0% found this document useful (0 votes)
29 views57 pages

9 RegEx

Uploaded by

saiabhishek.4806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views57 pages

9 RegEx

Uploaded by

saiabhishek.4806
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Regular Expressions

(re)
in Python

Regex
Problem

Write a Python code to check if the given mobile


number is valid or not. The conditions to be
satisfied for a mobile number are:

a) Number of characters must be 10

b) All characters must be digits and must not


begin with a ‘0’
Validity of Mobile Number

Input Processing Output

A string representing a Take character by character Print valid or


mobile number and check if it valid invalid
Test Case 1
• Abc8967891

• Invalid
• Alphabets are not allowed
Test Case 2
• 440446845

• Invalid
• Only 9 digits
Test Case 3
• 0440446845

• Invalid
• Should not begin with a zero
Test Case 4
• 8440446845

• Valid
• All conditions satisfied
Python code to check validity of mobile number (Long
Code)

import sys
number = input()
if len(number)!=10:
print ('invalid')
sys.exit()
elif number[0]=='0':
print ('invalid')
sys.exit()
else:
for chr in number:
if (chr.isdigit()==False):
print ('invalid')
sys.exit()
print('Valid')
sys.exit()

• The sys.exit() function is used in Python to


terminate a program after necessary
cleanup (like closing files, releasing
resources, or logging)
• You can pass an optional exit status code to
indicate success or failure (with 0
indicating success and 1 indicating an
error).
• The default exit status for sys.exit() in
Python is 0.
• Manipulating text or data is a big thing

• If I were running an e-mail archiving


company, and you, as one of my
customers, requested all of the e-mail that
you sent and received last February, for
example, it would be nice if I could set a
computer program to collate and forward
that information to you, rather than having
a human being read through your e-mail
and process your request manually.
• So this demands the question of how we can
program machines with the ability to look for
patterns in text.
• Regular expressions provide such an
infrastructure for advanced text pattern
matching, extraction, and/or search-and-
replace functionality.
• Python supports regexes through the
standard library re module -> import
re
• regexes are strings containing text and
special characters that describe a pattern
with which to recognize multiple strings.
• Regexs without special characters

• These are simple expressions that match a


single string
• Power of regular expressions comes in when
meta characters and special characters are
used to define character sets, subgroup
matching, and pattern repetition
• Metacharacters are special characters in
regular expressions that have specific
meanings and functionalities beyond their
literal representation. They help define
patterns for matching strings. Here’s a
rundown of some commonly used
metacharacters in regex.

• Eg . ^ $ * + ? {n} {n,} {n,m} [] | ()


•Note Some of the whitespace character are
space/tab/new line
The findall() Function

• The findall() function returns a list containing all matches.


• SYNTAX re. findall(pattern,source_string)

• Return an empty list if no match was found


import re
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']
text = "test cat testing scatter"

# \b Matches 'cat' at word boundaries


pattern_b = r"\bcat\b"
print(re.findall(pattern_b, text))
# ['cat']
# \B Matches 'cat' where it is not at a word boundary
pattern_B = r"\Bcat\B"
print(re.findall(pattern_B, text))
# ['cat'] (from "scatter")
The search() Function
• The search() function searches the string for a match, and returns a
Match object if there is a match.
• If there is more than one match, only the first occurrence of the match
will be returned.
• If no matches are found, the value None is returned.
• SYNTAX re.search(pattern,source_string)

• Search for the first white-space character in the string:


import re

txt = "The rain in Spain"


x = re.search("\s", txt)
print(type(x)
print("The first white-space character is located in position:", x.start())

<class 're.Match'>
The first white-space character is located in position: 3
The match() Function
• re.match(pattern, string): This function checks for a match only
at the beginning of the string. If the pattern matches the start of
the string, it returns a match object; otherwise, it returns None.
SYNTAX re.match(pattern,source_string)

• Search for the first white-space character in the string:

txt ="The rain in Spain"


x = re.match("The",txt)
print(type(x))
print("The first white-space character is located in position:",
x.start())

<class 're.Match'> The first white-space character is located in


position: 0
12/05/2024 20
re.match vs re.search
• Both return the first match of a substring
found in the string.
• re.match() searches only from the beginning
of the string and return match object if found.
But if a match of substring is found
somewhere in the middle of the string, it
returns none.
• re.search() searches for the whole string
even if the string contains multi-lines and tries
to find a match of the substring in all the lines
of string.
The split() Function
• The split() function returns a list where the string
has been split at each match
• SYNTAX re.split(pattern,source_string,maxsplit)
maxsplit is optional argument.

• Split at each white-space character:


• import re
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
• ['The', 'rain', 'in', 'Spain']
The split() Function

• You can control the number of occurrences by


specifying the maxsplit parameter:
• Split the string only at the first occurrence:
• import re
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)
• ['The', 'rain in Spain']
The sub() Function for substitution

• The sub() function replaces the matches with the text of your choice:
• SYNTAX re.sub(old_pattern, new_pattern, source_string, no. of
replacements )
no. of replacements is optional

• Replace every white-space character with the number 9:


• import re
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)
• The9rain9in9Spain
• You can control the number of replacements by
specifying the count parameter:
• Replace the first 2 occurrences:
• import re
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
#2 is no. of occurences to be replaced
print(x)
• The9rain9in Spain
Basic features of the re
module

• Matching Patterns: You can check if a


string matches a pattern using re.match() or
re.search().
• Finding Patterns: Use re.findall() to find all
occurrences of a pattern in a string.
• Substituting Patterns: Use re.sub() to
replace occurrences of a pattern with
another string.
• In Python's re module, when you find a
match usingT methods like re.search() or
re.match(), you get a match object. This
match object provides several methods
and attributes that allow you to get
more information about the match.
Functions in match
object
group()

• A group() expression returns one or more subgroups of the


match.
import re
m = re.match(r'(\w+)@(\w+)\.(\
w+)','[email protected]')
print(m.group(0)) # The entire match
print(m.group(1)) # The first parenthesized subgroup.
print(m.group(2)) # The second parenthesized subgroup.
print(m.group(3)) # The third parenthesized subgroup.
print(m.group(1,2,3)) # Multiple arguments give us a tuple.
groups()

• A groups() expression returns a tuple


containing all the subgroups of the match.
import re
m = re.match(r'(\w+)@(\w+)\.(\
w+)','[email protected]')
print(m.groups())
#Output ('username', 'hackerrank', 'com')
groupdict()
?P<name> is a syntax used to define named
capturing groups.
You can replace name with any valid identifier.
import re
m = re.match('(?P<user>\w+)@(?P<website>\
w+)\.(?P<extension>\
w+)','[email protected]')
print(m.groupdict())
groupdict()

if m:
print("Full name:", m.group('user'))
print("website:", m.group('website'))
#Ouput
Full name: myname
website: hackerrank
Matching Any Single Character (.)

• dot or period (.) symbol (letter, number,


whitespace (not including “\n”), printable,
non-printable, or a symbol) matches any
single character except for \n

• To specify a dot character explicitly, you


must escape its functionality with a
backslash, as in “\.”
• The caret symbol ^ in regular expressions
serves two primary functions, depending on its
context.
1. Start of a String
When placed at the beginning of a regex pattern,
the caret ^ asserts that the following pattern
must match at the start of the string.
2. Negation in Character Classes
When used inside square brackets [...], the caret
^ negates the character class, meaning it
matches any character not included in the
brackets.
Matching from the Beginning or End of
Strings or Word Boundaries (^, $)

^ - Match beginning of string


$ - Match End of string
if you wanted to match any
string that ends with a
dollar sign, one possible
regex solution would be the
pattern .*\$$
Denoting Ranges (-) and Negation (^)

 brackets [] also support ranges of


characters
 A hyphen [a-z] between a pair of symbols
enclosed in brackets is used to indicate a range
of characters;
 For example: A–Z, a–z, or 0–9 or 1-9 for
uppercase letters, lowercase letters, and
numeric digits, respectively
Multiple Occurrence / Repetition
Using Closure Operators (*, +, ?, {})

 special symbols *, +, and ? , all of which


can be used to match single, multiple, or no
occurrences of string patterns
 Asterisk or star operator (*) - match zero
or more occurrences of the regex
immediately to its left
 Plus operator (+) - Match one or more
occurrences of a regex
 Question mark operator (?) : match
exactly 0 or 1 occurrences of a regex.

 There are also brace operators ({}) with


either a single value or a comma-
separated pair of values. These indicate a
match of exactly N occurrences (for {N})
or a range of occurrences; for example, {M,
N} will match from M to N occurrences.
re Valid Invalid

[dn]ot? dot, not, do, no dnt, dn,dott,dottt

[dn]ot?$ dot, not, do, no dnt, dn,dott,nott

[dn]ot*$ dott,nott,do,no dotn

[dn]ot* Do,no,dot,dottt

12, 123, 1234,


[0-9]{2,5} 12345, a123456 1,2,4,5,
12345, 123456,
a1234567,111111
[0-9]{5} 111 12, 123, 1234

aX{2,4} aXX,aXXX,aXXXX a ,aX


logical “or”
• The | denotes "OR" operator.
• |. This character separates terms contained within
each (...) group.
• example, for instance:
^I like (dogs|penguins), but not (lions|tigers).$
• This expression will match any of the following
strings:
• I like dogs, but not lions.
• I like dogs, but not tigers.
• I like penguins, but not lions.
• I like penguins, but not tigers.
metacharacter \d
• You can replace [0-9] by metacharacter \d, but not [1-
9].
• Ex.
[1-9][0-9]*|0 or [1-9]\d*|0

• For "abc123xyz", it matches the substring "123".


• For "abcxyz", it matches nothing.
• For "abc123xyz456_0", it matches
substrings: "123", "456" and "0" (three matches).
• For "0012300", it matches
substrings: "0", "0" and "12300" (three matches)!!!
^[1-9][0-9]*|0$ or ^[1-9]\
d*|0$

• The position anchors ^ and $ match the


beginning and the ending of the input
string, respectively. That is, this regex shall
match the entire input string, instead of a
part of the input string (substring).
• an occurrence indicator, + for one or
more, * for zero or more, and ? for zero or
one.
metacharacter \w

• metacharacter \w for a word character


[a-zA-Z0-9_].
Recall that metacharacter \d can be used for a
digit [0-9].

[a-zA-Z_][0-9a-zA-Z_]* or [a-zA-Z_]\w*

• Begin with one letters or underscore, followed by zero


or more digits, letters and underscore.
metacharacter \s
• \s (space) matches any single whitespace like blank, tab,
newline
• The uppercase counterpart \S (non-space) matches any
single character that doesn't match by \s
• # Sample string
• text = "Hello, this is an example string.\nIt has multiple
lines and spaces."
# Pattern to find a whitespace character followed by
'example'
pattern = "\s(example)"
print(re.search(pattern, "Hello, this is an example string.\nIt
has multiple lines and spaces."
“))
[Uppercase]metach
aracter
• In regex, the uppercase metacharacter
denotes the inverse of the lowercase
counterpart, for example, \w for word
character and \W for non-word
character; \d for digit and \D or non-
digit.
Escape code \

• The \ is known as the escape code, which restore the


original literal meaning of the following character.
Similarly, *, +, ? (occurrence indicators), ^, $ (position
anchors), \. to represent . have special meaning in regex.
• In a character class (ie.,square brackets[]) any
characters except ^, -, ] or \ is a literal, and do not
require escape sequence.
• ie., only these four characters, ^, -, ] or \ , require
escape sequence inside the bracket list: ^, -, ], \
• Ex: [.-]? matches an optional character . or -.
Escape code \

• Most of the special regex characters lose their meaning


inside bracket list, and can be used as they are;
except ^, -, ] or \.
• To include a ], place it first in the list, or use escape \].
• To include a ^, place it anywhere but first, or use escape \^.
• To include a - place it last, or use escape \-.
• To include a \, use escape \\.
• No escape needed for the other characters such
as ., +, *, ?, (, ), {, }, and etc, inside the bracket list
• You can also include metacharacters such as \w, \W, \d, \D, \
s, \S inside the bracket list.
Escape code \

re.findall('[ab\-c]', '123-456')
Output ['-']
re.findall('[a-c]', 'abdc')
Output ['a', 'b', 'c']
Flags

• re.DOTALL (or re.S)


• re.MULTILINE (or re.M)
• re.IGNORECASE flag (or re.I)
re.DOTALL (or re.S)
• Normally, the dot (.) in regex matches any character except
a newline. For example, if you want to match text across
multiple lines (which includes newline characters), the dot
will not work unless you enable the DOTALL flag.
• The DOTALL flag allows the dot (.) in a regular expression to
match any character, including newline characters (\n).
text = "Hello\nWorld"
pattern = ".*"
print(re.match(pattern, text))
# Matches only "Hello“
text = "Hello\nWorld"
pattern = ".*"
print(re.match(pattern, text, re.DOTALL))
# Matches "Hello\nWorld"
re.MULTILINE (or re.M)
• The MULTILINE flag changes the behavior of the ^ and $
anchors so that they match at the start and end of each
line, not just the start and end of the entire string.
• Without MULTILINE, ^ matches the beginning of the string,
and $ matches the end of the string.
text = "Hello\nWorld"
pattern = "^World$"
print(re.search(pattern, text))
# No match because "World" is not at the start of the text
text = "Hello\nWorld"
pattern = "^World$"
print(re.search(pattern, text, re.MULTILINE))
# Matches "World" because it's at the start of a new line
re.IGNORECASE (or re.I)
• In Python's re module, the re.IGNORECASE flag (also written as
re.I) is used to make the regular expression case-insensitive.
When this flag is applied, the regex will match letters regardless
of whether they are uppercase or lowercase.
text = "Hello World"
pattern = "hello"
match = re.search(pattern, text)
print(match)
# No match, because "hello" does not match "Hello" (case-
sensitive)
text = "Hello World"
pattern = "hello"
match = re.search(pattern, text, re.IGNORECASE)
print(match)
# Match found, because "hello" matches "Hello" (case-insensitive)
Raw string r
valid = re.match('[a-zA-z]{2}\.[a-zA-Z]{2}$', input())
print("valid" if valid else "invalid")
#OUTPUT WARNING
#SyntaxWarning: invalid escape sequence '\.'
#Solution- use r (raw string)
valid = re.match(r'[a-zA-z]{2}\.[a-zA-Z]{2}$', input())
print("valid" if valid else "invalid")
#Alternate solution - use double back slash
valid = re.match(r'[a-zA-z]{2}\\.[a-zA-Z]{2}$', input())
print("valid" if valid else "invalid")

You might also like