Unit 5 - Application Development Using Python
Unit 5 - Application Development Using Python
tushar.1801@gmail.com
D0OLHR8SGA
Program: MCA
Specialization: Core
Semester: 3
Course Name: Application Development using Python *
Course Code: 21VMT0C301
Unit Name: Strings, Pattern Matching with Regular Expressions
tushar.1801@gmail.com
D0OLHR8SGA
The human brain can only understand words and not 0s and 1s, i.e., it is not trained to read
the binary code. We store these words in strings on python.
Strings in Python are collections of bytes that represent Unicode characters. A single
character in Python is just a string of length 1, since there is no such thing as a character
data type. To access the string's constituents, use square brackets.
Creation of strings:
In Python, single, double, or even triple quotes can be used to create strings. In the given
code we have created strings using single quotes, double quotes and triple quotes. All the
tushar.1801@gmail.com
D0OLHR8SGA
Output:
tushar.1801@gmail.com
D0OLHR8SGA
The process of slicing is used to gain access to a specific subset of characters in the String.
Using the Slicing operator, one can slice through a string (colon).
tushar.1801@gmail.com
D0OLHR8SGA
Example:
startswith() method:
If a string begins with the supplied prefix, the Python String startswith() function returns
True (string). If not, False is returned.
Parameters:
Prefix: Prefix ix is nothing more than a string that demands inspection.
start: The first position in the string where the prefix needs to be verified.
end: The final position in the string where the prefix has to be verified.
Parameters:
suffix: All that a suffix is is a string that needs to be verified.
start: The point in the string where the suffix needs to be checked first.
end: The place in the string where the suffix has to be checked after the ending position + 1.
stringName.join(iterable)
Here, iterable means the objects that can return their members one at a time. Example of
iterables – list, tuple, set, dictionary, string.
Example:
tushar.1801@gmail.com
D0OLHR8SGA
split() method:
Python's String split() function breaks the given string into a list of strings using the defined
separator.
string.split(separator, maxsplit)
separator – This acts as a delimiter. At this designated divider, the string separates. If is
absent, a separator is any blank space.
maxsplit: It is a number that instructs us to split the string as many times as possible. If it is
not supplied, the default value is -1, which indicates that there is no limit.
Examples of splitting are given below:
string.partition(separator)
tushar.1801@gmail.com
D0OLHR8SGA
Parameter separator is a substring that will separate the string. A tuple with 3 entries is
returned. the section immediately following the separator, the separator itself, and the
preceding substring.
String justification:
rjust():
After replacing a specified character in the left side of the original string, the string rjust()
method returns a new string of the specified length.
string.rjust(length, fillchar)
string.ljust(length, fillchar)
Parameters:
length: The modified string's length. The original string is returned if length is less than or
equal to the length of the original string.
fillchar: Characters that must be padded (optional). If it is absent, the default argument is
taken to be space.
Example:
tushar.1801@gmail.com
D0OLHR8SGA
The centre() method in the Python string constructs and returns a new string that has the
supplied character appended as padding.
string.center(length[, fillchar])
Parameters:
length: The string's length following character padding.
fillchar: Characters that must be padded (optional). If it's omitted, space is used as the
argument by default.
tushar.1801@gmail.com
D0OLHR8SGA
replace() method:
Python's replace() function creates a replica of the string by replacing every instance of one
substring with a different substring. It returns a duplicate of the text that replaces every
instance of one substring with another substring.
Parameters:
old – old substring that needs to be replaced.
new - A new substring that would take the place of the previous one.
count - (Optional) The number of times the new substring should be substituted for the old
substring.
Example:
The initial index and ending index of the string "good" are provided by the code above.
tushar.1801@gmail.com
D0OLHR8SGA
Note that the r character (r'good') here denotes raw rather than regex. The
character \ won't be recognised as an escape character in the raw string, making it slightly
different from a standard string. This is due to the fact that the regular expression engine
uses the character \ for internal escaping.
MetaCharacters are helpful, significant, and will be used in module RE functions, which
helps us comprehend the analogy with RE. The list of metacharacters is shown below.
Source: GeeksForGeeks
Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.
This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Special sequences provide the precise position in the search string where the match must
take place rather than matching for the actual character in the string. It makes it simpler to
write patterns that are used frequently.
tushar.1801@gmail.com
D0OLHR8SGA
Source: GeeksForGeeks
re.findall():
A list of strings containing all of the pattern's non-overlapping matches in the given string
Matches are returned in the order they are found once the string has been left-to-right
scanned.
Example:
re.compile():
In order to conduct operations like looking for pattern matches or replacing strings, regular
expressions are compiled into pattern objects.
tushar.1801@gmail.com
D0OLHR8SGA
'a' appears for the first time in "Whatever." Case sensitivity applies.
The next occurrence is "e" in "Whatever," followed by "e" once more in "Whatever," "a" in
"are," and "one" for the last "e."
The metacharacter backslash '\' is crucial since it indicates different sequences. Utilize "\\" if
you want to use the backslash without its particular significance as a metacharacter.
re.sub()
The function's "sub" keyword stands for SubString; it searches the provided string for a
specific regular expression pattern (3rd parameter), replaces it with repl (2nd parameter),
and counts the number of times this happens.
Syntax:
re.subn():
Except for how it produces output, subn() and sub() are identical in all other respects.
Instead of merely returning the string, it produces a tuple that includes a count of the sum
tushar.1801@gmail.com
of the replacements and the new string.
D0OLHR8SGA
Example:
string.translate(table)
Example:
tushar.1801@gmail.com
D0OLHR8SGA
This translation mapping offers a mapping from the letters b, e, l, l, and e to the letters a, p,
p, l, and e, respectively. Although the mapping to a, b, and e is reset to None by the removal
string str3.
Therefore, a, b, and e are eliminated when the string is translated using translate(),
producing cdf..
Pattern Matching:
You could be accustomed to searching for text by pressing Ctrl-F and typing the desired
terms. Regular expressions take things a step further by enabling you to define a specific
text pattern to search for.
Regexes, also known as regular expressions, are descriptions of a pattern of text. A digit
character, or any single number from 0 to 9, is represented by the character "d" in a regex,
for instance.
pinCodeRegex = re.compile(r'\d\d\d\d\d\d')
2. Using pc.groups():
4. Matching a parenthesis:
tushar.1801@gmail.com
D0OLHR8SGA Regular expressions give parentheses a special meaning, but what if you need to
match a parenthesis in your text? For instance, the firstThree can be set in
parentheses in the pincodes you are attempting to match. In this instance, a
backslash is required to escape the (and) characters. the interactive shell with the
following information:
Greedy Quantifiers:
?, *, +, and m, n are examples of greedy quantifiers that match as many characters as they
can (longest match). For instance, the substrings 'a', 'aa', and 'aaa' all match the regex 'a+',
but the regex 'a+' will match as many 'a's as possible in the string 'aaaa'.
Non-greedy quantifiers:
As few letters as possible are matched by a non-greedy quantifier, such as ??, *?, +?, and?
{m,n}? (shortest possible match). For instance, the regex 'a+?' will match as few 'as' in your
Greedy Matching:
A greedy match occurs when the regex engine matches as many characters as it can in an
effort to discover your pattern in the string.
In your string "bbbb," for instance, the regex "b+" will match as many "b"s as feasible. The
substrings "b," "bb," and "bbb" all match the regex "b+," but the regex engine does not
consider this sufficient. It always strives to match more and is always hungry.
The greedy quantifiers, in other words, give you the longest match from a specific location
in the string.
It turns out that all default quantifiers, including?, *, +, {m}, and {m,n}, are greedy, matching
as many characters as they can to ensure that the regex pattern is still met.
A shorter match would be acceptable in any situation. However, because the regex engine is
naturally greedy, those are insufficient.
Example of greedy matching:
tushar.1801@gmail.com
D0OLHR8SGA
Use the zero-or-one regex 'b?' in the first instance. It matches one 'b' character if feasible
because it is greedy.
Non-greedy pattern matching:
A non-greedy match occurs when the regex engine matches the fewest characters feasible
while still being able to match the given string's pattern.
For instance, the regex 'a+?' will match as few 'as' in your string 'aaaa' as feasible. As a
result, it completes the task by matching the first character, "a." The second character,
which is also a match, is then used, and so on.
The non-greedy quantifiers, in other words, provide you with the shortest match from a
specific location in the string. By adding the question mark symbol "?" to the default
quantifiers?, *, +, {m}, and {m,n}, you can make them less greedy. They "consume" or
"match" as few characters as feasible while still satisfying the regex pattern.
You use the 'a??' version, which is not a greedy zero or one. If it can, it matches zero 'as.
Keep in mind that it advances from left to right to "consume" the empty string. Only then is
it obliged to match the initial character of the letter "a" because it can no longer match the
empty string. The empty string can then be matched once more after that. Repeatedly, the
empty string is matched first, and only then the letter "a" if it is necessary. That is why this
tushar.1801@gmail.com
peculiar pattern appears.
D0OLHR8SGA
You use the 'p*?' version, which is not a greedy zero or one. Once more, if it can, it matches
zero 'as. It only matches one character at a given point, "consumes," it, and then continues
if it has already matched zero characters at that location.
You utilise the 'p+?' version, which is not a greedy one-or-more. The regex engine only
recognises the character "p" in this instance, consumes it, and continues on to the next
match.
Difference between greedy and non-greedy matching:
Regular expression matching starts right away. It will only return the earliest match that
they can locate. Regular expressions do greedy matches by default. The longest strings that
can be matched and returned using the regex pattern are referred to be greedy matches.
The greedy match will attempt to match the quantified pattern as many times as it can. The
non-greedy match will make an effort to match the quantified pattern as infrequently as
possible.
Character class in python:
A group of characters enclosed in square brackets is referred to as a "character class" or
"character set." Only one character from a character class or character set, on average, is
matched by the regex engine. The characters that we want to match are included in square
brackets. You can use the character set [aeiou] to match any vowel.
Sometimes you need to match a group of characters but the shorthand character classes (d,
w, s, and so on) are too general. Using square brackets, you can define your own character
class.
Character set specifications are made with square brackets. To specify the range of
characters inside a character set, use a hyphen. The sequence of characters inside square
tushar.1801@gmail.com
D0OLHR8SGA brackets is irrelevant. The regular phrase [Aa]n, for instance, denotes either an uppercase or
lowercase a, followed by the letter n.
A character class can be expressed in its simplest form by enclosing a group of characters in
square brackets.
Since it specifies a character class that accepts either "a," "b," or "c" as its initial character
followed by "at," the regular expression [abc]at, for instance, will match the words "bat," or
"cat."
Example of custom character classes:
Wildcard Symbols:
tushar.1801@gmail.com
D0OLHR8SGA A wildcard is a symbol that can be used in place of one or more characters to represent
them. Computer applications, languages, search engines, and operating systems all employ
wildcards to make search criteria simpler. The question mark (?) and the asterisk (*) are the
most popular wildcards.
Asterisk(*) – Any number of characters can be specified using an asterisk *. Usually, it is
added to the end of a root word. This is useful when looking for a root word's changeable
ends. Any number of characters can be specified using an asterisk *. Usually, it is added to
the end of a root word. This is useful when looking for a root word's changeable ends.
Question(?) – A question mark(?) is used anyplace in the word to represent a single
character. When a word has multiple spellings and you want to search for all of them at
once, it is most helpful.
The question mark(?) symbol is replaced with the dot(.) character.
Similar to the asterisk * symbol, the .+ characters are used to match one or more characters.
Example:
tushar.1801@gmail.com
D0OLHR8SGA
There are two ways that a verbose regular expression differs from a compact regular
expression:
Whitespace is not used. Carriage returns, spaces, and tabs do not match as spaces, tabs, and
carriage returns. They are not at all matched. A backslash must be placed in front of the
space if we wish to match it in a verbose regular expression.
Commentary is disregarded. Similar to a remark in Python code, a verbose regular
expression comment begins with the # character and extends to the end of the line. Instead
of being a comment within our source code in this instance, it is a comment within a multi-
line string, but the principle is the same.
re.IGNORECASE:
This flag enables case-insensitive regular expression matching with the supplied string, so
that expressions like [A-Z] will also match lowercase letters. It is often supplied to
re.compile() as an optional argument.
tushar.1801@gmail.com
D0OLHR8SGA
re.DOTALL():
Python's "." special character matches any character except the beginning of a new line, but
its capability can be expanded using the DOTALL flag.
The "." character can be used to match any character, including newlines, thanks to the
DOTALL flag.
There may be situations when working on real-world projects that require us to analyse
multi-line strings (separated by newline characters, or "n"). In these circumstances, we
employ re.DOTALL.
Example:
One or more characters ('. +') are matched by the regular expression in this case. The engine
halts when it encounters the newline character because the dot character does not
correspond to the line breaks. Take a closer look at the code that uses the DOTALL flag.
In the above code, we simply put in the pattern for lowercase in variable ‘lower’ and pattern
for uppercase in variable ‘upper’. We use the re.findall(pattern,text) option to find the total
number of lowercase and uppercase characters by enclosing the statement withing the ‘len’
method.
tushar.1801@gmail.com
D0OLHR8SGA