0% found this document useful (0 votes)
0 views

8_String and Regular Expression

Chapter 1 covers the basics of Python, focusing on string manipulation and regular expressions. It includes string methods, indexing, slicing, and practices for counting characters and manipulating strings. Additionally, it introduces regular expressions, their metacharacters, special sequences, and functions available in the re module.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

8_String and Regular Expression

Chapter 1 covers the basics of Python, focusing on string manipulation and regular expressions. It includes string methods, indexing, slicing, and practices for counting characters and manipulating strings. Additionally, it introduces regular expressions, their metacharacters, special sequences, and functions available in the re module.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Chaper 1: Python basic

Lecturer: Nguyen Tuan Long, Phd


Email: [email protected]
Mobile: 0982 746 235
Chapter 1: Python basic 2

1.8. String and Regular Expression


• Working with Strings
• String Methods
• Python RegEx
• RegEx – Metacharacters
• RegEx – Special Sequences
• RegEx – Functions
Working with Strings
String Concatenation
Working with Strings
Multiline Strings
Working with Strings
Indexing
Index 0 1 2 3 4 5 6 7 8 9 10 11 12
message = “ h e l l o p y t h o n ! ”
Negative index -13 -2 -1
Working with Strings
Slicing variable_name[ start : stop : step ]

Unchangeable
Working with Strings
Length
Practice
1. Counting the number of characters in a string.
Sample String : google.com'
Expected Result : {'g': 2, 'o': 3, 'l': 1, 'e': 1, '.': 1, 'c': 1,
'm': 1}
String Methods

.title( ),

.upper(),

.lower(),

.rstrip(),

.lstrip(),

.strip()
String Methods
.replace():
The replace method works like a find and replace tool. It takes in
two values within its parenthesis, one that it searches for and the
other that it replaces the searched value.

Syntax
string.replace(oldvalue, newvalue, count)
String Methods
Parameter Values
Parameter Description
oldvalue Required. The string to search for
newvalue Required. The string to replace the old value with
count Optional. A number specifying how many occurrences of the old value
you want to replace. Default is all occurrences
String Methods
.find( ):
The find method will search for any string we ask it to. It
finds the starting index of our searched term.

Definition and Usage:


• The find() method finds the first occurrence of the specified
value.
• The find() method returns -1 if the value is not found.
• The find() method is almost the same as
the index() method, the only difference is that
the index() method raises an exception if the value is not
String Methods

Syntax
string.find(value, start, end)
Parameter Description
value Required. The value to search for
start Optional. Where to start the search. Default is 0
end Optional. Where to end the search. Default is to
the end of the string

Question: How to find all indexes of the searched term?


String Methods
.split( ):
Splits the string at the specified separator, and returns a list.
Syntax
string.split(separator,
maxsplit)
Parameter Values
Parameter Description
separator Optional. Specifies the separator to use when splitting
the string. By default any whitespace is a separator
maxsplit Optional. Specifies how many splits to do. Default value
is -1, which is "all occurrences"
String Methods
Practice
message =‘’’
Import this
‘’’
1. message_lema= Lemmatization is a process in natural language
processing (NLP) used to normalize words to their root form, called the
word "lemma." ex: running, ran--> run, better, best--> good…
2. sentences = The list of sentences in message_lema
3. Bag of words: Convert sentences to vectors
1. Create the dictionary from message_lema
2. Convert sentences to vectors: Once the dictionary is obtained, each sentence in the data is
converted into a vector with a length equal to the size of the dictionary. Each element in the
vector corresponds to the frequency at which the word appears in the sentence.
Python RegEx
• A RegEx, or Regular Expression, is a sequence of characters
that forms a search pattern.
• RegEx can be used to check if a string contains the
specified search pattern.

The findall() fun


ction returns a RegEx
list containing
all matches.
RegEx - Metacharacters
Metacharacters are characters with a special meaning: [], \, +,
*, ., |, () , $, {}
o []
RegEx : A set of
String charactersc
Matches
a 
[abc] ac 
Hey Jude 
• [arn]: Returns a match where one of the specified characters (a, r,
or n) are present.
• [a-n]: Returns a match for any lower case character, alphabetically
between a and n ['T', 'r',
a
[^arn]: Returnsa match for any character EXCEPT a, r, and'n',
• [^ab] n 'n',
ac  'S', 'p',
'n', '+']
RegEx - Metacharacters

• [0123]: Returns a match where any of the specified digits (0, 1, 2,


or 3) are present
• [0-5][0-9]: Returns a match for any two-digit numbers from 00
and 59
• [a-zA-Z]: Returns a match for any character alphabetically
between a and z, lower case OR upper case
o • . [+]:: In sets, +, *, ., |, (), $,{} has no special meaning, so [+]
Any character (except newline character)
means: return a match for any + character in the string
a 
.. ac 
acd 
RegEx - Metacharacters
o ^ : Starts with o $ : Ends with
a  a 
^a abc  a$ Formula 
bac  Cab 
abc  abc
^ab ab$ 
acb  ab 
o * : Zero or more mn 
occurrences man 
ma*n maaan 
main 
woman 
RegEx - Metacharacters
o + : One or more o ? : Zero or one
occurrences mn  occurrences mn

man  man 
ma+n maaan  ma?n maaan 
main  main 
woman  woman 

o {} : abc dat  ab233cde 


Exactly abc data  12 and 
a{2,3} [0-9]{2,4}
the aabc daaat 2313131
 
specified aabc daaaat 1 and 2
number of
 
occurrenc min max
Practice
Get all phone numbers
RegEx - Metacharacters
o | : Either or o () : Capture and group
cde  ab xz 
ade  abxz 
a|b (a|b|c)xz
acdbea  axz cabxz 
a b cd  a b cd 
o \ : Signals a special sequence (can also be used to escape
special characters)
abxz 
.xz 
\.xz
axz ca.xz 
axz 
RegEx - Special Sequences

A special sequence is a \ followed by one of the characters in the list


below, and has a special meaning:
Character Description
Returns a match if the specified characters are at the beginning of the
\A string
Returns a match where the specified characters are at the beginning or at
\b the end of a word
Returns a match where the specified characters are present, but NOT at
\B the beginning (or at the end) of a word.

\d Returns a match where the string contains digits (numbers from 0-9)

\D Returns a match where the string DOES NOT contain digits


RegEx - Special Sequences
Character Description
\s Returns a match where the string contains a white space character
Returns a match where the string DOES NOT contain a white space
\S character
Returns a match where the string contains any word characters (characters
\w from a to Z, digits from 0-9, and the underscore _ character)
\W Returns a match where the string DOES NOT contain any word characters
\Z Returns a match if the specified characters are at the end of the string
RegEx - Functions
The re module offers a set of functions that allows us to search a string for a
match:
Function Description
findall Returns a list containing all matches
search Returns
string
a Match object if there is a match anywhere in the

split Returns a list where the string has been split at each match
sub Replaces one or many matches with a string
Practice
Get full names, emails, and phone numbers in the file Example_Regex.txt
• Emails: [a-z]+@[a-z.]+
• Name: (?:[A-ZĐ]\w+\s){3,4} or

Try it yourself
• (?:[A-ZĐ][a-záàảãạăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệóòỏõọôốồổỗộơớờởỡợíìỉĩịúùủũụưứừửữựýỳỷỹỵđ,]+\s){3,4}
• (?:[A-ZĐ][a-záàảãạăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệóòỏõọôốồổỗộơớờởỡợíìỉĩịúùủũụưứừửữựýỳỷỹỵ]+\s){2,3}[A-ZĐ][a-
záàảãạăắằẳẵặâấầẩẫậéèẻẽẹêếềểễệóòỏõọôốồổỗộơớờởỡợíìỉĩịúùủũụưứừửữựýỳỷỹỵ]+\n
• See: https://docs.python.org/3/library/re.html

You might also like