CH08
CH08
November 7, 2024
2 8.1 Introduction
• Strings support many of the same sequence operations as lists and tuples
• Strings, like tuples, are immutable
• Here, we take a deeper look at strings
• Introduce regular expressions and the re module for matching patterns in text
– Particularly important in today’s data rich applications
• Table below shows many string-processing and NLP-related applications
1
String and NLP applications
Sentiment analysis
Spam classification
Speech-to-text engines
Spell checkers
Steganography
Text editors
Text-to-speech engines
Web scraping
Who authored Shakespeare’s works?
Word clouds
Word games
Writing medical diagnoses from x-rays, scans, blood tests
and many more…
[1]: '17.49'
• Can also use the e presentation type for exponential (scientific) notation.
[2]: f'{17.489:.2e}'
[2]: '1.75e+01'
2.1.1 Integers
• There also are integer presentation types (b, o and x or X) that format integers using the
binary, octal or hexadecimal number systems.
[3]: f'{10:d}'
[3]: '10'
[4]: f'{10:b}'
[4]: '1010'
2.1.2 Characters
[5]: f'{65:c} {97:c}'
2
2.1.3 Strings
• If s specified explicitly, the value to format must be a string, an expression that produces
a string or a string literal.
• If you do not specify a presentation type, non-string values are converted to strings.
[6]: f'{"hello":s} {7}'
[8]: f'{Decimal("10000000000000000000000000.0"):.3f}'
[8]: '10000000000000000000000000.000'
[9]: f'{Decimal("10000000000000000000000000.0"):.3e}'
[9]: '1.000e+25'
[10]: '1.000E+25'
[13]: f'[{3.5:10f}]'
[14]: f'[{3.6:10.3f}]'
3
[14]: '[ 3.600]'
[14]: f'[{"hello":10}]'
[16]: f'[{3.5:<15f}]'
[17]: f'[{"hello":>15}]'
[19]: f'[{3.5:^7.3f}]'
[20]: f'[{"hello":^7}]'
4
• To fill the remaining characters of the field with 0s rather than spaces, place a 0 before the
field width (and after the + if there is one)
[21]: f'[{27:+10d}]'
[22]: f'[{27:+010d}]'
[22]: '[+000000027]'
[26]: a = 5-7
f'[{a:+010d}]'
[26]: '[-000000002]'
27
27
-27
[24]: f'{12345678:,d}'
[24]: '12,345,678'
[25]: f'{123456.78:,.2f}'
[25]: '123,456.78'
5
[30]: '{:.2f}'.format(17.489)
[30]: '17.49'
[27]: 4439569264
[28]: s2 = 'birthday'
[30]: print(s1)
id(s1)
happy birthday
[30]: 4439833776
[32]: symbol *= 5
[33]: symbol
[33]: '>>>>>'
6
4.0.1 Removing Leading and Trailing Whitespace
• strip removes leading and trailing whitespace
[34]: sentence = '\n \t This is a test string. \t\t \n'
[35]: sentence
[36]: print(sentence)
[37]: sentence.strip()
[38]: print(sentence)
[42]: sentence.strip('\n')
7
5 8.5 Changing Character Case
5.0.1 Capitalizing Only a String’s First Character
• Method capitalize returns a new string with only the first letter capitalized (sometimes
called sentence capitalization)
A: 65; a: 97
• Compare the strings 'Orange' and 'orange' using the comparison operators
[46]: False
[47]: True
[48]: True
[49]: True
8
[50]: False
[51]: False
[53]: sentence.count('to')
[53]: 2
• If you specify as the second argument a start index, count searches only the slice
string[start_index:]
[54]: sentence.count('to', 12)
[54]: 1
• If you specify as the second and third arguments the start index and end index,count
searches only the slice string[start_index:end_index]
[55]: sentence.count('that', 12, 25)
[55]: 1
• Like count, the other string methods presented in this section each have start index and
end index arguments
9
[58]: 3
• rindex performs the same operation as index, but searches from the end of the string
[63]: sentence.rindex('be')
[63]: 16
• find and rfind perform the same tasks as index and rindex but return -1 if the substring
is not found
[64]: sentence.index('answer')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [64], line 1
----> 1 sentence.index('answer')
[59]: sentence.find('answer')
[59]: -1
[60]: True
[61]: False
[62]: True
[63]: True
10
[64]: sentence.startswith('be')
[64]: False
[65]: sentence.endswith('question')
[65]: True
[66]: sentence.endswith('quest')
[66]: False
[69]: values
[69]: '1\t2\t3\t4\t5'
11
[72]: letters = 'A; B; C; D'
[73]: letters.split(';')
[74]: 'An--Bart--Joe'.split('--')
• Specify the maximum number of splits with an integer as the second argument
• Last token is the remainder of the string
[79]: letters.split('; ', 2)
A = B
• rsplit performs the same task as split but processes the maximum number of splits from
the end of the string toward the beginning
You can use maxsplit as argument if you want to limit number os splits. E.g. if you are only
interested in first part.
[4]: my_list = 'An Ben Casper'
my_list.split()
[5]: my_list.split(maxsplit=1)
[6]: my_list.split(maxsplit=1)[0]
[6]: 'An'
12
[83]: letters_list = ['A', 'B', 'C', 'D']
[84]: ';'.join(letters_list)
[84]: 'A;B;C;D'
[86]: '0;1;2;3;4;5;6;7;8;9'
• To search for the separator from the end of the string, use method rpartition
[88]: url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'
[90]: document
[90]: 'table_of_contents.html'
[91]: rest_of_url
[91]: 'http://www.deitel.com/books/PyCDS'
[92]: 'table_of_contents.html'
or
[93]: document = url.rpartition('/')[2]
document
[93]: 'table_of_contents.html'
13
9.0.4 String Method splitlines
• splitlines returns a list of new strings representing lines of text split at each newline
character in the original string
[94]: lines = """This is line 1
This is line2
This is line3"""
[95]: lines
[96]: lines.splitlines()
[97]: False
[98]: '27'.isdigit()
[98]: True
[99]: 'A9876'.isalnum()
[99]: True
14
[100]: '123 Main Street'.isalnum()
[100]: False
15
[102]: print(file_path)
file_path
C:\MyFolder\MySubFolder\MyFile.txt
[102]: 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
[104]: print(file_path)
file_path
C:\MyFolder\MySubFolder\MyFile.txt
[104]: 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
16