0% found this document useful (0 votes)
7 views16 pages

CH08

This document provides an in-depth exploration of strings in Python, highlighting their immutability and various operations such as formatting, concatenation, and manipulation. It introduces the use of regular expressions and the re module for text pattern matching, along with numerous applications in natural language processing. Additionally, it covers string comparison, searching, replacing substrings, and splitting and joining strings, providing examples for each concept.

Uploaded by

Serene XU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views16 pages

CH08

This document provides an in-depth exploration of strings in Python, highlighting their immutability and various operations such as formatting, concatenation, and manipulation. It introduces the use of regular expressions and the re module for text pattern matching, along with numerous applications in natural language processing. Additionally, it covers string comparison, searching, replacing substrings, and splitting and joining strings, providing examples for each concept.

Uploaded by

Serene XU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CH08

November 7, 2024

1 8. Strings: A Deeper Look

2 8.1 Introduction
• Strings support many of the same sequence operations as lists and tuples
• Strings, like tuples, are immutable
• Here, we take a deeper look at strings
• Introduce regular expressions and the re module for matching patterns in text
– Particularly important in today’s data rich applications
• Table below shows many string-processing and NLP-related applications

String and NLP applications


Anagrams
Automated grading of written homework
Automated teaching systems
Categorizing articles
Chatbots
Compilers and interpreters
Creative writing
Cryptography
Document classification
Document similarity
Document summarization
Electronic book readers
Fraud detection
Grammar checkers
Inter-language translation
Legal document preparation
Monitoring social media posts
Natural language understanding
Opinion analysis
Page-composition software
Palindromes
Parts-of-speech tagging
Project Gutenberg free books
Reading books, articles, documentation and absorbing knowledge
Search engines

1
String and NLP applications
Sentiment analysis
Spam classification
Speech-to-text engines
Spell checkers
Steganography
Text editors
Text-to-speech engines
Web scraping
Who authored Shakespeare’s works?
Word clouds
Word games
Writing medical diagnoses from x-rays, scans, blood tests
and many more…

2.1 8.2.1 Presentation Types


[1]: f'{17.489:.2f}'

[1]: '17.49'

• Can also use the e presentation type for exponential (scientific) notation.

[2]: f'{17.489:.2e}'

[2]: '1.75e+01'

2.1.1 Integers
• There also are integer presentation types (b, o and x or X) that format integers using the
binary, octal or hexadecimal number systems.
[3]: f'{10:d}'

[3]: '10'

[4]: f'{10:b}'

[4]: '1010'

2.1.2 Characters
[5]: f'{65:c} {97:c}'

[5]: 'A a'

2
2.1.3 Strings
• If s specified explicitly, the value to format must be a string, an expression that produces
a string or a string literal.
• If you do not specify a presentation type, non-string values are converted to strings.
[6]: f'{"hello":s} {7}'

[6]: 'hello 7'

2.1.4 Floating-Point and Decimal Values


• For extremely large and small floating-point and Decimal values, Exponential (scientific)
notation can be used to format the values more compactly
[7]: from decimal import Decimal

[8]: f'{Decimal("10000000000000000000000000.0"):.3f}'

[8]: '10000000000000000000000000.000'

[9]: f'{Decimal("10000000000000000000000000.0"):.3e}'

[9]: '1.000e+25'

• The formatted value 1.000e+25 is equivalent to


1.000 x 1025
• If you prefer a capital E for the exponent, use the E presentation type rather than e.
[10]: f'{Decimal("10000000000000000000000000.0"):.3E}'

[10]: '1.000E+25'

2.2 8.2.2 Field Widths and Alignment


• Python right-aligns numbers and left-aligns other values.
• Python formats float values with six digits of precision.
[12]: f'[{27:10d}]'

[12]: '[ 27]'

[13]: f'[{3.5:10f}]'

[13]: '[ 3.500000]'

[14]: f'[{3.6:10.3f}]'

3
[14]: '[ 3.600]'

[14]: f'[{"hello":10}]'

[14]: '[hello ]'

[16]: f'[{"hello day everone":10}]'

[16]: '[hello day everone]'

2.2.1 Explicitly Specifying Left and Right Alignment in a Field


• Can specify left and right alignment with < and >:
[15]: f'[{27:<15d}]'

[15]: '[27 ]'

[16]: f'[{3.5:<15f}]'

[16]: '[3.500000 ]'

[17]: f'[{"hello":>15}]'

[17]: '[ hello]'

2.2.2 Centering a Value in a Field


• Centering attempts to spread the remaining unoccupied character positions equally to the
left and right of the formatted value
• Python places the extra space to the right if an odd number of character positions remain
[18]: f'[{27:^7d}]'

[18]: '[ 27 ]'

[19]: f'[{3.5:^7.3f}]'

[19]: '[ 3.500 ]'

[20]: f'[{"hello":^7}]'

[20]: '[ hello ]'

2.3 8.2.3 Numeric Formatting


2.3.1 Formatting Positive Numbers with Signs
• A + before the field width specifies that a positive number should be preceded by a +

4
• To fill the remaining characters of the field with 0s rather than spaces, place a 0 before the
field width (and after the + if there is one)

[21]: f'[{27:+10d}]'

[21]: '[ +27]'

[22]: f'[{27:+010d}]'

[22]: '[+000000027]'

[26]: a = 5-7
f'[{a:+010d}]'

[26]: '[-000000002]'

2.3.2 Using a Space Where a + Sign Would Appear in a Positive Value


• A space indicates that positive numbers should show a space character in the sign position
[23]: print(f'{27:d}\n{27: d}\n{-27: d}')

27
27
-27

2.3.3 Grouping Digits


• Format numbers with thousands separators by using a comma (,)

[24]: f'{12345678:,d}'

[24]: '12,345,678'

[25]: f'{123456.78:,.2f}'

[25]: '123,456.78'

2.4 8.2.4 String’s format Method


• f-strings were added to Python in version 3.6
• Before that, formatting was performed with the string method format
• f-string formatting is based on the format method’s capabilities
• Call method format on a format string containing curly brace ({}) placeholders, possibly
with format specifiers
• Pass to the method the values to be formatted
• If there’s a format specifier, precede it by a colon (:)
• See book for more info

5
[30]: '{:.2f}'.format(17.489)

[30]: '17.49'

[31]: '{} {}'.format('Amanda', 'Cyan')

[31]: 'Amanda Cyan'

[26]: '{0} {0} {1}!!!'.format('Happy', 'Birthday')

[26]: 'Happy Happy Birthday!!!'

3 8.3 Concatenating and Repeating Strings


• Previously, we used the + operator to concatenate strings and the * operator to repeat strings
• Also can perform these operations with augmented assignments
• Strings are immutable, so each operation assigns a new string object to the variable
[27]: s1 = 'happy'
id(s1)

[27]: 4439569264

[28]: s2 = 'birthday'

[29]: s1 += ' ' + s2

[30]: print(s1)
id(s1)

happy birthday

[30]: 4439833776

[31]: symbol = '>'

[32]: symbol *= 5

[33]: symbol

[33]: '>>>>>'

4 8.4 Stripping Whitespace from Strings


• Methods for removing whitespace from the ends of a string each return a new string

6
4.0.1 Removing Leading and Trailing Whitespace
• strip removes leading and trailing whitespace
[34]: sentence = '\n \t This is a test string. \t\t \n'

[35]: sentence

[35]: '\n \t This is a test string. \t\t \n'

[36]: print(sentence)

This is a test string.

[37]: sentence.strip()

[37]: 'This is a test string.'

[38]: print(sentence)

This is a test string.

4.0.2 Removing Leading Whitespace


• lstrip removes only leading whitespace
[39]: sentence.lstrip()

[39]: 'This is a test string. \t\t \n'

4.0.3 Removing Trailing Whitespace


• rstrip removes only trailing whitespace
[40]: sentence.rstrip()

[40]: '\n \t This is a test string.'

[42]: sentence.strip('\n')

[42]: ' \t This is a test string. \t\t '

7
5 8.5 Changing Character Case
5.0.1 Capitalizing Only a String’s First Character
• Method capitalize returns a new string with only the first letter capitalized (sometimes
called sentence capitalization)

[43]: 'happy birthday'.capitalize()

[43]: 'Happy birthday'

5.0.2 Capitalizing the First Character of Every Word in a String


• Method title returns a new string with only the first character of each word capitalized
(sometimes called book-title capitalization)

[44]: 'strings: a deeper look'.title()

[44]: 'Strings: A Deeper Look'

6 8.6 Comparison Operators for Strings


• Strings may be compared with the comparison operators
• Strings are compared based on their underlying integer numeric values
• Can check integer codes with ord
[45]: print(f'A: {ord("A")}; a: {ord("a")}')

A: 65; a: 97
• Compare the strings 'Orange' and 'orange' using the comparison operators

[46]: 'Orange' == 'orange'

[46]: False

[47]: 'Orange' != 'orange'

[47]: True

[48]: 'Orange' < 'orange'

[48]: True

[49]: 'Orange' <= 'orange'

[49]: True

[50]: 'Orange' > 'orange'

8
[50]: False

[51]: 'Orange' >= 'orange'

[51]: False

7 8.7 Searching for Substrings


• Can search a string for a **substring to
– count number of occurrences
– determine whether a string contains a substring
– determine the index at which a substring resides in a string
• Each method shown in this section compares characters lexicographically using their under-
lying numeric values

7.0.1 Counting Occurrences


• count returns the number of times its argument occurs in a string
[52]: sentence = 'to be or not to be that is the question'

[53]: sentence.count('to')

[53]: 2

• If you specify as the second argument a start index, count searches only the slice
string[start_index:]
[54]: sentence.count('to', 12)

[54]: 1

• If you specify as the second and third arguments the start index and end index,count
searches only the slice string[start_index:end_index]
[55]: sentence.count('that', 12, 25)

[55]: 1

• Like count, the other string methods presented in this section each have start index and
end index arguments

7.0.2 Locating a Substring in a String


• index searches for a substring within a string and returns the first index at which the substring
is found; otherwise, a ValueError occurs:
[58]: sentence.index('be')

9
[58]: 3

• rindex performs the same operation as index, but searches from the end of the string
[63]: sentence.rindex('be')

[63]: 16

• find and rfind perform the same tasks as index and rindex but return -1 if the substring
is not found
[64]: sentence.index('answer')

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [64], line 1
----> 1 sentence.index('answer')

ValueError: substring not found

[59]: sentence.find('answer')

[59]: -1

7.0.3 Determining Whether a String Contains a Substring


• To check whether a string contains a substring, use operator in or not in
[60]: 'that' in sentence

[60]: True

[61]: 'THAT' in sentence

[61]: False

[62]: 'THAT' not in sentence

[62]: True

7.0.4 Locating a Substring at the Beginning or End of a String


• startswith and endswith return True if the string starts with or ends with a specified
substring
[63]: sentence.startswith('to')

[63]: True

10
[64]: sentence.startswith('be')

[64]: False

[65]: sentence.endswith('question')

[65]: True

[66]: sentence.endswith('quest')

[66]: False

8 8.8 Replacing Substrings


• A common text manipulation is to locate a substring and replace its value
• replace searches a string for the substring in its first argument and replaces each occurrence
with the substring in its second argument
• Can receive an optional third argument specifying the maximum number of replacements
[67]: values = '1\t2\t3\t4\t5'

[68]: values.replace('\t', ', ')

[68]: '1, 2, 3, 4, 5'

[69]: values

[69]: '1\t2\t3\t4\t5'

[70]: 'There is [note] in the note'.replace('note','')

[70]: 'There is [] in the '

[71]: 'There is [note] in the note'.replace('note','extra note')

[71]: 'There is [extra note] in the extra note'

9 8.9 Splitting and Joining Strings


• Tokens typically are separated by whitespace characters such as blank, tab and newline,
though other characters may be used—the separators are known as delimiters

9.0.1 Splitting Strings


• To tokenize a string at a custom delimiter, specify the delimiter string that split uses to
tokenize the string

11
[72]: letters = 'A; B; C; D'

[73]: letters.split(';')

[73]: ['A', ' B', ' C', ' D']

[74]: 'An--Bart--Joe'.split('--')

[74]: ['An', 'Bart', 'Joe']

• Specify the maximum number of splits with an integer as the second argument
• Last token is the remainder of the string
[79]: letters.split('; ', 2)

[79]: ['A', 'B', 'C; D']

[82]: a,b, _ = letters.split('; ', 2)


print(f'{a} = {b}')

A = B
• rsplit performs the same task as split but processes the maximum number of splits from
the end of the string toward the beginning
You can use maxsplit as argument if you want to limit number os splits. E.g. if you are only
interested in first part.
[4]: my_list = 'An Ben Casper'
my_list.split()

[4]: ['An', 'Ben', 'Casper']

[5]: my_list.split(maxsplit=1)

[5]: ['An', 'Ben Casper']

[6]: my_list.split(maxsplit=1)[0]

[6]: 'An'

This gives the same result as my_list.split()[0] but is more performant.

9.0.2 Joining Strings


• join concatenates the strings in its argument, which must be an iterable containing only
string values
• The separator between the concatenated items is the string on which you call join

12
[83]: letters_list = ['A', 'B', 'C', 'D']

[84]: ';'.join(letters_list)

[84]: 'A;B;C;D'

• Join the results of a list comprehension that creates a list of strings


[86]: ';'.join([str(i) for i in range(10)])

[86]: '0;1;2;3;4;5;6;7;8;9'

9.0.3 String Methods partition and rpartition


String method partition splits a string into a tuple of three strings based on the method’s separator
argument * the part of the original string before the separator * the separator itself * the part of
the string after the separator
[87]: 'Amanda: 89, 97, 92'.partition(': ')

[87]: ('Amanda', ': ', '89, 97, 92')

• To search for the separator from the end of the string, use method rpartition
[88]: url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'

[89]: rest_of_url, separator, document = url.rpartition('/')

[90]: document

[90]: 'table_of_contents.html'

[91]: rest_of_url

[91]: 'http://www.deitel.com/books/PyCDS'

Suppose you only are interested in value op document it is better to write


[92]: _, _, document = url.rpartition('/')
document

[92]: 'table_of_contents.html'

or
[93]: document = url.rpartition('/')[2]
document

[93]: 'table_of_contents.html'

13
9.0.4 String Method splitlines
• splitlines returns a list of new strings representing lines of text split at each newline
character in the original string
[94]: lines = """This is line 1
This is line2
This is line3"""

[95]: lines

[95]: 'This is line 1\nThis is line2\nThis is line3'

[96]: lines.splitlines()

[96]: ['This is line 1', 'This is line2', 'This is line3']

• Passing True to splitlines keeps the newlines


[98]: lines.splitlines(True)

[98]: ['This is line 1\n', 'This is line2\n', 'This is line3']

10 8.10 Characters and Character-Testing Methods


• In Python, a character is simply a one-character string
• Python provides string methods for testing whether a string matches certain characteristics
• isdigit returns True if the string on which you call the method contains only the digit
characters (0–9)
– Useful for validating data
[97]: '-27'.isdigit()

[97]: False

[98]: '27'.isdigit()

[98]: True

11 8.10 Characters and Character-Testing Methods (cont.)


• isalnum returns True if the string on which you call the method is alphanumeric (only digits
and letters)

[99]: 'A9876'.isalnum()

[99]: True

14
[100]: '123 Main Street'.isalnum()

[100]: False

12 8.10 Characters and Character-Testing Methods (cont.)


• Table of many character-testing methods

String Method Description


isalnum() Returns True if the string contains only
alphanumeric characters (i.e., digits and
letters).
isalpha() Returns True if the string contains only
alphabetic characters (i.e., letters).
isdecimal() Returns True if the string contains only
decimal integer characters (that is, base 10
integers) and does not contain a + or - sign.
isdigit() Returns True if the string contains only digits
(e.g., ‘0’, ‘1’, ‘2’).
isidentifier() Returns True if the string represents a valid
identifier.
islower() Returns True if all alphabetic characters in the
string are lowercase characters (e.g., 'a', 'b',
'c').
isnumeric() Returns True if the characters in the string
represent a numeric value without a + or - sign
and without a decimal point.
isspace() Returns True if the string contains only
whitespace characters.
istitle() Returns True if the first character of each word
in the string is the only uppercase character in
the word.
isupper() Returns True if all alphabetic characters in the
string are uppercase characters (e.g., 'A', 'B',
'C').

13 8.11 Raw Strings


• Backslash characters in strings introduce escape sequences—like \n for newline and \t for tab
• To include a backslash in a string, use two backslash characters \\
• Makes some strings difficult to read
• Consider a Microsoft Windows file location:
[101]: file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'

15
[102]: print(file_path)
file_path

C:\MyFolder\MySubFolder\MyFile.txt

[102]: 'C:\\MyFolder\\MySubFolder\\MyFile.txt'

• raw strings—preceded by the character r—are more convenient


• They treat each backslash as a regular character, rather than the beginning of an escape
sequence
[103]: file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'

[104]: print(file_path)
file_path

C:\MyFolder\MySubFolder\MyFile.txt

[104]: 'C:\\MyFolder\\MySubFolder\\MyFile.txt'

16

You might also like