Lecture 11
Lecture 11
Files
Strings
• Objects with methods
• Finding and counting substrings: count, find, startswith
• Removing leading & trailing substrings/whitespace: strip, removeprefix
• Transforming Text: replace, upper, lower, title
• Checking String Composition: isalnum, isnumeric, isupper
• Splitting & Joining:
- names = str.split(', ')
- ', '.join(names)
-
Regular Expressions
• AKA regex
• A syntax to better specify how to decompose strings
• Look for patterns rather than speci c characters
• Metacharacters: . ^ $ * + ? { } [ ] \ | ( )
- Repeat, one-of-these, optional
• Character Classes: \d (digit), \s (space), \w (word character), also \D, \S, \W
• Digits with slashes between them: \d+/\d+/\d+
• Usually use raw strings (no backslash plague): r'\d+/\d+/\d+'
fi
ff
Regular Expresion Examples
• s0 = "No full dates here, just 02/15"
s1 = "02/14/2021 is a date"
s2 = "Another date is 12/25/2020"
s3 = "April Fools' Day is 4/1/2021 & May the Fourth is 5/4/2021"
• re.match(r'\d+/\d+/\d+',s1) # returns match object
• re.match(r'\d+/\d+/\d+',s2) # None!
• re.search(r'\d+/\d+/\d+',s2) # returns 1 match object
• re.search(r'\d+/\d+/\d+',s3) # returns 1! match object
• re.findall(r'\d+/\d+/\d+',s3) # returns list of strings
• re.finditer(r'\d+/\d+/\d+',s3) # returns iterable of matches
• re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',s3)
# captures month, day, year, and reformats
Grouping
• Parentheses capture a group that can be accessed or used later
• Access via groups() or group(n) where n is the number of the group, but
numbering starts at 1
• Note: group(0) is the full matched string
• for match in re.finditer(r'(\d+)/(\d+)/(\d+)',s3):
print(match.groups())
• for match in re.finditer(r'(\d+)/(\d+)/(\d+)',s3):
print('{2}-{0:02d}-{1:02d}'.format(
*[int(x) for x in match.groups()]))
• * operator expands a list into individual elements
Modifying Strings
Method/Attribute Purpose
split()
Split the string into a list, splitting it wherever the
RE matches
sub()
Find all substrings where the RE matches, and
replace them with a di erent string
subn()
Does the same thing as sub(), but returns the new
string and the number of replacements
fi
Assignment 4
• Assignment will cover strings and les
• Reading & writing data to les
• Dealing with characters and encodings
fi
Files
Opening a File
• Opening associates a le on disk with an object in memory ( le object or le
handle).
• We access the le via the le object.
• <filevar> = open(<name>, <mode>)
• Mode 'r' = read or 'w' = write, 'a' = append
• read is default
• Also add 'b' to indicate the le should be opened in binary mode: 'rb','wb'
fi
fi
fi
fi
fi
fi
Standard File Objects
• When Python begins, it associates three standard le objects:
- sys.stdin: for input
- sys.stdout: for output
- sys.stderr: for errors
• In the notebook
- sys.stdin isn't really used, get_input can be used if necessary
- sys.stdout is the output shown after the code
- sys.stderr is shown with a red background
fi
Reading Files
• Use the open() method to open a le for reading
- f = open('huck-finn.txt')
• Usually, add an 'r' as the second parameter to indicate read (default)
• Can iterate through the le (think of the le as a collection of lines):
- f = open('huck-finn.txt', 'r')
for line in f:
if 'Huckleberry' in line:
print(line.strip())
• Using line.strip() because the read includes the newline, and print
writes a newline so we would have double-spaced text
• Closing the le: f.close()
fi
fi
fi
fi
Encoding in Files
• all_lines = open('huck-finn.txt').readlines()
all_lines[0] # '\ufeff\n'
• \ufeff is the UTF Byte-Order-Mark (BOM)
• Optional for UTF-8, but if added, need to read it
• a = open('huck-finn.txt', encoding='utf-8-sig').readlines()
a[0] # '\n'
• No need to specify UTF-8 (or ascii since it is a subset)
• Other possible encodings:
- cp1252, utf-16, iso-8859-1
Other Methods for Reading Files
• read(): read the entire le
• read(<num>): read <num> characters (bytes)
- open('huck-finn.txt', encoding='utf-8-sig').read(100)
• readlines(): read the entire le as a list of lines
- lines = open('huck-finn.txt', encoding='utf-8-sig').readlines()
fi
fi
fi
Parsing Files
• Dealing with different formats, determining more meaningful data from les
• txt: text le
• csv: comma-separated values
• json: JavaScript object notation
• Jupyter also has viewers for these formats
• Look to use libraries to help possible
- import json
- import csv
- import pandas
• Python also has pickle, but not used much anymore
fi
fi
Writing Files
• outf = open("mydata.txt", "w")
• If you open an existing le for writing, you wipe out the le’s contents. If the
named le does not exist, a new one is created.
• Methods for writing to a le:
- print(<expressions>, file= outf)
- outf.write(<string>)
- outf.writelines(<list of strings>)
• If you use write, no newlines are added automatically
- Also, remember we can change print's ending: print(…, end=", ")
• Make sure you close the le! Otherwise, content may be lost (buffering)
• outf.close()
D. Koop, CSCI 503/490, Spring 2023 25
fi
fi
fi
fi
fi
Context Manager
• The with statement is used with contexts
• A context manager's enter method is called at the beginning
• …and exit method at the end, even if there is an exception!
• outf = open('huck-finn-lines.txt','w')
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
• with open('huck-finn-lines.txt','w') as outf:
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
D. Koop, CSCI 503/490, Spring 2023 27
Context Manager
• The with statement is used with contexts
• A context manager's enter method is called at the beginning
• …and exit method at the end, even if there is an exception!
• outf = open('huck-finn-lines.txt','w')
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
• with open('huck-finn-lines.txt','w') as outf:
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
D. Koop, CSCI 503/490, Spring 2023 27