0% found this document useful (0 votes)
17 views

Lecture 11

The document discusses strings, regular expressions, and files in Python. It covers topics like Unicode, ASCII, finding and manipulating substrings, formatting strings, using regular expressions to search and modify strings, and reading and writing files in Python.

Uploaded by

damaso
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture 11

The document discusses strings, regular expressions, and files in Python. It covers topics like Unicode, ASCII, finding and manipulating substrings, formatting strings, using regular expressions to search and modify strings, and reading and writing files in Python.

Uploaded by

damaso
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Programming Principles in Python (CSCI 503/490)

Files

Dr. David Koop

(some slides adapted from Dr. Reva Freedman)

D. Koop, CSCI 503/490, Spring 2023


Unicode and ASCII


• Conceptual systems
• ASCII:
- old, English-centric, 7-bit system (only 128 characters)
• Unicode:
- Can represent over 1 million characters from all languages + emoji 🎉
- Characters have hexadecimal representation: é = U+00E9 and
name (LATIN SMALL LETTER E WITH ACUTE)
- Python allows you to type "é" or represent via code "\u00e9"
• Codes: ord → character to integer, chr → integer to character

D. Koop, CSCI 503/490, Spring 2023 2


Strings
• Objects with methods
• Finding and counting substrings: count, find, startswith
• Removing leading & trailing substrings/whitespace: strip, removeprefix
• Transforming Text: replace, upper, lower, title
• Checking String Composition: isalnum, isnumeric, isupper
• Splitting & Joining:
- names = str.split(', ')
- ', '.join(names)
-

D. Koop, CSCI 503/490, Spring 2023 3


Format and f-Strings


• s.format: templating function
- Replace elds indicated by curly braces with corresponding values
- "My name is {} {}".format(first_name, last_name)
- "My name is {first_name} {last_name}.format(
first_name=name[0], last_name=name[1])
• Formatted string literals (f-strings) reference variables directly!
- f"My name is {first_name} {last_name}"
• Can include expressions, too:
- f"My name is {name[0].capitalize()} {name[1].capitalize()}"
• Format mini-language allows specialized displays (alignment, numeric
formatting)

D. Koop, CSCI 503/490, Spring 2023 4


fi

Regular Expressions
• AKA regex
• A syntax to better specify how to decompose strings
• Look for patterns rather than speci c characters
• Metacharacters: . ^ $ * + ? { } [ ] \ | ( )
- Repeat, one-of-these, optional
• Character Classes: \d (digit), \s (space), \w (word character), also \D, \S, \W
• Digits with slashes between them: \d+/\d+/\d+
• Usually use raw strings (no backslash plague): r'\d+/\d+/\d+'

D. Koop, CSCI 503/490, Spring 2023 5


fi

Regular Expression Methods


Method/
Purpose
Attribute
match() Determine if the RE matches at the beginning of the string.
search() Scan through a string, looking for any location where this RE matches.
findall() Find all substrings where the RE matches, and returns them as a list.
finditer() Find all substrings where the RE matches, and returns them as an iterator.
split() Split the string into a list, splitting it wherever the RE matches
sub() Find all substrings where the RE matches, and replace them with a di erent string
subn() Does the same thing as sub(), but returns the new string & number of replacements

[Deitel & Deitel]


D. Koop, CSCI 503/490, Spring 2023 6

ff
Regular Expresion Examples
• s0 = "No full dates here, just 02/15"
s1 = "02/14/2021 is a date"
s2 = "Another date is 12/25/2020"
s3 = "April Fools' Day is 4/1/2021 & May the Fourth is 5/4/2021"
• re.match(r'\d+/\d+/\d+',s1) # returns match object
• re.match(r'\d+/\d+/\d+',s2) # None!
• re.search(r'\d+/\d+/\d+',s2) # returns 1 match object
• re.search(r'\d+/\d+/\d+',s3) # returns 1! match object
• re.findall(r'\d+/\d+/\d+',s3) # returns list of strings
• re.finditer(r'\d+/\d+/\d+',s3) # returns iterable of matches
• re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',s3)
# captures month, day, year, and reformats

D. Koop, CSCI 503/490, Spring 2023 7






Grouping
• Parentheses capture a group that can be accessed or used later
• Access via groups() or group(n) where n is the number of the group, but
numbering starts at 1
• Note: group(0) is the full matched string
• for match in re.finditer(r'(\d+)/(\d+)/(\d+)',s3):
print(match.groups())
• for match in re.finditer(r'(\d+)/(\d+)/(\d+)',s3):
print('{2}-{0:02d}-{1:02d}'.format(
*[int(x) for x in match.groups()]))
• * operator expands a list into individual elements

D. Koop, CSCI 503/490, Spring 2023 8




Modifying Strings

Method/Attribute Purpose

split()
Split the string into a list, splitting it wherever the
RE matches

sub()
Find all substrings where the RE matches, and
replace them with a di erent string

subn()
Does the same thing as sub(), but returns the new
string and the number of replacements

D. Koop, CSCI 503/490, Spring 2023 9


ff
Substitution
• Do substitution in the middle of a string:
• re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',s3)
• All matches are substituted
• First argument is the regular expression to match
• Second argument is the substitution
- \1, \2, … match up to the captured groups in the rst argument
• Third argument is the string to perform substitution on
• Can also use a function:
• to_date = lambda m:
f'{m.group(3)}-{int(m.group(1)):02d}-{int(m.group(2)):02d}'
re.sub(r'(\d+)/(\d+)/(\d+)', to_date, s3)

D. Koop, CSCI 503/490, Spring 2023 10


fi


Assignment 4
• Assignment will cover strings and les
• Reading & writing data to les
• Dealing with characters and encodings

D. Koop, CSCI 503/490, Spring 2023 11


fi

fi

Files

D. Koop, CSCI 503/490, Spring 2023 12


Files
• A le is a sequence of data stored on disk.
• Python uses the standard Unix newline character (\n) to mark line breaks.
- On Windows, end of line is marked by \r\n, i.e., carriage return + newline.
- On old Macs, it was carriage return \r only.
- Python converts these to \n when reading.

D. Koop, CSCI 503/490, Spring 2023 13


fi

Opening a File
• Opening associates a le on disk with an object in memory ( le object or le
handle).
• We access the le via the le object.
• <filevar> = open(<name>, <mode>)
• Mode 'r' = read or 'w' = write, 'a' = append
• read is default
• Also add 'b' to indicate the le should be opened in binary mode: 'rb','wb'

D. Koop, CSCI 503/490, Spring 2023 14


fi
fi
fi
fi

fi
fi
Standard File Objects
• When Python begins, it associates three standard le objects:
- sys.stdin: for input
- sys.stdout: for output
- sys.stderr: for errors
• In the notebook
- sys.stdin isn't really used, get_input can be used if necessary
- sys.stdout is the output shown after the code
- sys.stderr is shown with a red background

D. Koop, CSCI 503/490, Spring 2023 15


fi

Files and Jupyter


• You can double-click a le to see its contents (and edit it manually)
• To see one as text, may need to right-click
• Shell commands also help show les in the notebook
• The ! character indicates a shell command is being called
• These will work for Linux and macos but not necessarily for Windows
• !cat <fname>: print the entire contents of <fname>
• !head -n <num> <fname>: print the rst <num> lines of <fname>
• !tail -n <num> <fname>: print the last <num> lines of <fname>

D. Koop, CSCI 503/490, Spring 2023 16


fi
fi
fi

Reading Files
• Use the open() method to open a le for reading
- f = open('huck-finn.txt')
• Usually, add an 'r' as the second parameter to indicate read (default)
• Can iterate through the le (think of the le as a collection of lines):
- f = open('huck-finn.txt', 'r')
for line in f:
if 'Huckleberry' in line:
print(line.strip())
• Using line.strip() because the read includes the newline, and print
writes a newline so we would have double-spaced text
• Closing the le: f.close()

D. Koop, CSCI 503/490, Spring 2023 17


fi

fi

fi


fi

Remember Encodings (Unicode, ASCII)?


• Encoding: How things are actually stored
• ASCII "Extensions": how to represent characters for different languages
- No universal extension for 256 characters (one byte), so…
- ISO-8859-1, ISO-8859-2, CP-1252, etc.
• Unicode encoding:
- UTF-8: used in Python and elsewhere (uses variable # of 1—4 bytes)
- Also UTF-16 (2 or 4 bytes) and UTF-32 (4 bytes for everything)
- Byte Order Mark (BOM) for les to indicate endianness (which byte rst)

D. Koop, CSCI 503/490, Spring 2023 18


fi

fi

Encoding in Files
• all_lines = open('huck-finn.txt').readlines()
all_lines[0] # '\ufeff\n'
• \ufeff is the UTF Byte-Order-Mark (BOM)
• Optional for UTF-8, but if added, need to read it
• a = open('huck-finn.txt', encoding='utf-8-sig').readlines()
a[0] # '\n'
• No need to specify UTF-8 (or ascii since it is a subset)
• Other possible encodings:
- cp1252, utf-16, iso-8859-1

D. Koop, CSCI 503/490, Spring 2023 19




Other Methods for Reading Files
• read(): read the entire le
• read(<num>): read <num> characters (bytes)
- open('huck-finn.txt', encoding='utf-8-sig').read(100)
• readlines(): read the entire le as a list of lines
- lines = open('huck-finn.txt', encoding='utf-8-sig').readlines()

D. Koop, CSCI 503/490, Spring 2023 20


fi

fi

Reading a Text File


• Try to read a le at most once
• f = open('huck-finn.txt', 'r')
for i, line in enumerate(f):
if 'Huckleberry' in line:
print(line.strip())
for i, line in enumerate(f):
if "George" in line:
print(line.strip())
• Can't iterate twice!
• Best: do both checks when reading the le once
• Otherwise: either reopen the le or seek to beginning (f.seek(0))

D. Koop, CSCI 503/490, Spring 2023 21


fi

fi







fi

Parsing Files
• Dealing with different formats, determining more meaningful data from les
• txt: text le
• csv: comma-separated values
• json: JavaScript object notation
• Jupyter also has viewers for these formats
• Look to use libraries to help possible
- import json
- import csv
- import pandas
• Python also has pickle, but not used much anymore

D. Koop, CSCI 503/490, Spring 2023 22


fi

fi

Comma-separated values (CSV) Format


• Comma is a eld separator, newlines denote records
- a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• May have a header (a,b,c,d,message), but not required
• No type information: we do not know what the columns are (numbers,
strings, oating point, etc.)
- Default: just keep everything as a string
- Type inference: Figure out the type to make each column based on values
• What about commas in a value? → double quotes

D. Koop, CSCI 503/490, Spring 2023 23


fl
fi


Python csv module


• Help reading csv les using the csv module
- import csv
with open('persons_of_concern.csv', 'r') as f:
for i in range(3): # skip first three lines
next(f)
reader = csv.reader(f)
records = [r for r in reader] # r is a list
• or
- import csv
with open('persons_of_concern.csv', 'r') as f:
for i in range(3): # skip first three lines
next(f)
reader = csv.DictReader(f)
records = [r for r in reader] # r is a dict
D. Koop, CSCI 503/490, Spring 2023 24



fi








Writing Files
• outf = open("mydata.txt", "w")
• If you open an existing le for writing, you wipe out the le’s contents. If the
named le does not exist, a new one is created.
• Methods for writing to a le:
- print(<expressions>, file= outf)
- outf.write(<string>)
- outf.writelines(<list of strings>)
• If you use write, no newlines are added automatically
- Also, remember we can change print's ending: print(…, end=", ")
• Make sure you close the le! Otherwise, content may be lost (buffering)
• outf.close()
D. Koop, CSCI 503/490, Spring 2023 25
fi
fi
fi
fi

fi

With Statement: Improved File Handling


• With statement does "enter" and "exit" handling:
• In the previous example, we need to remember to call outf.close()
• Using a with statement, this is done automatically:
- with open('huck-finn.txt', 'r') as f:
for line in f:
if 'Huckleberry' in line:
print(line.strip())
• This is important for writing les!
- with open('output.txt', 'w') as f:
for k, v in counts.items():
f.write(k + ': ' + v + '\n')
• Without with, we need f.close()
D. Koop, CSCI 503/490, Spring 2023 26

fi




Context Manager
• The with statement is used with contexts
• A context manager's enter method is called at the beginning
• …and exit method at the end, even if there is an exception!
• outf = open('huck-finn-lines.txt','w')
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
• with open('huck-finn-lines.txt','w') as outf:
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
D. Koop, CSCI 503/490, Spring 2023 27








Context Manager
• The with statement is used with contexts
• A context manager's enter method is called at the beginning
• …and exit method at the end, even if there is an exception!
• outf = open('huck-finn-lines.txt','w')
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
• with open('huck-finn-lines.txt','w') as outf:
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
D. Koop, CSCI 503/490, Spring 2023 27








JavaScript Object Notation (JSON)


• A format for web data
• Looks very similar to python dictionaries and lists
• Example:
- {"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
{"name": "Katie", "age": 33, "pet": "Cisco"}] }
• Only contains literals (no variables) but allows null
• Values: strings, arrays, dictionaries, numbers, booleans, or null
- Dictionary keys must be strings
- Quotation marks help differentiate string or numeric values
D. Koop, CSCI 503/490, Spring 2023 28





Reading JSON data


• Python has a built-in json module
- with open('example.json') as f:
data = json.load(f)
- with open('example-out.json', 'w') as f:
json.dump(data, f)
• Can also load/dump to strings:
- json.loads, json.dumps

D. Koop, CSCI 503/490, Spring 2023 29



You might also like