0% found this document useful (0 votes)

19 views30 pages

Lecture 11

The document discusses strings, regular expressions, and files in Python. It covers topics like Unicode, ASCII, finding and manipulating substrings, formatting strings, using regular expressions to search and modify strings, and reading and writing files in Python.

Uploaded by

damaso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views30 pages

Lecture 11

Uploaded by

damaso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Programming Principles in Python (CSCI 503/490)

Files

Dr. David Koop

(some slides adapted from Dr. Reva Freedman)

D. Koop, CSCI 503/490, Spring 2023

Unicode and ASCII

• Conceptual systems
• ASCII:
- old, English-centric, 7-bit system (only 128 characters)
• Unicode:
- Can represent over 1 million characters from all languages + emoji 🎉
- Characters have hexadecimal representation: é = U+00E9 and
name (LATIN SMALL LETTER E WITH ACUTE)
- Python allows you to type "é" or represent via code "\u00e9"
• Codes: ord → character to integer, chr → integer to character

D. Koop, CSCI 503/490, Spring 2023 2

Strings
• Objects with methods
• Finding and counting substrings: count, find, startswith
• Removing leading & trailing substrings/whitespace: strip, removeprefix
• Transforming Text: replace, upper, lower, title
• Checking String Composition: isalnum, isnumeric, isupper
• Splitting & Joining:
- names = str.split(', ')
- ', '.join(names)
-

D. Koop, CSCI 503/490, Spring 2023 3

Format and f-Strings

• s.format: templating function
- Replace elds indicated by curly braces with corresponding values
- "My name is {} {}".format(first_name, last_name)
- "My name is {first_name} {last_name}.format(
first_name=name[0], last_name=name[1])
• Formatted string literals (f-strings) reference variables directly!
- f"My name is {first_name} {last_name}"
• Can include expressions, too:
- f"My name is {name[0].capitalize()} {name[1].capitalize()}"
• Format mini-language allows specialized displays (alignment, numeric
formatting)

D. Koop, CSCI 503/490, Spring 2023 4

Regular Expressions
• AKA regex
• A syntax to better specify how to decompose strings
• Look for patterns rather than speci c characters
• Metacharacters: . ^ $ * + ? { } [ ] \ | ( )
- Repeat, one-of-these, optional
• Character Classes: \d (digit), \s (space), \w (word character), also \D, \S, \W
• Digits with slashes between them: \d+/\d+/\d+
• Usually use raw strings (no backslash plague): r'\d+/\d+/\d+'

D. Koop, CSCI 503/490, Spring 2023 5

Regular Expression Methods

Method/
Purpose
Attribute
match() Determine if the RE matches at the beginning of the string.
search() Scan through a string, looking for any location where this RE matches.
findall() Find all substrings where the RE matches, and returns them as a list.
finditer() Find all substrings where the RE matches, and returns them as an iterator.
split() Split the string into a list, splitting it wherever the RE matches
sub() Find all substrings where the RE matches, and replace them with a di erent string
subn() Does the same thing as sub(), but returns the new string & number of replacements

[Deitel & Deitel]

D. Koop, CSCI 503/490, Spring 2023 6

ff
Regular Expresion Examples
• s0 = "No full dates here, just 02/15"
s1 = "02/14/2021 is a date"
s2 = "Another date is 12/25/2020"
s3 = "April Fools' Day is 4/1/2021 & May the Fourth is 5/4/2021"
• re.match(r'\d+/\d+/\d+',s1) # returns match object
• re.match(r'\d+/\d+/\d+',s2) # None!
• re.search(r'\d+/\d+/\d+',s2) # returns 1 match object
• re.search(r'\d+/\d+/\d+',s3) # returns 1! match object
• re.findall(r'\d+/\d+/\d+',s3) # returns list of strings
• re.finditer(r'\d+/\d+/\d+',s3) # returns iterable of matches
• re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',s3)
# captures month, day, year, and reformats

D. Koop, CSCI 503/490, Spring 2023 7

Grouping
• Parentheses capture a group that can be accessed or used later
• Access via groups() or group(n) where n is the number of the group, but
numbering starts at 1
• Note: group(0) is the full matched string
• for match in re.finditer(r'(\d+)/(\d+)/(\d+)',s3):
print(match.groups())
• for match in re.finditer(r'(\d+)/(\d+)/(\d+)',s3):
print('{2}-{0:02d}-{1:02d}'.format(
*[int(x) for x in match.groups()]))
• * operator expands a list into individual elements

D. Koop, CSCI 503/490, Spring 2023 8

Modifying Strings

Method/Attribute Purpose

split()
Split the string into a list, splitting it wherever the
RE matches

sub()
Find all substrings where the RE matches, and
replace them with a di erent string

subn()
Does the same thing as sub(), but returns the new
string and the number of replacements

D. Koop, CSCI 503/490, Spring 2023 9

ff
Substitution
• Do substitution in the middle of a string:
• re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',s3)
• All matches are substituted
• First argument is the regular expression to match
• Second argument is the substitution
- \1, \2, … match up to the captured groups in the rst argument
• Third argument is the string to perform substitution on
• Can also use a function:
• to_date = lambda m:
f'{m.group(3)}-{int(m.group(1)):02d}-{int(m.group(2)):02d}'
re.sub(r'(\d+)/(\d+)/(\d+)', to_date, s3)

D. Koop, CSCI 503/490, Spring 2023 10

 
Assignment 4
• Assignment will cover strings and les
• Reading & writing data to les
• Dealing with characters and encodings

D. Koop, CSCI 503/490, Spring 2023 11

Files

D. Koop, CSCI 503/490, Spring 2023 12

Files
• A le is a sequence of data stored on disk.
• Python uses the standard Unix newline character (\n) to mark line breaks.
- On Windows, end of line is marked by \r\n, i.e., carriage return + newline.
- On old Macs, it was carriage return \r only.
- Python converts these to \n when reading.

D. Koop, CSCI 503/490, Spring 2023 13

Opening a File
• Opening associates a le on disk with an object in memory ( le object or le
handle).
• We access the le via the le object.
• <filevar> = open(<name>, <mode>)
• Mode 'r' = read or 'w' = write, 'a' = append
• read is default
• Also add 'b' to indicate the le should be opened in binary mode: 'rb','wb'

D. Koop, CSCI 503/490, Spring 2023 14

fi
fi
fi
fi

fi
fi
Standard File Objects
• When Python begins, it associates three standard le objects:
- sys.stdin: for input
- sys.stdout: for output
- sys.stderr: for errors
• In the notebook
- sys.stdin isn't really used, get_input can be used if necessary
- sys.stdout is the output shown after the code
- sys.stderr is shown with a red background

D. Koop, CSCI 503/490, Spring 2023 15

Files and Jupyter

• You can double-click a le to see its contents (and edit it manually)
• To see one as text, may need to right-click
• Shell commands also help show les in the notebook
• The ! character indicates a shell command is being called
• These will work for Linux and macos but not necessarily for Windows
• !cat <fname>: print the entire contents of <fname>
• !head -n <num> <fname>: print the rst <num> lines of <fname>
• !tail -n <num> <fname>: print the last <num> lines of <fname>

D. Koop, CSCI 503/490, Spring 2023 16

fi
fi
fi

Reading Files
• Use the open() method to open a le for reading
- f = open('huck-finn.txt')
• Usually, add an 'r' as the second parameter to indicate read (default)
• Can iterate through the le (think of the le as a collection of lines):
- f = open('huck-finn.txt', 'r')
for line in f:
if 'Huckleberry' in line:
print(line.strip())
• Using line.strip() because the read includes the newline, and print
writes a newline so we would have double-spaced text
• Closing the le: f.close()

D. Koop, CSCI 503/490, Spring 2023 17

fi
 
fi

fi
 
 
fi

Remember Encodings (Unicode, ASCII)?

• Encoding: How things are actually stored
• ASCII "Extensions": how to represent characters for different languages
- No universal extension for 256 characters (one byte), so…
- ISO-8859-1, ISO-8859-2, CP-1252, etc.
• Unicode encoding:
- UTF-8: used in Python and elsewhere (uses variable # of 1—4 bytes)
- Also UTF-16 (2 or 4 bytes) and UTF-32 (4 bytes for everything)
- Byte Order Mark (BOM) for les to indicate endianness (which byte rst)

D. Koop, CSCI 503/490, Spring 2023 18

Encoding in Files
• all_lines = open('huck-finn.txt').readlines()
all_lines[0] # '\ufeff\n'
• \ufeff is the UTF Byte-Order-Mark (BOM)
• Optional for UTF-8, but if added, need to read it
• a = open('huck-finn.txt', encoding='utf-8-sig').readlines()
a[0] # '\n'
• No need to specify UTF-8 (or ascii since it is a subset)
• Other possible encodings:
- cp1252, utf-16, iso-8859-1

D. Koop, CSCI 503/490, Spring 2023 19

 
 
Other Methods for Reading Files
• read(): read the entire le
• read(<num>): read <num> characters (bytes)
- open('huck-finn.txt', encoding='utf-8-sig').read(100)
• readlines(): read the entire le as a list of lines
- lines = open('huck-finn.txt', encoding='utf-8-sig').readlines()

D. Koop, CSCI 503/490, Spring 2023 20

Reading a Text File

• Try to read a le at most once
• f = open('huck-finn.txt', 'r')
for i, line in enumerate(f):
if 'Huckleberry' in line:
print(line.strip())
for i, line in enumerate(f):
if "George" in line:
print(line.strip())
• Can't iterate twice!
• Best: do both checks when reading the le once
• Otherwise: either reopen the le or seek to beginning (f.seek(0))

D. Koop, CSCI 503/490, Spring 2023 21

Parsing Files
• Dealing with different formats, determining more meaningful data from les
• txt: text le
• csv: comma-separated values
• json: JavaScript object notation
• Jupyter also has viewers for these formats
• Look to use libraries to help possible
- import json
- import csv
- import pandas
• Python also has pickle, but not used much anymore

D. Koop, CSCI 503/490, Spring 2023 22

Comma-separated values (CSV) Format

• Comma is a eld separator, newlines denote records
- a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
• May have a header (a,b,c,d,message), but not required
• No type information: we do not know what the columns are (numbers,
strings, oating point, etc.)
- Default: just keep everything as a string
- Type inference: Figure out the type to make each column based on values
• What about commas in a value? → double quotes

D. Koop, CSCI 503/490, Spring 2023 23

fl
fi
 
 

Python csv module

• Help reading csv les using the csv module
- import csv
with open('persons_of_concern.csv', 'r') as f:
for i in range(3): # skip first three lines
next(f)
reader = csv.reader(f)
records = [r for r in reader] # r is a list
• or
- import csv
with open('persons_of_concern.csv', 'r') as f:
for i in range(3): # skip first three lines
next(f)
reader = csv.DictReader(f)
records = [r for r in reader] # r is a dict
D. Koop, CSCI 503/490, Spring 2023 24

 
Writing Files
• outf = open("mydata.txt", "w")
• If you open an existing le for writing, you wipe out the le’s contents. If the
named le does not exist, a new one is created.
• Methods for writing to a le:
- print(<expressions>, file= outf)
- outf.write(<string>)
- outf.writelines(<list of strings>)
• If you use write, no newlines are added automatically
- Also, remember we can change print's ending: print(…, end=", ")
• Make sure you close the le! Otherwise, content may be lost (buffering)
• outf.close()
D. Koop, CSCI 503/490, Spring 2023 25
fi
fi
fi
fi

With Statement: Improved File Handling

• With statement does "enter" and "exit" handling:
• In the previous example, we need to remember to call outf.close()
• Using a with statement, this is done automatically:
- with open('huck-finn.txt', 'r') as f:
for line in f:
if 'Huckleberry' in line:
print(line.strip())
• This is important for writing les!
- with open('output.txt', 'w') as f:
for k, v in counts.items():
f.write(k + ': ' + v + '\n')
• Without with, we need f.close()
D. Koop, CSCI 503/490, Spring 2023 26
 
fi

Context Manager
• The with statement is used with contexts
• A context manager's enter method is called at the beginning
• …and exit method at the end, even if there is an exception!
• outf = open('huck-finn-lines.txt','w')
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
• with open('huck-finn-lines.txt','w') as outf:
for i, line in enumerate(huckleberry):
outf.write(line)
if i > 3:
raise Exception("Failure")
D. Koop, CSCI 503/490, Spring 2023 27
 
 
 
 

JavaScript Object Notation (JSON)

• A format for web data
• Looks very similar to python dictionaries and lists
• Example:
- {"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
{"name": "Katie", "age": 33, "pet": "Cisco"}] }
• Only contains literals (no variables) but allows null
• Values: strings, arrays, dictionaries, numbers, booleans, or null
- Dictionary keys must be strings
- Quotation marks help differentiate string or numeric values
D. Koop, CSCI 503/490, Spring 2023 28

Reading JSON data

• Python has a built-in json module
- with open('example.json') as f:
data = json.load(f)
- with open('example-out.json', 'w') as f:
json.dump(data, f)
• Can also load/dump to strings:
- json.loads, json.dumps

D. Koop, CSCI 503/490, Spring 2023 29

Second Puc Computer Science Notes Complete
No ratings yet
Second Puc Computer Science Notes Complete
149 pages
CHAPTER_2___FILE_HANDLING_IN_PYTHON
No ratings yet
CHAPTER_2___FILE_HANDLING_IN_PYTHON
54 pages
slide set 6
No ratings yet
slide set 6
60 pages
Session22 To 24 PYTHON COLAB
No ratings yet
Session22 To 24 PYTHON COLAB
128 pages
sZqNV9HyBC23D9pj
No ratings yet
sZqNV9HyBC23D9pj
21 pages
Text File
No ratings yet
Text File
43 pages
III Unit Files in Python
No ratings yet
III Unit Files in Python
16 pages
UNIT –5 PSP
No ratings yet
UNIT –5 PSP
43 pages
Week 8 Lecture
No ratings yet
Week 8 Lecture
46 pages
CMSC 201 - Lec10 - File IO - Gibson
No ratings yet
CMSC 201 - Lec10 - File IO - Gibson
49 pages
Lecture 07
No ratings yet
Lecture 07
36 pages
PP Unit-4
No ratings yet
PP Unit-4
40 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
Lecture 02
No ratings yet
Lecture 02
33 pages
Lecture 01
No ratings yet
Lecture 01
32 pages
Unit 2
No ratings yet
Unit 2
21 pages
Lecture 04
No ratings yet
Lecture 04
31 pages
FileHandling2023 Text File-1
No ratings yet
FileHandling2023 Text File-1
56 pages
File Handling Text+Binary+CSV
No ratings yet
File Handling Text+Binary+CSV
62 pages
ComputerSysAndProgramming_7
No ratings yet
ComputerSysAndProgramming_7
41 pages
Chapter 8 PythonFiles
No ratings yet
Chapter 8 PythonFiles
17 pages
5 File Handling 1 (1) (1)
No ratings yet
5 File Handling 1 (1) (1)
71 pages
Unit 3
No ratings yet
Unit 3
10 pages
PSPP Unit 5 Notes
No ratings yet
PSPP Unit 5 Notes
16 pages
4 2. Sequences
No ratings yet
4 2. Sequences
39 pages
2 Python Fundamentals Full
No ratings yet
2 Python Fundamentals Full
103 pages
CS390-PYTHON
No ratings yet
CS390-PYTHON
194 pages
UNIT 5 notes
No ratings yet
UNIT 5 notes
16 pages
Python Programming Files and Exceptions
No ratings yet
Python Programming Files and Exceptions
19 pages
FileHandling_6a9330f76353d5a31172850a2a8d8c86
No ratings yet
FileHandling_6a9330f76353d5a31172850a2a8d8c86
17 pages
5 File Handling 1
No ratings yet
5 File Handling 1
56 pages
FILE HANDLING IN PYTHON-2024
No ratings yet
FILE HANDLING IN PYTHON-2024
49 pages
lbobgdt-07-python text file processing
No ratings yet
lbobgdt-07-python text file processing
9 pages
Chapter 6
No ratings yet
Chapter 6
23 pages
Python Material
No ratings yet
Python Material
21 pages
py 3
No ratings yet
py 3
16 pages
Class Xii Computer Science Ch-6file Handlingppt
No ratings yet
Class Xii Computer Science Ch-6file Handlingppt
62 pages
07-Csci333 Lecture FileIOExceptions
No ratings yet
07-Csci333 Lecture FileIOExceptions
80 pages
23_Prashant_python_22_A_Exp11
No ratings yet
23_Prashant_python_22_A_Exp11
14 pages
Programming & Numerical Analysis: Kai-Feng Chen
No ratings yet
Programming & Numerical Analysis: Kai-Feng Chen
39 pages
File CC
No ratings yet
File CC
23 pages
Basic Python
No ratings yet
Basic Python
15 pages
Python Modul 4 - Sivanandha m s
No ratings yet
Python Modul 4 - Sivanandha m s
10 pages
Python Cheat Sheet PDF
No ratings yet
Python Cheat Sheet PDF
96 pages
XIIComp.Sc.26
No ratings yet
XIIComp.Sc.26
22 pages
Python the Essentials 1731972875
No ratings yet
Python the Essentials 1731972875
11 pages
File Handling
No ratings yet
File Handling
56 pages
OOP 1st Research in the Endterm
No ratings yet
OOP 1st Research in the Endterm
4 pages
Python String
No ratings yet
Python String
42 pages
File Handling Notes
No ratings yet
File Handling Notes
28 pages
RemoveWatermark PYTHON+MID2
No ratings yet
RemoveWatermark PYTHON+MID2
8 pages
Introducing Python
No ratings yet
Introducing Python
108 pages
Python Cheat-Sheet PDF
100% (1)
Python Cheat-Sheet PDF
29 pages
File Handling 2022 - Complete Notes
No ratings yet
File Handling 2022 - Complete Notes
60 pages
Sigcse Slides PDF
No ratings yet
Sigcse Slides PDF
108 pages
Distribution of Marks: Unit Unit Name Marks: Computer Science (083) CLASS-XII 2021-22
No ratings yet
Distribution of Marks: Unit Unit Name Marks: Computer Science (083) CLASS-XII 2021-22
30 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
59 pages
13.file Handling
No ratings yet
13.file Handling
66 pages
Document
No ratings yet
Document
164 pages
What Is Python?: Emphasis On Structure and Discipline Simple Problems ! Simple Programs
No ratings yet
What Is Python?: Emphasis On Structure and Discipline Simple Problems ! Simple Programs
27 pages
File Handling Cs Class 12
No ratings yet
File Handling Cs Class 12
19 pages
Numerical Computing: Scilab
No ratings yet
Numerical Computing: Scilab
33 pages
Python Quick Reference
100% (1)
Python Quick Reference
3 pages