0% found this document useful (0 votes)
14 views108 pages

12-Files Parsing

Uploaded by

leonardo333555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views108 pages

12-Files Parsing

Uploaded by

leonardo333555
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Files and Parsing

CS106AP Lecture 12
Roadmap B asics
in g
Programm The C
onsol Ima
e ges
Day 1!

Graphics Data structures


Midterm

Object-Oriented
Everyday Python
Programming

Life after CS106AP!


Roadmap B asics
in g
Programm The C
onsol Ima
e ges
Day 1!

Graphics Data structures


Midterm
Dic
tion Dicti Parsi FilesLists
arie ona ng:
Object-Oriented s ri St
Everyday Python 2.0 es 1.0 rings
Programming

Life after CS106AP!


Today’s How can I separate valuable data
from junk?
questions
1. Review

Command Line

Today’s File Reading

2. What is Parsing?
topics Useful String Functions

How to Parse

3. What’s next?
Review
Command Line &
Arguments
PyCharm Terminal ==
Command Line Terminal/Command
Definition Prompt Definition
Command Line/Terminal Python Console/Interpreter

Text interface for giving An interactive program that allows


instructions to the computer. us to write Python code and run it
These instructions are relayed to line-by-line.
the computer’s operating system.
Command Line Usage
python3 script_name.py

using Python,
run this script’s
main() function
What’s up with $?
Our convention is to let "$" represent the terminal prompt.
What’s up with $?
Our convention is to let "$" represent the terminal prompt.

e.g.

$ python3 ghost.py hoover


What’s up with $?
Our convention is to let "$" represent the terminal prompt.

e.g.

$ python3 ghost.py hoover

this is the part you’d type


into your terminal!
What’s up with $?
Our convention is to let "$" represent the terminal prompt.

e.g.

$ python3 ghost.py hoover

If we use “>>>”, we’re referring to the Python interpreter.

>>> 3 * 6

18
Think/Pair/Share:
Line-by-line: what’s happening in the
following code?
Arguments Think/Pair/Share:
Line-by-line: what’s
def main():
happening in the
args = sys.argv[1:] following code?
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])
Arguments Think/Pair/Share:
Line-by-line: what’s
def main():
happening in the
args = sys.argv[1:] following code?
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py -chars aei poem.txt


Arguments get the command line
def main(): arguments as a list!
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py -chars aei poem.txt


Arguments get the command line
def main(): arguments as a list!
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py -chars aei poem.txt


Arguments slice off the first item in
def main(): the list
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py -chars aei poem.txt


Arguments slice off the first item in
def main(): the list
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
Now our
if len(args) == 3 and args[0] == ‘-chars’:
list doesn’t
print_processed_text(args[2], args[1])
include the
script
$ python3 DeleteCharacters.py -chars aei poem.txt name.
Arguments args
[‘-chars’, ‘aei’, ‘poem.txt’]

def main():
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py -chars aei poem.txt


Arguments args
[‘-chars’, ‘aei’, ‘poem.txt’]
0 1 2
def main():
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py -chars aei poem.txt


Arguments Think/Pair/Share:
What would args be?
def main():
What lines of code
args = sys.argv[1:] would execute?
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py poem.txt


Arguments args
[‘poem.txt’]
0
def main():
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py poem.txt


Arguments
Think/Pair/Share:
def main():
What would args be?
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py i rly like unic0rns ^-^


Arguments args
[‘i’, ‘rly’, ‘like’, ‘unic0rns’, ‘^-^’]
0 1 2 3 4
def main():
args = sys.argv[1:]
if len(args) == 1:
print_processed_text(args[0], ‘aei’)
if len(args) == 3 and args[0] == ‘-chars’:
print_processed_text(args[2], args[1])

$ python3 DeleteCharacters.py i rly like unic0rns ^-^


Takeaways on arguments
python3 DeleteCharacters.py -chars aei poem.txt

using Python,
run this script
with all of these arguments!
Takeaways on arguments
● We can use sys.argv to get a list of strings that correspond to the
command line arguments!

Slide adapted from Chris Piech


Files
Storing Information
When we’re running a program, When we’re not running a
variables and information are program and we want to save
stored on RAM (Random Access information, we store it on our
Memory) hard drive (also called disk)
What’s in a text file?
0 The suns are able to fall and rise:
1 When that brief light has fallen for us,
2 we must sleep a never ending night.

● No bold/italics!
● Each line is ended by the ‘\n’ newline character!
○ Except for the last line, which doesn’t have a ‘\n’.
What’s in a text file?
0 The suns are able to fall and rise:\n
1 When that brief light has fallen for us,\n
2 we must sleep a never ending night.

● No bold/italics!
● Each line is ended by the ‘\n’ newline character!
○ Except for the last line, which doesn’t have a ‘\n’.
File Reading – catullus.txt
0 The suns are able to fall and rise:\n
1 When that brief light has fallen for us,\n
2 we must sleep a never ending night.

with open(‘catullus.txt’, ‘r’) as f:


for line in f:
print(line)
File Reading – catullus.txt
0 The suns are able to fall and rise:\n
1 When that brief light has fallen for us,\n
2 we must sleep a never ending night.

with open(‘catullus.txt’, ‘r’) as f:


for line in f: print() automatically adds a ‘\n’!
print(line)
How can we avoid the extra
Output: output line?
The suns are able to fall and rise:\n\n
When that brief light has fallen for us,\n\n
we must sleep a never ending night.

with open(‘catullus.txt’, ‘r’) as f:


for line in f: print() automatically adds a ‘\n’!
print(line)
Output:
The suns are able to fall and rise:\n
When that brief light has fallen for us,\n
we must sleep a never ending night.

with open(‘catullus.txt’, ‘r’) as f:


for line in f: end’s default value is ‘\n’
print(line, end=’’)
Output:
The suns are able to fall and rise:\n
When that brief light has fallen for us,\n
we must sleep a never ending night.

with open(‘catullus.txt’, ‘r’) as f:


for line in f: “once you’ve printed this line,
print(line, end=’’) don’t add on a ‘\n’”
How can I separate valuable
data from junk?
Parsing!
Data from Social Explorer: ACS 2017
Data from Social Explorer: ACS 2017
What is data?
$GPGGA,005328.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,2.0,0000*70

$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38

$GPRMC,005328.000,A,3726.1389,N,12210.2515,W,0.00,256.18,221217,,,D*78

$GPGGA,005329.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,2.0,0000*71

$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38

$GPRMC,005329.000,A,3726.1389,N,12210.2515,W,0.00,256.18,221217,,,D*79

$GPGGA,005330.000,3726.1389,N,12210.2515,W,2,07,1.3,22.5,M,-25.7,M,3.0,0000*78

$GPGSA,M,3,09,23,07,16,30,03,27,,,,,,2.3,1.3,1.9*38

Read more about NMEA


What is data?
● Usually just text!

○ Text is a common data exchange format.


Parsing

Definition

Parsing
The act of reading “raw” text and converting it
into a more useful format stored in memory.

Adapted from Jon Skeet


Components of Parsing
Components of Parsing
● File Reading
Components of Parsing
● File Reading

● String Manipulation
Components of Parsing
● File Reading

● String Manipulation

● Advanced Control Flow


Components of Parsing
● File Reading

● String Manipulation

● Advanced Control Flow

● Container Data Types


Components of Parsing
● File Reading

● String Manipulation

● Advanced Control Flow

● Container Data Types


String Manipulation - Useful Functions
s.isalpha()

s.isdigit()

s.isspace()
String Manipulation - Useful Functions
s.isalpha()

s.isdigit()

s.isspace()
applies to spaces, tabs, and newlines.
String Manipulation - Useful Functions
s.isalpha()

s.isdigit()

s.isspace()
applies to spaces, tabs, and newlines.
Tabs are written ‘\t’. Newlines are ‘\n’.
String Manipulation - Useful Functions
String Manipulation - Useful Functions
s.startswith(substr)
These functions return booleans!
s.endswith(substr)
String Manipulation - Useful Functions
s.startswith(substr)
These functions return booleans!
s.endswith(substr)

>>> ‘Sonja’.startswith(‘Son’)
String Manipulation - Useful Functions
s.startswith(substr)
These functions return booleans!
s.endswith(substr)

>>> ‘Sonja’.startswith(‘Son’)

True
String Manipulation - Useful Functions
>>> s = ‘computer’
String Manipulation - Useful Functions
>>> s = ‘computer’

>>> ‘put’ in s
String Manipulation - Useful Functions
>>> s = ‘computer’

>>> ‘put’ in s You can use in with strings, like lists!


String Manipulation - Useful Functions
>>> s = ‘computer’

>>> ‘put’ in s You can use in with strings, like lists!


True
String Manipulation - Useful Functions
>>> s = ‘computer’

>>> ‘put’ in s

True
String Manipulation - Useful Functions
>>> s = ‘hello!’
String Manipulation - Useful Functions
>>> s = ‘hello!’

>>> s.find(‘!’)
String Manipulation - Useful Functions
>>> s = ‘hello!’
find() returns the index of the
>>> s.find(‘!’)
first occurrence of the substring
5 you pass in
String Manipulation - Useful Functions
>>> s = ‘hello!’
find() returns the index of the
>>> s.find(‘!’)
first occurrence of the substring
5 you pass in
>>> s.find(‘l’)
String Manipulation - Useful Functions
>>> s = ‘hello!’
find() returns the index of the
>>> s.find(‘!’)
first occurrence of the substring
5 you pass in
>>> s.find(‘l’)

2
String Manipulation - Useful Functions
>>> s = ‘hello!’

>>> s.find(‘w’)
String Manipulation - Useful Functions
>>> s = ‘hello!’
if the string doesn’t contain the
>>> s.find(‘w’)
substring, return -1
-1
String Manipulation - Useful Functions
>>> s = ‘hello!’

>>> s.find(‘w’)

-1
optionally can pass in start index
>>> s.find(‘l’, 3) (or end index)
String Manipulation - Useful Functions
>>> s = ‘hello!’

>>> s.find(‘w’)

-1
optionally can pass in start index
>>> s.find(‘l’, 3) (or end index)
3
String Manipulation - Useful Functions
>>> s = ‘hello!’

>>> s.find(‘w’)

-1
the format is:
>>> s.find(‘l’, 3) s.find(substr, start_index, end_index)
3
String Manipulation - Useful Functions
>>> s = ‘hello!’

>>> s.find(‘w’)

-1
the format is:
>>> s.find(‘l’, 3) s.find(substr, start_index, end_index)
3
Think/Pair/Share:
Find the first ‘@’ in s. Return the
substring made of 0 or more alpha
characters following the ‘@’.
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’

>>> s.strip() removes whitespace on left & right


sides of string
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’

>>> s.strip() removes whitespace on left & right


sides of string
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’

>>> s.strip() removes whitespace on left & right


'hello world!' sides of string
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’

>>> s.strip() removes whitespace on left & right


'hello world!' sides of string
>>> s = ‘ hello world!\n ’
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’

>>> s.strip() removes whitespace on left & right


'hello world!' sides of string
>>> s = ‘ hello world!\n ’

>>> s.strip()
can be used on newlines
and tabs as well as spaces
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’

>>> s.strip() removes whitespace on left & right


'hello world!' sides of string
>>> s = ‘ hello world!\n ’

>>> s.strip()
can be used on newlines
'hello world!' and tabs as well as spaces
String Manipulation - Useful Functions
>>> s = ‘ hello world! ’

>>> s.strip() removes whitespace on left & right


'hello world!' sides of string
>>> s = ‘ hello world!\n ’

>>> s.strip()
can be used on newlines
'hello world!' and tabs as well as spaces
How can we avoid the extra
Recall: (output) output line?
The suns are able to fall and rise:\n\n
When that brief light has fallen for us,\n\n
we must sleep a never ending night.

with open(‘catullus.txt’, ‘r’) as f:


for line in f: print() automatically adds a ‘\n’!
print(line)
How can we avoid the extra
Recall: (output) output line?
The suns are able to fall and rise:\n
When that brief light has fallen for us,\n
we must sleep a never ending night.

with open(‘catullus.txt’, ‘r’) as f:


for line in f:
line = line.strip()
print(line)
How do we represent strings?
● Google “omega uppercase unicode”
○ ‘03A9’
How do we represent strings?
● Google “omega uppercase unicode”
○ ‘03A9’
○ hexadecimal notation (base-16) = 0-9 plus letters A-F
How do we represent strings?
● Google “omega uppercase unicode”
○ ‘03A9’
○ hexadecimal notation (base-16) = 0-9 plus letters A-F

>>> s = ‘\u03A9’
How do we represent strings?
● Google “omega uppercase unicode”
○ ‘03A9’
○ hexadecimal notation (base-16) = 0-9 plus letters A-F

>>> s = ‘\u03A9’

>>> s
How do we represent strings?
● Google “omega uppercase unicode”
○ ‘03A9’
○ hexadecimal notation (base-16) = 0-9 plus letters A-F

>>> s = ‘\u03A9’

>>> s

‘Ω’
Components of Parsing
● File Reading

● String Manipulation

● Advanced Control Flow

● Container Data Types


Compound Boolean Expressions
s = ‘yay’
if len(s) == 2 and s[1] == ‘a’:
# do something
Compound Boolean Expressions
s = ‘yay’ False True
if len(s) == 2 and s[1] == ‘a’:
# do something
Compound Boolean Expressions
s = ‘yay’ Stop! This will never get executed!
if len(s) == 2 and s[1] == ‘a’:
# do something
Compound Boolean Expressions
s = ‘yay’ Stop! This will never get executed!
if len(s) == 2 and s[1] == ‘a’: This is also known as
# do something “shortcircuiting”.
Why is this useful?
Why is this useful?
s = ‘’
if len(s) != 0 and s[0] == ‘a’:
# do something
Why is this useful?
s = ‘’ False s[0] would result in an error!
if len(s) != 0 and s[0] == ‘a’:
# do something
Advanced Control Flow
break
continue
Advanced Control Flow
# print words in all_words until hit a censored word!
Advanced Control Flow
# print words in all_words until hit a censored word!

def censored(all_words, censored_words):


for word in all_words:
if word in censored_words:
break
print(word)
Advanced Control Flow
# print words in all_words until hit a censored word!

def censored(all_words, censored_words):


for word in all_words:
if word in censored_words:
break
print(word)
Advanced Control Flow
# print words in all_words that aren’t censored!
Advanced Control Flow
# print words in all_words that aren’t censored!

def censored(all_words, censored_words):


for word in all_words:
if word in censored_words:
continue
print(word)
Advanced Control Flow
# print words in all_words that aren’t censored!

def censored(all_words, censored_words):


for word in all_words:
if word in censored_words:
continue
print(word)
Think/Pair/Share:
Print list of zoo animals (not including
the bears) and corresponding list of
number of times each animal has
been fed.
What’s next?
Roadmap B asics
in g
Programm The C
onsol Ima
e ges
Day 1!

Graphics Data structures


Midterm
Dic
tion Dicti Pars FilesLists
arie ona ing:
Object-Oriented s ri S
Everyday Python 2.0 es 1.0 trings
Programming

Life after CS106AP!


What’s next?
● Dictionaries
○ Is there a better way to store complex data?

You might also like