0% found this document useful (0 votes)
12 views54 pages

Ch11 ManipulatingTextWithMethodsAndFiles

Chapter 11 covers manipulating text in Python, focusing on strings and lists, including their methods and how to handle files. It explains string manipulation techniques, such as slicing and using methods like capitalize and find, as well as how to read from and write to files. The chapter emphasizes the importance of understanding file structures and directories for effective file processing.

Uploaded by

rksaar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views54 pages

Ch11 ManipulatingTextWithMethodsAndFiles

Chapter 11 covers manipulating text in Python, focusing on strings and lists, including their methods and how to handle files. It explains string manipulation techniques, such as slicing and using methods like capitalize and find, as well as how to read from and write to files. The chapter emphasizes the importance of understanding file structures and directories for effective file processing.

Uploaded by

rksaar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Chapter 11: Manipulating Text with

Methods and Files


Chapter Objectives
Text
’ Text is the universal medium
’ We can convert any other media to a text representation.
’ We can convert between media formats using text.
’ Text is simple.

’ Like sound, text is usually processed in an array- a long


line of characters
’ We refer to one of these long line of characters as strings.
’ In many (especially older) programming languages, text is actually manipulated
as arrays of characters. It's horrible! Python actually knows how to deal with
strings.
Strings
’ Strings are defined with quote marks.
’ Python actually supports three kinds of quotes:
>>> print 'this is a string'
this is a string
>>> print "this is a string"
this is a string
>>> print """this is a string"""
this is a string
’ Use the right one that allows you to embed quote
marks you want
>>> aSingleQuote = " ' "
>>> print aSingleQuote
'
Why would you want to use triple quotes?

’ To have long quotations


with returns and such def aLongString():
inside them. return """This is a
long
>>> print aLongString()
string"""
This is a
long
string
>>>
If you wanted ONLY vowels, what would
you change in the IF?
def novowels(somestring):
collection = ""
for ch in somestring:
if (ch != "a") and (ch != "e") and (ch != "i") and (ch != "o") and (ch != "u"):
collection = collection + ch
print collection

(1) Change all != to ==


(2) Change all != to == AND change all “and” to “or”
(3) Change all “and” to “or”
(4) Keep the same != and “and” structure, but list all 21 consonants
Getting parts of strings
’ We use the square bracket “[]” notation to get parts of
strings.
’ string[n] gives you the nth character in the string
’ string[n:m] gives you the nth up to (but not including)
the mth character.
Getting parts of strings
>>> hello = "Hello"
>>> print hello[1]
H e l l o
e
>>> print hello[0] 0 1 2 3 4
H
>>> print hello[2:4]
ll
Start and end assumed if not there
’ >>> print hello
’ Hello
’ >>> print hello[:3]
’ Hel
’ >>> print hello[3:]
’ lo
’ >>> print hello[:]
’ Hello
Dot notation
’ All data in Python are actually objects
’ Objects not only store data, but they respond to
special functions that only objects of the same type
understand.
’ We call these special functions methods
’ Methods are functions known only to certain objects
’ To execute a method, you use dot notation
’ Object.method()
Capitalize is a method known only
to strings
>>> test="this is a test."
>>> print test.capitalize()
This is a test.
>>> print capitalize(test)
A local or global name could not be found.
NameError: capitalize
>>> print 'this is another test'.capitalize()
This is another test
>>> print 12.capitalize()
A syntax error is contained in the code -- I can't read it as
Python.
Useful string methods
’ startswith(prefix) returns true if the string starts with the
given suffix
’ endswith(suffix) returns true if the string ends with the
given suffix
’ find(findstring) and find(findstring,start) and
find(findstring,start,end) finds the findstring in the
object string and returns the index number where the
string starts. You can tell it what index number to start
from, and even where to stop looking. It returns -1 if it
fails.
’ There is also rfind(findstring) (and variations) that
searches from the end of the string toward the front.
If we call function like this:
firsthalfsmaller(“this is a test”) Which one of
these is the output?
(1) ThiS iS a TeST def firsthalfsmaller(something):
newstring = ""
(2) tHIs Is A tEst for c in something:
(3) this is A TEST if c < "m":
newstring = newstring + c.lower()
(4) THIS IS a test if c >= "m":
newstring = newstring + c.upper()
print newstring
Demonstrating startswith
>>> letter = "Mr. Mark Guzdial requests the pleasure of
your company..."
>>> print letter.startswith("Mr.")
1 Remember that
Python sees “0”
>>> print letter.startswith("Mrs.") as false and
0 anything else
(including “1”)
as true
Demonstrating endswith
>>> filename="barbara.jpg"
>>> if filename.endswith(".jpg"):
... print "It's a picture"
...
It's a picture
Demonstrating find
>>> print letter
Mr. Mark Guzdial requests the pleasure of your company ...
>>> print letter.find("Mark")
4
>>> print letter.find("Guzdial")
9
>>> print len("Guzdial")
7
>>> print letter[4:9+7]
Mark Guzdial
>>> print letter.find("fred")
-1
Interesting string methods
’ upper() translates the string to uppercase
’ lower() translates the string to lowercase
’ swapcase() makes all upper->lower and vice versa
’ title() makes just the first characters uppercase and
the rest lower.
’ isalpha() returns true if the string is not empty and all
letters
’ isdigit() returns true if the string is not empty and all
numbers
Replace method
>>> print letter
Mr. Mark Guzdial requests the pleasure of your
company...
>>> letter.replace("a","!")
'Mr. M!rk Guzdi!l requests the ple!sure of your
comp!ny...'
>>> print letter
Mr. Mark Guzdial requests the pleasure of your
company...
Strings are sequences
>>> for i in "Hello":
... print i
...
H
e
l
l
o
Lists
’ We've seen lists before—that's what range() returns.
’ Lists are very powerful structures.
’ Lists can contain strings, numbers, even other lists.
’ They work very much like strings
’ You get pieces out with []
’ You can add lists together
’ You can use for loops on them
’ We can use them to process a variety of kinds of data.
Demonstrating lists
>>> mylist = ["This","is","a", 12]
>>> print mylist
['This', 'is', 'a', 12]
>>> print mylist[0]
This
>>> for i in mylist:
... print i
...
This
is
a
12
>>> print mylist + ["Really!"]
['This', 'is', 'a', 12, 'Really!']
Useful methods to use with lists:
But these don't work with strings
’ append(something) puts something in the list at the
end.
’ remove(something) removes something from the
list, if it's there.
’ sort() puts the list in alphabetical order
’ reverse() reverses the list
’ count(something) tells you the number of times that
something is in the list.
’ max() and min() are functions (we've seen them
before) that take a list as input and give you the
maximum and minimum value in the list.
Converting from strings to lists
>>> letter=“Mr. Mark Guzdial requests the pleasure of your
company...”

>>> print letter.split(" ")


['Mr.', 'Mark', 'Guzdial', 'requests', 'the', 'pleasure', 'of', 'your',
'company...']
Extended Split Example
def phonebook():
return """
Mary:893-0234:Realtor:
Fred:897-2033:Boulder crusher:
Barney:234-2342:Professional bowler:"""

def phones():
phones = phonebook()
phonelist = phones.split('\n')
newphonelist = []
for list in phonelist:
newphonelist = newphonelist + [list.split(":")]
return newphonelist

def findPhone(person):
for people in phones():
if people[0] == person:
print "Phone number for",person,"is",people[1]
Running the Phonebook
>>> print phonebook()

Mary:893-0234:Realtor:
Fred:897-2033:Boulder crusher:
Barney:234-2342:Professional bowler:
>>> print phones()
[[''], ['Mary', '893-0234', 'Realtor', ''], ['Fred', '897-2033', 'Boulder
crusher', ''], ['Barney', '234-2342', 'Professional bowler', '']]
>>> findPhone('Fred')
Phone number for Fred is 897-2033
Strings have no font
’ Strings are only the characters of text displayed
“WYSIWYG” (What You See is What You Get)
’ WYSIWYG text includes fonts and styles
’ The font is the characteristic look of the letters in all
sizes
’ The style is typically the boldface, italics, underline,
and other effects applied to the font
’ In printer's terms, each style is its own font
Encoding font information
’ Font and style information is often encoded as style
runs
’ A separate representation from the string
’ Indicates bold, italics, or whatever style modification;
start character; and end character.
The old brown fox runs.
’ Could be encoded as:
"The old brown fox runs."
[[bold 0 6] [italics 5 12]]
How do we encode all that?
’ Is it a single value? Not really.
’ Do we encode it all in a complex list? We could.
’ How do most text systems handle this?
’ As objects
’ Objects have data, maybe in many parts.
’ Objects know how to act upon their data.
’ Objects' methods may be known only to that object, or
may be known by many objects, but each object
performs that method differently.
What can we do with all this?
’ Answer: Just about anything!
’ Strings and lists are about as powerful as one gets in
Python
’ By “powerful,” we mean that we can do a lot of different kinds of
computation with them.
’ Examples:
’ Pull up a Web page and grab information out of it, from within a function.
’ Find a nucleotide sequence in a string and print its name.
’ Manipulate functions' source

’ But first, we have to learn how to manipulate files…


Files: Places to put strings and other stuff
’ Files are these named large collections of bytes.
’ Files typically have a base name and a suffix
’ barbara.jpg has a base name of “barbara” and a suffix of
“.jpg”
’ Files exist in directories (sometimes called folders)

Tells us that the file “640x480.jpg” is in


the folder “mediasources” in the folder
“ip-book” on the disk “C:”
Directories
’ Directories can contain files or other directories.
’ There is a base directory on your computer, sometimes
called the root directory
’ A complete description of what directories to visit to
get to your file is called a path
We call this structure a “tree”
C:\
’ C:\ is the root of the tree.
’ It has branches, each of
which is a directory Documents Windows
and Settings
’ Any directory (branch)
can contain more Mark
directories (branches) Guzdial
and files (leaves)
mediasources cs1315

640x480.jpg
Why do I care about all this?
’ If you're going to process files, you need to know where
they are (directories) and how to specify them (paths).
’ If you're going to do movie processing, which involves
lots of files, you need to be able to write programs that
process all the files in a directory (or even several
directories) without having to write down each and every
name of the files.
Using lists to represent trees
>>> tree =
[["Leaf1","Leaf2"],[["Leaf3"],["Leaf4"],"Leaf5"]]
>>> print tree
[['Leaf1', 'Leaf2'], [['Leaf3'], ['Leaf4'], 'Leaf5']]
>>> print tree[0]
['Leaf1', 'Leaf2']
>>> print tree[1]
[['Leaf3'], ['Leaf4'], 'Leaf5'] Leaf5
>>> print tree[1][0]
['Leaf3'] Leaf1
>>> print tree[1][1] Leaf3
Leaf4
['Leaf4'] Leaf2
>>> print tree[1][2]
Leaf5
The Point: Lists allow us to
represent complex
relationships, like trees
How to open a file
’ For reading or writing a file (getting characters out or
putting characters in), you need to use open
’ open(filename , how) opens the filename.
’ If you don't provide a full path, the filename is assumed to be in the
same directory as JES.
’ how is a two character string that says what you want
to do with the string.
’ “rt” means “read text”
’ “wt” means “write text”
’ “rb” and “wb” means read or write bytes
’ We won't do much of that
Methods on files: Open returns a file object

’ open() returns a file object that you use to manipulate


the file
’ Example: file=open(“myfile”,”wt”)
’ file.read() reads the whole file as a single string.
’ file.readlines() reads the whole file into a list where
each element is one line.
’ read() and readlines() can only be used once without closing and reopening
the file.
’ file.write(something) writes something to the file
’ file.close() closes the file—writes it out to the disk
and won't let you do any more to it without re-opening
it.
Reading a file
>>> program=pickAFile()
>>> print program
C:\Documents and Settings\Mark Guzdial\My Documents\py-
programs\littlepicture.py
>>> file=open(program,"rt")
>>> contents=file.read()
>>> print contents
def littlepicture():
canvas=makePicture(getMediaPath("640x480.jpg"))
addText(canvas,10,50,"This is not a picture")
addLine(canvas,10,20,300,50)
addRectFilled(canvas,0,200,300,500,yellow)
addRect(canvas,10,210,290,490)
return canvas
>>> file.close()
Reading a file by lines
>>> file=open(program,"rt")
>>> lines=file.readlines()
>>> print lines
['def littlepicture():\n', '
canvas=makePicture(getMediaPath("640x480.jpg"))\n'
, ' addText(canvas,10,50,"This is not a picture")\n', '
addLine(canvas,10,20,300,50)\n', '
addRectFilled(canvas,0,200,300,500,yellow)\n', '
addRect(canvas,10,210,290,490)\n', ' return canvas']
>>> file.close()
How you get spam
def formLetter(gender , lastName , city , eyeColor ):
file = open("formLetter.txt","wt") The first line may be
file.write("Dear ") updated (see the source code)
if gender =="F":
file.write("Ms. "+ lastName+":\n")
if gender =="M":
file.write("Mr. "+ lastName+":\n")
file.write("I am writing to remind you of the offer ")
file.write("that we sent to you last week. Everyone in ")
file.write(city+" knows what an exceptional offer this is!")
file.write("(Especially those with lovely eyes of"+ eyeColor+"!)")
file.write("We hope to hear from you soon .\n")
file.write("Sincerely ,\n")
file.write("M. Adam , Attorney at Law")
file.close ()
Trying out our spam generator
>>> formLetter("M", "Guzdial", "Decatur", "brown")
Dear Mr. Guzdial:
I am writing to remind you of the offer that we
sent to you last week. Everyone in Decatur knows what
an exceptional offer this is!(Especially those with
lovely eyes of brown!)We hope to hear from you soon.
Sincerely,
M. Adam,
Attorney at Law

Only use this power for good!


Finding data on the Internet
’ The Internet is filled with wonderful data, and almost
all of it is in text!
’ Later, we'll write functions that directly grab files from
the Internet, turn them into strings, and pull
information out of them.
’ For now, let's assume that the files are on your disk,
and let's process them from there.
Example: Finding the Nucleotide Sequence
’ There are places on the Internet
where you can grab DNA
sequences of things like parasites.
’ What if you're a biologist and
want to know if a sequence of
nucleotides that you care about is
in one of these parasites?
’ We not only want to know “yes”
or “no,” but which parasite.
What the data looks like
>Schisto unique AA825099

gcttagatgtcagattgagcacgatgatcgattgaccgtgagatcgacga

gatgcgcagatcgagatctgcatacagatgatgaccatagtgtacg

>Schisto unique mancons0736

ttctcgctcacactagaagcaagacaatttacactattattattattatt

accattattattattattattactattattattattattactattattta

ctacgtcgctttttcactccctttattctcaaattgtgtatccttccttt
How are we going to do it?
’ First, we get the sequences in a big string.
’ Next, we find where the small subsequence is in the big
string.
’ From there, we need to work backwards until we find
“>” which is the beginning of the line with the sequence
name.
’ From there, we need to work forwards to the end of the
line. From “>” to the end of the line is the name of the
sequence
’ Yes, this is hard to get just right. Lots of debugging prints.
The code that does it
def findSequence(seq):
sequencesFile = getMediaPath("parasites.txt")
file = open(sequencesFile,"rt")
sequences = file.read()
file.close()
# Find the sequence
seqloc = sequences.find(seq)
#print "Found at:", seqloc
if seqloc <> -1:
# Now, find the ">" with the name of the sequence
nameloc = sequences.rfind(">",0,seqloc)
#print "Name at:",nameloc
endline = sequences.find("\n",nameloc)
print "Found in ",sequences[nameloc:endline]
if seqloc == -1:
print "Not found"
Why -1?
’ If .find or .rfind don't find something, they return -1
’ If they return 0 or more, then it's the index of where the
search string is found.
’ What's “<>”?
’ That's notation for “not equals”
’ You can also use “!=“
Running the program
>>> findSequence("tagatgtcagattgagcacgatgatcgattgacc")
Found in >Schisto unique AA825099
>>> findSequence("agtcactgtctggttgaaagtgaatgcttccaccgatt")
Found in >Schisto unique mancons0736
Example: Get the temperature
’ The weather is always
available on the Internet.

’ Can we write a function that


takes the current temperature
out of a source like
http://www.ajc.com/weather
or http://www.weather.com ?
The Internet is mostly text
’ Text is the other unimedia.
’ Web pages are actually text in the format called HTML
(HyperText Markup Language)
’ HTML isn't a programming language,
it's an encoding language.
’ It defines a set of meanings for certain characters, but
one can't program in it.
’ We can ignore the HTML meanings for now, and just
look at patterns in the text.
Where's the temperature?
’ The word “temperature” <td ><img
doesn't really show up. src="/shared-
local/weather/images/ps.gif"
’ But the temperature always width="48" height="48"
follows the word “Currently”, border="0"><font size=-
and always comes before the 2><br></font><font
size="-1" face="Arial, Helvetica, sans-
“<b>&deg;</b>” serif"><b>Currently</b><br>
Partly sunny<br>
<font
size="+2">54<b>&deg;</b></font
><font face="Arial, Helvetica,
sans-serif"
size="+1">F</font></font></td>
</tr>
We can use the same algorithm we've seen previously

’ Grab the content out of a file in a big string.


’ (We've saved the HTML page previously.
’ Soon, we'll see how to grab it directly.)
’ Find the starting indicator (“Currently”)
’ Find the ending indicator (“<b>&deg;”)
’ Read the previous characters
Finding the temperature
def findTemperature():
weatherFile = getMediaPath("ajc-weather.html")
file = open(weatherFile, "rt")
weather = file.read()
file.close()
# Find the Temperature
curloc = weather.find("Currently")
if curloc <> -1:
# Now, find the "<b>&deg;" following the temp
temploc = weather.find("<b>&deg;",curloc)
tempstart = weather.rfind(">",0,temploc)
print "Current temperature:",weather[tempstart+1:temploc]
if curloc == -1:
print "They must have changed the page format -- can't find the temp"
CSV Data
’ There are many places on the Internet where you can
find data in comma-separated values (CSV) format.
’ Do a Web search for “data journalism”
’ Or try the US Census at https://www.census.gov
’ (Also provided in state-populations.csv in MediaSources
at http://mediacomputation.org)
Find a state’s population
def findPopulation(state):
file = open(getMediaPath("state-populations.csv"),"rt")
lines = file.readlines()
file.close()
for line in lines:
parts = line.split(",")
if parts[4] == state:
return int(parts[5])
return -1

You might also like