0% found this document useful (0 votes)
2 views

Python2 Tutorial_ Extensive Python Example

This document is a tutorial chapter on using Python sets, illustrated through an analysis of the novel 'Ulysses' by James Joyce. It demonstrates how sets can be used to identify unique words and analyze vocabulary across multiple literary works. The tutorial emphasizes the practical applications of sets in programming, particularly in text analysis.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Python2 Tutorial_ Extensive Python Example

This document is a tutorial chapter on using Python sets, illustrated through an analysis of the novel 'Ulysses' by James Joyce. It demonstrates how sets can be used to identify unique words and analyze vocabulary across multiple literary works. The tutorial emphasizes the practical applications of sets in programming, particularly in text analysis.

Uploaded by

Sentinel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Python Course

Home Python 2 Tutorial Python 3 Tutorial Advanced Topics Numerical Programming Machine Learning Tkinter Tutorial Contact

Previous Chapter: Sets and Frozen Sets


Next Chapter: input via the keyboard

An Extensive Example for Sets

Follow Bernd Klein,


the author of this
Python and the Best Novel website, at Google+:
Bernd Klein on
Google
This chapter deals with natural languages and literature. It will be also an extensive example and use case for Python sets.
Python 3 Novices in Python often think that sets are just a toy for mathematicians and that there is no real use case in
Tutorial programming. The contrary is true. There are multiple use cases for sets. They are used for example to get rid of doublets
Bernd Klein on
Facebook
- multiple occurrences of elements - in a list, i.e. to make a list unique.
The Origins of
Python In the following example we will use sets to determine the different words occurring in a novel. Our use case is build
around a novel which has been praised by many as the best novel in the English language and also as the hardest to read. Search this website:
Starting with
Python: The We are talking about the novel "Ulysses" by James Joyce. We will not talk about or examine the beauty of the language or
the language style. We will study the novel by having a close look at the words used in the novel. Our approach will be Go
Interactive Shell
purely statitically. The claim is that James Joyce used in his novel more words than any other author. Actually his
Executing a
vocabulary is above and beyond all other authors, maybe even Shakespeare.
Script
Indentation Besides Ulysses we will use the novels "Sons and Lovers" by D.H. Lawrence, "The Way of All Flesh" by Samuel Butler, Classroom
Data Types and "Robinson Crusoe" by Daniel Defoe, "To the Lighthouse" by Virginia Woolf, "Moby Dick" by Herman Melville and the Short Training
Variables Story "Metamorphosis" by Franz Kafka. Courses
Operators
Before you continue with this chapter of our tutorial it might be a good idea to read the chapter Sets and Frozen Sets and This website contains
Sequential Data the two chapter on regular expressions and advanced regular expressions. a free and extensive
Types: Lists and
online tutorial by
Strings Different Words of a Text Bernd Klein, using
List material from his
Manipulations To cut out all the words of the novel "Ulysses" we can use the function findall from the module "re": classroom Python
Shallow and training courses.
Deep Copy import re
Dictionaries # we don't care about case sensitivity and therefore use lower: If you are interested
ulysses_txt = open("books/james_joyce_ulysses.txt").read().lower() in an instructor-led
Sets and Frozen
words = re.findall(r"\b[\w-]+\b", ulysses_txt) classroom training
Sets print("The novel ulysses contains " + str(len(words))) course, you may
An Extensive have a look at the
Example Using The novel ulysses contains 272452 Python classes
Sets
input via the This number is the sum of all the words and many words occur multiple time:
keyboard
for word in ["the", "while", "good", "bad", "ireland", "irish"]:
Conditional
print("The word '" + word + "' occurs " + \
Statements str(words.count(word)) + " times in the novel!" )
Loops, while
Loop The word 'the' occurs 15112 times in the novel! by Bernd Klein at
For Loops The word 'while' occurs 123 times in the novel! Bodenseo. © kabliczech -
The word 'good' occurs 321 times in the novel! Fotolia.com
Difference
The word 'bad' occurs 90 times in the novel!
between The word 'ireland' occurs 90 times in the novel! Quote of the
interators und The word 'irish' occurs 117 times in the novel! Day:
Iterables
Output with Print 272452 surely is a huge number of words for a novel, but on the other hand there are lots of novels with even more words. More interesting and saying more about the quality of a novel is the number of different
Man is the best
Formatted output words. This is the moment where we will finally need "set". We will turn the list of words "words" into a set. Applying "len" to this set will give us the number of different words:
computer we can put
with string aboard a
diff_words = set(words)
modulo and the spacecraft...and the
print("'Ulysses' contains " + str(len(diff_words)) + " different words!")
format method only one that can be
Functions mass produced with
'Ulysses' contains 29422 different words!
unskilled labor.
Recursion and
(Wernher von Braun)
Recursive This is indeed an impressive number. You can see this, if you look at the other novels in our folder books:
Functions
Parameter novels = ['sons_and_lovers_lawrence.txt',
'metamorphosis_kafka.txt', If you have the
Passing in
'the_way_of_all_flash_butler.txt',
Functions choice working with
'robinson_crusoe_defoe.txt',
Python 2 or Python 3,
Namespaces 'to_the_lighthouse_woolf.txt',
'james_joyce_ulysses.txt', we recomend to
Global and Local
'moby_dick_melville.txt'] switch to Python 3!
Variables for novel in novels: You can read our
Decorators txt = open("books/" + novel).read().lower() Python Tutorial to see
Memoization with words = re.findall(r"\b[\w-]+\b", txt) what the differences
Decorators diff_words = set(words) are.
n = len(diff_words)
Read and Write print("{name:38s}: {n:5d}".format(name=novel[:-4], n=n))
Files
Modular sons_and_lovers_lawrence : 10822 Data Protection
Programming metamorphosis_kafka : 3027 Declaration
and Modules the_way_of_all_flash_butler : 11434
robinson_crusoe_defoe : 6595 Data Protection
Packages in to_the_lighthouse_woolf : 11415
Python Declaration
james_joyce_ulysses : 29422
Regular moby_dick_melville : 18922
Expressions
Regular Special Words in Ulysses
Expressions,
Advanced We will subtract all the words occurring in the other novels from "Ulysses" in the following little Python program. It is amazing how many words are used by James Joyce and by none of the other authors:
Lambda
Operator, Filter, words_in_novel = {}
for novel in novels:
Reduce and Map
txt = open("books/" + novel).read().lower()
List words = re.findall(r"\b[\w-]+\b", txt)
Comprehension words_in_novel[novel] = words
Iterators and
words_only_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
Generators
novels.remove('james_joyce_ulysses.txt')
Exception for novel in novels:
Handling words_only_in_ulysses -= set(words_in_novel[novel])
Tests, DocTests,
with open("books/words_only_in_ulysses.txt", "w") as fh:
UnitTests txt = " ".join(words_only_in_ulysses)
Object Oriented fh.write(txt)
Programming
Class and print(len(words_only_in_ulysses))
Instance
Attributes 15314
Properties vs.
getters and By the way, Dr. Seuss wrote a book with only 50 different words: Green Eggs and Ham
setters
Inheritance The file with the words only occurring in Ulysses contains strange or seldom used words like:
Multiple
huntingcrop tramtrack pappin kithogue pennyweight undergarments scission nagyaságos wheedling begad dogwhip hawthornden turnbull calumet covey repudiated pendennis waistcoatpocket nostrum
Inheritance
Magic Methods
Common Words
and Operator
Overloading
It is also possible to find the words which occur in every book. To accomplish this, we need the set intersection:
OOP, Inheritance
Example # we start with the words in ulysses
Slots common_words = set(words_in_novel['james_joyce_ulysses.txt'])
Classes and for novel in novels:
common_words &= set(words_in_novel[novel])
Class Creation
Road to print(len(common_words))
Metaclasses
Metaclasses 1745
Metaclass Use
Case: Count Doing it Right
Function Calls
Abstract Classes We made a slight mistake in the previous calculations. If you look at the texts, you will notice that they have a header and footer part added by Project Gutenberg, which doesn't belong to the texts. The texts are
positioned between the lines:
James Joyce
***START OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH***
James Joyce was and
born in Rathgar,
Dublin, Ireland on ***END OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH***
the 2nd of February
1882. He wrote or
poetry, novels and
short stories. He is *** START OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***
regarded as one of
and
the most important
and influential *** END OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***
authors of the 20th
century. His most The function read_text takes care of this:
famous work is
Ulysses, published in def read_text(fname):
1922. He also beg_e = re.compile(r"\*\*\* ?start of (this|the) project gutenberg ebook[^*]*\*\*\*")
published the short- end_e = re.compile(r"\*\*\* ?end of (this|the) project gutenberg ebook[^*]*\*\*\*")
story collection txt = open("books/" + fname).read().lower()
beg = beg_e.search(txt).end()
Dubliners in 1914,
end = end_e.search(txt).start()
and the novels A return txt[beg:end]
Portrait of the Artist words_in_novel = {}
as a Young Man for novel in novels + ['james_joyce_ulysses.txt']:
(1916) and txt = read_text(novel)
Finnegans Wake words = re.findall(r"\b[\w-]+\b", txt)
(1939). words_in_novel[novel] = words
words_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
Joyce frequently words_in_ulysses -= set(words_in_novel[novel])
travelled to
Switzerland for eye with open("books/words_in_ulysses.txt", "w") as fh:
surgeries and for txt = " ".join(words_in_ulysses)
treatments for his fh.write(txt)
daughter Lucia. She
suffered from from print(len(words_in_ulysses))
schizophrenia. Jung,
who treated Lucia in 15341
1934, told the Joyce
biographer Richard # we start with the words in ulysses
Ellmann that Lucia common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
and her father were
common_words &= set(words_in_novel[novel])
"like two people
going to the bottom print(len(common_words))
of a river, one falling
and the other 1279
diving."
He died 1941 in
The words of the set "common_words" are words belong to the most frequently used words of the English language. Let's have a look at 30 arbitrary words of this set:
Zurich.
counter = 0
for word in common_words:
print(word, end=", ")
This website is counter += 1
created by: if counter == 30:
break

ancient, broke, breathing, laugh, divided, forced, wealth, ring, outside, throw, person, spend, better, errand, school, sought, knock, tell, inner, run, packed, another, since,
touched, bearing, repeated, bitter, experienced, often, one,

Previous Chapter: Sets and Frozen Sets


Next Chapter: input via the keyboard

Python Training
Courses in Canada,
the US, and Europe

© 2011 - 2018, Bernd Klein, Bodenseo; Design by Denise Mitchinson adapted for python-course.eu by Bernd Klein

You might also like