0% found this document useful (0 votes)
14 views36 pages

UBC Summer School in NLP - VSP 2019 Lecture 7

This document summarizes the key points from a class on natural language processing and computation. It discusses building nested dictionaries to track family trees across generations. It then covers using multiple Python dictionaries and files to transcribe text to speech, including downloading supplementary files and writing IPA transcriptions to a file. It introduces error handling using try/except blocks and avoiding errors with if/else statements. It provides an example of counting letter frequencies in words and fixing a KeyError. Finally, it mentions applying these concepts to an Indonesian language.

Uploaded by

万颖佳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views36 pages

UBC Summer School in NLP - VSP 2019 Lecture 7

This document summarizes the key points from a class on natural language processing and computation. It discusses building nested dictionaries to track family trees across generations. It then covers using multiple Python dictionaries and files to transcribe text to speech, including downloading supplementary files and writing IPA transcriptions to a file. It introduces error handling using try/except blocks and avoiding errors with if/else statements. It provides an example of counting letter frequencies in words and fixing a KeyError. Finally, it mentions applying these concepts to an Indonesian language.

Uploaded by

万颖佳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

VANCOUVER SUMMER PROGRAM

Package G (Linguistics): Computation for Natural Language Processing


Class 7
Instructor: Michael Fry
PLAN TODAY
• Review:
• Dictionaries
• Finish of text-to-speech (TTS)
• Write to file
• Access webpage
• Introduce Errors/exceptions
• Practical application
• Introduce importing packages
• Learn to use command line
• Introduce installing packages with pip
• Introduce NLTK (Natural Language Toolkit)
NESTED DICTIONARIES
• Sometimes it’s useful to nest dictionaries
• i.e. a dictionary who’s key accesses a value that is also a dictionary
• For example, someone might want to track the oldest son of every family through
generations (‘Tim is the father of Jake is the father of Pete’)
• How do we build a nested dictionary?
• How can we access entries?
TTS: OUR METHOD
• We’re going to use the CMU-Dictionary as a half-way point to get to IPA
• We’re going to use multiple python dictionaries to do our transcription from spelling
to ARPABET to IPA, like this:
• ‘hello’ -> HH AH0 LOW1 -> hɛlo
• Our dictionary types will look like this:
english2arpabet = {'hello': ['HH', 'AH0', 'L', 'OW1’]}
arpabet2ipa = {'HH':'h', 'AH0': 'ɛ', 'L': 'l', 'OW1': 'o’}
TTS: OUR METHOD
• Download
supplementary material → cmu_dict.txt
supplementary material → arpabet2ipa.txt
Datasets → three_little_pigs.txt
• Pseudocode (step 1):
read arpabet2ipa.txt into python
initialize an empty dictionary
for each line:
strip the line
split the line
add an entry into our arpabet2ipa dictionary with arpabet as the key
TTS: OUR METHOD
• Pseudocode (Step 2):
read the cmu_dict.txt file
initialize empty dictionary
for each line in the file:
strip the line
split the line by double space (‘ ‘)
add entry into dictionary with spelling as key
TTS: OUR METHOD
• Pseudocode (Step 3):
read three_little_pigs.txt
make an empty ipa_output list
for each line in the file:
strip the line
split the line
make empty ipa_line list
for each word in the line:
strip off punctuation
look up pronunciation in cmu_dict
convert pronunciation to ipa
add it to the ipa_line
add ipa_line to ipa_output
TTS: OUR METHOD
• Now, to make things easy to copy online, let’s write our data to a file
• Writing to a file is a lot like reading from a file, you just need another argument
out_file_path = ‘C:/Users/Michael/Desktop/test.txt’
with open(out_file_path, mode=‘w’, encoding=‘utf-8’) as out_file:
out_file.write(‘DINOSAURS!’)
• Notice that we can only write out strings
• Note: Python automatically creates a .txt file for you if it doesn’t exist, if it does exist
(mode=‘w’) starts at the first line
• There are other modes (r, r+, w, w+, a, a+) which all allow different interactions with files
TTS: OUR METHOD
• Now that we have IPA transcriptions for our story, let’s go listen to some of it
• Go to: https://itinerarium.github.io/phoneme-synthesis/
• Enter a sub-portion of your IPA and listen away
PLAN TODAY
• Review:
• Dictionaries
• Finish of text-to-speech (TTS)
• Write to file
• Access webpage
• Introduce Errors/exceptions
• Practical application
• Introduce importing modules/packages
• Learn to use command line
• Introduce installing packages with pip
• Introduce NLTK (Natural Language Toolkit)
ERROR HANDLING
• When the Python interpreter reaches a line of code that it can't execute, it raises an
exception or it throws an error
alphabet = ['a', 'b', 'c']
h = alphabet[7]
h = h.upper()
• Python can't find index 7 (because it doesn't exist). Therefore Python can't print the
value at index 7
• Python can't "skip" a line of code that it doesn't understand. Even if it could skip line
2, what would happen on line 3?
• When there's nothing left for Python to do, it raises an Exception
ERROR HANDLING
• Errors are fatal, meaning the program cannot continue running
• Exceptions are non-fatal, meaning the program could possibly recover and keep
running
• However, programmers commonly call everything an error
• Even more confusing, is that Python calls all of its exceptions Errors
TYPES OF ERRORS
• Types of errors:
This is technically the only "error" on this
• SyntaxError
list because the program cannot run with
• NameError
an SyntaxError
• IndexError
• KeyError
• AttributeError
• TypeError
• UnicodeError
• FileNotFoundError
• TabError
• ZeroDivisionError
• and more…
TYPES OF ERRORS
• SyntaxError:
• This is raised if you have typed code that doesn't conform to the Python syntax. For
example:
• A missing bracket - print('Some text'
• A missing quote - print('Some text)
• A missing colon - for line in file
• A missing comma - numbers = [1,2,3,4 5]
• SyntaxErrors are fatal, and they prevent your program from running
• Normally, IDLE will highlight the problem in red for you
TYPES OF ERRORS
• NameError:
• This is raised if the Python interpreter can't find a variable name. There are
several possible causes, but these two are very common:
• You misspelled a variable name (e.g. "woard" instead of "word")
• You defined a variable name inside an if-statement, then tried to access
that variable name, but the if-statement code was never executed in the
first place (example on the next slide).
TYPES OF ERRORS
• IndexError:
• This is raised if you try to access an index in a list that does not exist. There are two
possible causes:
• The list has the correct length, and the index is wrong.
• Double-check the code that is generating the index number.

• The list is not the correct length, and the index you want should exist, but doesn't
• Make sure the code that builds the list is building one of the right length.
TYPES OF ERRORS
• KeyError:
• This is raised if you try to access a dictionary key that doesn't exist
• The key you want should be in the dictionary, but it isn't. Double-check the code that is
building the dictionary to make sure the key does exist before you try to access it.
• You've asked for an incorrect key name. Check for spelling mistakes if you typed key names
yourself. Otherwise, check that your code is not generating incorrect names.
TYPES OF ERRORS
• AttributeError:
• This is raised if you try to do use a method or attribute that doesn't exist
s = 'some string‘
s.append('!')
• This raises an AttributeError because strings do not have an "append" method
• There are no general solutions on how to solve an AttributeError. Follow the traceback of
the error
TYPES OF ERRORS
• TypeError
• This is raised if you try to do something with a variable of the wrong type.
numbers = [0,1,2,3,4]
print(numbers['0'])
• This raises a TypeError because the index is the wrong type. It has to be an integer, not a
string
• To debug, follow the traceback
ERROR HANDLING
• Python gives us a way to catch errors using a try/except/else block (kind of like an
if/else block)
try:
#some code
except KeyError:
if a KeyError is raised, this happens
except (IndexError, ValueError):
if an IndexError or a ValueError Is raised, this happens
else:
#if any other exception is raised, this happens
ERROR HANDLING
• Generally, you don’t want errors, so you shouldn’t code to include them.
• Of course, just because you shouldn’t, doesn’t mean you can’t. Let’s fix this:

words = ['shrubbery', 'coconut', 'witch', 'newt']


letter_count = dict()
for word in words:
for letter in word:
letter_count[letter] += 1
#raises KeyError
ERROR HANDLING
• Fix with try/except:
for letter in word:
try:
letter_count[letter] += 1
except KeyError:
letter_count[letter] = 1
ERROR HANDLING
• Better yet to avoid any errors, so let’s fix it with an if/else instead
ERROR HANDLING
• Better yet to avoid any errors, so let’s fix it with an if/else instead
for letter in word:
if letter in letter_count:
letter_count[letter] += 1
else:
letter_count[letter] = 1
ERROR HANDLING: PRACTICAL
APPLICATION
• A Polynesian language spoken by
around 150,000 people in Indonesia
• We’re going to go through a wordlist
and find all the environments of each
speech sound
• We’ll have to deal with two types
of errors, KeyError and IndexError
ERROR HANDLING: PRACTICAL
APPLICATION
• Linguistic environments:
• An "environment" in linguistics means the left side and right side of a sound.
• The environments for the sound [o] in the word [polo]:
• it occurs between [p] and [l]
• it occurs between [l] and the end of the word
• In linguistics, we would write the environments of [o] this way:
• p_l
• l_#
• The underscore represents [o], and the # symbol means end of the word
ERROR HANDLING: PRACTICAL
APPLICATION
• All of the environments for all the sounds in the word [kelopo]
ERROR HANDLING: PRACTICAL
APPLICATION
• All of the environments for all the sounds in the word [kelopo]
• Environments of [k]
• #_e
• Environments of [e]
• k_l
• Environments of [o]
• l_p
• p_#
• Environments of [p]
• o_o
ERROR HANDLING: PRACTICAL
APPLICATION
• Pseudocode:
open a file
make an empty words list
for each line in file:
strip and split by tab
append the Lamaholot word to a list
make an empty dictionary
for each word in the list:
for each letter in the word:
find out the letters on either side
add those environments to the dictionary for that letter
NEW PYTHON BASIC: IMPORT
• Python has additional packages that aren’t immediately accessible in a program,
you need to import them
• To import a module, type import followed by the module name
• For example
import string
• Now you can access things in the module using "dot-notation“
print(string.punctuation)
print(string.ascii_lowercase)

• Import statements should always be the very first thing at the top of your code.
IMPORTING MODULES
• Another useful package/module is the os module
• OS stands for "operating system". This module contains useful functions for dealing with
files and folders.
• Here are two very useful functions:
os.getcwd()
• Returns the "current working directory.“ This means the folder where the Python file is
os.path.join(string1, string2, ...)
• Take the strings and joins them by slashes to make a file path
• These are often combined:
path = os.path.join(os.getcwd(), 'data', 'turkish_words.txt')
PYTHON PACKAGES
• Importing gives us access to the wide-world of things you can do with Python
• Many developers have made Python packages available
• We’ll be installing packages using pip
• PIP is a tongue-and-cheek recursive acronym for Pip installs Packages
• We use it from command prompt/terminal
• Let’s first have some fun with command prompt
• Using text input
NEW FUNCTION: INPUT()
• Since we’ll be working with command prompt to install packages, I thought it’d be
fun to play around a minute with a command line interface
• Remember, using Sublime Text, which is a text editor, we can’t interact with our
script, we just program and run it
• Now we’re using command line, we can input text with the input() function
• Test it out in IDLE:
print(‘please input your name:’
name = input()
print(name)
INTERACTING WITH CMD LINE
• Original computer games started with text-line interfaces
• Today, we can recreate these types of games easily
• Let’s write a fun little text-based game:
• We’ll preset a path that the user must figure out
• The basic interaction asks the user to make a move and lets them know if they made the
right move or not
print(‘make a move (up, down, left, right):’)
curr_move = input()
if curr_move is correct:
print(‘you got it right, you can move on!’)

PLAN TODAY
• Review:
• Dictionaries
• Finish of text-to-speech (TTS)
• Write to file
• Access webpage
• Introduce Errors/exceptions
• Practical application
• Introduce importing modules/packages
• Learn to use command line
• Introduce installing packages with pip
• Introduce NLTK (Natural Language Toolkit)
INSTALLING NEW PYTHON
PACKAGES
• We install python packages using the pip install call
• Let’s install the Natural Language Processing Toolkit
• In command line (PC)
• pip install nltk
• In terminal (MAC)
• pip3 install nltk
• You’ll see a bunch of stuff show up on the screen as in installs the package
• There are lots of packages you can install
• If there’s a specific project you’re working on, there’s a good chance someone has made
a package which can help!

You might also like