Mapping Words to Properties Using Python Dictionaries
Mapping Words to Properties Using Python Dictionaries
A tagged word of the form (word, tag) is an association between a word and a
part-of-speech tag.
The most natural way to store mappings in Python uses the so-called dictionary
data type.
Mapping Words to Properties Using Python
Dictionaries
Indexing Lists Versus Dictionaries
Lookup using words is familiar to anyone who has used a dictionary. Some more
examples are shown in figure below.
Mapping Words to Properties Using Python
Dictionaries
In the case of a phonebook, we look up an entry using a name and get back a
number.
When we type a domain name in a web browser, the computer looks this up to
get back an IP address.
A word frequency table allows us to look up a word and find its frequency in a
text collection.
In all these cases, we are mapping from names to numbers, rather than the
other way around as with a list.
Python provides a dictionary data type that can be used for mapping between
arbitrary types.
To illustrate, we define pos to be an empty dictionary and then add four entries
to it, specifying the part-of-speech of some words.
Once we have populated the dictionary in this way, we can employ the keys to
retrieve values.
Unlike lists and strings, where we can use len() to work out which integers will
be legal indexes, how do we work out the legal keys for a dictionary?
If the dictionary is not too big, we can simply inspect its contents by evaluating
the variable pos, this gives us the key-value pairs.
They are not in the same order they were originally entered; this is because
dictionaries are not sequences but mappings, and the keys are not inherently
ordered.
Mapping Words to Properties Using Python
Dictionaries
Alternatively, to just find the keys, we can either convert the dictionary to a list
or use the dictionary in a context where a list is expected, as the parameter of
sorted() or in a for loop.
Mapping Words to Properties Using Python
Dictionaries
As well as iterating over all keys in the dictionary with a for loop, we can use the
for loop as we did for printing lists.
Mapping Words to Properties Using Python
Dictionaries
Finally, the dictionary methods keys(), values(), and items() allow us to access
the keys, values, and key-value pairs as separate lists.
We can even sort tuples, which orders them according to their first element.
Mapping Words to Properties Using Python
Dictionaries
Now suppose we try to use a dictionary to store the fact that the word sleep can
be used as both a verb and a noun.
Mapping Words to Properties Using Python
Dictionaries
In other words, there can be only one entry in the dictionary for 'sleep’.
However, there is a way of storing multiple values in that entry: we use a list
value, e.g., pos['sleep'] = ['N', 'V'].
Mapping Words to Properties Using Python
Dictionaries
Defining Dictionaries
There are a couple of ways to do this, and we will normally use the first.
The dictionary keys must be immutable types, such as strings and tuples. If we
try to define a dictionary using a mutable key, we get a TypeError.
Mapping Words to Properties Using Python
Dictionaries
Default Dictionaries
However, it’s often useful if a dictionary can automatically create an entry for
this new key and give it a default value, such as zero or the empty list.
Since Python 2.5, a special kind of dictionary called a defaultdict has been
available.
In order to use it, we have to supply a parameter which can be used to create
the default value, e.g., int, float, str, list, dict, tuple.
Mapping Words to Properties Using Python
Dictionaries
Example.
Mapping Words to Properties Using Python
Dictionaries
The preceding examples specified the default value of a dictionary entry to be
the default value of a particular data type.
However, we can specify any default value we like, simply by providing the
name of a function that can be called with no arguments to create the required
value.
They can perform better with a fixed vocabulary and a guarantee that no new
words will appear.
We need to create a default dictionary that maps each word to its replacement.
If the tag hasn’t been seen before, it will have a zero count by default.
Each time we encounter a tag, we increment its count using the += operator.
Mapping Words to Properties Using Python
Dictionaries
Example.
Mapping Words to Properties Using Python
Dictionaries
The first parameter of sorted() is the items to sort, which is a list of tuples
consisting of a POS tag and a frequency.
The second parameter specifies the sort key using a function itemgetter().
The last parameter of sorted() specifies that the items should be returned in
reverse order, i.e., decreasing values of frequency.
Since accumulating words like this is such a common task, NLTK provides a more
convenient way of creating a defaultdict(list), in the form of nltk.Index().
Mapping Words to Properties Using Python
Dictionaries
We can use default dictionaries with complex keys and values.
Let’s study the range of possible tags for a word, given the word itself and the
tag of the previous word.
The example uses a dictionary whose default value for an entry is a dictionary
(whose default value is int(), i.e., zero).
Notice how we iterated over the bigrams of the tagged corpus, processing a pair
of word-tag pairs for each iteration .
Each time through the loop we updated our pos dictionary’s entry for (t1, w2), a
tag and its following word.
When we look up an item in pos we must specify a compound key, and we get
back a dictionary object.
A POS tagger could use such information to decide that the word right, when
preceded by a determiner, should be tagged as ADJ.
Mapping Words to Properties Using Python
Dictionaries
Inverting a Dictionary
Dictionaries support efficient lookup, so long as you want to get the value for
any key.
In the case that no two keys have the same value, this is an easy thing to do.
We just get all the key-value pairs in the dictionary, and create a new dictionary
of value-key pairs.
The below example also illustrates another way of initializing a dictionary pos
with key-value pairs.
Mapping Words to Properties Using Python
Dictionaries
Let’s first make our part-of-speech dictionary a bit more realistic and add some
more words to pos using the dictionary update() method, to create the situation
where multiple keys have the same value.
Then the technique just shown for reverse lookup will no longer work.
Instead, we have to use append() to accumulate the words for each part-of-
speech, as follows.
Mapping Words to Properties Using Python
Dictionaries
Now we have inverted the pos dictionary, and can look up any part-of-speech
and find all words having that part-of-speech.
We can do the same thing even more simply using NLTK’s support for indexing,
as follows.