Lecture 6 - CS50's Introduction to Artificial Intelligence with Python
Lecture 6 - CS50's Introduction to Artificial Intelligence with Python
OpenCourseWare
Donate (https://cs50.harvard.edu/donate)
Brian Yu (https://brianyu.me)
[email protected]
Lecture 6
Language
So far in the course, we needed to shape tasks and data such that an AI will be able to process
them. Today, we will look at how an AI can be constructed to process human language.
Natural Language Processing spans all tasks where the AI gets human language as input. The
following are a few examples of such tasks:
automatic summarization, where the AI is given text as input and it produces a summary
of the text as output.
information extraction, where the AI is given a corpus of text and the AI extracts data as
output.
language identification, where the AI is given text and returns the language of the text as
output.
machine translation, where the AI is given a text in the origin language and it outputs the
translation in the target language.
named entity recognition, where the AI is given text and it extracts the names of the
entities in the text (for example, names of companies).
speech recognition, where the AI is given speech and it produces the same words in text.
text classification, where the AI is given text and it needs to classify it as some type of
text.
word sense disambiguation, where the AI needs to choose the right meaning of a word
that has multiple meanings (e.g. bank means both a financial institution and the ground
on the sides of a river).
Semantics is the meaning of words or sentences. While the sentence “Just before nine o’clock
Sherlock Holmes stepped briskly into the room” is syntactically different from “Sherlock Holmes
stepped briskly into the room just before nine o’clock,” their content is effectively identical.
Similarly, although the sentence “A few minutes before nine, Sherlock Holmes walked quickly
into the room” uses different words from the previous sentences, it still carries a very similar
meaning. Moreover, a sentence can be perfectly grammatical while being completely
nonsensical, as in Chomsky’s example, “Colorless green ideas sleep furiously.” To be able to parse
human speech and produce it, the AI needs to command semantics.
Context-Free Grammar
Formal Grammar is a system of rules for generating sentences in a language. In Context-Free
Grammar, the text is abstracted from its meaning to represent the structure of the sentence
using formal grammar. Let’s consider the following example sentence:
This is a simple grammatical sentence, and we would like to generate a syntax tree
representing its structure.
We start by assigning each word its part of speech. She and city are nouns, which we will mark
as N. Saw is a verb, which we will mark as V. The is a determiner, marking the following noun as
definite or indefinite, and we will mark it as D. Now, the above sentence can be rewritten as
NVDN
So far, we have abstracted each word from its semantic meaning to its part of speech. However,
words in a sentence are connected to each other, and to understand the sentence we must
understand how they connect. A noun phrase (NP) is a group of words that connect to a noun.
For example, the word she is a noun phrase in this sentence. In addition, the words the city also
form a noun phrase, consisting of a determiner and a noun. A verb phrase (VP) is a group of
words that connect to a verb. The word saw is a verb phrase in itself. However, the words saw the
city also make a verb phrase. In this case, it is a verb phrase consisting of a verb and a noun
phrase, which in turn consists of a determiner and a noun. Finally, the whole sentence (S) can be
represented as follows:
Using formal grammar, the AI is able to represent the structure of sentences. In the grammar we
have described, there are enough rules to represent the simple sentence above. To represent
more complex sentences, we will have to add more rules to our formal grammar.
nltk
As is often the case in Python, multiple libraries have been written to implement the idea
above. nltk (Natural Language Toolkit) is one such library. To analyze the sentence from above,
we will provide the algorithm with rules for the grammar:
import nltk
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> D N | N
VP -> V | V NP
parser = nltk.ChartParser(grammar)
Similar to what we did above, we define what possible components could be included in others.
A sentence can include a noun phrase and a verb phrase, while the phrases themselves can
consist of other phrases, nouns, verbs, etc., and, finally, each part of speech spans some words in
the language.
After giving the algorithm an input sentence split into a list of words, the function prints the
resulting syntactic tree (pretty_print) and also generates a graphic representation (draw).
n-grams
An n-gram is a sequence of n items from a sample of text. In a character n-gram, the items are
characters, and in a word n-gram the items are words. A unigram, bigram, and trigram are
sequences of one, two, and three items. In the following sentence, the first three n-grams are
“how often have,” “often have I,” and “have I said.”
“How often have I said to you that when you have eliminated the impossible whatever remains,
however improbable, must be the truth?”
n-grams are useful for text processing. While the AI hasn’t necessarily seen the whole sentence
before, it sure has seen parts of it, like “have I said.” Since some words occur together more
often then others, it is possible to also predict the next word with some probability. For
example, your smartphone suggests words to you based on a probability distribution derived
from the last few words you typed. Thus, a helpful step in natural language processing is
breaking the sentence into n-grams.
Tokenization
Tokenization is the task of splitting a sequence of characters into pieces (tokens). Tokens can be
words as well as sentences, in which case the task is called word tokenization or sentence
tokenization. We need tokenization to be able to look at n-grams, since those rely on sequences
of tokens. We start by splitting the text into words based on the space character. While this is a
good start, this method is imperfect because we end up with words with punctuation, such as
“remains,”. So, for example, we can remove punctuation. However, then we face additional
challenges, such as words with apostrophes (e.g. “o’clock”) and hyphens (e.g. “pearl-grey).
Additionally, some punctuation is important for sentence structure, like periods. However, we
need to be able to tell apart between a period at the end of the word “Mr.” and a period in the
end of the sentence. Dealing with these questions is the process of tokenization. In the end,
once we have our tokens, we can start looking at n-grams.
Markov Models
As discussed in previous lectures, Markov models consist of nodes, the value of each of which
has a probability distribution based on a finite number of previous nodes. Markov models can be
used to generate text. To do so, we train the model on a text, and then establish probabilities
for every n-th token in an n-gram based on the n words preceding it. For example, using
trigrams, after the Markov model has two words, it can choose a third one from a probability
distribution based on the first two. Then, it can choose a fourth word from a probability
distribution based on the second and third words. To see an implementation of such a model
using nltk, refer to generator.py in the source code, where our model learns to generate
Shakespeare-sounding sentences. Eventually, using Markov models, we are able to generate text
that is often grammatical and sounding superficially similar to human language output.
However, these sentences lack actual meaning and purpose.
Bag-of-Words Model
Bag-of-words is a model that represents text as an unordered collection of words. This model
ignores syntax and considers only the meanings of the words in the sentence. This approach is
helpful in some classification tasks, such as sentiment analysis (another classification task
would be distinguishing regular email from spam email). Sentiment analysis can be used, for
instance, in product reviews, categorizing reviews as positive or negative. Consider the following
sentences:
Based only on the words in each sentence and ignoring the grammar, we can see that sentences
1 and 3 are positive (“loved,” “fun,” “best”) and sentences 2 and 4 are negative (“broke,” “cheap,”
“flimsy”).
Naive Bayes
Naive Bayes is a technique that’s can be used in sentiment analysis with the bag-of-words
model. In sentiment analysis, we are asking “What is the probability that the sentence is
positive/negative given the words in the sentence.” Answering this question requires computing
conditional probability, and it is helpful to recall Bayes’ rule from lecture 2:
Now, we would like to use this formula to find P(sentiment | text), or, for example, P(positive |
“my grandson loved it”). We start by tokenizing the input, such that we end up with P(positive |
“my”, “grandson”, “loved”, “it”). Applying Bayes’ ruled directly, we get the following expression:
P(“my”, “grandson”, “loved”, “it” | positive)*P(positive)/P(“my”, “grandson”, “loved”, “it”). This
complicated expression will give us the precise answer to P(positive | “my”, “grandson”, “loved”,
“it”).
However, we can simplify the expression if we are willing to get an answer that’s not equal, but
proportional to P(positive | “my”, “grandson”, “loved”, “it”). Later on, knowing that the probability
distribution needs to sum up to 1, we can normalize the resulting value into an exact
probability. This means that we can simplify the expression above to the numerator only: P(“my”,
“grandson”, “loved”, “it” | positive)*P(positive). Again, we can simplify this expression based on the
knowledge that a conditional probability of a given b is proportional to the joint probability of a
and b. Thus, we get the following expression for our probability: P(positive, “my”, “grandson”,
“loved”, “it”)*P(positive). Calculating this joint probability, however, is complicated, because the
probability of each word is conditioned on the probabilities of the words preceding it. It
requires us to compute P(positive)*P(“my” | positive)*P(“grandson” | positive, “my”)*P(loved |
positive, “my”, “grandson”)*P(“it” | positive, “my”, “grandson”, “loved”).
Here is where we use Bayes’ rules naively: we assume that the probability of each word is
independent from other words. This is not true, but despite this imprecision, Naive Bayes’
produces a good sentiment estimate. Using this assumption, we end up with the following
probability: P(positive)*P(“my” | positive)*P(“grandson” | positive)*P(“loved” | positive)*P(“it” |
positive), which is not that difficult to calculate. P(positive) = the number of all positive samples
divided by the number of total samples. P(“loved” | positive) is equal to the number of positive
samples with the word “loved” divided by the number of positive samples. Let’s consider the
example below, with smiling and frowning emojies substituting the words “positive” and
“negative”:
On the right we are seeing a table with the conditional probabilities of each word on the left
occurring in a sentence given that the sentence is positive or negative. In the small table on the
left we are seeing the probability of a positive or a negative sentence. On the bottom left we
are seeing the resulting probabilities following the computation. At this point, they are in
proportion to each other, but they don’t tell us much in terms of probabilities. To get the
probabilities, we need to normalize the values, arriving at P(positive) = 0.6837 and P(negative) =
0.3163. The strength of naive Bayes is that it is sensitive to words that occur more often in one
type of sentence than in the other. In our case, the word “loved” occurs much more often in
positive sentences, which makes the whole sentence more likely to be positive than negative. To
see an implementation of sentiment assessment using Naive Bayes with the nltk library, refer to
sentiment.py.
One problem that we can run into is that some words may never appear in a certain type of
sentence. Suppose none of the positive sentences in our sample had the word “grandson.” Then,
P(“grandson” | positive) = 0, and when computing the probability of the sentence being positive
we will get 0. However, this is not the case in reality (not all sentences mentioning grandsons
are negative). One way to go about this problem is with Additive Smoothing, where we add a
value α to each value in our distribution to smooth the data. This way, even if a certain value is
0, by adding α to it we won’t be multiplying the whole probability for a positive or negative
sentence by 0. A specific type of additive smoothing, Laplace Smoothing adds 1 to each value in
our distribution, pretending that all values have been observed at least once.
Information Retrieval
Information retrieval is the task of finding relevant documents in response to a user query. To
achieve this task, we use topic modeling to discover the topics for a set of documents. How can
the AI go about extracting the topics of documents? One way to do so is by looking at term
frequency, which is simply counting how many times a term appears in a document. The idea
behind this is that key terms and important ideas are likely to repeat. To get term frequencies,
we will work with the tf-idf library.
tf-idf
We can use the tf-idf library for information retrieval tasks. In lecture, we hoped to use term
frequency to establish what the topics of each Sherlock Holmes story are in the corpus we had.
However, when we ran the counting algorithm naively, the top-5 most common words were
function words, words that are used for syntactic purposes in the sentence and not as
independent units of meaning, such as “the,” “by,” “and,” “which,” “yet,” etc. By simply counting all
the words by frequency of appearance we didn’t get the unique words that repeat in the corpus
and define its topics, but the words that are the most common in the English language in
general. The words that we are after are content words, words that carry meaning
independently, such as “crime,” “brothers,” “demons,” “gentle,” “meek,” etc.
We can run the algorithm again, excluding words that we define as function words in the
English language. Now the output becomes more indicative of the content of each document.
However, in the case of Sherlock Holmes, the most common word in each document is,
unsurprisingly, Holmes. Since we are searching the corpus of Sherlock Holmes story, having
Holmes as one of the key words tells us nothing new about each story. This brings us to the idea
of Inverse Document Frequency, which is a measure of how common or rare a word is across
documents in a corpus. It is usually computed by the following equation:
Note that when the word appears in all documents, the value inside the logarithm is 1, and
log(1) = 0. This is where the name of the library comes from: tf-idf stand for Term Frequency —
Inverse Document Frequency. What this library is able to do is to multiply the term frequency of
each word by the inverse document frequency, thus getting a value for each word. The more
common the word is in one document and the fewer documents it appears in, the higher its
value will be. Alternatively, the less common the word is in a document and the more common
it is across documents, the lower its value. This lets us get as output a list of words per
document that are likely to span the main topics that define the document.
Information Extraction
Information Extraction is the task of extracting knowledge from documents. So far, treating text
with the bag-of-words approach was helpful when we wanted the AI to perform simple tasks,
such as recognizing sentiment in a sentence as positive or negative, or retrieving the key words
in a document. However, for a task like information extraction, where we need the AI to
understand the connections between different words, the bag-of-words approach won’t get us
very far.
A possible task of information extraction can take the form of giving a document to the AI as
input and getting a list of companies and the years when they were founded as output. One way
we can go about this is providing the AI with a template, such as “When {company} was
founded in {year}.” This will not get us perfect results, because not all information about
companies and their years of foundation are written in precisely this format. Still, if the dataset
is large enough, we will definitely come across sentences of precisely this form, which will allow
the AI to extract this knowledge. A more advanced approach would be to give the AI an
abstracted example like “Facebook, 2004,” and let it develop its own model to extract the data.
By going through enough data, the AI will be able to infer possible templates for extracting
information similar to the example.
Using templates, even if self-generated by the AI, is helpful in tasks like Information Extraction.
However, if we want the AI to be able to produce text like a human, the AI needs to be able to
understand not just templates, but how all words in the language relate to each other in their
meanings.
Word Net
Wordnet is a database similar to a dictionary, where words are given definitions as well as
broader categories. For example, the word “house” will yield multiple definitions, one of which is
“a dwelling that serves as living quarters for one or more families.” This definition is paired with
the categories “building” and “dwelling.” Having access to these broader categories will allow us
to make connections between words that are semantically similar.
Word Representation
We want to represent word meanings in our AI. As we’ve seen before, it is convenient to provide
input to the AI in the form of numbers. One way to go about this is by using One-Hot
Representation, where each word is represented with a vector that consists of as many values
as we have words. Except for a single value in the vector that is equal to 1, all other values are
equal to 0. How we can differentiate words is by which of the values is 1, ending up with a
unique vector per word. For example, the sentence “He wrote a book” can be represented as four
vectors:
[1, 0, 0, 0] (he)
[0, 1, 0, 0] (wrote)
[0, 0, 1, 0] (a)
[0, 0, 0, 1] (book)
However, while this representation works in a world with four words, if we want to represent
words from a dictionary, when we can have 50,000 words, we will end up with 50,000 vectors of
length 50,000. This is incredibly inefficient. Another problem in this kind of representation is
that we are unable to represent similarity between words like “wrote” and “authored.” Instead,
we turn to the idea of Distributed Representation, where meaning is distributed across multiple
values in a vector. With distributed representation, each vector has a limited number of values
(much less than 50,000), taking the following form:
This allows us to generate unique values for each word while using smaller vectors.
Additionally, now we are able to represent similarity between words by how different the values
in their vectors are.
“You shall know a word by the company it keeps” is an idea by J. R. Firth, an English linguist.
Following this idea, we can come to define words by their adjacent words. For example, there
are limited words that we can use to complete the sentence “for ___ he ate.” These words are
probably words like “breakfast,” “lunch,” and “dinner.” This brings us to the conclusion that by
considering the environment in which a certain word tends to appear, we can infer the meaning
of the word.
word2vec
word2vec is an algorithm for generating distributed representations of words. It does so by
Skip-Gram Architecture, which is a neural network architecture for predicting context given a
target word. In this architecture, the neural network has an input unit for every target word. A
smaller, single hidden layer (e.g. 50 or 100 units, though this number is flexible) will generate
values that represent the distributed representations of words. Every unit in this hidden layer is
connected to every unit in the input layer. The output layer will generate words that are likely to
appear in a similar context as the target words. Similar to what we saw in last lecture, this
network needs to be trained with a training dataset using the backpropagation algorithm.
This neural network turns out to be quite powerful. In the end, of the process, every word ends
up being just a vector, or a sequence of numbers. For example,