Lecture 4 - CS50's Introduction to Artificial Intelligence with Python
Lecture 4 - CS50's Introduction to Artificial Intelligence with Python
OpenCourseWare
Donate (https://cs50.harvard.edu/donate)
Brian Yu (https://brianyu.me)
[email protected]
Lecture 4
Machine Learning
Machine learning provides a computer with data, rather than explicit instructions. Using these
data, the computer learns to recognize patterns and becomes able to execute tasks on its own.
Supervised Learning
Supervised learning is a task where a computer learns a function that maps inputs to outputs
based on a dataset of input-output pairs.
There are multiple tasks under supervised learning, and one of those is Classification. This is a
task where the function maps an input to a discrete output. For example, given some
information on humidity and air pressure for a particular day (input), the computer decides
whether it will rain that day or not (output). The computer does this after training on a dataset
with multiple days where humidity and air pressure are already mapped to whether it rained or
not.
This task can be formalized as follows. We observe nature, where a function f(humidity, pressure)
maps the input to a discrete value, either Rain or No Rain. This function is hidden from us, and it
is probably affected by many other variables that we don’t have access to. Our goal is to create
function h(humidity, pressure) that can approximate the behavior of function f. Such a task can be
visualized by plotting days on the dimensions of humidity and rain (the input), coloring each
data point in blue if it rained that day and in red if it didn’t rain that day (the output). The white
data point has only the input, and the computer needs to figure out the output.
Nearest-Neighbor Classification
One way of solving a task like the one described above is by assigning the variable in question
the value of the closest observation. So, for example, the white dot on the graph above would
be colored blue, because the nearest observed dot is blue as well. This might work well some
times, but consider the graph below.
Following the same strategy, the white dot should be colored red, because the nearest
observation to it is red as well. However, looking at the bigger picture, it looks like most of the
other observations around it are blue, which might give us the intuition that blue is a better
prediction in this case, even though the closest observation is red.
One way to get around the limitations of nearest-neighbor classification is by using k-nearest-
neighbors classification, where the dot is colored based on the most frequent color of the k
nearest neighbors. It is up to the programmer to decide what k is. Using a 3-nearest neighbors
classification, for example, the white dot above will be colored blue, which intuitively seems
like a better decision.
Perceptron Learning
Another way of going about a classification problem, as opposed to the nearest-neighbor
strategy, is looking at the data as a whole and trying to create a decision boundary. In two-
dimensional data, we can draw a line between the two types of observations. Every additional
data point will be classified based on the side of the line on which it is plotted.
The drawback to this approach is that data are messy, and it is rare that one can draw a line and
neatly divide the classes into two observations without any mistakes. Often, we will
compromise, drawing a boundary that separates the observations correctly more often than not,
but still occasionally misclassifies them.
x₁ = Humidity
x₂ = Pressure
will be given to a hypothesis function h(x₁, x₂), which will output its prediction of whether it is
going to rain that day or not. It will do so by checking on which side of the decision boundary
the observation falls. Formally, the function will weight each of the inputs with an addition of a
constant, ending in a linear equation of the following form:
Often, the output variable will be coded as 1 and 0, where if the equation yields more than 0,
the output is 1 (Rain), and 0 otherwise (No Rain).
The weights and values are represented by vectors, which are sequences of numbers (which can
be stored in lists or tuples in Python). We produce a Weight Vector w: (w₀, w₁, w₂), and getting to
the best weight vector is the goal of the machine learning algorithm. We also produce an Input
Vector x: (1, x₁, x₂).
We take the dot product of the two vectors. That is, we multiply each value in one vector by the
corresponding value in the second vector, arriving at the expression above: w₀ + w₁x₁ + w₂x₂. The
first value in the input vector is 1 because, when multiplied by the weight vector w₀, we want to
keep it a constant.
Thus, we can represent our hypothesis function the following way:
Since the goal of the algorithm is to find the best weight vector, when the algorithm encounters
new data it updates the current weights. It does so using the perceptron learning rule:
The important takeaway from this rule is that for each data point, we adjust the weights to
make our function more accurate. The details, which are not as critical to our point, are that
each weight is set to be equal to itself plus some value value in parentheses. Here, y stands for
the observed value while the hypothesis function stands for the estimate. If they are identical,
this whole term is equal to zero, and thus the weight is not changed. If we underestimated
(calling No Rain while Rain was observed), then the value in the parentheses will be 1 and the
weight will increase by the value of xᵢ scaled by α the learning coefficient. If we overestimated
(calling Rain while No Rain was observed), then the value in the parentheses will be -1 and the
weight will decrease by the value of x scaled by α. The higher α, the stronger the influence each
new event has on the weight.
The result of this process is a threshold function that switches from 0 to 1 once the estimated
value crosses some threshold.
The problem with this type of function is that it is unable to express uncertainty, since it can
only be equal to 0 or to 1. It employs a hard threshold. A way to go around this is by using a
logistic function, which employs a soft threshold. A logistic function can yield a real number
between 0 and 1, which will express confidence in the estimate. The closer the value to 1, the
more likely it is to rain.
Another benefit of support vector machines is that they can represent decision boundaries with
more than two dimensions, as well as non-linear decision boundaries, such as below.
To summarize, there are multiple ways to go about classification problems, with no one being
always better than the other. Each has their drawbacks and might prove more useful than
others in specific situations.
Regression
Regression is a supervised learning task of a function that maps an input point to a continuous
value, some real number. This differs from classification in that classification problems map an
input to discrete values (Rain or No Rain).
For example, a company might use regression to answer the question of how money spent
advertising predicts money earned in sales. In this case, an observed function f(advertising)
represents the observed income following some money that was spent in advertising (note that
the function can take more than one input variable). These are the data that we start with. With
this data, we want to come up with a hypothesis function h(advertising) that will try to
approximate the behavior of function f. h will generate a line whose goal is not to separate
between types of observations, but to predict, based on the input, what will be the value of the
output.
Loss Functions
Loss functions are a way to quantify the utility lost by any of the decision rules above. The less
accurate the prediction, the larger the loss.
L(actual, predicted):
0 if actual = predicted
1 otherwise
In words, this function gains value when the prediction isn’t correct and doesn’t gain value
when it is correct (i.e. when the observed and predicted values match).
In the example above, the days that are valued at 0 are the ones where we predicted the
weather correctly (rainy days are below the line and not rainy days are above the line). However,
days when it didn’t rain below the line and days when it did rain above it are the ones that we
failed to predict. We give each one the value of 1 and sum them up to get an empirical estimate
of how lossy our decision boundary is.
L₁ and L₂ loss functions can be used when predicting a continuous value. In this case, we are
interested in quantifying for each prediction how much it differed from the observed value. We
do this by taking either the absolute value or the squared value of the observed value minus
the predicted value (i.e. how far the prediction was from the observed value).
One can choose the loss function that serves their goals best. L₂ penalizes outliers more harshly
than L₁ because it squares the the difference. L₁ can be visualized by summing the distances
from each observed point to the predicted point on the regression line:
Overfitting
Overfitting is when a model fits the training data so well that it fails to generalize to other data
sets. In this sense, loss functions are a double edged sword. In the two examples below, the loss
function is minimized such that the loss is equal to 0. However, it is unlikely that it will fit new
data well.
For example, in the left graph, a dot next to the red one at the bottom of the screen is likely to
be Rain (blue). However, with the overfitted model, it will be classified as No Rain (red).
Regularization
Regularization is the process of penalizing hypotheses that are more complex to favor simpler,
more general hypotheses. We use regularization to avoid overfitting.
In regularization, we estimate the cost of the hypothesis function h by adding up its loss and a
measure of its complexity.
One way to test whether we overfitted the model is with Holdout Cross Validation. In this
technique, we split all the data in two: a training set and a test set. We run the learning
algorithm on the training set, and then see how well it predicts the data in the test set. The idea
here is that by testing on data that were not used in training, we can a measure how well the
learning generalizes.
The downside of holdout cross validation is that we don’t get to train the model on half the
data, since it is used for evaluation purposes. A way to deal with this is using k-Fold Cross-
Validation. In this process, we divide the data into k sets. We run the training k times, each time
leaving out one dataset and using it as a test set. We end up with k different evaluations of our
model, which we can average and get an estimate of how our model generalizes without losing
any data.
scikit-learn
As often is the case with Python, there are multiple libraries that allow us to conveniently use
machine learning algorithms. One of such libraries is scikit-learn.
The four left columns are data that we can use to predict whether a note is genuine or
counterfeit, which is external data provided by a human, coded as 0 and 1. Now we can train our
model on this data set and see if we can predict whether new banknotes are genuine or not.
import csv
import random
# model = KNeighborsClassifier(n_neighbors=1)
# model = svm.SVC()
model = Perceptron()
Note that after importing the libraries, we can choose which model to use. The rest of the code
will stay the same. SVC stands for Support Vector Classifier (which we know as support vector
machine). The KNeighborsClassifier uses the k-neighbors strategy, and requires as input the
number of neighbors it should consider.
data = []
for row in reader:
data.append({
"evidence": [float(cell) for cell in row[:4]],
"label": "Authentic" if row[4] == "0" else "Counterfeit"
})
# Print results
print(f"Results for model {type(model).__name__}")
print(f"Correct: {correct}")
print(f"Incorrect: {incorrect}")
print(f"Accuracy: {100 * correct / total:.2f}%")
This manual version of running the algorithm can be found in the source code for this lecture
under banknotes0.py. Since the algorithm is used often in a similar way, scikit-learn contains
additional functions that make the code even more succinct and easy to use, and this version
can be found under banknotes1.py.
Reinforcement Learning
Reinforcement learning is another approach to machine learning, where after each action, the
agent gets feedback in the form of reward or punishment (a positive or a negative numerical
value).
The learning process starts by the environment providing a state to the agent. Then, the agent
performs an action on the state. Based on this action, the environment will return a state and a
reward to the agent, where the reward can be positive, making the behavior more likely in the
future, or negative (i.e. punishment), making the behavior less likely in the future.
This type of algorithm can be used to train walking robots, for example, where each step
returns a positive number (reward) and each fall a negative number (punishment).
Set of states S
Set of actions Actions(S)
Transition model P(s’ | s, a)
Reward function R(s, a, s’)
For example, consider the following task:
The agent is the yellow circle, and it needs to get to the green square while avoiding the red
squares. Every single square in the task is a state. Moving up, down, or to the sides is an action.
The transition model gives us the new state after performing an action, and the reward function
is what kind of feedback the agent gets. For example, if the agent chooses to go right, it will
step on a red square and get negative feedback. This means that the agent will learn that, when
in the state of being in the bottom-left square, it should avoid going right. This way, the agent
will start exploring the space, learning which state-action pairs it should avoid. The algorithm
can be probabilistic, choosing to take different actions in different states based on some
probability that’s being increased or decreased based on reward. When the agent reaches the
green square, it will get a positive reward, learning that it is favorable to take the action it took
in the previous state.
Q-Learning
Q-Learning is one model of reinforcement learning, where a function Q(s, a) outputs an estimate
of the value of taking action a in state s.
The model starts with all estimated values equal to 0 (Q(s,a) = 0 for all s, a). When an action is
taken and a reward is received, the function does two things: 1) it estimates the value of Q(s, a)
based on current reward and expected future rewards, and 2) updates Q(s, a) to take into
account both the old estimate and the new estimate. This gives us an algorithm that is capable
of improving upon its past knowledge without starting from scratch.
The new value estimate can be expressed as a sum of the reward (r) and the future reward
estimate. To get the future reward estimate, we consider the new state that we got after taking
the last action, and add the estimate of the action in this new state that will bring to the
highest reward. This way, we estimate the utility of making action a in state s not only by the
reward it received, but also by the expected utility of the next step. The value of the future
reward estimate can sometimes appear with a coefficient gamma that controls how much
future rewards are valued. We end up with the following equation:
This brings us to discuss the Explore vs. Exploit tradeoff. A greedy algorithm always exploits,
taking the actions that are already established to bring to good outcomes. However, it will
always follow the same path to the solution, never finding a better path. Exploration, on the
other hand, means that the algorithm may use a previously unexplored route on its way to the
target, allowing it to discover more efficient solutions along the way. For example, if you listen
to the same songs every single time, you know you will enjoy them, but you will never get to
know new songs that you might like even more!
To implement the concept of exploration and exploitation, we can use the ε (epsilon) greedy
algorithm. In this type of algorithm, we set ε equal to how often we want to move randomly.
With probability 1-ε, the algorithm chooses the best move (exploitation). With probability ε, the
algorithm chooses a random move (exploration).
Another way to train a reinforcement learning model is to give feedback not upon every move,
but upon the end of the whole process. For example, consider a game of Nim. In this game,
different numbers of objects are distributed between piles. Each player takes any number of
objects from any one single pile, and the player who takes the last object looses. In such a
game, an untrained AI will play randomly, and it will be easy to win against it. To train the AI, it
will start from playing a game randomly, and in the end get a reward of 1 for winning and -1 for
losing. When it is trained on 10,000 games, for example, it is already smart enough to be hard
to win against it.
This approach becomes more computationally demanding when a game has multiple states and
possible actions, such as chess. It is infeasible to generate an estimated value for every possible
move in every possible state. In this case, we can use a function approximation, which allows us
to approximate Q(s, a) using various other features, rather than storing one value for each state-
action pair. Thus, the algorithm becomes able to recognize which moves are similar enough so
that their estimated value should be similar as well, and use this heuristic in its decision
making.
Unsupervised Learning
In all the cases we saw before, as in supervised learning, we had data with labels that the
algorithm could learn from. For example, when we trained an algorithm to recognize counterfeit
notes, each banknote had four variables with different values (the input data) and whether it is
counterfeit or not (the label). In unsupervised learning, only the input data is present and the AI
learns patterns in these data.
Clustering
Clustering is an unsupervised learning task that takes the input data and organizes it into
groups such that similar objects end up in the same group. This can be used, for example, in
genetics research, when trying to find similar genes, or in image segmentation, when defining
different parts of the image based on similarity between pixels.
k-means Clustering
k-means Clustering is an algorithm to perform a clustering task. It maps all data points in a
space, and then randomly places k cluster centers in the space (it is up to the programmer to
decide how many; this is the starting state we see on the left). Each cluster center is simply a
point in the space. Then, each cluster gets assigned all the points that are closest to its center
than to any other center (this is the middle picture). Then, in an iterative process, the cluster
center moves to the middle of all these points (the state on the right), and then points are
reassigned again to the clusters whose centers are now closest to them. When, after repeating
the process, each point remains in the same cluster it was before, we have reached an
equilibrium and the algorithm is over, leaving us with points divided between clusters.