What Is Machine Learning_ _ Python Data Science Handbook
What Is Machine Learning_ _ Python Data Science Handbook
do) by Jake
VanderPlas; Jupyter notebooks are available on GitHub (https://github.com/jakevdp/PythonDataScienceHandbook).
Open in Colab
(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/note
What-Is-Machine-Learning.ipynb)
Before we take a look at the details of various machine learning methods, let's
start by looking at what machine learning is, and what it isn't. Machine learning is
often categorized as a subfield of artificial intelligence, but I find that
categorization can often be misleading at first brush. The study of machine
learning certainly arose from research in this context, but in the data science
application of machine learning methods, it's more helpful to think of machine
learning as a means of building models of data.
Here we have two-dimensional data: that is, we have two features for each point,
represented by the (x,y) positions of the points on the plane. In addition, we have
one of two class labels for each point, here represented by the colors of the
points. From these features and labels, we would like to create a model that will
let us decide whether a new point should be labeled "blue" or "red."
There are a number of possible models for such a classification task, but here we
will use an extremely simple one. We will make the assumption that the two
groups can be separated by drawing a straight line through the plane between
them, such that points on each side of the line fall in the same group. Here the
model is a quantitative version of the statement "a straight line separates the
classes", while the model parameters are the particular numbers describing the
location and orientation of that line for our data. The optimal values for these
model parameters are learned from the data (this is the "learning" in machine
learning), which is often called training the model.
The following figure shows a visual representation of what the trained model
looks like for this data:
figure source in Appendix (06.00-figure-code.html#Classification-Example-Figure-
2)
Now that this model has been trained, it can be generalized to new, unlabeled
data. In other words, we can take a new set of data, draw this model line through
it, and assign labels to the new points based on this model. This stage is usually
called prediction. See the following figure:
For example, this is similar to the task of automated spam detection for email; in
this case, we might use the following features and labels:
Some important classification algorithms that we will discuss in more detail are
Gaussian naive Bayes (see In Depth: Naive Bayes Classification (05.05-naive-
bayes.html)), support vector machines (see In-Depth: Support Vector Machines
(05.07-support-vector-machines.html)), and random forest classification (see In-
Depth: Decision Trees and Random Forests (05.08-random-forests.html)).
Consider the data shown in the following figure, which consists of a set of points
each with a continuous label:
As with the classification example, we have two-dimensional data: that is, there
are two features describing each data point. The color of each point represents
the continuous label for that point.
There are a number of possible regression models we might use for this type of
data, but here we will use a simple linear regression to predict the points. This
simple linear regression model assumes that if we treat the label as a third spatial
dimension, we can fit a plane to the data. This is a higher-level generalization of
the well-known problem of fitting a line to data with two coordinates.
We can visualize this setup as shown in the following figure:
Notice that the feature 1-feature 2 plane here is the same as in the two-
dimensional plot from before; in this case, however, we have represented the
labels by both color and three-dimensional axis position. From this view, it seems
reasonable that fitting a plane through this three-dimensional data would allow
us to predict the expected label for any set of input parameters. Returning to the
two-dimensional projection, when we fit such a plane we get the result shown in
the following figure:
This plane of fit gives us what we need to predict labels for new points. Visually,
we find the results shown in the following figure:
figure source in Appendix (06.00-figure-code.html#Regression-Example-Figure-4)
As with the classification example, this may seem rather trivial in a low number of
dimensions. But the power of these methods is that they can be straightforwardly
applied and evaluated in the case of data with many, many features.
For example, this is similar to the task of computing the distance to galaxies
observed through a telescope—in this case, we might use the following features
and labels:
The distances for a small number of these galaxies might be determined through
an independent set of (typically more expensive) observations. Distances to
remaining galaxies could then be estimated using a suitable regression model,
without the need to employ the more expensive observation across the entire
set. In astronomy circles, this is known as the "photometric redshift" problem.
Some important regression algorithms that we will discuss are linear regression
(see In Depth: Linear Regression (05.06-linear-regression.html)), support vector
machines (see In-Depth: Support Vector Machines (05.07-support-vector-
machines.html)), and random forest regression (see In-Depth: Decision Trees and
Random Forests (05.08-random-forests.html)).
By eye, it is clear that each of these points is part of a distinct group. Given this
input, a clustering model will use the intrinsic structure of the data to determine
which points are related. Using the very fast and intuitive k-means algorithm (see
In Depth: K-Means Clustering (05.11-k-means.html)), we find the clusters shown
in the following figure:
k-means fits a model consisting of k cluster centers; the optimal centers are
assumed to be those that minimize the distance of each point from its assigned
center. Again, this might seem like a trivial exercise in two dimensions, but as our
data becomes larger and more complex, such clustering algorithms can be
employed to extract useful information from the dataset.
We will discuss the k-means algorithm in more depth in In Depth: K-Means
Clustering (05.11-k-means.html). Other important clustering algorithms include
Gaussian mixture models (See In Depth: Gaussian Mixture Models (05.12-
gaussian-mixtures.html)) and spectral clustering (See Scikit-Learn's clustering
documentation (http://scikit-learn.org/stable/modules/clustering.html)).
Visually, it is clear that there is some structure in this data: it is drawn from a one-
dimensional line that is arranged in a spiral within this two-dimensional space. In
a sense, you could say that this data is "intrinsically" only one dimensional,
though this one-dimensional data is embedded in higher-dimensional space. A
suitable dimensionality reduction model in this case would be sensitive to this
nonlinear embedded structure, and be able to pull out this lower-dimensionality
representation.
The following figure shows a visualization of the results of the Isomap algorithm,
a manifold learning algorithm that does exactly this:
Notice that the colors (which represent the extracted one-dimensional latent
variable) change uniformly along the spiral, which indicates that the algorithm
did in fact detect the structure we saw by eye. As with the previous examples, the
power of dimensionality reduction algorithms becomes clearer in higher-
dimensional cases. For example, we might wish to visualize important
relationships within a dataset that has 100 or 1,000 features. Visualizing 1,000-
dimensional data is a challenge, and one way we can make this more
manageable is to use a dimensionality reduction technique to reduce the data to
two or three dimensions.
# Summary
Here we have seen a few simple examples of some of the basic types of machine
learning approaches. Needless to say, there are a number of important practical
details that we have glossed over, but I hope this section was enough to give you
a basic idea of what types of problems machine learning approaches can solve.
Clustering: Models that detect and identify distinct groups in the data
Dimensionality reduction: Models that detect and identify lower-
dimensional structure in higher-dimensional data
In the following sections we will go into much greater depth within these
categories, and see some more interesting examples of where these concepts can
be useful.
All of the figures in the preceding discussion are generated based on actual
machine learning computations; the code behind them can be found in
Appendix: Figure Code (06.00-figure-code.html).
Open in Colab
(https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/note
What-Is-Machine-Learning.ipynb)