2 Machine Learning
2 Machine Learning
Mindshare Initiative in
Artificial Intelligence
Important: All the following exercises are done in Python, as a common scripting platform and language
found in many computing servers and data-centers. The same experiments can be also repeated in MacOS
and Windows (using the latest WSL), with possible limitations, but this does not mean that there aren’t
other ways to achieve the same in such systems.
From the following questions, let’s classify them as “descriptive”, “exploratory”, “inference”, “prediction”
or “causal”, and find methods to apply:
1. We have a dataset form an airline, with the registry of passengers of a plane. The dataset contains
the personal data of passengers with their age and nationality. We are asked to retrieve some
information towards commercial segmentation: Which age ranges exist in data? Which is the
most frequent age range? And which is the most usual country for the passengers?
We are in front of an descriptive question. We are asked to retrieve basic information from
the features of our data, in order to understand the characteristics of the sampled population.
We can solve this through analytics by looking for features ranges and statistics like average,
trend, max and min. A summary of the dataset will provide much of that data.
2. We have a dataset from a library, with the registry of book lending having data from users, from
the lent books, the dates of loans and book categories. We are asked if there is any preference of
book categories across different users and ages.
We have here an exploratory question. We are asked if there are relations among the features
we are presented. We need to cross ages and categories, to discover for each age with are
the most popular categories (computing the trend, grouping by ages or ranges of ages). Here
we could even do some clustering of ages according to categories, to discover the best
partitioning of age ranges, and have the best representation in preferences.
3. We have a dataset of a web site access logs, with the registry of accesses, also the evidence of an
attack comming from a specific country of origin, aside of registries of legitimate users. We are
asked if we can we determine if in the future, the same pattern of attack that distinguishes the
attackers and the legitimate users will still be valid to separate users, and if that pattern repeats
in attacks from other countries.
Here we have an inductive question, as we are asked if the patterns we are finding are
applicable in future or different data. Here we can look for the same found patterns across
datasets. A naïve method could be use the machine learning model that displayed the pattern
for the target data, and attempt to infer with it different data, then check the error. If the new
data does not fit the model, it is possible that that data is different (different patterns,
behaviors, relations...), but if it fits it is possible the patter is more general that just the training
data.
4. We have a dataset of a streamming music service, with the registry of users and their music
preferences. We want to know if, given a random user, according to their music history, this user
can be classified into a specific or several categories.
This question is a predictive question. We are asked to create a method to learn from user
preferences and separate them by categories, so labeled users can be (most probably) offered
recommendations. If we have the categories (users are labeled) we can apply machine
learning classification methods, to predict from a user which labels they have. If we don’t have
a-priori those labels, we can apply clustering to find new labels on users, then discover which
classes of user exist, and classify future users.
Exercise 2 – Supervised Learning
With these exercises we will take a first look to how to create, train and evaluate a Machine Learning
model with Python and Scikit-learn (abbreviated Sklearn). Sklearn is a free software machine learning
library for the Python programming language featuring various classification, regression and clustering
algorithms, and it is widely used both in industry and academia.
Now we can load the Iris data using the method load_iris:
iris_ds = datasets.load_iris()
To avoid having to type iris_ds. to access the features and labels of the dataset we will save the
dataset's features (iris_ds.data) and labels (iris_ds.target) into variables called X and y
respectively:
X = iris_ds.data
y = iris_ds.target
One of the advantages of the Sklearn datasets package is that it comes with “loading” functions for its
datasets. Unfortunately, other datasets and files do not often have loaders, and we’ll have to load them
by reading a CSV file, then select the data corresponding to X and Y manually.
You can see that we have also passed a random_state parameter to train_test_split.
random_state sets a seed for the Random Number Generator (RNG) used to perform the split, and it
allows us to obtain the same results at every run of this example to ensure repeatability of the results.
Building the classification model
Once our training and validation datasets ready we can move on to creating our model. For this first
classification example we will use a Logistic Regression model:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression( solver = 'lbfgs', multi_class = 'auto',
random_state = 0 )
Again, we have passed a random_state. You will see this quite often when working with Sklearn. We
have also passed two other parameters, solver and multi_class. They are actually passing the default
values, but we need to do so to silence some warnings (solved for future versions of Sklearn).
Right now our logistic regression model is like an empty box, as we have not trained it with any data.
Sklearn uses a common API for all its models, with fit used to train the model on training data and
predict used to make predictions on data. We can then train the model using fit:
R2 Score: 0.9809523809523809
That's a fairly good R2 score! According to it our model can explain 98% of the variance of the training
data. The Iris dataset is a quite simple dataset, so this result should not be surprising.
Let's now try and make predictions on the validation data using predict. We can calculate the Mean
Squared Error of the prediction versus the real labels with mean_squared_error:
lr_test_prediction = log_reg.predict(X_test)
Confusion Matrix
We can generate the prediction's confusion matrix with confusion_matrix:
Although the confusion matrix above has all the information we need, its format is not very appealing.
We can generate a better visualization using a heatmap from the Seaborn package (plot in Figure 2.1):
Visualizing error
To be able to visualize how the test and training errors evolve as the training process progresses it would
be interesting to plot them with regards to the numbers of samples used in training. This can be achieved
with the learning_curve method (plot in Figure 2.2):
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation
set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
plt.figure(figsize=(8,6))
plt.ylim(0.7, 1.01)
plt.xlabel("Training examples")
plt.ylabel("Score")
In this exercise we will take a more in-depth view of the process of building a supervised learning model,
particularly a Linear Regression. We will start by importing the packages we will need for plotting and data
manipulation:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
boston_ds = datasets.load_boston()
Sklearn also packages a description of the dataset with the structure returned by load_boston,
accessble throug the DESCR attribute:
print(boston_ds.DESCR)
boston_ds.feature_names
All the data stored in the structure returned by load_boston is in Numpy's array format, the features,
labels and feature names being stored in a separate array each. This format can be somewhat
inconvenient when manipulating and exploring data so in order to avoid this we will convert the dataset
into Pandas' dataframe format. Pandas is Python's library for data manipulation and analysis which allows
to store data in a relational database table-style with its DataFrames.
We will first create a dataframe with the features and their respective names (recall that the feature data
is stored in the data attribute and the list of feature names is stored in feature_names):
boston_df_raw['MEDV'] = boston_ds.target
You can check the number of rows and columns of a dataframe with the shape attribute:
boston_df_raw.shape
(506, 14)
You can also visualize the first rows of the dataframe with nice formatting with head. Passing a number
to head will display that amount of rows, if no argument is passed it will default to displaying 5 rows:
boston_df_raw.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
plt.figure(figsize=(14, 12))
sns.heatmap(boston_df_raw.corr(), annot=True);
Cells corresponding to highly correlated variables take colors on the far ends of the scale. By analyzing the
values on the heatmap above we can see that the features MEDV is correlated with the most are RM (0.7)
and LSTAT (-0.74) so we will use them to make our predictions.
Another frequent way of visualizing data is through a scatterplot matrix, which allows to display the way
in which different variables are related (Seaborn's version also includes the histogram for each variable
on the diagonal). As we have already identified that MEDV is related to RM and LSTAT the most, we will
only display the scatter matrix for those three variables (plot in Figure 2.4):
Observing the scatterplots of MEDV against the other two variables we can see that the value 50
appears for many different values of the other variables, and it does not follow the overall trend of the
rest of the data. Because MEDV ranges from 0 to 50 these values might not correspond to an actual
median price but rather be the result of truncating the real value to the maximum of the scale (50). To
avoid considering modified values we will remove them from the data:
If we check the number of rows in the dataset with shape we can see we have deleted 16 rows:
boston_df.shape
(490, 14)
Taking a further look to the data we can see that the relation between MEDV and LSTAT is definitely
not linear. Observing LSTAT's histogram there's a clear positive skewness which suggests that we might
be able to address this non-linearity by applying a logarithmic transformation. Let's create a new
dataframe with the the transformed LSTAT:
df = pd.DataFrame(boston_df[['MEDV', 'RM']])
df['logLSTAT'] = np.log(boston_df['LSTAT']);
sns.pairplot(df);
Now the relationship between MEDV and logLSTAT definitely looks more like a linear one.
Now that the data is ready we can move on to creating and training the model. To avoid writing the feature
names multiple times we can store them in a variable:
The values.reshape(-1,1) methods are used to reshape the numpy array into a specific shape
required by Sklearn. You can see that we have also applied the fit method directly rather than creating
the model and training in two separate steps.
print("Coefs: {}".format(lin_reg.coef_))
print("Intercept: {}".format(lin_reg.intercept_ ))
The coefficients alone don't provide much insight on the quality of the model, so we can calculate the R^2
score to check how much variance on the data our model is able to explain:
train_score = lin_reg.score(train_df[feats].values,
train_df[labels].values.reshape(-1,1))
Not a great score, but take into account we are making predictions in only two variables. Let's now
calculate the Root Mean Square Error:
The results above are on training data, at this point in the course you might have an intuition that testing
on training data alone does not give much insight on the quality of the model. To see an actual measure
of the quality of our model we need to test it on the validation data. For that purpose, we will predict
MEDV for each one of the observations on the validation dataset and calculate the Root Mean Square
Error (RMSE) with regards to the real values:
test_prediction = lin_reg.predict(test_df[feats].values)
rmse = np.sqrt(mean_squared_error(test_df[labels].values.reshape(-1,1),
test_prediction))
print("RMSE on test data: {}".format(rmse))
As expected, the RMSE on the test data is slightly above the one on the training data.
Exercise 3 – Unsupervised Learning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
seed=25740565
For illustration purposes we will generate our own data for this example so we can have full control over
it. Let's start by defining the cluster centers:
In order to generate data for clustering purposes Sklearn provides the make_blobs function that allows
to generate isotropic Gaussian data points or 'blobs' that can be used to for clustering:
We have told make_blobs to generate 300 2-dimensional points spread over 3 blobs, each one centered
at one of the points we defined in centers. As usual, it is a good idea to plot the data (whenever
dimensionality allows) to see what it looks like (plot in Figure 3.1):
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(X.T[0], X.T[1], s=3);
And giving it some color so we can better differentiate the clusters (plot in Figure 3.3):
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(X.T[0], X.T[1], s=3, c=y);
ax.scatter(x=centers.T[0], y=centers.T[1], s=80, c=[0,1,2]);
Figure 3.1 Figure 3.2 Figure 3.3
As usual, let's split the data into training and validation datasets:
We can now build our clustering model and training using fit. K-means requires the number K of clusters
as a parameter, so we will use our privileged knowledge and we will set K = 3:
We can check the labels that the algorithm has assigned with the labels_ property:
kmeans.labels_
array([1, 2, 0, 1, 1, 2, 2, 0, 2, 0, 2, 0, 2, 2, 1, 1, 0, 0, 2, 1, 0, 1,
2, 2, 2, 0, 0, 2, 1, 1, 0, 2, 2, 0, 0, 0, 2, 0, 1, 2, 0, 1, 2, 0,
2, 1, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 1, 1, 0, 1, 0, 0, 1, 2, 1,
1, 2, 1, 2, 0, 0, 2, 1, 2, 0, 0, 0, 2, 1, 2, 1, 0, 1, 2, 0, 2, 2,
1, 1, 2, 1, 0, 1, 0, 1, 2, 1, 0, 0, 1, 2, 0, 0, 0, 1, 0, 0, 1, 2,
2, 2, 0, 0, 1, 2, 0, 1, 2, 1, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 2, 2,
0, 1, 2, 2, 1, 2, 0, 2, 1, 2, 2, 0, 1, 0, 2, 2, 1, 0, 1, 2, 1, 1,
1, 2, 1, 1, 2, 2, 1, 2, 0, 1, 0, 1, 2, 2, 0, 2, 0, 1, 0, 1, 1, 1,
1, 2, 2, 1, 1, 0, 1, 0, 1, 0, 0, 2, 0, 1, 0, 0, 0, 2, 2, 0, 0, 1,
2, 1, 2, 1, 0, 1, 2, 2, 1, 1, 2, 0], dtype=int32)
kmeans.cluster_centers_.T
If we compare the clustering obtained by k-means and the original cluster we can see it did a pretty good
job (don't mind the cluster colors not matching from one plot to another, k-means re-assigns label values
to each cluster when done training) (plot in Figure 3.4):
fig, ax = plt.subplots(1, 2, figsize = (14,6))
ax[0].scatter(X_train.T[0], X_train.T[1], s = 3, c = y_train);
ax[0].scatter(x = centers.T[0], y = centers.T[1], s = 80, c = [0,1,2]);
ax[0].set_title('Actual data')
ax[1].scatter(X_train.T[0], X_train.T[1], s = 3, c = kmeans.labels_)
ax[1].scatter(x = kmeans.cluster_centers_.T[0], y =
kmeans.cluster_centers_.T[1], s = 80, c = [0,1,2])
ax[1].set_title('k-means clustering');
Figure 3.4
Overlaying both plots, we can see the cluster centers have a pretty good match (actual centers in red, plot
in Figure 3.5):
Another way to evaluate the quality of a clustering is the homogeneity_score, which measures how
many of the points in a cluster correspond to a single class:
0.9603651641018862
The main drawback of homogeneity is that it only takes into account points within the same cluster
belonging to the same class, so if all points in two different clusters belong to the same class the
homogeneity score will be high even though there should be only one cluster.
One of the main issues with K-means is that you need to specify the number of clusters you want the
algorithm to look for. As normally you do not know the amount of clusters beforehand tuning this
hyperparameter can be tricky, especially for large values of k. For example, if we look for only two clusters
in our dataset (plot in Figure 3.6):
kmeans2 = KMeans(n_clusters = 2, random_state = seed).fit(X_train)
fig, ax = plt.subplots(figsize = (6,6))
ax.scatter(X_train.T[0], X_train.T[1], s = 3, c = kmeans2.labels_);
ax.scatter(x = kmeans2.cluster_centers_.T[0], y =
kmeans2.cluster_centers_.T[1], s = 80, c = [0,1])
ax.scatter(x = centers.T[0], y = centers.T[1], s = 80, c = ['red']);
We can see that k-means has aggregated two clusters into a single one, finding two clusters as we told it
(original cluster centers in red). Checking the homogeneity score we can see its significantly lower:
homogeneity_score(y_train, kmeans2.labels_)
0.47267561432978034
We can see now that it has found two extra cluster that were not there in the first place. Calculating the
homogeneity score:
homogeneity_score(y_train, kmeans5.labels_)
0.9999999999999998
As we mentioned before, because it is only checking that point within each cluster belong to the same
class, it has a score of effectively one. Taking it to an extreme case, if we made as many clusters as data
points the homogeneity score would be 1 so you need to put some thought when using and interpreting
the score.
Q(s,a): is the score of the current status given the current action
α: is the learning rate
R(s): is the reward for reaching status s
γ: is the discount factor for the expected new score from next status
maxa′ Q(s′,a′): is the best expected score for available actions and status
import numpy as np
import numpy.random as rd
import math
We also need a viability function that returns the list of possible actions from a given state:
# Viability function
def viability_function(actions, s):
valid_actions = []
for a in actions:
a_prime = a
if s[0] + a_prime[0] >= 0 \
and s[1] + a_prime[1] >= 0 \
and s[0] + a_prime[0] < rewards.shape[0] \
and s[1] + a_prime[1] < rewards.shape[1]:
valid_actions.append(a_prime)
return np.array(valid_actions)
# qlearn function
def qlearn(actions, rewards, s_initial, alpha, gamma, max_iters,
q_scoring=None):
# Initialize scoring, state and action variables
if q_scoring is None:
q_scoring = np.full([rewards.shape[0], rewards.shape[1],
len(actions)], 0.5)
s = s_initial
a = actions[0]
# Solve Reward
r = rewards[s[0], s[1]]
print("Iteration: {}".format(iteration_count))
print(" Best Position: ({},{})".format(*best['s']))
print(" Best Action: ({},{})".format(*best['a']))
print(" Best Value: {}".format(best['q']))
print()
Case of example
Consider a bidimensional space of 4 x 4 cells, with the following rewards for being in each cell:
There is a goal point, position [3,2], with high reward for being on it, and no reward or negative reward
for leaving it. Our actions are king movements in a chess game, plus the No Operation movement. Adding
the NOP movement allows us to remain in the best position when found, then exhaust the convergence
steps until loop breaks, finishing the game. The NOP has as drawback that we could get stuck in a local
sub-optimal, while forcing us to always move could let us escape from them.
Problem Details:
Space has dimensions 4 x 4
Goal is to reach [3,2] (We don't tell which is the goal, but rather we reward it better)
Start point is at Random
Reward depends only on the current position
alpha = 0.5
gamma = 1.0
max_iters = 50
import numpy as np
import numpy.random as rd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
We would like to see how the neural network performs in non-linearly separable data, so we will generate
two group of points in a radial space:
mean = 0.0
var = 0.26
threshold = 0.25
features = np.vstack((x,y)).T
This data is obviously non-separable by a logistic regression or linear models (without transforming the
feature vectors x and y). We can plot it to see that it is not linearly separable (plot in Figure 4.1):
plt.figure(figsize=(6,6))
plt.scatter(x,y, c=[colors[int(l)] for l in labels], s=0.2);
plt.xlim(-1,1); plt.ylim(-1,1);
Now we create a training dataset and a test dataset for validation purposes:
Now we are training the model. We are using the fit function to fit a model to the data. We are selecting
a MultiLayer Perceptron, and we choose a single layer with 64 neurons:
Time to test! We can use the function predict to pass new data (here the test dataset) through the model,
then we get the confusion matrix. As we have 2 classes, the confusion matrix will be a 2 x 2 table, with
“Real vs Predicted” values. We want to have all the big numbers on the diagonal (“Real Red & Predicted
Red”, “Real Blue & Predicted Blue”)
[[191 11]
[ 6 292]]
As we see, the network learns pretty well, as we have around 97% of values correctly predicted. We can
also print the summary of the classification, and see the accuracy and precision values:
print(classification_report(labels_test,predictions))
We can finish by plotting how the test points are classified “red” or”blue”, and see that follow the same
color than the shape we used to design the dataset (plot in Figure 4.2):
plt.figure(figsize=(6,6))
plt.scatter(features_test.T[0], features_test.T[1], c=[colors[int(l)] for l
in predictions], s=0.2);
plt.xlim(-1,1); plt.ylim(-1,1);
Figure 4.1 Figure 4.2
Reminder
All these exercises are to show the basic concepts of those technologies. To have a better understanding
and discover all the capabilities of those methods and applications, check the reference manuals and play
with new examples.