Machine Learning Path
Machine Learning Path
Have feedback? We’d love to hear it! Leave your comments here.
Introduction
What is the Golden Path to Machine Learning?
This course is a combination of learning materials (external & internal) and programming exercises to get you familiar
with the core concepts of machine learning. We’ve heard a lot of engineers express interest in machine learning, and
we’d like to share our knowledge with you, so you too can build and understand basic models!
Prerequisites
● Basic knowledge of Python
Supervised learning
For when you want to make predictions.
Predictions is a broad word, but there are so many things you can do with supervised learning. You can predict
numerical values, like the lifetime value of a customer. You can predict categorical values, like the country a user lives
in (for those missing that data).
1. A dataset with one row per entity you’re making a prediction about (e.g. user, zap, support ticket).
2. Other information about that entity — this is how you will make the predictions (e.g. role, domain, browser).
3. The “correct answers” for your dataset (e.g. if you’re trying to predict a user’s country, first you’ll need a dataset
with other users and their countries for your model to learn from).
Unsupervised learning
For when you want to make meaningful categories.
Unlike supervised learning, in unsupervised learning you don’t have the answers already. Instead, what you have is a
dataset with some information, and you want to make categories out of it, such as grouping survey responses into
themes or creating groupings of similar customers.
To make an unsupervised learning model, you will need #1 & #2 listed above, but unlike supervised models, you do
not need #3.
Let’s build a model that predicts whether or not a person survived the Titanic sinking. From now, the information we’re
using to predict this (age, fair) we will call features, and the information we’re trying to predict (survival) we will call our
target.
A note about target variables - Sometimes you won’t have a clean target variable, or you’ll want to predict something
that you don’t have data for. Let’s say you want to predict the difficulty of a support ticket, but you don’t have a dataset
of tickets labeled with their corresponding difficulty. In this type of situation you have two options - you can either get
a group of folks together to label past tickets with difficulty or, you can create your own target variable from other
information.
If you get folks to hand label your data, make sure you first develop some guidelines on how to do that labeling, so
that the labels on consistent. You may even want to have multiple folks classify the same ticket, so you can then
average their votes later on.
If you want to create your own target variable, you have to be creative in thinking about what data you do have access
to. For example, if you have access to the length of time a customer champion actively worked on a given support
ticket, you could assume that more time = more difficult, and thus use time as a target variable that approximates
difficulty!
Check out the source code for datto if you’d like to get into more detail on how these methods work.
import pandas as pd
import datto as dt
df = pd.read_csv('train.csv')
df.shape
df.head(10)
Let’s clean those column names for easier referencing later on.
ct = dt.CleanText()
df = ct.clean_column_names(df)
There, that’s better. Now we don’t have any spaces, everything is lowercase and uses underscores, and we’ve removed
special characters.
df["survived"].unique()
df[df.duplicated(['passengerid'])].sort_values(by='passengerid')
Great, looks like we don’t have any duplicate rows. If you do find duplicates, you’ll want to use drop_duplicates by a
subset of passengerid.
Now that you’ve gotten a better idea of what your data looks like, let’s move on to cleaning up your data.
df.info()
If you recheck the data types, you’ll see that we indeed changed this type to be numeric!
Let’s see if we should exclude any columns from our analysis. Some reasons we may want to exclude values include too
many unique values, low variance in numerical values, or a large percentage of nulls.
eda = dt.Eda()
eda.find_cols_to_exclude(df)
We definitely won’t want passengerid as a feature, so we can drop those now. ticket has so many unique values we
probably can’t use as it is. cabin has many nulls so let’s drop that too. We’ll keep name for later language processing.
One thing that will cause an error when building a model is having null values. Let’s check for counts of nulls and fill
those values with a placeholder.
Okay, so we’ll need to fill null values for a few different columns. Note that you never want to delete null values!
There could be something different about users with null values, and deleting them could have you lose valuable
information from your model.
For the numerical values, what we want to fill the nulls with will depend on what the variable is. Let’s fill age with a
mean, and embarked with a placeholder of ‘unknown’.
df["age"].fillna(
df["age"].mean(), inplace=True
)
df["embarked"].fillna('unknown', inplace=True)
Let’s check if we have any correlated features. Correlated features can mess up a model sometimes.
eda.find_correlated_features(df)
Looks like the correlation between our numeric features is relatively low, so nothing to do here! Let’s move on to some
general discussion of types of machine learning models you may use.
While you can program a machine learning model without knowing any of the theory, the more you know about how
these work, the better you can be at understanding which model to use and why a certain model may be performing
how it is.
So, a machine learning model is the method you use to get your predictions. A supervised machine learning model will
learn patterns in your data that best predict some given information.
Let’s compare this to making a set of heuristics — something you may already do in your work. For example, based on
what we know about our passengers, we might make a set of heuristics that look something like this:
(These rules are completely made up as an example, don’t assume they’re true!)
A machine learning model is an alternative to making a set of heuristics like this. Instead of creating the heuristics
ourselves, the model learns these heuristics automatically from our data and figures out the best set of rules and
combinations of features that it will use to make predictions. Let’s learn very broadly about a few types of ML models,
keeping in mind that you could take a whole course learning about the intricacies of each of these!
There are two types of each of the following models - classifiers and regressors. If you are trying to predict a categorical
target, you will use a classifier, and if you are trying to predict a numerical value, you will use a regressor.
Optional additional resource: listing of all the supervised learning models that sklearn (Python package) has available
Regressions
The classifier version of a regression is called logistic regression, while the regressor version is called linear regression.
In a both type of regressions, we create something called a line of best fit which is then used to predict the target.
For a linear regression, you can imagine plotting all your data on a graph, and then trying to draw the best line that
goes through as many data points as possible.
(Source)
While linear regression predicts a numerical value for the target, a logistic regression predicts a probability of each
category. It also creates a line on your graph, but this line looks different. The line attempts to separate categories as
best as possible. Everything to the right/underneath the line is predicted as a higher probability of being True, and
everything to the left/above the line is predicted as a higher probability of being False.
For example, in our work above, we would predict the probability of someone surviving, and use that probability to
ultimately decide whether it’s a higher probability they are a survivor or a higher probability they are not a survivor.
(Source)
Decision trees
You can think of a decision tree as a sort of flow chart, with various decision making pieces at each step. The features
are used as the questions that guide you down the tree, where you ultimately end up with a value or probability for
your target variable at the bottom. Each split in the tree is based on the answer to a single question, i.e. a single
feature. The splits are chosen automatically by the model learning which types of splits best separate the target
variable’s values.
(Source)
Bias is how far off your predictions are the true value. Variance is how different your predictions are from one another.
In other words, bias how far you are from your target (the bulls eye), while variance is how spread out your darts are.
(Source)
Let’s analyze the image above — the top row is low bias, as you can see, both have darts that are relatively close to the
bulls eye (the correct answers for the target variable).
On the right hand column, we have darts that are quite spread out, or in other words, have high variance. And so on
with the other rows & columns.
Models with high variance typically are paying too much attention to the data your model is learning from and don’t
generalize well to new incoming data. This is called overfitting.
Models with high bias have the opposite problem - they aren’t learning enough to capture the patterns in the data and
thus have a good rate of correct predictions. This is called underfitting.
So this is a tradeoff, ideally you want to balance both bias & variance, and different models will have different
advantages/disadvantages in these. Okay, let’s move on to some more models...
Bagging
Bagging t is when you make a whole bunch of decision trees, and then take the average of those trees to make an
average set of rules to use instead.
To make these different trees, you resample your data using bootstrapping and then make a tree from each new
bootstrapped sample. This means that you pull a sample from your data, and then replace it, so you end up with a new
dataset of the same size, but with various samples repeated. It’s like picking a random ball from a bag, and then each
time putting the ball back in the bag to potentially pick those same balls again.
(Source)
In the image above, you would create a decision tree from each bootstrapped sample, and then later average those
trees together. As you might imagine, bagging (and the following more complex tree based methods) typically perform
better than a single decision tree as they help average out any extremes and get the most useful splits from multiple
runs.
Random forests
Random forests work very similarly to boosting with one key difference — instead of using every single feature in your
dataset to build each new tree, the model randomly selects only a few of the features available, and builds a model
using that subset of features. This leads to more diverse trees which tends to perform better than simple bagging.
Boosting
Boosting is another tree based method that is slightly different — instead of bootstrapping your data, each new decision
tree is created based on the one before it. So after the first tree is built, the next tree is created by either making a new
tree using the information the first tree learned (e.g. what splits are good). In random forests and bagging, each tree is
pretty deep, meaning there are a lot of questions to go through in your flow chart before you reach an answer. In
boosting, the trees tend to be short, meaning you may answer one or two questions and then arrive at your answer.
Here’s an image to help illustrate the differences between random forests, bagging, and boosting:
(Created by Kristie)
Neural networks
Alrighty, one last supervised learning model for you. It’s the one I bet you’ve heard discussed, neural nets!
Neural networks are great for image data, video data, sound data, etc. E.g. if you want to classify an image as either a
dog or a cat. They’re also good for generating bodies of text. However, for most simple regression or classification
problems, a neural network is most likely overkill. They’re computationally expensive and often random forests or
boosting will give great results with far less complexity, so we won’t get too deep into going into how to code a neural
net work in this course. However, if you’d like to go through a tutorial, this one is a great way to get started using a
package called Keras.
There’s several types of neural networks, but here’s a broad overview of how a basic one works. First, you have an input
layer - this is all your features in the data. You also have an output layer, this is the target you are trying to predict.
In the middle you have some number of hidden layers (how many layers to choose is mostly experimentation). To start,
your model guesses some random weights to use in the hidden layer(s). The input layer is then multiplied by the
hidden layer(s) and the output of that is compared to the real values of the target value. Then, we use something
called back propagation to update the weights. This means that we calculate how far off our results were from the true
results, and we use a particular calculation to adjust the weights in the hidden layer so that hopefully we can do a
better job with our predictions next time. The number of times you go back and adjust those weights is called the
number of epochs, and is also chosen mostly by experimenting with your data).
(Source)
Optional additional resource: listing of all clustering models sklearn has available
K-means clustering
To make the separate categories in your data, this algorithm picks some number of clusters to create. It then picks that
number of random data points. I.e. if you want 3 clusters, it chooses 3 random data points. These data points chosen
are called centroids. Then, for each individual data point in your dataset (e.g. each user) it uses an equation to
calculate the similarity between that data point and each of the centroids. This similarity is calculated by incorporating
the features you have about each data point, such as features we added to our example dataset earlier, like pclass
and age.
For each data point, it is assigned to the group where it is closest to that centroid. And so on, you assign each data
point to a different centroid, thus making your categories. In the image below, the black dots are the centroids, and
each data point would represent an item, user, etc. You can see that each data point has been assigned to a different
color category, based on which centroid it is closest to.
(Source)
Of course, picking random points for your centroids will probably not result in great clusters, so part of the process of
this model is choosing the number of times you want to run it, and then each time, the model adjusts and learns from
itself, based on calculating how much the data points vary. With the end goal being to make as distinct categories as
possible by having groups with all data points being quite similar to each other.
Overall, the idea is that we have features representing (adjusted) counts of keywords in some text. What NMF does is it
takes all those features we’ve gotten from the text and reduces them to only a few categories. It does this in a way
that’s somewhat similar to how neural networks work, by multiplying some matrices together, but then going back and
determining the best weights to multiple by that create distinct categories.
X will be our feature set, and y will be our target. Let’s split our datasets into those separate chunks:
y = df["survived"]
X = df.drop("survived", axis=1)
In order to evaluate how well your model is doing, you’ll need a train set and a test set of data. This is done by splitting
your data into two different datasets, with the train set being larger than the test set.
The reason for this split is that you want your model to learn patterns from one set of data, but then, in order to mimic
how it might perform out in the real world, you want to test it on some data it has never seen before. And you want to
test it on some data you have the “correct” answers for, so that you can evaluate approximately how well it might do if
you were to deploy it.
An important idea to remember is that you want to keep these datasets entirely separate until the very end, when you
test (once!) how well your model is doing.
To do that, we need to transform this one column into many columns, that represent adjusted counts of each word in
the name.
(Source)
Let’s transform our text (name) into separate columns of adjusted word counts instead. As part of this process, we will
lemmatize the text — this means changing everything into its root word to standardize the word counts (e.g. running to
run, leapt to leap).
# The vectorizer that will transform text into columns of adjusted word frequencies
vectorizer = TfidfVectorizer(
tokenizer=ct.lematize,
# Means you can have individual words or phrases of 2-3 words
ngram_range=(1, 3),
# Means each word must appear in at least 5 different data points
min_df=5,
# Means each word cannot appear in more than 40% of data points
max_df=0.4
)
(Source)
Note that running dummies on the whole dataset as done above will make dummies for any categorical variables and
keep any numerical values as is. After you make these columns, now you can combine the the word vector columns
and the dummy columns into one final train set and one final test set.
lr.fit(X_train_combined, y_train)
y_predicted = lr.predict(X_test_combined)
You can also score how well that model is doing. There are a lot of different scoring methods you can use, but let’s use
precision & recall.
● Precision — out of everything you predicted as true, what percent were actually true?
● Recall — out of everything that was true (e.g. people who are survivors), what percent did you “catch” / predict as
true?
● Accuracy — out of all the predictions you made, what percent were correct?
We’ll walk through some examples to better understand these. Let’s say you have a dataset of 100 folks.
You predict that 20 of them are survivors. Of those 20 you predicted as survivors, only 13 were actually survivors.
Doing some math to figure out the other pieces, this means that you correctly predicted 13 folks were survivors, and
you were correct in predicting 53 folks were not survivors. This means your accuracy was (13+53)/100, or 67%.
(Source)
Using our example numbers, in the above diagram, the light green circle half is 13, the red circle half is 7, the dark green
side is 27, and the light gray side is 53.
Accuracy is the most well known metric here, and, it can be a problematic metric. For this reason I recommend you
either avoid it entirely, or only use it in combination with the other metrics listed to get a fuller picture of how your
model is doing.
Here’s one example where accuracy is misleading — if we continue with the above example, let’s say your model
predicts that everyone is not a survivor. This would give you an accuracy of 60%! (60 are not survivors, so you are
correct 60 times). Which sounds relatively decent, however, a model that only predicts False is not particularly useful.
Checking precision and recall here would alert you to this fact.
In terms of deciding between precision and recall, you’ll want to think about your use case and which one is more
important. Let’s say you’re trying to predict email spam. Here, it’s much worse to classify something as spam when it isn’t
spam, because you would miss potentially important emails. So, precision is more important — you want to make sure
you are as spot on in your predictions as possible.
As another example, let’s say you’re trying to predict whether someone has a highly contagious illness. In this case,
missing a patient who is sick is much worse than accidentally telling someone they are sick when they aren’t. In this
case, recall is a higher priority. You want to catch as many of the sick patients as possible with your model.
Most likely you will want to measure both of these metrics to balance them somewhat, but knowing your use case will
help you decide what to optimize your model for.
Regressors
Root mean squared error (RMSE)
First off, this is a measurement of error, so it’s important to note that the lower this score is, the better. Here are the
steps to calculating it to illustrate what it means (though you won’t have to do this manually):
Squaring the difference is useful because it helps penalize the model for larger errors. Taking the square root at the end
changes the error value back into the same units as your target.
R²
Unlike RMSE, the higher the score here, the better. R² takes your model and compares it to a much more basic model —
what would happen if you simply predicted the average value of your target variable for every single data point? In
other words, R² evaluates how much better your model is compared to predicting that average value for everything.
Then...
precision_score(y_test, y_predicted)
pd.DataFrame(y_predicted).value_counts()
recall_score(y_test, y_predicted)
accuracy_score(y_test, y_predicted)
So accuracy is the highest, and we’re relatively balanced on recall & precision. Let’s optimize our model and see if we
can make this better.
Optimizing your model
One thing we can do is gridsearch multiple models with different parameters (i.e. changes in setup, we won’t go into
those). But remember, we don’t want to optimize using our test set, otherwise we won’t know how the model will
perform on unseen data.
So instead, we’ll use cross validation, meaning we will split our train set even further, this time into mini train sets and
mini test sets. We can make these splits multiple times and test out our different models on this mini splits to figure out
which model and parameters may work best. But we will not touch that final test set at all!
(Source)
We need to make one quick change first — datto works on pipelines, not single models. Pipelines allow you to do
multiple transformations to your data before running a model. We won’t go into detail of those other steps in this
course, but instead we’ll set up our model in pipeline format quick to make it compatible with datto’s methods.
Then, we’ll run the model_testing function. This function will test several metrics, but we can choose one to optimize
most. Let’s do precision, because we’ve decided that it’s more important to be correct and selective in labeling
survivors.
tm = dt.TrainModel()
tm.model_testing(X_train_combined, y_train, pipeline, "classification", "precision")
Here you’ll get your results. I got the results that a MLP Classifier worked best (you may get slightly different results), so
I’m going to rerun it on the test set using this new classifier plus the best parameters listed and get my final results.
from sklearn.neural_network import MLPClassifier
pipeline.fit(X_train_combined, y_train)
y_predicted = pipeline.predict(X_test_combined)
Once you have a model that is working well, you can simple save it somewhere (such as S3) and then use it to predict
incoming data somewhere, such as within a micro-service. You’ll need to load the model upon starting the service, and
then you can feed it new data to predict one by one. Make sure to fit your model on your data before saving it, and to
feed it the exact same columns of features in order to make predictions. To do this, you’ll need to rerun all the same
steps that you did when cleaning your data previously.
Additional resources
Through this course, you’ve seen ways to implement datto methods that help with the machine learning model
development process. If you’d like to see more examples of how to use this package and additional methods, check out
this Juypter notebook.
Learning resources:
Conclusion
I hope this guide has given you a better idea of what machine learning is, how it works, and how you can use it to
create simple models!
● You can use machine learning to either predict something (supervised learning) or to create meaningful groups
(unsupervised learning).
● Always start with exploring your data — see what values you have in different columns, check data types, check for
null values, etc.
● To prepare your data for modeling, you’ll need to fill null values, split into a train set and a test set, transform text
data to word frequencies, and transform categorical data into numerical columns
● Your features are the information used to predict something, and your target is what you are trying to predict
● Classifiers are used when you have a categorical target, regressors are used when you have a numerical target
● Accuracy as a way to evaluate your model can be misleading — think about your use case and decide if precision
or recall is a better fit
● Never learn anything from your test set — split your train set into smaller pieces and use gridsearching to test out
different models and parameters to optimize your model
Best of luck working on your own models! Please share any feedback with us about how this course went.