Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
GUDEBOOK
An Illustrated Guide
for Non-Technical Readers
Introduction:
Machine Learning Concepts
for Everyone
According to Google Trends, interest in the term machine learning (ML) has increased over 490%
since Dataiku was founded in 2013. We’ve watched ML go from the realm of a relatively small
number of data scientists to the mainstream of analysis and business.
While this has resulted in a plethora of innovations and improvements among our customers
and for organizations worldwide, it has also provoked reactions ranging from curiosity to
anxiety among people everywhere.
We’ve decided to make this guide because we’ve noticed that there aren’t many resources out
there that answer the question, “What is machine learning?” while using a minimum of technical
terms. The basic concepts of machine learning are actually not very difficult to grasp when
they’re explained simply.
In this guidebook, we’ll start with some definitions and then move on to explain some of the
most common algorithms used in machine learning today, such as linear regression and tree-
based models. Then, we’ll dive a bit deeper into how we go about deciding how to evaluate and
fine-tune these models. Next, we’ll take a look at clustering models, and finally we’ll finish with
some resources that you can explore to learn more.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 2
An Introduction to Key
Data Science Concepts
A First Look at Machine Learning
? What is
Machine learning?
The answer is,
in one word, algorithms.
Machine learning is, more or less, a way for computers to learn things without being
specifically programmed. But how does that actually happen?
As humans, we learn through past experiences. We use our senses to obtain these “experiences”
and use them later to survive. Machines learn through commands provided by humans.
These sets of rules are known as algorithms.
PAST
EXPERIENCES BEHAVIOR
Algorithms are sets of rules that a computer is able to follow. Think about how you learned to do
long division — maybe you learned to divide the denominator into the first digits of the numerator,
subtract the subtotal, and continue with the next digits until you were left with a remainder.
Well, that’s an algorithm, and it’s the sort of thing we can program into a computer, which can
perform these sorts of calculations much, much faster than we can.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 3
? What does machine learning look like?
In machine learning, our goal is either prediction or clustering. Prediction is a process where,
from a set of input variables, we estimate the value of an output variable. This technique is used
for data that has a precise mapping between input and output, referred to as labeled data.
This is known as supervised learning. For example, using a set of characteristics of a house, we
can predict its sale price. We will discuss unsupervised learning later in the guidebook when
exploring clustering methods.
LABELED DATA
Patient information Label
Class label
AGE GENDER SMOKING VACCINATION SICK
1= SICK
18 M 0 1 0 0 = NOT SICK
41 F 0 0 0
33 F 1 0 1
24 M 0 1 0
65 M 1 0 1
19 F 0 1 0
UNLABELED DATA
Customer information
Customer
MARITAL PRODUCT
AGE GENDER OCCUPATION information
STATUS CODE
18 Single 0 Student S212
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 4
Prediction problems are divided into two main categories:
Regression problems, where the variable to predict is numerical. For example: Understand how
the number of sleep hours and study hours (the predictors) determine a students’ test scores.
PREDICTION: REGRESSION
71
Hours of study
78
PREDICTIORS RESULT
57
Hours of sleep
63
Classification problems, where the variable to predict is part of one of some number of pre-
defined categories, which can be as simple as "yes" or "no." For example: Understand how the
number of sleep hours and study hours (the predictors) determine if a student will Succeed or
Fail, where “Succeed” and “Fail” are two class labels.
PREDICTION: CLASSIFICATION
One way to remember this distinction is that classification is about predicting a category or
class label, while regression is about predicting a quantity. The most prominent and common
algorithms used in machine learning historically and today come in three groups: linear models,
tree-based models, and neural networks. We'll come back to this and explain these groups more
in-depth, but first, let's take a step back and define a few key terms.
The terms we’ve chosen to define here are commonly used in machine learning. Whether you’re
working on a project that involves machine learning or if you’re just curious about what’s going
on in this part of the data world, we hope you’ll find these definitions clear and helpful.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 5
Definitions of 10 Fundamental Terms for Data Science
and Machine Learning
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 6
Top Prediction Algorithms
Pros and Cons of Some of the Most Common Machine Learning Algorithms
?
Now, let’s take a look at some of the major types of machine learning
algorithms. We can group them into three buckets:
linear models, tree-based models, and neural networks
Linear Models
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 7
TREE-BASED MODEL APPROACH
When you hear tree-based, think decision trees, i.e., a sequence of branching operations.
A decision tree is a graph that uses a branching method to show each possible outcome of a
decision. Like if you’re ordering a salad, you first decide the type of lettuce, then the toppings,
then the dressing. We can represent all possible outcomes in a decision tree. In machine learning,
the branches used are binary yes/no answers.
Tree-Based Models
NEURAL NETWORKS
Neural networks refer to a biological phenomenon comprised of
interconnected neurons that exchange messages with each other.
This idea has now been adapted to the world of machine learning and
is called ANN (Artificial Neural Networks).
Deep learning, which you’ve heard a lot about, can be done with
several layers of neural networks put one after the other.
ANNs are a family of models that are taught to adopt cognitive skills.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 8
TOP PREDICTION ALGORITHMS
Takes advantage of
many decision trees,
A sort of “wisdom
with rules created
of the crowd”. Models can get very
Tree-Based
from subsamples of
Tends to result in large.
Random features. Each tree is
very high quality
Forest weaker than a full
models. Not easy to understand
decision tree, but
predictions.
by combining them
Fast to train.
we get better overall
performance.
Interconnected
Neural Networks
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 9
How to Evaluate Models
Metrics and Methodologies for Choosing the Best Model
?
By now, you might have already It depends on what kind of
created a machine learning model. model you’ve built.
But now the question is: How can
you tell if it’s a good model?
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 10
Result Tab
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 11
Interpretation Section
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 12
Confusion Matrix
The Confusion matrix compares the actual values of the target variable with the predicted
values. In addition, Dataiku displays some associated metrics, such as “precision,”
“recall,” and the “F1-score.” By default, Dataiku displays the Confusion matrix and
associated metrics at the optimal threshold (or cut-off). However, manual changes to the
cut-off value are reflected in both the Confusion matrix and associated metrics in real-time.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 13
The validation step is where you optimize the parameters Legend
for each algorithm you want to use. The two most common
approaches are k-fold cross validation and a validation set LINEAR
MODEL
MEAN-SQUARED
ERROR
R-SQUARED
ACCURACY
2. Test Step
ROC AUC
LOG LOSS
Selects
algorithms
Runs
calculation
Select
Selects Runs
metrics calculation
how to Scores
another and see which performed best on apply the new data
model
your test set. Now you're ready to deploy
the model on brand new data!
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 14
Introducing the K-fold Strategy and
the Hold-Out Strategy
Comparing 2 Popular Methodologies
?
When you build a model, don’t wait until you’ve already run it on the test set to
discover that it’s overfitted. Instead, the best practice is to evaluate regularization
techniques on your model before touching the test set. In some ways,
this validation step is just stage two of your training.
The validation step is where we will optimize the hyperparameters for each algorithm we want
to use. Let’s take a look at the definition of parameter and hyperparameter. In machine learning,
tuning the hyperparameters is an essential step in improving machine learning models.
Model parameters are attributes about a model after it has been trained based on known data.
You can think of model parameters as a set of rules that define how a trained model makes
predictions. These rules can be an equation, a decision tree, many decision trees, or something
more complex. In the case of a linear regression model, the model parameters are the coefficients
in our equations that the model “learns” — where each coefficient shows the impact of a change
in an input variable on the outcome variable.
A machine learning practitioner uses an algorithms’ hyperparameters as levers to control how
a model is trained by the algorithm. For example, when training a decision tree, one of these
controls, or hyperparameters, is called max_depth. Changing this max_depth hyperparameter
controls how deep the eventual model may go.
It is an algorithm’s responsibility to find the optimal parameters for a model based on the
training data, but it is the machine learning practitioner’s responsibility to choose the right
hyperparameters to control the algorithm.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 15
The preferred method of validating a model is called K-fold Cross-
o
Validation. To do this, you take your training set and split it into
FOLD 1 FOLD 2 FOLD 3 .. . FOLD K some number — called K (hence the name) — sections, or folds.
The drawback of K-fold cross-validation is that it can take up a PROS OF THE HOLD-OUT STRATEGY
lot of time and computing resources. A more efficient though Fully independent data; only needs to be run once so has lower
less robust approach is to set aside a section of your training computational costs.
set and use it as a validation set — this is called the hold-out
CONS OF THE HOLD-OUT STRATEGY
strategy.
Performance evaluation is subject to higher variance given the
• The held-out validation set could be the same size as
smaller size of the data.
your test set, so you might wind up with a split of 60-20-20
among your training set, validation set, and test set. PROS OF THE K-FOLD STRATEGY
Prone to less variation because it uses the entire training set.
• Then, for each parameter combination, calculate the
error on your validation set, and then choose the model CONS OF THE K-FOLD STRATEGY
with the lowest error to then calculate the test error on Higher computational costs; the model needs to be trained K
your test set. times at the validation step (plus one more at the test step).
• This way, you will have the confidence that you have
properly evaluated your model before applying it in the
real world.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 16
Train/Test Set
Data science and machine learning platforms can do a lot of the heavy lifting for you
when it comes to train/test data. For example, during the model training phase, Dataiku
“holds out” on the test set, and the model is only trained on the train set. Once the model
is trained, Dataiku evaluates its performance on the test set.
This ensures that the evaluation is done on data that the model has never seen before.
By default, Dataiku randomly splits the input dataset into a training and a test set, and
the fraction of data used for training can be customized (though again, 80% is a standard
fraction of data to use for training). For more advanced practitioners, there are lots more
settings in Dataiku that can be tweaked for the right train/test set specifications.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 17
How to Validate Your Model
HOLDOUT STRATEGY
TRAIN VALIDATION TEST
1
Split your data
into train/
validation /
test
2
For each
parameter
For a given model
combination
VALIDATION
MECTRIC
3
Choose the
parameter TEST METRIC
combination (CAN COMPARE WITH
with the OTHER MODELS)
best metric
K-FOLD STRATEGY
TRAIN TEST
1
Set aside the best
set and split the train
set into k folds
o
2
For each FOLD 1 OTHER FOLDS
For a given model
parameter
combination
METRIC 1
Train
METRIC K
Train
3
Choose the
parameter TEST METRIC
combination with (CAN COMPARE
the best metrics WITH OTHER
MODELS)
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 18
Unsupervised Learning and Clustering
An Overview of the Most Common Example of Unsupervised Learning
?
What do we mean by
unsupervised learning? trying to predict a variable; instead, we
want to discover hidden patterns within
our data that will let us identify groups,
or clusters, within that data.
UNSUPERVISED SUPERVISED
LEARNING LEARNING
Labeled
Unlabeled
OBSERVATIONS ID X Y
CLUSTERING PREDICTION
y=mx+b
5
4
3
2
1
1 2 34 5
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 19
Clustering is often used in marketing in order to group users according to multiple characteristics,
such as location, purchasing behavior, age, and gender. It can also be used in scientific research;
for example, to find population clusters within DNA data.
One simple clustering algorithm is DBSCAN. In DBSCAN, you select a distance for your radius,
and you select a point within your dataset — then, all other data points within the radius’s
distance from your initial point are added to the cluster. You then repeat the process for each
new point added to the cluster, and you repeat until no new points are within the radii of the
most recently added points. Then, you choose another point within the dataset and build
another cluster using the same approach.
DBSCAN is intuitive, but its effectiveness and output rely heavily on what you choose for a radius,
and there are certain types of distributions that it won’t react well to. The most widespread
clustering algorithm is called k-means clustering. In k-means clustering, we pre-define the
number of clusters we want to create — the number we choose is the k, and it is always
a positive integer.
To run k-means clustering, we begin by randomly placing k starting points within our dataset.
These points are called centroids, and they become the prototypes for our k clusters. We create
these initial clusters by assigning every point within the dataset to its nearest centroid. Then,
with these initial clusters created, we calculate the midpoint of each of them, and we move each
centroid to the midpoint of its respective cluster. After that, since the centroids have moved, we
can then reassign each data point to a centroid, create an updated set of clusters, and calculate
updated midpoints. We continue iterating for a predetermined number of times — 300 is pretty
standard. By the time we get to the end, the centroids should move minimally, if at all.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 20
HIERARCHICAL CLUSTERING
Hierarchical clustering is another method of clustering. Here, clusters are assigned based on
hierarchical relationships between data points. There are two key types of hierarchical clustering:
agglomerative (bottom-up) and divisive (top-down). Agglomerative is more commonly used as it
is mathematically easier to compute, and is the method used by python’s scikit-learn library,
so this is the method we’ll explore in detail.
Here’s how it works:
1. Assign each data point to its own cluster, so the number of initial clusters (K) is equal to
the number of initial data points (N).
2. Compute distances between all clusters.
3. Merge the two closest clusters.
4. Repeat steps two and three iteratively until all data points are finally merged into one
large cluster.
7 7 7
1 2 6 1 2 6 1 2 6
4 4 4
3 3 3
5 5 5
7 7 7
1 2 6 1 2 6 1 2 6
4 4 4
3 3 3
5 5 5
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 21
Conclusion
You now have all the basics you need to start testing machine learning.
Don’t worry, we won’t leave you empty handed though. Here are a few ideas for projects
that are relatively straightforward that you can get started with today.
How To: Improve Data Quality With an Efficient Data Labeling Process
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 22
Practical Next Steps
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 23
Meet Katie Gross
Throughout this guidebook, you may have noticed some “In plain English” features and
wondered what those were. These side blurbs are extracts from Dataiku Lead Data Scientist
Katie Gross’ “In plain English” blog series during which she goes over high level machine
learning concepts and breaks them down into plain English that everyone can understand.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 24
For further exploration
BOOKS
• A theoretical, but clear and comprehensive, textbook: An Introduction to Statistical
Learning by Hastie, Tibshirani, and Friedman.
• Anand Rajaraman and Jeffrey Ullman’s book (or PDF), Mining of Massive Datasets, for
some advanced but very practical use cases and algorithms.
• Python Machine Learning: a practical guide around scikit-learn.
• “This was my first machine learning book, and I owe a lot to it” says one of our senior
data scientists.
COURSES
• Andrew Ng’s Coursera/ Stanford course on machine learning is basically a requirement!
• "Intro to Machine Learning'' on Udacity is a good introduction to machine learning for
people who already have basic Python knowledge. The very regular quizzes and exercises
make it particularly interactive.
Dataiku Academy
In May 2020, Dataiku launched Dataiku Academy, a free online learning tool to enable
all users to build their skills around Dataiku. Dataiku Academy is designed for every type
of user — from non-technical roles to the hardcore coders — and meant to guide them
through self-paced, interactive online training.
GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 25
Everyday AI,
Extraordinary People
Elastic Architecture Built for the Cloud
Filter on Mr.
45,000+ 450+
ACTIVE USERS CUSTOMERS
Dataiku is the platform for Everyday AI, systemizing the use of data for exceptional
business results. Organizations that use Dataiku elevate their people (whether technical
and working in code or on the business side and low- or no-code) to extraordinary, arming
them with the ability to make better day-to-day decisions with data.