100% found this document useful (3 votes)

1K views

Machine Learning Basics: An Illustrated Guide For Non-Technical Readers

Uploaded by

Danilo

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

1K views

Machine Learning Basics: An Illustrated Guide For Non-Technical Readers

Uploaded by

Danilo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Machine Learning Basics:

GUDEBOOK

An Illustrated Guide
for Non-Technical Readers
Introduction:
Machine Learning Concepts
for Everyone
According to Google Trends, interest in the term machine learning (ML) has increased over 490%
since Dataiku was founded in 2013. We’ve watched ML go from the realm of a relatively small
number of data scientists to the mainstream of analysis and business.
While this has resulted in a plethora of innovations and improvements among our customers
and for organizations worldwide, it has also provoked reactions ranging from curiosity to
anxiety among people everywhere.
We’ve decided to make this guide because we’ve noticed that there aren’t many resources out
there that answer the question, “What is machine learning?” while using a minimum of technical
terms. The basic concepts of machine learning are actually not very difficult to grasp when
they’re explained simply.
In this guidebook, we’ll start with some definitions and then move on to explain some of the
most common algorithms used in machine learning today, such as linear regression and tree-
based models. Then, we’ll dive a bit deeper into how we go about deciding how to evaluate and
fine-tune these models. Next, we’ll take a look at clustering models, and finally we’ll finish with
some resources that you can explore to learn more.

We hope you enjoy this guidebook

and that no matter how much or how
little you familiarize yourself with
machine learning, you find it valuable!

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 2
An Introduction to Key
Data Science Concepts
A First Look at Machine Learning

? What is
Machine learning?
The answer is,
in one word, algorithms.

Machine learning is, more or less, a way for computers to learn things without being
specifically programmed. But how does that actually happen?
As humans, we learn through past experiences. We use our senses to obtain these “experiences”
and use them later to survive. Machines learn through commands provided by humans.
These sets of rules are known as algorithms.

PAST
EXPERIENCES BEHAVIOR

DATA & ANSWERS

RULES

Algorithms are sets of rules that a computer is able to follow. Think about how you learned to do
long division — maybe you learned to divide the denominator into the first digits of the numerator,
subtract the subtotal, and continue with the next digits until you were left with a remainder.
Well, that’s an algorithm, and it’s the sort of thing we can program into a computer, which can
perform these sorts of calculations much, much faster than we can.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 3
? What does machine learning look like?

In machine learning, our goal is either prediction or clustering. Prediction is a process where,
from a set of input variables, we estimate the value of an output variable. This technique is used
for data that has a precise mapping between input and output, referred to as labeled data.
This is known as supervised learning. For example, using a set of characteristics of a house, we
can predict its sale price. We will discuss unsupervised learning later in the guidebook when
exploring clustering methods.

LABELED DATA
Patient information Label

Class label
AGE GENDER SMOKING VACCINATION SICK
1= SICK
18 M 0 1 0 0 = NOT SICK
41 F 0 0 0

33 F 1 0 1

24 M 0 1 0

65 M 1 0 1

19 F 0 1 0

UNLABELED DATA

Customer information

Customer
MARITAL PRODUCT
AGE GENDER OCCUPATION information
STATUS CODE
18 Single 0 Student S212

41 Married 0 Teacher M211 Product

Purchase

33 Single 1 Nurse S600

Store
24 Single 0 0S703
manager
Pattern/
Similarities
65 Married 1 Retired M107

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 4
Prediction problems are divided into two main categories:
Regression problems, where the variable to predict is numerical. For example: Understand how
the number of sleep hours and study hours (the predictors) determine a students’ test scores.

PREDICTION: REGRESSION

71
Hours of study

PREDICTIORS RESULT
57
Hours of sleep

Classification problems, where the variable to predict is part of one of some number of pre-
defined categories, which can be as simple as "yes" or "no." For example: Understand how the
number of sleep hours and study hours (the predictors) determine if a student will Succeed or
Fail, where “Succeed” and “Fail” are two class labels.

PREDICTION: CLASSIFICATION

Hours of study Succeed

PREDICTIORS CLASS LABELS

Hours of sleep Fail

One way to remember this distinction is that classification is about predicting a category or
class label, while regression is about predicting a quantity. The most prominent and common
algorithms used in machine learning historically and today come in three groups: linear models,
tree-based models, and neural networks. We'll come back to this and explain these groups more
in-depth, but first, let's take a step back and define a few key terms.
The terms we’ve chosen to define here are commonly used in machine learning. Whether you’re
working on a project that involves machine learning or if you’re just curious about what’s going
on in this part of the data world, we hope you’ll find these definitions clear and helpful.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 5
Definitions of 10 Fundamental Terms for Data Science
and Machine Learning

MODEL [ˈmɑdəl] / noun REGRESSION [rəˈgrɛʃən] / noun

1. a prediction method whose output is a real number,
1. a mathematical representation of a real world
that is, a value that represents a quantity along a line.
process; a predictive model forecasts a future outcome
For example: Predicting the temperature of an engine
based on past behaviors.
or the revenue of a company.

TRAINING [ˈtreɪnɪŋ] / verb

TARGET [ˈtɑrgət] / noun
1. the process of creating a model from the training
1. in statistics, it is called the dependent variable;
data. The data is fed into the training algorithm, which
it is the output of the model or the variable you wish
learns a representation for the problem, and produces
to predict.
a model. Also called “learning.”

CLASSIFICATION [ˌklæsəfəˈkeɪʃən] / noun TEST SET [tɛst sɛt] / noun

1. a prediction method that assigns each data point to 1. a dataset, separate from the training set but with the
a predefined category, e.g., a type of operating system. same structure, used to measure and benchmark the
performance of various models.
TRAINING SET [ˈtreɪnɪŋ sɛt] / noun
OVERFITTING [ˌoʊvərˈfɪtɪŋ] / verb
1. a dataset used to find potentially predictive
relationships that will be used to create a model. 1. a situation in which a model that is too complex for
the data has been trained to predict the target.
FEATURE [ˈfitfər] / noun This leads to an overly specialized model, which makes
predictions that do not reflect the reality of the underlying
1. also known as an independent variable or a predictor
relationship between the features and target.
variable, a feature is an observable quantity, recorded
and used by a prediction model. You can also engineer
features by combining them or adding new information
to them.

ALGORITHM [ˈælgəˌrɪðəm] / noun

1. a set of rules used to make a calculation or solve a
problem.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 6
Top Prediction Algorithms
Pros and Cons of Some of the Most Common Machine Learning Algorithms

?
Now, let’s take a look at some of the major types of machine learning
algorithms. We can group them into three buckets:
linear models, tree-based models, and neural networks

LINEAR MODEL APPROACH

A linear model uses a simple formula to find the “best fit” line through a set of data points.
This methodology dates back over 200 years, and it has been used widely throughout statistics
and machine learning. It is useful for statistics because of its simplicity — the variable you want
to predict (the dependent variable) is represented as an equation of variables you know
(independent variables), and so prediction is just a matter of inputting the independent
variables and having the equation spit out the answer.
For example, you might want to know how long it will take to bake a cake, and your regression
analysis might yield an equation t = 0.5x + 0.25y, where t is the baking time in hours, x is the
weight of the cake batter in kg, and y is a variable which is 1 if it is chocolate and 0 if it is not.
If you have 1 kg of chocolate cake batter (we love cake), then you plug your variables into our
equation, and you get t = (0.5 x 1) + (0.25 x 1) = 0.75 hours, or 45 minutes.
Note that linear regressions can be simple or multiple. In multiple linear regression, the value
of the target variable changes based on the value of more than one independent variable, or x.

Linear Models

“Remember, linear models generate a formula to create a

best-fit line to predict unknown values. Linear models are
Katie Gross considered “old school” and often not as predictive as newer
algorithm classes, but they can be trained relatively quickly
and are generally more straightforward to interpret, which
can be a big plus!”

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 7
TREE-BASED MODEL APPROACH
When you hear tree-based, think decision trees, i.e., a sequence of branching operations.
A decision tree is a graph that uses a branching method to show each possible outcome of a
decision. Like if you’re ordering a salad, you first decide the type of lettuce, then the toppings,
then the dressing. We can represent all possible outcomes in a decision tree. In machine learning,
the branches used are binary yes/no answers.

Tree-Based Models

“Tree-based models are very popular in machine learning.

The decision tree, the foundation of tree-based models, is quite
Katie Gross
straightforward to interpret, but generally a weak predictor.
Ensemble models can be used to generate stronger predictions
from many trees, with random forest and gradient boosting
as two of the most popular. All tree-based models can be used
for regression or classification and can handle non-linear
relationships quite well.”

NEURAL NETWORKS
Neural networks refer to a biological phenomenon comprised of
interconnected neurons that exchange messages with each other.
This idea has now been adapted to the world of machine learning and
is called ANN (Artificial Neural Networks).
Deep learning, which you’ve heard a lot about, can be done with
several layers of neural networks put one after the other.
ANNs are a family of models that are taught to adopt cognitive skills.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 8
TOP PREDICTION ALGORITHMS

TYPE NAME DESCRIPTION ADVANTAGES DISADVANTAGES

Sometimes too simple

to capture complex
The “best fit” line Easy to understand —
relationships
Linear through all data you clearly see what
between variables.
Regression points. Predictions are the biggest drivers of
numerical. the model are.
Does poorly with
correlated features.
Linear

Sometimes too simple

The adaptation of
to capture complex
linear regression
relationships
Logistic to problems of
Also easy to understand. between variables.
Regression classification (e.g.,
yes/no questions,
Does poorly with
groups, etc.
correlated features.

A series of yes/no Not often used on its

rules based on the own for prediction
Decision features, forming because it’s also often
Easy to understand.
Tree a tree, to match all too simple and not
possible outcomes of powerful enough
a decision. for complex data.

Takes advantage of
many decision trees,
A sort of “wisdom
with rules created
of the crowd”. Models can get very
Tree-Based

from subsamples of
Tends to result in large.
Random features. Each tree is
very high quality
Forest weaker than a full
models. Not easy to understand
decision tree, but
predictions.
by combining them
Fast to train.
we get better overall
performance.

A small change in the

feature set or training
Uses even weaker
set can create radical
Gradient decision trees, that are
High-performing. changes in the model.
Boosting increasingly focused
on “hard” examples.
Not easy to understand
predictions.

Interconnected
Neural Networks

“neurons” that pass Very slow to train,

messages to each Can handle extremely because they often
other. complex tasks — no have a very complex
Neural
Deep learning uses other algorithm comes architecture.
Networks several layers of close in image
neural networks recognition. Almost impossible to
stacked on top of one understand predictions.
another.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 9
How to Evaluate Models
Metrics and Methodologies for Choosing the Best Model

?
By now, you might have already It depends on what kind of
created a machine learning model. model you’ve built.
But now the question is: How can
you tell if it’s a good model?

METRICS FOR EVALUATING MODELS

We’ve already talked about training sets and test sets — this is when you break your data into
two parts: one to train your model and the other to test it. After you’ve trained your model using
the training set, you want to test it with the test set. Makes sense, right? So now, which metrics
should you use with your test set?
Not even sure where to start when it comes to building, much less evaluating models?
Dataiku can help! Dataiku is one of the world’s leading AI and machine learning platforms,
supporting agility in organizations’ data efforts via collaborative, elastic, and responsible AI,
all at enterprise scale. Throughout this guide, we'll show you tips like the following on how tools
like Dataiku can help you if you're new to machine learning.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 10
Result Tab

Dataiku is the world’s leading AI and machine learning platform, supporting

agility in organizations’ data efforts via collaborative, elastic, and responsible AI,
all at enterprise scale.
In Dataiku the Models page of a visual analysis consists of a Result tab that is useful for
comparing model performance across different algorithms and training sessions.
By default, the models are grouped by Session. However, we can select the Models view
to assess all models in one window, or the Table view to see all models along with more
detailed metrics.
There are several metrics for evaluating machine learning models, depending on whether
you are working with a regression model or a classification model.
For regression models, you want to look at mean squared error and R2. Mean squared
error is calculated by computing the square of all errors and averaging them over all
observations. The lower this number is, the more accurate your predictions are.
R2 (pronounced R-Squared) is the percentage of the observed variance from the mean
that is explained (that is, predicted) by your model. R2 always falls between 0 and 1,
and a higher number is better.
For classification models, the most simple metric for evaluating a model is accuracy.
Accuracy is a common word, but in this case we have a very specific way of calculating it.
Accuracy is the percentage of observations which were correctly predicted by the model.
Accuracy is simple to understand, but should be interpreted with caution, in particular
when the various classes to predict are unbalanced. Another metric you might come
across is the ROC AUC, which is a measure of accuracy and stability. AUC stands for “area
under the curve.” A higher ROC AUC generally means you have a better model.
Logarithmic loss, or log loss, is a metric often used in competitions like those run by
Kaggle, and it is applied when your classification model outputs not strict classifications
(e.g., true and false) but class membership probabilities (e.g., a 10% chance of being true,
a 75% chance of being true, etc.). Log loss applies heavier penalties to incorrect predictions
that your model made with high confidence.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 11
Interpretation Section

Dataiku displays algorithm-dependent interpretation panels. For example, a linear

model, such as Logistic Regression, will display information about the model’s coefficients,
instead of variable importance, as in the case of an XGBoost model.

THE CONFUSION MATRIX

One tool used to evaluate and compare models is a simple table known as a confusion
matrix. The number of columns and rows in the table depends on the number of possible
outcomes. For binary classification, there are only two possible outcomes, so there are only
two columns and two rows.
The labels that make up a confusion matrix are TP, or true positive, FN, or false negative, FP,
or false positive, and TN, or true negative.
There will always be errors in prediction models. When a model incorrectly predicts an
outcome as true when it should have predicted false, it is labeled as FP, or false positive.
This is known as a type I error. When a model incorrectly predicts an outcome as false when
it should have predicted true, it is labeled as FN, or false negative. This is known as a type II
error.
Depending on our use case, we have to decide if we are more willing to accept higher numbers
of type I or type II errors. For example, if we were classifying patient test results as indicative
of cancer or not, we would be more concerned with a high number of false negatives.
In other words, we would want to minimize the number of predictions where the model
falsely predicts that a patient’s test result is not indicative of cancer.
Similarly, if we were classifying a person as either guilty of a crime, or not guilty of a crime,
we would be more concerned with high numbers of false positives. We would want to reduce
the number of predictions where the model falsely predicts that a person is guilty.

True Positive False Negative

TP FN

False Positive FP TN True Negative

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 12
Confusion Matrix

The Confusion matrix compares the actual values of the target variable with the predicted
values. In addition, Dataiku displays some associated metrics, such as “precision,”
“recall,” and the “F1-score.” By default, Dataiku displays the Confusion matrix and
associated metrics at the optimal threshold (or cut-off). However, manual changes to the
cut-off value are reflected in both the Confusion matrix and associated metrics in real-time.

OVERFITTING AND REGULARIZATION

When you train your model using the training set, the model learns the underlying patterns in
that training set in order to make predictions. But the model also learns peculiarities of that
data that don’t have any predictive value. And when those peculiarities start to influence the
prediction, we’ll do such a good job at explaining our training set that the performance on the
test set (and on any new data, for that matter) will suffer. This is called overfitting, and it can be
one of the biggest challenges to building a predictive model. The remedy for overfitting is called
regularization, which is basically just the process of simplifying your model or making it less
specialized.
For linear regression, regularization takes the form of L2 and L1 regularization.
The mathematics of these approaches are out of our scope in this post, but conceptually they’re
fairly simple. Imagine you have a regression model with a bunch of variables and a bunch of
coefficients, in the model y = C1a + C2b + C3c…, where the Cs are coefficients and a, b, and c are
variables. What L2 regularization does is reduce the magnitude of the coefficients, so that the
impact of individual variables is somewhat dulled.
Now, imagine that you have a lot of variables — dozens, hundreds, or even more — with small
but non-zero coefficients. L1 regularization just eliminates a lot of these variables, working
under the assumption that much of what they’re capturing is just noise. For decision tree models,
regularization can be achieved through setting tree depth. A deep tree — that is, one with a lot
of decision nodes — will be complex, and the deeper it is, the more complex it is. By limiting the
depth of a tree, making it more shallow, we accept losing some accuracy, but it will be more
general.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 13
The validation step is where you optimize the parameters Legend
for each algorithm you want to use. The two most common
approaches are k-fold cross validation and a validation set LINEAR
MODEL

— which approach you use will depend on your requirements

and constraints! TREE MODEL

1. Your Approach OTHER

ALGORITHMS

VALIDATION SET VS K-FOLD CROSS VALIDATION

Common Metrics
A validation set reduces the K-folds provide better results, for Evaluating
amount of data you can use, but but are more expensive and take
Models
it is simple and cheap more time

MEAN-SQUARED
ERROR

R-SQUARED

ACCURACY

2. Test Step
ROC AUC

LOG LOSS

The test step is where you take the best

version of each algorithm and apply it
to your test set — a set of data that has
not been used in either the training or
validating of the models. Your challenge
here is to decide which metric to use for
HUMAN MACHINE
evaluation.
Validation step

Selects
algorithms
Runs
calculation
Select

3. Your Best Model

metrics
Test step

Selects Runs
metrics calculation

Based on the metrics you chose, you will

Decides
be able to evaluate one algorithm against
Model

how to Scores
another and see which performed best on apply the new data
model
your test set. Now you're ready to deploy
the model on brand new data!

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 14
Introducing the K-fold Strategy and
the Hold-Out Strategy
Comparing 2 Popular Methodologies

?
When you build a model, don’t wait until you’ve already run it on the test set to
discover that it’s overfitted. Instead, the best practice is to evaluate regularization
techniques on your model before touching the test set. In some ways,
this validation step is just stage two of your training.

The validation step is where we will optimize the hyperparameters for each algorithm we want
to use. Let’s take a look at the definition of parameter and hyperparameter. In machine learning,
tuning the hyperparameters is an essential step in improving machine learning models.
Model parameters are attributes about a model after it has been trained based on known data.
You can think of model parameters as a set of rules that define how a trained model makes
predictions. These rules can be an equation, a decision tree, many decision trees, or something
more complex. In the case of a linear regression model, the model parameters are the coefficients
in our equations that the model “learns” — where each coefficient shows the impact of a change
in an input variable on the outcome variable.
A machine learning practitioner uses an algorithms’ hyperparameters as levers to control how
a model is trained by the algorithm. For example, when training a decision tree, one of these
controls, or hyperparameters, is called max_depth. Changing this max_depth hyperparameter
controls how deep the eventual model may go.
It is an algorithm’s responsibility to find the optimal parameters for a model based on the
training data, but it is the machine learning practitioner’s responsibility to choose the right
hyperparameters to control the algorithm.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 15
The preferred method of validating a model is called K-fold Cross-
o
Validation. To do this, you take your training set and split it into
FOLD 1 FOLD 2 FOLD 3 .. . FOLD K some number — called K (hence the name) — sections, or folds.

FOLD 1 OTHER FOLDS

Then, for each combination of parameters
(e.g., tree depth of five and 15 trees), test METRIC 1
Train
the model on the fold and calculate the
CROSS
error after training it on the rest of the Compute
metric VALIDATED
training set NOT part of the fold — and then METRIC
OTHER FOLDS FOLD K
continue doing this until you’ve calculated
METRIC K
the error on all K folds. Train

The average of these errors is your cross

validated error for each combination of
parameters.
Parameter Parameter
(e.g., depth) A 5 15 B

TEST METRIC Then, choose the parameters with the best

(CAN COMPARE WITH
OTHER MODELS) cross validated error, train your model on
the full training set, and then compute
RETRAIN MODEL ON COMPUTE METRIC your error on the test set — which until
ALL TRAINING DATA ON TEST SET
now hasn’t been touched.
This test error can now be used to
compare with other algorithms.

The drawback of K-fold cross-validation is that it can take up a PROS OF THE HOLD-OUT STRATEGY
lot of time and computing resources. A more efficient though Fully independent data; only needs to be run once so has lower
less robust approach is to set aside a section of your training computational costs.
set and use it as a validation set — this is called the hold-out
CONS OF THE HOLD-OUT STRATEGY
strategy.
Performance evaluation is subject to higher variance given the
• The held-out validation set could be the same size as
smaller size of the data.
your test set, so you might wind up with a split of 60-20-20
among your training set, validation set, and test set. PROS OF THE K-FOLD STRATEGY
Prone to less variation because it uses the entire training set.
• Then, for each parameter combination, calculate the
error on your validation set, and then choose the model CONS OF THE K-FOLD STRATEGY
with the lowest error to then calculate the test error on Higher computational costs; the model needs to be trained K
your test set. times at the validation step (plus one more at the test step).
• This way, you will have the confidence that you have
properly evaluated your model before applying it in the
real world.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 16
Train/Test Set

Data science and machine learning platforms can do a lot of the heavy lifting for you
when it comes to train/test data. For example, during the model training phase, Dataiku
“holds out” on the test set, and the model is only trained on the train set. Once the model
is trained, Dataiku evaluates its performance on the test set.
This ensures that the evaluation is done on data that the model has never seen before.
By default, Dataiku randomly splits the input dataset into a training and a test set, and
the fraction of data used for training can be customized (though again, 80% is a standard
fraction of data to use for training). For more advanced practitioners, there are lots more
settings in Dataiku that can be tweaked for the right train/test set specifications.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 17
How to Validate Your Model
HOLDOUT STRATEGY
TRAIN VALIDATION TEST

1
Split your data
into train/
validation /
test

2
For each
parameter
For a given model

combination
VALIDATION
MECTRIC

TRAIN MODEL COMPUTE METRIC

Parameter
ON VALIDATION SET
5 15 Parameter
(e.g., depth) A B (e.g., n trees)

3
Choose the
parameter TEST METRIC
combination (CAN COMPARE WITH
with the OTHER MODELS)
best metric

RETRAIN MODEL ON COMPUTE METRIC

6 14 ALL TRAINING DATA ON TEST SET

K-FOLD STRATEGY
TRAIN TEST

1
Set aside the best
set and split the train
set into k folds
o

FOLD 1 FOLD 2 FOLD 3 .. . FOLD K

2
For each FOLD 1 OTHER FOLDS
For a given model

parameter
combination
METRIC 1
Train

Parameter Parameter Compute CROSS

(e.g., depth) A 5 15 B (e.g., n trees) metric VALIDATED
METRIC
OTHER FOLDS FOLD K

METRIC K
Train

3
Choose the
parameter TEST METRIC
combination with (CAN COMPARE
the best metrics WITH OTHER
MODELS)

6 14 Retrain model on all Compute metric on

training data test set

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 18
Unsupervised Learning and Clustering
An Overview of the Most Common Example of Unsupervised Learning

By unsupervised, it means that we’re not

?
What do we mean by
unsupervised learning? trying to predict a variable; instead, we
want to discover hidden patterns within
our data that will let us identify groups,
or clusters, within that data.

UNSUPERVISED SUPERVISED
LEARNING LEARNING

Labeled
Unlabeled

OBSERVATIONS ID X Y

S212 M 18 ID1 1 1.2

S212 F 41 ID2 2 2.3

S600 F 33 ID3 3 2.4

M107 M 24 ID4 4 2.9

M107 M 65 ID5 5 4.2

CLUSTERING PREDICTION

y=mx+b
5
4
3
2
1

1 2 34 5

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 19
Clustering is often used in marketing in order to group users according to multiple characteristics,
such as location, purchasing behavior, age, and gender. It can also be used in scientific research;
for example, to find population clusters within DNA data.
One simple clustering algorithm is DBSCAN. In DBSCAN, you select a distance for your radius,
and you select a point within your dataset — then, all other data points within the radius’s
distance from your initial point are added to the cluster. You then repeat the process for each
new point added to the cluster, and you repeat until no new points are within the radii of the
most recently added points. Then, you choose another point within the dataset and build
another cluster using the same approach.
DBSCAN is intuitive, but its effectiveness and output rely heavily on what you choose for a radius,
and there are certain types of distributions that it won’t react well to. The most widespread
clustering algorithm is called k-means clustering. In k-means clustering, we pre-define the
number of clusters we want to create — the number we choose is the k, and it is always
a positive integer.
To run k-means clustering, we begin by randomly placing k starting points within our dataset.
These points are called centroids, and they become the prototypes for our k clusters. We create
these initial clusters by assigning every point within the dataset to its nearest centroid. Then,
with these initial clusters created, we calculate the midpoint of each of them, and we move each
centroid to the midpoint of its respective cluster. After that, since the centroids have moved, we
can then reassign each data point to a centroid, create an updated set of clusters, and calculate
updated midpoints. We continue iterating for a predetermined number of times — 300 is pretty
standard. By the time we get to the end, the centroids should move minimally, if at all.

K-MEANS CLUSTERING ALGORITHM IN ACTION

A popular clustering algorithm, k-means clustering identifies clusters via an iterative process.
The “k” in k-means is the number of clusters, and it is chosen before the algorithm is run.
1. First, we choose the number of clusters we want — in this case, we choose eight.
Thus, eight centroids are randomly chosen within our dataset.
2. Each data point is assigned to its closest centroid — this creates the first set of clusters,
which are represented by different colors.
3. The midpoint — also called the center of gravity — for each cluster is calculated, and the
centroids are moved to these points. Then, new clusters are formed, and the process is iterated.
4. The algorithm is terminated after a pre-determined number of iterations — in this case,
we use 300, which is a common setting. The result: our final clusters!

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 20
HIERARCHICAL CLUSTERING
Hierarchical clustering is another method of clustering. Here, clusters are assigned based on
hierarchical relationships between data points. There are two key types of hierarchical clustering:
agglomerative (bottom-up) and divisive (top-down). Agglomerative is more commonly used as it
is mathematically easier to compute, and is the method used by python’s scikit-learn library,
so this is the method we’ll explore in detail.
Here’s how it works:
1. Assign each data point to its own cluster, so the number of initial clusters (K) is equal to
the number of initial data points (N).
2. Compute distances between all clusters.
3. Merge the two closest clusters.
4. Repeat steps two and three iteratively until all data points are finally merged into one
large cluster.

7 7 7

1 2 6 1 2 6 1 2 6
4 4 4
3 3 3
5 5 5

7 7 7

1 2 6 1 2 6 1 2 6
4 4 4
3 3 3
5 5 5

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 21
Conclusion
You now have all the basics you need to start testing machine learning.
Don’t worry, we won’t leave you empty handed though. Here are a few ideas for projects
that are relatively straightforward that you can get started with today.

How To: Execute Anomaly Detection at Scale

How To: Address Churn with Predictive Analytics

How To: Improve Data Quality With an Efficient Data Labeling Process

How To: Operationalize Data Science

How To: Future Proof your Operations with Predictive Maintenance

How To: Drive Serendipitous Discovery with Recommendation Engines

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 22
Practical Next Steps

It’s now time for you to put everything you

learned here into practice!
The Machine Learning Basics, Continued:
Building Your First Machine Learning Model
guidebook will walk you through some of the
main considerations when building your first
predictive machine learning model, including:
defining the goal, preparing the data, building,
tuning, interpreting the model.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 23
Meet Katie Gross
Throughout this guidebook, you may have noticed some “In plain English” features and
wondered what those were. These side blurbs are extracts from Dataiku Lead Data Scientist
Katie Gross’ “In plain English” blog series during which she goes over high level machine
learning concepts and breaks them down into plain English that everyone can understand.

“ You can check out all of

these deep dive blogs,
as well as a webinar
with GigaOm Research
featuring Katie Gross
here to continue your
learning journey.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 24
For further exploration
BOOKS
• A theoretical, but clear and comprehensive, textbook: An Introduction to Statistical
Learning by Hastie, Tibshirani, and Friedman.
• Anand Rajaraman and Jeffrey Ullman’s book (or PDF), Mining of Massive Datasets, for
some advanced but very practical use cases and algorithms.
• Python Machine Learning: a practical guide around scikit-learn.
• “This was my first machine learning book, and I owe a lot to it” says one of our senior
data scientists.

COURSES
• Andrew Ng’s Coursera/ Stanford course on machine learning is basically a requirement!
• "Intro to Machine Learning'' on Udacity is a good introduction to machine learning for
people who already have basic Python knowledge. The very regular quizzes and exercises
make it particularly interactive.

VIDEOS & OTHER

• Oxford professor Nando de Freitas’s 16-episode deep learning lecture series on YouTube.
• Open-source machine learning libraries, such as scikit-learn (and their great user
guide), Keras, TensorFlow, and MLlib.

Dataiku Academy

In May 2020, Dataiku launched Dataiku Academy, a free online learning tool to enable
all users to build their skills around Dataiku. Dataiku Academy is designed for every type
of user — from non-technical roles to the hardcore coders — and meant to guide them
through self-paced, interactive online training.

GUIDEBOOK - Dataiku Machine Learning Basics: An Illustrated Guide for Non-Technical Readers 25
Everyday AI,
Extraordinary People
Elastic Architecture Built for the Cloud

Machine Learning Visualization Data Preparation

Name Sex Age

Natural lang. Gender Integer

Braund, Mr. Owen Harris male 22

Moran, Mr. James male 38
Heikkinen, Miss. Laina
Remove rows containing Mr. female 26
Futrelle, Mrs. Jacques Heath female 35
Keep only rows containing Mr.
Allen, Mr. William Henry male 35
Split column on Mr.
McCarthy, Mr. Robert male
Replace
Hewlett, Mrs (Mary Mr. by ...
D Kingcome) 29

Remove rows equal to Moran, Mr. James

Keep only rows equal to Moran, Mr. James

Clear cells equal to Moran, Mr. James

Filter on Moran, Mr. James

Filter on Mr.

Toggle row highlight

Show complete value

DataOps Governance & MLOps Applications

45,000+ 450+
ACTIVE USERS CUSTOMERS

Dataiku is the platform for Everyday AI, systemizing the use of data for exceptional
business results. Organizations that use Dataiku elevate their people (whether technical
and working in code or on the business side and low- or no-code) to extraordinary, arming
them with the ability to make better day-to-day decisions with data.

Machine Learning For Absolute Beginners A - Oliver Theobald
100% (2)
Machine Learning For Absolute Beginners A - Oliver Theobald
179 pages
Machine Learning With Python: Amin Zollanvari
No ratings yet
Machine Learning With Python: Amin Zollanvari
457 pages
Understanding Machine Learning
100% (67)
Understanding Machine Learning
416 pages
Python Machine Learning For Beginners Ebook Final
100% (10)
Python Machine Learning For Beginners Ebook Final
305 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (14)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Machine Learning Simplified
100% (1)
Machine Learning Simplified
109 pages
Machine Learning
100% (11)
Machine Learning
135 pages
Machine Learning For Predictive Data Analytics PDF
No ratings yet
Machine Learning For Predictive Data Analytics PDF
45 pages
Machine Learning From Scratch PDF
100% (6)
Machine Learning From Scratch PDF
124 pages
Associating Fundamental Features With Technical Indicators For Analyzing Quarterly Stock Market Trends Using Machine Learning Algorithms
No ratings yet
Associating Fundamental Features With Technical Indicators For Analyzing Quarterly Stock Market Trends Using Machine Learning Algorithms
16 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
100% (8)
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
339 pages
Predictive Modelling - Final Project Report-Logistic Regression and LDA
100% (1)
Predictive Modelling - Final Project Report-Logistic Regression and LDA
25 pages
Machine Learning Absolute Beginners Introduction 2nd
91% (70)
Machine Learning Absolute Beginners Introduction 2nd
128 pages
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
100% (2)
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
32 pages
Machine Learning Absolute Beginners
100% (2)
Machine Learning Absolute Beginners
52 pages
Machine Learning Projects in Python
100% (15)
Machine Learning Projects in Python
135 pages
Mastering Machine Learning With Scikit-Learn: Chapter No. 5 "Nonlinear Classification and Regression With Decision Trees"
No ratings yet
Mastering Machine Learning With Scikit-Learn: Chapter No. 5 "Nonlinear Classification and Regression With Decision Trees"
23 pages
Machine Learning
100% (1)
Machine Learning
21 pages
Machine Learning Projects Python
94% (17)
Machine Learning Projects Python
134 pages
Machine Learning: Andrew NG's Course From Coursera: Presentation
100% (1)
Machine Learning: Andrew NG's Course From Coursera: Presentation
4 pages
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
100% (1)
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
65 pages
Full Course of Machine Learning
100% (14)
Full Course of Machine Learning
660 pages
AI Machine Learning Complete Course: For PHP & Python Devs
100% (1)
AI Machine Learning Complete Course: For PHP & Python Devs
96 pages
Sent-Machine Learning For Data Science
100% (1)
Sent-Machine Learning For Data Science
463 pages
Introduction To AI and ML
100% (1)
Introduction To AI and ML
68 pages
DataScienceHandbook PDF
100% (3)
DataScienceHandbook PDF
322 pages
Immediate download (eBook PDF) Machine Learning A Probabilistic Perspective by Kevin P. Murphy ebooks 2024
100% (8)
Immediate download (eBook PDF) Machine Learning A Probabilistic Perspective by Kevin P. Murphy ebooks 2024
46 pages
3 - Big Data Insight V.2019 PDF
No ratings yet
3 - Big Data Insight V.2019 PDF
28 pages
Machine Learning With Python
100% (2)
Machine Learning With Python
137 pages
MACHINELEARING UNIT 1material
100% (1)
MACHINELEARING UNIT 1material
64 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Hackers Guide To Machine Learning With Python PDF
100% (14)
Hackers Guide To Machine Learning With Python PDF
272 pages
Designing Machine Learning Systems With Python - Sample Chapter
100% (1)
Designing Machine Learning Systems With Python - Sample Chapter
31 pages
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
100% (4)
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
84 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Practical Machine Learning - Sample Chapter
83% (18)
Practical Machine Learning - Sample Chapter
46 pages
1 - Machine Learning (Start)
No ratings yet
1 - Machine Learning (Start)
32 pages
Symbolic Machine Learning: M.S.Kaysar, M.Engg Cse, Iub
100% (2)
Symbolic Machine Learning: M.S.Kaysar, M.Engg Cse, Iub
112 pages
Machine Learning
100% (3)
Machine Learning
47 pages
Sampler PDF
0% (1)
Sampler PDF
25 pages
Ai Machine Learning in Industry
100% (1)
Ai Machine Learning in Industry
38 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (7)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
90% (10)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Face Detection & Emotion Recognition
No ratings yet
Face Detection & Emotion Recognition
26 pages
Apache Mahout Essentials
From Everand
Apache Mahout Essentials
Jayani Withanawasam
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
50% (2)
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
27 pages
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
No ratings yet
Machine Learning Basics: An Illustrated Guide For Non-Technical Readers
16 pages
Guidebook Machine Learning Basics PDF
100% (1)
Guidebook Machine Learning Basics PDF
16 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
9 pages
INTRODUCTION
No ratings yet
INTRODUCTION
51 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
9 pages
Buiding Youf First ML Model GUIDE
No ratings yet
Buiding Youf First ML Model GUIDE
43 pages
DataIku Machine Learning Basics p2
No ratings yet
DataIku Machine Learning Basics p2
43 pages
ETI microproject
No ratings yet
ETI microproject
11 pages
Machine Learning Tutorial For Beginners
No ratings yet
Machine Learning Tutorial For Beginners
15 pages
Intro Machine Learning
No ratings yet
Intro Machine Learning
4 pages
UNit 1 Introduction To ML
No ratings yet
UNit 1 Introduction To ML
225 pages
AI Session 3 Machine Learning Slides
No ratings yet
AI Session 3 Machine Learning Slides
35 pages
Feed-Forward Neural Networks (Part 2: Learning)
No ratings yet
Feed-Forward Neural Networks (Part 2: Learning)
17 pages
Breast Cancer Classification
100% (2)
Breast Cancer Classification
16 pages
Deep Learning Algorithms For Microscopic Fabric Dataset
No ratings yet
Deep Learning Algorithms For Microscopic Fabric Dataset
6 pages
6-Neural NT
No ratings yet
6-Neural NT
44 pages
Major Project (Kartik Joshi)
No ratings yet
Major Project (Kartik Joshi)
4 pages
Untitled0.ipynb - Colaboratory
No ratings yet
Untitled0.ipynb - Colaboratory
5 pages
It's All Analytics!: The Foundations of AI, Big Data, and Data Science Landscape For Professionals in Healthcare, Business, and Government Scott Burk
100% (4)
It's All Analytics!: The Foundations of AI, Big Data, and Data Science Landscape For Professionals in Healthcare, Business, and Government Scott Burk
52 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Modern Pridictive Modelling(Regression)
No ratings yet
Modern Pridictive Modelling(Regression)
12 pages
AI With Python - Unsupervised Learning - Clustering
No ratings yet
AI With Python - Unsupervised Learning - Clustering
12 pages
Breaking Cryptographic Implementations Using Deep Learning Techniques
No ratings yet
Breaking Cryptographic Implementations Using Deep Learning Techniques
25 pages
PGP-AIML Curriculum - Great Lakes
No ratings yet
PGP-AIML Curriculum - Great Lakes
43 pages
Machine Learning
No ratings yet
Machine Learning
66 pages
Machine Learning Program 4 (SHANKAR)
No ratings yet
Machine Learning Program 4 (SHANKAR)
6 pages
Mask Former
No ratings yet
Mask Former
17 pages
Predicting_Football_Match_Result_Using_Fusion-based_Classification_Models
No ratings yet
Predicting_Football_Match_Result_Using_Fusion-based_Classification_Models
6 pages
KM-secC
No ratings yet
KM-secC
16 pages
DBSCAN
No ratings yet
DBSCAN
3 pages
Inventory Control
No ratings yet
Inventory Control
14 pages
DM-Question Bank 2024-25 Objective Question Bank
No ratings yet
DM-Question Bank 2024-25 Objective Question Bank
14 pages
Urban Analytics With Social Media Data: Foundations, Applications and Platforms 1st Edition Tan Yigitcanlar All Chapter Instant Download
100% (7)
Urban Analytics With Social Media Data: Foundations, Applications and Platforms 1st Edition Tan Yigitcanlar All Chapter Instant Download
66 pages
ml aat 2
No ratings yet
ml aat 2
25 pages
School of Engineering: Lab Manual On Machine Learning Lab
No ratings yet
School of Engineering: Lab Manual On Machine Learning Lab
23 pages
A Review on Machine Learning and Deep Learning Techniques for Predicting
No ratings yet
A Review on Machine Learning and Deep Learning Techniques for Predicting
13 pages
Facial Recognition Using OpenCV
No ratings yet
Facial Recognition Using OpenCV
7 pages
Data Mining
No ratings yet
Data Mining
18 pages
Social Media Spam Comments Detection Analysis Using Machine Learning
No ratings yet
Social Media Spam Comments Detection Analysis Using Machine Learning
6 pages

Machine Learning Basics: An Illustrated Guide For Non-Technical Readers

Uploaded by

Machine Learning Basics: An Illustrated Guide For Non-Technical Readers

Uploaded by

Machine Learning Basics:

We hope you enjoy this guidebook

DATA & ANSWERS

41 Married 0 Teacher M211 Product

33 Single 1 Nurse S600

Hours of study Succeed

PREDICTIORS CLASS LABELS

MODEL [ˈmɑdəl] / noun REGRESSION [rəˈgrɛʃən] / noun

TRAINING [ˈtreɪnɪŋ] / verb

CLASSIFICATION [ˌklæsəfəˈkeɪʃən] / noun TEST SET [tɛst sɛt] / noun

ALGORITHM [ˈælgəˌrɪðəm] / noun

LINEAR MODEL APPROACH

“Remember, linear models generate a formula to create a

“Tree-based models are very popular in machine learning.

TYPE NAME DESCRIPTION ADVANTAGES DISADVANTAGES

Sometimes too simple

Sometimes too simple

A series of yes/no Not often used on its

A small change in the

“neurons” that pass Very slow to train,

METRICS FOR EVALUATING MODELS

Dataiku is the world’s leading AI and machine learning platform, supporting

Dataiku displays algorithm-dependent interpretation panels. For example, a linear

THE CONFUSION MATRIX

True Positive False Negative

False Positive FP TN True Negative

OVERFITTING AND REGULARIZATION

— which approach you use will depend on your requirements

1. Your Approach OTHER

VALIDATION SET VS K-FOLD CROSS VALIDATION

The test step is where you take the best

3. Your Best Model

Based on the metrics you chose, you will

FOLD 1 OTHER FOLDS

The average of these errors is your cross

TEST METRIC Then, choose the parameters with the best

TRAIN MODEL COMPUTE METRIC

RETRAIN MODEL ON COMPUTE METRIC

FOLD 1 FOLD 2 FOLD 3 .. . FOLD K

Parameter Parameter Compute CROSS

6 14 Retrain model on all Compute metric on

By unsupervised, it means that we’re not

S212 M 18 ID1 1 1.2

S212 F 41 ID2 2 2.3

S600 F 33 ID3 3 2.4

M107 M 24 ID4 4 2.9

M107 M 65 ID5 5 4.2

K-MEANS CLUSTERING ALGORITHM IN ACTION

How To: Execute Anomaly Detection at Scale

How To: Address Churn with Predictive Analytics

How To: Operationalize Data Science

How To: Future Proof your Operations with Predictive Maintenance

How To: Drive Serendipitous Discovery with Recommendation Engines

It’s now time for you to put everything you

“ You can check out all of

VIDEOS & OTHER

Machine Learning Visualization Data Preparation

Name Sex Age

Natural lang. Gender Integer

Braund, Mr. Owen Harris male 22

Remove rows equal to Moran, Mr. James

Keep only rows equal to Moran, Mr. James

Clear cells equal to Moran, Mr. James

Filter on Moran, Mr. James

Toggle row highlight

Show complete value

DataOps Governance & MLOps Applications

©2021 dataiku | dataiku.com

You might also like