Visual Guide To Machine Learning
Visual Guide To Machine Learning
• Anyone who wants to understand • Anyone who would rather copy and
WHEN, WHY, and HOW to deploy paste code than become fluent in the
machine learning tools & techniques underlying algorithms
PART 1:
ABOUT THIS SERIES
This is Part 1 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning
1 ML Intro & Landscape Machine Learning introduction, definition, process & landscape
Tools to explore data quality (variable types, empty values, range &
2 Preliminary Data QA count calculations, table structure, left/right censored data, etc.)
noun
*Dictionary.com
COMMON ML QUESTIONS
Which customers are most What will sales look like for What patterns do we see in
likely to churn next month? the next 12 months? terms of product cross-selling?
How can we use online When we adjusted tactics Which product is customer X
customer reviews to monitor last month, did we drive any most likely to purchase next?
changes in sentiment? incremental revenue?
WHEN IS ML THE RIGHT FIT?
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Quality Assurance (QA) is about preparing & cleaning data prior to analysis. We’ll cover common QA topics
including variable types, empty/missing values, range & count calculations, censored data, etc.
MACHINE LEARNING PROCESS
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Univariate profiling is about exploring individual variables to build an understanding of your data. We’ll cover
common topics like normal distributions, frequency tables, histograms, etc.
MACHINE LEARNING PROCESS
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Multivariate profiling is about understanding relationships between multiple variables. We’ll cover common
tools for exploring categorical & numerical data, including kernel densities, violin & box plots, scatterplots, etc.
MACHINE LEARNING PROCESS
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Machine learning is a natural extension of multivariate profiling, and uses statistical models and methods to
answer questions which are too complex to solve using simple visual analysis or trial-and-error
MACHINE LEARNING LANDSCAPE
MACHINE LEARNING
Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Quality Assurance (QA) is about preparing & cleaning data prior to analysis. We’ll cover common QA topics
including variable types, empty/missing values, range & count calculations, censored data, etc.
PRELIMINARY DATA QA
Data QA (otherwise known as Quality Assurance or Quality Control) is the first step in
the analytics and machine learning process; QA allows you to identify and correct
underlying data issues (blanks, errors, incorrect formats, etc.) prior to analysis
Are there any missing or Was the data our client Is there any risk that the Are there any outliers
empty values in the data captured from the online data capture process was that might skew the
shared by the HR team? survey encoded properly? biased in some way? results of our analysis?
VARIABLE TYPES
Range Calculations
Common variable types include:
Left/Right Censored
Variable Types Investigating empty values, and how they are recorded in your data,
is a prerequisite for every single analysis
Empty Values
Empty values can be recorded in many ways (NA, N/A, #N/A, NaN,
Null, “-”, “Invalid”, blank, etc.), but the most common mistake is
Range Calculations turning empty numerical values into zeros (0)
Count Calculations
Left/Right Censored
Table Structure
For a missing Retail Price, you would
likely be able to impute the value
since you know the product name/ID
RANGE CALCULATIONS
Variable Types One of the simplest QA tools for numerical variables is to calculate
the range of values in a column (minimum and maximum values)
min(height) = -10
max(height = 10 Is your variable normalized around
Table Structure a central value (i.e. 0)?
COUNT CALCULATIONS
Range Calculations
Distinct counts can be particularly useful for QA, and help to:
• Understand the granularity or “grain” of your data
Count Calculations
• Identify how many unique values a field contains
• Ensure consistency by identifying misspellings or categorization errors
which might otherwise be difficult to catch (i.e. leading or trailing spaces)
Left/Right Censored
PRO TIP: For numerical variables with many unique values (i.e. long decimals), use a
Table Structure histogram to plot frequency based on custom ranges or “bins” (more on that soon!)
LEFT/RIGHT CENSORED
Variable Types
When data is left or right censored, it means that due to some
circumstance the min or max value observed is not the natural
minimum or maximum of that metric
Empty Values • This can be difficult to spot unless you are aware of how the data is being
recorded (which means it’s a particularly dangerous issue to watch out for!)
Count Calculations
Left/Right Censored
Table Structure
Mall Shopper Survey Results Ecommerce Repeat Purchase Rate
Only tracks shoppers over the age of 18 due to legal Sharp drop as you approach the current date has nothing to do
reasons, so anyone under 18 is excluded (even though with customer behavior, but the fact that recent customers
there are plenty of mall shoppers under 18) haven’t have the opportunity or need to repurchase yet
TABLE STRUCTURE
Variable Types Table structures generally come in two flavors: long or wide
Range Calculations
PIVOT
Count Calculations
UNPIVOT
Left/Right Censored
Long tables typically contain a single, distinct column for each field (Date,
Variable Types
Product, Category, Quantity, Profit, etc.)
• Easy to see all available fields and variable types
Empty Values • Great for exploratory data analysis and aggregation (i.e. PivotTables)
Range Calculations Wide tables typically split the same metric into multiple columns or
categories (i.e. 2018 Sales, 2019 Sales, 2020 Sales, etc.)
• Typically not ideal for human readability, since wide tables may contain thousands
Count Calculations of columns (vs. only a handful if pivoted to a long format)
• Often (but not always) the best format for machine learning model input
Left/Right Censored • Great format for visualizing categorical data (i.e. sales by product category)
Table Structure There’s no right or wrong table structure; each type has strengths & weaknesses!
CASE STUDY: PRELIMINARY QA
THE You’ve just been hired as a Data Analyst for Maven Market, a local grocery
SITUATION store looking for help with basic data management and analysis.
The store manager would like you to conduct some analyses on product
THE inventory and sales, but the data is a mess.
ASSIGNMENT You’ll need to explore the data, conduct a preliminary QA, and help clean it up
to prepare the data for further analysis.
Review all fields to ensure that variable types are configured for proper
analysis (i.e. no dates formatted as strings, text formatted as values, etc.)
Remember that NA and 0 do not mean the same thing! Think carefully about
how to handle missing data and the impact it may have on your analysis
Run basic diagnostics like Range, Count, and Left/Right Censored checks
against all columns in your data set...every time
Understand your table structure before conducting any analysis to reduce the
risk of double counting, inaccurate calculations, omitted data, etc.
*Copyright Maven Analytics, LLC
MACHINE LEARNING PROCESS
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Univariate profiling is about exploring individual variables to build an understanding of your data. We’ll cover
common topics like normal distributions, frequency tables, histograms, etc.
UNIVARIATE PROFILING
Univariate profiling is the next step after preliminary QA; think of univariate profiling as
conducting a descriptive analysis of each variable by itself
VARIABLE TYPES
Categorical Variables
Terms like discrete, categorical, multinomial, and classes may all be used interchangeably.
Binary is a special type of categorical variable which takes only 1 of 2 cases: true for false (or 1
Data Profiling or 0) and is also known as a logical variable or a binary flag
DISCRETIZATION
Categorical
Distributions
Numerical Variables
Discretization Rules:
If Price <100 then Price Level = Low
Histograms & If Price >=100 & Price <500 then Price Level = Med
Kernel Densities If Price >=500 then Price Level = High
Normal Distribution
Data Profiling
NOMINAL VS. ORDINAL VARIABLES
VARIABLE TYPES
Categorical Variables
Histograms &
Kernel Densities
There are two types of categorical variables: nominal and ordinal
Normal Distribution • Nominal variables contain categories with no inherent logical rank, which can be
re-ordered with no consequence (i.e. Product Type = Camping, Biking or Hiking)
• Ordinal variables contain categories with a logical order (i.e. Size = Small, Medium,
Data Profiling Large), but the interval between those categories has no logical interpretation
CATEGORICAL DISTRIBUTIONS
Histograms & • Heat maps: Formatted to visualize patterns (typically used for multiple variables)
Kernel Densities
Understanding categorical distributions will help us gather knowledge for
Normal Distribution building accurate machine learning models (more on this later!)
Data Profiling
CATEGORICAL DISTRIBUTIONS
Categorical Variables
Section Distribution:
Camping Biking
Categorical Frequency table
14 6
Distributions
Camping Biking
Proportions table
Numerical Variables 70% 30%
Histograms &
Kernel Densities Size & Section Distribution:
Camping Biking
S 6 4
Normal Distribution Heat Map
L 8 2
Data Profiling
NUMERICAL VARIABLES
VARIABLE TYPES
Categorical Variables
You may hear numeric variables described further as interval and ratio, but the distinction is trivial
and rarely makes a difference in common use cases
Data Profiling
HISTOGRAMS
Categorical Variables
Histograms are used to plot a single, discretized numerical variable
Numerical Variables
Age Values:
8 29 45 8
Histograms & 25 33 37
7
Frequency
Kernel Densities 6
19 43 21 5
28 32 40 4
Normal Distribution 24 17 28 3
2
5 22 39
1
15 47 12
Data Profiling 0-10 11-20 21-30 31-40 41-50
Age Range
KERNEL DENSITIES
Categorical Variables
Kernel densities are “smooth” versions of histograms, which can help
to prevent users from over-interpreting breaks between bins
Numerical Variables
Age Values:
8 29 45 8
Histograms & 25 33 37
7
Frequency
Kernel Densities 6
19 43 21 5
28 32 40 4
Normal Distribution 24 17 28 3
2
5 22 39
1
15 47 12
Data Profiling 0-10 11-20 21-30 31-40 41-50
Age Range
HISTOGRAMS & KERNEL DENSITIES
PRO TIP: If your data is relatively symmetrical (not skewed), you can use Sturge’s Rule as a
Data Profiling quick “rule of thumb” to determine an appropriate number of bins: K = 1 + 3.322 log (N)
(where K = number of bins, N = number of observations)
CASE STUDY: HISTOGRAMS
THE You’ve just been promoted as the new Pit Boss at The Lucky Roll Casino.
SITUATION Your mission? Use data to help expose cheats on the casino floor.
Profits at the craps tables have been unusually low, and you’ve been asked to
THE investigate the possible use of loaded die (weighted towards specific numbers).
ASSIGNMENT Your plan is to track the outcome of each roll, then compare your results
against the expected probability distribution to see how closely they match.
Histograms &
Kernel Densities
Normal Distribution
Data Profiling
You can find normal distributions in many real-world examples: heights, weights, test scores, etc.
NORMAL DISTRIBUTION
Categorical
Distributions
1 1 𝑥−𝜇 2
Numerical Variables −
𝑒 2 𝜎
Histograms &
Kernel Densities 𝜎 2𝜋 Turn the parabola
upside down
Normal Distribution
Make the tails
flare out
Data Profiling
CASE STUDY: NORMAL DISTRIBUTION
THE It’s August 2016, and you’ve been invited to Rio de Janeiro as a Data Analyst
SITUATION for the Global Olympic Committee.
Your job is to collect demographic data for all female athletes competing in
THE
the Summer Games and determine how the distribution of Olympic athlete
ASSIGNMENT heights compares against the general public.
1. Gather heights for all female athletes competing in the 2016 Games
THE 2. Plot height frequencies using a Histogram, and test various bin widths
OBJECTIVES 3. Determine if athlete heights follow a normal distribution, or “bell curve”
4. Compare the distributions for athletes vs. the general public
DATA PROFILING
Numerical Variables
Mode of City = “Houston”
Mode of Sessions = 24
Histograms &
Kernel Densities Mode of Gender = F, M
(this is a bimodal field!)
Normal Distribution
Common uses:
Data Profiling • Understanding the most common values within a dataset
• Diagnosing if one variable is influenced by another
MODE
Categorical Variables While modes typically aren’t very useful on their own, they can provide
helpful hints for deeper data exploration
Categorical • For example, the right histogram below shows a multi-modal distribution, which
Distributions indicates that there may be another variable impacting the age distribution
8 8
Histograms & 7 7
Kernel Densities 6 6
Frequency
Frequency
5 5
4 4
Normal Distribution 3 3
2 2
1 1
0-10 11-20 21-30 31-40 41-50 0-10 11-20 21-30 31-40 41-50
Data Profiling
Age Range Age Range
MEAN
Categorical Variables
The mean is the calculated “central” value in a discrete set on numbers
• Mean is what most people think of when they hear the word “average”, and is
calculated by dividing the sum of all values by the count of all observations
Categorical
Distributions • Means can only be applied to numerical variables (not categorical)
Numerical Variables
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠
𝑚𝑒𝑎𝑛 =
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Histograms &
Kernel Densities 5,220
=
5
= 𝟏, 𝟎𝟒𝟒
Normal Distribution
Common uses:
• Making a “best-guess” estimate of a value
Data Profiling
• Calculating a central value when outliers are not present
MEDIAN
The median is the middle value in a list of values sorted from highest to
Categorical Variables
lowest (or vice versa)
• When there are two middle-ranked values, the median is the average of the two
Categorical
• Medians can only be applied to numerical variables (not categorical)
Distributions
Numerical Variables
Normal Distribution
Common uses:
Data Profiling • Identifying the “center” of a distribution
• Calculating a central value when outliers may be present
PERCENTILE
Numerical Variables
Histograms &
Kernel Densities
Numerical Variables
Variance = 5
Normal Distribution
Common uses:
Data Profiling
• Comparing the numerical distributions of two different groups (i.e. prices of products
ordered online vs. in store)
VARIANCE
Categorical Variables 𝑛 2
σ𝑖=1(𝑥𝑖 − 𝜇)
Categorical
Distributions 𝑛−1
Numerical Variables Calculation Steps:
Histograms &
Kernel Densities
Common uses:
Data Profiling • Comparing segments for a given metric (i.e. time on site for mobile users vs. desktop)
• Understanding how likely certain values are bound to occur
SKEWNESS
Categorical Variables Skewness tells us how a distribution varies from a normal distribution
• This is commonly used to mathematically describe skew to the left or right
Categorical
Distributions
Left skew Normal Distribution Right skew
Numerical Variables
Histograms &
Kernel Densities
Normal Distribution
Common uses:
Data Profiling
• Identifying non-normal distributions, and describing them mathematically
BEST PRACTICES: UNIVARIATE PROFILING
Make sure you are using the appropriate tools for profiling categorical
variables vs. numerical variables
QA still comes first! Profiling metrics are important, but can lead to
misleading results without proper QA (i.e. handling outliers or missing values)
QUALITY ASSURANCE
UNIVARIATE PROFILING
MULTIVARIATE PROFILING
MACHINE LEARNING
Multivariate profiling is about understanding relationships between multiple variables. We’ll cover common
tools for exploring categorical & numerical data, including kernel densities, violin & box plots, scatterplots, etc.
MULTIVARIATE PROFILING
Multivariate profiling is the next step after univariate profiling, since single-metric
distributions are rarely enough to draw meaningful insights or conclusions
Categorical-Numerical
Distributions This is one of the simplest forms of multivariate profiling, and leverages
the same tools we used to analyze univariate distributions:
Multivariate Kernel
Densities • Frequency tables: Show the count (or frequency) of each distinct combination
• Proportions tables: Show the count of each combination as a % of the total
Violin & Box Plots • Heat maps: Frequency or proportions table formatted to visualize patterns
THE You’ve just been hired by the New York Department of Transportation
SITUATION (DOT) to help analyze traffic accidents in New York City from 2019-2020
THE 1. Create a table to plot accident frequency by time of day and day of week
OBJECTIVES 2. Apply conditional formatting to the table to create a heatmap showing the
days and times with the fewest (green) and most (red) accidents in the sample
CATEGORICAL-NUMERICAL DISTRIBUTIONS
Multivariate Kernel Teal class has a mean of ~15 and relatively low
Densities variance (highly concentrated around the mean)
Violin & Box Plots Yellow class has a mean of ~20 and moderate
variance relative to other categories
Numerical-Numerical
Distributions Purple class has a mean of ~25, overlaps with
yellow, and has relatively high variance
Numerical-Numerical
Distributions
Categorical profiling works for simple cases, but breaks down quickly
Categorical-Categorical
Distributions • Humans are pretty good at visualizing 1, 2, or maybe even 3 variables, but how
would you visualize a joint distribution for 10 variables? 100?
Categorical-Numerical
Distributions
Multivariate Kernel
Densities
?
Violin & Box Plots Categorical profiling can’t answer prescriptive or predictive questions
• Suppose you randomized several elements on your sales page (font, image, layout,
button, copy, etc.) to understand which ones drive conversions
Numerical-Numerical
Distributions • You could count conversions for individual elements, or some combinations of
elements, but categorical distribution alone can’t measure causation
Scatter Plots &
Correlation
This is when you need machine learning!
NUMERICAL-NUMERICAL DISTRIBUTIONS
Categorical-Numerical They are typically visualized using scatter plots, which plot points along
Distributions
the X and Y axis to show the relationship between two variables
Multivariate Kernel • Scatter plots allow for simple, visual intuition: when one variable increases or
Densities decreases, how does the other variable change?
• There are many possibilities: no relationship, positive, negative, linear, non-linear,
Violin & Box Plots cubic, exponential, etc.
Numerical-Numerical
Distributions
Multivariate Kernel
Densities
σ𝑛𝑖=1(𝑥𝑖 − 𝜇) 2 σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
Violin & Box Plots 𝑛−1 (𝑛 − 1)𝑠𝑥 𝑠𝑦
Numerical-Numerical
Distributions • Here we multiply variable X’s difference from its mean with variable Y’s difference
from its mean, instead of squaring a single variable (like we do with variance)
Scatter Plots &
Correlation • Sx and Sy are the standard deviations of X and Y, which puts them on the same scale
CORRELATION VS. CAUSATION
Categorical-Categorical
CORRELATION
Distributions
Categorical-Numerical
Distributions
Multivariate Kernel
Densities DOES NOT IMPLY
Violin & Box Plots
Numerical-Numerical
Distributions CAUSATION
Scatter Plots &
Correlation
CORRELATION VS. CAUSATION
Categorical-Categorical
Drowning Deaths
Distributions
Categorical-Numerical
Distributions
Multivariate Kernel
Densities
Ice Cream Cones Sold
Violin & Box Plots Consider the scatter plot above, showing daily ice cream sales and
drowning deaths in a popular New England vacation town
Numerical-Numerical • These two variables are clearly correlated, but do ice cream cones CAUSE people to
Distributions drown? Do drowning deaths CAUSE a surge in ice cream sales?
Categorical-Categorical Scatter plots show two dimensions by default (X and Y), but using
Distributions symbols or color allows you to visualize additional variables and
expose otherwise hidden patterns or trends
Categorical-Numerical
Distributions
Multivariate Kernel
Densities
Numerical-Numerical
Distributions
THE You’ve just landed your dream job as a Marketing Analyst at Loud & Clear, the
SITUATION hottest ad agency in San Diego.
Your client would like to understand the impact of their digital media spend, and
THE how it relates to website traffic, offline spend, site load time, and sales.
ASSIGNMENT Your role is to collect and visualize these metrics at the weekly-level in order to
begin exploring the relationships between them.
Use categorical variables to filter or “cut” your data and quickly compare
profiling metrics or distributions across classes
Remember that correlation does not imply causation, and that variables can
be related without one causing a change in the other
PART 2:
ABOUT THIS SERIES
This is Part 2 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning
2 Classification Models
MACHINE LEARNING
Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)
LASSO/RIDGE, state-
Support vector machines,
space, advanced Deep Learning
gradient boosting, neural
generalized linear (Feed Forward, Convolutional, RNN/LSTM, Attention,
nets/deep learning, etc.
methods, VAR, DFA, etc. Deep RL, Autoencoder, GAN. etc.)
MACHINE LEARNING LANDSCAPE
MACHINE LEARNING
MACHINE LEARNING
3 Safari 36 2 0
Variables can be categorical or numerical 4 IE 17 1 0
• Categorical variables contain classes or categories (used for filtering) 5 Chrome 229 9 1
• Numerical variables contain numbers (used for aggregation)
These are numerical variables
The goal of any classification model is to predict a dependent variable using independent variables
EXAMPLE: Using data from a CRM database (sample below) to predict if a customer will churn next month
Cust. ID Gender Status HH Income Age Sign-Up Date Newsletter Churn Churn is our dependent variable,
1 M Bronze $30,000 29 Jan 17, 2019 1 0
since it’s what we want to predict
Gender, Status, HH Income, Age, Sign-Up Date and Newsletter are all
independent variables, since they can help us explain, or predict, Churn
CLASSIFICATION 101
EXAMPLE: Using data from a CRM database (sample below) to predict if a customer will churn next month
3 F Bronze $15,000 24 Oct 1, 2020 0 0 We use records with observed values for
4 F Silver $75,000 41 Apr 12, 2019 1 0 both independent and dependent variables
5 M Bronze $40,000 36 Jul 23, 2020 1 1
to “train” our classification model...
6 M Gold $35,000 31 Oct 22, 2017 0 1
Project Scoping
Remember, these steps ALWAYS come first
Preliminary QA Before building a model, you should have a deep understanding of both the project scope (stakeholders, framework,
desired outcome, etc.) and the data at hand (variable types, table structure, data quality, profiling metrics, etc.)
Data Profiling
Adding new, calculated Splitting records into Building classification models Choosing the best
variables (or “features”) to “Training” and “Test” data from Training data and performing model for a given
a data set based on sets, to validate accuracy applying to Test data to prediction, and tuning it to
existing fields and avoid overfitting maximize prediction accuracy prevent drift over time
FEATURE ENGINEERING
Cust. ID Status HH Income Age Sign-Up Date Newsletter Gold Silver Bronze Scaled Income Log Income Age Group Sign-Up Year Priority
Splitting categorical fields Standardizing numerical Converting a range of Grouping continuous Transforming a date or Using and/or logic to
into separate binary ranges common scales values into a compressed, values into discrete datetime value into encode “interactions”
features (i.e. 1-10, 0-100%) “less-skewed” distribution segments or bins individual components between variables
DATA SPLITTING
Splitting is the process of partitioning data into separate sets of records for the
purpose of training and testing machine learning models
• As a rule of thumb, ~70-80% of your data will be used for Training (which is what your model
learns from), and ~20-30% will be reserved for Testing (to validate the model’s accuracy)
3 F Bronze $15,000 24 Oct 1, 2020 0 0 Using Training data for optimization and Test data
Training
4 F Silver $75,000 41 Apr 12, 2019 1 0 for validation ensures that your model can
data
5 M Bronze $40,000 36 Jul 23, 2020 1 1 accurately predict both known and unknown values,
6 M Gold $35,000 31 Oct 22, 2017 0 1
which helps to prevent overfitting
7 F Gold $80,000 46 May 2, 2019 0 0
In this section we’ll introduce common classification models used to predict categorical
outcomes, including KNN, naïve bayes, decision trees, logistic regression, and more
• In its simplest form, KNN creates a scatter plot with training data, plots a new
Naïve Bayes unobserved value, and assigns a class (DV) based on the classes of nearby points
• K represents the number of nearby points (or “neighbors”) the model will consider
Decision Trees when making a prediction
Sentiment Analysis
K-NEAREST NEIGHBORS (KNN)
HH INCOME
$90,000
$80,000
$60,000
$50,000
Decision Trees
$40,000
$30,000
$10,000
No Purchase
AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
X Y K 9 Purchase PURCHASE
28 $85,000 10 1 No Purchase (90% Confidence)
Sentiment Analysis
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS (KNN)
HH INCOME
$90,000
$80,000
$60,000
$50,000
Decision Trees
$40,000
$30,000
$10,000
No Purchase
AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
X Y K 6 Purchase NO PURCHASE
44 $40,000 20 14 No Purchase (70% Confidence)
Sentiment Analysis
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS (KNN)
HH INCOME
$90,000
$80,000
$60,000
$50,000
Decision Trees
$40,000
$30,000
$10,000
No Purchase
AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
???
X Y K 3 Purchase
50 $80,000 6 3 No Purchase
Sentiment Analysis
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS (KNN)
THE Congratulations! You just landed your dream job as a Machine Learning
SITUATION Engineer at Spotify, one of the world’s largest music streaming platforms
• For new observations, the model looks at the probability that each IV value would
Decision Trees be observed given each outcome, and compares the results to make a prediction
Sentiment Analysis
NAÏVE BAYES
These are the independent variables (IV’s)
Each record represents a customer which will help us make a prediction
3 1 1 0 1
4 0 0 0 0
Decision Trees
5 1 0 0 0 These are our observed values, which
we use to train the model
6 1 1 1 1
Random Forests 7 1 0 1 0
8 0 1 1 1
9 1 0 1 1
Logistic Regression
10 1 1 0 0
1
Naïve Bayes 2 1 0 1 0
NEWS
3 1 1 0 1 0
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0
8 0 1 1 1 1 0
1
9 1 0 1 1
SITE
Logistic Regression
10 1 1 0 0 0
11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES
Naïve Bayes 2 1 0 1 0 1 3
NEWS
3 1 1 0 1 0
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0
8 0 1 1 1 1 0
1
9 1 0 1 1
SITE
Logistic Regression
10 1 1 0 0 0
11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES
Naïve Bayes 2 1 0 1 0 1 3 4
NEWS
3 1 1 0 1 0
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0
8 0 1 1 1 1 0
1
9 1 0 1 1
SITE
Logistic Regression
10 1 1 0 0 0
11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES
Naïve Bayes 2 1 0 1 0 1 3 4
NEWS
3 1 1 0 1 0 2
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0
8 0 1 1 1 1 0
1
9 1 0 1 1
SITE
Logistic Regression
10 1 1 0 0 0
11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES
Naïve Bayes 2 1 0 1 0 1 3 4
NEWS
3 1 1 0 1 0 2 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1
FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0
8 0 1 1 1 1 0
1
9 1 0 1 1
SITE
Logistic Regression
10 1 1 0 0 0
11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES
Naïve Bayes 2 1 0 1 0 1 3 4
NEWS
3 1 1 0 1 0 2 1
4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1 4 1
FB
6 1 1 1 1
0 1 4
Random Forests 7 1 0 1 0
8 0 1 1 1 1 0
9 1 0 1 1
1 3 2
SITE
Logistic Regression
10 1 1 0 0 0 2 3
=
Overall: 50% 50%
11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES
K-Nearest Neighbors • For example, P (News | Purchase) tells us the probability that a customer
subscribed to the newsletter, given (or conditioned on) the fact that they purchased
NEWS
Probability of each independent variable,
P (No FB | Purchase) 20% given that a purchase was made
Decision Trees 0 2 1
P (Site | Purchase) 60%
1 0 P (No Site | Purchase) 40%
1 4 1
Random Forests P (News | No Purchase) 80%
FB
=
P (No Site | Purchase) 40%
P (No News | Purchase) P (No News | No Purchase)
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% P (FB | Purchase) P (FB | No Purchase)
P (FB | No Purchase) 20% x x
P (Site | Purchase) P (Site | No Purchase)
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% P (Purchase) P (No Purchase)
=
P (Purchase) 50%
Logistic Regression
P (No Purchase) 50%
Sentiment Analysis
NAÏVE BAYES
Unobserved value:
P (News | Purchase) 60%
NEWS FB SITE
=
P (No Site | Purchase) 40%
40% P (No News | No Purchase)
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% 80% P (FB | No Purchase)
P (FB | No Purchase) 20% x x
60% P (Site | No Purchase)
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% 50% P (No Purchase)
=
P (Purchase) 50%
Logistic Regression 9.6%
P (No Purchase) 50%
Sentiment Analysis
NAÏVE BAYES
Unobserved value:
P (News | Purchase) 60%
NEWS FB SITE
=
P (No Site | Purchase) 40%
40% 20%
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% 80% 20%
P (FB | No Purchase) 20% x x
60% 40%
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% 50% 50%
=
P (Purchase) 50%
Logistic Regression 9.6% 0.8%
P (No Purchase) 50%
Sentiment Analysis
NAÏVE BAYES
Unobserved value:
P (News | Purchase) 60%
NEWS FB SITE
=
P (No Site | Purchase) 40%
40% 20%
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% 80% 20%
P (FB | No Purchase) 20% x x
60% 40%
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% 50% 50%
=
P (Purchase) 50%
Logistic Regression 9.6% 0.8%
P (No Purchase) 50%
9.6%
Sentiment Analysis Overall Purchase Probability:
(9.6% + 0.8%)
= 92.3% PURCHASE
PREDICTION
NAÏVE BAYES
K-Nearest Neighbors
Who has time to work through all this math?
Naïve Bayes • No one! These probabilities are all calculated automatically by the model
(no manual effort required!)
Decision Trees
Doesn’t multiplying many probabilities lead to very small values?
Random Forests • Yes, which is why the relative probability is so important; even so, Naïve
Bayes can struggle with a very large number of IVs
Logistic Regression
Sentiment Analysis
CASE STUDY: NAÏVE BAYES
THE You’ve just been promoted to Marketing Manager at Cat Slacks, a global retail
SITUATION powerhouse specializing in high-quality pants for cats
• The goal is to determine which rules and IVs do the best job splitting up the
Decision Trees classes, which we can measure using a metric known as entropy
• This is NOT a manual process; decision trees test many variables and select rules
based on the change in entropy after the split (known as information gain)
Random Forests
Gender
10 YES 10 NO
Lifetime Value
4 YES 4 NO 6 YES 6 NO
Random Forests
Splitting on Gender isn’t effective, since our classes are still evenly distributed after the split
Logistic Regression
(in other words, Gender isn’t a good predictor for churn)
Sentiment Analysis
DECISION TREES
Dependent variable: CHURN
Q: Will this subscriber churn next month?
Independent Variables
K-Nearest Neighbors
Age
Gender
10 YES 10 NO
Lifetime Value
4 YES 8 NO 7 YES 1 NO
Random Forests
K-Nearest Neighbors
ENTROPY = −𝑃1 ∗ 𝑙𝑜𝑔2 𝑃1 − 𝑃2 ∗ 𝑙𝑜𝑔2 (𝑃2 )
Naïve Bayes Curve between 0 - 1
1.00
Decision Trees
0.75 Entropy = 1 (max)
50/50 split between classes
Logistic Regression
0.25
Entropy = 0 (min) Entropy = 0 (min)
All Class 2 (P1=0) All Class 1 (P2=0)
Sentiment Analysis
P1
0.25 0.50 0.75 1.00
DECISION TREES
Dependent variable: CHURN
Q: Will this subscriber churn next month?
Independent Variables
K-Nearest Neighbors
Age
Entropy = 1
Gender
10 YES 10 NO
Lifetime Value
Entropy = .91 Entropy = .54
(- 0.09) (- 0.46)
4 YES 8 NO 7 YES 1 NO
Random Forests
The reduction in entropy after the split tells us that we’re gaining information, and
Logistic Regression teasing out differences between those who churn and those who do not
Sentiment Analysis
DECISION TREES
Dependent variable: CHURN
Q: Will this subscriber churn next month?
Independent Variables
K-Nearest Neighbors
Age
Entropy = 1
Gender
10 YES 10 NO
Lifetime Value
Entropy = .91 Entropy = .54
(- 0.09) (- 0.46)
4 YES 8 NO 7 YES 1 NO
Random Forests
Lifetime Value Age
Decision node
K-Nearest Neighbors 10 YES 10 NO
Last Log-In
Leaf node
<7 >7
days days
Naïve Bayes
4 YES 8 NO 7 YES 1 NO
Random Forests
Sign-Up Date
Logistic Regression
<60 >60 UNOBSERVED VALUE PREDICTION
days days
Last Log-In
<7 >7
days days
Naïve Bayes
4 YES 8 NO 7 YES 1 NO
Random Forests
Sign-Up Date
Logistic Regression
<60 >60 UNOBSERVED VALUE PREDICTION
days days
Sentiment Analysis
ago, has an LTV of $28, and NO CHURN
signed up 90 days ago
1 YES 0 NO 1 YES 3 NO
DECISION TREES
Last Log-In
<7 >7
days days
Naïve Bayes
4 YES 8 NO 7 YES 1 NO
Random Forests
Sign-Up Date
Logistic Regression
<60 >60 UNOBSERVED VALUE PREDICTION
days days
Logistic Regression Does the best first split always lead to the most accurate model?
• Not necessarily! That’s why we often use a collection of multiple decision trees,
known as a random forest, to maximize accuracy
Sentiment Analysis
RANDOM FORESTS
CHURN
Sign-Up Sign-Up Sign-Up
TREE 3 PREDICTION
Log-In Log-In Log-In
TREE N PREDICTION
THE You are the founder and CEO of Trip Genie, an online subscription service
SITUATION designed to connect global travelers with local guides.
You’d like to better understand your customers and identify which types of
THE behaviors can be used to help predict paid subscriptions.
ASSIGNMENT Your goal is to build a decision tree to help you predict subscriptions based on
multiple factors (newsletter, Facebook follow, time on site, sessions, etc.).
• The likelihood function measures how accurately a model predicts outcomes, and
Decision Trees is used to optimize the “shape” of the curve
• Although it has the word “regression” in its name, logistic regression is not used for
predicting numeric variables
Random Forests
TRUE (1)
K-Nearest Neighbors
Naïve Bayes
0.5
Decision Trees
Random Forests
FALSE (0)
X1
Logistic Regression
TRUE (1)
K-Nearest Neighbors
PROBABILITY
Naïve Bayes
0.5
Decision Trees
Random Forests
FALSE (0)
X1
Logistic Regression
• Logistic regression plots the best-fitting curve between 0 and 1, which tells us the
probability of Y being TRUE for any given value of X1
Sentiment Analysis
LOGISTIC REGRESSION
SPAM
K-Nearest Neighbors
PROBABILITY
Naïve Bayes
0.5
Decision Trees
Random Forests
NOTFALSE
SPAM (0)
# RECIPIENTS
5 10 15 20 25
Logistic Regression
• Here we’re using logistic regression to predict if an email will be marked as spam,
based on the number of email recipients (X1)
Sentiment Analysis
LOGISTIC REGRESSION
SPAM
K-Nearest Neighbors
P = 95%
Prediction: SPAM
PROBABILITY
Naïve Bayes
0.5
Random Forests P = 1%
Prediction: NOT SPAM
NOTFALSE
SPAM (0)
# RECIPIENTS
5 10 15 20 25
Logistic Regression
• Using this model, we can test unobserved values of X1 to predict the probability that
Y is true or false (in this case the probability that an email is marked as spam)
Sentiment Analysis
LOGISTIC REGRESSION
SPAM
K-Nearest Neighbors
0.9
PROBABILITY
Naïve Bayes spam is high, so our decision point may be >50%
Logistic Regression
Is 50% always the right decision point for logistic models?
Sentiment Analysis • No. It depends on the relative risk of a false positive (incorrectly predicting a TRUE
outcome) or false negative (incorrectly predicting a FALSE outcome)
LOGISTIC REGRESSION
QUIT
K-Nearest Neighbors
PROBABILITY
Naïve Bayes low, so our decision point may be <50%
STAY % NEGATIVE
25% 50% 75% 100%
FEEDBACK
Logistic Regression
• Now consider a case where we’re predicting if an employee will quit based on
negative feedback from HR
Sentiment Analysis • It’s easier to train an employee than hire a new one, so the risk of a false positive is
low but the risk of a false negative (incorrectly predicting someone will stay) is high
LOGISTIC REGRESSION
K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where
Decision Trees
Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0
Sentiment Analysis 0 X1
LOGISTIC REGRESSION
K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where
Decision Trees
Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0
Sentiment Analysis 0 X1
LOGISTIC REGRESSION
K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where
Decision Trees
Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0
Sentiment Analysis 0 X1
LOGISTIC REGRESSION
K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where
Decision Trees
Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0
Sentiment Analysis 0 X1
LOGISTIC REGRESSION
K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where
Decision Trees
Random Forests
𝜷𝟎 = 0
0.5 𝜷𝟎 = 2
Logistic Regression 𝜷𝟎 = -2
Sentiment Analysis 0 X1
LOGISTIC REGRESSION
K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where
Decision Trees
Random Forests
𝜷𝟎 = 0
0.5 𝜷𝟎 = 2
Logistic Regression 𝜷𝟎 = -2
Sentiment Analysis 0 X1
LOGISTIC REGRESSION
K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where
Decision Trees
Random Forests
𝜷𝟎 = 0
0.5 𝜷𝟎 = 2
Logistic Regression 𝜷𝟎 = -2
Sentiment Analysis 0 X1
LOGISTIC REGRESSION
K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!
• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
Actual Observation
Random Forests
Y=1 Y=0 • When our model output is close to the actual Y,
we want likelihood to be HIGH (near 1)
Logistic Regression ~1 HIGH LOW
Model Output • When our model output is far from the actual Y,
~0 LOW HIGH we want likelihood to be LOW (near 0)
Sentiment Analysis
LOGISTIC REGRESSION
K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!
• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
Actual Observation
Random Forests
Y=1 Y=0 LIKELIHOOD FUNCTION:
=
Logistic Regression ~1 HIGH LOW
Model Output 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 ∗ 𝟏 − 𝒐𝒖𝒕𝒑𝒖𝒕 𝟏−𝒚
~0 LOW HIGH
Sentiment Analysis
LOGISTIC REGRESSION
K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!
• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.99)1 ∗ (1 − .99)1−1
Y=1 Y=0
(.99)1 ∗ (.01)0
Logistic Regression ~1 HIGH LOW
Model Output
.99 ∗ 1
=
~0 LOW HIGH
K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!
• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.01)0 ∗ (1 − .01)1−0
Y=1 Y=0
1 ∗ (.99)1
Logistic Regression ~1 HIGH LOW
Model Output
1 ∗ .99
=
~0 LOW HIGH
K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!
• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.01)1 ∗ (1 − .01)1−1
Y=1 Y=0
.01 ∗ (.99)0
Logistic Regression ~1 HIGH LOW
Model Output
.01 ∗ 1
=
~0 LOW HIGH
K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!
• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.99)0 ∗ (1 − .99)1−0
Y=1 Y=0
1 ∗ (.01)1
Logistic Regression ~1 HIGH LOW
Model Output
1 ∗ .01
=
~0 LOW HIGH
High likelihood
K-Nearest Neighbors SPAM
Naïve Bayes
PROBABILITY
Low likelihood
Low likelihood
Decision Trees
• Observations closest to the curve have the highest likelihood values (and vice versa),
Sentiment Analysis so maximizing total likelihood allows us to find the curve that fits our data best
LOGISTIC REGRESSION
SPAM
Naïve Bayes
PROBABILITY
Decision Trees
Random Forests
NOT SPAM
5 10 15 20 25
Logistic Regression
# RECIPIENTS
• # of Recipients can help us detect spam, but so can other variables like the number
Sentiment Analysis
of typos, count of words like “free” or “bonus”, sender reputation score, etc.
LOGISTIC REGRESSION
Naïve Bayes
Decision Trees
Random Forests
Logistic Regression
• Logistic regression can handle multiple independent variables, but the visual
Sentiment Analysis interpretation breaks down at >2 IV’s (this is why we need machine learning!)
LOGISTIC REGRESSION
Naïve Bayes
Makes the output fall
between 0 and 1 1
Decision Trees
1 + 𝑒 −(𝛽0+𝛽1 𝑥1+𝛽2 𝑥2+⋯+𝛽𝑛𝑥𝑛)
Random Forests Weighted independent
variables (x1, x2...xn)
Logistic Regression
• Logistic regression is about finding the best combination of weights (𝛽1, 𝛽2...𝛽𝑛) for
a given set of independent variables (x1, x2...x𝑛) to maximize the likelihood function
Sentiment Analysis
CASE STUDY: LOGISTIC REGRESSION
THE You’ve just been promoted to Marketing Manager for Lux Dining, a wildly
SITUATION popular international food blog.
The CMO is concerned about unsubscribe rates and thinks it may be related
THE to the frequency of emails your team has been sending.
ASSIGNMENT Your job is to use logistic regression to plot this relationship and predict if a
user will unsubscribe based on the number of weekly emails received.
• Sentiment analysis often falls under Natural Language Processing (NLP), but is
Naïve Bayes typically applied as a classification technique
• Unlike other classification models, you must ”hand-score” the sentiment (DV)
Random Forests
values for your Training data, which your model will learn from
Logistic Regression
Example use cases:
• Understanding the tone of product reviews posted by customers
Sentiment Analysis • Analyzing open-ended survey responses
SENTIMENT ANALYSIS
K-Nearest Neighbors
Limitations:
Decision Trees
• Based entirely on word count
• Straight-forward but not flexible
Logistic Regression
Sentiment Analysis
SENTIMENT ANALYSIS
The first step in any sentiment analysis is to clean and QA the text to
remove noise and isolate the most meaningful information:
K-Nearest Neighbors
• Remove punctuation, capitalization and special characters
• Correct spelling and grammatical errors
Naïve Bayes • Use proper encoding (i.e. UTF-8)
• Lemmatize or stem (remove grammar tense, convert to “root” term)
Decision Trees • Remove stop words (“a”, “the”, “or”, “of”, “are”, etc.)
NOTE: This process can vary based on the context; for example, you may want to
Sentiment Analysis preserve capitalization or punctuation if you care about measuring intensity (i.e. “GREAT!!”
vs. “great”), or choose to allow specific stop words or special characters
SENTIMENT ANALYSIS
Once the text has been cleaned, we can transform our text into
K-Nearest Neighbors numeric data using a “bag of words” approach:
• Split cleaned text into individual words (this is known as tokenization)
Naïve Bayes
• Create a new column with a binary flag (1/0) for each word
Sentiment is our
Random Forests Each word is an independent variable dependent variable
K-Nearest Neighbors How do you address language nuances like ambiguity, double
negatives, slang or sarcasm?
Naïve Bayes • Generally speaking, the more observations you score in your Training data, the
better your model will be at detecting these types of nuances
• No sentiment model is perfect, but feature engineering and advanced
Decision Trees techniques can help improve accuracy for more complex cases
Random Forests “If you like watching paint If you like watching paint dry like watch paint
dry, you’ll love this movie!” you’ll love this movie dry love movie
Logistic Regression
“The new version is awfully The new version is awfully new version awful
good, not as bad as expected!” good not as bad as expected good bad expect
Sentiment Analysis
CASE STUDY: SENTIMENT ANALYSIS
THE You’re an accomplished author and creator of the hit series Bark Twain the Data
SITUATION Dog, featuring a feisty chihuahua who uses machine learning to solve crimes.
Reviews for the latest book in the Bark Twain series are coming in, and they
THE aren’t looking great...
ASSIGNMENT To apply a bit more rigor to your analysis and automate scoring for future
reviews, you’ve decided to use this feedback to build a basic sentiment model.
In this section we’ll discuss techniques for selecting & tuning classification models,
including hyperparameter optimization, class balancing, confusion matrices and more
Imbalanced Classes
Confusion Matrix
K-Nearest Naïve Decision Random Logistic
Neighbors Bayes Trees Forests Regression
Model Selection
Examples: Examples: Examples: Examples: Examples:
• The class or outcome which occurs more frequently is known as the majority
Imbalanced Classes
class, while the class which occurs less frequently is the minority class
• Imbalanced classes can bias a model towards always predicting the majority class,
Confusion Matrix since it often yields the best overall accuracy (i.e. 99%+)
• This is a significant issue when predicting very rare and very important events, like
Model Selection a nuclear meltdown
Model Drift
There are several ways to balance classes in your Training data,
including up-sampling, down-sampling and weighting
IMBALANCED CLASSES
Hyperparameters Up-sampling
Minority class observations are
duplicated to balance the data
Imbalanced Classes
1
Down-sampling
Confusion Matrix Majority class observations are
randomly removed to balance
the data
Model Selection
Weighting
For models that randomly sample
observations (random forests),
Model Drift
increase the probability of selecting
the minority class
IMBALANCED CLASSES
Hyperparameters x
Up-sampling
x
Minority class observations are
x duplicated to balance the data
Imbalanced Classes x
x
x Down-sampling
x
Confusion Matrix Majority class observations are
x
randomly removed to balance
the data
x
Model Selection x
x
Weighting
x For models that randomly sample
x observations (random forests),
Model Drift
x increase the probability of selecting
x the minority class
x
IMBALANCED CLASSES
Hyperparameters Up-sampling
Minority class observations are
duplicated to balance the data
Imbalanced Classes 1
Down-sampling
Confusion Matrix Majority class observations are
randomly removed to balance
the data
Model Selection
Weighting
For models that randomly sample
observations (i.e. random forests),
Model Drift
increase the probability of selecting
the minority class
CONFUSION MATRIX
ACTUAL CLASS
Model Selection
1 0 1 0
1 0
Imbalanced Classes
True Positive False Positive
1 (TP) (FP)
PREDICTED
CLASS
Confusion Matrix False Negative True Negative
0 (FN) (TN)
Model Selection
Of all predictions, what % Of all predicted positives, Of all actual positives, what
were correct? what % were correct? % were predicted correctly?
CONFUSION MATRIX
1 0
Imbalanced Classes
1 100 5
PREDICTED
CLASS
Confusion Matrix
0 15 50
Model Selection
=
(100+50)/(100+50+5+15) 100/(100+5) 100/(100+15)
=
=
.88 .95 .87
CONFUSION MATRIX
A B C D A B C D
Confusion Matrix
A A
B B
Model Selection PREDICTED PREDICTED
PRODUCT PRODUCT
C C
Model Drift D D
Imbalanced Classes
ACTUAL PRODUCT
A B C D
In this case our model has a hard time differentiating
Confusion Matrix A products B and C, often predicting the wrong one
Model Drift D
CONFUSION MATRIX
=
.9846
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 214 ) / ( 214 + 1 + 8 + 2 )
=
In this case, (TN) includes all
D cases where Product A was .9511
Model Drift not predicted OR observed
Recall
TP / ( TP + FN )
( 214 ) / ( 214 + 15 + 3 )
=
.9224
CONFUSION MATRIX
=
.9899
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 452 ) / ( 452 + 15 + 1 )
=
D .9658
Model Drift
Recall
TP / ( TP + FN )
( 452 ) / ( 452 + 1 + 2 )
=
.9934
CONFUSION MATRIX
=
.9889
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 1123 ) / ( 1123 + 19 )
=
D .9834
Model Drift
Recall
TP / ( TP + FN )
( 1123 ) / ( 1123 + 8 + 1 + 12 )
=
.9816
CONFUSION MATRIX
=
.9799
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 34 ) / ( 34 + 3 + 2 + 12 )
=
D .6667
Model Drift
Recall
TP / ( TP + FN )
( 34 ) / ( 34 + 2 + 19 )
=
.6182
CONFUSION MATRIX
Imbalanced Classes
# Obs. Accuracy Precision Recall
Model Selection
C 1,144 .9889 .9834 .9816
Imbalanced Classes
# Obs. Accuracy Precision Recall
Model Selection
C 1,144 .5498 .2239 .4348
Imbalanced Classes When comparing performance, it’s important to prioritize the most
relevant metrics based on the context of your model
Confusion Matrix • RECALL may be the best metric if it’s critical to predict ALL positive outcomes
correctly, and false negatives are a major risk (i.e. nuclear reactor)
• PRECISION may be the best metric if false negatives aren’t a big deal, but false
Model Selection positives are a major risk (i.e. spam filter or document search)
• ACCURACY may be the best metric if you care about predicting positive and
Model Drift negative outcomes equally, or if the risk of each outcome is comparable
MODEL DRIFT
Drift is when a trained model gradually becomes less accurate over time,
Hyperparameters
even when all variables and parameters remain the same
• As a best practice, all ML models used for ongoing prediction should be
Imbalanced Classes updated or retrained on a regular basis to combat drift
• If you notice drift compared to your benchmark, retrain the model using updated
Training data and consider discarding old records (if you have enough volume)
Model Drift
• Conduct additional feature engineering as necessary
PART 3:
ABOUT THIS SERIES
This is Part 3 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning
MACHINE LEARNING
Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)
LASSO/RIDGE, state-
Support vector machines,
space, advanced Deep Learning
gradient boosting, neural
generalized linear (Feed Forward, Convolutional, RNN/LSTM, Attention,
nets/deep learning, etc.
methods, VAR, DFA, etc. Deep RL, Autoencoder, GAN. etc.)
MACHINE LEARNING LANDSCAPE
MACHINE LEARNING
MACHINE LEARNING
3 Safari 36 2 0
Variables can be categorical or numerical 4 IE 17 1 0
• Categorical variables contain classes or categories (used for filtering) 5 Chrome 229 9 1
• Numerical variables contain numbers (used for aggregation)
These are numerical variables
The goal of regression is to predict a numeric dependent variable using independent variables
𝒚 Dependent variable (DV) – for regression, this must be numerical (not categorical)!
• This is the variable you’re trying to predict
• The dependent variable is commonly referred to as “Y”, “predicted”, “output”, or “target” variable
• Regression is about understanding how the numerical DV is impacted by, or dependent on, other variables in the model
EXAMPLE: Using marketing and sales data (sample below) to predict revenue for a given month
Social Posts, Competitive Activity, Marketing Spend and Promotion Count are all
independent variables, since they can help us explain, or predict, monthly Revenue
REGRESSION 101
EXAMPLE: Using marketing and sales data (sample below) to predict revenue for a given month
3 11 Medium $15,000 10 $1,112,050 We’ll use records with observed values for
4 22 Medium $705,000 11 $1,582,077 both independent and dependent variables
5 41 High $3,000 3 $1,889,053
to “train” our regression model...
6 5 High $3,500 3 $1,200,089
11 7 High $320,112 8 ??? ...then apply that model to new, unobserved values
containing IVs but no DV
This is what our regression model will predict!
REGRESSION WORKFLOW
Measurement Planning
Adding new, calculated Splitting records into Building regression models Choosing the best
variables (or “features”) to “Training” and “Test” data from Training data and performing model for a given
a data set based on sets, to validate accuracy applying to Test data to prediction, and tuning it to
existing fields and avoid overfitting maximize prediction accuracy prevent drift over time
FEATURE ENGINEERING
Social Competitive Marketing Promotion Competitive Competitive Competitive Promotion >10 &
Month ID Revenue Log Spend
Posts Activity Spend Count High Medium Low Social > 25
Splitting is the process of partitioning data into separate sets of records for the
purpose of training and testing machine learning models
• As a rule of thumb, ~70-80% of your data will be used for Training (which is what your model
learns from), and ~20-30% will be reserved for Testing (to validate the model’s accuracy)
3 11 Medium $15,000 10 $1,112,050 Using Training data for optimization and Test data
Training for validation ensures that your model can
4 22 Medium $705,000 11 $1,582,077
data accurately predict both known and unknown values,
5 41 High $3,000 3 $1,889,053 which helps to prevent overfitting
6 5 High $3,500 3 $1,200,089
There are two common use cases for linear regression: prediction and root-cause analysis
In this section we’ll introduce the basics of regression modeling, including linear
relationships, least squared error, simple and multiple regression and non-linear models
Y-intercept X value
(x,y)
Univariate Linear
Regression y = ⍺ + βx β
Y value Slope (rise/run) ⍺
Multiple Linear
Regression • NOTE: Not all relationships are linear (more on that later!)
Linear Relationships
Univariate Linear
Regression
Non-Linear (logarithmic) No Relationship
Multiple Linear
Regression
Non-Linear
Regression
Linear Relationships 50
40
Consider a line that fits
Least Squared Error every single point in the plot
30
This is known as a perfectly
linear relationship
20
Univariate Linear
Regression
10
Multiple Linear
Regression 10 20 30 40 50 60 70 80
Non-Linear
Regression
In this case you can simply calculate the exact value of Y for any given value
of X (no Machine Learning needed, just simple math!)
LINEAR RELATIONSHIPS
Linear Relationships 50
40
In the real world, things
Least Squared Error aren’t quite so simple
30
When you add variance, it
means that many different
20
Univariate Linear lines could potentially fit
Regression
through the plot
10
Multiple Linear
Regression 10 20 30 40 50 60 70 80
Non-Linear
Regression
To find the equation of the line with the best possible fit, we can use a
technique known as least squared error or “least squares”
LEAST SQUARED ERROR
• Now square each of those residuals, add them all up, and adjust your line until
you’ve minimized that sum; this is how least squares works!
Univariate Linear
Regression
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error
10 10 15 5 25
40 20 25 20 -5 25
30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25
40 40 30 -10 100
50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25
80 40 50 10 100
Multiple Linear
Regression 10 20 30 40 50 60 70 80
Non-Linear
Regression STEP 1: Plot each data point on a scatterplot, and record the X and Y values
LEAST SQUARED ERROR
y = 10 + 0.5x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error
10 10 15 5 25
40 20 25 20 -5 25
30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25
40 40 30 -10 100
50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25
80 40 50 10 100
Multiple Linear
Regression 10 20 30 40 50 60 70 80
Non-Linear
Regression STEP 2: Draw a straight line through the points in the scatterplot, and
calculate the Y values derived by your linear equation
LEAST SQUARED ERROR
y = 10 + 0.5x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error
10 10 15 5 25
40 20 25 20 -5 25
30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25
40 40 30 -10 100
50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25
80 40 50 10 100
Multiple Linear
Regression 10 20 30 40 50 60 70 80
Non-Linear
Regression STEP 3: For each value of X, calculate the error (or residual) by comparing
the actual Y value against the Y value produced by your linear equation
LEAST SQUARED ERROR
y = 10 + 0.5x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error
10 10 15 5 25
40 20 25 20 -5 25
30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25
40 40 30 -10 100
50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25
80 40 50 10 100
Multiple Linear
Regression
=
10 20 30 40 50 60
862.5
70 80
SUM OF SQUARED ERROR:
Non-Linear STEP 4: Square each individual residual and add them up to determine the
Regression
sum of squared error (SSE)
• This defines exactly how well your line “fits” the plot (or in other words, how well
the linear equation describes the relationship between X and Y)
LEAST SQUARED ERROR
y = 12 + 0.4x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error
10 10 16 6 36
40 20 25 20 -5 25
30 20 24 4 16
Least Squared Error
30 35 30 26 -4 16
40 40 28 -12 144
50 15 32 17 289
20
Univariate Linear 60 40 36 -4 16
Regression
65 30 38 8 64
10
70 50 40 -10 100
80 40 44 4 16
Multiple Linear
Regression
=
10 20 30 40 50 60
722
70 80
SUM OF SQUARED ERROR:
Non-Linear STEP 5: Plot a new line, repeat Steps 1-4, and continue the process until
Regression
you’ve found the line that minimizes the sum of squared error
• This is where Machine Learning comes in; human trial-and-error is completely
impractical, but machines can find an optimal linear equation in seconds
UNIVARIATE LINEAR REGRESSION
Coefficient/parameter
Univariate Linear (sensitivity of Y to X)
Dependent variable (DV)
Regression This is just the equation of
y =⍺ + βx + 𝜀 Error/residual a line, plus an error term
Multiple Linear
Y-intercept Independent variable (IV)
Regression
Non-Linear
Regression
Simple linear regression is rarely used on its own; think of it as a primer for
understanding more complex topics like non-linear and multiple regression
CASE STUDY: UNIVARIATE LINEAR REGRESSION
THE You are the proud owner of The Cone Zone, a mobile ice cream cart
SITUATION operating on the Atlantic City boardwalk.
You’ve noticed that you tend to sell more ice cream on hot days, and want to
THE understand how temperature and sales relate. Your goal is to build a simple linear
ASSIGNMENT regression model that you can use to predict sales based on the weather forecast.
1. Use a scatterplot to visualize a sample of daily temperatures (X) and sales (Y)
THE 2. Do you notice a clear pattern or trend? How might you interpret this?
OBJECTIVES 3. Find the line that best fits the data by minimizing SSE, then confirm by
plotting a linear trendline on the scatter plot
MULTIPLE LINEAR REGRESSION
Linear Relationships Can we use more than one IV to predict the DV?
Multiple Linear DV
Regression
y =⍺ + β1x1 + β2x2 + β3x3 + … + βnxn + 𝜀 Error/
residual
Non-Linear Y-intercept
Regression Instead of just 1 IV, we have a whole set of independent variables
(and associated coefficients/weights) to help explain our DV
MULTIPLE LINEAR REGRESSION
Univariate Linear
Regression
Multiple Linear
Regression
Non-Linear
Regression
Multiple regression can scale well beyond 2 variables, but this is where
visual analysis breaks down (and why we need machine learning!)
MULTIPLE LINEAR REGRESSION
EXAMPLE You are preparing to list a new property on AirBnB, and want to estimate
(or predict) an appropriate price using the listing data below
Linear Relationships
Univariate Linear
Regression
Multiple Linear
Regression
Non-Linear
Regression
MULTIPLE LINEAR REGRESSION
EXAMPLE You are preparing to list a new property on AirBnB, and want to estimate
(or predict) an appropriate price using the listing data below
Linear Relationships
MODEL 2: Predict price (Y) based on accommodation (X1) and number of bedrooms (X2):
Univariate Linear
Regression Y =52.59 + (15.4*X1) + (5.1*X2)
Multiple Linear MODEL 3: Predict price (Y) based on accommodation (X1), number of bedrooms (X2), and room
Regression type (entire.place (X3), hotel.room (X4), private.room (X5)):
EXAMPLE You are preparing to list a new property on AirBnB, and want to estimate
(or predict) an appropriate price using the listing data below
Linear Relationships
Univariate Linear
Regression
Multiple Linear
Regression
Non-Linear
Regression
Y-intercept
Log-transformed independent variable (IV)
Non-Linear
Regression
All we’re really doing is transforming the data to create linear relationships between each IV and the DV,
then applying a standard linear regression model using those transformed values
NON-LINEAR REGRESSION
EXAMPLE #1 You are predicting Sales (Y) using Marketing Spend (X). As you spend
more on marketing, the impact on sales eventually begins to diminish.
Linear Relationships
Univariate Linear
Regression
Multiple Linear
Regression
The relationship between Sales and Marketing ...but the relationship between Sales and the
Spend is non-linear (logarithmic)... log of Marketing Spend is linear!
Non-Linear
Regression
y = ⍺ + βx + 𝜀 y = ⍺ + β*ln(x) + 𝜀
NON-LINEAR REGRESSION
EXAMPLE #2 You are predicting population growth (Y) over time (X) and notice an
increasing rate of growth as the population size increases.
Linear Relationships
Univariate Linear
Regression
Multiple Linear
Regression
The relationship between Time and Population ...but the relationship between Time and the
is non-linear (exponential)... log of Population is linear!
Non-Linear
Regression y = ⍺ + βx + 𝜀 ln(y) = ⍺ + βx + 𝜀
NOTE: There are multiple ways to transform variables based on the type of relationship
(log, exponential, cubic, etc.), and multiple techniques to model them (more on that later!)
CASE STUDY: NON-LINEAR REGRESSION
Your client has asked you to help set media budgets and estimate sales for an
THE upcoming campaign. Using historical ad spend and revenue, your goal is to
ASSIGNMENT build a regression model to help predict campaign performance.
In this section we’ll explore common diagnostic metrics used to evaluate regression
models and ensure that predictions are stable and accurate
R-Squared
Homoskedasticity
F-Significance
Multicollinearity
Variance Inflation
R-SQUARED
Sample Model Output R-Squared measures how well your model explains the variance in the
dependent variable you are predicting
R-Squared
• The higher the R-Squared, the “better” your model predicts variance in the DV
and the more confident you can be in the accuracy of your predictions
Mean Error Metrics
• Adjusted R-Squared is often used as it “penalizes” the R-squared value based
Homoskedasticity on the number of variables included in the model
Variance Inflation Total distance between each y value and the mean,
squared (basically variance, without dividing by n)
R-SQUARED EXAMPLE
10 10 16 36 30 400
R-Squared 40 20 25 20 25 30 25
30 20 24 16 30 100
40 40 28 144 30 100
20 50 15 32 289 30 225
Homoskedasticity 60 40 36 16 30 100
10 65 30 38 64 30 0
80 40 44 16 30 100
10 20 30 40 50 60 70 80
=
P-Values & T-Statistics
722 1,450
SSE TSS
Multicollinearity
SSE = (𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2
722
Variance Inflation
R2 =1 – 𝑖
1– = 0.502
TSS = (𝑦𝑖 − ത 2
𝑦) 1,450
𝑖
MEAN ERROR METRICS
Mean error metrics measure how well your regression model predicts,
Sample Model Output as opposed to how well it explains variance (like R-Squared)
R-Squared
• There are many variations, but the most common ones are Mean Squared Error
(MSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE)
Mean Error Metrics • These metrics provide “standards” which can be used to compare predictive
accuracy across multiple regression models
Homoskedasticity
MSE MAE MAPE
F-Significance σ𝑖(𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2 σ𝑖 |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 | |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 |
σ𝑖
𝑛 𝑛 𝑦𝑖
𝑛
P-Values & T-Statistics
Average of the squared distance Average of the absolute distance Mean Absolute Error, converted
between actual & predicted values between actual & predicted values to a percentage
Multicollinearity
Variance Inflation Mean error metrics can be used to evaluate regression models just like performance metrics like accuracy,
precision and recall can be used to evaluate classification models
MSE EXAMPLE
y = 12 + 0.4x
Sample Model Output 50
X Y (actual) Y (line) Error Sq. Error
10 10 16 6 36
R-Squared 40 20 25 20 -5 25
30 20 24 4 16
35 30 26 -4 16
Mean Error Metrics 30
40 40 28 -12 144
50 15 32 17 289
20
Homoskedasticity 60 40 36 -4 16
65 30 38 8 64
10
F-Significance 70 50 40 -10 100
80 40 44 4 16
=
P-Values & T-Statistics 10 20 30 40 50 60
722
70 80
SUM OF SQUARED ERROR:
Multicollinearity
MSE
σ𝑖(𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2 722
Variance Inflation
𝑛
=
10
= 72.2
MAE EXAMPLE
y = 12 + 0.4x
Sample Model Output 50
X Y (actual) Y (line) Error Sq.
AbsError
Error
10 10 16 6 36
6
R-Squared 40 20 25 20 -5 25
5
30 20 24 4 16
4
35 30 26 -4 16
4
Mean Error Metrics 30
40 40 28 -12 144
12
50 15 32 17 289
17
20
Homoskedasticity 60 40 36 -4 16
4
65 30 38 8 64
8
10
F-Significance 70 50 40 -10 100
10
80 40 44 4 16
4
=
P-Values & T-Statistics 10 20 30 40 50 60
74
70 80
SUM OF ABSOLUTE ERROR:
Multicollinearity
MAE
74
7.4
σ𝑖 |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 |
Variance Inflation = =
𝑛 10
MAPE EXAMPLE
y = 12 + 0.4x
Sample Model Output 50 X
Y Y
Error
Abs Abs %
(actual) (line) Error Error
10 10 16 6 6 0.6
R-Squared 40 20 25 20 -5 5 0.2
30 20 24 4 4 0.2
35 30 26 -4 4 0.133
Mean Error Metrics 30
40 40 28 -12 12 0.3
50 15 32 17 17 1.133
20
Homoskedasticity 60 40 36 -4 4 0.1
65 30 38 8 8 0.267
10
F-Significance 70 50 40 -10 10 0.2
80 40 44 4 4 0.1
=
P-Values & T-Statistics 10 20 30 40 50 60
3.233
70 80
SUM OF ABSOLUTE % ERROR:
Multicollinearity
MAPE
|𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 | 3.233
Variance Inflation σ𝑖
𝑦𝑖 =
10
= 32.33%
𝑛
MEAN ERROR METRICS
Multicollinearity PRO TIP: In general we recommend considering all of them, since they can be
calculated instantly and each provide helpful context into model performance
Variance Inflation
HOMOSKEDASTICITY
F-Significance
Multicollinearity
Residuals are consistent over Residuals increase at higher IV values, indicating that there
the entire IV range is some variance that the IVs are unable to explain
Variance Inflation
Breusch-Pagan tests can report a formal calculation of Heteroskedasticity, but usually a simple visual check is enough
NULL HYPOTHESIS
noun
1. In a statistical test, the hypothesis that there is no significant difference
between specified populations, any observed difference being due to
sampling or experimental error *
Our goal is to reject the null hypothesis and prove (with a high level of confidence) that our
model can produce accurate, statistically significant predictions and not just random outputs
*Oxford Dictionary
F-STATISTIC & F-SIGNIFICANCE
Sample Model Output The F-Statistic and associated P-Value (aka F-Significance) help us
understand the predictive power of the model as a whole
R-Squared
• F-Significance is technically defined as “the probability that the null hypothesis
cannot be rejected”, which can be interpreted as the probability that your model
Mean Error Metrics
predicts poorly
Homoskedasticity • The smaller the F-Significance, the more useful your regression is for prediction
F-Significance • NOTE: It’s common practice to use a P-Value of .05 (aka 95%) as a threshold to
determine if a model is “statistically significant”, or valid for prediction
P-Values & T-Statistics
Multicollinearity PRO TIP: F-Significance should be the first thing you check when you evaluate a
regression model; if it’s above your threshold, you may need more training, if it’s
below your threshold, move on to coefficient-level significance (up next!)
Variance Inflation
T-STATISTICS & P-VALUES
Sample Model Output T-Statistics and their associated P-Values help us understand the
predictive power of the individual model coefficients
R-Squared
T-Statistics tell us the degree to which we can
“trust” the coefficient estimates
Mean Error Metrics
• Calculated by dividing the coefficient by its
standard error
Homoskedasticity • Primarily used as a stepping-stone to calculate
P-Values, which are easier to interpret and
more commonly used for diagnostics
F-Significance
P-Values tell us the probability that the
coefficient is meaningless, statistically speaking
P-Values & T-Statistics • The smaller the P-value, the more confident
you can be that the coefficient is valid (not 0)
0.001 99.9%***
Variance Inflation 0.01 99%**
0.05 95%*
MULTICOLLINEARITY
Sample Model Output Multicollinearity occurs when two or more independent variables are
highly correlated, leading to untrustworthy model coefficients
R-Squared
• Correlation means that one IV can be used to predict another (i.e. height and
weight), leading to many combinations of coefficients that predict equally well
Mean Error Metrics
• This leads to unreliable coefficient estimates, and means that your model will fail
Homoskedasticity to generalize when applied to non-training data
F-Significance
How do I measure multicollinearity, and what can I do about it?
P-Values & T-Statistics
• Variance Inflation Factor (VIF) can help you quantify the degree of
multicollinearity, and determine which IVs to exclude from the model
Multicollinearity
Variance Inflation
VARIANCE INFLATION FACTOR
Sample Model Output To calculate Variance Inflation Factor (VIF) you treat each individual
IV as the dependent variable, and use the R2 value to measure how
R-Squared well you can predict them using the other IVs in the model
Multicollinearity
PRO TIP: As a rule of thumb, a VIF >10 indicates that multicollinearity is a problem,
and that one or more IVs is redundant and should be removed from the model
Variance Inflation
VARIANCE INFLATION FACTOR
R-Squared
SOLUTION: Pick one high-VIF variable to remove (arbitrarily), re-run the model,
F-Significance
and recalculate the VIF values to see if multicollinearity is gone
Multicollinearity
Private.room
Adios multicollinearity! NO YES
PRO TIP: Use a frequency
Entire.place
Variance Inflation NO 816 15,815
table to confirm correlation! YES 13,166 0
RECAP: SAMPLE MODEL OUTPUT
Formula for the regression (variables & data set)
R-Squared
Profile of
Residuals/Errors
Mean Error Metrics
Homoskedasticity Coefficient
P-Values
Y-Intercept & IV
F-Significance coefficients
Variance Inflation
F-Statistic and P-Value (aka F-Significance)
*Copyright Maven Analytics, LLC
TIME-SERIES FORECASTING
In this section we’ll explore common time-series forecasting techniques, which use
regression models to predict future values based on seasonality and trends
Linear Trending Smoothing • Learn how to identify and quantify seasonality using
auto correlation and one-hot encoding
Non-Linear Trends Intervention Analysis • Explore common models for non-linear forecasting
(like ADBUDG and Gompertz)
• Apply forecasting techniques to analyze the impact of
key business decisions (aka “interventions”)
FORECASTING 101
Forecasting 101 Time-series forecasting is all about predicting future values of a single,
numeric dependent variable
Seasonality • Forecasting works just like any other regression model, except your data must
contain multiple observations of your DV over time (aka time-series data)
Linear Trending • Time-series forecast models look for patterns in the observed data – like
seasonality or linear/non-linear trends – to accurately predict future values
Smoothing
Common Examples:
Non-Linear Trends • Forecasting revenue for the next fiscal year
• Predicting website traffic growth over time
Intervention Analysis • Estimating sales for an new product launch
SEASONALITY
Linear Trending • We can identify seasonal patterns using an Auto Correlation Function (ACF),
then apply that seasonality to forecasts using techniques like one-hot encoding
or moving averages (more on that soon!)
Smoothing
Common examples:
Non-Linear Trends
• Website traffic by hour of the day
• Seasonal product sales
Intervention Analysis
• Airline ticket prices around major holidays
AUTO CORRELATION FUNCTION
Linear Trending
Smoothing
80%
60%
CORRELATION
40%
20%
Intervention Analysis 0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
AUTO CORRELATION FUNCTION
Linear Trending
Smoothing
80%
60%
CORRELATION
40%
20%
Intervention Analysis 0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
AUTO CORRELATION FUNCTION
Linear Trending
Smoothing
80%
60%
CORRELATION
40%
20%
Intervention Analysis 0%
-20%
-40%
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
AUTO CORRELATION FUNCTION
Linear Trending
Strong correlation every 7th lag,
which indicates a weekly cycle
Smoothing
100%
80%
60%
CORRELATION
20%
0%
-20%
-40%
Intervention Analysis
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
LAG
CASE STUDY: AUTO CORRELATION
Once you’ve identified a seasonal pattern, you can use one-hot encoding
to create independent variables which capture the effect of those time
Forecasting 101
periods (days, weeks, months, quarters, etc.)
Seasonality
• NOTE: When you use one-hot encoding, you must exclude one of the options
rather than encoding all of them (it doesn’t matter which one you exclude)
• Consider the equation A+B=5; there are an infinite combination of A & B values that
Linear Trending can solve it. One-hot encoding all options creates a similar problem for regression
Quarter ID Revenue Q1 Q2 Q3 Q4
Smoothing
1 $1,300,050 1 0 0 0
2 $11,233,310 0 1 0 0
4 $1,582,077 0 0 0 1
Intervention Analysis
PRO TIP: If your data contains multiple seasonal patterns (i.e. hour of day + day of week),
include both dimensions in the model as one-hot encoded independent variables
CASE STUDY: ONE-HOT ENCODING
THE You’re are a Senior Analyst for Weather Trends, a Brazilian weather
SITUATION station boasting the longest and most accurate forecasts in the biz.
You’ve been asked to help prepare temperature forecasts for the upcoming
THE year. To do this, you’ll need to analyze ~5 years of historical data from Rio de
ASSIGNMENT Janeiro, and use regression to predict monthly average temperatures.
Forecasting 101
What if the data includes both seasonality and a linear trend?
Seasonality
Trend describes an overarching direction or movement in a time series,
not counting seasonality
Linear Trending • Trends are often linear (up/down), but can be non-linear as well (more on that later!)
• To account for linear trending in a regression, you can include a time step IV; this
Smoothing is simply an index value that starts at 1 and increments with each time period
• If the time step coefficient isn’t statistically significant, it means you don’t have a
Non-Linear Trends meaningful linear trend
Intervention Analysis PRO TIP: It’s common for time-series models to include trending AND seasonality; in this
case, use a combination of one-hot encoding and time step variables to account for both!
LINEAR TRENDING
Forecasting 101
What if the data includes both seasonality and a linear trend?
Seasonality
Trend describes an overarching direction or movement in a time series,
not counting seasonality
Month Revenue
Linear Trending T-Step Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
Jan $1,300,050 1 1 0 0 0 0 0 0 0 0 0 0
Feb $1,233,310 2 0 1 0 0 0 0 0 0 0 0 0
Mar $1,112,050 3 0 0 1 0 0 0 0 0 0 0 0
Smoothing
Apr $1,582,077 4 0 0 0 1 0 0 0 0 0 0 0
May $1,776,392 5 0 0 0 0 1 0 0 0 0 0 0
Jun $2,110,201 6 0 0 0 0 0 1 0 0 0 0 0
Non-Linear Trends Jul $1,928,290 7 0 0 0 0 0 0 1 0 0 0 0
Aug $2,250,293 8 0 0 0 0 0 0 0 1 0 0 0
Sep $2,120,050 9 0 0 0 0 0 0 0 0 1 0 0
THE You are a Business Intelligence Analyst for Maven Muscles, a large national
SITUATION chain of fitness centers.
The Analytics Director needs your help building a monthly revenue forecast
THE for the upcoming year. Memberships follow a clear seasonal pattern, and
ASSIGNMENT revenue has been steadily rising as the gym continues to open new locations.
Forecasting 101 What if my data is so noisy that I can’t tell if a trend exists?
Smoothing • If the volatility is truly just noise (and something we want our model to ignore),
averaging or weighting the values around each time step can help us “smooth”
the data and produce more accurate forecasts
Non-Linear Trends
Intervention Analysis PRO TIP: Smoothing is a great way to expose patterns and trends that otherwise might be
tough to see; make this part of your data profiling process!
SMOOTHING
Forecasting 101 What if my data is so noisy that I can’t tell if a trend exists?
Smoothing
Non-Linear Trends
Intervention Analysis
CASE STUDY: SMOOTHING
THE You’ve just been hired as an analytics consultant for Maven Motel, a struggling
SITUATION national motel chain.
Management has been taking steps to improve guest satisfaction, and has asked
THE you to analyze daily data to determine how ratings are trending. Your task is to
ASSIGNMENT use a moving average calculation to discern if an underlying trend is present.
1. Collect daily average guest ratings for the motel. Do you see any clear
THE patterns or trends?
OBJECTIVES 2. Calculate a moving average, and compare various windows from 1-12 weeks
3. Determine if an underlying trend is present. How would you describe it?
NON-LINEAR TRENDS
Non-Linear Trends
Intervention Analysis
NON-LINEAR TRENDS
Non-Linear Trends
Intervention Analysis
NON-LINEAR TRENDS
Non-Linear Trends
Intervention Analysis
NON-LINEAR TRENDS
Non-Linear Trends
Intervention Analysis
NON-LINEAR TRENDS
Non-Linear Trends
Intervention Analysis
PRO TIP: ADBUDG and Gompertz are more flexible versions of a logistic curve, and
are commonly seen in BI use cases (product launches, diminishing returns, etc.)
CASE STUDY: NON-LINEAR TREND
THE The team at Cat Slacks just launched a new product poised to revolutionize
the world of feline fashion: a lightweight, breatheable jogging short designed
SITUATION for active cats who refuse to compromise on quality.
1. Collect sales data for the first 8 weeks since the launch
THE 2. Apply a Gompertz curve to fit a logistic trend
OBJECTIVES 3. Adjust parameters to compare various capacity limits and growth rates
INTERVENTION ANALYSIS
Linear Trending • By fitting a model to the “pre-intervention” data (up to the date of the change), you
can compare predicted vs. actual values after that date to estimate the impact of
the intervention
Smoothing
Common examples:
Non-Linear Trends
• Measuring the impact of a new website or check-out page on conversion rates
• Quantifying the impact of a new HR program to reduce employee churn
Intervention Analysis
INTERVENTION ANALYSIS
Linear Trending
Smoothing
Non-Linear Trends
Intervention Analysis
INTERVENTION ANALYSIS
Forecasting 101 STEP 1: Fit a regression model to the data, using only observations from the pre-
intervention period:
Seasonality
Intervention
Linear Trending
Smoothing
Non-Linear Trends
Intervention Analysis
INTERVENTION ANALYSIS
Forecasting 101 STEP 1: Fit a regression model to the data, using only observations from the pre-
intervention period:
Seasonality
Intervention
Linear Trending
Smoothing
Non-Linear Trends
Intervention Analysis
INTERVENTION ANALYSIS
Forecasting 101 STEP 2: Compare the predicted and observed values in the post-intervention
period, and sum the daily residuals to estimate the impact of the change:
Seasonality
Intervention
Linear Trending
Smoothing
Non-Linear Trends
Intervention Analysis
CASE STUDY: INTERVENTION ANALYSIS
THE You are a Web Analyst for Alpine Supplies, an online retailer specializing
SITUATION in high-end camping and hiking gear.
The company recently rolled out a new product landing page, and the CMO
THE has asked you to help quantify the impact on conversion rate (CVR) and
ASSIGNMENT sales. You’ll need to conduct an intervention analysis, and make sure to
capture day of week seasonality and trending in your model.
1. Collect data to track sessions and CVR before and after the change
THE 2. Fit a regression model to predict CVR using data from before the redesign
OBJECTIVES 3. Use the model to forecast ”baseline” CVR after the change, and calculate
both incremental daily sales and the total cumulative impact
PART 4:
ABOUT THIS SERIES
This is Part 4 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning
Learn the key building blocks of segmentation and clustering, and review
2 Segmentation & Clustering two of the most-used algorithms: K-Means and Hierarchical Clustering
Explore common techniques for association mining and when to use each of
3 Association Mining them, including A Priori and Markov models
MACHINE LEARNING
K-Means Clustering
Classification Regression Reinforcement Learning
Hierarchical Clustering (Q-learning, deep RL, multi-armed-bandit, etc.)
LASSO/RIDGE, state-
Support vector machines,
space, advanced
Matrix factorization, principal components, factor Deep Learning
gradient boosting, neural analysis, UMAP, T-SNE, topological data analysis,
generalized linear (Feed Forward, Convolutional, RNN/LSTM, Attention,
nets/deep learning, etc. advanced clustering, etc.
methods, VAR, DFA, etc. Deep RL, Autoencoder, GAN. etc.)
MACHINE LEARNING LANDSCAPE
MACHINE LEARNING
Classification Regression Used to DESCRIBE or ORGANIZE the data in some non-obvious way
MACHINE LEARNING
Business Objective
Hyperparameter
Feature Engineering Model Application Model Selection
Tuning
Add new, calculated Apply relevant unsupervised Adjust and tune model Select the model that yields
variables (or “features”) to ML techniques, based on the parameters (this is typically the most useful or insightful
the data set based on objective (you will typically an iterative process) results, based on the
existing fields test multiple models) objective at hand
There often are no strict rules to determine which model is best; it’s about which one helps you best answer the question at hand
RECAP: FEATURE ENGINEERING
Social Competitive Marketing Promotion Competitive Competitive Competitive Promotion >10 &
Month ID Revenue Log Spend
Posts Activity Spend Count High Medium Low Social > 25
In this section we’ll introduce the fundamentals of clustering & segmentation and
compare two common unsupervised models: K-means and hierarchical clustering
Hierarchical Clustering Key Takeaways • Explore common models including K-Means and
hierarchical clustering
Remember, there’s no “right” answer or single optimization metric when it comes to clustering and segmentation; the best
outputs are the ones which help you answer the question at hand and make practical, data-driven business decisions
CLUSTERING BASICS
Hierarchical Clustering
Key Takeaways
CLUSTERING BASICS
Hierarchical Clustering
Key Takeaways
CLUSTERING BASICS
Collect the raw data you need for analysis (i.e. transactions, customer
Key Takeaways EXTRACT
records, survey results, website pathing, etc.)
Clustering Basics K-Means is a common algorithm which assigns each observation in a data
set to a specific cluster, where K represents the number of clusters
K-Means
In it’s simplest form (2 dimensions), here’s how it works:
Hierarchical Clustering 1. Select K arbitrary locations in a scatterplot as cluster centers (or centroids),
and assign each observation to a cluster based on the closest centroid
Key Takeaways 2. Recalculate and relocate each centroid to the mean of the observations
assigned to it, then reassign each observation to its new closest centroid
3. Repeat the process until observations no longer change clusters
Clustering Basics
K-Means
Hierarchical Clustering
Key Takeaways
Clustering Basics
K-Means
Hierarchical Clustering
Key Takeaways
Clustering Basics
K-Means
Hierarchical Clustering
Key Takeaways
Clustering Basics
K-Means
Hierarchical Clustering
Key Takeaways
Clustering Basics
K-Means
Hierarchical Clustering
Key Takeaways
Clustering Basics
K-Means
Hierarchical Clustering
Key Takeaways
Clustering Basics How do we know what’s the “right” number of clusters (K)?
• While there is no “right” or “wrong” number of clusters, you can use the
K-Means within-cluster sum of squares (WSS) to help inform your decision
Hierarchical Clustering
Square these distances and sum them to calculate WSS for two clusters (K=2)
Key Takeaways
K-MEANS
Clustering Basics How do we know what’s the “right” number of clusters (K)?
• While there is no “right” or “wrong” number of clusters, you can use the
K-Means within-cluster sum of squares (WSS) to help inform your decision
Hierarchical Clustering
WSS
2 3 4 5 6 7 8
NUMBER OF CLUSTERS (K)
K-MEANS
Clustering Basics How do we know what’s the “right” number of clusters (K)?
• While there is no “right” or “wrong” number of clusters, you can use the
K-Means within-cluster sum of squares (WSS) to help inform your decision
Hierarchical Clustering
Look for an “elbow” or inflection point, where
adding another cluster has a relatively small
Key Takeaways impact on WSS (in this case where K=5)
WSS
2 3 4 5 6 7 8
NUMBER OF CLUSTERS (K)
K-MEANS
Key Takeaways
Does the shape of the clusters matter?
• Yes, K-Means works best when the clusters are mostly circular in shape;
but other tools like Hierarchical Clustering (up next!) can address this
THE You’ve just been hired as a Data Science Intern for Pet Palz, an ecommerce
SITUATION start-up selling custom 3-D printed models of household pets.
*This is known as agglomerative or “bottom-up” clustering (vs. divisive or “top-down” clustering, which is much less common)
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical Clustering
DISTANCE
p2
p3
Key Takeaways
p1
p4
p1 p2 p3 p4 p5 p6
STEP 1: Find the two closest points, and group them into a cluster
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical Clustering
DISTANCE
p2
p3
Key Takeaways
p1
p4
5 clusters
p1 p2 p3 p4 p5 p6
STEP 1: Find the two closest points, and group them into a cluster
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical Clustering
DISTANCE
p2
p3
Key Takeaways
p1
p4
4 clusters
p1 p2 p3 p4 p5 p6
STEP 2: Find the next two closest points/clusters, and group them together
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical Clustering
DISTANCE
p2
p3 3 clusters
Key Takeaways
p1
p4
p1 p2 p3 p4 p5 p6
STEP 3: Repeat the process until all points are part of the same cluster
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
2 clusters
p5
p6
Hierarchical Clustering
DISTANCE
p2
p3
Key Takeaways
p1
p4
p1 p2 p3 p4 p5 p6
STEP 3: Repeat the process until all points are part of the same cluster
HIERARCHICAL CLUSTERING
Clustering Basics
K-Means
p5
p6
Hierarchical Clustering
DISTANCE
p2
p3
Key Takeaways
p1
p4
p1 p2 p3 p4 p5 p6
STEP 3: Repeat the process until all points are part of the same cluster
HIERARCHICAL CLUSTERING
Clustering Basics How do you know when to stop creating new clusters?
K-Means
The height of each
branch tells us how close Vertical branches that lead
Hierarchical Clustering to splits are called clades
the clusters/data points
are to each other
(taller = longer distance)
DISTANCE
Key Takeaways
Clustering Basics How exactly do you define the distance between clusters?
• There are a number of valid ways to measure the distance between clusters,
K-Means which are often referred to as “linkage methods”
• Common methods include measuring the closest min/max distance between
Hierarchical Clustering clusters, the lowest average distance, or the distance between cluster centroids
Key Takeaways When might you use hierarchical clustering over K-Means?
• Run both models and compare outputs; Hierarchical clustering may produce
more meaningful results if clusters are not circular or uniform in shape
vs
K-Means Hierarchical
KEY TAKEAWAYS
Models won’t tell you how to interpret what each cluster represents
• Use multivariate analysis and profiling techniques to understand the characteristics of each cluster
*Copyright Maven Analytics, LLC
ASSOCIATION MINING
In this section we’ll introduce the fundamentals of association mining and basket
analysis, and compare common techniques including Apriori and Markov Chains
Association mining is NOT about trying to prove or establish causation; it’s just about
identifying frequently occurring patterns and correlations in large datasets
APRIORI
Support (A,B)
3
1) Support ( ) = 10/20 = 0.5
4
Apriori
5 Support ( ) = 7/20 = 0.35
6
Support ( , ) = 6/20 = 0.3
Markov Chains 7
9 Support ( , ) 0.3
Key Takeaways 10 2) Confidence ( )= = = 60%
Support ( ) 0.5
11
12
13 Support ( , ) 0.3
14 3) Lift ( )= = = 1.7
Support ( ) x Support ( ) 0.5 x 0.35
15
16
17
18
Since Lift > 1, we can interpret the association between bacon and eggs
19 as real and informative (eggs are likely to be purchased with bacon)
20
3
1) Support ( ) = 10/20 = 0.5
4
Apriori
5 Support ( ) = 5/20 = 0.25
6
Support ( , ) = 1/20 = 0.05
Markov Chains 7
9 Support ( , ) 0.05
Key Takeaways 10 2) Confidence ( )= = = 10%
Support ( ) 0.5
11
12
13 Support ( , ) 0.05
14 3) Lift ( )= = = 0.4
Support ( ) x Support ( ) 0.5 x 0.25
15
16
17
18
Since Lift < 1, we can conclude that there is no positive association
19 between bacon and basil (basil is unlikely to be purchased with bacon)
20
3
1) Support ( ) = 10/20 = 0.5
4
Apriori
5 Support ( ) = 1/20 = 0.05
6
Support ( , ) = 1/20 = 0.05
Markov Chains 7
9 Support ( , ) 0.05
Key Takeaways 10 2) Confidence ( )= = = 10%
Support ( ) 0.5
11
12
13 Support ( , ) 0.05
14 3) Lift ( )= = = 2
Support ( ) x Support ( ) 0.5 x 0.05
15
16
17
18
Since Lift = 2 we might assume a strong association between bacon and
19 water, but this is skewed since water only appears in one transaction
20
4
To filter low-volume purchases, you can plot support for each item
Apriori and determine a threshold or cutoff value:
5
Markov Chains 7
SUPPORT
9
Key Takeaways 10
12
13
14
15
16 In this case we might filter out any transactions containing items with
17 support <0.15 (transactions 4, 6, 8, 10, 15, 17)
18
Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways
Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways
Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways
Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways
Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways
Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways
STEP 4:
Based on the filtered transactions,
you can use an apriori model to
calculate confidence and lift and
identify the strongest associations
Can you calculate associations between multiple items, like coffee being
Association Mining Basics
purchased with bacon and eggs?
• Yes, you can calculate support, confidence and lift using the same exact logic as you
would with individual items
Apriori
5 Support ( , , ) 0.2
6
Confidence ( & )= = = 67%
Support ( , ) 0.3
7
9 Support ( , , ) 0.2
Lift ( & )= = = 1.7
10 Support ( , ) x Support ( ) 0.12
Calculating all possible associations would be impossible for a human; this is where Machine Learning comes in!
THE You are the proud owner of Coastal Roasters, a local coffee shop and
SITUATION bakery based in the Pacific Northwest.
You’d like to better understand which items customers tend to purchase together,
THE to help inform product placement and promotional strategies.
ASSIGNMENT As a first step, you’ve decided to analyze a sample of ~10,000 transactions over a
6-month period, and conduct a simple basket analysis using an apriori model.
EXAMPLE: Monthly subscribers transitioning between Gold, Silver and Churn states each month:
Key Takeaways
80% 40%
TO STATE
20%
GOLD SILVER
Gold Silver Churn 15%
40%
Gold 0.8 0.15 0.05 1%
FROM STATE
5%
30%
69%
MARKOV CHAINS
80% 40%
TO STATE
20%
FROM STATE
1%
Apriori Silver
5%
30%
0.2 0.4 0.4
CHURN
Churn 0.01 0.3 0.69
Markov Chains
69%
Key Takeaways Example insights & recommendations:
• Most customers who churn stayed churned, but 31% do come back; of those who return,
nearly all of them re-subscribe to a Silver plan (vs. Gold)
ü RECOMMENDATION: Launch targeted marketing to recently churned customers, offering a discount to
resubscribe to a Silver membership plan
• Once customers upgrade to a Gold membership, the majority 80% renew each month
ü RECOMMENDATION: Offer a one-time discount for Silver customers to upgrade to Gold; while you may
sacrifice some short-term revenue, it will likely be profitable in the long term
To account for prior transitions (vs. just the previous) you can use more complex “higher-order” Markov Chains
CASE STUDY: MARKOV CHAINS
THE You’ve just been promoted to Senior Web Analyst at Alpine Supplies, an online
SITUATION retailer specializing in equipment and supplies for outdoor enthusiasts.
The VP of Sales just shared a sample of ~15,000 customer purchase paths, and
THE would like you to analyze the data to help inform a new cross-sell sales strategy.
ASSIGNMENT Your goal is to explore the data and build a simple Markov model to predict which
product an existing customer is most likely to purchase next.
In this section we’ll introduce the concept of statistical outliers, and review common
methods for detecting both cross-sectional and time-series outliers and anomalies
The terms outlier detection, anomaly detection, and rare-event detection are often used interchangeably; they all focus on
finding observations which are materially different than the others (outliers are also sometimes called “pathological” data)
OUTLIER DETECTION BASICS
Key Takeaways • Business outliers: Legitimate values which provide real and meaningful
information, such as a fraudulent transaction or an unexpected spike in sales
Outlier Detection Basics Cross-sectional outlier detection is used to measure the similarity
between observations in a dataset, and identify observations which are
Cross-Sectional Outliers
unusually different or dissimilar
• Detecting outliers in one or two dimensions is often trivial, using basic profiling
Time-Series Outliers metrics (i.e. interquartile range) or visual analysis (scatterplots, box plots, etc.)
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
THE You are a Senior Data Analyst for Brain Games, a large chain of retail shops
SITUATION specializing in educational toys, puzzles and board games.
You’d like to analyze store-level sales by product category, to identify patterns and
THE see if any locations show an unusual composition of revenue.
ASSIGNMENT To do this, you’ll be exploring revenue data across 100 individual store locations,
and detecting outliers using “nearest neighbor” calculations.
Outlier Detection Basics Time-series outlier detection is used to identify observations which fall
well outside expectations for a particular point in time, after accounting
Cross-Sectional Outliers
for seasonality and trending
• In practice, this involves building a time-series regression, comparing each
Time-Series Outliers
observation to its predicted value, and plotting the residuals to detect anomalies
Time-Series Outliers
Key Takeaways
TIME-SERIES OUTLIERS
Cross-Sectional Outliers
Time-Series Outliers
Key Takeaways
Outlier Detection Basics Instead of simple visual analysis, we can fit a linear regression model to
account for seasonality (hour of day) and trending
Cross-Sectional Outliers • From there, we can plot the distribution of residuals in order to quickly identify
any observations which significantly deviated from the model’s prediction
Time-Series Outliers
Key Takeaways
TIME-SERIES OUTLIERS
Cross-Sectional Outliers
When we plot the distribution of
model residuals, we detect a clear
Time-Series Outliers outlier in the data
Key Takeaways
PRO TIP: Not all outliers are
bad! If you find an anomaly,
understand what happened
and how you can learn from it
KEY TAKEAWAYS
For time-series analysis, you can fit a regression model to the data and plot
the residuals to clearly identify outliers
• This allows you to quickly find outliers, while controlling for seasonality and trending
*Copyright Maven Analytics, LLC
DIMENSIONALITY REDUCTION
Dimensionality Reduction Dimensionality reduction involves reducing the number of columns (i.e.
Basics dimensions) in a dataset while losing as little information as possible
• This can be used strictly as a ML optimization technique, or to help develop a
Principal Component
Analysis better understanding of the data for business intelligence, market research, etc.
ü Understanding the main traits or characteristics ü Reducing trivial correlations and removing
of your customers (website activity, transaction multicollinearity to build more accurate and
patterns, survey response data, etc.) meaningful models
PRINCIPAL COMPONENT ANALYSIS
Principal Component
• In the simplest form, PCA finds lines that best fit through the observations in a
Analysis data set, and uses those lines to create new dimensions to analyze
Advanced Techniques
Key Takeaways Spelling (X) / Vocabulary (Y) Vocabulary (X) / Multiplication (Y) Spelling (X) / Multiplication (Y) Multiplication (X) / Geometry (Y)
PRINCIPAL COMPONENT ANALYSIS
Key Takeaways
t
en
p on
com
Vocabulary
Geometry
g
tin
eigh
w
Spelling Multiplication
PRINCIPAL COMPONENT ANALYSIS
Key Takeaways
Vocabulary
Geometry
10.0
(8.3 , 5.6)
3.2
(2.8 , 1.6)
Spelling Multiplication
PRINCIPAL COMPONENT ANALYSIS
Key Takeaways
In this plot, the first 2 components explain
most of the variance in the data, and
additional components have minimal impact
PRINCIPAL COMPONENT ANALYSIS
Dimensionality Reduction
Basics
Principal Component
Analysis
Advanced Techniques
Key Takeaways
In this example, defining components for language and math
helped us simplify and better understand the data set
• Using the new components we derived, we can conduct further analysis
like predictive modeling, classification, clustering, etc.
PRINCIPAL COMPONENT ANALYSIS
Dimensionality Reduction
Basics
Principal Component
Math
Analysis
Advanced Techniques
Language
Key Takeaways
In this example, defining components for language and math
helped us simplify and better understand the data set
• Using the new components we derived, we can conduct further analysis
like predictive modeling, classification, clustering, etc.
• For example, clustering might help us understand student testing patterns
(most skew towards either language or math, while a few excel in both subjects)
ADVANCED TECHNIQUES
CONGRATULATIONS!
You now have a solid foundational understanding of unsupervised learning, including techniques
like clustering, association mining, outlier detection, and dimensionality reduction.
We hope you’ve enjoyed the entire Machine Learning for BI series, and that you find an
opportunity to put your skills to good use!