0% found this document useful (0 votes)
165 views

Visual Guide To Machine Learning

Uploaded by

JOSEPH FAYESE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views

Visual Guide To Machine Learning

Uploaded by

JOSEPH FAYESE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 364

THE COMPLETE VISUAL GUIDE TO

With Best-Selling Instructors Josh MacCarty & Chris Dutton

*Copyright Maven Analytics, LLC


SETTING EXPECTATIONS

This is NOT a coding/programming course; it’s about introducing and


demystifying essential machine learning topics
• Our goal is to break down complex techniques using simple and intuitive explanations and demos

We’ll focus on tools commonly applied to business intelligence use cases


• We’ll focus on techniques like data profiling, linear/logistic regression, forecasting, and unsupervised
learning, but will not cover some more advanced or specialized techniques (deep learning, NLP, etc.)

We’ll use Microsoft Excel as a tool to help explain key concepts


• Excel’s intuitive, visual interface allows us to expose the nuts and bolts of each technique to
understand HOW and WHY these algorithms work (rather than simply running lines of code)

You do NOT need a math or stats background to take this course


• We’ll cover the basics as needed, but won’t dive deep into statistics or econometric theory
SETTING EXPECTATIONS

Who this is for: Who this is NOT for:

• Analysts or BI professionals looking to • Senior data professionals looking to


transition into a ML/data science role master advanced topics in ML/AI

• Students looking to develop a deep • Students looking for a hands-on


conceptual understanding of core coding course or bootcamp (i.e.
machine learning topics python/R)

• Anyone who wants to understand • Anyone who would rather copy and
WHEN, WHY, and HOW to deploy paste code than become fluent in the
machine learning tools & techniques underlying algorithms
PART 1:
ABOUT THIS SERIES

This is Part 1 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning

PART 1 PART 2 PART 3 PART 4


QA & Data Profiling Classification Regression & Forecasting Unsupervised Learning

*Copyright Maven Analytics, LLC


COURSE OUTLINE

1 ML Intro & Landscape Machine Learning introduction, definition, process & landscape

Tools to explore data quality (variable types, empty values, range &
2 Preliminary Data QA count calculations, table structure, left/right censored data, etc.)

Tools to understand individual variables (distribution,


3 Univariate Profiling histograms & kernel densities, data profiling metrics, etc.)

Tools to understand multiple variables (kernel densities, violin &


4 Multivariate Profiling box plots, correlation, variance, etc.)

*Copyright Maven Analytics, LLC


*Copyright Maven Analytics, LLC
INTRO TO MACHINE LEARNING (ML)

MACHINE LEARNING [ muh-sheen-lur-ning ]

noun

1. The capacity of a computer to process and evaluate data beyond


programmed algorithms, through contextualized inference*

Using statistical models to find patterns and make predictions

*Dictionary.com
COMMON ML QUESTIONS

Which customers are most What will sales look like for What patterns do we see in
likely to churn next month? the next 12 months? terms of product cross-selling?

How can we use online When we adjusted tactics Which product is customer X
customer reviews to monitor last month, did we drive any most likely to purchase next?
changes in sentiment? incremental revenue?
WHEN IS ML THE RIGHT FIT?

• Machine Learning is a natural


extension of data profiling and basic
visual analysis

Machine Learning • ML is required when the underlying


Complexity data or analysis is too complex for
of analysis basic data profiling (i.e. visualizing
Multivariate relationships between 3+ variables)

• Machine Learning is ideal for finding


Univariate optimal solutions that would be
impossible or impractical to derive
through human trial-and-error
Complexity of data
THE MACHINE LEARNING PROCESS

Building models • ML models are only as good as


(the “fun” stuff) the data they are built on
(“garbage in, garbage out”)

• Data prep & QA happens behind


the scenes, but typically accounts
for the majority (80%+) of the
Data Prep, QA & Profiling
machine learning workflow
(the boring but really, really,
REALLY important stuff)
• While it’s tempting to start with
the “fun” stuff, data prep and QA
skills are absolutely critical!
MACHINE LEARNING PROCESS

PREPARING YOUR DATA UNDERSTANDING YOUR DATA MODELING YOUR DATA

QUALITY ASSURANCE

UNIVARIATE PROFILING

MULTIVARIATE PROFILING

MACHINE LEARNING

Quality Assurance (QA) is about preparing & cleaning data prior to analysis. We’ll cover common QA topics
including variable types, empty/missing values, range & count calculations, censored data, etc.
MACHINE LEARNING PROCESS

PREPARING YOUR DATA UNDERSTANDING YOUR DATA MODELING YOUR DATA

QUALITY ASSURANCE

UNIVARIATE PROFILING

MULTIVARIATE PROFILING

MACHINE LEARNING

Univariate profiling is about exploring individual variables to build an understanding of your data. We’ll cover
common topics like normal distributions, frequency tables, histograms, etc.
MACHINE LEARNING PROCESS

PREPARING YOUR DATA UNDERSTANDING YOUR DATA MODELING YOUR DATA

QUALITY ASSURANCE

UNIVARIATE PROFILING

MULTIVARIATE PROFILING

MACHINE LEARNING

Multivariate profiling is about understanding relationships between multiple variables. We’ll cover common
tools for exploring categorical & numerical data, including kernel densities, violin & box plots, scatterplots, etc.
MACHINE LEARNING PROCESS

PREPARING YOUR DATA UNDERSTANDING YOUR DATA MODELING YOUR DATA

QUALITY ASSURANCE

UNIVARIATE PROFILING

MULTIVARIATE PROFILING

MACHINE LEARNING

Machine learning is a natural extension of multivariate profiling, and uses statistical models and methods to
answer questions which are too complex to solve using simple visual analysis or trial-and-error
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning Advanced Topics

Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)

K-Nearest Neighbors Least Squares Outlier Detection


Natural Language Processing
Naïve Bayes Linear Regression Markov Chains (Latent Semantic Analysis, Latent Dirichlet Analysis,
relationship extraction, semantic parsing, contextual
Logistic Regression Forecasting word embeddings, translation, etc.)
Matrix factorization, principal components, factor
analysis, UMAP, T-SNE, topological data analysis,
Sentiment Analysis Non-Linear Regression
advanced clustering, etc.
Computer Vision
Monte Carlo (Convolutional neural networks, style translation, etc.)
Random forest, support
vector machines, gradient LASSO/RIDGE, state-
boosting, neural space, advanced Deep Learning
nets/deep learning, etc. generalized linear (Feed Forward, Convolutional, RNN/LSTM, Attention,
methods, VAR, DFA, etc. Deep RL, Autoencoder, GAN. etc.)
*Copyright Maven Analytics, LLC
MACHINE LEARNING PROCESS

PREPARING YOUR DATA UNDERSTANDING YOUR DATA MODELING YOUR DATA

QUALITY ASSURANCE

UNIVARIATE PROFILING

MULTIVARIATE PROFILING

MACHINE LEARNING

Quality Assurance (QA) is about preparing & cleaning data prior to analysis. We’ll cover common QA topics
including variable types, empty/missing values, range & count calculations, censored data, etc.
PRELIMINARY DATA QA

Data QA (otherwise known as Quality Assurance or Quality Control) is the first step in
the analytics and machine learning process; QA allows you to identify and correct
underlying data issues (blanks, errors, incorrect formats, etc.) prior to analysis

TOPICS WE’LL COVER: COMMON USE CASES:


• Minimizing the risk of drawing false conclusions or
Variable Types Empty Values misunderstanding your data

• Confirming that dates are properly formatted for


analysis (as date values, not text strings)
Range Calculations Count Calculations
• Replacing blanks or errors with appropriate values
to prevent summarization errors
Table Structures Left/Right Censors
• Calculating ranges (max & min) to spot-check for
outliers or unexpected values
PRELIMINARY DATA QA

POP QUIZ: When should you QA your data?

EVERY. SINGLE. TIME.


(no exceptions!)
WHY IS QA IMPORTANT?

As an analyst, Preliminary Data QA will help you answer questions like:

Are there any missing or Was the data our client Is there any risk that the Are there any outliers
empty values in the data captured from the online data capture process was that might skew the
shared by the HR team? survey encoded properly? biased in some way? results of our analysis?
VARIABLE TYPES

Variable types give us information about our variables


Variable Types
• January 1, 2000 as a number simply displays a date
• January 1, 2000 as a date implies a set of information we can use (first day of
Empty Values the year, winter, Saturday, weekend, etc.) all of which can be used to build strong
machine learning models

Range Calculations
Common variable types include:

Count Calculations Numeric Discrete Date


• Customer count • Count • 1/1/2020

Ordinal Categorical Complex


Left/Right Censored • Small, Medium, Large • Gender • 4 + 2i

Interval Nominal Logical


• Temperature • Nationality • True/False
Table Structure
Ratio Binary Monetary
• Weight • Yes/No • $4.50
VARIABLE TYPES

Understanding variable types is fundamental to the machine


Variable Types
learning process, and can help avoid common QA issues, including:
• Numeric variables formatted as characters or strings, preventing proper
Empty Values aggregation or analysis
• String/character variables formatted as numeric, preventing proper text-
based operations
Range Calculations
• Values that Less obvious cases like variables that are re-coded from raw
values to buckets (surveys)
Count Calculations
Here we’re formatting zip codes (which will never be
analyzed as values), as a numeric rather than string

Left/Right Censored

When we talk about variables, we might also refer to them


as metrics, KPIs, dimensions, or columns. Similarly, we may
Table Structure use rows, observations, and records interchangeably
EMPTY VALUES

Variable Types Investigating empty values, and how they are recorded in your data,
is a prerequisite for every single analysis
Empty Values
Empty values can be recorded in many ways (NA, N/A, #N/A, NaN,
Null, “-”, “Invalid”, blank, etc.), but the most common mistake is
Range Calculations turning empty numerical values into zeros (0)

Count Calculations

Left/Right Censored

Average Age = 28.6 Average Age = 35.8


Table Structure
EMPTY VALUES

Empty values can be handled in 3 ways: keep, remove, or impute


Variable Types
• Keep empty or zero values if you are certain that they are accurate and
meaningful (i.e. no sales of a product on a specific date)
Empty Values
• Remove empty values if you have a large volume of data and can confirm that
there is no pattern or bias to the missing values (i.e. all sales from a specific
product category are missing)
Range Calculations
• Impute (substitute) empty values if you can accurately populate the data or if
you are working with limited data and can use statistical methods (mean,
conditional mean, linear interpolation, etc.) without introducing bias
Count Calculations

For a missing Units Sold value,


you would likely remove the row
Left/Right Censored unless you are certain that an
empty value represents 0 sales

Table Structure
For a missing Retail Price, you would
likely be able to impute the value
since you know the product name/ID
RANGE CALCULATIONS

Variable Types One of the simplest QA tools for numerical variables is to calculate
the range of values in a column (minimum and maximum values)

Empty Values Range calculation is a helpful tool to understand your variables,


confirm that ranges are realistic, and identify potential outliers
Range Calculations
Is there a clear lower or upper limit
min(Age) = 18 (i.e. 18+ or capped at 65)?
Count Calculations max(Age) = 65

min(Income) = 0 Is your variable transformed to a standard


Left/Right Censored max(Income) = 100 max/min scale (i.e. 1-10, 0-100)?

min(height) = -10
max(height = 10 Is your variable normalized around
Table Structure a central value (i.e. 0)?
COUNT CALCULATIONS

Count calculations help you understand the number of records or


Variable Types
observations that fall within specific categories, and can be used to:
• Identify categories you were not expecting to see
Empty Values • Begin understanding how your data is distributed
• Gather knowledge for building accurate ML models (more on this later)

Range Calculations
Distinct counts can be particularly useful for QA, and help to:
• Understand the granularity or “grain” of your data
Count Calculations
• Identify how many unique values a field contains
• Ensure consistency by identifying misspellings or categorization errors
which might otherwise be difficult to catch (i.e. leading or trailing spaces)
Left/Right Censored

PRO TIP: For numerical variables with many unique values (i.e. long decimals), use a
Table Structure histogram to plot frequency based on custom ranges or “bins” (more on that soon!)
LEFT/RIGHT CENSORED

Variable Types
When data is left or right censored, it means that due to some
circumstance the min or max value observed is not the natural
minimum or maximum of that metric
Empty Values • This can be difficult to spot unless you are aware of how the data is being
recorded (which means it’s a particularly dangerous issue to watch out for!)

Range Calculations Left Censored Right Censored

Count Calculations

Left/Right Censored

Table Structure
Mall Shopper Survey Results Ecommerce Repeat Purchase Rate
Only tracks shoppers over the age of 18 due to legal Sharp drop as you approach the current date has nothing to do
reasons, so anyone under 18 is excluded (even though with customer behavior, but the fact that recent customers
there are plenty of mall shoppers under 18) haven’t have the opportunity or need to repurchase yet
TABLE STRUCTURE

Variable Types Table structures generally come in two flavors: long or wide

Long Table Wide Table


Empty Values

Range Calculations

PIVOT

Count Calculations
UNPIVOT

Left/Right Censored

Pivoting is the process of adjusting a table from


long to wide by transforming rows into columns,
Table Structure and Unpivoting is the opposite (wide to long)
TABLE STRUCTURE

Long tables typically contain a single, distinct column for each field (Date,
Variable Types
Product, Category, Quantity, Profit, etc.)
• Easy to see all available fields and variable types
Empty Values • Great for exploratory data analysis and aggregation (i.e. PivotTables)

Range Calculations Wide tables typically split the same metric into multiple columns or
categories (i.e. 2018 Sales, 2019 Sales, 2020 Sales, etc.)
• Typically not ideal for human readability, since wide tables may contain thousands
Count Calculations of columns (vs. only a handful if pivoted to a long format)
• Often (but not always) the best format for machine learning model input
Left/Right Censored • Great format for visualizing categorical data (i.e. sales by product category)

Table Structure There’s no right or wrong table structure; each type has strengths & weaknesses!
CASE STUDY: PRELIMINARY QA

THE You’ve just been hired as a Data Analyst for Maven Market, a local grocery
SITUATION store looking for help with basic data management and analysis.

The store manager would like you to conduct some analyses on product
THE inventory and sales, but the data is a mess.
ASSIGNMENT You’ll need to explore the data, conduct a preliminary QA, and help clean it up
to prepare the data for further analysis.

1. Look at common data profiling metrics to identify potential issues


THE
2. Take note of any issues you find in the product inventory sample
OBJECTIVES
3. Correct the issues to prepare the data for further analysis
BEST PRACTICES: PRELIMINARY QA

Review all fields to ensure that variable types are configured for proper
analysis (i.e. no dates formatted as strings, text formatted as values, etc.)

Remember that NA and 0 do not mean the same thing! Think carefully about
how to handle missing data and the impact it may have on your analysis

Run basic diagnostics like Range, Count, and Left/Right Censored checks
against all columns in your data set...every time

Understand your table structure before conducting any analysis to reduce the
risk of double counting, inaccurate calculations, omitted data, etc.
*Copyright Maven Analytics, LLC
MACHINE LEARNING PROCESS

PREPARING YOUR DATA UNDERSTANDING YOUR DATA MODELING YOUR DATA

QUALITY ASSURANCE

UNIVARIATE PROFILING

MULTIVARIATE PROFILING

MACHINE LEARNING

Univariate profiling is about exploring individual variables to build an understanding of your data. We’ll cover
common topics like normal distributions, frequency tables, histograms, etc.
UNIVARIATE PROFILING

Univariate profiling is the next step after preliminary QA; think of univariate profiling as
conducting a descriptive analysis of each variable by itself

TOPICS WE’LL COVER: COMMON USE CASES:


• Developing a deeper understanding of the fields
Categorical you’re working with (outliers, distribution, etc.)
Categorical Variables
Distributions
• Preparing to build ML models (necessary for
Histograms & Kernel selecting the correct model type)
Numerical Variables
Densities
• Exploring individual variables before conducting a
deeper multivariate analysis
Normal Distribution Data Profiling
CATEGORICAL VARIABLES

VARIABLE TYPES
Categorical Variables

Categorical CATEGORICAL NUMERICAL


Distributions

Categorical variables contain categories as values, instead of numbers


Numerical Variables (i.e. Product Type, Customer Name, Country, Month, etc.)
• Categorical fields are exceptionally important for data analysis, and are typically
Histograms & used as dimensions by which we filter or “cut” numerical values (i.e. sales by store)
Kernel Densities • Many types of predictive models are used to predict categorical dependent
variables, or “classes” (more on that later!)
Normal Distribution

Terms like discrete, categorical, multinomial, and classes may all be used interchangeably.
Binary is a special type of categorical variable which takes only 1 of 2 cases: true for false (or 1
Data Profiling or 0) and is also known as a logical variable or a binary flag
DISCRETIZATION

Discretization is the process of creating a new categorical variable from an


Categorical Variables existing numerical variable, based on the values of the numerical variable

Categorical
Distributions

Numerical Variables
Discretization Rules:
If Price <100 then Price Level = Low
Histograms & If Price >=100 & Price <500 then Price Level = Med
Kernel Densities If Price >=500 then Price Level = High

Normal Distribution

Data Profiling
NOMINAL VS. ORDINAL VARIABLES

VARIABLE TYPES
Categorical Variables

Categorical CATEGORICAL NUMERICAL


Distributions

Numerical Variables NOMINAL ORDINAL

Histograms &
Kernel Densities
There are two types of categorical variables: nominal and ordinal

Normal Distribution • Nominal variables contain categories with no inherent logical rank, which can be
re-ordered with no consequence (i.e. Product Type = Camping, Biking or Hiking)

• Ordinal variables contain categories with a logical order (i.e. Size = Small, Medium,
Data Profiling Large), but the interval between those categories has no logical interpretation
CATEGORICAL DISTRIBUTIONS

Categorical Variables Categorical distributions are visual and/or numeric representations of


the unique values a variable contains, and how often each occurs
Categorical
Distributions
Common categorical distributions include:
Numerical Variables • Frequency tables: Show the count (or frequency) of each distinct value
• Proportions tables: Show the count of each value as a % of the total

Histograms & • Heat maps: Formatted to visualize patterns (typically used for multiple variables)
Kernel Densities
Understanding categorical distributions will help us gather knowledge for
Normal Distribution building accurate machine learning models (more on this later!)

Data Profiling
CATEGORICAL DISTRIBUTIONS

Categorical Variables
Section Distribution:
Camping Biking
Categorical Frequency table
14 6
Distributions
Camping Biking
Proportions table
Numerical Variables 70% 30%

Histograms &
Kernel Densities Size & Section Distribution:
Camping Biking

S 6 4
Normal Distribution Heat Map
L 8 2

Data Profiling
NUMERICAL VARIABLES

VARIABLE TYPES
Categorical Variables

Categorical CATEGORICAL NUMERICAL


Distributions

Numerical variables contain numbers as values, instead of categories


Numerical Variables
(i.e. Gross Revenue, Pageviews, Quantity, Retail Price, etc.)
• Numerical fields are typically aggregated (as a sum, count, average, max, min, etc.)
Histograms & and broken down by different dimensions or categories
Kernel Densities
• Numerical data is also known as quantitative data, while categorical data is
often referred to as “qualitative”
Normal Distribution

You may hear numeric variables described further as interval and ratio, but the distinction is trivial
and rarely makes a difference in common use cases
Data Profiling
HISTOGRAMS

Categorical Variables
Histograms are used to plot a single, discretized numerical variable

Imagine taking a numerical variable (like age), defining ranges or “bins”


Categorical (1-5, 6-10, etc.), and counting the number of observations which fall
Distributions
into each bin; this is exactly what histograms are designed to do!

Numerical Variables
Age Values:

8 29 45 8

Histograms & 25 33 37
7

Frequency
Kernel Densities 6
19 43 21 5

28 32 40 4
Normal Distribution 24 17 28 3

2
5 22 39
1
15 47 12
Data Profiling 0-10 11-20 21-30 31-40 41-50

Age Range
KERNEL DENSITIES

Categorical Variables
Kernel densities are “smooth” versions of histograms, which can help
to prevent users from over-interpreting breaks between bins

Categorical • Technical definition: “Non-parametric density estimation via smoothing”


Distributions • Intuitive definition: “Wet noodle laying on a histogram”

Numerical Variables
Age Values:

8 29 45 8

Histograms & 25 33 37
7

Frequency
Kernel Densities 6
19 43 21 5

28 32 40 4
Normal Distribution 24 17 28 3

2
5 22 39
1
15 47 12
Data Profiling 0-10 11-20 21-30 31-40 41-50

Age Range
HISTOGRAMS & KERNEL DENSITIES

Categorical Variables When to use histograms and kernel density charts:


• Visualizing how a given variable is distributed
Categorical • Providing a visual glimpse of profiling metrics like mean, mode, and skewness
Distributions
Things to watch out for:
Numerical Variables • Bin sensitivity: Bin size can significantly change the shape and ”smoothness” of a
histogram, so select a bin width that accurately shows the data distribution
• Outliers: Histograms can be used to identify outliers in your data set, but you
Histograms & may need to remove them to avoid skewing the distribution
Kernel Densities
• Sample size: Histograms are best suited for variables with many observations, to
reflect the true population distribution
Normal Distribution

PRO TIP: If your data is relatively symmetrical (not skewed), you can use Sturge’s Rule as a
Data Profiling quick “rule of thumb” to determine an appropriate number of bins: K = 1 + 3.322 log (N)
(where K = number of bins, N = number of observations)
CASE STUDY: HISTOGRAMS

THE You’ve just been promoted as the new Pit Boss at The Lucky Roll Casino.
SITUATION Your mission? Use data to help expose cheats on the casino floor.

Profits at the craps tables have been unusually low, and you’ve been asked to
THE investigate the possible use of loaded die (weighted towards specific numbers).
ASSIGNMENT Your plan is to track the outcome of each roll, then compare your results
against the expected probability distribution to see how closely they match.

1. Record the outcomes for a series of individual dice rolls


THE
2. Plot the frequency of each result (1-12) using a histogram
OBJECTIVES
3. Compare your plot against the expected frequency distribution
NORMAL DISTRIBUTION

Many numerical variables naturally follow a normal distribution, also


Categorical Variables known as a Gaussian distribution or “bell curve”
• Normal distributions are symmetrical and dense around the center, with flared out
Categorical “tails” on both ends (like a bell!)
Distributions • Normal distributions are essential to many underlying assumptions in ML, and a
helpful tool for comparing distributions or testing differences between them

Numerical Variables Symmetrical

Histograms &
Kernel Densities

Normal Distribution

Most dense around center

Data Profiling
You can find normal distributions in many real-world examples: heights, weights, test scores, etc.
NORMAL DISTRIBUTION

Categorical Variables Parabola centered


around the mean

Categorical
Distributions

1 1 𝑥−𝜇 2
Numerical Variables −
𝑒 2 𝜎
Histograms &
Kernel Densities 𝜎 2𝜋 Turn the parabola
upside down
Normal Distribution
Make the tails
flare out
Data Profiling
CASE STUDY: NORMAL DISTRIBUTION

THE It’s August 2016, and you’ve been invited to Rio de Janeiro as a Data Analyst
SITUATION for the Global Olympic Committee.

Your job is to collect demographic data for all female athletes competing in
THE
the Summer Games and determine how the distribution of Olympic athlete
ASSIGNMENT heights compares against the general public.

1. Gather heights for all female athletes competing in the 2016 Games
THE 2. Plot height frequencies using a Histogram, and test various bin widths
OBJECTIVES 3. Determine if athlete heights follow a normal distribution, or “bell curve”
4. Compare the distributions for athletes vs. the general public
DATA PROFILING

Data profiling describes the process of using statistics to describe or


Categorical Variables
summarize information about a particular variable
• Data profiling is critical for understanding variable characteristics which cannot be
Categorical
seen or easily visualized using tools like histograms
Distributions
• Data profiling communicates meaning, using simple, concise, and universally
understood metrics
Numerical Variables
Common data profiling metrics include:
Histograms & • Mode
Kernel Densities
• Mean
• Median PRO TIP: Combining visual distributions with
data profiling metrics is a powerful way to
Normal Distribution • Percentile communicate meaning in advanced analytics
• Variance
• Standard Deviation
Data Profiling
• Skewness
MODE

The mode is the most frequently observed value


Categorical Variables
• With a numerical variable (most common), it’s the general range around the highest
”peak” in the distribution
Categorical
• With a categorical variable, it’s simply the value that appears most often
Distributions

Numerical Variables
Mode of City = “Houston”

Mode of Sessions = 24
Histograms &
Kernel Densities Mode of Gender = F, M
(this is a bimodal field!)

Normal Distribution
Common uses:
Data Profiling • Understanding the most common values within a dataset
• Diagnosing if one variable is influenced by another
MODE

Categorical Variables While modes typically aren’t very useful on their own, they can provide
helpful hints for deeper data exploration
Categorical • For example, the right histogram below shows a multi-modal distribution, which
Distributions indicates that there may be another variable impacting the age distribution

Numerical Variables Single Mode (21-30) Two Modes (0-10, 41-50)

8 8
Histograms & 7 7
Kernel Densities 6 6
Frequency

Frequency
5 5
4 4
Normal Distribution 3 3
2 2
1 1

0-10 11-20 21-30 31-40 41-50 0-10 11-20 21-30 31-40 41-50
Data Profiling
Age Range Age Range
MEAN

Categorical Variables
The mean is the calculated “central” value in a discrete set on numbers
• Mean is what most people think of when they hear the word “average”, and is
calculated by dividing the sum of all values by the count of all observations
Categorical
Distributions • Means can only be applied to numerical variables (not categorical)

Numerical Variables
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠
𝑚𝑒𝑎𝑛 =
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Histograms &
Kernel Densities 5,220
=
5
= 𝟏, 𝟎𝟒𝟒
Normal Distribution
Common uses:
• Making a “best-guess” estimate of a value
Data Profiling
• Calculating a central value when outliers are not present
MEDIAN

The median is the middle value in a list of values sorted from highest to
Categorical Variables
lowest (or vice versa)
• When there are two middle-ranked values, the median is the average of the two
Categorical
• Medians can only be applied to numerical variables (not categorical)
Distributions

Numerical Variables

Histograms & Median = 19.5


Kernel Densities (average of 15 and 24)

Normal Distribution

Common uses:
Data Profiling • Identifying the “center” of a distribution
• Calculating a central value when outliers may be present
PERCENTILE

Percentiles are used to describe the percent of values in a column which


Categorical Variables fall below a particular number
• If you are in the 90th percentile for height, you’re taller than 90% of the population
Categorical • If the 50th percentile for test scores is 86, half of the students scored lower (which
Distributions means 86 is also the median!)

Numerical Variables

Histograms &
Kernel Densities

Bob is the 3rd tallest in a group of 12.


75% Since he’s taller than 9 others, Bob is in
Normal Distribution
the 75th percentile for height!
Common uses:
Data Profiling • Providing context and intuitive benchmarks for how values “rank” within a sample (test
scores, height/weight, blood pressure, etc.)
VARIANCE

Variance, in simple terms, describes how thin or wide a distribution is


Categorical Variables
• Variance measures how far the observations are from the mean, on average, and
help us concisely describe a variable’s distribution
Categorical
• The wider a distribution, the higher the variance (and vice versa)
Distributions

Numerical Variables
Variance = 5

Histograms & Variance = 15


Kernel Densities
Variance = 30

Normal Distribution

Common uses:
Data Profiling
• Comparing the numerical distributions of two different groups (i.e. prices of products
ordered online vs. in store)
VARIANCE

Categorical Variables 𝑛 2
σ𝑖=1(𝑥𝑖 − 𝜇)
Categorical
Distributions 𝑛−1
Numerical Variables Calculation Steps:

1) Calculate the average of the variable


Histograms &
Kernel Densities
2) Subtract that average from the first row
Average squared
3) Square that difference distance from the mean
Normal Distribution

4) Do steps 1-3 for every row and sum it up


Data Profiling
5) Divide by the number of observations (-1)
STANDARD DEVIATION

Standard Deviation is the square root of the variance


Categorical Variables
• This converts the variance back to the scale of the variable itself
• In a normal distribution, ~68% of values fall within 1 standard deviation of the mean,
Categorical
~95% fall within 2, and ~99.7% fall within 3 (known as the “empirical rule”)
Distributions

~68% ~95% ~99.7%


Numerical Variables

Histograms &
Kernel Densities

Normal Distribution 1 s.d. 2 s.d. 3 s.d.

Common uses:
Data Profiling • Comparing segments for a given metric (i.e. time on site for mobile users vs. desktop)
• Understanding how likely certain values are bound to occur
SKEWNESS

Categorical Variables Skewness tells us how a distribution varies from a normal distribution
• This is commonly used to mathematically describe skew to the left or right
Categorical
Distributions
Left skew Normal Distribution Right skew

Numerical Variables

Histograms &
Kernel Densities

Normal Distribution

Common uses:
Data Profiling
• Identifying non-normal distributions, and describing them mathematically
BEST PRACTICES: UNIVARIATE PROFILING

Make sure you are using the appropriate tools for profiling categorical
variables vs. numerical variables

Distributions are a great way to quickly and visually explore variables

QA still comes first! Profiling metrics are important, but can lead to
misleading results without proper QA (i.e. handling outliers or missing values)

One single distribution, visualization, or metric isn’t enough to fully


understand a variable; always explore your data from multiple angles
*Copyright Maven Analytics, LLC
MACHINE LEARNING PROCESS

PREPARING YOUR DATA UNDERSTANDING YOUR DATA MODELING YOUR DATA

QUALITY ASSURANCE

UNIVARIATE PROFILING

MULTIVARIATE PROFILING

MACHINE LEARNING

Multivariate profiling is about understanding relationships between multiple variables. We’ll cover common
tools for exploring categorical & numerical data, including kernel densities, violin & box plots, scatterplots, etc.
MULTIVARIATE PROFILING

Multivariate profiling is the next step after univariate profiling, since single-metric
distributions are rarely enough to draw meaningful insights or conclusions

TOPICS WE’LL COVER: COMMON USE CASES:

• Exploring relationships between multiple categorical


Categorical-Categorical Categorical-Numerical
and/or numerical variables
Distributions Distributions
• Informing which specific machine learning models or
Multivariate Kernel techniques to use
Violin & Box Plots
Densities

Multivariate distributions are generally called “joint”


Numerical-Numerical Scatter Plots & distributions, and it’s clear why—it’s joining two variables
Distributions Correlation into the same distribution!
CATEGORICAL-CATEGORICAL DISTRIBUTIONS

Categorical-Categorical Categorical/categorical distributions represent the frequency of


Distributions unique combinations between two or more categorical variables

Categorical-Numerical
Distributions This is one of the simplest forms of multivariate profiling, and leverages
the same tools we used to analyze univariate distributions:
Multivariate Kernel
Densities • Frequency tables: Show the count (or frequency) of each distinct combination
• Proportions tables: Show the count of each combination as a % of the total
Violin & Box Plots • Heat maps: Frequency or proportions table formatted to visualize patterns

Numerical-Numerical Common uses:


Distributions
• Understanding product mix based on multiple characteristics (i.e. size & category)

Scatter Plots &


• Exploring customer demographics based on attributes like location, gender, etc.
Correlation
CATEGORICAL-CATEGORICAL DISTRIBUTIONS

Categorical-Categorical In this example we’re looking


Distributions at product inventory using a
distribution of two categorical Frequency
Table
Categorical-Numerical fields: Design and Size
Distributions

Multivariate Kernel We can show this joint


Densities
distribution as a simple count
(Frequency Table), as a Proportions
Table
Violin & Box Plots percentage of the total
(Proportions Table), or as a
conditionally formatted table
Numerical-Numerical
Distributions
(Heat Map)

Scatter Plots & Heat Map


Correlation
CASE STUDY: HEAT MAPS

THE You’ve just been hired by the New York Department of Transportation
SITUATION (DOT) to help analyze traffic accidents in New York City from 2019-2020

The DOT commissioner would like to understand accident frequency by time of


day and day of week, in order to support a public service campaign promoting
THE safe driving habits.
ASSIGNMENT Your role is to provide the data that she needs to understand when traffic
accidents are most likely to occur.

THE 1. Create a table to plot accident frequency by time of day and day of week
OBJECTIVES 2. Apply conditional formatting to the table to create a heatmap showing the
days and times with the fewest (green) and most (red) accidents in the sample
CATEGORICAL-NUMERICAL DISTRIBUTIONS

Categorical-numerical distributions are used for comparing numerical


Categorical-Categorical
Distributions
distributions across classes in a category (i.e. age distribution by gender)

Categorical-Numerical These are typically visualized using variations of familiar univariate


Distributions
numerical distributions, including:
Multivariate Kernel • Histograms & Kernel Densities: Show the count (or frequency) of values
Densities
• Violin Plots: Kernel density “glued” to its mirror image, and tilted on its side
• Box Plots: Like a kernel density, but formatted to visualize key statistical values
Violin & Box Plots (min/max, median, quartiles) and outliers

Numerical-Numerical Common uses:


Distributions
• Comparing key business metrics (i.e. customer lifetime value, average order size,
Scatter Plots & purchase frequency, etc.) by customer class (gender, loyalty status, location, etc.)
Correlation • Comparing sales performance by day of week or hour of day
KERNEL DENSITIES

Categorical-Categorical Remember: kernel densities are just smooth versions of histograms


Distributions
• To visualize a categorical-numerical distribution, kernel densities can be repeated
to represent each class within a particular category
Categorical-Numerical
Distributions

Multivariate Kernel Teal class has a mean of ~15 and relatively low
Densities variance (highly concentrated around the mean)

Violin & Box Plots Yellow class has a mean of ~20 and moderate
variance relative to other categories

Numerical-Numerical
Distributions Purple class has a mean of ~25, overlaps with
yellow, and has relatively high variance

Scatter Plots &


Correlation
VIOLIN PLOTS

Categorical-Categorical A violin plot is essentially a kernel density flipped vertically and


Distributions combined with its mirror image
Categorical-Numerical • Violin plots use the exact same
Distributions data as kernel densities, just
visualized slightly differently
Multivariate Kernel
Densities • These can help you visualize the
shape of each distribution, and
compare them across classes
Violin & Box Plots more clearly

Numerical-Numerical
Distributions

Scatter Plots &


Correlation
BOX PLOTS

Box plots are like violin plots, but designed


Categorical-Categorical
Distributions
to show key statistical attributes rather than
smooth distributions, including:
Categorical-Numerical
Distributions • Median
• Min & Max (excluding outliers)
Multivariate Kernel
Densities • 25th & 75th Percentiles
• Outliers
Violin & Box Plots

Box plots provide a ton of information in a


Numerical-Numerical
Distributions
single visual, and can be used to quickly
compare statistical characteristics between
Scatter Plots & classes
Correlation
LIMITATIONS OF CATEGORICAL DISTRIBUTIONS

Categorical profiling works for simple cases, but breaks down quickly
Categorical-Categorical
Distributions • Humans are pretty good at visualizing 1, 2, or maybe even 3 variables, but how
would you visualize a joint distribution for 10 variables? 100?
Categorical-Numerical
Distributions

Multivariate Kernel
Densities
?
Violin & Box Plots Categorical profiling can’t answer prescriptive or predictive questions
• Suppose you randomized several elements on your sales page (font, image, layout,
button, copy, etc.) to understand which ones drive conversions
Numerical-Numerical
Distributions • You could count conversions for individual elements, or some combinations of
elements, but categorical distribution alone can’t measure causation
Scatter Plots &
Correlation
This is when you need machine learning!
NUMERICAL-NUMERICAL DISTRIBUTIONS

Numerical-Numerical distributions are common and intuitive, but the


Categorical-Categorical
Distributions most complex mathematically

Categorical-Numerical They are typically visualized using scatter plots, which plot points along
Distributions
the X and Y axis to show the relationship between two variables
Multivariate Kernel • Scatter plots allow for simple, visual intuition: when one variable increases or
Densities decreases, how does the other variable change?
• There are many possibilities: no relationship, positive, negative, linear, non-linear,
Violin & Box Plots cubic, exponential, etc.

Numerical-Numerical Common uses:


Distributions
• Quickly visualizing how two numerical variables relate
Scatter Plots & • Predicting how a change in one variable will impact another (i.e. square footage and
Correlation house price, marketing spend and sales, etc.)
CORRELATION

Univariate profiling metrics aren’t much help in the multivariate world;


Categorical-Categorical
Distributions we now need a way to describe relationships between variables

Categorical-Numerical Correlation is the most common multivariate profiling metric, and is


Distributions
used to describe how a pair of variables are linearly related
Multivariate Kernel • In other words: for a given row, when one variable’s observation goes above its mean,
Densities does the other variable’s observation also go above its mean (and vice versa)?

Violin & Box Plots

Numerical-Numerical
Distributions

Scatter Plots &


Correlation
No correlation Positive correlation Strong positive correlation
CORRELATION

Correlation is an extension of variance


Categorical-Categorical
Distributions • Think of correlation as a way to measure the variance of both variables at one time
(called “co-variance”), while controlling for the scales of each variable
Categorical-Numerical
Distributions
Variance formula Correlation formula

Multivariate Kernel
Densities
σ𝑛𝑖=1(𝑥𝑖 − 𝜇) 2 σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
Violin & Box Plots 𝑛−1 (𝑛 − 1)𝑠𝑥 𝑠𝑦
Numerical-Numerical
Distributions • Here we multiply variable X’s difference from its mean with variable Y’s difference
from its mean, instead of squaring a single variable (like we do with variance)
Scatter Plots &
Correlation • Sx and Sy are the standard deviations of X and Y, which puts them on the same scale
CORRELATION VS. CAUSATION

Categorical-Categorical

CORRELATION
Distributions

Categorical-Numerical
Distributions

Multivariate Kernel
Densities DOES NOT IMPLY
Violin & Box Plots

Numerical-Numerical
Distributions CAUSATION
Scatter Plots &
Correlation
CORRELATION VS. CAUSATION

Categorical-Categorical

Drowning Deaths
Distributions

Categorical-Numerical
Distributions

Multivariate Kernel
Densities
Ice Cream Cones Sold

Violin & Box Plots Consider the scatter plot above, showing daily ice cream sales and
drowning deaths in a popular New England vacation town
Numerical-Numerical • These two variables are clearly correlated, but do ice cream cones CAUSE people to
Distributions drown? Do drowning deaths CAUSE a surge in ice cream sales?

Scatter Plots &


Correlation
Of course not, because correlation does NOT imply causation!
So what do you think is really going on here?
PRO TIP: VISUALIZING A THIRD DIMENSION

Categorical-Categorical Scatter plots show two dimensions by default (X and Y), but using
Distributions symbols or color allows you to visualize additional variables and
expose otherwise hidden patterns or trends
Categorical-Numerical
Distributions

Multivariate Kernel
Densities

Violin & Box Plots

Numerical-Numerical
Distributions

Visualizing more than 3-4 dimensions is beyond human capability.


Scatter Plots & This is where you need machine learning!
Correlation
CASE STUDY: CORRELATION

THE You’ve just landed your dream job as a Marketing Analyst at Loud & Clear, the
SITUATION hottest ad agency in San Diego.

Your client would like to understand the impact of their digital media spend, and
THE how it relates to website traffic, offline spend, site load time, and sales.
ASSIGNMENT Your role is to collect and visualize these metrics at the weekly-level in order to
begin exploring the relationships between them.

1. Gather weekly spend, traffic, website, and sales data


THE 2. Create a scatterplot to visualize any two given variables
OBJECTIVES 3. Compare the relationship between Digital Media Spend and each variable
4. What patterns do you see, and how would you interpret them?
BEST PRACTICES: MULTIVARIATE PROFILING

Univariate profiling is a great start, but multivariate profiling is necessary


when working with more than one variable (pretty much always)

Understand the types of variables you’re working with (categorical vs.


numerical) to determine which profiling and visualization techniques to use

Use categorical variables to filter or “cut” your data and quickly compare
profiling metrics or distributions across classes

Remember that correlation does not imply causation, and that variables can
be related without one causing a change in the other
PART 2:
ABOUT THIS SERIES

This is Part 2 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning

PART 1 PART 2 PART 3 PART 4


QA & Data Profiling Classification Regression & Forecasting Unsupervised Learning

*Copyright Maven Analytics, LLC


COURSE OUTLINE

Machine Learning landscape & key concepts, classification process,


1 Intro to Classification feature engineering, data splitting, overfitting, etc.

2 Classification Models

K-Nearest Naïve Decision Random Logistic Sentiment


Neighbors Bayes Trees Forests Regression Analysis

Model performance, hyperparameter tuning, confusion matrices,


3 Model Selection & Tuning imbalanced classes, model drift, etc.

*Copyright Maven Analytics, LLC


*Copyright Maven Analytics, LLC
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning Advanced Topics

Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)

K-Nearest Neighbors Least Squares Outlier Detection


Natural Language Processing
Naïve Bayes Linear Regression Markov Chains (Latent Semantic Analysis, Latent Dirichlet Analysis,
relationship extraction, semantic parsing, contextual
Decision Trees Forecasting word embeddings, translation, etc.)
Matrix factorization, principal components, factor
analysis, UMAP, T-SNE, topological data analysis,
Logistic Regression Non-Linear Regression
advanced clustering, etc.
Computer Vision
Sentiment Analysis Monte Carlo (Convolutional neural networks, style translation, etc.)

LASSO/RIDGE, state-
Support vector machines,
space, advanced Deep Learning
gradient boosting, neural
generalized linear (Feed Forward, Convolutional, RNN/LSTM, Attention,
nets/deep learning, etc.
methods, VAR, DFA, etc. Deep RL, Autoencoder, GAN. etc.)
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning

Has a “label” Does NOT have a “label”

A “label” is an observed variable which you are trying to predict


(purchase (1/0), sentiment score (positive/negative), product type (A/B/C), etc.)

Focused on describing or organizing the


Focused on predicting a label or value
data in some non-obvious way
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning

Classification Regression CLASSIFICATION EXAMPLES:


• What type of subscribers are most likely to cancel or churn?
• Which segment does this customer fall into?
Used to predict Used to predict
classes for values for • Is the sentiment of this review positive, negative or neutral?
CATEGORICAL NUMERICAL
variables variables
REGRESSION EXAMPLES:
• How much revenue are we forecasting for the quarter?
• How many units do we expect to sell next season?
• How will seasonality impact our customer base next year?
RECAP: KEY CONCEPTS
These are categorical variables
All tables contain rows and columns
Session ID Browser Time (Sec) Pageviews Purchase
• Each row represents an individual record/observation
1 Chrome 354 11 1
• Each columns represent an individual variable
2 Safari 94 4 1

3 Safari 36 2 0
Variables can be categorical or numerical 4 IE 17 1 0
• Categorical variables contain classes or categories (used for filtering) 5 Chrome 229 9 1
• Numerical variables contain numbers (used for aggregation)
These are numerical variables

Each row represents a record of a unique web session


QA and data profiling comes first, every time
• Without quality data, you can’t build quality models (“garbage in, garbage out”)
• Profiling is the first step towards conditioning (or filtering) on key variables to understand their impact

Machine Learning is needed when visual analysis falls short


• Humans aren’t capable of thinking beyond 3+ dimensions, but machines are built for it
• ML is a natural extension of visual analysis and univariate/multivariate profiling
CLASSIFICATION 101

The goal of any classification model is to predict a dependent variable using independent variables

𝒚 Dependent variable (DV)


• This is the variable you’re trying to predict
• The dependent variable is commonly referred to as “Y”, “predicted”, “output”, or “target” variable
• Classification is about understanding how the DV is impacted by, or dependent on, other variables in the model

𝐱 Independent variables (IVs)


• These are the variables which help you predict the dependent variable
• Independent variables are commonly referred to as “X’s”, “predictors”, “features”, “dimensions”, “explanatory variables”, or “covariates”
• Classification is about understanding how the IVs impact, or predict, the DV
CLASSIFICATION 101

EXAMPLE: Using data from a CRM database (sample below) to predict if a customer will churn next month

Cust. ID Gender Status HH Income Age Sign-Up Date Newsletter Churn Churn is our dependent variable,
1 M Bronze $30,000 29 Jan 17, 2019 1 0
since it’s what we want to predict

2 F Gold $60,000 37 Mar 18, 2017 1 1

3 F Bronze $15,000 24 Oct 1, 2020 0 0

4 F Silver $75,000 41 Apr 12, 2019 1 0

5 M Bronze $40,000 36 Jul 23, 2020 1 1

6 M Gold $35,000 31 Oct 22, 2017 0 1

7 F Gold $80,000 46 May 2, 2019 0 0

8 F Bronze $20,000 33 Feb 20, 2020 0 0

9 M Silver $100,000 51 Aug 5, 2020 1 1

10 M Silver $45,000 29 Sep 15, 2019 1 0

Gender, Status, HH Income, Age, Sign-Up Date and Newsletter are all
independent variables, since they can help us explain, or predict, Churn
CLASSIFICATION 101

EXAMPLE: Using data from a CRM database (sample below) to predict if a customer will churn next month

Cust. ID Gender Status HH Income Age Sign-Up Date Newsletter Churn

1 M Bronze $30,000 29 Jan 17, 2019 1 0

2 F Gold $60,000 37 Mar 18, 2017 1 1

3 F Bronze $15,000 24 Oct 1, 2020 0 0 We use records with observed values for
4 F Silver $75,000 41 Apr 12, 2019 1 0 both independent and dependent variables
5 M Bronze $40,000 36 Jul 23, 2020 1 1
to “train” our classification model...
6 M Gold $35,000 31 Oct 22, 2017 0 1

7 F Gold $80,000 46 May 2, 2019 0 0

8 F Bronze $20,000 33 Feb 20, 2020 0 0

9 M Silver $100,000 51 Aug 5, 2020 1 1

10 M Silver $45,000 29 Sep 15, 2019 1 0

...then apply that model to new, unobserved


11 F Bronze $40,000 40 Dec 12, 2020 0 ???
values containing IVs but no DV
This is what our model will predict!
CLASSIFICATION WORKFLOW

Project Scoping
Remember, these steps ALWAYS come first
Preliminary QA Before building a model, you should have a deep understanding of both the project scope (stakeholders, framework,
desired outcome, etc.) and the data at hand (variable types, table structure, data quality, profiling metrics, etc.)
Data Profiling

Feature Engineering Data Splitting Model Training Selection & Tuning

Adding new, calculated Splitting records into Building classification models Choosing the best
variables (or “features”) to “Training” and “Test” data from Training data and performing model for a given
a data set based on sets, to validate accuracy applying to Test data to prediction, and tuning it to
existing fields and avoid overfitting maximize prediction accuracy prevent drift over time
FEATURE ENGINEERING

Feature engineering is the process of enriching a data set by creating additional


independent variables based on existing fields
• New features can help improve the accuracy and predictive power of your ML models
• Feature engineering is often used to convert fields into “model-friendly” formats; for example,
one-hot encoding transforms categorical variables into binary (1/0) fields

Original features Engineered features

Cust. ID Status HH Income Age Sign-Up Date Newsletter Gold Silver Bronze Scaled Income Log Income Age Group Sign-Up Year Priority

1 Bronze $30,000 29 Jan 17, 2019 1 0 0 1 .25 10.3 20-30 2019 1

2 Gold $60,000 37 Mar 18, 2017 1 1 0 0 .75 11.0 30-40 2017 1

3 Bronze $15,000 24 Oct 1, 2020 0 0 0 1 0 9.6 20-30 2020 0

4 Silver $75,000 41 Apr 12, 2019 1 0 1 0 1 11.2 40-50 2019 0

5 Bronze $40,000 36 Jul 23, 2020 1 0 0 1 .416 10.6 30-40 2020 1


FEATURE ENGINEERING TECHNIQUES

Common feature engineering techniques:


Cust. ID Status HH Income Age Sign-Up Date Newsletter Gold Silver Bronze Scaled Income Log Income Age Group Sign-Up Year Priority

1 Bronze $30,000 29 Jan 17, 2019 1 0 0 1 .25 10.3 20-30 2019 1

2 Gold $60,000 37 Mar 18, 2017 1 1 0 0 .75 11.0 30-40 2017 1

3 Bronze $15,000 24 Oct 1, 2020 0 0 0 1 0 9.6 20-30 2020 0

4 Silver $75,000 41 Apr 12, 2019 1 0 1 0 1 11.2 40-50 2019 0

5 Bronze $40,000 36 Jul 23, 2020 1 0 0 1 .416 10.6 30-40 2020 1

One-Hot Log Date


Encoding Scaling Transformation Discretization Extraction Boolean Logic

Splitting categorical fields Standardizing numerical Converting a range of Grouping continuous Transforming a date or Using and/or logic to
into separate binary ranges common scales values into a compressed, values into discrete datetime value into encode “interactions”
features (i.e. 1-10, 0-100%) “less-skewed” distribution segments or bins individual components between variables
DATA SPLITTING

Splitting is the process of partitioning data into separate sets of records for the
purpose of training and testing machine learning models
• As a rule of thumb, ~70-80% of your data will be used for Training (which is what your model
learns from), and ~20-30% will be reserved for Testing (to validate the model’s accuracy)

Cust. ID Gender Status HH Income Age Sign-Up Date Newsletter Churn

1 M Bronze $30,000 29 Jan 17, 2019 1 0


Test data is NOT used to optimize models
2 F Gold $60,000 37 Mar 18, 2017 1 1

3 F Bronze $15,000 24 Oct 1, 2020 0 0 Using Training data for optimization and Test data
Training
4 F Silver $75,000 41 Apr 12, 2019 1 0 for validation ensures that your model can
data
5 M Bronze $40,000 36 Jul 23, 2020 1 1 accurately predict both known and unknown values,
6 M Gold $35,000 31 Oct 22, 2017 0 1
which helps to prevent overfitting
7 F Gold $80,000 46 May 2, 2019 0 0

8 F Bronze $20,000 33 Feb 20, 2020 0 0

9 M Silver $100,000 51 Aug 5, 2020 1 1


Test
data
10 M Silver $45,000 29 Sep 15, 2019 1 0
OVERFITTING
Splitting is primarily used to avoid overfitting, which is when a model predicts known (Training) data very
well but unknown (Test) data poorly
• Think of overfitting like memorizing the answers to a test instead of actually learning the material; you’ll ace the
test, but lack the ability to generalize and apply your knowledge to unfamiliar questions

OVERFIT model WELL-FIT model UNDERFIT model


• Models the Training data too well • Models the Training data just right • Doesn’t model Training data well enough
• Doesn’t generalize well to Test data • Generalizes well to Test data • Doesn’t generalize well to Test data (high
(high variance, low bias) (balance of bias & variance) bias, low variance)
OVERFITTING
Splitting is primarily used to avoid overfitting, which is when a model predicts known (Training) data very
well but unknown (Test) data poorly
• Think of overfitting like memorizing the answers to a test instead of actually learning the material; you’ll ace the
test, but lack the ability to generalize and apply your knowledge to unfamiliar questions

OVERFIT model WELL-FIT model UNDERFIT model


• Models the Training data too well • Models the Training data just right • Doesn’t model Training data well enough
• Doesn’t generalize well to Test data • Generalizes well to Test data • Doesn’t generalize well to Test data (high
(high variance, low bias) (balance of bias & variance) bias, low variance)
*Copyright Maven Analytics, LLC
CLASSIFICATION MODELS

In this section we’ll introduce common classification models used to predict categorical
outcomes, including KNN, naïve bayes, decision trees, logistic regression, and more

TOPICS WE’LL COVER: COMMON USE CASES:

• Predicting if a subscriber will stay or churn


K-Nearest Neighbors Naïve Bayes
• Predicting which product a customer will purchase
• Predicting tone or sentiment from raw text (product
Decision Trees Random Forests
reviews, survey responses, tweets, etc.)
• Predicting if an email is spam, or a bank transaction is
Logistic Regression Sentiment Analysis fraudulent
*Copyright Maven Analytics, LLC
K-NEAREST NEIGHBORS (KNN)

K-Nearest Neighbors (KNN) is a classification technique designed to


K-Nearest Neighbors
predict outcomes based on similar observed values

• In its simplest form, KNN creates a scatter plot with training data, plots a new
Naïve Bayes unobserved value, and assigns a class (DV) based on the classes of nearby points

• K represents the number of nearby points (or “neighbors”) the model will consider
Decision Trees when making a prediction

Random Forests Example use cases:


• Generating product recommendations on a website
Logistic Regression • Classifying customers into pre-existing segments

Sentiment Analysis
K-NEAREST NEIGHBORS (KNN)

HH INCOME

K-Nearest Neighbors $100,000

$90,000

$80,000

Naïve Bayes $70,000

$60,000

$50,000
Decision Trees
$40,000

$30,000

Random Forests $20,000 Purchase

$10,000
No Purchase

AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
X Y K 9 Purchase PURCHASE
28 $85,000 10 1 No Purchase (90% Confidence)
Sentiment Analysis
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS (KNN)

HH INCOME

K-Nearest Neighbors $100,000

$90,000

$80,000

Naïve Bayes $70,000

$60,000

$50,000
Decision Trees
$40,000

$30,000

Random Forests $20,000 Purchase

$10,000
No Purchase

AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression
X Y K 6 Purchase NO PURCHASE
44 $40,000 20 14 No Purchase (70% Confidence)
Sentiment Analysis
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS (KNN)

HH INCOME

K-Nearest Neighbors $100,000

$90,000

$80,000

Naïve Bayes $70,000

$60,000

$50,000
Decision Trees
$40,000

$30,000

Random Forests $20,000 Purchase

$10,000
No Purchase

AGE
20 25 30 35 40 45 50 55 60 65 70 75
Logistic Regression

???
X Y K 3 Purchase
50 $80,000 6 3 No Purchase
Sentiment Analysis
UNOBSERVED VALUE NEIGHBORS PREDICTION
K-NEAREST NEIGHBORS (KNN)

K-Nearest Neighbors What if there’s a tie?


• In practice, KNN also factors distance between neighbors; if there are multiple
modes, the class with the shortest total distance to the observation wins
Naïve Bayes

How do we figure out the “right” value for K?


Decision Trees
• This is where Machine Learning comes in! The model tests multiple K values to
see which one is most accurate when applied to your Test data
Random Forests

What if we have more than 2 IVs?


Logistic Regression • Scatter plots are great for demonstrating how KNN works for 2 independent
variables, but humans can’t visualize beyond 3 or 4 dimensions
• Computers can easily apply the same logic and calculations to many IVs
Sentiment Analysis
CASE STUDY: K-NEAREST NEIGHBORS

THE Congratulations! You just landed your dream job as a Machine Learning
SITUATION Engineer at Spotify, one of the world’s largest music streaming platforms

Your first assignment is to analyze user listening behavior to help improve


THE Spotify’s recommendation engine.
ASSIGNMENT You’ll need to use a KNN model to predict if a user will listen, skip or favorite a
given track, based on various attributes.

1. Collect sample data containing listener behaviors and song attributes


THE
OBJECTIVES 2. Visualize outcomes for a given pair of attributes using a Scatter Plot
3. Calculate the predicted outcome (listen, skip or favorite) and level of
confidence for any unobserved value
*Copyright Maven Analytics, LLC
NAÏVE BAYES

Naïve Bayes is a classification technique which uses conditional


K-Nearest Neighbors
probabilities to predict multi-class or binary outcomes
• Naïve Bayes essentially creates frequency tables for each combination of variables,
Naïve Bayes then calculates the conditional probability of each IV value for a given outcome

• For new observations, the model looks at the probability that each IV value would
Decision Trees be observed given each outcome, and compares the results to make a prediction

Random Forests Example use cases:


• Predicting purchase probability for prospects in a marketing database
Logistic Regression • Credit risk scoring in banking or financial industries

Sentiment Analysis
NAÏVE BAYES
These are the independent variables (IV’s)
Each record represents a customer which will help us make a prediction

K-Nearest Neighbors Cust. ID


Subscribed to Followed FB Visited
Purchase?
Newsletter Page Website

1 0 1 0 1 This is the dependent variable (DV)


we are predicting
Naïve Bayes 2 1 0 1 0

3 1 1 0 1

4 0 0 0 0
Decision Trees
5 1 0 0 0 These are our observed values, which
we use to train the model
6 1 1 1 1

Random Forests 7 1 0 1 0

8 0 1 1 1

9 1 0 1 1
Logistic Regression
10 1 1 0 0

This is an unobserved value;


11 0 1 1 ??? which our model will predict
Sentiment Analysis
NAÏVE BAYES
Frequency tables between
each IV and the DV

K-Nearest Neighbors Cust. ID


Subscribed to Followed FB Visited
Purchase?
Newsletter Page Website PURCHASE
1 0 1 0 1 1 0

1
Naïve Bayes 2 1 0 1 0

NEWS
3 1 1 0 1 0

4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1

FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0

8 0 1 1 1 1 0

1
9 1 0 1 1

SITE
Logistic Regression
10 1 1 0 0 0

11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES

K-Nearest Neighbors Cust. ID


Subscribed to Followed FB Visited
Purchase?
Newsletter Page Website PURCHASE
1 0 1 0 1 1 0

Naïve Bayes 2 1 0 1 0 1 3

NEWS
3 1 1 0 1 0

4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1

FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0

8 0 1 1 1 1 0

1
9 1 0 1 1

SITE
Logistic Regression
10 1 1 0 0 0

11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES

K-Nearest Neighbors Cust. ID


Subscribed to Followed FB Visited
Purchase?
Newsletter Page Website PURCHASE
1 0 1 0 1 1 0

Naïve Bayes 2 1 0 1 0 1 3 4

NEWS
3 1 1 0 1 0

4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1

FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0

8 0 1 1 1 1 0

1
9 1 0 1 1

SITE
Logistic Regression
10 1 1 0 0 0

11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES

K-Nearest Neighbors Cust. ID


Subscribed to Followed FB Visited
Purchase?
Newsletter Page Website PURCHASE
1 0 1 0 1 1 0

Naïve Bayes 2 1 0 1 0 1 3 4

NEWS
3 1 1 0 1 0 2

4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1

FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0

8 0 1 1 1 1 0

1
9 1 0 1 1

SITE
Logistic Regression
10 1 1 0 0 0

11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES

K-Nearest Neighbors Cust. ID


Subscribed to Followed FB Visited
Purchase?
Newsletter Page Website PURCHASE
1 0 1 0 1 1 0

Naïve Bayes 2 1 0 1 0 1 3 4

NEWS
3 1 1 0 1 0 2 1

4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1

FB
6 1 1 1 1
0
Random Forests 7 1 0 1 0

8 0 1 1 1 1 0

1
9 1 0 1 1

SITE
Logistic Regression
10 1 1 0 0 0

11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES

K-Nearest Neighbors Cust. ID


Subscribed to Followed FB Visited
Purchase?
Newsletter Page Website PURCHASE
1 0 1 0 1 1 0

Naïve Bayes 2 1 0 1 0 1 3 4

NEWS
3 1 1 0 1 0 2 1

4 0 0 0 0
Decision Trees 1 0
5 1 0 0 0 1 4 1

FB
6 1 1 1 1
0 1 4
Random Forests 7 1 0 1 0

8 0 1 1 1 1 0

9 1 0 1 1
1 3 2

SITE
Logistic Regression
10 1 1 0 0 0 2 3

=
Overall: 50% 50%
11 0 1 1 ???
Sentiment Analysis
NAÏVE BAYES

Frequency tables give us the conditional probability of each outcome

K-Nearest Neighbors • For example, P (News | Purchase) tells us the probability that a customer
subscribed to the newsletter, given (or conditioned on) the fact that they purchased

PURCHASE P (News | Purchase) 60%


Naïve Bayes
1 0 P (No News | Purchase) 40%
1 3 4 P (FB | Purchase) 80%

NEWS
Probability of each independent variable,
P (No FB | Purchase) 20% given that a purchase was made
Decision Trees 0 2 1
P (Site | Purchase) 60%
1 0 P (No Site | Purchase) 40%
1 4 1
Random Forests P (News | No Purchase) 80%
FB

0 1 4 P (No News | No Purchase) 20%


P (FB | No Purchase) 20% Probability of each independent variable,
Logistic Regression 1 0 P (No FB | No Purchase) 80% given that a purchase was NOT made
1 3 2 P (Site | No Purchase) 40%
SITE

0 2 3 P (No Site | No Purchase) 60%


Sentiment Analysis P (Purchase) 50%
Overall probability of purchase
P (No Purchase) 50%
NAÏVE BAYES
Unobserved value:
P (News | Purchase) 60%
NEWS FB SITE

K-Nearest Neighbors P (No News | Purchase) 40% 0 1 1


P (FB | Purchase) 80%
P (No FB | Purchase) 20%
Naïve Bayes P (Site | Purchase) 60% Prob. given Purchase = 1 Prob. given Purchase = 0

=
P (No Site | Purchase) 40%
P (No News | Purchase) P (No News | No Purchase)
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% P (FB | Purchase) P (FB | No Purchase)
P (FB | No Purchase) 20% x x
P (Site | Purchase) P (Site | No Purchase)
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% P (Purchase) P (No Purchase)

P (No Site | No Purchase) 60%

=
P (Purchase) 50%
Logistic Regression
P (No Purchase) 50%

Sentiment Analysis
NAÏVE BAYES
Unobserved value:
P (News | Purchase) 60%
NEWS FB SITE

K-Nearest Neighbors P (No News | Purchase) 40% 0 1 1


P (FB | Purchase) 80%
P (No FB | Purchase) 20%
Naïve Bayes P (Site | Purchase) 60% Prob. given Purchase = 1 Prob. given Purchase = 0

=
P (No Site | Purchase) 40%
40% P (No News | No Purchase)
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% 80% P (FB | No Purchase)
P (FB | No Purchase) 20% x x
60% P (Site | No Purchase)
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% 50% P (No Purchase)

P (No Site | No Purchase) 60%

=
P (Purchase) 50%
Logistic Regression 9.6%
P (No Purchase) 50%

Sentiment Analysis
NAÏVE BAYES
Unobserved value:
P (News | Purchase) 60%
NEWS FB SITE

K-Nearest Neighbors P (No News | Purchase) 40% 0 1 1


P (FB | Purchase) 80%
P (No FB | Purchase) 20%
Naïve Bayes P (Site | Purchase) 60% Prob. given Purchase = 1 Prob. given Purchase = 0

=
P (No Site | Purchase) 40%
40% 20%
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% 80% 20%
P (FB | No Purchase) 20% x x
60% 40%
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% 50% 50%

P (No Site | No Purchase) 60%

=
P (Purchase) 50%
Logistic Regression 9.6% 0.8%
P (No Purchase) 50%

Sentiment Analysis
NAÏVE BAYES
Unobserved value:
P (News | Purchase) 60%
NEWS FB SITE

K-Nearest Neighbors P (No News | Purchase) 40% 0 1 1


P (FB | Purchase) 80%
P (No FB | Purchase) 20%
Naïve Bayes P (Site | Purchase) 60% Prob. given Purchase = 1 Prob. given Purchase = 0

=
P (No Site | Purchase) 40%
40% 20%
P (News | No Purchase) 80%
x x
Decision Trees P (No News | No Purchase) 20% 80% 20%
P (FB | No Purchase) 20% x x
60% 40%
P (No FB | No Purchase) 80% x x
Random Forests P (Site | No Purchase) 40% 50% 50%

P (No Site | No Purchase) 60%

=
P (Purchase) 50%
Logistic Regression 9.6% 0.8%
P (No Purchase) 50%

9.6%
Sentiment Analysis Overall Purchase Probability:
(9.6% + 0.8%)
= 92.3% PURCHASE
PREDICTION
NAÏVE BAYES

K-Nearest Neighbors
Who has time to work through all this math?

Naïve Bayes • No one! These probabilities are all calculated automatically by the model
(no manual effort required!)

Decision Trees
Doesn’t multiplying many probabilities lead to very small values?
Random Forests • Yes, which is why the relative probability is so important; even so, Naïve
Bayes can struggle with a very large number of IVs

Logistic Regression

Sentiment Analysis
CASE STUDY: NAÏVE BAYES

THE You’ve just been promoted to Marketing Manager at Cat Slacks, a global retail
SITUATION powerhouse specializing in high-quality pants for cats

Your boss wants to understand the impact of key customer interactions on


THE purchase behavior, including subscribing to the newsletter, following the
ASSIGNMENT Facebook page, and visiting the Cat Slacks website

1. Collect sample data containing customer interactions and purchase events


THE
OBJECTIVES 2. Compare the purchase probability for specific interactions
3. Calculate the overall purchase probability for any given set of interactions
*Copyright Maven Analytics, LLC
DECISION TREES

Decision Trees use a series of binary (true/false) rules to predict multi-


K-Nearest Neighbors
class or binary outcomes
• For each rule, a decision tree selects a single IV and uses it to split observations
Naïve Bayes into two groups, where each group (ideally) skews towards one class

• The goal is to determine which rules and IVs do the best job splitting up the
Decision Trees classes, which we can measure using a metric known as entropy

• This is NOT a manual process; decision trees test many variables and select rules
based on the change in entropy after the split (known as information gain)
Random Forests

Logistic Regression Example use cases:


• Predicting if a customer will cancel/churn next month
• Identifying fraudulent bank transactions or insurance claims
Sentiment Analysis
DECISION TREES
Dependent variable: CHURN
Q: Will this subscriber churn next month?
Independent Variables
K-Nearest Neighbors
Age

Gender
10 YES 10 NO

Naïve Bayes HH Income


Gender
Sign-Up Date
M F

Decision Trees Last Log-In

Lifetime Value

4 YES 4 NO 6 YES 6 NO
Random Forests

Splitting on Gender isn’t effective, since our classes are still evenly distributed after the split
Logistic Regression
(in other words, Gender isn’t a good predictor for churn)

Sentiment Analysis
DECISION TREES
Dependent variable: CHURN
Q: Will this subscriber churn next month?
Independent Variables
K-Nearest Neighbors
Age

Gender
10 YES 10 NO

Naïve Bayes HH Income


Last Log-In
Sign-Up Date
<7 >7
days days

Decision Trees Last Log-In

Lifetime Value

4 YES 8 NO 7 YES 1 NO
Random Forests

How do we know which independent variables to split?


Logistic Regression • Entropy measures the “evenness” or “uncertainty” between classes, and can be
compared before/after each split to measure information gain
• This is where Machine Learning comes in; the model tries all IVs to determine
Sentiment Analysis
which splits reduce entropy the most
DECISION TREES
Probability of Class 1 Probability of Class 2
(Count of Class 1 / N) (Count of Class 2 / N)

K-Nearest Neighbors
ENTROPY = −𝑃1 ∗ 𝑙𝑜𝑔2 𝑃1 − 𝑃2 ∗ 𝑙𝑜𝑔2 (𝑃2 )
Naïve Bayes Curve between 0 - 1

1.00

Decision Trees
0.75 Entropy = 1 (max)
50/50 split between classes

Random Forests ENTROPY


0.50

Logistic Regression
0.25
Entropy = 0 (min) Entropy = 0 (min)
All Class 2 (P1=0) All Class 1 (P2=0)
Sentiment Analysis
P1
0.25 0.50 0.75 1.00
DECISION TREES
Dependent variable: CHURN
Q: Will this subscriber churn next month?
Independent Variables
K-Nearest Neighbors
Age
Entropy = 1
Gender
10 YES 10 NO

Naïve Bayes HH Income


Last Log-In
Sign-Up Date
<7 >7
days days

Decision Trees Last Log-In

Lifetime Value
Entropy = .91 Entropy = .54
(- 0.09) (- 0.46)
4 YES 8 NO 7 YES 1 NO
Random Forests

The reduction in entropy after the split tells us that we’re gaining information, and
Logistic Regression teasing out differences between those who churn and those who do not

Sentiment Analysis
DECISION TREES
Dependent variable: CHURN
Q: Will this subscriber churn next month?
Independent Variables
K-Nearest Neighbors
Age
Entropy = 1
Gender
10 YES 10 NO

Naïve Bayes HH Income


Last Log-In
Sign-Up Date
<7 >7
days days

Decision Trees Last Log-In

Lifetime Value
Entropy = .91 Entropy = .54
(- 0.09) (- 0.46)
4 YES 8 NO 7 YES 1 NO
Random Forests
Lifetime Value Age

<$50 >$50 <35 >35


Logistic Regression

Sentiment Analysis 2 YES 3 NO 2 YES 5 NO 0 YES 1 NO 7 YES 0 NO

Entropy = .97 Entropy = .86 Entropy = 0 Entropy = 0


(+ 0.06) (- 0.05) (- 0.54) (- 0.54)
DECISION TREES
Root node

Decision node
K-Nearest Neighbors 10 YES 10 NO

Last Log-In
Leaf node
<7 >7
days days

Naïve Bayes

4 YES 8 NO 7 YES 1 NO

Decision Trees Lifetime Value Age

<$50 >$50 <35 >35

Random Forests

2 YES 3 NO 2 YES 5 NO 0 YES 1 NO 7 YES 0 NO

Sign-Up Date
Logistic Regression
<60 >60 UNOBSERVED VALUE PREDICTION
days days

52-year-old customer who last


Sentiment Analysis logged in 21 days ago CHURN
1 YES 0 NO 1 YES 3 NO
DECISION TREES

K-Nearest Neighbors 10 YES 10 NO

Last Log-In
<7 >7
days days

Naïve Bayes

4 YES 8 NO 7 YES 1 NO

Decision Trees Lifetime Value Age

<$50 >$50 <35 >35

Random Forests

2 YES 3 NO 2 YES 5 NO 0 YES 1 NO 7 YES 0 NO

Sign-Up Date
Logistic Regression
<60 >60 UNOBSERVED VALUE PREDICTION
days days

Customer who logged in 3 days

Sentiment Analysis
ago, has an LTV of $28, and NO CHURN
signed up 90 days ago

1 YES 0 NO 1 YES 3 NO
DECISION TREES

K-Nearest Neighbors 10 YES 10 NO

Last Log-In
<7 >7
days days

Naïve Bayes

4 YES 8 NO 7 YES 1 NO

Decision Trees Lifetime Value Age

<$50 >$50 <35 >35

Random Forests

2 YES 3 NO 2 YES 5 NO 0 YES 1 NO 7 YES 0 NO

Sign-Up Date
Logistic Regression
<60 >60 UNOBSERVED VALUE PREDICTION
days days

Customer who logged in 5 days


Sentiment Analysis ago and has an LTV of $70 NO CHURN
1 YES 0 NO 1 YES 3 NO
DECISION TREES

K-Nearest Neighbors How do we know when to stop splitting?


• Splitting too much can lead to overfitting, so you can adjust model inputs
(known as “hyperparameters”) to limit things like tree depth or leaf size
Naïve Bayes

Decision Trees How do we know where to split numerical IVs?


• In practice, decision trees will test different numerical splits and optimize based
on information gain (just like testing individual IV splits)
Random Forests

Logistic Regression Does the best first split always lead to the most accurate model?
• Not necessarily! That’s why we often use a collection of multiple decision trees,
known as a random forest, to maximize accuracy
Sentiment Analysis
RANDOM FORESTS

A Random Forest is a collection of individual decision trees, each built


using a random subset of observations
K-Nearest Neighbors
• Each decision tree randomly selects variables to evaluate at each split
• Each tree produces a prediction, and the mode, or most frequent prediction, wins
Naïve Bayes
TREE 1 PREDICTION
SPLIT 1 SPLIT 2 SPLIT N
CHURN
Decision Trees Age Age Age

Gender Gender Gender TREE 2 PREDICTION

Income Income Income NO CHURN


Random Forests FINAL PREDICTION

CHURN
Sign-Up Sign-Up Sign-Up

TREE 3 PREDICTION
Log-In Log-In Log-In

Logistic Regression CHURN


LTV LTV LTV

TREE N PREDICTION

Sentiment Analysis CHURN


CASE STUDY: DECISION TREES

THE You are the founder and CEO of Trip Genie, an online subscription service
SITUATION designed to connect global travelers with local guides.

You’d like to better understand your customers and identify which types of
THE behaviors can be used to help predict paid subscriptions.
ASSIGNMENT Your goal is to build a decision tree to help you predict subscriptions based on
multiple factors (newsletter, Facebook follow, time on site, sessions, etc.).

1. Collect Training data at the customer-level


THE
OBJECTIVES 2. Split on each independent variable and compare changes in entropy
3. Build a decision tree to best predict subscriptions based on available IVs
*Copyright Maven Analytics, LLC
LOGISTIC REGRESSION

Logistic Regression is a classification technique used to predict the


K-Nearest Neighbors
probability of a binary (true/false) outcome
• In its simplest form, logistic regression forms an S-shaped curve between 0 - 1,
Naïve Bayes which represents the probability of a TRUE outcome for any given value of X

• The likelihood function measures how accurately a model predicts outcomes, and
Decision Trees is used to optimize the “shape” of the curve

• Although it has the word “regression” in its name, logistic regression is not used for
predicting numeric variables
Random Forests

Example use cases:


Logistic Regression
• Flagging spam emails or fraudulent credit card transactions
• Determining whether to serve a particular ad to a website visitor
Sentiment Analysis
LOGISTIC REGRESSION

TRUE (1)
K-Nearest Neighbors

Naïve Bayes

0.5

Decision Trees

Random Forests
FALSE (0)
X1

Logistic Regression

• Each dot represents an observed value, where X is a numerical independent


variable and Y is the binary outcome (true/false) that we want to predict
Sentiment Analysis
LOGISTIC REGRESSION

TRUE (1)
K-Nearest Neighbors

PROBABILITY
Naïve Bayes

0.5

Decision Trees

Random Forests
FALSE (0)
X1

Logistic Regression

• Logistic regression plots the best-fitting curve between 0 and 1, which tells us the
probability of Y being TRUE for any given value of X1
Sentiment Analysis
LOGISTIC REGRESSION

SPAM
K-Nearest Neighbors

PROBABILITY
Naïve Bayes

0.5

Decision Trees

Random Forests
NOTFALSE
SPAM (0)
# RECIPIENTS
5 10 15 20 25

Logistic Regression

• Here we’re using logistic regression to predict if an email will be marked as spam,
based on the number of email recipients (X1)
Sentiment Analysis
LOGISTIC REGRESSION

SPAM
K-Nearest Neighbors
P = 95%
Prediction: SPAM

PROBABILITY
Naïve Bayes

0.5

Decision Trees P = 50%


Prediction: ???

Random Forests P = 1%
Prediction: NOT SPAM

NOTFALSE
SPAM (0)
# RECIPIENTS
5 10 15 20 25

Logistic Regression

• Using this model, we can test unobserved values of X1 to predict the probability that
Y is true or false (in this case the probability that an email is marked as spam)
Sentiment Analysis
LOGISTIC REGRESSION

SPAM
K-Nearest Neighbors
0.9

In this case the risk of categorizing a real email as

PROBABILITY
Naïve Bayes spam is high, so our decision point may be >50%

By increasing the threshold to 90%, we:


0.5
1. Correctly predict more of the legit (not
Decision Trees spam) emails, which is really important
2. Incorrectly mark a few spam emails as
“not spam”, which is not a big deal
Random Forests
NOTFALSE
SPAM (0)
# RECIPIENTS
5 10 15 20 25

Logistic Regression
Is 50% always the right decision point for logistic models?
Sentiment Analysis • No. It depends on the relative risk of a false positive (incorrectly predicting a TRUE
outcome) or false negative (incorrectly predicting a FALSE outcome)
LOGISTIC REGRESSION

QUIT
K-Nearest Neighbors

In this case the risk of a false positive is

PROBABILITY
Naïve Bayes low, so our decision point may be <50%

By decreasing the threshold to 10%, we:


0.5

Decision Trees 1. Correctly predict more cases where someone


is likely to quit, which is really important
2. Incorrectly flag some employees who plan to
stay, which is not a big deal
Random Forests 0.1

STAY % NEGATIVE
25% 50% 75% 100%
FEEDBACK
Logistic Regression
• Now consider a case where we’re predicting if an employee will quit based on
negative feedback from HR
Sentiment Analysis • It’s easier to train an employee than hire a new one, so the risk of a false positive is
low but the risk of a false negative (incorrectly predicting someone will stay) is high
LOGISTIC REGRESSION

K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where

1 + 𝑒 −(𝛽0+𝛽1 𝑥1 ) 𝜷𝟎 is the intercept, X is


the IV value and 𝜷𝟏 is
the weight (or slope)
Naïve Bayes

Decision Trees

Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0

Sentiment Analysis 0 X1
LOGISTIC REGRESSION

K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where

1 + 𝑒 −(𝛽0+𝛽1 𝑥1 ) 𝜷𝟎 is the intercept, X is


the IV value and 𝜷𝟏 is
the weight (or slope)
Naïve Bayes

Decision Trees

Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0

Sentiment Analysis 0 X1
LOGISTIC REGRESSION

K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where

1 + 𝑒 −(𝛽0+𝛽1 𝑥1 ) 𝜷𝟎 is the intercept, X is


the IV value and 𝜷𝟏 is
the weight (or slope)
Naïve Bayes

Decision Trees

Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0

Sentiment Analysis 0 X1
LOGISTIC REGRESSION

K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where

1 + 𝑒 −(𝛽0+𝛽1 𝑥1 ) 𝜷𝟎 is the intercept, X is


the IV value and 𝜷𝟏 is
the weight (or slope)
Naïve Bayes

Decision Trees

Random Forests
𝜷𝟏 = 0.5
0.5 𝜷𝟏 = 1.0
Logistic Regression 𝜷𝟏 = 5.0

Sentiment Analysis 0 X1
LOGISTIC REGRESSION

K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where

1 + 𝑒 −(𝛽0+𝛽1 𝑥1 ) 𝜷𝟎 is the intercept, X is


the IV value and 𝜷𝟏 is
the weight (or slope)
Naïve Bayes

Decision Trees

Random Forests
𝜷𝟎 = 0
0.5 𝜷𝟎 = 2
Logistic Regression 𝜷𝟎 = -2

Sentiment Analysis 0 X1
LOGISTIC REGRESSION

K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where

1 + 𝑒 −(𝛽0+𝛽1 𝑥1 ) 𝜷𝟎 is the intercept, X is


the IV value and 𝜷𝟏 is
the weight (or slope)
Naïve Bayes

Decision Trees

Random Forests
𝜷𝟎 = 0
0.5 𝜷𝟎 = 2
Logistic Regression 𝜷𝟎 = -2

Sentiment Analysis 0 X1
LOGISTIC REGRESSION

K-Nearest Neighbors
Makes the output fall
between 0 and 1 1
Linear equation, where

1 + 𝑒 −(𝛽0+𝛽1 𝑥1 ) 𝜷𝟎 is the intercept, X is


the IV value and 𝜷𝟏 is
the weight (or slope)
Naïve Bayes

Decision Trees

Random Forests
𝜷𝟎 = 0
0.5 𝜷𝟎 = 2
Logistic Regression 𝜷𝟎 = -2

Sentiment Analysis 0 X1
LOGISTIC REGRESSION

K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!

• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees

Actual Observation
Random Forests
Y=1 Y=0 • When our model output is close to the actual Y,
we want likelihood to be HIGH (near 1)
Logistic Regression ~1 HIGH LOW
Model Output • When our model output is far from the actual Y,
~0 LOW HIGH we want likelihood to be LOW (near 0)

Sentiment Analysis
LOGISTIC REGRESSION

K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!

• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees

Actual Observation
Random Forests
Y=1 Y=0 LIKELIHOOD FUNCTION:

=
Logistic Regression ~1 HIGH LOW
Model Output 𝒐𝒖𝒕𝒑𝒖𝒕 𝒚 ∗ 𝟏 − 𝒐𝒖𝒕𝒑𝒖𝒕 𝟏−𝒚
~0 LOW HIGH

Sentiment Analysis
LOGISTIC REGRESSION

K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!

• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.99)1 ∗ (1 − .99)1−1
Y=1 Y=0
(.99)1 ∗ (.01)0
Logistic Regression ~1 HIGH LOW
Model Output
.99 ∗ 1

=
~0 LOW HIGH

Sentiment Analysis .99


LOGISTIC REGRESSION

K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!

• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.01)0 ∗ (1 − .01)1−0
Y=1 Y=0
1 ∗ (.99)1
Logistic Regression ~1 HIGH LOW
Model Output
1 ∗ .99

=
~0 LOW HIGH

Sentiment Analysis .99


LOGISTIC REGRESSION

K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!

• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.01)1 ∗ (1 − .01)1−1
Y=1 Y=0
.01 ∗ (.99)0
Logistic Regression ~1 HIGH LOW
Model Output
.01 ∗ 1

=
~0 LOW HIGH

Sentiment Analysis .01


LOGISTIC REGRESSION

K-Nearest Neighbors
How do we determine the “right” values for 𝛽? LIKELIHOOD!

• Likelihood is a metric that tells us how good our model is at correctly predicting Y,
Naïve Bayes based on the shape of the curve
• This is where machine learning comes in; instead of human trial-and-error, an
algorithm determines the best weights to maximize likelihood using observed values
Decision Trees
𝑦 1−𝑦
𝑜𝑢𝑡𝑝𝑢𝑡 ∗ 1 − 𝑜𝑢𝑡𝑝𝑢𝑡
Actual Observation
Random Forests (.99)0 ∗ (1 − .99)1−0
Y=1 Y=0
1 ∗ (.01)1
Logistic Regression ~1 HIGH LOW
Model Output
1 ∗ .01

=
~0 LOW HIGH

Sentiment Analysis .01


LOGISTIC REGRESSION

High likelihood
K-Nearest Neighbors SPAM

Naïve Bayes

PROBABILITY
Low likelihood

Low likelihood
Decision Trees

Random Forests NOT SPAM


5 10 15 20 25

High likelihood # RECIPIENTS


Logistic Regression

• Observations closest to the curve have the highest likelihood values (and vice versa),
Sentiment Analysis so maximizing total likelihood allows us to find the curve that fits our data best
LOGISTIC REGRESSION

But what if multiple X variables could help predict Y?


K-Nearest Neighbors

SPAM

Naïve Bayes

PROBABILITY
Decision Trees

Random Forests

NOT SPAM
5 10 15 20 25
Logistic Regression
# RECIPIENTS

• # of Recipients can help us detect spam, but so can other variables like the number
Sentiment Analysis
of typos, count of words like “free” or “bonus”, sender reputation score, etc.
LOGISTIC REGRESSION

But what if multiple X variables could help predict Y?


K-Nearest Neighbors

Naïve Bayes

Decision Trees

Random Forests

Logistic Regression

• Logistic regression can handle multiple independent variables, but the visual
Sentiment Analysis interpretation breaks down at >2 IV’s (this is why we need machine learning!)
LOGISTIC REGRESSION

But what if multiple X variables could help predict Y?


K-Nearest Neighbors

Naïve Bayes
Makes the output fall
between 0 and 1 1
Decision Trees
1 + 𝑒 −(𝛽0+𝛽1 𝑥1+𝛽2 𝑥2+⋯+𝛽𝑛𝑥𝑛)
Random Forests Weighted independent
variables (x1, x2...xn)

Logistic Regression
• Logistic regression is about finding the best combination of weights (𝛽1, 𝛽2...𝛽𝑛) for
a given set of independent variables (x1, x2...x𝑛) to maximize the likelihood function
Sentiment Analysis
CASE STUDY: LOGISTIC REGRESSION

THE You’ve just been promoted to Marketing Manager for Lux Dining, a wildly
SITUATION popular international food blog.

The CMO is concerned about unsubscribe rates and thinks it may be related
THE to the frequency of emails your team has been sending.
ASSIGNMENT Your job is to use logistic regression to plot this relationship and predict if a
user will unsubscribe based on the number of weekly emails received.

1. Collect customer data containing email frequency and subscription activity


THE 2. Plot the logistic regression curve which maximizes likelihood
OBJECTIVES 3. Determine the appropriate email frequency (based on a 50% decision point)
*Copyright Maven Analytics, LLC
SENTIMENT ANALYSIS

Sentiment analysis is a technique used to determine the emotional


K-Nearest Neighbors
tone or “sentiment” behind text-based data

• Sentiment analysis often falls under Natural Language Processing (NLP), but is
Naïve Bayes typically applied as a classification technique

• Sentiment models typically use a “bag of words” approach, which involves


Decision Trees calculating the frequency or presence (1/0) of key words to convert text
(which models struggle with) into numerical inputs

• Unlike other classification models, you must ”hand-score” the sentiment (DV)
Random Forests
values for your Training data, which your model will learn from

Logistic Regression
Example use cases:
• Understanding the tone of product reviews posted by customers
Sentiment Analysis • Analyzing open-ended survey responses
SENTIMENT ANALYSIS

K-Nearest Neighbors

This word cloud is *technically* a very


simple version of sentiment analysis
Naïve Bayes

Limitations:
Decision Trees
• Based entirely on word count
• Straight-forward but not flexible

Random Forests • Requires manual interpretation


• Subject to inconsistency/human error

Logistic Regression

Sentiment Analysis
SENTIMENT ANALYSIS

The first step in any sentiment analysis is to clean and QA the text to
remove noise and isolate the most meaningful information:
K-Nearest Neighbors
• Remove punctuation, capitalization and special characters
• Correct spelling and grammatical errors
Naïve Bayes • Use proper encoding (i.e. UTF-8)
• Lemmatize or stem (remove grammar tense, convert to “root” term)
Decision Trees • Remove stop words (“a”, “the”, “or”, “of”, “are”, etc.)

The computer is running hot because I’m mining bitcoin!


Random Forests
the computer is running hot because I’m mining bitcoin!

Logistic Regression computer run hot mine bitcoin

NOTE: This process can vary based on the context; for example, you may want to
Sentiment Analysis preserve capitalization or punctuation if you care about measuring intensity (i.e. “GREAT!!”
vs. “great”), or choose to allow specific stop words or special characters
SENTIMENT ANALYSIS

Once the text has been cleaned, we can transform our text into
K-Nearest Neighbors numeric data using a “bag of words” approach:
• Split cleaned text into individual words (this is known as tokenization)
Naïve Bayes
• Create a new column with a binary flag (1/0) for each word

• Manually assign sentiment for observations in your Training data


Decision Trees
• Apply any classification technique to predict sentiment for unobserved text

Sentiment is our
Random Forests Each word is an independent variable dependent variable

hate garbage love book sentiment


Logistic Regression I HATE this garbage! 1 1 0 0 negative
Observed
values
I love this book. 0 0 1 1 positive
Unobserved
Sentiment Analysis This book is garbage 0 1 0 1 ??? value
SENTIMENT ANALYSIS

K-Nearest Neighbors How do you address language nuances like ambiguity, double
negatives, slang or sarcasm?

Naïve Bayes • Generally speaking, the more observations you score in your Training data, the
better your model will be at detecting these types of nuances
• No sentiment model is perfect, but feature engineering and advanced
Decision Trees techniques can help improve accuracy for more complex cases

Random Forests “If you like watching paint If you like watching paint dry like watch paint
dry, you’ll love this movie!” you’ll love this movie dry love movie

Logistic Regression
“The new version is awfully The new version is awfully new version awful
good, not as bad as expected!” good not as bad as expected good bad expect
Sentiment Analysis
CASE STUDY: SENTIMENT ANALYSIS

THE You’re an accomplished author and creator of the hit series Bark Twain the Data
SITUATION Dog, featuring a feisty chihuahua who uses machine learning to solve crimes.

Reviews for the latest book in the Bark Twain series are coming in, and they
THE aren’t looking great...
ASSIGNMENT To apply a bit more rigor to your analysis and automate scoring for future
reviews, you’ve decided to use this feedback to build a basic sentiment model.

1. Collect raw text reviews and tokenize key words


THE 2. Hand-score the sentiment for observed values in the Training data
OBJECTIVES 3. Apply any classification technique and validate using Test data
*Copyright Maven Analytics, LLC
MODEL SELECTION & TUNING

In this section we’ll discuss techniques for selecting & tuning classification models,
including hyperparameter optimization, class balancing, confusion matrices and more

TOPICS WE’LL COVER: COMMON USE CASES:

• Adjusting model parameters to optimize performance


Hyperparameters Imbalanced Classes
• Rebalancing classes to reduce bias
• Comparing performance across multiple classification
Confusion Matrix Selection & Drift
models to select the best performing option
• Retraining models with fresh data to minimize drift
HYPERPARAMETERS

Hyperparameters are inputs/settings you can control while training a model


Hyperparameters
• Adjusting parameters to maximize accuracy is known as “hyperparameter optimization”

Imbalanced Classes

Confusion Matrix
K-Nearest Naïve Decision Random Logistic
Neighbors Bayes Trees Forests Regression
Model Selection
Examples: Examples: Examples: Examples: Examples:

• K • Smoothing • Criterion • # Trees • Intercept


Model Drift • Distance parameters (entropy, gini) • # Samples • Penalty Term
definition • Tree depth • # Features • Solver
• NN Algorithm • Leaf size • Re-Sampling
• Max Features
• Min gain
IMBALANCED CLASSES

Imbalanced classes occur when you are predicting outcomes with


Hyperparameters
drastically different frequencies in the Training data

• The class or outcome which occurs more frequently is known as the majority
Imbalanced Classes
class, while the class which occurs less frequently is the minority class

• Imbalanced classes can bias a model towards always predicting the majority class,
Confusion Matrix since it often yields the best overall accuracy (i.e. 99%+)

• This is a significant issue when predicting very rare and very important events, like
Model Selection a nuclear meltdown

Model Drift
There are several ways to balance classes in your Training data,
including up-sampling, down-sampling and weighting
IMBALANCED CLASSES

Hyperparameters Up-sampling
Minority class observations are
duplicated to balance the data
Imbalanced Classes
1
Down-sampling
Confusion Matrix Majority class observations are
randomly removed to balance
the data

Model Selection
Weighting
For models that randomly sample
observations (random forests),
Model Drift
increase the probability of selecting
the minority class
IMBALANCED CLASSES

Hyperparameters x
Up-sampling
x
Minority class observations are
x duplicated to balance the data
Imbalanced Classes x
x
x Down-sampling
x
Confusion Matrix Majority class observations are
x
randomly removed to balance
the data
x

Model Selection x
x
Weighting
x For models that randomly sample
x observations (random forests),
Model Drift
x increase the probability of selecting
x the minority class

x
IMBALANCED CLASSES

Hyperparameters Up-sampling
Minority class observations are
duplicated to balance the data
Imbalanced Classes 1

Down-sampling
Confusion Matrix Majority class observations are
randomly removed to balance
the data

Model Selection
Weighting
For models that randomly sample
observations (i.e. random forests),
Model Drift
increase the probability of selecting
the minority class
CONFUSION MATRIX

A Confusion Matrix is a table summarizing the frequency of predicted vs.


Hyperparameters
actual classes in your Test data
• This is the most common and concise way to evaluate performance and compare
Imbalanced Classes classification models against one another

• Confusion matrices can be used to derive several types of model performance


metrics, including accuracy, precision and recall
Confusion Matrix

ACTUAL CLASS
Model Selection
1 0 1 0

True Positive False Positive


1 (TP) (FP)
1 100 5
Model Drift PREDICTED
CLASS
False Negative True Negative
0 (FN) (TN)
0 15 50
CONFUSION MATRIX

Confusion matrices are used to derive model performance metrics


Hyperparameters
ACTUAL CLASS

1 0
Imbalanced Classes
True Positive False Positive
1 (TP) (FP)
PREDICTED
CLASS
Confusion Matrix False Negative True Negative
0 (FN) (TN)

Model Selection

Accuracy Precision Recall


Model Drift (TP+TN) / (TP+TN+FP+FN) TP / (TP+FP) TP / (TP+FN)

Of all predictions, what % Of all predicted positives, Of all actual positives, what
were correct? what % were correct? % were predicted correctly?
CONFUSION MATRIX

Confusion matrices are used to derive model performance metrics


Hyperparameters
ACTUAL CLASS

1 0
Imbalanced Classes
1 100 5
PREDICTED
CLASS
Confusion Matrix
0 15 50

Model Selection

Accuracy Precision Recall


Model Drift (TP+TN) / (TP+TN+FP+FN) TP / (TP+FP) TP / (TP+FN)
=

=
(100+50)/(100+50+5+15) 100/(100+5) 100/(100+15)
=

=
.88 .95 .87
CONFUSION MATRIX

Will a Confusion Matrix work for multi-class predictions too?


Hyperparameters
• Yes! Consider a model classifying multiple products (A, B, C, D); ideally, we
want the highest frequencies along the top-left to bottom-right diagonal

Imbalanced Classes ACTUAL PRODUCT ACTUAL PRODUCT

A B C D A B C D

Confusion Matrix
A A

B B
Model Selection PREDICTED PREDICTED
PRODUCT PRODUCT
C C

Model Drift D D

Good confusion matrix Bad confusion matrix


CONFUSION MATRIX

A confusion matrix can also provide insight into relationships and


Hyperparameters
interactions between classes, to help inform model improvements

Imbalanced Classes
ACTUAL PRODUCT

A B C D
In this case our model has a hard time differentiating
Confusion Matrix A products B and C, often predicting the wrong one

• Are these products very similar? Do we need to


B predict them separately?
Model Selection PREDICTED
PRODUCT • If so, can we engineer features to help distinguish
C B from C, and improve model accuracy?

Model Drift D
CONFUSION MATRIX

To score a multi-class confusion matrix, calculate metrics for each predicted


Hyperparameters class, then take a weighted average to evaluate the model as a whole

ACTUAL PRODUCT PRODUCT A:


Imbalanced Classes
A B C D Accuracy
( TP + TN ) / ( TP + TN + FP + FN )
A ( 214 + 452 + 1 + 1123 + 19 + 2 + 12 + 34 ) / ( 1886 )
Confusion Matrix

=
.9846
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 214 ) / ( 214 + 1 + 8 + 2 )

=
In this case, (TN) includes all
D cases where Product A was .9511
Model Drift not predicted OR observed
Recall
TP / ( TP + FN )
( 214 ) / ( 214 + 15 + 3 )

=
.9224
CONFUSION MATRIX

To score a multi-class confusion matrix, calculate metrics for each predicted


Hyperparameters class, then take a weighted average to evaluate the model as a whole

ACTUAL PRODUCT PRODUCT B:


Imbalanced Classes
A B C D Accuracy
( TP + TN ) / ( TP + TN + FP + FN )
A ( 452 + 214 + 8 + 2 + 1123 + 19 + 3 + 12 + 34 ) / ( 1886 )
Confusion Matrix

=
.9899
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 452 ) / ( 452 + 15 + 1 )

=
D .9658
Model Drift
Recall
TP / ( TP + FN )
( 452 ) / ( 452 + 1 + 2 )

=
.9934
CONFUSION MATRIX

To score a multi-class confusion matrix, calculate metrics for each predicted


Hyperparameters class, then take a weighted average to evaluate the model as a whole

ACTUAL PRODUCT PRODUCT C:


Imbalanced Classes
A B C D Accuracy
( TP + TN ) / ( TP + TN + FP + FN )
A ( 1123 + 214 + 1+ 2 + 15 + 452 + 19 + 3 + 2 + 34 ) / ( 1886 )
Confusion Matrix

=
.9889
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 1123 ) / ( 1123 + 19 )

=
D .9834
Model Drift
Recall
TP / ( TP + FN )
( 1123 ) / ( 1123 + 8 + 1 + 12 )

=
.9816
CONFUSION MATRIX

To score a multi-class confusion matrix, calculate metrics for each predicted


Hyperparameters class, then take a weighted average to evaluate the model as a whole

ACTUAL PRODUCT PRODUCT D:


Imbalanced Classes
A B C D Accuracy
( TP + TN ) / ( TP + TN + FP + FN )
A ( 34 + 214 + 1 + 8 + 15 + 452 + 1 + 1123 ) / ( 1886 )
Confusion Matrix

=
.9799
B
PREDICTED Precision
PRODUCT
Model Selection C TP / ( TP + FP )
( 34 ) / ( 34 + 3 + 2 + 12 )

=
D .6667
Model Drift
Recall
TP / ( TP + FN )
( 34 ) / ( 34 + 2 + 19 )

=
.6182
CONFUSION MATRIX

To score a multi-class confusion matrix, calculate metrics for each predicted


Hyperparameters class, then take a weighted average to evaluate the model as a whole

Imbalanced Classes
# Obs. Accuracy Precision Recall

Confusion Matrix A 232 .9846 .9511 .9224

B 455 .9899 .9658 .9934

Model Selection
C 1,144 .9889 .9834 .9816

D 55 .9799 .6667 .6182


Model Drift
WEIGHTED AVG: .9883 .9659 .9666
CONFUSION MATRIX

To score a multi-class confusion matrix, calculate metrics for each predicted


Hyperparameters class, then take a weighted average to evaluate the model as a whole

Imbalanced Classes
# Obs. Accuracy Precision Recall

Confusion Matrix A 232 .7500 .1284 .0979

B 455 .4795 .2732 .1390

Model Selection
C 1,144 .5498 .2239 .4348

D 55 .5792 .0759 .0900


Model Drift
WEIGHTED AVG: .5586 .1994 .1868
MODEL SELECTION

Model selection describes the process of training multiple models,


Hyperparameters
comparing performance on Test data, and choosing the best option

Imbalanced Classes When comparing performance, it’s important to prioritize the most
relevant metrics based on the context of your model

Confusion Matrix • RECALL may be the best metric if it’s critical to predict ALL positive outcomes
correctly, and false negatives are a major risk (i.e. nuclear reactor)

• PRECISION may be the best metric if false negatives aren’t a big deal, but false
Model Selection positives are a major risk (i.e. spam filter or document search)

• ACCURACY may be the best metric if you care about predicting positive and
Model Drift negative outcomes equally, or if the risk of each outcome is comparable
MODEL DRIFT

Drift is when a trained model gradually becomes less accurate over time,
Hyperparameters
even when all variables and parameters remain the same
• As a best practice, all ML models used for ongoing prediction should be
Imbalanced Classes updated or retrained on a regular basis to combat drift

Tips to correct drift:


Confusion Matrix
• Benchmark your model performance (accuracy, precision, recall) on Day 1

Model Selection • Monitor performance against your benchmark over time

• If you notice drift compared to your benchmark, retrain the model using updated
Training data and consider discarding old records (if you have enough volume)
Model Drift
• Conduct additional feature engineering as necessary
PART 3:
ABOUT THIS SERIES

This is Part 3 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning

PART 1 PART 2 PART 3 PART 4


QA & Data Profiling Classification Regression & Forecasting Unsupervised Learning

*Copyright Maven Analytics, LLC


COURSE OUTLINE

Review the ML landscape and key supervised learning concepts, including


1 Intro to Regression regression vs. classification, feature engineering, overfitting, etc.

Learn the key building blocks of regression analysis, including linear


2 Regression Modeling relationships, least squared error, multivariate and non-linear models

Understand and interpret model outputs and common diagnostic metrics


3 Model Diagnostics like R-squared, mean error, F-significance, P-values, and more

Explore powerful tools & techniques for time-series forecasting, including


4 Time-Series Forecasting seasonality, linear and non-linear trends, intervention analysis, etc.

*Copyright Maven Analytics, LLC


*Copyright Maven Analytics, LLC
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning Advanced Topics

Clustering/Segmentation
Classification Regression Reinforcement Learning
K-Means (Q-learning, deep RL, multi-armed-bandit, etc.)

K-Nearest Neighbors Least Squares Outlier Detection


Natural Language Processing
Naïve Bayes Linear Regression Markov Chains (Latent Semantic Analysis, Latent Dirichlet Analysis,
relationship extraction, semantic parsing, contextual
Decision Trees Forecasting word embeddings, translation, etc.)
Matrix factorization, principal components, factor
analysis, UMAP, T-SNE, topological data analysis,
Logistic Regression Non-Linear Regression
advanced clustering, etc.
Computer Vision
Sentiment Analysis Intervention Analysis (Convolutional neural networks, style translation, etc.)

LASSO/RIDGE, state-
Support vector machines,
space, advanced Deep Learning
gradient boosting, neural
generalized linear (Feed Forward, Convolutional, RNN/LSTM, Attention,
nets/deep learning, etc.
methods, VAR, DFA, etc. Deep RL, Autoencoder, GAN. etc.)
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning

Has a “label” Does NOT have a “label”

A “label” is an observed variable which you are trying to predict


(purchase (1/0), sentiment score (positive/negative), product type (A/B/C), etc.)

Focused on describing or organizing the


Focused on predicting a label or value
data in some non-obvious way
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning

Classification Regression CLASSIFICATION EXAMPLES:


• What type of subscribers are most likely to cancel or churn?
• Which segment does this customer fall into?
Used to predict Used to predict
classes for values for • Is the sentiment of this review positive, negative or neutral?
CATEGORICAL NUMERICAL
variables variables
REGRESSION EXAMPLES:
• How much revenue are we forecasting for the quarter?
• How many units do we expect to sell next season?
• How will seasonality impact the size of our customer base?
RECAP: KEY CONCEPTS
These are categorical variables
All tables contain rows and columns
Session ID Browser Time (Sec) Pageviews Purchase
• Each row represents an individual record/observation
1 Chrome 354 11 1
• Each columns represent an individual variable
2 Safari 94 4 1

3 Safari 36 2 0
Variables can be categorical or numerical 4 IE 17 1 0
• Categorical variables contain classes or categories (used for filtering) 5 Chrome 229 9 1
• Numerical variables contain numbers (used for aggregation)
These are numerical variables

Each row represents a record of a unique web session


QA and data profiling comes first, every time
• Without quality data, you can’t build quality models (“garbage in, garbage out”)
• Profiling is the first step towards conditioning (or filtering) on key variables to understand their impact

Machine Learning is needed when visual analysis falls short


• Humans aren’t capable of thinking beyond 3+ dimensions, but machines are built for it
• ML is a natural extension of visual analysis and univariate/multivariate profiling
REGRESSION 101

The goal of regression is to predict a numeric dependent variable using independent variables

𝒚 Dependent variable (DV) – for regression, this must be numerical (not categorical)!
• This is the variable you’re trying to predict
• The dependent variable is commonly referred to as “Y”, “predicted”, “output”, or “target” variable
• Regression is about understanding how the numerical DV is impacted by, or dependent on, other variables in the model

𝐱 Independent variables (IVs)


• These are the variables which help you predict the dependent variable
• Independent variables are commonly referred to as “X’s”, “predictors”, “features”, “explanatory variables”, or “covariates”
• Regression is about understanding how the IVs impact, or predict, the DV
REGRESSION 101

EXAMPLE: Using marketing and sales data (sample below) to predict revenue for a given month

Social Competitive Marketing Promotion


Month ID
Posts Activity Spend Count
Revenue Revenue is our dependent variable, since
it’s what we want to predict
1 30 High $130,000 12 $1,300,050

2 15 Low $600,000 5 $11,233,310 Since Revenue is numerical, we’ll use


3 11 Medium $15,000 10 $1,112,050 regression (vs. classification) to predict it
4 22 Medium $705,000 11 $1,582,077

5 41 High $3,000 3 $1,889,053

6 5 High $3,500 3 $1,200,089

7 23 Low $280,000 2 $1,300,080

8 12 Low $120,000 11 $700,150

9 8 Low $1,000,000 51 $41,011,150

10 19 Medium $43,000 10 $1,000,210

Social Posts, Competitive Activity, Marketing Spend and Promotion Count are all
independent variables, since they can help us explain, or predict, monthly Revenue
REGRESSION 101

EXAMPLE: Using marketing and sales data (sample below) to predict revenue for a given month

Social Competitive Marketing Promotion


Month ID Revenue
Posts Activity Spend Count

1 30 High $130,000 12 $1,300,050

2 15 Low $600,000 5 $11,233,310

3 11 Medium $15,000 10 $1,112,050 We’ll use records with observed values for
4 22 Medium $705,000 11 $1,582,077 both independent and dependent variables
5 41 High $3,000 3 $1,889,053
to “train” our regression model...
6 5 High $3,500 3 $1,200,089

7 23 Low $280,000 2 $1,300,080

8 12 Low $120,000 11 $700,150

9 8 Low $1,000,000 51 $41,011,150

10 19 Medium $43,000 10 $1,000,210

11 7 High $320,112 8 ??? ...then apply that model to new, unobserved values
containing IVs but no DV
This is what our regression model will predict!
REGRESSION WORKFLOW

Measurement Planning

Remember, these steps ALWAYS come first


Preliminary QA Before building a model, you should have a clear measurement plan (KPIs, project scope, desired outcome, etc.) and an
understanding of the data at hand (variable types, table structure, data quality, profiling metrics, etc.)
Data Profiling

Feature Engineering Data Splitting Model Training Selection & Tuning

Adding new, calculated Splitting records into Building regression models Choosing the best
variables (or “features”) to “Training” and “Test” data from Training data and performing model for a given
a data set based on sets, to validate accuracy applying to Test data to prediction, and tuning it to
existing fields and avoid overfitting maximize prediction accuracy prevent drift over time
FEATURE ENGINEERING

Feature engineering is the process of enriching a data set by creating additional


independent variables based on existing fields
• New features can help improve the accuracy and predictive power of your ML models
• Feature engineering is often used to convert fields into “model-friendly” formats; for example,
one-hot encoding transforms categorical variables into individual, binary (1/0) fields

Original features Engineered features

Social Competitive Marketing Promotion Competitive Competitive Competitive Promotion >10 &
Month ID Revenue Log Spend
Posts Activity Spend Count High Medium Low Social > 25

1 30 High $130,000 12 $1,300,050 1 0 0 1 11.7

2 15 Low $600,000 5 $11,233,310 0 0 1 0 13.3

3 8 Medium $15,000 10 $1,112,050 0 1 0 0 9.6

4 22 Medium $705,000 11 $1,582,077 0 1 0 0 13.5

5 41 High $3,000 3 $1,889,053 1 0 0 0 8.0


DATA SPLITTING

Splitting is the process of partitioning data into separate sets of records for the
purpose of training and testing machine learning models
• As a rule of thumb, ~70-80% of your data will be used for Training (which is what your model
learns from), and ~20-30% will be reserved for Testing (to validate the model’s accuracy)

Social Competitive Marketing Promotion


Month ID Revenue
Posts Activity Spend Count

1 30 High $130,000 12 $1,300,050


Test data is NOT used to optimize models

2 15 Low $600,000 5 $11,233,310

3 11 Medium $15,000 10 $1,112,050 Using Training data for optimization and Test data
Training for validation ensures that your model can
4 22 Medium $705,000 11 $1,582,077
data accurately predict both known and unknown values,
5 41 High $3,000 3 $1,889,053 which helps to prevent overfitting
6 5 High $3,500 3 $1,200,089

7 23 Low $280,000 2 $1,300,080

8 12 Low $120,000 11 $700,150


Test
9 8 Low $1,000,000 51 $41,011,150
data
10 19 Medium $43,000 10 $1,000,210
OVERFITTING
Splitting is primarily used to avoid overfitting, which is when a model predicts known (Training) data very
well but unknown (Test) data poorly
• Think of overfitting like memorizing the answers to a test instead of actually learning the material; you’ll ace the
test, but lack the ability to generalize and apply your knowledge to unfamiliar questions

OVERFIT model WELL-FIT model UNDERFIT model


• Models the Training data too well • Models the Training data just right • Doesn’t model Training data well enough
• Doesn’t generalize well to Test data • Generalizes well to Test data • Doesn’t generalize well to Test data (high
(high variance, low bias) (balance of bias & variance) bias, low variance)
REGRESSION

There are two common use cases for linear regression: prediction and root-cause analysis

PREDICTION ROOT-CAUSE ANALYSIS

• Used to predict or forecast a numerical • Used to determine the causal impact of


dependent variable individual model inputs
• Goal is to make accurate predictions, even • Goal is to prove causality by comparing
if causality cannot necessarily be proven the sensitivity of each IV on the DV
*Copyright Maven Analytics, LLC
REGRESSION MODELING

In this section we’ll introduce the basics of regression modeling, including linear
relationships, least squared error, simple and multiple regression and non-linear models

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand basic linear relationships and the concept


Linear Relationships Least Squared Error of least squared errors
• Explore the difference between simple (univariate) and
Univariate Linear Multiple Linear
multiple linear regression
Regression Regression
• Demonstrate how regression can be used to model
linear or non-linear relationships
Non-Linear Regression
LINEAR RELATIONSHIPS

A linear relationship means that when you increase or decrease your


Linear Relationships X variable, your Y variable increases or decreases at a steady rate
• If you plot a linear relationship over many X/Y values, it will look like a straight line

Least Squared Error


• Mathematically, linear relationships can be described using the following equation:

Y-intercept X value
(x,y)
Univariate Linear
Regression y = ⍺ + βx β
Y value Slope (rise/run) ⍺

Multiple Linear
Regression • NOTE: Not all relationships are linear (more on that later!)

Non-Linear Common examples:


Regression
• Taxi Mileage (X) and Total Fare (Y)
• Units Sold (X) and Total Revenue (Y)
LINEAR RELATIONSHIPS

Positive Linear Negative Linear

Linear Relationships

Least Squared Error

Univariate Linear
Regression
Non-Linear (logarithmic) No Relationship

Multiple Linear
Regression

Non-Linear
Regression

Sometimes there’s no relationship at all!


LINEAR RELATIONSHIPS

Linear Relationships 50

40
Consider a line that fits
Least Squared Error every single point in the plot
30
This is known as a perfectly
linear relationship
20
Univariate Linear
Regression
10

Multiple Linear
Regression 10 20 30 40 50 60 70 80

Non-Linear
Regression
In this case you can simply calculate the exact value of Y for any given value
of X (no Machine Learning needed, just simple math!)
LINEAR RELATIONSHIPS

Linear Relationships 50

40
In the real world, things
Least Squared Error aren’t quite so simple
30
When you add variance, it
means that many different
20
Univariate Linear lines could potentially fit
Regression
through the plot
10

Multiple Linear
Regression 10 20 30 40 50 60 70 80

Non-Linear
Regression
To find the equation of the line with the best possible fit, we can use a
technique known as least squared error or “least squares”
LEAST SQUARED ERROR

Least squared error is used to mathematically determine the line that


Linear Relationships
best fits through a series of data points
• Imagine drawing a line through a scatterplot, and measuring the distance between
Least Squared Error your line and each point (these distances are called errors, or residuals)

• Now square each of those residuals, add them all up, and adjust your line until
you’ve minimized that sum; this is how least squares works!
Univariate Linear
Regression

Why “squared” error?


Multiple Linear
Regression • Squaring the residuals converts them into positive values, and prevents positive
and negative distances from cancelling each other out (this helps the model
optimize more efficiently, too)
Non-Linear
Regression
LEAST SQUARED ERROR

Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error

10 10 15 5 25
40 20 25 20 -5 25

30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25

40 40 30 -10 100

50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25

80 40 50 10 100
Multiple Linear
Regression 10 20 30 40 50 60 70 80

Non-Linear
Regression STEP 1: Plot each data point on a scatterplot, and record the X and Y values
LEAST SQUARED ERROR

y = 10 + 0.5x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error

10 10 15 5 25
40 20 25 20 -5 25

30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25

40 40 30 -10 100

50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25

80 40 50 10 100
Multiple Linear
Regression 10 20 30 40 50 60 70 80

Non-Linear
Regression STEP 2: Draw a straight line through the points in the scatterplot, and
calculate the Y values derived by your linear equation
LEAST SQUARED ERROR

y = 10 + 0.5x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error

10 10 15 5 25
40 20 25 20 -5 25

30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25

40 40 30 -10 100

50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25

80 40 50 10 100
Multiple Linear
Regression 10 20 30 40 50 60 70 80

Non-Linear
Regression STEP 3: For each value of X, calculate the error (or residual) by comparing
the actual Y value against the Y value produced by your linear equation
LEAST SQUARED ERROR

y = 10 + 0.5x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error

10 10 15 5 25
40 20 25 20 -5 25

30 20 25 5 25
Least Squared Error
30 35 30 27.5 -2.5 6.25

40 40 30 -10 100

50 15 35 20 400
20
Univariate Linear 60 40 40 0 0
Regression
65 30 42.5 12.5 156.25
10
70 50 45 -5 25

80 40 50 10 100
Multiple Linear
Regression

=
10 20 30 40 50 60
862.5
70 80
SUM OF SQUARED ERROR:

Non-Linear STEP 4: Square each individual residual and add them up to determine the
Regression
sum of squared error (SSE)
• This defines exactly how well your line “fits” the plot (or in other words, how well
the linear equation describes the relationship between X and Y)
LEAST SQUARED ERROR

y = 12 + 0.4x
Linear Relationships 50
X Y (actual) Y (line) Error Sq. Error

10 10 16 6 36
40 20 25 20 -5 25

30 20 24 4 16
Least Squared Error
30 35 30 26 -4 16

40 40 28 -12 144

50 15 32 17 289
20
Univariate Linear 60 40 36 -4 16
Regression
65 30 38 8 64
10
70 50 40 -10 100

80 40 44 4 16
Multiple Linear
Regression

=
10 20 30 40 50 60
722
70 80
SUM OF SQUARED ERROR:

Non-Linear STEP 5: Plot a new line, repeat Steps 1-4, and continue the process until
Regression
you’ve found the line that minimizes the sum of squared error
• This is where Machine Learning comes in; human trial-and-error is completely
impractical, but machines can find an optimal linear equation in seconds
UNIVARIATE LINEAR REGRESSION

Univariate (“simple”) linear regression is used for predicting a numerical


Linear Relationships
output (DV) based on a single independent variable
• Univariate linear regression is simply an extension of least squares; you use the
Least Squared Error linear equation that minimizes SSE to predict an output (Y) for any given input (X)

Coefficient/parameter
Univariate Linear (sensitivity of Y to X)
Dependent variable (DV)
Regression This is just the equation of
y =⍺ + βx + 𝜀 Error/residual a line, plus an error term

Multiple Linear
Y-intercept Independent variable (IV)
Regression

Non-Linear
Regression
Simple linear regression is rarely used on its own; think of it as a primer for
understanding more complex topics like non-linear and multiple regression
CASE STUDY: UNIVARIATE LINEAR REGRESSION

THE You are the proud owner of The Cone Zone, a mobile ice cream cart
SITUATION operating on the Atlantic City boardwalk.

You’ve noticed that you tend to sell more ice cream on hot days, and want to
THE understand how temperature and sales relate. Your goal is to build a simple linear
ASSIGNMENT regression model that you can use to predict sales based on the weather forecast.

1. Use a scatterplot to visualize a sample of daily temperatures (X) and sales (Y)
THE 2. Do you notice a clear pattern or trend? How might you interpret this?
OBJECTIVES 3. Find the line that best fits the data by minimizing SSE, then confirm by
plotting a linear trendline on the scatter plot
MULTIPLE LINEAR REGRESSION

Linear Relationships Can we use more than one IV to predict the DV?

Multiple linear regression is used for predicting a numerical output


Least Squared Error
(DV) based on multiple independent variables
• In its simplest form, multiple linear regression is simply univariate linear regression
Univariate Linear
Regression
with additional x variables:

Multiple Linear DV
Regression
y =⍺ + β1x1 + β2x2 + β3x3 + … + βnxn + 𝜀 Error/
residual

Non-Linear Y-intercept
Regression Instead of just 1 IV, we have a whole set of independent variables
(and associated coefficients/weights) to help explain our DV
MULTIPLE LINEAR REGRESSION

To visualize how multiple regression works with 2 independent variables,


Linear Relationships
imagine fitting a plane (vs. a line) through a 3D scatterplot:

Least Squared Error

Univariate Linear
Regression

Multiple Linear
Regression

Non-Linear
Regression
Multiple regression can scale well beyond 2 variables, but this is where
visual analysis breaks down (and why we need machine learning!)
MULTIPLE LINEAR REGRESSION

EXAMPLE You are preparing to list a new property on AirBnB, and want to estimate
(or predict) an appropriate price using the listing data below
Linear Relationships

Least Squared Error

Univariate Linear
Regression

Multiple Linear
Regression

Non-Linear
Regression
MULTIPLE LINEAR REGRESSION

EXAMPLE You are preparing to list a new property on AirBnB, and want to estimate
(or predict) an appropriate price using the listing data below
Linear Relationships

MODEL 1: Predict price (Y) based on accommodation (X1):

Least Squared Error Y =55.71 + (16.6*X1)

MODEL 2: Predict price (Y) based on accommodation (X1) and number of bedrooms (X2):
Univariate Linear
Regression Y =52.59 + (15.4*X1) + (5.1*X2)

Multiple Linear MODEL 3: Predict price (Y) based on accommodation (X1), number of bedrooms (X2), and room
Regression type (entire.place (X3), hotel.room (X4), private.room (X5)):

Y =43.82 + (5.7*X1) + (5.1*X2) + (63.7*X3) + (65.4*X4) + (9.8*X5)


Non-Linear
Regression MODEL 4: Predict price (Y) based on accommodation (X1), number of bedrooms (X2), room type
(entire.place (X3), hotel.room (X4), private.room (X5)), and district (manhattan (X6), Brooklyn (X7)):

Y =26.1 + (6.3*X1) + (6.7*X2) + (60.5*X3) + (54.5*X4) + (10.6*X5) + (28.5*X6) + 9.8*X7)


MULTIPLE LINEAR REGRESSION

EXAMPLE You are preparing to list a new property on AirBnB, and want to estimate
(or predict) an appropriate price using the listing data below
Linear Relationships

Least Squared Error

Univariate Linear
Regression

Multiple Linear
Regression

Non-Linear
Regression

Model 1 Model 2 Model 3 Model 4


Sum of Squared Error (SSE): 226,577 224,259 172,201 158,591
NON-LINEAR REGRESSION

Linear Relationships What if the relationship between variables isn’t linear?

Non-linear regression is used when variables have a non-linear


Least Squared Error relationship, but can be transformed to create a linear one
• This works exactly like linear regression, except you use transformed versions of
Univariate Linear your dependent or independent variables:
Regression
Coefficient/parameter
Dependent variable (DV)
Multiple Linear
Regression y = ⍺ + β*ln(x) + 𝜀 Error/residual

Y-intercept
Log-transformed independent variable (IV)
Non-Linear
Regression
All we’re really doing is transforming the data to create linear relationships between each IV and the DV,
then applying a standard linear regression model using those transformed values
NON-LINEAR REGRESSION

EXAMPLE #1 You are predicting Sales (Y) using Marketing Spend (X). As you spend
more on marketing, the impact on sales eventually begins to diminish.
Linear Relationships

Least Squared Error

Univariate Linear
Regression

Multiple Linear
Regression
The relationship between Sales and Marketing ...but the relationship between Sales and the
Spend is non-linear (logarithmic)... log of Marketing Spend is linear!
Non-Linear
Regression
y = ⍺ + βx + 𝜀 y = ⍺ + β*ln(x) + 𝜀
NON-LINEAR REGRESSION

EXAMPLE #2 You are predicting population growth (Y) over time (X) and notice an
increasing rate of growth as the population size increases.
Linear Relationships

Least Squared Error

Univariate Linear
Regression

Multiple Linear
Regression
The relationship between Time and Population ...but the relationship between Time and the
is non-linear (exponential)... log of Population is linear!
Non-Linear
Regression y = ⍺ + βx + 𝜀 ln(y) = ⍺ + βx + 𝜀

NOTE: There are multiple ways to transform variables based on the type of relationship
(log, exponential, cubic, etc.), and multiple techniques to model them (more on that later!)
CASE STUDY: NON-LINEAR REGRESSION

THE You work as a Marketing Analyst for Maven Marketing, an international


SITUATION advertising agency based in London.

Your client has asked you to help set media budgets and estimate sales for an
THE upcoming campaign. Using historical ad spend and revenue, your goal is to
ASSIGNMENT build a regression model to help predict campaign performance.

1. Collect historical data for daily ad spend and revenue


THE
OBJECTIVES 2. Create a linear regression, and gauge the fit. Does it look accurate?
3. Transform the ad spend values to fit a logarithmic relationship, and re-run
the model. How does this compare to the linear version?
*Copyright Maven Analytics, LLC
MODEL DIAGNOSTICS

In this section we’ll explore common diagnostic metrics used to evaluate regression
models and ensure that predictions are stable and accurate

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand how to interpret regression model outputs


Sample Model Output R-Squared
• Define common diagnostic metrics like R^2,
Mean Error Metrics Homoskedasticity MSE/MAE/MAPE, P-Values, F-statistics, etc.
• Explore the difference between homoskedasticity and
F-Significance P-Values & T-Statistics heteroskedasticity in a linear regression model
• Understand how to identify and measure
Multicollinearity Variance Inflation Factor multicollinearity using variance inflation factor (VIF)
SAMPLE MODEL OUTPUT

Sample Model Output

R-Squared

Mean Error Metrics

Homoskedasticity

F-Significance

P-Values & T-Statistics

Multicollinearity

Variance Inflation
R-SQUARED

Sample Model Output R-Squared measures how well your model explains the variance in the
dependent variable you are predicting
R-Squared
• The higher the R-Squared, the “better” your model predicts variance in the DV
and the more confident you can be in the accuracy of your predictions
Mean Error Metrics
• Adjusted R-Squared is often used as it “penalizes” the R-squared value based
Homoskedasticity on the number of variables included in the model

F-Significance Total distance between predicted and actual


values, squared (aka squared error)

P-Values & T-Statistics SSE = Sum of Squared Error = ෍(𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2


R2 =1 – 𝑖

Multicollinearity TSS = Total Sum of Squares = ത 2


෍(𝑦𝑖 − 𝑦)
𝑖

Variance Inflation Total distance between each y value and the mean,
squared (basically variance, without dividing by n)
R-SQUARED EXAMPLE

Sample Model Output y = 12 + 0.4x


50 x y prediction (y–prediction)2 y ( y– y )2

10 10 16 36 30 400

R-Squared 40 20 25 20 25 30 25

30 20 24 16 30 100

Mean Error Metrics 30 35 30 26 16 30 0

40 40 28 144 30 100

20 50 15 32 289 30 225
Homoskedasticity 60 40 36 16 30 100

10 65 30 38 64 30 0

F-Significance 70 50 40 100 30 400

80 40 44 16 30 100
10 20 30 40 50 60 70 80

=
P-Values & T-Statistics
722 1,450
SSE TSS

Multicollinearity
SSE = ෍(𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2
722
Variance Inflation
R2 =1 – 𝑖
1– = 0.502
TSS = ෍(𝑦𝑖 − ത 2
𝑦) 1,450
𝑖
MEAN ERROR METRICS

Mean error metrics measure how well your regression model predicts,
Sample Model Output as opposed to how well it explains variance (like R-Squared)

R-Squared
• There are many variations, but the most common ones are Mean Squared Error
(MSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE)

Mean Error Metrics • These metrics provide “standards” which can be used to compare predictive
accuracy across multiple regression models
Homoskedasticity
MSE MAE MAPE
F-Significance σ𝑖(𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2 σ𝑖 |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 | |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 |
σ𝑖
𝑛 𝑛 𝑦𝑖
𝑛
P-Values & T-Statistics
Average of the squared distance Average of the absolute distance Mean Absolute Error, converted
between actual & predicted values between actual & predicted values to a percentage
Multicollinearity

Variance Inflation Mean error metrics can be used to evaluate regression models just like performance metrics like accuracy,
precision and recall can be used to evaluate classification models
MSE EXAMPLE

y = 12 + 0.4x
Sample Model Output 50
X Y (actual) Y (line) Error Sq. Error

10 10 16 6 36
R-Squared 40 20 25 20 -5 25

30 20 24 4 16

35 30 26 -4 16
Mean Error Metrics 30

40 40 28 -12 144

50 15 32 17 289
20
Homoskedasticity 60 40 36 -4 16

65 30 38 8 64
10
F-Significance 70 50 40 -10 100

80 40 44 4 16

=
P-Values & T-Statistics 10 20 30 40 50 60
722
70 80
SUM OF SQUARED ERROR:

Multicollinearity
MSE
σ𝑖(𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 )2 722
Variance Inflation
𝑛
=
10
= 72.2
MAE EXAMPLE

y = 12 + 0.4x
Sample Model Output 50
X Y (actual) Y (line) Error Sq.
AbsError
Error

10 10 16 6 36
6
R-Squared 40 20 25 20 -5 25
5

30 20 24 4 16
4

35 30 26 -4 16
4
Mean Error Metrics 30

40 40 28 -12 144
12

50 15 32 17 289
17
20
Homoskedasticity 60 40 36 -4 16
4

65 30 38 8 64
8
10
F-Significance 70 50 40 -10 100
10

80 40 44 4 16
4

=
P-Values & T-Statistics 10 20 30 40 50 60
74
70 80
SUM OF ABSOLUTE ERROR:

Multicollinearity
MAE
74
7.4
σ𝑖 |𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 |
Variance Inflation = =
𝑛 10
MAPE EXAMPLE

y = 12 + 0.4x
Sample Model Output 50 X
Y Y
Error
Abs Abs %
(actual) (line) Error Error

10 10 16 6 6 0.6
R-Squared 40 20 25 20 -5 5 0.2

30 20 24 4 4 0.2

35 30 26 -4 4 0.133
Mean Error Metrics 30

40 40 28 -12 12 0.3

50 15 32 17 17 1.133
20
Homoskedasticity 60 40 36 -4 4 0.1

65 30 38 8 8 0.267
10
F-Significance 70 50 40 -10 10 0.2

80 40 44 4 4 0.1

=
P-Values & T-Statistics 10 20 30 40 50 60
3.233
70 80
SUM OF ABSOLUTE % ERROR:

Multicollinearity
MAPE
|𝑦𝑖 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑖 | 3.233
Variance Inflation σ𝑖
𝑦𝑖 =
10
= 32.33%
𝑛
MEAN ERROR METRICS

Sample Model Output


When should I use each type of error metric?
R-Squared
• Mean Squared Error (MSE) is particularly useful when outliers or
extreme values are important to predict
Mean Error Metrics
• Mean Absolute Error (MAE) is useful if you want to minimize the impact
Homoskedasticity of outliers on model selection

• Mean Absolute Percentage Error (MAPE) is useful when your DV is


F-Significance on a very large scale, or if you want to compare models on a more
intuitive scale
P-Values & T-Statistics

Multicollinearity PRO TIP: In general we recommend considering all of them, since they can be
calculated instantly and each provide helpful context into model performance
Variance Inflation
HOMOSKEDASTICITY

Homoskedasticity is a term used to describe a model with consistent


Sample Model Output “scatter”, or variance in residuals, across all IV values
• In order to make accurate predictions across the full range of IV values, models
R-Squared should have normally distributed residuals (errors) with a zero mean
• Heteroskedasticity describes a model with inconsistent residual variance,
Mean Error Metrics meaning that it predicts poorly for certain IV values and will fail to generalize
Homoskedasticity Heteroskedasticity
Homoskedasticity

F-Significance

P-Values & T-Statistics

Multicollinearity
Residuals are consistent over Residuals increase at higher IV values, indicating that there
the entire IV range is some variance that the IVs are unable to explain
Variance Inflation
Breusch-Pagan tests can report a formal calculation of Heteroskedasticity, but usually a simple visual check is enough
NULL HYPOTHESIS

NULL HYPOTHESIS (H0) [null hy·poth·e·sis]

noun
1. In a statistical test, the hypothesis that there is no significant difference
between specified populations, any observed difference being due to
sampling or experimental error *

The hypothesis that your model is garbage

Our goal is to reject the null hypothesis and prove (with a high level of confidence) that our
model can produce accurate, statistically significant predictions and not just random outputs

*Oxford Dictionary
F-STATISTIC & F-SIGNIFICANCE

Sample Model Output The F-Statistic and associated P-Value (aka F-Significance) help us
understand the predictive power of the model as a whole
R-Squared
• F-Significance is technically defined as “the probability that the null hypothesis
cannot be rejected”, which can be interpreted as the probability that your model
Mean Error Metrics
predicts poorly

Homoskedasticity • The smaller the F-Significance, the more useful your regression is for prediction

F-Significance • NOTE: It’s common practice to use a P-Value of .05 (aka 95%) as a threshold to
determine if a model is “statistically significant”, or valid for prediction
P-Values & T-Statistics

Multicollinearity PRO TIP: F-Significance should be the first thing you check when you evaluate a
regression model; if it’s above your threshold, you may need more training, if it’s
below your threshold, move on to coefficient-level significance (up next!)
Variance Inflation
T-STATISTICS & P-VALUES

Sample Model Output T-Statistics and their associated P-Values help us understand the
predictive power of the individual model coefficients
R-Squared
T-Statistics tell us the degree to which we can
“trust” the coefficient estimates
Mean Error Metrics
• Calculated by dividing the coefficient by its
standard error
Homoskedasticity • Primarily used as a stepping-stone to calculate
P-Values, which are easier to interpret and
more commonly used for diagnostics
F-Significance
P-Values tell us the probability that the
coefficient is meaningless, statistically speaking
P-Values & T-Statistics • The smaller the P-value, the more confident
you can be that the coefficient is valid (not 0)

Multicollinearity • Statistical significance is calculated as (1 – P)


P-Value Statistical Significance

0.001 99.9%***
Variance Inflation 0.01 99%**

0.05 95%*
MULTICOLLINEARITY

Sample Model Output Multicollinearity occurs when two or more independent variables are
highly correlated, leading to untrustworthy model coefficients
R-Squared
• Correlation means that one IV can be used to predict another (i.e. height and
weight), leading to many combinations of coefficients that predict equally well
Mean Error Metrics
• This leads to unreliable coefficient estimates, and means that your model will fail
Homoskedasticity to generalize when applied to non-training data

F-Significance
How do I measure multicollinearity, and what can I do about it?
P-Values & T-Statistics
• Variance Inflation Factor (VIF) can help you quantify the degree of
multicollinearity, and determine which IVs to exclude from the model
Multicollinearity

Variance Inflation
VARIANCE INFLATION FACTOR

Sample Model Output To calculate Variance Inflation Factor (VIF) you treat each individual
IV as the dependent variable, and use the R2 value to measure how
R-Squared well you can predict them using the other IVs in the model

Mean Error Metrics


Y = ⍺ + β1x1 + β2x2 + β3x3 + … + βnxn + 𝜀
Homoskedasticity

F-Significance x1 = ⍺ + β2x2 + β3x3 + … + βnxn + 𝜀 1 VIF for X1

x2 = ⍺ + β1x1 + β3x3 + … + βnxn + 𝜀 1-R2 VIF for X2


P-Values & T-Statistics

Multicollinearity
PRO TIP: As a rule of thumb, a VIF >10 indicates that multicollinearity is a problem,
and that one or more IVs is redundant and should be removed from the model
Variance Inflation
VARIANCE INFLATION FACTOR

Sample Model Output


EXAMPLE You are predicting the price of an AirBnB listing, and notice the following
results after calculating the VIF for each independent variable:

R-Squared

Mean Error Metrics


Entire.place and Private.room produce high VIF values, since they essentially measure the same thing;
if a listing isn’t a private room, there’s a high probability that it’s an entire place (and vice versa)
Homoskedasticity

SOLUTION: Pick one high-VIF variable to remove (arbitrarily), re-run the model,
F-Significance
and recalculate the VIF values to see if multicollinearity is gone

P-Values & T-Statistics

Multicollinearity
Private.room
Adios multicollinearity! NO YES
PRO TIP: Use a frequency

Entire.place
Variance Inflation NO 816 15,815
table to confirm correlation! YES 13,166 0
RECAP: SAMPLE MODEL OUTPUT
Formula for the regression (variables & data set)

Sample Model Output

R-Squared
Profile of
Residuals/Errors
Mean Error Metrics

Homoskedasticity Coefficient
P-Values

Y-Intercept & IV
F-Significance coefficients

P-Values & T-Statistics


Coefficient
Standard Errors
Multicollinearity R2 and & T-Values
Adjusted R2

Variance Inflation
F-Statistic and P-Value (aka F-Significance)
*Copyright Maven Analytics, LLC
TIME-SERIES FORECASTING

In this section we’ll explore common time-series forecasting techniques, which use
regression models to predict future values based on seasonality and trends

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand when, why and how to use common


Forecasting 101 Seasonality
time-series forecasting techniques

Linear Trending Smoothing • Learn how to identify and quantify seasonality using
auto correlation and one-hot encoding

Non-Linear Trends Intervention Analysis • Explore common models for non-linear forecasting
(like ADBUDG and Gompertz)
• Apply forecasting techniques to analyze the impact of
key business decisions (aka “interventions”)
FORECASTING 101

Forecasting 101 Time-series forecasting is all about predicting future values of a single,
numeric dependent variable
Seasonality • Forecasting works just like any other regression model, except your data must
contain multiple observations of your DV over time (aka time-series data)

Linear Trending • Time-series forecast models look for patterns in the observed data – like
seasonality or linear/non-linear trends – to accurately predict future values

Smoothing

Common Examples:
Non-Linear Trends • Forecasting revenue for the next fiscal year
• Predicting website traffic growth over time
Intervention Analysis • Estimating sales for an new product launch
SEASONALITY

Forecasting 101 Seasonality is a repeatable, predictable pattern that a dependent


variable may follow over time
Seasonality • Seasonality often aligns with specific time periods (calendar months, fiscal periods,
days of the week, hours of the day, etc.) but that isn’t always the case

Linear Trending • We can identify seasonal patterns using an Auto Correlation Function (ACF),
then apply that seasonality to forecasts using techniques like one-hot encoding
or moving averages (more on that soon!)
Smoothing

Common examples:
Non-Linear Trends
• Website traffic by hour of the day
• Seasonal product sales
Intervention Analysis
• Airline ticket prices around major holidays
AUTO CORRELATION FUNCTION

ACF essentially involves calculating the correlation between time-series


Forecasting 101
data and lagged versions of itself, then plotting those correlations
• This allows you to visualize which lags are highly correlated with the original data,
Seasonality and reveals the length (or period) of the seasonal trend

Linear Trending

Smoothing

Non-Linear Trends 100%

80%

60%
CORRELATION

40%

20%

Intervention Analysis 0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG
AUTO CORRELATION FUNCTION

ACF essentially involves calculating the correlation between time-series


Forecasting 101
data and lagged versions of itself, then plotting those correlations
• This allows you to visualize which lags are highly correlated with the original data,
Seasonality and reveals the length (or period) of the seasonal trend

Linear Trending

Smoothing

Non-Linear Trends 100%

80%

60%
CORRELATION

40%

20%

Intervention Analysis 0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG
AUTO CORRELATION FUNCTION

ACF essentially involves calculating the correlation between time-series


Forecasting 101
data and lagged versions of itself, then plotting those correlations
• This allows you to visualize which lags are highly correlated with the original data,
Seasonality and reveals the length (or period) of the seasonal trend

Linear Trending

Smoothing

Non-Linear Trends 100%

80%

60%
CORRELATION

40%

20%

Intervention Analysis 0%

-20%

-40%

-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG
AUTO CORRELATION FUNCTION

ACF essentially involves calculating the correlation between time-series


Forecasting 101
data and lagged versions of itself, then plotting those correlations
• This allows you to visualize which lags are highly correlated with the original data,
Seasonality and reveals the length (or period) of the seasonal trend

Linear Trending
Strong correlation every 7th lag,
which indicates a weekly cycle

Smoothing
100%

80%

60%
CORRELATION

Non-Linear Trends 40%

20%

0%

-20%

-40%

Intervention Analysis
-60% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

LAG
CASE STUDY: AUTO CORRELATION

THE You are a Business Intelligence Analyst for Accucorp Accounting, a


SITUATION national firm specializing in tax preparation services.

The marketing team is hoping to better understand customer behavior, and


THE would like to know if online searches for “tax”-related keywords follow a
ASSIGNMENT seasonal pattern. Your goal is to analyze monthly search data and use auto
correlation to test for seasonality.

1. Collect monthly search volume for relevant tax keywords


THE 2. Create 36 monthly lags, and calculate the correlation between each lag
OBJECTIVES and the original values
3. Plot the correlations and determine the period of seasonality, if it exists
(monthly, quarterly, annual, etc.)
SEASONALITY: ONE-HOT ENCODING

Once you’ve identified a seasonal pattern, you can use one-hot encoding
to create independent variables which capture the effect of those time
Forecasting 101
periods (days, weeks, months, quarters, etc.)

Seasonality
• NOTE: When you use one-hot encoding, you must exclude one of the options
rather than encoding all of them (it doesn’t matter which one you exclude)

• Consider the equation A+B=5; there are an infinite combination of A & B values that
Linear Trending can solve it. One-hot encoding all options creates a similar problem for regression

Quarter ID Revenue Q1 Q2 Q3 Q4
Smoothing
1 $1,300,050 1 0 0 0

2 $11,233,310 0 1 0 0

Non-Linear Trends 3 $1,112,050 0 0 1 0

4 $1,582,077 0 0 0 1

Intervention Analysis
PRO TIP: If your data contains multiple seasonal patterns (i.e. hour of day + day of week),
include both dimensions in the model as one-hot encoded independent variables
CASE STUDY: ONE-HOT ENCODING

THE You’re are a Senior Analyst for Weather Trends, a Brazilian weather
SITUATION station boasting the longest and most accurate forecasts in the biz.

You’ve been asked to help prepare temperature forecasts for the upcoming
THE year. To do this, you’ll need to analyze ~5 years of historical data from Rio de
ASSIGNMENT Janeiro, and use regression to predict monthly average temperatures.

1. Collect historical average monthly temperatures from Rio de Janeiro


THE 2. Create independent variables for each month using one-hot encoding
OBJECTIVES 3. Build a regression model to produce a forecast for the following year
LINEAR TRENDING

Forecasting 101
What if the data includes both seasonality and a linear trend?

Seasonality
Trend describes an overarching direction or movement in a time series,
not counting seasonality

Linear Trending • Trends are often linear (up/down), but can be non-linear as well (more on that later!)

• To account for linear trending in a regression, you can include a time step IV; this
Smoothing is simply an index value that starts at 1 and increments with each time period

• If the time step coefficient isn’t statistically significant, it means you don’t have a
Non-Linear Trends meaningful linear trend

Intervention Analysis PRO TIP: It’s common for time-series models to include trending AND seasonality; in this
case, use a combination of one-hot encoding and time step variables to account for both!
LINEAR TRENDING

Forecasting 101
What if the data includes both seasonality and a linear trend?

Seasonality
Trend describes an overarching direction or movement in a time series,
not counting seasonality
Month Revenue
Linear Trending T-Step Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
Jan $1,300,050 1 1 0 0 0 0 0 0 0 0 0 0

Feb $1,233,310 2 0 1 0 0 0 0 0 0 0 0 0

Mar $1,112,050 3 0 0 1 0 0 0 0 0 0 0 0
Smoothing
Apr $1,582,077 4 0 0 0 1 0 0 0 0 0 0 0

May $1,776,392 5 0 0 0 0 1 0 0 0 0 0 0

Jun $2,110,201 6 0 0 0 0 0 1 0 0 0 0 0
Non-Linear Trends Jul $1,928,290 7 0 0 0 0 0 0 1 0 0 0 0

Aug $2,250,293 8 0 0 0 0 0 0 0 1 0 0 0

Sep $2,120,050 9 0 0 0 0 0 0 0 0 1 0 0

Intervention Analysis Oct $2,479,293 10 0 0 0 0 0 0 0 0 0 1 0


Nov $2,560,203 11 0 0 0 0 0 0 0 0 0 0 1
Dec $2,739,022 12 0 0 0 0 0 0 0 0 0 0 0
CASE STUDY: SEASONALITY + TREND

THE You are a Business Intelligence Analyst for Maven Muscles, a large national
SITUATION chain of fitness centers.

The Analytics Director needs your help building a monthly revenue forecast
THE for the upcoming year. Memberships follow a clear seasonal pattern, and
ASSIGNMENT revenue has been steadily rising as the gym continues to open new locations.

1. Collect historical monthly revenue data


THE 2. Use one-hot encoding and a time-step variable to account for both
OBJECTIVES seasonality and linear trending
3. Fit a regression model to forecast revenue for the upcoming year
SMOOTHING

Forecasting 101 What if my data is so noisy that I can’t tell if a trend exists?

Seasonality If your data is highly volatile, smoothing techniques can be used to


reveal underlying patterns or trends
Linear Trending • Common techniques like moving averages or weighted smoothing remove
random variation and “noise” from the data to help expose seasonality or trends

Smoothing • If the volatility is truly just noise (and something we want our model to ignore),
averaging or weighting the values around each time step can help us “smooth”
the data and produce more accurate forecasts
Non-Linear Trends

Intervention Analysis PRO TIP: Smoothing is a great way to expose patterns and trends that otherwise might be
tough to see; make this part of your data profiling process!
SMOOTHING

Forecasting 101 What if my data is so noisy that I can’t tell if a trend exists?

Seasonality If your data is highly volatile, smoothing techniques can be used to


reveal underlying patterns or trends
Linear Trending

Smoothing

Non-Linear Trends

Intervention Analysis
CASE STUDY: SMOOTHING

THE You’ve just been hired as an analytics consultant for Maven Motel, a struggling
SITUATION national motel chain.

Management has been taking steps to improve guest satisfaction, and has asked
THE you to analyze daily data to determine how ratings are trending. Your task is to
ASSIGNMENT use a moving average calculation to discern if an underlying trend is present.

1. Collect daily average guest ratings for the motel. Do you see any clear
THE patterns or trends?
OBJECTIVES 2. Calculate a moving average, and compare various windows from 1-12 weeks
3. Determine if an underlying trend is present. How would you describe it?
NON-LINEAR TRENDS

Time-series data won’t always follow a seasonal pattern or linear trend;


Forecasting 101 it may follow a non-linear trend, or have no predictable trend at all
• There are many formulas designed to forecast common non-linear trends:
Seasonality

Linear Trending • Exponential


• Cubiq
• Step-Functions
Smoothing • ADBUDG
• Gompertz

Non-Linear Trends

Intervention Analysis
NON-LINEAR TRENDS

Time-series data won’t always follow a seasonal pattern or linear trend;


Forecasting 101 it may follow a non-linear trend, or have no predictable trend at all
• There are many formulas designed to forecast common non-linear trends:
Seasonality

Linear Trending • Exponential


• Cubiq
• Step-Functions
Smoothing • ADBUDG
• Gompertz

Non-Linear Trends

Intervention Analysis
NON-LINEAR TRENDS

Time-series data won’t always follow a seasonal pattern or linear trend;


Forecasting 101 it may follow a non-linear trend, or have no predictable trend at all
• There are many formulas designed to forecast common non-linear trends:
Seasonality

Linear Trending • Exponential


• Cubiq
• Step-Functions
Smoothing • ADBUDG
• Gompertz

Non-Linear Trends

Intervention Analysis
NON-LINEAR TRENDS

Time-series data won’t always follow a seasonal pattern or linear trend;


Forecasting 101 it may follow a non-linear trend, or have no predictable trend at all
• There are many formulas designed to forecast common non-linear trends:
Seasonality

Linear Trending • Exponential


• Cubiq
• Step-Functions
Smoothing • ADBUDG
• Gompertz

Non-Linear Trends

Intervention Analysis
NON-LINEAR TRENDS

Time-series data won’t always follow a seasonal pattern or linear trend;


Forecasting 101 it may follow a non-linear trend, or have no predictable trend at all
• There are many formulas designed to forecast common non-linear trends:
Seasonality

Linear Trending • Exponential


• Cubiq
• Step-Functions
Smoothing • ADBUDG
• Gompertz

Non-Linear Trends

Intervention Analysis
PRO TIP: ADBUDG and Gompertz are more flexible versions of a logistic curve, and
are commonly seen in BI use cases (product launches, diminishing returns, etc.)
CASE STUDY: NON-LINEAR TREND

THE The team at Cat Slacks just launched a new product poised to revolutionize
the world of feline fashion: a lightweight, breatheable jogging short designed
SITUATION for active cats who refuse to compromise on quality.

You’ve been asked to provide a weekly sales forecast to help the


THE
manufacturing and warehouse teams with capacity planning. You only have
ASSIGNMENT 8 weeks of data to work with, but expect the launch to follow a logistic curve.

1. Collect sales data for the first 8 weeks since the launch
THE 2. Apply a Gompertz curve to fit a logistic trend
OBJECTIVES 3. Adjust parameters to compare various capacity limits and growth rates
INTERVENTION ANALYSIS

Forecasting 101 Intervention analysis is a technique used to estimate the impact of a


specific change (or “intervention”) on the dependent variable
Seasonality • Simply put, intervention analysis is about predicting what would have happened if
the change or intervention never took place

Linear Trending • By fitting a model to the “pre-intervention” data (up to the date of the change), you
can compare predicted vs. actual values after that date to estimate the impact of
the intervention
Smoothing

Common examples:
Non-Linear Trends
• Measuring the impact of a new website or check-out page on conversion rates
• Quantifying the impact of a new HR program to reduce employee churn
Intervention Analysis
INTERVENTION ANALYSIS

Forecasting 101 Intervention analysis is a technique used to estimate the impact of a


specific change (or “intervention”) on the dependent variable
Seasonality
Intervention

Linear Trending

Smoothing

Non-Linear Trends

Intervention Analysis
INTERVENTION ANALYSIS

Forecasting 101 STEP 1: Fit a regression model to the data, using only observations from the pre-
intervention period:
Seasonality
Intervention

Linear Trending

Smoothing

Non-Linear Trends

Intervention Analysis
INTERVENTION ANALYSIS

Forecasting 101 STEP 1: Fit a regression model to the data, using only observations from the pre-
intervention period:
Seasonality
Intervention

Linear Trending

Smoothing

Non-Linear Trends

Intervention Analysis
INTERVENTION ANALYSIS

Forecasting 101 STEP 2: Compare the predicted and observed values in the post-intervention
period, and sum the daily residuals to estimate the impact of the change:
Seasonality
Intervention

Linear Trending

Smoothing

Non-Linear Trends

Intervention Analysis
CASE STUDY: INTERVENTION ANALYSIS

THE You are a Web Analyst for Alpine Supplies, an online retailer specializing
SITUATION in high-end camping and hiking gear.

The company recently rolled out a new product landing page, and the CMO
THE has asked you to help quantify the impact on conversion rate (CVR) and
ASSIGNMENT sales. You’ll need to conduct an intervention analysis, and make sure to
capture day of week seasonality and trending in your model.

1. Collect data to track sessions and CVR before and after the change
THE 2. Fit a regression model to predict CVR using data from before the redesign
OBJECTIVES 3. Use the model to forecast ”baseline” CVR after the change, and calculate
both incremental daily sales and the total cumulative impact
PART 4:
ABOUT THIS SERIES

This is Part 4 of a 4-Part series designed to help you build a deep, foundational understanding of
machine learning, including data QA & profiling, classification, forecasting and unsupervised learning

PART 1 PART 2 PART 3 PART 4


QA & Data Profiling Classification Regression & Forecasting Unsupervised Learning

*Copyright Maven Analytics, LLC


COURSE OUTLINE

Review the ML landscape and key unsupervised learning concepts, including


1 Intro to Unsupervised ML unsupervised vs. supervised, feature engineering, workflow, etc.

Learn the key building blocks of segmentation and clustering, and review
2 Segmentation & Clustering two of the most-used algorithms: K-Means and Hierarchical Clustering

Explore common techniques for association mining and when to use each of
3 Association Mining them, including A Priori and Markov models

Learn how to detect cross-sectional and time-series outliers using


4 Outlier Detection techniques like Nearest Neighbor Distance and Time-Series models

Understand the core foundations of dimensionality reduction and how it


5 Dimensionality Reduction can be used in the context of BI and unsupervised learning

*Copyright Maven Analytics, LLC


*Copyright Maven Analytics, LLC
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning Advanced Topics

K-Means Clustering
Classification Regression Reinforcement Learning
Hierarchical Clustering (Q-learning, deep RL, multi-armed-bandit, etc.)

K-Nearest Neighbors Least Squares Markov Chains


Natural Language Processing
Naïve Bayes Linear Regression Apriori (Latent Semantic Analysis, Latent Dirichlet Analysis,
relationship extraction, semantic parsing, contextual
Decision Trees Forecasting Cross-sectional Outlier Detection word embeddings, translation, etc.)

Logistic Regression Non-Linear Regression Time-series Outlier Detection


Computer Vision
Sentiment Analysis Intervention Analysis Dimensionality Reduction (Convolutional neural networks, style translation, etc.)

LASSO/RIDGE, state-
Support vector machines,
space, advanced
Matrix factorization, principal components, factor Deep Learning
gradient boosting, neural analysis, UMAP, T-SNE, topological data analysis,
generalized linear (Feed Forward, Convolutional, RNN/LSTM, Attention,
nets/deep learning, etc. advanced clustering, etc.
methods, VAR, DFA, etc. Deep RL, Autoencoder, GAN. etc.)
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning

Has a “label” Does NOT have a “label”

Classification Regression Used to DESCRIBE or ORGANIZE the data in some non-obvious way

Unsupervised learning isn’t about making predictions, it’s


Used to predict Used to predict about finding insights & patterns hidden in the data
labels for values for
CATEGORICAL NUMERICAL
variables variables
• Unlike classification or regression, we don’t care about IVs, DVs or data
splitting – we just care about understanding our observed values
MACHINE LEARNING LANDSCAPE

MACHINE LEARNING

Supervised Learning Unsupervised Learning

K-Means Clustering UNSUPERVISED EXAMPLES:


Clustering/Segmentation
• Are there specific groups or “clusters” of
Hierarchical Clustering
customers who behave in similar ways?

Markov Chains • When customers purchase a particular


Association Mining/Basket Analysis item, which other items tend to be
Apriori
purchased as well?

• Are there any anomalies in the data that


Cross-sectional Outlier Detection we wouldn’t expect to see under normal
Outlier Detection conditions?
Time-series Outlier Detection
• Can we combine or reduce the number of
dimensions in our data to better
Dimensionality Reduction Dimensionality Reduction understand it?
UNSUPERVISED LEARNING WORKFLOW

Business Objective

Remember, these steps ALWAYS come first


Preliminary QA Before building a model, you should have a clear understanding of the business objective (identifying clusters,
detecting anomalies, etc.) and the data at hand (variable types, table structure, data quality, profiling metrics, etc.)
Data Profiling

Hyperparameter
Feature Engineering Model Application Model Selection
Tuning

Add new, calculated Apply relevant unsupervised Adjust and tune model Select the model that yields
variables (or “features”) to ML techniques, based on the parameters (this is typically the most useful or insightful
the data set based on objective (you will typically an iterative process) results, based on the
existing fields test multiple models) objective at hand

There often are no strict rules to determine which model is best; it’s about which one helps you best answer the question at hand
RECAP: FEATURE ENGINEERING

Feature engineering is the process of enriching a data set by creating additional


variables (or dimensions) based on existing fields
• When preparing data for segmentation/clustering, features should be intuitive and actionable;
they should help you clearly describe each cluster and make practical business recommendations

Original features Engineered features

Social Competitive Marketing Promotion Competitive Competitive Competitive Promotion >10 &
Month ID Revenue Log Spend
Posts Activity Spend Count High Medium Low Social > 25

1 30 High $130,000 12 $1,300,050 1 0 0 1 11.7

2 15 Low $600,000 5 $11,233,310 0 0 1 0 13.3

3 8 Medium $15,000 10 $1,112,050 0 1 0 0 9.6

4 22 Medium $705,000 11 $1,582,077 0 1 0 1 13.5

5 41 High $3,000 3 $1,889,053 1 0 0 0 8.0


KEY TAKEAWAYS

Unsupervised learning is about finding patterns, not making predictions


• Unlike classification or regression, it’s not about labels, dependent/independent variables or splitting test and
training data; it’s about better understanding the observed values in a data set

For many unsupervised techniques, there’s no “right” or “wrong” solution


• Unsupervised learning is often less about optimizing specific metrics or parameters, and more about using
Machine Learning to clearly answer the business question at hand

Always focus on the desired business outcome


• Make sure that you can clearly interpret your model output and translate it into intuitive, practical insights
and recommendations (otherwise what’s the point?)
*Copyright Maven Analytics, LLC
CLUSTERING & SEGMENTATION

In this section we’ll introduce the fundamentals of clustering & segmentation and
compare two common unsupervised models: K-means and hierarchical clustering

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand how segmentation and clustering


Clustering Basics K-Means
models fundamentally work

Hierarchical Clustering Key Takeaways • Explore common models including K-Means and
hierarchical clustering

• Compare and contrast common clustering models

Remember, there’s no “right” answer or single optimization metric when it comes to clustering and segmentation; the best
outputs are the ones which help you answer the question at hand and make practical, data-driven business decisions
CLUSTERING BASICS

Clustering Basics Clustering allows you to find concentrations or groups of observations


which are similar to one another but distinct from other groups
K-Means

Hierarchical Clustering

Key Takeaways
CLUSTERING BASICS

Clustering Basics Clustering allows you to find concentrations or groups of observations


which are similar to one another but distinct from other groups
K-Means

Hierarchical Clustering

Key Takeaways
CLUSTERING BASICS

Clustering Basics Clustering allows you to find concentrations or groups of observations


which are similar to one another but distinct from other groups
K-Means
Multiple models can be used for clustering, but the general workflow
Hierarchical Clustering
consists of the following steps:

Collect the raw data you need for analysis (i.e. transactions, customer
Key Takeaways EXTRACT
records, survey results, website pathing, etc.)

Use feature engineering to create clear, intuitive dimensions which can


TRANSFORM
help you interpret each cluster and make actionable recommendations

Apply one or more clustering models and determine an appropriate


MODEL
number of clusters based on the model output and business context

Use multivariate analysis to understand the characterics of each cluster


INTERPRET
(profiling metrics, distributions, etc.). The model won’t do this for you!
K-MEANS

Clustering Basics K-Means is a common algorithm which assigns each observation in a data
set to a specific cluster, where K represents the number of clusters
K-Means
In it’s simplest form (2 dimensions), here’s how it works:
Hierarchical Clustering 1. Select K arbitrary locations in a scatterplot as cluster centers (or centroids),
and assign each observation to a cluster based on the closest centroid
Key Takeaways 2. Recalculate and relocate each centroid to the mean of the observations
assigned to it, then reassign each observation to its new closest centroid
3. Repeat the process until observations no longer change clusters

Example use cases:


• Identifying customer segments for targeted marketing campaigns
• Clustering store locations based on factors like sales, ratings, size, etc.
K-MEANS

Clustering Basics

K-Means

Hierarchical Clustering

Key Takeaways

STEP 1: Determine what you think might be an appropriate number of


clusters (in this case 2), and select arbitrary locations as initial centroids
K-MEANS

Clustering Basics

K-Means

Hierarchical Clustering

Key Takeaways

STEP 2: Assign each observation to a cluster, based on the closest centroid


K-MEANS

Clustering Basics

K-Means

Hierarchical Clustering

Key Takeaways

STEP 3: Relocate each centroid to the mean of its assigned observations,


and reassign each observation to the new closest centroid
K-MEANS

Clustering Basics

K-Means

Hierarchical Clustering

Key Takeaways

STEP 3: Relocate each centroid to the mean of its assigned observations,


and reassign each observation to the new closest centroid
K-MEANS

Clustering Basics

K-Means

Hierarchical Clustering

Key Takeaways

STEP 4: Continue to relocate each centroid to the mean of its


assigned observations, until the clusters no longer change
K-MEANS

Clustering Basics

K-Means

Hierarchical Clustering

Key Takeaways

STEP 4: Continue to relocate each centroid to the mean of its


assigned observations, until the clusters no longer change
K-MEANS

Clustering Basics How do we know what’s the “right” number of clusters (K)?
• While there is no “right” or “wrong” number of clusters, you can use the
K-Means within-cluster sum of squares (WSS) to help inform your decision

Hierarchical Clustering
Square these distances and sum them to calculate WSS for two clusters (K=2)

Key Takeaways
K-MEANS

Clustering Basics How do we know what’s the “right” number of clusters (K)?
• While there is no “right” or “wrong” number of clusters, you can use the
K-Means within-cluster sum of squares (WSS) to help inform your decision

Hierarchical Clustering

Rerun the model with additional clusters,


and plot the WSS for each value of K
Key Takeaways

WSS

2 3 4 5 6 7 8
NUMBER OF CLUSTERS (K)
K-MEANS

Clustering Basics How do we know what’s the “right” number of clusters (K)?
• While there is no “right” or “wrong” number of clusters, you can use the
K-Means within-cluster sum of squares (WSS) to help inform your decision

Hierarchical Clustering
Look for an “elbow” or inflection point, where
adding another cluster has a relatively small
Key Takeaways impact on WSS (in this case where K=5)

WSS

PRO TIP: Think of this as a


guideline, not a strict rule

2 3 4 5 6 7 8
NUMBER OF CLUSTERS (K)
K-MEANS

Clustering Basics Can we create clusters based on more than 2 dimensions?


• Yes! K-Means and WSS plots can scale to any number of dimensions
K-Means
• At higher dimensions, profiling metrics are critical for understanding what
each cluster represents, since you can’t compare them visually
Hierarchical Clustering

Key Takeaways
Does the shape of the clusters matter?
• Yes, K-Means works best when the clusters are mostly circular in shape;
but other tools like Hierarchical Clustering (up next!) can address this

Does the initial centroid location make a difference?


• It can, so a common best practice is to re-run the model K times to ensure
that you are minimizing WSS for the selected number of clusters
CASE STUDY: K-MEANS

THE You’ve just been hired as a Data Science Intern for Pet Palz, an ecommerce
SITUATION start-up selling custom 3-D printed models of household pets.

You’ve been asked to analyze customer-level data to help identify meaningful


audience segments and refine the company’s marketing strategy.
THE
ASSIGNMENT As a first step, you’ll be exploring the relationship between how often customers
engage with promotional emails and how much they actually spend, using a K-
means model for clustering.

1. Collect sample data containing email engagement and spend by customer


THE
OBJECTIVES 2. Use K-means to identify clusters
3. Interpret each cluster and suggest next steps based on your analysis
HIERARCHICAL CLUSTERING

Clustering Basics Hierarchical clustering is an unsupervised learning technique that


creates clusters by grouping similar data points together*
K-Means
In it’s simplest form (2 dimensions), here’s how it works:
Hierarchical Clustering 1. Create a scatterplot, find the 2 closest points, and group them into a cluster
2. Then find the next two closest points or clusters, and group them to a cluster
Key Takeaways
3. Repeat the process of combining the closest pairs of points or clusters until you
eventually end up with one single cluster

This process is visualized using a tree diagram called a dendrogram,


which shows the hierarchical relationship between clusters

*This is known as agglomerative or “bottom-up” clustering (vs. divisive or “top-down” clustering, which is much less common)
HIERARCHICAL CLUSTERING

Clustering Basics

K-Means

p5
p6
Hierarchical Clustering

DISTANCE
p2

p3

Key Takeaways
p1
p4

p1 p2 p3 p4 p5 p6

STEP 1: Find the two closest points, and group them into a cluster
HIERARCHICAL CLUSTERING

Clustering Basics

K-Means

p5
p6
Hierarchical Clustering

DISTANCE
p2

p3

Key Takeaways
p1
p4

5 clusters

p1 p2 p3 p4 p5 p6

STEP 1: Find the two closest points, and group them into a cluster
HIERARCHICAL CLUSTERING

Clustering Basics

K-Means

p5
p6
Hierarchical Clustering

DISTANCE
p2

p3

Key Takeaways
p1
p4
4 clusters

p1 p2 p3 p4 p5 p6

STEP 2: Find the next two closest points/clusters, and group them together
HIERARCHICAL CLUSTERING

Clustering Basics

K-Means

p5
p6
Hierarchical Clustering

DISTANCE
p2

p3 3 clusters

Key Takeaways
p1
p4

p1 p2 p3 p4 p5 p6

STEP 3: Repeat the process until all points are part of the same cluster
HIERARCHICAL CLUSTERING

Clustering Basics

K-Means
2 clusters

p5
p6
Hierarchical Clustering

DISTANCE
p2

p3

Key Takeaways
p1
p4

p1 p2 p3 p4 p5 p6

STEP 3: Repeat the process until all points are part of the same cluster
HIERARCHICAL CLUSTERING

Clustering Basics

K-Means

p5
p6
Hierarchical Clustering

DISTANCE
p2

p3

Key Takeaways
p1
p4

p1 p2 p3 p4 p5 p6

STEP 3: Repeat the process until all points are part of the same cluster
HIERARCHICAL CLUSTERING

Clustering Basics How do you know when to stop creating new clusters?

K-Means
The height of each
branch tells us how close Vertical branches that lead
Hierarchical Clustering to splits are called clades
the clusters/data points
are to each other
(taller = longer distance)

DISTANCE
Key Takeaways

Clades help us understand


similarity; two leaves in the
same clade are more similar
At the end of each clade to each other than they are to
is a leaf, representing a the leaves in another clade
single data point
p1 p2 p3 p4 p5 p6
HIERARCHICAL CLUSTERING

Clustering Basics How exactly do you define the distance between clusters?
• There are a number of valid ways to measure the distance between clusters,
K-Means which are often referred to as “linkage methods”
• Common methods include measuring the closest min/max distance between
Hierarchical Clustering clusters, the lowest average distance, or the distance between cluster centroids

Key Takeaways When might you use hierarchical clustering over K-Means?
• Run both models and compare outputs; Hierarchical clustering may produce
more meaningful results if clusters are not circular or uniform in shape

vs

K-Means Hierarchical
KEY TAKEAWAYS

Cluster analysis is about finding concentrations of observations which are


similar to one another and distinct from other groups
• Common use cases include identifying key customer segments, clustering stores based on performance, etc.

Two common techniques are K-means and Hierarchical clustering


• In practice, we recommend testing multiple models since each technique has strengths and weaknesses

Models won’t tell you how to interpret what each cluster represents
• Use multivariate analysis and profiling techniques to understand the characteristics of each cluster
*Copyright Maven Analytics, LLC
ASSOCIATION MINING

In this section we’ll introduce the fundamentals of association mining and basket
analysis, and compare common techniques including Apriori and Markov Chains

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Understand the basics of association mining and


Association Mining Basics Apriori
introduce several real-world use cases

Markov Chains Key Takeaways


• Explore common association mining techniques
like Apriori and Markov Chains

• Compare and contrast common unsupervised


models for basket analysis
ASSOCIATION MINING BASICS

Association Mining Basics


Association Mining is used to reveal underlying patterns, relationships
and correlations in the data, and translate them into IF/THEN association
rules (i.e. “If a customer buys product A, then they will likely buy product B”)
Apriori
• Association Mining is often applied to transactional data, to analyze which
Markov Chains products tend to be purchased together (known as “basket analysis”)

Key Takeaways Common use cases:


• Building product recommendation systems (i.e. Amazon, Netflix)
• Sending targeted product offers based on purchase history
• Optimizing physical store layouts to maximize cross-selling

Association mining is NOT about trying to prove or establish causation; it’s just about
identifying frequently occurring patterns and correlations in large datasets
APRIORI

The Apriori algorithm is commonly used to analyze how frequently


Association Mining Basics
items are purchased together in a single transaction (i.e. beer and
diapers, peanut butter and jelly, etc.)
Apriori
• In its simplest form (2-item sets), Apriori models compare the frequency of
transactions containing item A, item B, and both items A and B
Markov Chains
• This allows you to understand how often these items tend to be purchased
together, and calculate the strength of the association between them
Key Takeaways

Apriori models typically include three key metrics:


• Support (frequency of transactions containing a given item or set of items) / total transactions

• Confidence (conditional probability of item B occurring, given item A)

• Lift (measures the importance or “strength” of the association between items)

Support (A,B)

Support (A) x Support (B)


APRIORI
TRANS. ITEM 1 ITEM 2

1 EXAMPLE #1 Calculating the association between bacon & eggs


Association Mining Basics 2

3
1) Support ( ) = 10/20 = 0.5
4
Apriori
5 Support ( ) = 7/20 = 0.35
6
Support ( , ) = 6/20 = 0.3
Markov Chains 7

9 Support ( , ) 0.3
Key Takeaways 10 2) Confidence ( )= = = 60%
Support ( ) 0.5
11

12

13 Support ( , ) 0.3
14 3) Lift ( )= = = 1.7
Support ( ) x Support ( ) 0.5 x 0.35
15

16

17

18
Since Lift > 1, we can interpret the association between bacon and eggs
19 as real and informative (eggs are likely to be purchased with bacon)
20

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI
TRANS. ITEM 1 ITEM 2

1 EXAMPLE #2 Calculating the association between bacon & basil


Association Mining Basics 2

3
1) Support ( ) = 10/20 = 0.5
4
Apriori
5 Support ( ) = 5/20 = 0.25
6
Support ( , ) = 1/20 = 0.05
Markov Chains 7

9 Support ( , ) 0.05
Key Takeaways 10 2) Confidence ( )= = = 10%
Support ( ) 0.5
11

12

13 Support ( , ) 0.05
14 3) Lift ( )= = = 0.4
Support ( ) x Support ( ) 0.5 x 0.25
15

16

17

18
Since Lift < 1, we can conclude that there is no positive association
19 between bacon and basil (basil is unlikely to be purchased with bacon)
20

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI
TRANS. ITEM 1 ITEM 2

1 EXAMPLE #3 Calculating the association between bacon & water


Association Mining Basics 2

3
1) Support ( ) = 10/20 = 0.5
4
Apriori
5 Support ( ) = 1/20 = 0.05
6
Support ( , ) = 1/20 = 0.05
Markov Chains 7

9 Support ( , ) 0.05
Key Takeaways 10 2) Confidence ( )= = = 10%
Support ( ) 0.5
11

12

13 Support ( , ) 0.05
14 3) Lift ( )= = = 2
Support ( ) x Support ( ) 0.5 x 0.05
15

16

17

18
Since Lift = 2 we might assume a strong association between bacon and
19 water, but this is skewed since water only appears in one transaction
20

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI
TRANS. ITEM 1 ITEM 2

1 How do we account for infrequently purchased items?


Association Mining Basics 2

4
To filter low-volume purchases, you can plot support for each item
Apriori and determine a threshold or cutoff value:
5

Markov Chains 7

SUPPORT
9
Key Takeaways 10

11 Drop-off where support = 0.15

12

13

14

15

16 In this case we might filter out any transactions containing items with
17 support <0.15 (transactions 4, 6, 8, 10, 15, 17)
18

19 • This helps us avoid misleading confidence and lift calculations,


20
and reduces the number of transactions we need to analyze

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI

Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways

1-item sets: 2-item sets: 3-item sets: 4-item sets:


Consider a shop that sells 4 items
(chocolate, cheese, bananas & bread)

There are 15 possible combinations


or itemsets (ignoring duplicates)

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI

Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways

STEP 1: 1-item sets: 2-item sets: 3-item sets: 4-item sets:

Calculate support for each 1-item


set, and filter out all transactions
containing items below the
threshold (in this case cheese)

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI

Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways

STEP 2: 1-item sets: 2-item sets: 3-item sets: 4-item sets:

Based on the remaining itemsets,


calculate support for each 2-item set,
and filter out transactions containing
any pairs below the threshold (in this
case chocolate & bread)

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI

Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways

STEP 3: 1-item sets: 2-item sets: 3-item sets: 4-item sets:

Repeat until all infrequent itemsets


have been eliminated, and filter
transactions accordingly

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI

Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways

STEP 4: 1-item sets: 2-item sets: 3-item sets: 4-item sets:

Based on the filtered transactions,


you can use an apriori model to
calculate confidence and lift and
identify the strongest associations

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI

Association Mining Basics Wouldn’t you want to analyze all possible item combinations, instead
of calculating individual associations?
Apriori
• Yes! In reality that’s how Apriori works. But since the number of configurations
increases exponentially with each item, this is impractical for humans to do
Markov Chains • To reduce or “prune” the number of itemsets to analyze, we can use the apriori
principle, which basically states that if an item is infrequent, any combinations
containing that item must also be infrequent
Key Takeaways

STEP 4:
Based on the filtered transactions,
you can use an apriori model to
calculate confidence and lift and
identify the strongest associations

Associations can be visualized using


network graphs like this one

*Inspired by KDNuggets (www.kdnuggets.com)


APRIORI

Can you calculate associations between multiple items, like coffee being
Association Mining Basics
purchased with bacon and eggs?
• Yes, you can calculate support, confidence and lift using the same exact logic as you
would with individual items
Apriori

ID ITEM 1 ITEM 2 ITEM 3 Support ( , ) = 3/10 = 0.3


Markov Chains 1
Support ( ) = 4/10 = 0.4
2

3 Support ( , , ) = 2/10 = 0.2


Key Takeaways
4

5 Support ( , , ) 0.2
6
Confidence ( & )= = = 67%
Support ( , ) 0.3
7

9 Support ( , , ) 0.2
Lift ( & )= = = 1.7
10 Support ( , ) x Support ( ) 0.12

Calculating all possible associations would be impossible for a human; this is where Machine Learning comes in!

*Inspired by KDNuggets (www.kdnuggets.com)


CASE STUDY: APRIORI

THE You are the proud owner of Coastal Roasters, a local coffee shop and
SITUATION bakery based in the Pacific Northwest.

You’d like to better understand which items customers tend to purchase together,
THE to help inform product placement and promotional strategies.
ASSIGNMENT As a first step, you’ve decided to analyze a sample of ~10,000 transactions over a
6-month period, and conduct a simple basket analysis using an apriori model.

1. Collect transaction-level data, including date, time, and items purchased


THE
OBJECTIVES 2. Determine an appropriate support threshold, based on product frequency
3. Calculate confidence and lift to measure association between popular items
MARKOV CHAINS

Markov Chains are used to describe the flow or transition between


Association Mining Basics “states” of a categorical variable
• Markov Chains calculate the probability of moving between states (subscribers
Apriori renewing or churning, customers purchasing one product after another, etc.)
• Observations must be tracked over time and tied to a unique ID, so that all state
Markov Chains
transitions can be recorded as probabilities in a matrix (where all rows = 100%)

EXAMPLE: Monthly subscribers transitioning between Gold, Silver and Churn states each month:
Key Takeaways

80% 40%
TO STATE
20%

GOLD SILVER
Gold Silver Churn 15%
40%
Gold 0.8 0.15 0.05 1%
FROM STATE

5%
30%

Silver 0.2 0.4 0.4


CHURN

Churn 0.01 0.3 0.69

69%
MARKOV CHAINS
80% 40%
TO STATE
20%

Gold Silver Churn


Association Mining Basics GOLD SILVER
15%
Gold 0.8 0.15 0.05 40%

FROM STATE
1%
Apriori Silver
5%
30%
0.2 0.4 0.4
CHURN
Churn 0.01 0.3 0.69
Markov Chains

69%
Key Takeaways Example insights & recommendations:
• Most customers who churn stayed churned, but 31% do come back; of those who return,
nearly all of them re-subscribe to a Silver plan (vs. Gold)
ü RECOMMENDATION: Launch targeted marketing to recently churned customers, offering a discount to
resubscribe to a Silver membership plan

• Once customers upgrade to a Gold membership, the majority 80% renew each month
ü RECOMMENDATION: Offer a one-time discount for Silver customers to upgrade to Gold; while you may
sacrifice some short-term revenue, it will likely be profitable in the long term

To account for prior transitions (vs. just the previous) you can use more complex “higher-order” Markov Chains
CASE STUDY: MARKOV CHAINS

THE You’ve just been promoted to Senior Web Analyst at Alpine Supplies, an online
SITUATION retailer specializing in equipment and supplies for outdoor enthusiasts.

The VP of Sales just shared a sample of ~15,000 customer purchase paths, and
THE would like you to analyze the data to help inform a new cross-sell sales strategy.
ASSIGNMENT Your goal is to explore the data and build a simple Markov model to predict which
product an existing customer is most likely to purchase next.

1. Collect sample data containing customer-level purchase paths


THE
OBJECTIVES 2. Calculate a frequency and probability matrix for popular products
3. For any given product, predict the most likely future purchase
KEY TAKEAWAYS

Association mining reveals patterns, relationships & correlations in the


data, and translates them into IF/THEN association rules
• Common use cases include building recommendation systems (i.e. Amazon, Netflix), targeting product
offers based on purchase history, optimizing physical store layouts to maximize cross-selling, etc.

Apriori models measure how frequently specific items tend to be


purchased together (known as “basket analysis”)
• Rather than analzying every possible combination of itemsets, the apriori principle can be used to speed up
calculations by filtering out infrequent transactions

Markov Chains can be used to measure the probability of transitioning


between “states” of a categorical variable
• Markov models analyze sequences of events over time to determine what is most likely to happen next
*Copyright Maven Analytics, LLC
OUTLIER DETECTION

In this section we’ll introduce the concept of statistical outliers, and review common
methods for detecting both cross-sectional and time-series outliers and anomalies

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:


• Understand different types of outliers, and how
Outlier Detection Basics Cross-sectional Outliers
they can impact an analysis

• Learn how to identify outliers in a cross-sectional


Time-Series Outliers Key Takeaways dataset using a “nearest neighbors” approach

• Learn how to use regression analysis to detect


outliers in a time-series dataset

The terms outlier detection, anomaly detection, and rare-event detection are often used interchangeably; they all focus on
finding observations which are materially different than the others (outliers are also sometimes called “pathological” data)
OUTLIER DETECTION BASICS

Outlier Detection Basics


Outlier detection is used to identify observations in a dataset which are
either unexpected or statistically different from the others
Cross-Sectional Outliers
There are two general types of outliers you may encounter:
Time-Series Outliers • Data/recording issues: Values which were captured or categorized incorrectly,
and which must be removed or excluded to avoid skewing the analysis

Key Takeaways • Business outliers: Legitimate values which provide real and meaningful
information, such as a fraudulent transaction or an unexpected spike in sales

Common use cases:


• Identifying store locations which are behaving abnormally compared to others
• Detecting web traffic anomalies which would be impossible to detect otherwise
• Flagging potentially fraudulent credit card transactions
CROSS-SECTIONAL OUTLIERS

Outlier Detection Basics Cross-sectional outlier detection is used to measure the similarity
between observations in a dataset, and identify observations which are
Cross-Sectional Outliers
unusually different or dissimilar
• Detecting outliers in one or two dimensions is often trivial, using basic profiling
Time-Series Outliers metrics (i.e. interquartile range) or visual analysis (scatterplots, box plots, etc.)

• Detecting outliers 3+ dimensions is tricker, and often requires more sophisticated


Key Takeaways techniques; this is where machine learning comes in!

A common technique is to calculate the distance


between each observation and its nearest neighbor,
and plot those distances to reveal outliers

NOTE: Raw data should first be scaled to produce


consistent and meaningful distance calculations
CROSS-SECTIONAL OUTLIERS

EXAMPLE Normalized store sales by product category (n=101)


Outlier Detection Basics

Cross-Sectional Outliers In this example we’re looking at normalized sales


for 3 product categories (x_1, x_2 and x_3) across
101 different store locations
Time-Series Outliers
• In cases like these, detecting outliers by
visualizing distributions or scatterplots can
Key Takeaways be difficult (and often impossible)
CROSS-SECTIONAL OUTLIERS

EXAMPLE Normalized store sales by product category (n=101)


Outlier Detection Basics

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

In this case, box and violin plots don’t


reveal any obvious outliers...
CROSS-SECTIONAL OUTLIERS

EXAMPLE Normalized store sales by product category (n=101)


Outlier Detection Basics

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

Kernel density plots show relatively normal


distributions for each variable...
CROSS-SECTIONAL OUTLIERS

EXAMPLE Normalized store sales by product category (n=101)


Outlier Detection Basics

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

And scatter plots fail to expose any


anomalies in the data
CROSS-SECTIONAL OUTLIERS

EXAMPLE Normalized store sales by product category (n=101)


Outlier Detection Basics

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

And scatter plots fail to expose any


anomalies in the data
CROSS-SECTIONAL OUTLIERS

EXAMPLE Normalized store sales by product category (n=101)


Outlier Detection Basics

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

And scatter plots fail to expose any


anomalies in the data
CROSS-SECTIONAL OUTLIERS

EXAMPLE Normalized store sales by product category (n=101)


Outlier Detection Basics

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

But there is an outlier in the sample, which we can


clearly see when we visualize a 3rd dimension
CROSS-SECTIONAL OUTLIERS

For cross-sectional outlier detection, techniques like nearest neighbors


Outlier Detection Basics are more efficient and can scale to any number of dimensions

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

By calculating the distance from each point to its


nearest neighbor and plotting the distribution,
we can detect a clear outlier in the data
CASE STUDY: OUTLIER DETECTION

THE You are a Senior Data Analyst for Brain Games, a large chain of retail shops
SITUATION specializing in educational toys, puzzles and board games.

You’d like to analyze store-level sales by product category, to identify patterns and
THE see if any locations show an unusual composition of revenue.
ASSIGNMENT To do this, you’ll be exploring revenue data across 100 individual store locations,
and detecting outliers using “nearest neighbor” calculations.

1. Collect category-level sales data by store location


THE
OBJECTIVES 2. Use scatterplots to visualize relationships and scan for outliers
3. Calculate the nearest neighbor for each observation and visualize the
data using a box plot and histogram. Do you see any outliers now?
TIME-SERIES OUTLIERS

Outlier Detection Basics Time-series outlier detection is used to identify observations which fall
well outside expectations for a particular point in time, after accounting
Cross-Sectional Outliers
for seasonality and trending
• In practice, this involves building a time-series regression, comparing each
Time-Series Outliers
observation to its predicted value, and plotting the residuals to detect anomalies

Fit regression model Calculate residuals Plot residuals


Key Takeaways
TIME-SERIES OUTLIERS

Real-world time-series data is often extremely granular, with recurring


Outlier Detection Basics
patterns and trends which make visual outlier detection virtually impossible
Cross-Sectional Outliers
EXAMPLE Hourly website sessions (n=8,760)

Time-Series Outliers

Key Takeaways
TIME-SERIES OUTLIERS

EXAMPLE Hourly website sessions (n=8,760)


Outlier Detection Basics

Cross-Sectional Outliers

Time-Series Outliers

Key Takeaways

Just like our cross-sectional example, univariate


distributions don’t reveal any outliers or anomalies
TIME-SERIES OUTLIERS

Outlier Detection Basics Instead of simple visual analysis, we can fit a linear regression model to
account for seasonality (hour of day) and trending
Cross-Sectional Outliers • From there, we can plot the distribution of residuals in order to quickly identify
any observations which significantly deviated from the model’s prediction
Time-Series Outliers

Key Takeaways
TIME-SERIES OUTLIERS

Outlier Detection Basics

Cross-Sectional Outliers
When we plot the distribution of
model residuals, we detect a clear
Time-Series Outliers outlier in the data

Key Takeaways
PRO TIP: Not all outliers are
bad! If you find an anomaly,
understand what happened
and how you can learn from it
KEY TAKEAWAYS

Outlier detection is used to identify observations in a dataset which are


either unexpected or statistically different from the others
• Common use cases include detecting web traffic anomalies, flagging fraudulent credit card transactions,
identifying unusually high or low-performing store locations, etc.

For cross-sectional analysis, plotting “nearest neighbor” distances can help


detect anomalies when visual profiling falls short
• This approach is ideal for exposing outliers in large, complex or highly-dimensional datasets

For time-series analysis, you can fit a regression model to the data and plot
the residuals to clearly identify outliers
• This allows you to quickly find outliers, while controlling for seasonality and trending
*Copyright Maven Analytics, LLC
DIMENSIONALITY REDUCTION

In this section we’ll review the fundamentals of dimensionality reduction, discuss


common Machine Learning and business use cases, and introduce core techniques like
Principal Component Analysis (PCA)

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

Dimensionality Reduction Principal Component • Understand how dimensionality reduction can be


Basics Analysis applied in business and Machine Learning contexts

• Explore the intuition behind core linear


Advanced Techniques Key Takeaways dimensionality reduction techniques, including
Principal Component Analysis (PCA)

• Introduce more advanced models, including


techniques for non-linear dimensionality reduction
DIMENSIONALITY REDUCTION

Dimensionality Reduction Dimensionality reduction involves reducing the number of columns (i.e.
Basics dimensions) in a dataset while losing as little information as possible
• This can be used strictly as a ML optimization technique, or to help develop a
Principal Component
Analysis better understanding of the data for business intelligence, market research, etc.

Advanced Techniques COMMON BUSINESS USE CASES: COMMON ML USE CASES:

ü Summarizing item-level sales data to better ü Improving a model’s predictive power by


understand group or category performance reducing redundant or unnecessary
Key Takeaways dimensions in the data
ü Analyzing survey data to understand
overarching trends and findings without parsing ü Visualizing clusters in limited dimensions to
through individual question responses spot-check cluster/segmentation outputs

ü Understanding the main traits or characteristics ü Reducing trivial correlations and removing
of your customers (website activity, transaction multicollinearity to build more accurate and
patterns, survey response data, etc.) meaningful models
PRINCIPAL COMPONENT ANALYSIS

Dimensionality Reduction One of the most common dimensionality reduction techniques is


Basics Principal Component Analysis (PCA)

Principal Component
• In the simplest form, PCA finds lines that best fit through the observations in a
Analysis data set, and uses those lines to create new dimensions to analyze

Advanced Techniques

STEP 1 STEP 2 STEP 3 STEP 4 STEP 5


Key Takeaways
Use high-dimensional Find “components” or Interpret the Select the “principal” Use those principal
data that could be lines that fit through weighting of each components which components in place
more useful if it could those dimensions, to component on the explain a material of existing dimensions
be accurately potentially use as original dimensions to proportion of the for further analysis
represented with new dimensions to build an intuition for variance in the data
fewer dimensions represent the data what they represent
PRINCIPAL COMPONENT ANALYSIS

EXAMPLE Student test scores by subject (n=10)


Dimensionality Reduction
Basics
Plotting pairs of dimensions reveals some
correlation between Spelling/Vocabulary
Principal Component and Multiplication/Geometry test scores
Analysis
• Linear dimensionality reduction
techniques will find lines that fit through
Advanced Techniques highly correlated dimensions first

Key Takeaways Spelling (X) / Vocabulary (Y) Vocabulary (X) / Multiplication (Y) Spelling (X) / Multiplication (Y) Multiplication (X) / Geometry (Y)
PRINCIPAL COMPONENT ANALYSIS

EXAMPLE Student test scores by subject (n=10)


Dimensionality Reduction
Basics
In the context of dimensionality reduction,
these lines are called components or factors
Principal Component
Analysis Observations are given a weighting for each
component, which essentially measures its
position relative to the length of the line
Advanced Techniques

Key Takeaways
t
en
p on
com
Vocabulary

Geometry
g
tin
eigh
w

Spelling Multiplication
PRINCIPAL COMPONENT ANALYSIS

EXAMPLE Student test scores by subject (n=10)


Dimensionality Reduction
Basics
2.8 7.8
Plot the component weightings
4.3 8.7 for each point, and record them
Principal Component 7.3 3.2 as new dimensions in the table
10.7 4.9
Analysis 11.4 2.6
10.0 11.5
6.1 11.6
NOTE: We’re looking at only 2
9.6 11.3 dimensions here, but PCA scales
Advanced Techniques
6.6 13.6 to any number of dimensions
13.5 12.0

Key Takeaways
Vocabulary

Geometry
10.0
(8.3 , 5.6)

3.2

(2.8 , 1.6)

Spelling Multiplication
PRINCIPAL COMPONENT ANALYSIS

How do you determine the meaning of each component?


Dimensionality Reduction
Basics
• PCA will generate many potential components, and it won’t be immediately clear
what they actually represent
Principal Component
Analysis • In practice, PCA models calculate weights (or “loadings”) to help explain the
relationship between each component and the original dimensions (think of
loadings like coefficients in a linear regression)
Advanced Techniques
X1 X2 X3 X4 Y
language
2.8 7.8
Key Takeaways 4.3 8.7 High coefficients on X1 (spelling) and
7.3 3.2 X2 (vocabulary), so we might interpret
10.7 4.9
11.4 2.6 component 1 as “language”
10.0 11.5
6.1 11.6
9.6 11.3
6.6 13.6
13.5 12.0

Y = (0.72)*X1 + (0.69)*X2 + (-0.02)*X3 + (0.03)*X4


PRINCIPAL COMPONENT ANALYSIS

How do you determine the meaning of each component?


Dimensionality Reduction
Basics
• PCA will generate many potential components, and it won’t be immediately clear
what they actually represent
Principal Component
Analysis • In practice, PCA models calculate weights (or “loadings”) to help explain the
relationship between each component and the original dimensions (think of
loadings like coefficients in a linear regression)
Advanced Techniques
X1 X2 X3 X4 Y
language math
2.8 7.8
Key Takeaways 4.3 8.7 High coefficients on X3 (multiplication)
7.3 3.2 and X4 (geometry), so we might interpret
10.7 4.9
11.4 2.6 component 2 as “math”
10.0 11.5
6.1 11.6
9.6 11.3
6.6 13.6 This requires human intuition;
13.5 12.0 your model won’t tell you
what the components mean!

Y = (0.004)*X1 + (0.001)*X2 + (0.68)*X3 + (0.72)*X4


PRINCIPAL COMPONENT ANALYSIS

Dimensionality Reduction How do you know how many components to keep?


Basics
• You can use a Scree Plot to visualize the cumulative variance explained by each
Principal Component component, and look for the “elbow” where additional components add relatively
Analysis little value (similar to a WSS chart for clustering)

• The components up to and including the “elbow” become your principal


Advanced Techniques components, and others can be disregarded

Key Takeaways
In this plot, the first 2 components explain
most of the variance in the data, and
additional components have minimal impact
PRINCIPAL COMPONENT ANALYSIS

Dimensionality Reduction
Basics

Principal Component
Analysis

Advanced Techniques

Key Takeaways
In this example, defining components for language and math
helped us simplify and better understand the data set
• Using the new components we derived, we can conduct further analysis
like predictive modeling, classification, clustering, etc.
PRINCIPAL COMPONENT ANALYSIS

Dimensionality Reduction
Basics

Principal Component

Math
Analysis

Advanced Techniques

Language

Key Takeaways
In this example, defining components for language and math
helped us simplify and better understand the data set
• Using the new components we derived, we can conduct further analysis
like predictive modeling, classification, clustering, etc.
• For example, clustering might help us understand student testing patterns
(most skew towards either language or math, while a few excel in both subjects)
ADVANCED TECHNIQUES

Dimensionality Reduction What about non-linear dimensionality reduction?


Basics
• Linear dimensionality reduction is the most well-researched and widely used
Principal Component method, and can be implemented across a broad range of BI and ML use cases
Analysis
• Other, non-linear dimensionality reduction techniques do exist, but they are
significantly more specialized and complex in terms of the underlying algorithms
Advanced Techniques and math:
• Kernel PCA
• T-Distributed Stochastic Neighbors (T-SNE)
Key Takeaways
• Uniform Manifold Approximation & Projection (UMAP)
• Self-Organizing Maps
KEY TAKEAWAYS

Dimensionality reduction is used to reduce the number of columns in a


data set while losing as little information as possible
• This can be used for model optimization, or to identify overarching trends or patterns in a granular dataset

One of the most common dimensionality reduction techniques is known as


Principal Component Analysis (PCA)
• PCA essentially fits lines through the observations in a data set to create new dimensions to analyze

Dimensionality reduction requires human logic and intuition to interpret


what each component actually means
• This will help you simplify your data, understand it on a deeper level, and conduct further analysis
*Copyright Maven Analytics, LLC
WRAPPING UP

CONGRATULATIONS!
You now have a solid foundational understanding of unsupervised learning, including techniques
like clustering, association mining, outlier detection, and dimensionality reduction.

We hope you’ve enjoyed the entire Machine Learning for BI series, and that you find an
opportunity to put your skills to good use!

PART 1 PART 2 PART 3 PART 4


QA & Data Profiling Classification Regression & Forecasting Unsupervised Learning

You might also like