Case Study Data Science Business
Case Study Data Science Business
Deep Learning
For Business™
Data Science has been one of the biggest tech buzz
word for the last 5-10 years!
2
The Big Data Industry is Growing Rapidly!
3
The Demand for Data Scientists is only going up!
4
The Problems in the Industry and with Universities
5
Data Science for the first time!
6
This course seeks to answer and fill in these Gaps!
You’ll learn:
● How do approach business problems and solve them using Data Science
Techniques
● You’ll gain perspective on how Data Scientists fit into a tech world filled
with Data Engineers, Data Analysts, Business Analysts
● How to apply the latest techniques in Deep Learning and solve some of
our Business problems with 20 Case Studies
7
My In-Depth Data Science Learning Path
8
Carefully Selected Real World Case Studies
The case studies used in this course were carefully selected and chosen
specifically because:
● The encompass some of the most common business problems and can
be easily applied to your own business
● Taken from a wide variety of industries and separated into 7 sections:
1. Predictive Modeling & Classifiers
2. Data Science in Marketing
3. Data Science in Retail – Clustering and Recommendation Systems
4. Time Series Forecasting
5. Natural Language Processing
6. Big Data Projects with PySpark
7. Deployment into Production
9
Case Studies Section 1 – Predictive Modeling & Classifiers
10
Case Studies Section 2 – Data Science in Marketing
11
Case Studies Section 3 –Data Science in Retail
13
Case Studies Section 5 – Natural Language
Processing
1. Summarizing Reviews
2. Detecting Sentiment in text
3. Spam Filters
14
Case Studies Section 6 – Big Data with PySpark
15
Case Studies Section 7 – Deployment Into Production
16
Hi
I’m Rajeev Ratan
17
About me
● Radio Frequency Engineer in Trinidad & Tobago for 8 years
● University of Edinburgh – MSc in Artificial Intelligence (2014-2015) where I
specialized in Robotics and Machine Learning
● Entrepreneur First (EF6) London Startup Incubator (2016)
● CTO at Edtech Startup, Dozenal (2016)
● VP of Engineering at KeyPla (CV Real Estate Startup) (2016-2017)
● Head of CV and Business Analytics at BlinkPool (eGames Gambling
Startup) (2017-2019)
● Data Scientist Consultant
● Udemy Courses
1. Mastering OpenCV in Python (~15,000 students since 2016)
2. Deep Learning Computer Vision™ CNN, OpenCV, YOLO, SSD & GANs
(~5,000 students since 2018)
18
My Udemy Courses
19
What you’ll be able to do after
Understand all aspects that make one a
complete Data Scientist
● The software programming chops to mess
with Data
● The Analytical and Statistical skills to make
sense of Data
● All the Machine Learning Theory needed for
Data Science
● Real World Understanding of Data and how
to understand the business domain to
better solve problems
20
Requirements
● Some familiarity with programming
○ Familiarity with any language helps, especially Python but it is NOT a
prerequisite.
● High School Level Math
● Little to no Data Science & Statistical or Machine Learning knowledge
● Interest and Passion in solving problems with Data
21
What you’re getting
● ~800+ Slides
● ~15 hours of video including
● Comprehensive Data Science & Deep Learning theoretical
and practical learning path
● ~50+ ipython notebooks
● 20 Amazing Real World Case Studies
22
Course Outline
1. Course Introduction
2. Python, Pandas and Visualizations
3. Statistics, Machine Learning
4. Deep Learning in Detail
5. Predictive Modeling & Classifiers – 6 Case Studies
6. Data Science in Marketing – 4 Case Studies
7. Data Science in Retail – Clustering & Recommendation Systems – 3 Case Studies
8. Time Series Forecasting – 2 Case Studies
9. Natural Language Processing – 3 Case Studies
10.Big Data Projects with PySpark – 2 Case Studies
11.Deployment into Production
23
Course Approach Options
Beginners – Do Everything!
1. Course Introduction
2. Python, Pandas and Visualizations
3. Statistics, Machine Learning
4. Deep Learning in Detail
5. Predictive Modeling & Classifiers – 6 Case Studies
6. Data Science in Marketing – 4 Case Studies
7. Data Science in Retail – Clustering & Recommendation Systems – 3 Case Studies
8. Time Series Forecasting – 2 Case Studies
9. Natural Language Processing – 3 Case Studies
10.Big Data Projects with PySpark – 2 Case Studies
11.Deployment into Production
25
Did a few courses online or have a degree
related to Data Science
1. Course Introduction
2. Python, Pandas and Visualizations
3. Statistics, Machine Learning
4. Deep Learning in Detail
5. Predictive Modeling & Classifiers – 6 Case Studies
6. Data Science in Marketing – 4 Case Studies
7. Data Science in Retail – Clustering & Recommendation Systems – 3 Case Studies
8. Time Series Forecasting – 2 Case Studies
9. Natural Language Processing – 3 Case Studies
10.Big Data Projects with PySpark – 2 Case Studies
11.Deployment into Production
26
Junior Data Scientists – Just Look at
the Case Studies in Any Order!
1. Course Introduction
2. Python, Pandas and Visualizations
3. Statistics, Machine Learning
4. Deep Learning in Detail
5. Predictive Modeling & Classifiers – 6 Case Studies
6. Data Science in Marketing – 4 Case Studies
7. Data Science in Retail – Clustering & Recommendation Systems – 3 Case Studies
8. Time Series Forecasting – 2 Case Studies
9. Natural Language Processing – 3 Case Studies
10.Big Data Projects with PySpark – 2 Case Studies
11.Deployment into Production
27
Why Data is the New Oil and What
Most Businesses are Doing wrong
Has this ever happened to you?
29
How do those online giants like Google
and Facebook know you so well?
● It’s DATA
30
Online Advertising
31
How Targeted Are Online Ads?
32
Let’s look at Uber
33
Netflix’s Recommendations
34
Amazon’s Recommendations
35
36
Banking
37
The Ubiquity of Data Opportunities
38
What Are Businesses Are Doing
Wrong?
● Unfortunately, the data revolution hasn’t
been sparked into all businesses and
industries.
● Why? Lack of competition so they sit
comfy until a young fast moving startup
gets their Series A funding and can
compete.
● What are some common mistakes
companies make?
39
Mistakes Businesses Make with Data
NOT DATA DRIVEN
1. Not recording their data
2. Recording their data, but doing nothing
substantial with it and then discarding it
because of storage costs
3. Mistaking Business Analytic Reports as a
substitute for data science.
4. Not Trusting the data and relying on their
intuition
5. Relying too much on statistically flawed
analysis
40
Data = Value = Better Decisions = More Profits!
41
Defining Business Problems for
Analytic Thinking & Data Driven
Decision Making
Analytical Thinking Defined
43
Data-Analytic Thinking
● For Businesses who utilize data, Data Analysis and Data Driven Decision
making is so critical that it can almost blind them.
● Having a deep understanding of the business problem, the domain and
how the data is generated is critical. For example:
○ Image you built a model that was 97% accurate in diseases
detection. Pretty good? Perhaps, but suppose it missed 3% of
patients who actually had it. A model with 90% accuracy that never
failed to detect a disease is better. False Positives here are better
than missing when a person had a disease.
● Many times, companies can hire the best data scientists who are great
technically, but the miss the big picture and lead to bad decisions.
44
Data-Analytic Thinking
45
10 Data Science Projects every
Business should do!
Data for your business is growing all
around you
47
10 Data Science Projects that can be applied
to most businesses!
● Analytics Projects
1. Determine your best (most profitable) customers
2. Most Profitable items and item categories
3. Customer Life Time Value
4. Season Trends and Forecasting
● Machine Learning Prediction Projects
5. Determine Customers likely to leave your business (retention)
6. Customer Segments
7. Recommendation Systems
8. AB Testing Analysis of Ads or many other changes (UI, Logo etc.)
9. Fraud Detection
10. Natural Language Processing of Social Media Sentiment
48
1. Determine your best (most
profitable) customers
Many people mistakenly think their best customers are the ones who
spend the most. However, there are many exceptions and other metrics
we can use to determine best or most valuable customers.
● In gambling , customers depositing the most are often the most skilled gamblers
and can actually be profiting thus hurting your bottom line
● Retailers like Amazon may have customers who spend hundreds monthly, however,
due to relaxed return policies, these people an be returning products regularly
basically renting expensive items for ‘free’ and forcing you to sell your new stock as
B-Stock or Used.
● Outside of profit from a customer, one can look at customers who garners the most
referrals or perhaps has continuously increased their spending over time.
49
2. Most Profitable items and item
categories
● It’s very useful to understand what items are your number seller. However,
many times analysts make simple mistakes that can be mislead executives.
● For instance, one supermarket thought one particular item was their top seller
for years.
● However, when I dug in, I noticed a brand of beer that was sold in 3 variations -
cans, medium and large glass bottles, was actually their top seller.
● Additionally, categorizing items (a tedious task if done manually) can shed
more light on what customers want.
50
3. Customer Life Time Value
51
4. Season Trends and Forecasting
52
5. Determine Customers likely to leave your
business (retention)
● Retention is an exercise often proposed and performed by businesses when
they try prevent customers from dropping their service and/or going to a
competitor.
● Imagine a retention strategy that involved gifts and a personal call to your
customers – and you had 10,000 customers!
● That’s going to be a waste of time and money
● It’d make far more sense to create a model to predict which customers
were mostly likely going to leave and target those customers.
53
6. Customer Segments
55
8. A/B Testing
56
9. Fraud Detection
According to Wikipedia,
“Fraud is a billion-dollar business and it is increasing
every year. The PwC global economic crime survey of
2016 suggests that more than one in three (36%) of
organizations experienced economic crime.”
● Modern businesses need to be smart in detecting
fraud that could potential hurt their business.
● Chargebacks with stolen credit cards, identity
theft, affiliate fraud and many others are things
we can use Machine Learning Models to detect.
57
10. Natural Language Processing (NLP)
58
Making Sense of Buzz Words, Data Science,
Big Data, Machine & Deep Learning
There are so many confusing buzz words in this industry.
Where does one begin?
60
Buzz Words
61
Buzz Words Explained
2. Big Data – Any dataset that can’t really be stored and manipulated on one
machine
62
Buzz Words Explained
63
Buzz Words Explained
9. Data Mining - The implication with the term data mining is that all the
discovery is driven by a person, which is one slight contrast between
machine learning and data mining, as many of the algorithms or methods
are similar between the two.
64
Buzz Words Explained
65
How Deep Learning is Changing
Everything!
The Power of Deep Learning
● Machine Learning has been around for decades with
many of the established algorithms around since the
1960s.
● What brought on the Data Science revolution was:
○ Increasing Computer Power
○ Cheap Data Storage
○ Developed of software and tools that made it far
more accessible
● Around ~2010 it was seen that typical Machine Learning
models could only learn and do so much and some
problems were just too difficult
67
https://en.wikipedia.org/wiki/Timeline_of_machine_learning
The Power of Deep Learning
● Deep Learning solved this and has ushered in a new revolution in Data
Science and Artificial Intelligence.
● It took our models from “that’s pretty good” to “that’s scary good, better
than what most humans can do”. Deep Learning achieved far higher
accuracy in a number of Computer Vision and NLP tasks and allowed
Machine Learning experts to tackle even more difficult problems,
problems once thought to be too challenging
68
The Power of Deep Learning
69
https://blog.statsbot.co/deep-learning-achievements-4c563e034257
How Does Deep Learning Work?
70
More Data Needed!
71
The Roles of Data Analyst, Data Engineer &
Data Scientists
Data Analysts
● Data analysts translate raw numbers into comprehensible reports
and/or insights
● As we know, all businesses potentially collects data, from sales figures,
market research, logistics, or transportation costs.
● Data analysts take that data and use it to help companies make better
business decisions.
● They often can create simple visual illustrations and charts to convey
the meaning in their data
● This could mean figuring out trends in cash flow, how to schedule
employees, optimal pricing and more.
73
Data Engineers
Data engineers connect all parts of the data ecosystem within a company or
institution and make it accessible.
● Accessing, collecting, auditing, and cleaning data from applications and
systems into a usable state (ETL Pipelines)
● Creating, choosing and maintaining efficient databases
● Building data pipelines taking data from one system to next
● Monitoring and managing all the data systems (scalability, security,
DevOps)
● Deploying Data Scientists’ models
● They work with production and understand Big Data concepts
74
Data Scientists
● Clean up and pre-process (data wrangling) data from various systems in
usable data for use in their algorithms
● Using Machine Learning and Deep Learning to build better prediction
models
● Evaluating statistical models and results to determine the validity of
analyses.
● Testing and continuously improving the accuracy of machine learning
models.
● Much like Data Analysts, Data Scientists building data visualizations to
summarize the conclusion of an advanced analysis.
75
The Data Science Hierarchy of Needs
https://en.wikipedia.org/wiki/Monica_Rogati 76
Data Analysts vs. Data Scientists
77
How Data Scientists Approach Problems
79
Obtain – Finding Data
80
Data Wrangling
81
Exploratory Analysis
● Also called Exploratory Data Analysis,
is an approach to analyzing data sets
to summarize their main
characteristics, often by using visual
methods.
82
Model
83
Interpret
● Having trained our model we need to understand
its performance strengths and weaknesses.
● This step isn’t too difficult, however it does
require a proper understanding of the domain
you’re working in.
● Detecting Fraud or Diseases requires a model that
rarely misses, and we will be ok with False
Positives. Conversely, in a situation where False
Positives are costly, you make want to adjust the
model accordingly
84
Deploy into Production
85
Communication is Vital
86
What is Python and Why do
Data Scientists Love it?
Python https://www.python.org/
● Created by Guido van Rossum and first released in 1991, Python's design
philosophy emphasizes code readability with its notable use of
significant whitespace
● Python is an interpreted, high-level, general-purpose programming
language.
○ Interpreted – Instructions are execute directly without being
compiled into machine-language instructions. Compiled languages
unlike interpreted languages, are faster and give the developed
more control over memory management and hardware recourses.
○ High-level – allowing us to perform complex tasks easily and
efficiently
88
Why Python for Data Science
● Python competes with many languages in the Data Science world, most
notably R and to a much lesser degree MATLAB, JAVA and C++.
89
Why does Python Lead the Pack?
90
A Crash Course in Python
Introduction to Pandas
Pandas
93
Understanding Pandas
94
Simple Example – Imagine a 1000s of rows of
data like the table below
⋮ ⋮ ⋮ ⋮
95
We can use pandas to produce this:
First Name Last Name Age General Subject Area Overall Grade Average Mark
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
96
Understanding Pandas DataFrames
Amit 77 DataFrame
Gretel 74
Series or Column
97
Introduction to Pandas
What is pandas and why is it useful?
Statistics for Data Analysts and Scientists
Statistics – No one likes statistics
100
Statistics – Why is so important?
And in business…
● What month will I have the most sales, or what time of day?
● Should I take out insurance?
● Is the economy doing well?
101
Everything dealing with forecasting/predicting comes
down to Statistics
102
“ “Being a Statistician will be the
sexiest job over the next decade”
Hal Varian, Chief Economist, Google
103
The Subfields of Statistics
104
The Subfields of Statistics
106
Descriptive Statistics
Descriptive Statistics
● Descriptive statistics are used to describe or
summarize data in ways that are meaningful and
useful, such that, for example, patterns might
emerge from the data.
108
Examples of Descriptive Statistics
109
What are good Descriptive Statistics?
110
Simple Visualizations Help Us
Understand Descriptive Statistics
● In order to get a better understanding our data, we
need to embark on an Exploratory Data Analysis.
111
Simple Exploratory Analysis
Subject Mark/Score ● Looking at this report card, this
English 77 student has 5 subjects, this student
as scored a total of 350 marks out
Mathematics 94
of 500. If we look at the average
Physics 88 grade we get 70. However, they
Chemistry 91
have a 0 in Biology. Is this a
mistake? If we ignore the 0, their
Biology 0 average is substantially better at
87.5%
112
We now see the need for actually
examining our data
● This process where we visualize the data is called
Exploratory Analysis.
113
Exploratory Data Analysis
Exploratory Data Analysis (EDA)
115
Methods of EDA
● Histogram Plots
● Scatter Plots
● Violin Plots
● Boxplots
● Many more!
116
Sampling – How to Lie with Statistics
“ “”There are lies, damn lies, and
statistics!”
Mark Twain
118
How to Lie with Statistics?
119
How to Lie with Statistics?
120
More examples
121
Trump Supporters prefer Beer to
Wine!
● What if the general population prefers beer over wine?
● What if Trump supporters typically lived in warmer
states where beer was more popular over wine?
● What if the actual statistics showed 64% of Trump
Supporters preferred beer over wine, while 63% of
Hillary Clinton’s supporters held that same view?
122
Statistical Goal?
.......
123
Sampling
What is Sampling?
.......
125
Sampling Example
127
Good Sampling
128
Stratifying Data
● Think about our wine dataset – if we use the mean from the
entire population of wines, does it accurately reflect the mean
alcohol percentage for wines?
● Yes and No – Yes if we’re thinking about all types of wines, but
generally No because Red and Whites typically have different
alcohol percentages and mixing them together skews this.
129
Tips for Creating Good Stratums
131
Sampling Summary
132
Variables in Data
What are variables?
First Name Last Name Age General Subject Area Overall Grade Average Mark
● Quantitative variable
135
Quantitative vs. Qualitative
Quantitative Qualitative/Categorical
136
Nominal and Ordinal
137
Interval & Ratio
● Ration scales are similar to interval, but they have the added
property of having an inherent zero. E.g. height or weight.
138
Continuous and Discrete Variables
● Goals are discreet, there is no way a player Player Goals Height
can score a fraction of a goal. Discrete
variables measure quantity and value, but Player 1 43 177cm
have no interval measurement between
adjacent values. Player 2 25 180cm
● Height however, is a continuous. Just because Player 3 3 177cm
we give whole number integer values, that
doesn’t mean Player 1 and Player 3 are
exactly the same height. Player 1 can be
177.3cm while Player 3 can be 176.9cm. You
can perhaps never have exactly the same
value in height of two people as there would
be nanometer differences in height
139
Frequency Distributions
Why do we collect data?
141
The Data Collection Process
1. Collect Data
2. Analyze Data
3. Use analysis for decision making etc.
142
Let’s explore Frequency counts
First Name Last Name Age General Subject Area Overall Grade Average Mark
143
We can use Frequency Breakdown to
Summarize the data
Sciences 45
Languages 23
Modern Arts 32
144
Histogram Plots
As we saw previously,
Histogram plots
represent this
frequency count well.
145
Making Good Histogram Plots
146
Bad or Confusing Histograms
147
Good Histograms
148
Rule of Thumbs for Good Plots
151
Histogram Plots are Frequency Distributions
152
Analyzing Frequency Distribution Plots
153
Normal & Uniform Distributions
154
Examples of Normal & Uniform Distributions
● An example of a Normal
Distributions are human
heights.
● An example of a uniform
distribution is rolling a fair
dice. After say 1000 rolls,
we expect equal numbers
of 1s,2s,3s,4s,5s & 6s
155
Analyzing Frequency Distributions
We’ve seen how useful Frequency
Distributions are, but what else can we do?
157
Comparing our Whites and Red
158
Analyze Frequency Distributions to Find Outliers
159
Outliers Rule of Thumb
160
Inter Quartile Ranges for Red Wines
Upper Outlier
Range
164
The Mean
● The mean is quite simple and we will have discussed it before
so you intuitively know that mean is the same as average.
165
Common Mean Misconceptions
● Many times, we consider mean to be the expected average i.e.
the center or most common value. However, that is often a
mistake.
● Outliers can skew our means
● Joan with her 26 phones, skews Person Phones Owned
our mean
Nancy 6
Amy 5
𝟔 + 𝟓 + 𝟕 + 𝟐𝟔 Savi 7
𝐌𝐞𝐚𝐧 = = 𝟏𝟏
𝟒 Joan 26
MEAN 6
166
Illustration of Mean Skewing
167
Mathematical Notation for Mean
!, "!- + !. …+!/ ∑2
01, !0 𝑵
𝜇= $
= $
What we sum up
to
# 𝒙𝒊 What we sum
Where:
𝒊(𝟏 First value of x
• 𝝁 = 𝑀𝑒𝑎𝑛
• 𝒙 = each value in our dataset
• 𝑵 = Number of data samples in our dataset Index of the summation
168
Weighted Mean
● Here’s an issue you might encounter with means. If we’re given
just the summary data below, how can we calculate the mean
of the cars sold across all years?
169
Weighted Mean
● You might assume we can just find the mean of the “Mean Sale Price”
and be done with it. That would be wrong.
● Think about this simple example. We have 2 boys and 3 girls
● Boy 1 weighs 101lbs, Boy 2: 115lbs, Girl 1: 90lbs, Girl 2: 77lbs, Girl 3: 99lbs.
● The average weight of all 5 of them is: 96.4 lbs
● However, if you were given the summarized the mean separately:
○ Mean Boy Weight = 108 lbs
○ Mean Girl Weight = 88.7 lbs
○ (Mean Boy Weight + Mean Girl Weight) / 2 = 98.33 lbs
● The means aren’t equal!
170
Calculating the Weighted Mean
Child Weight
Boy 1 101
Boy 2 115 Child Mean Quantity Expanded
Weights Means
Girl 1 90
Boys 108 2 2 * 108 = 216
Girl 2 77
Girls 88.7 3 3 * 88.7 = 266
Girl 3 99
Mean 96.4
Getting our accurate weighted mean is simple:
(216 + 266) / 5 = 96.4lbs
171
Median
● Many people confuse Means and Medians.
● Remember while means are the average of all values, Median is average
of the two middle values or the actual middle value itself (depending on
if the quantity of data is odd or even)
4 6 7 8 10 11 30
Mean = 10.86
MEDIAN
4 6 7 8 10 11 14 41
Mean = 12.63
MEDIAN = (8+10)/2 = 9
172
Median
● Medians are useful because they ignore the effects of outliers
and give us a good idea of a general average of the data
● It’s a robust statistic
● One way to lie with statistics is using means over medians. E.g.
The average salary of an employee in company A is 100,000 per
year. However, that average or mean can easily be skewed
upward by a few executives making millions per year. In fact,
95% of the company can be making less the mean salary!
173
Mode
● The Mode is simply the most frequent item in distribution. \
● In any Kernel Density plot (KDE in Seaborn), the mode is always
the peak
174
When is Mean, Mode and Median the same?
● For a symmetrical
distribution, we really only
need to describe it by stating
what the mean is and how
wide or narrow the
distribution is.
175
When to use Mean, Mode and Median?
● Medians can be used on the same data type, and are good
for summarizing data when there are outliers (use boxplots
to check)
Source: https://www.statisticshowto.datasciencecentral.com/pearson-mode-skewness/
● Positive or right Skewness is when the mean > median > mode
○ Outliers are skewed to the right
● Negative Skewness is when the mean < median < mode
○ Outliers are skewed to the left
177
Variance, Standard Deviation and
Bessel’s Correction
Measure of Spread/Dispersion
● Mean, mode and median all give us a view of what’s the most
likely or common data point in a dataset.
● Think about human heights, we can use the mean, mode or
median to illustrate the point that the average male human is
5’9”.
● However, this tells us nothing about how common it is to find
someone over 7’
● Or even more descriptive measure: 95% of men lie between
what range of height?
179
The Range of our Data
● In our Wine dataset we can see the max
and min alcohol parentages were 14.9%
and 8.0%
● The weakness in using Range is that it only uses the max and
min.
● What if we had a distribution that was:
○ 10, 10. 11, 12, 11, 10, 11, 12, 200
○ This distribution has a range of 200-10 = 190, which gives
us the impression that our values swing widely up to 190,
however, in reality our data is consistently varying
between 10-12 with one major exception.
181
The lead up to Variance - Difference
● 10, 10, 11, 12, 11, 10, 11, 12, 200 ………Mean = 31.88
Values below mean Values above mean
𝒙 − 𝝁 Distance 𝒙 − 𝝁 Distance
10-31.88 -21.88 200-31.88 168.11
10-31.88 -21.88 Total 168.11
11-31.88 -20.88
12-31.88 -19.88
11-31.88 -20.88
● The Difference = (-168.11 + 168.11) / 9 = 0
10-31.88 -21.88
11-31.88 -20.88
12-31.88 -19.88 ● How do we solve this problem?
Total -168.11
182
Use Mean Absolute Distance
183
Use Mean Squared Distance - Variance
● Mean Squared Distance is written mathematically as:
%!&' "( %"&' "…( %# &' " ∑#
$%! %$ &'
"
𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = =
* *
● So from our previous example, we get:
● (21.882 + 21.882 + 20.882 + 19.882 + 20.882 + 21.882 + 20.882 + 19.882 + 168.882) / 9 =
3530.22
● Mean Squared Distance is commonly called the Variance
● Standard Deviation is the square root of Variance =
% %
!$ "# % $ !% "# % …$ !& "# ∑&
'($ !' "#
● Standard Deviation =
&
= &
184
Variance and Standard Deviation
185
Coefficient of Variation(CV)
% %
!$ "# % $ !% "# % …$ !& "# ∑&
'($ !' "#
Standard Deviation =
&"(
= &"(
187
Covariance and Correlation
Covariance
189
Covariance
Subject Height Weight
1 174 136
: 184 175
N 173 120
190
Covariance
● We can see that as height
increases so does the
weight, and vice versa.
*
(𝑥. − 𝜇% )(𝑦. − 𝜇- )
𝐶𝑜𝑣 𝑥, 𝑦 = 𝜎%- =C
𝑁
."/
191
Working through the Covariance
X 1 4 5 7 Mean = 16 / 4 = 4.25
192
The Weakness of Covariance
● This means we can’t compare variances over data sets with different
scales (like pounds and inches).
193
Correlation
IJK(L,M)
● 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 = 𝜌 = NM NN
195
𝒙−𝒙 ( 𝒚−𝒚 ( ()
(𝒙 − 𝒙
𝟐
()
(𝒚 − 𝒚
𝟐
X Y (𝒙 − 𝒙
()(𝒚 − 𝒚)
( = 𝟒. 𝟐𝟓
𝒙 ( = 𝟎. 𝟔
𝒚
196
Correlation Matrix
● We an use Pandas and Seaborn to produce very informative correlation plots.
197
Negative
correlation
with quality
Positive
correlation
with quality
198
Pairwise Plots
199
Lying with Correlations – Divorce
Rates in Maine caused by
Margarine Consumption?
201
Source: http://www.tylervigen.com/spurious-correlations
Lying and Misleading with Statistics
202
The Normal Distribution & the
Central Limit Theorem
Normal or Gaussian Distributions
204
Normal Distributions Example
205
Describing Normal Distributions
𝑵𝒐𝒕𝒂𝒕𝒊𝒐𝒏 ∶ 𝜨(𝝁, 𝝈𝟐 )
206
Central Limit Theorem
● One of the most important theorems in Statistics is said to be
the Central Limit Theorem.
207
Sampling a Population
Source: https://www.superdatascience.com/courses/statistics-business-analytics-a-z/
208
Sampling a Population
1
● Imagine we are sampling a distribution shown
2 right.
3 ● We take several samples (2 to 6)
4 ● The red X in rows (2 to 6) represent the means of
each sample
5
● Row 7, shows the distribution of our means
6 taken from the samples.
7 ● It follows a normal distribution!
Source: https://www.superdatascience.com/courses/statistics-business-analytics-a-z/
209
Try this experimental
http://onlinestatbook.com/stat_sim/sampling_dist/index.html?source=post
_page-----a67a3199dcd4----------------------
210
Z-Scores and Percentiles
Standard Deviations Revisited
213
Transforming entire Distributions to Z-Scores
Original Distribution Z-Transform
214
Z-Score Means are always 0 & Standard Deviations always 1
215
Why do we use Z-Scores?
● This brings us to how we can use Z-Scores to determine your percentile. Let’s
go back to our previous example
● Ryan’s Z-Score was 0.2 and Sara’s 0.3
● Ryan is the 0.5793 percentile, meaning he’s scored better than 57% than his
classmates
Source: http://i1.wp.com/www.sixsigmastudyguide.com/wp-content/uploads/2014/04/z-table.jpg
217
Probability – An Introduction
Coin Flipping – What are the odds of a Heads?
● Probability is a measure
quantifying the likelihood that
events will occur.
219
Why is Probability Important?
● In statistics, data science and Artificial Intelligence, obtaining
the probabilities of events is essential to making a successful
‘smart’ or in the Business World, a calculated risk.
○ What is the probability of my Marketing Campaign succeeding?
○ What is the probability that Customer A will be buy products X, Y &
Z?
220
Probability Scope
● In this section, we’ll learn to estimate probabilities both
theoretically and empirically
● The rules of Probability
○ The Addition Rule
○ The Product Rule
● Permutations and Combinations
● Bayes Theorem
221
Estimating Probability
Empirical or Experimental Estimates
223
Empirical or Experimental Estimates
224
Let’s attempt this in Python
225
Probability as a Percent
● 𝟎 ≤ 𝑷 𝑬𝒗𝒆𝒏𝒕 ≥ 𝟏
226
Probabilities of All Events Must Sum to 1
227
Simple Probability Question
228
Probability – Addition Rule
Probability and Certainty
● From our random experiments, we get a very good idea of the chance or
likeliness of an event occurring. However, it’s still a prediction based on
random experimentation and even if an event is 99% likely to occur, we can
expect with some certainty that the 1% event can occur once in a 100 trials
● In a coin toss event, we have two outcomes: Heads or Tails
● In a dice roll we have 6 possible outcomes for each event.
● The Omega symbol 𝜴 is used to signify the sample space i.e. all possible
outcomes, this is known as the Sample Space:
○ 𝜴(𝒅𝒊𝒄𝒆) = 𝟏, 𝟐, 𝟑, 𝟒, 𝟓, 𝟔
○ 𝜴(𝒄𝒐𝒊𝒏) = 𝑯𝒆𝒂𝒅𝒔, 𝑻𝒂𝒊𝒍𝒔
230
Probability Multiple Events
● Imagine we now have two coins being tossed simultaneously
232
The Addition Rule
● Let’s roll a dice and answer the question:
○ What is the probability of getting a 1 or a 6?
12 8 12 8 20 2
𝑃 𝑔𝑟𝑒𝑒𝑛 ∪ 𝑏𝑙𝑢𝑒 = + = + = =
12 + 8 + 10 12 + 8 + 10 30 30 30 3
235
More on the Addition Rule
● Previously our events were Mutually Exclusive, that is both
outcomes cannot both happen.
● We can illustrate this using a Venn diagram to show there is no
intersection between the events.
236
Example of Mutually Exclusive Events
● Let’s say we have a class of 13 students, 7 female and 6 male.
● We’re forming a small committee involving 3 of these students
● What is the probability that the committee has 2 or more female
students?
● Let’s consider the events that allow this possibility
○ Event A – 2 females and 1 male is chosen
○ Event B – 3 females are chosen
● These two events cannot occur together, meaning they’re mutually
exclusive.
237
Example of Non-Mutually Exclusive Events
𝑷 𝑨 ∪ 𝑩 = 𝐏 𝑨 + 𝐏 𝑩 − 𝐏(𝑨 ∩ 𝑩)
239
Other Probability Problems
● What is the probability of pulling 3 Aces in row?
● There are 4 aces in a deck of 52 cards
5
● So, the probability of getting one ace is
6,
● So the probability of getting a second ace would be the same?
● Nope, there are now 3 aces left and 51 cards left.
* & / %
● 𝑃 3 𝐴𝑐𝑒𝑠 = × × = ../. = 0.000181 𝑜𝑟 0.0181%
./ .% .(
● This is an example of sampling without replacement, if we were to place
the Ace back into the deck, this would be sampling with replacement.
240
Probability – Permutations & Combinations
Permutations and Combinations
242
Determining the Number of Outcomes
Coin B
Coin A
● Let’s explore tossing two coins:
Coin B
● A – Coin 1
● B – Coin 2
● 𝜴(𝒄𝒐𝒊𝒏) = 𝑯𝒆𝒂𝒅𝒔, 𝑻𝒂𝒊𝒍𝒔 Coin B
● Number of Outcomes for A = 2
Coin A
● Number of Outcomes for B = 2
● Total number of Outcomes = a x b = 2 x 2 = 4 Coin B
● This is known as the Product Rule
243
Number of Outcomes
● Exploring this concept further, let’s consider a combination lock
with a 2-Digit code.
● What the odds of guessing the code in your first guess?
.CDEFG HI RCMFKKFK
● 𝑃 𝐸 = SHJOP .CDEFG HITCJMHDFK
● Ω- = {0,1,2,3,4,5,6,7,8,9, }
● Ω5 = {0,1,2,3,4,5,6,7,8,9, }
- -
● 𝑃 𝐸 = -7 ×-7 = -77 = 0.01
● We can extend this to any combination of locks:
-
● 4 Digit = -7,777 = 0.0001
244
Permutations
● As we just saw, we explored an ordered set numbers in order to
get the probability of guessing the combination lock.
● An ordered combination of elements is called a Permutation
● Order matters because, let’s say the unlock code was 1234,
reordering the sequence to 4321 or 2341 would not work. Order
is important!
245
Two Types of Permutations
● Permutations with repetition - In our previous example with the
combination lock, we can see digits can be repeated.
○ Therefore, calculating the number of possibilities is quite easy.
○ Total Number of Outcomes for N choices = Ω = 𝑛-×𝑛5×𝑛8
● Permutations without repetition – These are instances where once
an outcome occurs, it is no longer replaced.
○ Let’s look at 16 numbered pool balls
○ When randomly choosing our first ball, there are 16 possibilities.
○ After choosing this ball, it is removed, so we’re now left with 15
○ To calculate the number of possible combinations we do:
16×15×14×13×12 …×1 this is known as 𝟏𝟔! Or 16 Factorial
○
246
Permutations Without Repetition
● Let’s say we wanted to choose 3 random pool balls from our set of
16
● How many different combinations of balls can there be? E.g. 5,3,12 or
15,2,6 etc.
-9!
● Simple -8! = 3360 𝑝𝑒𝑟𝑚𝑢𝑡𝑎𝑡𝑖𝑜𝑛𝑠
● The formula we apply is written as:
B!
○ (B0G)!
247
Combinations
● Before, we just looked at the two types of Permutation (with and
without repetition). In Permutations order mattered.
● In the scenario where it doesn’t matter is called Combinations
Order does matter Order doesn’t matter
123
132
213
123
231
312
321
248
Calculating Combinations Exampel
● If we don’t care about the order, the number of Combinations is
given by:
)!
○
+! )"+ !
(,! (,! /0,2//,342,444,000
○ = = = 560
-! (,"- ! -!×(-! ,×,,//3,0/0,400
249
Combinations with Repetition
● Now this brings us the to the last bit of this chapter where we
look at Combinations with Repetition.
● In our last slide, the values were not repeated e.g. in the case of
combinations of 1,2,3 we could only use one digit once.
● In the situation where they can be repetition the formula is as
follows:
G:B0- !
○ G! B0- !
250
Bayes Theorem
Introduction
255
Bayes Theorem
“In probability theory and statistics, Bayes’ theorem (alternatively Bayes’
law or Bayes’ rule) describes the probability of an event, based on prior
knowledge of conditions that might be related to the event. For example, if cancer
is related to age, then, using Bayes’ theorem, a person's age can be used to more
accurately assess the probability that they have cancer than can be done without
knowledge of the person’s age.” Wikipedia
𝑷(𝑨 𝒂𝒏𝒅 𝑩)
● 𝑷 𝑨𝑩 =
𝑷(𝑩)
256
Bayes Theorem Demonstrated
Let’s explore a practical example:
● 1% of women have breast cancer (99% do not)
● 80% of mammograms detect cancer correctly, so 20% of them are wrong
● 9.6% of mammograms report cancer being detected when it is in fact not
present (False Positive)
257
Bayes Theorem Demonstrated
● So let’s suppose you or someone you know unfortunately, tests positive for
breast cancer.
● This means got top row outcome.
● Probability of a True Positive (meaning we have cancer and it was a positive
result) = 1% x 80% = 0.008
● Probability of a False Positive (meaning we don’t have cancer and it was a
positive result) = 99% x 9.6%% = 0.09504
258
Bayes Theorem Demonstrated
● Remember, Probability is equal to:
𝑬𝒗𝒆𝒏𝒕
● 𝑷 𝑬𝒗𝒆𝒏𝒕 =
𝒂𝒍𝒍 𝒑𝒐𝒔𝒔𝒊𝒃𝒍𝒚 𝒐𝒖𝒕𝒄𝒐𝒎𝒆𝒔
𝟎.𝟎𝟎𝟖 𝟎.𝟎𝟎𝟖
● 𝑷 𝑯𝒂𝒗𝒊𝒏𝒈 𝑪𝒂𝒏𝒄𝒆𝒓|𝑷𝒐𝒔𝒕𝒊𝒗𝒆 𝑻𝒆𝒔𝒕 = 𝟎.𝟎𝟎𝟖B𝟎.𝟎𝟗𝟓𝟎𝟒 = 𝟎.𝟏𝟎𝟑𝟎𝟒 = 𝟎. 𝟎𝟕𝟕𝟔 𝒐𝒓 𝟕. 𝟖%
259
Bayes Theorem Problem Summary
𝐏 𝑷 𝑪 𝐏(𝐂)
𝐏 𝐂𝐚𝐧𝐜𝐞𝐫|𝐏𝐨𝐬𝐭𝐢𝐯𝐞 = 𝐏 𝐂 𝐏 =
𝑷 𝑷 𝑪 𝑷 𝑪 + 𝑷 𝑷 ~𝑪 𝑷(~𝑷)
7
𝑃 𝐶 𝑃 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑎𝑣𝑖𝑛𝑔 𝐶𝑎𝑛𝑐𝑒𝑟 𝑎𝑛𝑑 ℎ𝑎𝑣𝑖𝑛𝑔 𝑎 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 = %
8
𝑃 𝑃 𝐶 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑎 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑔𝑖𝑣𝑒𝑛 𝑡ℎ𝑎𝑡 𝑦𝑜𝑢 ℎ𝑎𝑑 𝐶𝑎𝑛𝑐𝑒𝑟 = 80%
𝑃 𝐶 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑎𝑣𝑖𝑛𝑔 𝐶𝑎𝑛𝑐𝑒𝑟 = 1%
𝑃 ~𝐶 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑛𝑜𝑡 ℎ𝑎𝑣𝑖𝑛𝑔 𝐶𝑎𝑛𝑐𝑒𝑟 = 99%
𝑃 ~𝑃 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑒𝑠𝑡 𝑏𝑒𝑖𝑛𝑔 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝑃 𝑃|~𝐶 = 𝐶ℎ𝑎𝑛𝑐𝑒 𝑜𝑓𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑒𝑠𝑡 𝑔𝑖𝑣𝑒𝑛 𝑦𝑜𝑢 𝑑𝑖𝑑 𝑛𝑜𝑡 ℎ𝑎𝑣𝑒 𝐶𝑎𝑛𝑐𝑒𝑟 = 9.6%
260
Hypothesis Testing Introduction
What is a Hypothesis?
● A hypothesis is a proposed explanation (unproven) for a
phenomenon.
262
Testing a Hypothesis
● Take the hypothesis “If we change the color of the Add to Cart button on
our e-commerce website, we’ll increase sales”.
● To test this, we can foresee many problems.
○ What if the change was carried out when the end of the month was
approaching (i.e. sales were naturally going to increase)?
○ What if, the company increased ads around that same time?
○ What if there was some random event that resulted in a chance in sales?
● As you can see, there are many difficulties when testing Hypothesis, in
reality this is one of the controversial use of Statistics.
263
Framing of Hypothesis
Null Hypothesis
● This describes the existing conditions or present state of affairs.
Examples:
1. All Lily’s have 3 petals
2. The number of children in a household is unrelated to the
amount of televisions owned
3. Changing the Add to Cart button color on our e-commerce
website, will have no effect on our sales
264
Framing of Hypothesis
Alternative Hypothesis
● This is used to compare or contrast against the Null Hypothesis
● Example: If we change the color of the Add to Cart button on
our e-commerce website, we’ll increase sales.
Null Hypothesis – Users exposed to the new Add to Cart button did
not change their tendency to make a purchase
Alternative Hypothesis – Users exposed to the new button Add to
Cart button resulted in increased sales.
265
Statistical Significance
Research Design – Blind Experiment
267
Formalize our Hypothesis
Null Hypothesis
Subjects taking the new flu medication did not change the duration
of the flu symptoms compared to those who took the placebo.
Alternative Hypothesis
Subjects taking the new flu medication had the duration of their flu
symptoms reduced compared to those who took the placebo
268
Research Design – Blind Experiment
269
Statistical Significance
● Once we obtain the results from the two groups, how do we
interpret the results with any degree of certainty?
270
Comparing our Means
𝑥- = 5.4
𝑥5 = 4.8
Our Null Hypothesis stated:
● 𝑥- − 𝑥5 = 0
Our Alternative Hypothesis stated:
● 𝑥- − 𝑥5 > 0
● 5.4 − 4.8 = 0.6 > 0
Therefore, 5.4 is greater than 4.8.
● Should our Null Hypothesis be rejected?
271
Would this Happen Every Time?
𝜇/ 𝜇& 𝝁𝟏 𝜇*
272
Permutation Test
● Each time we redo this experiment is a called a Permutation Test.
273
Sampling Distribution
● Our sampling distribution approximates a full range of possible
test statistics for the null hypothesis.
● We then compare our control group mean (5.4) to see how likely it
is to observe this mean.
274
Simulation of Re-trails
● In the real world we can’t re-run our control group study several
times, due to time/money and effort constraints.
● However, we don’t need to. Here’s the trick.
Control Experimental
275
Recall Our Results
Subject Group (Randomly Assigned) Flu Duration
1 Control 4 Days
2 Control 3 Days
3 Experimental 3 Days
4 Control 5 Days
⋮
80 Experimental 5 Days
276
Simulation of Re-trails
● We now take some of the individual data points and randomly
assign them to the other group.
● We then calculate the new mean after this randomization
Control Experimental
277
Our Randomized Assignments
● Total in each group remains the same (i.e. 40 each)
278
Record the Means of Each Simulated Trial
● For each iteration or trial, we log the mean and the Mean Difference
● The Mean Difference is our initial mean of 5.4 (Control) minus the
mean of randomized trail re-assignments.
281
Let’s run these Simulations in Python
0.6
282
What can we take away from this?
● We see that our real life mean difference of 0.6 lies a bit to the far
right.
● We see that the mean in our experimental iterations is almost zero.
○ This means that mean difference between groups is almost
purely random and there is no real difference between groups
● But how do we show this statistically?
● What is the probability of a mean difference of 0.6 occurring?
○ Lets check our mean difference values and see how many of
these values were great or equal to 0.6
283
P-Value
284
\
Let’s Determine our P-Value in Python
● This means our difference in means i.e. 5.4 vs. 4.8 was statistically
significant.
285
More on P-Values
● We’ve shown that our results support our Alternative Hypothesis – i.e. our
new flu medication reduces the length of the flu.
● However, always remember P-values are the probability that what you
measured is the result of some random fluke.
● Saying our P-Value is 3.81% means our results have a 3.81% chance of being
attributed to random coincidence.
● However 3.81% is small and is lower than the typical 5% threshold used for P-
Values. Therefore, Statistically Significant.
286
Hypothesis Testing – Pearson Correlation
Pearson Correlation Coefficient
288
Pearson Correlation Coefficient
289
Pearson Correlation Coefficient
Subject x y
y
100000
1 22 35000 90000
2 25 22000 80000
3 45 88000 70000
60000
4 31 72000 50000
y
5 33 37000 40000
6 62 69000 30000
20000
7 42 48000
10000
8 39 43000 0
9 26 19000 0 10 20 30 40 50 60 70
290
Pearson Correlation Coefficient
● Hypothesis Testing with Pearson tells us whether we can conclude two variables
correlate or influence each other in a way that is Statistically Significant
● r ranges from -1 to +1
r measures the
○ Values close to 0 indicates no correlation strength and
○ -1 indicates and inverse relationship and strong correlation
○ +1 indicates a positive relationship and strong correlation direction of a
linear relationship
between two
variables on a
scatterplot
291
Calculating the Pearson Correlation Coefficient
˜ 𝑥 = 𝑠𝑢𝑚 𝑜𝑓 𝑥 𝑠𝑐𝑜𝑟𝑒𝑠
292
Calculating our Pearson Correlation
Coefficient
1. Null Hypothesis: No correlation between income and age
2. Define Alpha (our measure of significance strength, lower is stronger) – We’ll use 0.05
3. Find Degrees of Freedom – Our sample has 10 subjects and the formula finding degrees
of freedom is DF = n – 2 = 8 (in our case)
4. Use an r-Table to find the Critical Value of r (i.e. the threshold value we use to reject our
Null Hypothesis). In our case using an Alpha of 0.05 we get a critical r = 0.632 from our r-
tables. If our r is greater than our critical r, we reject the Null Hypothesis
5. Apply Formula:
293
http://statisticslectures.com/tables/rtable/
294
Calculating our Pearson Correlation Coefficient
Subject x y x2 y2 xy
1 22 35000 484 1225000000 770000
2 25 22000 625 484000000 550000
3 45 88000 2025 7744000000 3960000
4 31 72000 961 5184000000 2232000
5 33 37000 1089 1369000000 1221000
6 62 69000 3844 4761000000 4278000
7 42 48000 1764 2304000000 2016000
8 39 43000 1521 1849000000 1677000
9 26 19000 676 361000000 494000
10 55 33000 3025 1089000000 1815000
SUM 380 466000 16014 2.637E+10 19013000
295
Calculating our Pearson Correlation Coefficient
1305000
𝑟= = 0.482
(39.67)(68223.16)
296
Introduction to Machine Learning
Machine Learning
● Machine Learning is almost synonymous to Artificial Intelligence
(AI) because it entails the study of how software can learn.
● It has seen a rapid explosion in growth in the last 5-10 years due to
the combination of incredible breakthroughs in new algorithms
such as Deep Learning, combined with almost exponential
increases in CPU power, especially in parallel operations (GPU and
TPU) which allowed for huge improvements in training Deep
Learning networks.
298
Explicit Programming
● We said Machine Learning allows software to learn
without explicitly programming the information.
But what do we mean by that exactly?
You
300
We would create some Explicit
rules to determine which
customer will buy the American
football
You! Buy
my football!
301
Explicit Programming is difficult
If male then
If age is between 18 and 25 years then:
If location is City A then:
If income between 10000 and 20000 then:
303
We Need a Better Solution!
304
How Machine Learning enables Computers to Learn
Machine Learning to the Rescue!
307
Teaching a Machine Learning Algorithm
309
Human Learning Example
● An ice cream salesman over time
will know who the best target
customers are (hint: children)
● Thus he/she would use things like
balloons or musical ice cream
trucks to get their attention
310
Another Human Learning Example
● Babies learn the names of animals by
viewing pictures of them
311
Example of Training Data
Gender Age Credit Rating Declared Bankruptcy
Female 45 900 No
Male 23 850 No
Male 29 890 No
⋮ ⋮ ⋮ ⋮
Female 39 954 No
312
We can then assess the Performance of our Machine Learning Model
on Test Data. Basically it’s just like Tests or Examinations we did in
school
⋮ ⋮ ⋮ ⋮
Male 59 859 No No
313
What is a Machine Learning Model?
What is a Machine Learning Model?
315
Machine Learning Models are Equations!
Input Data
X1 Machine
x2 Learning Output
⋮ Model
xN
316
What do these Equations look like?
● Model = 𝑤/ 𝑥/ + 𝑤, 𝑥, + 𝑤4 𝑥4 + 𝑏 = 𝒘𝑻 𝒙 + 𝒃
317
Finding the Gradient of Line given two points
-"&-! ∆9
● 𝑚= =
%"&%! ∆:
/&4 &, 𝟐
● 𝒎= = =
&/&5 &6 𝟓
● The equation for a straight line is:
○ 𝒚 = 𝒎𝒙 + 𝒄
● We need to find c, which the y
intercept (i.e. when x = 0)
318
Finding the Equation of line given two points
● 𝑦 = 𝑚𝑥 + 𝑐
,
● 1= −1 + 𝑐
6
,
● 1 + = +𝑐
6
<
● 𝑐=
6
0,0 𝟐𝒙 𝟕
● 𝒚= +
𝟓 𝟓
● 𝟓𝒚 = 𝟐𝒙 + 𝟕
319
Why did we do this?
320
Olympic 100m Gold Times Over Time
12
11.5
11
10.5
time
10
9.5
9
1850 1900 1950 2000 2050
321
We’ll use Least Squares Method to get Equation of the line
● 𝑦 = 9.53 10.5
10
9.5
9
1880 1900 1920 1940 1960 1980 2000 2020 2040
322
Least Squares Method
323
Least Squares Method
11.5
● A Machine Learning model is simply a time
mathematical function that transforms 11
the inputs into an output relevant to
10.5
our objective.
10
327
328
https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/machine-learning.png
Supervised Learning
● Supervised learning is by far the most popular form of AI and ML used today.
● We take a set of labeled data (called a dataset) and we feed it in to some ML
learning algorithm that then creates a model to fit this data to some outputs
● E.g. let’s say we give our ML algorithm a set of 10k spam emails and 10k non spam.
Our model figures out what texts or sequences of text indicate spam and thus can
now be used as a spam filter in the real world.
Input Data
x1 Supervised
x2 Machine Output Labels
. Learning
xN
329
Supervised Learning In Business
330
Unsupervised Learning
● Unsupervised learning is concerned with finding interesting clusters of input data. It
does so without any help of data labeling.
● It does this by creating interesting transformations of the input data
● It is very important in data analytics when trying to understand data
● Examples in the Business world:
o Customer Segmentation
Input Data
Chicken
Pork Unsupervised
Cluster 1
Beef Machine
Cluster 2
Vanilla Learning
Chocolate
331
Reinforcement Learning
● Reinforcement learning is a type of learning
where an agent learns by receiving rewards
and penalties.
● Unlike Supervised Learning, it isn’t given the
correct label or answer. It is taught based on
experience
● Usually applications are AI playing games
(e.g. DeepMind’s Go AI) but it can be applied
to Trading Bots (we’ll be making one later!)
332
ML Supervised Learning Process
333
ML Terminology
● Supervised Learning
● Regressions – Linear Regression, Support Vector
Machines, KNN, Naïve Bayes, Decision Trees and Random
Forests
● Classifiers – Logistic Regression, Neural Networks & Deep
Learning
● Unsupervised Learning
○ Clustering – K-Means and many more
● Reinforcement Learning
335
Linear Regression – Introduction to Cost
Functions and Gradient Descent
Linear Regression
● A Linear Regression is a statistical approach to modeling the
relationship between a dependent variable (y) and one or more
independent variables (x).
● Basically, we want to know the regression equation that can be
used to make predictions.
● Our model uses linear predictor functions whose parameters
(similar to the m and c in ‘y=mx+c’) are estimated using the training
data.
● Linear Regressions are one of the most popular ML algorithms due
to it’s simplicity and ease of implementation.
337
Linear Regression for one Independent Variable
● y = 𝑓 𝑥 = 𝑚𝑥 + 𝑏
338
Linear Regression for one Independent Variable
339
Loss Functions
y = 𝑓 𝑥 = 𝑚𝑥 + 𝑏
● How do we find the most appropriate values of m and b?
● We can measure the accuracy or goodness of Linear
Regression model by finding error difference between the
predicted outputs and the actual outputs (ground truth).
340
Loss Functions: Mean Squared Error (MSE)
𝑵
𝟏
𝑴𝒆𝒂𝒏 𝑺𝒒𝒖𝒂𝒓𝒆𝒅 𝑬𝒓𝒓𝒐𝒓 𝒎, 𝒃 = §(𝑨𝒄𝒕𝒖𝒂𝒍 𝑶𝒖𝒕𝒑𝒖𝒕 − 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑶𝒖𝒕𝒑𝒖𝒕)𝟐
𝑵
𝒊,𝟏
● Before we get into this, let’s look at how we find the values for m and
b
341
Finding the values of m and b
342
Introducing Gradient Descent
/
○ Cost Function = 𝐽 𝜃2 , 𝜃/ = ∑?
."/(ℎ@ 𝑥. − 𝑦. )
,
,
● We treat m and b as 𝜃( and 𝜃%
● The above equation simply tells us how wrong the line is by measuring how
far the predicted value of ℎ 𝑥J is from the actual value 𝑦J
● We square the error so that we don’t end up with negative values
● We divide by 2 to make updating our parameters easier
343
What is Gradient Descent
(
● 𝐽 𝜃0 , 𝜃( = ∑?
=>((ℎ@ 𝑥= − 𝑦= )
/
/
345
Gradient Descent Method
346
Gradient Descent Method Visualized
347
Gradient Descent Method Visualized
348
Line Fitting Visualized - Iteratively
349
Gradient Descent Method Visualized
350
Gradient Descent by Data Scientist Bhavesh Bhatt
Linear Algebra Representation
A
● 𝑓A = 𝑓(𝑥 ; 𝑤, 𝑏) = 𝒘𝑻 𝒙(𝒏) + 𝒃
351
Let’s Do Some Linear Regressions in Python
352
Polynomial and Multivariate Linear
Regressions
A Polynomial Regression
● A Polynomial Regression is a type of
linear regression in which the
relationship between the independent
variable x and dependent variable y is
modeled as an nth degree polynomial.
● Polynomial regression fits a nonlinear
relationship between the values of x and
the corresponding outputs or dependent
variable y.
354
Polynomial Regression
𝑦 = 𝑚-𝑥 + 𝑚5𝑥 5 + 𝑚8𝑥 8 + ⋯ + 𝑚B 𝑥 B + 𝑏
355
Polynomial Regression In Python
356
Multivariate Linear Regression
x1 x2 Y
342 32 32
235 36 23
357
Multivariate Linear Regression
● 𝒚 = 𝒇 𝒙 = 𝒃 + 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 … + 𝒘𝒏 𝒙𝒏 = ∑𝒏
𝒊>𝟏 𝒘𝒊 𝒙𝒊
358
Multivariate Regression In Python
359
Logistic Regression
Logistic Regression
● Previously with Linear Regressions we were predicting a continuous
variable, like our 2020 Olympic 100m Gold time of 9.53 seconds.
364
Theory behind Logistic Regressions
365
Logistic Regressions Create Decision Boundaries
366
Logistic Regression – Cost or Loss Function
− log ℎc 𝑥 𝑖𝑓 𝑦 = 1
● 𝐶𝑜𝑠𝑡 ℎc 𝑥 , 𝑦 = ´
− log 1 − ℎc 𝑥 𝑖𝑓 𝑦 = 0
368
Convex vs. Non-Convex
source: https://medium.freecodecamp.org/understanding-gradient-descent-the-most-popular-ml-algorithm-
a66c0d97307f; https://www.cs.ubc.ca/labs/lci/mlrg/slides/non_convex_optimization.pdf
369
Other Machine Learning Classification
Algorithms
● Logistic Regressions aren’t the only class predictors, there are many
useful algorithms in Machine Learning such as:
○ Support Vector Machines
○ Random Forests
○ Neural Networks / Deep Learning
370
Support Vector Machines (SVMs)
Support Vector Machines (SVMs)
How do we
know which
decision
boundary is
best?
372
Optimal Hyperplane
373
Optimal Hyperplane in Multiple Dimensions
374
Support Vectors
375
Hyperplane Intuition
376
The Kernel Trick – Handling Non-Linear Data
377
Decision Trees and Random Forests
Decision Trees
379
Decision Trees
Root Node
380
Decision Trees – 3 Classes
381
The Decision Tree Algorithm
382
Defining and Evaluating our splits
383
Gini Gain Calculation
● Firstly we calculate the Gini Gain (called
impurity) for the whole dataset.
● This is the probability of incorrectly
classifying a random selected element in
the dataset.
U
𝐺JPJQJRS = ˜ 𝑝J 1 − 𝑝J
JT%
● 𝑪 – number of classes
● 𝒑𝒊 – probability of randomly choosing
element of class i
384
Gini Gain Calculation
U
𝐺JPJQJRS = ˜ 𝑝J 1 − 𝑝J
JT%
4 3
𝑊ℎ𝑒𝑟𝑒 𝑐 = 2 𝑎𝑛𝑑 𝑝 1 = 𝑎𝑛𝑑 𝑝 2 =
7 7
𝐺 =𝑝 1 × 1−𝑝 1 +𝑝 2 × 1−𝑝 2
𝐺 = 4¤7 × 1 − 4¤7 + 3¤7 × 1 − 3¤7
24
𝐺= = 0.49
49
385
Gini Gain Calculation for x = 0.5
U
386
Gini Gain Calculation for x = 0.5
𝐺JPJQJRS = 0.49 𝐺VWXQ = 0 𝐺YJZ[Q = 0.5
We do this for each split and use the one with the
Highest Gini Gain
387
Gini Gains for Multiple Classes
388
Random Forests
● Once we understand Decision Trees the intuition for
Random Forests makes sense.
389
K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNNs)
● KNNs are one of the most popular machine learning
algorithms as it gives somewhat decent performance and is
also quite easy to understand and implement.
391
The KNN Algorithm – Step by Step
392
The KNN Algorithm – Visually
Scatter plot of our training data Let’s classify our new red input
393
The KNN Algorithm – Using k = 3
394
Choosing K
● As we can see from our last diagram, the larger k, the more
points it considers. But how do we go about knowing which k
to choose?
● If k = 1, we will only assign classes considering the closest
point, this can lead to a model that overfits!
● However, as k goes up, we begin to simplify the model too
much thus leading to a model that is under-fitting (high bias
and low variance
● Let’s visualize this!
395
● Notice how our decision
boundary becomes
smoother as k increases
(less overfitting) but
only up to a point!
396
Disadvantages of KNNs
● As you may have noticed, every time we wish to calculate a new
input, we need to load all our training data, then calculate the
distances to every point! This is exhaustive and negatively impacts
performance, not to mention the high memory usage it requires.
● Also of note, for KNN to work well, we need to normalize or scale all
the input data so that we can calculate the distances fairly.
Common distance metrics used are Euclidean distance or Cosine
Distance.
● Datasets with a large number of features (i.e. dimensions) will
impact KNN performance. To avoid overfitting we thus need even
more data for KNN which isn’t always possible. This is known as the
Curse of Dimensionality.
397
Assessing Performance – Accuracy,
Confusion Matrix, Precision and Recall
Assessing Model Accuracy
● Accuracy is simply a measure of how much of our training data did our
model classify correctly
dHGGFMJ dPOKK+I+MOJ+HBK
● 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = SHJOP .CDEFG HI dPOKK+I+MOJ+HBK
399
Is Accuracy the only way to assess a model’s
performance?
● While very important, accuracy alone doesn’t tell us the whole story.
● Imagine we’re using a Model to predict whether a person has a life threatening disease
based on a blood test.
● There are now 4 possible scenarios.
1. TRUE POSITIVE- Test Predicts Positive for the disease and the person has the disease
2. TRUE NEGATIVE - Test Predicts Negative for the disease and the person does NOT
have the disease
3. FALSE POSITIVE - Test Predicts Positive for the disease but the person does NOT
have the disease
4. FALSE NEGATIVE - Test Predicts Negative for the disease but the person actually has
the disease
400
For a 2 or Binary Class Classification Problem
\]^W _`aJQJbWa
○ 𝑅𝑒𝑐𝑎𝑙𝑙 =
\]^W _`aJQJbWaBcRSaW dWZRQJbWa
● Precision – When predicting positive, how many of our positive predictions were right?
\]^W _`aJQJbWa
○ 𝑃𝑟𝑒𝑐𝑠𝑖𝑜𝑛 = \]^W _`aJQJbWaBcRSaW _`aJQJbWa
/ ×YWfRSS×_]WfJaJ`P
○ 𝐹 − 𝑆𝑐𝑜𝑟𝑒 = _]WfJaJ`PBYWfRSS
401
Confusion Matrix Real Example
Let’s say we’ve built a classifier to identify Male Faces in an image, there are:
● 10 Male Faces & 5 Female Faces in the image
● Our classifier identifies 6 male faces
● Out of the 6 male faces - 4 were male and 2 female.
Male Not Male
Predicted Male TP = 4 FP = 2
403
Confusion Matrix for Multiple Classes
● We can use scikit-learn to generate our confusion matrix.
● Let’s analyze our results on our MNIST dataset
404
Confusion Matrix for Multiple Classes
True
Predicted Values
405
Confusion Matrix Analysis
406
Recall
● Recall – Actual true positives over how many times the classifier predicted that
class
● Let’s look at the number 7:
○ TP = 1010 and FN = 18
○ 1010 / 1028 = 98.24%
407
Precision
408
Classification Report
Using scikit-learn we can automatically generate a Classification Report that
gives us Recall, Precision, F1 and Support.
409
F1-Score and Support
411
Let’s compare the
performance between
Logistic Regressions, SVMs
and KNN Classifiers in Python
412
Understanding the ROC and AUC Curve
ROC (Receiver Operating Characteristic)
415
How Thresholds affect Recall and Precision
416
The ROC Curve
417
Source: https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/
Creating the ROC Curve
418
Source: https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/
How we use the ROC Curve and AUC
419
Overfitting – Regularization,
Generalization and Outliers
What Makes a Good Model?
421
Examine these Models and the
Decision Boundary
● Model A ● Model B ● Model C
422
Let’s look at Each
● Overfitting ● Ideal or Balanced ● Underfitting
423
Overfitting and Underfitting
● Overfitting occurs when a statistical model or machine
learning algorithm captures the noise of the data. Overfitting occurs
if the model or algorithm shows low bias but high variance
● Underfitting occurs when a statistical model or machine learning
algorithm cannot capture the underlying trend of the data
424
Overfitting
● Overfitting leads to poor models and is one of the most common
problems faced developing in AI/Machine Learning/Neural Nets.
● Overfitting occurs when our Model fits near perfectly to our training
data, as we saw in the previous slide with Model A. However, fitting
too closely to training data isn’t always a good thing.
● What happens if we try to classify a brand new
point that occurs at the position shown on the left?
(who’s true color is green)
● It will be misclassified because our model has
overfit the test data
● Models don’t need to be complex to be good
425
Overfitting – Bias and Variance
● Bias Error – Bias is the difference between the average prediction and the
true value we’re attempting the predict. Bias error results from simplifying
assumptions made by a model to make the target function easier to learn.
Common low bias models are Decision Trees, KNN and SVMs. Parametric
models like Linear Regression and Logistic Regression have high-bias (i.e.
more assumptions). They’re often fast to train, but less flexible.
● Variance Error – Variance error originates from how much would your
target model change if different training data were used. Normally, we do
expect some variance in models for different training data, however, the
underlying model should be generally the same. High variance models
(Decision Trees, KNN and SVMs) are very sensitive to this whereas low
variance models (Linear Regression and Logistic Regression) are not.
● Parametric or linear models have high bias and low variance
● Non-parametric or non-linear have low bias and high variance
426
The Bias and Variance Tradeoff
● Trade-Off – We want both low bias and low variance, however as we increase bias we
decrease variance and vice versa. But how does this happen?
● In linear regression, if we increase the degrees in our polynomial, we are lowering the bias
but increasing the variance, conversely if we reduce the complexity we increase bias but
reduce variance
● In K-nearest neighbors, increasing the number of clusters (k) increases bias but reduces
variance
427
How do we know if we’ve Overfit?
Test on your model on..….Test Data!
● Many times when Overfitting we can achieve high accuracy 95%+ on our
training data, but then get abysmal (~70%) results on the test data. That
is a perfect example of Overfitting.
428
Overfitting Illustrated Graphically
● Examine our
Training Loss and
Accuracy. They’re
both heading the
right directions!
● But, what’s
happening to the
loss and accuracy
on our Test data?
● This is classic
example of
Overfitting to our
training data
429
How do we avoid overfitting?
● Rule of thumb is to reduce the complexity of your model
430
How do we avoid overfitting?
● Why do less complex models overfit less?
● Overly complex (i.e. less degrees in the polynomial or in Deep
Learning, shallower networks) can sometimes find features or
interpret noise to be important in data, due to their abilities to
memorize more features (called memorization capacity)
431
What is Regularization?
● It is a method of making our model more general to our
dataset.
● Overfitting ● Ideal or Balanced ● Underfitting
432
Types of Regularization
433
L1 And L2 Regularization
● L1 & L2 regularization are techniques we use to
penalize large weights. Large weights or gradients
manifest as abrupt changes in our model’s decision
boundary. By penalizing, we a really making them
smaller.
○ 𝐸𝑟𝑟𝑜𝑟 =
!
"
𝑡𝑎𝑟𝑔𝑒𝑡#! − 𝑜𝑢𝑡! " $
+ ∑ 𝑤% "
"
○ 𝐸𝑟𝑟𝑜𝑟 =
!
"
𝑡𝑎𝑟𝑔𝑒𝑡#! − 𝑜𝑢𝑡! " $
+ ∑ 𝑤%
"
434
L1 And L2 Regularization
On the left: LASSO regression (you can see that the coefficients, represented by the red rings, can
equal zero when they cross the y-axis). On the right: Ridge regression (you can see that the
coefficients approach, but never equal zero, because they never cross the y-axis).
Meta-credit: “Regularization in Machine Learning” by Prashant Gupta
435
Cross Validation
● Cross validation or also known as k-fold cross validation is a method of training
where we split our dataset into k folds instead of a typical training and test split.
● For example, let’s say we’re using 5 folds. We train on 4, and use the 5th final fold
as our test. We then train on the other 4 folds, and test on another.
● We then use the average weights across coming out of each cycle.
● Cross Validation reduces overfitting but slows the training process
436
Introduction to Neural
Netorks
What are Neural Networks
● Neural Networks act as a ‘black box’ or brain that takes inputs
and predicts an output.
438
The Mysterious ‘Black Box’ Brain
439
The Mysterious ‘Black Box’ Brain
440
The Mysterious ‘Black Box’ Brain
441
How NNs ‘look’
A Simple Neural Network with 1 hidden layer
442
Example Neural Network
443
How do we get a prediction?
Pass the inputs into our NN to receive an output
444
How we predict example 2
445
Types of Deep Learning Models – Feed
Forward, CNNs, RNNs & LSTMs
Deep Learning has Spawned Dozens of Model
Types
● With the advent of Deep Learning algorithms obtaining incredible
performance, tweaking and designing intricate elements to these Neural
Networks has spawned dozens and dozens different models.
● However, despite the dozens of variations they mostly fall into the
following categories:
● Feed Forward Neural Networks
● Convolutional Neural Networks (CNN)
● Recurrent Neural Networks (RNN)
● Long Short Term Memory Networks (LSTM)
447
Convolutional Neural Networks (CNNs) –
Why are they needed?
448
How do Computers Store Images?
449
How do Computers Store Images?
● A Color image would consist of 3 channels (RBG) - Right
● And a Grayscale image would consist of one channel - Below
450
Why NNs Don’t Scale Well to Image Data
● Therefore, our input layer will thus have 12,288 weights. While not an
insanely large amount, imaging using input images of 640 x 480? You’d
have 921.600 weights! Add some more hidden layers and you’ll see how
fast this can grow out of control. Leading to very long training times
451
CNNs are perfectly suited for Image
Classifiations
452
CNN’s use a 3D Volume Arrangement for it’s
Neurons
● Because our input data is an image, we can
constrain or design our Neural Network to
better suit this type of data
453
It’s all in the name
454
Convolutions
● Convolution is a mathematical term to describe the
process of combining two functions to produce a third
function.
455
Examples of Image Features
456
The Convolution Process
● Convolutions are executed by
sliding the filter or kernel over
the input image.
457
Recurrent Neural Networks
458
Recurrent Neural Networks
459
RNN Uses and Weaknesses
460
Introducing Long Short Term Memory Networks (LSTM)
461
6.0
\
● 6.3 Forward Propagation
● 6.4 Activation Functions
● 6.5 Training Part 1 – Loss Functions
● 6.6 Training Part 2 – Backpropagation and Gradient Descent
● 6.7. Backpropagation & Learning Rates – A Worked Example
● 6.8 Regularization, Overfitting, Generalization and Test Datasets
● 6.9 Epochs, Iterations and Batch Sizes
● 6.10 Measuring Performance and the Confusion Matrix
● 6.11 Review and Best Practices
463
6.3
Forward Propagation
How Neural Networks process their inputs to produce an output
Using actual values (random)
465
Looking at a single Node/Neuron
𝐻% = 𝑤% 𝑖% + 𝑤/ 𝑖/ + 𝑏%
466
Steps for all output values
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑤P 𝑖P + 𝑏
467
In reality these connections are simply formulas that
pass values from one layer to the next
𝑂𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑁𝑜𝑑𝑒𝑠 = 𝑤P 𝑖P + 𝑏
𝐻% = 𝑖% 𝑤% + 𝑖/ 𝑤/ + 𝑏% 𝑂𝑢𝑡𝑝𝑢𝑡% = 𝑤. 𝐻% + 𝐻/ 𝑤g + 𝑏/
468
Calculating H2
𝐻/ = 𝑖% 𝑤& + 𝑖/ 𝑤* + 𝑏%
469
Getting our final outputs
𝑂𝑢𝑡𝑝𝑢𝑡% = 𝑤. 𝐻% + 𝐻/ 𝑤g + 𝑏/
𝑂𝑢𝑡𝑝𝑢𝑡% = (0.6 × 0.8675) + (0.05×0.885) + 0.6 = 0.5205 + 0.04425 + 0.6 = 1.16475
𝑂𝑢𝑡𝑝𝑢𝑡/ = 0.43375 + 0.2655 + 0.6 = 1.29925
470
What does this tell us?
● Our initial default random weights (w and b) produced very
incorrect results
471
The Bias Trick
– Making our weights and biases as one
● Now is a good a time as any to show you the bias trick that is used to simply our
calculations.
𝑊 𝑏 𝑊 𝑏
𝑥J
𝑥J
● 𝑥J is our input values, instead of doing a multiplication then adding of our biases, we
can simply add the biases to our weight matrix and add an addition element to our
input data as 1.
● This simplifies our calculation operations as we treat the biases and weights as one.
● NOTE: This makes out input vector size bigger by one i.e if 𝑥J had 32 values, it will
now have 33.
472
6.4
f(x) = max(0, x)
● Therefore, if 𝑤+ 𝑥+ + 𝑏= 0.75
● f(x) = 0.75
● However, if 𝑤+ 𝑥+ + 𝑏 = -0.5 then f(x) = 0
474
The ReLU Activation Function
● This activation function is called the ReLU (Rectified Linear Unit) function
and is one of the most commonly used activation functions in training
Neural Networks.
f(x) = max(0, x)
475
Why use Activation Functions?
For Non Linearity
476
Example of Non Linear Data
Linearly Separable Non-Linearly Separable
NOTE: This shows just 2 Dimensions, imagine separating data with multiple dimensions
477
Types of Activation Functions
478
Why do we have biases in the first place?
480
The ‘Deep’ in Deep Learing:
Hidden Layers
● Depth refers to the number of hidden layers
● The deeper the network the better it learns non-linear mappings
● Deeper is always better, however there becomes a point of
diminishing returns and overly long training time.
● Deeper Networks can lead to over fitting
481
The Real Magic Happens in Training
482
6.5
484
How do we beginin training a NN?
What exactly do we need?
485
Training step by step
1. Initialize some random values for our weights and bias
2. Input a single sample of our data
3. Compare our output with the actual value it was supposed to be, we’ll be
calling this our Target values.
4. Quantify how ‘bad’ these random weights were, we’ll call this our Loss.
5. Adjust weights so that the Loss lower
6. Keep doing this for each sample in our dataset
7. Then send the entire dataset through this weight ‘optimization’ program to
see if we get an even lower loss
8. Stop training when the loss stops decreasing.
486
Training Process Visualized
487
Quantifying Loss with Loss Functions
488
Loss Functions
● Loss functions are integral in training Neural Nets as they measure the
inconsistency or difference between the predicted results & the actual target
results.
● They are always positive and penalize big errors well
● The lower the loss the ‘better’ the model
● There are many loss functions, Mean Squared Error (MSE) is popular
● MSE = (𝑇𝑎𝑟𝑔𝑒𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)5
Outputs Predicted Results Target Values Error (T-P) MSE
489
Types of Loss Functions
● There are many types of loss functions such as:
● L1
● L2
● Cross Entropy – Used in binary classifications
● Hinge Loss
● Mean Absolute Error (MAE)
In practice, MSE is always a good safe choice. We’ll discuss using
different loss functions later on.
NOTE: A low loss goes hand in hand with accuracy. However, there is
more to a good model than good training accuracy and low loss. We’ll
learn more about this soon.
490
Using the Loss to Correct the Weights
● Getting the best weights for our classifying our data is not a trivial task,
especially with large image data which can contain thousands of inputs
from a single image.
(3 x 4) + (4 x 2) + (4+2) = 26 (3 x 4) + (4 x 4) + (4 x 1) + (4+4+1) = 41
492
6.6
Training Part 2:
Backpropagation & Gradient
Descent
Determining the best weights efficiently
Introducing Backpropagation
494
Backpropagation: Revisiting out Neural Net
495
Backpropagation: Revisiting out Neural Net
● However, this tunes the weights for that specific input data.
How do make our Neural Network generalize?
498
Gradient Descent
499
Stochastic Gradient Descent
● Naïve Gradient Decent is very computationally expensive/slow as it
requires exposure to the entire dataset, then updates the gradient.
● Stochastic Gradient Descent (SGD) does the updates after every input
sample. This produces noisy or fluctuating loss outputs. However, again
this method can be slow.
503
Backpropagation Simplified
● We obtain the total error at the output nodes and then propagate
these errors back through the network using Backpropagation to
obtain the new and better gradients or weights.
504
The Chain Rule
● Backpropagation is made possible by the Chain Rule.
● What is the Chain Rule? Without over complicating things, it’s defined as:
505
Let’s take a look at our previous basic NN
506
We use the Chain Rule to Determine
the Direction the Weights Should Take
● Should W5 be
increased or
decreased?
507
Our Calculated Forward Propagation and Loss Values
509
Exploring W5
𝑑𝐸HIHJK
𝑑𝑤6
Were 𝐸HIHJK is the sum of the Error from
Output 1 and Output 2
510
Using the Chain Rule to Calculate W5
1 2 3
LM?@?AB LM?@?AB LOPH! L*QHOPHRPH!
● = × ×
LNC LOPH! L*QHOPHRPH! LNC
/ , / ,
● 𝐸HIHJK = 𝑡𝑎𝑟𝑔𝑒𝑡O/ − 𝑜𝑢𝑡/ + 𝑡𝑎𝑟𝑔𝑒𝑡O, − 𝑜𝑢𝑡,
, ,
/
● 𝑑𝑂𝑢𝑡/ = ● Fortunately, the partial derivative of
/(Q DEF?G!
the logistic function is the output
multiplied by 1 minus the output:
LOPH!
● LAQHO!
= 𝑜𝑢𝑡I/ 1 − 𝑜𝑢𝑡I/ = 0.742294385 1 − 0.742294385
LOPH!
2 ● = −0.191293
LAQHO!
512
•‘Ž•Œ,
Let’s get
•’«
LAQHO!
● LNC
= 1 ×𝑜𝑢𝑡𝐻/ ×𝑤6 (/&/) + 0 + 0
LAQHO!
3 ●
LNC
= 𝑜𝑢𝑡𝐻/ = 0.704225234
513
••¬-¬®¯
We now have all the pieces to get
•’«
LM?@?AB
● LNC
= 0.642294385 × 0.191293431∗×0.704225234
LM?@?AB
● = 0.086526
LNC
514
So what’s the new weight for W5 ?
hi&'&()
● 𝑁𝑒𝑤 𝑤. = 𝑤. − 𝜼× hj*
Learning Rate
● Notice we introduced a new parameter ‘𝜼’ and gave it a value of 0.5
● Look carefully at the first formula. The learning rate simply controls how a big a magnitude
!"!"!#$
jump we take in the direction of !#
%
515
Check your answers
● 𝑛𝑒𝑤 𝑤G = 0.400981
● 𝑛𝑒𝑤 𝑤3 = 0.558799
516
You’ve just used Backpropagation to Calculate the new W5
● You can now calculate the new updates for W6, W7 and W8
similarly.
517
6.8
Regularization, Overfitting,
Generalization and Test Datasets
How do we know our model is good?
What Makes a Good Model?
519
What Makes a Good Model?
● Model A ● Model B ● Model C
520
What Makes a Good Model?
● Overfitting ● Ideal or Balanced ● Underfitting
521
Overfitting
● Overfitting is leads to poor models and is one of the most common
problems faced developing in AI/Machine Learning/Neural Nets.
● Overfitting occurs when our Model fits near perfectly to our training
data, as we saw in the previous slide with Model A. However, fitting
to closely to training data isn’t always a good thing.
● But, what’s
happening to the
loss and accuracy on
our Test data?
● This is classic
example of
Overfitting to our
training data
524
How do we avoid overfitting?
● Overfitting is a consequence of our weights. Our weights
have been tuned to fit our Training Data well but due to this
‘over tuning’ it performs poorly on unseen Test Data.
525
How do we avoid overfitting?
● We can use less weights to get smaller/less deep Neural
Networks
● Deeper models can sometimes find features or
interpret noise to be important in data, due to their
abilities to memorize more features (called
memorization capacity)
526
How do we avoid overfitting?
● We can use less weights to get smaller/less deep Neural
Networks
● Deeper models can sometimes find features or interpret
noise to be important in data, due to their abilities to
memorize more features (called memorization capacity)
Or Regularization!
527
What is Regularization?
● It is a method of making our model more general to our dataset.
528
Types of Regularization
● L1 & L2 Regularization
● Cross Validation
● Early Stopping
● Drop Out
● Dataset Augmentation
529
L1 And L2 Regularization
● L1 & L2 regularization are techniques we use to penalize large weights. Large weights
or gradients manifest as abrupt changes in our model’s decision boundary. By
penalizing, we a really making them smaller.
● L2 also known as Ridge Regression
% l
○ 𝐸𝑟𝑟𝑜𝑟 = /
𝑡𝑎𝑟𝑔𝑒𝑡k% − 𝑜𝑢𝑡% /
+ / ∑ 𝑤J /
531
Early Stopping
532
Dropout
● Dropout refers to dropping nodes (both hidden and visible) in a
neural network with the aim of reducing overfitting.
● In training certain parts of the neural network are ignored during
some forward and backward propagations.
● Dropout is an approach to regularization in neural networks
which helps reducing interdependent learning amongst the
neurons. Thus the NN learns more robust or meaningful features.
● In Dropout we set a parameter ‘P’ that sets the probability of
which nodes are kept or (1-p) for those that are dropped.
● Dropout almost doubles the time to converge in training
533
Dropout Illustrated
534
Data Augmentation
● Data Augmentation is one of the easiest ways to improve our models.
● It’s simply taking our input dataset and making slight variations to it in
order to improve the amount of data we have for training. Examples
below.
● This allows us to build more robust models that don’t overfit.
535
6.9
539
6.10
Measuring Performance
How we measure the performance of our Neural Network
Loss and Accuracy
541
Loss and Accuracy
542
Is Accuracy the only way to assess a model’s
performance?
● While very important, accuracy alone doesn’t tell us the whole story.
● Imagine we’re using a NN to predict whether a person has a life threatening disease based on a
blood test.
● There are now 4 possible scenarios.
1. TRUE POSITIVE
■ Test Predicts Positive for the disease and the person has the disease
2. TRUE NEGATIVE
■ Test Predicts Negative for the disease and the person does NOT have the disease
3. FALSE POSITIVE
■ Test Predicts Positive for the disease but the person does NOT have the disease
4. FALSE NEGATIVE
■ Test Predicts Negative for the disease but the person actually has the disease
543
For a 2 or Binary Class Classification Problem
● Precision – Our of all the samples how much did the classifier get right
STPQ UIV.H.WQV
○ 𝑃𝑟𝑒𝑐𝑠𝑖𝑜𝑛 =
STPQ UIV.H.WQV(XJKVQ UIV.H.WQV
544
An Example
545
https://en.wikipedia.org/wiki/Precision_and_recall
546
6.11
549
Best Practices
● Activation Functions - Use ReLU or Leaky ReLU
● Loss Function - MSE
● Regularization
● Use L2 and start small and increase accordingly (0.01, 0.02,
0.05, 0.1……)
● Use Dropout and set P between 0.2 to 0.5
● Learning Rate – 0.001
● Number of Hidden Layers – As deep as your machine’s
performance will allow
● Number of Epochs – Try 10-100 (ideally at least 50)
550
A/B Testing A Fun Theoretical
Example
A/B Testing Introduction
● A/B Testing has become almost a buzz word given how much it’s
thrown around in marketing circles and amongst web & UX/UI
interfaces.
● It’s not, but, there’re a lot of things to consider when designing and
analyzing your A/B Test
552
Our Real Life A/B Test Example
553
Our Hot Dogs
554
Our Experiment Design
555
Defining our Hypothesis
Null Hypothesis
● There is no difference between
the two hot dog sausage brands
(A and B)
Alternative Hypothesis
● The new hot dog sausage brand
(B) is better than A.
556
Our Evaluation Metric
● Now, in our example the two hot dog sausages cost roughly the
same so the cost impact is negligible.
558
Experiment Size
560
Statistical Power
● When choosing sample size, we need to decide what the Statistical Power of
our test will be.
● Statistical Power is the probability of correctly rejecting the Null Hypothesis
● Statistical Power is often referred to 1-Beta as it’s inversely proportional to
making Type II Error (Beta Errors).
● Type II error is the probability of failing to reject the Null Hypothesis when you
should have.
● We typically use a statistical power of 80% and don’t worry, I’ll explain what
that 80% means shortly.
● Statistical power represents the probability that you’ll get a false negative. A
power of 0.80 means that there is an 80% chance that if there was an effect, we
would detect it (or a 20% chance that we’d miss the effect)
561
Statistical Significance Level
562
Baseline Metric and Effect Size
● But by what amount of change are looking to get from our new
hotdogs? In our example let’s look at getting a 25% increase.
563
Calculating our Sample Size
P1 (baseline) = 50%
P2 (p1+effect size) = 75%
Statistical Power = 80%
Significance Level = 5%
Sample Size N = 59
564
Experiment Length
● In the business world there are many other factors that could
mess with our results such as seasonality, trends etc.
565
Back to our Hotdog Experiment
566
Discussing our Results
35 88
● 𝑑¹ = 𝑝̂F)QFG+DFBJ − 𝑝̂MHBJGHP = 97 − 97 = 0.15
● So we have a 0.15 improvement with our Hotdog B, but was this by
random chance or is this statistically significant?
● This interval tells us the range of values the difference between the
two groups can have.
567
Calculating our Confidence Intervals
● We firstly need to calculate the pooled probability, which is really just the
combined probability of the two samples.
n+',&-') Bn./0.-12.,& &&B*/ o.
● 𝑝̂m``SWh =
d+',&-') Bd./0.-12.,&
=
%/(
=
%/(
= 0.625
● The second and last step to get the Confidence Interval is getting the Standard
Error. It estimates of how much variation the obtained results will have. This
means how widespread the values in the distribution of the sample will be.
We’ll calculate the Pooled Standard Error which combines the standard error of
both samples.
%
● 𝑆𝐸m``SWh = 𝑝̂m``SWh ×(1 − 𝑝̂m``SWh ×(d )
+',&-') Bd./0.-1.2,&
%
● 𝑆𝐸m``SWh = 0.625× 1 − 0.625 × %/(
= 0.044
568
Final Analysis
So now we have everything to get out confidence
intervals and enough information to determine if
we should reject our Null Hypothesis.
● 𝐻( : 𝑑· = 0
● 𝐻% : 𝑑· > 1.96×𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝐸𝑟𝑟𝑜𝑟m``SWh
Remember we chose to use a 5% significance
level? What we do now is split the 5% into the two
ends of our normal distribution curve (we assume
normally distributed data).
569
Final Analysis – Obtaining our Z-Score
𝐻$ : 𝑑j > 1.96×𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝐸𝑟𝑟𝑜𝑟%&&'(!
• Notice the 1.96 in our Alternative Hypothesis?
• We used the Z-Table to get the value for 0.025
• Therefore, in order to reject the Null Hypothesis,
we want to prove that the difference observed, is
great than a certain interval around the value that
refers to 5% of the results being due to chance.
• Hence why we use the z-score and the standard
error.
570
So are our new Hotdogs better?
● 𝑯𝟏 : 𝟎. 𝟏𝟓 > 𝟏. 𝟗𝟔×𝟎. 𝟎𝟒𝟒 ≡ 𝟎. 𝟏𝟓 > 𝟎. 𝟎𝟖𝟔𝟐𝟒
● So yes, we have statistically proven with our AB test that the new hotdogs (B) is better than A.
● Before we say an emphatic yes, to using our new hotdogs, let’s dig into the interpretation of our
results, given our specified parameters.
● Remember we chose a minimum effect size of 25%, as we can see given our Standard Error, the
lower bound (left) is a situation where the minimum effect is not guaranteed as it’s significantly
less than 0.15
● We can therefore conclude that our new hotdogs have a strong possibility of resulting in 25%
more sales
571
Clustering – Unsupervised Learning
Unsupervised Learning
● Unsupervised learning is concerned with finding interesting clusters of input data. It
does so without any help of data labeling.
● It does this by creating interesting transformations of the input data
● It is very important in data analytics when trying to understand data
● Examples in the Business world:
o Customer Segmentation
Input Data
Chicken
Pork Unsupervised
Cluster 1
Beef Machine
Cluster 2
Vanilla Learning
Chocolate
573
Goal of Clustering
● The goal of clustering is if given a set of data points, we can classify
each data point into a specific group or cluster.
● There are several types of clustering metods which we’ll now discuss.
575
Types of Clustering Algorithms
576
K-Means Clustering
K-Means Clustering Algorithm
578
K-Means Clustering – Step 1
580
K-Means Clustering – Step 3
581
K-Means Clustering – Step 4
582
K-Means Clustering – Step 5
● Repeat Step 3 & 4 – We now use the new centroids (the yellow and
blue crosses) as the cluster centers and then assign the closest
points to each centroid’s cluster.
● We keep repeating this step until the newly formed clusters stop
changing, all points remain in the same cluster and/or the number of
specified iterations is reached.
583
K-Means Clustering Algorithm Advantages
584
K-Means Clustering Algorithm Disadvantages
585
Choosing K – Elbow Method &
Silhouette Analysis
Choosing K
587
Elbow Method
588
Elbow Method
589
Silhouette Analysis
591
592
Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
593
Source: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering Introduction
595
Agglomerative Hierarchical Clustering Introduction
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
596
Agglomerative Hierarchical Clustering: Step 1
597
Agglomerative Hierarchical Clustering: Step 2
598
Agglomerative Hierarchical Clustering: Step 3
599
Advantages of Agglomerative Hierarchical Clustering
Disadvantages:
● It is quite slow and doesn’t scale well
600
Mean-Shift Clustering
Mean-Shift Clustering
602
Mean-Shift Clustering Steps
Let’s start with thinking about a set of 2-dim x,y points.
1. We first initialize a circular sliding window (radius r) starting at a randomly selected point.
2. Think of Mean Shift as a density finding algorithm, that keeps shifting the window to higher
and higher densities of points until it converges.
3. The density within the sliding window is proportional to the number of points inside it.
4. The sliding window is shifted according to the mean until there is no direction at which a shift
can accommodate more points inside the kernel.
5. We do this entire process above is done with a few sliding windows until all points lie within a
window. When multiple sliding windows overlap the window containing the most points is
preserved. The data points are then clustered according to the sliding window in which they
reside.
603
Mean-Shift Clustering Demo
604
DBSCAN (Density-Based Spatial
Clustering of Applications with Noise)
Density-Based Spatial Clustering of Applications with
Noise (DBSCAN)
606
DBSCAN – Step 1 & 2
Step 1
● DBSCAN starts off with an arbitrary starting data point that has not been visited.
● The neighborhood of this point is extracted using a distance epsilon ε (All points
which are within the ε distance are neighborhood points)
Step 2:
● If there are a sufficient number of points (according to minPoints) within this
neighborhood then the clustering process starts and the current data point
becomes the first point in the new cluster.
● If not, the point will be labeled as noise and the point is marked as “visited”.
607
DBSCAN – Step 3 & 4
Step 3
● For this first point in the new cluster, the points within its ε distance
neighborhood also become part of the same cluster. This procedure of making
all points in the ε neighborhood belong to the same cluster is then repeated for
all of the new points that have been just added to the cluster group.
Step 4
● This process of steps 2 and 3 is repeated until all points in the cluster are
determined, this occurs when all points have been visited and assigned a
cluster.
608
DBSCAN – Step 5
Step 5
● Once we’re done with the current cluster, a new unvisited point is retrieved and
processed, leading to the discovery of a further cluster or noise.
● This process repeats until all points are marked as visited. Since at the end of
this all points have been visited, each point will have been marked as either
belonging to a cluster or being noise.
609
DBSCAN Demo
610
DBSCAN Advantages and Disadvantages
Advantages:
● No preset K number of clusters needs to be set
● It can identify outliers
● It can find erratically sized and shaped clusters well
Disadvantages:
● It doesn’t perform well when clusters are of varying densities (due to the
setting of ε and minPoints for identifying the neighborhood points, as these will
vary from cluster to cluster when they have different densities.
● High-Dimensional data poses problems when choosing ε
611
Expectation–Maximization (EM)
Clustering using Gaussian Mixture
Models (GMM)
Expectation–Maximization (EM) Clustering using
Gaussian Mixture Models (GMM)
613
Expectation–Maximization (EM) Clustering using
Gaussian Mixture Models (GMM)
614
EM Clustering Using GMM Steps 1 & 2
Step 1
● Like K-Means, we begin by selecting the number of clusters and randomly
initializing the Gaussian distribution parameters for each cluster.
Step 2
● Given these Gaussian distributions for each cluster, we compute the probability
that each data point belongs to a particular cluster. The closer a point is to the
Gaussian’s center, the more likely it belongs to that cluster.
● This works well for Normally Distributed data as we are assuming that most of
the data lies closer to the center of the cluster.
615
EM Clustering Using GMM Steps
Step 3
● Based on these probabilities a new set of parameters for the Gaussian
distributions is computed such that we maximize the probabilities of data
points within the clusters.
● We compute these new parameters using a weighted sum of the data point
positions, where the weights are the probabilities of the data point belonging in
that particular cluster.
Step 4
● We repeat Step 2 and 3 iteratively until convergence.
616
EM Clustering Using GMM Demo
617
EM Clustering Using GMM Advantages and Disadvantages
Advantages
● GMMs are a lot more flexible in terms of cluster
covariance than K-Means
● the clusters can take on any ellipse shape, rather than
being restricted to circles
● Due to the use of probabilities data points can have
multiple clusters e.g. a point can have 0.65 probability in
being in one cluster and 0.35 in another.
Disadvantages
● Choosing K and scaling to higher dimensions
618
Principal Component Analysis
Why do we need PCA?
● Many times when working with large datasets with
many dimensions things get too computationally
expensive and confusing to keep track off
● However, many times we look at the data itself, we
see variables that are strongly correlated and
hence possibly redundant. E.g. a persons height
and weight
● What if there was a way to reduce the
dimensionality of our data while still retaining it’s
information?
● That is what Principal Component Analysis
achieves
620
What is PCA Exactly?
● PCA is an algorithm that compresses your dataset’s dimensions from a
higher to lower dimensionality.
● It does this based on the eigenvectors of the variance in your dataset
● PCA is extremely used in Data Compression saving loads in processing
time, as well as in Visualizations of high dimensional data in 2 or 3
dimensions. This is very helpful when doing cluster analysis.
● More technically, PCA finds a new set of dimensions such that:
○ All the dimensions are orthogonal (and thus linearly independent)
○ Ranked according to the variance of data along them. This means
the first principal component contains the most information about
the data.
621
How does PCA Work?
1. Calculate the Covariance Matrix (X) of the our data points (i.e. how our
variables all relate to one another)
2. Calculate the Eigenvectors and their Eigenvalues
3. Sort our Eigenvectors according to their Eigenvalues in from largest to
smallest.
4. Select a set of Eigenvectors (k) to represent our data in k-dimensions
5. Transform our original n-dimensional dataset into k-dimensions
622
Eigenvectors & Eigenvalues
● The eigenvectors or principal components of our covariance matrix
represent the vector directions of the new feature space.
623
Understanding them intuitively
● Imagine you have a bunch of people. Every person has a group of
friends they talk to, with some frequency. And every person has a
credibility rating (though the same person may have different
credibility ratings from different people).
● Distribute an amount X of gossip to each person, and let them talk to
each other.
● The largest eigenvalue gives you an idea of the fastest rate at which
gossip can grow in this social circle.
● The corresponding eigenvector gives you an idea of how much
gossip each person should start with in order to obtain this maximal
growth rate. (In particular, if you want a story to spread rapidly, the
largest components of the principal eigenvector identify who you
should tell the story to)
https://www.quora.com/What-is-the-physical-meaning-of-the-eigenvalues-and-eigenvectors
624
PCA Output
1. Eigenvectors
625
PCA Output
https://medium.com/@sadatnazrul/the-dos-and-donts-of-principal-component-analysis-7c2e9dc8cc48
626
Points to Remember about PCA
● Always normalize your dataset before performing PCA
● Principal components are always linearly independent
● Every principal component will be Orthogonal/Perpendicular to every other
principal component
● Using PCA we can use k-dimensions to represent some value (always less than
100%, your goal would be to get it as close to 100% as possible with as small as k as
possible)
● Numerically we compute PCA using SVD (Singular Value Decomposition)
● The goal of PCA is to create a new data that represents the original with some loss
due to compression/reduced dimensionality
● 𝑵𝒆𝒘 𝑫𝒂𝒕𝒂𝒌×𝒏 = 𝒕𝒐𝒑 𝒌 𝒆𝒊𝒈𝒆𝒏𝒗𝒆𝒕𝒐𝒓 𝒌×𝒎 [𝒐𝒓𝒊𝒈𝒊𝒏𝒂𝒍 𝒅𝒂𝒕𝒂𝒔𝒆𝒕]𝒎×𝒏
627
PCA Disadvantages
● PCA works only if the observed variables or data is linearly
correlated. If there is no correlation within our dataset PCA will
fail to find the adequate components to represent our data with
less dimensions
● PCA by nature is lossy and thus information will be loss when we
reduce dimensionality
● It is sensitive to scaling which is why all data should be
normalized
● Visualizations are hard to interpret real meaning as they do not
relate to the original data features.
628
PCA In Python!
629
t-Distributed Stochastic Neighbor
Embedding (t-SNE)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Introduction
631
Why is t-SNE better than PCA?
633
How does t-SNE Work?
634
The t-SNE algorithm – Step 1
635
The t-SNE algorithm – Step 2
636
The t-SNE algorithm – Step 3
● In the last step we want these set of probabilities from the low-
dimensional space to reflect those of the high dimensional as best
as possible (as we want the two map structures to be similar).
● We then measure the difference between the probability
distributions of the two-dimensional spaces using Kullback-
Liebler divergence (KL), often written as D(p,q).
● KL-divergence is an asymmetrical approach that compares two
distributions (“distance” between two distributions).
● We then minimize our KL cost function using gradient descent.
637
Visual Comparison of t-SNE vs PCA
638
Introduction to Recommendation
Engines
640
Netflix’s Recommendations
641
Amazon’s Recommendations
642
643
How do they know us so well?
644
Are Big Tech Companies Spying on Us?
● No
645
Why are Recommendation Systems so Important?
646
“ “Recommender Systems aim to help
a user or a group of users
to select items from a crowded item
or information space.”
(MacNee et. al 2006)
647
Section Overview
In this section we’ll learn
● Intuition behind Recommendation Systems – How do
we review items?
● Collaborative Filtering and Content-based filtering
648
Before recommending, how do we
rate or review Items?
Let’s try to Build an Item Comparison Tool
● How do we do this?
650
Approach 1 – Net Score
● We take the Net Score of Positive ratings– Negative ratings
○ 𝑁𝑒𝑡 𝑆𝑐𝑜𝑟𝑒 = 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑖𝑛𝑔𝑠 − (𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑖𝑛𝑔𝑠)
● \\
Movie Positive Ratings Negative Ratings Net Score Ave Percent Positive
651
Examples of sites making this mistake
Urban Dictionary
652
Approach 2 – Average Rating
zHK+J+LF {OJ+B|K
● 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑅𝑎𝑡𝑖𝑛𝑔 = SHJOP {OJ+B|K
Movie Positive Ratings Negative Ratings Net Score Ave Percent Positive
A 750 500 250 60%
B 5000 4000 1000 56%
C 9 1 8 90%
● Our algorithm now scores Movie C the highest but again is this right?
Nope. Movies with very little reviews will dominate the rankings.
653
Sites making this
mistake - Amazon
4
654
Approach 3 CORRECT Score = Lower bound of Wilson
score confidence interval for a Bernoulli parameter
● It can be seen that we need to balance the proportion of positive ratings with
the uncertainty of a small number of observations.
● Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson.
● What we want to ask is: Given the ratings I have, there is a 95% chance that the
“real” fraction of positive ratings is at least what?
● Wilson gives the answer. Considering only positive and negative ratings (i.e. not
a 5-star scale), the lower bound on the proportion of positive ratings is given by:
655
Source: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
User Collaborative Filtering and
Item/Content-based Filtering
Recommendation Systems
Let’s think about two ways we can do this for let’s say, an online retailer like
Amazon.
1. Do we have users that buy similar items? Let’s say our system has records that
we have a subset of users who bought Metallica albums, and most of these
customers also bought Megadeath albums too. Therefore, we can infer if
someone has purchases Metallica albums, they are a likely candidate to
purchase a Megadeath album. This is called Collaborative Filtering.
2. Another approach is, what if we know a user has been searching for formal wear
suits online. We know intrinsically that formal wear suits need to be paired with
appropriate shoes. As such, we can recommend the user purchase our top rated
shoes. This is called Content or Item based filtering.
658
Collaborative filtering User to User
659
User-to-item Matrix
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 0 0 1 0
5 1 1 0 0
660
User-to-item Matrix Explained
662
Cosine Similarity Explained
663
Cosine Similarity Explained continued
Words Sentence 1 Sentence 2
Amy 1 0
Likes 1 2
Mangoes 1 1
More 1 0
Than 1 0
Apples 1 0
Sam 0 1
potatoes 0 1
664
Cosine Similarity Explained continues
● Sentence 1 = [1,1,1,1,1,1,0,0]
● Sentence 2 = [0,2,1,0,0,0,1,1]
● Cosine Distance = 0.463
665
Item-based approach collaborative filtering
666
User to User Collaborative Filtering Disadvantages
667
How item-item based filtering solved this problem
668
Case Study – User Collaborative
Filtering and Item/Content-based
filtering in Python
The Netflix Prize and Matrix
Factorization and Deep Learning as
Latent-Factor Methods
Collaborative Filtering Recap
However, it’s not all good, let’s take a look at some of the
potential issues we face when using Collaborative filtering.
671
Collaborative Filtering Challenges
673
Matrix Factorization
● Where 𝜎$ > 𝜎, > 𝜎- > 𝜎. , therefore the preference for the first item is written as:
● 𝑝$$ = 𝜎$ 𝜇$$ 𝜈$$ + 𝜎, 𝜇$, 𝜈,$ + 𝜎- 𝜇$- 𝜈-$ + 𝜎. 𝜇$. 𝜈.$
● Vector Form - 𝑝$$ = → ∘
/ 0&
⋅→
1&
● So now we can select the top two features based on the sigmas:
● 𝑝$$ ≈ 𝜎$ 𝜇$$ 𝜈$$ + 𝜎, 𝜇$, 𝜈,$
676
Simon Funk’s SVD
Therefore, the estimated score for the item j from user I is:
677
Deep Learning Embedding
● Deep learning offered a more flexible method to include various factors into
modeling and creating embeddings.
● The workings for many of these methods can be a bit complicated to explain,
however in essence, they formulate the problem as a classification problem
where each item is one class.
● There have many advances that can deal with millions of users and items.
● Multi-Phase modeling such as used by YouTube divided the modeling process
into two steps where the first uses only user-item activity data to select
hundreds of candidates out of millions. Then in the second phase it uses more
information about the candidate videos to make another selection and ranking.
● All Recommender systems should not overfit historical user-item preference
data (exploitation), to avoid getting stuck in a local optimal.
678
Introduction to Natural Language
Processing
Introduction to Natural Language Processing
680
NLP’s role in Businesses - Translating
681
NLP’s role in Businesses - Summarizing information
682
NLP’s role in Businesses - Sentiment analysis
683
NLP’s role in Businesses - Chatbots
684
NLP’s role in Businesses – Information
Extraction and Search
685
NLP’s role in Businesses – Auto
Responders and Auto Complete
686
Main Topics in NLP
687
Modeling Language – The Bag of
Words Model
Bag of Words Modeling
689
Typical Machine Learning Inputs
690
Tokenization
691
Bag of Words and Tokenization Example
1 1 1 1 0 0 1 1 1
2 0 1 0 1 1 1 1 1
692
693
Understanding our input data
694
Normalization, Stop Word Removal,
Lemmatizing/Stemming
Words are Messy
696
Normalization Types
● Case Normalization
● Removing Stop Words
● Removing Punctuation and Special Symbols
● Lemmatising/Stemming
697
Case Normalization
Case normalization is simply the standardizing of all case for words
found in our document. Example, changing sentences like:
● “Hi John, today we’ll go to the market”:
● “hi john, today we’ll go to the market
Typically, case doesn’t actually change any meaning, however they
are cases (pun intended) where it can, example:
● Reading is a city in the UK which is different to the act of
reading.
● April and May are both common names and months.
698
Removing Stop Words
● Stop words are common words that do not act
additional information or predictive value due to
how common they are in normal text. Examples of
common stop words are:
○ I
○ is
○ and
○ a
○ are
○ the
699
Removing Punctuation and Special Symbols
700
Lemmatising/Stemming
● Both lemmatizing and stemming are techniques that seek to reduce
inflection forms to normalize words with the same lemma.
● Lemmatising does this by considering the context of the word while
stemming does not.
● However, most current lemmatizers or stemmer libraries are not highly
accurate.
701
TF-IDF Vectorizer (Term Frequency —
Inverse Document Frequency)
Vectorizing Text
704
Word2Vec - Efficient Estimation of
Word Representations in Vector Space
Vectorizing Text
706
Vectorizing Text
707
From the Word2Vec Paper - Read – “Efficient Estimation of Word
Representations in Vector Space” https://arxiv.org/pdf/1301.3781.pdf
708
Training Word2Vec
709
Context Window
710
Context Window
711
Continuous Bag of Words & Skip Gram
● Using this method, the Continuous Bag of Words and the Skip Gram
models separate the data into observations of target words and their
context.
● Continuous Bag of Words - In our Fox example, the context is ‘quick,
brown, jumped, over’. These form the features for the Fox class.
● Skip Gram - Here we structure the data so that the target is used to
predict the context, so here we’ll use Fox to predict the context ‘quick,
brown, jumped, over’.
712
Building the Neural Network Model
713
Reinforcement Learning
Introduction to Reinforcement Learning
● In Reinforcement learning, we
teach our algorithm, or ‘agent’ in
an environment that produces a
state and reward.
715
Learning in Reinforcement Learning
716
Learning in Reinforcement Learning
717
Q Learning
718
Q Learning State Reward Tables
Action 1 Action 2
State 1 0 5
State 2 5 0
State 3 0 5
State 4 5 0
719
State Reward Tables – Deferred Learning
Action 1 Action 2
State 1 10 0
State 2 5 0
State 3 5 0
State 4 5 40
● After extensive trials an agent will be able to learn that taking Action 2
repeatedly for States 1,2,3 and 4 lead to the greatest reward..
720
The Q Learning Rule
● r – Reward
● 𝛾 – Discounts reward impact (0 to 1)
● 𝑚𝑎𝑥J́ 𝑄 𝑠,́ 𝑎́ - This is the maximum Q value possible in the
next state. It represents the maximum future reward thus to
encourage the agent to aim for the max reward in as little
action steps
721
Introduction to Big Data
Big Data
723
724
Big Data Defined
725
Examples of Big Data
726
Classifications of Big Data
727
Structured Data
728
Semi-Structured Data
729
Un-Structured Data
730
Big Data Characteristics - The 3Vs
733
Challenges of Dealing with Big Data
● Storage Requirements
● Keep forever?
● Can we keep adding new data easily?
● How long do reports take to generate?
734
Big Data Solutions
735
Distributed Data/Computing
● Big Data storage and analysis requires new tools (software and
hardware)
● Introducing MapReduce
736
Hadoop, MapReduce and Spark
Big Data History
738
MapReduce
739
MapReduce
● Map phase: The user specifies a map function that is applied to each
key-value pair, producing other key-value pairs, referred to as
intermediate key-value pairs.
740
MapReduce
741
Hadoop
● Hadoop is a JAVA based open source system developed by Apache in 2006 and provides
a software framework for distributed storage and processing of big data using
the MapReduce programming model.
● Hadoop is a Processing Engine/Framework and introduced the HDFS (Hadoop Distributed File
System), a batch processing engine (MapReduce) &a Resource Management Layer (YARN).
● Hadoop provided the ability to analyze large data sets. However, it relied heavily on disk
storage as opposed to memory for computation. Hadoop was therefore very slow for
calculations that require multiple passes over the same data.
● This allowed the hardware requirements for Hadoop operations to be cheap (hard disk space is
far cheaper than ram) it made accessing and processing data much slower.
● However, Hadoop had poor support for SQL and machine learning implementations.
742
Spark
● The UC Berkeley AMP Lab and Databricks spear headed the development of Spark that aimed
to solve many of deficiencies, performance issues and complexities of Hadoop
● It was initially released in 2014 and donated to the Apache Software Foundation
Source: https://databricks.com/spark/about
744
RDDs – Resilient Distributed Data Set
745
Introduction to PySpark
PySpark
● Even though the Spark toolkit was written in Scala, a language that
compiles to byte code for the Java Virtual Machine (JVM), as such
numerous implementations or wrappers have been developed for R,
Java, SQL and of course Python!
● As such, Python users can now work with RDDs in Python programming
language also
747
PySpark Overview
748
PySpark in Industry
● Netflix has used PySpark internally to power many of
their backend machine learning tasks (apparently almost
one trillion per day!)
● The healthcare industry has used PySpark to perform
analytics including Genome Sequencing
● The financial sector has made full use of PySpark for in-
house trading, banking, credit risk and insurance use
cases.
● Retail and E-commerce – Literally begs the use of
PySpark given that these businesses have millions and
millions of sales and retail data in their data warehouses.
Both eBay and Alibaba are known to use PySpark.
749
RDDs, Transformations, Actions,
Lineage Graphs & Jobs
RDDs (Resilient Distributed Data) in Python
751
Lazy Evaluation and Pipelining
● Spark's RDD implementation allows us to evaluate code "lazily“.
752
Types of Spark Methods
● Transformations:
○ map()
○ reduceByKey()
● Actions:
○ take()
○ reduce()
○ saveAsTextFile()
○ collect()
753
Transformations
● Transformations are one of the methods you can perform on an RDD in Spark.
● They are lazy operations that create one or more new RDDs (because RDDs are
immutable, they can’t be altered in any way once they’ve been created)
● Transformations take an RDD as an input and apply some function on them and
outputs one or more RDDs.
● Let’s talk about lazy evaluation - as the Scala compiler comes across each
Transformation, it doesn’t actually build any new RDDs yet. Instead, it
constructs a chain (or pipeline) of hypothetical RDDs that would result from
those Transformations which will only be evaluated once an Action is called.
● This chain of hypothetical, or “child”, RDDs are all connected logically back to
the original “parent” RDD, this concept is called the lineage graph.
754
Actions
● Actions are any operations on RDDs that do not produce
an RDD output
755
Lineage Graphs
● Remember the chain or pipeline constructed due to our Lazy
Evaluation? These were Transformation operations that are only
evaluated when an Action is called. This chain of hypothetical,
or “child”, RDDs are all connected logically back to the original
“parent” RDD, this concept is called the lineage graph.
● A lineage graph outlines a “logical execution plan”. The
compiler begins with the earliest RDDs that aren’t dependent
on any other RDDs, and follows a logical chain of
Visualization of example lineage graph;
Transformations until it ends with the RDD that an Action is r00, r01 are parent RDDs, r20 is final
called on RDD (source Jacek Laskowski -
https://jaceklaskowski.gitbooks.io/maste
● Lineage Graphs are the drivers of Spark’s fault tolerance. If a ring-apache-spark/spark-rdd-
lineage.html#logical-execution-plan)
Node fails, the information about what that node was supposed
to do is held in the lineage graph and thus can be done
elsewhere.
756
Spark Applications and Jobs
● In Spark, whenever processing needs to be done, there
is a Driver process that is in charge of taking the user’s
code and converting it into a set of multiple tasks.
● There are also executor processes, each operating on a
separate node in the cluster, that are in charge of
running the tasks, as delegated by the driver.
● Each driver process has a set of executors that it has
access to in order to run tasks.
● A Spark application is a user built program that consists Visualization of Spark Architecture (from Spark API
- https://spark.apache.org/docs/latest/cluster-
of a driver and that driver’s associated executors. overview.html)
757
Spark Overview
758
Simple Data Cleaning in PySpark
Machine Learning in PySpark
Customer Lifetime Value (CLV)
Customer Lifetime Value
● That’s where CLV comes in. we need to determine what the expected
lifetime value a customer has to our business.
762
Customer Lifetime Value Defined
763
Customer Lifetime Value Application
764
Benefits of knowing your CLV
● Push the marketing channels that bring you your most valuable
customers
765
What CLV is Not!
767
Another CLV Pitfall Example
768
Buy-til-you-die (BTYD) models
The buy-til-you-die model
● In 1987 by researchers at Wharton and Colombia
developed a model called the Pareto/NBD (Negative
Binomial Distribution) that could be used for estimating
the number of future purchases a customer will make.
770
BTYD Method
● BTYD models predict purchasing activity of a customer using
two stochastic probabilistic models.
1. The probability of customer making a repeat purchase
2. The probability of a customer churns or ‘dies’
771
Using BTYD to Estimate Expected No. of
Future Purchases
772
Obtaining a Customers Residual Lifetime Value
● To get our CLV just add the sum of each customer’s past
purchases to their RLV!
773
The Beta-Geometric/Negative Binomial Distribution
774
The lifetimes Module in Python
775
776
Deploying your Machine Learning Model
777
Cloud Deployments
● There are many ways to deploy models and servers to various cloud
services
● For Example, using AWS’s EC2 (a web service that provides secure,
resizable compute capacity in the cloud)
778
Why Heroku?
● AWS and the others requires too much manual configuration and also a
credit card to sign up
● Heroku allows users free access to test their platform without a credit
card
779
A bit about Continuous Integration/Continuous Deployment
780
Steps to Deploy our Machine Learning Model
781
Deep Learning Recommendation
Engines
Why Deep Learning for Recommendation Engines?
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805