Module 1
Module 1
3
Syllabus
4
Syllabus
Module-2: Classification
✔Logistic Regression, Decision Trees, Naïve Bayes-conditional
probability - Random Forest - SVM Classifier (6 Hours)
5
Syllabus
Module-3: Clustering
✔K-means, K-medoids, Hierarchical clustering (4 Hours)
6
Module-4: Optimization
✔Gradient descent - Variants of gradient descent - Momentum - Adagrad
- RMSprop - Adam – AMSGrad (3 Hours)
7
Syllabus
8
Syllabus
9
Syllabus
10
Syllabus
Text Book
✔Cathy O’Neil and Rachel Schutt. “Doing Data Science, Straight talk from
the Frontline”, O’Reilly. 2014.
✔Dan Toomey, “R for Data Science”, Packt Publishing, 2014.
✔Trevor Hastie, Robert Tibshirani and Jerome Friedman. “Elements of
Statistical Learning”, Springer , Second Edition. 2009.
✔Kevin P. Murphy. “Machine Learning: A Probabilistic Perspective”, MIT
Press; 1st Edition, 2012.
11
Syllabus
Reference Books
✔Glenn J. Myatt, “Making Sense of Data : A Practical Guide to Exploratory Data
Analysis and Data Mining”, John Wiley & Sons, Second Edition, 2014.
✔G. K. Gupta, ―Introduction to Data Mining with Case Studies”, Easter Economy
Edition, Prentice Hall of India, 2006.
✔Michael Berthold, David J. Hand, “Intelligent Data Analysis”, Springer, 2007.
✔Colleen Mccue, “Data Mining and Predictive Analysis: Intelligence Gathering
and Crime Analysis”, Elsevier, 2007.
✔R N Prasad, Seema Acharya, “Fundamentals of Business Analytics”, Wiley;
Second edition, 2016.
✔https://www.sscnasscom.com/qualification-pack/SSC/Q2101/
12
Assessment Process (Theory)
CAT-1 15
CAT-2 15
Assignments/Quizzes 30
FAT 40
Total 100
13
Module-1: Regression Analysis
Linear regression: simple linear regression - Regression Modelling -
Correlation, ANOVA, Forecasting, Autocorrelation (6 Hours)
14
Data Analytics – What?
Science of analyzing raw data in order to make
conclusions about that information [Investopedia]
Reinforcement Learning
Statistical Learning
21 21
Statistical Learning
• Supervised statistical learning:
It is defined by its use of labelled datasets to train algorithms that to
classify data or predict outcomes accurately.
Examples: Problems occur in business, medicine, astrophysics, and
public policy
• Unsupervised statistical learning:
Unsupervised learning uses unlabelled data. From that data, it
discovers patterns that help solve for clustering or we can learn
relationships and structure from such data
Example: Input dataset containing images of different types of cats and
dogs
22 22
Statistical Learning – Wage Data
23 23
Statistical Learning – Advertising Data
24 24
Statistical Learning – Advertising Data
25
Statistical Learning – Advertising Data
26
Statistical Learning – Advertising Data
• Input Variables: Advertising budgets
• Input Variables are denoted by X
• X1 – TV budget
• X2 – Radio budget
• X3 – Newspaper budget
• Input variables are called by
different names like
• Predictors
• Independent variables
• Features
27
• Variables
27
Statistical Learning – Advertising Data
28
Statistical Learning – Advertising Data
29
Statistical Learning – Income Data
31
Linear Regression - Introduction
• Linear Regression is a statistical procedure that determines the
equation for the straight line that best fits a specific set of data.
• This is a very simple approach for supervised learning
• In particular, it is a useful tool for predicting a quantitative response.
JEE Score 32
Advertising Data
35 35
Simple Linear Regression
Estimating the coefficients 0 and 1
36 36
Simple Linear Regression
Estimating the coefficients 0 and 1
37 37
Simple Linear Regression
Question-1:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f(x) of the form f(x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.
(a)(4 3)
(b)(5 3)
(c)(5 1)
(d)(1 5) 38 38
Simple Linear Regression
Solution:
1.Calculate Ypredicted for the given X using the given (a, b) values
2.For each (a, b) value, calculate the RSS
3.The best set of parameters is the one that gives minimum RSS
where
39 39
Simple Linear Regression
Solution: Formula
For a = 4 and b = 3
X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 15 7.6099
4 23.3192 19 18.6555
5 28.3129 23 28.2269
6 32.1351 27 26.3693
RSS 84.4632
40 40
Simple Linear Regression
Solution: Formula
For a = 5 and b = 3
X Y Ypredicted (Y-YPredicted)2
2 12.8978 13 0.0104
3 17.7586 18 0.0583
4 23.3192 23 0.1019
5 28.3129 28 0.0979
6 32.1351 33 0.7481
RSS 1.0166
41 41
Simple Linear Regression
Solution: Formula
For a = 5 and b = 1
X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 16 3.0927
4 23.3192 21 5.3787
5 28.3129 26 5.3495
6 32.1351 31 1.2885
RSS 18.7110
42 42
Simple Linear Regression
Solution: Formula
For a = 1 and b = 5
X Y Ypredicted (Y-YPredicted)2
2 12.8978 7 34.7840
3 17.7586 8 95.2303
4 23.3192 9 205.0395
5 28.3129 10 335.3623
6 32.1351 11 446.6925
RSS 1117.1086
44 44
Simple Linear Regression
Solution:
(a) To find the best fit, calculate the model coefficients using the formula
45 45
Simple Linear Regression
Solution:
47 47
Simple Linear Regression
Solution:
Residual Plot
(c) Residual plot for the best linear fit
Residual
X Y Ypredicted (Y-YPredicted)
2 12.8978 13.0789 -0.1811
3 17.7586 17.9818 -0.2232
4 23.3192 22.8847 0.4345
5 28.3129 27.7876 0.5253
6 32.1351 32.6905 -0.5554
The random pattern in it is an indication that a linear model is suitable for this data
49 49
Linear regression model
validation
1. R² values are just one such measure.
R-squared, otherwise known as R² typically has a value
in the range of 0 through to 1. The closer the r-
squared value is to 1, the better the fit.
The closer the MAPE value is to zero, the better the predictions.
A MAPE less than 5% is considered as an indication that the forecast is
acceptably accurate.
4. Mean Absolute Error(MAE)
The lower the value the better and 0 means the model is
perfect.
6. Root Mean Squared Error(RMSE)
56 56
8. Confidence Interval)
58 58
Example: Confidence Interval for Regression
Coefficient in R
Alpha=0.05
Both terms are used to describe how two variables relate to each other.
Covariance
Covariance signifies the direction of the linear relationship between the two
variables.
The values of covariance can be any number between the two opposite
infinities.
Covariance Formula
Types of covariance
Positive covariance
Positive covariance means both the variables (X, Y) move in the same direction
(i.e. show similar behavior). So, if greater values of one variable (X) seem to
correspond with greater values of another variable (Y), then the variables are
considered to have positive covariance. This tells you something about the linear
relationship between the two variables. So, for example, if an increase in a
person’s height corresponds with an increase in a person’s weight, there is
positive covariance between the two.
Types of covariance…
Negative covariance
Negative covariance means both the variables (X, Y) move in the
opposite direction. As opposed to positive covariance, if the greater
values of one variable (X) correspond to lesser values of another
variable (Y) and vice-versa, then the variables are considered to have
negative covariance.
The closer it is to +1 or -1, the more closely the two variables are related.
When the correlation coefficient is negative, the changes in the two variables
are in opposite directions.
Types of correlation
Positive correlation
What is a correlation matrix
Sampling
•IoE inspection to get feedback from students/faculty/parent/Alumni/Industry
•Quality control (Statistical Quality Control)
• 100% inspection
• Sample inspection
•Conducting Experiments
Note:
There should not be significant variation between the sample mean and the
population mean.
This is to be proved statistically.
Why ANOVA?
Helps us to understand how different sample groups respond.
83
ANOVA
• ANOVA – ANalysis of Variance
• Variance:
• The variance measures the average degree to which each data point is different
from the mean.
• The variance is greater when there is a wider range of numbers in the group.
• The calculation of variance uses squares because it weighs outliers more heavily
than data point closer to the mean.
• This prevents differences above the mean from canceling out those below, which
would result in a variance of zero.
• Thus variance is the average of the squared differences from the mean.
• ANOVA is a hypothesis testing procedure that is used to evaluate differences
between 2 or more samples.
Standard Deviation:
•Standard Deviation tells how far the data points are from the mean.
•It is the square root of variance
•These two statistical concepts are closely related
•For Data analysts, these two mathematical concepts are of paramount importance as
they are used to measure volatility of data distribution.
•In stock trading, if the standard deviation is less, it indicates the investment is less
risky.
85
ANOVA
Shows how far the mean it is away from the mean of the
larger sort of combined population
ANOVA
91
Step3:Calculating the Means
• Means for each group and
•Grand mean
92
F_statistic < F_critical
93
ANOVA
ANOVA
Groups/ Columns
Scores
First
Random Sample
Year Second Year Third Year
within each group
82 62 64
93 85 73
61 94 87
74 78 91
69 71 56
53 66 78
ANOVA
Calculate the mean of each column
ANOVA
ANOVA
Partitioning Sum of Squares
ANOVA
ANOVA
Scores
First Second (XB- (Xc-
Year Year Third Year (XA-Xmean)^2 Xmean)^2 Xmean)^2
82 62 64 59.633 150.744 105.633
93 85 73 350.522 114.966 1.633
61 94 87 176.299 388.966 161.855
74 78 91 0.077 13.855 279.633
69 71 56 27.855 10.744 334.077
53 66 78 452.744 68.522 13.855
Sum 432 456 449 1067.130 747.796 896.685
Mean 72 76 74.83
Sum of
Squares_between
1. Find difference
between each
group mean and
the overall mean
2. Square the
deviations
3. Multiply with no.
of values of
each column
4. Add them up
ANOVA
Sum of
Squares_between
1. Find difference
between each
group mean and
the overall mean
2. Square the
deviations
3. Multiply with no.
of values of
each column
4. Add them up
Sum of
Squares_between
ANOVA
Scores
df = Degrees of Freedom
1. DoF b/w the columns
ANOVA F - statistic
MSC = Mean Square Columns/ Treatments MSE = Mean Square Error/ Within
ANOVA
Random
•Amount of data
•Data quality
•Seasonality
•Trends
•Unexpected events
The amount of data is probably the most important factor (assuming that the data is
accurate). A good rule of thumb would be the more data we have, the better our
model will generate forecasts.
Seasonality means that there are distinct periods of time when the data contains
consistent irregularities. For example, if an online web shop analyzed its sales
history, it would be evident that the holiday season results in an increased amount
of sales.
Trends are probably the most important information you are looking for.
They indicate whether a variable in the time series will increase or decrease in
a given period.
In other words, our forecast for next month (or any month in the future,
for that matter) is the average of all sales that have occurred in the past
2. Autoregression
y= b0 + b1*X1
This technique can be used on time series where we can predict the value for the
next time step (t+1) given the observations at the last two time steps (t and t-1).
Because the regression model uses data from the same input variable at previous
time steps, it is referred to as an autoregression (regression of self).
3. Simple Moving Average
Rather than use all the previous data in the calculation of an average
as the forecast, why not just use some of the more recent data? This is
precisely what a moving average does, with the following formula.
Simple Moving Average
year 1 2 3 4 5 6 7 8 9 10 11 12
Sale
5.2 4.9 5.5 4.9 5.2 5.7 5.4 5.8 5.9 6 5.2 4.8
s
7 5.4 (4.9+5.2+5.7)/3=5.267
8 5.8 (5.2+5.7+5.4)/3=5.433
9 5.9 (5.7+5.4+5.8)/3=5.633
10 6 (5.4+5.8+5.9)/3=5.7
11 5.2 (5.8+5.9+6)/3=5.9
12 4.8 (5.9+6+5.2)/3=5.7
(7) Calculation: |((Actual-forecast)/actual)|%= |(4.9-5.2)/4.9|%=6.12%
Forecasting errors
Example 2:
Calculate a four-year moving average from the following data set:
4. Weighted Moving Average forecast example
year 1 2 3 4 5 6 7 8 9 10 11 12
Sale
5.2 4.9 5.5 4.9 5.2 5.7 5.4 5.8 5.9 6 5.2 4.8
s
4 4.9 (1*5.2+2*4.9+1*5.5)/4=5.125
5 5.2 (1*4.9+2*5.5+1*4.9)/4=5.2
6 5.7 (1*5.5+2*4.9+1*5.2)/4=5.125
7 5.4 (1*4.9+2*5.2+1*5.7)/4=5.25
8 5.8 (1*5.2+2*5.7+1*5.4)/4=5.5
9 5.9 (1*5.7+2*5.4+1*5.8)/4=5.575
10 6 (1*5.4+2*5.8+1*5.9)/4=5.725
11 5.2 (1*5.8+2*5.9+1*6)/4=5.9
12 4.8 (1*5.9+2*6+1*5.2)/4=5.775
5. Exponential Smoothing forecast
year 1 2 3 4 5 6 7 8 9 10
Sales 30 25 35 25 20 30 35 40 30 45
Likewise a pure Moving Average (MA only) model is one where Yt depends only
on the lagged forecast errors.
If we combine differencing with autoregression and a moving average model, we
obtain a non-seasonal ARIMA model. The full model can be represented with the
following equation:
Autocorrelation
Autocorrelation analysis is an important step in the Exploratory Data
Analysis of time series forecasting.
The autocorrelation analysis helps detect patterns and check for
randomness.
Autocorrelation is the correlation between a time series with a lagged
version of itself.
Any autocorrelation that may be present in time series data is determined using a
correlogram, also known as an ACF plot. This is used to help you determine whether
your series of numbers is exhibiting autocorrelation at all, at which point you can then
begin to better understand the pattern that the values in the series may be predicting.
An autoregressive model is when a value from a time series is regressed on
previous values from that same time series.
Autocorrelation