Module 1
Module 1
Learning
The ability to improve behavior based on the experience.
1
What is Learning?
Learning
The ability to improve behavior based on the experience.
2
What is Learning?
3
What is Learning?
4
What is Learning?
5
What is Learning?
Correct 2 3 4
Incorrect 3 2 1
6
What is Machine Learning?
7
Machine Learning
8
Machine Learning to a Layman
9
What is Machine Learning?
Machine Learning
Design of Algorithm that-
• Learn from data or build models using that data
• The learned model can be used to
• Detect patterns/structures/themes/trends etc. in the data
• Make predictions about future data and make decisions
• Modern ML algorithms are heavily “data-driven”
• No need to pre-define and hard-code all the rules (usually
infeasible/impossible anyway).
• The rules are not “static”; can adapt as the ML algorithm
ingests with more and more data.
10
Machine Learning vs Programming
11
When to Use Machine Learning?
• Human expertise is absent
Example: navigating on mars
• Humans are unable to explain their expertise Example:
vision, speech, language
• Requirements and data change over time
Example: Tracking, Biometrics, Personalized
fingerprint recognition
• The problem or the data size is just too large
13
Why Machine Learning?
• Machine Learning term first coined in 1959
• Computer Model based on Neural Network was created in 1943
14
Why Machine Learning?
15
Why Machine Learning?
• Structured Data
• Unstructured Data
• “More than 300 million photos get uploaded per day.
• Every minute there are 510,000 comments posted
and 293,000 statuses updated”
DATA • “Over 2.5 quintillion bytes of data are created
every single day, and it's only going to grow from
there.
• By 2020, it is said that 1.7MB of data has been
created every second for every person on earth”
More than 80% data is unstructured.
16
Why Machine Learning?
Python
Libraries: Pandas, Numpy, Sklearn, Keras
TensorFlow, PyTorch, Theano
OPTIMIZED
ALGORITHMS Less programming more science!
17
Why Machine Learning?
Powerful CPUs
GPU
Parallel and Distributed Computing
COMPUTING
POWER
18
Jargon Difference!
Machine Learning
Data Science
19
Jargon Difference!
20
Types of Learning
• Supervised (inductive) learning: Training data includes desired
outputs.
• Unsupervised learning: Training data does not include desired
outputs, Find hidden/interesting structure in data.
• Semi-supervised learning: Training data includes a few desired
outputs
• Reinforcement learning: the learner interacts with the world via
“actions” and tries to find an optimal policy of behavior with
respect to “rewards” it receives from the environment
21
Types of Learning
22
A Typical Supervised Learning Workflow (for Classification)
23
A Typical Supervised Learning Workflow (for Classification)
24
A Typical Un-supervised Learning Workflow (for Clustering)
25
A Typical Un-supervised Learning Workflow (for Clustering)
27
Geometric View of Some Basic ML Problems
Regression
Supervised Learning:
Learn a line/curve (the “model”)
using training data consisting of
Input-output pairs (each output is a
real-valued number)
Classification
Clustering
Dimensionality Reduction
Labeled
Training “dog
”
“do
Data g”
“do
g”
Supervised ML p(class|image)
“ca
“ca t”
“ca t”
t”
Unlabeled
Training Unsupervised ML p(image)
Data
30
Machine Learning = Function Approximation
Supervised Learning (“predict y given x ”) can be thought learning a function that maps x to y
Labeled “dog
Training “do
”
Data “do
g”
g”
Supervise : image class
d ML
“ca
t”
“cat”
“cat”
Unsupervised Learning (“model x ”) can also be thought of as learning a function that maps x to some useful
latent representation of x
Unlabeled
Training
Data latent
Unsupervis
ed ML
: representation
image of image (e.g.,
cluster id or
compressed
version)
Harder for Unsupervised Learning because there is no supervision y
Other ML paradigms (e.g., Reinforcement Learning) can be thought of as doing function approx.
31
Machine Learning in the real-world
Broadly applicable in many domains (e.g., internet, robotics, healthcare and
biology, computer vision, NLP, databases, computer systems, finance, etc.)
32
Machine Learning helps Computer Vision
33
Machine Learning helps Computer Vision
34
Machine Learning helps NLP
35
Machine Learning helps NLP
36
Machine Learning helps NLP
37
Machine Learning helps NLP- Search and Info Retrieval
38
Machine Learning meets Speech Processing
ML algorithms can learn to translate speech in real time
39
Machine Learning helps Chemistry
ML algorithms can understand properties of molecules and learn to
synthesize new molecules
40
Machine Learning helps Chemistry
ML algorithms can “read” databases of matetials and recreate the
Periodic Table within hours
“Recreated” Periodic Table
41
Machine Learning helps in Biology, E-commerce
42
Inductive Learning
43
Classification Learning
Task T
Input
• A set of instances d1, d2, ...., dn
• An instance has a set of features
• We can represent an instance as a vector
• d = <x1,x2, x3, ...., xn>
Output
• A set of predictions y1, y2, y3, ...., yc
• One of the fixed set of constant values
• Eg: {+1, -1}
45
Inductive Learning
• Also called as Deterministic Supervised Learning
• In this, first input x, (the verified value) given to a function f, and
the output is f(x).
• Then we can give different set of inputs (raw inputs) to the same
function f, and verify the output f(x).
• By using the outputs we generate (learn) the rules.
46
Inductive Learning
• Inductive learning, also known as discovery learning, is a process
where the learner discovers rules by observing examples.
• We can often work out rules for ourselves by observing
examples. If there is a pattern; then record it.
• We then apply the rule in different situations to see if it works.
• With inductive language learning, tasks are designed specifically
to guide the learner and assist them in discovering a rule.
47
Inductive Learning
• Inductive learning or “Prediction”:
– Given examples of a function (X, F(X))
– Predict function F(X) for new examples X
This is the function which we are trying to learn.
• Classification
F(X) = Discrete
• Regression
F(X) = Continuous
• Probability estimation
F(X) = Probability(X):
2. Integer Valued
Example: Number of words in a text
Example:
<0.5,2.8,+>
3.0
+
+ + +
-
+ + - - -
2.0
+ - +
-
- + + - - -
1.0
-
+ + + - -
0.0
2. Polynomial
quadratic function: ax² + bx +c
a,b,c- 3 parameters
3. Complex Function
Note:- We are interested in a function which not only fit the
training data but also works well with future or test data.
53
Basic Terminologies
Representation of Function
• When we talk of representation of these hypothesis (or
functions) then we have two things, one is features and the
other is the function class.
54
Basic Terminologies
Representation
55
Basic Terminologies
Representation
56
Basic Terminologies
Hypothesis:
Function for labeling examples
3.0
Label: + + Label:
+ ? + + -
-
+ + - - -
2.0
+ ? - +
- ?
- + + - - -
1.0
-
+ + + ?
- -
0.0
58
Hypothesis Space
Hypothesis Space:
Set of legal hypotheses
3.0
+
+ + +
-
+ + - - -
2.0
+ - +
-
- + + - - -
1.0
-
+ + + - -
0.0
60
Hypothesis Space
If there are 2 Boolean input features then there are possible instances.
If there are 3 Boolean input features then there are possible instances.
If there are n Boolean input features then there are possible instances.
If there are 2 Boolean input features then there are possible Boolean
functions
If there are 3 Boolean input features then there are possible Boolean
functions.
61
Inductive Learning In General
Inducing a general function from training examples.
Constructs a hypothesis h to agree with all the training
examples
A hypothesis is consistent if it agrees (works will) with all
training examples.
A hypothesis is said to be generalized if it correctly predicts the
value of y for new examples.
Example Applications:
• Disease diagnosis
x: Properties of patient (e.g., symptoms, lab test results)
f(x): Predict disease
• Automated steering
x: Bitmap picture of road in front of car
f(x): Degrees to turn the steering wheel
• Credit risk assessment
x: Customer credit history and proposed purchase
f(x): Approve purchase or not
63
64
Learning = Representation + Evaluation
+ Optimization
• Combinations of just three elements
Representation Evaluation Optimization
Instances Accuracy Greedy search
Hyperplanes Precision/Recall Branch & bound
Decision trees Squared error Gradient descent
Sets of rules Likelihood Quasi-Newton
Neural networks Posterior prob. Linear progr.
Graphical models Margin Quadratic progr.
Etc. Etc. Etc.
65
Inductive BIAS
As we can see that hypothesis space is very large. It is not possible
to look at every hypothesis individually to choose the best
hypothesis.
So we put some restrictions on hypothesis.
If we restrict the hypothesis, it reflects a bias of the learning
algorithm
67
Generalization & Error
Bias
This is the error introduced due to simplifying assumptions
made by a model
Simplified assumptions limit the model's capacity to learn.
Low Bias
Suggests less assumptions about the form of the target
function.
High Bias
Suggests more assumptions about the form of the target
function.
68
Generalization & Error
Variance
Variance tells that how much a random variable is different
from its expected value.
If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then
variance occurs.
Low variance.
Suggests small change to the estimated models with Changes
to the training dataset.
High variance.
Suggests large changes, to the estimated models with Changes
to the training dataset.
69
Bias Variance Trade-off
Test
Train Data
Test Data
70
Bias Variance Trade-off
Complex Model
No error on training data
What about test set?
Test Data
71
Bias Variance Trade-off
Complex Model
No error on training data
Test Data
72
Bias Variance Trade-off
Complex Model
No error on training data
Test Data
73
Bias Variance Trade-off
74
Bias Variance Trade-off
Simple Model
More error on training data
What about test set?
Test Data
75
Bias Variance Trade-off
Simple Model
More error on training data
What about test set?
Test Data
76
Bias Variance Trade-off
Simple Model
More error on training data
What about test set?
Test Data
77
Bias Variance Trade-off
78
Bias Variance Trade-off
79
Bias Variance Trade-off
High Low
Train Error
High Bias Low Bias
High Low
Test Error
High Variance Low Bias
Train Data
Test Data
80
Over-fitting
• Over-fitting & under-fitting are the two main errors/problems in
the machine learning model, which cause poor performance in
Machine Learning.
• Over-fitting occurs when the model fits more data than required,
and it tries to capture each and every data point fed to it.
Hence it starts capturing noise and inaccurate data from the
dataset, which degrades the performance of the model.
• An over-fitted model doesn't perform accurately with the
test/unseen dataset and can’t generalize well.
• An over-fitted model is said to have low bias and high variance.
81
How to avoid Overfitting
• Using cross-validation
• Using Regularization techniques
• Implementing Ensemble Techniques.
• Picking a less parameterized/complex model
• Training the model with sufficient data
• Removing features
• Early stopping the training
82
Under-fitting
• Model cannot create a mapping between the input and the target
variable
• Under-observing the features leads to a higher error in the
training and unseen data samples.
• Under-fitting becomes obvious when the model is too simple and
cannot create a relationship between the input and the output.
83
How to avoid Under-fitting
84
85
Over-fitting
Over-fitting during training
Model
error Error on
new data
Training error
Number of iterations
86
Regularization and Over-fitting
Adding a regularizer:
Model
error Without regularizer
With regularizer
Number of iterations
87
Cross-Validation
• Cross-validation involves partitioning your data into
distinct training and test subsets.
88
K-fold Cross-Validation
• To get more accurate estimates of performance you
can do this k times.
• Break the data into k equal-sized subsets Ai
• For each i in 1,…,k do:
– Train a model on all the other folds A1,…, Ai-1, Ai+1,…, Ak
– Test the model on Ai
• Compute the average performance of the k runs
89
5-fold Cross-Validation
90
Occam’s Razer
91
Learning as a search
92
Supervised Learning
Classification Vs. Regression
93
Training Eg. or Input Features Output
Instance
X1 X2 X3 . . Xn Y
I1 a1 a2 a3 . . an Y1
I2 b1 b2 b3 . . bn Y2
I3 c1 c2 c3 . . cn Y3 Classificatio
n or
. Regression ?
. ?
.
Im P1 P2 P3 . . Pn Ym
Test Input Z1 Z2 Z3 . . Zn ??
Model has
to predict it.
94
Supervised Learning-
For each input x, the desired output y is given. Here y is the label.
95
Classification
• Classification is a process of categorizing a given set of data into classes.
• It can be performed on both structured or unstructured data.
• The process starts with predicting the class of given data points. The classes are often
referred to as target, label or categories.
96
Classification Example:-
You are given with the collection of emails, determine the spam
or non-spam email from it.
97
Regression
• A technique for determining the statistical relationship between two or more variables
where a change in a dependent variable is associated with, and depends on, a change
in one or more independent variables.
• A regression problem is used when the output variable is a real or continuous value,
such as "Salary" or "weight".
98
Regression Example –
• Sales of a product can be predicted by using the relationship
between sales volume and amount of advertising.
• The performance of an employee can be predicted by using
the relationship between performance and aptitude tests.
• The size of a child’s vocabulary can be predicted by using the
relationship between the vocabulary size, the child’s age and
the parents’ educational input.
99
Regression Example – Estimate the price of the house
from the given data.
100
Regression Example – Estimate the price of the house
from the given data.
101
Regression Example – Estimate the price of the house
from the given data.
What would be
the price of the
102
medium size
Regression Example – Estimate the price of the house
from the given data.
103
Regression Example – Estimate the price of the house
from the given data.
104
Regression Example – Estimate the price of the house
from the given data.
105
Regression Analysis
106
Dependent and Independent Variable
• Independent variables are considered as an input to a system
and may take on different values freely.
• Dependent variables are those values that change as a
consequence of changes in other values in the system.
• Independent variable is also called as predictor or
explanatory variable and is denoted by X
• Dependent variable is also called as response variable and is
denoted by Y.
107
Linear Regression
• The Simplest mathematical relationship between two
variables x and y is a linear relationship
• In a cause and effect relationship, the independent variable is
cause, and the dependent variable is the effect.
• Least squares linear regression is a method for predicting the
value of a dependent variable Y, based on the value of an
independent variable X.
108
Linear Regression
Height(cm) Weight(KG)
120 45.0
127 51.8
140 58.4
134 55.8
179 86.2
122 44.9
166 68.1
149 52.0
180 95.3
171 73.5
155 61.1
178 89.8
109
Linear Regression
110
The first order linear model
111
Slope & Intercept
Slope:
The slope of a line is the change in y for a one unit increase
in x.
Y-intercept:
It is the height at which the line crosses the vertical axis and
can be obtained by setting x = 0 in the below equation
y = mx + b
112
Error Variable
The inclusion of the random error term allows (x,y) to fall either
above the true regression line (When E > 0) or below the line
(When E < 0)
113
Basis Linear Regression Logistic Regression
Core concept The data is modeled using a It models the probability
straight line of a certain class or event
existing such as yes or no,
win or lose, sick or
healthy, and so on
Used For Continuous variable Categorical Variable
Output/ Value of the variable Probability of occurrence
Prediction of event
Evaluation Measured by loss, R squared, Accuracy, Precision,
Measures Adjusted R Squared. Recall, F1 Score, ROC
curve, Confusion Matrix,
etc.
114
Linear Regression using Least Square Method
115
Linear Regression using Least Square Method
116
Linear Regression using Least Square Method
117
Linear Regression using Least Square Method
118
Linear Regression using Least Square Method
𝒎=
∑ ( 𝒙 − 𝒙) ( 𝒚 − 𝒚 ) ❑
𝟐
∑ (𝒙−𝒙)
119
Linear Regression using Least Square Method
120
Linear Regression using Least Square Method
121
Linear Regression using Least Square Method
122
Linear Regression using Least Square Method
123
Linear Regression using Least Square Method
124
Linear Regression using Least Square Method
𝒎=
∑ ( 𝒙 − 𝒙) ( 𝒚 − 𝒚 ) ❑
=
𝟔
=𝟎 . 𝟔
∑ (𝒙−𝒙)
𝟐
𝟏𝟎
126
R-Squared
• It is used to determine how
well the regression line fits the
data,
• It measures the proportion of
variance in the dependent
variable (y) that is explained
by the independent variables
(x) in a model.
127
R-Squared
• It ranges from 0 to 1, where a higher value indicates a better fit
of the model to the data.
• If R-Squared value is 1 means the model perfectly fits the data,
while an R-Squared value is 0 means the model explains none of
the variability in the data.
• will always be a positive number.
128
R-Squared
129
R-Squared
130
R-Squared
131
R-Squared
x y
1 2
2 4
3 5
4 4
5 5
Mean 4
Given,
Regression line equation = mx + b
m = 0.6, b = 2.2
132
R-Squared
x y
1 2 2.8 0.8 0.64 2 4
2 4 3.4 0.6 0.36 0 0
3 5 4 1 1 1 1
4 4 4.6 0.6 0.36 0 0
5 5 5.2 0.2 0.04 1 1
Mean 4
= 0.6
133
R-Squared
134
R-Squared
135
R-Squared
136
R-Squared
137
R-Squared
138
Standard Error of Estimate (SEE)
• The standard error of the
estimate is a measure of the
variability of the predicted
values around the true
regression line.
• We calculate the distance
between actual and the
estimated/predicted values
which is called as the Error.
• Therefore our task is to
minimize this error.
139
Standard Error of Estimate (SEE)
140
Standard Error of Estimate (SEE)
Given,
n = 5, k= 1
141
SEE Vs
SEE and R-squared are two different measures used in regression
analysis:
• SEE: It is a measure of the variability of the predicted values
around the true regression line. It provides an indication of the
accuracy of the prediction.
• R-squared: It is a statistic that measures the proportion of
variation in the dependent variable that can be explained by
the independent variables. It ranges from 0 to 1.
In summary, the SEE measures the accuracy of the predictions,
while R-squared measures the goodness of fit of the model to
the data.
142
SEE Vs
143
Types of Regression
Univariate and Multivariate regression are two types of regression
analysis used in statistics:
• Univariate Regression: It is a type of regression analysis that
involves only one independent variable and one dependent
variable.
• Multivariate Regression: It is a type of regression analysis that
involves multiple independent variables and one dependent
variable.
144
LR Exercise 1
Q. Study the relationship between the monthly sales and the
advertising costs surveyed for different stores as given below.
Find the equation of the straight line that fits the data best.
Determine the R-squared value.
Store Sales (units) Advertising Cost
1 368000 1700
2 340000 1500
3 665000 2800
4 954000 5000
5 331000 1300
6 556000 2200
7 376000 1300 145
LR Exercise 2 Car age in Price in
Q. Examine the relationship between years Lakhs
the age and price for used cars sold in 4 6.3
the last year by a car dealership 4 5.8
company. Find the equation of the
5 5.7
straight line that fits the data best.
5 4.5
Determine the R-squared value.
7 4.5
7 4.2
8 4.1
9 3.1
10 2.1
11 2.5
12 2.2
146
Other Evaluation Measures
Several evaluation measures are commonly used to assess the
performance of a linear regression model.
147
Error Calculation in Linear Regression
1. Mean Absolute Error (MAE):
It is the simplest regression error metric to understand, it can
be calculated as below:
148
Error Calculation in Linear Regression
2. Mean Square Error (MSE):
The mean square error (MSE) is similar to the MAE, but squares the
difference before summing them all instead of using the absolute value.
149
Error Calculation in Linear Regression
150
Gradient Descent in LR
151
Gradient Descent in LR
xi are features
b, mi are model parameters or coefficient
y is target variable
152
Gradient Descent in LR
Depending on the values of m and b, multiple possible lines can be
possible.
• Error
154
The Cost Function of Linear Regression
• Cost function measures how a machine learning model performs.
• Cost function is the calculation of the error between predicted
values and actual values, represented as a single real number.
• Cost function of a linear regression is mean square error.
• Cost Function
After expanding = b + mx in above equation we get
155
How to minimize the Cost Function?
• We have established the fact that all the straight lines are just
different combination of model parameters b & m.
• Cost Function is the function of parameters b & m.
• Therefore by changing the values of b & m we can change the
cost function.
• We will keep changing the values of b & m till we find a
combination where cost function is minimized.
• To find the best combination we use Gradient Descent
Algorithm.
156
Gradient Descent in LR
157
Gradient Descent in LR
158
Gradient Descent in LR
159
Gradient Descent in LR
160
Gradient Descent in LR
161
Gradient Descent in LR
162
Gradient Descent in LR
163
Gradient Descent in LR
Steps
1. Calculate slope at the current value of parameter b & m separately
2. Take step α and update new parameters.
3. Calculate the cost function J with new (b & m) values .
164
Multivariate Linear Regression
165
Multivariate Linear Regression
166
Multivariate Linear Regression
167
Multivariate Linear Regression
Here,
168
How to use Multivariate Regression Analysis?
The processes involved in multivariate regression analysis include
the selection of features, engineering the features, feature
normalization, selection loss functions, hypothesis analysis, and
creating a regression model.
1. Selection of features:
It is the most important step in multivariate regression. Also known
as variable selection, this process involves selecting viable variables
to build efficient models.
169
Feature Elimination
170
How to use Multivariate Regression Analysis?
2. Feature Normalization: This involves feature scaling to maintain
streamlined distribution and data ratios. This helps in better data analysis.
The value of all the features can be changed according to the requirement.
3. Selecting Loss function and hypothesis: The loss function is used for
predicting errors. The loss function comes into play when the hypothesis
prediction changes from the actual figures. Here, the hypothesis represents
the value predicted from the feature or variable.
171
How to use Multivariate Regression Analysis?
5. Reducing the loss function: The loss function is minimized by
generating an algorithm specifically for loss minimization on the
dataset which in turn facilitates the alteration of hypothesis parameters.
Gradient descent is the most commonly used algorithm for loss
minimization.
172
Assumptions in the Multivariate Regression Model
• The dependent and the independent variables have a linear
relationship.
• The independent variables do not have a strong correlation among
themselves.
• The observations of yiare chosen randomly and individually from
the population.
173
Advantages of Multivariate Regression
• Multivariate regression helps us to study the relationships among
multiple variables in the dataset.
• The correlation between dependent and independent variables
helps in predicting the outcome.
• It is one of the most convenient and popular algorithms used in
machine learning.
174
Disadvantages of Multivariate Regression
• The complexity of multivariate techniques requires complex
mathematical calculations.
• It is not easy to interpret the output of the multivariate regression
model since there are inconsistencies in the loss and error outputs.
• Multivariate regression models cannot be applied to smaller
datasets; they are designed for producing accurate outputs when it
comes to larger datasets.
175