Module 2
Module 2
Contents
Linear relationship
A linear relationship is a statistical term used to describe a straight-line
relationship between two variables. Linear relationships can be expressed
either in a graphical format or as a mathematical equation form of
y = mx + b
Linear and Non-Linear
Non-Linear relationship
A non-linear model is also a statistical model that does NOT assume a linear
relationship between the dependent variable and the independent variables,
meaning that analysing more complex relationships from more complex and
relatively large datasets is available in non-linear models with making curves,
exponentials, logarithms, or interactions.
Linear Separable Classes
Class1 Class2
Linearly Separable
XOR Problem: Not Linearly Separable
In a multi-class setting, the classes are mutually exclusive which means a single
instance of data can belong to one and only class.
Image Classification with Multiple Labels: Identifying and labeling multiple objects
or features within an image, like recognizing both “cat” and “outdoor” in a photograph.
Three types of Classification
Supervised Machine Learning
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data machines predict the
output.
The labelled data means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.
It applies the same concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to
the machine learning model.
The aim of a supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
Steps Involved in Supervised Learning:
JEE Score
Real Time Regression Applications- Advertising Data
y=m*x+c
or
distance=m*speed+c
Where,
m=slope of line
c=y-intercept
Linear Regression
As the speed increases the distance also increases. The variables has
positive relationship.
What is the best Fit Line?
Our primary objective while using
linear regression is to locate the best-fit
line, which implies that the error
between the predicted and actual
values should be kept to a minimum.
There will be the least error in the best-
fit line.
The best Fit Line equation provides a
straight line that represents the
relationship between the dependent
and independent variables.
The slope of the line indicates how
much the dependent variable changes
for a unit change in the independent
variable(s).
The regression line is the best-fit line
for our model.
Linear Regression Line
Positive Linear Relationship:
If the dependent variable increases on the Y-
axis and independent variable increases on
X-axis, then such a relationship is termed as
a Positive linear relationship.
x y
1 3
2 2
3 2
4 4
5 3
Linear Regression: Explanation with Example
Linear Regression: Explanation with Example
Total=10 Total=2
Linear Regression: Explanation with Example
Linear Regression: Explanation with Example
Predicting yp using x and plotting the points.
yp=(0.2*x)+2.2
yp=(0.2*1)+2.2=2.4
yp=(0.2*2)+2.2=2.6
yp=(0.2*3)+2.2=2.8
yp=(0.2*4)+2.2=3.0
yp=(0.2*5)+2.2=3.2
Linear Regression
In linear regression we predict a continuous variable ‘y’ with a given
input variable ‘x’.
It is a linear model. x is called the independent variable and y is the
dependent variable.
When there is a single input variable (x), the method is referred to
as simple linear regression.
When there are multiple input variables, literature from statistics
often refers to the method as multiple linear regression.
We keep moving the line through the data points to make sure the best fit line
has the least square distance between the data points and the regression line.
Evaluation metrics for Regression
The lower the value the better and 0 means the model is perfect.
Root Mean Squared Error(RMSE)
where
Simple Linear Regression
Solution: Formula
For a = 4 and b = 3
X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 15 7.6099
4 23.3192 19 18.6555
5 28.3129 23 28.2269
6 32.1351 27 26.3693
RSS 84.4632
Simple Linear Regression
Solution: Formula
For a = 5 and b = 3
X Y Ypredicted (Y-YPredicted)2
2 12.8978 13 0.0104
3 17.7586 18 0.0583
4 23.3192 23 0.1019
5 28.3129 28 0.0979
6 32.1351 33 0.7481
RSS 1.0166
Simple Linear Regression
Solution: Formula
For a = 5 and b = 1
X Y Ypredicted (Y-YPredicted)2
2 12.8978 11 3.6016
3 17.7586 16 3.0927
4 23.3192 21 5.3787
5 28.3129 26 5.3495
6 32.1351 31 1.2885
RSS 18.7110
Simple Linear Regression
Solution: Formula
For a = 1 and b = 5
X Y Ypredicted (Y-YPredicted)2
2 12.8978 7 34.7840
3 17.7586 8 95.2303
4 23.3192 9 205.0395
5 28.3129 10 335.3623
6 32.1351 11 446.6925
RSS 1117.1086
Answer: The parameter (5,3) which gives least RSS (1.016). Hence
(5,3) is used to model this function
Simple Linear Regression
Question-2:
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
(a)Find the best linear fit
(b)Determine the minimum RSS
(c) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.
Simple Linear Regression
Solution:
(a) To find the best fit, calculate the model coefficients using the formula
Simple Linear Regression
Solution:
(X- (Y- (X-Xmean)(Y-
X Y Xmean) Ymean) Ymean) (X-Xmean)2
2 12.8978 -2 -9.9869 19.9738 4
3 17.7586 -1 -5.1261 5.1261 1
4 23.3192 0 0.4345 0.0000 0
5 28.3129 1 5.4282 5.4282 1
6 32.1351 2 9.2504 18.5008 4 The best linear fit is
Sum 20 114.4236 0 0.0000 49.0289 10
Y = 4.9029X + 3.2732
Mean 4 22.88472
Substituting
in the formula 0 =3.2732
1 = 4.9029
Simple Linear Regression
Solution:
The random pattern in it is an indication that a linear model is suitable for this data
Methods to find best fit line in Linear Regression
1. Statistical
2. Gradient Descent
3. Randomly drawing straight line
Advantages of Linear Regression
• Linear regression is a relatively simple algorithm, making it easy to understand
and implement. The coefficients of the linear regression model can be interpreted
as the change in the dependent variable for a one-unit change in the independent
variable, providing insights into the relationships between variables.
• Linear regression is computationally efficient and can handle large datasets
effectively. It can be trained quickly on large datasets, making it suitable for real-
time applications.
• Linear regression is relatively robust to outliers compared to other machine
learning algorithms. Outliers may have a smaller impact on the overall model
performance.
• Linear regression often serves as a good baseline model for comparison with
more complex machine learning algorithms.
• Linear regression is a well-established algorithm with a rich history and is widely
available in various machine learning libraries and software packages.
Disadvantages of Linear Regression
• Linear regression assumes a linear relationship between the dependent and
independent variables. If the relationship is not linear, the model may not perform
well.
• Linear regression is sensitive to multicollinearity, which occurs when there is a high
correlation between independent variables. Multicollinearity can inflate the variance
of the coefficients and lead to unstable model predictions.
• Linear regression assumes that the features are already in a suitable form for the
model. Feature engineering may be required to transform features into a format
that can be effectively used by the model.
• Linear regression is susceptible to both overfitting and underfitting. Overfitting
occurs when the model learns the training data too well and fails to generalize to
unseen data. Underfitting occurs when the model is too simple to capture the
underlying relationships in the data.
• Linear regression provides limited explanatory power for complex relationships
between variables. More advanced machine learning techniques may be necessary for
deeper insights.
Multicollinearity
Multicollinearity occurs when two or more independent variables have a high correlation
with one another in a regression model, which makes it difficult to determine the
individual effect of each independent variable on the dependent variable.
The presence of multicollinearity can lead to unstable and unreliable coefficient
estimates, making it challenging to interpret the results and draw meaningful conclusions
from the model.
To detect multicollinearity
Calculate the variance inflation factor (VIF) for each independent variable, and a VIF
value greater than 1.5 indicates multicollinearity.
To fix multicollinearity
Remove one of the highly correlated variables, combine them into a single variable, or use
a dimensionality reduction technique such as principal component analysis to reduce the
number of variables while retaining most of the information.
Correlation
If the correlation between two variables is +/- 1.0, then the variables are said to be
perfectly collinear.
When the correlation coefficient is negative, the changes in the two variables are in
opposite directions.
Overfitting and Underfitting
Overfitting occurs when our machine learning model tries to cover all the data points or
more than the required data points present in the given dataset. Because of this, the
model starts caching noise and inaccurate values present in the dataset, and all these
factors reduce the efficiency and accuracy of the model. The overfitted model has low
bias and high variance.
Note:
•Bias: Bias is the difference between the predicted values and the actual values.
•Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Underfitting represents the inability of the model to learn the training data effectively
result in poor performance both on the training and testing data. In simple terms, an
underfit model is inaccurate, especially when applied to new, unseen examples.
Variance
Variance is a measure of how data points vary from the mean.
Standard Deviation
Standard deviation is the square root of variance.
1. Implement Simple Linear Regression using statistical approach for the
given dataset sample.
area price
2600 550000
3000 565000
3200 610000
3600 680000
4000 725000
Multilinear Regression
One of the most common types of predictive analysis is multiple linear
regression.
This type of analysis allows you to understand the relationship between
a continuous dependent variable and two or more independent variables.
The independent variables can be either continuous (like age and height)
or categorical (like gender and occupation).
It's important to note that if your dependent variable is categorical, you
should dummy code it before running the analysis.
Here : Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn
Y = Dependent variable and x1, x2, x3, …… xn = multiple independent variables
Assumptions of Multiple Linear Regression
•A linear relationship should exist between the target and predictor variables.
To build a multiple linear regression model to predict the house price based on the first
three features.
Multiple Linear Regression Solved Example
Problem 1:Evaluate the following dataset with one response variable y and two
predictor variables X1 and X2 to fit a multiple linear regression model.
Multiple Linear Regression Solved Example: Formulae used are….
Where,
= [(194.875)(1162.5) – (-200.375)(-953.5)] /
[(263.875) (194.875) – (-200.375)2]
=3.148
Multiple Linear Regression Solved Example: Solution
b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] /
[(263.875) (194.875) – (-200.375)2] = -1.656
Data Preparation
•Cleaning the Data: Remove or impute missing values to ensure the dataset is
complete.
•Encoding Categorical Variables: If any categorical variables are present, convert
them into numerical form. In this example, our dataset is purely numerical.
•Feature Scaling: Standardise or normalise the features if they vary significantly in
scale. This step ensures that all features contribute equally to the model.
Multiple Linear Regression Steps
Model Training
Splitting the Dataset: Divide the dataset into training and testing sets. A split of
80% training and 20% testing is typically used.
Fitting the Model: Use the training data to fit the multiple linear regression
model.
After training the model, evaluate its performance using the test data
Predicting: Use the model to make predictions on the test set
Metrics: Assess the model’s performance using metrics such as Mean Absolute Error
(MAE), Mean Squared Error (MSE), and R-squared score.
1. Implement Multilinear Regression using statistical approach for the dataset
sample solved in theory.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions
Decision Trees
A tree can be “learned” by splitting the source set into subsets based on an
attribute value test. This process is repeated on each derived subset in a
recursive manner called recursive partitioning. The recursion is
completed with leaf nodes.
Test Data
Refund Refund Marital Taxable
Status Income Cheat
Yes No
No Married 80K ?
NO MarSt
10
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree: An Example
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
• The attribute with the highest value is chosen as the splitting attribute.
Decision Tree- ID3
Key Characteristics
•Uses Entropy and Information Gain: ID3 uses entropy as a measure of impurity
and information gain (with largest value) to decide which attribute to split the data
on at each step.
•Categorical Data: ID3 is designed to work with categorical data. Continuous data
needs to be discretized before using ID3.
m
Info( D) pi log 2 ( pi )
i 1
9 9 5 5
Info( D) I (9,5) log 2 ( ) log 2 ( ) 0.940
14 14 14 14
2. Entropy for each attribute in the dataset is calculated using the
below formulae
2.1. Entropy and Gain for Age : InfoAge(D) and Gain(age)
5 4
Infoage ( D) I (2,3) I (4,0)
14 14
5
I (3,2) 0.694
14
Choose the attribute with the highest gain as the splitting attribute
Choose the attribute with the highest gain as the splitting attribute
CART stands for Classification and Regression Trees and is a more advanced algorithm
that can handle both classification and regression tasks. It was introduced by Breiman et
al. in 1984.
Key Characteristics
•Uses Gini Impurity or MSE: For classification tasks, CART uses the Gini impurity
that measures largest reduction to decide splits, while for regression tasks, it uses
mean squared error (MSE).
•Handles Numerical and Categorical Data: Unlike ID3, CART can handle both
categorical and continuous data without requiring prior discretization.
•Binary Trees: CART produces binary trees, meaning each internal node splits the data
into two child nodes.
Decision Tree-CART Problem
From table, you can seen that Gini index for outlook feature is lowest. So we
get our root node.
Decision Tree-CART Problem
lets focus on sub data on sunny outlook feature. we need to find the Gini index for
temperature, humidity and wind feature respectively.
Decision Tree-CART Problem
Gini index for temperature on sunny
outlook
Decision Tree-CART Problem
Where,
=0.484
P(No/today)=((3/5)*(2/5)*(1/5)*(2/5)*(5/14))/((5/14)*(4/14)*(7/14)*(8/14))
=0.235
Problem 3: Practice