Unit 4 Data Science
Unit 4 Data Science
Linear Regression:
It is the basic and commonly used type for predictive analysis. It is a statistical approach to
modeling the relationship between a dependent variable and a given set of independent
variables.
Equation: Y = mX+c
• Multiple Linear Regression: Multiple linear regressions are based on the presumption
that both the dependent and independent variables, or Predictor and Target
variables, have a linear relationship. There are two types of multilinear regressions:
linear and nonlinear. It has one or more x variables and one or more y variables, or
one dependent variable and two or more independent variables.
Where:
1|MANISH GEHLOT
• yi is the dependent or predicted variable
A more precise calculation than ordinary linear regression is multiple linear regression.
Simple linear regression can readily capture the relationship between two variables in
simple relationships. Multiple linear regression is generally preferable for more complex
interactions that require more thought
This form of regression analysis expands upon linear regression, which is the simplest form
of regression. Simple linear regression creates linear mathematical relationships between
one independent variable and one dependent variable, represented by y = a + ßx, where y
can only result in one outcome based on the variable x. For example, in the equation 20 +
2x, where x = 5, y can only be 30.
Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context.
Tools of data visualization provide an accessible way to see and understand trends,
2|MANISH GEHLOT
outliers, and patterns in data by using visual effects or elements such as a chart, graphs,
and maps.
1. Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where
data generally represents amount such as height, weight, age of a person, etc.
Numerical data visualization is easiest way to visualize data. It is generally used for
helping others to digest large data sets and raw numbers in a way that makes it
easier to interpret into action. Numerical data is categorized into two categories :
• Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
• Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a
household has).
The type of visualization techniques that are used to represent numerical data visualization
is Charts and Numerical Values. Examples are Pie Charts, Bar Charts, Averages,
Scorecards, etc.
2. Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data
where data generally represents groups. It simply consists of categorical variables
that are used to represent characteristics such as a person’s ranking, a person’s
gender, etc. Categorical data visualization is all about depicting key themes,
establishing connections, and lending context. Categorical data is classified into three
categories :
3|MANISH GEHLOT
• Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
• Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
• Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or
processes).
The type of visualization techniques that are used to represent categorical data is Graphics,
Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram,
etc.
Model Visualization provides reason and logic behind to enable the accountability and
transparency on the model. Machine Learning models considered as Black Box Models due
to complex inner workings. Data Scientists deliver a model with high accuracy.
• Data exploration - Data exploration is done using exploratory data analysis (EDA).
Apply T-distributed Stochastic Neighbour Embedding (t-SNE) or principal component
analysis (PCA) techniques to understand the feature.
LIME is an algorithm to explain the predictions of the classifier or regression model. It gives
the list of features with visual explanations, features importance in visual form is needed to
4|MANISH GEHLOT
determine essential elements in the dataset. LIME provide visual descriptions of the model
and explains what the actual model is implementing.
A model predicts that a specific patient has the flu. The prediction is then explained by an
explainer that highlights the symptoms that are important to the model. With the help of
this information about the rationale behind the model, the doctor is now empowered to
trust the model or not.
• Neutron - Neutron gives the visualization for Deep Learning and Machine Learning
Models.
• ANN visualizer
• Visual DL- VisualDL is a profound learning visualization tool that can help in
visualize Deep Learning jobs including features such as scalar, parameter
distribution, model structure, and image visualization.
Residuals: A residual is a measure of how far away a point is vertically from the regression
line. Simply, it is the error between a predicted value and the observed actual value.
Residual Equation
Figure 1 is an example of how to visualize residuals against the line of best fit. The vertical
lines are the residuals.
6|MANISH GEHLOT
How to Create One
We first want to calculate the residual of each point of data on the scatter plot. Recall the
equation for a residual:
7|MANISH GEHLOT
Step 1: Locate the residual = 0 line in the residual plot.
Step 2: Look at the points in the plot and answer the following questions:
Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?
If the points show no pattern, that is, the points are randomly dispersed, we can conclude
that a linear model is an appropriate model.
If the points show a curved pattern, such as a U-shaped pattern, we can conclude that a
linear model is not appropriate and that a non-linear model might fit better.
Interpret the plot to determine if the plot is a good fit for a linear model.
The residuals are the y values in residual plots. The residual =0 line coincides with the x-
axis.
Step 2: Look at the points in the plot and answer the following questions:
Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?
In this residual plot, the points are scattered randomly around the residual=0 line. We can
conclude that a linear model is appropriate for modeling this data.
8|MANISH GEHLOT
How to Interpret a Residual Plot: Example 2
Interpret the plot to determine if the plot is a good fit for a linear model.
Step 2: Look at the points in the plot and answer the following questions:
Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?
In this residual plot, there is a pattern that you can describe. The data points are above the
residual=0 line near [0,1]. Then, we detect all of the data points under the residual=0 line
near [2,8]. The next data points are again clustered on or above the residual line=0. The
data points form a curved pattern, a U-shaped pattern. Since there is a detectable pattern
in the residual plot, we conclude that a linear model is not a right fit for the data.
Distribution Plots: Distribution plots visually assess the distribution of sample data by
comparing the empirical distribution of the data with the theoretical values expected from a
specified distribution.
1. joinplot
2. distplot
3. pairplot
4. rugplot
9|MANISH GEHLOT
Displot
• KDE stands for Kernel Density Estimation and that is another kind of the plot in
seaborn.
• bins is used to set the number of bins you want in your plot and it actually depends
on your dataset.
Now looking at this we can say that most of the total bill given lies between 10 and 20.
Joinplot
It is used to draw a plot of two variables with bivariate and univariate graphs. It basically
combines two different plots. Syntax:
10 | M A N I S H G E H L O T
Explanation:
• kind is a variable that helps us play around with the fact as to how do you want to
visualise the data.It helps to see whats going inside the joinplot. The default is scatter
and can be hex, reg(regression) or kde.
• x and y are two strings that are the column names and the data that column
contains is used by specifying the data parameter.
• here we can see tips on the y axis and total bill on the x axis as well as a linear
relationship between the two that suggests that the total bill increases with the tips.
• hue sets up the categorical separation between the entries if the dataset.
Polynomial Regression is a form of linear regression in which the relationship between the
independent variable x and dependent variable y is modeled as an nth degree polynomial.
Polynomial regression fits a nonlinear relationship between the value of x and the
11 | M A N I S H G E H L O T
corresponding conditional mean of y, denoted E(y |x)
Why Polynomial Regression:
• Inspection of residuals. If we try to fit a linear model to curved data, a scatter plot of
residuals (Y-axis) on the predictor (X-axis) will have patches of many positive
residuals in the middle. Hence in such a situation, it is not appropriate.
• Now that we have a basic understanding of what Polynomial Regression is, let’s open
• I’m going to take a slightly different approach here. We will implement both the
where we have a curvilinear relationship between the target and predictor. Finally, we
will compare the results to understand the difference between the two.
• First, import the required libraries and plot the relationship between the target
y = a + bx + e
Here y is a dependent variable, a is the y-intercept, b is the slope and e is the error rate.
In many cases, this linear model will not work out For example if we analyzing the
production of chemical synthesis in terms of temperature at which the synthesis take place
in such cases we use a quadratic model
12 | M A N I S H G E H L O T
y = a + b1x + b2^2 + e
Here y is the dependent variable on x, a is the y-intercept and e is the error rate.
In general, we can model it for nth value.
Since regression function is linear in terms of unknown variables, hence these models are
linear from the point of estimation.
Hence through the Least Square technique, let’s compute the response value that is y.
Step 1: Import libraries and dataset
Import the important libraries and the dataset we are using to perform Polynomial
Regression.
13 | M A N I S H G E H L O T
Step 6: Visualising the Polynomial Regression results using a scatter plot.
Step 7: Predicting new results with both Linear and Polynomial Regression. Note that the
input variable must be in a numpy 2D array.
MAE is a very simple metric which calculates the absolute difference between actual and
predicted values.
To better understand, let’s take an example you have input data and output data and use
Now you have to find the MAE of your model which is basically a mistake made by the
model known as an error. Now find the difference between the actual value and predicted
14 | M A N I S H G E H L O T
value that is an absolute error but we have to find the mean absolute of the complete
dataset.
so, sum all the errors and divide them by a total number of observations And this is MAE.
Advantages of MAE
• The MAE you get is in the same unit as the output variable.
Disadvantages of MAE
• The graph of MAE is not differentiable so we have to apply various optimizers like
Gradient descent which can be differentiable.
from sklearn.metrics import mean_absolute_error
print("MAE",mean_absolute_error(y_test,y_pred))
Now to overcome the disadvantage of MAE next metric came as MSE.
So, above we are finding the absolute difference and here we are finding the squared
difference.
What actually the MSE represents? It represents the squared distance between actual and
predicted values. we perform squared to avoid the cancellation of negative terms and it is
15 | M A N I S H G E H L O T
Advantages of MSE
The graph of MSE is differentiable, so you can easily use it as a loss function.
Disadvantages of MSE
• The value you get after calculating MSE is a squared unit of output. for example, the
output variable is in meter(m) then after calculating MSE the output we get is in
meter squared.
• If you have outliers in the dataset then it penalizes the outliers most and the
calculated MSE is bigger. So, in short, It is not Robust to outliers which were an
advantage in MAE.
from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))
As RMSE is clear by the name itself, that it is a simple square root of mean squared error.
Advantages of RMSE
• The output value you get is in the same unit as the required output variable which
makes interpretation of loss easy.
Disadvantages of RMSE
• It is not that robust to outliers as compared to MAE.
16 | M A N I S H G E H L O T
for performing RMSE we have to NumPy NumPy square root function over MSE.
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))
Most of the time people use RMSE as an evaluation metric and mostly when you are
working with deep learning techniques the most preferred metric is RMSE.
It is a very simple metric that is used by most of the datasets hosted for Machine Learning
competitions.
5) R Squared (R2)
R2 score is a metric that tells the performance of your model, not the loss in an absolute
sense that how many wells did your model perform.
In contrast, MAE and MSE depend on the context as we have seen whereas the R2 score is
independent of context.
So, with help of R squared we have a baseline model to compare a model which none of the
other metrics provides. The same we have in classification problems which we call a
threshold which is fixed at 0.5. So basically R2 squared calculates how must regression
line is better than a mean line.
Hence, R2 squared is also known as Coefficient of Determination or sometimes also known
as Goodness of fit.
R2 Squared
Now, how will you interpret the R2 score? suppose If the R2 score is zero then the above
regression line by mean line is equal means 1 so 1-1 is zero. So, in this case, both lines are
overlapping means model performance is worst, It is not capable to take advantage of the
output column.
17 | M A N I S H G E H L O T
Now the second case is when the R2 score is 1, it means when the division term is zero and
it will happen when the regression line does not make any mistake, it is perfect. In the real
world, it is not possible.
So we can conclude that as our regression line moves towards perfection, R2 score move
The normal case is when the R2 score is between zero and one like 0.8 which means your
Now as K increases by adding some features so the denominator will decrease, n-1 will
remain constant. R2 score will remain constant or will increase slightly so the complete
answer will increase and when we subtract this from one then the resultant score will
decrease. so this is the case when we add an irrelevant feature in the dataset.
And if we add a relevant feature then the R2 score will increase and 1-R2 will decrease
heavily and the denominator will also decrease so the complete term decreases, and on
subtracting from one the score increases.
n=40
k=2
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print(adj_r2_score)
Hence, this metric becomes one of the most important metrics to use during the evaluation
of the model.
18 | M A N I S H G E H L O T