0% found this document useful (0 votes)
14 views18 pages

Unit 4 Data Science

The document provides an overview of regression analysis, focusing on linear regression and its types, including simple and multiple linear regression, as well as the importance of data visualization in understanding relationships between variables. It discusses the steps involved in building a multiple linear regression model, the advantages of using multiple regression over simple regression, and various tools and techniques for visualizing data and model predictions. Additionally, it covers model visualization methods such as LIME, SHAP, and decision trees, along with the interpretation of residual plots and the implementation of polynomial regression.

Uploaded by

Abhishek Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

Unit 4 Data Science

The document provides an overview of regression analysis, focusing on linear regression and its types, including simple and multiple linear regression, as well as the importance of data visualization in understanding relationships between variables. It discusses the steps involved in building a multiple linear regression model, the advantages of using multiple regression over simple regression, and various tools and techniques for visualizing data and model predictions. Additionally, it covers model visualization methods such as LIME, SHAP, and decision trees, along with the interpretation of residual plots and the implementation of polynomial regression.

Uploaded by

Abhishek Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT 4 DATA SCIENCE

What is Regression Analysis?

Regression analysis is a predictive modelling technique that assesses the relationship


between dependent (i.e., the goal/target variable) and independent factors. Forecasting,
time series modelling, determining the relationship between variables, and predicting
continuous values can all be done using regression analysis. Just to give you an Analogy,
Regression is the best way to study the relationship between household areas and a driver’s
household electricity cost.

Linear Regression:
It is the basic and commonly used type for predictive analysis. It is a statistical approach to
modeling the relationship between a dependent variable and a given set of independent
variables.

These are of two types:


• Simple linear Regression
• Multiple Linear Regression
Simple Linear Regression: The association between two variables is established using a
straight line in Simple Linear Regression. It tries to create a line that is as near to the data
as possible by determining the slope and intercept, which define the line and reduce
regression errors. There is a single x and y variable

Equation: Y = mX+c

• Multiple Linear Regression: Multiple linear regressions are based on the presumption
that both the dependent and independent variables, or Predictor and Target
variables, have a linear relationship. There are two types of multilinear regressions:
linear and nonlinear. It has one or more x variables and one or more y variables, or
one dependent variable and two or more independent variables.

• Multiple regression, also known as multiple linear regression (MLR), is a statistical


technique that uses two or more explanatory variables to predict the outcome of a
response variable. In other words, it can explain the relationship between multiple
independent variables against one dependent variable. These independent variables
serve as predictor variables, while the single dependent variable serves as the
criterion variable. You can use this technique in a variety of contexts, studies and
disciplines, including in econometrics and financial inference.

Multiple Linear Regression Formula

Where:
1|MANISH GEHLOT
• yi is the dependent or predicted variable

• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.

• β1 and β2 are the regression coefficients representing the change in y relative to a


one-unit change in xi1 and xi2, respectively.

• βp is the slope coefficient for each independent variable

• ϵ is the model’s random error (residual) term.

Steps Involved in any Multiple Linear Regression Model


Step #1: Data Pre Processing
1. Importing The Libraries.
2. Importing the Data Set.
3. Encoding the Categorical Data.
4. Avoiding the Dummy Variable Trap.
5. Splitting the Data set into Training Set and Test Set.
Step #2: Fitting Multiple Linear Regression to the Training set
Step #3: Predict the Test set results.

Why Multiple Linear Regression is better than Simple Linear Regression

A more precise calculation than ordinary linear regression is multiple linear regression.
Simple linear regression can readily capture the relationship between two variables in
simple relationships. Multiple linear regression is generally preferable for more complex
interactions that require more thought

When to use Multiple Linear Regression

Multiple linear regression should be employed when numerous independent factors


influence the outcome of a single dependent variable and when forecasting more complex
interactions.

Multiple vs. linear regression

This form of regression analysis expands upon linear regression, which is the simplest form
of regression. Simple linear regression creates linear mathematical relationships between
one independent variable and one dependent variable, represented by y = a + ßx, where y
can only result in one outcome based on the variable x. For example, in the equation 20 +
2x, where x = 5, y can only be 30.

Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context.
Tools of data visualization provide an accessible way to see and understand trends,

2|MANISH GEHLOT
outliers, and patterns in data by using visual effects or elements such as a chart, graphs,
and maps.

Categories of Data Visualization ;


Data visualization is very critical to market research where both numerical and categorical
data can be visualized that helps in an increase in impacts of insights and also helps in
reducing risk of analysis paralysis. So, data visualization is categorized into following
categories:

1. Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where
data generally represents amount such as height, weight, age of a person, etc.
Numerical data visualization is easiest way to visualize data. It is generally used for
helping others to digest large data sets and raw numbers in a way that makes it
easier to interpret into action. Numerical data is categorized into two categories :

• Continuous Data –
It can be narrowed or categorized (Example: Height measurements).

• Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a
household has).

The type of visualization techniques that are used to represent numerical data visualization
is Charts and Numerical Values. Examples are Pie Charts, Bar Charts, Averages,
Scorecards, etc.

2. Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data
where data generally represents groups. It simply consists of categorical variables
that are used to represent characteristics such as a person’s ranking, a person’s
gender, etc. Categorical data visualization is all about depicting key themes,
establishing connections, and lending context. Categorical data is classified into three
categories :

3|MANISH GEHLOT
• Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).

• Nominal Data –
In this, classification is based on attributes (Example: Male or Female).

• Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or
processes).

The type of visualization techniques that are used to represent categorical data is Graphics,
Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram,
etc.

What is Model Visualization?

Model Visualization provides reason and logic behind to enable the accountability and
transparency on the model. Machine Learning models considered as Black Box Models due
to complex inner workings. Data Scientists deliver a model with high accuracy.

Different types of Model Visualization:

• Data exploration - Data exploration is done using exploratory data analysis (EDA).
Apply T-distributed Stochastic Neighbour Embedding (t-SNE) or principal component
analysis (PCA) techniques to understand the feature.

• Built models - Various metric to measure classification and regression model.


Accuracy, precision and recall, confusion metrics, log loss and F1 score used in
classification while meaning squared error (MSE), mean squared logarithmic error,
root mean square error(RMSE) used in the regression. All these metrics after built
model are used to understand and measure the performance.

• Decision Tree models - Static feature summary such as feature importance


retrieved from the model. It only exists in Decision tree-based algorithms such as
Random Forest and XGBoost.

How Model Visualization works?

Classification/Regression Machine Learning model results or decisions are difficult to


understand by human brains. It also makes it difficult to explain to the non-data-scientists.
The complex model functionality can be approximated by locally fit the linear models to
some permutations of the training set.

LIME (Local Interpretable Model agnostic Explanations)

LIME is an algorithm to explain the predictions of the classifier or regression model. It gives
the list of features with visual explanations, features importance in visual form is needed to

4|MANISH GEHLOT
determine essential elements in the dataset. LIME provide visual descriptions of the model
and explains what the actual model is implementing.

Example for Model Prediction

A model predicts that a specific patient has the flu. The prediction is then explained by an
explainer that highlights the symptoms that are important to the model. With the help of
this information about the rationale behind the model, the doctor is now empowered to
trust the model or not.

GradCAM (Gradient-weighted Class Activation Maps)

Gradient-weighted Class Activation Maps is an advanced and specialized method. Some


constraints of this method are that we need to have access to the internals of the model,
and it should work with images. For a simple understanding of the method, given a sample
of data(image), it will give the output in the form of a heat map of the regions of the image
where the neural network had the most and greatest activations, therefore the features in
the image that model correlates with class.

SHAP (Shapley Additive Explanations)


SHAP provides many explainers for different kinds of models.
• Tree Explainer - Supports XGBoost, LightGBM, CatBoost, and scikit learn models by
Tree SHAP.
• Deep Explainer(DEEP SHAP) - Support Tensorflow and Keras models by using
Deeplift and Shapley values.
• Gradient Explainer - Support Tensorflow and Keras models.
• Kernel Explainer (kernel SHAP) - Applying to any models by using LIME and
Shapley values.

Decision Trees Visualization


Decision tree Models can be visualized or interpret easily with the help of decision trees. It
allows browsing through each of the individual trees to see their relative importance to the
overall model. Answer the question as what can be the importance of each feature to a
particular tree.
Below unique visualization characteristics
• The decision nodes show how the feature space split.
• The split positions for decision nodes are shown visually in the distribution.
• In the models, it visualized how the training data or samples gets distributed in leaf
nodes and how the tree makes predictions for a specific observation.

Model Visualization Tools

Neural Networks Visualization

• CNNVis - Provides better analysis of Deep Convolutional Neural Networks.


5|MANISH GEHLOT
• Neural Networks Playground -An interactive in-browser visualization of Neural
Networks.TensorFlow playgrounds involve interactive visualization of Neural
Networks.

• Neutron - Neutron gives the visualization for Deep Learning and Machine Learning
Models.

• ANN visualizer

Tools for data exploration


• Tableau
• Power BI
Tools for explaining Predictions
• SHAP (Shapley Additive Explanations)
• LIME (Local Interpretable Model agnostic Explanations)
Tools for Visualizing during Training

• Visual DL- VisualDL is a profound learning visualization tool that can help in
visualize Deep Learning jobs including features such as scalar, parameter
distribution, model structure, and image visualization.

• TensorBoard- It allows to visualize the model structure, plot quantitative metrics


about the execution of the model, and show additional information data like images,
audio, text that pass through it.

Tools for Evaluation Visualization

• Yellowbrick- Yellowbrick comprised of visual diagnostic tools called visualizers that


extend the Scikit-learn API to allow human steering of the model selection process.

Residual Plots for regression model

Residuals: A residual is a measure of how far away a point is vertically from the regression
line. Simply, it is the error between a predicted value and the observed actual value.

Residual Equation

Figure 1 is an example of how to visualize residuals against the line of best fit. The vertical
lines are the residuals.

6|MANISH GEHLOT
How to Create One

To begin, suppose we have a standard scatter plot and a trend line.

We first want to calculate the residual of each point of data on the scatter plot. Recall the
equation for a residual:

residual=actual y-value−predicted y-value

How to Interpret a Residual Plot

7|MANISH GEHLOT
Step 1: Locate the residual = 0 line in the residual plot.

Step 2: Look at the points in the plot and answer the following questions:

Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?

If the points show no pattern, that is, the points are randomly dispersed, we can conclude
that a linear model is an appropriate model.

If the points show a curved pattern, such as a U-shaped pattern, we can conclude that a
linear model is not appropriate and that a non-linear model might fit better.

How to Interpret a Residual Plot: Example 1

Interpret the plot to determine if the plot is a good fit for a linear model.

Step 1: Locate the residual = 0 line in the residual plot.

The residuals are the y values in residual plots. The residual =0 line coincides with the x-
axis.

Step 2: Look at the points in the plot and answer the following questions:

Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?

In this residual plot, the points are scattered randomly around the residual=0 line. We can
conclude that a linear model is appropriate for modeling this data.

8|MANISH GEHLOT
How to Interpret a Residual Plot: Example 2

Interpret the plot to determine if the plot is a good fit for a linear model.

Step 1: Locate the residual = 0 line in the residual plot.

Step 2: Look at the points in the plot and answer the following questions:

Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?

In this residual plot, there is a pattern that you can describe. The data points are above the
residual=0 line near [0,1]. Then, we detect all of the data points under the residual=0 line
near [2,8]. The next data points are again clustered on or above the residual line=0. The
data points form a curved pattern, a U-shaped pattern. Since there is a detectable pattern
in the residual plot, we conclude that a linear model is not a right fit for the data.

Distribution Plots: Distribution plots visually assess the distribution of sample data by
comparing the empirical distribution of the data with the theoretical values expected from a
specified distribution.

4 types of distribution plots namely:

1. joinplot
2. distplot
3. pairplot
4. rugplot

9|MANISH GEHLOT
Displot

It is used basically for univariant set of observations and visualizes it through a


histogram i.e. only one observation and hence we choose one particular column of
the dataset. Syntax:

distplot(a[, bins, hist, kde, rug, fit, ...])

• KDE stands for Kernel Density Estimation and that is another kind of the plot in
seaborn.

• bins is used to set the number of bins you want in your plot and it actually depends
on your dataset.

• color is used to specify the color of the plot

Now looking at this we can say that most of the total bill given lies between 10 and 20.

Joinplot

It is used to draw a plot of two variables with bivariate and univariate graphs. It basically
combines two different plots. Syntax:

jointplot(x, y[, data, kind, stat_func, ...])

10 | M A N I S H G E H L O T
Explanation:

• kind is a variable that helps us play around with the fact as to how do you want to
visualise the data.It helps to see whats going inside the joinplot. The default is scatter
and can be hex, reg(regression) or kde.

• x and y are two strings that are the column names and the data that column
contains is used by specifying the data parameter.

• here we can see tips on the y axis and total bill on the x axis as well as a linear
relationship between the two that suggests that the total bill increases with the tips.

• hue sets up the categorical separation between the entries if the dataset.

• palette is used for designing the plots.

Implementation of Polynomial Regression

Polynomial Regression is a form of linear regression in which the relationship between the
independent variable x and dependent variable y is modeled as an nth degree polynomial.
Polynomial regression fits a nonlinear relationship between the value of x and the

11 | M A N I S H G E H L O T
corresponding conditional mean of y, denoted E(y |x)
Why Polynomial Regression:

• There are some relationships that a researcher will hypothesize is curvilinear.


Clearly, such types of cases will include a polynomial term.

• Inspection of residuals. If we try to fit a linear model to curved data, a scatter plot of
residuals (Y-axis) on the predictor (X-axis) will have patches of many positive
residuals in the middle. Hence in such a situation, it is not appropriate.

• An assumption in usual multiple linear regression analysis is that all the


independent variables are independent. In polynomial regression model, this
assumption is not satisfied.

Polynomial Regression vs. Linear Regression

• Now that we have a basic understanding of what Polynomial Regression is, let’s open

up our Python IDE and implement polynomial regression.

• I’m going to take a slightly different approach here. We will implement both the

polynomial regression as well as linear regression algorithms on a simple dataset

where we have a curvilinear relationship between the target and predictor. Finally, we

will compare the results to understand the difference between the two.

• First, import the required libraries and plot the relationship between the target

variable and the independent variable.

Uses of Polynomial Regression:


These are basically used to define or describe non-linear phenomena such as:
• The growth rate of tissues.
• Progression of disease epidemics
• Distribution of carbon isotopes in lake sediments
The basic goal of regression analysis is to model the expected value of a dependent variable
y in terms of the value of an independent variable x. In simple regression, we used the
following equation –

y = a + bx + e

Here y is a dependent variable, a is the y-intercept, b is the slope and e is the error rate.
In many cases, this linear model will not work out For example if we analyzing the
production of chemical synthesis in terms of temperature at which the synthesis take place
in such cases we use a quadratic model
12 | M A N I S H G E H L O T
y = a + b1x + b2^2 + e

Here y is the dependent variable on x, a is the y-intercept and e is the error rate.
In general, we can model it for nth value.

y = a + b1x + b2x^2 +....+ bnx^n

Since regression function is linear in terms of unknown variables, hence these models are
linear from the point of estimation.
Hence through the Least Square technique, let’s compute the response value that is y.
Step 1: Import libraries and dataset
Import the important libraries and the dataset we are using to perform Polynomial
Regression.

Step 2: Dividing the dataset into 2 components


Divide dataset into two components that is X and y.X will contain the Column between 1
and 2. y will contain the 2 columns.
Step 3: Fitting Linear Regression to the dataset
Fitting the linear Regression model On two components.
Step 4: Fitting Polynomial Regression to the dataset
Fitting the Polynomial Regression model on two components X and y.
Step 5: In this step, we are Visualising the Linear Regression results using a scatter plot.

13 | M A N I S H G E H L O T
Step 6: Visualising the Polynomial Regression results using a scatter plot.

Step 7: Predicting new results with both Linear and Polynomial Regression. Note that the
input variable must be in a numpy 2D array.

Advantages of using Polynomial Regression:


• A broad range of functions can be fit under it.
• Polynomial basically fits a wide range of curvatures.
• Polynomial provides the best approximation of the relationship between dependent
and independent variables.
Disadvantages of using Polynomial Regression
• These are too sensitive to the outliers.
• The presence of one or two outliers in the data can seriously affect the results of
nonlinear analysis.
• In addition, there are unfortunately fewer model validation tools for the detection of
outliers in nonlinear regression than there are for linear regression.
Measures for In-sample Evaluation
1) Mean Absolute Error(MAE)

MAE is a very simple metric which calculates the absolute difference between actual and

predicted values.

To better understand, let’s take an example you have input data and output data and use

Linear Regression, which draws a best-fit line.

Now you have to find the MAE of your model which is basically a mistake made by the

model known as an error. Now find the difference between the actual value and predicted

14 | M A N I S H G E H L O T
value that is an absolute error but we have to find the mean absolute of the complete

dataset.

so, sum all the errors and divide them by a total number of observations And this is MAE.

And we aim to get a minimum MAE because this is a loss.

Advantages of MAE

• The MAE you get is in the same unit as the output variable.

• It is most Robust to outliers.

Disadvantages of MAE
• The graph of MAE is not differentiable so we have to apply various optimizers like
Gradient descent which can be differentiable.
from sklearn.metrics import mean_absolute_error
print("MAE",mean_absolute_error(y_test,y_pred))
Now to overcome the disadvantage of MAE next metric came as MSE.

2) Mean Squared Error(MSE)


MSE is a most used and very simple metric with a little bit of change in mean absolute
error. Mean squared error states that finding the squared difference between actual and
predicted value.

So, above we are finding the absolute difference and here we are finding the squared
difference.

What actually the MSE represents? It represents the squared distance between actual and

predicted values. we perform squared to avoid the cancellation of negative terms and it is

the benefit of MSE.

15 | M A N I S H G E H L O T
Advantages of MSE

The graph of MSE is differentiable, so you can easily use it as a loss function.
Disadvantages of MSE
• The value you get after calculating MSE is a squared unit of output. for example, the
output variable is in meter(m) then after calculating MSE the output we get is in
meter squared.

• If you have outliers in the dataset then it penalizes the outliers most and the

calculated MSE is bigger. So, in short, It is not Robust to outliers which were an

advantage in MAE.
from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))

3) Root Mean Squared Error(RMSE)

As RMSE is clear by the name itself, that it is a simple square root of mean squared error.

Advantages of RMSE
• The output value you get is in the same unit as the required output variable which
makes interpretation of loss easy.
Disadvantages of RMSE
• It is not that robust to outliers as compared to MAE.
16 | M A N I S H G E H L O T
for performing RMSE we have to NumPy NumPy square root function over MSE.
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))
Most of the time people use RMSE as an evaluation metric and mostly when you are
working with deep learning techniques the most preferred metric is RMSE.

4) Root Mean Squared Log Error(RMSLE)


Taking the log of the RMSE metric slows down the scale of error. The metric is very helpful
when you are developing a model without calling the inputs. In that case, the output will
vary on a large scale.
To control this situation of RMSE we take the log of calculated RMSE error and resultant
we get as RMSLE.
To perform RMSLE we have to use the NumPy log function over RMSE.
print("RMSE",np.log(np.sqrt(mean_squared_error(y_test,y_pred))))

It is a very simple metric that is used by most of the datasets hosted for Machine Learning

competitions.

5) R Squared (R2)
R2 score is a metric that tells the performance of your model, not the loss in an absolute
sense that how many wells did your model perform.
In contrast, MAE and MSE depend on the context as we have seen whereas the R2 score is
independent of context.
So, with help of R squared we have a baseline model to compare a model which none of the
other metrics provides. The same we have in classification problems which we call a
threshold which is fixed at 0.5. So basically R2 squared calculates how must regression
line is better than a mean line.
Hence, R2 squared is also known as Coefficient of Determination or sometimes also known
as Goodness of fit.

R2 Squared
Now, how will you interpret the R2 score? suppose If the R2 score is zero then the above
regression line by mean line is equal means 1 so 1-1 is zero. So, in this case, both lines are
overlapping means model performance is worst, It is not capable to take advantage of the
output column.

17 | M A N I S H G E H L O T
Now the second case is when the R2 score is 1, it means when the division term is zero and
it will happen when the regression line does not make any mistake, it is perfect. In the real
world, it is not possible.

So we can conclude that as our regression line moves towards perfection, R2 score move

towards one. And the model performance improves.

The normal case is when the R2 score is between zero and one like 0.8 which means your

model is capable to explain 80 per cent of the variance of data.


from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(r2)
6) Adjusted R Squared
The disadvantage of the R2 score is while adding new features in data the R2 score starts
increasing or remains constant but it never decreases because It assumes that while
adding more data variance of data increases.
But the problem is when we add an irrelevant feature in the dataset then at that time R2
sometimes starts increasing which is incorrect.
Hence, To control this situation Adjusted R Squared came into existence.

Now as K increases by adding some features so the denominator will decrease, n-1 will
remain constant. R2 score will remain constant or will increase slightly so the complete
answer will increase and when we subtract this from one then the resultant score will
decrease. so this is the case when we add an irrelevant feature in the dataset.
And if we add a relevant feature then the R2 score will increase and 1-R2 will decrease
heavily and the denominator will also decrease so the complete term decreases, and on
subtracting from one the score increases.
n=40
k=2
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print(adj_r2_score)
Hence, this metric becomes one of the most important metrics to use during the evaluation
of the model.

18 | M A N I S H G E H L O T

You might also like