0% found this document useful (0 votes)

14 views18 pages

Unit 4 Data Science

The document provides an overview of regression analysis, focusing on linear regression and its types, including simple and multiple linear regression, as well as the importance of data visualization in understanding relationships between variables. It discusses the steps involved in building a multiple linear regression model, the advantages of using multiple regression over simple regression, and various tools and techniques for visualizing data and model predictions. Additionally, it covers model visualization methods such as LIME, SHAP, and decision trees, along with the interpretation of residual plots and the implementation of polynomial regression.

Uploaded by

Abhishek Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views18 pages

Unit 4 Data Science

Uploaded by

Abhishek Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

UNIT 4 DATA SCIENCE

What is Regression Analysis?

Regression analysis is a predictive modelling technique that assesses the relationship

between dependent (i.e., the goal/target variable) and independent factors. Forecasting,
time series modelling, determining the relationship between variables, and predicting
continuous values can all be done using regression analysis. Just to give you an Analogy,
Regression is the best way to study the relationship between household areas and a driver’s
household electricity cost.

Linear Regression:
It is the basic and commonly used type for predictive analysis. It is a statistical approach to
modeling the relationship between a dependent variable and a given set of independent
variables.

These are of two types:

• Simple linear Regression
• Multiple Linear Regression
Simple Linear Regression: The association between two variables is established using a
straight line in Simple Linear Regression. It tries to create a line that is as near to the data
as possible by determining the slope and intercept, which define the line and reduce
regression errors. There is a single x and y variable

Equation: Y = mX+c

• Multiple Linear Regression: Multiple linear regressions are based on the presumption
that both the dependent and independent variables, or Predictor and Target
variables, have a linear relationship. There are two types of multilinear regressions:
linear and nonlinear. It has one or more x variables and one or more y variables, or
one dependent variable and two or more independent variables.

• Multiple regression, also known as multiple linear regression (MLR), is a statistical

technique that uses two or more explanatory variables to predict the outcome of a
response variable. In other words, it can explain the relationship between multiple
independent variables against one dependent variable. These independent variables
serve as predictor variables, while the single dependent variable serves as the
criterion variable. You can use this technique in a variety of contexts, studies and
disciplines, including in econometrics and financial inference.

Multiple Linear Regression Formula

Where:
1|MANISH GEHLOT
• yi is the dependent or predicted variable

• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.

• β1 and β2 are the regression coefficients representing the change in y relative to a

one-unit change in xi1 and xi2, respectively.

• βp is the slope coefficient for each independent variable

• ϵ is the model’s random error (residual) term.

Steps Involved in any Multiple Linear Regression Model

Step #1: Data Pre Processing
1. Importing The Libraries.
2. Importing the Data Set.
3. Encoding the Categorical Data.
4. Avoiding the Dummy Variable Trap.
5. Splitting the Data set into Training Set and Test Set.
Step #2: Fitting Multiple Linear Regression to the Training set
Step #3: Predict the Test set results.

Why Multiple Linear Regression is better than Simple Linear Regression

A more precise calculation than ordinary linear regression is multiple linear regression.
Simple linear regression can readily capture the relationship between two variables in
simple relationships. Multiple linear regression is generally preferable for more complex
interactions that require more thought

When to use Multiple Linear Regression

Multiple linear regression should be employed when numerous independent factors

influence the outcome of a single dependent variable and when forecasting more complex
interactions.

Multiple vs. linear regression

This form of regression analysis expands upon linear regression, which is the simplest form
of regression. Simple linear regression creates linear mathematical relationships between
one independent variable and one dependent variable, represented by y = a + ßx, where y
can only result in one outcome based on the variable x. For example, in the equation 20 +
2x, where x = 5, y can only be 30.

Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring context.
Tools of data visualization provide an accessible way to see and understand trends,

2|MANISH GEHLOT
outliers, and patterns in data by using visual effects or elements such as a chart, graphs,
and maps.

Categories of Data Visualization ;

Data visualization is very critical to market research where both numerical and categorical
data can be visualized that helps in an increase in impacts of insights and also helps in
reducing risk of analysis paralysis. So, data visualization is categorized into following
categories:

1. Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where
data generally represents amount such as height, weight, age of a person, etc.
Numerical data visualization is easiest way to visualize data. It is generally used for
helping others to digest large data sets and raw numbers in a way that makes it
easier to interpret into action. Numerical data is categorized into two categories :

• Continuous Data –
It can be narrowed or categorized (Example: Height measurements).

• Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a
household has).

The type of visualization techniques that are used to represent numerical data visualization
is Charts and Numerical Values. Examples are Pie Charts, Bar Charts, Averages,
Scorecards, etc.

2. Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data
where data generally represents groups. It simply consists of categorical variables
that are used to represent characteristics such as a person’s ranking, a person’s
gender, etc. Categorical data visualization is all about depicting key themes,
establishing connections, and lending context. Categorical data is classified into three
categories :

3|MANISH GEHLOT
• Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).

• Nominal Data –
In this, classification is based on attributes (Example: Male or Female).

• Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or
processes).

The type of visualization techniques that are used to represent categorical data is Graphics,
Diagrams, and Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram,
etc.

What is Model Visualization?

Model Visualization provides reason and logic behind to enable the accountability and
transparency on the model. Machine Learning models considered as Black Box Models due
to complex inner workings. Data Scientists deliver a model with high accuracy.

Different types of Model Visualization:

• Data exploration - Data exploration is done using exploratory data analysis (EDA).
Apply T-distributed Stochastic Neighbour Embedding (t-SNE) or principal component
analysis (PCA) techniques to understand the feature.

• Built models - Various metric to measure classification and regression model.

Accuracy, precision and recall, confusion metrics, log loss and F1 score used in
classification while meaning squared error (MSE), mean squared logarithmic error,
root mean square error(RMSE) used in the regression. All these metrics after built
model are used to understand and measure the performance.

• Decision Tree models - Static feature summary such as feature importance

retrieved from the model. It only exists in Decision tree-based algorithms such as
Random Forest and XGBoost.

How Model Visualization works?

Classification/Regression Machine Learning model results or decisions are difficult to

understand by human brains. It also makes it difficult to explain to the non-data-scientists.
The complex model functionality can be approximated by locally fit the linear models to
some permutations of the training set.

LIME (Local Interpretable Model agnostic Explanations)

LIME is an algorithm to explain the predictions of the classifier or regression model. It gives
the list of features with visual explanations, features importance in visual form is needed to

4|MANISH GEHLOT
determine essential elements in the dataset. LIME provide visual descriptions of the model
and explains what the actual model is implementing.

Example for Model Prediction

A model predicts that a specific patient has the flu. The prediction is then explained by an
explainer that highlights the symptoms that are important to the model. With the help of
this information about the rationale behind the model, the doctor is now empowered to
trust the model or not.

GradCAM (Gradient-weighted Class Activation Maps)

Gradient-weighted Class Activation Maps is an advanced and specialized method. Some

constraints of this method are that we need to have access to the internals of the model,
and it should work with images. For a simple understanding of the method, given a sample
of data(image), it will give the output in the form of a heat map of the regions of the image
where the neural network had the most and greatest activations, therefore the features in
the image that model correlates with class.

SHAP (Shapley Additive Explanations)

SHAP provides many explainers for different kinds of models.
• Tree Explainer - Supports XGBoost, LightGBM, CatBoost, and scikit learn models by
Tree SHAP.
• Deep Explainer(DEEP SHAP) - Support Tensorflow and Keras models by using
Deeplift and Shapley values.
• Gradient Explainer - Support Tensorflow and Keras models.
• Kernel Explainer (kernel SHAP) - Applying to any models by using LIME and
Shapley values.

Decision Trees Visualization

Decision tree Models can be visualized or interpret easily with the help of decision trees. It
allows browsing through each of the individual trees to see their relative importance to the
overall model. Answer the question as what can be the importance of each feature to a
particular tree.
Below unique visualization characteristics
• The decision nodes show how the feature space split.
• The split positions for decision nodes are shown visually in the distribution.
• In the models, it visualized how the training data or samples gets distributed in leaf
nodes and how the tree makes predictions for a specific observation.

Model Visualization Tools

Neural Networks Visualization

• CNNVis - Provides better analysis of Deep Convolutional Neural Networks.

5|MANISH GEHLOT
• Neural Networks Playground -An interactive in-browser visualization of Neural
Networks.TensorFlow playgrounds involve interactive visualization of Neural
Networks.

• Neutron - Neutron gives the visualization for Deep Learning and Machine Learning
Models.

• ANN visualizer

Tools for data exploration

• Tableau
• Power BI
Tools for explaining Predictions
• SHAP (Shapley Additive Explanations)
• LIME (Local Interpretable Model agnostic Explanations)
Tools for Visualizing during Training

• Visual DL- VisualDL is a profound learning visualization tool that can help in
visualize Deep Learning jobs including features such as scalar, parameter
distribution, model structure, and image visualization.

• TensorBoard- It allows to visualize the model structure, plot quantitative metrics

about the execution of the model, and show additional information data like images,
audio, text that pass through it.

Tools for Evaluation Visualization

• Yellowbrick- Yellowbrick comprised of visual diagnostic tools called visualizers that

extend the Scikit-learn API to allow human steering of the model selection process.

Residual Plots for regression model

Residuals: A residual is a measure of how far away a point is vertically from the regression
line. Simply, it is the error between a predicted value and the observed actual value.

Residual Equation

Figure 1 is an example of how to visualize residuals against the line of best fit. The vertical
lines are the residuals.

6|MANISH GEHLOT
How to Create One

To begin, suppose we have a standard scatter plot and a trend line.

We first want to calculate the residual of each point of data on the scatter plot. Recall the
equation for a residual:

residual=actual y-value−predicted y-value

How to Interpret a Residual Plot

7|MANISH GEHLOT
Step 1: Locate the residual = 0 line in the residual plot.

Step 2: Look at the points in the plot and answer the following questions:

Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?

If the points show no pattern, that is, the points are randomly dispersed, we can conclude
that a linear model is an appropriate model.

If the points show a curved pattern, such as a U-shaped pattern, we can conclude that a
linear model is not appropriate and that a non-linear model might fit better.

How to Interpret a Residual Plot: Example 1

Interpret the plot to determine if the plot is a good fit for a linear model.

Step 1: Locate the residual = 0 line in the residual plot.

The residuals are the y values in residual plots. The residual =0 line coincides with the x-
axis.

Step 2: Look at the points in the plot and answer the following questions:

Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?

In this residual plot, the points are scattered randomly around the residual=0 line. We can
conclude that a linear model is appropriate for modeling this data.

8|MANISH GEHLOT
How to Interpret a Residual Plot: Example 2

Interpret the plot to determine if the plot is a good fit for a linear model.

Step 1: Locate the residual = 0 line in the residual plot.

Step 2: Look at the points in the plot and answer the following questions:

Are they scattered randomly around the residual = 0 line? Or, are they clustered in a
curved pattern, such as a U-shaped pattern?

In this residual plot, there is a pattern that you can describe. The data points are above the
residual=0 line near [0,1]. Then, we detect all of the data points under the residual=0 line
near [2,8]. The next data points are again clustered on or above the residual line=0. The
data points form a curved pattern, a U-shaped pattern. Since there is a detectable pattern
in the residual plot, we conclude that a linear model is not a right fit for the data.

Distribution Plots: Distribution plots visually assess the distribution of sample data by
comparing the empirical distribution of the data with the theoretical values expected from a
specified distribution.

4 types of distribution plots namely:

1. joinplot
2. distplot
3. pairplot
4. rugplot

9|MANISH GEHLOT
Displot

It is used basically for univariant set of observations and visualizes it through a

histogram i.e. only one observation and hence we choose one particular column of
the dataset. Syntax:

distplot(a[, bins, hist, kde, rug, fit, ...])

• KDE stands for Kernel Density Estimation and that is another kind of the plot in
seaborn.

• bins is used to set the number of bins you want in your plot and it actually depends
on your dataset.

• color is used to specify the color of the plot

Now looking at this we can say that most of the total bill given lies between 10 and 20.

Joinplot

It is used to draw a plot of two variables with bivariate and univariate graphs. It basically
combines two different plots. Syntax:

jointplot(x, y[, data, kind, stat_func, ...])

10 | M A N I S H G E H L O T
Explanation:

• kind is a variable that helps us play around with the fact as to how do you want to
visualise the data.It helps to see whats going inside the joinplot. The default is scatter
and can be hex, reg(regression) or kde.

• x and y are two strings that are the column names and the data that column
contains is used by specifying the data parameter.

• here we can see tips on the y axis and total bill on the x axis as well as a linear
relationship between the two that suggests that the total bill increases with the tips.

• hue sets up the categorical separation between the entries if the dataset.

• palette is used for designing the plots.

Implementation of Polynomial Regression

Polynomial Regression is a form of linear regression in which the relationship between the
independent variable x and dependent variable y is modeled as an nth degree polynomial.
Polynomial regression fits a nonlinear relationship between the value of x and the

11 | M A N I S H G E H L O T
corresponding conditional mean of y, denoted E(y |x)
Why Polynomial Regression:

• There are some relationships that a researcher will hypothesize is curvilinear.

Clearly, such types of cases will include a polynomial term.

• Inspection of residuals. If we try to fit a linear model to curved data, a scatter plot of
residuals (Y-axis) on the predictor (X-axis) will have patches of many positive
residuals in the middle. Hence in such a situation, it is not appropriate.

• An assumption in usual multiple linear regression analysis is that all the

independent variables are independent. In polynomial regression model, this
assumption is not satisfied.

Polynomial Regression vs. Linear Regression

• Now that we have a basic understanding of what Polynomial Regression is, let’s open

up our Python IDE and implement polynomial regression.

• I’m going to take a slightly different approach here. We will implement both the

polynomial regression as well as linear regression algorithms on a simple dataset

where we have a curvilinear relationship between the target and predictor. Finally, we

will compare the results to understand the difference between the two.

• First, import the required libraries and plot the relationship between the target

variable and the independent variable.

Uses of Polynomial Regression:

These are basically used to define or describe non-linear phenomena such as:
• The growth rate of tissues.
• Progression of disease epidemics
• Distribution of carbon isotopes in lake sediments
The basic goal of regression analysis is to model the expected value of a dependent variable
y in terms of the value of an independent variable x. In simple regression, we used the
following equation –

y = a + bx + e

Here y is a dependent variable, a is the y-intercept, b is the slope and e is the error rate.
In many cases, this linear model will not work out For example if we analyzing the
production of chemical synthesis in terms of temperature at which the synthesis take place
in such cases we use a quadratic model
12 | M A N I S H G E H L O T
y = a + b1x + b2^2 + e

Here y is the dependent variable on x, a is the y-intercept and e is the error rate.
In general, we can model it for nth value.

y = a + b1x + b2x^2 +....+ bnx^n

Since regression function is linear in terms of unknown variables, hence these models are
linear from the point of estimation.
Hence through the Least Square technique, let’s compute the response value that is y.
Step 1: Import libraries and dataset
Import the important libraries and the dataset we are using to perform Polynomial
Regression.

Step 2: Dividing the dataset into 2 components

Divide dataset into two components that is X and y.X will contain the Column between 1
and 2. y will contain the 2 columns.
Step 3: Fitting Linear Regression to the dataset
Fitting the linear Regression model On two components.
Step 4: Fitting Polynomial Regression to the dataset
Fitting the Polynomial Regression model on two components X and y.
Step 5: In this step, we are Visualising the Linear Regression results using a scatter plot.

13 | M A N I S H G E H L O T
Step 6: Visualising the Polynomial Regression results using a scatter plot.

Step 7: Predicting new results with both Linear and Polynomial Regression. Note that the
input variable must be in a numpy 2D array.

Advantages of using Polynomial Regression:

• A broad range of functions can be fit under it.
• Polynomial basically fits a wide range of curvatures.
• Polynomial provides the best approximation of the relationship between dependent
and independent variables.
Disadvantages of using Polynomial Regression
• These are too sensitive to the outliers.
• The presence of one or two outliers in the data can seriously affect the results of
nonlinear analysis.
• In addition, there are unfortunately fewer model validation tools for the detection of
outliers in nonlinear regression than there are for linear regression.
Measures for In-sample Evaluation
1) Mean Absolute Error(MAE)

MAE is a very simple metric which calculates the absolute difference between actual and

predicted values.

To better understand, let’s take an example you have input data and output data and use

Linear Regression, which draws a best-fit line.

Now you have to find the MAE of your model which is basically a mistake made by the

model known as an error. Now find the difference between the actual value and predicted

14 | M A N I S H G E H L O T
value that is an absolute error but we have to find the mean absolute of the complete

dataset.

so, sum all the errors and divide them by a total number of observations And this is MAE.

And we aim to get a minimum MAE because this is a loss.

Advantages of MAE

• The MAE you get is in the same unit as the output variable.

• It is most Robust to outliers.

Disadvantages of MAE
• The graph of MAE is not differentiable so we have to apply various optimizers like
Gradient descent which can be differentiable.
from sklearn.metrics import mean_absolute_error
print("MAE",mean_absolute_error(y_test,y_pred))
Now to overcome the disadvantage of MAE next metric came as MSE.

2) Mean Squared Error(MSE)

MSE is a most used and very simple metric with a little bit of change in mean absolute
error. Mean squared error states that finding the squared difference between actual and
predicted value.

So, above we are finding the absolute difference and here we are finding the squared
difference.

What actually the MSE represents? It represents the squared distance between actual and

predicted values. we perform squared to avoid the cancellation of negative terms and it is

the benefit of MSE.

15 | M A N I S H G E H L O T
Advantages of MSE

The graph of MSE is differentiable, so you can easily use it as a loss function.
Disadvantages of MSE
• The value you get after calculating MSE is a squared unit of output. for example, the
output variable is in meter(m) then after calculating MSE the output we get is in
meter squared.

• If you have outliers in the dataset then it penalizes the outliers most and the

calculated MSE is bigger. So, in short, It is not Robust to outliers which were an

advantage in MAE.
from sklearn.metrics import mean_squared_error
print("MSE",mean_squared_error(y_test,y_pred))

3) Root Mean Squared Error(RMSE)

As RMSE is clear by the name itself, that it is a simple square root of mean squared error.

Advantages of RMSE
• The output value you get is in the same unit as the required output variable which
makes interpretation of loss easy.
Disadvantages of RMSE
• It is not that robust to outliers as compared to MAE.
16 | M A N I S H G E H L O T
for performing RMSE we have to NumPy NumPy square root function over MSE.
print("RMSE",np.sqrt(mean_squared_error(y_test,y_pred)))
Most of the time people use RMSE as an evaluation metric and mostly when you are
working with deep learning techniques the most preferred metric is RMSE.

4) Root Mean Squared Log Error(RMSLE)

Taking the log of the RMSE metric slows down the scale of error. The metric is very helpful
when you are developing a model without calling the inputs. In that case, the output will
vary on a large scale.
To control this situation of RMSE we take the log of calculated RMSE error and resultant
we get as RMSLE.
To perform RMSLE we have to use the NumPy log function over RMSE.
print("RMSE",np.log(np.sqrt(mean_squared_error(y_test,y_pred))))

It is a very simple metric that is used by most of the datasets hosted for Machine Learning

competitions.

5) R Squared (R2)
R2 score is a metric that tells the performance of your model, not the loss in an absolute
sense that how many wells did your model perform.
In contrast, MAE and MSE depend on the context as we have seen whereas the R2 score is
independent of context.
So, with help of R squared we have a baseline model to compare a model which none of the
other metrics provides. The same we have in classification problems which we call a
threshold which is fixed at 0.5. So basically R2 squared calculates how must regression
line is better than a mean line.
Hence, R2 squared is also known as Coefficient of Determination or sometimes also known
as Goodness of fit.

R2 Squared
Now, how will you interpret the R2 score? suppose If the R2 score is zero then the above
regression line by mean line is equal means 1 so 1-1 is zero. So, in this case, both lines are
overlapping means model performance is worst, It is not capable to take advantage of the
output column.

17 | M A N I S H G E H L O T
Now the second case is when the R2 score is 1, it means when the division term is zero and
it will happen when the regression line does not make any mistake, it is perfect. In the real
world, it is not possible.

So we can conclude that as our regression line moves towards perfection, R2 score move

towards one. And the model performance improves.

The normal case is when the R2 score is between zero and one like 0.8 which means your

model is capable to explain 80 per cent of the variance of data.

from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)
print(r2)
6) Adjusted R Squared
The disadvantage of the R2 score is while adding new features in data the R2 score starts
increasing or remains constant but it never decreases because It assumes that while
adding more data variance of data increases.
But the problem is when we add an irrelevant feature in the dataset then at that time R2
sometimes starts increasing which is incorrect.
Hence, To control this situation Adjusted R Squared came into existence.

Now as K increases by adding some features so the denominator will decrease, n-1 will
remain constant. R2 score will remain constant or will increase slightly so the complete
answer will increase and when we subtract this from one then the resultant score will
decrease. so this is the case when we add an irrelevant feature in the dataset.
And if we add a relevant feature then the R2 score will increase and 1-R2 will decrease
heavily and the denominator will also decrease so the complete term decreases, and on
subtracting from one the score increases.
n=40
k=2
adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))
print(adj_r2_score)
Hence, this metric becomes one of the most important metrics to use during the evaluation
of the model.

18 | M A N I S H G E H L O T

9 Types of Regression Analysis
No ratings yet
9 Types of Regression Analysis
16 pages
Unit 2
No ratings yet
Unit 2
100 pages
Fase 4 Modelos de Regresión Con Información Cualitativa ECONOMETRIA
No ratings yet
Fase 4 Modelos de Regresión Con Información Cualitativa ECONOMETRIA
5 pages
Time Series Forecasting - Project Report
33% (3)
Time Series Forecasting - Project Report
68 pages
Regression
No ratings yet
Regression
45 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
ML Unit-III Notes
No ratings yet
ML Unit-III Notes
83 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
CSE3506 PPT Ref1
No ratings yet
CSE3506 PPT Ref1
135 pages
Regression: UNIT - V Regression Model
100% (1)
Regression: UNIT - V Regression Model
21 pages
KCA 034 - Unit 2
No ratings yet
KCA 034 - Unit 2
97 pages
Chapter 2
No ratings yet
Chapter 2
136 pages
Statistic and Data Science Ii PDF
No ratings yet
Statistic and Data Science Ii PDF
37 pages
Da 2
No ratings yet
Da 2
31 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
Unit 2 Da
No ratings yet
Unit 2 Da
31 pages
Campus
No ratings yet
Campus
47 pages
Data Analytivs-Unit-2
No ratings yet
Data Analytivs-Unit-2
24 pages
Data Science Unit 4
No ratings yet
Data Science Unit 4
15 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
Unit 2 - NOTES1 - ML
No ratings yet
Unit 2 - NOTES1 - ML
35 pages
ML Unit-2 Final
No ratings yet
ML Unit-2 Final
32 pages
Predictive Analytics
No ratings yet
Predictive Analytics
22 pages
MLT Unit 2
No ratings yet
MLT Unit 2
53 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Unit 2
No ratings yet
Unit 2
48 pages
Regression Analysis in Machine Learning - Javatpoint
No ratings yet
Regression Analysis in Machine Learning - Javatpoint
1 page
Unit 3
No ratings yet
Unit 3
25 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
AIML MSE 2 Notes
No ratings yet
AIML MSE 2 Notes
35 pages
4 ML
No ratings yet
4 ML
41 pages
ML DL NLP Definitions
No ratings yet
ML DL NLP Definitions
22 pages
Datamining Unit4
No ratings yet
Datamining Unit4
21 pages
4.analyze and Data Driven - Facebook
No ratings yet
4.analyze and Data Driven - Facebook
27 pages
Unit 2 Data Analytics
No ratings yet
Unit 2 Data Analytics
33 pages
Module - 03
No ratings yet
Module - 03
28 pages
Unit 3
No ratings yet
Unit 3
30 pages
Unit 3
No ratings yet
Unit 3
45 pages
Unit-3 Part 2 DA
No ratings yet
Unit-3 Part 2 DA
20 pages
Week 7. Intro To ML. Regression
No ratings yet
Week 7. Intro To ML. Regression
24 pages
Finals-Predictive-Time-Series-Analysis - Module
No ratings yet
Finals-Predictive-Time-Series-Analysis - Module
14 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Module 5
No ratings yet
Module 5
48 pages
Module3-Fitting A Model To Data
No ratings yet
Module3-Fitting A Model To Data
57 pages
Unit 2 Supervised Learning and Applications
No ratings yet
Unit 2 Supervised Learning and Applications
13 pages
Ssdma Unit 2 Part1
No ratings yet
Ssdma Unit 2 Part1
20 pages
Unit 1 (DS)
No ratings yet
Unit 1 (DS)
15 pages
Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
No ratings yet
Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
12 pages
BIA Notes
No ratings yet
BIA Notes
10 pages
RM 240 MCQs BBAIV Sakshi 60 MCQs Each Units
No ratings yet
RM 240 MCQs BBAIV Sakshi 60 MCQs Each Units
54 pages
Unit III
No ratings yet
Unit III
13 pages
L4a - Supervised Learning
No ratings yet
L4a - Supervised Learning
25 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Data Science Activity
No ratings yet
Data Science Activity
12 pages
Machine Learning - Regression Notes
No ratings yet
Machine Learning - Regression Notes
9 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
Group 1 Practical
No ratings yet
Group 1 Practical
16 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
CHI SQUARE TEST and ANOVA
No ratings yet
CHI SQUARE TEST and ANOVA
20 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Probability and Statistics Lecture Notes
100% (1)
Probability and Statistics Lecture Notes
9 pages
MGT555 Individual Assignment 2
No ratings yet
MGT555 Individual Assignment 2
9 pages
Probit Logit Interpretation
No ratings yet
Probit Logit Interpretation
26 pages
Introduction To Classical Test Theory With CITAS: Nathan A. Thompson, PH.D
No ratings yet
Introduction To Classical Test Theory With CITAS: Nathan A. Thompson, PH.D
12 pages
Civ 7101 Assignment 1 2023 Katende Abdulazziz
No ratings yet
Civ 7101 Assignment 1 2023 Katende Abdulazziz
19 pages
Practice Exam 09 Multiple Choice
No ratings yet
Practice Exam 09 Multiple Choice
11 pages
By Okite Moses
No ratings yet
By Okite Moses
33 pages
Consistent Problem Unit 2
No ratings yet
Consistent Problem Unit 2
9 pages
Distributed Lag Models
No ratings yet
Distributed Lag Models
9 pages
Mad, Mse, Mape Formulas
No ratings yet
Mad, Mse, Mape Formulas
18 pages
Quantitative Research Data Analysis Lecturers Notes
100% (1)
Quantitative Research Data Analysis Lecturers Notes
12 pages
Econometrics Assignment Eviews PDF
No ratings yet
Econometrics Assignment Eviews PDF
10 pages
Sta301 Mid Term Solved Mcqs With References
No ratings yet
Sta301 Mid Term Solved Mcqs With References
29 pages
Calculating Statistics Using Excel
No ratings yet
Calculating Statistics Using Excel
14 pages
Behavioral Variability - Leary Fall 2015
No ratings yet
Behavioral Variability - Leary Fall 2015
18 pages
Assignment4 Group3.CC01.Forecasting-1
No ratings yet
Assignment4 Group3.CC01.Forecasting-1
11 pages
Kumaraswamy's Distribution A Beta-Type Distribution With
No ratings yet
Kumaraswamy's Distribution A Beta-Type Distribution With
12 pages
11241-Article Text-23214-1-10-20190524 PDF
No ratings yet
11241-Article Text-23214-1-10-20190524 PDF
13 pages
CIVL101: Lecture-29 Test of Significance of Large Samples Z-Statistic
No ratings yet
CIVL101: Lecture-29 Test of Significance of Large Samples Z-Statistic
21 pages
Lecture Note 04 - Measures of Central Tendency
No ratings yet
Lecture Note 04 - Measures of Central Tendency
15 pages
Bayesian Methods For The Analysis of Small Sample Multilevel Data With A Complex Variance Structure
No ratings yet
Bayesian Methods For The Analysis of Small Sample Multilevel Data With A Complex Variance Structure
14 pages
Accountability and Fraud Type Effects On Fraud Detection Responsibility
No ratings yet
Accountability and Fraud Type Effects On Fraud Detection Responsibility
13 pages
Practice Final Exam S13
No ratings yet
Practice Final Exam S13
15 pages
SAMPLING METHODS in Order To Answer The Research Questions
No ratings yet
SAMPLING METHODS in Order To Answer The Research Questions
5 pages
M1 Lesson 1
No ratings yet
M1 Lesson 1
6 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
5 pages
Lesson 6 Weeks 7-8
No ratings yet
Lesson 6 Weeks 7-8
4 pages

Unit 4 Data Science

Uploaded by

Unit 4 Data Science

Uploaded by

UNIT 4 DATA SCIENCE

What is Regression Analysis?

Regression analysis is a predictive modelling technique that assesses the relationship

These are of two types:

• Multiple regression, also known as multiple linear regression (MLR), is a statistical

Multiple Linear Regression Formula

• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.

• β1 and β2 are the regression coefficients representing the change in y relative to a

• βp is the slope coefficient for each independent variable

• ϵ is the model’s random error (residual) term.

Steps Involved in any Multiple Linear Regression Model

Why Multiple Linear Regression is better than Simple Linear Regression

When to use Multiple Linear Regression

Multiple linear regression should be employed when numerous independent factors

Multiple vs. linear regression

Categories of Data Visualization ;

What is Model Visualization?

Different types of Model Visualization:

• Built models - Various metric to measure classification and regression model.

• Decision Tree models - Static feature summary such as feature importance

How Model Visualization works?

Classification/Regression Machine Learning model results or decisions are difficult to

LIME (Local Interpretable Model agnostic Explanations)

Example for Model Prediction

GradCAM (Gradient-weighted Class Activation Maps)

Gradient-weighted Class Activation Maps is an advanced and specialized method. Some

SHAP (Shapley Additive Explanations)

Decision Trees Visualization

Model Visualization Tools

Neural Networks Visualization

• CNNVis - Provides better analysis of Deep Convolutional Neural Networks.

Tools for data exploration

• TensorBoard- It allows to visualize the model structure, plot quantitative metrics

Tools for Evaluation Visualization

• Yellowbrick- Yellowbrick comprised of visual diagnostic tools called visualizers that

Residual Plots for regression model

To begin, suppose we have a standard scatter plot and a trend line.

residual=actual y-value−predicted y-value

How to Interpret a Residual Plot

How to Interpret a Residual Plot: Example 1

Step 1: Locate the residual = 0 line in the residual plot.

Step 1: Locate the residual = 0 line in the residual plot.

4 types of distribution plots namely:

It is used basically for univariant set of observations and visualizes it through a

distplot(a[, bins, hist, kde, rug, fit, ...])

• color is used to specify the color of the plot

jointplot(x, y[, data, kind, stat_func, ...])

• palette is used for designing the plots.

Implementation of Polynomial Regression

• There are some relationships that a researcher will hypothesize is curvilinear.

• An assumption in usual multiple linear regression analysis is that all the

Polynomial Regression vs. Linear Regression

up our Python IDE and implement polynomial regression.

polynomial regression as well as linear regression algorithms on a simple dataset

variable and the independent variable.

Uses of Polynomial Regression:

y = a + b1x + b2x^2 +....+ bnx^n

Step 2: Dividing the dataset into 2 components

Advantages of using Polynomial Regression:

Linear Regression, which draws a best-fit line.

And we aim to get a minimum MAE because this is a loss.

• It is most Robust to outliers.

2) Mean Squared Error(MSE)

the benefit of MSE.

3) Root Mean Squared Error(RMSE)

4) Root Mean Squared Log Error(RMSLE)

towards one. And the model performance improves.

model is capable to explain 80 per cent of the variance of data.

You might also like