0% found this document useful (0 votes)
14 views53 pages

Unit V - Update

Unit V of the Fundamentals of Data Science and Analytics course focuses on predictive analytics, covering concepts such as linear regression, logistic regression, time series analysis, and survival analysis. It outlines the steps in predictive analytics, various techniques, and the importance of goodness of fit in model evaluation. Additionally, it discusses the challenges and advantages of time series analysis, as well as the implications of missing values in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views53 pages

Unit V - Update

Unit V of the Fundamentals of Data Science and Analytics course focuses on predictive analytics, covering concepts such as linear regression, logistic regression, time series analysis, and survival analysis. It outlines the steps in predictive analytics, various techniques, and the importance of goodness of fit in model evaluation. Additionally, it discusses the challenges and advantages of time series analysis, as well as the implications of missing values in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

UNIT V – PREDICTIVE ANALYTICS


Linear least squares – implementation – goodness of fit – testing a linear model – weighted resampling.
Regression using StatsModels – multiple regression – nonlinear relationships – logistic regression –
estimating parameters – Time series analysis – moving averages – missing values – serial correlation –
autocorrelation. Introduction to survival analysis.
PARTA

1. Define predictive analytics.


 Predictive analytics is the process of using data to forecast future
outcomes.
 The process uses data analysis, machine learning, artificial
intelligence, and statistical models to find patterns that might
predict future behavior.
 Data scientists use historical data as their source and utilize various
regression models and machine learning techniques to detect
patterns and trends in the data.

2. List the Steps in Predictive Analytics


1. Define the problem
2. Acquire and organize data
3. Pre-process data
4. Develop predictive models
5. Validate and deploy results

3. What are the Predictive Analytics Techniques available? List


the techniques used for Predictive Analytics.
1. Regression analysis
2. Decision trees
3. Neural networks

4. List the uses and examples of predictive analytics


 Fraud detection
 Conversion and purchase prediction
 Risk reduction
 Operational improvement
 Customer segmentation
 Maintenance forecasting

5. Define Least squares fit


 A “linear fit” is a line intended to model the relationship

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

between variables.
 A “least squares” fit is one that minimizes the mean squared error
(MSE) between the line and the data.

6. Define Residuals
 The deviation of an actual value from a model.
 The difference between the actual values and the fitted line.
 thinkstats2 provides a function that computes residuals:
def Residuals(xs, ys, inter, slope):
xs =
np.asarray(xs)
ys =
np.asarray(ys)
res = ys - (inter + slope * xs)
return res
It returns the differences between the actual values and the fitted line.

7. What is Goodness of fit in predictive


analytics? Goodness of fit
 A goodness-of-fit is a statistical test that tries to determine whether a
set of observed values match those expected under the applicable
model.
 They can show whether your sample data fit an expected set of data
from a population with normal distribution.

8. Mention the types of goodness-of-fit tests


 The chi-square test determines if a relationship exists between
categorical data.
Variables must be mutually exclusive in order to qualify for the chi-
square test for independence. And the chi goodness-of-fit test should
not be used for data that is continuous.
 The Kolmogorov-Smirnov test determines whether a sample comes
from a specific distribution of a population.

9. What are the different ways to measure the quality of a linear model,
or goodness of fit?
 Standard deviation of the residuals
 Coefficient of determination, usually denoted R2 and called “R-squared”:

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

def CoefDetermination(ys, res):


return 1 - Var(res) / Var(ys)
 Var(res) is the MSE of guesses using the model, Var(ys) is the MSE
without it.

10. Differentiate Goodness-of-Fit Test vs. Independence Test


 Goodness-of-fit test and independence test are both statistical
tests used to assess the relationship between variables.
 A goodness-of-fit test is used to evaluate how well a set of
observed data fits a particular probability distribution.
 An independence test is used to assess the relationship between two
variables. It is used to test whether there is any association
between two variables.
 The primary purpose of an independence test is to see whether a
change in one variable is related to a change in another variable.
 An independence test is pointed towards two specific variables. A
goodness-of-fit test is used on an entire set of observed data to
evaluate the appropriateness of a specific model.

11. Define Regression and list its


types. Regression
 The linear least squares fit is an example of regression, which is
fitting any kind of model to any kind of data.
 The goal of regression analysis is to describe the relationship
between one set of variables, called the dependent variables, and
another set of variables, called independent or explanatory
variables.
 When there is only one dependent and one explanatory variable, that’s
simple regression.
 If there is more than one dependent variable with more than one
explanatory variable, that’s multivariate regression.
 If the relationship between the dependent and explanatory variable is
linear, that’s linear regression.

12. Define StatsModels and mention its purpose.


 statsmodels provides two interfaces (APIs); the “formula” API uses
strings to identify the dependent and explanatory variables.
It uses a syntax called patsy; in this example, the ~ operator
separates the dependent variable on the left from the explanatory
3

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

variables on the right.


 smf.ols takes the formula string and the DataFrame, live, and
returns an OLS object that represents the model.
The name ols stands for “ordinary least squares.”
 Given a sequence of values for y and sequences for x1 and x2, find the
parameters, , , and , that minimize the sum of 2 . This
process is called ordinary least squares.

13. How to implement Regression in Python using StatsModels?


1. Step 1: Import packages.
2. Step 2: Loading data.
3. Step 3: Setting a hypothesis.
4. Step 4: Fitting the model
5. Step 5: Summary of the model.

14. Define R- squared value, F- statistic and


Predictions. R- squared value
 R-squared value ranges between 0 and 1.
 An R-squared of 100 percent indicates that all changes in the dependent
variable are completely explained by changes in the independent
variable(s).
F- statistic:
 The F statistic simply compares the combined effect of all variables.
Predictions:
 If significance level (alpha) to be 0.05, reject the null hypothesis and
accept the alternative hypothesis as p<0.05. so, say that there is a
relationship between head size and brain weight.

15. Definemultiple linear regression or Multiple Regression using


Statsmodels in Python.
Multiple linear regression (MLR)
 Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
 The goal of multiple linear regression is to model the linear relationship
between the explanatory (independent) variables and response
(dependent) variables.
 MLR is used extensively in econometrics and financial inference.

Formula and Calculation of Multiple Linear Regression


4

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

16. Define logistic regression.


 If the dependent variable is boolean, the generalized model is called
logistic regression.
 Logistic regression is a supervised machine learning algorithm that
accomplishes binary classification tasks by predicting the probability
of an outcome, event, or observation.
 The model delivers a binary or dichotomous outcome limited to two
possible outcomes: yes/no, 0/1, or true/false.
 Logistic regression is commonly used in binary classification
problems where the outcome variable reveals either of the two
categories (0 and 1).

17. Define Sigmoid Function


 Logistic regression uses a logistic function called a sigmoid
function to map predictions and their probabilities. Refer figure
5.3 for Sigmoid function.
 The sigmoid function refers to an S-shaped curve that converts any
real value to a range between 0 and 1.
 If the output of the sigmoid function (estimated probability) is
greater than a predefined threshold on the graph, the model predicts
that the instance belongs to that class.
 If the estimated probability is less than the predefined threshold,
the model predicts that the instance does not belong to the
class.
The sigmoid function is referred to as an activation function for
logistic regression and is defined as:

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

where,
e = base of natural logarithms
value = numerical value one wishes to transform

18. List the types of Logistic Regression with Examples


Logistic regression is classified into binary, multinomial, and ordinal.
Binary logistic regression
 Binary logistic regression predicts the relationship between
the independent and binary dependent variables.
 Some examples of the output of this regression type may
be, success/failure, 0/1, or true/false.
Examples:
1. Deciding on whether or not to offer a loan to a bank
customer: Outcome = yes or no.
Evaluating the risk of cancer: Outcome = high or low.
3. Predicting a team’s win in a football match: Outcome = yes or no.
Multinomial logistic regression
 A categorical dependent variable has two or more discrete
outcomes in a multinomial regression type.
 This implies that this regression type has more than two possible
outcomes.
Ordinal logistic regression
 Ordinal logistic regression applies when the dependent variable is
in an ordered state (i.e., ordinal). The dependent variable (y)
specifies an order with two or more categories or levels.

19. Define time series and time series analysis.


 Time Series
 A time series is a sequence of measurements from a system that
varies in time.
 An ordered sequence of values of a variable at equally spaced
time intervals.
 Time Series Analysis
 Time series analysis is a specific way of analyzing a sequence of
data points collected over an interval of time.
 In time series analysis, analysts record data points at consistent
intervals over a set period of time rather than just recording the data
points intermittently or randomly.
 Time series analysis has become a crucial tool for companies

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

looking to make better decisions based on data.

20. Mention the components of Time Series Data


 Trends: Long-term increases, decreases, or stationary movement
 Seasonality: Predictable patterns at fixed intervals
 Cycles: Fluctuations without a consistent period
 Noise: Residual unexplained variability

21. List the different types of data used for predictive analysis.
 Types of Data
 Time Series Data: Comprises observations collected at different
time intervals. It's geared towards analyzing trends, cycles, and other
temporal patterns.
 Cross-Sectional Data: Involves data points collected at a single moment in
time. Useful for understanding relationships or comparisons between
different entities or categories at that specific point.
Pooled Data: A combination of Time Series and Cross-Sectional
data. This hybrid enriches the dataset, allowing for more nuanced
and comprehensive analyses.

22. Mention the different types of time series analysis.


 Time Series Analysis Types
 Classification
 Curve fitting
 Descriptive analysis
 Explanative analysis
 Exploratory analysis
 Forecasting
 Intervention analysis
 Segmentation.

23. List the Time Series Analysis Techniques


 Moving Average
 Exponential Smoothing
 Autoregression
 Decomposition
 Time Series Clustering
 Wavelet Analysis
 Intervention Analysis
 Box-Jenkins ARIMA models
7

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Box-Jenkins Multivariate models


 Holt-Winters Exponential Smoothing

24. List the Advantages of Time Series Analysis


1. Data Cleansing
2. Understanding Data
3. Forecasting
4. Identifying Trends and Seasonality
5. Visualizations
6. Efficiency
7. Risk Assessment

25. List the Challenges of Time Series Analysis


1. Limited Scope
2. Noise Introduction
3. Interpretation Challenges
4. Generalization Issues
5. Model Complexity
6. Non-Independence of Data
7. Data Availability

26. Define Serial Correlation and Auto Correlation in


Time Series Analysis.
Serial Correlation
 Serial correlation is the relationship between a given variable
and a lagged version of itself over various time intervals.
 It measures the relationship between a variable's current value
given its past values.
 A variable that is serially correlated indicates that it may not
be random.
 Serial correlation occurs in a time series when a variable and a
lagged version of itself (for instance a variable at times T and
at T- 1) are observed to be correlated with one another over
periods of time.
 lag: The size of the shift the time series by an interval in a
serial correlation or autocorrelation.

Autocorrelation
 Autocorrelation, refers to the degree of correlation of the same

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

variables between two successive time intervals.


 Autocorrelation represents the degree of similarity between a
given time series and a lagged version of itself over successive
time intervals.
 Autocorrelation measures the relationship between a variable's
current value and its past values.

27. Define Survival Analysis and Survival Curve.Survival Analysis


 Survival analysis is a field of statistics that focuses on
analysing the expected time until a certain event happens.
 Survival analysis can be used for analysing the results of
that treatment in terms of the patients’ life expectancy.
 The term `survival time' specifies the length of time taken for failure
to occur.

Survival curves
o The fundamental concept in survival analysis is the survival
curve, S(t), which is a function that maps from a duration, t, to
the probability of surviving longer than t, it’s just the
complement of the CDF:
S(t) = 1 − CDF(t)
where CDF(t) is the probability of a lifetime less than or equal to t.

28. Define missing value and narrate the reason for missing
value. Missing Value
 Missing data is defined as the values or data that is not
stored for some variable/s in the given dataset.
Reason for Missing Values
 Past data might get corrupted due to improper maintenance.
 Observations are not recorded for certain fields due to some
reasons. There might be a failure in recording the values due to
human error.
 The user has not provided the values intentionally
 Item nonresponse: This means the participant refused to
respond.
29. Why the missing data should be handled?

 The missing data will decrease the predictive power of the


model. If the algorithms are applied with missing data, then
there will be bias in the estimation of parameters.
9

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 The results are not confident if the missing data is not handled
properly.

30. List the types of Missing Values

Type Definition
Missing completely at
Missing data are randomly
random (MCAR) distributed across the variable and
unrelated to other variables.
Missing at random (MAR) Missing data are not randomly
distributed but they are accounted
for by other observed variables.
Missing not at random Missing data systematically differ
(MNAR) from the observed values.

31. List the methods for identifying missing data

Functions Descriptions

This function returns a pandas dataframe, where each


.isnull() value is a boolean value True if the value is missing,
False otherwise.

Similarly to the previous function, the values for this one


.notnull()
are False if either NaN or None value is detected.

This function generates three main columns, including


.info() the “Non-Null Count” which shows the number of non-
missing values for each column.

31. Outline a few approaches to detect outliers? Explain


different ways to deal with it. (Nov/Dem 2023)
Outliers are values at the extreme ends of a dataset.
Outliers are extreme values that differ from most other data
points in a dataset.
They can have a big impact on statistical analyses and skew the
results of any hypothesis tests.
10

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

It’s important to carefully identify potential outliers dataset and


deal with them in an appropriate manner for accurate results.
There are four ways to identify or detect outliers:
• Sorting method - can sort quantitative variables
from low to high and scan for extremely low or
extremely high values.
• Data visualization method - can use software to
visualize data with a box plot, or a box-and-whisker
plot
• Statistical tests - applying statistical tests or
procedures to identify extreme values.
Interquartile range method - the range of the middle half of dataset.

Different ways to deal with outliers


• Retain outliers
Keeping outliers is usually the better option when not sure if
they are errors.
• Remove outliers
Deleting extreme values from dataset before performing
statistical analyses.

32. Give an approach to handle missing values in a dataset. (Nov/Dem 2023)


Deleting Rows with missing values
Impute missing values for continuous variable
Impute missing values for categorical variable
Other Imputation Methods
Using Algorithms that support missing values
Prediction of missing values
Imputation using Deep Learning Library — Datawig

33. What is survival analysis? (Apr/May2024)

 Survival analysis, also known as time-to-event analysis, is a branch of statistics that


studies the amount of time it takes before a particular event of interest occurs.

 The survival function is S(t)=1 − F(t), or the probability that a person or machine or a
business lasts longer than t time units

11

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Example - Insurance companies use survival analysis to predict the death of the
insured and estimate other important factors such as policy cancellations, non-
renewals, and how long it takes to file a claim.

34. Why do you need weighted Resampling? (APR/MAY 2024)

 Weighted sampling is a technique in which the probabilities of selecting a particular


person to participate in our survey are no longer equal.

 Some subgroups in our population are assigned a higher probability of being selected,
based on our determined survey needs.

 This allows researchers to correct issues that occur during data collection.

 Weighted resampling chooses samples from {X (n) , w (n) } with probability


proportional to the importance function at the sample location, ρ (n) ∝ g(X (n) ), and
re-weights them in order to ensure p(X) is fairly represented, w (n) ∝ w (n) /ρ (n) .

12

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

PART B

1. Give a brief introduction about predictive analytics.


 Predictive analytics
 Predictive analytics is the process of using data to forecast future
outcomes.
 The process uses data analysis, machine learning, artificial
intelligence, and statistical models to find patterns that might
predict future behavior.
 Data scientists use historical data as their source and utilize various
regression models and machine learning techniques to detect
patterns and trends in the data.

 Steps in Predictive Analytics


 The workflow for building predictive analytics frameworks follows five
basic steps:
1. Define the problem:
 A prediction starts with a good thesis and set of requirements.
 A distinct problem to solve will help determine what method of
predictive analytics should be used.
2. Acquire and organize data:
 An organization may have decades of data to draw upon, or a
continual flood of data from customer interactions.
 Before predictive analytics models can be developed, data flows
must be identified, and then datasets can be organized in a
repository such as a data warehouse like BigQuery.
3. Pre-process data:
 To prepare the data for the predictive analytics models, it should
be cleaned to remove anomalies, missing data points, or extreme
outliers, any of which might be the result of input or measurement
errors.
4. Develop predictive models:
 Data scientists have a variety of tools and techniques to develop
predictive models depending on the problem to be solved and
nature of the dataset.
 Machine learning, regression models, and decision trees are some
of the most common types of predictive models.
5. Validate and deploy results:

13

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Check on the accuracy of the model and adjust accordingly.


Once acceptable results have been achieved, make them available to
stakeholders via an app, website, or data dashboard.

 Predictive Analytics Techniques


Predictive analytics tends to be performed with three main types of
techniques:
1. Regression analysis
 Regression is a statistical analysis technique that estimates
relationships between variables.
 Regression is useful to determine patterns in large datasets to
determine the correlation between inputs.
 Regression is often used to determine how one or more independent
variables affects another, such as how a price increase will affect
the sale of a product.
2. Decision trees
 Decision trees are classification models that place data into
different categories based on distinct variables.
 The model looks like a tree, with each branch representing a
potential choice, with the leaf of the branch representing the
result of the decision.
3. Neural networks
 Neural networks are machine learning methods that are useful in
predictive analytics when modeling very complex relationships.
 Neural networks are best used to determine nonlinear
relationships in datasets, especially when no known mathematical
formula exists to analyze the data.
 Neural networks can be used to validate the results of decision
trees and regression models.

 Uses and examples of predictive analytics


 Predictive analytics can be used to streamline operations, boost
revenue, and mitigate risk for almost any business or industry,
including banking, retail, utilities, public sector, healthcare, and
manufacturing.
Fraud detection
 Predictive analytics examines all actions on a company’s network in
real time to pinpoint abnormalities that indicate fraud and other
vulnerabilities.
Conversion and purchase prediction
14

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Companies can take actions, like retargeting online ads to visitors, with
data that predicts a greater likelihood of conversion and purchase intent.

Risk reduction
 Credit scores, insurance claims, and debt collections all use
predictive analytics to assess and determine the likelihood of future
defaults.
Operational improvement
 Companies use predictive analytics models to forecast
inventory, manage resources, and operate more efficiently.
Customer segmentation
 By dividing a customer base into specific groups, marketers
can use predictive analytics to make forward-looking decisions to
tailor content to unique audiences.
Maintenance forecasting
Organizations use data to predict when routine equipment maintenance will
be required and can then schedule it before a problem or malfunction arises.

2. Explain linear least squares and its implementation in detail.

Least squares fit


 A “linear fit” is a line intended to model the relationship
between variables.
 A “least squares” fit is one that minimizes the mean squared
error (MSE) between the line and the data.
 The more general problem is that of fitting a straight line
to a collection of pairs of observations (x, y)

 The most commonly used method for finding a model is that of


least squares estimation.
 It is supposed that x is an independent (or predictor) variable which
is known exactly, while y is a dependent (or response) variable.

 The least squares (LS) estimates for are those for


which the predicted values of the curve minimize the sum of the
squared deviations from the observations.
That is the problem is to find the values of that minimize the
residual sum of squares.

15

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Implementation

thinkstats2 provides simple functions that demonstrate linear least


squares:

def LeastSquares(xs, ys):


meanx, varx = MeanVar(xs)
meany = Mean(ys)
slope = Cov(xs, ys, meanx, meany) /
varx inter = meany - slope * meanx
return inter, slope

LeastSquares takes sequences xs and ys and returns the estimated


parameters inter and slope.

thinkstats2 also provides FitLine, which takes inter and slope and
re- turns the fitted line for a sequence of xs.

def FitLine(xs, inter, slope):


fit_xs = np.sort(xs)
fit_ys = inter + slope *
fit_xs return fit_xs, fit_ys

Residuals
 The deviation of an actual value from a model.
 The difference between the actual values and the fitted line.
 thinkstats2 provides a function that computes residuals:

def Residuals(xs, ys, inter, slope):


xs =
np.asarray(xs)
ys =
np.asarray(ys)
res = ys - (inter + slope * xs)
return res

16

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Residuals takes sequences xs and ys and estimated parameters inter and


slope. It returns the differences between the actual values and the fitted line.

Figure 5.1 – Linear Least Square

 A plot in the figure 5.1 depicts the data points (in red), the least
squares line of best fit (in blue), and the residuals (in green)
 The parameters slope and inter are estimates based on a sample; like
other estimates, they are vulnerable to sampling bias,
measurement error, and sampling error.
Sampling bias is caused by non-representative sampling, measurement error
is caused by errors in collecting and recording data, and sampling error is
the result of measuring a sample rather than the entire population.

3. Explain in detail about Goodness of fit. Goodness of fit.


(Nov/Dem2023)
 A goodness-of-fit is a statistical test that tries to determine whether a
set of observed values match those expected under the applicable
model.
 They can show whether sample data fit an expected set of data from
a population with normal distribution.

Types of goodness-of-fit tests


 The chi-square test determines if a relationship exists between
categorical data.

17

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Variables must be mutually exclusive in order to qualify for the chi-


square test for independence. And the chi goodness-of-fit test should
not be used for data that is continuous.
Goodness of fit is a measure of how well a statistical model fits a set of
observations.

When goodness of fit is high, the values expected based


on the model are close to the observed values.
 When goodness of fit is low, the values expected based
on the model are far from the observed values.
 The Kolmogorov-Smirnov test determines whether a sample
comes from a specific distribution of a population.

To conduct the test, need a certain variable, along with an


assumption of how it is distributed.
 The observed values, which are derived from the actual data set
 The expected values, which are taken from the assumptions made
 The total number of categories in the set

Ways to measure the quality of a linear model, or goodness of fit.


 Standard deviation of the residuals - Std(res) is the root
mean squared error (RMSE) of predictions.
 Coefficient of determination, usually denoted R2 and called
“R- squared”:
def CoefDetermination(ys, res):
return 1 - Var(res) / Var(ys)
Var(res) is the MSE of guesses using the model, Var(ys) is the
MSE without it.

Importance of Goodness-of-Fit Tests


 Provide a way to assess how well a statistical model fits a
set of observed data.
 To determine whether the observed data are consistent with
the assumed statistical model
 Useful in choosing between different models which may better
fit the data.
 Help to identify outliers or market abnormalities that may be
affecting the fit of the model
 Provide information about the variability of the data and
the estimated parameters of the model.
18

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Can be useful for making predictions and understanding the


behavior of the system being modeled.

Goodness-of-Fit Test vs. Independence Test


 Goodness-of-fit test and independence test are both statistical
tests used to assess the relationship between variables.
A goodness-of-fit test is used to evaluate how well a set of observed data
fits a particular probability distribution.
 An independence test is used to assess the relationship between two
variables. It is used to test whether there is any association
between two variables.
 The primary purpose of an independence test is to see whether a
change in one variable is related to a change in another variable.
An independence test is pointed towards two specific variables. A goodness-of-
fit test is used on an entire set of observed data to evaluate the
appropriateness of a specific model.
4. Discuss in detail about Regression using
StatsModels. Regression
 The linear least squares fit is an example of regression, which is
fitting any kind of model to any kind of data.
 The goal of regression analysis is to describe the relationship
between one set of variables, called the dependent variables, and
another set of variables, called independent or explanatory variables.
 When there is only one dependent and one explanatory variable, that’s
simple regression.
 If there is more than one dependent variable with more than one
explanatory variable, that’s multivariate regression.
 If the relationship between the dependent and explanatory variable is
linear, that’s linear regression.
 For example, if the dependent variable is y and the explanatory
variables are x1 and x2, linear regression model:

where is the intercept, is the parameter associated with


x1, is the parameter associated with x2, and is the residual.

StatsModels
 statsmodels provides two interfaces (APIs); the “formula” API uses
19

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

strings to identify the dependent and explanatory variables.


It uses a syntax called patsy; in this example, the ~ operator
separates the dependent variable on the left from the explanatory
variables on the right.
smf.ols takes the formula string and the DataFrame, live, and returns an
OLS object that represents the model.
The name ols stands for “ordinary least squares.”
Given a sequence of values for y and sequences for x1 and x2, find the

parameters, , , and , that minimize the sum of 2 . This


process is called ordinary least squares.
 The fit method fits the model to the data and returns a
RegressionResults object that contains the results.

Stepwise Implementation in Python


Step 1: Import
packages. Step 2:
Loading data.
Step 3: Setting a hypothesis.
Step 4: Fitting the model
statsmodels.regression.linear_model.OLS() method is used
to get ordinary least squares, and fit() method is used to fit
the data in it. The ols method takes in the data and
performs linear regression.
inpendent_columns ~ dependent_column:
left side of the ~ operator contains the independent
variables and right side of the operator contains the name
of the dependent variable or the predicted column.
Step 5: Summary of the model.
All the summary statistics of the linear regression model are
returned by the model.summary() method.

Example Program

# import
packages import
numpy as np
import pandas
as pd
import statsmodels.formula.api as smf
20

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

# loading the csv file


df = pd.read_csv('headbrain1.csv')
print(df.head())

# fitting the model


df.columns = ['Head_size', 'Brain_weight']
model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()

# model summary
print(model.summary())

Output

R- squared value:
 R-squared value ranges between 0 and 1.
 An R-squared of 100 percent indicates that all changes in the dependent
variable are completely explained by changes in the independent
variable(s).

F- statistic:
 The F statistic simply compares the combined effect of all variables.

Predictions:
If significance level (alpha) to be 0.05, reject the null hypothesis and accept
the alternative hypothesis as p<0.05. so, say that there is a relationship
between head size and brain weight.

21

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

5. Discuss in detail about multiple linear regression or Multiple


Regression using Statsmodels in Python.
Multiple linear regression (MLR)
 Multiple linear regression (MLR), also known simply as multiple
regression, is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable.
 The goal of multiple linear regression is to model the linear relationship
between the explanatory (independent) variables and response
(dependent) variables.
MLR is used extensively in econometrics and financial inference.
Formula and Calculation of Multiple Linear Regression

Figure 5.2 – Simple Linear Regression Vs Multiple Linear Regression


22

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Example
import statsmodels.api as sm
X=
advertising[[‘TV’,’Newspaper’,’Radi
o’]] y = advertising[‘Sales’]

# Add a constant to get an intercept


X_train_sm = sm.add_constant(X_train)

# Fit the resgression line using ‘OLS’ lr = sm.OLS(y_train,


X_train_sm).fit() print(lr.summary())
Output

Understanding the results:


 Rsq value is 91% which is good. It means that the degree of
variance in Y variable is explained by X variables
 Adj Rsq value is also good although it penalizes predictors more than Rsq
 After looking at the p values we can see that ‘newspaper’ is
not a significant X variable since p value is greater than 0.05
The coef values are good as they fall in 5% and 95%, except for the
newspaper variable.

23

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

6. Discuss in detail about logistic regression with suitable case


study. LOGISTIC REGRESSION. (Nov/Dem2023)
 Linear regression can be generalized to handle other kinds of dependent
variables.
 If the dependent variable is boolean, the generalized model is called
logistic regression.
 If the dependent variable is an integer count, it’s called Poisson
regression.
 Logistic regression is a supervised machine learning algorithm that
accomplishes binary classification tasks by predicting the probability
of an outcome, event, or observation.
 The model delivers a binary or dichotomous outcome limited to two
possible outcomes: yes/no, 0/1, or true/false.
Logistic regression is commonly used in binary classification problems
where the outcome variable reveals either of the two categories (0 and 1).
Example:
1. Determine the probability of heart attacks:
With the help of a logistic model, medical practitioners can
determine the relationship between variables such as the weight,
exercise, etc., of an individual and use it to predict whether the
person will suffer from a heart attack or any other medical
complication.
2. Identifying spam emails:
Email inboxes are filtered to determine if the email communication
is promotional/spam by understanding the predictor variables and
applying a logistic regression algorithm to check its authenticity.
Sigmoid Function
 Logistic regression uses a logistic function called a sigmoid
function to map predictions and their probabilities. Refer figure
5.3 for Sigmoid function.
 The sigmoid function refers to an S-shaped curve that converts any
real value to a range between 0 and 1.
 If the output of the sigmoid function (estimated probability) is
greater than a predefined threshold on the graph, the model predicts
that the instance belongs to that class.
 If the estimated probability is less than the predefined threshold, the
model predicts that the instance does not belong to the class.
 The sigmoid function is referred to as an activation function for
logistic regression and is defined as:

24

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

where,
e = base of natural logarithms
value = numerical value one wishes to

transform The following equation represents logistic

regression:

 x = input value
 y = predicted output
 b0 = bias or intercept term
b1 = coefficient for input (x)

Figure 5.3 – Sigmoid Function

Key Assumptions for implementing Logistic Regression


1. The dependent/response variable is binary or dichotomous
 The first assumption of logistic regression is that response
variables can only take on two possible outcomes – pass/fail,
male/female, and malignant/benign.

25

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

2. Little or no multicollinearity between the


predictor/explanatory variables
 This assumption implies that the predictor variables (or the
independent variables) should be independent of each other.
Multicollinearity relates to two or more highly correlated
independent variables.
3. Linear relationship of independent variables to log odds
 Log odds refer to the ways of expressing probabilities. Log
odds are different from probabilities. Odds refer to the ratio of
success to failure, while probability refers to the ratio of
success to everything that can occur.
 For example, consider that you play twelve tennis games with
your friend. Here, the odds of you winning are 5 to 7 (or 5/7),
while the probability of you winning is 5 to 12 (as the total
games played = 12).
4. Prefers large sample size
 Logistic regression analysis yields reliable, robust, and valid
results when a larger sample size of the dataset is considered.
5. Problem with extreme outliers
Another critical assumption of logistic regression is the requirement of no
extreme outliers in the dataset.
6. Consider independent observations
 This assumption states that the dataset observations should be
independent of each other. The observations should not be
related to each other or emerge from repeated measurements of
the same individual type.

Types of Logistic Regression with Examples


Logistic regression is classified into binary, multinomial, and ordinal.
Binary logistic regression
 Binary logistic regression predicts the relationship between
the independent and binary dependent variables.
 Some examples of the output of this regression type may
be, success/failure, 0/1, or true/false.
Examples:
4. Deciding on whether or not to offer a loan to a bank
customer: Outcome = yes or no.
5. Evaluating the risk of cancer: Outcome = high or low.
6. Predicting a team’s win in a football match: Outcome = yes or no.
Multinomial logistic regression
26

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 A categorical dependent variable has two or more


discrete outcomes in a multinomial regression type.
 This implies that this regression type has more than two
possible outcomes.
Examples:
1. Let’s say you want to predict the most popular transportation
type for 2040. Here, transport type equates to the dependent
variable, and the possible outcomes can be electric cars,
electric trains, electric buses, and electric bikes.
2. Predicting whether a student will join a college,
vocational/trade school, or corporate industry.
3. Estimating the type of food consumed by pets, the outcome
may be wet food, dry food, or junk food.
Ordinal logistic regression
 Ordinal logistic regression applies when the dependent variable is
in an ordered state (i.e., ordinal). The dependent variable (y)
specifies an order with two or more categories or levels.
Examples: Dependent variables represent,
1. Formal shirt size: Outcomes = XS/S/M/L/XL
2. Survey answers: Outcomes = Agree/Disagree/Unsure
Scores on a math test: Outcomes = Poor/Average/Good
Logistic regression works in the following steps:
1. Prepare the data: The data should be in a format where each row
represents a single observation and each column represents a
different variable. The target variable (the variable you want to predict)
should be binary (yes/no, true/false, 0/1).
2. Train the model: We teach the model by showing it the training
data. This involves finding the values of the model parameters that
minimize the error in the training data.
3. Evaluate the model: The model is evaluated on the held-out test
data to assess its performance on unseen data.
4. Use the model to make predictions: After the model has been
trained and assessed, it can be used to forecast outcomes on new
data.

ESTIMATING PARAMETERS
Given a probability, compute the odds like this:

Given odds in favor, convert to probability like this:


27

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Logistic regression is based on the following model:

Where o is the odds in favor of a particular outcome;

Suppose having estimated the parameters And


given values for x1 and x2 can compute the predicted value of log o,
and then convert to a probability:
o = np.exp (log_o)

p = o / (o+1)

The usual goal is to find the maximum-likelihood estimate (MLE), which is


the set of parameters that maximizes the likelihood of the data.
Example
Suppose the following data:
>>> y = np.array([0, 1, 0, 1])
>>> x1 = np.array([0, 0, 0, 1])
>>> x2 = np.array([0, 1, 1,
1]) And start with the initial
guesses

:
>>> beta = [-1.5, 2.8, 1.1]

Then for each row can compute log_o:

>>> log_o = beta[0] + beta[1] * x1 +


beta[2] * x2 [-1.5 -0.4 -0.4 2.4]

And convert from log odds to probabilities:

>>> o = np.exp(log_o)
[ 0.223 0.670 0.670 11.02 ]

28

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

>>> p = o / (o+1)
[ 0.182 0.401 0.401 0.916 ]

Notice that when log_o is greater than 0, o is greater than 1 and p


is greater than 0.5.
The likelihood of an outcome is p when y==1 and 1-p when y==0.

If think the probability of a boy is 0.8 and the outcome is a boy,


the likelihood is 0.8; if the outcome is a girl, the likelihood is
0.2.

Compute that like this:


>>> likes = y * p + (1-y) *
(1-p) [ 0.817 0.401 0.598
0.916 ]

The overall likelihood of the data is the product of likes:


>>> like =
np.prod(likes) 0.18

For these values of beta, the likelihood of the data is 0.18. The goal of
logistic regression is to find parameters that maximize this likelihood.

IMPLEMENTATION
StatsModels provides an implementation of logistic regression called
logit, named for the function that converts from probability to log
odds.

import statsmodels.formula.api as smf


model = smf.logit('boy ~ agepreg',
data=df) results = model.fit()
SummarizeResults(results)

The result is a Logit object that represents the model.


It contains attributes called endog and exog that contain the endogenous
variable, another name for the dependent variable, and the exogenous
variables, another name for the explanatory variables.
The result of model.fit is a BinaryResults object,

29

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

7. Discuss in detail about time series analysis with a suitable case study.
 Time Series
 A time series is a sequence of measurements from a system that
varies in time.
 An ordered sequence of values of a variable at equally spaced
time intervals.
 Time Series Analysis
 Time series analysis is a specific way of analyzing a sequence of
data points collected over an interval of time.
 In time series analysis, analysts record data points at consistent
intervals over a set period of time rather than just recording the data
points intermittently or randomly.

Time series analysis has become a crucial tool for companies
looking to make better decisions based on data.
 Examples of time series analysis in action include:
 Weather data
 Rainfall measurements
 Temperature readings
 Heart rate monitoring (EKG)
 Brain monitoring (EEG)
 Quarterly sales
 Stock prices
 Automated stock trading
 Industry forecasts
 Components of Time Series Data
 Trends: Long-term increases, decreases, or stationary movement
 Seasonality: Predictable patterns at fixed intervals
 Cycles: Fluctuations without a consistent period
 Noise: Residual unexplained variability

 Types of Data
 Time Series Data: Comprises observations collected at different
time intervals. It's geared towards analyzing trends, cycles, and other
temporal patterns.
 Cross-Sectional Data: Involves data points collected at a single
moment in time. Useful for understanding relationships or
comparisons between different entities or categories at that specific
point.
Pooled Data: A combination of Time Series and Cross-Sectional
30

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

data. This hybrid enriches the dataset, allowing for more nuanced
and comprehensive analyses.

 Time Series Analysis Types


 Classification: Identifies and assigns categories to the data.
 Curve fitting: Plots the data along a curve to study the
relationships of variables within the data.
 Descriptive analysis: Identifies patterns in time series data, like
trends, cycles, or seasonal variation.
 Explanative analysis: Attempts to understand the data and the
relationships within it, as well as cause and effect.
 Exploratory analysis: Highlights the main characteristics of the
time series data, usually in a visual format.
 Forecasting: Predicts future data. This type is based on historical
trends. It uses the historical data as a model for future data,
predicting scenarios that could happen along future plot points.
 Intervention analysis: Studies how an event can change the data.
 Segmentation: Splits the data into segments to show the
underlying properties of the source information.

 Time Series Analysis Techniques


 Moving Average: Useful for smoothing out long-term trends. It is ideal for
removing noise and identifying the general direction in which values are
moving.
 Exponential Smoothing: Suited for univariate data with a
systematic trend or seasonal component. Assigns higher weight to
recent observations, allowing for more dynamic adjustments.
 Autoregression: Leverages past observations as inputs for a regression
equation to predict future values. It is good for short-term
forecasting when past data is a good indicator.
 Decomposition: This breaks down a time series into its core
components—trend, seasonality, and residuals—to enhance the
understanding and forecast accuracy.
 Time Series Clustering: Unsupervised method to categorize data
points based on similarity, aiding in identifying archetypes or trends
in sequential data.
 Wavelet Analysis: Effective for analyzing non-stationary time series
data. It helps in identifying patterns across various scales or
resolutions.
 Intervention Analysis: Assesses the impact of external events on a
31

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

time series, such as the effect of a policy change or a marketing


campaign.
 Box-Jenkins ARIMA models: Focuses on using past behavior and
errors to model time series data. Assumes data can be characterized by a
linear function of its past values.
 Box-Jenkins Multivariate models: Similar to ARIMA, but accounts
for multiple variables. Useful when other variables influence one
time series.
 Holt-Winters Exponential Smoothing: Best for data with a distinct
trend and seasonality. Incorporates weighted averages and builds
upon the equations for exponential smoothing.

 The Advantages of Time Series Analysis


1. Data Cleansing: Time series analysis techniques such as smoothing
and seasonality adjustments help remove noise and outliers, making
the data more reliable and interpretable.
2. Understanding Data: Models like ARIMA or exponential smoothing
provide insight into the data's underlying structure. Autocorrelations and
stationary measures can help understand the data's true nature.
3. Forecasting: One of the primary uses of time series analysis is to
predict future values based on historical data. Forecasting is
invaluable for business planning, stock market analysis, and other
applications.
4. Identifying Trends and Seasonality: Time series analysis can
uncover underlying patterns, trends, and seasonality in data that
might not be apparent through simple observation.
Visualizations: Through time series decomposition and other techniques,
it's possible to create meaningful visualizations that clearly show trends,
cycles, and irregularities in the data.
6. Efficiency: With time series analysis, less data can sometimes be more.
Focusing on critical metrics and periods can often derive valuable
insights without getting bogged down in overly complex models or
datasets.
7. Risk Assessment: Volatility and other risk factors can be modeled
over time, aiding financial and operational decision-making
processes.

 Challenges of Time Series Analysis


1. Limited Scope: Time series analysis is restricted to time-dependent

32

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

data. It's not suitable for cross-sectional or purely categorical data.


2. Noise Introduction: Techniques like differencing can introduce
additional noise into the data, which may obscure fundamental
patterns or trends.
3. Interpretation Challenges: Some transformed or differenced values
may need more intuitive meaning, making it easier to understand the
real- world implications of the results.
4. Generalization Issues: Results may only sometimes be
generalizable, primarily when the analysis is based on a single,
isolated dataset or period.
5. Model Complexity: The choice of model can greatly influence the
results, and selecting an inappropriate model can lead to unreliable or
misleading conclusions.
6. Non-Independence of Data: Unlike other types of statistical
analysis, time series data points are not always independent, which can
introduce bias or error in the analysis.
7. Data Availability: Time series analysis often requires many data points for
reliable results, and such data may not always be easily accessible or available.

8. Explain in detail about Time Series Analysis Technique – Moving


Average and exponentially-weighted moving average with an example.
(Apr/May 2024)
 Moving Average
 A moving average divides the series into overlapping regions,
called windows, and computes the average of the values in
each window.
 One of the simplest moving averages is the rolling mean, which
computes the mean of the values in each window.
 For example, if the window size is 3, the rolling mean computes the
mean of values 0 through 2, 1 through 3, 2 through 4, etc.
 pandas provides rolling_mean, which takes a Series and a
window size and returns a new Series.
 >>> series = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> pandas.rolling_mean(series, 3)
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])
 The first two values are nan; the next value is the mean of the first
three elements, 0, 1, and 2. The next value is the mean of 1, 2,

33

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

and 3. And so on.


 The rolling mean seems to do smoothing out the noise and extracting
the trend.
 Exponentially-weighted moving average (EWMA)
 The Exponentially Weighted Moving Average (EWMA) is a
quantitative or statistical measure used to model or describe a time
series.
 The moving average is designed as such that older observations are
given lower weights. The weights fall exponentially as the data point
gets older – hence the name exponentially weighted.
 An alternative is the exponentially-weighted moving average (EWMA),
which has two advantages.
 First, it computes a weighted average where the most recent value
has the highest weight and the weights for previous values drop off
exponentially.
 Second, the pandas implementation of EWMA handles missing
values better.
EWMA Formula

Where:
Alpha = The weight decided by the user
r = Value of the series in the current period

ewma = pandas.ewma(reindexed.ppg, span=30)


thinkplot.Plot(ewma.index, ewma)

 The span parameter corresponds roughly to the window size of a


moving average; it controls how fast the weights drop off, so it
determines the number of points that make a non-negligible
contribution to each average.

Figure 5.1 (right) shows the EWMA for the same data.
 It is similar to the rolling mean, where they are both defined, but it
has no missing values, which makes it easier to work with.

34

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Figure 5.4: Daily price and a rolling mean (left) and exponentially-
weighted moving average (right).

Missing values
 A simple and common way to fill missing data is to use a moving average.
 The Series method fillna:
reindexed.ppg.fillna(ewma, inplace=True)

Wherever reindexed.ppg is nan, fillna replaces it with the corresponding value from
ewma. The inplace flag tells fillna to modify the existing Series rather than create a
new one.

 One example of serial correlation is found in stock prices.


 Stock prices tend to go up and down together over time, which is
said to be “serially correlated.” This means that if stock prices go up
today, they will also go up tomorrow. Similarly, if stock prices go
down today, they are likely to go down tomorrow.
 The degree of serial correlation can be measured using the
autocorrelation coefficient.
 The autocorrelation coefficient measures how closely related a series
of data points are to each other.

Types of Serial Correlation Positive Serial Correlation


 Positive serial correlation occurs when a positive error for one
observation increases the chance of a positive error for another
observation.
 In other words, if there is a positive error in one period, there is a greater
likelihood of a positive error in the next period as well.
35

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Positive serial correlation also means that a negative error for one
observation increases the chance of a negative error for another
observation.
 So, if there is a negative error in one period, there is a greater
likelihood of a negative error in the next period. Refer Figure 5.5

Figure 5.5 – Positive Serial Correlation

Negative Serial Correlation


A negative serial correlation occurs when a positive error for one observation
increases the chance of a negative error for another observation.

 In other words, if there is a positive error in one period, there is a greater


likelihood of a negative error in the next period.
 A negative serial correlation also means that a negative error for one
observation increases the chance of a positive error for another
observation.
 So, if there is a negative error in one period, there is a greater
likelihood of a positive error in the next period. Refer Figure 5.6

36

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Figure 5.6 – Negative Serial Correlation

def SerialCorr(series, lag=1):


xs = series[lag:]
ys = series.shift(lag)[lag:]
corr = thinkstats2.Corr(xs, ys)
return corr
After the shift, the first lag values are nan, so I use a slice to
remove them before computing Corr.

Testing for Serial


Correlation Durbin-
Watson Test
 The Durbin-Watson test is a statistical test used to determine whether
or not there is a serial correlation in a data set.
 It tests the null hypothesis of no serial correlation against the
alternative positive or negative serial correlation hypothesis.
The test is named after James Durbin and Geoffrey Watson, who
developed it in 1950.
The Durbin-Watson Statistic (DW) is approximated by:

$$ DW = 2(1 − r) $$

Where:
\(r\) is the sample correlation between regression residuals from
one period and the previous period.

 The test statistic can take on values ranging from 0 to 4.


 A value of 2 indicates no serial correlation, a value between 0 and 2
indicates a positive serial correlation, and a value between 2 and 4
indicates a negative serial correlation:

 If there is no autocorrelation, the regression errors will be

37

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

uncorrelated, and thus \(DW = 2\)


$$ DW = 2(1 − r) = 2(1 − 0) = 2 $$

 For positive serial autocorrelation, \(DW < 2\).


For example, if serial correlation of the regression residuals = 1,
\(DW = 2(1 − 1) = 0\).

 For negative autocorrelation, \(DW > 2\).


For example, if serial correlation of the regression residual = −1,
\(DW = 2(1 − (−1)) = 4\).

 To reject the null hypothesis of no serial correlation, need to find a


critical value lower than the calculated value of d*.

Define \(d_l\) as the lower value and \(d_u\) as the upper value:
o If the DW statistic is less than \(d_l\), we reject the null hypothesis of
no positive serial correlation.
o If the DW statistic is greater than \((4 – d_l)\), we reject the
null hypothesis, indicating a significant negative serial correlation.
o If the DW statistic falls between \(d_l\) and \(d_u\), the test results
are inconclusive.
o If the DW statistic is greater than \(d_u\), we fail to reject the
null hypothesis of no positive serial correlation. Refer Figure 5.7

Example 5.1:
9. The Durbin-Watson Test for Serial Correlation
Consider a regression output with two independent variables
that generate a DW statistic of 0.654. Assume that the
sample size is
15. Test for serial correlation of the error terms at the 5%
significance level.

Solution
From the Durbin-Watson table with \(n = 15\) and \(k = 2\),
\(d_l = 0.95\) and \(d_u = 1.54\).
Since \(d = 0.654 < 0.95 =
d_l\),
Reject the null hypothesis and conclude that there is
significant positive autocorrelation.
38

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Example 5.2
Consider a regression model with 80 observations and two independent
variables. Assume that the correlation between the error term and the
first lagged value of the error term is 0.18. The most
appropriate decision is:
A. reject the null hypothesis of positive serial correlation.
B. fail to reject the null hypothesis of positive serial correlation.
C. declare that the test results are
inconclusive.
Solution
The correct answer is
C. The test statistic is:
$$ DW \approx 2(1 − r) = 2(1 − 0.18) = 1.64 $$
The critical values from the Durbin Watson table with \(n = 80\) and
\(k = 2\) is \(d_l = 1.59\) and \(d_u = 1.69\).
Because 1.69 > 1.64 > 1.59, determine the test results are
inconclusive.

10. Discuss in detail about Autocorrelation. And differentiate


between serial correlation and autocorrelation.
(Nov/Dem2023)
 Autocorrelation
 Autocorrelation, refers to the degree of correlation of the same
variables between two successive time intervals.
 Autocorrelation represents the degree of similarity between a given
time series and a lagged version of itself over successive time
intervals.
 Autocorrelation measures the relationship between a variable's
current value and its past values.
 The value of autocorrelation ranges from -1 to 1.
 An autocorrelation of +1 represents a perfect positive correlation,
while an autocorrelation of -1 represents a perfect negative
correlation.
 A value between -1 and 0 represents negative autocorrelation.
 A value between 0 and 1 represents positive autocorrelation.
 Autocorrelation gives information about the trend of a set of
historical data so that it can be useful in the technical analysis

39

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Types of Autocorrelation
 Positive autocorrelation
The observations with positive autocorrelation can be plotted into
a smooth curve. By adding a regression line, it can be
observed that a positive error is followed by another positive
one, and a negative error is followed by another negative one.
Refer Figure 5.8

Figure 5.8 – Positive Autocorrelation

 Negative autocorrelation
Conversely, negative autocorrelation represents that the increase observed in a time
interval leads to a proportionate decrease in the lagged time interval. By plotting
the observations with a regression.

line, it shows that a positive error will be followed by a negative


one and vice versa. Refer Figure 5.9

Figure 5.9 – Negative Autocorrelation

40

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Autocorrelation can be applied to different numbers of time gaps,


which is known as lag.
 A lag 1 autocorrelation measures the correlation between the
observations that are a one-time gap apart.
 For example, to learn the correlation between the temperatures of
one day and the corresponding day in the next month, a lag 30
autocorrelation should be used (assuming 30 days in that month).
 Autocorrelation refers to the correlation between a time series
variable and its own lagged values over time. In other words, it
measures the degree of similarity between observations of a variable
at different points in time.
 Autocorrelation is an important concept in time series analysis as
it helps to identify patterns and relationships within the data.
 Positive autocorrelation occurs when a time series variable is
correlated with its past values, while negative autocorrelation occurs
when it is correlated with its future values.
 Zero autocorrelation indicates that there is no correlation between
the variable and its lagged values.

Benefits of Autocorrelation
 Autocorrelation has several benefits in time series analysis:
 Identifying patterns – Autocorrelation helps to identify patterns
in the time series data, which can provide insights into the
behavior of the variable over time.
 Model selection – Autocorrelation can be used to select appropriate models for
time series analysis.
 Forecasting – Autocorrelation can help to forecast future values
of a time series variable.
 Validating assumptions – Autocorrelation can be used to
validate assumptions of statistical models.
 Hypothesis testing –
• Autocorrelation can affect the results of
hypothesis tests, such as t-tests and F-tests. By

Test for Autocorrelation


 Autocorrelation can be assessed using a variety of statistical
techniques such as the autocorrelation function (ACF), partial
autocorrelation function (PACF), and the Durbin-Watson statistic.
 These methods help to quantify the strength and direction of the

41

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

autocorrelation and can be used to model and forecast time series


data.

 The Durbin-Watson statistic is commonly used to test for autocorrelation.


 It can be applied to a data set by statistical software. T
 he outcome of the Durbin-Watson test ranges from 0 to 4.
 An outcome closely around 2 means a very low level of autocorrelation.
 An outcome closer to 0 suggests a stronger positive autocorrelation,
and an outcome closer to 4 suggests a stronger negative
autocorrelation.

 The autocorrelation function (ACF) assesses the correlation between


observations in a time series for a set of lags. The ACF for time series y
is given by:
Corr (yt,yt−k), k=1,2,….
Analysts typically use graphs to display this function.

Computation of Autocorrelation in Python


The pandas.Series.autocorr() function lets you compute the
lag-N (default=1) autocorrelation on a given series.
Code Snippet:
df['series'].autocorr(lag=1)

 Serial Correlation Versus Autocorrelation


Serial correlation is a statistical concept that refers to the correlation between a
variable and itself over time. It is used to measure the degree to which a
variable's values at one point in time are related to its values at another point in
time. Serial correlation is often used in time-series analysis to detect patterns in data
and to test whether a model is appropriate for the data.

2. Autocorrelation is a specific type of serial correlation that


measures the correlation between a variable and its lagged values.
In other words, autocorrelation measures the degree to which a
variable's values at one point in time are related to its values at
previous points in time. Autocorrelation is often used to assess
whether a time-series model is appropriate for the data.
3. Serial correlation is a more general term that refers to the correlation
between a variable and itself over time, whereas autocorrelation
specifically refers to the correlation between a variable and its
lagged values.
42

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

4. In terms of applications, serial correlation is often used to analyze


patterns in data over time, such as trends and seasonality, while
autocorrelation is often used in time-series analysis to assess the
fit of a model and to make predictions about future values.
For example, in finance, serial correlation might be used to analyze the
daily returns of a stock or portfolio over time to detect trends and seasonality.
Autocorrelation might be used to test whether a time- series model is
appropriate for the data and to make predictions about future returns based on
past values.

11. Give a brief introduction about Survival


Analysis. Survival Analysis
 Survival analysis is a field of statistics that focuses on analysing
the expected time until a certain event happens.
 Survival analysis can be used for analysing the results of that
treatment in terms of the patients’ life expectancy.
 The term `survival time' specifies the length of time taken for
failure to occur.
 Survival analysis is used to analyse data in which the time until the
event is of interest.
 The response is often referred to as a failure time, survival time, or
event time.
 This branch of statistics developed around measuring the
effects of medical treatment on patients’ survival in clinical trials.
 Examples
o Time until tumour recurrence
o Time until a machine part fails

Survival curves
 The fundamental concept in survival analysis is the survival curve,
S(t), as in figure 5.10, which is a function that maps from a duration,
t, to the probability of surviving longer than t, it’s just the
complement of the CDF:
S(t) = 1 − CDF(t)
where CDF(t) is the probability of a lifetime less than or equal to t.

43

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Figure 5.10 - Survival curves

 For example, in the NSFG dataset, given the duration of 11189


complete pregnancies.
 Can read this data and compute the CDF:
preg = nsfg.ReadFemPreg()
complete = preg.query('outcome in [1, 3,
4]').prglngth cdf = thinkstats2.Cdf(complete,
label='cdf')
 The outcome codes 1, 3, 4 indicate live birth, stillbirth, and miscarriage.
 The DataFrame method query takes a boolean expression and
evaluates it for each row, selecting the rows that yield True.
class
SurvivalFunction(object):
def init (self, cdf,
label=''):
self.cdf = cdf
self.label = label or
cdf.label @property
def ts(self):
return self.cdf.xs
@property
def ss(self):
return 1 - self.cdf.ps

 SurvivalFunction provides two properties:


o ts, which is the sequence of lifetimes,
ss, which is the survival curve.

 From the survival curve can derive the hazard function; for
pregnancy lengths, the hazard function maps from a time, t, to
the fraction of pregnancies that continue until t and then end at t.

44

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 The numerator is the fraction of lifetimes that end at t, which is


also PMF(t).

Figure 5.5 - Hazard curve

Censoring
 In longitudinal studies exact survival time is only known for those
individuals who show the event of interest during the follow-up
period. These individuals are called censored observations.
 The following terms are used in relation to censoring:
 Right censoring: a subject is right censored if it is known that
failure occurs some time after the recorded follow-up period.
 Left censoring: a subject is left censored if it is known that the
failure occurs some time before the recorded follow-up period.
 Interval censoring: a subject is interval censored if it is known that
the event occurs between two times, but the exact time of failure is
not known.

Truncation
 A truncation period means that the outcome of interest cannot
possibly occur.
 A censoring period means that the outcome of interest may
have occurred.
 There are two types of truncation:
Left truncation: a subject is left truncated if it enters the population at risk
some stage after the start of the follow-up period.
 Right truncation: a subject is right truncated if it leaves the
population at risk some stage after the study start.

45

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Figure 5.6: Left-, right-censoring, and truncation

 An `X' indicates that the subject has experienced the outcome of


interest; a `O' indicates censoring.
 Subject A experiences the event of interest on day 7. Subject B
does not experience the event during the study period and is right
censored on day 12 (this implies that subject B experienced the
event sometime
 after day 12).
 Subject C does not experience the event of interest during its period
of observation and is censored on day 10.
 Subject D is interval censored: this subject is observed
intermittantly and experiences the event of interest sometime
between days 5 { 6 and 7 { 8. Subject E is left censored | it has
been found to have already experienced the event of interest when
it enters the study on day 1.
 Subject F is interval truncated: there is no way possible that the
event of interest could occur to this individual between days 4 {
6.
Subject G is left truncated: there is no way possible that the event of interest could
have occurred before the subject enters the study on day 3.

46

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

12. What are the Effective Strategies for Handling Missing Values in
Data Analysis?
Missing Value
 Missing data is defined as the values or data that is not stored
for some variable/s in the given dataset.
 Below is a sample of the missing data from the Titanic dataset.
 The columns ‘Age’ and ‘Cabin’ have some missing values.

Reason for Missing Values


 Past data might get corrupted due to improper maintenance.
 Observations are not recorded for certain fields due to some
reasons. There might be a failure in recording the values due to
human error.
 The user has not provided the values intentionally
 Item nonresponse: This means the participant refused to respond.

Reason to handle missing data


 The missing data will decrease the predictive power of the
model. If the algorithms are applied with missing data, then there
will be bias in the estimation of parameters.
 The results are not confident if the missing data is not handled
properly.

Types of Missing Values

47

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

Type Definition
Missing completely at Missing data are randomly distributed
random (MCAR) across the variable and unrelated to
other variables.
Missing at random Missing data are not randomly
(MAR) distributed but they are accounted for
by other observed variables.
Missing not at Missing data systematically differ
random from
(MNAR) the observed values.
Missing Completely At Random (MCAR)
 In MCAR, the probability of data being missing is the same for all
the observations.
 In this case, there is no relationship between the missing data
and any other values observed or unobserved within the given
dataset.
 That is, missing values are completely independent of
other data. There is no pattern.
 In the case of MCAR data, the value could be missing due to
human error, some system/equipment failure, loss of sample, or
some unsatisfactory technicalities while recording the values.
 For Example, suppose in a library there are some overdue
books. Some values of overdue books in the computer system are
missing. The reason might be a human error, like the librarian
forgetting to type in the values.

Missing At Random (MAR)


 MAR data means that the reason for missing values can be
explained by variables which have complete information, as there is
some relationship between the missing data and other values/data.
 In this case, the data is not missing for all the observations.
48

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 It is missing only within sub-samples of the data, and there is some


pattern in the missing values.
 For example, if you check the survey data, you may find that all
the people have answered their ‘Gender,’ but
‘Age’ values are mostly missing for people who have answered
their ‘Gender’ as ‘female.’ (The reason being most of the females
don’t want to reveal their age.)
So, the probability of data being missing depends only on the observed value or
data. In this case, the variables ‘Gender’ and ‘Age’ are related.
The reason for missing values of the ‘Age’ variable can be
explained by the ‘Gender’ variable, but you cannot predict
the missing value itself.

Missing Not At Random (MNAR)


 Missing values depend on the unobserved data.
 If there is some structure/pattern in missing data and other
observed data can not explain it, then it is considered to be
Missing Not At Random (MNAR).
 If the missing data does not fall under the MCAR or
MAR, it can be categorized as MNAR.
 It can happen due to the reluctance of people to provide the
required information.
 A specific group of respondents may not answer some
questions in a survey.

Methods for identifying missing data

Functions Descriptions

This function returns a pandas dataframe, where each


.isnull() value is a boolean value True if the value is missing,
False otherwise.

Similarly to the previous function, the values for this one


.notnull()
are False if either NaN or None value is detected.

49

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

This function generates three main columns, including


.info() the “Non-Null Count” which shows the number of non-
missing values for each column.

This one is similar to isnull and notnull. However it shows


.isna() True only when the missing value is NaN type.

Approach to handle missing values in a dataset.


 Deleting Rows with missing values
 Impute missing values for continuous variable
 Impute missing values for categorical variable
 Other Imputation Methods
 Using Algorithms that support missing values
 Prediction of missing values
 Imputation using Deep Learning Library — Datawig

13. Compare and contrast between multiple regression and logistic regression techniques
with example. (Apr/May 2024)

Multiple regression

 Explaining or predicting a single Y variable from two or more X variables is called multiple
regression. The goals of multiple regression are

(1) to describe and understand the relationship,

(2) to forecast (predict) a new observation, and

(3) to adjust and control a process.

Multiple Linear Regression Formula

Where:

 yi is the dependent or predicted variable


 β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
 β1 and β2 are the regression coefficients representing the change in y relative to a one-unit
change in xi1 and xi2, respectively.
 βp is the slope coefficient for each independent variable
50

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 ϵ is the model’s random error (residual) term.

Uses

 There are two main uses for multiple regression analysis.

 The first is to determine the dependent variable based on multiple independent variables. For
example, you may be interested in determining what a crop yield will be based on
temperature, rainfall, and other independent variables.

 The second is to determine how strong the relationship is between each variable. For example,
you may be interested in knowing how a crop yield will change if rainfall increases or the
temperature decreases.

Logistic Regression:

 logistic regression, a technique for predicting categorical outcomes with two possible
categories.

 Logistic regression is a supervised machine learning algorithm that accomplishes binary


classification tasks by predicting the probability of an outcome, event, or observation.

 The model delivers a binary or dichotomous outcome limited to two possible outcomes:
yes/no, 0/1, or true/false.

 This type of statistical model (also known as logit model) is often used for classification
and predictive analytics. Since the outcome is a probability, the dependent variable is
bounded between 0 and 1.

 In logistic regression, a logit transformation is applied on the odds—that is, the


probability of success divided by the probability of failure. This is also commonly known
as the log odds, or the natural logarithm of odds,

logistic function formulas:

 Logit(pi) = 1/(1+ exp(-pi))


 ln(pi/(1-pi)) = Beta_0 + Beta_1*X_1 + … + B_k*K_k

Logistic Regression Uses:

Logistic regression is commonly used for prediction and classification problems. Some of these
use cases include:

 Fraud detection: Logistic regression models can help teams identify data anomalies, which
are predictive of fraud. Certain behaviors or characteristics may have a higher association with
fraudulent activities, which is particularly helpful to banking and other financial institutions in
protecting their clients.

51

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Disease prediction: In medicine, this analytics approach can be used to predict the likelihood
of disease or illness for a given population. Healthcare organizations can set up preventative care
for individuals that show higher propensity for specific illnesses.

 Churn prediction: Specific behaviors may be indicative of churn in different functions of an


organization. For example, the sales organization may want to learn which of their clients are at risk
of taking their business elsewhere. This can prompt teams to set up a retention strategy to avoid lost
revenue.

14. A company manufactures an electronic device to be used in a very wide temperature


range. The company knows that increased temperature shortens the life time of the
device, and a study is therefore performed in which the life time is determined as a
function of temperature.

Temp in 10 20 30 40 50 60 70 80 90
Celcius

Life time in 420 365 285 220 176 117 69 34 5


hours

Find the linear regression equation. Also find the estimated life time when
temperature is 55. (Apr/May 2024)

Solution

 Calculate the mean of X (Temperature) and Y (Lifetime):

Mean of X (Temperature): (20 + 30 + 40 + 50 + 60 + 70 + 80 + 90) / 8 = 55.


Mean of Y (Lifetime): (420 + 365 + 285 + 220 + 176 + 117 + 69 + 34) / 8 = 219.5.

 Calculate the sum of squares for X (SSX) and Y (SSY):

SSX = Σ(Xi - X̄)² = (20-55)² + (30-55)² + ... + (90-55)² = 1100.


SSY = Σ(Yi - 219.5)² = (420-219.5)² + (365-219.5)² + ... + (34-219.5)² = 259,706.5.

 Calculate the sum of products (SP):

SP = Σ(Xi - X̄)(Yi - Ȳ) = (20-55)(420-219.5) + (30-55)(365-219.5) + ... + (90-55)(34-219.5) = -


35*200.5 + -25*145.5 + ... + 35*(-185.5) = -24,770.

 Calculate the slope (b) of the best-fit line:

b = SP / SSX = -24,770 / 1100 = -22.52.

52

PREPARED BY: Ms.G.RAMYA, AP / CSBS


AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS Unit-V Mailam Engineering College

 Calculate the intercept (a) of the best-fit line:

a = Ȳ - bX̄ = 219.5 - (-22.52 * 55) = 219.5 + 1243.6 = 1463.1.

 The best line equation (simple linear regression equation) is: Y = -22.52X + 1463.1

 Predict the lifetime of the device when the temperature is 100:


Y = -22.52 * 100 + 1463.1 = 1210.9.

 Identify the temperature range where 53∘53 ∘ Celsius falls. The temperature range
is 5 0 <𝑇 <60 50<T <60. Find the corresponding life time for the temperature
range 5 0 <𝑇 <60 50 <T <60 . The life time range is 176 <ℎ<117 176 <h<117.

 Since 53 ∘53 ∘ Celsius is closer to 50 ∘50∘ Celsius, we use the life time value
for 5 0 ∘5 0 ∘ Celsius, which is 176 17 6 hours.

 Estimate the life time for 53 ∘53∘ Celsius by interpolating between the life times
for 5 0 ∘5 0 ∘ Celsius and 60 ∘60 ∘ Celsius.

 The interpolated life time for 53∘53 ∘ Celsius is approximately 171. 95 171. 95 hours.

53

PREPARED BY: Ms.G.RAMYA, AP / CSBS

You might also like