Unit V - Update
Unit V - Update
between variables.
A “least squares” fit is one that minimizes the mean squared error
(MSE) between the line and the data.
6. Define Residuals
The deviation of an actual value from a model.
The difference between the actual values and the fitted line.
thinkstats2 provides a function that computes residuals:
def Residuals(xs, ys, inter, slope):
xs =
np.asarray(xs)
ys =
np.asarray(ys)
res = ys - (inter + slope * xs)
return res
It returns the differences between the actual values and the fitted line.
9. What are the different ways to measure the quality of a linear model,
or goodness of fit?
Standard deviation of the residuals
Coefficient of determination, usually denoted R2 and called “R-squared”:
where,
e = base of natural logarithms
value = numerical value one wishes to transform
21. List the different types of data used for predictive analysis.
Types of Data
Time Series Data: Comprises observations collected at different
time intervals. It's geared towards analyzing trends, cycles, and other
temporal patterns.
Cross-Sectional Data: Involves data points collected at a single moment in
time. Useful for understanding relationships or comparisons between
different entities or categories at that specific point.
Pooled Data: A combination of Time Series and Cross-Sectional
data. This hybrid enriches the dataset, allowing for more nuanced
and comprehensive analyses.
Autocorrelation
Autocorrelation, refers to the degree of correlation of the same
Survival curves
o The fundamental concept in survival analysis is the survival
curve, S(t), which is a function that maps from a duration, t, to
the probability of surviving longer than t, it’s just the
complement of the CDF:
S(t) = 1 − CDF(t)
where CDF(t) is the probability of a lifetime less than or equal to t.
28. Define missing value and narrate the reason for missing
value. Missing Value
Missing data is defined as the values or data that is not
stored for some variable/s in the given dataset.
Reason for Missing Values
Past data might get corrupted due to improper maintenance.
Observations are not recorded for certain fields due to some
reasons. There might be a failure in recording the values due to
human error.
The user has not provided the values intentionally
Item nonresponse: This means the participant refused to
respond.
29. Why the missing data should be handled?
The results are not confident if the missing data is not handled
properly.
Type Definition
Missing completely at
Missing data are randomly
random (MCAR) distributed across the variable and
unrelated to other variables.
Missing at random (MAR) Missing data are not randomly
distributed but they are accounted
for by other observed variables.
Missing not at random Missing data systematically differ
(MNAR) from the observed values.
Functions Descriptions
The survival function is S(t)=1 − F(t), or the probability that a person or machine or a
business lasts longer than t time units
11
Example - Insurance companies use survival analysis to predict the death of the
insured and estimate other important factors such as policy cancellations, non-
renewals, and how long it takes to file a claim.
Some subgroups in our population are assigned a higher probability of being selected,
based on our determined survey needs.
This allows researchers to correct issues that occur during data collection.
12
PART B
13
Companies can take actions, like retargeting online ads to visitors, with
data that predicts a greater likelihood of conversion and purchase intent.
Risk reduction
Credit scores, insurance claims, and debt collections all use
predictive analytics to assess and determine the likelihood of future
defaults.
Operational improvement
Companies use predictive analytics models to forecast
inventory, manage resources, and operate more efficiently.
Customer segmentation
By dividing a customer base into specific groups, marketers
can use predictive analytics to make forward-looking decisions to
tailor content to unique audiences.
Maintenance forecasting
Organizations use data to predict when routine equipment maintenance will
be required and can then schedule it before a problem or malfunction arises.
15
Implementation
thinkstats2 also provides FitLine, which takes inter and slope and
re- turns the fitted line for a sequence of xs.
Residuals
The deviation of an actual value from a model.
The difference between the actual values and the fitted line.
thinkstats2 provides a function that computes residuals:
16
A plot in the figure 5.1 depicts the data points (in red), the least
squares line of best fit (in blue), and the residuals (in green)
The parameters slope and inter are estimates based on a sample; like
other estimates, they are vulnerable to sampling bias,
measurement error, and sampling error.
Sampling bias is caused by non-representative sampling, measurement error
is caused by errors in collecting and recording data, and sampling error is
the result of measuring a sample rather than the entire population.
17
StatsModels
statsmodels provides two interfaces (APIs); the “formula” API uses
19
Example Program
# import
packages import
numpy as np
import pandas
as pd
import statsmodels.formula.api as smf
20
# model summary
print(model.summary())
Output
R- squared value:
R-squared value ranges between 0 and 1.
An R-squared of 100 percent indicates that all changes in the dependent
variable are completely explained by changes in the independent
variable(s).
F- statistic:
The F statistic simply compares the combined effect of all variables.
Predictions:
If significance level (alpha) to be 0.05, reject the null hypothesis and accept
the alternative hypothesis as p<0.05. so, say that there is a relationship
between head size and brain weight.
21
Example
import statsmodels.api as sm
X=
advertising[[‘TV’,’Newspaper’,’Radi
o’]] y = advertising[‘Sales’]
23
24
where,
e = base of natural logarithms
value = numerical value one wishes to
regression:
x = input value
y = predicted output
b0 = bias or intercept term
b1 = coefficient for input (x)
25
ESTIMATING PARAMETERS
Given a probability, compute the odds like this:
p = o / (o+1)
:
>>> beta = [-1.5, 2.8, 1.1]
>>> o = np.exp(log_o)
[ 0.223 0.670 0.670 11.02 ]
28
>>> p = o / (o+1)
[ 0.182 0.401 0.401 0.916 ]
For these values of beta, the likelihood of the data is 0.18. The goal of
logistic regression is to find parameters that maximize this likelihood.
IMPLEMENTATION
StatsModels provides an implementation of logistic regression called
logit, named for the function that converts from probability to log
odds.
29
7. Discuss in detail about time series analysis with a suitable case study.
Time Series
A time series is a sequence of measurements from a system that
varies in time.
An ordered sequence of values of a variable at equally spaced
time intervals.
Time Series Analysis
Time series analysis is a specific way of analyzing a sequence of
data points collected over an interval of time.
In time series analysis, analysts record data points at consistent
intervals over a set period of time rather than just recording the data
points intermittently or randomly.
Time series analysis has become a crucial tool for companies
looking to make better decisions based on data.
Examples of time series analysis in action include:
Weather data
Rainfall measurements
Temperature readings
Heart rate monitoring (EKG)
Brain monitoring (EEG)
Quarterly sales
Stock prices
Automated stock trading
Industry forecasts
Components of Time Series Data
Trends: Long-term increases, decreases, or stationary movement
Seasonality: Predictable patterns at fixed intervals
Cycles: Fluctuations without a consistent period
Noise: Residual unexplained variability
Types of Data
Time Series Data: Comprises observations collected at different
time intervals. It's geared towards analyzing trends, cycles, and other
temporal patterns.
Cross-Sectional Data: Involves data points collected at a single
moment in time. Useful for understanding relationships or
comparisons between different entities or categories at that specific
point.
Pooled Data: A combination of Time Series and Cross-Sectional
30
data. This hybrid enriches the dataset, allowing for more nuanced
and comprehensive analyses.
32
>>> pandas.rolling_mean(series, 3)
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])
The first two values are nan; the next value is the mean of the first
three elements, 0, 1, and 2. The next value is the mean of 1, 2,
33
Where:
Alpha = The weight decided by the user
r = Value of the series in the current period
Figure 5.1 (right) shows the EWMA for the same data.
It is similar to the rolling mean, where they are both defined, but it
has no missing values, which makes it easier to work with.
34
Figure 5.4: Daily price and a rolling mean (left) and exponentially-
weighted moving average (right).
Missing values
A simple and common way to fill missing data is to use a moving average.
The Series method fillna:
reindexed.ppg.fillna(ewma, inplace=True)
Wherever reindexed.ppg is nan, fillna replaces it with the corresponding value from
ewma. The inplace flag tells fillna to modify the existing Series rather than create a
new one.
Positive serial correlation also means that a negative error for one
observation increases the chance of a negative error for another
observation.
So, if there is a negative error in one period, there is a greater
likelihood of a negative error in the next period. Refer Figure 5.5
36
$$ DW = 2(1 − r) $$
Where:
\(r\) is the sample correlation between regression residuals from
one period and the previous period.
37
Define \(d_l\) as the lower value and \(d_u\) as the upper value:
o If the DW statistic is less than \(d_l\), we reject the null hypothesis of
no positive serial correlation.
o If the DW statistic is greater than \((4 – d_l)\), we reject the
null hypothesis, indicating a significant negative serial correlation.
o If the DW statistic falls between \(d_l\) and \(d_u\), the test results
are inconclusive.
o If the DW statistic is greater than \(d_u\), we fail to reject the
null hypothesis of no positive serial correlation. Refer Figure 5.7
Example 5.1:
9. The Durbin-Watson Test for Serial Correlation
Consider a regression output with two independent variables
that generate a DW statistic of 0.654. Assume that the
sample size is
15. Test for serial correlation of the error terms at the 5%
significance level.
Solution
From the Durbin-Watson table with \(n = 15\) and \(k = 2\),
\(d_l = 0.95\) and \(d_u = 1.54\).
Since \(d = 0.654 < 0.95 =
d_l\),
Reject the null hypothesis and conclude that there is
significant positive autocorrelation.
38
Example 5.2
Consider a regression model with 80 observations and two independent
variables. Assume that the correlation between the error term and the
first lagged value of the error term is 0.18. The most
appropriate decision is:
A. reject the null hypothesis of positive serial correlation.
B. fail to reject the null hypothesis of positive serial correlation.
C. declare that the test results are
inconclusive.
Solution
The correct answer is
C. The test statistic is:
$$ DW \approx 2(1 − r) = 2(1 − 0.18) = 1.64 $$
The critical values from the Durbin Watson table with \(n = 80\) and
\(k = 2\) is \(d_l = 1.59\) and \(d_u = 1.69\).
Because 1.69 > 1.64 > 1.59, determine the test results are
inconclusive.
39
Types of Autocorrelation
Positive autocorrelation
The observations with positive autocorrelation can be plotted into
a smooth curve. By adding a regression line, it can be
observed that a positive error is followed by another positive
one, and a negative error is followed by another negative one.
Refer Figure 5.8
Negative autocorrelation
Conversely, negative autocorrelation represents that the increase observed in a time
interval leads to a proportionate decrease in the lagged time interval. By plotting
the observations with a regression.
40
Benefits of Autocorrelation
Autocorrelation has several benefits in time series analysis:
Identifying patterns – Autocorrelation helps to identify patterns
in the time series data, which can provide insights into the
behavior of the variable over time.
Model selection – Autocorrelation can be used to select appropriate models for
time series analysis.
Forecasting – Autocorrelation can help to forecast future values
of a time series variable.
Validating assumptions – Autocorrelation can be used to
validate assumptions of statistical models.
Hypothesis testing –
• Autocorrelation can affect the results of
hypothesis tests, such as t-tests and F-tests. By
41
Survival curves
The fundamental concept in survival analysis is the survival curve,
S(t), as in figure 5.10, which is a function that maps from a duration,
t, to the probability of surviving longer than t, it’s just the
complement of the CDF:
S(t) = 1 − CDF(t)
where CDF(t) is the probability of a lifetime less than or equal to t.
43
From the survival curve can derive the hazard function; for
pregnancy lengths, the hazard function maps from a time, t, to
the fraction of pregnancies that continue until t and then end at t.
44
Censoring
In longitudinal studies exact survival time is only known for those
individuals who show the event of interest during the follow-up
period. These individuals are called censored observations.
The following terms are used in relation to censoring:
Right censoring: a subject is right censored if it is known that
failure occurs some time after the recorded follow-up period.
Left censoring: a subject is left censored if it is known that the
failure occurs some time before the recorded follow-up period.
Interval censoring: a subject is interval censored if it is known that
the event occurs between two times, but the exact time of failure is
not known.
Truncation
A truncation period means that the outcome of interest cannot
possibly occur.
A censoring period means that the outcome of interest may
have occurred.
There are two types of truncation:
Left truncation: a subject is left truncated if it enters the population at risk
some stage after the start of the follow-up period.
Right truncation: a subject is right truncated if it leaves the
population at risk some stage after the study start.
45
46
12. What are the Effective Strategies for Handling Missing Values in
Data Analysis?
Missing Value
Missing data is defined as the values or data that is not stored
for some variable/s in the given dataset.
Below is a sample of the missing data from the Titanic dataset.
The columns ‘Age’ and ‘Cabin’ have some missing values.
47
Type Definition
Missing completely at Missing data are randomly distributed
random (MCAR) across the variable and unrelated to
other variables.
Missing at random Missing data are not randomly
(MAR) distributed but they are accounted for
by other observed variables.
Missing not at Missing data systematically differ
random from
(MNAR) the observed values.
Missing Completely At Random (MCAR)
In MCAR, the probability of data being missing is the same for all
the observations.
In this case, there is no relationship between the missing data
and any other values observed or unobserved within the given
dataset.
That is, missing values are completely independent of
other data. There is no pattern.
In the case of MCAR data, the value could be missing due to
human error, some system/equipment failure, loss of sample, or
some unsatisfactory technicalities while recording the values.
For Example, suppose in a library there are some overdue
books. Some values of overdue books in the computer system are
missing. The reason might be a human error, like the librarian
forgetting to type in the values.
Functions Descriptions
49
13. Compare and contrast between multiple regression and logistic regression techniques
with example. (Apr/May 2024)
Multiple regression
Explaining or predicting a single Y variable from two or more X variables is called multiple
regression. The goals of multiple regression are
Where:
Uses
The first is to determine the dependent variable based on multiple independent variables. For
example, you may be interested in determining what a crop yield will be based on
temperature, rainfall, and other independent variables.
The second is to determine how strong the relationship is between each variable. For example,
you may be interested in knowing how a crop yield will change if rainfall increases or the
temperature decreases.
Logistic Regression:
logistic regression, a technique for predicting categorical outcomes with two possible
categories.
The model delivers a binary or dichotomous outcome limited to two possible outcomes:
yes/no, 0/1, or true/false.
This type of statistical model (also known as logit model) is often used for classification
and predictive analytics. Since the outcome is a probability, the dependent variable is
bounded between 0 and 1.
Logistic regression is commonly used for prediction and classification problems. Some of these
use cases include:
Fraud detection: Logistic regression models can help teams identify data anomalies, which
are predictive of fraud. Certain behaviors or characteristics may have a higher association with
fraudulent activities, which is particularly helpful to banking and other financial institutions in
protecting their clients.
51
Disease prediction: In medicine, this analytics approach can be used to predict the likelihood
of disease or illness for a given population. Healthcare organizations can set up preventative care
for individuals that show higher propensity for specific illnesses.
Temp in 10 20 30 40 50 60 70 80 90
Celcius
Find the linear regression equation. Also find the estimated life time when
temperature is 55. (Apr/May 2024)
Solution
52
The best line equation (simple linear regression equation) is: Y = -22.52X + 1463.1
Identify the temperature range where 53∘53 ∘ Celsius falls. The temperature range
is 5 0 <𝑇 <60 50<T <60. Find the corresponding life time for the temperature
range 5 0 <𝑇 <60 50 <T <60 . The life time range is 176 <ℎ<117 176 <h<117.
Since 53 ∘53 ∘ Celsius is closer to 50 ∘50∘ Celsius, we use the life time value
for 5 0 ∘5 0 ∘ Celsius, which is 176 17 6 hours.
Estimate the life time for 53 ∘53∘ Celsius by interpolating between the life times
for 5 0 ∘5 0 ∘ Celsius and 60 ∘60 ∘ Celsius.
The interpolated life time for 53∘53 ∘ Celsius is approximately 171. 95 171. 95 hours.
53