: Predictive Analytics
ear least squares - implementation - goodness of fit - testing a linear model -
‘esampling. Regression using StatsModels - multiple regression - nonlinear
ighted a °
relationships = logistic regression - estimating parameters - Time series analysis -
moving ‘averages ~ missing values - serial correlation - autocorrelation. Introduction to
survival analysis.
Contents
Linear Least Squares
Regression using StatsModels
Multiple Regression
Logistic Regression
Time Series Analysis
Introduction to Survival
Two Marks Questions with Answers—
Fundamentals of Data Science and Analytics (5-2) Predictive Analytiog
Qs1 Linear Least Squares
Least square method
* The method of least squares is about estimating parameters by minimizing the squareq
discrepancies between observed data, on the one-hand and their expected values on the
other.
¢ The Least Squares (LS) criterion states that the sum of the squares of errors is minimum,
The least-squares solutions yield y(x) whose elements sum to 1, but do not ensure the
outputs to be in the range [0,1].
* How to draw such a line based on data points observed ? Suppose a imaginary line of
y=a+bx.
E(Y)=a+bx
x,
Fig. 5.1.1
© Imagine a vertical distance between the line and a data point E = Y - EY). This error is
the deviation of the data point from the imaginary line, regression line. Then what is the
best values of a and b ? A and b that minimizes the sum of such errors.
Deviation does not have good properties for computation. Then why do we use squares of
deviation ? Let us get a and b that can minimize the sum of squared deviations rather than
the sum of deviations. This method is called least squares.
Least squares method minimizes the sum of squares of errors. Such a and b are called least
squares estimators i,c. estimators of parameters 0: and B.
*. The process of getting parameter estimators (e.g., a and b) is called estimation. Lest
squares method is the estimation method of Ordinary Least Squares (OLS).
TECHNICAL PUBLICATIONS® an pst or iowiedgamentals of Data Science and Analytics (5-3) Predictive Analytics
a
pot
ised
1 yack rob
: certain &
ntages of least square
ustness to outliers.
iatasets unsuitable for least squares classification.
pecis ‘on boundary corresponds to ML solution,
3.
example 5.1.4 : Fit straight line to the points in the table, Compute m and b by least
uares
Solution : Represent in matrix form :
3.00 1 4507 [Va
425 1 IR =| 45 |, Vp
550 1 |LbJ | 550 |*] ve
8.00 1 ssod | y,
: x-["]-a'w'a'y
[ 121.3125 anasto" _ [ 195825] _ [o25]
={ 20.7500. 4.0000 | ~L 19.7500 | = L 3.663
V =AX-L ; .
3.00 1 4507] [-0.10
_|425 1 [226] 425 |_| 0.46
5.50 1 |L3.663)~ | 5.50 | | -0.48
8.00 1 5.50 0.13
5.1.1 Goodness of Fit
* A goodness-of-fit test, in general, refers to measuring how well do the observed data
Correspond to the fitted (assumed) model. The goodness-of-fit test compares the observed
Values to the expected (fitted or predicted) values.
* Goodness-of-fit tests are frequently applied in business decision making. For example, the
below image depicts the linear regression function. The goodness-of-fit test here will
Compare the actual observéd values denoted by dots to the predicted values denoted by the
Tegression line,
TECHNICAL PUBLICATIONS® - an upthrust for knowledge
mee— —+
Fundamentals of Data Science and Analytics (5-4) Predictive Anaiyticg
Fig. 5.1.2 : Goodness of fit
© Broadly, the goodness of fit test categorization can be done based on the distribution of the
predict and variable of the dataset.
a) The chi-square b) Kolmogorov-Smimov _‘c) Anderson-Darling. '
@ 5:1.2 Testing a Linear Model :
© The following measures are used to validate the simple linear regression models :
1. Co-efficient of determination (R-square).
2. Hypothesis test for the regression coefficient by.
3. Analysis of variance for overall model validity (relevant more for multiple linear
regression). :
4, Residual analysis to validate the regression model assumptions.
5. Outlier analysis.
©, The primary objective of regression is to explain the variation in Y using the knowledge of
X. The coefficient of determination (R-square) measures the percentage of variation in Y
explained by the model (8, + B, X).
Characteristics of R-square :
© Here are some basic characteristics of the measure :
1. Since R’ is a proportion, it is always a number between 0 and 1.
2 .
2. If R’ = 1, all of the data points fall perfectly on the regression line. The predictor x
. accounts for all of the variation in y.
2 . aes
3. If R= 0, the estimated regression line is perfectly horizontal. The predictor x accounts
for none of the variation in y.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeean eae the linear regression setting. More specifically, R- indicates the
rion of the Variance in the dependent variable (Y) that is predicted or explained by
ession and the predictor variable (X, also known as the independent variable).
P
sient TRF 2
jn general @ high R° value indicates that the model is a good fit for the data, although
interpretations of fit depend on the context of analysis, An R. of 0.35, for examples
indicates that 35 percent of the variation in the outcome has been explained just by
predicting th outcome using the covariates included in the model.
«» That percentage right be a very high portion of variation to predict in a field such, as the
social sciences in other fields, such as the physical sciences, one would expect R' to be
rnuch closer to 100 percent.
‘The theoretical minimum R° is 0. However, since linear regression is based on the best
possible fit, R’ will always be greater than zero, even when the predictor and outcome
variables bear no relationship to one another.
increases when a new predictor variable i added tothe model, even if the new predictor
js not associated with the outcome. 1» ‘account for that effect, the adjusted R' incorporates
the same information as the usual R’ but then also penalizes for the number of predictor
variables included in the model.
2
‘As a result, R” increases as new predictors are added to a multiple linear regression model,
but the adjusted R’ increases only ifthe increase in Ris greater than one would expect
from chance alone. In such a model, the adjusted R is the most realistic estimate of the
proportion of the variation that is predicted by the covariates included in the model.
B sa24 ‘Spurious Regression
1 we regress one random walk onto another independent
© The regression is spurious whe
will most likely indicate a non-existing
random walk, It is spurious because the regression
" relationship :
1. The coefficient estimate will not convel
limit the coefficient estimate will follow a nom
srge toward Zero (the true value). Instead, in the
degenerate distribution.
2. The t value most often is significant.
2.
3. Ris typically very high.
TEGHRICAL PUBLICATIONS®-an ups! for rowledge=
6-9) Predictive Analytics
Fundamentals of Data Science and Analytics
* Spurious regression is linked to serially correlated errors.
Granger and Newbold(1974) pointed out that along with the large t-values strong evidence
stating that when a low value
of serially correlated errors will appear in regression analysis,
of the Durbin-Watson statistic is combined with a high value of the t-statistic the
relationship is not true.
Hypothesis Test for Regression Co-Efficient (t-Test)
¢ The regression co-efficient (B,) captures the existence of a line
response variable and the explanatory variable.
If B, = 0, we can conclude that there is no statistically significant linear relationship
ar relationship between the
between the two variables.
Using the Analysis of Variance (ANOVA), we can test whether the overall model is
statistically significant. However, for a simple linear regression, the null and alternative
hypotheses in ANOVA and t-test are exactly same and thus there will be no difference in
the p-value.
Residual analysis
© Residual (error) analysis is important to check whether the assumptions of regression
models have been satisfied. It is performed to check the following :
1. The residuals are normally distributed.
2. The variance of residual is constant (homoscedasticity).
3. The functional form of regression is correctly specified.
4, If there are any outliers.
5.2 Regression using StatsModels
Linear regression statsmodel is the model that helps us to predict and is used for fitting up
the scenario where one parameter is directly dependent on the other parameter. Here, we
have one variable that is dependent and the other one which is independent. Depending on
the change in the value of the independent parameter, we need to predict the change in the
dependent variable.
The statsmodels library has more advanced statistical tools as compared to sci-kit learn.
Moreover, it’s regression analysis tools can give more detailed results,
TECHNICAL PUBLICATIONS? - an up-ratfornomoageS
aon of Data Science and Analytics (5-7) pretetve Anaytlcs |
|
pe i
ll
axe ace four availabe clasts ofthe properties ofthe regression model that wil help us f°
se te statsmodel linear regression. The classes are as follows : « |
4) ordinary Least Square (OLS) . ; |
1 Weighted Least Square (WLS) |
° Generalized Least Square (GLS) |
@) GLSAR- Feasible generalized leat square along with the errors that are auto correlated. ‘| |
statsmodel Linear regression model helps to predict or estimate the values of the dependent
variables a8 and when there is a change in the independent quantities.
4. statsmodels is @ Python module that provides classes and functions for the estimation of i
many different statistical models, as well as for conducting statistical tests and statistical
data exploration.
+ Statsmodels is built on top of NumPy, SciPy and matplotlib.
53 Multiple Regression
let’s see how big the difference in weight
«The mothers of ftt babies are 3.59 yers younger, Running the linear model agai, we Bet
the change in birth weight as a function of age:
si ad tal agers ete
«The slope is 0.0175 pounds per yea. we multiply the slope by the difference in ages, We
get the expected difference in birth weight for first babies and others, due to mother’s age :
© The result is 0.063, just about half of the observed difference. So we conclude, tentatively,
that the observed difference in birth weight can be partly explained by the difference in
og
oe]
mother’s age.
these relationships more systematically.
i
4
TEGHNIGAL PUBLICATIONS® - en uphrustorInewfodgeFundamentals of Data Science and Analytics (5-8) Predictive Analytics
© The first line creates a new column named isfrst that is True for first babies and False
otherwise. Then we fit a model using isfirst as an explanatory variable.
© Here are the sample results :
© Because isfirst is a boolean, ois treats it as a categorical variable, which means that the
values fall into categories, like True and False and should not be treated as numbers. The
estimated parameter is the effect on birth weight when isfist is true, so the result, -0.125 Ibs,
is the difference in birth weight between first babies and others.
The slope and the intercept are statistically significant, which means that they were unlikely
to occur by chance, but the R” value for this model is small, which means that isfirst doesn’t
account for a substantial part of the variation in birth weight.
The results are similar with seers:
ice 2
ae e
« Again, the parameters are statistically significant, but RX is low.
© These models confirm results we have already seen. But now we can fit a single model that
includes both variables. With the formula totawet Ib ~ isfirst-+ agepreg, we get :
In the combined model, the parameter for isist is smaller by about half, which means that
part of the apparent effect of isfirst is actually accounted for by agepreg. And the p-value
for isfirst is about 2.5 %, which is on the border of statistical significance. ‘
@ 5.3.1 Nonlinear Relationship
« Remembering that the contribution of agepreg might be nonlinear, we might consider
adding a variable to capture more of this relationship. One option is to create a column,
agepreg2, that contains the squares of the ages : :
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeSERIE wa
36) URPRON Isaacs cans aesea IASI
« The parameter of agepreg? is negative, so the parabola curves downward, which is
consistent with the shape of the lines in Fig. 5.3.1.
19
05}
2 00
3
3
g-05
oh aa
— 50th
— 251 '
a
a rr
Age (years)
Fig. 5.3.1 : Residuals of the linear fit
* The quadratic model of agepreg accounts for more of the variability in birth weight; the
parameter for isfirst is smaller in this model and no longer statistically significant.
* Using computed variables like agepreg2 is a common way to fit polynomials and other
functions to data. This process is still considered linear regression, because the dependent
Variable is ‘a linear function of the explanatory variables, regardless of whether some
Variables are nonlinear functions of others.*
TECHNICAL PUBLICATIONS® - an up-trust for knowledgeFundamentals of Data Science and Analytics (5 - 10) Predictive Analytics
OD 5.4 Logistic Regression
© Logistic regression is a form of regression analysis in which the outcome variable is binary
or dichotomous. A statistical method used to model dichotomous or binary outcomes using
Predictor variables.
‘© Logistic regression is one of the supervised learning algorithms.
The binary logistic regression model is given by,
z
PM) =
l+e
Where Z = By +B, X + By Xp +By Xy + --- + By Xy (Kp Xp» Xs ate independent variable)
The logistic regression function is rewritten as,
PY z
T- Ty) gee
(Po) = By +B, X, + By Xp +». + By X
Multiple linear regression is an extension of linear regression, which allows a response
variable, y to be modeled as a linear function of two or more predictor variables.
Logit function is similar to a multiple linear regression model. Such models are called
Generalized Linear Models (GLM), in GLM the errors do not follow normal distribution
and there exists a transformation function of the outcome variable that takes a linear
function.
@ 5.4.1 Estimation of Parameters in Logistic Regression
© Parameter estimates (also called coefficients) are the log odds ratio associated with a one-
unit change of the predictor, all other predictors being held constant. For each term
involving a categorical variable, a number of dummy predictor variables are created to
predict the effect of each different level.
Regression parameters in the case of logistic regression are estimated using Maximum
Likelihood Estimator (MLE). In binary logistic regression, the response variable Y takes
only two values (Y = 0 and 1).
© The unknown model parameters are estimated using maximum-likelihood estimation.
© A coefficient describes the size of the contribution of that predictor; a large coefficient
indicates that the variable strongly influences the probability of that outcome, while a near-
zero coefficient indicates that variable has little influence on the probability of that
outcome.
TEGHNIGAL PUBLICATIONS® - en up:tvt or rawiedgeea
pul ndamentals of Data Science and Analytics (5-11) Predictive Analytics
«A positive sign indicates thatthe explanatory variable increases the probability of the
outcome, while a negative sign indicates that the variable decreases the probability of that
outeome. A confidence interval for each parameter shows the uncertainty in the estimate.
PY=NZ = BytB, X,+B)X) +... +B, X, = a(Z)
z
e
Ite
«The probability function of binary logistic regression for specific observation Y; is given by
Yi 1-¥j
PY) = m2) ‘(=n *t
«The log-likelihod function is given by,
in(L) = LL = 2 Maloy + Bay) {Int —m(Z) }
6 5.4.2 Logistic Regression Model Diagnostics
Regression models for categorical outcomes should be evaluated for fit and adherence to
model assumptions. There are two main elements of such an assessment : Discrimination
and calibration.
Discrimination measutes the ability of the model to correctly classify observations into
outcome categories, Calibration measures how well the model estimated probabilities agree
with the observed outcomes and it is typically evaluated via a goodness-of-fit test.
The (binary) logistic regression model describes the relationship between a binary outcome
variable and one or more predictor variables,
Here we discuss four test models :
1, Omnibus test : The omnibus test is a likelihood-ratio chi-square test of the current
model versus the null model. The significance value of less than 0.05 indicates that the
current model outperforms the null model. Omnibus tests are generic statistical tests
used for checking whether the variance explained by the model is more than the
unexplained variance,
Wald's test : Wald’s test is used for checking whether an individual explanatory
Variable is statistically significant. Wald's test is a chi-square test, A Wald test calculates
aZ statistic which is :
we
SE(B)
[TEGHICAL PLBLUCATIONS an wpe tontdgeFundamentals of Data Science and Analytics (5 - 12) Predictive Analytics
This value is squared which yields a chi-square distribution and is used as the Wald test
statistic
3. Hosmer-Lemeshow test : It is a chi-square goodness of fit test for bin
regression.
4. Psendo R” : Pseudo R” is a measure of goodness of the model. It is called pseudo R
because it does not have the same interpretation of R’ as in the MLR model.
ary logistic
5.4.3 Variable Selection in Logistic Regression
© Variable selection is an important consideration when creating logistic re;
Variables must be selected carefully so that the model makes accurate predictions, but
gression models.
without over-fitting the data.
Forward LR (Likelihood Ratio) :
© In Forward LR, at each step one variable is added to the model. The following steps are
used in building logistic regression model using forward LR selection method :
1. Start with no variables in the model.
2. For each independent variable, calculate the difference between — 2LL, and — 2LL; , ,
value.
3, Repeat step 2, till all the variables are exhausted or the change in ~ 2LL is not
significant, that is the p-value after adding a new variable is greater than 0.
Forward selection wald : Stepwise selection method with entry testing based on the
significance of the score statistic and removal testing based on the probability of the Wald
statistic.
Method selection allows you to specify how independent variables are entered into the
analysis. Using different methods, you can construct a variety of regression models from
the same set of variables.
Enter ; A procedure for variable selection in which all variables in a block are entered in
single step.
Forward selection (Conditional) : Stepwise selection method with entry testing based on the
significance of the score statistic and removal testing based on the probability of a
likelihood-ratio statistic based on conditional parameter estimates.
Forward selection (Likelihood Ratio) : Stepwise selection method with entry testing based
on the significance of the score statistic and removal testing based on the probability of @
likelihood-ratio statistic based on the maximum partial likelihood estimates.
TECHNICAL PUBLICATIONS® - an up-thrust for knowtedgoFundamentals of Data Science and Analytics (5 - 19) Analytics
« Forward selection (Wald) : Stepwise selection method with entry testing based on the
significance of the score statistic and removal testing based on the probability of the Wald
statistic.
« Backward elimination (Conditional) ; Backward stepwise selection. Removal testing is
_ based on the probability of the likelihood-ratio statistic based on conditional parameter
estimates.
« Backward elimination Likelihood Ratio) : Backward stepwise selection, Removal testing is
based on the probability of the likelihood-ratio statistic based on the maximum partial
likelihood estimates.
Backward elimination (Wald) : Backward stepwise selection. Removal testing is based on
the probability of the Wald statistic,
(1 5.5 Time Series Analysis
First of all, we will create a scatter plot of dates and values in Matplotlib using
plt.plot_date(). We will be using Python’s built-in module called datetime(datetime,
timedelta) for parsing the dates. So, let us create a python file called ‘plot_time_series.py’Fundamentals of Data Science and Analytics (5 - 14) feen tae
Output :
pomo732 mers 0745 mort 0-739 20728 207-33
J 5.5.1 Missing Values
Data can have missing values for a number of reasons such as observations that were not
recorded and data corruption. Handling missing data is important as many machine learning
algorithms do not support data with missing values.
We can load the dataset as a Pandas DataFrame and print summary statistics on each
ay
a4
© In Python, specifically Pandas, NumPy and Scikit-Leam, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, ete.
TECHNICAL PUBLICATIONS® on ups for knonedpeFundamentals of Data Science and Analytics (5 . 15) Predictive Analytics
¢ Use the ismall() method to detect the missing values. Pandas Dataframe provides a function
isnull(), it cree new dataframe of same size as calling dataframe, it contains only True
& False only. With True at the place NaN in original dataframe and False at other places,
Encoding missingness :
«The fillna() function is used to fill NA/NNaN values using the specified method.
Syntax:
‘pateFrame-fillna(valie=None, tmethod=None, axis=None, inplace=False, limit =None,
‘downoast=None,**kwargs) i
Where,
1. value : It is a value that is used to fill the null values,
2. method : A method that is used to fill the null values.
3. axis : It takes int or string value for rows/columns,
4, inplace : If it is true, it fills values at an empty place.
5. limit : It is an integer value that specifies the maximum number of consecutive
forward/backward NaN value fills.
6. downcast : It takes a dict that specifies what to downcast like Float64 to int64.
(| 5.5.2 Serial Coorelation
Serial correlation is the relationship between a given variable and a lagged version of itself
cover various time intervals. It measures the relationship between a variable's current value
given its past values.
¢ A variable that is serially correlated indicates that it may not be random. Technical analysts
validate the profitable patterns of a security or group of securities and determine the risk
associated with investment opportunities.
© The most common form of serial correlation is called first-order serial correlation in which
the error in time is related to the previous (t- 1) period’s error :
& = P&_; tty -l
0 indicates positive serial correlation - The error terms will tend to have the same
sign from one period to the next period.
©) <0 indicates negative serial correlation - The error terms will tend to have a different
sign from one period to the next period.
Impure serial correlation
* This type of serial correlation is caused by a specification error such as an omitted variable
or ignoring nonlinearities. Suppose the true regression equation is given by,
YY, = Bo+B, Xj, + By Xq +e
The error term €, will capture the effect of X,,.
Since many economic variables exhibit trends over time, X,, is likely to depend on X2, ¢_j»
X,,¢_7 ++» This will translate into a seeming correlation between €, and €_y, &_ 2» «-- and
this serial correlation would violate assumption.
A specification error of the functional form can also cause this type of serial correlation.
‘Suppose the true regression equation between Y and X is quadratic. but we assume it’s
linear. The error term will depend on X’.
The consequences of serial correlation :
1. Pure serial correlation does not cause bias in the regression coefficient estimates.
2. Serial correlation causes OLS to no longer be a minimum variance estimator.
3. Serial correlation causes the estimated variances of the regression coefficients to be
biased, leading to unreliable hypothesis testing.
5.5.3 Autocorrelation
* Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals, It measures how the lagged version of the value of a variable is
related to the original version of it in a time series.
The value of autocorrelation varies between + 1 and — 1. If the autocorrelation of series is a
very small value that does not mean, there is no correlation. The correlation could be non-
linear. A value between — 1 and 0 represents negative autocorrelation. A value between
O and 1 represents positive autocorrelation. ‘
© Autocorrelation gives information about the trend of a set of historical data so that it can be
usefal in the technical analysis for the equity market.
TECHNICAL PUBLIGATIONS® - an up-thrust for knowledge-
fundamentals of Data Science and Analytics (5-17) Analytics
« Fig. 5.5.1 shows positive and negative autocorrelation.
ype a —
(a) Positive autocorrelation (b) Negative autocorrelation
Fig. 5.5.1
A technical analyst can learn how the stock price of a particular day is affected by those of
previous days through autocorrelation. Thus, he/she can estimate how the price will move
in the future.
«If the price of a stock with strong positive autocorrelation has been increasing for several
days, the analyst can reasonably estimate the future price will continue to move upward in
the recent future days. The analyst may buy and hold the stock for a short period of time to
profit from the upward price movement,
«The autocorrelation analysis only provides information about short-term trends and tells
little about the fundamentals of a company. Therefore, it can only be applied to support the
trades with short holding periods.
4 5.6 Introduction to Survival
© Survival analysis is used to analyze data in which the time until the event is of interest. The
response is often referred to as a failure time, survival time or event time.
© Originally, this branch of statistics developed around measuring the effects of medical
treatment on patient’s survival in clinical trials. For example, imagine a group of cancer
patients who are administered a certain new form of treatment. Survival analysis can be
used for analyzing the results of that treatment in terms of the patients’ life expectancy.
Censoring :
# Censoring is present when we have some information about a subject’s event time, but we
don’t know the exact event time. For the analysis methods we will discuss to be valid,
censoring mechanism must be independent of the survival mechanism.
TEOPRIGHL PUBLIGATIONS® nop knowFundamentals of Data Science and Analytics (5-18) Brectetive Ansivics
* There are generally three reasons why censoring might occur :
‘ & A subject does not experience the event before the study ends
b. A person is lost to follow-up during the study period
©. A person withdraws from the study.
© These are all examples of right-censoring,
© Types of right-censoring :*
1. Fixed type I censoring occurs when a study is designed to end after C years of follow-
up. In this case, everyone who does not have an event observed during the course of the
study is censored at C years.
2. In random type I censoring, the study is designed to end after C years, but censored
subjects do not all have the same censoring time. This is the main type of right-
censoring we will be concerned with.
3. In type II censoring, a study ends when there is a pre-specified number of events.
©. The survival fiinction is a function of time (t) and can be represented as :
S(t) = Pr(T>t)
where Pr() stands for the probability and T for the time of the event of interest for a random
observation from the sample. We can interpret the survival function as the probability of the
event of interest (for example, the death event) not occurring by the time t.
* The survival function takes values in the range between 0 and 1 (inclusive) and is a non-
increasing function of t.
£2 5.7 Two Marks Questions with Answers
Q.1 What is logistic regression ?
Ans, : Logistic regression is a form of regression analysis in which the outcome variable is
binary or dichotomous. A statistical method used to model dichotomous or binary outcomes
using predictor variables. Logistic regression is one of the supervised learning algorithms.
Q2 What is omnibus test ?
Ans. : The omnibus test is a likelihood-ratio chi-square test of the current model versus the null
model. The significance value of less than 0.05 indicates that the current model outperforms
the null model. Omnibus tests are generic statistical tests used for checking whether the
Variance explained by the model is more than the unexplained variance,
TECHNICAL PUBLICATIONS® . en up-thrust for knowledge
Serundoments of Dal Scien and Aneytes (5.19) a
Predictive Anelytics
a3 Define serial correlation,
‘ans. : Serial correlation is the relationship between a given variable and a lagged version of
ious time i
itself over various ime intervals. It measures the relationship between a variable's current
value given its past values.
as What are the consequences of serial correlation 7
‘Ans.t 1. Pure serial correlation does not cause bias inthe regression coefficient estimates.
2. Serial correlation causes OLS to no longer be a minimum variance estimator.
3. Serial sexaition causes the estimated variances of the regression coefficients to be
biased, leading to unreliable hypothesis testing.
Q.5 Define autocorrelation,
‘Ans. : Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals. It measures how the lagged version of the value of a variable is
related to the original version of it in a time series.
Q.6. What are reasons for censoring 7
‘Ans, : There are generally three reasons why censoring might occur:
a. A subject does not experience the event before the study ends.
b. A person is lost to follow-up during the study period.
c. A person withdraws from the study.
Q.7_ Explain regression using statsmodels.
‘Ans. : Linear regression statsmodel is the model that helps us to predict and is used for fitting
up the scenario where one parameter is directly dependent on the other parameter. Here, we
have one variable that is dependent and the other one which is independent. Depending on the
change in the value of the independent parameter, we need to predict the change in the
dependent variable.
Q.8 Why residual analysis is important 7
‘Ans. : Residual (error) analysis is important to check whether the assumptions of regression
models have been satisfied. It is performed to check the following :
1. The residuals are normally distributed.
2. The variance of residual is constant (homoscedasticity).
3, The functional form of regression is correctly specified.
4. If there are any outliers.
TECHNICAL PUBLICATIONS® - an up-trust for knowledgetive Analytics
Fundamentals of Data Science and Analytics —_(5- 20) Predictive Analytic
Q.9 What Is spurious regression 7
ther independent
Ans. : The regression is spurious when we regress one random. walk onto anol pt
a non-existin,
random walk, It is spurious because the regression will most likely indicate 1g
relationship.
900
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge