Fdsa UNIT V
Fdsa UNIT V
PREDICTIVE ANALYTICS
Linear least squares – implementation – goodness of fit – testing a linear model – weighted re
sampling. Regression using StatsModels – multiple regression – nonlinear relationships – logistic
regression– estimating parameters – accuracy. Time series analysis – moving averages – missing
values – serial correlation – autocorrelation Introduction to survival analysis
1. Linear least squares:
Linear least squares (LLS) is the least squares approximation of linear functions to data.
It is a set of formulations for solving statistical problems involved in linear regression, including
variants for ordinary (unweighted), weighted, and generalized (correlated) residuals.
Main formulations:
The three main linear least squares formulations are:
● Ordinary least squares (OLS) is the most common estimator. OLS estimates are
commonly used to analyze both experimental and observational data.
● Weighted least squares (WLS) are used when heteroscedasticity is present in the error
terms of the model.
● Generalized least squares (GLS) is an extension of the OLS method that allows efficient
estimation of β when either heteroscedasticity, or correlations, or both are present among
the error terms of the model, as long as the form of heteroscedasticity and correlation is
known independently of the data.
Least squares fit:
Correlation coefficients measure the strength and sign of a relationship, but not the slope.
There are several ways to estimate the slope; the most common is a linear least squares fit. A
“linear fit” is a line intended to model the relationship between variables. A “least squares” fit is
one that minimizes the mean squared error (MSE) between the line and the data.
Otherwise, consider the sequence of point as ‘y’, that express as a function of another
sequence ‘x’. The linear relationship exist between ‘x’ and ‘y’ with the intercept ‘a’ and slope
‘b’, then the least square fit for each y[i] except to be a + (b * x[i])
But unless the correlation is perfect, this prediction is only approximate. The vertical
deviation from the line, or residual, is
To minimize the absolute value of the residuals, or their squares, or their cubes; the most
common choice is to minimize the sum of squared residuals, sum(res**2).
Why to minimize the sum of squared residuals, sum(res**2)? There are three good reasons and
one less important one:
● Squaring has the feature of treating positive and negative residuals the same, which is
usually what we want.
● Squaring gives more weight to large residuals, but not so much weight that the largest
residual always dominates.
● If the residuals are uncorrelated and normally distributed with mean 0 and constant (but
unknown) variance, then the least squares fit is also the maximum likelihood estimator
of inter and slope.
● The values of inter and slope that minimize the squared residuals can be computed
efficiently.
2. Implementation:
def LeastSquares(xs, ys):
meanx, varx = MeanVar(xs)
meany = Mean(ys)
Also provides FitLine, which takes inter and slope and returns the fitted line for a sequence
of xs.
Residuals takes sequences xs and ys and estimated parameters inter and slope. It returns
the differences between the actual values and the fitted line.
3. Goodness of fit:
There are several ways to measure the quality of a linear model, or goodness of fit. One
of the simplest is the standard deviation of the residuals.
If we use a linear model to make predictions, Std(res) is the root mean squared error
(RMSE) of the predictions. For example, if you use mother’s age to guess birth weight, the
RMSE of your guess would be 1.40 lbs.
Another way to measure goodness of fit is the coefficient of determination, usually denoted R2
and called “ R-squared”:
def CoefDetermination(ys, res):
Var(res) is the MSE using the model, Var(ys) is the MSE without it. So their ratio is the
fraction of MSE that remains if you use the model, and R2 is the fraction of MSE the model
eliminates.
class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
_, slope = thinkstats2.LeastSquares(ages, weights)
return slope
def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
weights = self.ybar + np.random.permutation(self.res)
return ages, weights
The data are represented as sequences of ages and weights. The test statistic is the slope
estimated by LeastSquares. The model of the null hypothesis is represented by the mean weight
and the deviations. To generate simulated data, we permute the deviations and add them to the
mean.
The p-value can be estimated by two ways:
● Compute the probability that the slope under the null hypothesis exceeds the observed
slope.
● Compute the probability that the slope in the sampling distribution falls below 0. (If the
estimated slope were negative, we would compute the probability that the slope in the
sampling distribution exceeds 0.)
The second option is easier because we normally want to compute the sampling distribution
of the parameters anyway. And it is a good approximation unless the sample size is small and the
distribution of residuals is skewed. Even then, it is usually good enough, because p-values don't
have to be precise.
5. Weighted resampling:
Definition:
A value associated with an observation in a sample that indicates what part of the
population it represents.
Explanation:
The survey deliberately oversamples several groups in order to improve the chance of
getting statistically significant results; that is, in order to improve the power of tests involving
these groups. This survey design is useful for many purposes, but it means that we cannot use the
sample to estimate values for the general population without accounting for the sampling
process.
For each respondent, the data includes a variable called finalweight, which is the number
of people in the general population the respondent represents. This value is called a sampling
weight, or just \weight.
Example:
If you survey 100,000 people in a country of 300 million, each respondent represents
3,000 people. If you oversample one group by a factor of 2, each person in the oversampled
group would have a lower weight, about 1500.
To correct for oversampling, we can use resampling; that is, we can draw samples from
the survey using probabilities proportional to sampling weights. Then, for any quantity we want
to estimate, we can generate sampling distributions, standard errors, and confidence intervals.
● ResampleRows, which chooses rows from a DataFrame, giving each row the same
probability. Using the probabilities proportional to sampling weights,
ResampleRowsWeighted takes a DataFrame, resamples rows according to the weights in
finalwgt, and returns a DataFrame containing the resampled rows:
def ResampleRowsWeighted(df, column=finalwgt):
weights = df[column]
cdf = Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample
6. Regression:
The linear least squares fit is an example of regression, which is the more general
problem of fitting any kind of model to any kind of data. The goal of regression analysis is to
describe the relationship between one set of variables, called the dependent variables, and
another set of variables, called independent or explanatory variables.
● Regression is the task of predicting a continuous quality.
● A regression problem requires the prediction of a quantity.
● A regression problem with multiple input variables is called a multivariate regression
problem.
● The continuous values denotes or represent the probability of a data point belongs to the
each output class.
statsmodels provides two interfaces (APIs); the formula API uses strings to identify the
dependent and explanatory variables. It uses a syntax called patsy; in this example, the ~
operator separates the dependent variable on the left from the explanatory variables on the right.
smf.ols takes the formula string and the DataFrame, live, and returns an OLS object that
represents the model. The name ols stands for “ordinary least squares.”
The fit method fits the model to the data and returns a Regression Result object that
contains the results.
The results are also available as attributes. params is a Series that maps from variable
names to their parameters, so we can get the intercept and slope like this:
inter = results.params[Intercept]
slope = results.params[agepreg]
pvalues is a Series that maps from variable names to the associated p-values, so we can
check whether the estimated slope is statistically signi_cant:
slope_pvalue = results.pvalues[agepreg]
And results provides resid, a sequence of residuals, and fittedvalues, a sequence of fitted values
corresponding to agepreg. The results object provides summary(), which represents the results in
a readable format.
print(results.summary())
Linear regression:
It is a method to predict the dependent variable (Y) based on values of independent
variable (X). It can be used for the cases where we want to predict some continuous quantity.
Dependent variable: The response variable who’s values needs to be predicted. It should be
continuous.
Independent variable: The predictor variable used to predict the response variable.
Equation:
Y= b0 +b1X + e
Where,
b0 – intercept
b1 – Slope
X – Independent variable.
Y – Dependent variable
e - error
7. Multiple regression:
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes known simply
as multiple regression, and it is an extension of linear regression. The variable that we want to
predict is known as the dependent variable, while the variables we use to predict the value of
the dependent variable are known as independent or explanatory variables.
We can use multiple linear regression when you want to know:
1. How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth).
2. The value of the dependent variable at a certain value of the independent variables (e.g.
the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer
addition).
Independence of observations: the observations in the dataset were collected using statistically
valid methods, and there are no hidden relationships among variables.
In multiple linear regression, it is possible that some of the independent variables are actually
correlated with one another, so it is important to check these before developing the regression
model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them
should be used in the regression model.
Linearity: the line of best fit through the data points is a straight line, rather than a curve or
some sort of grouping factor.
To find the best-fit line for each independent variable, multiple linear regression calculates three
things:
● The regression coefficients that lead to the smallest overall model error.
● The t-statistic of the overall model.
● The associated p-value (how likely it is that the t-statistic would have occurred by chance
if the null hypothesis of no relationship between the independent and dependent variables
was true).
It then calculates the t-statistic and p-value for each regression coefficient in the model.
import pandas as pd
import statsmodels.api as sm
st_market = { ‘year’: [2001,2002,2003],
‘month’: [‘jan’, ‘june’, ‘march’],
‘Interest_rate’: [2,3.3, 5.2],
‘unemployment_rate’: [3.22, 5.33, 7.2]
‘stock_index_price’: [1254,1236.55,1287.43]
df = pd.DataFrame(stock_market, columns = [‘year’, ‘month’, ‘Interest_rate’,
‘unemployment_rate’, ‘stock_index_price’ ] )
x = df[[‘Interest_rate’, ‘unemployment_rate’]]
# Here we have 2 variable for the multiple linear regression. If you want to use one variable for
simple linear regression, then use x = df[ ‘Interest_rate’]
y = df[‘stock_index_price’]
X = sm.add_constant(x) # adding a constant.
model = sm.OLS(y, x).fit( )
predictions = model.predict(x)
print_model = model.summary( )
print(print_model)
8. Nonlinear relationships:
Nonlinear regression is a form of regression analysis in which observational data are
modeled by a function which is a nonlinear combination of the model parameters and depends on
one or more independent variables. The data are fitted by a method of successive
approximations. In nonlinear regression, a statistical model of the form,
relates a vector of independent variables, x, and its associated observed dependent
variables, y. The function f is nonlinear in the components of the vector of parameters β, but
otherwise arbitrary.
With a single predictor, the model is a straight line. With more than one predictor, the
model is a plane or hyperplane. While such models are adequate for representing many
relationships (at least over a limited range of the predictors), there are many cases when a more
complicated model is required.
There are several procedures for fitting nonlinear models. The models that may be fit
include:
1. Transformable nonlinear models: models involving a single predictor variable in
which transforming Y, X or both results in a linear relationship between the transformed
variables.
2. Polynomial models: models involving one or more predictor variables which include
higher-order terms such as B1,1X12 or B1,2X1X2.
3. Models that are nonlinear in the parameters: models in which the partial derivatives
of Y with respect to the predictor variables involve the unknown parameters.
9. Logistic regression:
One of the basic and most popular algorithms to solve classification problem is logistic
regression. It is a method used to predict a dependent variable, given a set of independent
variables, such that the dependent variable is categorical.
Dependent variable (Y):
The response binary variable holding values like 0 or 1, yes or no, A, B or C.
Independent variable (X):
The predictor variable used to predict the response variable.
Equation:
log(Y / (1-Y) ) = C + B1X1 + B2X2 + …..
where,
Y = The probability of an event to happen which you are trying to predict.
X1, X2 are the independent variables which determine the occurrence of an event (i.e) Y.
C – Constant term which will be the probability of the event happening when no other factors
are considered.
B1B2 – Coefficient.
# syntax for logistic regression:
model = sm.logit( ytrain, xtrain)
result = model.fit( )
final_result = model.summary( )
The usual goal is to find the maximum-likelihood estimate (MLE), which is the set of
parameters that maximizes the likelihood of the data. For example, suppose we have the
following data:
y = np.array([0, 1, 0, 1])
x1 = np.array([0, 0, 0, 1])
x2 = np.array([0, 1, 1, 1])
And we start with the initial guesses β0 = -1.5, β1 = 2:8, and β2 = 1.1:
beta = [-1.5, 2.8, 1.1]
Then for each row we can compute log_o:
log_o = beta[0] + beta[1] * x1 + beta[2] * x2
p = o / (o+1)
[ 0.182 0.401 0.401 0.916 ]
Notice that when log_o is greater than 0, o is greater than 1 and p is greater than 0.5.
The likelihood of an outcome is p when y==1 and 1-p when y==0. For example, if we
think the probability of a boy is 0.8 and the outcome is a boy, the likelihood is 0.8; if the
outcome is a girl, the likelihood is 0.2. We can compute that like this:
likes = y * p + (1-y) * (1-p)
[ 0.817 0.401 0.598 0.916 ]
For these values of beta, the likelihood of the data is 0.18. The goal of logistic regression is to
find parameters that maximize this likelihood.
Accuracy:
Accuracy – The number of successful predictions, compared with what we would expect by
chance.
In the NSFG data, there are more boys than girls, so the baseline strategy is to guess “boy" every
time. The accuracy of this strategy is just the fraction of boys:
actual = endog[‘boy’]
baseline = actual.mean()
Since actual is encoded in binary integers, the mean is the fraction of boys, which is 0.507.
Similarly, true_neg indicates the cases where we guess “girl" and get it right.
Accuracy is the fraction of correct guesses:
acc = (sum(true_pos) + sum(true_neg)) / len(actual)
The result is 0.512, slightly better than the baseline, 0.507. But, you should not take this result
too seriously. We used the same data to build and test the model, so the model may not have
predictive power on new data.
11.Time series analysis:
A moving average divides the series into overlapping regions, called windows, and
computes the average of the values in each window. One of the simplest moving averages is the
rolling mean, which computes the mean of the values in each window. For example, if the
window size is 3, the rolling mean computes the mean of values 0 through 2, 1 through 3, 2
through 4, etc. pandas provides rolling_mean, which takes a Series and a window size and
returns a new Series.
>>> series = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> pandas.rolling_mean(series, 3)
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])
The first two values are nan; the next value is the mean of the first three elements, 0, 1,
and 2. The next value is the mean of 1, 2, and 3. and so on.
Wherever reindexed.ppg is nan, fillna replaces it with the corresponding value from
ewma. The inplace ag tells fillna to modify the existing Series rather than create a new one.
A drawback of this method is that it understates the noise in the series. We can solve that
problem by adding in resampled residuals:
resid = (reindexed.ppg - ewma).dropna()
fake_data = ewma + thinkstats2.Resample(resid, len(reindexed))
reindexed.ppg.fillna(fake_data, inplace=True)
resid contains the residual values, not including days when ppg is nan. fake_data contains the
sum of the moving average and a random sample of residuals. Finally, fillna replaces nan with
values from fake_data.
For example, if prices are falling, we expect to see values above the mean in the first half
of the series and values below the mean in the second half. It is more interesting to see if the
correlation persists if you subtract away the trend.
For example, we can compute the residual of the EWMA and then compute its serial
correlation:
ewma = pandas.ewma(reindexed.ppg, span=30)
resid = reindexed.ppg - ewma
corr = SerialCorr(resid, 1)
With lag=1, the serial correlations for the de-trended data are -0.022 for high quality, -0.015 for
medium, and 0.036 for low. These values are small, indicating that there is little or no one-day
serial correlation in this series.
15. Autocorrelation:
The autocorrelation function is a function that maps from lag to the serial correlation with
the given lag. “Autocorrelation" is another name for serial correlation, used more often when the
lag is not 1.
StatsModels, which we used for linear regression, also provides functions for time series
analysis, including acf, which computes the autocorrelation function:
import statsmodels.tsa.stattools as smtsa
acf = smtsa.acf(filled.resid, nlags=365, unbiased=True)
acf computes serial correlations with lags from 0 through nlags. The unbiased flag tells acf to
correct the estimates for the sample size. The result is an array of correlations. If we select daily
prices for high quality, and extract correlations for lags 1, 7, 30, and 365, we can confirm that acf
and SerialCorr yield approximately the same results:
>>> acf[0], acf[1], acf[7], acf[30], acf[365]
1.000, -0.029, 0.020, 0.014, 0.044
With lag=0, acf computes the correlation of the series with itself, which is always 1.
Above figure is an example of Autocorrelation function for daily prices (left), and daily
prices with a simulated weekly seasonality (right).
Prediction:
Time series analysis can be used to investigate, and sometimes explain, the behavior of
systems that vary in time. It can also make predictions. It is important to quantify error. In other
words, we want to know how accurate the prediction is likely to be.
There are three sources of error we should take into account:
● Sampling error: The prediction is based on estimated parameters, which depend on
random variation in the sample. If we run the experiment again, we expect the estimates
to vary.
● Random variation: Even if the estimated parameters are perfect, the observed data varies
randomly around the long-term trend, and we expect this variation to continue in the
future.
● Modeling error: We have already seen evidence that the long-term trend is not linear, so
predictions based on a linear model will eventually fail.
Example:
SurvivalFunction provides two properties: ts, which is the sequence of lifetimes, and ss,
which is the survival curve. In Python, a “property" is a method that can be invoked as if it were
a variable.
We can instantiate a SurvivalFunction by passing the CDF of lifetimes:
sf = SurvivalFunction(cdf)
SurvivalFunction also provides __getitem__ and Prob, which evaluates the survival curve:
# class SurvivalFunction
def __getitem__(self, t):
return self.Prob(t)
● Hazard function: A function that maps from t to the fraction of people alive until t who
die at t.
SurvivalFunction provides MakeHazard, which calculates the hazard function:
# class SurvivalFunction
def MakeHazard(self, label=):
ss = self.ss
lams = {}
for i, t in enumerate(self.ts[:-1]):
hazard = (ss[i] - ss[i+1]) / ss[i]
lams[t] = hazard
return HazardFunction(lams, label=label)
The HazardFunction object is a wrapper for a pandas Series:
class HazardFunction(object):
def __init__(self, d, label=):
self.series = pandas.Series(d)
self.label = label
d can be a dictionary or any other type that can initialize a Series, including another
Series. label is a string used to identify the HazardFunction when plotted.