0% found this document useful (0 votes)
282 views18 pages

Fdsa UNIT V

Data scientice topics

Uploaded by

wipet19377
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
282 views18 pages

Fdsa UNIT V

Data scientice topics

Uploaded by

wipet19377
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT V

PREDICTIVE ANALYTICS
Linear least squares – implementation – goodness of fit – testing a linear model – weighted re
sampling. Regression using StatsModels – multiple regression – nonlinear relationships – logistic
regression– estimating parameters – accuracy. Time series analysis – moving averages – missing
values – serial correlation – autocorrelation Introduction to survival analysis
1. Linear least squares:

Linear least squares (LLS) is the least squares approximation of linear functions to data.
It is a set of formulations for solving statistical problems involved in linear regression, including
variants for ordinary (unweighted), weighted, and generalized (correlated) residuals.

Main formulations:
The three main linear least squares formulations are:

● Ordinary least squares (OLS) is the most common estimator. OLS estimates are
commonly used to analyze both experimental and observational data.
● Weighted least squares (WLS) are used when heteroscedasticity is present in the error
terms of the model.
● Generalized least squares (GLS) is an extension of the OLS method that allows efficient
estimation of β when either heteroscedasticity, or correlations, or both are present among
the error terms of the model, as long as the form of heteroscedasticity and correlation is
known independently of the data.
Least squares fit:
Correlation coefficients measure the strength and sign of a relationship, but not the slope.
There are several ways to estimate the slope; the most common is a linear least squares fit. A
“linear fit” is a line intended to model the relationship between variables. A “least squares” fit is
one that minimizes the mean squared error (MSE) between the line and the data.

Suppose, a sequence of points, ys, that need to express as a function of another


sequence xs. If there is a linear relationship between xs and ys with intercept ‘inter’ and
slope ‘slope’, and the expectation of each y[i] to be inter + slope * x[i].

Otherwise, consider the sequence of point as ‘y’, that express as a function of another
sequence ‘x’. The linear relationship exist between ‘x’ and ‘y’ with the intercept ‘a’ and slope
‘b’, then the least square fit for each y[i] except to be a + (b * x[i])

But unless the correlation is perfect, this prediction is only approximate. The vertical
deviation from the line, or residual, is

res = ys - (inter + slope * xs)


The residuals might be due to random factors like measurement error, or non-random
factors that are unknown. For example, if we are trying to predict weight as a function of height,
unknown factors might include diet, exercise, and body type. If we get the
parameters inter and slope wrong, the residuals get bigger, so it makes intuitive sense that the
parameters we want are the ones that minimize the residuals.

To minimize the absolute value of the residuals, or their squares, or their cubes; the most
common choice is to minimize the sum of squared residuals, sum(res**2).

Why to minimize the sum of squared residuals, sum(res**2)? There are three good reasons and
one less important one:
● Squaring has the feature of treating positive and negative residuals the same, which is
usually what we want.
● Squaring gives more weight to large residuals, but not so much weight that the largest
residual always dominates.
● If the residuals are uncorrelated and normally distributed with mean 0 and constant (but
unknown) variance, then the least squares fit is also the maximum likelihood estimator
of inter and slope.
● The values of inter and slope that minimize the squared residuals can be computed
efficiently.
2. Implementation:
def LeastSquares(xs, ys):
meanx, varx = MeanVar(xs)
meany = Mean(ys)

slope = Cov(xs, ys, meanx, meany) / varx


inter = meany - slope * meanx

return inter, slope


LeastSquares takes sequences xs and ys and returns the estimated parameters inter and slope.

Also provides FitLine, which takes inter and slope and returns the fitted line for a sequence
of xs.

def FitLine(xs, inter, slope):


fit_xs = np.sort(xs)
fit_ys = inter + slope * fit_xs
return fit_xs, fit_ys

Another useful test is to plot the residuals.


A function that computes residuals are:
def Residuals(xs, ys, inter, slope):
xs = np.asarray(xs)
ys = np.asarray(ys)
res = ys - (inter + slope * xs)
return res

Residuals takes sequences xs and ys and estimated parameters inter and slope. It returns
the differences between the actual values and the fitted line.

3. Goodness of fit:

There are several ways to measure the quality of a linear model, or goodness of fit. One
of the simplest is the standard deviation of the residuals.

If we use a linear model to make predictions, Std(res) is the root mean squared error
(RMSE) of the predictions. For example, if you use mother’s age to guess birth weight, the
RMSE of your guess would be 1.40 lbs.

Another way to measure goodness of fit is the coefficient of determination, usually denoted R2
and called “ R-squared”:
def CoefDetermination(ys, res):

return 1 - Var(res) / Var(ys)

Var(res) is the MSE using the model, Var(ys) is the MSE without it. So their ratio is the
fraction of MSE that remains if you use the model, and R2 is the fraction of MSE the model
eliminates.

4. Testing a linear model:


There are several ways to test the results of a linear fit. One option is to test whether the
apparent reduction in MSE is due to chance. In that case, the test statistic is R2 and the null
hypothesis is that there is no relationship between the variables.
Another approach is to test whether the apparent slope is due to chance. The null
hypothesis is that the slope is actually zero;

Here's a HypothesisTest for this model:

class SlopeTest(thinkstats2.HypothesisTest):
def TestStatistic(self, data):
ages, weights = data
_, slope = thinkstats2.LeastSquares(ages, weights)
return slope

def MakeModel(self):
_, weights = self.data
self.ybar = weights.mean()
self.res = weights - self.ybar
def RunModel(self):
ages, _ = self.data
weights = self.ybar + np.random.permutation(self.res)
return ages, weights

The data are represented as sequences of ages and weights. The test statistic is the slope
estimated by LeastSquares. The model of the null hypothesis is represented by the mean weight
and the deviations. To generate simulated data, we permute the deviations and add them to the
mean.
The p-value can be estimated by two ways:
● Compute the probability that the slope under the null hypothesis exceeds the observed
slope.
● Compute the probability that the slope in the sampling distribution falls below 0. (If the
estimated slope were negative, we would compute the probability that the slope in the
sampling distribution exceeds 0.)

The second option is easier because we normally want to compute the sampling distribution
of the parameters anyway. And it is a good approximation unless the sample size is small and the
distribution of residuals is skewed. Even then, it is usually good enough, because p-values don't
have to be precise.

5. Weighted resampling:
Definition:
A value associated with an observation in a sample that indicates what part of the
population it represents.
Explanation:
The survey deliberately oversamples several groups in order to improve the chance of
getting statistically significant results; that is, in order to improve the power of tests involving
these groups. This survey design is useful for many purposes, but it means that we cannot use the
sample to estimate values for the general population without accounting for the sampling
process.
For each respondent, the data includes a variable called finalweight, which is the number
of people in the general population the respondent represents. This value is called a sampling
weight, or just \weight.
Example:
If you survey 100,000 people in a country of 300 million, each respondent represents
3,000 people. If you oversample one group by a factor of 2, each person in the oversampled
group would have a lower weight, about 1500.
To correct for oversampling, we can use resampling; that is, we can draw samples from
the survey using probabilities proportional to sampling weights. Then, for any quantity we want
to estimate, we can generate sampling distributions, standard errors, and confidence intervals.

● ResampleRows, which chooses rows from a DataFrame, giving each row the same
probability. Using the probabilities proportional to sampling weights,
ResampleRowsWeighted takes a DataFrame, resamples rows according to the weights in
finalwgt, and returns a DataFrame containing the resampled rows:
def ResampleRowsWeighted(df, column=finalwgt):
weights = df[column]
cdf = Cdf(dict(weights))
indices = cdf.Sample(len(weights))
sample = df.loc[indices]
return sample

6. Regression:
The linear least squares fit is an example of regression, which is the more general
problem of fitting any kind of model to any kind of data. The goal of regression analysis is to
describe the relationship between one set of variables, called the dependent variables, and
another set of variables, called independent or explanatory variables.
● Regression is the task of predicting a continuous quality.
● A regression problem requires the prediction of a quantity.
● A regression problem with multiple input variables is called a multivariate regression
problem.
● The continuous values denotes or represent the probability of a data point belongs to the
each output class.

Regression using statsmodel:


import statsmodels.formula.api as smf
live, firsts, others = first.MakeFrames()
formula = totalwgt_lb ~ agepreg
model = smf.ols(formula, data=live)
results = model.fit()

statsmodels provides two interfaces (APIs); the formula API uses strings to identify the
dependent and explanatory variables. It uses a syntax called patsy; in this example, the ~
operator separates the dependent variable on the left from the explanatory variables on the right.
smf.ols takes the formula string and the DataFrame, live, and returns an OLS object that
represents the model. The name ols stands for “ordinary least squares.”
The fit method fits the model to the data and returns a Regression Result object that
contains the results.
The results are also available as attributes. params is a Series that maps from variable
names to their parameters, so we can get the intercept and slope like this:
inter = results.params[Intercept]
slope = results.params[agepreg]

pvalues is a Series that maps from variable names to the associated p-values, so we can
check whether the estimated slope is statistically signi_cant:
slope_pvalue = results.pvalues[agepreg]

And results provides resid, a sequence of residuals, and fittedvalues, a sequence of fitted values
corresponding to agepreg. The results object provides summary(), which represents the results in
a readable format.
print(results.summary())

Linear regression:
It is a method to predict the dependent variable (Y) based on values of independent
variable (X). It can be used for the cases where we want to predict some continuous quantity.
Dependent variable: The response variable who’s values needs to be predicted. It should be
continuous.
Independent variable: The predictor variable used to predict the response variable.
Equation:
Y= b0 +b1X + e
Where,
b0 – intercept
b1 – Slope
X – Independent variable.
Y – Dependent variable
e - error

7. Multiple regression:
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes known simply
as multiple regression, and it is an extension of linear regression. The variable that we want to
predict is known as the dependent variable, while the variables we use to predict the value of
the dependent variable are known as independent or explanatory variables.
We can use multiple linear regression when you want to know:
1. How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth).
2. The value of the dependent variable at a certain value of the independent variables (e.g.
the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer
addition).

Assumptions of multiple linear regression:


Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t
change significantly across the values of the independent variable.

Independence of observations: the observations in the dataset were collected using statistically
valid methods, and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually
correlated with one another, so it is important to check these before developing the regression
model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them
should be used in the regression model.

Normality: The data follows a normal distribution.

Linearity: the line of best fit through the data points is a straight line, rather than a curve or
some sort of grouping factor.

How to perform a multiple linear regression

Multiple linear regression formula


The formula for a multiple linear regression is:

● y = the predicted value of the dependent variable


● B0 = the y-intercept (value of y when all other parameters are set to 0)
● B1X1= the regression coefficient (B1) of the first independent variable (X1) (a.k.a. the
effect that increasing the value of the independent variable has on the predicted y value)
● … = do the same for however many independent variables you are testing
● BnXn = the regression coefficient of the last independent variable
● e = model error (a.k.a. how much variation there is in our estimate of y)

To find the best-fit line for each independent variable, multiple linear regression calculates three
things:

● The regression coefficients that lead to the smallest overall model error.
● The t-statistic of the overall model.
● The associated p-value (how likely it is that the t-statistic would have occurred by chance
if the null hypothesis of no relationship between the independent and dependent variables
was true).

It then calculates the t-statistic and p-value for each regression coefficient in the model.

Python coding for linear and multiple regression:

import pandas as pd
import statsmodels.api as sm
st_market = { ‘year’: [2001,2002,2003],
‘month’: [‘jan’, ‘june’, ‘march’],
‘Interest_rate’: [2,3.3, 5.2],
‘unemployment_rate’: [3.22, 5.33, 7.2]
‘stock_index_price’: [1254,1236.55,1287.43]
df = pd.DataFrame(stock_market, columns = [‘year’, ‘month’, ‘Interest_rate’,
‘unemployment_rate’, ‘stock_index_price’ ] )
x = df[[‘Interest_rate’, ‘unemployment_rate’]]
# Here we have 2 variable for the multiple linear regression. If you want to use one variable for
simple linear regression, then use x = df[ ‘Interest_rate’]
y = df[‘stock_index_price’]
X = sm.add_constant(x) # adding a constant.
model = sm.OLS(y, x).fit( )
predictions = model.predict(x)
print_model = model.summary( )
print(print_model)

8. Nonlinear relationships:
Nonlinear regression is a form of regression analysis in which observational data are
modeled by a function which is a nonlinear combination of the model parameters and depends on
one or more independent variables. The data are fitted by a method of successive
approximations. In nonlinear regression, a statistical model of the form,
relates a vector of independent variables, x, and its associated observed dependent
variables, y. The function f is nonlinear in the components of the vector of parameters β, but
otherwise arbitrary.
With a single predictor, the model is a straight line. With more than one predictor, the
model is a plane or hyperplane. While such models are adequate for representing many
relationships (at least over a limited range of the predictors), there are many cases when a more
complicated model is required.
There are several procedures for fitting nonlinear models. The models that may be fit
include:
1. Transformable nonlinear models: models involving a single predictor variable in
which transforming Y, X or both results in a linear relationship between the transformed
variables.
2. Polynomial models: models involving one or more predictor variables which include
higher-order terms such as B1,1X12 or B1,2X1X2.
3. Models that are nonlinear in the parameters: models in which the partial derivatives
of Y with respect to the predictor variables involve the unknown parameters.

9. Logistic regression:

One of the basic and most popular algorithms to solve classification problem is logistic
regression. It is a method used to predict a dependent variable, given a set of independent
variables, such that the dependent variable is categorical.
Dependent variable (Y):
The response binary variable holding values like 0 or 1, yes or no, A, B or C.
Independent variable (X):
The predictor variable used to predict the response variable.
Equation:
log(Y / (1-Y) ) = C + B1X1 + B2X2 + …..
where,
Y = The probability of an event to happen which you are trying to predict.
X1, X2 are the independent variables which determine the occurrence of an event (i.e) Y.
C – Constant term which will be the probability of the event happening when no other factors
are considered.
B1B2 – Coefficient.
# syntax for logistic regression:
model = sm.logit( ytrain, xtrain)
result = model.fit( )
final_result = model.summary( )

10. Estimating parameters:


Unlike linear regression, logistic regression does not have a closed form solution, so it is
solved by guessing an initial solution and improving it iteratively.

The usual goal is to find the maximum-likelihood estimate (MLE), which is the set of
parameters that maximizes the likelihood of the data. For example, suppose we have the
following data:
y = np.array([0, 1, 0, 1])
x1 = np.array([0, 0, 0, 1])
x2 = np.array([0, 1, 1, 1])

And we start with the initial guesses β0 = -1.5, β1 = 2:8, and β2 = 1.1:
beta = [-1.5, 2.8, 1.1]
Then for each row we can compute log_o:
log_o = beta[0] + beta[1] * x1 + beta[2] * x2

And convert from log odds to probabilities:


o = np.exp(log_o)

[ 0.223 0.670 0.670 11.02 ]

p = o / (o+1)
[ 0.182 0.401 0.401 0.916 ]

Notice that when log_o is greater than 0, o is greater than 1 and p is greater than 0.5.
The likelihood of an outcome is p when y==1 and 1-p when y==0. For example, if we
think the probability of a boy is 0.8 and the outcome is a boy, the likelihood is 0.8; if the
outcome is a girl, the likelihood is 0.2. We can compute that like this:
likes = y * p + (1-y) * (1-p)
[ 0.817 0.401 0.598 0.916 ]

The overall likelihood of the data is the product of likes:


like = np.prod(likes)
0.18

For these values of beta, the likelihood of the data is 0.18. The goal of logistic regression is to
find parameters that maximize this likelihood.

Accuracy:
Accuracy – The number of successful predictions, compared with what we would expect by
chance.
In the NSFG data, there are more boys than girls, so the baseline strategy is to guess “boy" every
time. The accuracy of this strategy is just the fraction of boys:
actual = endog[‘boy’]
baseline = actual.mean()
Since actual is encoded in binary integers, the mean is the fraction of boys, which is 0.507.

Here's how we compute the accuracy of the model:


predict = (results.predict() >= 0.5)
true_pos = predict * actual
true_neg = (1 - predict) * (1 - actual)
results.predict returns a NumPy array of probabilities, which we round off to 0 or 1. Multiplying
by actual yields 1 if we predict a boy and get it right, 0 otherwise. So, true_pos indicates “true
positives".

Similarly, true_neg indicates the cases where we guess “girl" and get it right.
Accuracy is the fraction of correct guesses:
acc = (sum(true_pos) + sum(true_neg)) / len(actual)

The result is 0.512, slightly better than the baseline, 0.507. But, you should not take this result
too seriously. We used the same data to build and test the model, so the model may not have
predictive power on new data.
11.Time series analysis:

Why time series analysis:


In regression, we have independent variable and a dependent variable, what we do, we deduce a
function or you can say a mapping function of how one variable is related to another and move to
analysis part. But in time series we have one variable that is time.
Example: You own a coffee shop, so what do you do, you try to see how many number of cups of
coffee you sell every month, for that what you will do, you add up all the sales of yours coffee
now.
What is time series?
● Set of observation taken at specified times usually at equal intervals.
● Used to predict the future values based on the previous observed values.
Importance:
● Business forecasting
● Understanding past behavior
● Plan future operation.
Components of time series:
● Trends – a movement to relatively higher or lower values over a long periods of time. It
happens for sometimes and disappear. A smooth function that captures persistent
changes.
● Seasonality – Repeating pattern within a fixed time periods. Periodic variation, possibly
including daily, weekly, monthly, or yearly cycles.
● Irregularity or noise – Erratic in nature or unsystematic, also called as residuals. Happens
for short duration and not repeating. Random variation around the long-term trend.
● Cyclic – Basically repeating up and down movements.
When not to use time series analysis:
● Values are constant.
● Values in the form of functions. (e.g - sin(x) or cos(x))
Stationarity:
Time series has a particular behavior over time, there is a very high probability that it will
follow the same in the future. It should contain constant mean, constant variance and auto
covariance.
Tests to check stationarity:
1. Rolling statistics – plot the moving average or moving variance and see if it varies with
time. More of a visual techniques.
2. ADCF test – Null hypothesis is that the time series is non-stationary. The test result
comprise of a test statistics and critical values.

12. Moving averages:


A moving average is defined as an average of fixed number of items in the time series
which move through the series by dropping the top items of the previous averaged group and
adding the next in each successive average. Moving averages can be used for measuring the
trend of any series. This method is applicable to linear as well as non-linear trends.
The moving-average (MA) model assumes that the present value of a random variable
depends on past errors/innovations—i.e., it depends on the errors at t−1, t−2, …, t−n. The model
is given by the following equation:
yt = µ + ϕ εt-1
( or) yt = µ + c1εt-1 + εt. (we are using error that we made in last prediction period to make
prediction)
µ - Initial expectation value
c – Constant.

A moving average divides the series into overlapping regions, called windows, and
computes the average of the values in each window. One of the simplest moving averages is the
rolling mean, which computes the mean of the values in each window. For example, if the
window size is 3, the rolling mean computes the mean of values 0 through 2, 1 through 3, 2
through 4, etc. pandas provides rolling_mean, which takes a Series and a window size and
returns a new Series.
>>> series = np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> pandas.rolling_mean(series, 3)
array([ nan, nan, 1, 2, 3, 4, 5, 6, 7, 8])

The first two values are nan; the next value is the mean of the first three elements, 0, 1,
and 2. The next value is the mean of 1, 2, and 3. and so on.

An alternative is the exponentially-weighted moving average (EWMA), which has two


advantages. First, as the name suggests, it computes a weighted average where the most recent
value has the highest weight and the weights for previous values drop of exponentially. Second,
the pandas implementation of EWMA handles missing values better.
ewma = pandas.ewma(reindexed.ppg, span=30)
thinkplot.Plot(ewma.index, ewma)
The span parameter corresponds roughly to the window size of a moving average; it
controls how fast the weights drop off, so it determines the number of points that make a
non-negligible contribution to each average.

Advantages of moving averages


● This method is simple as compared to the method of least squares.
● It is a flexible method of measuring trend. If a few more figures are added to the data, the
entire calculation are not changed, we only get some more trend values.
● If the period of moving averages happens to coincide with the period of cyclical
fluctuations in the data, such fluctuations are automatically eliminated The moving
average has the advantage that it follows the general me cements of the data and that its
shape is determined by the data rather than the statistician's choice of a mathematical
function.
Disadvantages of moving averages
● Trend values cannot be computed for all the years. The longer the period of moving
averages the greater the number of years for which trend values cannot be obtained. For
example, in the threeyear moving average, trend values cannot be obtained for the first
year and last year, in five-yearly moving average for the firs: two years and the last two
years, and so on. It is often these extreme years in which we are most interested.
● Great care has to be exercised in selecting the period of moving average. No hard and fast
rules are available for the choice of the period and one has to use his own judgment.
● Since the moving average is not represented by a mathematical function, this method
cannot be used in forecasting which the main objective of trend analysis is.
● Although theoretically we say that if the period of moving average happens to coincide
with the period of cycle, the cyclical fluctuations are completely eliminated but in
practice since the cycles are by no means perfectly periodic, the lengths of the various
cycles in any given series will usually vary considerably and therefore no moving
average can completely remove the cycle.

13. Missing values:


Time series data based on human behavior often exhibits daily, weekly, monthly, or
yearly cycles. The test for seasonality, don't work well with missing data, so we have to solve
that problem first.
A simple and common way to fill missing data is to use a moving average.
The Series method fillna does just what we want:
reindexed.ppg.fillna(ewma, inplace=True)

Wherever reindexed.ppg is nan, fillna replaces it with the corresponding value from
ewma. The inplace ag tells fillna to modify the existing Series rather than create a new one.
A drawback of this method is that it understates the noise in the series. We can solve that
problem by adding in resampled residuals:
resid = (reindexed.ppg - ewma).dropna()
fake_data = ewma + thinkstats2.Resample(resid, len(reindexed))
reindexed.ppg.fillna(fake_data, inplace=True)
resid contains the residual values, not including days when ppg is nan. fake_data contains the
sum of the moving average and a random sample of residuals. Finally, fillna replaces nan with
values from fake_data.

14. Serial correlation:


As prices vary from day to day, you might expect to see patterns. If the price
is high on Monday, you might expect it to be high for a few more days; and if it's low, you might
expect it to stay low. A pattern like this is called serial correlation, because each value is
correlated with the next one in the series.
To compute serial correlation, we can shift the time series by an interval called a lag, and
then compute the correlation of the shifted series with the original:
def SerialCorr(series, lag=1):
xs = series[lag:]
ys = series.shift(lag)[lag:]
corr = thinkstats2.Corr(xs, ys)
return corr
After the shift, the first lag values are nan, so I use a slice to remove them before
computing Corr.
If we apply SerialCorr to the raw price data with lag 1, we find serial correlation 0.48 for
the high quality category, 0.16 for medium and 0.10 for low. In any time series with a long-term
trend, we expect to see strong serial correlations.

For example, if prices are falling, we expect to see values above the mean in the first half
of the series and values below the mean in the second half. It is more interesting to see if the
correlation persists if you subtract away the trend.
For example, we can compute the residual of the EWMA and then compute its serial
correlation:
ewma = pandas.ewma(reindexed.ppg, span=30)
resid = reindexed.ppg - ewma
corr = SerialCorr(resid, 1)

With lag=1, the serial correlations for the de-trended data are -0.022 for high quality, -0.015 for
medium, and 0.036 for low. These values are small, indicating that there is little or no one-day
serial correlation in this series.

15. Autocorrelation:
The autocorrelation function is a function that maps from lag to the serial correlation with
the given lag. “Autocorrelation" is another name for serial correlation, used more often when the
lag is not 1.
StatsModels, which we used for linear regression, also provides functions for time series
analysis, including acf, which computes the autocorrelation function:
import statsmodels.tsa.stattools as smtsa
acf = smtsa.acf(filled.resid, nlags=365, unbiased=True)
acf computes serial correlations with lags from 0 through nlags. The unbiased flag tells acf to
correct the estimates for the sample size. The result is an array of correlations. If we select daily
prices for high quality, and extract correlations for lags 1, 7, 30, and 365, we can confirm that acf
and SerialCorr yield approximately the same results:
>>> acf[0], acf[1], acf[7], acf[30], acf[365]
1.000, -0.029, 0.020, 0.014, 0.044
With lag=0, acf computes the correlation of the series with itself, which is always 1.

Above figure is an example of Autocorrelation function for daily prices (left), and daily
prices with a simulated weekly seasonality (right).

Prediction:
Time series analysis can be used to investigate, and sometimes explain, the behavior of
systems that vary in time. It can also make predictions. It is important to quantify error. In other
words, we want to know how accurate the prediction is likely to be.
There are three sources of error we should take into account:
● Sampling error: The prediction is based on estimated parameters, which depend on
random variation in the sample. If we run the experiment again, we expect the estimates
to vary.
● Random variation: Even if the estimated parameters are perfect, the observed data varies
randomly around the long-term trend, and we expect this variation to continue in the
future.
● Modeling error: We have already seen evidence that the long-term trend is not linear, so
predictions based on a linear model will eventually fail.

16. Survival analysis:


Survival analysis is a way to describe how long things last. It is often used to study
human lifetimes, but it also applies to \survival" of mechanical and electronic components, or
more generally to intervals in time before an event.
If someone you know has been diagnosed with a life-threatening disease, you
might have seen a “5-year survival rate," which is the probability of surviving five years after
diagnosis. That estimate and related statistics are the result of survival analysis.
The fundamental concept in survival analysis is the survival curve, S(t), which is a
function that maps from a duration, t, to the probability of surviving longer than t. If you know
the distribution of durations, or “lifetimes", finding the survival curve is easy; it's just the
complement of the CDF:
S(t) = 1- CDF(t)

where CDF(t) is the probability of a lifetime less than or equal to t.

Example:

SurvivalFunction provides two properties: ts, which is the sequence of lifetimes, and ss,
which is the survival curve. In Python, a “property" is a method that can be invoked as if it were
a variable.
We can instantiate a SurvivalFunction by passing the CDF of lifetimes:
sf = SurvivalFunction(cdf)
SurvivalFunction also provides __getitem__ and Prob, which evaluates the survival curve:
# class SurvivalFunction
def __getitem__(self, t):
return self.Prob(t)

def Prob(self, t):


return 1 - self.cdf.Prob(t)

● Hazard function: A function that maps from t to the fraction of people alive until t who
die at t.
SurvivalFunction provides MakeHazard, which calculates the hazard function:
# class SurvivalFunction
def MakeHazard(self, label=):
ss = self.ss
lams = {}
for i, t in enumerate(self.ts[:-1]):
hazard = (ss[i] - ss[i+1]) / ss[i]
lams[t] = hazard
return HazardFunction(lams, label=label)
The HazardFunction object is a wrapper for a pandas Series:
class HazardFunction(object):
def __init__(self, d, label=):
self.series = pandas.Series(d)
self.label = label

d can be a dictionary or any other type that can initialize a Series, including another
Series. label is a string used to identify the HazardFunction when plotted.

● Kaplan-Meier estimation: An algorithm for estimating hazard and survival functions.


The general idea is that we can use the data to estimate the hazard function, then convert
the hazard function to a survival curve.
● cohort: a group of subjects de_ned by an event, like date of birth, in a particular interval
of time.
● cohort effect: a difference between cohorts.
● NBUE: A property of expected remaining lifetime, “ New better than used in
expectation."
● UBNE: A property of expected remaining lifetime, “ Used better than new in
expectation."

You might also like