Statistics Study Guide: Matthew Chesnes The London School of Economics September 22, 2001
Statistics Study Guide: Matthew Chesnes The London School of Economics September 22, 2001
Descriptive Statistics
Pictures of Data: Histograms, pie charts, stem and leafs plots, scatter plots, OGives OGive: A measure of cummulative distribution percentages. Compute reasonable intervals for data and determine their frequency, then their cummulative frequency, and nally their percentage frequency (percentile). Plot percentages at the upper end of the intervals. Easy way to display quartiles. Stem and Leaf Plot: Stem is the major part of the data and leaves are minor part. Choose a reasonable unit for the stem. Organize leaves in order of magnitude. If leaves are too large, split stem into smaller intervals. Placing the leaves equidistant, provides a histogram like representation. A Diagram is Interacting the Eye - B. Blight. Measures of Data: Mean, Median, Mode, Standard Deviation, Quartiles Right Skewed Data - positively skewed - long right tail Left Skewed Data - negatively skewed - long left tail Measures of Location and Spread Mean: Average (stable value and useful in analysis though sensitive to outliers). Mode: Data value that occurs most often ... highest peak in the pdf of the continuous case. Median: 50th percentile (Insensitive to outliers though not as useful in statistical inference) Range: Max - Min (Crude and inaccurate) Interquartile Range: 75th 25th quartile. Sample standard deviation, s, an estimate of population standard deviation, , s= Corrected sum of Squares = n1
n i=1 (xi
x )2 . n1
The standard deviation is calulated on n-1 degrees of freedom rather than n because dividing by n would yield a biased estimator. Alternative form of s:
n
CSS =
i=1
s=
x2 2 i nx . n1
Probability
Additive Law: P (A B ) = P (A) + P (B ) P (A B ). Exclusive Events: P (A B ) = P (A) + P (B ). Total Probability: P (A) + P (Ac ) = 1. Demorgans Laws: P (Ac B c ) = P ((A B )c ), and P (Ac B c ) = P ((A B )c ). Combinatorial: n Cx = n! . x!(n x)!
2.1
Exclusive events are VERY dependent. One happening completely excludes the possibility of the other occuring. The Law of Total Probability: P (A) = P (A B ) + P (A B c ) = P (A|B ) P (B ) + P (A|B c ) P (B c ). So in general, Bayes Law can be written, P (Bi |A) =
n i=1
3
3.1
3.2
Hypergeometric: Not independent trials, sampling without replacement. As n , hypergeometric Binomial. Negative Binomial: The distribution of the number of trials needed to get k successes. Multinomial: Generalization of binomial for more than 2 classications.
3.3
Discrete Distribution Applications: Measuing random arrival times: components that break down over time, Defective items in a large batch. Memoryless Property: A every point in time, there is always the same chance of the event occurring. Arrival rate: . P (r arrivals in time t) = r e . r!
The rate, , is in terms of time, t. Use Poisson approximation for the binomial when n is large AND p is either large ( 1) or small ( 0). Thus, use the poisson approximation if np < 10. Then P DFP oisson = (np)r enp . r!
x i p i = E [X ].
If X is distributed binomially, E [X ] = np. If X is distributed poisson, E [X ] = . Expectation is a linear operator. E [a + bX ] = a + bE [X ]. Variance and Standard Deviation of X. R is a random variable
2 = E [(R )2 ] = R r (r
)2 pr .
2 Alternate Form: R = E [(R)2 ] (E [R])2 . 2 Rearranging for an important and useful result: R + 2 = E [(R)2 ]. 2 = npq . If X is distributed binomially, X 2 = . If X is distributed poisson, X
xf (x)dx.
2 X = E [(X )2 ] (E [X ])2 .
5.1
5.2
Consider the Poisson process with points occuring at random in time. is the average number of occurances per unit of time. The time between occurances is a continuous random variable, X, and it follows an exponential distribution. 1 F (x) = P (X > x) = P (0 occurances over the interval from (0,x)) = ex (x)0 = ex . 0! Thus F (x) = 1 ex . Thus f (x) =
d (1 dx
ex ) = ex .
5.3
The Central Limit Theorem: If n value are sampled from a population and if n is suciently large, then the sample mean (or sum) is normally distributed whatever the distribution of the parent population. If parent is normal, n large n relatively small. If parent is very nonnormal, n large about 50 at most. Standard Normal Distribution: = 0, = 1. 6
E [aX + bY ] = aE [X ] + bE [Y ] = aX + bY .
2 2 2 [aX + bY ] = a2 X + b2 Y + 2ab Cov (X, Y ). 2 2 2 [aX bY ] = a2 X + b2 Y 2ab Cov (X, Y ). 2 2 [3 + 2X ] = 4X .
Theorem: Any linear function of normal variables is itself normally distributed. Normal Approximation to the binomial: If R Binomially with n trials and p, the probability of success, as n , but p remains constant, R Normal. As n , but np remains constant (therefore p 0), R Poisson (Use Poisson if np < 10). If R Normal, R N (np, npq ). IMPORTANT ... when using the normal approximation to the binomial, remember to add or subtract a half when computing intervals or nding critical values to reect the discreteness of the original distribution.
Sampling Theory
Let X N (, 2 ). Let Q =
i
xi xi X.
N (, ). n
= = The standard error. The Standard Deviation of X n Parameters, Estimators, and Standard Errors. Parameter = ; Estimator = x ; Standard Error = . n Parameter = p ; Estimator = r ; Standard Error = n pq . n
7
7.1
Estimation
Point Estimation
We want to estimate some parameter, using an estimator, . Calculate The Mean Square Error = MSE = E [( )2 ].
2 Square out to nd MSE = + (E [] )2 .
Or otherwise written, MSE = Variance + bias2 . Desirable properties of estimators: unbiased: E[estimator] = parameter. Ecient: Small variance. For example, E [s2 ] = E [ CSS ] = E[ n1 (x x )2 ] = 2. n1
Hence dividing by the n-1 is explained because it gives us an unbiased estimator. However, eciency is more important than unbiasedness. If one estimator is slightly biased but extremely ecient, use it because of the high variability of the alternative.
7.2
Interval Estimation
( x Zcrit (2.5%) SE ( x) = ( x 1.96 ). n
An incorrect interpretation of this interval would be: There is a 95 percent chance that x is within 1.96 standard errors of . A correct (purist) statement would be: if you took many samples and calculated the condence interval for a parameter each time, then 95 percent of the condence intervals would contain the true value of the parameter. This is because the interval is the thing that has variability, not . is a constant. Condence intervals for proportions: p( r Zcrit n
r (1 n r n ) ). n
Sample size Determination. Dene d to be the Tolerance or the half length of the condence interval. To obtain a 95 percent condence interval for a mean to be with Zcrit 2 in a certain tolerance, d, set n = . One may have to estimate with s using a d small sample rst and then determine optimal n. In general, d = Zcrit SE , and the SE involves n, so solve for n and plug in d. Exact formulation of the variance of x : V ar( x) = 2 n (1 ). n N
7.3
Suppose the sample is small and the variance is unknown. A condence interval for is, s ( x tcrit n1 ). n The t distribution, AKA, the students t distribution, is more spread out to allow for the variability of both x and s. If is known, use Z distribution for sure. (Unless n is incredibly low). If n is large, use Z because even though the t distribution is theoretically correct, t Z as n . One other case: if n is small and the distribution is really not normal (the central limit theory does not apply), then one must use a non paratmetric approximation. Comparison of Means: 3 cases. Paired Data. Calculate di = xi yi . We want an estimate for d = x y . So condence interval becomes, sd d (d tn1 ( )). n We use the t distribution because n is small and we are estimating sd . Unpaired Large Samples. x y estimated by x y . Thus the standard error here is, 2 2 Sy Sx Sx = + . y nx ny And thus a condence interval becomes, x y ( xy Zcrit
2 2 Sy Sx + ). nx ny
10
Unpaired Small samples. Must make the assumption that the variances of the two samples is the same! Risky assumption. Assume 1 = 2 = p . Thus, Sp = And, SE = Sp 1 1 + = n1 n2
2 2 (n1 1)S1 + (n2 1)S2 n1 + n2 2
CSS1 + CSS2 = n1 + n2 2
1 1 + . n1 n2
2 Notice that Sp is a weighted function of the sample variances with eachs degrees of freedom as the weights. The test statistic for a hypothesis test or a condence interval will follow a t distribution with n1 + n2 2 degrees of freedom.
7.4
S2 =
11
Hypothesis Testing
Testing H0 versus H1 . Always choose the null hypothesis to be the simpler of the two alternatives. Type I Error: Rejecting H0 when it is true. () Type II Error: Failing to reject H0 when it is false. ( ) and both decrease with a larger sample size. Power Function: The probability of accepting H1 (rejecting H0 ) for dierent values of the true parameter, . Some might use the terminology, Accepting H1 . But this would be incorrect if it implies proof. All we are saying is that the available data supports the hypothesis. Purists would never just accept, they would use the terminology, Fail to reject H0 . To carry out test, dene hypotheses, compute test statistic and compare with the relevant distribution. If n is large, use the Z distribution for your decision. If n is smaller and is unknown, use the t distribution. If a test statistic is on a division point of the critical values, maybe you cannot condently reject H0 , but you should be very suspicious that it is actually true. Always report lowest possible level (highest possible condence). Doing otherwise is just ignorant. - C.Dougherty The P value of the test tells you exactly where the test statistic lies: its the probabilty that under the null hypothesis, you observe an estimate as or more extreme then your value. When computing standard errors for test, always compute them with null values. Since we are assuming that the null is true until proven guilty, one must use its values when doing the test. Advantage of Paired Test: must less sensitive. Never use the data to form your hypothesis: choose the nature of the test (one tailed or two tailed, null and alternative hypotheses, etc) rst and then carry out the test using the data.
12
(Oi Ei )2 2 (r 1)(c1) . Ei
The larger the statistic, P, the larger the likelihood of rejecting H0 in favor of Association. The statistic is distributed as a Chi Squared with (row-1)(col-1) degress of freedom.
13
10
Let R be a random variable with p.d.f, pr . Let T be some function, (R). Then Prob(T=t) = pr where is over all the values of r such that (r) = t. Work out the distributions of R and then T to see that this is true. Theorem: For a random variable X and a random variable Y = (X ) such that is a monotonic function, the c.d.f. for X equals the c.d.f. for Y . F (x) = G(y ). Also, (IMPORTANT THEOREM), for the same transformation, , g (y ) = f (x)| dx |. dy For a general transformation on a random variable ( not necessarily monotonic), just look at the graph of the transformed X, and evaluate the above theorem over each monotonic section. Joint density functions of two random variables: f (x, y ). This is simply a surface in three dimensions with the volume under the surface (instead of area under the curve) representing probability. Total volume under the surface is again equal to one. All of Bayes calculus on probabilities also applies to density functions. f (y ) = f (y |x)f (x)dx.
10.1
Covariance: Cov (X, Y ) = = E [(X X )(Y Y )]. If > 0, X and Y work in the same direction. If < 0, X and Y work in the opposite direction. It can also be shown that Cov (X, Y ) = E [XY ] E [X ]E [Y ]. Since the covariance depends on the units of the random variable, we dene the correlation coecient to be, = . X Y
is the Linear Correlation Coecient, and it lies between -1 and 1. If X and Y are independent, it can be shown that the Cov (X, Y ) = 0 If X is a linear function of Y , then XY = 1. Properties of Variance and Covariance. 14
V ar(aX + bY ) = a2 V ar(X ) + b2 V ar(Y ) + 2abCov (X, Y ). Variance is a second-order operator. Variances always add, though the covariance term takes the sign of ab. 3 variable case: V ar(aX + bY + cZ ) = a2 V ar(X ) + b2 V ar(Y ) + +c2 V ar(Z ) + 2abCov (X, Y ) + 2acCov (X, Z ) + 2bcCov (Y, Z ). Cov (aX +bY, cS +dT ) = acCov (X, S )+adCov (X, T )+bcCov (Y, S )+bdCov (Y, T ).
15
11
is the correlation matrix. All diagonal elements of this matrix are the variances of each of the random variables. The o diagonal entries are covariances. It is of course a symmetric matrix. = E [(X )(X )T ]. ), (ie, is multivariate normal), Then X T
1
Theorem: If X N (0,
X 2 p.
16
12
6 Basic Statistics needed for regression. n, sample size; x , sample mean of independent variable; y , sample mean of dependent variable; Sxx , the corrected sum of squares for the xs; Syy , the corrected sum of squares for the ys; Sxy , the corrected sum of products for x and y . Sxx = Syy = Sxy =
i (xi i (yi i (xi
x )2 . y )2 . x )2 (yi y )2 =
i (xi yi )
nx y .
12.1
Covariance = c = Correlation = r =
A correlation of zero means that there is no linear relationship between X and Y but does not necessarily mean there is no relationship at all ... it could be nonlinear. Test for Correlation: H0 : = 0 versus H1 : = 0. r n2 Test Statistic = . 1 r2
12.2
Use scatterplots of the data as a starting point. Simple Linear Model: yi = + xi + i . Error term,
i
iid N (0, 2 ).
(yi xi )2 .
Yields estimators, Sxy . Sxx a=y bx . b= It can be shown that a and b are B.L.U.E. : Best, Linear, Unbiased, Estimators. 17
np
is an unbiased estimator of 2 .
i (yi
2 i ei .
Arranging the denition of RSS, we nd, RSS = Syy b2 Sxx . Or in other words, RSS is the extra variability in y that we cannot explain after tting the model. If Syy is the total variability, then b2 Sxx is the explained variability. Analysis of Variance Table. Source Regression (Explained) Residual (Unexplained) Total Degrees of Freedom Sums of Squares p1 b2 Sxx np RSS = Syy b2 Sxx n1 Syy Mean Square 2 Sr 2 S = RSS np 2 ST
Dene,
2 Sxy b2 Sxx R = = . Syy Sxx Syy 2
R2 is the percentage of the variability in y that is explained by the independent variables via the regression equation. In words, it is the explained variability over the total variability, so it is a good measure of how well the line ts the data. In simple linear regression, we saw the that the correlation coecient, r = Sxy . Sxx Syy Thus, in SLR, R2 = r2 . This doesnt apply to multiple regression because there we have many correlations and only one R2 value.
Adjusted R2 . Good for comparing models in the multiple regression setting. Reects the addition of more variables, while always increasing R2 , might lead to a worse model. Dene, S2 S2 2 Radj = T 2 . ST 18
SE (a) =
x 2 1 + . n Sxx
Hypothesis tests and inferences about and are the same as always and will follow a t distribution because we are estimating . F Test for Regression: particularly useful for multiple regression. In SLR, F = t2 . Test H0 : Bi = 0 i versus H1 : i = 0 for at least one i. The Null hypothesis is that S2 the regression has no eect. Test statistic F = r . If F is much dierent from 1, then S2 reject H0 and conclude that there is a valid regression eect. It can be shown that as a ratio of two Chi squared variables,
2 Sr Fp1,np . S2
12.3
Prediction Intervals
Plug your x value into the regression equation and get your predicted y. Be careful though of points outside the range of your data. For an interval of condence, develop a prediction interval for y . y is your estimator and the standard error of y is, SE ( y) = So your prediction interval becomes, y ( y tn2 SE ). From the last term in the SE formula, it is clear that the further away from the mean you are, the larger your prediction interval. 1 (x x )2 1+ + . n Sxx
12.4
Multiple Regression
Model: y = x + . Where, x= 1 1 1 1 ... 1 x11 x21 x31 x41 ... xn1 19 x12 x22 x32 x42 ... xn2 ... ... ... ... ... ... x 1p x 2p x 3p x 4p ... xnp
(1)
And, = Thus, for OLS, we minimize: (y x )T (y x ). Which yeilds, = b = (xT x)1 (xT y ). 1 2 ... p . (2)
20
13
Time Series
Laspeyres Index: For comparing prices using quantity at the base time, t0 . Pt = 100 Paasche Price Index: Pt = 100 Quantity Index: Pt = 100 Value Index: Pt = 100 qt p 0 . q0 p 0 qt p t . q0 p 0 q0 p t . q0 p 0 qt p t . qt p 0
Index Linking. Useful to reindex from time to time, but to avoid jumps, dene, Pt = P0,t for t = 0, ..., 10. Pt = P10,t P0,t = p0,10 100 q10 pt for t = 10, ..., 20. q10 p10
Time Series: x0 , x1 , x2 , ..., xt , ... Classical Economic Time series: xt = Tt + St + Ct + It . (Trend + Seasonal + Cyclical + Irregular stationary component.) Stationary Time Series: Relate variable to itself using 1 or more lags. Autoregression: (xt x ) = b(xt1 x ). Auto Correlation ( is the number of lags) : r = Models 1st order Autoregressive Model. xt = xt1 + t . Where, (0, 1) for a stationary time series and > 1 for a non-stationary time series. 21
(xt x )(xt + x ) n 1 ( xt x )2 n1
2nd order Autoregressive Model. xt = 1 xt1 + 2 xt2 + t . Moving Average Model. xt = xt+1 = xt+2 =
t
+b
t+1
t1 .
+ b t.
t+1 .
t+2
+b
Where every neighboring term is correlated with each other, but others are not. This can be extended to more than one lagged interaction. Mixed Models : ARMA Model - AutoRegressive Moving Average Models p q
ai xti =
i=0 i=0
bi
ti .
22