0% found this document useful (0 votes)
67 views8 pages

Chapter 4 - Summarizing Numerical Data

The document summarizes various techniques for numerically describing data through measures of central tendency (mean, median), dispersion (range, standard deviation, interquartile range), distributions (quantiles, histograms), and relationships between variables (correlation, regression). It discusses using averages to summarize both cross-sectional and time series data, including moving averages and exponentially weighted moving averages to smooth time series and forecast future values.

Uploaded by

eviroyer
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views8 pages

Chapter 4 - Summarizing Numerical Data

The document summarizes various techniques for numerically describing data through measures of central tendency (mean, median), dispersion (range, standard deviation, interquartile range), distributions (quantiles, histograms), and relationships between variables (correlation, regression). It discusses using averages to summarize both cross-sectional and time series data, including moving averages and exponentially weighted moving averages to smooth time series and forecast future values.

Uploaded by

eviroyer
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 4 - Summarizing Numerical Data

15.075 Cynthia Rudin Here are some ways we can summarize data numerically. Sample Mean: x :=
n i=1

xi

Note: in this class we will work with both the population mean and the sample mean x . Do not confuse them! Remember, x is the mean of a sample taken from the population and is the mean of the whole population. Sample median: order the data values x(1) x(2) x(n) , so then median := x :=
1 [x n 2 (2)

n odd x( n+1 ) 2 . + x( n +1) ] n even 2


outlier

Mean and median can be very dierent: 1, 2, 3, 4, 500 . The median is more robust to outliers. Quantiles/Percentiles: Order the sample, then nd x p so that it divides the data into two parts where: a fraction p of the data values are less than or equal to x p and the remaining fraction (1 p) are greater than x p . That value x p is the pth -quantile, or 100pth percentile.

5-number summary {xmin , Q1 , Q2 , Q3 , xmax }, where, Q1 = .25 , Q2 = .5 , Q3 = .75 . Range: xmax xmin measures dispersion Interquartile Range: IQR := Q3 Q1 , range resistant to outliers

Sample Variance s2 and Sample Standard Deviation s: s :=


2

see why later

1 n 1

n )2 . (xi x i=1

Remember, for a large sample from a normal distribution, 95% of the sample falls in [ x 2s, x + 2s]. Do not confuse s2 with 2 which is the variance of the population.
s Coecient of variation (CV) := x , dispersion relative to size of mean.

z-score

xi x . s It tells you where a data point lies in the distribution, that is, how many standard deviations above/below the mean. E.g. zi = 3 where the distribution is N (0, 1). zi :=

It allows you to compute percentiles easily using the z-scores table, or a command on the computer.

Now some graphical techniques for describing data. Bar chart/Pie chart - good for summarizing data within categories

Pareto chart - a bar chart where the bars are sorted.

Histogram

Boxplot and normplot Scatterplot for bivariate data Q-Q Plot for 2 independent samples Hans Rosling

Chapter 4.4: Summarizing bivariate data Two Way Table Heres an example: Respiratory Problem? yes no row total smokers 25 25 50 non-smokers 5 45 50 column total 30 70 100 Question: If this example is from a study with 50 smokers and 50 non-smokers, is it meaningful to conclude that in the general population : a) 25/30 = 83% of people with respiratory problems are smokers? b) 25/50 = 50% of smokers have respiratory problems? Simpsons Paradox Deals with aggregating smaller datasets into larger ones. Simpsons paradox is when conclusions drawn from the smaller datasets are the opposite of conclusions drawn from the larger dataset. Occurs when there is a lurking variable and uneven-sized groups being combined E.g. Kidney stone treatment (Source: Wikipedia) Which treatment is more eective? Treatment A Treatment B 78%
273 350

83%

289 350

Including information about stone size, now which treatment is more eective? small stones large stones both Treatment A group 1 93% 81 87 group 3 73% 192 263 78%
273 350

Treatment B group 2 87% 234 270 group 4 69% 55 80 83%


289 350

What happened!?

Continuing with bivariate data: Correlation Coecient- measures the strength of a linear relationship between two variables: Sxy sample correlation coecient = r := , Sx Sy where Sxy 1 = (xi x )(yi y ) n 1 i=1 1 n1
n n

2 Sx =

(xi x )2 .
i=1

This is also called the Pearson Correlation Coecient. If we rewrite 1 r= n1 you can see that
(xi x ) Sx n

i=1

(xi x ) (yi y ) , Sx Sy

and

(yi y ) Sy

are the z-scores of xi and yi .

r [1, 1] and is 1 only when data fall along a straight line sign(r) indicates the slope of the line (do yi s increase as xi s increase?) always plot the data before computing r to ensure it is meaningful

Correlation does not imply causation, it only implies association (there may be lurking variables that are not recognized or controlled) For example: There is a correlation between declining health and increasing wealth. Linear regression (in Ch 10) yy xx =r . Sx Sy

Chapter 4.5: Summarizing time-series data

Moving averages. Calculate average over a window of previous timepoints xtw+1 + + xt , w where w is the size of the window. Note that we make window w smaller at the beginning of the time series when t < w. M At = Example To use moving averages for forecasting, given x1 , . . . , xt1 , let the predicted value at time t be x t = M At1 . Then the forecast error is: et = xt x t = xt M At1 . The Mean Absolute Percent Error (MAPE) is: 1 M AP E = T 1
T

t=2

et 100%. xt

The MAPE looks at the forecast error et as a fraction of the measurement value xt . Sometimes as measurement values grow, errors, grow too, the MAPE helps to even this out.

For MAPE, xt cant be 0. Exponentially Weighted Moving Averages (EWMA). It doesnt completely drop old values. EW M At = xt + (1 )EW M At1 , where EW M A0 = x0 and 0 < < 1 is a smoothing constant. Example here controls balance of recent data to old data called exponentially from recursive formula: EW M At = [xt + (1 )xt1 + (1 )2 xt2 + . . . ] + (1 )t EW M A0 the forecast error is thus: t = xt EW M At1 et = x t x HW? Compare MAPE for MA vs EWMA Autocorrelation coecient. Measures correlation between the time series and a lagged version of itself. The k th order autocorrelation coecient is:
T )(xt t=k+1 (xtk x T )2 t=1 (xt x

rk := Example

x )

MIT OpenCourseWare http://ocw.mit.edu

15.075J / ESD.07J Statistical Thinking and Data Analysis


Fall 2011

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like