0% found this document useful (0 votes)

67 views8 pages

Chapter 4 - Summarizing Numerical Data

The document summarizes various techniques for numerically describing data through measures of central tendency (mean, median), dispersion (range, standard deviation, interquartile range), distributions (quantiles, histograms), and relationships between variables (correlation, regression). It discusses using averages to summarize both cross-sectional and time series data, including moving averages and exponentially weighted moving averages to smooth time series and forecast future values.

Uploaded by

eviroyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views8 pages

Chapter 4 - Summarizing Numerical Data

Uploaded by

eviroyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Chapter 4 - Summarizing Numerical Data

15.075 Cynthia Rudin Here are some ways we can summarize data numerically. Sample Mean: x :=
n i=1

Note: in this class we will work with both the population mean and the sample mean x . Do not confuse them! Remember, x is the mean of a sample taken from the population and is the mean of the whole population. Sample median: order the data values x(1) x(2) x(n) , so then median := x :=
1 [x n 2 (2)

n odd x( n+1 ) 2 . + x( n +1) ] n even 2

outlier

Mean and median can be very dierent: 1, 2, 3, 4, 500 . The median is more robust to outliers. Quantiles/Percentiles: Order the sample, then nd x p so that it divides the data into two parts where: a fraction p of the data values are less than or equal to x p and the remaining fraction (1 p) are greater than x p . That value x p is the pth -quantile, or 100pth percentile.

5-number summary {xmin , Q1 , Q2 , Q3 , xmax }, where, Q1 = .25 , Q2 = .5 , Q3 = .75 . Range: xmax xmin measures dispersion Interquartile Range: IQR := Q3 Q1 , range resistant to outliers

Sample Variance s2 and Sample Standard Deviation s: s :=

see why later

1 n 1

n )2 . (xi x i=1

Remember, for a large sample from a normal distribution, 95% of the sample falls in [ x 2s, x + 2s]. Do not confuse s2 with 2 which is the variance of the population.
s Coecient of variation (CV) := x , dispersion relative to size of mean.

z-score

xi x . s It tells you where a data point lies in the distribution, that is, how many standard deviations above/below the mean. E.g. zi = 3 where the distribution is N (0, 1). zi :=

It allows you to compute percentiles easily using the z-scores table, or a command on the computer.

Now some graphical techniques for describing data. Bar chart/Pie chart - good for summarizing data within categories

Pareto chart - a bar chart where the bars are sorted.

Histogram

Boxplot and normplot Scatterplot for bivariate data Q-Q Plot for 2 independent samples Hans Rosling

Chapter 4.4: Summarizing bivariate data Two Way Table Heres an example: Respiratory Problem? yes no row total smokers 25 25 50 non-smokers 5 45 50 column total 30 70 100 Question: If this example is from a study with 50 smokers and 50 non-smokers, is it meaningful to conclude that in the general population : a) 25/30 = 83% of people with respiratory problems are smokers? b) 25/50 = 50% of smokers have respiratory problems? Simpsons Paradox Deals with aggregating smaller datasets into larger ones. Simpsons paradox is when conclusions drawn from the smaller datasets are the opposite of conclusions drawn from the larger dataset. Occurs when there is a lurking variable and uneven-sized groups being combined E.g. Kidney stone treatment (Source: Wikipedia) Which treatment is more eective? Treatment A Treatment B 78%
273 350

83%

289 350

Including information about stone size, now which treatment is more eective? small stones large stones both Treatment A group 1 93% 81 87 group 3 73% 192 263 78%
273 350

Treatment B group 2 87% 234 270 group 4 69% 55 80 83%

289 350

What happened!?

Continuing with bivariate data: Correlation Coecient- measures the strength of a linear relationship between two variables: Sxy sample correlation coecient = r := , Sx Sy where Sxy 1 = (xi x )(yi y ) n 1 i=1 1 n1
n n

2 Sx =

(xi x )2 .
i=1

This is also called the Pearson Correlation Coecient. If we rewrite 1 r= n1 you can see that
(xi x ) Sx n

i=1

(xi x ) (yi y ) , Sx Sy

and

(yi y ) Sy

are the z-scores of xi and yi .

r [1, 1] and is 1 only when data fall along a straight line sign(r) indicates the slope of the line (do yi s increase as xi s increase?) always plot the data before computing r to ensure it is meaningful

Correlation does not imply causation, it only implies association (there may be lurking variables that are not recognized or controlled) For example: There is a correlation between declining health and increasing wealth. Linear regression (in Ch 10) yy xx =r . Sx Sy

Chapter 4.5: Summarizing time-series data

Moving averages. Calculate average over a window of previous timepoints xtw+1 + + xt , w where w is the size of the window. Note that we make window w smaller at the beginning of the time series when t < w. M At = Example To use moving averages for forecasting, given x1 , . . . , xt1 , let the predicted value at time t be x t = M At1 . Then the forecast error is: et = xt x t = xt M At1 . The Mean Absolute Percent Error (MAPE) is: 1 M AP E = T 1
T

t=2

et 100%. xt

The MAPE looks at the forecast error et as a fraction of the measurement value xt . Sometimes as measurement values grow, errors, grow too, the MAPE helps to even this out.

For MAPE, xt cant be 0. Exponentially Weighted Moving Averages (EWMA). It doesnt completely drop old values. EW M At = xt + (1 )EW M At1 , where EW M A0 = x0 and 0 < < 1 is a smoothing constant. Example here controls balance of recent data to old data called exponentially from recursive formula: EW M At = [xt + (1 )xt1 + (1 )2 xt2 + . . . ] + (1 )t EW M A0 the forecast error is thus: t = xt EW M At1 et = x t x HW? Compare MAPE for MA vs EWMA Autocorrelation coecient. Measures correlation between the time series and a lagged version of itself. The k th order autocorrelation coecient is:
T )(xt t=k+1 (xtk x T )2 t=1 (xt x

rk := Example

x )

MIT OpenCourseWare http://ocw.mit.edu

15.075J / ESD.07J Statistical Thinking and Data Analysis

Fall 2011

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Descriptive Statistics and Exploratory Data Analysis
No ratings yet
Descriptive Statistics and Exploratory Data Analysis
36 pages
Introduction To The Practice of Basic Statistics (Textbook Outline)
100% (14)
Introduction To The Practice of Basic Statistics (Textbook Outline)
65 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
Statistics 101
100% (1)
Statistics 101
20 pages
It0089 Finalreviewer
100% (1)
It0089 Finalreviewer
143 pages
SPC Awareness Training
No ratings yet
SPC Awareness Training
70 pages
Nuts and Bolts Technology Strategy Fiona Murray
100% (1)
Nuts and Bolts Technology Strategy Fiona Murray
15 pages
L03 ECO220 Print
No ratings yet
L03 ECO220 Print
15 pages
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Further Summary
No ratings yet
Further Summary
29 pages
LGO Leadership: An Introduction To A Two-Year Journey: Jan Klein Session 1
No ratings yet
LGO Leadership: An Introduction To A Two-Year Journey: Jan Klein Session 1
28 pages
Lecture Week 2 Statistics
No ratings yet
Lecture Week 2 Statistics
57 pages
Statistics
No ratings yet
Statistics
64 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
4 - SM and Data Visualization
No ratings yet
4 - SM and Data Visualization
61 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
4x @6ote ) 'Btda2@m
No ratings yet
4x @6ote ) 'Btda2@m
55 pages
002 Probability-and-Statistics-Part-1-Data
No ratings yet
002 Probability-and-Statistics-Part-1-Data
84 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
LESSON 4 MMW Data Management
No ratings yet
LESSON 4 MMW Data Management
104 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
Difference Between (Median, Mean, Mode, Range, Midrange) (Descriptive Statistics)
No ratings yet
Difference Between (Median, Mean, Mode, Range, Midrange) (Descriptive Statistics)
11 pages
Unit 3 Descriptive Statistics
No ratings yet
Unit 3 Descriptive Statistics
25 pages
History Reporting
No ratings yet
History Reporting
61 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel
48 pages
Statistics
No ratings yet
Statistics
20 pages
Descriptive Stat Lec 1
No ratings yet
Descriptive Stat Lec 1
32 pages
Unit 8. Data Analysis
No ratings yet
Unit 8. Data Analysis
69 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Stat 101 Exam 1: Important Formulas and Concepts 1
No ratings yet
Stat 101 Exam 1: Important Formulas and Concepts 1
18 pages
Stats - The Theory 2
No ratings yet
Stats - The Theory 2
25 pages
RMP470S Lecture 7 - One-Dimensionalstatistics
No ratings yet
RMP470S Lecture 7 - One-Dimensionalstatistics
27 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
No ratings yet
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
69 pages
Maths
No ratings yet
Maths
30 pages
Statistics 101 Study Notes
No ratings yet
Statistics 101 Study Notes
33 pages
Summarising and Analysing Data
No ratings yet
Summarising and Analysing Data
36 pages
12 Location
No ratings yet
12 Location
29 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
SLIDES - Statistics-Descriptive Statistics
No ratings yet
SLIDES - Statistics-Descriptive Statistics
25 pages
Week 01 Introduction
No ratings yet
Week 01 Introduction
33 pages
One Dimensional Statistics
No ratings yet
One Dimensional Statistics
21 pages
02 Assignment Print
No ratings yet
02 Assignment Print
77 pages
GE MODMAT Unit 4 Statistics 1
No ratings yet
GE MODMAT Unit 4 Statistics 1
14 pages
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
No ratings yet
Mathematics As A Tool (Descriptive Statistics) (Midterm Period) Overview: This Module Tackles Mathematics As Applied To Different Areas Such As Data
33 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Unit 5 BRM
No ratings yet
Unit 5 BRM
17 pages
FORMULAS
No ratings yet
FORMULAS
16 pages
AP Statistics Michel Liao
No ratings yet
AP Statistics Michel Liao
20 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Global Strategy & Organization: Joe Santos
No ratings yet
Global Strategy & Organization: Joe Santos
13 pages
03 Networks 2 Print
No ratings yet
03 Networks 2 Print
45 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
08 Integer Programming 1 Print
No ratings yet
08 Integer Programming 1 Print
35 pages
Ge 4 - Topic 2-Statistics
No ratings yet
Ge 4 - Topic 2-Statistics
8 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
Notes Stats Quiz 2
No ratings yet
Notes Stats Quiz 2
10 pages
MIT15 067S11 Lec13
No ratings yet
MIT15 067S11 Lec13
25 pages
Statistics S1 Theory
No ratings yet
Statistics S1 Theory
8 pages
Leadership Development Planning: Jan Klein Session 7
No ratings yet
Leadership Development Planning: Jan Klein Session 7
23 pages
Lec 24
No ratings yet
Lec 24
23 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Summer Leadership Wrap-Up: A Time To Reflect: Jan Klein
No ratings yet
Summer Leadership Wrap-Up: A Time To Reflect: Jan Klein
16 pages
ST Formula Sheet Midterm
No ratings yet
ST Formula Sheet Midterm
4 pages
Team Processes: Jan Klein
No ratings yet
Team Processes: Jan Klein
15 pages
Frequency Distribution Table: Measure of Dispersion: Range, Variance, Standard Deviation
No ratings yet
Frequency Distribution Table: Measure of Dispersion: Range, Variance, Standard Deviation
4 pages
Reviewer in MMW Finals
No ratings yet
Reviewer in MMW Finals
3 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Competitive Decision Making and Negotiation: Purpose: Learn How To Negotiate
No ratings yet
Competitive Decision Making and Negotiation: Purpose: Learn How To Negotiate
4 pages
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet