0% found this document useful (0 votes)
117 views8 pages

Data Analysis - Statistics-021216 - NH (Compatibility Mode)

Statistics is the study of collecting, analyzing, and presenting data. The main goals of exploratory data analysis are to find patterns in the data, understand what it looks like, and discover any surprises. Key aspects of analysis include examining relationships between variables, correlations, outliers, and data distributions. Statistics uses parameters like the mean and variance to describe population distributions, and sample statistics like the sample mean and variance to make inferences about populations.

Uploaded by

Jonathan Joe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views8 pages

Data Analysis - Statistics-021216 - NH (Compatibility Mode)

Statistics is the study of collecting, analyzing, and presenting data. The main goals of exploratory data analysis are to find patterns in the data, understand what it looks like, and discover any surprises. Key aspects of analysis include examining relationships between variables, correlations, outliers, and data distributions. Statistics uses parameters like the mean and variance to describe population distributions, and sample statistics like the sample mean and variance to make inferences about populations.

Uploaded by

Jonathan Joe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Analysis –

Statistics
Overview

Purpose of EDA

 To find patterns in data, to give us a sense of what they look like,


and perhaps some surprises

1
Patterns

 Relations between variables (e.g. linear or nonlinear)


 Correlations (positive, negative, none)
 Outliers
 Distributions (histograms, charts, etc), e.g. is it normal?

y B 


 A

Patterns: Trends and cycles

 Need a long series


 Trends may be
 Rigid (dotted line below)

 Flexible (moving average (MA), shown as light curve), e.g. if

data = {1, 3, 5, 4, 8,...}, a 3-period MA is {3, 4, 17/3,...}.


 Cycles may be irregular

y
Period

Trend

Amplitude

Time

2
Index numbers

 Used to track temporal changes, e.g. house price, steel price


 For a single homogeneous product (i.e. quality does not change
over time), a possible index is
1, P1/P0,..., Pn/P0
where Pt is price at time t. We often multiply by 100 to shift the
decimal place.

Index numbers

 To construct an index of fruit prices, select a representative basket


and compute as follows:
t Index
0 1
1 (35+40+170)/(30+40+150) = 245/220 = 1.11
2 (45+50+180)/220 =1.25
3 (50+55+200)/220 = 1.39
i.e. cost of fruits has gone up.

2000 (t=0) t=1 t=2 t=3


Apple 30c 35c 45c 50c
Orange 40c 40c 50c 55c
Banana (per kg) $1.50 $1.70 $1.80 $2.00

3
Statistics and random variables

 Statistics are
 Methods of analyzing data (e.g. I study statistics);

 Data themselves (e.g. Do you have the statistics?); or

 Formulas based on sample values (e.g. sample mean m = ∑x/n

is a statistic)
 A random variable (rv) refers to an outcome that varies, e.g.
 Gender, e.g. G = 1 (if Male) and G = 0 (if Female). Since G

takes on discrete values, it is a discrete rv.


 Weight, e.g. Wi = weight of the ith person. W is a continuous rv

 Often, we omit the term “random” and call it a variable

Discrete distribution

 Suppose we toss 2 fair coins together.


 Let rv x = No. of heads.
 Its probability distribution f(x) is given below.

f(x) x f(x)
0.50 0 0.25

0.25
1 0.50
2 0.25

0 1 2 x

4
Distributions: Continuous

 Consider a population of 200 students whose test scores (x) are


distributed below, e.g. x = 75.2 marks

Score Frequency Prob distribution, f(x) Cumulative distribution


0-20 10 0.05 0.05
21-40 30 0.15 0.20
41-60 90 0.45 0.65
61-80 50 0.25 0.90
81-100 20 0.10 1.00
Total 200 1.00

Distribution: Continuous

 We can plot its relative frequency polygon


 If the column interval becomes smaller and smaller, the polygon
approaches a curve, called a probability density function

f(x)

0.4

0.2

30 70 x

5
Population mean and variance

 We often use 2 parameters to describe a population distribution:


mean () and variance (2). ( is the standard deviation)

 Note Greek letters refer to population parameters; we will use m


for sample mean and s for sample std deviation.

 The mean or expectation is given by


E[x] =  =  xf(x) if x discrete
=  xf(x) dx if x continuous.
 The variance is given by
2 = E[x- ]2 =  (x-)2f(x) if x discrete
=  (x-) f(x)
2 if x continuous

Example: Discrete case

 If x takes values as shown in the table, x f(x)


0 0.2
 = 0.2(0) + 0.5(2) + 0.3(3) = 1.9. 2 0.5
3 0.3
2 = [0.2(0 - 1.9)2+0.5(2 - 1.9)2+0.3(3 - 1.9)2]
= 1.09.

Thus  = 1.044.

6
Sample statistics

 In practice, we only have a sample of n values {x1, ...,xn}, n < N


where N is population size

Sample mean = m = xi/n


Sample variance = s2 = (xi – m)2/(n-1)

 Example:
If our sample is {0, 2, 5, 5},
m = 12/4 = 3
s2= [(0-3)2+(2-3)2+(5-3)2+(5-3)2]/3 = 6
s = 6.

Distribution of sample mean m

 Left panel: Population distribution of x, e.g. height of UI students,


with unknown  and . Note x is not normally distributed.
 Right panel: Sampling distribution of m, e.g. if we take say 100
samples each of size n (=9 say), and compute m for each sample.
So we have 100 values of m to draw the curve.
 Note m ~ N(, 2/n) for large n. This is the Central Limit Thm. For
small n (<30), m ~ tn-1(, s2/n)
 We often use these result to make inferences about 

shape /n

 x  m

7
Summary: What is statistics?

 Statistics is the scientific application of mathematical principles to


the collection, analysis, and presentation of data
 at the foundation of all of statistics is data.

Collection make
deals Presentation decisions
Statistics data to
with Analysis and solve
problems
Use

15

Engineering statistics

 Engineering statistics is the study of how best to…


 Collect engineering data

 Summarize or describe engineering data

 Draw formal inferences and practical conclusions on the basis of

engineering data all the while recognizing the reality of variation

16

You might also like