Lecture 1: Introduction: Statistics Is Concerned With
Lecture 1: Introduction: Statistics Is Concerned With
Lecture 1: Introduction: Statistics Is Concerned With
Key Definitions
A population (universe) is the collection of all members of a group
N represents the population size
A parameter is a numerical measure that describes a characteristic of a population d ib h t i ti f l ti A statistic is a numerical measure that describes a characteristic of a sample d ib h t i ti f l
3
Sample
b gi o r y
Measures computed from sample data are called statistics
4
c n u
ef gh i jk l m n o p q rs t u v w x y z
Examples
Population P l ti All eligible voters All light bulbs manufactured in a day All patients with high blood pressure for a clinical study Sample S l 1000 voters polled 100 light bulbs selected 200 hypertension patients enrolled for a clinical study
Inferential Statistics
Drawing conclusions and/or making decisions concerning a population based only on sample data
Descriptive Statistics
Collect data
e.g., e g Survey
Present data
e.g., Tables and graphs
Characterize data
Inferential statistics
Population
10
Types of Data
Data
Categorical
Examples: Marital Status Political Party Eye Color (Defined categories)
Numerical
Discrete
Examples: Number of Children Defects per hour (Counted items)
Continuous
Examples: Weight distance (Measured characteristics)
11
Histograms
7 6
Tables
5 4 3 2 1 0 10 20 30 40 50 60
Stem-and-Leaf Display St d L f Di l
A simple way to see distribution details in a p y data set
METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)
Data in Raw Form (as Collected): 24, 26, 24, 21, 27, 27, 30, 41, 32, 24 26 24 21 27 27 30 41 32 38 Data in Ordered Array from Smallest to Largest: Largest 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Stem-and-Leaf Stem and Leaf Display:
2 144677 3 028 4 1
Find Range: 58 - 12 = 46 Select Number of Classes: 5 ( (usually between 5 and ll b t d 15) Compute Class Interval (Width): 10 (46/5 then round up) C t Cl I t l (Width) Determine Class Boundaries (Limits):10, 20, 30, 40, 50,
60
Class
[10, [10 20) [20, 30) [30, [30 40) [40, 50) [50, 60) Total
Frequency
3 6 5 4 2 20
Percentage
15 30 25 20 10 0 100
Histogram Example g p
Class [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) Class Cl Midpoint Frequency 15 25 35 45 55 3 6 5 4 2
Distribution Shape
The shape of the distribution is said to be symmetric if the observations are balanced, or evenly distributed, about the center. y ,
Symmetric Distribution
10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9
Fre equency
Distribution Shape
(continued)
The shape of the distribution is said to be skewed if the observations are not symmetrically distributed around the center.
Positively Skewed Distribution
A positively skewed distribution (skewed to the right) has a tail that extends to the right in the direction of g positive values.
12 10 Fre equency 8 6 4 2 0 1 2 3 4 5 6 7 8 9
A negatively skewed distribution (skewed to the left) has a tail that extends to the left in the direction of negative al es negati e values.
Fre equency
Numerical description
Summary M S Measures
Quartiles
Range Variance
Variation
Mean
Mean (Arithmetic Mean) of Data Values
Sample mean
n Population mean
X=
X
i =1
Sample Size
i
X1 + X 2 + L + X n = n
Population Size
X
i =1
X1 + X 2 + L + X N = N
An example
TV watching hours/week: 5, 7, 3, 38, 7
Mean = (5 + 7 + 3 + 38 + 7)/5 = 60/5 = 12
12
38
Mean = 12
Mean = 6
Median
Robust measure of central tendency y Not affected by extreme values
3 5 7 38 3 5 7 8
Median = 7
Median = 7
Mode
A Measure of Central Tendency Value that Occurs Most Often Not Affected b Extreme Values N t Aff t d by E t V l There May Not Be a Mode There M Be S Th May B Several M d l Modes Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
No Mode
Quartiles Q til
Split ordered data into 4 quarters i ( n + 1) Position of i th quartile i-th
( Qi ) =
25%
25%
25%
25%
( Q1 )
( Q2 )
( Q3 )
Noncentral Location Q1 , Q2, and Q3 are called 25th, 50th, and 75th percentile respectively. A pth percentile is the value of X such that p% of the measurements are less than X and (100 p)% (100-p)% are greater than X X.
3(10 + 1) = 8.25 4
X smallest Q 1
Median( Q2)
Q3
Xlargest
12
15.75 15 75 21
Measures of Variation
Variation
Range Interquartile Range Variance
Standard Deviation
Measures of variation give information on the spread or variability of the data values.
Range
Easy to compute Difference between the Largest and the Smallest Observations: S ll t Ob ti
Range = 12 - 7 = 5
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Variance
Sample Variance:
S2 =
( X
i =1
X)
n 1
Population Variance:
=
2
( X
i =1
Standard Deviation
Most widely used Measure of Variation y Has the Same Units as the Original Data
Sample Standard Deviation:
S=
Population Standard Deviation:
( X
i =1
X)
n 1
( X
i =1
Examples E l
Data set 11, 12, 13, 16, 16, 17, 18, 21 n=8,
1 X = (11 + 12 + ... + 21) = 15.5 8
s = s 2 = 11.14 = 3.34
Xi
i =1
and
X
i =1
X
i =1
8 i =1
= 11 + 12 + ... + 21 =124
Visualizing variation