Unit 3 - Descriptive Statistics
Unit 3 - Descriptive Statistics
UNIT 3 - FUNDAMENTALS OF
STATISTICS
Hashem Zarafat
FUNDAMENTALS OF STATISTICS
What Is Statistics?
1. Collecting Data
e.g. survey, databases
Data Why?
2. Presenting Data
e.g., plots, charts, tables, data visualization
Analysis
Decision-
Making
3. Characterizing Data
e.g., means, correlations…
Statistical
Methods
Descriptive Inferential
Statistics Statistics
Descriptive Statistics
1. Involves
• Collecting Data
• Presenting Data $
50
• Characterizing Data
2. Purpose 25
• Describe Data
Numerical measures that describe a distribution 0
by providing:
• Information on the central tendency of Q1 Q2 Q3 Q4
the distribution
• The width of the distribution
• The shape of the distribution X = 30.5 S2 = 113
Measure of central tendency: a number that
characterizes the “middleness” of an entire
distribution
Inferential Statistics
1. Involves Population?
• Estimation
• Hypothesis
Testing
2. Purpose
• Make decisions about population
characteristics
Fundamental elements:
1. Statistical Inference
• Estimation or prediction or generalization about a
population based on information contained in a
sample
2. Measure of Reliability
• Statement (usually qualified) about the degree of
uncertainty associated with a statistical inference
Populations vs Samples
• R: “character” object
Ordinal Scale (Categorical)
• R: “factor” object
Interval Scale (Numerical, Continuous)
• R: “numeric” object
Ratio Scale (Numerical, Continuous)
• For any percentage p, the pth percentile is the value such that a
percentage p of all values are less than it.
– Splits the data into two pieces: the lower piece contains k percent of
the data, and the upper piece contains the rest of the data.
• Calculating: For example, suppose you have 25 test scores, and in order from
lowest to highest they look like this: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72,
77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. To find the 90th percentile for
these (ordered) scores, start by multiplying 90% times the total number of scores,
which gives 90% ∗ 25 = 0.90 ∗ 25 = 22.5 (the index). Rounding up to the nearest
whole number, you get 23. Counting from left to right (from the smallest to the
largest value in the data set), you go until you find the 23rd value in the data set.
That value is 98, and it’s the 90th percentile for this data set.
Measures of Variability:
Quartiles, Range, and Interquartile Range
• The quartiles divide the data into four groups, each with
(approximately) a quarter of all observations.
• Naturally, the first, second and third quartiles are the percentiles
corresponding to p = 25%, p = 50%, and p = 75%.
• By definition, the second quartile (p = 50%) is equal to the median.
• If all of the observations are close to the mean, then their squared
deviations from the mean will be relatively small, and the variance
will be relatively small.
• If at least a few of the observations are far from the mean, then
their squared deviations from the mean will be large, and this will
cause the variance to be large.
Empirical Rules for Interpreting Standard
Deviation
• A more natural measure is the standard deviation which is the square
root of variance (denoted as just s or σ).
• The interpretation of the standard deviation can be stated as three
empirical rules.
• If the values of a variable are approximately normally distributed
(symmetric and bell-shaped), then the following rules hold:
(1) Approximately 68% of the observations are within one standard deviation of
the mean.
(2) Approximately 95% of the observations are within two standard deviations
of the mean.
(3) Approximately 99.7% of the observations are within three standard deviations
of the mean.
• Fortunately, many variables in real-world data are indeed approximately
normally distributed.
Normal Distribution: Empirical Rules
• Standard normal distribution: a normal distribution with a mean of
0 and a standard deviation of 1 (→ “standardization”)
• Probability: the expected relative frequency of a particular outcome
Mean Absolute Deviation
➢ Mean
➢ (57+99+78+73+84+95)/6 = 81
➢ Median
➢ 57, 73, 78, 84, 95, 99 ➢ Standard Deviation
➢ Even: (78+84)/2 = 81 ➢ √235.6 ≈ 15.35
➢ 90th percentile
➢ 0.9*6=5.4; round-up=6; 6th score
in ranked dataset:99.
Measures of Shape: Skewness
• Skewness occurs because of a lack of symmetry.
▪ A variable can be skewed to the right (or positively
skewed) because of some really large values (e.g.
comparing Baseball players’ salaries).
▪ Or it can be skewed to the left (or negatively skewed)
because of some really small values (e.g. examining
temperature lows in Antarctica).
Measures of Shape: Kurtosis
▪ Kurtosis has to do with the “fatness” of the tails of the
distribution relative to the tails of a normal distribution.
▪ A distribution with high kurtosis has many extreme
observations.
Skewness and Kurtosis – What are acceptable
values?