Data Analytics Theory
Data Analytics Theory
“Statistical Techniques/Methods”
Do some Interpret
statistical results
calculations
DATA AND SUMMARIZATION
Primary Uses of Statistics
POPULATION
A population consists of all the items or individuals about which
you want to draw a conclusion.
SAMPLE
A sample is the portion of a population selected for analysis.
PARAMETER
A parameter is a numerical measure that describes a
characteristic of a population.
STATISTIC
A statistic is a numerical measure that describes a characteristic
of a sample.
Qualitative(Categ
Quantitative
orical)
Discrete (no.
of customers, Ordinal (customer
no of claims) satisfaction,
efficiency of workers,
bond rating)
Continuous
(salary, price)
Nominal (sex,
nationality,
eye color)
Cross-Sectional Data
• In general, the more spread out a distribution is, the larger the measure of
dispersion will be
Measures of Dispersion
125
75
because the scores 50
0
1 2 3 4 5 6 7 8 9 10
Measures of Dispersion
• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
Coefficient of Variation (CV)
s
CV 100
x
Percentiles, Quartiles and IQR
restaurant_type
chain :12000
independent:18000
Box Plot
83 84 85 86 87 88 89 90 91
IBM
BoxPlot
Chebyshev’s Theorem
Applies to any distribution, regardless of shape
Empirical Rule
Applies only to roughly mound-shaped and symmetric
distributions
Chebyshev’s Theorem
1
1
At least 2 of
the elements of any
k
distribution lie within k standard deviations of the
mean
1 1 3
1 1 75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1 2 1 89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1 2 1 94%
4 16 16
Empirical Rule
For roughly mound-shaped and symmetric
distributions, approximately:
m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
Scatter Plots and Correlation
y y
x x
y y
x x
Strong relationships Weak relationships
y y
x x
y y
x x
No relationship
x
Correlation Coefficient
cov( x, y )
rxy
sx s y
1
cov( x, y ) ( xi x )( yi y )
n
1 1
sx
n
( xi x ) 2
s y
n
( y i y ) 2
Features of correlation coefficient
• Unit free
• Range between -1.00 and 1.00
• -1≤r<0 implies that as X ↑ (↓), Y ↓ (↑ )
• 0< r≤1 implies that as X ↑ (↓), Y ↑ (↓)
• The closer to -1.00, the stronger the negative linear relationship
• The closer to 1.00, the stronger the positive linear relationship
• The closer to 0.00, the weaker the linear relationship
• r=0 implies that X and Y are not linearly associated
Examples of Approximate r Values
y y y
x x x
r = -1.00 r = -.60 r = 0.00
y y
x x
r = 0.20 r = 1.00