0% found this document useful (0 votes)
137 views84 pages

002 Probability-and-Statistics-Part-1-Data

Statistics is used to make inferences about populations based on samples due to the infeasibility of measuring entire populations. Probability helps quantify the certainty of statistical conclusions. Data is collected observations that can be continuous, categorical, or both. Visualizing data through graphs can reveal trends that tables obscure. Data is measured at nominal, ordinal, interval, or ratio levels. Measures of central tendency like the mean, median, and mode describe where data is centered, while measures of dispersion like range, variance, and standard deviation describe how spread out it is.

Uploaded by

Selly Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views84 pages

002 Probability-and-Statistics-Part-1-Data

Statistics is used to make inferences about populations based on samples due to the infeasibility of measuring entire populations. Probability helps quantify the certainty of statistical conclusions. Data is collected observations that can be continuous, categorical, or both. Visualizing data through graphs can reveal trends that tables obscure. Data is measured at nominal, ordinal, interval, or ratio levels. Measures of central tendency like the mean, median, and mode describe where data is centered, while measures of dispersion like range, variance, and standard deviation describe how spread out it is.

Uploaded by

Selly Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Probability and Statistics

for Business and Data

PART 1 - DATA
Introduction
Probability and Statistics

● Statistics is the mathematical science


behind the problem “what can I know
about a population if I’m unable to reach
every member?”
Probability and Statistics

● If we could measure the height of every


resident of Australia, then we could make a
statement about the average height of
Australians at the time we took our
measurement.
● This is where random sampling comes in.
Probability and Statistics

● If we take a reasonably sized random


sample of Australians and measure their
heights, we can form a statistical inference
about the population of Australia.
● Probability helps us know how sure we are
of our conclusions!
Data
What is Data?

● Data = the collected observations we


have about something.
● Data can be continuous:
"What is the stock price?"
● or categorical:
"What car has the best repair history?"
Why Data Matters

● Helps us understand things as they are:


"What relationships if any exist between
two events?"
"Do people who eat an apple a day enjoy
fewer doctor's visits than those who don't?"
Why Data Matters

● Helps us predict future behavior


to guide business decisions:
"Based on a user's click history which ad is
more likely to bring them to our site?"
Visualizing Data

● Compare a table:
Flights
Not much
can be
gained by
reading it.
Visualizing Data

● to a graph:
Flights
The graph uncovers
two distinct trends - an
increase in passengers
flying over the years
and a greater number
of passengers flying in
the summer months.
Analyze Visualizations Critically!

● Graphs can be misleading:


Measuring Data
Levels of Measurement

Nominal
● Predetermined categories

● Can’t be sorted

Animal classification (mammal fish reptile)


Political party (republican democrat independent)
Levels of Measurement

Ordinal
● Can be sorted

● Lacks scale

Survey responses
Levels of Measurement

Interval
● Provides scale

● Lacks a “zero” point

Temperature
Levels of Measurement

Ratio
● Values have a true zero point

Age, weight, salary


Population vs. Sample

● Population = every member of a group


● Sample = a subset of
members that time
and resources allow
you to measure
Mathematical Symbols & Syntax
Symbol/Expression Spoken as Description

𝑥2 x squared x raised to the second power


𝑥2 = 𝑥 × 𝑥

𝑥𝑖 x-sub-i a subscripted variable


(the subscript acts as a label)
𝑥! x factorial 4! = 4 × 3 × 2 × 1

𝑥ҧ x bar symbol for the sample mean

𝜇 “mew” symbol for the population mean


(Greek lowercase letter mu)
𝛴 sigma syntax for writing sums
(Greek capital letter sigma)
Exponents

𝒙𝟓 = 𝑥 × 𝑥 × 𝑥 × 𝑥 × 𝑥
1 2 3 4 5
EXAMPLE: 34 = 3 × 3 × 3 × 3 = 81
Exponents – special cases
1
𝑥 −3
=
𝑥×𝑥×𝑥
1 1
EXAMPLE: 2−3 = = = 0.125
2×2×2 8

1
𝑛
𝑥 𝑛 = 𝑥
1
3
EXAMPLE: 8 3 = 8=2
Factorials

𝒙! = 𝑥 × 𝑥 − 1 × 𝑥 − 2 × ⋯ × 1
EXAMPLE: 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720

5! 5×4×3×2×1
EXAMPLE: = = 5 × 4 = 20
3! 3×2×1
Simple Sums
𝒏
෍ 𝒙 = 1 +2 + 3 + ⋯+ 𝑛
𝒙=𝟏

EXAMPLE: σ4𝑥=1 𝑥 = 1 + 2 + 3 + 4 = 10

EXAMPLE: σ4𝑥=1 𝑥 2 = 1 + 4 + 9 + 16 = 30
Series Sums
𝒏
෍ 𝒙𝒊 = 𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛
𝒊=𝟏
EXAMPLE: 𝑥 = {5,3,2,8}
𝑛 = # 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑥 = 4
𝟒
෍ 𝒙𝒊 = 5 + 3 + 2 + 8 = 18
𝒊=𝟏
Equation Example

● Formula for calculating a sample mean:


σ𝑛𝑖=1 𝑥𝑖
𝑥ҧ =
𝑛
● Read out loud:
“𝒙 bar (the symbol for the sample mean) is equal to the sum
(indicated by the Greek letter sigma) of all the 𝒙-sub-𝒊 values
in the series as 𝒊 goes from 1 to the number 𝒏 items in the
series divided by 𝒏."
Equation Example σ𝑛𝑖=1 𝑥𝑖
𝑥ҧ =
𝑛
1. Start with a series of values:
{7 8 9 10}
2. Assign placeholders to each item
{7 8 9 10}
1 2 3 4 n=4
3. These become 𝑥1 𝑥2 etc.
𝑥1 = 7 𝑥2 = 8 𝑥3 = 9 𝑥4 = 10
Equation Example σ𝑛𝑖=1 𝑥𝑖
𝑥ҧ =
𝑛
4. Plug these into the equation:
σ𝑛𝑖=1 𝑥𝑖 𝑥1 + 𝑥2 + 𝑥3 + 𝑥4 … + 𝑥𝑛
𝑥ҧ = =
𝑛 𝑛

7 + 8 + 9 + 10 34
= = = 8.5
4 4
Measurement Types
Central Tendency
Measurements of Data

● “What was the average return?”


Measures of Central Tendency

● “How far from the average


did individual values stray?”
Measures of Dispersion
Measures of Central Tendency
(mean, median, mode)
● Describe the “location” of the data
● Fail to describe the “shape” of the data
mean = “calculated average”
median = “middle value”
mode = “most occurring value”
Mean

● Shows “location” but not “how spread out”


Median – odd number of values

9 10 10 11 13 15 16 19 19 21 23 28 30 33 34 36 44

= 19
Median - even number of values

10 10 11 13 15 16 19 19 21 23 28 30 33 34 36 44

19 + 21
= 20
2
Mean vs. Median

● The mean can be influenced by outliers.


● The mean of {2,3,2,3,2,12} is 4
● The median is 2.5
● The median is much closer to
most of the values in the series!
Mode

10 10 11 13 15 16 16 16 21 23 28 30 33 34 36 44

= 16
Measurement Types
Dispersion
Measures of Dispersion
(range, variance, standard deviation)
9 10 11 13 15 16 19 19 21 23 28 30 33 34 36 39

• In this sample the mean is 22.25


• How do we describe how “spread out”
the sample is?
Range

9 10 11 13 15 16 19 19 21 23 28 30 33 34 36 39

𝑅𝑎𝑛𝑔𝑒 = 𝑚𝑎𝑥 − 𝑚𝑖𝑛


= 39 − 9
= 30
Variance

● Calculated as the sum of square distances


from each point to the mean
● There’s a difference between the SAMPLE
variance and the POPULATION variance
● subject to Bessel's correction (𝒏 − 𝟏)
Variance

𝛴 𝑥−𝑥ҧ 2
SAMPLE VARIANCE: 2
𝑠 =
𝑛−1

𝛴 𝑋−𝜇 2
POPULATION VARIANCE: 𝜎2 =
𝑁
Sample Variance
2
𝛴 𝑥 − 𝑥ҧ
𝑠2 =
𝑛−1

4 + 7 + 9 + 8 + 11 39
4 7 9 8 11 𝑥ҧ = = = 7.8
sample
mean
5 5

4−7.8 2 + 7−7.8 2 + 9−7.8 2 + 8−7.8 2 + 11−7.8 2


𝑠2 =
5−1

= 6.7 sample variance


Standard Deviation

● square root of the variance


● benefit: same units as the sample
● meaningful to talk about
“values that lie within
one standard deviation
of the mean”
Sample Standard Deviation 𝑠=
𝛴 𝑥 − 𝑥ҧ 2

𝑛−1

Sample: 4 + 7 + 9 + 8 + 11 39 sample
𝑥ҧ = = = 7.8 mean
4 7 9 8 11 5 5

4 − 7.8 2 + 7 − 7.8 2 + 9 − 7.8 2 + 8 − 7.8 2 + 11 − 7.8 2


𝑠=
5−1

= 6.7 = 2.59 sample standard deviation


Population 𝜎=
𝛴 𝑋−𝜇 2

Standard Deviation 𝑁

Population: 4 + 7 + 9 + 8 + 11 39 population
𝜇= = = 7.8 mean
4 7 9 8 11 5 5

4 − 7.8 2 + 7 − 7.8 2 + 9 − 7.8 2 + 8 − 7.8 2 + 11 − 7.8 2


𝜎=
5

= 5.36 = 2.32 population standard deviation


Measurement Types
Quartiles
Quartiles and IQR

● Another way to describe data is through


quartiles and the interquartile range (IQR)
● Has the advantage that every data point is
considered, not aggregated!
Quartiles and IQR

● Consider the following series of 20 values:

9 10 10 11 13 15 16 19 19 21 23 28 30 33 34 36 44 45 47 60
1st quartile 2nd quartile 3rd quartile
or median
1. Divide the series
2. Divide each subseries
3. These become quartiles
Quartiles and IQR

● Consider the following series of 20 values:

9 10 10 11 13 15 16 19 19 21 23 28 30 33 34 36 44 45 47 60
1st quartile 2nd quartile 3rd quartile
or median
1st quartile = 14
2nd quartile = 22
3rd quartile = 35
Plot the Quartiles
9 10 10 11 13 15 16 19 19 21 23 28 30 33 34 36 44 45 47 60

Quartile
ranges are
seldom the
same size!
Fences & Outliers

● What is considered an “outlier”?


● A common practice is to set a “fence”
that is 1.5 times the width of the IQR
● Anything outside the fence is an outlier
● This is determined by the data,
not an arbitrary percentage!
Fences & Outliers
1 IQR 1.5 IQR

In this set,
60 is not
an outlier,
but 70
would be
Fences & Outliers
9 10 10 11 13 15 16 19 19 21 23 28 30 33 34 36 44 45 47 70
fence at 1.5 IQR

Here 70
is a true
outlier

● When drawing box plots, the whiskers are brought inward


to the outermost values inside the fence.
Bivariate Data
Bivariate Data

● Compares two variables


● By convention, the x-axis is set to
the independent variable
● The y-axis is set to the
dependent variable, or that which is being
measured relative to x.
Bivariate Data

● Scatter plots may uncover a correlation


between two variables
● They can’t show causality!
Bivariate Data

● Correlation between two variables


● Doesn't prove causality!
Bivariate Data

● More statistical analysis is needed to


determine causality!
● For example: "Does increasing number of
police officers decrease crime?"
● We would look at correlation, and do
further analysis to understand causality.
Bivariate Data

Positive Negative or
correlation Inverse
correlation
Covariance

● A common way to compare two variables is


to compare their variances – how far from
each item’s mean do typical values fall?
● The first challenge is to match scale.
Comparing height in inches to weight in
pounds isn’t meaningful unless we develop
a standard score to normalize the data.
Covariance

● For simplicity, we’ll consider the


population covariance:

1 𝑁
𝑐𝑜𝑣 𝑋, 𝑌 = ෍ (𝑥𝑖 − 𝑥)(
ҧ 𝑦𝑖 − 𝑦)

𝑁 𝑖=1
Covariance Exercise

● Consider the following two tables:


x y x y

1 4 1 5

2 6 2 9

3 5 3 7

4 7 4 4

5 9 5 8

6 8 6 6
Covariance Exercise

● Plot them:
x y x y

1 4 1 5

2 6 2 9

3 5 3 7

4 7 4 4

5 9 5 8

6 8 6 6
Covariance Exercise x̅ = 3.5, y̅ = 6.5

● Calculate mean values:


x y x y

1 4 1+2+3+4+5+6 1 5 1+2+3+4+5+6
𝑥ҧ = = 3.5 𝑥ҧ = = 3.5
6 6
2 6 2 9

3 5 4+6+5+7+9+8 3 7 5+9+7+4+8+6
𝑦ത = = 6.5 𝑦ത = = 6.5
4 7 6 4 4 6

5 9 5 8

6 8 6 6
Covariance Exercise x̅ = 3.5, y̅ = 6.5

● Calculate (x -x̅ ) and (y -y̅ ) :


x y (x - x̅) (y - y̅) x y (x - x̅) (y - y̅)

1 4 -2.5 -2.5 1 5 -2.5 -1.5

2 6 -1.5 -0.5 2 9 -1.5 2.5

3 5 -0.5 -1.5 3 7 -0.5 0.5

4 7 0.5 0.5 4 4 0.5 -2.5

5 9 1.5 2.5 5 8 1.5 1.5

6 8 2.5 1.5 6 6 2.5 -0.5


Covariance Exercise x̅ = 3.5, y̅ = 6.5

● Calculate (x -x̅ )(y -y̅ ) :


x y (x - x̅) (y - y̅) (x - x̅)(y - y̅) x y (x - x̅) (y - y̅) (x - x̅)(y - y̅)

1 4 -2.5 -2.5 6.25 1 5 -2.5 -1.5 3.75

2 6 -1.5 -0.5 0.75 2 9 -1.5 2.5 -3.75

3 5 -0.5 -1.5 0.75 3 7 -0.5 0.5 -0.25

4 7 0.5 0.5 0.25 4 4 0.5 -2.5 -1.25

5 9 1.5 2.5 3.75 5 8 1.5 1.5 2.25

6 8 2.5 1.5 3.75 6 6 2.5 -0.5 -1.25


Covariance Exercise x̅ = 3.5, y̅ = 6.5

● Calculate sums:
x y (x - x̅) (y - y̅) (x - x̅)(y - y̅) x y (x - x̅) (y - y̅) (x - x̅)(y - y̅)

1 4 -2.5 -2.5 6.25 1 5 -2.5 -1.5 3.75

2 6 -1.5 -0.5 0.75 2 9 -1.5 2.5 -3.75

3 5 -0.5 -1.5 0.75 3 7 -0.5 0.5 -0.25

4 7 0.5 0.5 0.25 4 4 0.5 -2.5 -1.25

5 9 1.5 2.5 3.75 5 8 1.5 1.5 2.25

6 8 2.5 1.5 3.75 6 6 2.5 -0.5 -1.25


 15.5  -0.5
Covariance Exercise x̅ = 3.5, y̅ = 6.5

● Calculate covariance:
1 𝑁 1 𝑁
x y 𝑐𝑜𝑣 𝑋, 𝑌 = ෍ (𝑥𝑖 − 𝑥)(
ҧ 𝑦𝑖 − 𝑦)
ത x y 𝑐𝑜𝑣 𝑋, 𝑌 = ෍ (𝑥𝑖 − 𝑥)(
ҧ 𝑦𝑖 − 𝑦)

𝑁 𝑖=1 𝑁 𝑖=1
1 4 1 5

2 6 15.5 2 9 −0.5
= = 𝟐. 𝟓𝟖𝟑 = = −𝟎. 𝟎𝟖𝟑
6 6
3 5 3 7

4 7 4 4

5 9 5 8

6 8 6 6
 15.5  -0.5
Covariance Exercise

● Compare covariances:
x y x y

1 4 1 5

2 6 2 9

3 5 3 7

4 7 4 4

5 9 5 8

6 8 6 6
cov(x,y) = 2.583 cov(x,y) = -0.083
Pearson Correlation
Coefficient
Pearson Correlation Coefficient

● In order to normalize values coming from


two different distributions, we use:
1
σ
𝑛 (𝑥 − 𝑥)(𝑦
𝑐𝑜𝑣 𝑋, 𝑌 ҧ − 𝑦)
ത σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝜌𝑋,𝑌 = = =
𝜎𝑋 𝜎𝑌 𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2 𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2
𝑛 𝑛

 = Greek letter “rho”  = standard deviation


cov = covariance 𝑥ҧ = mean of X
Pearson Correlation Coefficient

● Values fall between +1 and −1, where


1 = total positive linear correlation
0 = no linear correlation
−1 = total negative linear correlation
Pearson Correlation Coefficient

● Several sets of (x, y) points, with the


correlation coefficient for each set:
Correlation Exercise

● A company decides to test sales of


a new product in five separate markets,
to determine the best price point.
● They set a different price
in each market and record
sales volume over the
same 30 day period.
Correlation Exercise

● These are the results


● Plot the results
Price Units Sold
(USD) (thousands)
10 55

11 57

15 49

19 48

22 39
Correlation Exercise

● There appears to be a strong correlation,


but how strong?
Price Units Sold
(USD) (thousands)
10 55

11 57

15 49

19 48

22 39
Correlation Exercise

1. Recall the simplified 𝑐𝑜𝑣 𝑋, 𝑌 σ(𝑥 − 𝑥)(𝑦


ҧ − 𝑦)

𝜌𝑋,𝑌 = =
correlation formula: 𝜎𝑋 𝜎𝑌 𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2

Price Units Sold


(USD)
10
(thousands)
55
2. Find the mean of x and y:
11 57 10 + 11 + 15 + 19 + 22
𝑥ҧ = = 15.4
15 49 5
19 48
55 + 57 + 49 + 48 + 39
22 39 𝑦ത = = 49.6
5
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

Correlation Exercise 𝜌𝑋,𝑌 =
𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2

𝑥ҧ = 15.4 𝑦ത = 49.6

3. Calculate 𝑥 − 𝑥ҧ and 𝑦 − 𝑦ത :
Price Units Sold (x − x̅) (y − y̅)
(USD) (thousands)
10 55 -5.4 5.4

11 57 -4.4 7.4

15 49 -0.4 -0.6

19 48 3.6 -1.6

22 39 6.6 -10.6
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

Correlation Exercise 𝜌𝑋,𝑌 =
𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2

𝑥ҧ = 15.4 𝑦ത = 49.6

4. Calculate 𝑥 − 𝑥ҧ 𝑦 − 𝑦ത :
Price Units Sold (x − x̅) (y − y̅) (x − x̅)(y − y̅)
(USD) (thousands)
10 55 -5.4 5.4 -29.16

11 57 -4.4 7.4 -32.56

15 49 -0.4 -0.6 0.24

19 48 3.6 -1.6 -5.76

22 39 6.6 -10.6 -69.96


σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

Correlation Exercise 𝜌𝑋,𝑌 =
𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2

𝑥ҧ = 15.4 𝑦ത = 49.6

5. Calculate 𝑥 − 𝑥ҧ 2
and 𝑦 − 𝑦ത 2
:
Price Units Sold (x − x̅) (y − y̅) (x − x̅)(y − y̅) (x − x̅)2 (y − y̅)2
(USD) (thousands)
10 55 -5.4 5.4 -29.16 29.16 29.16

11 57 -4.4 7.4 -32.56 19.36 54.76

15 49 -0.4 -0.6 0.24 0.16 0.36

19 48 3.6 -1.6 -5.76 12.96 2.56

22 39 6.6 -10.6 -69.96 43.56 112.36


σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

Correlation Exercise 𝜌𝑋,𝑌 =
𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2

𝑥ҧ = 15.4 𝑦ത = 49.6

6. Compute the sums:


Price Units Sold (x − x̅) (y − y̅) (x − x̅)(y − y̅) (x − x̅)2 (y − y̅)2
(USD) (thousands)
10 55 -5.4 5.4 -29.16 29.16 29.16

11 57 -4.4 7.4 -32.56 19.36 54.76

15 49 -0.4 -0.6 0.24 0.16 0.36

19 48 3.6 -1.6 -5.76 12.96 2.56

22 39 6.6 -10.6 -69.96 43.56 112.36

 -137.2 105.2 199.2


σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

Correlation Exercise 𝜌𝑋,𝑌 =
𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2

𝑥ҧ = 15.4 𝑦ത = 49.6

7. Plug these into the original formula:


Price Units Sold (x − x̅) (y − y̅) (x − x̅)(y − y̅) (x − x̅)2 (y − y̅)2
(USD) (thousands)
10 55 -5.4 5.4 -29.16 29.16 29.16

11 57 -4.4 7.4 -32.56 19.36 54.76

15 49 -0.4 -0.6 0.24 0.16 0.36

19 48 3.6 -1.6 -5.76 12.96 2.56

22 39 6.6 -10.6 -69.96 43.56 112.36

 -137.2 105.2 199.2


σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

Correlation Exercise 𝜌𝑋,𝑌 =
𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2

𝑥ҧ = 15.4 𝑦ത = 49.6

7. Plug these into the original formula:

σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത −137.2
𝜌𝑋,𝑌 = =
𝛴 𝑥 − 𝑥ҧ 2 𝛴 𝑦 − 𝑦ത 2 105.2 199.2
−137.2 −137.2
= = = −𝟎. 𝟗𝟒𝟖
10.26 × 14.11 144.8

 -137.2 105.2 199.2


Correlation Exercise

● 𝜌𝑋,𝑌 = −0.948 shows a very strong


negative correlation!
Next Up: PROBABILITY

You might also like