Statistics - Basic Concepts Part 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Part 2

Data Analytics for


Decision Making

• Statistical concepts and methods

• From histograms to probability distributions


• The Normal distribution
• Confidence intervals
• Hypothesis tests
Probability Density Function (pdf)
Discrete distribution
❑ A statistical distribution used for Discrete data

Continuous distribution
❑ A statistical distribution used for Continuous data
NormalDistribution
(the Bell curve)
Number of customers arriving at the checkout counter
of a grocery store in an hour

❑ approximate this process using a statistical distribution


use the pdf of that statistical distribution

0 1 2 3 4 5 6 …
Transforming data in functions
Observed data Standard functions

?
Continuous probability distributions

Distribution Parameters Application

Uniform a, b Multiples / Random # generators


Normal m, s Multiples
Exponential m Times between arrivals / services
Weibull a, b Times between failures
Beta a1, a2 Times between failures
Lognormal m, s Execution times of tasks
Triangular a, b, c When we have no data
Gamma a, b Execution times of tasks
Continuous probability distributions describe the probabilities of the possible values of a continuous random variable.
Here are some examples and their applications:

1. **Normal Distribution**:
- **Finance**: Stock returns are often assumed to follow a normal distribution, especially when analyzing their
historical data over short periods.
- **Quality Control**: Measurements of products, when many factors contribute to minor variations, tend to have a
normal distribution.

2. **Uniform Distribution**:
- **Simulation**: When modeling random events where each interval of time, length, or other continuous measures is
equally likely, a continuous uniform distribution is used.
- **Manufacturing**: If a product's lifespan is known to last between two fixed times with any time in between being
equally likely, it's modeled as a uniform distribution.
3. **Exponential Distribution**:
- **Reliability Engineering**: Describing the time between failures of a process or system that has a
constant failure rate.
- **Queuing Theory**: Modeling the time between arrivals in a system like customers at a bank or calls
at a call center when these times are memoryless.

Other distributions include:


- **Beta Distribution**: Used in project management to predict task durations when only a limited sample
data is available.
- **Log-Normal Distribution**: Describing stock prices and real estate property values, as they can't go
below zero but can increase without bound.

These distributions find their applications in diverse sectors like finance, engineering, and operations
research, among others. They play a crucial role in shaping business strategies, decision-making processes,
and optimization efforts.
Discrete probability distributions

Distribution Parameters Application

Uniform a, b Multiple events


Bernoulli p Binary events
Binomial p, t Number of items in a batch
Poisson l Times between arrivals
Geometric p # of parts between failures
Some examples of applications based on the four distributions mentioned:

1. **Binomial Distribution**:
- **Quality Control**: Determining the probability of getting a certain number of defective products in a sample from a
production batch.
- **Marketing**: Estimating the success rate of email marketing campaigns by gauging the number of users who click on
an email link out of all the recipients.

2. **Poisson Distribution**:
- **Customer Service**: Predicting the number of customer service calls a call center can expect within a given time
frame.
- **Supply Chain Management**: Estimating the number of order arrivals at a warehouse in a particular time interval.
3. **Uniform Distribution**:
- **Inventory Management**: When each product in a catalog has an equal likelihood of being purchased,
their demand can be modeled with a uniform distribution.
- **Resource Allocation**: If every task in a process is equally likely to be selected for processing, their
assignment can be modeled uniformly.

4. **Bernoulli Distribution**:
- **Finance**: Determining the success or failure of an investment, where success could be a profit and
failure could be a loss.
- **Human Resources**: Assessing the outcome of an employee training program, where success indicates
the employee passed and failure indicates they did not.

While the search results provide a general understanding of probability distributions, the specific business
applications are derived from the basic properties of each distribution.
Z values and areas under the Normal curve

Area between -1 and 1 68.26 %


-2 and 2 95.44 %
-3 and 3 99.74 %
-6 and 6 99.99966 %
[3.4 in 1000000]
Six Sigma approach

Other important points


Z values
-1.65 and 1.65 90 % [1 in 10]
-1.96 and 1.96 95 % [1 in 20]
-2.58 and 2.58 99 % [1 in 100]
-3.29 and 3.29 99.9 % [1 in 1000]
Heights of Men and Women

Plot of probability Density Function

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10

Height
Heights of Men and Women

Plot of probability Density Function


Higher
probabilities

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10

Height
Heights of Men
and Women
Plot of probability Density Function
Higher
probabilities

Lower Lower
probabilities probabilities

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10

Height
Heights of Men and Women

Prob(Height < 1.55) = ?

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10

Height
Heights of Men and Women

Prob(Height > 1.80) = ?

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10

Height
Heights of Men
and Women
Prob(1.50 < Height < 1.70) = ?

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10

Height
Heights of Men and Women
Prob(Height = 1.55) = 0

Area of a
line is zero

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10

Height
Confidence intervals
Confidence interval
Range in which we expect to find the true value of our variable

Confidence level
Probability that computed intervals cover the true value of our
variable

Sample standard error [SEM]


Dispersion measure around the sample average
SEM = Sample standard deviation / sqrt (sample size)

Computing the confidence interval


Interval = Mean +/- Z value * Standard Error
1. **Standard Deviation**:
- It measures how spread out the values in a dataset are around the average.
- Think of it as the average amount by which individual data points in a business dataset (like
sales figures for different months) deviate from the average of that dataset.
- It gives you a sense of the variability within your dataset.

2. **Standard Error**:
- It measures how accurate a sample mean is as an estimate of the population mean.
- Imagine you take multiple samples (like monthly sales from different regions) and calculate
their averages. The standard error tells you how much those averages differ from one another.
- The smaller the standard error, the more confident you can be that the sample mean
accurately represents the population mean.

In essence, while the standard deviation is about variability within a single dataset, the
standard error is about the accuracy of average values across multiple datasets or samples
Z values and areas under the Normal curve
Z values
Area between -1 and 1 68.26 %
-2 and 2 95.44 %
-3 and 3 99.74 %
-6 and 6 99.99966 %
[3.4 in 1000000]
Six Sigma approach

Other important points


Z values
-1.65 and 1.65 90 % [1 in 10]
-1.96 and 1.96 95 % [1 in 20]
-2.58 and 2.58 99 % [1 in 100]
-3.29 and 3.29 99.9 % [1 in 1000]
Confidence intervals
99% confidence interval

95.44% confidence interval

90% confidence interval

-2.58 SEM -1.65 SEM m 1.65 SEM 2.58 SEM

-2.00 SEM 2.00 SEM

The higher the confidence level,


the greater the interval
Hypothesis tests (1 of 2)
The mean m of a given population is equal to a given value M
Hypothesis H0 : mean m = M
Hypothesis H1 : mean m is not equal to M
1. Take M as the mean of hypothesis H0
2. Choose a given confidence level
3. Compute the confidence interval centered in M
4. ¿Does the sample mean m fall inside the confidence interval?
Yes = the mean m is equal to M with the confidence level
No = the sample mean m is not equal to M
95.5% confidence interval

Hypothesis H0
Hypothesis H0
m M m is not checked
is checked
Hypothesis tests (2 of 2)
Two populations have the same mean
Hypothesis H0 : mean 1 = mean 2 => mean 1 – mean 2 = 0
Hypothesis H1 : mean 1 different to mean 2
1. Compute the standard error:
standard error = sqrt [s1*s1/n1 + s2*s2/n2]
2. Choose a given confidence level
3. Compute the confidence interval centered in 0
4. Compute the difference between means
5. ¿Does the mean difference fall inside the confidence interval?
Yes = means are equal with the corresponding confidence level
No = means are not equal
95.5% confidence interval

H0 is checked m1-m2 0 m1-m2 H0 is not checked


LET´S PRACTICE
WITH EXCEL

You might also like