DS Full
DS Full
2
Probability Distributions
3
4
A Probability Distribution
● Consider the experiment of flipping a coin twice.
● The 4 possible outcomes are HH, HT, TH, TT
● Let the variable X represents the number of heads.
○ X can take values as 0,1 or 2.
○ When a variable takes different values with associated probability, it is called a random
variable.
X Probability
0 1/4 = 0.25
1 2/4 = 0.5
2 1/4 = 0.25
5
Binomial Distribution
●This is a widely used discrete probability distribution
and it plays a major role in quality control and quality
assurance functions.
●Examples:
○ Manufacturing units use the binomial distribution for defective
analysis.
○ In service organizations like banks and insurance corporations to get
an idea of the proportion of customers who are satisfied with the
service quality.
○ In the context of deciding whether to accept or reject a lot, containing
components or finished products based on statistically designed
sampling plan.
12
○ n= number of trials
○ p= probability of success on a single trial
○ q= 1-p
○ r= number of successes in trials (the random variable)
Examples
● If X ~ Bin(5,0.3), find
a) P(X=5) b) P(X≤4)
Poisson Distribution
● This is another discrete distribution which also plays a major role in
quality control in the context of reducing the number of defects per
standard unit such as number of defects per item etc.
● Other real life examples include;
○ The number of cars arriving at a highway entrance per hour.
○ The number of customers visiting a bank per hour during peak business period.
16
Examples
● If on an average, 6 customers arrive every two minutes at a bank
during the busy hours of working,
a) What is the probability that exactly 4 customers arrive in a given minute?
b) What is the probability that more than 3 customers arrive in a given minute?
P(Z<-a)=P(Z>a)=prob
P(Z>-a)=1-P(Z>a)=1-prob
22
Examples
● P(Z≤2.1)
● P(Z>2.1)
● P(1.5 ≤Z ≤2.1)
● P(Z<-1.5)
Continuity Correction
●If P(X=n), use P(n-0.5<X<n+0.5)
Example
● Find the probability of obtaining between 4 and 7 heads inclusive with 12
toss of a fair coin,
○ Using the binomial distribution
○ Using the normal approximation to binomial distribution
26
Example
● A radioactive disintegration gives counts that follow a Poisson distribution
with mean count per second of 25. Find the probability that in 1 second
the count is between 23 and 27 inclusive,
○ Using the Poisson distribution
○ Using normal approximation to Poisson distribution
28
Sampling Applications
MCU3209_Jayani C. Hapugoda_OUSL 29
Introduction
● Statistics consists of two main branches called descriptive statistics
and inferential statistics.
● Inferential statistics are used to generalize the findings from sample
to the population.
Example
● Consider the population of size 5 school buses A, B, C, D, E. The no. of
students travelling in each bus is given below:
Sample Sample mean
Bus No. of students AB 27.0
A 24 AC 22.5
B 30 AD 21.0
C 21 AE 25.5
D 18 BC 25.5
E 27 BD 24.0
BE 28.5
CD 19.5
CE 24.0
DE 22.5
MCU3209_Jayani C. Hapugoda_OUSL 34
Sampling Distribution
● The frequency distribution of possible values of a statistic for
repeated samples of the same size from the same population is
called the sampling distribution of the statistic.
Skewed Populations
● As n
increases,
shape
becomes
more
normal
MCU3209_Jayani C. Hapugoda_OUSL 36
Parameter Estimation
1. Point Estimation
2. Interval Estimation
● For a given population parameter, we can construct a random interval
so that it has a given probability of capturing the population
parameter.
● It deals with replacing a point estimate, a single number, by an entire
interval of possible values. (An interval of possible values for the
parameter being estimated).
● An interval estimate provides more information about population
parameter than does a point estimate.
Confidence Intervals
52
53
Test Statistic
● A test statistic is a quantity calculated from our sample of data (data
summary or measure).
● Its value is used to decide whether or not the null hypothesis should
be rejected in our hypothesis test.
● The choice of a test statistic will depend on the assumed probability
model and the hypotheses under question.
61
Test
Small statistic
sample
Critical
value
Test
Large statistic
sample Critical
(n>30) value
65
Example 1
● A traditional manufacturing process has produced millions of TV tubes
with a mean life 1200h and st. deviation 300h. The engineering
department of the company introduced a new process. A sample of 100
tubes from new process gives sample mean 1265h. Assuming the st.
deviation of new process is same as traditional process, test the following
hypothesis at 5% significant level.
1. Traditional method and new method gives same mean life
2. New method is better than the traditional method
3. Traditional method is better than the new method
67
Example 2
● A personal specialist of a major corporation is recruiting a large number
of employees for an overseas assignment. During the testing process, the
management inquires the mean of the scores from the specialist and the
reply was 90. When the management reviews 20 of the test results
compiled, it finds that the mean score is 84, and the standard deviation of
this score is 11. If the management wants to test the specialist’s view at
1% significant level, what decision can be drawn?
68
Example 3
● A marketing manager of an enterprise is facing a decision whether to
introduce a new product into the market or not. Consumer acceptance
measured in a blind comparison test is agreed upon as an appropriate
basis for evaluation. Marketing of the new product will be pursued only if
the acceptance rate exceeds 30%. Otherwise, the new product will not be
introduced in the market. A random sample of 200 consumers reveals
that the acceptance rate is 32%. Using a level of significance of 0.01,
perform the hypothesis testing and recommend your action.
69
Example 4
● Two research laboratories have independently produced drugs that
provide relief to arthritis sufferers. The first drug was tested on a group
of 90 arthritis sufferers and produced an average of 8.5h of relief, and a
sample standard deviation of 1.8h. The second drug was tested on 80
arthritis sufferers, producing an average of 7.9h of relief, and a sample
standard deviation of 2.1h. At the 0.05 level of significance, does the
second drug provide a significantly shorter period of relief?
72
Introduction
● This technique is used for 2 purposes:
○ Comparing population proportions of more than 2 samples (Goodness of fit)
○ To determine the association between 2 nominal variables (Test of independence)
● Assumptions:
○ Samples are randomly and independently drawn.
○ The data must be in frequency form.
○ No frequency in any category must be less than 5.
● Test statistic =
Where fo=observed frequency for the ith category
fe=expected frequency for the ith category
● Critical value = chi square value with degrees of freedom k-1; where k
– number of categories
● When test statistic > critical value (chi square table value), we reject
H0.
75
A 36
B 52
C 40
D 35
E 37
Total 200
76
● Test statistic =
Correlation Analysis
79
Correlation
● Correlation describes the strength of a linear relationship.
● The strength of a linear relationship is an indication of how closely the
points in the scatter plot fit a straight line.
● Coefficient of correlation (r) measures the strength of a linear
relationship. It’s numerical value ranges between -1 to +1.
● If there is no fit we say there is no relationship; points are scattered
randomly on the plot, r=0.
● If points lie exactly on a straight line, we say that there is a perfect
linear relationship; r=1 or r=-1.
83
Correlation
84
Correlation
● Guidelines for classifying the strength of a linear relationship.
Correlation of coefficient (r) Strength of relationship
+1 Perfect positive
Between 0.75 and 0.99 Strong positive
Between 0.5 and 0.74 Moderate positive
Between 0.25 and 0.49 Weak positive
Between -0.24 and 0.24 No relationship
Between -0.25 and -0.49 Weak negative
Between -0.5 and -0.74 Moderate negative
Between -0.75 and -0.99 Strong negative
-1 Perfect negative
85
Correlation Coefficient
86
7 12
10 14
9 13
4 5
11 15
5 7
3 4
Draw the scatter plot and draw the line of best fit.
Calculate and interpret the correlation between promotional
expenses and sales.
87
Regression Analysis
88
Regression Analysis
● The correlation coefficient gives us the degree of relationship between
two variables.
● It doesn’t estimate or predict one variable using the other variable.
○ Ex: predicting the sales volume using the advertising expenditure.
● Using regression analysis, it is possible to predict one variable using other
variables.
● For business planning and forecasting, regression is much more useful
than correlation.
89
● When the standard error is small when compared with the range of y
values, this indicates a good model fit.
91
Coefficient of Determination
●
92
Example
Promotional Expenses (X) Sales (Y)
7 12
10 14
9 13
4 5
11 15
5 7
3 4
94
Example Ctd.
1) Calculate the regression equation. (i.e. Calculate the slope and
intercept of the regression line)
2) Interpret the slope coefficient of the regression equation.
3) Using the regression equation, calculate the sales volume with
respect to promotional expense of 4.
4) Obtain the coefficient of determination (i.e. how much of the
variability of sales is predicted by promotional expenses) and
interpret the results.
95
Multiple Regression
● Multiple regression is a statistical technique used to predict the value of a
dependent variable that is influenced by two or more independent variables.
● Ex: The performance of an employee depends on age, gender, education level
etc.
● The multiple regression equation with k independent variables can be given as
Yi = b0 + b1 X1i+ b2X2i….+bkXki
■ Where;
■ Yi = the predicted value from the model
■ b0 = the estimated intercept
■ b1,b2..bk = the estimated slope coefficients
■ X1i , X2i .... Xki = values of independent variables
96
Example
101
102
Components in Time Series
● Four components in a time series:
○ Trend (T): represents the long term behavior of the time series, whether the data
reveal a steady upward or downward movement.
○ Cyclic Effect (C): represents the typical business cycles that occur at irregular intervals
in several years. (long term)
103
Components in Time Series
○ Seasonal Variation (S): represents variation caused by season. A repetitive behavior
of less than one year period (short term).
○ Random Variation (R): represents irregular variations that occur by chance having no
assignable cause and this cannot be predicted.
104
Time Series Models
● Two commonly used models:
○ Additive model: Produces the forecasts by adding the components
Y = T+C+S+R
106
Moving Averages (MA)
●
107
Moving Averages (MA) -Example
● A company is interested in forecasting demand for one of its
products. Find 3-month and 5-month moving averages for the data.
108
Centered Moving Averages
● When even number of observations includes in the moving
averages, the average is placed in the middle of two periods.
● To place it on the actual time, it has to be centered by
calculating the average of two consecutive moving averages.
● Ex: Month Sales (100 4-month moving Centered moving
units) averages averages
1 15
2 9
3 16 14.25
13.75
4 17 13.25 14.625
5 11 16
6 20
109
Trend Analysis
● Trend can be linear or non-linear.
● Linear trend lines are most commonly used, as a non-linear trend
can be mathematically transformed to a linear trend.
● In trend analysis, we fit a trend line using the de-seasonalized time
series data.
● Least square method is used
● The equation used is y = b0 +b1t
○ y is the forecasted value
○ t is the independent variable (time)
○ b0 is the intercept (the y value when t=0)
○ b1 is the slope of the line
110
Trend Analysis Ctd.
●
111
Seasonal Effect Analysis
● Seasonal variation may occur within a shorter time periods (within
a year).
● Seasonal indices are constructed to measure the seasonal effect.
● Steps:
1) Calculate a series of suitable centered moving average. (CMA)
2) Calculate the percentage of the actual value to the CMA value for
each period in the time series having a CMA entry.
3) Use ratios calculated in step 2 to calculate the average seasonal
effect for each season.
4) Adjust the seasonal effects in a way that they add up to a number
which equals to the number of seasons.
5) De-seasonalize the original series by dividing the corresponding
adjusted seasonal effects.
6) Estimate the trend line by fitting appropriate regression model on
the de-seasonalised series.
112
7) Prepare forecast based on trend*percentage seasonal variation.
Seasonal Effect Analysis - Example
● A company is interested in forecasting sales for one of its products.
Forecast sales for year 5 for the data using seasonal adjustments.
113