0% found this document useful (0 votes)
12 views31 pages

Statssss

BU255 notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views31 pages

Statssss

BU255 notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

WEEK 6

8.1. Estimating the Population Mean Using the z Statistic (σ Known)


point estimate = an estimate of a population parameter constructed from a
statistic taken from a sample

100(1-a)% Confidence Interval to Estimate μ

α = the area under the normal curve in the tails of the distribution outside the area
defined by the confidence interval
Distribution of Sample Mean for 95% Confidence

α = 0.005, α/2 = 0.025


z = 0.5 – α/2: z0.025 is found by 0.5 – 0.0250 = 0.4750  z = 1.96 (table)
EXAMPLE: Determine a 95% confidence interval for x̄ = 1300, σ = 160, n = 85 and z
= 1.96

margin of error of the interval = the distance between the statistic computed to
estimate a parameter and the parameter.

Common z values
95% confident that the population mean is in an interval: If the company
were to randomly select 100 samples of 85 bills and use the results of each sample
to construct a 95% confidence interval, approximately 95 of the 100 intervals would
contain the population mean.

8.2. Estimating the Population Mean Using the t Statistic (σ Unknown)


EXAMPLE: A business analyst is interested in estimating the average flying time
from Toronto to Vancouver. They don’t know the population mean, average time or
the population standard deviation. They will take a random sample of flights and
compute a sample mean.

The t Distribution
= a distribution that describes the sample data in small samples when the standard
deviation is unknown and the population is normally distributed

Characteristics of the t Distribution


- symmetric, unimodal, flatter in the middle, have more area in their tails than
the standard distribution
- t distribution approaches the standard normal curve as n becomes large
Confidence Intervals to Estimate the Population Mean Using the t Statistic

8.3. Estimating the Population Proportion


EXAMPLE: What proportion of the market does our company control?

Confidence interval to estimate p

p̂ - sample proportion, p – population proportion, q = 1 - p̂


EXAMPLE: A study of 87 randomly selected companies with a telemarketing
operation revealed that 39% of the sampled companies used telemarketing to
assist them in order processing. Using this information how could an analyst
estimate the population proportion of telemarketing companies that use their
telemarketing operation to assist them in order processing?
p̂ - the point estimate of the population proportion p

8.5. Estimating Sample Size


- it is important to estimate the size of sample necessary to accomplish the
purposes of the study

Sample Size When Estimating μ


EXAMPLE: Estimate the monthly expenditure on bread by a family in Montreal. How
much error is she willing to tolerate in the results? Suppose she wants the estimate
to be withing $1.00 of the actual figure, and the standard deviation of average
monthly bread purchase is $4.00. What is the sample-size estimation for this
problem? The value of z for a 90% level of confidence is 1.645. E = ±$1.00, σ =
$4.00, and z = 1.645

Determining Sample Size When Estimating p

- if p is unknown analysts often use p = 0.5, because it gives the highest


sample number

OVERVIEW OF THE CHAPTER:


8.1. Estimating the Population Mean Using the z Statistic (σ Known)

8.2. Estimating the Population Mean Using the t Statistic (σ Unknown)

8.3. Estimating the Population Proportion

8.5. Estimating Sample Size


Determining Sample Size When Estimating p

WEEK 7
9.1. Introduction to Hypothesis Testing
Hypothesis = a tentative explanation of a principle operating in nature

Types of Hypotheses
1. Research hypotheses
2. Statistical hypotheses
3. Substantive hypotheses

1. Research Hypothesis
= a statement of what the researcher believes will be the outcome of an
experiment/study
“Older workers are more loyal to a company”

2. Statistical Hypotheses
= a formal hypothesis structure set up with a null and an alternative
hypothesis to scientifically test research hypotheses
a) null hypothesis H0
The hypothesis that assumes the status quo—that the old theory, method,
or standard is still true (Older workers are more loyal to a company)
b) alternative hypothesis Ha
The hypothesis that the researcher is interested in proving (Older workers
are not more loyal to a company)

Tests:
1. Two-tailed tests
- a statistical test wherein the researcher is interested in testing both sides of
the distribution
- nondirectional: Ha allows for either the > or < possibility; H a: ≠
o same, different, equal, control, out-of-control
- Are the machines overfilling or underfilling the wheat packages?
- 2 critical points, 2 rejection regions, α split in half
2. One-tailed tests
- a statistical test wherein the researcher is interested in testing one side of the
distribution
- directional: Ha uses either the > or < possibility; H a : < or Ha: >
o higher, lower, older, younger, more, less, longer, shorter
- 1 critical point, 1 rejection region, α not split in half

3. Substantive hypotheses
substantive result = what occurs when the outcome of a statistical study
produces results that are important to the decision-maker

8 Steps of Testing Hypotheses:


1. Establish a null and an alternative hypothesis
o One-tailed or Two-tailed?
2. Determine the appropriate statistical test
o critical values = the value that divides the nonrejection region from
the rejection region
3. Set the value of α
4. Establish the decision rule
5. Gather sample data
6. Analyze the data
7. Reach a statistical conclusion
8. Make a business decision

Rejection and Nonrejection Regions


rejection region = the portion of a distribution in which a computed statistic lies
that will result in the decision to reject the null hypothesis
nonrejection region = any portion of a distribution that is not in the rejection
region. If the observed statistic falls in this region, the decision is to fail to reject the
null hypothesis

Type I and Type II Errors


Type I error = an error committed by rejecting a true null hypothesis.
- Innocent person found guilty and sent to jail
α (level of significance) = the probability of committing a Type I error

Type II error = an error committed by failing to reject a false null hypothesis.


- If a manager doesn’t fire an employee because of lack of evidence but they
are stealing
- Employers often would rather risk a Type I error than a Type II error
β = the probability of committing a Type II error.

- One way to reduce both kinds of errors simultaneously is to increase the


sample size
Power = the probability of rejecting a false null hypothesis
=1-β

9.2 Testing Hypotheses About a Population Mean Using the z Statistic (σ


Known)
EXAMPLE: A survey found that average net income for sole proprietor CPAs is
$74,914. An accounting analyst wants to test this figure by taking a random sample
of 112 sole proprietor CPAs to determine whether the net income figure has
changed. Assume the population standard deviation of net incomes for sole
proprietor CPAs is $14,530.
1. H0: μ = $74,914
Ha: μ ≠ $74,914
- Two tailed
2. Z – test

3. Set α = 0.05
4. Decision rule: zα/2 = ± 1.96, z = 0.5 - α/2 = 0.5 – 0.025 = 0.4750
5. Gather data: sample mean is $78,695
6. Calculate z with n = 112, x̄ = $78,695, σ = $14,530, μ = $74,914

7. Observed value (2.75) > critical value of z  reject the null hypothesis

Using the p-Value to Test Hypotheses


= a method of testing hypotheses in which there is no preset level of α. The
probability of getting a test statistic at least as extreme as the observed test
statistic is computed under the assumption that the null hypothesis is true.
p-value (observed significance level) = the smallest value of α for which the
null hypothesis can be rejected
= 0.5 – z-score

9.3 Testing Hypotheses About a Population Mean Using the t Statistic (σ


Unknown)

9.4 Testing Hypotheses About a Proportion

OVERVIEW OF THE CHAPTER


9.1. Introduction to Hypothesis Testing
Type I error = an error committed by rejecting a true null hypothesis.
Type II error = an error committed by failing to reject a false null hypothesis.
9.2 Testing Hypotheses About a Population Mean Using the z Statistic (σ
Known)
p-value = 0.5 – z-score
Rejecting the Null hypothesis using p-value

9.3 Testing Hypotheses About a Population Mean Using the t Statistic (σ


Unknown)

9.4 Testing Hypotheses About a Proportion

WEEK 8
10.1. Hypothesis Testing and Confidence Intervals About the Difference in
2 Means Using the z Statistic (σ2 Known)
EXAMPLE: Which toothpaste brand is more effective?

Z Formula for the Difference in 2 Sample Means – Independent Samples


and σ2 Known

Confidence intervals to Estimate μ1 – μ2


EXAMPLE: Estimate the difference between middle-income shoppers and low-
income shoppers in terms of the average amount saved on grocery bills per week
by using coupons. Random samples of 60 middle-income shoppers and 80 low-
income shoppers are taken, and their purchases are monitored for one week.

98% confidence interval  zα/2 = 2.33

10.2 Hypothesis Testing and Confidence Intervals About the Difference in


Two Means: Independent Samples and Population Variances Unknown

Hypothesis Testing

If σ 21=σ 22=σ 2:

t formula

EXAMPLE: A company has an education program. They want to know which


program works better: the traditional one or the new one. One group with 15 people
will do method A and the other one with 12 people will do method B. Use α = 0.05.
1. The hypotheses
2. t-test

3. α = 0.05
4. two-tailed test
Df = 25
t0.025,25 = ± 2.060
5. Gather data

6. t value

7. The null hypothesis is rejected

Confidence Interval to Estimate μ1−μ2 : σ2 are unknown but equal

10.3 Statistical Inferences for Two Related Populations


- dependent samples
“t test for related measures” = a t test to test the differences in two related or
matched samples; sometimes called the matched-pairs test or the correlated t test
EXAMPLE: before and after study
- The 2 samples have to be the same size and the individual scores have to be
matched

t- Formula to Test the Difference in 2 Dependent Populations

n = number of pairs
d = sample difference in pairs
D – mean population difference
sd – standard deviation of sample difference
d̄ = mean sample difference

EXAMPLE: A stock market investor is interested in determining whether there is a


significant difference in the price to earnings ratio for companies from one year to
the next. In an effort to study this question, the investor randomly samples 9
companies at the end of year 1 and year 2.

1.

2.
3. Assume α = 0.01
4. α/2 = 0.005. n = 9, df = 8, t0.005,8 = ± 3.355
- if t > 3.355 or t < - 3.355  reject null hypothesis
6. t = - 0.7
7. fail to reject null hypothesis

Confidence Interval Formula to Estimate the Difference in Related


Populations, D
EXAMPLE: The sales of new houses fluctuates seasonally. A national real estate
association wants to estimate the average difference in the number of new houses
sales per company in Halifax between year 1 and year 2. To do so, the association
randomly selects 18 real estate firms in the Halifax area and obtains their new-
house sales figures for May of year 1 and May of year 2. The numbers of sales per
company.
Using these data, the association’s analyst estimates the average difference in the
number of sales per real estate company in Halifax for May of year 1 and May of
year 2 and constructs a 99% confidence interval. The analyst assumes that
differences in sales are normally distributed in the population.

n = 18, df = 17, 99% level of confidence, t 0.005,17 = 2.898,


The number of pairs, n, is 18, and the degrees of freedom are 17, d̄ = -3.39,

- The minus indicated more sales in year 2 than in year 1


- Reject the null hypothesis

10.4. Statistical Inferences About 2 Population Proportions


EXAMPLE: Comparing the market share of a product for 2 different markets\
Mean difference

Standard deviation of the difference of sample proportions of

Z Formula for the Difference in 2 Population Proportions

p̂ - sample proportion, p – population proportion, q = 1 - p̂


Hypothesis Testing
- can be used to determine the probability of getting a particular difference in 2
sample proportions when given the values of the population proportions
- p1 – p 2
z Formula to Test the Difference in Population Proportions

EXAMPLE: Is the proportion of people driving new cars in Windsor different from
the proportion in Kingston.

Confidence Interval to Estimate p1 – p2

EXAMPLE: A manager wants to determine the difference between the proportion of


morning shoppers who are 25 years old or less and the proportion of after 5 pm
shoppers who are 25 years old or below. Out of 400 morning shoppers, 48 are under
25 years old. Out of 460 after 5 pm shoppers, 187 are below 25 years old. Construct
a 98% confidence interval.

- the negative signs in the interval indicate a higher proportion od people at or


below age 25 shopping after 5pm
WEEK 9
11.1 Introduction to Design of Experiments
experimental design = a plan and a structure to test hypotheses in which the
researcher either controls or manipulates one or more variables.
Independent variables
treatment variable = the independent variable of an experimental design
that the researcher either controls or modifies
classification variable (levels) = the independent variable of an
experimental design that was present prior to the experiment and is not the
result of the researcher's manipulations or control

EXAMPLE: A finance analyst might conduct a study to determine whether there is


a significant difference in application fees for home loans in the 10 provinces of
Canada and might include three different types of lending organizations. In this
study, the independent variables are provinces and types of lending organizations.

Dependent variable = the response to the different levels of the independent


variable

3 specific types of experimental designs:


1. completely randomized design
2. randomized block design
3. factorial experiments

11.2 The Completely Randomized Design (One Way ANOVA)


completely randomized design = an experimental design wherein there is one
treatment or independent variable with two or more treatment levels and one
dependent variable
EXAMPLE: A tire-quality study in which tire quality is the independent variable and
the treatment levels are low, medium and high quality. The dependent variable
might be the number of km driven before the tread fails provincial inspection.

One-Way Analysis of Variance (ANOVA)


= the process used to analyze a completely randomized experimental design. This
process involves computing a ratio of the variance between treatment levels of the
independent variable to the error variance. This ratio is an F value, which is then
used to determine whether there are any significant differences between the means
of the treatment levels.
- Begins with the notion that dependent variable responses (data,
measurement) are not all the same in a given study
EXAMPLE: Measurements for the opening of 24 valves are given below. The mean
is 6.34 cm. Only one of the valve openings is 6.34 cm. Why?
- Attempting to break down the total variance among the objects being studied
into possible causes

- ANOVA test are one-tailed with the rejection region in the upper tail

One-way ANOVA partitions the total variance of the data into 2 variances:
1. the variance resulting from the treatment (columns)
2. the error variance, or that portion of the total variance unexplained by the
treatment

ASSUMPTIONS ANALYSIS OF VARIANCE


- Observations are drawn from normally distributed populations
- Observations represent random samples from the populations
- The variances of the populations are equal

EXAMPLE: An analyst decides to analyze the effects of the machine operator on the
valve opening measurements of valves produced in a manufacturing plant. The
independent variable in this design is the machine operator. Suppose further that 4
different operators operate the machines. These four machine operators are the
levels of treatment (classification) of the independent variable. The dependent
variable is the opening measurement of the valve. Is there a significant difference in
the mean valve opening of 24 valves produced by the 4 operators?
ANOVA table

In the machine operator example, dfC = 3 and dfE = 20. F0.005, 3, 20 = 3.1. The
observed F value is 10.18 and larger than the critical F value. The null hypotjesis is
rejected.

Reading the F Distribution Table


- F distribution begins at 0 and cannot be negative because variances are always
positive
- dfC (degrees of freedom in the numerator) – treatment (column) df = C - 1
- dfE (degrees of freedom in the denominator) – error degrees of freedom = N -
C

Comparison of F and t Values


- ANOVA can be used to test hypotheses about the difference in 2 sample
means form independent populations
t-test – a special case of a one way ANOVA where there are only 2 treatment
levels ( dfC = 1)
- in this case F = t 2

WEEK 10
12.1 Correlation
Correlation = a measure of the degree of relatedness of two or more variables.
- several measures of correlation are available, ideally you will solve for p (the
population coefficient of correlation)

coefficient of correlation, r
= measure the linear correlation of two variables, interval data
- business analysts with sample data
- Pearson product-moment correlation coefficient
- from -1 (inverse) to 0 (no correlation) to +1 (perfect correlation)

Examples of correlations:
12.2 Introduction to Simple Regression Analysis

Regression analysis = the process of constructing a mathematical model or


function that can be used to predict or determine one variable by any other
variable.

Bivariate (simple, linear) regression


- involves 2 variables whereby one is predicted by another variable
- dependent variable (y) – is predicted
- independent variable (x) - explanatory variable, the predictor
- the first step is to construct a scatter plot
12.3. Determining the Equation of the Regression Line

- if ɛi = 0 the point is on the regression line


Mathematical models can be
- deterministic: produce an exact output for a given input

- probabilistic: includes an error term that allows for various values of output to
occur for a given value of input

Equation of the Simple Regression Line


- the process of figuring out b0 and b1 is called Least Square Analysis
- least square regression line is the regression line that results in the smallest sum
of errors squared

Slope of the Regression Line

y- intercept of the Regression Line

12.4. Residual Analysis


Residual = The difference between the actual y value and the y value predicted by
the regression model
- the error of the regression model in predicting each value of the dependent
variable

∑ (y - ^y ) – sum of the residuals

- always zero (except for the rounding error)

Outliers = data points that lie apart from the rest of the points.
- located with residuals

Residual graphs are plotted against the x-axis


Using Residuals to Test the Assumptions of the Regression Model
Assumptions:
- the model is linear
- the error terms have constant variances
- the error terms are independent
- the error terms are normally distributed

residual plot = a type of graph in which the residuals for a particular regression
model are plotted along with their associated values of x (x, y - ^y )

homoscedasticity = the condition that occurs when the error variances produced
by a regression model are constant

heteroscedasticity = the condition that occurs when the error variances produced
by a regression model are not constant

- error terms do not appear to be related to adjacent terms


12.5 Standard Error of the Estimate
- provides a single measurement of the regression error
sum of squares of error (SSE) = the sum of the residuals squared for a
regression model

Computational Formula for SSE

Standard Error of the Estimate, se


= a standard deviation of the error of a regression model

68% 0 ± 1se
95% 0 ± 2se
- you can identify outliers by looking at data that is ± 2s e or ± 3se

12.6 Coefficient of Determination


coefficient of determination, r2
= the proportion of variability of the dependent variable accounted for or explained
by the independent variable in a regression model
- ranges from 0 to 1
- if r2 = 0: the predictor accounts for none of the variability of the dependent
variable
- if r2 = 1: perfect prediction of y by x and 100% of the variability of y is accounted
for by x
Sum of squares of y SSyy

Explained variation (SSR, SSreg) Unexplained variation (SSE,


SSerr)

Computational Formula for r2

12.7 Hypothesis Tests for the Slope of the Regression Model and for the
Overall Model
Testing the Slope
- a hypothesis can be conducted on the sample slope of the regression model to
determine whether the population slope is significantly different from zero
- another way to determine how well a regression model fits the data

- use y as the predictor of y for all values x

- 2 tailed – testing if there is a correlation

- one tailed - testing if there is a significant positive correlation

- one tailed - testing if there is a significant negative correlation


t Test of the Slope

Testing the Overall Method


- the F test for overall significance tests the same thing as the t test in simple
regression
F = t2
- the hypotheses being tested in simple regression by the F test for overall
significance are:

- the values of the sum of squares (SS), degrees of freedom (df), and mean squares
(MS) are obtained from the ANOVA table
WEEK 11

12.8 Estimation
- regression analysis can be used as a prediction tool
Confidence Intervals to Estimate the Conditional Mean of y: µ y/x
- one type of confidence interval is an estimate of the average value of y E(y x) for a
given x
where x0 is a particular value of x

EXAMPLE: Take a 95% confidence interval to estimate the average value of y


(airline cost) for the airline cost when x (number of passengers) is 73. For a 95%
confidence interval, a = 0.05 and a/2 = 0.025. The df = n – 2 = 12 – 2 = 10. The
table t value is t0.025,10 = 2.228. Other needed values for this problem, which were
solved for previously, are:

se = 0.1772 ∑x = 930 x=77.5 ∑x2 = 73,764

x0 = 73, ^y = 4.5411

Prediction Intervals to Estimate a Single Value of y

EXAMPLE:

t0.025,10 = 2.228 se = 0.1772 ∑x = 930 x=77.5 ∑x2 = 73,764

x0 = 73, ^y = 4.5411

12.9 Using Regression to Develop a Forecasting Trend Line


- time-series data: data gathered on a particular characteristic over a period of time
at regular intervals
EXAMPLE: 10 years of weekly Toronto Stock Exchange averages, monthly
consumption of coffee over a two-year period
- time-series data contain any one or combination of four elements: trend, cyclicity,
seasonality, and irregularity
- long-term general direction of data
12.10 Interpreting the Output

- the regression equation is found under Coefficients at the bottom of ANOVA


- the slope or coefficient of x is 2.2315 and the y-intercept is 30.9125
- the Standard Error is the fourth statistic under Regression Statistics (15.6491)
- r2 is on the second line (0.886)
- the t test for the slope is found under t Stat near the bottom of the ANOVA section
in the Number of beds (x variable) row, t = 8.83
- adjacent to the t Stat is the p-value, which is the probability of the t statistic
occurring by chance if the null hypothesis is true (p = t 2 = 0.000005)
- F is equal to p
- the predicted values and the residuals are shown in the Residual Output section

13.1 The Multiple Regression Model


multiple regression = regression analysis with one dependent variable and two or
more independent variables or at least one nonlinear independent variable

- y is the dependent (response) variable


partial regression coefficient (βi)= the coefficient of an independent variable in a
multiple regression model that represents the increase that will occur in the value of
the dependent variable from a one-unit increase in the independent variable if all
other variables are held constant

Multiple Regression Model with Two Independent Variables (First Order)


- constructed with 2 independent variables where the highest power of either
variable is 1

response surface = the surface defined by a multiple regression model.


response plane = a plane that is fit in a three-dimensional space and that
represents the response surface defined by a multiple regression model with two
independent first-order variables.

Response Plane for a First-Order Two-Predictor Multiple Regression Model

Determining the Multiple Regression Equation


- the regression analysis is referred to as least squares analysis
- a regression model with 2 independent variables will generate 3 simultaneous
equations with 3 unknowns (b0, b1, b2)

13.2 Significance Tests of the Regression Model and Its Coefficients


Testing the Overall Model
- F test of overall significance

The F Value

- if a regression model has only one linear independent variable, it is a simple


regression model and the F test for overall model is the same as the t test for
significance of the population slope

Significance Tests of the Regression Coefficients


- in multiple regression, individual significance tests can be computed for each
regression coefficient using a t test
The Hypothesis for testing the regression coefficient of each independent variable:

At a = 0.05, the null hypothesis is rejected because the probabilities (p) are less
than 0.05.
- if the t ratios for any predictor variable are not significant (fail to reject H 0), the
analyst might decide to drop that variable from the analysis as a nonsignificant
predictor
- the df for each of these individual tests of regression coefficients are n – k – 1

- testing the regression coefficients gives the analyst some insight into the fit of the
regression model but also helps in the evaluation of how worthwhile individual
independent variables are in predicting y

WEEK 12
14.2 Indicator (Dummy) Variables
= qualitative variables that represent whether or not a given item or person
possesses a certain characteristic and are usually coded as 0 (negative) or 1
(affirmative).
- if an indicator variable has c categories, then c – 1 dummy variables must be
created

14.3 Model Building: Search Procedures


Search procedures = Processes whereby more than one multiple regression
model is developed for a given database, and the models are compared and sorted
by different criteria, depending on the given procedure.
all possible regressions = A multiple regression search procedure in which all
possible multiple linear regression models are determined from the data using all
variables.
Stepwise regression = A step-by-step multiple regression search procedure that
begins by developing a regression model with a single predictor variable and adds
and deletes predictors one step at a time, examining the fit of the model at each
step until there are no more significant predictors remaining outside the model.
Forward selection = A multiple regression search procedure that is essentially the
same as stepwise regression analysis except that once a variable is entered into the
process, it is never removed.
backward elimination = A step-by-step multiple regression search procedure that
begins with a full model containing all predictors. Nonsignificant predictors are
eliminated from the model, one at a time, until no nonsignificant predictors remain.

You might also like