0% found this document useful (0 votes)
15 views14 pages

STAR Rando Questions Stats

Uploaded by

tsong51400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

STAR Rando Questions Stats

Uploaded by

tsong51400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

REGRESSION

Study online at https://quizlet.com/_190zv4


o Correlation statistics are used for describing STRENGTH of a
relationship b/w 2 variables

Why look at regression statistics? o Regression is used when a researcher wants to establish this
relationship as a basis for PREDICTION
--> Ex. Good for clinical decision making and predicting quantifi-
able clinical outcomes (prognosis)
1) Linear regression
2 types of regression analysis:
2) Nonlinear regression
1) NONLINEAR REGRESSION
NOTE: applying a linear regression line to curvilinear data, will not
allow us to get a full hold on all the data. we will miss some of the
"moments"
we use a different formula!
how to we adjust for nonlinear regression?
called: QUADRATIC EQUASION
defines the quadratic curve
Quadratic equation (deF)
Y-hat = a + b1X + b2X^2
polynomial regression (def) process of deriving the quadratic equation
lowest p-value!

Researchers must decide to use a linear or curvilinear regression 1) do an analysis of variance for both linear and nonlinear regres-
model based on visual inspection of the scatter plot and then sion lines
doing an analysis of variance to determine if their pick was the
right model for the set of data. After doing the analysis of variance 2) lowst p-value will tell you which line to use
for each equation, what variable is used to determine which one
is better?

this analysis of variance is done before running ANCOVA


involves the examination of 2 variables, X and Y

• Ex. Looking at blood pressure (Y) as it relates to age (X)


2) Linear regression (def)
linear regression is the simplest and most commonly used form of
regression stats
X stands for: independent or predictor variable
Y stands for: dependent or criterion variable
the line that we determine, which goes through the middle of the
data points and represents a good approx of the relationship

linear regression line (def) • In a perfect correlation, all data points will fall on a straight line

• Could use this line to predict values of Y by knowing any given


value of X
determine the equation to use!
how do we know know where to draw the line? where is the middle
and how do we construct it?
y-hat = a + bX
y-hat = = is the PREDICTED VALUE OF Y
regression constant or Y-intercept (value of Y when X=0 --> point
a=
at which the line intersects the x axis)
regression coefficient or slope of the line

b= • When b is positive, Y increases as X increases


• When b is negative, Y decreases as X increases
• If b = 0, slope of line is horizontal, indicating no relationship bc Y
1/6
REGRESSION
Study online at https://quizlet.com/_190zv4
is constant for all values of x
• Neg and pos values of b will correspond to the negative and
positive correlation b/w X and Y
what is the role of the correlation coefficient in linear regression we can use the value of r to determine how well the dots of the line
analysis? are correlated to one another
THE REGRESSION MODEL
Regression model (def) models which allows us to find RESIDUALS when r <1
NOTE: in a perfect scenario, all points being examined will fall on
the line (aka r = perfect +1 or -1)

• HOWEVER, when r < 1.00, the regression line is only partially


useful for predicting values of Y and will need to look at residuals
1) The regression line is not 100% accurate for predicting Y for any
given value of X because the data is all over the place

2) Some data points will be above the regression line and some
When r < 1, 3 things occur: below. Some will fall far from the line and some will fall closely

3) residuals pop up!


--> all values you plug in for X will give you a actual/observed value
of Y that is different from the PREDICTED value of Y
ERROR COMPONENTS, or the distance that each ACTUAL value
varies from the PREDICTED values of Y
Residuals (des)
(Y - Y-hat)

residuals represent the degree of error in the regression line


when r <1, we try to minimize the error component and yield the
what is the main purpose of the line of best fit when r <1?
smallest residuals
used to help find the line of best fit that minimizes # of residuals

--> determined by finding squares of all residuals and summing


them £(Y(-(Y)^2 , for every possible line that could be drawn to the
least squares method (def)
data

the line that gives you the SMALLEST sum of squares is the line
of best fit
why do we want the line that gives us the SMALLEST sum of the smaller the sum of squares, the closer the data points will be
squares to the line of best fit
ASSUMPTIONS FOR REGRESSION ANALYSIS
1) For any given value of X, Y scores exists
’ MEANING, if we sampled more subjects, we'd see a distribution
of Y scores for each value of X
--> ex. The BP scores in our data set is a random observation from
the larger population distribution of all possible BP scores for that
What 2 assumptions do we make:
age

• 2) Assume that each of these distributions is normal and their


variances are equal
ANALYSIS OF RESIDUALS: PLOTTING RESIDUALS

once we have determined our expected and actual Y values, and


Plotting residuals (def) calculated for residuals, we can plot them to see how much they
actually differed from our expected

2/6
REGRESSION
Study online at https://quizlet.com/_190zv4

residuals vs predicted scores


• If all the assumptions are met, the pattern should resemble a
horizontal band of points (FIGURE A)

• Figure B: indicates that the variance of residuals is not consistent,


what are the 3 hypothetical figures we can look at when PLOT- but DOES depend on the predicted variable, X ’ also, notice that
TING RESIDUALS: the variance increases as X gets larger ’ assumptions of normality
and equality of variance NOT met

• Figure C: indicates that it completely goes against the linear


model. Is a curvilinear model instead.
• 1) LOGARITHMIC TRANSFORMATIONS: they may transform
one or both sets of data to more closely satisfy the necessary
assumptions so that the variance is stabilized, normalized distri-
butions can be seen and/or create a more linear relationship
When the residuals do not plot/fall in a horizontal band (like figure
A), the researcher can do 2 things:
• 2) create STANDARDIZED RESIDUALS: obtained by dividing
each residual score by the S.D. of the residual distribution and
allowing them to be expressed in S.D. units (very similar to what
we do with z-scores)
small number of deviant scores that are far from the main cluster
of points around the regression line, that can significantly distort
the statistical association
outliers (def)
--> There is no statistical rational for discarding an outlier, but if
a casual factor can be identified (ex. Error in measurement or
recording, or someone other than the tester made a mistake), it
may be reasonable to omit the outlier
can be used to supplement the findings from the regression line
and is a rough indicator of how well the regression line "fits" the
data

o The correlation coefficient can be used to represents strength of


how does the correlation coefficient be used in regression statis-
an association:
tics?
• when r is close to + or - 1 ’ the regression line is said to provide
a strong basis for prediction
• when r gets farther from + or - 1 ’ errors of prediction increase ’
and it can not evaluate the accuracy of how far away things are
from the regression line
when testing our ACCURANCY OF PREDICTION OF OUR RE-
1) coefficient of determination
GRESSION LINE (and to evaluate how far the correlation coeffi-
cient is from the regression line, 2 statistical approaches can be
2) standard error of the estimate (SEE)
used

• The square of the correlation coefficient, r^2 = the % of the TOTAL


variance in the Y scores that be explained by X scores

--> it indicates the accuracy of prediction based on X

--> aka if we know variance of X, we can find out the variance of Y,


1) coefficient of determination which is why r is squared (ex. Looking at the BP vs age, r =.87 and
r^2 =.76. meaning, 76% of the variance in BP can be accounted
for by knowing the variance of age)

--> (r^2) - 1 = the amount of variance that CANNOT be explained


by the relationship b/w the 2 variables (ex. 1-.76 = 24% not
explained)
3/6
REGRESSION
Study online at https://quizlet.com/_190zv4

--> r^2 values range b/w 0 and 1


• Looks at the variance of errors on either side of the regression
line
2) standard error of the estimate (SEE)
--> If SEE = high value, scores are widely dispersed

--> If SEE = small, the better the fit


ANALYSIS OF VARIANCE OF REGRESSION
after we determine our regression line, analysis of variance of
Why do we look at an analysis of variance of regression (ANO-
regression allows us to statistically test if we found the regression
VA-regression)?
relationship by chance or not
null hypothesis for variance of regression analysis? null: b = 0
recap: Y stands for observed value
recap: Y-hat stands for predicted value
NOTE: looking at different relationships bewteen Ybar and Y hat
to the MEAN OF ALL SCORES, we are able to find variances
(SSreg): £(
Yhat -Ybar)2 ’ Tells us total variance of the sample from
REGRESSION sum of squares
the pop regression line
(SSres): £(
Y -Yhat)2 ’ Tells us variance of our sample from our
RESIDUAL sum of squares
observed regression line
F = MSreg / MSres
If we find all the sum of squares for regression and residual, we
can make an F ratio
--> this is analogous to anova and used to test significance
INTERPRETING OUR REGRESSION ANALYSIS
• 1) That all predictions are ONLY applicable when they meet the
population criteria/assumption
The calculated reference population that we PREDICT (Y-hat),
should be clearly specified ’ bc its ultimate purpose is to help us
predict scores for all kinds/new samples of observations from the
• 2) We cannot make valid predictions for values of X that are out-
data. This means 2 things:
side the range of scores that were used to generate the regression
line
o The overall fxn of experimental design is to explain the effect of
an independent variable on a dependent variable, while controlling
for CONFOUNDING effects of extraneous factors ’ research does
not always do this well
why do we have to do so many analysis of variance?

o They can never be eliminated, but can be controlled for by


experimental design and/ or statistical control
ANALYSIS OF COVARIANCE (ANCOVA) for LINEAR regressions
analysis of COvariance (ANCOVA) (def) a combination of ANOVA + linear regression
cofounding variables that can affect the relationship between the
dependent variable and other independent variables of primary
interest ’ typically we want to remove this covariate so that we are
only looking at the true relationship of the indep and dep variables

Covariate (def) o Ex. Looking at videotaping or having students participate in a


group activity ’ which one makes them a better clinician

• Confounding variable: GPA (need to make all these people the


same intelligence)

• Indep: teaching strategy


4/6
REGRESSION
Study online at https://quizlet.com/_190zv4

• Dependent: clinical
performance

• Covariate: GPA
need to adjust means of the covariates to look at the true regres-
sion relationship

• 1st: run an ANOVA to determine p values and determine signifi-


cance or not ’ this also helps one to see if a covariate my be making
problems

• 2nd: When it is discovered that a covariate has a confounding


adjustment of means (def)
effect, we want to ARTIFICIALLY EQUATE the 2 groups (X and
Y) ’ done by finding the mean of whatever covariate group you are
examining

• After making the covariate means the same, we can use re-
gression lines to predict what the mean score for the X and Y
variables would be if the covariate were not present, by comparing
the covariate to the dependent variable
the dependent variable is the one that will make the largest shift if
a co-variate is effecting the results of the data.
why do we compare the covariate to the dependent variable?
the independent variable stands alone and will not be effected
• 1) Linearity of the covariate

• 2) Homogeneity of slopes

• 3) Independence of the covariate


As with all tests, ANCOVA requires 4 assumptions to be met:
• 4) Reliability of the covariate

ALL DONE BEFORE RUNNING ANCOVA


there must be a linear relationship between the COVARIATE and
1) Linearity of the covariate
the dependent variable (bc you plot them vs each other)
slopes of all regression lines (for X, Y, and covariate) should be
parallel

2) Homogeneity of slopes
--> Testing for homogeneity of slopes should be done BEFORE
the ANCOVA

--> If not parallel, ANOVA is not valid


o Covariate chosen should be related to the dependent variable,
but INDEPENDENT from the independent variable

3) Independence of the covariate o Do not want the indep variable to influence the value of the
covariate

o Measured prior to ANCOVA


o Covariate cannot be contaminated by measurement error itself
4) Reliability of the covariate
o Any error found within the covariate bc it will effect the overall
calculations and adjustment of means

as many as we want!

• ANCOVA can confound many different covariates

5/6
REGRESSION
Study online at https://quizlet.com/_190zv4
• They must all be related to the dependent variable and indepen-
dent to the indep variable
how many covariates can we examine at once?
• Ex. If we wanted to compare strength @ different ranges ’ covari-
ates may = height, weight, limb girth, BMI, etc
When is ANCOVA run> after ANOVA and initial regression line has been calculated
• Although ANCOVA does increase power, it does not provide a
safeguard against problems in study design ’ the ANCOVA is not
a substitute for using/not using randomization and a good study
design

• Should only be used in situations when experimental control


of relevant variables is not possible and when they need to be
INTERPRETING ANCOVA identified and measured BEFORE ANCOVA

• ANCOVA cannot determine causation (same with correlation and


regression)

• ANCOVA also restricts generalization to the population/range


that was established for the study... similar to correlation and
regression again

6/6
chi square
Study online at https://quizlet.com/_5xslur
The actual counts in the cells of a contingency table are referred
False
to as the expected cell frequencies.
D. The alternative hypothesis states that the two classifications are
statistically independent
NOT If ri is the row total for row i and cj is the column total for col-
umn j, then the estimated expected cell frequency corresponding
to row i and column j equals ricj/n.
When we carry out a chi-square test for independence, the null
The chi-square test is valid if all of the estimated expected cell
hypothesis states that the tow relevant classifications.
frequencies are at least 5.
The chi-square statistic is based on (r-i) (c-i) degrees of freedom
where r and c denote the number of rows and columns respec-
tively in the contingency table.
None of the above.
NOTIf ri is the row total for row i and cj is the column total for col-
umn j, then the estimated expected cell frequency corresponding
to row i and column j equals ricj/n.
Which of the following statements about the chi-square test of
The chi-square test is valid if all of the estimated expected cell
independence is false?
frequencies are at least 5.
The alternative hypothesis states that the two classifications are
The chi-square statistic is based on (r-i) (c-i) degrees of freedom
statistically independent.
where r and c denote the number of rows and columns respec-
tively in the contingency table.
None of the above.
In a contingency table, when all the expected frequencies equal
True
the observed frequencies the calculated x2 statistic equals zero.
When using chi-square goodness of fit test with multinomial prob-
abilities, the rejection of the null hypothesis indicates that at least
True
one of the multinomial probabilities is not equal to the value stated
in the null hypothesis.
A fastener manufacturing company uses chi-square goodness of
fit test to determine if the population of all lengths of 1/4 inch bolts
it manufactures is distributed according to a normal distribution. If
False
we reject the null hypothesis, it is reasonable to assume that the
population distribution is at least approximately normally distrib-
uted.
When using the chi-square goodness of fit test, if the value of the
True
chi-square statistic is large enough, we reject the null hypothesis
Correct: increases
In performing a chi-square test of independence, as the differ-
NOT decreases
ence between the respective observed and expected frequencies
may decrease of increase depending on the number of rows and
decrease, the probability of concluding that the row variable is
columns.
independent of the column variable
remains the same.
E. at least 5.
NOT
A chi-square goodness of fit test is considered to be valid if each greater than zero.
of the expected cell frequencies is ___________________. less than 5.
between 0 and 5.
at most 1
The x2 statistic from a contingency table with 6 rows and five correct: 20
columns will have _____ degrees of freedom NOT 30 24 5 25
The chi-square goodness of fit test is _________ a one-tailed test
Yes always NOT sometimes or never
with the rejection point in the right tail.
The x2 statistic is used to test whether the assumption of normal-
ity is reasonable for a given population distribution. The sample correct 3
consists of 5000 observations and is divided into 6 categories NOT 4999 6 5 4
(intervals). The degrees of freedom for the chi-square statistic is:

The process of using sample statistics to draw conclusions about


true population parameters is called:

1/6
chi square
Study online at https://quizlet.com/_5xslur
Statistical inference
The process of using sample statistics to draw conclusions about Statistical inference
true population parameters is called:
The classification of student class designation (freshman, sopho-
Categorical random variable
more, junior, senior) is an example of:
To monitor campus security, the campus police office is taking a
survey of the number of students in a parking lot each 30 minutes
of a 24-hour period with the goal of determining when patrols a discrete random variable
of the lot would serve the most students. If X is the number of
students in the lot each period of time, then X is an example of:
The personnel director at a large company studied the eating
habits of the company's employees. The director noted whether
employees brought their own lunches to work, ate at the company
an observational study
cafeteria, or went out to lunch. The goal of the study was to
improve the food service at the company cafeteria. This type of
data collection would best be considered as:
A sample of 200 students at a Big-Ten university was taken after
the midterm to ask them whether or not they studied for the exam
before the midterm, and whether they did well or poorly on the
midterm. The following table contains the result.

Did Well in Midterm Did Poorly in Midterm 30


Studied for the exam 80 20
Did not study for the exam 30 70

Referring to the above table, of those who did not study for the
exam, _______ percent of them did well on the midterm.
An insurance company evaluates many numerical variables about
a person before deciding on an appropriate rate for automobile in-
surance. A representative from a local insurance agency selected
50
a random sample of insured drivers and recorded, X, the number
of claims each made in the last 3 years, with the following results.
Xf
Which measure of central tendency can be used for both numer-
mode
ical and categorical variables?
Which of teh following is NOT a measure of central tendency? interquartile range
A business venture can result in the following outcomes (with
their corresponding chance of occurring in parentheses): Highly
Successful (10%), Successful (25%), Break Even (25%), Disap-
20%
pointing (20%), and Highly Disappointing (?). If these are the only
outcomes possible for the business venture, what is the chance
that the business venture will be considered Highly Disappointing
The employees of a company were surveyed on questions re-
garding their educational background and marital status. Of the
600 employees, 400 had college degrees, 100 were single, and 60 0.733
were single college graduates. The probability that an employee
of the company is single or has a college degree is:
Thirty-six of the staff of 80 teachers at a local school are certified in
CPR. In 180 days of school, about how many days can we expect 81
that the teacher on bus duty will likely be certified?
If n=10 and p=0.70, then the mean of the binomial distribution is: 7.00
A professor receives, on average, 24.7 e-mails from her students
per 24 hour period on the day before the midterm exam. To
Poisson distribution
compute the probability of receiving at least 10 e-mails on such a
day, what type of probability distribution should be used?
For some positive value of Z, the probability that a standard normal
1.16
variable is between 0 and Z is 0.3770. Therefore the value of Z is:

2/6
chi square
Study online at https://quizlet.com/_5xslur
FoFor some positive value of X, the probability that a standard
normal variable is between 0 and +1.5X is equal to 0.4332. There- 1.00
fore the value of X is:
2 of every 3 observations would fall between 1 standard deviation
above and below the mean.
4 of every 5 observations would fall between 1.28 standard
If a particular batch of data is approximately normally distributed,
deviation above and below the mean.
we would find that approximately
19 of every 20 observations would fall between 2 standard devi-
ation above and below the mean.
ALL OF THE ABOVE
A company that sells annuities must base the annual payout on
the probability distribution of the length of life of the participants
in the plan. Suppose the probability distribution of the lifetimes of
the participants is approximately normally distributed with a mean 0.0228
of 68 years of age and a standard deviation of 3.5 years. What
proportion of the plan recipients would receive payments beyond
age 75?
Weight # Jars
You work at Smuckers in Orrville, Ohio and are in charge of the 8.0 oz. 300
grape jelly production line. The weight of jars of grape jelly and the 8.5 oz. 287
number of jars at each weight for the last month (4 weeks) are as 8.9 oz. 128
follows: 9.0 oz. 67
ANSWER 8.42 oz
You work at Smuckers in Orrville, Ohio and are in charge of the
grape jelly production line. The weight of jars of grape jelly and the
0.1366
number of jars at each weight for the last month (4 weeks) are as
follows:
There are 5 starting players on the Bulldog basketball team. Each
player wanted to improve her shooting percentage, especially
Calculate the mean, median, mode, variance and standard devi-
Hannah. The number of baskets Hannah made in the last 10
ation and range and list them in that order.
games are listed below:
15.6, 15.5, 12, 49.16, 7.01, 23
12,13,21,18,19,8,4,22,12,27
The width of a confidence interval estimate for a proportion will be: narrower for 90% confidence than for 95% confidence
If you were constructing a 99% confidence interval of the pop-
ulation mean based on a sample of n=25 where the standard 2.7969
deviation of the sample s=0.05, the critical value of t will be:
A confidence interval was used to estimate the proportion of sta-
tistics students who are females. A random sample of 72 students
generated a 90% confidence interval of 0.438 to 0.642. Therefore, False
it is reasonable to assume that the population of female students
should be above 50%.
Suppose a 95% confidence interval turns out to be 1,000 to 2,100.
If the researcher wishes to reduce the width of the interval. One True
way he can do that is to increase the sample size
A major department store chain is interested in estimating the
average amount its credit card customers spent on their first
visit to the store. 85 accounts were randomly selected. The 95% most of his customers are spending more than $50
confidence interval was $56.89 t0 $145.78. Therefore, the store
manager can be 95% confident that
If the calculated Z-score falls in the rejection region (in the tail of
rejected
the bell curve), then the null hypothesis is:
A Type II error is committed when we reject a null hypothesis that
False
is true.
How many Kleenex should a Kimiberly Clark Corp. package of
tissues contain? Researchers determined that, on average, a
person uses 60 tissues during a cold, and the population standard
3/6
chi square
Study online at https://quizlet.com/_5xslur
deviation is 22. Suppose a random sample of 100 Kleenex users
Critical Value = 1.96; Calculated Z = 3.64; Reject the null hypoth-
yielded a mean of 52. Test the following hypothesis at the 5%
esis
significance level.
Given your conclusion in question #8, you should recommend that
False
Kimberly Clark Corp. put more than 60 tissues in each box.
In a sample of 265 subjects, the average score on an examination
was 63.8. Historically, the population standard deviation has been [63.43 to 64.17]
à = 3.08.What is the 95% confidence interval estimate for the µ?
When testing for independence in a contingency table with 3 rows
6
and 4 columns, there are ______ degrees of freedom.
If we wish to determine whether there is evidence that the propor-
tion of items of interest is the same in group 1 as in group 2, the True
appropriate tests to use are the Z test and the chi-square test
A manufacturing company produces bike frames and makes 1200
per year. These bike frames can be produced using three different
processes. Each process has a unique cost associated with it.
The CEO wants to know if the manufacturing processes are
5.99 CRITICAL VALUE
related to the overall cost of bike frame production. The following
12.56 CHI SQUARE VALUE
contingency table gives the number of bike frames produced with
each process and cost level. At a significance level of .05, perform
the chi-square test of independence to determine if production
process is related to production cost.
5. Question : In the past, of all the students enrolled in "Ba-
sic Business Statistics" 10% earned A's, 20% earned B's, 30%
earned C's, 20% earned D's and the remaining 20% either failed
or withdrew from the course.

Dr Johnson is a new professor teaching "Basic Business Sta-


tistics" for the first time this semester. At the conclusion of the
semester, in Dr. Johnson's class of 60 students, there were 10 Reject the null hypothesis at the 5% level of significance because
A's, 20 B's, 20 C's, 5 D's and 5 W's or F's. 16.4 > 9.48 Therefore there is a difference between the historical
grade distribution and Dr. Johnson's
Assume that Dr. Johnson's class constitutes a random sample. Dr
Johnson wants to know if there is sufficient evidence to conclude
that the grade distribution of his class is different than the historical
grade distribution.

Use a = .05, and determine if the grade distribution for Dr. John-
son's class is different than the historical grade distribution.
In a contingency table, when all the expected frequencies equal
the observed frequencies the calculated Chi-square statistic True
equals zero.
The actual counts in the cells of a contingency table are referred
False
to as the expected cell frequencies
In general, the degrees of freedom in a contingency table are
(rows -1)(columns-1)
equal to:
The chi-square goodness of fit is always a one-tailed test with the
True
rejection region in the right tail.
When conducting the chi-square goodness of fit test for your
Integrated Research Project, you will be using "Agree, Disagree
False
and No opinion" for your survey responses. Therefore the degrees
of freedom used to obtain the critical value is 2.
The Dalton Baseball Team is trying to analyze the number of hits
thier team got last season in order to train in the off season and
improve their record for next season. The number of hits each of
the 15 players received are as follows:
Baseball Player # Hits
1 34
4/6
chi square
Study online at https://quizlet.com/_5xslur
2 21
3 14
4 13
5 15
6 16
18, 21, 15
7 19
8 21
21, 16.53, 23
99
10 11
X 17.53, 16, 21
11 18
12 21
13 14
14 21
15 16
The weather is bad for the Northern Ohio area and storms are be-
ing forecasted. If the probability of a thunderstorm hitting Canton,
Ohio, is 67%, and the probability of hail is 34%, and the probability
73%
of both a thunderstorm and hail hitting is 28%, what then is the
probability of either a thunderstorm or hail hitting the Canton, Ohio
are today?
The t distribution approaches the _______________ as the sam-
Z, increases
ple size ___________.
When the sample size and sample standard deviation remain the
same, a 99% confidence interval for a population mean, µ will be Wider than
_________________ the 95% confidence interval for µ
When the population is normally distributed, population standard
deviation s is unknown, and the sample size is n = 15; the confi- The t distribution
dence interval for the population mean µ is based on:
In a manufacturing process a random sample of 36 bolts manu-
factured has a mean length of 3 inches with a standard deviation
2.865 to 3.136
of .3 inches. What is the 99% confidence interval for the true mean
length of the bolt?
The chi-square goodness of fit test for multinomial probabilities
4
with 5 categories has _____ degrees of freedom
A U.S. based internet company offers an on-line proficiency
course in basic accounting. Completion of this online course
satisfies the "Fundamentals of Accounting" course requirement
in many MBA programs. In the first semester 315 students have
enrolled in the course. The marketing research manager divided 15.56
the country into seven regions of approximately equal populations.
The course enrollment values in each of the seven regions are
given below. The management wants to know if there is equal
interest in the course across all regions. 45,60,30,40,50,55,35,
You have diligently collected your surveys and crunched all of
the numbers for your primary research on your IRP. One of the
questions you asked was as folows:
The Orrville Public Library would see an increase in number of
customers if they extended their evening hours to stay open until 17.16
11:00 pm on weekend nights.
You had 57 surveys collected from library adminstrators and staff,
with 39 Agree and 10 disagree. What is the calculated Chi-Square
value for this test?
You have diligently collected your surveys and crunched all of
the numbers for your primary research on your IRP. One of the
questions you asked was as folows:
The Orrville Public Library would see an increase in number of The critical value is 3.84, an yes it is significant
customers if they extended their evening hours to stay open until
11:00 pm on weekend nights.
You had 57 surveys collected from library adminstrators and
5/6
chi square
Study online at https://quizlet.com/_5xslur
staff, with 39 Agree and 10 disagree. Based on the calculated
Chi-square value from above, what is the critical value you will
compare that to and is this data now proven to be significant?

6/6
Statistics 101
Data Analysis and Statistical Inference
In-class problems on confidence intervals

Answers to conceptual questions on confidence intervals

Decide whether the following statements are true or false. Explain your reasoning.

Problems:

a) For a given standard error, lower confidence levels produce wider confidence intervals.

False. To get higher confidence, we need to make the interval wider interval. This is evident in the multiplier,
which increases with confidence level.

b) If you increase sample size, the width of confidence intervals will increase.

False. Increasing the sample size decreases the width of confidence intervals, because it decreases the standard
error.

c) The statement, "the 95% confidence interval for the population mean is (350, 400)", is equivalent to the
statement, "there is a 95% probability that the population mean is between 350 and 400".

False. 95% confidence means that we used a procedure that works 95% of the time to get this interval. That is,
95% of all intervals produced by the procedure will contain their corresponding parameters. For any one
particular interval, the true population percentage is either inside the interval or outside the interval. In this
case, it is either in between 350 and 400, or it is not in between 350 and 400. Hence, the probabliity that the
population percentage is in between those two exact numbers is either zero or one.

d) To reduce the width of a confidence interval by a factor of two (i.e., in half), you have to quadruple the
sample size.

True, as long as we're talking about a CI for a population percentage. The standard error for a population
percentage has the square root of the sample size in the denominator. Hence, increasing the sample size by a
factor of 4 (i.e., multiplying it by 4) is equivalent to multiplying the standard error by 1/2. Hence, the interval
will be half as wide. This also works approximately for population averages as long as the multiplier from the t-
curve doesn't change much when increasing the sample size (which it won't if the original sample size is large).

e) Assuming the central limit theorem applies, confidence intervals are always valid.

By "valid," we mean that the confidence interval procedure has a 95% chance of producing an interval that
contains the population parameter.

False. The central limit theorem is needed for confidence intervals to be valid. However, it is also necessary
that the data be collected from random samples. Confidence intervals will not remedy poorly collected data.

f) The statement, "the 95% confidence interval for the population mean is (350, 400)" means that 95% of the
population values are between 350 and 400.
False. The confidence interval is a range of plausible values for the population average. It does not provide a
range for 95% of the data values from the population. To find the percentage of values in the population
between 350 and 400, we need to look at a histogram of the data values and determine what percentage of
observations are between 350 and 400.

g) If you take large random samples over and over again from the same population, and make 95% confidence
intervals for the population average, about 95% of the intervals should contain the population average.

True. This is the definition of confidence intervals.

h) If you take large random samples over and over again from the same population, and make 95% confidence
intervals for the population average, about 95% of the intervals should contain the sample average.

False. The confidence interval is a range for the population average, not for the sample average. In fact, every
confidence interval contains its corresponding sample average, because CIs are of the form: sample avg. +/-
multiplier SE. So, the sample average is right in the middle of the CI.

i) It is necessary that the distribution of the variable of interest follows a normal curve.

False. It is necessary that the distribution of the sample average follows a normal curve. The data values of the
variable, however, need not follow a normal curve, because if the sample size is large enough the central limit
theorem for the sample average will apply.

j) A 95% confidence interval obtained from a random sample of 1000 people has a better chance of containing
the population percentage than a 95% confidence interval obtained from a random sample of 500 people.

False. All 95% confidence intervals have the property that they come from a procedure that has a 95% chance of
yielding an interval that contains the true value. The confidence interval method automatically accounts for
sample size in the standard error. A 95% CI with n=1000 will be narrower than a 95% CI with n=500, but both
CIs will have 95% confidence of containing the population percentage.

k) If you make go through life making 99% confidence intervals for all sorts of population means, about 1% of
the time the intervals won't cover their respective population means.

True. Since 99% of the intervals should contain the corresponding population mean, 1% of them will not.

You might also like