Lab Manual
Lab Manual
Grade 1: _______
Grade 2: _______
Grade 3: _______
B. Gerstman
Version: F06
Biostatistics Lab Manual
Table of Contents
Page
Table of contents 1
Introduction (Rules and Suggestions) 2
Lab 0: Sign up for computer accounts 3
Lab 1: Measurement and Sampling 4
Lab 2: Frequency Distributions 7
Lab 3: Summary Statistics 10
Lab 4: Probability 15
Lab 5: Introduction to Estimation 24
Lab 6: Introduction to Hypothesis Testing 31
Lab 7: Paired Samples 35
Lab 8: Independent Samples 42
Lab 9: Inference About a Proportion 47
Lab 10: Cross-Tabulated Counts 50
Data listing for simulation experiment (populati.sav) 57
Probability Tables (z, t, chi-square) 73
Back Cover 77
Page 1
Introduction
The biostatistics lab activity is an important part of the SJSU Department of Health Science Biostatistics
course. You must complete all lab work in this manual each week. Most labs can be completed within the
allotted 1.5 hour period. Occasional labs may require completion at home. Make certain you complete lab
work each week.
The lab workbook will be evaluated according to criteria set forth by the instructor of the course.
The premise of the lab is for you to take a simple random sample (SRS) from a population listing
(“sampling frame”) and then, over the course of the semester, analyze the data in your sample. Data for
the population (N = 600) are listed in the appendix of this manual and can also be downloaded from the
course website (data file populati.sav ). Downloading the data file is not required.
This course assumes you know how to manage Windows® computer files. If you do not know how to use
the Windows file manager, please take HPrf101 or a basic Windows computing course before enrolling
in this course.
Homework exercises are separate from the lab and do not go with lab work.
Our lab (MH321) is maintained by the SJSU College of Applied Sciences and Arts (CASA). Once you
are registered for the class, you should sign up for your account as soon as possible. To do this, go to
www.casa.sjsu.edu and click “New! Computer account sign up.” Here’s a screenshot from the CASA
homepage.
Write down your account ID and password. You are responsible for maintaining your computer
account. If you experience difficulties, contact the technical staff via www.casa.sjsu.edu.
Purpose: To select a simple random sample from populati.sav and enter your data into an SPSS file.
1. Random Numbers: You want to select a simple random sample (SRS) of n = 10 from the population
listed in the appendix of this manual. The population consists of 600 individuals, many of whom are
HIV positive. The first step in selecting your sample is to generate 10 random numbers between 1
and 600. To generate 10 random numbers between 1 ane 600.
Notes:
1. This sample was made without replacement. Selection with and without replacement makes no
difference in how you analyze your data except when the sampling fraction exceeds 5%. Since
the current sampling faction = 10 / 600 = 1.7%, you may analyze your data using standard
statistical methods.
2. When the sampling fraction exceeds 5%, you must incorporate “finite population correction
factors” into calculations. (Finite population correction factors are not covered in introductory
courses.)
3. Make certain you select random numbers between 1 and 600. In the past, some students have
sampled only a range of values, creating a bias in the sample.
4. You will be using your data for the rest of the semester. Make certain your sample is sound or
you will have to redo all lab analyses after correcting the sampling error.
2. Data: Identify the 10 people in the population with ID numbers that match your random numbers.
This comprises your unique random sample from the population. List these data here:
3. Data Entry: You are now going to enter your data into an SPSS file. SPSS stores its data in a special
.sav file format. This format is native to SPSS and can be opened only by SPSS.
a. Start SPSS. (The college computers have the SPSS icon on the Start Bar. Your home installation
may have the icon installed elsewhere.)
d. Type the variables name in the column labeled NAME. Name the variables exactly as specified
(ID, AGE, SEX, HIV, KAPOSISA, REPORTDA, OPPORTUN, SBP1, SBP2).
f. In the column labeled MEASURE, identify variables as scale, ordinal, or nominal, as appropriate.
g. You may leave the remaining columns blank or keep the default settings.
h. Click the Data View tab at the bottom of the screen, and then carefully enter your data. When
you are done, your data table should look something like this (with different values, of course):
i. Save your data! Use the naming convention LnameF10.sav (e.g., GerstmanB10.sav ). If you
are hooked-up to the CASA local area network (LAN), save your data to your home (H:) drive.
If you are not connected to the LAN, save your data to a removable device (e.g., memory stick,
floppy drive) and upload the data file to the LAN at the next opportunity.
Please see the syllabus for policies regarding homework assignments. Plagiarism will not be
tolerated.
Purpose: To explore the AGE data in your sample with a stem-and-leaf plot and frequency table.
1. Stem-and-leaf plot: Stem-and-leaf plots are effective ways to explore a distribution of numbers.
a. On the stem below, construct a stem-and-leaf plot of the AGE data in your sample.
|0|
|1|
|2|
|3|
|4|
|5|
|6|
AGE ×10
|0|
|0|
|1|
|1|
|2|
|2|
|3|
|3|
|4|
|4|
|5|
|5|
|6|
AGE ×10
Note: W hen plotting from scratch, you will have to decide on how many stem-values to include on your
plot. A rule of thumb is to use 4 to 12 stem “bins.” Then, use trial-and-error to select the best plot.
b. W hich of the above plots do you prefer? [Circle]: Single stem-values Double stem-values
d. The approximate location of the distribution can be ascertained by “eye-balling” its balancing point. This
locates the approximate mean (arithmetic average). The approximate mean of my data set is _______.
e. The spread of the distribution can be describe in several ways. The easiest way is to identify the minimum
value and maximum value in the data set. My data spread from ______ [minimum] to _______ [maximum].
a. Start SPSS.
b. Open the file that contains your data (LnameF10.sav ).
c. Click Analyze > Descriptive Statistics > Explore . Your screen should look like
this:
e. Click OK.
f. After the program runs, go to the OUTPUT window and Navigate to the Stem-and-leaf
plot (toward the bottom of the Window). Does this plot look the same as the one you drew by
hand? [Circle]: Yes No
Whenever you do a plot or calculation by hand and by computer, always compare the results. If results
differ, figure out why and reconcile the difference.
g. Print the output. The instructor will let you know whether you need to turn in your output.
3. Frequency table by hand: Create a frequency table for your AGE data. When n is small, it helps to
group data into class intervals before tallying frequencies. Group you AGE data into 10-year class
intervals, then then tally the results in this table:
Age range (years) Frequency Count Relative Freq (%) Cumulative Freq (%)
0!9
10!19
20!29
30!39
40!49
50!59
60!69
ALL 10 100% --
5. Optional: Recode data with SPSS. To have SPSS classify data into intervals, click Transform >
Recode > Into Different Variable . You will then be presented with a series of dialogue
boxes. Select AGE as your input variable and AGEGRP2 as your output variable. Follow the screen
prompts to set up codes for each class interval. A lab instructor will help you with the process if you
experience difficulty.
Purpose: To calculate and interpret summary statistics for the AGE data in your sample.
1. Sample mean: We begin by considering the most common measure of central location, the mean.
n=
3xi =
b. The mean is the balancing point of the distribution. It is also a good reflection of several things
you might want to know about the data. Name three of these things:
i. ___________________________________________________________________
ii. ___________________________________________________________________
iii. ___________________________________________________________________
2. Standard deviation: The standard deviation is the most common measure of spread, being based on
the average sum of squared deviations.
a. Calculate the sum of squared deviations for the AGE data in your sample using this helpful table:
b. Variance =
c. Take the square root of the variance. This is the sample standard deviation, s.
Standard deviation =
d. The standard deviation is not easy to interpret. One thing to keep in mind is that large standard
deviations are associated with large spreads and small standard deviations are associated with
small spreads. Another useful fact applies when the distribution is Normal (bell-shaped). When
this is the case 68% of the data will lie within one standard deviation of the mean and _______%
[fill in] of the data will lie within two standard deviations of the mean.
e. Many distributions are not Normal (bell-shaped). Under such circumstances, Chebyshev’s rule
may be applied. This rule says that at least _______% [fill in] of the values will lie within 2
standard deviations of the mean, whatever the shape of the distribution.
3. SPSS
a. Start SPSS
b. Open your data file LnameFname10.sav .
c. Click Analyze > Descriptive Statistics > Descriptives
d. Select the AGE variable
e. Navigate to the output.
f. Print the output. You may be asked to place your output in your lab notebook or hand it in for
HW.
g. Do your hand-calculated statistics match those computed by SPSS?
[Circle] Yes No*
4. Reporting summary statistics. Reporting summary statistics is an important part of our work.
Calculations should carry enough significant digits to maintain accuracy. Reported (final) results
should conform to reasonable (e.g., APA) standards.
a. Our aim is to report means and standard deviations with 1 decimal accuracy beyond the precision
of the data. To do this, carry 3 additional decimals in calculations. For example, if the initial data
is in integers, calculations should carry 3 decimals and the mean should be reported with one
decimal accuracy. If the data have 3 decimal accuracy, calculations should carry 6 decimals and
the mean should be reported with 4 decimals.
b. Units of measure and the sample size should be reported along with summary statistics.
c. Descriptions should be concise, clear, and accurate.
d. Here’s a model for reporting the mean and standard deviation: “The mean age in the sample is
29.0 years with a standard deviation of 15.4 years (n = 10).”
*
If results differ, track down the error and make necessary corrections.
5. 5-point summary and boxplot. Five-point summaries and boxplots offer an alternative way to
summarize distributional features.
a. Before determining the 5-point summary for your data, you must list it in ascending order. List
your AGE data as an ordered array:
FenceUpper = Q3 + 1.5(IQR) =
FenceLower = Q1 ! 1.5(IQR) =
e. Are there any values outside the upper fence? (List, if any):
Are there any values outside the lower fence? (List, if any):
6. SPSS:
[Circle] Yes No
vi. Navigate to the boxplot (computed earlier) and compare it to the one you drew by hand. Do
they match?
[Circle] Yes No
vii. The instructor will tell you know if any output is required for HW.
Lab 4: Probability
1. The Binomial Distributions. If we select 3 individuals at random from populati.sav, how many will be
female? W e know 26.5% of populati.sav is female. Therefore, we expected 3 × 0.265 = 0.795 females per
sample. Obviously, no single sample will have exactly 0.795 females. Some will have no females, some will
have one, some have two, and some will three. The binomial distribution will allow us to attach probabilities to
these potential outcomes.
Consider the variable SEX in your data set. This variable is categorical with two possible outcomes: a given
observation selected at random is either male or female. W hether the observation is male or female is a
Bernoulli random variable. Let the selection of a “female” represent a “success.” The total number of
successes in a sample of size n will be a binomial random variable.
a. Let X represent the number of females in a given sample. In sampling 3 people from
populati.sav , what are the values for binomial parameters n and p?
n = _____ p = _____
b. The binomial formula will determine the probability mass function for the number of females in
a given sample. Before using the binomial formula, we must learn how to use the choose
function. Recall that n Ci = n! / (i!)(n!i)! where n Ci represent the possible number of ways to
choose i items out of n and “!” represents the factorial function (see lecture notes).
i. How many different ways are there to choose 0 items out of 3? The answer is there is
only one way to choose 0 items out of 3 — you must select none of the objects. Using
the “choose function” formula we confirm:
3 C0 = = 1.
iii. 3 C2 =
iv. 3 C3 =
ii. The binomial formula is Pr(X = i) =( nC i)(p i)(q n-i). W e want to determine probabilities of various
outcomes. I’ll do the first calculation for you by calculating the probability of observing 0 females
(i = 0):
(Calculation notes: 3C 0 was calculated on the prior page; 0.265 0 = 1 because anything to the 0
power is 1; I used my calculator to determine 0.735 3 = 0.3971).
Pr(X = 1) =
Pr(X = 2) =
Pr(X = 3) =
vi. Below is the histogram for this probability mass function. Probability histograms show
probabilities as areas under the “curve.” On the histogram below, shade the bar corresponding to
Pr(X = 0).
The bar you just shaded has height 0.3971 and width 1.0 (from 0 to 1). The area of this bar = height × width
= 0.3971 × 1.0 = 0.3971, which is equal to the probability of observing zero females in the sample. Do you
understand the area under the curve concept”? [Circle] Yes No
If you answered “no,” please see the instructor for help. Leave no stone unturned.
d. The cumulative probability of an event is the probability of seeing that number or less. Use the
probabilities you just calculated to determine the following cumulative probabilities:
v. Cumulative probabilities correspond to areas under the curve to the left of a given point.
Shade the region corresponding to the cumulative probability Pr(X # 1) on the histogram
below.
This shaded area is equal to 0.3971 + 0.4295 = 0.8267 = Pr(X # 1). Notice that the cumulative
probability corresponds to the area in the left-tail of the distribution.
e. There are times we might want to determine probabilities of seeing a number greater than or
equal to a given value. This corresponds to areas in right-tails of probability distributions. For
example, we might want to know the probability of seeing 2 or more females in a sample. Shade
the region corresponding to Pr(X $ 2) on the histogram below and then determine Pr(X $ 2) =
Pr(X = 2) + Pr(X = 3) = _______.
a. Introduction to Normal random variables. Problem 1 in this lab looked at the probability of 0,
1, 2, or 3 females in a SRS of n = 3 taken from populati.sav. The binomial distribution was used
to model these outcomes. We need a different approach to model probabilities for continuous
random variables. This is achieved with smooth density curves (more technically called
probability density functions).
Many types of density curves are used in statistics. The most important of these is the family of
distributions known as the Normal distribution family. Members of the Normal distribution are
identified by two parameters: : and F. We introduced the basics of Normal distributions in
lecture and will not review all specifics here. Instead, recall that X~N(:, F) denotes a particular
Normal random variable with mean : and standard deviation F.
You may view Normal curves as idealized probability histograms. For example, a histogram of
the variable SBP1 (systolic blood pressure, first reading) in from populati.sav is shown
below.
A Normal curve with : = 120 and s = 20 is superimposed over the histogram. Although the fit of
the curve is imperfect, it will still suit our purpose in demonstrating how to use the curve to
model Normal probabilities.
Values for SBP1 vary approximately according to a Normal distribution with : = 120 and F = 20.
Let X represent value for SBP1 : X~N(120, 20). Let us depict this with a curve:
ii. Tick marks appear on the horizontal axis at standard deviation (20-unit) intervals. For
example, the tick mark to the left of center= 120 ! 20 = 100. Label all the tick marks on the
curve.
iii. After you have labeled the axis, shade the area under the curve corresponding to values less
than or equal to 80, i.e., Pr(X # 80).
ii. Z density curves are centered on 0 and have inflection points at !1 and +1. Label the tick
marks on the horizontal axis on the curve below.
After labeling the horizontal axis, shade the area under the curve corresponding to Pr(Z
# 0). This encompasses half the curve. Therefore, Pr(Z # 0) = __________ [fill in].
iii. Label the X axis of the Standard Normal curve below and then shade the area corresponding
to Pr(Z # 1).
iv. On this next density curve, shade the area under the curve corresponding to Pr(Z # 2) Then,
use your Z table to determine the Pr(Z # 2) = ________ [fill in].
v. On the density curve below, shade the area corresponding to Pr(Z #!2). Then, use your z
table for negative values to determine Pr(Z # !2) = _________ [fill in].
vi. On the density curve below, shade the area under the curve corresponding to Pr(Z $ 2). This
corresponds to the upper tail of the distribution. Since the area under the entire curve sums
to 1, the probability in the right tail is equal to 1 minus the area under the curve to the left of
2. Therefore, Pr(Z $ 2) = 1 ! Pr(Z # 2) = 1 ! __________ = __________ [fill in ×2].
vii. To compute probability in an interval, subtract from 1 the combined areas in the left tail and
in the right tail. For example, to determine Pr(!2 # Z # 2), subtract the left tail associated
with Pr(Z # !2) [answer v.] and the right tail associated with Pr(Z $ 2) [answer vi.]. On the
density curve below, shade the area under the curve corresponding to Pr(!2 # Z # 2).
The 68–95–99.7 rule says that 95% of the area will be between !2 and 2. Notice that the above
curve shows a shaded area of 0.9544, which rounds to 95%.
i. Let zp denote a z score with a cumulative probability of p. For example, z.5 = 0, since a z
score of 0 has a cumulative probability of .50 (i.e., 50%). Schematically, z.5 looks like:
ii. Use your z table to determine z.8413 = _____ [fill in]. Then, mark the location of z.8413 on the
density curve below and shade the cumulative probability region corresponding to a
cumulative probability of .8413.
iii. Use your Z table to determine z.9772 = _______ [fill in]. Then, mark z.9772 on the density curve
below and shade the cumulative probability associated with this z score.
iv. z.0228 = _______ [fill in]. Mark z.0228 on the curve, and shade the appropriate area
corresponding to this cumulative probability.
d. Modeling the SBP1 variable with the Normal distribution. Now that you understand the
Standard Normal distribution, go back to the SBP1 variable introduced in part a of this problem.
This variable is approximately distributed Normally with a mean of 120 and standard deviation
of 20, i.e., X~N(120, 20). Although this variable is not a Z variable, you can turn it values into a
Z variable by subtracting the distribution’s mean and dividing by it’s standard deviation. The
formula is .
z=
You’ll notice that the z score of this value is 0. This means it is 0 standard deviations away
from the mean. The cumulative probability of this value is Pr(X # 120) = Pr(Z # 0) = .5000.
z=
For this particular variable, Pr(X # 80) = Pr(Z # ______) = _______ [fill in ×2]
iii. You drew the area under the curve for this problem on page 19. Review this drawing and
make certain you understand the nature of the probability you just calculated. Check this box
after you’ve reviewed the drawing: 9
3. HW: Go to the StatPrimer website and print the exercises for Chapter 4. Complete exercises
assigned in class.
Purposes: To learn about the sampling distribution of means and confidence intervals for :.
1. Population & Sample: The following figure depicts the distribution of AGE in populati.sav .
This population consists of N = 600. It has a mean : of 29.5 and standard deviation s of 13.59.
Note: Although this distribution has a central mound, it is neither symmetrical nor bell-shaped.
Therefore, it is not Normal.
b. You calculated the mean age in your sample (LnameF10.sav ) in Lab 3. Report this value
here: ______ [fill in]
Note: The mean AGE in your sample, , is an estimate of population mean :. Although they are
related, they are not the same.
2. Sampling experiment (simulation): During this part of the lab, you are going to pretend you do not
know the population mean : of AGE . Your goal will be to infer the value of : based on data in your
sample. To accomplish this, you must understand the sampling “behavior” of a mean. The sampling
behavior of a mean demonstrated by its sampling distribution. We call this the sampling
distribution of a mean (SDM). To help you understand the SDM, we conduct a simulation
experiment.
a. Your lab instructor will collect the value of the mean AGE in your sample. Confirm you have
submitted your name and your sample mean to the lab instructor by checking this box:
b. The lab instructor will add each student’s sample mean to a file called SampleMeans.sav under
the variable MEANAGE . The data file SampleMeans.sav will then be distributed to everyone in
the class. Confirm you have a copy of SampleMeans.sav in your possession by checking this
box:
d. Create a histogram of the variable MEANAGE by clicking Graphs > Histogram . Look at the
output.
e. The histogram you just created is simulated SDM. Heed the fact that this is a distribution of
sample means, not a distribution of individual ages. The mean and standard deviation of this
sampling distribution on the histogram output. Write these value here:
f. Is the distribution of MEANAGE more Normal or less Normal than the population distribution of
AGE ? [A histogram of AGE is shown on p. 24 of this lab workbook.]
Circle best response: More bell-shaped Less bell-shaped About the same
The results of this experiment demonstrate the key elements of the SDM. These are:
(1) The Central Limit Theorem–the SDM is more Normal than the population distribution.
(2) The Law of Averages–the average of the SDM is about equal to population mean :.
(3) The Law of Large Numbers–the standard deviation (variability) of the SDM is less than the
standard deviation of individual values. (The standard deviation of the SDM is called the
standard error of the mean and is equal to F / /n, which in this case is equal to 13.59 / /10 =
4.30. Notice that the standard deviation of the simulated SDM should be close to this value.
g. We have been studying three separate distributions in this lab. These are:
Each of these distributions has a mean and standard deviation. It is helpful to use different
symbols to represent the different means and standard deviations in this simulation experiment.
These are:
Number of observations N n û
mean :
The standard deviation of SampleMeans.sav is a simulated SEM. This statistic reflects the
precision of as an estimate of :.
a. We know the standard deviation of AGE in the population is s = 13.59. What is the standard
error of the mean in your sample? (SEM = s / /n).
c. Did your confidence interval capture the value of the population mean? (Recall that in this
simulation experiment : is 29.5). [Circle] Yes No
e. In the long run, what percentage of 95% confidence intervals based on repeated samples will fail
to capture :?
f. In the long run, what percentage of (1!")100% confidence intervals will fail to capture :?
4. Student’s t. Student’s t distribution is used when making inferences about a population mean : when
data are a SRS and population standard deviation F is not known. In such instances, we estimate F
using sample standard deviation s as calculated in the sample. The t distribution compensates for the
additional uncertainty added by this estimate. (Additional background about the t distribution was
provided in lecture.)
a. Start your browser. Go the StatPrimer website and scroll to the bottom of the screen. Click on
the link to the t table and print this table (File > Print ). Tape the printout into your
Procedure Notebook, near your z tables.
b. Let tdf,p represent a t score with df degrees of freedom and cumulative probability of p. Using the t
table you just printed, determine the values for the t scores requested below. Mark the t score on
the X axis of the figure and shade the cumulative probability regions in each instance. Then,
determine the right-tail regions.
5. Confidence interval for :AGE , s estimated: We usually do not know the value of F in practice. In
such instances, we use sample standard deviation s as part of the procedure to calculate the 95%
confidence intervals for :.
a. What was the standard deviation of the AGE variable in your sample? (You calculated this in Lab
3).
s = _______
c. Calculate the 95% confidence interval for : using the formula ± (tn-1,.975 )(sem).
i. Start SPSS.
ii. Open LnameF10.sav .
iii. Click Analyze > Descriptive Statistics > Explore .
iv. Move the AGE variable into the Dependent list.
v. Click “OK.”
vi. Navigate to the Output window.
vii. The confidence interval is reported in the region labeled “Descriptive.”
viii.Print the output.
ix. Review the output.
x. Await further instructions on how to hand in the output, if at all.
6. Sample size requirement. Margin of error d is equal to half the confidence interval length. Let LCL
represent your lower confidence limit, and let UCL represent your upper confidence limit. Margin of
error d = ½(UCL ! LCL). Another formula for d is d = (tn-1,1-"/2)(sem).
b. We can increase the precision of our confidence interval estimate by collecting a larger sample.
We will use the formula n . (4)(F2) / d2 to determine the sample size needed to derive an
estimate for : with margin of error of d. How large a sample would we need to calculate a 95%
confidence interval for : with a margin or error no greater than 5? (Recall that s = 13.59).
c. How large a sample would we need to calculate a 95% confidence interval for : with a margin of
error no greater than 2?
Purpose: To learn about significance testing and to conduct one-sample tests for means.
1. Sampling distribution of a mean (SDM): Last week we simulated the SDM for samples of n = 10
for variable AGE. We learned that the SDM was approximately Normal with a mean of 29.5 and
standard error of 4.3, i.e., Based on this information, we can predict that 95%
of the s in the SDM will fall in the interval 29.5 ± (1.96)(4.3) = 29.5 ± 8.4 = (21, 38). This is what
this SDM model looks like:
In our simulation experiment, sample means were saved in SampleMeans.sav as the variable
MEANAGE. Over a three year period, I compiled data from this simulation. Here is a histogram showing
the distribution of the first 100 sample means:
a. Based on the above histogram, there were two (2) sample means greater than or equal to 38. This
represents __________% of the observations.
b. Sampling theory predicted that _________% of the sample means would be greater than or equal
to 38.
c. Did sampling theory do a good job in predicting the percentage of sample means above 38?
[Circle] Yes No
d. Explain you answer to part c.
2. One-sample z-test: This exercise introduces statistical hypothesis testing while revealing some of its
limitations. We start with a one-sample z test. The one-sample z test compares a single mean to an
expected value. The expected value is provided by the researcher (not the data).
Assume you do not know the population mean : of AGE . Assume you do know F = 13.59. Test
whether the data provide evidence that the population is not 32.
b. Calculate the test statistic using the data in your sample (LnameF10.sav ). The formula is
c. Place your zstat on the Normal density curve below. Then shade the areas under the curve
corresponding to the one-sided P-value.
P=
If P # ", the evidence is considered significant at the " type I error threshold.
3. One sample t test. You are going to retest the data, but will now pretend that s is not known. You
will base your standard error on s instead of F, and will use a tstat instead of a zstat. You will also use a
two-sided alternative.
b. You calculated the sample standard deviation s for AGE in you sample in Lab 3. Use this sample
standard deviation to calculate the standard error of the mean. (Formula: sem = s / /n).
d. Take the absolute value of your tstat. Place it on the curve below and shade the tail region
corresponding to the one-sided P-value. In addition, shade the mirror image of your |tstat| in the lef-
tail.
e. Use your t table to find the landmarks on t9 that is just to the right of your |tstat|. What is the area
under the curve to the right of this landmark? _______
f. Now find the landmark on t9 that is just to left of your |tstat|. What is the area under the curve to the
right of this landmark? _______
g. The area in the tail is less than your answer to f and more than your answer to e. Therefore, the
one-sided P-value is _____ < P < _____
h. Double the answers to g to get the two-sided P-value. _____ < P < _____
i. Interpret your results. Is there significant evidence against H0 ? (Brief narrative response.)
a. Start SPSS
b. Open LnameF10.sav .
c. Click Analyze > Compare Means > One Sample T Test .
d. A dialogue box will open. Place AGE in the Test Variable field and enter “32” in the field
labeled Test Value field (since you are testing H0 : : = 32). Your screen should look something
like this:
e. Navigate to the output window and review the test statistics calculated by SPSS. Your instructor
will let you know whether you must print the output.
f. Do these results match the ones you calculated by hand? [Circle] Yes No
1. Your data set includes the variables SBP1 and SBP2 . SBP1 is an initial systolic blood pressure
measurement (mm Hg). SBP2 is a follow-up measurement. Why are these paired and not independent
samples?
2. Initial Description.
a. Start SPSS.
b. Open your data set (LnameF10.sav ).
c. Click Analyze > Descriptive Statistics > Descriptives
d. Select the variables SBP1 and SBP2.
e. Report your results here:
3. Difference variable DELTA . It is important to maintain this pairings throughout this analysis. This is
accomplished by creating a new variable to hold differences within pairs. We call this variable DELTA .
a. List your data for SBP2 and SBP1 below. By hand, calculate DELTA values (let DELTA = SBP2 -
SBP1 ).
b. Construct a stem-and-leaf plot of DELTA. Below is a stem to hold your plot. The stem has split
(double) stem values to help draw out the distribution. It can store values between !14 and 14. If
you have values larger or small than this range, extend the stem up or down, as needed.
|-1|
|-0|
|-0|
| 0|
| 0|
| 1|
DELTA ×10
4. SPSS.
d. Click OK
e. Return to your Data View window. Make sure SPSS has calculated the variable DELTA as
expected.
f. Click Analyze > Descriptive Statistics > Explore and place DELTA in the Dependent
List. Click OK .
g. Go to the Output Window and review the stem-and-leaf plot produced by SPSS. How does your
stem-and-leaf plot compare to the one created by SPSS? Did SPSS use double-stem values?
[Circle] Yes No
h. Make note of the mean and standard deviation of DELTA . List these below.
= ______
sd = ______
nd = ______ (Note: Your sample has 20 measurements but only 10 paired observations.)
a. By hand, calculate the mean and standard deviation of DELTA . Here’s a table that you should fill
in to help with calculations.
nd =
3xd =
6. Confidence interval for : d . We will calculate a confidence interval for :d to see how
well estimates :d .
a. Calculate a 95% confidence interval for population mean difference :d . The formula is
± (tn!1,1!"/2 )(semd ) where semd = sd / /nd . Show all work.
c. The mean difference of SBP2 and SBP1 in the population :d = !0.18. Did your confidence
interval capture :d ? [Circle] Yes No
d. What percentage of 95% confidence intervals for :d will fail to capture !0.18?
7. Paired t Test. Use a two-sided t test to determine whether the observed mean difference is significant.
a. List the null hypothesis and alternative hypotheses for the test here:
b. Calculate the tstat and its df. (Formulas: tstat = (xbardelta ! µ0 ) / sedelta where µ0 = 0 and sedelta = sd / /n.
This statistic has df = n ! 1).
c. Place your tstat on the density curve below and shade the areas under the curve corresponding to
the P-value. (This is two-tailed test because the alternative hypothesis is two-sided.)
d. Use the method established in the prior lessons to determine the two-sided P-value.
f. Check your calculations with SPSS. Your data set should already be opened. Click Analyze >
Compare Means > One Sample T Test . Place DELTA in the field labeled Test Variable
and enter “0” in the field labeled Test Value (you are testing H0 : :d = 0). Navigate to the output
window and review the SPSS output. Print the results to your t test.
8. Power: You make a type II error whenever you retain a false null hypothesis. The probability of
avoiding a type II error, power, can be calculated according to the formula
random variable, ) represents a mean difference worth detecting, F represents the standard deviation,
and n represents the sample size.
a. Assume the standard deviation of DELTA is 5. Calculate the power of the test to detect a mean
difference of 5 mm Hg based on n = 10.
b. Now calculate the power of the test to detect a mean difference of 2 mm Hg.
c. What effect did lowering the “difference worth detecting” have on the power of the test?
d. Suppose you redo your study with 50 observations, you want to detect a mean difference of 2, and
the standard deviation is still about 5. What is the power of the test under these conditions?
e. What effect did increasing the sample size have on the power of the study?
1. Two groups. In this lab, we will compare AGE distributions in two independent groups.
a. Group 1 will comprise the AGE data in the sample you selected at the beginning of the semester
(LnameF10.sav ). You calculated summary statistics for these data in Lab 3. Go back to Lab 3
and retrieve the summary statistics for the AGE variable. Use at least two decimal places when
listing your mean and standard deviation, as you will be using these statistics to calculate
additional results.
n1 = _______
= ______
s1 = ______
b. A researcher studying a different population wants to compare ages in the two groups. Let’s call
this group 2. The researcher calculates the follow summary statistics:
n2 = 15
= 42.47
s2 = 12.48
These samples are independent because data points in sample 1 are not related to data points in
sample 2.
2. Confidence Interval.
b. Use the sample standard deviations to calculate the pooled estimate of variance. The formula is
d. Calculate the 95% confidence interval for :1 - :2 . The formula is ± (tdf, .975 )(sexbar1-
xbar2 ).
3. Hypothesis test. Conduct a two-sided t test to address whether the means differ significantly.
a. Write down the null and alternative hypotheses. Use proper statistical notation.
b. Conduct a flexible significance test. (A prior alpha is unnecessary.) Calculate the tstat and its
degrees of freedom. The . The standard error and df were calculated on the prior
page.
c. Place your tstat on this curve below. Shade the areas under the curve corresponding to the P-value.
The test is two sided, so shade both tails. Using your t table, wedge the |tstat| between two t critical
values (landmarks) on the curve. Show the location of the critical value landmarks on the X axis.
Then determine the right-tail areas associated with these critical values.
d. Determine the one-sided P-value [Fill in]: _____ < P < _____
e. Determine the two-sided P-value [Fill in]: _____ < P < _____
4. SPSS analysis.
i. Start SPSS.
ii. Open your data file (LnameF10.sav ).
iii. In the column for the AGE variable, enter the following additional values: {21, 26, 31,
34, 37, 38, 40, 40, 43, 44, 47, 52, 60, 62, 62 }.
iv. Go to the Variable View . Create a new variable called GROUP .
v. Return to the Data View . For the first 10 observations, assign a value of 1 to GROUP . For the
next 15 observations, assign a value of 2 to GROUP .
vi. Your screen should now look something like this:
vii. Click Analyze > Compare Means > Independent Samples T Test . The
“Independent-Samples T test” dialogue box will appear. Place AGE in the “Test Variable”
field. Place GROUP in the “Grouping Variable” field.
viii.Click the “Define Groups” button. The Define Groups dialogue box will appear. In the Group
1 field, type “1.” In the Group 2 field, type “2.” Click the Continue button.
ix. You will be taken to back to the “Independent Samples T test” dialogue box. Click OK .
x. After the program runs, go to the output window and navigate to the region labeled
“Independent Samples Test.” The t statistics we use appear in the row labeled “Equal
Variances Assumed.” Check these results and make sure your hand calculations were correct.
If not, reconcile the difference.
xi. Await further instruction on what to do with this output.
5. Assumptions. All inferential methods requires assumptions. The confidence interval and t test used in
this chapter are no exception.
a. List the validity assumptions required for statistical inference using the t procedures.
i.
ii.
iii.
i.
ii.
iii.
1. Sample size requirements: It’s always best to determine the sample size requirements of a study
before collecting data. Suppose you wanted to determine the sample size needed to estimate the
prevalence of HIV in populati.sav so that the margin of error is no greater than 0.10. To do this
you’d need to make certain assumptions. In particular, you would need to assume (“guestimate”) the
value of population prevalence p before determining the sample size requirement. When you have no
idea about the true value of population value p, you use .5 as a starting point to ensure adequate
sample size. Use this assumption and the formula to determine the sample size
n=
2. New data set. After completing part 1, you should realize that the current paltry sample size of n = 10
is inadequate to estimate prevalence p with any precision. To save you the trouble of selecting a new,
larger sample
a. Download the file GerstmanSampleBig.sav from the course website. Save this file to your
home directory or to a floppy disk. (Right-click, “Save as.”)
b. After you have downloaded the new data set, start SPSS and open your copy of the file.
c. Browse the data file. Note that this data set has n = 96 observations. In addition, HIV has been
re-coded so that 1 = positive and 2 = negative.
3. Determine the prevalence of HIV in the sample by clicking Analyze > Descriptive
Statistics > Frequencies .
4. Confidence interval (plus-four method). Sample prevalence is the point estimate for population
prevalence p. Let’s use the “plus four” method to calculate a confidence interval for p.
a. Calculations:
= x + 2 = _____
= n + 4 = _____
= _____
= _____
c. The true prevalence p of HIV in populati.sav is 0.768. (In practice, you would not know this
true value, but this is a simulation experiment.) Did your confidence interval capture the true
prevalence? [Circle] Yes No
d. What percentage of calculated 95% confidence intervals based on independent samples from the
same population will fail to capture parameter p?
5. One-sample z test of a proportion. An investigator wants to test whether the prevalence of HIV in
the population is greater than 50%. Because the investigator is cautious, she uses a two-sided test. Use
data in GerstmanSampleBig.sav to conduct this test.
b. The standard error of the proportion is based on the assumed value of p under the null
d. Place your zstat on the curve below. Make certain you place the test statistic in its proper location,
way out in the right tail of the density curve. If you do this properly, you will see that the P-value
is very small.
Purposes: To cross-tabulate binary data from independent groups and compare independent proportions.
1. Cross-tabulation: The relation between two binary variables is analyzed by cross-tabulating data to
form a 2-by-2 table. Tallying is done by the computer. Currently, we want to assess the relation
between SEX and HIV
a. Start SPSS.
b. Open your copy of GerstmanSampleBig.sav . (You downloaded this dataset for last week.)
c. Click Analyze > Descriptive Statistics > CrossTabs .
d. Select SEX as the row variable and HIV as the column variable. Your screen should look
something like this:
e. Click OK .
f. After SPSS completes its processing, go to the output window and view the cross-tabulation of
counts. Show the cross-tabulation here:
Male
Female
Total
2. Prevalence. The data you are working with represent a sample from populati.sav . We want to
estimate prevalences of HIV in males and females.
c. Prevalences can be compared in the form of a prevalence difference. This will let you know how
much higher or lower is the prevalence in one gender or the other. Calculate prevalence
difference in the data.
3. Chi-squared distributions. Your data showed a prevalence difference of .6765 - .8333 = !.1568,
indicating a lower prevalence in men by about 16%. At this point in the course you should be aware
that a observed difference in the sample may or may not hold up in the population. Therefore, you
want to test this difference for significance. Before conducting the test, you need to learn about the P2
(chi-squared) probability density function.
g. Chi-squared distributions are similar to other probability density functions in that they are used to
determine probabilities according to the “areas under the curve” method. Chi-square tables
provide landmarks (“critical values”) for chi-square random variables according to various
degrees of freedom, not unlike t tables.
h. Let P²df,p represent the critical value on a chi-squared distribution with df degrees of freedom that
has a cumulative probability (“left tail”) of p. Use your chi-square table to determine the
following critical values. In each instance, mark the critical value on the curve in its proper
location and shade the region in the right tail of the distribution.
4. Chi-square test. You want to test the data to see if there is a significant difference in the prevalences
of HIV in males and females.
a. The null and alternative hypotheses can be stated in several ways. One option is “H0 : no
association between row and column variables in the population” and H1 : “association.” When
testing binary data, you may also state the hypothesis in terms of population proportions p1 and p2 .
Use statistical notation to express the equivalence of population proportions.
Male 68
Female 24
Total 66 26 92
g. Place the P²stat on the X axis of the curve below. Shade the P-value region in the right tail of the
density curve. Then, place critical value landmark on the X-axis of the curve.
5. SPSS. After completing the chi-square test by hand, check your work with SPSS.
a. Your copy of GerstmanSampleBig.sav should already be open. If not, start SPSS and open
the file.
b. Click Analyze > Descriptive Statistics > CrossTabs .
c. Click the Statistics button and check the chi-square box.
d. Click OK
e. Go to the output window.
f. Print relevant parts of the output.
g. The instructor will let you know if you need to hand in the output.
h. Did your hand calculated statistics match SPSS’s? [Circle] Yes No