ACMT 311 Assignment
ACMT 311 Assignment
• Each question needs an R code unless it is italicized, in which case, you need to answer
the question in a complete sentence, when necessary.
• I need to see your script and not the result. So do not cut and paste results. I will run each
script to see the result.
• Be sure your code works. Do NOT submit unworkable codes.
Data Background
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey
of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify
risk factors in the adult population and report emerging health trends. For example, respondents
are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco
use, and even their level of healthcare coverage. The BRFSS Web site
(http://www.cdc.gov/brfss) contains a complete description of the survey, including the research
questions that motivate the study and other interesting results.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted
in 2000. While there are over 200 variables in this data set, we will work with a small subset.
Begin by loading the dataset into your R workspace by using this command:
source(“http://www.openintro.org/stat/data/cdc.R”). Make sure the data, cdc, is showing in the
environment panel.
Questions:
Q1. How many cases and variables does the data set, CDC, have?
Q2. What are the variables of the data set, cdc? (Your code should return: genhlth, exerany,
hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these
variables corresponds to a question that was asked in the survey. For example, for
genhlth, respondents were asked to evaluate their general health, responding either
excellent, very good, good, fair or poor. The exerany variable indicates whether the
respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates
whether the respondent had some form of health coverage (1) or did not (0). The
smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in
respondent’s lifetime. The other variables record the respondent’s height in inches,
weight in pounds as well as their desired weight, wtdesire, age in years, and gender.)
Q3. Generate the first 10 lines of the dataset.
Q4. For each variable, identify whether it is numerical or categorical. Be sure to list the
variables. Use the same variable name as given in the dataset.
Q5. Create a numerical summary of the variable, weight, that shows the minimum, Q1,
median, mean, Q3, and maximum values.
Q6. Calculate the inter-quartile range for the variable, weight.
Q7. Make a histogram of the weight distribution. Include a title and label for both axes so that
the graphic is understandable to anyone.
Q8. From the histogram, what kind of shape is the weight distribution? That is, state whether
the histogram is right-skewed, left-skewed, symmetric and whether it is unimodal,
bimodal, or multi-modal?
Q9. Make a horizontal boxplot of the weight distribution. Include a title and label for the axes
so that the graphic is understandable to anyone.
Q10. Describe one feature you see in the histogram that is difficult to see in the boxplot.
Q11. Describe one feature you see in the boxplot that is difficult to see in the histogram.
Q12. Draw the boxplot for the variable, wtdesire (desired weight). Include a title and label for
the axes so that the graphic is understandable to anyone.
Q13. Why do you think there are 2 extreme values showing in the boxplot for the variable,
wtdesire?
R Assignment #2
Using the data from assignment #1
Questions
Q1. Change all the entries in the variable, smoke100, from 0 to No and 1 to Yes. See Sec.
16.4 on how to do this.
Q2. Generate the last 8 lines of the dataset to check that all the changes took place.
Q3. Calculate how many “Yes” and how many “No” responses there were.
Q4. Calculate the relative frequency of the response distribution. To do this, take your code in
Q4 and divide it by the number of observations. Notice that R automatically divides all
entries.
Q5. Make a barplot of the responses by putting the table( ) command inside the barplot
function. Include a title and label for both axes so that the graphic is understandable to
the general audience.
Q6. Now split the smokers by gender. Do a cross-tab count by gender and smoker. (In your
code, put gender first, then smoker.)
Q7. Calculate the relative frequency of the response distribution.
Q8. Graph a side-by-side barplot with smoker on the horizontal axis and gender on the
vertical axis. Include a title, legend, and label for both axes so that the graphics is
understandable to the general audience. (Note: the legend might cover part of your
graphics. Do not worry about it. If you are bothered by the coverage, a fast way is to
extend the vertical axis or you can research how to reposition the legend.)
Q9. By separating gender and smokers, what feature do you see in the relative frequency
calculation or barplot that you do not see when gender was not separated? Give at least
one feature.
Q10. Graph a mosaic plot using the function mosaicplot( ) by putting table( ) inside the
mosaicplot( ) function. Include a title to your plot. Be sure labels are clear to the general
audience. (Although not necessary, you may try adding colors to make the plot more
visually pleasing.)
Q11. Name one feature that you see in the side-by-side barplot that is easier to see than in the
mosaic plot.
Q12. Name one feature that you see in the mosaic plot that is easier to see than in the side-by-
side plot.
Assignment #3
Background
For this assignment, we are going to see how the Central Limit Theorem works.
We will consider the real estate data from the city of Ames, Iowa. The details of every real
estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus
will be all residential home sales in Ames between 2006 and 2010. This collection
represents our population of interest.
Download the real estate data from Ames, Iowa by entering the following codes:
• download.file("http://www.openintro.org/stat/data/ames.RData", destfile =
"ames.RData")
• load("ames.RData")
(There are lots of quotation marks here. If the code does not work, you may have to play around
with the quotation marks.)
This data set has 82 variables. We are only going to be focusing on one variable, the Sale Price
of homes in Ames, Iowa.
Questions:
Q1. Rename the variable SalePrice as price.
Q2. Draw a histogram of price. Label the axes and give a title to your histogram so that it is
understandable to a general audience.
Q3. From the histogram, what kind of distribution is price?
Q4. Set the seed first. Then take any sample size you want from price by using the code:
sample(price, size), although I suggest using a small sample size for easy viewing. Call
that sample_size_size. For example, if your chosen sample size is 25, then your vector
will be called sample_size_25. (This part is optional: You may want to call out
sample_size_size to view what samples R generates. Try the code sample( ) several times
using different seeds or without seeding to see the samples vary.)
We will now take different sample sizes of price and calculate the sample means. For
uniformity, call the variable, sample_means_ size. You can follow the directions below or
refer to Chapter 21 (Samples and Distributions) of the R Guide. Any word in red and
italicized means you enter your own value. Any word in green is the code function name.
Q5. Use sample size = 5. Begin each code as follows: # 5a, # 5b, # 5c…
a. Do the each of the following in order.
• set_seed(any integer)
• sample_mean_size <- rep(NA, number of repetitions) #I suggest 1000 or more
repetitions. Play around with the number of repetitions
• for (i in 1: number of repetitions){
sample_mean_size[i] <- mean(sample(price, size))
}
# This loop takes the mean of the samples and puts them in the ith entry of
vector, sample_means_size, each time.
b. Do a histogram of sample_means_size. Be sure to label the axes. Add the title:
Sample Size of size.
c. What kind of distribution is sample_means_size?
d. Calculate the mean.
e. Calculate the standard deviation of sample_means_size.
f. Calculate the standard deviation of price divided by the square root of the sample
size.
Q6. Repeat all of a – e in #5 using size = 10. Start your code with # 6a, # 6b, # 6c …
Q7. Repeat all of a – e in #5 using size = 30. Start your code with # 7a, # 7b, # 7c …
Q8. Repeat all of a – e in #5 using size = 50. Start your code with # 8a, # 8b, # 8c …
Q9. Calculate the mean of price
Q10. As the sample size increases from 5 to 50, how does the mean of sample_means_size
compare with the mean of price?
Q11. As the sample size increases from 5 to 50, are your answers to #5 – 8, (e ) and (f ) getting
closer?
Assignment #4
Background
This is a made-up dataset but we will assume data was randomly collected. We want see what
happens when a two sample means hypothesis test was performed on a matched-pair dataset.
Questions:
Q1. Upload the dataset called “assignment4” into RStudio.
Situation 1 – Analyze the data as a two-sample means problem
Q2. Draw a side-by-side boxplot. There is no need for title or axes labels.
Q3. Draw the quantile plots for each variable. Include a line for each plot. Put the title “Group
1” and “Group 2” in the appropriate quantile plot.
Q4. From the boxplots and quantile plots, would you say the distribution of the variables are
approximately symmetric, right-skewed, left-skewed or none of the given?
Q5. Write the null and alternative hypothesis in symbols, if a two-sample means test is
performed. Identify the symbols used.
Q6. Perform a two-sample hypothesis test to determine if there is any difference between the
population mean.
Q7. Using the significance level, 𝛼 = 0.10, write a conclusion for your hypothesis test in the
context of the given situation.
Situation 2 - Analyze the data as a matched pair
Q8. Add a column of differences between Group1 and Group2 to the data frame. One way to
add a new column to the existing data frame is:
data_frame$new_column_name <- data_frame$Group1 – data_frame$Group2.
View your dataset to make sure column is added correctly.
Q9. Draw a boxplot for the “differences” variable.
Q10. Draw a quantile plot of the “differences” variable and include a line.
Q11. From the boxplot and quantile plot, would you say the distribution of the “differences”
variable is approximately symmetric, right-skewed, left-skewed or none of the given?
Q12. Perform a one-sample hypothesis test to determine if there are any differences in the
population means.
Q13. Using the significance level, 𝛼 = 0.10, write a conclusion for your hypothesis test in the
context of the given situation.