0% found this document useful (0 votes)
44 views7 pages

STATS 10 Assignment 1

The document outlines the requirements for submitting an assignment for a statistics course, including formatting instructions and submission guidelines. It consists of two parts: Part I focuses on R programming tasks related to data analysis, while Part II includes questions about statistical interpretation and analysis of given datasets. Students must provide clear and legible answers, including outputs and explanations for their calculations and analyses.

Uploaded by

vc431365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views7 pages

STATS 10 Assignment 1

The document outlines the requirements for submitting an assignment for a statistics course, including formatting instructions and submission guidelines. It consists of two parts: Part I focuses on R programming tasks related to data analysis, while Part II includes questions about statistical interpretation and analysis of given datasets. Students must provide clear and legible answers, including outputs and explanations for their calculations and analyses.

Uploaded by

vc431365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

STATS 10 Assignment 1

Please submit both parts of the assignment in one single PDF file. You can use any PDF editor
software to merge the two parts into one file. Please make sure that the questions are in the correct
order and clearly labeled, and that the answers are legible and easy to read.
To submit your assignment, upload the PDF file under the designated assignment page on the
course website before the deadline specified. Email or hard copy submissions are not accepted.

Part I
Include both the R commands and their corresponding outputs, results, or answers for all
exercise questions in Part I.

1. Vectors:
a. Create a vector named heights that contains the heights, in inches, of yourself and two
students near you. Print the contents of this vector.
b. Create a vector named names that contains the names of these people. Print the contents
of this vector.
c. Try typing cbind(heights, names). What did this command do? What class is this new
object?
Hint: Try the class() function.

2. Downloading data:
a. Download the data set births.csv from the course site and upload it into RStudio. Name
the data frame NCbirths.
b. Demonstrate that you have been successful by typing head(NCbirths) and copying and
pasting the output into your word processing document.

3. Package loading
a. Install the maps package. Verify its installation by typing find.package("maps") and
include the output in your answer.
b. Type library(maps) to load up the package. Type map("state") and include the plot output
in your answer.
Use the births data set for questions 4-11
4. Perform vector operations
a. Extract the weight variable as a vector from the data frame
b. What units do you think the weights are in?
c. Create a new vector named weights_in_pounds which are the weights of the babies in
pounds. You can look up conversion factors on the internet.
d. Demonstrate your success by typing weights_in_pounds[1:20] and including the output in
your word processing document.

5. What is the mean weight of the babies in pounds?


a. What percentage of the mothers in the sample smoke? Hint: use the tally function with
the format argument. Use the help screen for guidance.
b. According to the Centers for Disease Control, approximately 21% of adult Americans are
smokers. How far off is the percentage you found in 2 from the CDC’s report?

6. Produce three different histograms of the weights in pounds. Use 3 bins, 20 bins, and 100
bins. Which histogram seems to give the best visualization, and why?

7. We can use the syntax boxplot(vector1, vector2) to make a side by side box plot. Create a
side by side boxplot of the mother’s ages and the father’s ages. Which gender tends to be
older?

8. Try typing histogram(~ weight | Habit, data = NCbirths, layout = c(1, 2)). Describe what this
code does. Based on the graph, do you see any major differences between baby weights from
smoking moms vs. non-smoking moms?

9. Produce a dot plot of the weights in pounds.

10. Consider the other categorical variables in this data. Of those that record the health of the
baby, which do you think will be associated with the mother’s smoking and why? Make a
two-way Summary Table to check your hypothesis. Do you have evidence that this variable
associated with smoking? Why?
Part II
You may choose to type or write your answers electronically or scan your handwritten
solutions. Please ensure that you show all steps and explanations to receive full credit,
unless otherwise instructed.

1. A data set on Shark Attacks Worldwide posted on StatCrunch records data on all shark
attacks in recorded history including attacks before 1800. The data set can be viewed here:
https://www.statcrunch.com/app/index.html?dataid=2188687

a. How many variables are contained in the data?


i. 15

b. Which of the following questions could not be answered using this data set? Briefly
explain.
i. In what month do most shark attacks occur?
ii. Are shark attacks more likely to occur in warm temperature or cooler
temperatures?
1 Cannot be answered because there is no variable about temperature, the rest
can be solved for from the information, but this one cannot be.
iii. Attacks by which species of shark are more likely to result in a fatality?
iv. What country has the most shark attacks per year?

c. A researcher wants to understand the age of the people in the data set and proposed some
questions of interest: Are the reported cases are mostly younger people or older people?
How is the age distributed? How would you help the research answer these questions?
What statistical tools (e.g., graphs, measures) will you use? (You only need to describe
your approach)
i. First, I would advise the researcher to create a histogram with all of the data. The
bin width of the histogram can range from ages such as 15 to 20, 20 to 25, 25 to
30, up to.
ii. From the histogram itself, the researcher can then determine the reported cases
and their corresponding age by looking at the overall shape of the histogram
comma taking note of the skewness and the modality.
2. The scores of a quiz are displayed in the graph below.

a. Describe the shape of distribution


i. The distribution of the graph is unimodal and skewed left.

b. Would the mean score be greater than, less than, or about the same as the median score?
Explain.
i. In a graph that is skewed left, the mean score would be less than the median
score.

c. What measures would you use to report the center and spread. Explain.
i. To measure the center, the median would be a good measure of a typical value for
skewed distributions. And to measure the spread, the interquartile range would be
best as it would use this median, And provide more information on the range in
the middle 50% of the scores.

3. The distribution of test scores in a class is unimodal and symmetric with a mean of 80 pts
and a standard deviation of 7pts. Based on the information, Adam estimated that his score is
higher than approximately 97.5% of the students in class. What score did Adam receive?
Explain.
i. We can assume that Adams’s score is around 94 points. At the second standard
deviation above the mean, a score of 94 points is already higher than 95%
students in the class. So, since Adam is higher than 97.5% his score must fall
within the highest value of the second standard deviation.

4. Assume that both men and women’s heights have symmetric and unimodal distributions.
Women’s distribution has a mean of 64 inches and a standard deviation of 2.5 inches. Men’s
distribution has a mean of 69 inches and a standard deviation of 3 inches. a. What women’s
height corresponds with a z-score of -1.50?
i. With a Z score of - 1.50, the height corresponding would be around 60.25 inches
for women.
b. Professional basketball player Evelyn Akhator is 75 inches tall and plays in the
WNBA (women’s league). Professional basketball player Draymond Green is 79 inches
tall and plays in the NBA (men’s league). Compared to their own peers, who is taller?
i. Evelyn Akhator has a zscore of around 4.4 inches, and Draymond has a z
score of 3.3. So, Evelyn Akhator is taller than her peers in the WNBA.

5. The top ten movies based on Marvel comic book characters for the U.S. box office as of fall
2017 are shown in the following table, with domestic gross rounded to the nearest hundred
million. (Source: ultimatemovieranking.com)

a. Report the five-number summary of the domestic gross income.


Min: 363, Max: 677, Median: 428.5, Q1: 389, Q3: 520
b. Interpret the five-number summary in context, i.e., what information can you obtain about
the distribution of the domestic gross income?
i. The five number summary suggests that movies have most domestic gross
incomes within the range of 389million to 520 million, which are quartile 1 and
quartile three. And overall, quartile two shows the average gross income of
around 428.5 million. Outside of the interquartile range, there are few outliers
period

6. The data set below show the number of central public libraries in 32 states.
The five number summary is given as:
Minimum Q1 Median Q3 Maximum
1 62 91 218 756

Sketch a boxplot using the five-number summary above and the data below.
Mark the values of the quartiles, the lower whisker, the upper whisker, and any potential
outliers in the boxplot. Explain how you determined the length of the whiskers. (The
scale of the plot does not need to be accurate)
Q1 – 1.5*1qr = lower bound (first whisker), Q3+ 1.5*iqr = upper bound (other whisker)
I found the lower bound by subtracting 1.5 * IQR from Q1, and I found the lower bound
by adding 1.5 * IQR from the Q3. For the lower bound I got the value -172, however
the lower bound is at 1, and for the upper bound I got the value 452 which is the length
of the other whisker.

You might also like