Lab2
Lab2
Lab2
Vianna Chavez
2025-02-04
Exercise 1
1a)
flint <- read.csv("~/Desktop/Stats10/flint.csv")
1b)
mean(flint$Pb >= 15)
## [1] 0.04436229
1c)
mean(flint$Cu[flint$Region == "North"])
## [1] 44.6424
1d)
mean(flint$Cu[flint$Pb >= 15])
## [1] 305.8333
1e)
mean(flint$Pb)
## [1] 3.383272
mean(flint$Cu)
file:///Users/vianna/Desktop/Stats10/Lab2.html 1/16
2/5/25, 11:34 PM Lab2
## [1] 54.58102
1f)
boxplot(flint$Pb, main = "Flint lead (Pb) levels")
1g)
# No, I don't believe that this visual representation of the mean is the best. I think a
histogram would do a better representation of the mean.
Exercise 2
2a)
life <-read.table("https://ucla.box.com/shared/static/rqk4lc030pabv30wknx2ft9jy848ub9n.t
xt", header = TRUE)
plot(Life~Income, data = life)
file:///Users/vianna/Desktop/Stats10/Lab2.html 2/16
2/5/25, 11:34 PM Lab2
# Income seems to support the idea that as income increases, there is a higher chance of
longer life expectancy.
2b)
boxplot(life$Income, main = "Income")
file:///Users/vianna/Desktop/Stats10/Lab2.html 3/16
2/5/25, 11:34 PM Lab2
library(mosaic)
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected
by this.
##
## Attaching package: 'mosaic'
file:///Users/vianna/Desktop/Stats10/Lab2.html 4/16
2/5/25, 11:34 PM Lab2
2c)
file:///Users/vianna/Desktop/Stats10/Lab2.html 5/16
2/5/25, 11:34 PM Lab2
2d)
plot(Life~Income, data = low_income)
library(mosaic)
cor(Life~Income, data = low_income) # correlation coefficient
## [1] 0.752886
Exercise 3
3a)
file:///Users/vianna/Desktop/Stats10/Lab2.html 6/16
2/5/25, 11:34 PM Lab2
summary(maas$lead)
summary(maas$zinc)
3b)
histogram(maas$lead)
library(mosaic)
histogram(log(maas$lead))
file:///Users/vianna/Desktop/Stats10/Lab2.html 7/16
2/5/25, 11:34 PM Lab2
3c)
plot(log(lead)~log(zinc), data = maas)
file:///Users/vianna/Desktop/Stats10/Lab2.html 8/16
2/5/25, 11:34 PM Lab2
# There seems to be a pretty strong, positive and linear relationship between the two va
riables.
3d)
mycolors <- c("lightblue", "pink", "purple")
mylevels <- cut(maas$lead, c(0, 150, 400, 10000))
mysize <- 8 #the point size, can be changed to other values
plot(maas$x, maas$y, col=mycolors[as.numeric(mylevels)], pch= mysize)
file:///Users/vianna/Desktop/Stats10/Lab2.html 9/16
2/5/25, 11:34 PM Lab2
Exercise 4
4a)
LA <- read.table("https://ucla.box.com/shared/static/d189x2gn5xfmcic0dmnhj2cw94jwvqpa.tx
t", header=TRUE)
file:///Users/vianna/Desktop/Stats10/Lab2.html 10/16
2/5/25, 11:34 PM Lab2
4b)
# Relationship: There seems to be a positive correlation between income and schools. As
the income increases, so does the amount of schools. The information seems to be linear,
but it is affected by some outliers.
# 2. second plot
plot(Schools~Income, data = LA[LA$Schools != 0, ], main = "School v. Income")
file:///Users/vianna/Desktop/Stats10/Lab2.html 11/16
2/5/25, 11:34 PM Lab2
Exercise 5
5a)
customer_data <- read.csv("https://ucla.box.com/shared/static/y2y8rcie7mjw2h5t92x9dfcp13
3tc90h.csv")
# Yes there missing data sets. There are 22 total missing- 10 missing from the Age varia
ble, 5 from income, purchase amount is missing 7 variables.
sum(is.na(customer_data))
## [1] 22
colSums(is.na(customer_data))
file:///Users/vianna/Desktop/Stats10/Lab2.html 12/16
2/5/25, 11:34 PM Lab2
5b)
# Income and purchase_amt should be changed to numeric values since there may be instanc
es of precision in the amounts, such as a purchase amount being 164.50 etc.
class(customer_data$cust_id)
## [1] "character"
class(customer_data$age)
## [1] "integer"
class(customer_data$gender)
## [1] "character"
class(customer_data$income)
## [1] "numeric"
class(customer_data$education)
## [1] "character"
class(customer_data$marital_status)
## [1] "character"
class(customer_data$purchase_amt)
## [1] "numeric"
5c)
file:///Users/vianna/Desktop/Stats10/Lab2.html 13/16
2/5/25, 11:34 PM Lab2
# I produced box plots to help me visualize the data for Customer income, purchase amoun
t, and even age. I did not note any outliers with this visualization.
file:///Users/vianna/Desktop/Stats10/Lab2.html 14/16
2/5/25, 11:34 PM Lab2
file:///Users/vianna/Desktop/Stats10/Lab2.html 15/16
2/5/25, 11:34 PM Lab2
file:///Users/vianna/Desktop/Stats10/Lab2.html 16/16
Part II
You may choose to type or write your answers electronically or scan your handwritten
solutions. Please ensure that you show all steps and explanations to receive full credit,
unless otherwise instructed.
Exercise 1
A study was done random sample of 900 college students. The researcher wants to find out if
gender would affect people’s body image. The two-way table below summarizes the two
variables.
Body Image
Two-way table About
Overweight Underweight Total
right
Gender
a. In general, are students happy with their body weight? (Hint: Students that are happy with
their body weight responded "about right.")
yes, in general around 66% of students that responded to feel somewhat happy with their
weight.
b. If the researcher wants to compare the differences in body image between females and males.
What graph would best visualize the data for this purpose? Explain. (No need to draw the
actually plot)
if the researcher wants to compare these categorical values with each other, the best chart to
use would be a grouped bar chart. With this visual representation, we would get a better
understanding of the differences between each category but also both groups of male and
female.
c. Are female students more likely to feel they are about right than male students? Explain with
numerical evidence.
No, male students are a bit more likely to
feel about right with their weight. Females
are around 65.9%, whereas males are 67.4%.
d. For students who do not feel ‘about right’ with their body image, are there any differences
between the two gender groups? (Hint: are they more likely to feel there are overweight or
underweight? Do female students and male students feel the same way?)
In the terms of female participants, feeling overweight is much more common than feeling
underweight. Females feeling overweight was around 81.2%. The difference is much more
drastic than the males, as the males are almost evenly split. However, males are more likely
to feel that they are underweight.
Exercise 2
For each of the scatterplots shown, provide a written description that includes the direction, form,
and strength of the relationship, along with any outliers that do not fit the general trend. In
addition, explain what these characteristics mean in the context of the data.
a. Data on 50 states taken from the U.S. Census shows how the median family income is
related to the population (25 years or older) with a college degree or higher.
h. Direction: positive
i. Form: linear
j. strength of relationship: moderate
k. outliers: Yes (29, 61?)
l. Explain characteristics
i This graph depicts that people with a bachelors tend to have higher median
family incomes. This makes sense because having a degree or educational
background does lead to a higher paying job opportunity.
b. Consider the relationship between the average amount of fuel used (in liters) to drive a fixed
distance in a car (100 km), and the speed at which the car is driven (in km per hour). c.
a. Direction: not negative or positive
b. Form: nonlinear (U-shape)
c. strength of relationship: strong
d. outliers: 1 potential outlier
e. Explain characteristics
a. This scatter plot represents the relationship between a car’s speed and the fuel
consumption of that car. At low speeds, the engine leads to high fuel
consumption. At moderate speeds efficiency is maximized. At high speeds,
fuel consumption increases again.
Exercise 3
A researcher collected data on the median starting salaries and the median mid-career salaries for
graduates at a selection of colleges. (Source: The Wall Street Journal, Salary increase by salary
type, https://www.wsj.com/public/resources/documents/info-
Salaries_for_Colleges_by_Typesort.html). The data points and the fitted least squares regression
line are displayed in the graph below.
b. And why do you think the median salary is used instead of the mean?
a The median salary is used because it is the least susceptible to outliers, whereas the
mean is sensitive to outliers in data.
c. Can the median mid-career salary be estimated given a median starting salary of 60 (in
thousands of dollars)? Please explain why or why not, and show your calculation and
explanation if possible.
Yes because $60,000 is observed within the scatter plot and the data, and this estimate seems
somewhat reliable.
d. Can the median mid-career salary be estimated given a median starting salary of 100 (in
thousands of dollars)? Please explain why or why not, and show your calculation and
explanation if possible.
This estimate may not be reliable because 100,000 is outside of the observed data and the
prediction may not stand true.
Exercise 4
Assume that the relationship between the calories in a five-ounce serving and the % alcohol
content for a sample of wines is linear. Use the % alcohol as the explanatory variable, and fit a
least squares regression line.
Exercise 5
A doctor who believes strongly that antidepressants work better than "talk therapy" tests
depressed patients by treating half of them with antidepressants and the other half with talk
therapy. The doctor recruited 100 patients for the study. After six months’ treatment, the patients
will be evaluated on a scale of 1 to 5, with 5 indicating the greatest improvement. The doctor is
designing the study plan.
a. The doctor wants to put the most severe patients in the antidepressants group because he is
concerned about those patients’ conditions. Will this affect his ability to compare the
effectiveness of the antidepressants and the “talk therapy”? Explain.
Yes this will overall invalidate the comparisons that the researcher makes. There will be
hints of bias and the doctor must instead randomly assign patients to the two groups to ensure
that there is a fair comparison between the two.
b. The doctor asks you whether it is acceptable for him to know which treatment each patient
receives. Explain why this practice may affect his ability to compare the two groups.
No, it would not be acceptable for the experimenter to know which treatment group the
patients are assigned in. This will ultimately lead to experimenter bias, and a study is best
conducted when it is double-blind.
c. What improvements to the plan would you recommend?
Overall, I would highly suggest the use of random assignment to each group. Allowing
random assignment to eliminate the possibility of confounding variables such as the severity
of the patients depression. I will also ensure that there is a double-blind approach, in which
the experimenter nor the patient knows what they are doing. This would further eliminate
any bias that may come from the experimenter or the participant period. I would possibly also
suggest having a larger group to conduct the study on.