0% found this document useful (0 votes)
3 views

Lab2

The document contains a lab report by Vianna Chavez, detailing various statistical exercises involving data analysis using R. It includes exercises on lead and copper levels in Flint, life expectancy related to income, and customer data analysis, among others. Each exercise presents data visualizations, statistical calculations, and interpretations of results.

Uploaded by

vc431365
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lab2

The document contains a lab report by Vianna Chavez, detailing various statistical exercises involving data analysis using R. It includes exercises on lead and copper levels in Flint, life expectancy related to income, and customer data analysis, among others. Each exercise presents data visualizations, statistical calculations, and interpretations of results.

Uploaded by

vc431365
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

2/5/25, 11:34 PM Lab2

Lab2
Vianna Chavez
2025-02-04

Exercise 1
1a)
flint <- read.csv("~/Desktop/Stats10/flint.csv")

1b)
mean(flint$Pb >= 15)

## [1] 0.04436229

1c)
mean(flint$Cu[flint$Region == "North"])

## [1] 44.6424

1d)
mean(flint$Cu[flint$Pb >= 15])

## [1] 305.8333

1e)
mean(flint$Pb)

## [1] 3.383272

mean(flint$Cu)

file:///Users/vianna/Desktop/Stats10/Lab2.html 1/16
2/5/25, 11:34 PM Lab2

## [1] 54.58102

1f)
boxplot(flint$Pb, main = "Flint lead (Pb) levels")

1g)
# No, I don't believe that this visual representation of the mean is the best. I think a
histogram would do a better representation of the mean.

Exercise 2
2a)
life <-read.table("https://ucla.box.com/shared/static/rqk4lc030pabv30wknx2ft9jy848ub9n.t
xt", header = TRUE)
plot(Life~Income, data = life)

file:///Users/vianna/Desktop/Stats10/Lab2.html 2/16
2/5/25, 11:34 PM Lab2

# Income seems to support the idea that as income increases, there is a higher chance of
longer life expectancy.

2b)
boxplot(life$Income, main = "Income")

file:///Users/vianna/Desktop/Stats10/Lab2.html 3/16
2/5/25, 11:34 PM Lab2

library(mosaic)

## Registered S3 method overwritten by 'mosaic':


## method from
## fortify.SpatialPolygonsDataFrame ggplot2

##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected
by this.

##
## Attaching package: 'mosaic'

## The following objects are masked from 'package:dplyr':


##
## count, do, tally

## The following object is masked from 'package:Matrix':


##
## mean

file:///Users/vianna/Desktop/Stats10/Lab2.html 4/16
2/5/25, 11:34 PM Lab2

## The following object is masked from 'package:ggplot2':


##
## stat

## The following objects are masked from 'package:stats':


##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var

## The following objects are masked from 'package:base':


##
## max, mean, min, prod, range, sample, sum

histogram(life$Income, main = "Income")

# There are notable outliers in the box plot image.

2c)

file:///Users/vianna/Desktop/Stats10/Lab2.html 5/16
2/5/25, 11:34 PM Lab2

low_income <- life[life$Income < 1000, ]


high_income <- life[life$Income >= 1000, ]

2d)
plot(Life~Income, data = low_income)

library(mosaic)
cor(Life~Income, data = low_income) # correlation coefficient

## [1] 0.752886

Exercise 3
3a)

file:///Users/vianna/Desktop/Stats10/Lab2.html 6/16
2/5/25, 11:34 PM Lab2

maas <- read.table("https://ucla.box.com/shared/static/tv3cxooyp6y8fh6gb0qj2cxihj8klg1h.


txt", header = TRUE)

summary(maas$lead)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 37.0 72.5 123.0 153.4 207.0 654.0

summary(maas$zinc)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 113.0 198.0 326.0 469.7 674.5 1839.0

3b)
histogram(maas$lead)

library(mosaic)
histogram(log(maas$lead))

file:///Users/vianna/Desktop/Stats10/Lab2.html 7/16
2/5/25, 11:34 PM Lab2

3c)
plot(log(lead)~log(zinc), data = maas)

file:///Users/vianna/Desktop/Stats10/Lab2.html 8/16
2/5/25, 11:34 PM Lab2

# There seems to be a pretty strong, positive and linear relationship between the two va
riables.

3d)
mycolors <- c("lightblue", "pink", "purple")
mylevels <- cut(maas$lead, c(0, 150, 400, 10000))
mysize <- 8 #the point size, can be changed to other values
plot(maas$x, maas$y, col=mycolors[as.numeric(mylevels)], pch= mysize)

file:///Users/vianna/Desktop/Stats10/Lab2.html 9/16
2/5/25, 11:34 PM Lab2

Exercise 4
4a)
LA <- read.table("https://ucla.box.com/shared/static/d189x2gn5xfmcic0dmnhj2cw94jwvqpa.tx
t", header=TRUE)

plot(LA$Longitude, LA$Latitude, xlab = "Longitude", ylab = "Latitude", main = "Mapping L


A City Centers")
library(maps)
map("county", "california", add = TRUE)

file:///Users/vianna/Desktop/Stats10/Lab2.html 10/16
2/5/25, 11:34 PM Lab2

4b)
# Relationship: There seems to be a positive correlation between income and schools. As
the income increases, so does the amount of schools. The information seems to be linear,
but it is affected by some outliers.
# 2. second plot
plot(Schools~Income, data = LA[LA$Schools != 0, ], main = "School v. Income")

file:///Users/vianna/Desktop/Stats10/Lab2.html 11/16
2/5/25, 11:34 PM Lab2

Exercise 5
5a)
customer_data <- read.csv("https://ucla.box.com/shared/static/y2y8rcie7mjw2h5t92x9dfcp13
3tc90h.csv")

# Yes there missing data sets. There are 22 total missing- 10 missing from the Age varia
ble, 5 from income, purchase amount is missing 7 variables.

sum(is.na(customer_data))

## [1] 22

colSums(is.na(customer_data))

## cust_id age gender income education


## 0 10 0 5 0
## marital_status purchase_amt
## 0 7

file:///Users/vianna/Desktop/Stats10/Lab2.html 12/16
2/5/25, 11:34 PM Lab2

5b)
# Income and purchase_amt should be changed to numeric values since there may be instanc
es of precision in the amounts, such as a purchase amount being 164.50 etc.

customer_data$income <- as.numeric(customer_data$income)


customer_data$purchase_amt <- as.numeric(customer_data$purchase_amt)

class(customer_data$cust_id)

## [1] "character"

class(customer_data$age)

## [1] "integer"

class(customer_data$gender)

## [1] "character"

class(customer_data$income)

## [1] "numeric"

class(customer_data$education)

## [1] "character"

class(customer_data$marital_status)

## [1] "character"

class(customer_data$purchase_amt)

## [1] "numeric"

5c)

file:///Users/vianna/Desktop/Stats10/Lab2.html 13/16
2/5/25, 11:34 PM Lab2

# I produced box plots to help me visualize the data for Customer income, purchase amoun
t, and even age. I did not note any outliers with this visualization.

boxplot(customer_data$income, main = "Customer Income")

boxplot(customer_data$purchase_amt, main = "Purchase Amount")

file:///Users/vianna/Desktop/Stats10/Lab2.html 14/16
2/5/25, 11:34 PM Lab2

boxplot(customer_data$age, main = "Age")

file:///Users/vianna/Desktop/Stats10/Lab2.html 15/16
2/5/25, 11:34 PM Lab2

file:///Users/vianna/Desktop/Stats10/Lab2.html 16/16
Part II

You may choose to type or write your answers electronically or scan your handwritten
solutions. Please ensure that you show all steps and explanations to receive full credit,
unless otherwise instructed.

Exercise 1
A study was done random sample of 900 college students. The researcher wants to find out if
gender would affect people’s body image. The two-way table below summarizes the two
variables.

Body Image
Two-way table About
Overweight Underweight Total
right
Gender

Female 310 130 30 470

Male 290 68 72 430

Total 600 206 102 900

a. In general, are students happy with their body weight? (Hint: Students that are happy with
their body weight responded "about right.")
yes, in general around 66% of students that responded to feel somewhat happy with their
weight.
b. If the researcher wants to compare the differences in body image between females and males.
What graph would best visualize the data for this purpose? Explain. (No need to draw the
actually plot)
if the researcher wants to compare these categorical values with each other, the best chart to
use would be a grouped bar chart. With this visual representation, we would get a better
understanding of the differences between each category but also both groups of male and
female.
c. Are female students more likely to feel they are about right than male students? Explain with
numerical evidence.
No, male students are a bit more likely to
feel about right with their weight. Females
are around 65.9%, whereas males are 67.4%.
d. For students who do not feel ‘about right’ with their body image, are there any differences
between the two gender groups? (Hint: are they more likely to feel there are overweight or
underweight? Do female students and male students feel the same way?)
In the terms of female participants, feeling overweight is much more common than feeling
underweight. Females feeling overweight was around 81.2%. The difference is much more
drastic than the males, as the males are almost evenly split. However, males are more likely
to feel that they are underweight.

Exercise 2
For each of the scatterplots shown, provide a written description that includes the direction, form,
and strength of the relationship, along with any outliers that do not fit the general trend. In
addition, explain what these characteristics mean in the context of the data.

a. Data on 50 states taken from the U.S. Census shows how the median family income is
related to the population (25 years or older) with a college degree or higher.

h. Direction: positive
i. Form: linear
j. strength of relationship: moderate
k. outliers: Yes (29, 61?)
l. Explain characteristics
i This graph depicts that people with a bachelors tend to have higher median
family incomes. This makes sense because having a degree or educational
background does lead to a higher paying job opportunity.

b. Consider the relationship between the average amount of fuel used (in liters) to drive a fixed
distance in a car (100 km), and the speed at which the car is driven (in km per hour). c.
a. Direction: not negative or positive
b. Form: nonlinear (U-shape)
c. strength of relationship: strong
d. outliers: 1 potential outlier
e. Explain characteristics
a. This scatter plot represents the relationship between a car’s speed and the fuel
consumption of that car. At low speeds, the engine leads to high fuel
consumption. At moderate speeds efficiency is maximized. At high speeds,
fuel consumption increases again.
Exercise 3
A researcher collected data on the median starting salaries and the median mid-career salaries for
graduates at a selection of colleges. (Source: The Wall Street Journal, Salary increase by salary
type, https://www.wsj.com/public/resources/documents/info-
Salaries_for_Colleges_by_Typesort.html). The data points and the fitted least squares regression
line are displayed in the graph below.

a. What is the explanatory variable and response variable?


a Explanatory variable: start median salary
b Response variable: mic-career median salary

b. And why do you think the median salary is used instead of the mean?
a The median salary is used because it is the least susceptible to outliers, whereas the
mean is sensitive to outliers in data.

c. Can the median mid-career salary be estimated given a median starting salary of 60 (in
thousands of dollars)? Please explain why or why not, and show your calculation and
explanation if possible.
Yes because $60,000 is observed within the scatter plot and the data, and this estimate seems
somewhat reliable.
d. Can the median mid-career salary be estimated given a median starting salary of 100 (in
thousands of dollars)? Please explain why or why not, and show your calculation and
explanation if possible.
This estimate may not be reliable because 100,000 is outside of the observed data and the
prediction may not stand true.

Exercise 4
Assume that the relationship between the calories in a five-ounce serving and the % alcohol
content for a sample of wines is linear. Use the % alcohol as the explanatory variable, and fit a
least squares regression line.

a. Calculate slope and intercept of the regression line.


Slope: 18.97, Intercept: -67.67
b. Report the equation of the regression line and interpret it in the context of the problem.
Ŷ = -67.67 + 18.98x
This little suggests that for each increase in alcohol, the calorie increase by around 18.98
calories. And the intercept shows that when the alcohol content is 0, there is -67.67 calories.

c. Find and interpret the value of the coefficient of determination.


r ^ 2 = 0.9025
This coefficient of determination suggests that 90.25% of the variation in calories is there
attributed to alcohol.
d. Suppose a new point was added to your data: a wine that is 20% alcohol that contains 0
calories. How will that affect the value of r and the slope of the regression line? (No
calculation needed)
This new point would affect the correlation coefficient because it will be an outlier to the trend
that is consistent with the data.
Data table (Source:healthalicious.com)
Calories % alcohol
122 10.6
119 10.1
121 10.1
123 8.8
129 11.1
236 15.2

Table of summary statistics


Calories % alcohol
Mean 141.67 11.03
Std. Dev. 46.34 2.32
r 0.95

Exercise 5
A doctor who believes strongly that antidepressants work better than "talk therapy" tests
depressed patients by treating half of them with antidepressants and the other half with talk
therapy. The doctor recruited 100 patients for the study. After six months’ treatment, the patients
will be evaluated on a scale of 1 to 5, with 5 indicating the greatest improvement. The doctor is
designing the study plan.

a. The doctor wants to put the most severe patients in the antidepressants group because he is
concerned about those patients’ conditions. Will this affect his ability to compare the
effectiveness of the antidepressants and the “talk therapy”? Explain.
Yes this will overall invalidate the comparisons that the researcher makes. There will be
hints of bias and the doctor must instead randomly assign patients to the two groups to ensure
that there is a fair comparison between the two.
b. The doctor asks you whether it is acceptable for him to know which treatment each patient
receives. Explain why this practice may affect his ability to compare the two groups.
No, it would not be acceptable for the experimenter to know which treatment group the
patients are assigned in. This will ultimately lead to experimenter bias, and a study is best
conducted when it is double-blind.
c. What improvements to the plan would you recommend?
Overall, I would highly suggest the use of random assignment to each group. Allowing
random assignment to eliminate the possibility of confounding variables such as the severity
of the patients depression. I will also ensure that there is a double-blind approach, in which
the experimenter nor the patient knows what they are doing. This would further eliminate
any bias that may come from the experimenter or the participant period. I would possibly also
suggest having a larger group to conduct the study on.

You might also like