0% found this document useful (0 votes)
12 views17 pages

Problem Set 7

Uploaded by

Sunny Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views17 pages

Problem Set 7

Uploaded by

Sunny Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Public Policy 529

Fall 2024: Problem Set #7

Due Monday, November 4, end of the day

1. Facing claims that city police were engaging in racial profiling, the city of
Grand Rapids hired a consulting firm to perform a study on traffic stops in
the city. The results of this study were released in April 2017, and the
consulting firm’s report is posted on Canvas in the Problem Sets folder.
In short, the study found that Black motorists were stopped at “close to
twice the rate that would be expected given their presence in the traffic.”

It is useful to examine the study’s methodology. First, the consulting firm


collected benchmark data on the race of drivers at particular
intersections in the city. Thus, for each location, we have a sample with
information on the percentage (i.e. proportion) of drivers that are Black.
Second, the consulting firm collected data from the police department on
the race of people who were stopped near those same locations, provid-
ing a second sample that measures the proportion of drivers that are
Black. Under the null hypothesis of no racial profiling, the percentages in
these independent samples are the same.

(a) Earlier in the course, we talked about measurement. Examine


how the consulting firm measured the benchmark data on the
race of drivers (pp. 30-39 of the study). Assess the reliability
and validity of this measurement strategy.

Hi, I apologize for submitting this problem set so late. I’d finished all
problems a while ago, except for one, and finally got around to it.
2
Measurement Methodology

The consulting firm measured the benchmark data by using a few methods to
reduce overall bias:

 They selected 20 different locations across Grand Rapids after going


through a specific site selection process, which included a
comprehensive review of each site.

 They conducted the surveys at each location on at least 8 different days


and times to account for the variation in traffic partners.

 They also trained surveyors to visually identify the race and ethnicity of
drivers and recorded these driver demographics using a standardized
form.

Reliability

The method seems to be highly reliable since they took measures to avoid bias
and ensure proper data acquisition through:

 standardized training (including practice sessions and guidelines),

 a consistent data form, and

 even the presence of a GRPD officer at each session.

However, there are notably a few pitfalls in their method:

 Visual identification of race is still inherently subjective, which would


vary between surveyors.

 Some traffic patterns and driver demographics could also fluctuate


based on factors that cannot be reliably accounted for by the
consulting firm.

Validity
3
The results make sense and align with other studies’ findings. However, visual
identification of race is still inherently subjective, which would vary between
surveyors.

(b) According to the data (p. 56), at the corner of Bridge &
Stocking, 15.0% of the 2,383 drivers were Black. Out of 673
traffic stops made in that vicinity, 32.8% of the drivers were
Black. Construct a 99% confidence interval for the difference
of proportions. Be sure to use the correct standard error for a
confidence interval.

 p₁ = 0.328 (proportion of Black drivers in traffic stops)

 n₁ = 673 (sample size for traffic stops)

 p₂ = 0.150 (proportion of Black drivers in benchmark)

 n₂ = 2,383 (sample size for benchmark)

 SE = square root [ ( 0.328 ( 1 - 0.328 ) / 673 ) + ( 0.150 ( 1 -


0.150 ) / 2383 ) ]
SE = 0.0204

 For a 99% confidence interval, Z = 2.576

 CI = (p1 - p2) ± z * SE

 CI = 0.178 ± .0525

 CI = (0.1255, 0.2305)

(c) Now perform a significance test (α = .01) in which the null


hypothesis is that there is no difference between the
proportion of drivers who are Black and the proportion of
traffic stops that involve Black drivers. Perform all the steps
4
and report all relevant statistics.

 Null hypothesis Ho: There is no difference between the proportion of


Black drivers and the proportion of Black drivers in traffic stops (p1 =
p2).

 Alternative hypothesis H1: There is a difference (p₁ ≠ p₂).

 Z = ( p1 - p2 ) / SE

 Z = (0.328 - 0.150) / 0.0204

 Z = 8.725

 For a 2-tailed test, the critical value for a = .01 is ± 2.576.

 p-value: For Z = 8.725, the p-value is extremely small.

 Since |z| > 2.576 and –value < 0.01, we reject the null hypothesis.

 Based on the determined z and p-values, there is strong evidence to


conclude that there is a significant difference between the proportion
of Black drivers in the benchmark population and the proportion
stopped by police at the Bridge and Stocking location.

 This means that Black drives are stopped at a higher rate than their
presence in the general driving population would show.

5
2. One way that analysts measure the level of education in a country is to
calculate the number of years of school, on average, that people in the
country have completed. Suppose that, across a sample of democracies,
the mean of this variable is 8.2 years (n=81; s=2.8), and in a sample of
non-democracies the mean is 7.1 years (n=31; s=3.1).

(a) Why might we want to make the equal variance assumption in


this case?

 Equal variance assumption = homoscedasticity, assuming that


populations’ variances are similar or equal, restricts using certain
statistical tests

 We may make the equal variance assumption if we believe the


underlying variability in the years of education across democracies and
non-democracies is not significantly different.

 Making an equal variance assumption simplifies the analysis by letting


us pool the variances to create a single combined variance estimate,
leading to a single standard error, therefore leading to a more stable
estimate of the standard error.

 This assumption is also reasonable if sample sizes are similar or the


variances do not differ greatly, leading to more straightforward
calculations of test statistics.

(b) Suppose that we make the equal variance assumption. Perform


a significance test for the difference of means.

Create hypotheses  Null = mean # of years of edu is same for


democracies and non-democracies

o (𝜇1 − 𝜇2 = 0)

 Alternative = mean # of years of edu differs


between democracies and non-democracies

6
o (𝜇1 − 𝜇2 ≠ 0)

Calculate pooled  xˉ1 = 8.2 (mean for democracies)


standard deviation  xˉ2 = 7.1 (mean for non-democracies)
 n1 = 81, s1 = 2.8
 n2 = 31, s2 = 3.1

 Using calculator…
 s = 2.86
Calculate standard
error (SE) of difference
in means

 Using calculator…

 SE = 0.61

Calculate t-statistic

 t = 1.81

Check critical value for  df = n1 + n2 – 2 = 81 + 31 – 2 = 110


95% confidence
 Using t-distribution table…
interval
 Critical value = 1.98

 t-statistic = 1.81 < critical value = 1.98 à


don’t reject Null

Interpretation  Because the t-statistic of 1.81 is less than


the critical value of 1.98, we don’t reject the
7
Null. There is no statistically significant
difference in the mean number of years of
education between democracies and non-
democracies.

(c) Suppose we do not make the equal variance assumption.


Provide one reason to support this decision. Calculate the
standard error and degrees of freedom in this scenario. How
do these numbers compare to the equal variance assumption?

 To not make the equal variance assumption, there would be evidence


suggesting the variances in education years differ significantly
between the democracies and non-democracies.

 Evidence can come from documented differences in educational


systems or inequality in access to education between democracies and
non-democracies. This enters a structural -ism discussion.

Calculate standard
 xˉ1 = 8.2 (mean for democracies)
error
 xˉ2 = 7.1 (mean for non-democracies)
 n1 = 81, s1 = 2.8
 n2 = 31, s2 = 3.1

 Using calculator…

 SE = 0.64

8
Calculate degrees of
freedom

 Using calculator…

 df = 50
Comparison Standard error

 Equal variance assumption with pooled


variance = 0.609

 Unequal variance = 0.64

 The unequal variance’s standard error was


slightly higher than the equal variance
assumption.

 A higher SE reflects the additional


uncertainty when variances aren’t assumed to
be the same.

Degrees of freedom

 Equal variance assumption = 110

 Unequal variance assumption = 50

 The unequal variance assumption’s degrees of


freedom being lower reflects the additional
uncertainty when we don’t pool variances.
The test is more conservative.

 A lower df means its confidence level’s critical


value will be larger, making rejecting the Null
harder.

9
 Unequal variance confidence intervals tend to
be wider due to the larger standard error,
meaning the unequal variance assumption is
more cautious and less likely to result in Type
1 errors/false positives.

(d) Without making the equal variance assumption, make a 95%


confidence interval for the difference of means.
Calculate confidence
interval

 df = 50

 95% confidence interval

 t critical value at a = 0.025, df = 50 = 2.01

 CI = 1.1 ± 2.01 * 0.64

 CI = (-0.182, 2.382)
Interpretation We are confident that in 95% of samples we take, the true
difference in the average years of education between
democracies and non-democracies is between -0.182 and
2.382 years.

1
0
3. Use the dataset InfantMortality for this question. In this dataset, the
variable inf- mort2000 is the level of infant mortality for each country in
the year 2000. The variable infmort2010 is the level of infant mortality for
the same set of countries in the year 2010.

(a) To measure the difference in these two infant mortality rates


for each country, create a new variable called IMdiff, which
equals infmort2010 - infmort2000.

In Stata, use the generate command:


gen IMdiff = infmort2010 - infmort2000

In R, the command is:


InfMort$IMdiff <- InfMort$infmort2010 - InfMort$infmort2000

So that you can see the result of this command, browse your results
with either browse in Stata or View(InfMort) in R. When done, use
commands to obtain the mean, standard deviation, and sample
size for IMdiff.

Report your findings in your answers. How do you interpret the


mean of IMdiff?

 Mean = -12.08

o This mean indicates that on average, the 2010 infant mortality rate
was 12.08 deaths per 1,000 live births lower than in 2000 for the
countries in this dataset.

o A negative mean indicates a decrease in infant mortality rates over


this decade.

 Standard deviation = 12.15

 Sample size = 178

load("InfantMortality")
load("InfantMortality")
load("InfantMortality.rdata")
InfMort$IMdiff<-InfMort$infmort2010-InfMort$infmort2000
1
1
View(InfMort)
mean(IMdiff)
mean(InfMort$IMdiff)
sd(InfMort$IMdiff)
sd(InfMort$IMdiff, na.rm = FALSE)
table(InfMort$IMdiff)
> load("InfantMortality")
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file 'InfantMortality', probable reason 'No such file or directory'
> setwd("C:/Users/sunny/OneDrive - Umich/2024-2025/PUBPOL 529 Augmented
Statistics/Working Directory")
> load("InfantMortality")
Error in readChar(con, 5L, useBytes = TRUE) : cannot open the connection
In addition: Warning message:
In readChar(con, 5L, useBytes = TRUE) :
cannot open compressed file 'InfantMortality', probable reason 'No such file or directory'
> load("InfantMortality.rdata")
> InfMort$IMdiff<-InfMort$infmort2010-InfMort$infmort2000
> View(InfMort)
> mean(IMdiff)
Error: object 'IMdiff' not found
> mean(InfMort$IMdiff)
[1] -12.07809
> sd(InfMort$IMdiff)
[1] 12.15155
> sd(InfMort$IMdiff, na.rm = FALSE)
[1] 12.15155
> table(InfMort$IMdiff)

-64.4 -58.5 -50.5 -44.4


1 1 1 1
-42 -39.1 -39 -37.8
1 1 2 1
-36 -34.9 -34.5 -33.1
1 1 1 1
-32.3 -31.7 -30.7 -30.6
1 1 1 1
-30 -27.9 -26.9 -26.5
1 1 3 1
-26.2 -25.5 -24.7 -24.2
1 1 1 2
-24 -23.7 -23.2 -22.7
2 1 1 1
-22.3 -22.2 -22 -20.1
1 1 1 2
-19.7 -19.2 -19 -18.7
1 1 1 1
-18.4 -18.2 -17.6 -16.6
1 1 2 3
-16.5 -16.2 -16 -15.4
1 1 1 1
-15.3 -15.2 -14.5 -14.3
1 2 2 1
-14.2 -13.8 -13.6 -13
1 1 2 1
-12.3 -12.2 -11.8 -11.6

1
2
1 2 1 3
-11.2 -10.7 -10.6 -10.5
1 1 2 1
-10.4 -10.2 -10.1 -10
2 1 2 1
-9.5 -8.4 -8.2 -8.1
1 1 1 1
-7.80000000000001 -7.6 -7.5 -7.4
1 1 1 1
-7.2 -6.7 -6.6 -6.4
2 2 2 1
-6.3 -6.2 -5.8 -5.7
2 1 2 2
-5.6 -5.5 -5.3 -5.2
1 2 1 1
-5.1 -5 -4.8 -4.6
2 2 1 1
-4.5 -4.4 -4.2 -4.1
3 1 2 1
-4 -3.7 -3.69999999999999 -3.4
2 1 1 1
-3.1 -3 -2.9 -2.7
2 1 1 1
-2.6 -2.5 -2.4 -2.3
1 2 2 1
-2.2 -2.1 -2.09999999999999 -2
1 1 1 3
-1.9 -1.8 -1.7 -1.6
1 2 2 1
-1.5 -1.4 -1.3 -1.2
2 2 5 3
-1.1 -1 -0.9 -0.8
2 2 5 3
-0.5 -0.3 0.0999999999999996 0.100000000000001
1 1 1 1

(b) We learned that the mean of the differences is the same as the
difference of the means. Is that true? Use commands to find
the means of infmort2010 and infmort2000, then calculate the
difference between them. Compare this to the mean of IMdiff
that you found above.

 Yes, the mean of differences is the same as the differences of means, a


property of the paired sample t-test or when working with dependent
samples.

o Mean of differences = average of differences between paired


values (infmort2010 and infmort2000 for each country).

1
3
o Difference of means = difference between 2010 rates’ mean and
2000 rates’ mean

o These two values are equal because in paired data, calculating


the mean of differences is the same as calculating difference
between the 2 means.

 infmort2010 mean = 29.1

 infmort2000 mean = 41.17

 29.1 - 41.17 = -12.07

 Yes, the difference of means is the same as the mean of differences,


both being at -12.07 or -12.08 due to rounding.

mean(infmort2010)
mean(InfMort$infmort2010)
mean(InfMort$infmort2000)
> mean(infmort2010)
Error: object 'infmort2010' not found
> mean(InfMort$infmort2010)
[1] 29.09663
> mean(InfMort$infmort2000)
[1] 41.17472

(c) This insight from part (b) tells us that testing whether the mean of
IMdiff=0 is the same as testing whether infmort2010 and infmort2000
have different means. By hand, perform the test of whether the
mean of IMdiff=0. You have the information you need to calculate
the standard error from your summary statistics in part (a). It’s just a
one-sample test of statistical significance. Produce a t statistic and
p-value for this test.

(d) Now use your software to perform a dependent samples (i.e.


paired) t-test for the difference of means. Report the results
and compare them to the test that you performed in part (c).

1
4
In Stata, the command is: ttest infmort2010 = infmort2000

In R, the command is: t.test(InfMort$infmort2000,


InfMort$infmort2010, paired = TRUE)

At a 95% confidence interval of (10.28, 13.88), the true mean difference is not equal
to 0.

t.test(InfMort$infmort2000, InfMort$infmort2010, paired = TRUE)


> t.test(InfMort$infmort2000, InfMort$infmort2010, paired = TRUE)

Paired t-test

data: InfMort$infmort2000 and InfMort$infmort2010


t = 13.261, df = 177, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
10.28067 13.87551
sample estimates:
mean difference
12.07809

1
5
4. The 2016 American National Election Studies asked respondents whether
the federal government should make it easier or more difficult to buy a
gun. The possible answers were: more difficult, keep rules the same,
easier. It also asked whether they had a four- year college degree: yes,
no. The table below shows the joint frequency distribution of the
responses.

College Degree?
Gun Yes No Total
Purchases
More Difficult 1,01 1,24 2,259
4 5
Keep the 530 1,17 1,70
Same 1 1
Easier 86 189 275
Total 1,630 2,605 4,235

(a) Calculate fe, the expected frequency in each cell under the
scenario that the variables are independent.

fe = (row total * column total) / grand total

Gun Purchases Yes (Expected) No (Expected)


(2259 * 1630) / 4235 (2259 * 2605) / 4235
More Difficult
= 869.46 = 1389.54
(1701 * 1630) / 4235 (1701 * 2605) / 4235
Keep the Same
= 654.69 = 1046.31
(275 * 1630) / 4235 (275 * 2605) / 4235
Easier
= 105.84 = 169.16
(b) Use the χ2 table from Canvas or lecture slides to look up the
critical value of χ2 that would be necessary to reject the null
hypothesis that the variables are independent (α=.05). You will
first have to find the correct degrees of freedom.

df = (# of rows - 1) * (# of columns - 1)

df = ( 3 - 1 ) * ( 2 - 1 ) = 2
1
6
Based on 2 degrees of freedom and a = .05, the critical value is 5.99.

(c) Calculate the χ2 statistic from the data presented above. Can
you reject the null hypothesis?

Chi-square statistic formula:

Calculating each term for χ2 using the observed frequencies (given) and the
expected frequencies:

 More Difficult (Yes): (1014-869.46)^2 / 869.46 = 23.01

 More Difficult (No): (1245-1389.54)^2 / 1389.54 = 16.71

 Keep the Same (Yes): (530-654.69)^2 / 654.69 = 24.66

 Keep the Same (No): (1171-1046.31)^2 / 1046.31 = 15.68

 Easier (Yes): (86-105.84)^2 / 105.84 = 3.88

 Easier (No): (189-169.16)^2 / 169.16 = 1.77

x2 = 23.01 + 16.71 + 24.66 + 15.68 + 3.88 + 1.77 = 83.71

Since the chi-square statistic of 83.71 is greater than the critical value of 5.99,
the null hypothesis can be rejected.

There is strong evidence that having a college degree and opinions on whether
the federal government should make buying a gun easier or more difficult are not
independent of each other at the 0.05 significance level.

1
7
1
8

You might also like