0% found this document useful (0 votes)

67 views7 pages

Lab 1 - Introduction To Data

This document introduces a lab on summarizing public health data from the Centers for Disease Control and Prevention (CDC). The lab uses a dataset from the Behavioral Risk Factor Surveillance System (BRFSS) survey containing responses from 20,000 individuals. Students will learn skills for processing and subsetting large datasets. They will generate summary statistics and graphical summaries to describe variables like weight, age, gender, and health behaviors. This will help gain insights from the raw data and identify emerging health trends in the population.

Uploaded by

Timban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views7 pages

Lab 1 - Introduction To Data

Uploaded by

Timban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Lab 1: Introduction to data

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process
is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by
generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and
Prevention (CDC). As this is a large data set, along the way you’ll also learn the indispensable skills of data
processing and subsetting.

Getting started
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the
United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report
emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their
HIV/AIDS status, possible tobacco use, and even their level of health care coverage. The BRFSS Web site
(http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that
motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are
over 200 variables in this data set, we will work with a small subset.

We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio, enter the
following command.

source("http://www.openintro.org/stat/data/cdc.R")

The data set cdc that shows up in your workspace is a data matrix, with each row representing a case and each
column representing a variable. R calls this data format a data frame, which is a term that will be used throughout
the labs.

To view the names of the variables, type the command

names(cdc)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of
these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were
asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany
variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates
whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates
whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the
respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and g ender.

Exercise 1 How many cases are there in this data set? How many variables? For each variable,
identify its data type (e.g. categorical, discrete).

We can have a look at the first few entries (rows) of our data with the command

head(cdc)
and similarly we can look at the last few by typing

tail(cdc)

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise
here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to
take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

Summaries and tables

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that
information into a few summary statistics and graphics. As a simple example, the function summary returns a
numerical summary: minimum, first quartile, median, mean, second quartile, and maximum. For weight this is

summary(cdc$weight)

R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the respondents’
weight, you would look at the output from the summary command above and then enter

190 - 140

R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median,
and variance of weight, type

mean(cdc$weight)
var(cdc$weight)
median(cdc$weight)

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about
categorical data? We would instead consider the sample frequency or relative frequency distribution. The function
table does this for you by counting the number of times each kind of response was given. For example, to see the
number of people who have smoked 100 cigarettes in their lifetime, type

table(cdc$smoke100)

or instead look at the relative frequency distribution by typing

table(cdc$smoke100)/20000

Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similar to
something we observed in the last lab; when we multiplied or divided a vector with a number, R applied that action
across entries in the vectors. As we see above, this also works for tables. Next, we make a bar plot of the entries in
the table by putting the table inside the barplot command.

barplot(table(cdc$smoke100))

Notice what we’ve done here! We’ve computed the table of cdc$smoke100 and then immediately applied the
graphical function, barplot. This is an important idea: R commands can be nested. You could also break this into two
steps by typing the following:
smoke <- table(cdc$smoke100)
barplot(smoke)

Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke into the
console) and then used it in as the input for barplot. The special symbol <- performs an assignment, taking the
output of one line of code and saving it into an object in your workspace. This is another important idea that we’ll
return to later.

Exercise 2 Create a numerical summary for height a nd age, and compute the interquartile range for
each. Compute the relative frequency distribution for gender and exerany. How many males are in
the sample? What proportion of the sample reports being in excellent health?

The table command can be used to tabulate any number of variables that you provide. For example, to examine
which participants have smoked across each gender, we could use the following.

table(cdc$gender, cdc$smoke100)

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The
rows refer to gender. To create a mosaic plot of this table, we would enter the following command.

mosaicplot(table(cdc$gender, cdc$smoke100))

We could have accomplished this in two steps by saving the table in one line and applying mosaicplot in the next
(see the table/barplot example above).

Exercise 3 What does the mosaic plot reveal about smoking habits and gender?

Interlude: How R thinks about data

We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a
different observation (a different respondent) and each column is a different variable (the first is genhlth, the second
exerany and so on). We can see the size of the data frame next to the object name in the workspace or we can type

dim(cdc)

which will return the number of rows and columns. Now, if we want to access a subset of the full data frame, we can
use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format

cdc[567, 6]

which means we want the element of our data set that is in the 567th row (meaning the 567th person or observation)
and the 6th column (in this case, weight). We know that weight is the 6th variable because it is the 6th entry in the list
of variable names

names(cdc)

To see the weights for the first 10 respondents we can type

cdc[1:10, 6]

In this expression, we have asked just for rows in the range 1 through 10. R uses the “:” to create a range of values,
so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering

1:10

Finally, if we want all of the data for the first 10 respondents, type

cdc[1:10, ]

By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all
the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the column number to see all
columns in a data frame. Similarly, if we leave out an index or range for the rows, we would access all the
observations, not just the 567th, or rows 1 through 10. Try the following to see the weights for all 20,000 respondents
fly by on your screen

cdc[, 6]

Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the data
set. An alternative method to access the weight data is by referring to the name. Previously, we typed names(cdc) to
see all the variables contained in the cdc data set. We can use any of the variable names to select items in our data
set.

cdc$weight

The dollar-sign tells R to look in data frame cdc for the column called weight. Since that’s a single vector, we can
subset it with just a single index inside square brackets. We see the weight for the 567th respondent by typing

cdc$weight[567]

Similarly, for just the first 10 respondents

cdc$weight[1:10]

The command above returns the same result as the cdc[1:10,6] command. Both row-and-column notation and
dollar-sign notation are widely used, which one you choose to use depends on your personal preference.

A little more on subsetting

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this
through conditioning commands. First, consider expressions like

cdc$gender == "m"

cdc$age > 30

These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where
TRUE indicates that the person was male (via the first command) or older than 30 (second command).

Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can use the R
function subset to do that for us. For example, the command

mdata <- subset(cdc, cdc$gender == "m")

will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in
your workspace alongside its dimensions, you can take a peek at the first several rows as usual

head(mdata)

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only
specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up
the data based on values of one or more variables.

As an aside, you can use several of these conditions together with & and |. The & is read “and” so that

m_and_over30 <- subset(cdc, cdc$gender == "m" & cdc$age > 30)

will give you the data for men over the age of 30. The | character is read “or” so that

m_or_over30 <- subset(cdc, cdc$gender == "m" | cdc$age > 30)

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the
mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like
when forming a subset.

Exercise 4 Create a new object called under23 and smoke that contains all observations of
respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the
command you used to create the new object as the answer to this exercise.

Quantitative data
With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS
questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to
quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can
construct a box plot for a single variable with the following command.

boxplot(cdc$height)

You can compare the locations of the components of the box by examining the summary statistics.

summary(cdc$height)

Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph.
The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several
categories. So we can, for example, compare the heights of men and women with
boxplot(cdc$height ~ cdc$gender)

The notation here is new. The ~ character can be read “versus” or “as a function of”. So we’re asking R to give us a
box plots of heights where the groups are defined by gender.

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a
weight to height ratio and can be calculated as.

weight (lb)
BM I = height(in)2
× 703

703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and
pounds). The following two lines first make a new object called bmi and then creates box plots of these values,
defining groups by the variable cdc$genhlth.

bmi <- (cdc$weight/cdc$height^2) * 703

boxplot(bmi ~ cdc$genhlth)

Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data set. That
is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703.
The result is 20,000 BMI values, one for each respondent. This is one reason why we like R: it lets us perform
computations like this using very simple expressions.

Exercise 5 What does this box plot show? Pick another categorical variable from the data set and
see how it relates to BMI. List the variable you chose, why you might think it would have a
relationship to BMI, and indicate what the figure seems to suggest.

Finally, let’s make some histograms. We can look at the histogram for the age of our respondents with the
command

hist(cdc$age)

Histograms are generally a very good way to see the shape of a single distribution, but that shape can change
depending on how the data is split between the different bins. You can control the number of bins by adding an
argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50
breaks.

hist(bmi)
hist(bmi, breaks = 50)

Note that you can flip between plots that you’ve created by clicking the forward and backward arrows in the lower
right region of RStudio, just above the plots. How do these two histograms compare?
At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire. We’ve found an
interesting association between smoking and gender, and we can say something about the relationship between
people’s assessment of their general health and their own BMI. We’ve also picked up essential computing tools –
summary statistics, subsetting, and plots – that will serve us well throughout this course.
On Your Own
1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight
(weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a
new object called wdiff.
3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and
desired weight. What if wdiff is positive or negative?
4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use.
What does this tell us about how people feel about their current weight?
5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight
differently than women.
6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what
proportion of the weights are within one standard deviation of the mean.
7. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered in the
textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or
homework problems? Be specific in your answer.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported
(http://creativecommons.org/licenses/by-sa/3.0). This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab
written by Mark Hansen of UCLA Statistics.

Civil Engineers 11-2022
100% (2)
Civil Engineers 11-2022
173 pages
Political Obligation of Utilitarianism
No ratings yet
Political Obligation of Utilitarianism
11 pages
6498
No ratings yet
6498
40 pages
812 Community of Practice Slide Deck
No ratings yet
812 Community of Practice Slide Deck
25 pages
S7 Communication With Put/Get: S7-1500 Cpus and S7-1200 Cpus
No ratings yet
S7 Communication With Put/Get: S7-1500 Cpus and S7-1200 Cpus
24 pages
Dissertation Defense Presentation Format
100% (1)
Dissertation Defense Presentation Format
5 pages
All About Concrete Pavement Joint Design PDF
100% (1)
All About Concrete Pavement Joint Design PDF
141 pages
For Students PPT ch03
No ratings yet
For Students PPT ch03
82 pages
DLL G6 Q4 WEEK 2 ALL SUBJECTS (Mam Inkay Peralta)
No ratings yet
DLL G6 Q4 WEEK 2 ALL SUBJECTS (Mam Inkay Peralta)
65 pages
6 - Concurrent Processes (v2)
No ratings yet
6 - Concurrent Processes (v2)
33 pages
Siilabasii Koorsii
No ratings yet
Siilabasii Koorsii
7 pages
1 s2.0 S0379711221002319 Main
No ratings yet
1 s2.0 S0379711221002319 Main
12 pages
Intellectual Property Law Report
No ratings yet
Intellectual Property Law Report
20 pages
Chapter Two Literature Review
No ratings yet
Chapter Two Literature Review
13 pages
iii_2
No ratings yet
iii_2
17 pages
3 Dynamic Characteristics
No ratings yet
3 Dynamic Characteristics
7 pages
NDT ACC Criteria
100% (1)
NDT ACC Criteria
6 pages
Maths Mate Homework Program 8
100% (1)
Maths Mate Homework Program 8
7 pages
Meetings-Lecture 3 (A4)
No ratings yet
Meetings-Lecture 3 (A4)
7 pages
Unihemispheric Slow-Wave Sleep - Kimberly Mauro
No ratings yet
Unihemispheric Slow-Wave Sleep - Kimberly Mauro
10 pages
Course Project
No ratings yet
Course Project
3 pages
GRADE 4-Teachers-Program
No ratings yet
GRADE 4-Teachers-Program
4 pages
Hareesh Resume
No ratings yet
Hareesh Resume
2 pages
Anil Parmar QC
No ratings yet
Anil Parmar QC
3 pages
Chapter 3 - Exception Handling
No ratings yet
Chapter 3 - Exception Handling
13 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
24 pages
K-Means Clustering
No ratings yet
K-Means Clustering
16 pages
Inatek FD2002 User Manual
No ratings yet
Inatek FD2002 User Manual
8 pages
07-01 Maf
No ratings yet
07-01 Maf
5 pages
Surf PAC
No ratings yet
Surf PAC
4 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6440)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (998)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1174)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)

Lab 1 - Introduction To Data

Uploaded by

Lab 1 - Introduction To Data

Uploaded by

Lab 1: Introduction to data

To view the names of the variables, type the command

Summaries and tables

or instead look at the relative frequency distribution by typing

Interlude: How R thinks about data

To see the weights for the first 10 respondents we can type

Similarly, for just the first 10 respondents

A little more on subsetting

cdc​$​gender ​== ​"m"

cdc​$​age ​> ​30

mdata ​<- ​subset​(​cdc​, ​cdc​$​gender ​== ​"m"​)

m_or_over30 ​<- ​subset​(​cdc​, ​cdc​$​gender ​== ​"m" ​| ​cdc​$​age ​> ​30​)

bmi ​<- ​(​cdc​$​weight​/​cdc​$​height​^​2​) * ​703

You might also like

cdc$gender == "m"

cdc$age > 30

mdata <- subset(cdc, cdc$gender == "m")

m_or_over30 <- subset(cdc, cdc$gender == "m" | cdc$age > 30)

bmi <- (cdc$weight/cdc$height^2) * 703