Census Data
tidycensus is designed to return data from the US Census Bureau ready for use
within Tidyverse.
Obtain and set your Census API key
Get and install Census API key census_api_key().
# Load the tidycensus package into your R session
library(tidycensus)
# Census API key
api_key <- "309ff35a76c386efd343e8af934cdbb5f33472b4"
census_api_key(api_key)
# Check your API key
Sys.getenv("CENSUS_API_KEY")
# Getting Census data with tidycensus
# get_decennial() and get_acs() obtain data from US Census (2010) and American
Community Survey (2012-2016).
# get_decennial() returns a data value for each row, get_acs() returns ACS estimate
and margin of error.
# Obtain and view state populations from the 2010 US Census
state_pop <- get_decennial(geography = "state", variables = "P001001")
# Obtain and view state median household income from the 2012-2016 ACS
state_income <- get_acs(geography = "state", variables = "B19013_001")
# tidycensus options: geography, variables, named vector, filter by state/county
# tidycensus returns tidy data frames. Set output = "wide" in get_acs() or
get_decennial(), to get wide table.
# Get an ACS dataset for Census tracts in Texas by setting the state
tx_income <- get_acs(geography = "tract", state = "TX",
variables = "B19013_001")
# Get an ACS dataset for Census tracts in Travis County, TX
travis_income <- get_acs(geography = "tract", state = 'TX', county = 'Travis',
variables = "B19013_001")
# Custom variable names
travis_income2 <- get_acs(geography = "tract", state = "TX", county = "Travis",
variables = c(hhincome = "B19013_001"))
# Return county data in wide format
or_wide <- get_acs(geography = 'county', state = "OR",
variables = c(hhincome = "B19013_001",
medage = "B01002_001"),
output = 'wide')
# Create a scatterplot
plot(or_wide$hhincomeE, or_wide$medageE)
Loading variables in tidycensus
load_variables() obtains a dataset of variables from a specified sample and loads
it as a browsable data frame.
In RStudio, use View() to interactively search for these variables.
# Load variables from the 2012-2016 ACS
v16 <- load_variables(year = 2016, dataset = "acs5", cache = TRUE)
# Get variables from the ACS Data Profile
v16p <- load_variables(year = 2016, dataset = 'acs5/profile', cache = TRUE)
# Get variables from the 2000 Census SF3
v00 <- load_variables(year = 2000, dataset = "sf3", cache = TRUE)
# Load the tidyverse
library(tidyverse)
# Filter for table B19001
filter(v16, str_detect(name, "B19001"))
# Use public transportation to search for related variables
filter(v16p, str_detect(label, fixed("public transporation", ignore_case = TRUE)))
# Comparing geographies with ggplot2 dotplot
# Access the 1-year ACS with the survey parameter
ne_income <- get_acs(geography = "state", state = c("ME", "NH", "VT", "MA", "RI",
"CT", "NY"),
variables = "B19013_001", survey = "acs1")
# Create a dot plot
ggplot(ne_income, aes(x = estimate, y = NAME)) +
geom_point()
# Reorder the states in descending order of estimates
ggplot(ne_income, aes(x = estimate, y = reorder(NAME, estimate))) +
geom_point()
# Set dot color and size
ggplot(ne_income, aes(x = estimate, y = reorder(NAME, estimate))) +
geom_point(color = "navy", size = 4) +
scale_x_continuous(labels = scales::dollar) +
theme_minimal(base_size = 18) +
labs(x = "2016 ACS estimate", y = "", title = "Median household income by state")
Download and view a table of data from the ACS
Variables in CACS are organized into tables, with a common prefix. Commonly,
analysts will want to work with all variables in a given table, as these variables
might represent different aspects of a common characteristic (such as race or
income levels). To request data for an entire table in tidycensus, specify a table
argument with the table prefix, and optionally cache a dataset of table codes to
speed up table searching in future requests. In this exercise, you'll acquire a
table of variables representing different income bands, then filter out the
denominator rows.
Get a summary variable and calculate percentages
Many variables in the CACS are represented as counts or estimated counts. While
count data is useful for some applications, it is often good practice to normalize
count data by its denominator to convert it to a proportion or percentage for
better comparisons. In tidycensus, summary_var argument allows users to request
that a variable is given its own column in a tidy Census dataset. When the
summary_var parameter is requested in get_acs(), both est and moe are returned.
# Download table "B19001"
wa_income <- get_acs(geography = "county", state = "WA",
table = "B19001")
# Assign Census variables vector to race_vars
race_vars <- c(White = "B03002_003", Black = "B03002_004", Native = "B03002_005",
Asian = "B03002_006", HIPI = "B03002_007", Hispanic = "B03002_012")
# Request a summary variable from the ACS
ca_race <- get_acs(geography = "county", state = "CA",
variables = race_vars, summary_var = "B03002_001")
ca_race_pct <- ca_race %>%
mutate(pct = 100 * (estimate / summary_est))
# Finding the largest group by county
# Group the dataset and filter the estimate
ca_largest <- ca_race %>% group_by(GEOID) %>%
filter(estimate == max(estimate))
head(ca_largest)
# A tibble: 6 x 7
# Groups: GEOID [6]
GEOID NAME variable estimate moe summary_est summary_moe
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 06001 Alameda County, Califo~ White 523797 541 1605217 NA
2 06003 Alpine County, Califor~ White 804 164 1184 191
3 06005 Amador County, Califor~ White 29436 7 36963 NA
4 06007 Butte County, Californ~ White 164398 199 223877 NA
5 06009 Calaveras County, Cali~ White 36857 54 44787 NA
6 06011 Colusa County, Califor~ Hispanic 12351 NA 21361 NA
# Group the dataset and get a breakdown of the results
ca_largest %>%
group_by(variable) %>% tally()
# Recoding variables and calculating group sums
# Use a tidy workflow to wrangle ACS data
wa_grouped <- wa_income %>% filter(variable != "B19001_001") %>%
mutate(incgroup = case_when(variable < "B19001_008" ~ "below35k",
variable < "B19001_013" ~ "35kto75k",
TRUE ~ "above75k")) %>%
group_by(NAME, incgroup) %>%
summarize(group_est = sum(estimate))
Comparing ACS estimates for multiple years
ACS is updated every year.
tidyverse::map_df() iterates, computes through a sequence of values, then combines
the results into a single data frame.
# Map through ACS1 estimates to see how they change through the years
mi_cities <- map_df(2012:2016, function(x) {
get_acs(geography = "place", state = "MI",
survey = 'acs1', variables = c(totalpop = "B01003_001"),
year = x) %>%
mutate(year = x) })
mi_cities %>% arrange(NAME, year)
Inspecting margins of error
ACS margins of error by default represent a 90 percent confidence level around an
estimate.
# Get data on elderly poverty by Census tract in Vermont
vt_eldpov <- get_acs(geography = "tract", state = "VT",
variables = c(eldpovm = "B17001_016", eldpovf =
"B17001_030") )
# Identify rows with greater margins of error than their estimates
moe_check <- filter(vt_eldpov, moe > estimate)
# Check proportion of rows where the margin of error exceeds the estimate
nrow(moe_check) / nrow(vt_eldpov)
Using margin of error functions in tidycensus
tidycensus includes four functions (based on recommended formulas from the US
Census Bureau.
moe_sum()
moe_product()
moe_ratio()
moe_prop()
# Calculate a margin of error for a sum
moe_sum(moe = c(55, 33, 44, 12, 4))
# Calculate a margin of error for a product
moe_product(est1 = 55, est2 = 33, moe1 = 12, moe2 = 9)
# Calculate a margin of error for a ratio
moe_ratio(num = 1000, denom = 950, moe_num = 200, moe_denom = 177)
# Calculate a margin of error for a proportion
moe_prop(num = 374, denom = 1200, moe_num = 122, moe_denom = 333)
Calculate group-wise margins of error
One way to reduce margins of error in an ACS analysis is to combine estimates when
appropriate. This can be accomplished using tidyverse group-wise data analysis
tools. In this exercise, you'll combine estimates for male and female elderly
poverty in Vermont, and use the moe_sum() as part of this group-wise analysis.
While you may lose some detail with this type of approach, your estimates will be
more reliable relative to their margins of error than before you combined them.
# Group the dataset and calculate a derived margin of error
vt_eldpov2 <- vt_eldpov %>%
group_by(GEOID) %>%
summarize(
estmf = sum(estimate),
moemf = moe_sum(moe = moe, estimate = estimate)
)
# Filter rows where newly-derived margin of error exceeds newly-derived estimate
moe_check2 <- filter(vt_eldpov2, moemf > estmf)
# Check proportion of rows where margin of error exceeds estimate
nrow(moe_check2) / nrow(vt_eldpov2)
# Quick visual exploration of ACS margins of error
# Request median household income data
maine_inc <- get_acs(geography = "county", state = "ME",
variables = c(hhincome = "B19013_001"))
# Generate horizontal error bars with dots
ggplot(maine_inc, aes(x = estimate, y = NAME)) +
geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
geom_point()
# Customizing a ggplot2 margin of error plot
# Remove unnecessary content from the county's name
maine_inc2 <- maine_inc %>%
mutate(NAME = str_replace(NAME, " County, Maine", ""))
# Build a margin of error plot incorporating your modifications
ggplot(maine_inc2, aes(x = estimate, y = reorder(NAME, estimate))) +
geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
geom_point(size = 3, color = "darkgreen") +
theme_grey(base_size = 14) +
labs(title = "Median household income",
subtitle = "Counties in Maine",
x = "ACS estimate (bars represent margins of error)",
y = "") +
scale_x_continuous(labels = scales::dollar)
Getting Census boundary files with tigris
The US Census Bureau's TIGER/Line shapefiles include boundary files for the
geography at which decennial Census and ACS data are aggregated. These geographies
include legal entities that have legal standing in the U.S., such as states and
counties, and statistical entities used for data tabulation such as Census tracts
and block groups. In this exercise, you'll use the tigris package to acquire such
boundary files for counties in Colorado and Census tracts for Colorado's Denver
County, which covers the city of Denver.
Getting geographic features with tigris
In addition to enumeration units, the TIGER/Line database produced by the Census
Bureau includes geographic features. These features consist of several datasets for
use in thematic mapping and spatial analysis, such as transportation infrastructure
and water features. In this exercise, you'll acquire and plot roads and water data
with tigris.
library(tigris)
# Get a counties dataset for Colorado and plot it
co_counties <- counties(state = "CO")
plot(co_counties)
# Get a Census tracts dataset for Denver County, Colorado and plot it
denver_tracts <- tracts(state = "CO", county = "Denver")
plot(denver_tracts)
# Plot area water features for Lane County, Oregon
lane_water <- area_water(state = "OR", = "Lane")
plot(lane_water)
# Plot primary & secondary roads for the state of New Hampshire
nh_roads <- primary_secondary_roads(state = "NH")
plot(nh_roads)
Understanding the structure of tigris objects
By default, tigris returns objects of class Spatial*DataFrame from the sp package.
Objects of class Spatial* represent components of spatial data in different slots,
which include descriptions of the object's geometry, attributes, and coordinate
system. In this exercise, we'll briefly examine the structure of objects returned
by tigris functions.
# Check the class of the data
class(co_counties)
# Take a look at the information in the data slot
head(co_counties@data)
# Check the coordinate system of the data
co_counties@proj4string
TIGER/Line and cartographic boundary files
In addition to its TIGER/Line shapefiles, the US Census Bureau releases
cartographic boundary shapefiles for enumeration units. TIGER/Line shapefiles
correspond to legal boundaries of units, which can include water area and in turn,
may not be preferable for thematic mapping. The Census Bureau's cartographic
boundary shapefiles are clipped to the US shoreline and are generalized, which can
make them superior for mapping projects. In this exercise, you'll compare the
TIGER/Line and cartographic boundary representations of the US state of Michigan.
# Get a counties dataset for Michigan
mi_tiger <- counties("MI")
# Get the equivalent cartographic boundary shapefile
mi_cb <- counties("MI", cb = TRUE)
# Overlay the two on a plot to make a comparison
plot(mi_tiger)
plot(mi_cb, add = TRUE, border = "red")
Getting data as simple features objects
The sf package, which stands for simple features, promises to revolutionize the way
that vector spatial data are handled within R. sf objects represent spatial data
much like regular data frames, with a list-column that contains the geometry of the
geographic dataset. tigris can return spatial data as simple features objects
either by declaring class = "sf" within a function call or by setting as a global
option. In this exercise, you'll get acquainted with simple features in tigris.
# Get data from tigris as simple features
options(tigris_class = "sf")
# Get countries from Colorado and view the first few rows
colorado_sf <- counties("CO")
head(colorado_sf)
# Plot its geometry column
plot(colorado_sf$geometry)
Setting a cache directory
Spatial data from the US Census Bureau can get very big - sometimes hundreds of
megabytes in size. By default, tigris functions download data from the US Census
Bureau's website - but this can get tiresome if downloading the same large datasets
over and over. To resolve this, tigris includes an option to cache downloaded data
on a user's computer for future use, meaning that files only have to be downloaded
from the Census website once. In this exercise, you'll get acquainted with the
caching functionality in tigris.
# Set the cache directory
tigris_cache_dir("Your preferred cache directory path would go here")
# Set the tigris_use_cache option
options(tigris_use_cache = T)
# Check to see that you've modified the option correctly
getOption("tigris_use_cache")
Working with historic shapefiles
To ensure clean integration with the tidycensus package - which you'll learn about
in the next chapter - tigris defaults to returning shapefiles that correspond to
the year of the most recently-released ACS data. However, you may want boundary
files for other years. tigris allows R users to obtain shapefiles for 1990, 2000,
and 2010 through 2017, which represent many boundary changes over time. In this
exercise, you'll use tigris to explore how Census tract boundaries have changed in
Williamson County, Texas between 1990 and 2016.
# Get a historic Census tract shapefile from 1990 for Williamson County, Texas
williamson90 <- tracts(state = "TX", county = "Williamson", year = 1990, cb = T)
# Compare with a current dataset for 2016
williamson16 <- tracts(state = "TX", county = "Williamson", year = 2016, cb = TRUE)
# Plot the geometry to compare the results
par(mfrow = c(1, 2))
plot(williamson90$geometry)
plot(williamson16$geometry)
## Combining datasets of the same tigris type: rbind_tigris()
# Get Census tract boundaries for Oregon and Washington
or_tracts <- tracts("OR", cb = TRUE)
wa_tracts <- tracts("WA", cb = TRUE)
# Check the tigris attributes of each object
attr(or_tracts, 'tigris')
attr(wa_tracts, "tigris")
# Combine the datasets then plot the result
or_wa_tracts <- rbind_tigris(or_tracts, wa_tracts)
plot(or_wa_tracts$geometry)
## Getting data for multiple states: use tidyverse map() with rbind_tigris().
# Generate a vector of state codes and assign to new_england
new_england <- c("ME", "NH", "VT", "MA")
# Iterate through the states and request tract data for state
ne_tracts <- map(new_england, function(x) {
tracts(state = x, cb = TRUE)
}) %>%
rbind_tigris()
plot(ne_tracts$geometry)
Joining data from an external data frame
sf package enables the use of the tidyverse *_join() for simple features objects
for this purpose.
Get a dataset of legislative district boundaries for Texas using the
state_legislative_districts(), parameter [house = "lower"] to request data for
House of Repr.
# Get boundaries for Texas and set the house parameter
tx_house <- state_legislative_districts(state = "TX", house = "lower", cb = TRUE)
# Merge data on legislators to their corresponding boundaries
tx_joined <- left_join(tx_house, tx_members, by = c("NAME" = "District"))
Plotting simple features with ggplot2::geom_sf()
# Plot the legislative district boundaries
ggplot(tx_joined) +
geom_sf()
# Set fill aesthetic to map areas represented by Republicans and Democrats
ggplot(tx_joined, aes(fill = Party)) +
geom_sf()
# Set values so that Republican areas are red and Democratic areas are blue
# remove gridlines and add informative title
ggplot(tx_joined, aes(fill = Party)) +
geom_sf() +
coord_sf(crs = 3083, datum = NA) +
scale_fill_manual(values = c("R" = "red", "D" = "blue")) +
theme_minimal(base_size = 16) +
labs(title = "State House Districts in Texas")
Getting simple feature geometry
tidycensus can obtain simple feature geometry for many geographies by adding the
argument geometry = TRUE. In this exercise, you'll obtain a dataset of median
housing values for owner-occupied units by Census tract in Orange County,
California with simple feature geometry included.
library(tidycensus)
library(tidyverse)
library(sf)
# Get dataset with geometry set to TRUE
orange_value <- get_acs(geography = "tract", state = "CA",
county = "Orange",
variables = "B25077_001",
geometry = TRUE)
# Plot the estimate to view a map of the data
plot(orange_value["estimate"])
Joining data from tigris and tidycensus
Geometry is currently supported in tidycensus for geographies in the core Census
hierarchy - state, county, Census tract, block group, and block - as well as zip
code tabulation areas. However, you may be interested in mapping data for other
geographies. In this case, you can download the equivalent boundary file from the
Census Bureau using the tigris package and join your demographic data to it for
mapping.
# Get an income dataset for Idaho by school district
idaho_income <- get_acs( geography = 'school district (unified)',
variables = "B19013_001",
state = "ID")
# Get a school district dataset for Idaho
idaho_school <- school_districts(state = "ID", type = "unified", class = "sf")
# Join the income dataset to the boundaries dataset
id_school_joined <- left_join(idaho_school, idaho_income, by = "GEOID")
plot(id_school_joined["estimate"])
Shifting Alaska and Hawaii geometry
Analysts will commonly want to map data for the entire United States by state or
county; however, this can be difficult by default as Alaska and Hawaii are distant
from the continental United States. A common solution is to rescale and shift
Alaska and Hawaii for mapping purposes, which is supported in tidycensus. You'll
learn how to do this in this exercise.
# Get a dataset of median home values from the 1-year ACS
state_value <- get_acs(geography = 'state',
variables = "B25077_001", survey = "acs1",
geometry = T, shift_geo = TRUE)
# Plot the dataset to view the shifted geometry
plot(state_value["estimate"])
Making a choropleth map
ggplot2::geom_sf > fill aesthetic.
map colors
ggplot2 v3.0 integrated the viridis color palettes, which are perceptually uniform
and legible to colorblind individuals and in black and white.
scale_fill_viridis_c() continuous viridis palette for fill color.
scale_color_viridis_c() continuous viridis palette for borders.
scale_*_viridis_d() diverging
map output
map elements and add some descriptive information to provide context to your map.
library(ggplot2)
# Create a choropleth map with ggplot
ggplot(marin_value, aes(fill = estimate)) +
geom_sf()
# Set continuous viridis palettes for your map
ggplot(marin_value, aes(fill = estimate, color = estimate)) +
geom_sf() +
scale_fill_viridis_c() +
scale_color_viridis_c()
# Set the color guide to FALSE and add a subtitle and caption to your map
ggplot(marin_value, aes(fill = estimate, color = estimate)) +
geom_sf() +
scale_fill_viridis_c(labels = scales::dollar) +
scale_color_viridis_c(guide = F) +
theme_minimal() +
coord_sf(crs = 26911, datum = NA) +
labs(title = "Median owner-occupied housing value by Census tract",
subtitle = "Marin County, California",
caption = "Data source: 2012-2016 ACS.\nData acquired with the R tidycensus
package.",
fill = "ACS estimate")
Graduated symbol maps
There are many other effective ways to map statistical data besides choropleth
maps. One such example is the graduated symbol map, which represents statistical
variation by differently-sized symbols. In this exercise, you'll learn how to use
the st_centroid() tool in the sf package to create points at the center of each
state to be used as inputs to a graduated symbol map in ggplot2.
library(sf)
# Generate point centers
centers <- st_centroid(state_value)
# Set size parameter and the size range
ggplot() +
geom_sf(data = state_value, fill = "white") +
geom_sf(data = centers, aes(size = estimate), shape = 21,
fill = "lightblue", alpha = 0.7, show.legend = "point") +
scale_size_continuous(range = c(1, 20))
Faceted maps with ggplot2
One of the most powerful features of ggplot2 is its support for faceted plotting,
in which multiple plots are generated for unique values of a column in the data.
Faceted maps can be produced with geom_sf() in this way as well if tidycensus data
are in tidy format. In this exercise, you'll produce faceted maps that show the
racial and ethnic geography of Washington, DC from the 2010 decennial Census.
# Check the first few rows of the loaded dataset dc_race
head(dc_race)
# Remove the gridlines and generate faceted maps
ggplot(dc_race, aes(fill = percent, color = percent)) +
geom_sf() +
coord_sf(datum = NA) +
facet_wrap(~variable)
Interactive visualization with mapview
The mapview R package allows R users to interactively map spatial datasets in one
line of R code. This makes it an essential tool for exploratory spatial data
analysis in R. In this exercise, you'll learn how to quickly map tidycensus data
interactively using mapview and your Orange County, California median housing
values dataset.
# Map the orange_value dataset interactively
m <- mapview(orange_value)
m@map
# Map your data by the estimate column
m <- mapview(orange_value, zcol = "estimate")
m@map
# Add a legend to your map
m <- mapview(orange_value, zcol = "estimate", legend = TRUE)
m@map
Generating random dots with sf
Dot-density maps are created by randomly placing dots within areas where each dot
is proportional to a certain number of observations. In this exercise, you'll learn
how to create dots in this way with the sf package using the st_sample() function.
You will generate dots that are proportional to about 100 people in the decennial
Census, and then you will group the dots to speed up plotting with ggplot2.
# Generate dots, create a group column, and group by group column
dc_dots <- map(c("White", "Black", "Hispanic", "Asian"), function(group) {
dc_race %>%
filter(variable == group) %>%
st_sample(., size = .$value / 100) %>%
st_sf() %>%
mutate(group = group)
}) %>%
reduce(rbind) %>%
group_by(group) %>%
summarize()
Obtaining data for cartography with tigris
Before making your dot-density map of Washington, DC with ggplot2, it will be
useful to acquire some ancillary cartographic data with the tigris package that
will help map viewers understand what you've visualized. These datasets will
include major roads in DC; area water features; and the boundary of the District of
Columbia, which you'll use as a background in your map.
# Filter the DC roads object for major roads only
dc_roads <- roads("DC", "District of Columbia") %>%
filter(RTTYP %in% c("I", "S", "U"))
# Get an area water dataset for DC
dc_water <- area_water("DC", "District of Columbia")
# Get the boundary of DC
dc_boundary <- counties("DC", cb = TRUE)
Making a dot-density map with ggplot2
In your final exercise of this course, you are going to put together the datasets
you've acquired and generated into a dot-density map with ggplot2. You'll plot your
generated dots and ancillary datasets with geom_sf(), and add some informative map
elements to create a cartographic product.
# Plot your datasets and give your map an informative caption
ggplot() +
geom_sf(data = dc_boundary, color = NA, fill = "white") +
geom_sf(data = dc_dots, aes(color = group, fill = group), size = 0.1) +
geom_sf(data = dc_water, color = "lightblue", fill = "lightblue") +
geom_sf(data = dc_boundary, color = "grey") +
coord_sf(crs = 26918, datum = NA) +
scale_color_brewer(palette = "Set1", guide = FALSE) +
scale_fill_brewer(palette = "Set1") +
labs(title = "The racial geography of Washington, DC",
subtitle = "2010 decennial U.S. Census",
fill = "",
caption = "1 dot = approximately 100 people.\nData acquired with the R
tidycensus and tigris packages.")