Section1 Exercise1 PerformDataEngineeringTasks
Section1 Exercise1 PerformDataEngineeringTasks
Exercise
Perform data engineering tasks
Section 1 Exercise 1
October 5, 2022
Spatial Data Science
Time to complete
90 minutes
Introduction
Data engineering is a fundamental part of every analysis. The term refers to the planning,
preparation, and processing of data to make it more useful for analysis. It can include simple
tasks like identifying and correcting imperfections in your data and calculating new fields. It
can also include more complex tasks like reducing the dimensions of a multivariate dataset.
Data engineering also involves the process of geoenriching your data. Geoenrichment can
include various tasks:
In this exercise, you will use ArcGIS Notebooks and the Data Engineering view in ArcGIS Pro
to perform data engineering tasks. These tasks will use the built-in tools that are available with
these products as well as tools that are available by integrating open source libraries.
Exercise scenario
Because voting is voluntary in the United States, the level of voter participation (referred to as
"voter turnout") has a significant impact on the election results and resulting public policy.
Modeling voter turnout, and understanding where low turnout is prevalent, can inform
outreach efforts to increase voter participation. With the ultimate goal of predicting voter
turnout, in this exercise, you will focus on performing various data engineering tasks to
prepare election result data for predictive analysis.
Throughout this course, you will save all your data to this folder. When you create the folder,
do not include any spaces or special characters in the folder name.
d Extract the exercise data files to the EsriTraining folder on your local computer.
You downloaded and extracted the exercise data files that you will need to complete this
section of the MOOC.
e If your computer does not meet these requirements, check the Common Questions to
find links to complete any other recommended updates, and then run the test again.
Note: If your computer does not meet the requirements, you may need to use a different
computer or update your graphics card. For more information about graphics card
requirements, see ArcGIS Pro Help: ArcGIS Pro 3.0 system requirements (hardware
requirements).
f If your computer meets the requirements, continue to the step that is titled Locate your
course account to install ArcGIS Pro.
You ran and saved a report that told you whether your computer can support ArcGIS Pro 3.0.
b For Microsoft .NET, click the link in the Minimum Requirement column, and follow the
instructions to download and install Microsoft .NET Desktop Runtime 6.0.5 - Windows
x64.
After you successfully download and install Microsoft .NET Desktop Runtime 6.0.5, you are
ready to download and install ArcGIS Pro 3.0.
You will use your course ArcGIS account username and password to download ArcGIS Pro and
complete all the MOOC exercises. The username for your account ends with _sds (for
example, jdoe_sds). You may want to write down the username and password for quick
reference, or you can always return to the Lessons tab to locate your credentials.
Note: If you registered in the last few hours, your account may not be ready. Refresh the page
in an hour or so to see whether your account information is available.
If you already installed ArcGIS Pro 3.0, you can skip the remaining actions and move to the
next step, which is titled Sign in to ArcGIS Pro.
Note: To learn how to enable private browsing, go to How to Enable Private Browsing on any
Web Browser.
f Under ArcGIS Login, copy and paste or type your course ArcGIS username and password.
Note: An automated email will be sent to the email address that is associated with the
account, telling you that your account was recently modified. No action is required.
After the sign-in process is complete, you will see the home page of the MOOC organization.
j In the upper-right corner, click your account username, and then click My Settings.
k On the left side of the page, under My Settings, click the Licenses tab.
Note: You can run ArcGIS Pro in a different language by clicking the down arrow next to
English (Version 3.0) and choosing a different supported language. Keep in mind that this
course is taught in English, which means that all screenshots and exercises use the English
version of ArcGIS Pro.
n Click Download.
If the default download location does not have enough space, you can change the location by
following the steps in How to Change the File Download Location in Your Browser.
q When you are finished installing ArcGIS Pro, close the incognito web browser window.
b Sign in with the provided course ArcGIS account username that ends in _sds.
Note: The course ArcGIS account username and password are listed on the MOOC home
page under Lessons. The username for this account ends with _sds (for example, jdoe_sds).
a On the Start page, near Recent Projects, click Open Another Project .
Note: If you have configured ArcGIS Pro to start without a project template or with a default
project, you will not see the Start page. On the Project tab, click Open, and then click Open
Another Project.
d Click OK.
Your ArcGIS Pro project opens to a gray reference map, which is called a basemap. Because
you are preparing United States election data, the basemap is currently focused on the
contiguous United States.
Above the map is the ArcGIS Pro ribbon. ArcGIS Pro uses this horizontal ribbon to display and
organize functionality into a series of tabs. On the Map tab is the Navigate group, which
provides the tools that you need to navigate the map. The default tool is the Explore tool ,
which you can use to pan and zoom in and out of maps. To explore different areas of the
world on this basemap, pan the map by clicking your mouse and holding down the button
while you move the map. When you pan a map with the mouse, the pointer becomes a hand.
Zoom in or out of the map by using the mouse wheel or by using the Fixed Zoom In and
Fixed Zoom Out buttons in the Navigate group.
To the side of the map is the Contents pane, which lists the layers that have been added to
the map. Also to the side of the map is the Catalog pane, which lists the items that are
associated with this ArcGIS Pro package—Maps, Toolboxes, Notebooks, Databases, Styles,
Folders, and Locations.
If you do not see the Contents or Catalog panes, from the View tab, in the Windows group,
click either the Contents button or the Catalog Pane button .
To learn more about the ArcGIS Pro interface, see ArcGIS Pro Help: ArcGIS Pro user interface.
To learn more about ArcGIS Pro projects, see ArcGIS Pro Help: Projects in ArcGIS Pro.
A notebook opened in the ArcGIS Pro project. The first few cells in this notebook are
Markdown cells that help to explain the exercise.
a In the notebook, double-click the first Markdown cell that is titled Data Engineering.
Markdown cells use hashtags to determine the size and format of the explanatory text.
c Add a space between the hashtag and the words Data Engineering.
The text font style and size change to make the text appear more like a heading.
Note: Adding additional hashtags will decrease the size of the font. If you are familiar with
HTML, you can think of this action as switching between header tags (<h1>, <h2>, <h3>). Be
sure to maintain a space between the hashtag and your text; otherwise, the font style and size
will appear as regular text.
Note: Alternatively, you can select the cell and press Shift + Enter on your keyboard.
Running a Markdown cell will apply the formatting that you have indicated in the cell.
Similarly, running a Code cell will execute the code that you have written in the cell.
module can define functions, classes, and variables, and it can include runnable code. You will
use the import statement to import the modules.
a Click the arrow to the left of the Load And Prepare Data section to expand the section.
c From the ArcGIS Notebooks toolbar, click the Insert Cell Below button .
A Code cell is added under the Markdown cell. You will use this cell to import the Python
modules that are required to complete this exercise.
d Use the import syntax to import the following Python modules, pressing Enter on your
keyboard after each line:
• arcgis
• pandas
• os
• arcpy
This Code cell will call the modules from the ArcGIS Pro conda environment. To the left of the
Code cell is blue text with brackets. When you run a Code cell, an asterisk appears inside the
brackets to indicate that the cell is running. When the cell has completed running, the asterisk
is replaced with a number.
The number 1 appears in the brackets to indicate that the cell has been executed, which
means that the modules were successfully loaded.
You will use the pandas module quite often in this exercise. Instead of typing pandas each
time, you will shorten pandas to pd.
f Modify the line of code that says import pandas to say import pandas as pd.
g Click Run.
You used pd as a variable. A variable is a name that references an object. The object could be
a dataset or, in this case, a Python module. You could have shortened pandas to any variable
name. You used pd because it is the most common local name for pandas. The remaining
Code cells will use pd when using pandas functionality.
a Click the Markdown cell that is titled Read Data Into Python.
b From the ArcGIS Notebooks toolbar, click the Insert Cell Below button .
Note: Your file path may differ from what is listed in the image above. Include the full file path
to the folder where the countypres2016.csv dataset is located.
By defining this variable, you can use the table_csv_path variable throughout the script to
refer to the county election dataset (countypres2016.csv).
You want to specify that the FIPS attribute field in this data frame will be a text, or string,
value. You will use the dtype parameter to specify this field type.
g After table_csv_path, add a comma and a space, and then type dtype = {'FIPS': str}.
You created a data frame for the county elections dataset that you will use to prepare,
reformat, and geoenable your data.
k In ArcGIS Pro, from the Notebook tab, in the Notebook group, click Save.
Before moving to the next step in this PDF, you must expand each section and execute the
rest of the steps in the notebook in ArcGIS Pro.
l Expand each section and select each cell and either click Run or press Shift + Enter on
your keyboard.
You must run each cell in the notebook before proceeding to the next step.
n After you have finished running all cells in the notebook, on the Notebook tab, in the
Notebook group, click Save.
The feature class that you created in the notebook has been added to the map. The color of
the data will vary every time it is added to the map.
The Data Engineering view opens in a dockable window that can be moved and docked in
the same way that you dock maps, layouts, and attribute tables. In addition to the view, a
Data Engineering contextual tab is available. The tab provides access to commands that are
used for data engineering.
The Data Engineering view contains two panels: a fields panel and a statistics panel. The
fields panel lists the fields in the layer that you used to open the view. The fields panel allows
you to explore fields, change symbology, and produce charts for fields in the layer. The
statistics panel allows you to explore the values and distribution of your data by viewing
statistics and data quality metrics. The panel's statistics table is empty by default until you add
fields from the fields panel.
e From the Data Engineering contextual tab, in the Tools group, click Integrate and choose
Enrich.
The Environments dialog box opens. Here, you can set parameters that apply to
geoprocessing tools, such as the processing extent that limits processing to a specific
geographic area, a coordinate system for all output geodatasets, or the cell size of output
raster datasets.
c Under Processing Extent, click the Extent down arrow and choose the
County_elections_pres_2016_final layer.
Note: Your extent will again be listed as As Specified Below, but your extent figures now
match the County_elections_pres_2016_final layer.
Next, you will set the data source for the Enrich tool.
e Under Business Analyst, next to Data Source, click the Browse button .
The Business Analyst Data Source dialog box opens, which is where you can set the data
source for geoenrichment to a specific country. You will set the data source to the United
States.
f In the Business Analyst Data Source dialog box, on the left side, under Portal, click All
Countries.
g Scroll through the countries listed to see which countries and regions have demographic
data available through Esri.
Esri's GeoEnrichment service enables you to query authoritative global data for over 150
countries and regions. This extensive global data portfolio allows you to integrate global
demographics, business, behavioral, environmental, and places datasets into your own data.
i From the options that display under United States, click Esri 2022.
j Click OK.
Because your study area is the United States, you set the region to select demographic
variables from the United States to geoenrich your data.
Note: For more information on demographic data from Esri, see the Esri Location Data web
page.
a Return to the Enrich dialog box, and confirm that the Input Features parameter is set to
County_elections_pres_2016_final.
Note: If you do not see County_elections_pres_2016_final, return to the Create a pandas data
frame step and verify that you have executed each cell in the notebook.
The tool will automatically create an output feature class name that reflects the input. You can
keep this name or modify it to be more meaningful for your analysis.
b For Output Feature Class, replace the current text with CountyElections2016Enrich.
Note: This parameter represents a file path that leads to the ArcGIS Pro project's file
geodatabase (DataEngineering_and_Visualization.gdb). In ArcGIS Pro, the Current Workspace
environment defaults to the project's default geodatabase.
The Data Browser window is where you can explore the different demographic variables that
are available for data enrichment. Esri provides various demographic variables that are
regularly updated with the latest available data. For the United States, Esri also provides
attributes from previous censuses (2000 and 2010) that are recalibrated with the most current
census (2020) geography. You can quickly add various demographic variables to your data
using the Enrich tool. You can also add variables that you created or that were shared with
you.
d In the Data Browser window, in the Search Variables field, type Median Age and press
Enter.
On the left, you have the option to filter the available variables so that you can easily focus
your search. To the right of the Median Age variables, you see a hashtag and the word Index.
For each variable, these icons, along with a percent sign icon, are used to specify whether you
want a total count (hashtag), index, or percentage (percent sign) for the variable.
The Details panel helps you keep track of the variables that you select. When a variable is
selected, it is automatically listed in the Details panel.
g Search for and add the most recent Per Capita Income variable.
For this exercise, you will not enrich the data because an enriched data layer has been
provided for you in the next exercise. You explored the workflow for geoenriching your data
using the Enrich tool.
i At the top of the Enrich dialog box, click the Close button .
j Under the map, on the Data Engineering view tab, click the Close button .
After completing various data engineering techniques, you cleaned and prepared the election
data. Geoenabling and geoenriching the data provides demographic variables that you can
use to model or predict voter turnout.
In the next exercise, you will use various visualization techniques to explore relationships
between voter turnout and these variables. You will use this information to identify potential
variables to use in your prediction model later in the MOOC.
k If you would like to perform additional data engineering tasks, proceed to the optional
stretch goal. Otherwise, close the Data Engineering map and notebook tabs, save the
project, and then exit ArcGIS Pro.
1. Identify and remove records with null candidatevotes values in the election data.
2. Apply a symbology layer (default.lyrx) to the 2016 election turnout feature class
(out_2016_fc_name).
Note: Alaska does not have counties. Research its administrative and political subdivisions to
determine how the data would need to be engineered to address this issue.
Use the Lesson Forum to post your questions, observations, and syntax examples. Be sure to
include the #stretch hashtag in the posting title.
When you are finished, close the Data Engineering map and notebook tabs, save the project,
and then exit ArcGIS Pro.