Section2 Exercise1 Create A Prediction Model
Section2 Exercise1 Create A Prediction Model
Exercise
Create a prediction model
Section 2 Exercise 1
03/2020
Spatial Data Science MOOC
Time to complete
90 minutes
Introduction
Prediction is an important part of spatial data science. You can use prediction to forecast
future values (for example, predicting tomorrow's air quality for a specified location),
downscale information (for example, using voter turnout data at the county level to predict
voter turnout at the tract level), or fill in missing values in a dataset.
ArcGIS provides various prediction tools to help you complete these types of analyses. In this
exercise, you will use the Forest-based Classification and Regression tool, which uses an
adaptation of Leo Breiman's random forest algorithm. This supervised machine learning
algorithm allows you to use existing data to train models that may be useful for predictive
analysis.
The tool creates many decision trees, called an ensemble or a forest, that are used for
prediction. Each tree generates its own prediction and is used as part of a voting scheme to
make final predictions. The strength of the forest-based method is in capturing commonalities
of weak predictors (the trees) and combining them to create a powerful predictor (the forest).
You will use this tool to train and evaluate a predictive model, modifying variables and
parameters to improve the model performance.
Exercise scenario
After preparing and visualizing your data, you are ready to begin your predictive analysis. In
this exercise, you will create models that predict voter turnout. These models will use
explanatory variables, such as income and age, to predict the dependent variable, voter
turnout.
You will use this model to downscale voter turnout from the county to the tract level. This
information will be used to organize a "Get Out the Vote" canvassing campaign by identifying
local regions that are expected to have low voter turnout.
c Extract the files to a folder on your local computer, saving them in a location that you will
remember.
d In the Open Project dialog box, browse to the Prediction folder that you saved on your
computer.
A Prediction map tab opens to a gray basemap with a map layer that represents the 2016
election results for each county. Counties with a voter turnout value under the mean are
purple, and counties with a voter turnout value over the mean are green.
c In the search results, click the Forest-Based Classification And Regression (Spatial
Statistics Tools) tool.
Note: Be sure that you are using the Spatial Statistics tool and not the GeoAnalytics Desktop
tool.
The Forest-Based Classification And Regression tool opens in the Geoprocessing pane.
e Under Explanatory Training Variables, next to Variable, click the Add Many button .
f In the variable window, check the box for the following variables:
At the bottom of the Geoprocessing pane, you will see a message confirming that the tool
completed. You did not specify in the tool to create an output. Instead, you will review the
model's performance using the tool messages.
The Forest-Based Classification And Regression (Spatial Statistics Tools) tool message window
appears. Tool messages contain information such as the parameters used to run the tool, how
long the tool ran, and model performance diagnostics.
Note: Each time that you run the Forest-based Classification and Regression tool, you may
get slightly different results due to the randomness introduced in the algorithm to prevent the
model from overfitting to the training data.
By default, Forest-based Classification and Regression reserves 10 percent of the data for
validation. The model is trained without this random subset, and the tool returns an R-
Squared value measuring how well the model performs on the unseen data.
When a model is evaluated based on the training dataset rather than a validation dataset, it is
common for estimates of performance to be overstated due to a concept called overfitting.
Therefore, the validation R-Squared is a better indicator of model performance than the
training R-Squared.
The model returned a validation R-Squared value of 0.519, indicating that the model
predicted the voter turnout value in the validation dataset with an accuracy of about 52
percent.
Next, you will review how important each explanatory variable was in generating a prediction.
Note: Throughout the exercise, you will rerun the same geoprocessing tool using different
parameters.
b Click Run.
The Out_Trained_Features layer displays the predicted voter turnout for each county in the
contiguous United States. A variable importance table and associated bar chart are added to
the Contents pane and can be used to explore which variables were most important in this
prediction.
The 2019 Education: High School/No Diploma : Percent and 2019 Per Capita Income
variables have the highest importance, meaning that they were the most useful in predicting
voter turnout.
Each time that you run the Forest-based Classification and Regression tool, you may get
slightly different results due to the randomness introduced in the algorithm to prevent the
model from overfitting to the training data. To understand and account for this variability, you
will use a parameter that lets the tool create multiple models in one run. This will allow you to
explore the distribution of model performance.
b In the Geoprocessing pane, expand Validation Options, and then enter the following
parameters:
c Under Output Validation Table, check the box for Calculate Uncertainty.
Note: If you do not see the Calculate Uncertainty check box, you may need to scroll down in
the Geoprocessing pane.
f In the tool message window, expand Messages, if necessary, and scroll to the Validation
Data: Regression Diagnostics section.
The tool trained 10 models with random subsets of validation data. The most representative
R-Squared across the 10 runs is 0.527, corresponding to about 53 percent accuracy in
prediction of the validation data. You can use a histogram to review the distribution of R-
Squared values returned over the 10 runs.
The histogram shows the variability in model performance by visualizing the distribution of R-
Squared values returned over the 10 runs. The mean R-Squared for the 10 runs of this model
is 0.52.
Instead of a bar chart, the variable importance is visualized using a box plot to show the
distribution of importance across the 10 runs of the model. There is overlap in the distribution
of importance of the 2019 Per Capita Income and 2019 Education: High School/No Diploma :
Percent variables. In some runs of the model, Per Capita Income was more important, and in
other runs, Education: High School/No Diploma was more important. Overall, both variables
are strong candidates for your predictive model.
The Prediction Interval chart visualizes the level of uncertainty for any given prediction value.
By considering the range of prediction values returned by the individual trees in the forest,
prediction intervals are generated indicating the range in which the true value is expected to
fall. You can be 90 percent confident that new prediction values generated using the same
explanatory variables would fall in this range. This chart can help you identify if the model is
better at predicting some values than others. For example, if the confidence intervals were
much larger for low voter turnout values, then you would know that the model is not as stable
for predicting low voter turnout as it is for predicting high voter turnout. The prediction
intervals in this model are fairly consistent, indicating that the model performance is relatively
stable across all values.
At this point, you have used the attributes in your dataset as the explanatory variables in your
model. With the Forest-based Classification and Regression tool, you can also calculate new
variables based on distances to meaningful locations. In the next step, you will calculate new
variables and assess their importance to the model.
b In the Contents pane, expand DistanceVariables, and then turn on the following layers:
• DistanceVariables
• Cities10
• Cities9
• Cities8
• Cities7
• Cities6
• Cities5
Each Cities layer represents a class of city size based on population. Cities10 represents cities
with the largest populations, and Cities5 represents the cities with the smallest populations.
c In the Contents pane, turn off the DistanceVariables and Cities layers.
e Press and hold the Shift key on your keyboard, and then click Cities5.
f Drag the selected layers from the Contents pane into the Geoprocessing pane, under
Explanatory Training Distance Features.
g Click Run.
Note: When a city point is contained within a county, the distance will be zero.
The Forest-based Classification and Regression tool calculates the distances from each county
to the nearest city of each class (the closest class 5 city, the closest class 6 city, and so on).
These distances are added to the Out_Trained_Features layer as separate attribute fields.
l In the tool message window, expand Messages, if necessary, and scroll to the Validation
Data: Regression Diagnostics section.
The model's R-Squared has increased to 0.614, which means that you are predicting with
more than 61 percent accuracy based on the validation data. You will review the variable
importance chart to see how influential each of these distance variables are to the model
performance.
The distance to the smallest cities (Cities5) and the distance to the largest cities (Cities10) are
more important than the other distance variables. Overall, however, these variables were not
as helpful as the income and education variables.
In the next step, you will add additional demographic variables in an attempt to make a more
robust model.
a In the Geoprocessing pane, under Explanatory Training Variables, click the Add Many
button .
• County
• Shape_Area
• Shape_Length
• Voter_Turnout
• 2019 Median Age
• 2019 Per Capita Income
• 2019 Education: High School/No Diploma : Percent
• Own A Selfie Stick : Percent
d Click Add.
The model's R-Squared has increased to 0.707, which means that you are now predicting with
more than 70 percent accuracy based on the validation data.
Note: You can zoom in to the chart to more clearly see the distribution for a particular
variable.
2019 Per Capita Income and 2019 Education: High School/No Diploma : Percent still have the
highest variable importance in the model, but there are several new variables that have
contributed to the model and raised its performance. There are also several variables that
may not be helping the model, represented by their low variable importance.
All these variables represent continuous numerical values. In the next step, you will add a
categorical variable to the model.
This layer includes all the attributes from the CountyElections2016 layer and an additional
attribute that represents the following categories for state voting requirements:
• No document required
• ID without photo
• ID with photo
• Strict ID without photo
• Strict ID with photo
You will add this categorical variable to the model and assess the model performance.
f In the variable window, check the box for State Voting Requirement Laws.
g Click Add.
The State Voting Requirement Laws variable should be added to the list of explanatory
training variables and marked as a categorical variable.
h Click Run.
The R-Squared value is about the same, with no improvements to the model. You can review
the variable importance to determine how helpful state voting requirement laws are
compared to the other variables.
l Zoom to the bottom right of the chart to locate the State Voting Requirements variable.
Although voting requirement laws are likely important in voter participation, they do not help
the model very much—at least compared to the other explanatory variables. You will continue
to explore state-related variables that can impact voter participation.
This layer includes the attributes from the CountyElections2016 layer, the State Voting
Requirement Laws attribute, and an additional attribute that measures the difference in
election party votes as a percent. This attribute, State Percent Votes Difference, represents
how competitive each state is in the presidential election, with a lower percent difference
indicating a more competitive state.
Note: This ArcGIS Pro project includes an ArcGIS Notebook, ElectionPartyVotes, that was
used to calculate the State Percent Votes Difference.
f In the variable window, check the box for State Percent Votes Difference.
g Click Add.
Hint: In the Forest-Based Classification And Regression tool message window, scroll to
the Validation Data: Regression Diagnostics section.
_______________________________________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
_______________________________________________________________________________
Identifying a "good" model is subjective and varies greatly based on the industry and how
the model will be used. In many fields, including many of the social sciences, an R-Squared
value over 0.70 might be considered satisfactory for making a prediction. Before using this
model to predict, you will simplify the model to only include the most important variables.
i Close the Distribution Of Variable Importance chart and the tool message window.
2019 Median Age, 2019 Per Capita Income, and 2019 Education: High School/No Diploma :
Percent are already listed under Explanatory Training Variables.
• County
• FIPS
• Shape_Area
• Shape_Length
• State
• Voter_Turnout
• 2019 Median Age
• 2019 Per Capita Income
• 2019 Education: High School/No Diploma : Percent
• 2019 Population Age 18+
You are unchecking these variables in the variable window to ensure that these
variables are not listed under Explanatory Training Variables twice.
f Click Add.
j Review the various geoprocessing tool messages and charts to answer the following
questions.
3. What is the validation R-Squared value for this model?
_______________________________________________________________________________
_______________________________________________________________________________
4. What is the mean R-Squared value over the 10 runs of this model?
Hint: In the Contents pane, open the Validation R2 chart.
_______________________________________________________________________________
5. What are the two most important explanatory variables in this model?
_______________________________________________________________________________
The simplified model has approximately the same R-Squared value, meaning that removing
the variables with low importance did not compromise model performance. You will review
additional model metrics to help you assess if the model requires any additional changes.
Model Out of Bag Errors is another diagnostic that can help validate the model. The
percentage of variation explained indicates the percent of variability in voter turnout that can
be explained using this model. Model Out of Bag Errors also shows how much performance is
gained by increasing the number of trees in the model. If the percentage of variation
explained significantly increases from the 500 to the 1000 column, you may want to increase
the number of trees to improve model performance.
This model does not see a significant increase in percentage of variation explained, so you do
not need to increase the number of trees.
The Explanatory Variable Range Diagnostics lists the range of values covered by each
explanatory variable in the datasets used to train and validate the model. For example,
median age values spanned from 23 to 61 in the dataset used to train the model and from 24
to 58 in the dataset used to validate the model.
The Validation Share indicates the percentage of overlap between the values used to train
and the values used to validate. In this example, 86 percent of the median age values used to
train the model were used to validate the model. A value over one indicates that the model
predicted values outside the range of values in the training data. To minimize extrapolation,
you will review this diagnostic as you predict voter turnout to tracts.
For more information about these additional model metrics, see ArcGIS Pro Help: How
Forest-based Classification and Regression works.
The prediction features must include the variables used to train the model, but the variables
do not have to have the same name. You can use the Match Explanatory Variables to match
the variables using their respective name.
b Under Match Explanatory Variables, under Prediction, for the empty cells, click the down
arrow and choose the matching Training variable name.
The percentage of variation explained is still fairly high at 71 percent, and it does not vary
greatly between 500 and 1000 trees. Because processing is not taking too long, you can keep
1000 as the Number of Trees.
The confidence intervals are much larger for low voter turnout values than for high voter
turnout values. This indicates that the model is not as reliable for predicting low voter turnout
as it is for predicting high voter turnout. Because the goal of your analysis is to identify areas
with low voter turnout, this model is not reliable enough to meet your needs. The factors that
drive voter turnout are likely very different from place to place, making it difficult to find a
model that predicts well for the entire country. It is often good practice to reduce your study
area and create more localized models. In the next step, you will attempt to model voter
turnout in the state of Iowa.
The confidence intervals are much smaller for low voter turnout values, indicating that the
model has become more stable for predicting these values.
The distribution of R-Squared values suggests that the model has stabilized R-Squared values
for the validation subset with a mean of approximately 0.73. You will review the variable
importance chart to determine if there are any variables that you can remove to improve
model performance.
You will remove the variables of least importance and rerun the model to determine if these
refinements improve model performance.
Note: Typically, this is an iterative process, where you remove some variables and then rerun
and evaluate the model.
• Cities10
• Cities9
• Cities7
• Cities6
• Cities5
c Under Match Explanatory Variables, under Prediction, in an empty cell, click the down
arrow and choose the matching Training variable name.
There are still some variables that have a larger prediction range (tract-level) than the training
range (county-level). So, you will review the Prediction Interval to evaluate the model's
stability.
The confidence intervals are much smaller for low voter turnout values, indicating that the
model is more reliable. You can also use the Distribution of Variable Importance chart to
evaluate the stability of the model.
You can evaluate the stability of a model by the length of the interquartile range of the
explanatory variables. A large interquartile range (a long box) can indicate that the model is
unstable because one variable can be more important in one run of the model compared to
another. In this model, the interquartile ranges are fairly short, reconfirming that this model is
stable.
The distribution of R-Squared values suggests that the model has stabilized R-Squared values
for the validation subset with a mean of approximately 0.70.
The Out_Predicted_Features layer contains the predicted voter turnout for Iowa. Exploring
this map can help you identify the tracts with lower predicted voter turnout values and assess
which regions of the state could be good locations for a campaign to get out the vote.
Overall, the model is performing well for this analysis question. You can continue modifying
and improving the model (for example, changing the Training Data Excluded For Validation
parameter to zero) or proceed with these results. Remember, your model will not be perfect.
Your goal is to find a model that is useful for your objective, which, in this case, is a campaign
to get out the vote.
m If you would like to continue this analysis, proceed to the optional stretch goal; otherwise,
save the project and exit ArcGIS Pro.
The goal of prediction is to use the predicted values to make more informed decisions. If you
would like to continue this analysis, you can apply the results of this prediction to organize a
canvassing effort.
For this canvassing effort, you used the prediction model to identify a tract in Florida that has
low predicted voter turnout. You will assign 50 volunteers to a specific set of houses that they
will visit to inform potential voters about an upcoming election.
The following is a list of high-level steps that you can complete to continue this analysis:
Use the Lesson Forum to post your questions, observations, and syntax examples. Be sure to
include the #stretch hashtag in the posting title.
2. Compared to the other variables, is State Percentage Votes Difference a helpful variable in
this model?
Hint: In the Contents pane, open the Distribution Of Variable Importance chart.
Comparatively, the State Percentage Votes Difference is one of the more helpful
variables in this model. This indicates that knowing how competitive each state is can
help in predicting voter turnout.
4. What is the mean R-Squared value over the 10 runs of this model?
Hint: In the Contents pane, open the Validation R2 chart.
The R-Squared values for each run of the model range from 0.63 through 0.77 with a
mean value of 0.722.
5. What are the two most important explanatory variables in this model?
The two most important variables are 2019 Education: High School/No Diploma :
Percent and 2019 Per Capita Income.