Proc Geocode
Proc Geocode
Paper 087-2009
ABSTRACT
How do you convert your address data into map locations? This is done through the process of geocoding.
SAS/GRAPH® 9.2 now includes PROC GEOCODE to simplify this process for you. This presentation will briefly
discuss the current capability of this procedure and show examples of both address geocoding and Web address (IP
address) geolocating. A brief discussion of future directions will also be included.
INTRODUCTION
Every business or organization has a lot of data that includes an address. The address data is useful only if it is
transformed into location information that can be viewed on a map, used in distance calculations, or processed in
other similar ways. To make this data useful, you must convert the address to a location having a latitude and
longitude. This conversion process is called geocoding. If the address is an IP address, the process is usually called
geolocating.
This paper will discuss both geocoding and geolocating using PROC GEOCODE. First, we will introduce the
concepts needed to understand geocoding, and then we will discuss PROC GEOCODE’s current and planned
functionality. Finally, examples will show how to use PROC GEOCODE.
GEOCODING BASICS
The geocoding process depends on having lookup data with the necessary information to make the conversion. This
data is the key to geocoding. Factors such as age and granularity of the lookup data determine the geocoding
results. Addresses routinely change with new construction, new roads being added, and postal codes being split and
changed. The older your lookup data, the more likely it is that some address matches might be off.
Granularity is another important consideration. Does the location need to be the actual house location or is it okay at
a ZIP Code level or even a city level? If you are viewing the addresses on a state or U.S. map, then ZIP Code or city
is accurate enough.
To understand geocoding, it is important to first understand the lookup data. It is particularly important to understand
the differences between ZIP Code data, ZIP+4 data, and street address data. IP address data is completely different
from the other types of addresses, but it is important to understand this data, too.
1
SAS Global Forum 2009 Coders' Corner
A ZIP Code is further divided by ZIP+4 areas. Four additional digits are appended to the ZIP Code to specify these
additional subdivisions. A ZIP+4 will likely represent a single street or a part of a street. In a high-density city area, it
might represent one side of a street on a single block, or even one floor of a large building. Figure 2 illustrates the
relationship between a ZIP Code and a ZIP+4. The centroid is the midpoint of those addresses in the ZIP+4 area.
All addresses in this ZIP+4 area would get the same centroid if geocoded with ZIP+4 data. However, there would be
multiple locations within the overall ZIP Code area for each ZIP+4.
2
SAS Global Forum 2009 Coders' Corner
ZIP Code data will get you to the general location of the address, but not to the actual house location. ZIP+4 data will
probably get you to the correct street in the address, but not to the actual house location. To geocode to the specific
house location, you need street-level data.
IP GEOLOCATING DATA
IP data is a form of range data and was not designed to be geographic, like street addresses. IP addresses are very
different from ZIP Code and street addresses. Generally, these are collected from visitors to Web sites and indicate
the connection the visitor used. IP address lookup data contains information that matches ranges of IP addresses to
particular geographic locations. The location found will not be at the street or even ZIP Code level, but might indicate
the city, state, or country where the IP address is registered.
3
SAS Global Forum 2009 Coders' Corner
PROC GEOCODE
The GEOCODE procedure converts address data to geographic coordinates (latitude and longitude values). These
geographic coordinates can then be used on a map to calculate distances or to perform spatial analysis. Appendix 2
contains more information about what can be done with the geocoded coordinates. In addition, the procedure
enables you to add attribute values to your data if they are in the lookup data. Examples would be adding census
blocks or area codes to an address.
The GEOCODE procedure requires two SAS data sets:
• The input data set that you want to geocode. This data set will contain variables related to the address such
as street address, city, state, and ZIP Code.
• A lookup data set containing the data to transform your address data into geographic locations. By default,
SASHELP.ZIPCODE is used for ZIP Code or city geocoding. Range data (for example, IP addresses) uses
two data sets.
The simplest example of how these data sets are used is with ZIP Code geocoding. Figure 4 shows that the variables
from the input data are carried forward into the output data set. The ZIP variable is looked up in the lookup data set
and the X and Y values (longitude and latitude) are added to the output data set. These are the geographic
coordinates of that ZIP Code centroid. In addition, if the lookup data set contains other attributes such as county
names or census blocks, you can specify that these additional values be moved to the output data set.
In reality, geocoding is more complicated than this. By default, the procedure will give you the next larger area if the
ZIP Code isn’t found. If a standard five-digit ZIP Code isn’t found, then it will attempt to find the city area location. If a
ZIP+4 isn’t found, it will move to the ZIP Code, and then to the city. If you are interested in the ZIP Code location
only, you can turn off this behavior. The _MATCHED_ variable indicates the type of successful match that was found.
The value in this example means the ZIP Code matched. Other possible values are listed in the documentation.
Currently, the GEOCODE procedure supports five methods of geocoding. These include ZIP Code centroid, ZIP+4
center, city center, range geocoding (commonly used as IP address geolocating), and custom geocoding. The
preceding descriptions of ZIP Code and ZIP+4 data explain the type of data returned for these methods. City
geocoding is similar to the others, but a city location is found by averaging the ZIP Code centroid data for the city and
state to find the mean location. Range geocoding is used when the address value in one value is in a range of
values. This is commonly used for finding the locations of IP addresses, but any type of range lookup could be done.
Custom geocoding enables you to provide your own data and lookup key.
Additional geocoding methods are currently being researched and developed. Street-level geocoding is currently in
development for a future SAS release.
4
SAS Global Forum 2009 Coders' Corner
Lookup data is the key to geocoding. SAS does not provide geocoding data for all of the geocoding methods. Some
data must be found on external sites or purchased. Where possible, there is data available on the SAS Maps Online
Web site. The lookup data is discussed in more detail below.
PROCEDURE SYNTAX
The syntax for the GEOCODE procedure is very flexible and enables you to specify different lookup data sets and use
different variable names.
The basic syntax is like other SAS procedures:
PROC GEOCODE <options(s)>;
5
SAS Global Forum 2009 Coders' Corner
The complete syntax is specified in SAS/GRAPH documentation for the GEOCODE procedure. This can be found in
the SAS/GRAPH Reference at http://support.sas.com/documentation/onlinedoc/graph/index.html.
6
SAS Global Forum 2009 Coders' Corner
latest TIGER release that will include ZIP+4 values until the 2010 Census, or possibly later
according to Census documentation.
o You can purchase Melissa Data Geo*Data ZIP+4 product (www.melissadata.com) at
http://w2.melissadata.com/geocoder/geodata.htm.
o SAS 9.2 includes an autocall macro (GCDMEL9.SAS), which imports Geo*Data files into SAS data
sets for use with PROC GEOCODE. You can modify this program to import other sources of data.
• CITY
o This mechanism uses the same data as the ZIP Code method.
• RANGE (IP Address)
o One source of IP address lookup data is available from MaxMind (www.maxmind.com).
o GeoLite City and GeoLite Country are free products available from
http://www.maxmind.com/app/geolitecity and http://www.maxmind.com/app/geolitecountry.
o GeoIP City and GeoIP Country are the more accurate fee-based products at
http://www.maxmind.com/app/city and http://www.maxmind.com/app/country.
o SAS 9.2 includes an autocall macro (MAXMIND.SAS), which imports the CSV MaxMind files
(binary format not supported). You can modify this program to import other sources of data.
• Custom
o Any data can be used as a custom lookup data set. The only requirement is that you have at least
three variables that represent the X and Y values and a key variable to look up. The following
example for the CUSTOM geocoding method provides code where a custom data set is created.
EXAMPLES
The complete examples are available for download from http://support.sas.com/rnd/papers. Search for the title of this
paper. In some cases, the lookup data must be purchased or downloaded and processed before using the examples.
These examples demonstrate each of the current geocoding methods available in SAS 9.2.
7
SAS Global Forum 2009 Coders' Corner
The WORK.GEOCODED data set will contain all of the variables in the address data set, plus a variable for X, Y,
MSA, and AREACODE. In addition, a _MATCHED_ variable is created to give you the type of match done. Most of
these matches should be ZIP, but some might be None, City, or City mean. These indicate that the ZIP Code
specified was not found, or that a lower-level location was found instead.
Figure 5 shows a map that displays the geocoded points from this example. The examples contain a program
(MAPDEL.SAS) that can be used to display the points on the state of Delaware.
ZIP+4 GEOCODING
This example uses the ZIP+4 (PLUS4) geocoding method. Sample address data to be geocoded is provided in the
example, but you will usually provide this. The address data for this example is very similar to the data for ZIP Code
geocoding in the previous example. SAS does not ship lookup data for this example. You must either download the
data from the SAS Maps Online Web site, or purchase your own data. The example will attempt to provide the
location of the ZIP+4 center for each address. If the ZIP+4 is not found, then it will attempt to process the ZIP Code
centroid. If the ZIP Code cannot be found, then it will return the location of the city by default, but this example will not
process the city because of the NOCITY option. In addition, the attribute variables TRACT, BLOCK, FIPS, FENAME,
and FETYPE will be added to the output data for each match.
The procedure syntax is:
proc geocode
plus4 /* Geocoding method */
lookup=lookup.zip4 /* Lookup data set */
data=work.customers /* Input address data */
out=work.geocoded /* Output data set */
attributevar=(tract block fips fename fetype) /* Added output variables */
nocity; /* Only do ZIP4 or ZIP, not CITY */
run;
quit;
CITY GEOCODING
This example uses the CITY geocoding method. Sample address data to be geocoded is provided in the example,
but you will usually provide this. The address data for this example is very similar to the data for ZIP Code geocoding.
This example uses the SASHELP.ZIPCODE file to create mean city center, based on the ZIP Codes in that city. In
addition, the variables MSA and AREACODE will be added to the output data for each match.
The procedure syntax is:
proc geocode
city /* city method */
data=work.customers /* address data to geocode*/
out=work.geocoded /* output data set */
lookup=sashelp.zipcode /* lookup data set to use */
8
SAS Global Forum 2009 Coders' Corner
CUSTOM GEOCODING
Custom geocoding is a flexible lookup that enables you to apply your own type of lookup data to your address data.
In this example, the address data is a list of customers that have only an area code from their phone number. The
lookup data is created by extracting the necessary area code data from the SASHELP.ZIPCODE data set. Because
this is a custom file, you must specify the name of the lookup data set, the name of the address variable, and the
name of the lookup variable.
The following code creates the custom area code lookup file:
proc sort data=sashelp.zipcode out=temp;
by areacode;
run;
proc means data=temp noprint;
by areacode;
label x='Mean Longitude of Area Code'
y='Mean Latitude of Area Code'
areacode='Area Code';
var x y;
output mean= out=work.areacodes (keep=x y areacode label='Area Code Centers');
run;
IP GEOCODING
IP geocoding uses the RANGE geocoding method and is the most common use of this method. IP addresses are
generally referred to as being geolocated, rather than geocoded. Usually, a range of IP addresses belongs to a
company or an Internet provider, so the lookup information is in ranges. There are no default variable names with this
method, so all data sets and variable names must be specified. There are two files used for lookup. The range data
set provides the ranges. The lookup data provides the latitude and longitude information. A KEY variable links the
two data sets. Internally, the proper range is found, and then the key value is used to access the lookup data set to
find the latitude and longitude for that key.
The procedure syntax is:
proc geocode
range /* Geocoding method */
out=work.IPlocations /* Geolocated output data */
9
SAS Global Forum 2009 Coders' Corner
CONCLUSION
PROC GEOCODE converts your address data to map locations. The procedure is both simple to use and extremely
flexible, enabling you to provide your own lookup data. If you need the most current data, non-U.S. postal codes, or
any specialized data, you can purchase this data and easily use it with the procedure.
REFERENCES
SAS/GRAPH 9.2 Documentation: http://support.sas.com/documentation/onlinedoc/graph/index.html
RESOURCES
“PROC GEOCODE: Creating Map Locations from Your Data.” SAS Global Forum 2009 paper and example source
code download. SAS Institute Inc. http://support.sas.com/rnd/papers/.
“Tips and Tricks IV: More SAS/Graph Map Secrets.” SAS Global Forum 2009 paper and example source code
download. SAS Institute Inc. http://support.sas.com/rnd/papers.
“SAS Mapping: Technologies, Techniques, Tips and Tricks.” SUGI 28 SAS Presents handout and example source
code download. SAS Institute Inc. http://support.sas.com/rnd/papers/.
“Tips and Tricks II: Getting the most from your SAS/GRAPH maps.” SUGI 29 SAS Presents handout and example
source code download. SAS Institute Inc. http://support.sas.com/rnd/papers/.
“Tips and Tricks III: More Unique SAS/GRAPH Maps.” SUGI 30 SAS Presents handout and example source code
download. SAS Institute Inc. http://support.sas.com/rnd/papers/.
Ed Odom
SAS Institute Inc.
SAS Campus Drive
Cary, NC 27513
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
10
SAS Global Forum 2009 Coders' Corner
APPENDIX 1
What is ZIP+4?
ZIP+4 is an optional four-digit add-on to the standard ZIP Code. It basically subdivides a larger ZIP Code delivery
area into smaller carrier routes. It is used primarily by businesses or large commercial mailers to pre-sort mail for a
bulk discount.
11
SAS Global Forum 2009 Coders' Corner
SAS data set for your platform. This data set can be used with the PLUS4 option in PROC GEOCODE for ZIP+4
geocoding.
12
SAS Global Forum 2009 Coders' Corner
APPENDIX 2
DISPLAY ON MAP
Geocoding assigns latitude and longitude values to your addresses. Those geocoded locations can then be displayed
on a map:
SPATIAL ANALYSIS
Spatial analysis involves determining the effect of position on response data. Geocoded X and Y locations can be
used as the basis for spatial analyses. Several SAS procedures and products perform spatial analysis of attribute
values associated with X and Y locations and compute predicted values at other positions:
• PROC VARIOGRAM
• PROC KRIGE2D
• JMP
13