Data Cleaning and Base SAS Functions: Caroline Bahler, Meridian Software Inc
Data Cleaning and Base SAS Functions: Caroline Bahler, Meridian Software Inc
Data Cleaning and Base SAS Functions: Caroline Bahler, Meridian Software Inc
Paper 56-26
General Comments on Data Cleansing In the previous example, a new character variable
Data cleaning, cleansing, or scrubbing all are called zip code was created utilizing the PUT
synonymous terms for the same process – the function. Conversely, if the zip code in the new mail
removal of data values that are incorrect from a data list is character but it needs to be numeric then the
5 2
source . Dirty data refers to data that contains INPUT function can be used . For example,
incorrect/ erroneous data values.
data newlist;
Data cleansing is an art not a science. Each set of set newdata.maillist;
data that needs to be cleaned has its own set of
headaches and cleansing solutions. Therefore, the zipcode = INPUT(zip,8.);
INPUT
following functions allow the “cleanser” to tackle
run;
many types of problem in the basic cleansing line
instead of being specific solutions for a defined
situation. In addition, to character / numeric conversions the
PUT and INPUT functions can be used in the
Data cleansing requires the following information: conversion of data/time values into character
• Is there a pre-existing data source, either a variables and vice versa.
database table or data set that the new data will
Character Functions
be added to?
Frequently it is necessary to change the form of a
• Are there any business rules that need to be
character variable or use only a portion of that value.
used during cleansing? Often one of the
For example, you might need to uppercase all letters
cleansing chores is to convert a field into
within the variable value. In this case, a new
another using a set of criteria?
variable does not need to be defined for the function
• What are the cleansing problems in the new
to be used.
data? Before any cleansing effort can begin a
inventory of all of the obvious flaws in the data
The following is a list of character functions that are
needs to be compiled.
extremely useful in data cleansing.
Finally, some general rules of data cleansing:
Function Use
• The data is ALWAYS dirtier than you thought it
Compress Removes specified characters from a
was.
variable. One use is to remove
• New problems will always come to light once the unnecessary spaces from a variable.
obvious ones have been solved.
Index, These functions return the starting
• Data cleansing is an on-going process that indexc, position for a character, character
never stops. indexw string, or word, and are extremely
useful in determining where to start
Beginning Tutorials
Function Use
or stop when sub stringing a data newlist;
variable. set newdata.maillist;
Left Left justifies the variable value.
Length Returns the number of characters /* Extract month, day and year */
with a character variable value. /* from the date character vara */
Lowcase Lower cases all letters within a
variable value. m = scan(date,1,’ ‘);
Right Right justifies the variable value. d = scan(date,2,’ ‘);
Scan Returns a portion of the variable y = scan(year,2,’,’);
value as defined by a delimiter. For
dd = compress(d||m||y,’ ,’);
example, the delimiter could be a
space, comma, semi-colon etc.
Substr Returns a portion of the variable /* Convert mon, day, year into */
value based on a starting position /* new date variableb */
and number of characters. newdate = input(dd,date9.);
Translate Replaces a specific character with run;
characters that are specified.
Tranwrd Replaces a portion of the character a) In this case the SCAN function was used, but
string (word) with another character the SUBSTR function could also have been
string or word. For example, a used to extract the month, day, and year from
delimiter was supposed to be a the original character date variable. The SCAN
comma but data in some cases function was used because the data values
contains a colon. This function could contained a space or comma delimiter. Note
be used to replace the comma with a that the comma was used to delimit the year and
colon. the text portion was the second and NOT the
Trim Removes the trailing blanks from the third. The reason for this is the text string has
right-hand side of a variable value. only two pieces, month and day, before the
Upcase Upper cases all letters within a comma and year after the comma, when the
variable value. comma is used as the only delimiter. The
SUBSTR function would have been the only
If you need to use one of these functions on a choice if a delimiter had not been available.
numeric variable then it is preferable to first convert b) Conversion of the resulting mon, day and year
the numeric value into a character value (see variables into a new variable was accomplished
previous section). By default, conversion from by utilizing the COMPRESS function and INPUT
numeric to character will occur when using these functions. The COMPRESS function was used
functions within the DATA step with a warning to remove any spaces present within the three
placed at the end of the DATA step. (3) concatenated variables and to remove the
comma within the day variable value. Note – by
For example – choosing to use the scan function for extracting
the day value from the original date variable, the
A new mailing list contains a date value that is a comma was left with the day value since there
character and it needs to be converted into a SAS was no space between the day and comma.
date value. An additional challenge is that the Finally, the use of the INPUT function creates a
character value does not match any date informats. new variable with a SAS date value.
The solution to this conversion has two (2) steps – In many data cleansing scenarios, a single data
1. Need to re-arrange the date character value variable contains multiple pieces of data that need to
so that the date is in the following format – be split into separate variables. If there is no
ddmonyyyy, i.e. date9. informat. delimiter between them, then the variable must be
2. Convert the new character value to a date divided using the SUBSTR (substring) function.
value.
The SUBSTR function requires a starting point and
the number of characters to be kept in the new
variable. In some cases however, the starting point
may not be constant. In those cases then several
Beginning Tutorials
• A character or a specific set of characters occur The new mailing list has in one case separate
where the character string starts. Using the data variables for month, day, and year for one date. The
from the last example, the last 3 characters can problem is that this data needs to be added to a pre-
be extracted using INDEX to define the starting existing data set that contains this information as a
position. single SAS date. If the data is numeric, then the use
of the MDY function converts the separate variables
data cleandata; into a single date value variable. However, if the
data is character then the conversion to numeric
set dirtydata;
should occur first and then the conversion to the
oldidx = upcase(oldid);
date value.
a = substr(oldid,index(oldidx,’B’),3);
put a; The following codes shows how this two(2) part
run; process can occur within one (1) statement.
is a function that can be used to verify that the could be caused by either a data entry problem with
abbreviation for the state is correct. However, this the state abbreviation or a data entry problem with
conversion has another use in identifying the zip the zip code. In this case, our program has not
codes that are potentially incorrect. identified the actual problem. Instead the program
has identified only that there is a problem.
The following is a list of date/time functions that are
extremely useful in data cleansing. Conclusion
This paper was not an exhaustive study of all
Function Use functions available within SAS to cleanse data.
Stname Returns state name in all upper case Instead it discussed the most common base
from state abbreviation. functions used to perform:
Stnamel Returns state name in mixed case • data type conversions
from state abbreviation. • parse or change the justification or case of
Zipname Return state name in upper case character variables
from zip code. • parse and create date/time values
Zipnamel Returns state name in mixed case • determine state names from state abbreviations
from zip code. and zip codes
Zipstate Returns state abbreviation from zip
code. References
1. Functions and Call Routines, Base SAS
For example – Software. SAS On-line Documentation version
8. SAS Institute, Inc. Cary, NC.
data newlist; 2. Delwiche, Lora D. and Slaughter, Susan J.
set newdata.maillist; 1998. The Little SAS Book, Second Edition, SAS
Institute, Inc. Cary NC. pp 204-205
if state ne zipstate(zip) then 3. Zip codes for basic example – www.usps.com
stateflag=1; 4. Howard, Neil. 1999. Introduction to SAS
else Functions. Proceeding of the Twenty-fourth
Annual SAS User’s Group International
stateflag=0;
Conference. SAS Institute, Inc. Cary NC. pp
Run;
393-399.
5. Karp, Andrew. 1999. Working with SAS Date
In the example, above the value returned by the and Time Functions Proceeding of the Twenty-
ZIPSTATE function is compared to the variable fourth Annual SAS User’s Group International
containing the state abbreviation. If the two state Conference. SAS Institute, Inc. Cary NC. pp
abbreviations do not match, then a flag is set. 400-406.
6. Cody, Ron. 2000. Cody's Data Cleaning
Putting it all together Techniques Using SAS Software. SAS Institute,
Appendix 1 is an example of using all of the function Inc. Cary NC.
types to cleanse a set of data that is going to be
added to a pre-existing data table in a data
warehouse. Table 1 lists the data in its “raw” form. Trademarks
All variables within the “raw” data set are character SAS® and all SAS products are trademarks or
variables. registered trademarks of SAS Institute Inc.
Meridian Software, Inc.â is a registered trademark
The following changes need to be made: of Meridian Software, Inc.
• Change moddate to datetime value
• Upper case all state abbreviations Contact Information
• Ensure all phone numbers use only a dash as Caroline Bahler
divider. Meridian Software, Inc.
• Add identifier – the data needs a character 12204 Old Creedmoor Road
variable that uniquely identifies each row. The Raleigh, NC 27613
identifier needs to start with 1000. (919) 518-1070
• Determine if state abbreviations match zip code [email protected]
determined abbreviations
Cleansing Program.
retain i 1000;
/* identifier */
i = i + 1;
id = put(i,4.);
/* conversion to datetime */
date = compress(scan(moddate,2,' '),',')||
scan(moddate,1,' ')||
scan(moddate,3,' ');
time = scan(moddate,4,' ');
datetime = input(compress(date||":"||time),datetime21.);
run;
Table 1. Original Data
Id First Last Address City State zip Area Phone Moddate
1001 Brenda Jones 101 1st St Omaha NE 68101 123 147-2457 Jan 17, 2001 20:07:49
1002 Jim Smith 5 Keyland Lane Portland ME 04103 213 125-4596 Jan 17, 2001 20:07:49
1003 John Handford 3269 Graceland Ave Memphis MI 37501 111 235-9875 Jan 17, 2001 20:07:49
1004 Jane Kew 5684 Jonesboro Rd Blowing Rock NC 28605 102 286-5468 Jan 17, 2001 20:07:49
1005 Mary Roderick 201 Garland Dr Atlanta ga 30344 412 965/5692 Jan 17, 2001 20:07:49
1001 Brenda Jones 101 1st St Omaha NE 68101 123 147-2457 17JAN2001:20:07:49
1002 Jim Smith 5 Keyland Lane Portland ME 04103 213 125-4596 17JAN2001:20:07:49
1003 John Handford 3269 Graceland Ave Memphis MI 37501 111 235-9875 17JAN2001:20:07:49
1004 Jane Kew 5684 Jonesboro Rd Blowing Rock NC 28605 102 286-5468 17JAN2001:20:07:49
1005 Mary Roderick 201 Garland Dr Atlanta GA 30344 412 965-5692 17JAN2001:20:07:49
1001 NE NE
1002 ME ME
1003 MI TN *
1004 NC NC
1005 ga GA
Beginning Tutorials