Comparing Two Tables
Comparing Two Tables
ABSTRACT
This paper gives some tips for correctly and efficiently comparing two data files via SAS® programs, not
just to locate if discrepancies exist but where they exist. This can be helpful if you need to compare two
different versions of a file. The tips include thoughts on different methods of reading data into SAS, then
code examples for the COMPARE and SQL procedure to compare the datasets.
INTRODUCTION
It is a common occurrence that we need to compare two different versions of a file. It could be that there
are two different versions of the data and we are told to check what changed. Or, we are transitioning
from one method or tool for making a file to a new one, and we want to make sure that they produce the
same output from identical input. Also, it could be that two versions were generated as a way to validate
the data. (For example, you assign two programmers to independently generate the same file, then
check if their results are the same as verification. It is unlikely they would both make the identical
mistake.)
For this paper, we will use an imaginary report, using dummy data. Let’s say an office is told to generate
a report of student enrollment counts. The report should include rows for the following counts:
by School
by School and Grade
by School and Gender
by School, Gender, and Grade
Also, if any student-level record has a missing Grade value, exclude it from the counts as an invalid
record. Our data contains 1 school with 40 students, 10 of which have missing Grade values.
For our example, we will say (somewhat absurdly) that the company has decided to verify the file by
having three programmers independently generate it. This will enable us to highlight how different types
of differences appear.
Our dataset data1 is correct: it has all the required combinations of data and properly excluded records
without a grade. Our dataset data2 has all the required combinations of data but failed to exclude records
without a grade. (Note that it contains 40 students total, instead of just 30.) Our dataset data3 is as
data1, except that it excluded the by School by Grade and Gender combination. The datasets are shown
below.
1
See appendices 1 and 2, at the end of this paper, for the full SAS code to generate all output in this
paper.
2
Output 2. PROC COMPARE Without ID Statement
However, this method will only identify differences by observation number. Although that is fine for telling
you that discrepancies exist, it is not very helpful for troubleshooting where or why discrepancies
occurred. To see more details, or if order does not matter, you can declare some ID variables.1 First use
the SORT procedure to sort the datasets by the variables, then run PROC COMPARE, as follows:
PROC SORT DATA=data1; BY SchoolID Grade Gender; RUN;
PROC SORT DATA=data2; BY SchoolID Grade Gender; RUN;
PROC COMPARE DATA=data1 COMPARE=data2 LISTALL;
ID SchoolID Grade Gender;
RUN;
Now we can see that the discrepancies occurred at the SchoolID-level and the SchoolID-by-Gender-level.
1You could save the ID variables to a macro variable to limit the number of times you need to type them out. Doing
so can help decrease errors via typos.
3
Output 5. PROC COMPARE Showing Missing Observations
The LISTALL option will also explicitly list if any variables are in one dataset but not in the other. Such is
not the case in our examples, but it good to be aware that this procedure does check it. (Without the
LISTALL option, some of the output will state there are a different number of variables, but it is not as
easy to notice.)
When using PROC COMPARE, it is important to be aware of how its output flags different types of
discrepancies as well as situations when it is hard to notice that discrepancies occurred. For more detail
on how to safely understand PROC COMPARE output, check the Recommended Reading section.
2 If your columns are in a different order between the datasets, use EXCEPT CORR instead of just EXCEPT. You
can just use EXCEPT CORR in either case if you want to make certain SAS is comparing the correct columns. Note
that other set operators also exist and can be handy for organizing data or conducting quality control. See the online
documentation or the Advanced Programming SAS Certification guide for details.
4
Output 6. PROC SQL Data1 and Data2 (with Counts)
If you do not want to see discrepancies in counts, but rather just see missing observations, you can
accomplish this by DROPping the variable that changes: in our example, Count. Here is the comparison
between data1 and data3, dropping the Count variable. (The second query has no output since data1
was not missing any data.)
5
CONCLUSION
This paper has presented some tips and tools for checking if two files are identical or not. My hope is
that, whether for validation via dual processing, checking between versions, or some other purpose, the
ideas in this paper will be helpful. I should also note that similar results could, at least in theory, be
gained through checksums or hashing methods; however, I am not familiar enough with those methods to
recommend them, nor have I found an instance where that would be preferable to methods shown here.
To run the SAS code in Appendix 2, save the contents below into three text files and set the file directory
as the macro variable &dir.
data1.csv
SchoolID,Grade,Gender,Count
0001,03,M,10
0001,03,F,10
0001,03,,20
0001,05,M,5
0001,05,F,5
0001,05,,10
0001,,M,15
0001,,F,15
0001,,,30
data2.csv
SchoolID,Grade,Gender,Count
0001,03,M,10
0001,03,F,10
0001,03,,20
0001,05,M,5
0001,05,F,5
0001,05,,10
0001,,M,20
0001,,F,20
0001,,,40
data3.csv
SchoolID,Grade,Gender,Count
0001,03,,20
0001,05,,10
0001,,M,15
0001,,F,15
0001,,,30
6
run;
proc print noobs; title 'dataset 2'; run;
data data3;
infile "&dir\data3.csv" firstobs=2 dsd missover pad lrecl=1000;
input SchoolID :$4. Grade :$2. Gender :$1. Count :8.;
run;
proc print noobs; title 'dataset 3'; run;
*using DROP;
title2 'records in data1 not in data3';
SELECT * FROM data1(DROP=Count)
EXCEPT
SELECT * FROM data3(DROP=Count)
7
;
title2 'records in data3 not in data1';
SELECT * FROM data3(DROP=Count)
EXCEPT
SELECT * FROM data1(DROP=Count)
;
QUIT;
ACKNOWLEDGMENTS
The author thanks the WUSS chair and staff for this conference and the opportunity to present.
RECOMMENDED READING
Hemedinger, Chris. “Using LIBNAME XLSX to read and write Excel files.”
https://blogs.sas.com/content/sasdummy/2015/05/20/using-libname-xlsx-to-read-and-write-excel-files/.
Accessed 21 December 2018.
Horstman, Joshua and Roger Muller (2016), “Don’t Get Blindsided by PROC COMPARE,” 2016 Western
Users of SAS Software, San Francisco, CA, USA.
SAS Institute Inc. 2011. SAS® Certification Prep Guide: Advanced Programming for SAS®9, Third
Edition. Cary, NC: SAS Institute Inc.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Aaron R. Brown
South Carolina Department of Education
1429 Senate Street, Columbia, SC 29201
Work Phone: 803-734-8858
E-mail: [email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.