0% found this document useful (0 votes)

9 views8 pages

Comparing Two Tables

This paper provides tips for efficiently comparing two data files in SAS, focusing on methods to identify discrepancies and their locations. It discusses the importance of correctly reading data into SAS, using PROC COMPARE and PROC SQL for comparisons, and highlights examples with sample datasets. The goal is to assist users in validating data and ensuring accuracy across different file versions.

Uploaded by

top10videos124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views8 pages

Comparing Two Tables

Uploaded by

top10videos124

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Paper 129-2019

Tips for Correctly and Efficiently Comparing Two Files in SAS®

Aaron Brown, South Carolina Department of Education

ABSTRACT
This paper gives some tips for correctly and efficiently comparing two data files via SAS® programs, not
just to locate if discrepancies exist but where they exist. This can be helpful if you need to compare two
different versions of a file. The tips include thoughts on different methods of reading data into SAS, then
code examples for the COMPARE and SQL procedure to compare the datasets.

INTRODUCTION
It is a common occurrence that we need to compare two different versions of a file. It could be that there
are two different versions of the data and we are told to check what changed. Or, we are transitioning
from one method or tool for making a file to a new one, and we want to make sure that they produce the
same output from identical input. Also, it could be that two versions were generated as a way to validate
the data. (For example, you assign two programmers to independently generate the same file, then
check if their results are the same as verification. It is unlikely they would both make the identical
mistake.)
For this paper, we will use an imaginary report, using dummy data. Let’s say an office is told to generate
a report of student enrollment counts. The report should include rows for the following counts:
 by School
 by School and Grade
 by School and Gender
 by School, Gender, and Grade
Also, if any student-level record has a missing Grade value, exclude it from the counts as an invalid
record. Our data contains 1 school with 40 students, 10 of which have missing Grade values.
For our example, we will say (somewhat absurdly) that the company has decided to verify the file by
having three programmers independently generate it. This will enable us to highlight how different types
of differences appear.
Our dataset data1 is correct: it has all the required combinations of data and properly excluded records
without a grade. Our dataset data2 has all the required combinations of data but failed to exclude records
without a grade. (Note that it contains 40 students total, instead of just 30.) Our dataset data3 is as
data1, except that it excluded the by School by Grade and Gender combination. The datasets are shown
below.

Table 1. Contents of our Sample Datasets

First, in Step 1, we will look at issues when getting the data ready for comparison. Then, in Step 2, we
will compare the data with the COMPARE procedure. In Step 3, we will look into specific details of
discrepancies via the SQL procedure.

1
See appendices 1 and 2, at the end of this paper, for the full SAS code to generate all output in this
paper.

STEP 1: READING YOUR DATA INTO SAS

If you are comparing SAS datasets, then this step is essentially a non-issue; the data is already ready-to-
use in SAS. But, often, we are required to generate CSV files or Excel files. When that is the case, you
need to read the data in those files back into SAS before you can compare them. It is important to make
sure you read them in in such a way that your comparison is valid. (Plus, it is embarrassing to tell
someone they screwed up, when it was actually your program that screwed up.)
If your data is a CSV or text file, I highly recommend using the DATA step to read in data. The IMPORT
procedure can be used, but it can cause unexpected results, such as converting something to numeric
when you want it as a character string.
As an example, here is what our data1 dataset looks like if read in through PROC IMPORT.

Output 1. Dataset 1 via PROC IMPORT

All the observations were read into SAS, but SchoolID and Grade were rendered as numeric variables.
This caused their leading zeroes to be dropped. This change will negatively impact our comparisons via
PROC COMPARE and PROC SQL.
If you must use PROC IMPORT, it can be helpful to compare the character types and lengths of the
datasets you plan to compare. You could do this programmatically through the CONTENTS and
COMPARE procedures or by glancing at the datasets.
If importing Excel spreadsheets through PROC IMPORT, the MIXED option can sometimes help prevent
variable type issues. You can also try using LIBNAME XLSX, to link directly to the Excel document; a
blog with details about using that is in the references of this paper. But, in general, using Excel means
things will be a little bit trickier and you might need to add some code to undo any assumptions SAS
made while reading in the data.
For the rest of this paper, we will assume that the data were read into SAS properly.

STEP 2: COMPARE THE DATA

Comparing the datasets is relatively easy by using PROC COMPARE. If you expect no discrepancies (as
is our expectation here), I recommend using the LISTALL option; if you expect several discrepancies, you
may want to exclude it lest it generate too much clutter. If the order of the observations should be
identical, you can simply run code like the below:
PROC COMPARE DATA=data1 COMPARE=data2 LISTALL;
RUN;
We see that the 7th, 8th, and 9th observations had discrepancies.

2
Output 2. PROC COMPARE Without ID Statement
However, this method will only identify differences by observation number. Although that is fine for telling
you that discrepancies exist, it is not very helpful for troubleshooting where or why discrepancies
occurred. To see more details, or if order does not matter, you can declare some ID variables.1 First use
the SORT procedure to sort the datasets by the variables, then run PROC COMPARE, as follows:
PROC SORT DATA=data1; BY SchoolID Grade Gender; RUN;
PROC SORT DATA=data2; BY SchoolID Grade Gender; RUN;
PROC COMPARE DATA=data1 COMPARE=data2 LISTALL;
ID SchoolID Grade Gender;
RUN;
Now we can see that the discrepancies occurred at the SchoolID-level and the SchoolID-by-Gender-level.

Output 3. PROC COMPARE With ID Statement

That is much more helpful for telling the programmers what is wrong and what they need to double-check.
PROC COMPARE can also tell you if something is in one dataset but missing from another one. We will
show that by comparing data1 and data3 (after sorting data3). Its comparison shows no differences in
counts!

Output 4. PROC COMPARE With No Compared Values Unequal

However, if we look in the rest of PROC COMPARE’s output, we see that some observations were
missing.

1You could save the ID variables to a macro variable to limit the number of times you need to type them out. Doing
so can help decrease errors via typos.

3
Output 5. PROC COMPARE Showing Missing Observations
The LISTALL option will also explicitly list if any variables are in one dataset but not in the other. Such is
not the case in our examples, but it good to be aware that this procedure does check it. (Without the
LISTALL option, some of the output will state there are a different number of variables, but it is not as
easy to notice.)
When using PROC COMPARE, it is important to be aware of how its output flags different types of
discrepancies as well as situations when it is hard to notice that discrepancies occurred. For more detail
on how to safely understand PROC COMPARE output, check the Recommended Reading section.

TOPIC 3: PROC SQL TO FIND DISCREPANCIES

If there are a lot of records in one dataset but not in another, PROC COMPARE can become too cluttered
to really understand what is going on. When I was faced with that situation, I found a way to use PROC
SQL to view the discrepancies in a cleaner, more orderly manner by using the EXCEPT set operator. 2
(And even if PROC COMPARE is comprehensible, this output is probably easier for explaining the issue
to someone not used to PROC COMPARE.)
The EXCEPT operator limits the SQL output to just observations that are not in whatever you are
excepting from. In a sense, it means “SELECT everything EXCEPT these”. To fully compare two
datasets, you run two SQL queries. In the example below, the first query checks for everything in data1
not in data2, and the second query checks for anything in data2 not in data1.
PROC SQL NUMBER;
title2 'records in data1 not in data2';
SELECT * FROM data1
EXCEPT
SELECT * FROM data2
;
title2 'records in data2 not in data1';
SELECT * FROM data2
EXCEPT
SELECT * FROM data1
;
QUIT;

2 If your columns are in a different order between the datasets, use EXCEPT CORR instead of just EXCEPT. You
can just use EXCEPT CORR in either case if you want to make certain SAS is comparing the correct columns. Note
that other set operators also exist and can be handy for organizing data or conducting quality control. See the online
documentation or the Advanced Programming SAS Certification guide for details.

4
Output 6. PROC SQL Data1 and Data2 (with Counts)
If you do not want to see discrepancies in counts, but rather just see missing observations, you can
accomplish this by DROPping the variable that changes: in our example, Count. Here is the comparison
between data1 and data3, dropping the Count variable. (The second query has no output since data1
was not missing any data.)

PROC SQL NUMBER;

title2 'records in data1 not in data3';
SELECT * FROM data1(DROP=Count)
EXCEPT
SELECT * FROM data3(DROP=Count)
;
title2 'records in data3 not in data1';
SELECT * FROM data3(DROP=Count)
EXCEPT
SELECT * FROM data1(DROP=Count)
;
QUIT;

Output 7. PROC SQL Data1 and Data3 (No Counts)

Either method gives useful information. The first gives more details, but might seem cluttered to someone
if they care more about missing observations. The second omits the Count discrepancies, but highlights
missing observations.

5
CONCLUSION
This paper has presented some tips and tools for checking if two files are identical or not. My hope is
that, whether for validation via dual processing, checking between versions, or some other purpose, the
ideas in this paper will be helpful. I should also note that similar results could, at least in theory, be
gained through checksums or hashing methods; however, I am not familiar enough with those methods to
recommend them, nor have I found an instance where that would be preferable to methods shown here.

APPENDIX 1: TEXT FILES

To run the SAS code in Appendix 2, save the contents below into three text files and set the file directory
as the macro variable &dir.

data1.csv
SchoolID,Grade,Gender,Count
0001,03,M,10
0001,03,F,10
0001,03,,20
0001,05,M,5
0001,05,F,5
0001,05,,10
0001,,M,15
0001,,F,15
0001,,,30

data2.csv
SchoolID,Grade,Gender,Count
0001,03,M,10
0001,03,F,10
0001,03,,20
0001,05,M,5
0001,05,F,5
0001,05,,10
0001,,M,20
0001,,F,20
0001,,,40

data3.csv
SchoolID,Grade,Gender,Count
0001,03,,20
0001,05,,10
0001,,M,15
0001,,F,15
0001,,,30

APPENDIX 2: SAS CODE

*STEP 1: READING IN DATA;
*&dir is a macro variable with the folder location for the CSV files;
*in the DATA steps below, lrecl is set to be arbitrarily large;
data data1;
infile "&dir\data1.csv" firstobs=2 dsd missover pad lrecl=1000;
input SchoolID :$4. Grade :$2. Gender :$1. Count :8.;
run;
proc print noobs; title 'dataset 1'; run;
data data2;
infile "&dir\data2.csv" firstobs=2 dsd missover pad lrecl=1000;
input SchoolID :$4. Grade :$2. Gender :$1. Count :8.;

6
run;
proc print noobs; title 'dataset 2'; run;
data data3;
infile "&dir\data3.csv" firstobs=2 dsd missover pad lrecl=1000;
input SchoolID :$4. Grade :$2. Gender :$1. Count :8.;
run;
proc print noobs; title 'dataset 3'; run;

Title 'dataset 1 via PROC IMPORT';

proc import datafile="&dir\data1.csv" dbms=csv replace out=data1_imp;
getnames=yes;
run;
proc print noobs; run;

*STEP 2: PROC COMPARE;

Title 'Comparing the Datasets';
Title2 'no ID statement';
PROC COMPARE DATA=data1 COMPARE=data2 LISTALL;
RUN;
title2 'with ID statement';
PROC SORT DATA=data1; BY SchoolID Grade Gender; RUN;
PROC SORT DATA=data2; BY SchoolID Grade Gender; RUN;
PROC COMPARE DATA=data1 COMPARE=data2 LISTALL;
ID SchoolID Grade Gender;
RUN;

PROC SORT data=data3; BY SchoolID Grade Gender; run;

PROC COMPARE DATA=data1 COMPARE=data3 LISTALL;
ID SchoolID Grade Gender;
RUN;

*STEP THREE: PROC SQL;

Title 'PROC SQL Checks';
PROC SQL NUMBER;
*show all discrepancies;
title2 'records in data1 not in data2';
SELECT * FROM data1
EXCEPT
SELECT * FROM data2
;
title2 'records in data2 not in data1';
SELECT * FROM data2
EXCEPT
SELECT * FROM data1
;

*using DROP;
title2 'records in data1 not in data3';
SELECT * FROM data1(DROP=Count)
EXCEPT
SELECT * FROM data3(DROP=Count)

7
;
title2 'records in data3 not in data1';
SELECT * FROM data3(DROP=Count)
EXCEPT
SELECT * FROM data1(DROP=Count)
;
QUIT;

ACKNOWLEDGMENTS
The author thanks the WUSS chair and staff for this conference and the opportunity to present.

RECOMMENDED READING
Hemedinger, Chris. “Using LIBNAME XLSX to read and write Excel files.”
https://blogs.sas.com/content/sasdummy/2015/05/20/using-libname-xlsx-to-read-and-write-excel-files/.
Accessed 21 December 2018.
Horstman, Joshua and Roger Muller (2016), “Don’t Get Blindsided by PROC COMPARE,” 2016 Western
Users of SAS Software, San Francisco, CA, USA.
SAS Institute Inc. 2011. SAS® Certification Prep Guide: Advanced Programming for SAS®9, Third
Edition. Cary, NC: SAS Institute Inc.

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Aaron R. Brown
South Carolina Department of Education
1429 Senate Street, Columbia, SC 29201
Work Phone: 803-734-8858
E-mail: [email protected]

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

Applied Statistics and The SAS Programming 5th Edition
0% (2)
Applied Statistics and The SAS Programming 5th Edition
44 pages
Sas 1
100% (2)
Sas 1
50 pages
SAS Interview Questions
100% (1)
SAS Interview Questions
40 pages
SAS Cheat Sheet
No ratings yet
SAS Cheat Sheet
2 pages
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)
Foaming in Fractionation Columns
100% (2)
Foaming in Fractionation Columns
7 pages
"How Does Your Data Compare?" Sas'S Compare Procedure: Jenna Heyen, William M. Mercer, Inc., Deerfieldj IL
No ratings yet
"How Does Your Data Compare?" Sas'S Compare Procedure: Jenna Heyen, William M. Mercer, Inc., Deerfieldj IL
4 pages
Proc Compare - Stat 480
No ratings yet
Proc Compare - Stat 480
3 pages
Proc - Compare
No ratings yet
Proc - Compare
7 pages
SAS Chapter 03
No ratings yet
SAS Chapter 03
6 pages
Proc Compare
No ratings yet
Proc Compare
3 pages
Data Sets: Subsetting, Combining and Updating
No ratings yet
Data Sets: Subsetting, Combining and Updating
9 pages
Top 10 Most Powerful Functions For PROC SQL
No ratings yet
Top 10 Most Powerful Functions For PROC SQL
4 pages
Proc Univariate Means Tabulate
No ratings yet
Proc Univariate Means Tabulate
20 pages
PharmaSUG 2012 AD24
No ratings yet
PharmaSUG 2012 AD24
10 pages
Vibeeshcsa15031sas1 150920141056 Lva1 App6891
No ratings yet
Vibeeshcsa15031sas1 150920141056 Lva1 App6891
44 pages
c164 Biva Exp2
No ratings yet
c164 Biva Exp2
21 pages
Imelda C. Go, Lexington County School District One, Lexington, SC
No ratings yet
Imelda C. Go, Lexington County School District One, Lexington, SC
4 pages
Top 100 SAS Interview Questions and Answers For 2022
No ratings yet
Top 100 SAS Interview Questions and Answers For 2022
22 pages
Base SAS Interview Questions
No ratings yet
Base SAS Interview Questions
26 pages
Very Good Examples SAS Before Interview
No ratings yet
Very Good Examples SAS Before Interview
22 pages
LBSIM Business Analytics Slides - Day 6
No ratings yet
LBSIM Business Analytics Slides - Day 6
33 pages
Correlation: Type Informat Name What It Does
No ratings yet
Correlation: Type Informat Name What It Does
6 pages
Sascheatsheet 170401221255
No ratings yet
Sascheatsheet 170401221255
24 pages
SAS Ex
No ratings yet
SAS Ex
34 pages
W3 Syntax Review
No ratings yet
W3 Syntax Review
4 pages
Advanced SAS Interview Questions You'll Most Likely Be Asked
From Everand
Advanced SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
EPG1V2 - Summary of Lesson 3 - Exploring and Validating Data
No ratings yet
EPG1V2 - Summary of Lesson 3 - Exploring and Validating Data
3 pages
Sas Week3 Summary - Explore
No ratings yet
Sas Week3 Summary - Explore
4 pages
Proc Freq
No ratings yet
Proc Freq
57 pages
SAS Learnings
No ratings yet
SAS Learnings
11 pages
SAS Certified Specialilll
No ratings yet
SAS Certified Specialilll
7 pages
Christianna S. Williams, Abt Associates Inc, Durham, NC: PROC COMPARE - Worth Another Look!
No ratings yet
Christianna S. Williams, Abt Associates Inc, Durham, NC: PROC COMPARE - Worth Another Look!
23 pages
Import Xls Sas Code
No ratings yet
Import Xls Sas Code
6 pages
Exercesi 1 Student Version
No ratings yet
Exercesi 1 Student Version
4 pages
Topic: Generating Reports
No ratings yet
Topic: Generating Reports
15 pages
SAS Interview Questions and Answers.
No ratings yet
SAS Interview Questions and Answers.
29 pages
Sorting Through The Features of Proc SORT
No ratings yet
Sorting Through The Features of Proc SORT
44 pages
Base Programming Ref Sheet
No ratings yet
Base Programming Ref Sheet
4 pages
Sas Faq V1.3
No ratings yet
Sas Faq V1.3
56 pages
SAS Programming by Example (14) : Chapter 14 Efficiency Making Your Programs More Efficient
No ratings yet
SAS Programming by Example (14) : Chapter 14 Efficiency Making Your Programs More Efficient
9 pages
Sascheatsheet 170401221255
100% (1)
Sascheatsheet 170401221255
29 pages
Class 5 Notes
No ratings yet
Class 5 Notes
33 pages
A Hands-On Introduction To SAS DATA Step Hash Programming Techniques (V2)
No ratings yet
A Hands-On Introduction To SAS DATA Step Hash Programming Techniques (V2)
71 pages
Advanced SAS Interview Questions You'll Most Likely Be Asked
No ratings yet
Advanced SAS Interview Questions You'll Most Likely Be Asked
27 pages
Debugging and Data Cleaning Techniques With Sas: Look at Your Program
No ratings yet
Debugging and Data Cleaning Techniques With Sas: Look at Your Program
5 pages
4 Creating List Reports
No ratings yet
4 Creating List Reports
12 pages
Sas 201
No ratings yet
Sas 201
17 pages
SQL Mastery: The Masterclass Guide to Become an SQL ExpertMaster The SQL Programming Language In This Ultimate Guide Today!
From Everand
SQL Mastery: The Masterclass Guide to Become an SQL ExpertMaster The SQL Programming Language In This Ultimate Guide Today!
Jonathan S. Walker
No ratings yet
Proc SQL
100% (1)
Proc SQL
7 pages
M5L2 CDS Homework
No ratings yet
M5L2 CDS Homework
6 pages
Sample Sas Questions
100% (1)
Sample Sas Questions
4 pages
SAS Example Programs Student Results (Using If Then Else)
No ratings yet
SAS Example Programs Student Results (Using If Then Else)
3 pages
Sas 201
No ratings yet
Sas 201
17 pages
Sample Questions Base Programming Specialist
No ratings yet
Sample Questions Base Programming Specialist
4 pages
Sas 101
No ratings yet
Sas 101
17 pages
PharmaSUG 2018 LD12
No ratings yet
PharmaSUG 2018 LD12
8 pages
Advanced SQL Processing
No ratings yet
Advanced SQL Processing
7 pages
Day 1
No ratings yet
Day 1
13 pages
Summary Syntax SAS
No ratings yet
Summary Syntax SAS
6 pages
SAS Technical Interview Questions
No ratings yet
SAS Technical Interview Questions
4 pages
T SC 2549669 Science Knowledge Organiser Forces and Magnets Year 3 Ver 6
No ratings yet
T SC 2549669 Science Knowledge Organiser Forces and Magnets Year 3 Ver 6
2 pages
English Grade-10 4th-Quarter-Reviewer
No ratings yet
English Grade-10 4th-Quarter-Reviewer
2 pages
Jurnal
No ratings yet
Jurnal
10 pages
Is 3025 (P-14) For Conductivity
0% (1)
Is 3025 (P-14) For Conductivity
4 pages
Business Statistics Syllabus 2016-17
No ratings yet
Business Statistics Syllabus 2016-17
2 pages
TimeTable Classwise 20241007
No ratings yet
TimeTable Classwise 20241007
53 pages
Harrods
No ratings yet
Harrods
2 pages
Emirates Fare Conditions
No ratings yet
Emirates Fare Conditions
84 pages
MMPC 06
No ratings yet
MMPC 06
37 pages
Uhv Faculty Awareness Feedback
No ratings yet
Uhv Faculty Awareness Feedback
4 pages
Art of Problem Solving
No ratings yet
Art of Problem Solving
6 pages
Greenheck (471558GGB Iom)
No ratings yet
Greenheck (471558GGB Iom)
12 pages
RASCOR Company Portrait A4
No ratings yet
RASCOR Company Portrait A4
16 pages
Dimacolor Organic Pigment V2017 PDF
No ratings yet
Dimacolor Organic Pigment V2017 PDF
5 pages
Spinning PDF
100% (1)
Spinning PDF
142 pages
JBD Price List Awj - April 2015 West Bengal, Orissa Etc Lowress
No ratings yet
JBD Price List Awj - April 2015 West Bengal, Orissa Etc Lowress
104 pages
Luis Fernandes - Resume
No ratings yet
Luis Fernandes - Resume
2 pages
Request To Reissue A Check
0% (1)
Request To Reissue A Check
1 page
More Than Words John Warner PDF Download
No ratings yet
More Than Words John Warner PDF Download
30 pages
Splinting of Teeth Following Trauma
No ratings yet
Splinting of Teeth Following Trauma
73 pages
Invision System
No ratings yet
Invision System
12 pages
Libro Inglés 7mo 2024 - Esc. 1248
No ratings yet
Libro Inglés 7mo 2024 - Esc. 1248
77 pages
Answer Key Chapter 4 Past Paper Detailed Worksheet
No ratings yet
Answer Key Chapter 4 Past Paper Detailed Worksheet
29 pages
PTS Genap B Inggris Kelas 6
No ratings yet
PTS Genap B Inggris Kelas 6
4 pages
Tork Auto Media Universe
No ratings yet
Tork Auto Media Universe
54 pages
Torsion of A Non-Circular Bar PDF
No ratings yet
Torsion of A Non-Circular Bar PDF
16 pages
Computing Lessons Notes B7
No ratings yet
Computing Lessons Notes B7
5 pages
Front Office Management-II BHM 353 Prashant Vijeta Question No - 1
No ratings yet
Front Office Management-II BHM 353 Prashant Vijeta Question No - 1
6 pages
LAB Manual Part A: Experiment No.04
No ratings yet
LAB Manual Part A: Experiment No.04
12 pages

Comparing Two Tables

Uploaded by

Comparing Two Tables

Uploaded by

Paper 129-2019

Tips for Correctly and Efficiently Comparing Two Files in SAS®

Table 1. Contents of our Sample Datasets

STEP 1: READING YOUR DATA INTO SAS

Output 1. Dataset 1 via PROC IMPORT

STEP 2: COMPARE THE DATA

Output 3. PROC COMPARE With ID Statement

Output 4. PROC COMPARE With No Compared Values Unequal

TOPIC 3: PROC SQL TO FIND DISCREPANCIES

PROC SQL NUMBER;

Output 7. PROC SQL Data1 and Data3 (No Counts)

APPENDIX 1: TEXT FILES

APPENDIX 2: SAS CODE

Title 'dataset 1 via PROC IMPORT';

*STEP 2: PROC COMPARE;

PROC SORT data=data3; BY SchoolID Grade Gender; run;

*STEP THREE: PROC SQL;

You might also like