0% found this document useful (0 votes)
27 views20 pages

Checking Values of Character Variables: Using PROC FREQ To List Values

This document discusses techniques for checking values of character variables in SAS. It begins by using PROC FREQ to list the unique values and frequencies of selected character variables, which identifies several invalid values. The document then uses a DATA step to identify invalid values and determine their corresponding patient numbers. Finally, it describes using a PRINT statement with a WHERE clause to list only records containing invalid values.

Uploaded by

chakramch
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views20 pages

Checking Values of Character Variables: Using PROC FREQ To List Values

This document discusses techniques for checking values of character variables in SAS. It begins by using PROC FREQ to list the unique values and frequencies of selected character variables, which identifies several invalid values. The document then uses a DATA step to identify invalid values and determine their corresponding patient numbers. Finally, it describes using a PRINT statement with a WHERE clause to list only records containing invalid values.

Uploaded by

chakramch
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Checking Values of Character Variables


,QWURGXFWLRQ 8VLQJ 352& )5(4 WR /LVW 9DOXHV 'HVFULSWLRQ RI WKH )LOH 3$7,(1767;7 8VLQJ D '$7$ 6WHS WR &KHFN IRU ,QYDOLG 9DOXHV 8VLQJ 352& 35,17 ZLWK D :+(5( 6WDWHPHQW WR /LVW ,QYDOLG 9DOXHV 8VLQJ )RUPDWV WR &KHFN IRU ,QYDOLG 9DOXHV 8VLQJ ,QIRUPDWV WR &KHFN IRU ,QYDOLG 9DOXHV       

Introduction
7KHUH DUH VRPH EDVLF RSHUDWLRQV WKDW QHHG WR EH URXWLQHO\ SHUIRUPHG ZKHQ GHDOLQJ ZLWK FKDUDFWHU GDWD YDOXHV <RX PD\ KDYH D FKDUDFWHU YDULDEOH WKDW FDQ WDNH RQ RQO\ FHUWDLQ DOORZDEOH YDOXHV VXFK DV
0
DQG
)
IRU JHQGHU <RX PD\ DOVR KDYH D FKDUDFWHU YDULDEOH WKDW FDQ WDNH RQ QXPHURXV YDOXHV EXW WKH YDOXHV PXVW ILW D FHUWDLQ IRUP VXFK DV D VLQJOH OHWWHU IROORZHG E\ WZR RU WKUHH GLJLWV 7KLV FKDSWHU VKRZV \RX VHYHUDO ZD\V WKDW \RX FDQ XVH 6$6 VRIWZDUH WR SHUIRUP YDOLGLW\ FKHFNV RQ FKDUDFWHU YDULDEOHV
Using PROC FREQ to List Values

7KLV VHFWLRQ GHPRQVWUDWHV KRZ WR XVH 352& )5(4 WR FKHFN IRU LQYDOLG YDOXHV RI D FKDUDFWHU YDULDEOH ,Q RUGHU WR WHVW WKH SURJUDPV \RX GHYHORS XVH WKH UDZ GDWD ILOH 3$7,(1767;7 OLVWHG LQ WKH $SSHQGL[ <RX FDQ XVH WKLV GDWD ILOH DQG LQ ODWHU VHFWLRQV D 6$6 GDWD VHW FUHDWHG IURP WKLV UDZ GDWD ILOH IRU PDQ\ RI WKH H[DPSOHV LQ WKLV WH[W

Codys Data Cleaning Techniques Using SAS Software

Description of the Raw Data File PATIENTS.TXT


0
RU
)
$Q\ YDOLG GDWH %HWZHHQ  DQG  %HWZHHQ  DQG  %HWZHHQ  DQG 

'; $(

 

 

&KDUDFWHU &KDUDFWHU

 WR  GLJLW QXPHUDO

RU


7KHUH DUH VHYHUDO FKDUDFWHU YDULDEOHV WKDW VKRXOG KDYH D OLPLWHG QXPEHU RI YDOLG YDOXHV )RU WKLV H[HUFLVH \RX H[SHFW YDOXHV RI *(1'(5 WR EH
)
RU
0
 YDOXHV RI '; WKH QXPHUDOV  WKURXJK  DQG YDOXHV RI $( DGYHUVH HYHQWV WR EH

RU

 $ YHU\ VLPSOH DSSURDFK WR LGHQWLI\LQJ LQYDOLG FKDUDFWHU YDOXHV LQ WKLV ILOH LV WR XVH 352& )5(4 WR OLVW DOO WKH XQLTXH YDOXHV RI WKHVH YDULDEOHV 2I FRXUVH RQFH LQYDOLG YDOXHV DUH LGHQWLILHG XVLQJ WKLV WHFKQLTXH RWKHU PHDQV ZLOO KDYH WR EH HPSOR\HG WR ORFDWH VSHFLILF UHFRUGV RU SDWLHQW QXPEHUV FRUUHVSRQGLQJ WR WKH LQYDOLG YDOXHV

Chapter 1

Checking Values of Character Variables

8VH WKH SURJUDP 3$7,(1766$6 VKRZQ QH[W WR FUHDWH WKH 6$6 GDWD VHW 3$7,(176 IURP WKH UDZ GDWD ILOH 3$7,(1767;7 ZKLFK FDQ EH GRZQORDGHG IURP WKH 6$6 :HE VLWH RU IRXQG OLVWHG LQ WKH $SSHQGL[  7KLV SURJUDP LV IROORZHG ZLWK WKH DSSURSULDWH 352& )5(4 VWDWHPHQWV WR OLVW WKH XQLTXH YDOXHV DQG WKHLU IUHTXHQFLHV IRU WKH YDULDEOHV *(1'(5 '; DQG $( 3URJUDP  :ULWLQJ D 3URJUDP WR &UHDWH WKH 'DWD 6HW 3$7,(176

*----------------------------------------------------------* |PROGRAM NAME: PATIENTS.SAS IN C:\CLEANING | |PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS | *----------------------------------------------------------*; LIBNAME CLEAN "C:\CLEANING"; DATA CLEAN.PATIENTS; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; /* Pad short records with blanks */ INPUT @1 @5 @15 @18 @21 @24 @27 PATNO VISIT HR SBP DBP DX AE $3. @4 GENDER MMDDYY10. 3. 3. 3. $3. $1.; $1.

LABEL PATNO GENDER VISIT HR SBP DBP DX AE FORMAT VISIT RUN;

= "Patient Number" = "Gender" = "Visit Date" = "Heart Rate" = "Systolic Blood Pressure" = "Diastolic Blood Pressure" = "Diagnosis Code" = "Adverse Event?"; MMDDYY10.;

Codys Data Cleaning Techniques Using SAS Software



PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Frequency Counts for Selected Character Variables"; TABLES GENDER DX AE / NOCUM NOPERCENT; RUN;

Chapter 1

Checking Values of Character Variables

+HUH LV WKH RXWSXW IURP UXQQLQJ 3URJUDP 


Frequency Counts for Selected Character Variables The FREQ Procedure Gender GENDER Frequency ------------------2 1 F 12 M 14 X 1 f 2 Frequency Missing = 1 Diagnosis Code DX Frequency --------------1 7 2 2 3 3 4 3 5 3 6 1 7 2 X 2 Frequency Missing = 8

Adverse Event? AE Frequency --------------0 19 1 10 A 1 Frequency Missing = 1

Codys Data Cleaning Techniques Using SAS Software

/HW
V IRFXV LQ RQ WKH IUHTXHQF\ OLVWLQJ IRU WKH YDULDEOH *(1'(5 ,I YDOLG YDOXHV IRU *(1'(5 DUH
)

0
 DQG PLVVLQJ WKLV RXWSXW ZRXOG SRLQW RXW VHYHUDO GDWD HUURUV 7KH YDOXHV

DQG
;
ERWK RFFXU RQFH 'HSHQGLQJ RQ WKH VLWXDWLRQ WKH ORZHUFDVH YDOXH
I
PD\ RU PD\ QRW EH FRQVLGHUHG DQ HUURU ,I ORZHUFDVH YDOXHV ZHUH HQWHUHG LQWR WKH ILOH E\ PLVWDNH EXW WKH YDOXH DVLGH IURP WKH FDVH ZDV FRUUHFW \RX FRXOG FKDQJH DOO ORZHUFDVH YDOXHV WR XSSHUFDVH ZLWK WKH 83&$6( IXQFWLRQ 0RUH RQ WKDW ODWHU 7KH LQYDOLG '; FRGH RI
;
DQG WKH DGYHUVH HYHQW RI
$
DUH DOVR HDVLO\ LGHQWLILHG $W WKLV SRLQW LW LV QHFHVVDU\ WR UXQ DGGLWLRQDO SURJUDPV WR LGHQWLI\ WKH ORFDWLRQ RI WKHVH HUURUV 5XQQLQJ 352& )5(4 LV VWLOO D XVHIXO ILUVW VWHS LQ LGHQWLI\LQJ HUURUV RI WKHVH W\SHV DQG LW LV DOVR XVHIXO DV D ODVW VWHS DIWHU WKH GDWD KDYH EHHQ FOHDQHG WR HQVXUH WKDW DOO WKH HUURUV KDYH EHHQ LGHQWLILHG DQG FRUUHFWHG
Using a DATA Step to Check for Invalid Values



DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Invalid Patient Numbers and Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.; ***Check GENDER; IF GENDER NOT IN (F M ) THEN PUT PATNO= GENDER=; ***Check DX; IF VERIFY(DX, 0123456789) NE 0 THEN PUT PATNO= DX=; ***Check AE; IF AE NOT IN (0 1 ) THEN PUT PATNO= AE=; RUN;

Chapter 1

Checking Values of Character Variables

%HIRUH GLVFXVVLQJ WKH RXWSXW OHW


V VSHQG D PRPHQW ORRNLQJ RYHU WKH SURJUDP )LUVW QRWLFH WKH XVH RI WKH '$7$ B18//B VWDWHPHQW %HFDXVH WKH RQO\ SXUSRVH RI WKLV SURJUDP LV WR LGHQWLI\ LQYDOLG GDWD YDOXHV WKHUH LV QR QHHG WR FUHDWH D 6$6 GDWD VHW 7KH ),/( 35,17 VWDWHPHQW FDXVHV WKH UHVXOWV RI DQ\ VXEVHTXHQW 387 VWDWHPHQWV WR EH VHQW WR WKH 2XWSXW ZLQGRZ RU RXWSXW GHYLFH  :LWKRXW WKLV VWDWHPHQW WKH UHVXOWV RI WKH 387 VWDWHPHQWV ZRXOG EH VHQW WR WKH 6$6 /RJ *(1'(5 DQG $( DUH FKHFNHG E\ XVLQJ WKH ,1 RSHUDWRU 7KH VWDWHPHQW
IF X IN (A B C) THEN . . .;

LV HTXLYDOHQW WR
IF X = A OR X = B OR X = C THEN . . .;

7KDW LV LI ; LV HTXDO WR DQ\ RI WKH YDOXHV LQ WKH OLVW IROORZLQJ WKH ,1 RSHUDWRU WKH H[SUHVVLRQ LV HYDOXDWHG DV WUXH <RX ZDQW DQ HUURU PHVVDJH SULQWHG ZKHQ WKH YDOXH RI *(1'(5 LV QRW RQH RI WKH DFFHSWDEOH YDOXHV
)

0
 RU PLVVLQJ  7KHUHIRUH SODFH D 127 LQ IURQW RI WKH ZKROH H[SUHVVLRQ WULJJHULQJ WKH HUURU UHSRUW IRU LQYDOLG YDOXHV RI *(1'(5 RU $( <RX FDQ VHSDUDWH WKH YDOXHV LQ WKH OLVW E\ VSDFHV RU FRPPDV 7KHUH DUH VHYHUDO DOWHUQDWLYH ZD\V WKDW WKH JHQGHU FKHFNLQJ VWDWHPHQW FDQ EH ZULWWHQ 7KH PHWKRG DERYH XVHV WKH ,1 RSHUDWRU $ VWUDLJKWIRUZDUG DOWHUQDWLYH WR WKH ,1 RSHUDWRU LV
IF NOT (GENDER EQ F OR GENDER EQ M OR GENDER = ) THEN PUT PATNO= GENDER=;

$QRWKHU SRVVLELOLW\ LV
IF GENDER NE F AND GENDER NE M AND GENDER NE THEN PUT PATNO= GENDER=;

:KLOH DOO RI WKHVH VWDWHPHQWV FKHFNLQJ IRU *(1'(5 DQG $( SURGXFH WKH VDPH UHVXOW WKH ,1 RSHUDWRU LV SUREDEO\ WKH HDVLHVW WR ZULWH HVSHFLDOO\ LI WKHUH DUH D ODUJH QXPEHU RI SRVVLEOH YDOXHV WR FKHFN $OZD\V EH VXUH WR FRQVLGHU ZKHWKHU \RX ZDQW WR LGHQWLI\ PLVVLQJ YDOXHV DV LQYDOLG RU QRW ,Q WKH VWDWHPHQWV DERYH \RX DUH DOORZLQJ PLVVLQJ YDOXHV DV YDOLG FRGHV ,I \RX ZDQW WR IODJ PLVVLQJ YDOXHV DV HUURUV GR QRW LQFOXGH D PLVVLQJ YDOXH LQ WKH OLVW RI YDOLG FRGHV

Codys Data Cleaning Techniques Using SAS Software

,I \RX ZDQW WR DOORZ ORZHUFDVH 0


V DQG )
V DV YDOLG YDOXHV \RX FDQ DGG WKH VLQJOH OLQH
GENDER = UPCASE(GENDER);


VERIFY(character_variable,verify_string)

ZKHUH WKH YHULI\ VWULQJ LV HLWKHU D FKDUDFWHU YDULDEOH RU D VHULHV RI FKDUDFWHU YDOXHV SODFHG LQ VLQJOH RU GRXEOH TXRWHV 7KH 9(5,)< IXQFWLRQ UHWXUQV WKH ILUVW SRVLWLRQ LQ WKH FKDUDFWHUBYDULDEOH WKDW FRQWDLQV D FKDUDFWHU WKDW LV QRW LQ WKH YHULI\BVWULQJ ,I WKH FKDUDFWHUBYDULDEOH GRHV QRW FRQWDLQ DQ\ LQYDOLG YDOXHV WKH 9(5,)< IXQFWLRQ UHWXUQV D  7R PDNH WKLV FOHDUHU OHW
V ORRN DW WKH IROORZLQJ VWDWHPHQW WKDW XVHV WKH 9(5,)< IXQFWLRQ WR FKHFN IRU LQYDOLG *(1'(5 YDOXHV
IF VERIFY (GENDER,FM ) NE 0 THEN PUT PATNO= GENDER=;

1RWLFH WKDW \RX LQFOXGHG D EODQN LQ WKH YHULI\BVWULQJ VR WKDW PLVVLQJ YDOXHV ZLOO EH FRQVLGHUHG YDOLG ,I *(1'(5 KDV D YDOXH RWKHU WKDQ DQ
)

0


Chapter 1

Checking Values of Character Variables

$OWKRXJK WKH IXQFWLRQ


VERIFY(DX, 0123456789)

UHWXUQV D  LI WKHUH DUH QR LQYDOLG FKDUDFWHUV LQ WKH '; FRGH LW VKRXOG EH SRLQWHG RXW WKDW '; FRGHV ZLWK HPEHGGHG EODQNV ZLOO QRW EH LGHQWLILHG DV LQYDOLG ZLWK WKLV VWDWHPHQW ,I \RX ZDQW WR HQVXUH WKDW RQO\ WKH FKDUDFWHU UHSUHVHQWDWLRQV RI WKH QXPEHUV  WR  DUH FRQVLGHUHG YDOLG WKH IROORZLQJ VWDWHPHQWV FDQ EH XVHG
X_DX = INPUT(DX,3.); IF X_DX EQ . AND DX NE THEN PUT PATNO= DX=;



X_DX = INPUT(TRANSLATE(DX,X,.),3.);

7KH 75$16/$7( IXQFWLRQ DERYH ZLOO FRQYHUW SHULRGV RU GHFLPDO SRLQWV WR ;


V ,I '; RULJLQDOO\ FRQWDLQHG D GHFLPDO SRLQW WKH YDOXH RI ;B'; ZRXOG EH D PLVVLQJ YDOXH ,Q JHQHUDO WKH V\QWD[ RI WKH 75$16/$7( IXQFWLRQ LV
TRANSLATE(char_variable,to_string,from_string)

10

Codys Data Cleaning Techniques Using SAS Software

ZKHUH HDFK FKDUDFWHU LQ WKH IURPBVWULQJ LV WUDQVODWHG WR WKH FRUUHVSRQGLQJ FKDUDFWHU LQ WKH WRBVWULQJ )RU H[DPSOH WR WUDQVODWH WKH QXPHUDOV  WKURXJK  WR WKH OHWWHUV $ WKURXJK ( IRU D YDULDEOH FDOOHG 6&25( \RX ZRXOG ZULWH
NEW_VAR = TRANSLATE(SCORE,ABCDE,12345);

$QRWKHU LQWHUHVWLQJ DSSURDFK LV WR WHVW WR VHH LI WKH YDOXH RI ;B'; LV QRW DQ LQWHJHU 7KH 02' IXQFWLRQ LV DQ HIIHFWLYH ZD\ WR GR WKLV ,I DQ\ QXPEHU PRGXOXV  LV QRW  WKH UHPDLQGHU DIWHU \RX GLYLGH WKH QXPEHU E\   WKH QXPEHU LV QRW DQ LQWHJHU 7KH 6$6 FRGH XVLQJ WKLV PHWKRG LV
X_DX = INPUT(DX,3.); IF (X_DX EQ . OR MOD(X_DX,1) NE 0) AND DX NE THEN PUT PATNO= DX=;

+HUH LV DQRWKHU SRLQW ,I \RX ZDQW WR DYRLG ILOOLQJ XS \RXU 6$6 /RJ ZLWK HUURU PHVVDJHV UHVXOWLQJ IURP LQYDOLG DUJXPHQWV WR WKH ,1387 IXQFWLRQ \RX FDQ XVH WKH GRXEOH TXHVWLRQ PDUN "" PRGLILHU EHIRUH WKH LQIRUPDW WR WHOO WKH SURJUDP WR LJQRUH WKHVH HUURUV DQG QRW WR UHSRUW WKH HUURUV WR WKH 6$6 /RJ 7KH ,1387 IXQFWLRQ ZRXOG WKHQ ORRN OLNH WKLV
X_DX = INPUT(DX,?? 3.);

7KH "" LQIRUPDW PRGLILHU FDQ DOVR EH XVHG ZLWK WKH ,1387 VWDWHPHQW +HUH LV WKH RXWSXW IURP UXQQLQJ 3URJUDP 
Listing of Invalid Patient Numbers and Data Values PATNO=002 PATNO=003 PATNO=004 PATNO=010 PATNO=013 PATNO=002 PATNO=023 DX=X GENDER=X AE=A GENDER=f GENDER=2 DX=X GENDER=f

1RWH WKDW SDWLHQW  DSSHDUV WZLFH LQ WKLV RXWSXW 7KLV RFFXUV EHFDXVH WKHUH LV D GXSOLFDWH REVHUYDWLRQ IRU SDWLHQW  LQ DGGLWLRQ WR VHYHUDO RWKHU SXUSRVHO\ LQFOXGHG HUURUV  VR WKDW WKH GDWD VHW FDQ EH XVHG IRU H[DPSOHV ODWHU LQ WKLV ERRN VXFK DV WKH GHWHFWLRQ RI GXSOLFDWH ,'
V DQG GXSOLFDWH REVHUYDWLRQV

Chapter 1

Checking Values of Character Variables 11

6XSSRVH \RX ZDQW WR FKHFN IRU YDOLG SDWLHQW QXPEHUV 3$712 LQ D VLPLODU PDQQHU +RZHYHU \RX ZDQW WR IODJ PLVVLQJ YDOXHV DV HUURUV HYHU\ SDWLHQW PXVW KDYH D YDOLG ,'  7KH IROORZLQJ VWDWHPHQWV
ID = INPUT(TRANSLATE(PATNO,X,.),?? 3.); IF ID LT 1 THEN PUT "Invalid ID for PATNO=" PATNO;

ZLOO ZRUN LQ WKH VDPH ZD\ DV \RXU FKHFN IRU LQYDOLG '; FRGHV H[FHSW WKDW PLVVLQJ YDOXHV ZLOO QRZ EH OLVWHG DV HUURUV
Using PROC PRINT with a WHERE Statement to List Invalid Values

7KHUH DUH VHYHUDO DOWHUQDWLYH ZD\V WR LGHQWLI\ WKH ,'



V ZLWK LQYDOLG *(1'(5 YDOXHV \RX FRXOG ZULWH D SURJUDP OLNH WKH RQH VKRZQ LQ 3URJUDP  3URJUDP  8VLQJ 352& 35,17 WR /LVW ,QYDOLG &KDUDFWHU 9DOXHV

PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID GENDER VALUES"; WHERE GENDER NOT IN (M F ); ID PATNO; VAR GENDER; RUN;

,W
V HDV\ WR IRUJHW WKDW :+(5( VWDWHPHQWV FDQ EH XVHG ZLWKLQ 6$6 SURFHGXUHV 6$6 SURJUDPPHUV WKDW KDYH EHHQ DW LW IRU D ORQJ WLPH OLNH WKH DXWKRU RIWHQ ZULWH D VKRUW '$7$ VWHS ILUVW DQG XVH 387 VWDWHPHQWV RU FUHDWH D WHPSRUDU\ 6$6 GDWD VHW DQG IROORZ LW ZLWK D 352& 35,17 7KH SURJUDP DERYH LV ERWK VKRUWHU DQG PRUH HIILFLHQW WKDQ D '$7$ VWHS IROORZHG E\ D 352& 35,17 '$7$ B18//B VWHSV KRZHYHU WHQG WR EH IDLUO\ HIILFLHQW DQG DUH D UHDVRQDEOH DOWHUQDWLYH DV ZHOO DV WKH PRUH IOH[LEOH DSSURDFK

12

Codys Data Cleaning Techniques Using SAS Software

7KH RXWSXW IURP 3URJUDP  IROORZV

LISTING OF INVALID GENDER VALUES PATNO 003 010 013 023 GENDER X f 2 f

7KLV SURJUDP FDQ EH H[WHQGHG WR OLVW LQYDOLG YDOXHV IRU DOO WKH FKDUDFWHU YDULDEOHV <RX VLPSO\ DGG WKH RWKHU LQYDOLG FRQGLWLRQV WR WKH :+(5( VWDWHPHQW DV VKRZQ LQ 3URJUDP 

3URJUDP 

8VLQJ 352& 35,17 WR /LVW ,QYDOLG &KDUDFWHU 'DWD IRU 6HYHUDO 9DULDEOHV

PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID CHARACTER VALUES"; WHERE GENDER NOT IN (M F ) OR VERIFY(DX, 0123456789) NE 0 OR AE NOT IN (0 1 ); ID PATNO; VAR GENDER DX AE; RUN;

7KH UHVXOWLQJ RXWSXW LV VKRZQ QH[W


LISTING OF INVALID CHARACTER VALUES PATNO 002 003 004 010 013 002 023 GENDER F X F f 2 F f DX X 3 5 1 1 X AE 0 1 A 0 0 0

Chapter 1

Checking Values of Character Variables 13

1RWLFH WKDW WKLV RXWSXW LV QRW DV LQIRUPDWLYH DV WKH RQH SURGXFHG E\ WKH '$7$ B18//B VWHS LQ 3URJUDP  ,W OLVWV DOO WKH SDWLHQW QXPEHUV JHQGHUV '; FRGHV DQG DGYHUVH HYHQWV HYHQ ZKHQ RQO\ RQH RI WKH YDULDEOHV KDV DQ HUURU SDWLHQW  IRU H[DPSOH  6R WKHUH LV D WUDGHRII WKH VLPSOHU SURJUDP SURGXFHV VOLJKWO\ OHVV GHVLUDEOH RXWSXW :H FRXOG JHW SKLORVRSKLFDO DQG H[WHQG WKLV FRQFHSW WR OLIH LQ JHQHUDO EXW WKDW
V IRU VRPH RWKHU ERRN <RX FDQ DOVR VXEVWLWXWH DQ\ RI WKH PRUH FRPSOLFDWHG ORJLFDO H[SUHVVLRQV IURP WKH SUHYLRXV VHFWLRQ LQWR WKLV :+(5( VWDWHPHQW LI \RX ZLVK )RU H[DPSOH WR SHUIRUP D PRUH FDUHIXO FKHFN RQ '; FRGHV \RX FRXOG PRGLI\ WKH :+(5( VWDWHPHQW DV VKRZQ KHUH
PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID CHARACTER VALUES"; WHERE GENDER NOT IN (M F ) (INPUT(DX,3.) EQ . OR MOD(INPUT(DX,3.),1) NE 0) DX NE AE NOT IN (0 1 ); ID PATNO; VAR GENDER DX AE; RUN;

OR AND OR

Using Formats to Check for Invalid Values

$QRWKHU ZD\ WR FKHFN IRU LQYDOLG YDOXHV RI D FKDUDFWHU YDULDEOH IURP UDZ GDWD LV WR XVH XVHUGHILQHG IRUPDWV 7KHUH DUH VHYHUDO SRVVLELOLWLHV KHUH 2QH \RX FDQ FUHDWH D IRUPDW WKDW OHDYHV DOO YDOLG FKDUDFWHU YDOXHV DV LV DQG IRUPDWV DOO LQYDOLG YDOXHV WR D VLQJOH HUURU FRGH /HW
V VWDUW RXW ZLWK D SURJUDP WKDW VLPSO\ DVVLJQV IRUPDWV WR WKH FKDUDFWHU YDULDEOHV DQG XVHV 352& )5(4 WR OLVW WKH QXPEHU RI YDOLG DQG LQYDOLG FRGHV )ROORZLQJ WKDW \RX ZLOO H[WHQG WKH SURJUDP E\ XVLQJ D '$7$ VWHS WR LGHQWLI\ ZKLFK ,'
V KDYH LQYDOLG YDOXHV 3URJUDP  XVHV IRUPDWV WR FRQYHUW DOO LQYDOLG GDWD YDOXHV WR D VLQJOH YDOXH

14

Codys Data Cleaning Techniques Using SAS Software

3URJUDP 

8VLQJ D 8VHU'HILQHG )RUPDW DQG 352& )5(4 WR /LVW ,QYDOLG 'DWD 9DOXHV

PROC FORMAT; VALUE $GENDER F,M = = OTHER = VALUE $DX 001 - 999 OTHER

Valid Missing Miscoded; = Valid /* See important note below */ = Missing = Miscoded;

VALUE $AE 0,1 = Valid = Missing OTHER = Miscoded; RUN; PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Using Formats to Identify Invalid Values"; FORMAT GENDER $GENDER. DX $DX. AE $AE.; TABLES GENDER DX AE / NOCUM NOPERCENT MISSING; RUN;

)RU WKH YDULDEOHV *(1'(5 DQG $( ZKLFK KDYH VSHFLILF YDOLG YDOXHV \RX OLVW HDFK RI WKH YDOLG YDOXHV LQ WKH UDQJH WR WKH OHIW RI WKH HTXDO VLJQ LQ WKH 9$/8( VWDWHPHQW )RUPDW HDFK RI WKHVH YDOXHV ZLWK WKH YDOXH
9DOLG
 )RU WKH '; IRUPDW \RX VSHFLI\ D UDQJH RI YDOXHV RQ WKH OHIW VLGH RI WKH HTXDO VLJQ ,PSRUWDQW 1RWH ,W VKRXOG EH SRLQWHG RXW KHUH WKDW WKH UDQJH



ZLOO EHKDYH GLIIHUHQWO\ RQ :LQGRZV DQG 81,; SODWIRUPV FRPSDUHG WR 096 DQG &06 SODWIRUPV <RX PD\ ZDQW WR WHVW VHYHUDO YDOXHV RQ \RXU SODWIRUP WR EH VXUH WKH SURJUDP LV SHUIRUPLQJ DV \RX LQWHQG )RU H[DPSOH WKH YDOXH
$
ZLOO EH FRQVLGHUHG
9DOLG
RQ D :LQGRZV RU D 81,; SODWIRUP DQG
,QYDOLG
RQ 096 RU &06 DV SRLQWHG RXW E\ WZR RI P\ UHYLHZHUV -RKQ /DLQJ DQG 0LNH =GHE  <RX PD\ ZDQW WR WHVW IRU DOSKDEHWLF YDOXHV IRU '; LQ D VKRUW '$7$ VWHS SULRU WR UXQQLQJ 3URJUDP  <RX PD\ FKRRVH WR OXPS WKH PLVVLQJ YDOXH ZLWK WKH YDOLG YDOXHV LI WKDW LV DSSURSULDWH RU \RX PD\ ZDQW WR NHHS WUDFN RI PLVVLQJ YDOXHV VHSDUDWHO\ DV ZDV GRQH KHUH )LQDOO\ DQ\ YDOXH RWKHU WKDQ WKH YDOLG YDOXHV RU D PLVVLQJ YDOXH ZLOO EH IRUPDWWHG DV
0LVFRGHG
 $OO WKDW LV OHIW LV WR UXQ 352& )5(4 WR FRXQW WKH QXPEHU RI
9DOLG

0LVVLQJ
 DQG
0LVFRGHG
YDOXHV 7KH 7$%/(6 RSWLRQ 0,66,1* FDXVHV WKH PLVVLQJ YDOXHV WR EH OLVWHG LQ WKH ERG\ RI WKH 352& )5(4 RXWSXW +HUH LV WKH RXWSXW IURP 352& )5(4

Chapter 1

Checking Values of Character Variables 15

Using Formats to Identify Invalid Values The FREQ Procedure Gender GENDER Frequency --------------------Missing 1 Miscoded 4 Valid 26 Diagnosis Code DX Frequency --------------------Missing 8 Valid 21 Miscoded 2 Adverse Event? AE Frequency --------------------Missing 1 Valid 29 Miscoded 1

7KLV RXWSXW LVQ


W SDUWLFXODUO\ XVHIXO ,W GRHVQ
W WHOO \RX ZKLFK REVHUYDWLRQV SDWLHQW QXPEHUV FRQWDLQ PLVVLQJ RU LQYDOLG YDOXHV /HW
V PRGLI\ WKH SURJUDP E\ DGGLQJ D '$7$ VWHS VR WKDW ,'
V ZLWK LQYDOLG FKDUDFWHU YDOXHV DUH OLVWHG 3URJUDP  8VLQJ D 8VHU'HILQHG )RUPDW DQG D '$7$ 6WHS WR /LVW ,QYDOLG 'DWD 9DOXHV

PROC FORMAT; VALUE $GENDER F,M = Valid = Missing OTHER = Miscoded; VALUE $DX 001 - 999 = Valid = Missing OTHER = Miscoded; VALUE $AE 0,1 = Valid = Missing OTHER = Miscoded; RUN;

16

Codys Data Cleaning Techniques Using SAS Software

DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" FILE PRINT; ***Send output to the TITLE "Listing of Invalid Patient ***Note: We will only input those INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.;

PAD; Output window; Numbers and Data Values"; variables of interest;

IF PUT(GENDER,$GENDER.) = Miscoded THEN PUT PATNO= GENDER=; IF PUT(DX,$DX.) = Miscoded THEN PUT PATNO= DX=; IF PUT(AE,$AE.) = Miscoded THEN PUT PATNO= AE=; RUN;

7KH KHDUW RI WKLV SURJUDP LV WKH 387 IXQFWLRQ 7R UHYLHZ WKH 387 IXQFWLRQ LV VLPLODU WR WKH ,1387 IXQFWLRQ ,W WDNHV WKH IROORZLQJ IRUP
character_variable = PUT(variable,format)

ZKHUH FKDUDFWHUBYDULDEOH LV D FKDUDFWHU YDULDEOH WKDW FRQWDLQV WKH YDOXH RI WKH YDULDEOH OLVWHG DV WKH ILUVW DUJXPHQW WR WKH IXQFWLRQ IRUPDWWHG E\ WKH IRUPDW OLVWHG DV WKH VHFRQG DUJXPHQW WR WKH IXQFWLRQ 7KH UHVXOW RI D 387 IXQFWLRQ LV DOZD\V D FKDUDFWHU YDULDEOH DQG WKH IXQFWLRQ LV IUHTXHQWO\ XVHG WR SHUIRUP QXPHULFWRFKDUDFWHU FRQYHUVLRQV ,Q 3URJUDP  WKH ILUVW DUJXPHQW RI WKH 387 IXQFWLRQ LV D FKDUDFWHU YDULDEOH DQG WKH UHVXOW RI WKH 387 IXQFWLRQ IRU DQ\ LQYDOLG GDWD YDOXHV ZRXOG EH WKH YDOXH
0LVFRGHG
 +HUH LV WKH RXWSXW IURP 3URJUDP 
Listing of Invalid Patient Numbers and Data Values PATNO=002 PATNO=003 PATNO=004 PATNO=010 PATNO=013 PATNO=002 PATNO=023 DX=X GENDER=X AE=A GENDER=f GENDER=2 DX=X GENDER=f

Chapter 1

Checking Values of Character Variables 17

Using Informats to Check for Invalid Values



*----------------------------------------------------------------* | PROGRAM NAME: INFORM1.SAS IN C:\CLEANING | | PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS2 | | AND SET ANY INVALID VALUES FOR GENDER AND AE TO | | MISSING, USING A USER-DEFINED INFORMAT | *---------------------------------------------------------------*; LIBNAME CLEAN "C:\CLEANING"; PROC FORMAT; INVALUE $GEN INVALUE $AE RUN; DATA CLEAN.PATIENTS2; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @1 PATNO $3. @4 GENDER $GEN1. @27 AE $AE1.; LABEL PATNO GENDER DX AE RUN; = = = = "Patient Number" "Gender" "Diagnosis Code" "Adverse Event?";

F,M = _SAME_ OTHER = ; 0,1 = _SAME_ OTHER = ;

18

Codys Data Cleaning Techniques Using SAS Software

PROC PRINT DATA=CLEAN.PATIENTS2; TITLE "Listing of Data Set PATIENTS2"; VAR PATNO GENDER AE; RUN;


Listing of Data Set PATIENTS2

Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

PATNO 001 002 003 004 XX5 006 007 008 009 010 011 012 013 014 002 003 015 017 019 123 321 020 022 023 024 025 027 028 029 006

GENDER M F F M M M F M M M M F M F F M M F F M F M F F M F

AE 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0

Chapter 1

Checking Values of Character Variables 19

1RWLFH WKDW LQYDOLG YDOXHV IRU *(1'(5 DQG $( DUH QRZ PLVVLQJ YDOXHV LQFOXGLQJ WKH WZR ORZHUFDVH
I
V SDWLHQW QXPEHUV  DQG   /HW
V DGG RQH PRUH IHDWXUH WR WKLV SURJUDP %\ XVLQJ WKH NH\ZRUG 83&$6( LQ WKH LQIRUPDW VSHFLILFDWLRQ \RX FDQ DXWRPDWLFDOO\ FRQYHUW WKH YDOXHV EHLQJ UHDG WR XSSHUFDVH EHIRUH WKH UDQJHV DUH FKHFNHG +HUH DUH WKH 352& )250$7 VWDWHPHQWV UHZULWWHQ WR XVH WKLV RSWLRQ
PROC FORMAT; INVALUE $GEN (UPCASE)

F = F M = M OTHER = ; INVALUE $AE 0,1 = _SAME_ OTHER = ; RUN;

7KH 83&$6( RSWLRQ LV SODFHG LQ SDUHQWKHVLV IROORZLQJ WKH LQIRUPDW QDPH 1RWLFH VRPH RWKHU FKDQJHV DV ZHOO <RX FDQQRW XVH WKH NH\ZRUG B6$0(B DQ\PRUH EHFDXVH WKH YDOXH LV FKDQJHG WR XSSHUFDVH IRU FRPSDULVRQ SXUSRVHV EXW WKH B6$0(B VSHFLILFDWLRQ ZRXOG OHDYH WKH RULJLQDO ORZHUFDVH YDOXH XQFKDQJHG %\ VSHFLI\LQJ HDFK YDOXH LQGLYLGXDOO\ WKH ORZHUFDVH
I
WKH RQO\ ORZHUFDVH *(1'(5 YDOXH ZRXOG PDWFK WKH UDQJH
)
DQG EH DVVLJQHG WKH YDOXH RI DQ XSSHUFDVH
)
 7KH RXWSXW RI WKLV GDWD VHW LV LGHQWLFDO WR WKH RXWSXW IRU 3URJUDP  H[FHSW WKH YDOXH RI *(1'(5 IRU SDWLHQWV  DQG  DUH DQ XSSHUFDVH
)
 ,I \RX ZDQW WR SUHVHUYH WKH RULJLQDO YDOXH RI WKH YDULDEOH \RX FDQ XVH D XVHUGHILQHG LQIRUPDW ZLWK DQ ,1387 IXQFWLRQ LQVWHDG RI DQ ,1387 VWDWHPHQW <RX FDQ XVH WKLV PHWKRG WR FKHFN D UDZ GDWD ILOH RU D 6$6 GDWD VHW 3URJUDP  UHDGV WKH 6$6 GDWD VHW &/($13$7,(176 DQG XVHV XVHUGHILQHG LQIRUPDWV WR GHWHFW HUURUV

20

Codys Data Cleaning Techniques Using SAS Software

3URJUDP 

8VLQJ D 8VHU'HILQHG ,QIRUPDW ZLWK WKH ,1387 )XQFWLRQ

PROC FORMAT; INVALUE $GENDER F,M = _SAME_ OTHER = ERROR; INVALUE $AE 0,1 = _SAME_ OTHER = ERROR; RUN; DATA _NULL_; FILE PRINT; SET CLEAN.PATIENTS; IF INPUT (GENDER,$GENDER.) = ERROR THEN PUT @1 "Error for Gender for Patient:" PATNO" Value is " GENDER; IF INPUT (AE,$AE.) = ERROR THEN PUT @1 "Error for AE for Patient:" PATNO" Value is " AE; RUN;

7KH DGYDQWDJH RI WKLV SURJUDP RYHU 3URJUDP  LV WKDW WKH RULJLQDO YDOXHV RI WKH YDULDEOHV DUH QRW ORVW

You might also like