Checking Values of Character Variables: Using PROC FREQ To List Values
Checking Values of Character Variables: Using PROC FREQ To List Values
Introduction
7KHUH DUH VRPH EDVLF RSHUDWLRQV WKDW QHHG WR EH URXWLQHO\ SHUIRUPHG ZKHQ GHDOLQJ ZLWK FKDUDFWHU GDWD YDOXHV <RX PD\ KDYH D FKDUDFWHU YDULDEOH WKDW FDQ WDNH RQ RQO\ FHUWDLQ DOORZDEOH YDOXHV VXFK DV
0
DQG
)
IRU JHQGHU <RX PD\ DOVR KDYH D FKDUDFWHU YDULDEOH WKDW FDQ WDNH RQ QXPHURXV YDOXHV EXW WKH YDOXHV PXVW ILW D FHUWDLQ IRUP VXFK DV D VLQJOH OHWWHU IROORZHG E\ WZR RU WKUHH GLJLWV 7KLV FKDSWHU VKRZV \RX VHYHUDO ZD\V WKDW \RX FDQ XVH 6$6 VRIWZDUH WR SHUIRUP YDOLGLW\ FKHFNV RQ FKDUDFWHU YDULDEOHV
Using PROC FREQ to List Values
7KLV VHFWLRQ GHPRQVWUDWHV KRZ WR XVH 352& )5(4 WR FKHFN IRU LQYDOLG YDOXHV RI D FKDUDFWHU YDULDEOH ,Q RUGHU WR WHVW WKH SURJUDPV \RX GHYHORS XVH WKH UDZ GDWD ILOH 3$7,(1767;7 OLVWHG LQ WKH $SSHQGL[ <RX FDQ XVH WKLV GDWD ILOH DQG LQ ODWHU VHFWLRQV D 6$6 GDWD VHW FUHDWHG IURP WKLV UDZ GDWD ILOH IRU PDQ\ RI WKH H[DPSOHV LQ WKLV WH[W
7KH UDZ GDWD ILOH 3$7,(1767;7 FRQWDLQV ERWK FKDUDFWHU DQG QXPHULF YDULDEOHV IURP D W\SLFDO FOLQLFDO WULDO $ QXPEHU RI GDWD HUURUV ZHUH LQFOXGHG LQ WKH ILOH VR WKDW \RX FDQ WHVW WKH GDWD FOHDQLQJ SURJUDPV WKDW DUH GHYHORSHG LQ WKLV WH[W 7KH SURJUDPV LQ WKLV ERRN DVVXPH WKDW WKH ILOH 3$7,(1767;7 LV ORFDWHG LQ D GLUHFWRU\ IROGHU FDOOHG &?&/($1,1* 7KLV LV WKH GLUHFWRU\ WKDW LV XVHG WKURXJKRXW WKLV WH[W DV WKH ORFDWLRQ IRU GDWD ILOHV 6$6 GDWD VHWV 6$6 SURJUDPV DQG 6$6 PDFURV <RX FDQ PRGLI\ WKH ,1),/( DQG /,%1$0( VWDWHPHQWV WR ILW \RXU RZQ RSHUDWLQJ HQYLURQPHQW +HUH LV WKH OD\RXW IRU WKH GDWD ILOH 3$7,(1767;7 9DULDEOH 1DPH 3$712 *(1'(5 9,6,7 +5 6%3 '%3 'HVFULSWLRQ 3DWLHQW 1XPEHU *HQGHU 9LVLW 'DWH +HDUW 5DWH 6\VWROLF %ORRG 3UHVVXUH 'LDVWROLF %ORRG 3UHVVXUH 'LDJQRVLV &RGH $GYHUVH (YHQW 6WDUWLQJ &ROXPQ /HQJWK 9DULDEOH 7\SH &KDUDFWHU &KDUDFWHU 00''<< 1XPHULF 1XPHULF 1XPHULF 9DOLG 9DOXHV 1XPHUDOV RQO\
0
RU
)
$Q\ YDOLG GDWH %HWZHHQ DQG %HWZHHQ DQG %HWZHHQ DQG
'; $(
&KDUDFWHU &KDUDFWHU
WR GLJLW QXPHUDO
RU
7KHUH DUH VHYHUDO FKDUDFWHU YDULDEOHV WKDW VKRXOG KDYH D OLPLWHG QXPEHU RI YDOLG YDOXHV )RU WKLV H[HUFLVH \RX H[SHFW YDOXHV RI *(1'(5 WR EH
)
RU
0
YDOXHV RI '; WKH QXPHUDOV WKURXJK DQG YDOXHV RI $( DGYHUVH HYHQWV WR EH
RU
$ YHU\ VLPSOH DSSURDFK WR LGHQWLI\LQJ LQYDOLG FKDUDFWHU YDOXHV LQ WKLV ILOH LV WR XVH 352& )5(4 WR OLVW DOO WKH XQLTXH YDOXHV RI WKHVH YDULDEOHV 2I FRXUVH RQFH LQYDOLG YDOXHV DUH LGHQWLILHG XVLQJ WKLV WHFKQLTXH RWKHU PHDQV ZLOO KDYH WR EH HPSOR\HG WR ORFDWH VSHFLILF UHFRUGV RU SDWLHQW QXPEHUV FRUUHVSRQGLQJ WR WKH LQYDOLG YDOXHV
Chapter 1
8VH WKH SURJUDP 3$7,(1766$6 VKRZQ QH[W WR FUHDWH WKH 6$6 GDWD VHW 3$7,(176 IURP WKH UDZ GDWD ILOH 3$7,(1767;7 ZKLFK FDQ EH GRZQORDGHG IURP WKH 6$6 :HE VLWH RU IRXQG OLVWHG LQ WKH $SSHQGL[ 7KLV SURJUDP LV IROORZHG ZLWK WKH DSSURSULDWH 352& )5(4 VWDWHPHQWV WR OLVW WKH XQLTXH YDOXHV DQG WKHLU IUHTXHQFLHV IRU WKH YDULDEOHV *(1'(5 '; DQG $( 3URJUDP :ULWLQJ D 3URJUDP WR &UHDWH WKH 'DWD 6HW 3$7,(176
*----------------------------------------------------------* |PROGRAM NAME: PATIENTS.SAS IN C:\CLEANING | |PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS | *----------------------------------------------------------*; LIBNAME CLEAN "C:\CLEANING"; DATA CLEAN.PATIENTS; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; /* Pad short records with blanks */ INPUT @1 @5 @15 @18 @21 @24 @27 PATNO VISIT HR SBP DBP DX AE $3. @4 GENDER MMDDYY10. 3. 3. 3. $3. $1.; $1.
= "Patient Number" = "Gender" = "Visit Date" = "Heart Rate" = "Systolic Blood Pressure" = "Diastolic Blood Pressure" = "Diagnosis Code" = "Adverse Event?"; MMDDYY10.;
7KH '$7$ VWHS LV VWUDLJKWIRUZDUG 1RWLFH WKH 3$' RSWLRQ LQ WKH ,1),/( VWDWHPHQW 7KLV ZLOO VHHP IRUHLJQ WR PRVW PDLQIUDPH XVHUV DQG LV SUREDEO\ QR ORQJHU QHFHVVDU\ RQ RWKHU SODWIRUPV 7KH 3$' RSWLRQ SDGV DOO UHFRUGV DGGV EODQNV WR WKH HQG RI VKRUW UHFRUGV WR WKH GHIDXOW ORJLFDO UHFRUG OHQJWK RU D OHQJWK VSHFLILHG E\ DQRWKHU ,1),/( RSWLRQ /5(&/ ,W SUHYHQWV WKH SRVVLELOLW\ RI VNLSSLQJ WR WKH QH[W UHFRUG OLQH RI GDWD ZKHQ D VKRUW OLQH LV HQFRXQWHUHG 1H[W \RX ZDQW WR XVH 352& )5(4 WR OLVW DOO WKH XQLTXH YDOXHV IRU \RXU FKDUDFWHU YDULDEOHV 7R VLPSOLI\ WKH RXWSXW IURP 352& )5(4 XVH WKH 12&80 QR FXPXODWLYH VWDWLVWLFV DQG 123(5&(17 QR SHUFHQWDJHV 7$%/(6 RSWLRQV EHFDXVH \RX RQO\ ZDQW IUHTXHQF\ FRXQWV IRU HDFK RI WKH XQLTXH FKDUDFWHU YDOXHV 1RWH VRPHWLPHV WKH SHUFHQW DQG FXPXODWLYH VWDWLVWLFV FDQ EH XVHIXO WKH FKRLFH LV \RXUV 7KH 352& VWDWHPHQWV DUH VKRZQ LQ 3URJUDP 3URJUDP 8VLQJ 352& )5(4 WR /LVW $OO WKH 8QLTXH 9DOXHV IRU &KDUDFWHU 9DULDEOHV
PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Frequency Counts for Selected Character Variables"; TABLES GENDER DX AE / NOCUM NOPERCENT; RUN;
Chapter 1
/HW
V IRFXV LQ RQ WKH IUHTXHQF\ OLVWLQJ IRU WKH YDULDEOH *(1'(5 ,I YDOLG YDOXHV IRU *(1'(5 DUH
)
0
DQG PLVVLQJ WKLV RXWSXW ZRXOG SRLQW RXW VHYHUDO GDWD HUURUV 7KH YDOXHV
DQG
;
ERWK RFFXU RQFH 'HSHQGLQJ RQ WKH VLWXDWLRQ WKH ORZHUFDVH YDOXH
I
PD\ RU PD\ QRW EH FRQVLGHUHG DQ HUURU ,I ORZHUFDVH YDOXHV ZHUH HQWHUHG LQWR WKH ILOH E\ PLVWDNH EXW WKH YDOXH DVLGH IURP WKH FDVH ZDV FRUUHFW \RX FRXOG FKDQJH DOO ORZHUFDVH YDOXHV WR XSSHUFDVH ZLWK WKH 83&$6( IXQFWLRQ 0RUH RQ WKDW ODWHU 7KH LQYDOLG '; FRGH RI
;
DQG WKH DGYHUVH HYHQW RI
$
DUH DOVR HDVLO\ LGHQWLILHG $W WKLV SRLQW LW LV QHFHVVDU\ WR UXQ DGGLWLRQDO SURJUDPV WR LGHQWLI\ WKH ORFDWLRQ RI WKHVH HUURUV 5XQQLQJ 352& )5(4 LV VWLOO D XVHIXO ILUVW VWHS LQ LGHQWLI\LQJ HUURUV RI WKHVH W\SHV DQG LW LV DOVR XVHIXO DV D ODVW VWHS DIWHU WKH GDWD KDYH EHHQ FOHDQHG WR HQVXUH WKDW DOO WKH HUURUV KDYH EHHQ LGHQWLILHG DQG FRUUHFWHG
Using a DATA Step to Check for Invalid Values
<RXU QH[W WDVN LV WR XVH D '$7$ VWHS WR LGHQWLI\ LQYDOLG GDWD YDOXHV DQG WR GHWHUPLQH ZKHUH WKH\ RFFXU LQ WKH UDZ GDWD ILOH E\ OLVWLQJ WKH SDWLHQW QXPEHU 7KLV WLPH '$7$ VWHS SURFHVVLQJ LV XVHG WR LGHQWLI\ LQYDOLG FKDUDFWHU YDOXHV IRU VHOHFWHG YDULDEOHV $V EHIRUH \RX ZLOO FKHFN *(1'(5 '; DQG $( 6HYHUDO GLIIHUHQW PHWKRGV DUH XVHG WR LGHQWLI\ WKHVH YDOXHV )LUVW \RX FDQ ZULWH D VLPSOH '$7$ VWHS WKDW UHSRUWV LQYDOLG GDWD YDOXHV E\ XVLQJ 387 VWDWHPHQWV LQ D '$7$ B18//B VWHS +HUH LV WKH SURJUDP 3URJUDP 8VLQJ D '$7$ B18//B 6WHS WR 'HWHFW ,QYDOLG &KDUDFWHU 'DWD
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; FILE PRINT; ***Send output to the Output window; TITLE "Listing of Invalid Patient Numbers and Data Values"; ***Note: We will only input those variables of interest; INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.; ***Check GENDER; IF GENDER NOT IN (F M ) THEN PUT PATNO= GENDER=; ***Check DX; IF VERIFY(DX, 0123456789) NE 0 THEN PUT PATNO= DX=; ***Check AE; IF AE NOT IN (0 1 ) THEN PUT PATNO= AE=; RUN;
Chapter 1
LV HTXLYDOHQW WR
IF X = A OR X = B OR X = C THEN . . .;
7KDW LV LI ; LV HTXDO WR DQ\ RI WKH YDOXHV LQ WKH OLVW IROORZLQJ WKH ,1 RSHUDWRU WKH H[SUHVVLRQ LV HYDOXDWHG DV WUXH <RX ZDQW DQ HUURU PHVVDJH SULQWHG ZKHQ WKH YDOXH RI *(1'(5 LV QRW RQH RI WKH DFFHSWDEOH YDOXHV
)
0
RU PLVVLQJ 7KHUHIRUH SODFH D 127 LQ IURQW RI WKH ZKROH H[SUHVVLRQ WULJJHULQJ WKH HUURU UHSRUW IRU LQYDOLG YDOXHV RI *(1'(5 RU $( <RX FDQ VHSDUDWH WKH YDOXHV LQ WKH OLVW E\ VSDFHV RU FRPPDV 7KHUH DUH VHYHUDO DOWHUQDWLYH ZD\V WKDW WKH JHQGHU FKHFNLQJ VWDWHPHQW FDQ EH ZULWWHQ 7KH PHWKRG DERYH XVHV WKH ,1 RSHUDWRU $ VWUDLJKWIRUZDUG DOWHUQDWLYH WR WKH ,1 RSHUDWRU LV
IF NOT (GENDER EQ F OR GENDER EQ M OR GENDER = ) THEN PUT PATNO= GENDER=;
$QRWKHU SRVVLELOLW\ LV
IF GENDER NE F AND GENDER NE M AND GENDER NE THEN PUT PATNO= GENDER=;
:KLOH DOO RI WKHVH VWDWHPHQWV FKHFNLQJ IRU *(1'(5 DQG $( SURGXFH WKH VDPH UHVXOW WKH ,1 RSHUDWRU LV SUREDEO\ WKH HDVLHVW WR ZULWH HVSHFLDOO\ LI WKHUH DUH D ODUJH QXPEHU RI SRVVLEOH YDOXHV WR FKHFN $OZD\V EH VXUH WR FRQVLGHU ZKHWKHU \RX ZDQW WR LGHQWLI\ PLVVLQJ YDOXHV DV LQYDOLG RU QRW ,Q WKH VWDWHPHQWV DERYH \RX DUH DOORZLQJ PLVVLQJ YDOXHV DV YDOLG FRGHV ,I \RX ZDQW WR IODJ PLVVLQJ YDOXHV DV HUURUV GR QRW LQFOXGH D PLVVLQJ YDOXH LQ WKH OLVW RI YDOLG FRGHV
LPPHGLDWHO\ EHIRUH WKH OLQH WKDW FKHFNV IRU LQYDOLG JHQGHU FRGHV $V \RX FDQ SUREDEO\ JXHVV WKH 83&$6( IXQFWLRQ FKDQJHV DOO ORZHUFDVH OHWWHUV WR XSSHUFDVH OHWWHUV $ VWDWHPHQW VLPLODU WR WKH JHQGHU FKHFNLQJ VWDWHPHQW LV XVHG WR WHVW WKH DGYHUVH HYHQWV 7KHUH DUH VR PDQ\ YDOLG YDOXHV IRU '; DQ\ QXPHUDO IURP WR WKDW WKH DSSURDFK \RX XVHG IRU *(1'(5 DQG $( ZRXOG EH LQHIILFLHQW DQG ZHDU \RX RXW W\SLQJ LI \RX XVHG LW WR FKHFN IRU LQYDOLG '; FRGHV 7KH 9(5,)< IXQFWLRQ LV RQH RI WKH PDQ\ SRVVLEOH ZD\V \RX FDQ FKHFN WR VHH LI WKHUH LV D YDOXH RWKHU WKDQ WKH QXPHUDOV WR RU EODQN DV D '; YDOXH 7KH 9(5,)< IXQFWLRQ KDV WKH IROORZLQJ IRUP
VERIFY(character_variable,verify_string)
ZKHUH WKH YHULI\ VWULQJ LV HLWKHU D FKDUDFWHU YDULDEOH RU D VHULHV RI FKDUDFWHU YDOXHV SODFHG LQ VLQJOH RU GRXEOH TXRWHV 7KH 9(5,)< IXQFWLRQ UHWXUQV WKH ILUVW SRVLWLRQ LQ WKH FKDUDFWHUBYDULDEOH WKDW FRQWDLQV D FKDUDFWHU WKDW LV QRW LQ WKH YHULI\BVWULQJ ,I WKH FKDUDFWHUBYDULDEOH GRHV QRW FRQWDLQ DQ\ LQYDOLG YDOXHV WKH 9(5,)< IXQFWLRQ UHWXUQV D 7R PDNH WKLV FOHDUHU OHW
V ORRN DW WKH IROORZLQJ VWDWHPHQW WKDW XVHV WKH 9(5,)< IXQFWLRQ WR FKHFN IRU LQYDOLG *(1'(5 YDOXHV
IF VERIFY (GENDER,FM ) NE 0 THEN PUT PATNO= GENDER=;
1RWLFH WKDW \RX LQFOXGHG D EODQN LQ WKH YHULI\BVWULQJ VR WKDW PLVVLQJ YDOXHV ZLOO EH FRQVLGHUHG YDOLG ,I *(1'(5 KDV D YDOXH RWKHU WKDQ DQ
)
0
RU PLVVLQJ WKH 9(5,)< IXQFWLRQ UHWXUQV WKH SRVLWLRQ RI WKH LQYDOLG FKDUDFWHU LQ WKH VWULQJ %XW EHFDXVH WKH OHQJWK RI *(1'(5 LV DQ\ LQYDOLG YDOXH IRU *(1'(5 UHWXUQV D <RX DUH QRZ UHDG\ WR XQGHUVWDQG WKH 9(5,)< IXQFWLRQ WKDW FKHFNHG IRU LQYDOLG '; FRGHV 7KH YHULI\ VWULQJ FRQWDLQHG D EODQN SOXV WKH FKDUDFWHUV QXPHUDOV WKURXJK 7KXV LI WKH '; FRGH FRQWDLQV DQ\ FKDUDFWHU RWKHU WKDQ D EODQN RU D WKURXJK LW UHWXUQV WKH SRVLWLRQ RI WKLV RIIHQGLQJ FKDUDFWHU ZKLFK ZRXOG KDYH WR EH D RU '; LV WKUHH E\WHV LQ OHQJWK DQG WKH HUURU PHVVDJH ZRXOG EH SULQWHG
Chapter 1
UHWXUQV D LI WKHUH DUH QR LQYDOLG FKDUDFWHUV LQ WKH '; FRGH LW VKRXOG EH SRLQWHG RXW WKDW '; FRGHV ZLWK HPEHGGHG EODQNV ZLOO QRW EH LGHQWLILHG DV LQYDOLG ZLWK WKLV VWDWHPHQW ,I \RX ZDQW WR HQVXUH WKDW RQO\ WKH FKDUDFWHU UHSUHVHQWDWLRQV RI WKH QXPEHUV WR DUH FRQVLGHUHG YDOLG WKH IROORZLQJ VWDWHPHQWV FDQ EH XVHG
X_DX = INPUT(DX,3.); IF X_DX EQ . AND DX NE THEN PUT PATNO= DX=;
7KH ILUVW OLQH DERYH FUHDWHV D QXPHULF YDULDEOH ;B'; IURP WKH FKDUDFWHU '; YDOXH 7KH ,1387 IXQFWLRQ FDQ EH WKRXJKW RI LQ D VLPLODU PDQQHU WR DQ ,1387 VWDWHPHQW ,W VD\V WR SUHWHQG \RX DUH UHDGLQJ D YDULDEOH '; IURP D GDWD ILOH DFFRUGLQJ WR WKH ,1)250$7 H[FHSW \RX DUH DFWXDOO\ UHDGLQJ WKH YDOXH IURP D FKDUDFWHU YDULDEOH 7KH UHVXOW RI WKLV SURFHVV LV WR EH DVVLJQHG WR WKH YDULDEOH ;B'; ,Q RWKHU ZRUGV WKH ,1387 IXQFWLRQ SHUIRUPV D FKDUDFWHUWRQXPHULF FRQYHUVLRQ ,I WKHUH LV DQ LQYDOLG '; FRGH FRQWDLQLQJ D OHWWHU RU HPEHGGHG EODQN IRU H[DPSOH WKH ,1387 IXQFWLRQ VHQGV DQ HUURU PHVVDJH WR WKH 6$6 /RJ DQG UHWXUQV D PLVVLQJ YDOXH ,Q WKH VHFRQG OLQH \RX WHVW LI WKH QXPHULF HTXLYDOHQW RI WKH '; FRGH LV PLVVLQJ DQG WKH RULJLQDO '; LV QRW PLVVLQJ SXWWLQJ RXW DQ HUURU PHVVDJH ZKHQ WKLV FRQGLWLRQ LV WUXH 1RWH EHFDXVH WKH RULJLQDO FKDUDFWHU YDOXH ZDV WKUHH E\WHV \RX GRQ
W KDYH WR WHVW LI ;B'; LV JUHDWHU WKDQ EHFDXVH WKLV LV WKH ODUJHVW QXPEHU \RX FDQ ZULWH ZLWK WKUHH GLJLWV $Q\ LQYDOLG '; FRGH ZLOO WKHQ FDXVH WKH HUURU PHVVDJH WR EH SULQWHG )RU UHDOO\ FRPSXOVLYH SURJUDPPHUV OLNH WKH DXWKRU WKHUH LV RQH ILQDO SUREOHP ZLWK WKH DERYH DSSURDFK $ YDOXH VXFK DV ZRXOG QRW UHVXOW LQ DQ HUURU PHVVDJH EHFDXVH WKH QXPEHU LV EHWZHHQ DQG 7KHUH DUH VHYHUDO ZD\V DURXQG WKLV SUREOHP 2QH ZD\ LV WR XVH WKH 75$16/$7( IXQFWLRQ WR VXEVWLWXWH DQ LQYDOLG FKDUDFWHU IRU WKH GHFLPDO SRLQW EHIRUH \RX SHUIRUP WKH FKDUDFWHUWRQXPHULF FRQYHUVLRQ
X_DX = INPUT(TRANSLATE(DX,X,.),3.);
10
ZKHUH HDFK FKDUDFWHU LQ WKH IURPBVWULQJ LV WUDQVODWHG WR WKH FRUUHVSRQGLQJ FKDUDFWHU LQ WKH WRBVWULQJ )RU H[DPSOH WR WUDQVODWH WKH QXPHUDOV WKURXJK WR WKH OHWWHUV $ WKURXJK ( IRU D YDULDEOH FDOOHG 6&25( \RX ZRXOG ZULWH
NEW_VAR = TRANSLATE(SCORE,ABCDE,12345);
$QRWKHU LQWHUHVWLQJ DSSURDFK LV WR WHVW WR VHH LI WKH YDOXH RI ;B'; LV QRW DQ LQWHJHU 7KH 02' IXQFWLRQ LV DQ HIIHFWLYH ZD\ WR GR WKLV ,I DQ\ QXPEHU PRGXOXV LV QRW WKH UHPDLQGHU DIWHU \RX GLYLGH WKH QXPEHU E\ WKH QXPEHU LV QRW DQ LQWHJHU 7KH 6$6 FRGH XVLQJ WKLV PHWKRG LV
X_DX = INPUT(DX,3.); IF (X_DX EQ . OR MOD(X_DX,1) NE 0) AND DX NE THEN PUT PATNO= DX=;
+HUH LV DQRWKHU SRLQW ,I \RX ZDQW WR DYRLG ILOOLQJ XS \RXU 6$6 /RJ ZLWK HUURU PHVVDJHV UHVXOWLQJ IURP LQYDOLG DUJXPHQWV WR WKH ,1387 IXQFWLRQ \RX FDQ XVH WKH GRXEOH TXHVWLRQ PDUN "" PRGLILHU EHIRUH WKH LQIRUPDW WR WHOO WKH SURJUDP WR LJQRUH WKHVH HUURUV DQG QRW WR UHSRUW WKH HUURUV WR WKH 6$6 /RJ 7KH ,1387 IXQFWLRQ ZRXOG WKHQ ORRN OLNH WKLV
X_DX = INPUT(DX,?? 3.);
7KH "" LQIRUPDW PRGLILHU FDQ DOVR EH XVHG ZLWK WKH ,1387 VWDWHPHQW +HUH LV WKH RXWSXW IURP UXQQLQJ 3URJUDP
Listing of Invalid Patient Numbers and Data Values PATNO=002 PATNO=003 PATNO=004 PATNO=010 PATNO=013 PATNO=002 PATNO=023 DX=X GENDER=X AE=A GENDER=f GENDER=2 DX=X GENDER=f
1RWH WKDW SDWLHQW DSSHDUV WZLFH LQ WKLV RXWSXW 7KLV RFFXUV EHFDXVH WKHUH LV D GXSOLFDWH REVHUYDWLRQ IRU SDWLHQW LQ DGGLWLRQ WR VHYHUDO RWKHU SXUSRVHO\ LQFOXGHG HUURUV VR WKDW WKH GDWD VHW FDQ EH XVHG IRU H[DPSOHV ODWHU LQ WKLV ERRN VXFK DV WKH GHWHFWLRQ RI GXSOLFDWH ,'
V DQG GXSOLFDWH REVHUYDWLRQV
Chapter 1
6XSSRVH \RX ZDQW WR FKHFN IRU YDOLG SDWLHQW QXPEHUV 3$712 LQ D VLPLODU PDQQHU +RZHYHU \RX ZDQW WR IODJ PLVVLQJ YDOXHV DV HUURUV HYHU\ SDWLHQW PXVW KDYH D YDOLG ,' 7KH IROORZLQJ VWDWHPHQWV
ID = INPUT(TRANSLATE(PATNO,X,.),?? 3.); IF ID LT 1 THEN PUT "Invalid ID for PATNO=" PATNO;
ZLOO ZRUN LQ WKH VDPH ZD\ DV \RXU FKHFN IRU LQYDOLG '; FRGHV H[FHSW WKDW PLVVLQJ YDOXHV ZLOO QRZ EH OLVWHG DV HUURUV
Using PROC PRINT with a WHERE Statement to List Invalid Values
PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID GENDER VALUES"; WHERE GENDER NOT IN (M F ); ID PATNO; VAR GENDER; RUN;
,W
V HDV\ WR IRUJHW WKDW :+(5( VWDWHPHQWV FDQ EH XVHG ZLWKLQ 6$6 SURFHGXUHV 6$6 SURJUDPPHUV WKDW KDYH EHHQ DW LW IRU D ORQJ WLPH OLNH WKH DXWKRU RIWHQ ZULWH D VKRUW '$7$ VWHS ILUVW DQG XVH 387 VWDWHPHQWV RU FUHDWH D WHPSRUDU\ 6$6 GDWD VHW DQG IROORZ LW ZLWK D 352& 35,17 7KH SURJUDP DERYH LV ERWK VKRUWHU DQG PRUH HIILFLHQW WKDQ D '$7$ VWHS IROORZHG E\ D 352& 35,17 '$7$ B18//B VWHSV KRZHYHU WHQG WR EH IDLUO\ HIILFLHQW DQG DUH D UHDVRQDEOH DOWHUQDWLYH DV ZHOO DV WKH PRUH IOH[LEOH DSSURDFK
12
LISTING OF INVALID GENDER VALUES PATNO 003 010 013 023 GENDER X f 2 f
7KLV SURJUDP FDQ EH H[WHQGHG WR OLVW LQYDOLG YDOXHV IRU DOO WKH FKDUDFWHU YDULDEOHV <RX VLPSO\ DGG WKH RWKHU LQYDOLG FRQGLWLRQV WR WKH :+(5( VWDWHPHQW DV VKRZQ LQ 3URJUDP
3URJUDP
8VLQJ 352& 35,17 WR /LVW ,QYDOLG &KDUDFWHU 'DWD IRU 6HYHUDO 9DULDEOHV
PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID CHARACTER VALUES"; WHERE GENDER NOT IN (M F ) OR VERIFY(DX, 0123456789) NE 0 OR AE NOT IN (0 1 ); ID PATNO; VAR GENDER DX AE; RUN;
Chapter 1
1RWLFH WKDW WKLV RXWSXW LV QRW DV LQIRUPDWLYH DV WKH RQH SURGXFHG E\ WKH '$7$ B18//B VWHS LQ 3URJUDP ,W OLVWV DOO WKH SDWLHQW QXPEHUV JHQGHUV '; FRGHV DQG DGYHUVH HYHQWV HYHQ ZKHQ RQO\ RQH RI WKH YDULDEOHV KDV DQ HUURU SDWLHQW IRU H[DPSOH 6R WKHUH LV D WUDGHRII WKH VLPSOHU SURJUDP SURGXFHV VOLJKWO\ OHVV GHVLUDEOH RXWSXW :H FRXOG JHW SKLORVRSKLFDO DQG H[WHQG WKLV FRQFHSW WR OLIH LQ JHQHUDO EXW WKDW
V IRU VRPH RWKHU ERRN <RX FDQ DOVR VXEVWLWXWH DQ\ RI WKH PRUH FRPSOLFDWHG ORJLFDO H[SUHVVLRQV IURP WKH SUHYLRXV VHFWLRQ LQWR WKLV :+(5( VWDWHPHQW LI \RX ZLVK )RU H[DPSOH WR SHUIRUP D PRUH FDUHIXO FKHFN RQ '; FRGHV \RX FRXOG PRGLI\ WKH :+(5( VWDWHPHQW DV VKRZQ KHUH
PROC PRINT DATA=CLEAN.PATIENTS; TITLE "LISTING OF INVALID CHARACTER VALUES"; WHERE GENDER NOT IN (M F ) (INPUT(DX,3.) EQ . OR MOD(INPUT(DX,3.),1) NE 0) DX NE AE NOT IN (0 1 ); ID PATNO; VAR GENDER DX AE; RUN;
OR AND OR
$QRWKHU ZD\ WR FKHFN IRU LQYDOLG YDOXHV RI D FKDUDFWHU YDULDEOH IURP UDZ GDWD LV WR XVH XVHUGHILQHG IRUPDWV 7KHUH DUH VHYHUDO SRVVLELOLWLHV KHUH 2QH \RX FDQ FUHDWH D IRUPDW WKDW OHDYHV DOO YDOLG FKDUDFWHU YDOXHV DV LV DQG IRUPDWV DOO LQYDOLG YDOXHV WR D VLQJOH HUURU FRGH /HW
V VWDUW RXW ZLWK D SURJUDP WKDW VLPSO\ DVVLJQV IRUPDWV WR WKH FKDUDFWHU YDULDEOHV DQG XVHV 352& )5(4 WR OLVW WKH QXPEHU RI YDOLG DQG LQYDOLG FRGHV )ROORZLQJ WKDW \RX ZLOO H[WHQG WKH SURJUDP E\ XVLQJ D '$7$ VWHS WR LGHQWLI\ ZKLFK ,'
V KDYH LQYDOLG YDOXHV 3URJUDP XVHV IRUPDWV WR FRQYHUW DOO LQYDOLG GDWD YDOXHV WR D VLQJOH YDOXH
14
3URJUDP
8VLQJ D 8VHU'HILQHG )RUPDW DQG 352& )5(4 WR /LVW ,QYDOLG 'DWD 9DOXHV
PROC FORMAT; VALUE $GENDER F,M = = OTHER = VALUE $DX 001 - 999 OTHER
Valid Missing Miscoded; = Valid /* See important note below */ = Missing = Miscoded;
VALUE $AE 0,1 = Valid = Missing OTHER = Miscoded; RUN; PROC FREQ DATA=CLEAN.PATIENTS; TITLE "Using Formats to Identify Invalid Values"; FORMAT GENDER $GENDER. DX $DX. AE $AE.; TABLES GENDER DX AE / NOCUM NOPERCENT MISSING; RUN;
)RU WKH YDULDEOHV *(1'(5 DQG $( ZKLFK KDYH VSHFLILF YDOLG YDOXHV \RX OLVW HDFK RI WKH YDOLG YDOXHV LQ WKH UDQJH WR WKH OHIW RI WKH HTXDO VLJQ LQ WKH 9$/8( VWDWHPHQW )RUPDW HDFK RI WKHVH YDOXHV ZLWK WKH YDOXH
9DOLG
)RU WKH '; IRUPDW \RX VSHFLI\ D UDQJH RI YDOXHV RQ WKH OHIW VLGH RI WKH HTXDO VLJQ ,PSRUWDQW 1RWH ,W VKRXOG EH SRLQWHG RXW KHUH WKDW WKH UDQJH
ZLOO EHKDYH GLIIHUHQWO\ RQ :LQGRZV DQG 81,; SODWIRUPV FRPSDUHG WR 096 DQG &06 SODWIRUPV <RX PD\ ZDQW WR WHVW VHYHUDO YDOXHV RQ \RXU SODWIRUP WR EH VXUH WKH SURJUDP LV SHUIRUPLQJ DV \RX LQWHQG )RU H[DPSOH WKH YDOXH
$
ZLOO EH FRQVLGHUHG
9DOLG
RQ D :LQGRZV RU D 81,; SODWIRUP DQG
,QYDOLG
RQ 096 RU &06 DV SRLQWHG RXW E\ WZR RI P\ UHYLHZHUV -RKQ /DLQJ DQG 0LNH =GHE <RX PD\ ZDQW WR WHVW IRU DOSKDEHWLF YDOXHV IRU '; LQ D VKRUW '$7$ VWHS SULRU WR UXQQLQJ 3URJUDP <RX PD\ FKRRVH WR OXPS WKH PLVVLQJ YDOXH ZLWK WKH YDOLG YDOXHV LI WKDW LV DSSURSULDWH RU \RX PD\ ZDQW WR NHHS WUDFN RI PLVVLQJ YDOXHV VHSDUDWHO\ DV ZDV GRQH KHUH )LQDOO\ DQ\ YDOXH RWKHU WKDQ WKH YDOLG YDOXHV RU D PLVVLQJ YDOXH ZLOO EH IRUPDWWHG DV
0LVFRGHG
$OO WKDW LV OHIW LV WR UXQ 352& )5(4 WR FRXQW WKH QXPEHU RI
9DOLG
0LVVLQJ
DQG
0LVFRGHG
YDOXHV 7KH 7$%/(6 RSWLRQ 0,66,1* FDXVHV WKH PLVVLQJ YDOXHV WR EH OLVWHG LQ WKH ERG\ RI WKH 352& )5(4 RXWSXW +HUH LV WKH RXWSXW IURP 352& )5(4
Chapter 1
Using Formats to Identify Invalid Values The FREQ Procedure Gender GENDER Frequency --------------------Missing 1 Miscoded 4 Valid 26 Diagnosis Code DX Frequency --------------------Missing 8 Valid 21 Miscoded 2 Adverse Event? AE Frequency --------------------Missing 1 Valid 29 Miscoded 1
PROC FORMAT; VALUE $GENDER F,M = Valid = Missing OTHER = Miscoded; VALUE $DX 001 - 999 = Valid = Missing OTHER = Miscoded; VALUE $AE 0,1 = Valid = Missing OTHER = Miscoded; RUN;
16
DATA _NULL_; INFILE "C:\CLEANING\PATIENTS.TXT" FILE PRINT; ***Send output to the TITLE "Listing of Invalid Patient ***Note: We will only input those INPUT @1 PATNO $3. @4 GENDER $1. @24 DX $3. @27 AE $1.;
IF PUT(GENDER,$GENDER.) = Miscoded THEN PUT PATNO= GENDER=; IF PUT(DX,$DX.) = Miscoded THEN PUT PATNO= DX=; IF PUT(AE,$AE.) = Miscoded THEN PUT PATNO= AE=; RUN;
7KH KHDUW RI WKLV SURJUDP LV WKH 387 IXQFWLRQ 7R UHYLHZ WKH 387 IXQFWLRQ LV VLPLODU WR WKH ,1387 IXQFWLRQ ,W WDNHV WKH IROORZLQJ IRUP
character_variable = PUT(variable,format)
ZKHUH FKDUDFWHUBYDULDEOH LV D FKDUDFWHU YDULDEOH WKDW FRQWDLQV WKH YDOXH RI WKH YDULDEOH OLVWHG DV WKH ILUVW DUJXPHQW WR WKH IXQFWLRQ IRUPDWWHG E\ WKH IRUPDW OLVWHG DV WKH VHFRQG DUJXPHQW WR WKH IXQFWLRQ 7KH UHVXOW RI D 387 IXQFWLRQ LV DOZD\V D FKDUDFWHU YDULDEOH DQG WKH IXQFWLRQ LV IUHTXHQWO\ XVHG WR SHUIRUP QXPHULFWRFKDUDFWHU FRQYHUVLRQV ,Q 3URJUDP WKH ILUVW DUJXPHQW RI WKH 387 IXQFWLRQ LV D FKDUDFWHU YDULDEOH DQG WKH UHVXOW RI WKH 387 IXQFWLRQ IRU DQ\ LQYDOLG GDWD YDOXHV ZRXOG EH WKH YDOXH
0LVFRGHG
+HUH LV WKH RXWSXW IURP 3URJUDP
Listing of Invalid Patient Numbers and Data Values PATNO=002 PATNO=003 PATNO=004 PATNO=010 PATNO=013 PATNO=002 PATNO=023 DX=X GENDER=X AE=A GENDER=f GENDER=2 DX=X GENDER=f
Chapter 1
352& )250$7 LV DOVR XVHG WR FUHDWH LQIRUPDWV 5HPHPEHU WKDW IRUPDWV DUH XVHG WR FRQWURO KRZ YDULDEOHV ORRN LQ RXWSXW RU KRZ WKH\ DUH FODVVLILHG E\ VXFK SURFHGXUHV DV 352& )5(4 ,QIRUPDWV PRGLI\ WKH YDOXH RI YDULDEOHV DV WKH\ DUH UHDG IURP WKH UDZ GDWD RU WKH\ FDQ EH XVHG ZLWK DQ ,1387 IXQFWLRQ WR FUHDWH QHZ YDULDEOHV LQ WKH '$7$ VWHS 8VHUGHILQHG LQIRUPDWV DUH FUHDWHG LQ PXFK WKH VDPH ZD\ DV XVHUGHILQHG IRUPDWV ,QVWHDG RI D 9$/8( VWDWHPHQW WKDW FUHDWHV IRUPDWV DQ ,19$/8( VWDWHPHQW LV XVHG WR FUHDWH LQIRUPDWV 7KH RQO\ GLIIHUHQFH EHWZHHQ WKH WZR LV WKDW LQIRUPDW QDPHV FDQ RQO\ EH VHYHQ FKDUDFWHUV LQ OHQJWK 1RWH )RU WKRVH FXULRXV UHDGHUV WKH UHDVRQ LV WKDW LQIRUPDWV DQG IRUPDWV DUH ERWK VWRUHG LQ WKH VDPH FDWDORJ DQG DQ # LV SODFHG EHIRUH LQIRUPDWV WR GLVWLQJXLVK WKHP IURP IRUPDWV 7KH IROORZLQJ LV D SURJUDP WKDW FKDQJHV LQYDOLG YDOXHV IRU *(1'(5 DQG $( WR PLVVLQJ YDOXHV E\ XVLQJ D XVHUGHILQHG LQIRUPDW 3URJUDP 8VLQJ D 8VHU'HILQHG ,QIRUPDW WR 6HW ,QYDOLG 'DWD 9DOXHV WR 0LVVLQJ
*----------------------------------------------------------------* | PROGRAM NAME: INFORM1.SAS IN C:\CLEANING | | PURPOSE: TO CREATE A SAS DATA SET CALLED PATIENTS2 | | AND SET ANY INVALID VALUES FOR GENDER AND AE TO | | MISSING, USING A USER-DEFINED INFORMAT | *---------------------------------------------------------------*; LIBNAME CLEAN "C:\CLEANING"; PROC FORMAT; INVALUE $GEN INVALUE $AE RUN; DATA CLEAN.PATIENTS2; INFILE "C:\CLEANING\PATIENTS.TXT" PAD; INPUT @1 PATNO $3. @4 GENDER $GEN1. @27 AE $AE1.; LABEL PATNO GENDER DX AE RUN; = = = = "Patient Number" "Gender" "Diagnosis Code" "Adverse Event?";
18
PROC PRINT DATA=CLEAN.PATIENTS2; TITLE "Listing of Data Set PATIENTS2"; VAR PATNO GENDER AE; RUN;
1RWLFH WKH ,19$/8( VWDWHPHQWV LQ WKH 352& )250$7 DERYH 7KH NH\ ZRUG B6$0(B LV D 6$6 UHVHUYHG YDOXH WKDW GRHV ZKDW LWV QDPH LPSOLHV LW OHDYHV DQ\ RI WKH YDOXHV OLVWHG LQ WKH UDQJH VSHFLILFDWLRQ XQFKDQJHG 7KH NH\ ZRUG 27+(5 LQ WKH VXEVHTXHQW OLQH UHIHUV WR DQ\ YDOXHV QRW PDWFKLQJ RQH RI WKH SUHYLRXV UDQJHV 1RWLFH DOVR WKDW WKH LQIRUPDWV LQ WKH ,1387 VWDWHPHQW XVH WKH XVHUGHILQHG LQIRUPDW QDPH IROORZHG E\ WKH QXPEHU RI FROXPQV WR EH UHDG WKH VDPH PHWKRG WKDW LV XVHG ZLWK SUHGHILQHG 6$6 LQIRUPDWV 2XWSXW IURP WKH 352& 35,17 LV VKRZQ QH[W
Listing of Data Set PATIENTS2
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
PATNO 001 002 003 004 XX5 006 007 008 009 010 011 012 013 014 002 003 015 017 019 123 321 020 022 023 024 025 027 028 029 006
GENDER M F F M M M F M M M M F M F F M M F F M F M F F M F
AE 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0
Chapter 1
1RWLFH WKDW LQYDOLG YDOXHV IRU *(1'(5 DQG $( DUH QRZ PLVVLQJ YDOXHV LQFOXGLQJ WKH WZR ORZHUFDVH
I
V SDWLHQW QXPEHUV DQG /HW
V DGG RQH PRUH IHDWXUH WR WKLV SURJUDP %\ XVLQJ WKH NH\ZRUG 83&$6( LQ WKH LQIRUPDW VSHFLILFDWLRQ \RX FDQ DXWRPDWLFDOO\ FRQYHUW WKH YDOXHV EHLQJ UHDG WR XSSHUFDVH EHIRUH WKH UDQJHV DUH FKHFNHG +HUH DUH WKH 352& )250$7 VWDWHPHQWV UHZULWWHQ WR XVH WKLV RSWLRQ
PROC FORMAT; INVALUE $GEN (UPCASE)
7KH 83&$6( RSWLRQ LV SODFHG LQ SDUHQWKHVLV IROORZLQJ WKH LQIRUPDW QDPH 1RWLFH VRPH RWKHU FKDQJHV DV ZHOO <RX FDQQRW XVH WKH NH\ZRUG B6$0(B DQ\PRUH EHFDXVH WKH YDOXH LV FKDQJHG WR XSSHUFDVH IRU FRPSDULVRQ SXUSRVHV EXW WKH B6$0(B VSHFLILFDWLRQ ZRXOG OHDYH WKH RULJLQDO ORZHUFDVH YDOXH XQFKDQJHG %\ VSHFLI\LQJ HDFK YDOXH LQGLYLGXDOO\ WKH ORZHUFDVH
I
WKH RQO\ ORZHUFDVH *(1'(5 YDOXH ZRXOG PDWFK WKH UDQJH
)
DQG EH DVVLJQHG WKH YDOXH RI DQ XSSHUFDVH
)
7KH RXWSXW RI WKLV GDWD VHW LV LGHQWLFDO WR WKH RXWSXW IRU 3URJUDP H[FHSW WKH YDOXH RI *(1'(5 IRU SDWLHQWV DQG DUH DQ XSSHUFDVH
)
,I \RX ZDQW WR SUHVHUYH WKH RULJLQDO YDOXH RI WKH YDULDEOH \RX FDQ XVH D XVHUGHILQHG LQIRUPDW ZLWK DQ ,1387 IXQFWLRQ LQVWHDG RI DQ ,1387 VWDWHPHQW <RX FDQ XVH WKLV PHWKRG WR FKHFN D UDZ GDWD ILOH RU D 6$6 GDWD VHW 3URJUDP UHDGV WKH 6$6 GDWD VHW &/($13$7,(176 DQG XVHV XVHUGHILQHG LQIRUPDWV WR GHWHFW HUURUV
20
3URJUDP
PROC FORMAT; INVALUE $GENDER F,M = _SAME_ OTHER = ERROR; INVALUE $AE 0,1 = _SAME_ OTHER = ERROR; RUN; DATA _NULL_; FILE PRINT; SET CLEAN.PATIENTS; IF INPUT (GENDER,$GENDER.) = ERROR THEN PUT @1 "Error for Gender for Patient:" PATNO" Value is " GENDER; IF INPUT (AE,$AE.) = ERROR THEN PUT @1 "Error for AE for Patient:" PATNO" Value is " AE; RUN;
7KH DGYDQWDJH RI WKLV SURJUDP RYHU 3URJUDP LV WKDW WKH RULJLQDO YDOXHV RI WKH YDULDEOHV DUH QRW ORVW