Example Exploratory Data Analysis PDF
Example Exploratory Data Analysis PDF
This is an unmarked optional tutorial to show the kind of thinking that goes into an exploratory data analysis
The goal of this tutorial document is to walk through some of the common issues encountered in the early
stages of an exploratory analysis on a set of data. It gives examples of common problem areas in:
reading in data
dealing with blanks
dealing with factors
This data is a modied version of data from the New Zealand Election Survey, deliberately modied to
introduce problems that occur naturally in many data sets.
load("selected_nzes2011.Rdata")
We also want to load packages that have functions in them we want to use. For this particular analysis we
will only need the dplyr package, but for your project you will also likely need other packages as well, e.g.
ggplot2 .
library(dplyr)
If we try to run that line, we will get an error message about unexpected input or missing object.
We next need to diagnose where the problem lies in the R code or in the data? The best way to
troubleshoot this issue is to run each line of the dplyr chain one by one.
selected_nzes2011
The rst line runs without any erros, but the second line gives an error
selected_nzes2011%>%
select(jpartyvote,jdiffvoting,_singlefav)
We know that select() is a valid dplyr function, so that cannot be the problem. This means the problem
might be the variable names. The issue is that R has rules about what variable names are legal (e.g.no
spaces, starting with a letter) and when data is loaded, R will often x variable names to make them legal.
This happened to the _singlefav at the time of loading the data.
We could check this by looking through every single variable name in the data with the names() command.
names(selected_nzes2011)
##[1]"Jelect""jblogel""jnewspaper""jnatradio"
##[5]"jtalkback""jdiscussp""jrallies""jpersuade"
##[9]"jpcmoney""jpcposter""jlablike""jnatlike"
##[13]"jgrnlike""jnzflike""jactlike""junflike"
##[17]"jmaolike""jmnplike""jmostlike""jmostlikex"
##[21]"jrepublic""jsphealth""jspedu""jspunemp"
##[25]"jspdefence""jspsuper""jspbusind""jsppolice"
##[29]"jspwelfare""jspenviro""jgovpdk""jgovplab"
##[33]"jgovpnat""jgovpgrn""jgovpnzf""jgovpact"
##[37]"jgovunf""jgovpmao""jgovpmnp""jnevervoteno"
##[41]"jnevervotelab""jnevervotenat""jnevervotegrn""jnevervotenzf"
##[45]"jnevervoteact""jnevervoteunf""jnevervotemao""jnevervotemnp"
##[49]"jnevervoteoth""jnevervoteothx""jfirstpx""jsecondp"
##[53]"jage""jlanguage""jlanguagex""jrollsex"
##[57]"jhqual""jwkft""jwkpt""jwkun"
##[61]"jwkret""jwkdis""jwksch""jwkunpo"
##[65]"jwkunpi""jhhincome""jhhadults""jhhchn"
##[69]"jmarital""r_jind""jlablr""jnatlr"
##[73]"jgrnlr""jnzflr""jactlr""junflr"
##[77]"jmaolr""jmnplr""jslflr""jrelservices"
##[81]"jrelnone""jrelang""jrelpres""jrelcath"
##[85]"jrelmeth""jrelbap""jrellat""jrelrat"
##[89]"jrelfun""jrelothc""jrelnonc""jreligionx"
##[93]"jreligiousity""jethnicity_e""jethnicity_m""jethnicity_p"
##[97]"jethnicity_a""jethnicity_o""jethnicityx""jethnicmost"
##[101]"jethnicmostx""jpartyvote""jelecvote""njptyvote"
##[105]"njelecvote""jdiffvoting""X_singlefav"
However, when we have hundreds of column names, a useful tip is to just search out only possible names.
We can search the names for a fragment of the name by using the
grep("FRAGMENT",variable,value=TRUE) command, which in this case might be:
grep("singlefav",names(selected_nzes2011),value=TRUE)
##[1]"X_singlefav"
The value=TRUE argument, as described in the help for the grep() function reports the mathing character
string, as opposed to the index number for that string.
We can now conrm that the variable is called X_singlefav , so that is how we should be referring to it.
selected_nzes2011%>%
select(jpartyvote,jdiffvoting,X_singlefav)%>%
str()
These are all categorical data, however they are recorded as characters (text strings) as opposed to factors.
An easy way of tabulating these data to see how many times each level of is to use the group_by() function
along with the summarise() command:
selected_nzes2011%>%
group_by(jpartyvote)%>%
summarise(count=n())
###Atibble:142
##jpartyvotecount
##<chr><int>
##1Act29
##2ALC10
##3Alliance2
##4Anotherparty8
##5Conservative74
##6Don'tknow23
##7Green348
##8Labour749
##9Mana62
##10MaoriParty128
##11National1130
##12NZFirst216
##13UnitedFuture14
##14<NA>308
We can see that 23 people answered "Don'tknow" . Since our question is about people who knew which
party they voted for, we might want to exclude these observations from our analysis. We can do so by
filter ing them out.
selected_nzes2011%>%
filter(jpartyvote!="Don'tknow")%>%
group_by(jpartyvote)%>%
summarise(count=n())
###Atibble:122
##jpartyvotecount
##<chr><int>
##1Act29
##2ALC10
##3Alliance2
##4Anotherparty8
##5Conservative74
##6Green348
##7Labour749
##8Mana62
##9MaoriParty128
##10National1130
##11NZFirst216
##12UnitedFuture14
Because there is a %>% at the end of the line, R knows to continue on to the next line, as with any other to
be continued symbol at the end of the line.
Note that adding the lter also got rid of the NA entries. NA (Not Available) is used to indicate blank entries
those observations for which there is no data recorded. It is always a good plan to be aware of NAs and
deliberately include them in or exclude them from the analysis so that the nal results are not surprising. In
this case since NA indicates that these people did not answer the question about which party they voted for,
exluding them from the analysis makes sense.
We can also similarly view the levels and number of occurances of these levels in the X_singlefav variable:
selected_nzes2011%>%
group_by(X_singlefav)%>%
summarise(count=n())
###Atibble:82
##X_singlefavcount
##<chr><int>
##1Act33
##2Green388
##3Labour1043
##4Mana47
##5National1266
##6NZFirst138
##7UnitedFuture128
##8<NA>58
This set also has NA entries, but in this case we dont want to get rid of anything but the NA s so we need to
target them directly. NA entries need special targeting because they do not actually exist (they are dierent
to the text "NA" or a variable saved with the name NA ).
If we only wanted to nd the NA s we would use the is.na() function with the name of the variable inside
the parentheses.
However since we want the entries that are not NA s we can use the Not operator, ! , to indicate we want
all the ones that are not NA: !is.na() . Hence we can filter out all non NAs in our dplyr chain:
selected_nzes2011%>%
filter(!is.na(X_singlefav))%>%
group_by(X_singlefav)%>%
summarise(count=n())
###Atibble:72
##X_singlefavcount
##<chr><int>
##1Act33
##2Green388
##3Labour1043
##4Mana47
##5National1266
##6NZFirst138
##7UnitedFuture128
selected_nzes2011%>%
filter(!is.na(X_singlefav),jpartyvote!="Don'tknow")%>%
group_by(X_singlefav)%>%
summarise(count=n())
###Atibble:72
##X_singlefavcount
##<chr><int>
##1Act29
##2Green354
##3Labour914
##4Mana42
##5National1172
##6NZFirst119
##7UnitedFuture115
If we examine the categories in jdiffvoting we can see that this variable has levels such as both
"Don'tknow" and NA .
selected_nzes2011%>%
group_by(jdiffvoting)%>%
summarise(count=n())
###Atibble:72
##jdiffvotingcount
##<chr><int>
##1Don'tknow63
##2Votingcanmakeabigdifferencetowhathappens1605
##3Votingcanmakeareasonableamountofdifferencetowhathappens841
##4Votingcanmakesomedifferencetowhathappens339
##5Votingwon'tmakeanydifferencetowhathappens119
##6Votingwon'tmakemuchdifferencetowhathappens106
##7<NA>28
This creates a new variable named sameparty that has the value "same" if jpartyvote is equal to
X_singlefav , and "different" otherwise.
We can again check our work by exploring the groupings in a View:
selected_nzes2011%>%
group_by(jpartyvote,X_singlefav,sameparty)%>%
summarise(count=n())
##Source:localdataframe[82x4]
##Groups:jpartyvote,X_singlefav[?]
##
##jpartyvoteX_singlefavsamepartycount
##<chr><chr><chr><int>
##1ActActsame12
##2ActGreendifferent1
##3ActNationaldifferent14
##4ActUnitedFuturedifferent1
##5Act<NA><NA>1
##6ALCGreendifferent1
##7ALCLabourdifferent4
##8ALCNationaldifferent2
##9ALCUnitedFuturedifferent3
##10AllianceLabourdifferent1
###...with72morerows
We can see that observations where jpartyvote equaled X_singlefav , the value "same" was recorded for
the new variable sameparty , and the value "different" was recorded otherwise. If either jpartyvote or
X_singlefav had an NA , R could not check for equality and hence NA was recorded for the sameparty
variable as well.
To view and summarize the same entries we can use the following:
selected_nzes2011%>%
group_by(jpartyvote,X_singlefav,sameparty)%>%
summarise(count=n())%>%
filter(sameparty=="same")
##Source:localdataframe[7x4]
##Groups:jpartyvote,X_singlefav[7]
##
##jpartyvoteX_singlefavsamepartycount
##<chr><chr><chr><int>
##1ActActsame12
##2GreenGreensame237
##3LabourLaboursame632
##4ManaManasame31
##5NationalNationalsame1004
##6NZFirstNZFirstsame82
##7UnitedFutureUnitedFuturesame5
And to view and summarize the dierent entries we can use the following:
selected_nzes2011%>%
group_by(jpartyvote,X_singlefav,sameparty)%>%
summarise(count=n())%>%
filter(sameparty=="different")
##Source:localdataframe[59x4]
##Groups:jpartyvote,X_singlefav[59]
##
##jpartyvoteX_singlefavsamepartycount
##<chr><chr><chr><int>
##1ActGreendifferent1
##2ActNationaldifferent14
##3ActUnitedFuturedifferent1
##4ALCGreendifferent1
##5ALCLabourdifferent4
##6ALCNationaldifferent2
##7ALCUnitedFuturedifferent3
##8AllianceLabourdifferent1
##9AllianceNationaldifferent1
##10AnotherpartyGreendifferent2
###...with49morerows
We can also check how we got any NA s we have by using the is.na() function:
selected_nzes2011%>%
group_by(jpartyvote,X_singlefav,sameparty)%>%
summarise(count=n())%>%
filter(is.na(sameparty))
##Source:localdataframe[16x4]
##Groups:jpartyvote,X_singlefav[16]
##
##jpartyvoteX_singlefavsamepartycount
##<chr><chr><chr><int>
##1Act<NA><NA>1
##2Conservative<NA><NA>1
##3Don'tknow<NA><NA>7
##4Green<NA><NA>1
##5Labour<NA><NA>11
##6MaoriParty<NA><NA>2
##7National<NA><NA>7
##8NZFirst<NA><NA>2
##9<NA>Act<NA>4
##10<NA>Green<NA>32
##11<NA>Labour<NA>121
##12<NA>Mana<NA>4
##13<NA>National<NA>92
##14<NA>NZFirst<NA>17
##15<NA>UnitedFuture<NA>12
##16<NA><NA><NA>26
The checks show that the observations with NA s in the sameparty are going to be excluded from the analysis
when we ter out the NA s in the jpartyvote and X_singlefav variables, so we dont need to worry about
them anymore.
##Factorw/12levels"0","1","10","2",..:11410411NANA112...
str(selected_nzes2011$jage)
##int[1:3101]3737287143NA59686470...
jnzflike is a factor variable, in fact its ordinal and by default the levels are listed in alphabetical order.
Since this is a categorical variable, we can also summarize the occurances of each level with group_by() and
summarise() again:
selected_nzes2011%>%
group_by(jnzflike)%>%
summarise(count=n())
###Atibble:132
##jnzflikecount
##<fctr><int>
##10622
##21298
##310134
##42266
##53227
##64162
##75544
##86165
##97138
##108107
##11981
##12Don'tknow224
##13NA133
While jnzflike is on a 0 to 10 scale, this variable also has a level labeled "Don'tknow" , which is why R
stores this variable as not a numeric variable.
jage , on the other hand, is an integer, with values that are whole numbers between 0 and innity (or NA ).
For this variable we would want to take a look at numerical summaries such as means, medians, etc.
selected_nzes2011%>%
summarise(agemean=mean(jage),agemedian=median(jage),agesd=sd(jage),
agemin=min(jage),agemax=max(jage))
##agemeanagemedianagesdageminagemax
##1NANANaNNANA
What went wrong? The reason why all of the results were reported as NAs is that there were some NA
entries in the jage variable (people not reporting their age). Since it is not possible to take the average of a
series of values that contain NA s, obtaining the numerical summaries requires that we exclude the NA s
from the calculation.
Most numerical summary functions allow us to easily exclude NA s with the na.rm argument. See the help
documentation for the median function for more information.
?median
##startinghttpdhelpserver...
##done
An alternative approach is just to filter out the NA s rst, and then ask for the numerical summaries:
selected_nzes2011%>%
filter(!(is.na(jage)))%>%
summarise(agemean=mean(jage),agemedian=median(jage),agesd=sd(jage),
agemin=min(jage),agemax=max(jage))
##agemeanagemedianagesdageminagemax
##153.223285417.537118100
An age range of 18 to 100 is a reasonable age range for a voting age population, so there are no obvious
errors in the data. If there were, we would need to decide if we should lter them out of the analysis.
Having gained some familiarity with the specic variables we are using, we next need to consider if there is
additional work we should do on the data in investigating the question. There are a number of dierent
approaches we might take. For example, we could consider if those that strongly like NZ First are older than
those that strongly dislike NZ First, or we could consider if old people like NZ First more than young people.
selected_nzes2011%>%
filter(jnzflike%in%c("0","10"))%>%
group_by(jnzflike)%>%
summarise(count=n())
###Atibble:22
##jnzflikecount
##<fctr><int>
##10622
##210134
Remember that the jnzflike is not a numerical variable, hence we use the quotation marks around the
values (even though they happen to be numbers).
This is an example of simpligying the analysis by considering only two levels of a categorical variable, as
opposed to all possible levels.
###Atibble:32
##retiredagecount
##<chr><int>
##1retiredage876
##2workingage2156
##3<NA>69
We can see that individuals in the dataset are now labeled as either "retiredage" or "workingage" or
neither ( NA ), which we can easily lter out if need be.
This is an example of using a numerical threshold to convert a numerical variable to a categorical variable.
For approach 2, we might also be want to turn the scale of liking into numeric values, because at the
moment we cannot easily get summary information of the data in factor form. For example, if we ty to run
the following command, we get an error saying need numeric data.
selected_nzes2011%>%
group_by(retiredage)%>%
summarise(medlike=median(jnzflike))
We can change the type of data with functions of the form as.thingtochangeto() , but it is easy to go wrong
with factors. For example, this is wrong:
selected_nzes2011<selected_nzes2011%>%
mutate(numlikenzf=as.numeric(jnzflike))
We can see it has gone wrong if we use grouping to check our work (and it is a very good plan to check our
work after converting factors).
selected_nzes2011%>%
group_by(jnzflike,numlikenzf)%>%
summarise(count=n())
##Source:localdataframe[13x3]
##Groups:jnzflike[?]
##
##jnzflikenumlikenzfcount
##<fctr><dbl><int>
##101622
##212298
##3103134
##424266
##535227
##646162
##757544
##868165
##979138
##10810107
##1191181
##12Don'tknow12224
##13NANA133
Factor entries have two parts: the text we see on the screen, and a numeric order (remember how 10 was
coming between 1 and 2 because of the alphabetical order). When we say turn this into a number, R uses
the numeric order in which it stores the values to do that conversion, as opposed to the names of the levels
of the categorical variable. Hence, we need a conversion method that will use the text strings that label the
levels, as opposed to the storage order of these levels. We can do this by rst saving the variable as a
character variable, and then turning it into a number:
selected_nzes2011<selected_nzes2011%>%
mutate(numlikenzf=as.numeric(as.character(jnzflike)))
##Warningineval(substitute(expr),envir,enclos):NAsintroducedby
##coercion
The warning NAs introduced by coercion happens since the level "Don'tknow" cannot be turned into a
number. But this should be ne for our purposes since we are interested in the numerical responses
anyway.
Variable
selected_nzes2011%>% Question
DataType
group_by(jnzflike,numlikenzf)%>%
summarise(count=n())
##Source:localdataframe[13x3]
##Groups:jnzflike[?]
##
##jnzflikenumlikenzfcount
##<fctr><dbl><int>
##100622
##211298
##31010134
##422266
##533227
##644162
##755544
##866165
##977138
##1088107
##119981
##12Don'tknowNA224
##13NANA133
Converting the factor to a character rst ensures that the numerical values used in the labels of the levels of
the categorical variable are used.
Now that we cleaned up the data in a way that addresses the needs of the research questions we want to
explore, we are ready to continue with our analysis.
Question
DataType
jactlike
Factor
jactlr
chr
jage
int
jblogel
chr
jdiffvoting
chr
jdiscussp
chr
Jelect
Electorate
int
jelecvote
chr
jethnicity_a
chr
jethnicity_e
chr
jethnicity_m
chr
Variable
Question
DataType
jethnicity_o
chr
jethnicity_p
chr
jethnicityx
chr
jethnicmost
chr
jethnicmostx
chr
jfirstpx
chr
jgovpact
chr
jgovpdk
C8: cant recall which parties formed the government after 2008
election
chr
jgovpgrn
chr
jgovplab
chr
jgovpmao
C8: Maori Party helped form the government after 2008 election
chr
jgovpmnp
C8: Mana Party helped form the government after 2008 election
chr
jgovpnat
chr
jgovpnzf
chr
jgovunf
C8: United Future helped form the government after 2008 election
chr
jgrnlike
Factor
jgrnlr
chr
jhhadults
int
jhhchn
int
jhhincome
chr
jhqual
chr
jlablike
Factor
jlablr
chr
jlanguage
chr
jlanguagex
chr
jmaolike
Factor
jmaolr
chr
jmarital
chr
jmnplike
Factor
Variable
Question
DataType
jmnplr
chr
jmostlike
chr
jmostlikex
chr
jnatlike
Factor
jnatlr
chr
jnatradio
chr
jnevervoteact
chr
jnevervotegrn
chr
jnevervotelab
chr
jnevervotemao
chr
jnevervotemnp
chr
jnevervotenat
chr
jnevervoteno
chr
jnevervotenzf
chr
jnevervoteoth
chr
jnevervoteothx
C16: other party for for which you would never vote
chr
jnevervoteunf
chr
jnewspaper
chr
jnzflike
Factor
jnzflr
chr
jpartyvote
chr
jpcmoney
chr
jpcposter
chr
jpersuade
chr
jrallies
chr
jrelang
F17: anglican
chr
jrelbap
F17: baptist
chr
jrelcath
F17: catholic
chr
jrelfun
chr
Variable
Question
DataType
jreligionx
chr
jreligiousity
chr
jrellat
chr
jrelmeth
F17: methodist
chr
jrelnonc
F17: non-Christian
chr
jrelnone
F17: no religion
chr
jrelothc
chr
jrelpres
F17: presbyterian
chr
jrelrat
F17: ratana
chr
jrelservices
chr
jrepublic
chr
jrollsex
chr
jsecondp
C11: on election day which party overall was you second choice to be
in government
chr
jslflr
chr
jspbusind
chr
jspdefence
chr
jspedu
chr
jspenviro
chr
jsphealth
chr
jsppolice
B3g: should there be more or less public spending on police and law
enforcement
chr
jspsuper
chr
jspunemp
chr
jspwelfare
chr
jtalkback
chr
junflike
Factor
Variable
Question
DataType
junflr
chr
jwkdis
chr
jwkft
chr
jwkpt
chr
jwkret
F9: retired
chr
jwksch
chr
jwkun
chr
jwkunpi
chr
jwkunpo
chr
njelecvote
chr
njptyvote
chr
r_jind
chr
_singlefav
chr