0% found this document useful (0 votes)
26 views39 pages

Unit - 3

Uploaded by

cs235214205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views39 pages

Unit - 3

Uploaded by

cs235214205
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

REVIEW OF BASIC DATA ANALYTIC METHODS USING R

The previous chapter presented the six phases of the Data Analytics Lifecycle.

• Phase 1: Discovery

• Phase 2: Data Preparation

• Phase 3: Model Planning

• Phase 4: Model Building

• Phase 5: Communicate Results

• Phase 6: Operationalize

The first three phases involve various aspects of data exploration. In general, the success of a da ta
analysis project requires a deep understanding of the data. It also requires a toolbox for mining and pre-
senting the data. These activities include the study of the data in terms of basic statistical measures and
creation of graphs and plots to visualize and identify relationships and patterns. Several free or commercial
tools are available for exploring, conditioning, modeling, and presenting data. Because of its popularity and
versatility, the open-source programming language Ris used to illustrate many of the presented analytical
tasks and models in this book.
This chapter introduces the basic functionality of the Rprogramming language and environment. The
first section gives an overview of how to useR to acquire, parse, and filter the data as well as how to obtain
some basic descriptive statistics on a dataset. The second section examines using Rto perform exploratory
data analysis tasks using visua lization. The final section focuses on statistical inference, such as hypothesis
testing and analysis of variance in R.

3.1 Introduction toR


Ris a programming language and software framework for statistical analysis and graphics. Available for use
under the GNU General Public License [1], Rsoftware and installation instructions can be obtained via the
Comprehensive RArchive and Network [2]. This section provides an overview of the basic functionality of R.
In later chapters, this foundation in Ris utilized to demonstrate many of the presented analytical techniques.
Before delving into specific operations and functions of Rlater in this chapter, it is important to under-
stand the now of a basic Rscript to address an analytical problem. The following Rcode illustrates a typical
analytical situation in which a dataset is imported, the contents of the dataset are examined, and some
modeling building tasks are executed. Although the reader may not yet be familiar with the R syntax,
the code can be followed by reading the embedded comments, denoted by #. In the following scenario,
the annual sales in U.S. dollars for 10,000 retail customers have been provided in the form of a comma-
separated-value (CSV) file. The read . csv () function is used to import the CSV file. This dataset is stored
to the Rvariable sales using the assignment operator <- .

H imp< rt a CSV file of theo tot'l.- annual sa es for each customcl·


sales <- read . csv( "c:/data/yearly_sales.csv")

# €'>:amine Lhe imported dataset


head(sales)
3.1 Introduction toR

summary (sales)

# plot num_of_orders vs. sales


plot(sales$num_of_orders,sales$sales_total,
main .. "Number of Orders vs. Sales")

# perform a statistical analysis (fit a linear regression model)


results <- lm(sales$sales_total - sales$num_of_orders)
summary(results)

# perform some diagnostics on the fitted model


# plot histogram of the residuals
hist(results$residuals, breaks .. 800)

In this example, the data file is imported using the read. csv () function. Once the file has been
imported, it is useful to examine the contents to ensure that the data was loaded properly as well as to become
familiar with the data. In the example, the head ( ) function, by default, displays the first six records of sales.

# examine the imported dataset


head(sales)
cust - id sales - total num - of - orders gender
100001 800.64
2 100002 217.53
100003 74.58 2 t·l
4 100004 ·198. 60 t•l
5 100005 723.11 4 F
6 100006 69.43 2 F

The summary () function provides some descriptive statistics, such as the mean and median, for
each data column. Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are
provided. Because the gender column contains two possible characters, an "F" (female) or "M" (male),
the summary () function provides the count of each character's occurrence.
summary(sales)

cust id sales total num of orde1·s gender


i"lin. :100001 !'-lin. 30.02 !'-lin. 1.000 F:5035
1st Qu . : 1 o2 5o 1 1 s t Qu . : 80 . 29 1st Qu.: 2.000 !·l: 4965
l>ledian :105001 r'ledian : 151.65 t>ledian : 2.000
!>lean :105001 t•lean 24 9. 46 !>lean 2.428
3rd Qu. :107500 3 rd Qu. : 2 9 5 . 50 3rd Qu.: 3.000
l\lax. : 110 0 0 0 t•lax. :7606.09 1•1ax. :22.000

Plotting a dataset's contents can provide information about the relationships between the vari-
ous columns. In this example, the plot () function generates a scatterplot of the number of orders
(sales$num_of_orders) againsttheannual sales (sales$sales_total). The$ is used to refer-
ence a specific column in the dataset sales. The resulting plot is shown in Figure 3-1.
# plot num_of_orders vs. sales
plot(sales$num_of_orders,sales$sales_total,
main .. "Number of Orders vs. Sales")
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

Number of Orders vs. Total Sales

0
iii 0
0 0 0
:§ 0
I <0
C/)
Q>
0 0
iii
C/)
§ 0 0 0
0
II>

i
0
8 ~
I I I I i•
8
i 0
C/)
Q>
0 0 0 0
0
iii
C/)
N
8
0
• 5 10 15 20

sales$num_of_orders

FtGURE 3-1 Graphically examining the data

Each point corresponds to the number of orders and the total sales for each customer. The plot indicates
that the annual sales are proportional to the number of orders placed. Although the observed relationship
between these two variables is not purely linear, the analyst decided to apply linear regression using the
lm () function as a first step in the modeling process.

r esul t s <- lm(sal es$sa l es_total - sales$num_of _or de rs)


r e s ults

ca.l:
lm formu.a sa.c Ssales ~ota. sales$num_of_orders

Coefti 1en·
In· er ep • sa:essnum o f orders

The resulting intercept and slope values are -154.1 and 166.2, respectively, for the fitted linear equation.
However, results stores considerably more information that can be examined with the summary ()
function. Details on the contents of results are examined by applying the at t ributes () function.
Because regression analysis is presented in more detail later in the book, the reader should not overly focus
on interpreting the following output.

summary(results)

Call :
lm formu:a sa!esSsales_total - salcs$ num_of_orders

Re!'a ilnls:
Min IQ Med1an 3C 1·1ax
-666 . 5 12S . S - 26 . 7 86 . 6 4103 . 4

Coe f ficie nt:s:


Est1mate Std . Errol r value Prl> t
Intercept -15~.128 .; . 12"' - 37 . 33 <2e-16
sal~s$num f orders 166 . 22l 1 . 462 112 . 66 <2e - ~6
3.1 Introduction to R

Sior:1t . codes : 0 ' ... . 0.00! •·· · c . o: •• • 5 I • I .1 I 1

Res1aua. star:da~d e~ro~ : ~! .: on 999° deg~ees of :reeo~m


~ultlple R·squar d : 0 . ~617 , Aa:usted P-sq~a~ed : . 561

The summary () function is an example of a generic function. A generic function is a group of fu nc-
tions sharing the same name but behaving differently depending on the number and the type of arguments
they receive. Utilized previously, plot () is another example of a generic function; the plot is determined
by the passed variables. Generic functions are used throughout this chapter and the book. In the final
portion of the example, the following Rcode uses the generic function hist () to generate a histogram
(Figure 3-2) of the residualsstored in results. The function ca ll illustrates that optional parameter values
can be passed. In this case, the number of breaks is specified to observe the large residuals.

~ pert H. some d13gnosLics or. the htted m..de.


# plot hist >gnm f the residu, ls
his t (r esults $res idua l s, breaks= 800)

Histogra m of resultsSresid uals

0
I()

u>-
..
c
:J
<:T
0
0

~ 0
u. I()

0 1000 2000 3000 4000

resuttsSres1duals
FIGURE 3-2 Evidence of large residuals

This simple example illustrates a few of the basic model planning and building tasks that may occur
in Phases 3 and 4 of the Data Analytics Lifecycle. Throughout this chapter, it is useful to envision how the
presented R functionality will be used in a more comprehensive analysis.

3.1.1 R Graphical User Interfaces


R software uses a command-line interface (CLI) that is similar to the BASH shell in Li nux or the interactive
versionsof scripting languages such as Python. UNIX and Linux users can enter command Rat the terminal
prompt to use the CU. For Windows installations, Rcomes with RGui.exe, which provides a basic graphica l
user interface (GUI). However, to im prove the ease of writing, executing, and debugging Rcode, several
additional GUis have been written for R. Popular GUis include the Rcommander [3]. Rattle [4], and RStudio
[5). This section presents a brief overview of RStudio, which was used to build the Rexamples in th is book.
Figure 3-3 provides a screenshot of the previous Rcode example executed in RStudio.
REVIEW O F BASIC DATA ANALYTIC M ETHODS USING R

._CNtl.
.. - - ....._..
.. tJ
-....y
- ..... O....ft• f

....".,
' 1 • -...1 • n •• "'
.:1-
t 1 ules r t -.d.uv 6llu,-...,.1.,_uh,1.u.,· .-------, ulu 10000 OM. ol " " whbhs
Scripts
...
...
tw..o u ln
!~uln •
rn~o~ hs
j Workspace
...
.......•. . '
.........
plot

ruulu
u1ts~of-orcl9ors,u1es

t"ltSIIIU
to' 1 '
1• uln ,u lu_utul
Jo
ules_tot•l,
flt I
u t n- ·~ ~

ulu ~ ,..._ot_orct«'s
T
-
cwo...s '"· s..ln

.... ,..,.,...
; :- •toMtt·
""'-
0 { a......
•. ' 11 luu
'"
lU •Ill rt f " ~ I
Ul hht ruuhs 1 t"tst~•h. br u11 • 100
r Histogram or results$reslduals

••'"• iu~-~.~...~------:=::::::::::::::::~
"s-uy(resulu)
·- .:1
~ Plots
c•H:
l•(f or-..1& - ulu lulu_uul .. saln~of_orcMf's)

... , .... ls :
tUn
· 6M. 1 - US, \
tQ lllt'dhn
•U .7
1Q ~b
t6.6 .&10), <1
Console ~ ~

lsttuu Ud. Crrw 1 " ' ',.,. k(,.Jtl)


j
omuupt) .,,,,u, •.ut -Jr.n ~ •. ,, •·•
u l•s~oLoro.-' u6. Ul 1. 41 61 uJ.M <-lt- 16 ••• ~
sttnU . codts : o '• •• ' o. oot •••• 0.01 ••• o.os •, • 0.1 • • 1
l:utdlu•1 sund¥d uror: 110.1 on tnt CS•vr"' or fr~
tty\ttph I:•S.,¥-.1: 0 , $617, .lodJw~ t f'd l•squ.vl'd: 0 . )6)7

j
f" • Uathttc : t . ltl••Ool on I wrd 9Ht Of", P•VA1U41: ot l . h•16

,. #pf'rfor• ~- diA~ltn Of' Uw ttlud _,.., 1000 2000 3000


Jo ?lot hhtOQrM ot tt,. re~lduh
~ Mn(ruwh~k'n l cN .th, .,.,,,, · tOO)

FIGURE 3-3 RStudio GUI

The fou r highlighted window panes follow.

• Scripts: Serves as an area to write and saveR code

• Workspace: Lists the datasets and variables in the Renvironment

• Plot s: Displays the plots generated by the Rcode and provides a straightforward mechanism to
export the plots

• Console: Provides a history of the executed Rcode and the output

Additionally, the console pane can be used to obtain help information on R. Figure 3-4 illustrates that
by entering ? lm at the console prompt, the help details of the lm ( ) function are provided on the right.
Alternatively, help ( lm ) could have been entered at the console prompt.
Functions such as edit () and fix () allow the user to update the contents of an R variable.
Alternatively, such changes can be implemented with RStudio by selecting the appropriate variable from
the workspace pan e.
R allows one to save the workspace environment, includ ing variables and loaded libraries, into an
. Rdata file using the save . image () function. An existing . Rdata file can be loaded using the
load . image () function. Tools such as RStudio prompt the user for whether the developer wants to
save the workspace connects prior to exiting the GUI.
The reader is encouraged to install Rand a preferred GUI to try out the Rexamples provided in the book
and utilize the help functionality to access more details about the discussed topics.
3.1 Introduction toR

.._. . -;:,
...,
t" u lu
t
r e...cl.csv
tt • • ........... ,
.,.1yJ•1n.u\
41 1ft f Ot l.a.l:?' ~~ .:J-
J U
ulu
.. ...... o.ttl01 • i ~·

',
..
411\1 , •

......" ..
N
ru~o~lu

..,"'..,
hud ulu
s~salu
,. ,. ..
plot ultJ l......_.,....ot'MrS ,Uh.tSnles.toul. .. ,,.. -~of arden'''~· ~·u·

... t>t .II It~ r--r;~l~ ~\)

,,..
>01 • f • .I H-It t II
ru.ulu 1• uluSulu,.ICKil ulu s~of ..crd~
>07 ru~l n
J ,.. .... ,~ """
110 "' . . . .z..
Ul "r•·f dl \1 nt,,d~l ll f"'*"t 1--Ut M-" '"
U7 •ph•t hi t< t t~ ' " ' ••I
lU hht ru11hs Srut duah, bru~s • 100 1
u•
~••
,;.·,;·iah.... ; ;.;: .:---------=========----' ---
Fitting Linear Models

un:
.:J o..c:ttptSon
lo(f or-.la .. ultsSul•s..tO'CAI .. uo hsJtuil,..of_or~s )
~-u&t411 kllftl•f"'Idek I CM ... vt. . IIC:WfYWft9'*1 - UIQ!It1UIIIIIII~II-I .nd
luto.uh : 1Nfyt4vlt- • (~ • · ~~ . _.,_...clfbab'UMn)
Min IQ ..fltiM )Q '""'
~tM. , · US. S • lt.7 M , t 4110), 41

CCHff lc.l..r:ts: uqrcrw:a!• , 41 c..•, , ....., , , , .,.l111ht•, ft• .• etlc.tl,

(fnurcept)
Uttaau St d. lrror t "'' ' "' "'(•I t I>
-1~ .UI <I.Ut - J1.JJ c2t-l6 ••·
_,bOd • • q" · · .o., .. T~. • .. ~. y .. r;u..,r, ql" • ru:z.
• ~ h :.c t • TWI, cMtr uu • II"J:.I., o thu, . .. )
u 1e~ lnwa..ot _orws 1641.211 1."62 UI.M -.l t - 16 ...
Argument1
stvnu. cCIIOH.: o ···-· 0.001 •••• 0.01 ••• o.o\ ·.• o. t • • 1
• ntdvll nln<t¥0 tf'ror: 210.1 on t9N detJ't.S of frt~ f QI&l}& lf'lot.,Kt~tl&•••t· t• liii.•(OfiDMII'IIfunbece«cldlotNIWIII t tymtdcestiiCI¥IbOnol I

::::::.~.:. ~:-:=',::7'-=:::.7::~·~ 101CIU


tt~~lt
tple •-•qwwtd: 0. MJ7. AdJW1ttO •·squAred: o. ~617
r -su thttc : l.2t2... o.& on 1 1t10 tttl cw , .,._., ,..,.: c 2.21-16 I
......)t. . . . . . tlw""'*''"tlltfNdee fnalbn:l.ncl&:.• tt.~. . . tll... hm
) • .,..,.,0,.. · - ctl~tt( \ 01'1

...,.
• • plot

-• I
,.,,uq•
of ttw rnldv• h
.. "'ht(r~whs ~ ~sl..,.h, lilr'Uh • toO)
tr.. flt\f'G . . , . ,

-- ,-
J..... u ... tr_::~, f t ;....,l l l tYJIIUI1CIIt.,....,....,.kwaiiiii'O~•uhcl
II'I~'IIIKtorl~ l ....... fl .......... lO beuMO ~UMtc,.,poctst
II'I__......U.tl....,_,teMI<MCI•t!lotiftaltJI'X-tsl
1--'UL ....,C~IM1Il
SPIOII4 c.r--..:.or•-...aor
...... d•l4~..ql l vt:~t•~•
, ..r..... -,,, ...,.,...,.,.,.,,.,... ...,., .. YMO S..lban.tan
~ng
.=.J

FIGURE 3-4 Accessing help in Rstudio

3.1.2 Data Import and Export


In the annual retail sales example, the dataset was imported into R using the read . csv () function as
in the following code.

sales <- read . csv("c : /data/yearly_ sales . csv" )

R uses a forward slash {!) as the separator character in the directory and file paths. This convention
makes script Iiies somewhat more portable at the expense of some initial confusion on the part of Windows
users, w ho may be accustomed to using a backslash (\) as a separator. To simplify the import of multiple Iiies
with long path names, the setwd () function can be used to set the working directory for the su bsequent
import and export operations, as show n in the follow ing R code.

setwd ( "c: / data/ ")


sales < - read.csv ( "yearly_sales . csv" )

Other import functions include read. table ( l and read . de lim () ,which are intended to import
other common file types such as TXT. These functions can also be used to import the yearly_ sales
. csv file, as the following code illustrates.

sales_table <- read .table ( "yearly_sales . csv " , header=TRUE, sep="," )


sales_delim <- read . delim ("yearly_sales . csv", sep=",")

Th e ma in difference between these import functions is the default values. For example, t he read
. de lim () function expects the column separator to be a tab("\ t"). ln the event that the numerical data
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

in a data file uses a comma for the decimal, Ralso provides two additional functions-read . csv2 () and
read . del im2 ()-to import such data. Table 3-1 includes the expected defaults for headers, column
separators, and decimal point notations.

TABLE 3-1 Import Function Defaults

Function Headers Separator Decimal Point


r ead. t abl e () FALSE
r ead. csv () TRUE
r ead. csv2 ( ) TRUE .
"·"

read . d e lim () TRUE "\t "

r ead. d elirn2 () TRUE "\t " ...


The analogous Rfunctions such as write . table () ,write . csv () ,and write . csv 2 () enable
exporting of R datasets to an external fi le. For example, the following R code adds an additional column
to the sales dataset and exports the modified dataset to an external file.

t; ?\dd 1 ,... ... l,;.f'1Il t .. • h.;;. _'b:!l JUt :i ~ ~ j

sales$per_order <- sa l es$sales_total/sales$num_of _orders

# exp 1L d1ta 1s ''ll' 1.-.. <r u 1 "' L!1< llL t n• t '1'.' name,,
write . t able(sales ," sa l es_modified .txt ", sep= "\t ", row. names=FALSE

Sometimes it is necessary to read data from a database management system (DBMS). Rpackages such
as DBI [6) and RODBC [7] are available for this purpose. These packages provide database interfaces
for communication between Rand DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal
Greenplum. The following Rcode demonstrates how to instal l the RODBC package with the i ns t al l
. p acka ges () function. The 1 ibr a ry () function loads the package into the Rworkspace. Finally, a
connector (conn ) is initialized for connecting to a Pivotal Greenpl um database tra i n i ng2 via open
database connectivity (ODBC) with user user. The training2 database must be defined either in the
I etc/ODBC . ini configuration file or using the Administrative Tools under the Windows Control Panel.
install . packages ( "RODBC" )
library(RODBC)
conn <- odbcConnect ("t r aining2", uid="user" , pwd= "passwor d " )

Th e con nector needs to be present to su bmit a SQL query to an ODBC database by using the
sq l Qu ery () function from the RODBC package. The following Rcode retrieves specific columns from
the housi ng table in which household income (h inc ) is greater than $1,000,000.

housing_data s qlQuery(conn, "select s erialno , s t ate, persons, r ooms


<-
from housing
where hinc > 1000000")
head(housing_data )
3.1 Introduction t o R

4552088 5 9
4 45"- 88 5 9
5 8699:!93 6 5 5

Although plots can be saved using the RStudio GUI, plots can also be saved using R code by specifying
the appropriate graphic devices. Using the j peg () function, the following R code creates a new JPEG
file, adds a histogram plot to the file, and then closes the file. Such techniques are useful w hen automating
standard repor ts. Other functions, such as png () , bmp () , pdf () ,and postscript () ,are available
in R to save plots in the des ired format.

jpeg ( fil e= "c : /data/ sale s_h ist . j peg" ) creaLe a ne'" jpeg file
h ist(sales$num_of_ o rders ) # export histogt·;un to jpeg
d ev. o ff () ~ shut off the graphic device

More information on data imports and exports can be fou nd at http : I I cran . r-proj e ct . o rgl
doc I ma nuals I r- rel ease i R- da ta . html, such as how to import data sets from statistical software
packages including Minitab, SAS, and SPSS.

3.1.3 Attribute and Data Types


In the earlier exa mple, the sal es variable contained a record for each customer. Several cha racteristics,
such as total an nual sa les, number of orders, and gender, were provided for each customer. In general,
these characteristics or attributes provide the qualitative and quantitative measures for each item or subject
of interest. Attributes can be categorized into four types: nominal, ordinal, interval, and ratio (NOIR) [8).
Table 3-2 distinguishes these four attrib ute types and shows the operations they support. Nominal and
ordinal attributes are considered categorical attributes, w hereas interval and ratio attributes are considered
numeric attributes.

TABLE 3-2 NOIR Attribute Types

Categorical (Qualitative) Numeric (Quantitative)

Nominal Ordinal Interval Rat io

Definition The va lues represent Attributes The difference Both the difference
labels that distin- imply a betw een two and the ratio of
guish one from sequence. values is two values are
another. meaningful. meaningful.

Examples ZIP codes, nationa l- Quality of Temperature in Age, temperature


ity, street names, diamonds, Celsius or in Kelvin, counts,
gender, employee ID academic Fahrenheit, ca l- length, weight
numbers, TRUE or grades, mag- endar dates,
FALSE nitude of latitudes
ea rthquakes

Operations =, >' =, ~, = , ;t., =,~,

<, s , >, 2: <, s , >, c:, <, s , >, ~,

+, - +, - ,

x, .:-
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

Data of one attribute type may be converted to another. For example, the qual it yof diamonds {Fair,
Good, Very Good, Premium, Ideal} is considered ordinal but can be converted to nominal {Good, Excellent}
with a defined mapping. Similarly, a ratio attribute like Age can be converted into an ordinal attribute such
as {Infant, Adolescent, Adult, Senior}. Understanding the attribute types in a given dataset is important
to ensure that the appropriate descriptive statistics and analytic methods are applied and properly inter-
preted. For example, the mean and standard deviation of U.S. postal ZIP codes are not very meaningful or
appropriate. Proper handling of categorical variables will be addressed in subsequent chapters. Also, it is
useful to consider these attribute types during the following discussion on Rdata types.

Numeric, Character, and Logical Data Types


Like other programming languages, Rsupports the use of numeric, character, and logical (Boolean) values.
Examples of such variables are given in the following Rcode.
i <- 1 # create a numeric variable
sport <- "football" # create a character variable
flag <- TRUE # create a logical variable

Rprovides several functions, such as class () and type of (),to examine the characteristics of a
given variable. The class () function represents the abstract class of an object. The typeof () func-
tion determines the way an object is stored in memory. Although i appears to be an integer, i is internally
stored using double precision. To improve the readability of the code segments in this section, the inline
Rcomments are used to explain the code or to provide the returned values.
class(i) # returns "numeric"
typeof(i) # returns "double"

class(sport) # returns "character"


typeof(sport) # returns "character"

class(flag) ..
ttreturns "logical"
typeof (flag) # returns "logical"

Additional Rfunctions exist that can test the variables and coerce a variable into a specific type. The
following Rcode illustrates how to test if i is an integer using the is . integer ( } function and to coerce
i into a new integer variable, j, using the as. integer () function. Similar functions can be applied
for double, character, and logical types.
is.integer(i) # returns FALSE
j <- as.integer(i) # coerces contents of i into an integer
is.integer(j) # returns TRUE

The application of the length () function reveals that the created variables each have a length of 1.
One might have expected the returned length of sport to have been 8 for each of the characters in the
string 11 football". However, these three variables are actually one element, vectors.
length{i) # returns 1
length(flag) # returns 1
length(sport) # returns 1 (not 8 for "football")
3.1 Introduction to R

Vectors
Vectors are a basic building block for data in R. As seen previously, simple Rvariables are actually vectors.
A vector can only consist of values in the same class. The tests for vectors can be conducted using the
is. vector () function.
is.vector(i) !t returns TRUE
is.vector(flag) # returns TRUE
is.vector(sport) ±t returns TRUE

Rprovides functionality that enables the easy creation and manipulation of vectors. The following R
code illustrates how a vector can be created using the combine function, c () or the colon operator, :,
to build a vector from the sequence of integers from 1 to 5. Furthermore, the code shows how the values
of an existing vector can be easily modified or accessed. The code, related to the z vector, indicates how
logical comparisons can be built to extract certain elements of a given vector.
u <- c("red", "yellow", "blue") " create a vector "red" "yello•d" "blue"
u ±; t·eturns "red" "yellow'' "blue"
u[l] returns "red" 1st element in u)
v <- 1:5 # create a vector 1 2 3 4 5
v # returns 1 2 3 4 5
sum(v) It returns 15
w <- v * 2 It create a vector 2 4 6 8 10
w # returns 2 4 6 8 10
w[3] returns 6 (the 3rd element of w)
z <- v + w # sums two vectors element by element
z # returns 6 9 12 15
z > 8 # returns FALSE FALSE TRUE TRUE TRUE
z [z > 8] # returns 9 12 15
z[z > 8 I z < 5] returns 9 12 15 ("!"denotes "or")

Sometimes it is necessary to initialize a vector of a specific length and then populate the content of
the vector later. The vector ( } function, by default, creates a logical vector. A vector of a different type
can be specified by using the mode parameter. The vector c, an integer vector of length 0, may be useful
when the number of elements is not initially known and the new elements will later be added to the end
ofthe vector as the values become available.
a <- vector(length=3) # create a logical vector of length 3
a # returns FALSE FALSE FALSE
b <- vector(mode::"numeric 11 , 3) #create a numeric vector of length 3
typeof(b) # returns "double"
b[2] <- 3.1 #assign 3.1 to the 2nd element
b # returns 0.0 3.1 0.0
c <- vector(mode= 11 integer", 0) # create an integer vectot· of length o
c # returns integer(O)
length(c) # returns o

Although vectors may appear to be analogous to arrays of one dimension, they are technically dimen-
sionless, as seen in the following Rcode. The concept of arrays and matrices is addressed in the following
discussion.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

length(b) 1·eturns 3
dim(b) ~ 1·etun1s NULL (an undefined value)

Arrays and Matrices


The array () function can be used to restructure a vector as an array. For example, the following Rcode
builds a three-dimensional array to hold the quarterly sales for three regions over a two-year period and
then assign the sales amount of $158,000 to the second region for the first quarter of the first year.

H the dimensions are 3 regions , 4 quarters, and 2 years


quarterly_sales < - array(O, dim=c(3,4,2))
quarterly_sales[2,1,11 <- 158000
quarterly_sales

[. 1, [.:! 1 :. , 1 [,·s!
[1' 1 0 0 0
[:!,1 !58000 c 0 0
[ 3. 1 0 0 0 0

[. 11 (.21 [. 31 [. 41
[ 1' 1 0 0 0 0
[2 ' 1 0 0 0 0
[3 ' 1 0 0 0 0

A two-dimensional array is known as a matrix. The following code initializes a matrix to hold the quar-
terly sales for the three regions. The parameters nrov1 and nco l define the number of rows and columns,
respectively, for the sal es_ma tri x.

sales_matrix <- matrix(O, nrow = 3, neal 4)


sales_matrix

[.11 ;,:!) 1.31 [. .; 1


[ 1, 1 0 0 0 0
[2,1 0 0 0 0
[ 1' 1 n

R provides the standard matrix operations such as addition, subtraction, and multiplication, as well
as the transpose function t () and the inverse matrix function ma t r ix . inve r s e () included in the
matrixcalc package. Th e following Rcode builds a 3 x 3 matrix, M, and multiplies it by its inverse to
obtain the identity matrix.

library(matrixcalc)
M <- matrix(c(1,3,3,5,0,4,3 , 3,3) ,nrow 3,ncol 3) build a 3x3 matrix
3.1 Introduction toR

M %* % matrix . inverse (M} ~ multiply 1·! by inverse (:01}

[. 1] [. 2] [ ' 3]
[ 1' J 0 0
[2' J 0 1 0
[3' J 0 0 1

Data Frames
Similar to the concept of matrices, data frames provide a structure for storing and accessing several variables
of possibly different data types. In fact, as the i s . d ata . fr a me () function indicates, a data frame was
created by the r e ad . csv () function at the beginning of the chapter.

r.import a CSV :ile of the total annual sales :or each customer
s ales < - read . csv ("c : / data/ ye arly_s a l es . c sv" )
i s .da t a . f r ame (sal es ) ~ t·eturns TRUE

As seen earlier, the variables stored in the data frame can be easily accessed using the $ notation. The
following R code illustrates that in this example, each variable is a vector with the exception of gende r ,
which was, by a read . csv () default, imported as a factor. Discussed in detail later in this section, a factor
denotes a categorical variable, typically with a few finite levels such as "F" and "M " in the case of gender.

l ength(sal es$num_of _or ders) returns 10000 (number of customers)

i s . v ector(sales$cust id) returns TRUE


- returns
is . v ector(sales$sales_total) TRUE
i s .vector(sales$num_of_orders ) returns TRUE
is . v ector (sales$gender) returns FALSE

is . factor(s a les$gender ) ~ returns TRUE

Because of their flexibility to handle many data types, data frames are the preferred input format for
many ofthe modeling functions available in R. The foll owing use of the s t r () function provides the
structure of the sal es data frame. This function identifi es the integer and numeric (double) data types,
the factor variables and levels, as well as the first few values for each variable.

str (sal es) # display structure of the data frame object

'data.ft·ame': 10000 obs . of 4 vanables :


$ CUSt id int 100001 100002 100003 100004 100005 100006 ...
$ sales total num 800 . 6 217.5 74.6 498 . 6 723 . 1
$ num of - orders : int 3 3 2 3 4 2 2 2 2 2 . ..
$ gender Factor w/ 2 le,·els UfU "f'-1" : 1 l 2 2 1 1 2 2 1 2 .. .
I

In the simplest sense, data frames are lists of variables of the same length. A subset of the data frame
can be retrieved through subsetting operators. R's subsetting operators are powerful in t hat they allow
one to express complex operations in a succinct fashion and easily retrieve a subset of the dataset.

'! extract the fourth column of the sales data frame


sal es [, 4]
H extract the gender column of the sales data frame
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

sales$gender
# retrieve the first two rows of the data frame
sales[l:2,]
# retrieve the first, third, and fourth columns
sales[,c(l,3,4)]
l! retrieve both the cust_id and the sales_total columns
sales[,c("cust_id", "sales_total")]
# retrieve all the records whose gender is female
sales[sales$gender=="F",]

The following Rcode shows that the class of the sales variable is a data frame. However, the type of
the sales variable is a list. A list is a collection of objects that can be of various types, including other lists.
class(sales)
"data. frame"
typeof(sales)
"list"

Lists
Lists can contain any type of objects, including other lists. Using the vector v and the matrix M created in
earlier examples, the following Rcode creates assortment, a list of different object types.
# build an assorted list of a string, a numeric, a list, a vector,
# and a matrix
housing<- list("own", "rent")
assortment <- list("football", 7.5, housing, v, M)
assortment

[ [1)]
[1) "football"

[ (2])
[1) 7. 5

[ (3])
[ [ 3)) [ [ 1))
[1) "own"

[ [3)) [ [2)]
[1) "rent"

[ [4)]
[1] 1 2 3 4 5

[ [5)]
3.1 Introduction toR

[I 1] [ 1 2] [ 13 J
[11 J 1 5
[21 J 3 0
[3 1 J 3 4

In displaying the contents of assortment, the use of the double brackets, [ [] ] , is of particular
importance. As the following Rcode illustrates, the use of the single set of brackets only accesses an item
in the list, not its content.
# examine the fifth object, loll in the list
class(assortment[S]) .. returns "2.ist"
..
tt

length(assortment[S]) tt returns 1

class(assortment[[S]]) # returns "matrix"


length(assortment[[S]]) # returns 9 {for the 3x3 matrix)

As presented earlier in the data frame discussion, the s tr ( ) function offers details about the structure
of a list.
str(assortment)
List of 5
$ : chr "football"
$ : num 705
$ :List of 2
0 $ : chr "own "
0

0 0$ : chr "rent"
$ int [ 1: 5] 1 2 3 4 5
$ : num [ 1: 3 1 : 3] 1 3 3 5 0 4 3 3 3
1

Factors
Factors were briefly introduced during the discussion of the gender variable in the data frame sales.
In this case, gender could assume one of two levels: ForM. Factors can be ordered or not ordered. In the
case of gender, the levels are not ordered.
class(sales$gender) # returns "factor"
is.ordered(sales$gender) # returns FALSE

Included with the ggplot2 package, the diamonds data frame contains three ordered factors.
Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very Good, Premium,
and Ideal. Thus, sales$gender contains nominal data, and diamonds$cut contains ordinal data.
head(sales$gender) # display first six values and the levels

F F l-1 1'-1 F F
Levels: F l\1

library(ggplot2)
data(diamonds) # load the data frame into the R workspace
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

str(diamonds)
'data.frame': 53940 obs. of 10 variables:
$ carat num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 ...
$ cut Ord.factor w/ 5 levels "Fair"c"Good"c .. : 5 4 2 4 2 3 ...
$ color Ord.factor w/ 7 levels "D"c"E"c"F"c"G"c .. : 2 2 2 6 7 7
$ clarity: Ord.factor w/ 8 levels "I1"c"SI2"c"SI1"< .. : 2 3 5 4 2
$ depth num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
$ table num 55 61 65 58 58 57 57 55 61 61 ...
$ price int 326 326 327 334 335 336 336 337 337 338
$ X num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
$ z num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39

head(diamonds$cut) # display first six values and the levels


Ideal Premium Good Premium Good Very Good
Levels: Fair c Good c Very Good < Premium < Ideal

Suppose it is decided to categorize sales$sales_ totals into three groups-small, medium,


and big-according to the amount of the sales with the following code. These groupings are the basis for
the new ordinal factor, spender, with levels {small, medium, big}.
# build an empty character vector of the same length as sales
sales_group <- vector (mode=''character",
length=length(sales$sales_total))

# group the customers according to the sales amount


sales_group[sales$sales_total<100] <- "small"
sales_group[sales$sales_total>=100 & sales$sales_total<500] <- "medium"
sales_group[sales$sales_total>=500] <- "big"

# create and add the ordered factor to the sales data frame
spender<- factor(sales_group,levels=c("small", "medium", "big"),
ordered = TRUE)
sales <- cbind(sales,spender)

str(sales$spender)
Ord.factor w/ 3 levels "small"c"medium"c .. : 3 2 1 2 3 1 1 1 2 1 ...

head(sales$spender)
big medium small medium big small
Levels: small < medium c big

The cbind () function is used to combine variables column-wise. The rbind () function is used
to combine datasets row-wise. The use of factors is important in several Rstatistical modeling functions,
such as analysis of variance, aov ( ) , presented later in this chapter, and the use of contingency tables,
discussed next.
3.11ntrodudion toR

Contingency Tables
In R, table refers to a class of objects used to store the observed counts across the factors for a given dataset.
Such a table is commonly referred to as a contingency table and is the basis for performing a statistical
test on the independence of the factors used to build the table. The following Rcode builds a contingency
table based on the sales$gender and sales$ spender factors.
# build a contingency table based on the gender and spender factors
sales_table <- table{sales$gender,sales$spender)
sales_table
small medium big
F 1726 2746 563
M 1656 2723 586

class(sales_table) returns "table"


typeof(sales_table) returns "integer"
dim{sales_table) # returns 2 3

# performs a chi-squared test


summary(sales_table)
Number of cases in table: 10000
Number of factors: 2
Test for independence of all factors:
Chisq = 1.516, df = 2, p-value = 0.4686

Based on the observed counts in the table, the summary {) function performs a chi-squared test
on the independence of the two factors. Because the reported p-value is greater than 0.05, the assumed
independence of the two factors is not rejected. Hypothesis testing and p-values are covered in more detail
later in this chapter. Next, applying descriptive statistics in Ris examined.

3.1.4 Descriptive Statistics


It has already been shown that the summary () function provides several descriptive statistics, such as
the mean and median, about a variable such as the sales data frame. The results now include the counts
for the three levels of the spender variable based on the earlier examples involving factors.
summary(sales)
cust ·- id sales - total nurn --- of orders gende1· spender
!,lin. :100001 r~lin. 30.02 !\lin. 1.000 F:5035 small :3382
1st Qu. :102501 1st Qu.: 80.29 1st Qu.: 2.000 1<1:4965 medium:5469
!\led ian :105001 r<ledian : 151.65 1\ledian : 2.000 big : 114 9
!\lean :105001 r-lean 249.46 r'lean 2.428
3rd Qu. : 107500 3rd Qu.: 295.50 3rd Qu.: 3.000
!\lax. :110000 r~Iax. : 76 06 . 0 9 1~1ax. :22.000

The following code provides some common Rfunctions that include descriptive statistics. In parenthe-
ses, the comments describe the functions.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

:. ca.ls, assig::
x <- sales$sales_total
y <- sales$num_of_orders

cor(x,yl :: returns 0.75080!5 correlatlor.)


cov(x ,y l II returns 345.~111 (covarianc<?)
IQR(x) ff :retu::n~ :ns.:n (int.erquartile range)
mean(x) ,. returns 249.4';'i7 mean)
~

median(x) returns 151.65 (me:lianl


range (x ) te:urns 30.0::: 7 06 . 09 min rna:-:)
"
sd(x) ., returns '-l9.0508 ::;:... :1. :ie·:.
var(x) retur::s ~C17r.l.~ ··ari:'l!!~"",...

The IQR () function provides the difference between the third and the first quarti les. The other fu nc-
tions are fairly self-explanatory by their names. The reader is encouraged to review the available help files
for acceptable inputs and possible options.
The function apply () is useful when the same function is to be applied to several variables in a data
frame. For example, the following Rcode calculates the standard deviation for the first three variables in
sales. In the code, setting MARGIN=2 specifies that the sd () function is applied over the columns.
Other functions, such as lappl y () and sappl y (), apply a function to a list or vector. Readers can refer
to the R help files to learn how to use these functions.

apply (sales[,c (l : 3) ], MARGIN=2, FUN=Sd )

Additional descriptive statistics can be applied wi th user-defined funct ions. The following R code
defines a function, my_ range () , to compute the difference between the maximum and minimum va lues
returned by the range () function. In general, user-defined functions are usefu l for any task or operation
that needs to be frequently repeated. More information on user-defined functions is available by entering
help ( 11 function 11 ) in the console.
# build a functi~n tv plvviJ~ the difterence bet~een
~ -he maxrmum and thE .:m • •.1<
my_ range < - function (v) {range (v ) (2] - range (v) [1)}
my_range (x )

3.2 Exploratory Data Analysis


So far, this chapter has addressed importing and exporting data in R, basic data types and operations, and
generating descriptive statistics. Functions such as summary () can help analysts ea sily get an idea of
th e magnitude and range of the data, but other aspects such as linear relationships and distributions are
more difficult to see from descriptive statistics. For example, the following code shows a summary view of
a data frame data with two columns x and y. The output shows the range of x and y, but it's not clear
what the relationship may be between these two variables.
3.2 Exploratory Data Analysis

summary (data )
·.. y
M1n. : 1.90481 ~·1n .

1st Qu . : -0.66321 !s• 0u


Nedian : 0 . 0"'167 N•d.111
Nean 0.0-52~

3rd Qu .: 0 . 65414 r t ,,u

A useful way to detect patterns and anomalies in the data is through the exploratory data analysis with
visualization. Visualization gives a succinct, holistic view of the data that may be difficult to grasp from the
numbers and summaries alone. Variables x and y of the data frame data can instead be visual ized in a
scatterplot (Figure 3-5). which easily depicts the relationship between two variab les. An important facet
of the initial data exploration, visualization assesses data cleanliness and suggests potentially important
relationships in the data prior to the model planning and building phases.

Scatterplot of X and Y

-1·

2 0 2
X

FIGURE 3-5 A scatterplot can easily show if x andy share a relation

The code to generate data as well as Figure 3-5 is shown next.

x <- rno rm (SO)


y <- x + rnorm(SO , mea n=O , sd=O . S)

data<- as . data . f rame(cbind(x , y ) )


REVIEW OF BASIC DATA ANALYTIC METHODS USING R

s ummary (data )

library (ggplo t 2)
ggpl ot (data, aes (x=x , y=y)) +
geom_point (size=2) +
ggtitle ("Scatterplo t o f X and Y" ) +
theme (axis.text=el emen t_t ex t(s i ze= l 2) ,
axis. title el emen t_text (si ze= l4 ) ,
plot.title = e l ement_ t ex t(si ze=20 , fa ce ="bold" ))

Explo ra tory data analysis [9] is a data ana lysis approach to reveal the important characteristics of a
dataset, mainly through visualization. This section discusses how to use some basic visualization techniques
and the plotting feature in R to perform exploratory data analysis.

3.2.1 Visualization Before Analysis


To illustrate the importance of visualizing data, consider Anscombe's quartet. Anscom be's quartet consists
of four datasets, as shown in Figure 3-6. It was constructed by statistician Francis Anscom be [10] in 1973
to demonstrate the importance of graphs in statistical analyses.

#1 #2 # 3 #4
X y X y X y X y
4 4.26 4 3 10 4 5.39 8 5 25
5 5.68 5 4 74 5 5.73 8 5.56
6 7.24 6 6 13 6 6.08 8 5.76
7 4.82 7 7.26 7 6.42 8 6.58
8 6.95 8 8. 14 6.77 8 6.89
9 8.81 9 8.77 9 7. 11 8 7.04
10 8.04 10 9. 14 10 7.46 8 7.7 1
11 8.33 11 9.26 11 7.81 8 7.91
12 10.84 12 9. 13 12 8. 15 8 8.47
13 7.58 13 8.74 13 12.74 8 8.84
14 9.96 14 8. 10 14 8.84 19 12.50

fiGURE 3-6 Anscom be's quartet

The four data sets in Anscom be'squartet have nearly identical statistical properties, as shown in Table 3-3.

TABLE 3-3 Statistical Properties of Anscombe's Quartet

Statistical Property Value


Meanof x 9

Variance of y 11

M ean ofY 7.50 (to 2 decimal points)


3.2 Exploratory Data Analysis

Variance of Y 4.12 or4.13 (to 2 decimal points)

Correla tions between x andy 0.816

Linear regression line y =3.00 + O.SOx (to 2 decimal points)

Based on the nearly identical statistical properties across each dataset, one might conclude that these
four datasets are quite similar. However, the scatterplots in Figure 3-7 tell a different story. Each dataset is
plotted asa scatterplot, and the fitted lines are the result of applying linear regression models. The estimated
regression line fits Dataset 1 reasonably well. Dataset 2 is definitely nonlinear. Dataset 3 exhibits a linear
trend, with one apparent outlier at x = 13. For Dataset 4, the regression line fits the dataset quite well.
However, with only points at two x values, it is not possible to determine that the linearity assumption is
proper.

~I ~
12

••• ••
• • • •


3 4

12
• •
• ••
:t 5
••• •

10 15
X
~:
5 10 15

FIGURE 3-7 Anscombe's quartet visualized as scatterplots

The Rcode for generating Figure 3-7 is shown next. It requires the Rpackage ggplot2 [11]. which can
be installed simply by running the command install . p ackages ( "ggp lot2" ) . The anscombe
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

dataset for the plot is included in the standard Rdistribution. Enter data ( ) for a list of datasets included
in the R base distribution. Enter data ( Da tase tName) to make a dataset available in the current
workspace.
In the code that follows, variable levels is created using the gl (} function, which generates
factors offour levels (1, 2, 3, and 4), each repeating 11 times. Variable myda ta is created using the
with (data, expression) function, which evaluates an expression in an environment con-
structed from da ta.ln this example, the data is the anscombe dataset, which includes eight attributes:
xl, x2, x3, x4, yl, y2, y3, and y4. The expression part in the code creates a data frame from the
anscombe dataset, and it only includes three attributes: x, y, and the group each data point belongs
to (mygroup).
install.packages(''ggplot2") # not required i f package has been installed

data (anscombe) It load the anscombe dataset into the current \'iOrkspace
anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8. O·l 9.14 7.-16 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 7.58 8.74 12.74 7.71
4 9 9 8.81 8.77 7.11 8.84
5 11 11 11 8.33 9.26 7.81 8.·±7
6 14 14 14 8 9. 9G 8.10 8.34 7.04
7 6 6 6 8 7.24 6.13 6. •J8 5.25
8 4 4 4 19 ·l. 26 3.10 5. 3 9 12.50
9 12 12 12 8 10. 8•1 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.-12 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89

nrow(anscombe) It number of rows


[1] 11

# generates levels to indicate which group each data point belongs to


levels<- gl(4, nrow(anscombe))
levels
[1] 1 1 1 1 1 1 1 1 l l l 2 .; 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[ 34] 4 4 4 4 4 •l ·l 4 4 4 4
Levels: 1 2 3 4

# Group anscombe into a data frame


mydata <- with(anscombe, data.frame(x=c(xl,x2,x3,x4), y=c(yl,y2,y3,y4),
mygroup=levels))

mydata
X y mygroup
10 8.04
2 8 6.95
13 7.58
4 9 8.81
3.2 Exploratory Data Analysis

...
4,
1
B
1 ...
'i.S6
4

-l3 8 7 l 0 4
44 B 6 . 89 4

A Ma~f -~atterp "tF ~siny th ruplot~ package

library (ggplot2 )
therne_set (therne_bw ()) - s L rlot color :~erne

j\11 ' 1 7
ggplot (rnydata, aes (x, y )) +
geom_point (size=4 ) +
geom_srnooth (rnethod="lrn ", fill=NA, f ullrange=TRUE ) +
facet_wrap (-rnygroup )

3.2.2 Dirty Data


Thissection addresses how dirty data ca n be detected in th e data expl oration phase with visualizations. In
general, analysts should look for anomalies, verify the data with domain knowledge, and decide the most
appropriate approach to clean the data.
Consider a scenario in which a bank isconducting data analyses of its account holders to gauge customer
retention. Figure 3-8 shows the age distribution of the account holders.

0
...
0 -

>.
u
~~
~~
cQJ
:J
0"
QJ
u:
8 -J
~

0 -'

Age
FIGURE 3-8 Age distribution of bank account holders

If the age data is in a vector called age, the graph can be created with the following Rscript:

hist(age, br eaks=l OO , main= "Age Distributi on of Account Holders ",


xlab="Age", ylab="Frequency", col ="gray" )

The figure shows that the median age of the account holders is around 40. A few accountswith account
holder age less than 10 are unusual but plausible. These could be custodial accounts or college savings
accounts set up by the parents of young children. These accounts should be retained for future analyses.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

However, the left side of the graph shows a huge spike of customers who are zero years old or have
negative ages. This is likely to be evidence of missing data. One possible explanation is that the null age
values could have been replaced by 0 or negative values during the data input. Such an occurrence may
be caused by entering age in a text box that only allows numbers and does not accept empty values. Or it
might be caused by transferring data among several systems that have different definitions for null values
(such as NULL, NA, 0, -1, or-2). Therefore, data cleansing needs to be performed over the accounts with
abnormal age values. Analysts should take a closer look at the records to decide if the missing data should
be eliminated or if an appropriate age value can be determined using other available information for each
of the accounts.
In R, the is . na (} function provides tests for missing values. The following example creates a vector
x where the fourth value is not available (NA). The is . na ( } function returns TRUE at each NA value
and FALSE otherwise.
X<- c(l, 2, 3, NA, 4)
is.na(x)
[1) FALSE FALSE FALSE TRUE FALSE

Some arithmetic functions, such as mean ( }, applied to data containing missing values can yield an
na. rm parameter to TRUE to remove the missing value during the
NA result. To prevent this, set the
function's execution.
mean(x)
[1) NA
mean(x, na.rm=TRUE)
[1) 2. 5

The na. exclude (} function returns the object with incomplete cases removed.
DF <- data.frame(x = c(l, 2, 3), y = c(lO, 20, NA))
DF
X y
1 1 10
2 2 20
3 3 NA

DFl <- na.exclude(DF)


DFl
X y
1 1 10
2 2 20

Account holders older than 100 may be due to bad data caused by typos. Another possibility is that these
accounts may have been passed down to the heirs of the original account holders without being updated.
In this case, one needs to further examine the data and conduct data cleansing if necessary. The dirty data
could be simply removed or filtered out with an age threshold for future analyses. If removing records is
not an option, the analysts can look for patterns within the data and develop a set of heuristics to attack
the problem of dirty data. For example, wrong age values could be replaced with approximation based
on the nearest neighbor-the record that is the most similar to the record in question based on analyzing
the differences in all the other variables besides age.
3.2 Exploratory Data Analysis

Figure 3-9 presents another example of dirty data. The distribution shown here corresponds to the age
of mortgages in a bank's home loan portfolio. The mortgage age is calculated by subtracting the origina-
tion date of the loan from the current date. The vertical axis corresponds to the number of mortgages at
each mortgage age.

Portfolio Distribution, Years Since Origination


0
0

"'
~

0
0
~
0
0
u>- (X)
cQj
0
::J 0
cY <D
~
u. 0
..,.
0

0
0
"'
0
I
0 2 6 8 10
Mortgage Age

FIGURE 3-9 Distribution of mortgage in years since origination from a bank's home loan portfolio

If the data is in a vector called mortgage, Figure 3-9 can be produced by the following R script.

hist (mortgage, breaks=lO, xlab='Mortgage Age ", col= "gray•,


main="Portfolio Distribution, Years Since Origination" )

Figure 3-9 shows that the loans are no more than 10 years old, and these 10-year-old loans have a
disproportionate frequency compared to the rest of the population. One possible explanation is that the
10-year-old loans do not only include loans originated 10 years ago, but also those originated earlier than
that. In other words, the 10 in the x-axis actually means"<! 10. This sometimes happens when data is ported
from one system to another or because the data provider decided, for some reason, not to distinguish loans
that are more than 10 years old. Analysts need to study the data further and decide the most appropriate
way to perform data cleansing.
Data analysts shou ld perform sanity checks against domain knowledge and decide if the dirty data
needs to be eliminated. Consider the task to find out the probability of mortga ge loan default. If the
past observations suggest that most defaults occur before about the 4th year and 10-year-old mortgages
rarely default, it may be safe to eliminate the dirty data and assume that the defaulted loans are less than
10 years old. For other ana lyses, it may become necessary to track down the source and find out the true
origination dates.
Dirty data can occur due to acts of omission.ln the sales data used at the beginning of this chapter,
it was seen that the minimum number of orders was 1 and the minimum annual sales amount was $30.02.
Thus, there isa strong possibility that the provided dataset did not include the sales data on all customers,
just the customers who purchased something during the past year.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

3.2.3 Visualizing a Single Variable


Using visual representations of data is a hallmark of exploratory data analyses: letting the data speak to
its audience rather than imposing an interpretation on the data a priori. Sections 3.2.3 and 3.2.4 examine
ways of displaying data to help explain the underlying distributions of a single variable or the relationships
of two or more variables.
R has many fu nctions avai lable to examine a single variable. Some of these func tions are listed in
Table 3-4.

TABLE 3-4 Example Functions for Visualizing a Single Variable

Function Purpose
p l o t (data ) Scatterplot where x is the index andy is the value;
suitable for low-volume data

barp lot (data ) Barplot with vertical or horizontal bars

dotchart (data ) Cleveland dot plot [12)

hist (data ) Histogram

plot(density (data )) Density p lot (a continuous histogram)

s tem (data) Stem-and -leaf plot

rug (data ) Add a rug representation (1-d plot) of the data to an


existing plot

Dotchart and Barplot


Dotcha rt and barplot portray continuous values with labels from a discrete variable. A dotchart can be
created in R with the function dot cha rt ( x , lab e l= ... ) , where x is a numeric vector and l a bel
is a vector of categorical labels for x. A barplot can be created with the barplot (h e igh t ) function,
w here h eigh t represents a vector or matrix. Figure 3-10 shows (a) a dotchart and (b) a barplot based
on the mtcars dataset, which includes the fuel consumption and 10 aspects of automobile design and
performance of 32 automobiles. This dataset comes with the standard R distribution.
The plots in Figure 3-10 can be produced with the following R code.

data (mtcars )
dotchart (mtcars$mpg,labels=row . names (mtcars ) ,cex=.7,
main= "Mi les Per Gallon (MPG ) of Car Models",
xlab ="MPG" )
barplot (tabl e (mtcars$cyl ) , main="Distribu:ion of Car Cyl inder Counts",
x lab= "Number of Cylinders" )

Histogram and Density Plot


Figure 3-ll(a) includ es a histogram of household income. The histogram shows a clear concentration of
low household incom es on the left and the long tail of the higher incomes on the right.
3.2 Explorat ory Dat a Analysis

l.llles Per Gallon (t.IPG) of Cor Models

Volvo U2f 0

Uastreb Bota 0

Ferran [)n)
Ford Panttra L
Lotus Europa
Pot3che 91 • -2 0

F'"1X1·9 0

Ponbac Frebrd Distribution of Car Cylinder counts


ComaroZ28
AJ.ICJavtlw1 ~
Dodge Chalenger
Toyota Corona
Toyota Carob 0 ~
Hondt CNC 0
0
F,.1 128
~
Chrysltr ~nal 0
L11cok1 Cont11tnt.l
C.d.. c Fleotwoo4 o co
Utrc •54Slt
Utrc 4SOSL
"'

D
Utrc.c.SQSE. 0

Utrc2!0(
Were 280 ....
Wtre230
llerc 2•00
Ousttr 360
Vttant 0

Homtt SportabOU1 0
Homtl' Drrve
Datsun 7 10 6 8
Uazdo R.X' Wag
Uazda RXI 0 Ntmler ol Cylinders

10 15 20 2S 30

UPG

(a) (b)
FIGURE 3-10 (a) Dotchart on the miles per gallon of cars and (b) Barplot on the distribution of car cylinder
counts

Histogram of Income Distribution of Income (log10 scale)

....
0
"
N
0

"' 0
0

"'
..
i';
c:
:>
.,
0

"'
?;-
v;
.,
c:
00
0

CT

u: "'
0
0 "'
0

0
N "'
0

N
~ 0

0
0 0

Oe•OO 1e+05 2e+05 3e+05 4e•05 5e• 05 40 45 50 55

Income N = 4000 Band\Yidlh = 0 02069

FIGURE 3·11 (a) Histogram and (b) Density plot of household income
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

Figure 3-11 (b) shows a density plot of the logarithm of household income values, which emphasizes
the distribution. The income distribution is concentrated in the center portion of the graph. The code to
generate the two plots in Figure 3-11 is provided next. The rug ( } function creates a one-dimensional
density plot on the bottom of the graph to emphasize the distribution of the observation.
# randomly generate 4000 observations from the log normal distribution
income<- rlnorm(4000, meanlog = 4, sdlog = 0.7)
summary (income)
Min. 1st Qu. t.Jedian t>!ean 3rd Qu. f.!ax.
4.301 33.720 54.970 70.320 88.800 659.800
income <- lOOO*income
summary (income)
Min. 1st Qu. f.!edian f.!ean 3rd Qu. 1\!ax.
4301 33720 54970 70320 88800 659800
# plot the histogram
hist(income, breaks=SOO, xlab="Income", main="Histogram of Income")
# density plot
plot(density(loglO(income), adjust=O.S),
main="Distribution of Income (loglO scale)")
# add rug to the density plot
rug(loglO(income))

In the data preparation phase of the Data Analytics Lifecycle, the data range and distribution can be
obtained. If the data is skewed, viewing the logarithm of the data (if it's all positive) can help detect struc-
tures that might otherwise be overlooked in a graph with a regular, nonlogarithmic scale.
When preparing the data, one should look for signs of dirty data, as explained in the previous section.
Examining if the data is unimodal or multi modal will give an idea of how many distinct populations with
different behavior patterns might be mixed into the overall population. Many modeling techniques assume
that the data follows a normal distribution. Therefore, it is important to know if the available dataset can
match that assumption before applying any of those modeling techniques.
Consider a density plot of diamond prices (in USD). Figure 3-12(a) contains two density plots for pre-
mium and ideal cuts of diamonds. The group of premium cuts is shown in red, and the group of ideal cuts
is shown in blue. The range of diamond prices is wide-in this case ranging from around $300 to almost
$20,000. Extreme values are typical of monetary data such as income, customer value, tax liabilities, and
bank account sizes.
Figure 3-12(b) shows more detail of the diamond prices than Figure 3-12(a) by taking the logarithm. The
two humps in the premium cut represent two distinct groups of diamond prices: One group centers around
log10 price= 2.9 (where the price is about $794), and the other centers around log 10 price= 3.7 (where the
price is about $5,012). The ideal cut contains three humps, centering around 2.9, 3.3, and 3.7 respectively.
The Rscript to generate the plots in Figure 3-12 is shown next. The diamonds dataset comes with
the ggplot2 package.
library("ggplot2")
data(diamonds) # load the diamonds dataset from ggplot2

# Only keep the premium and ideal cuts of diamonds


3.2 Exploratory Data Analysis

niceDiamonds <- diamonds [diamonds$cut=="Pr emi um" I


diamonds$cut== " Ide a l •, I

summary(niceDiamonds$cut )
Pr m1u !:! a ..
0 0 137<ll . lSSl

# plot density plot of diamond prices


ggplo t( niceDiamonds, ae s(x=price , fill =cut) ) +
geom_density(alpha = .3, col or=NA)

# plot density plot of the loglO of diamond prices


ggpl ot (niceDiamonds , aes (x=logl O(price) , f il l =cut ) ) +
geom_density (alpha = . 3, color=NA)

As an alternative to ggplot2, the lattice package provides a function ca lled densityplot ()


for making simple density plot s.

.. '
3t
'

~ ~ cut
o;
; '. ..
;;
c Premium
Jcsu•
" "

It
'

0 0

1~300
pri ce togtO(prtce)

(a) (b)

FIGURE 3-12 Density plot s of (a) d iamond prices and (b) t he logarit hm of diamond p rices

3.2.4 Examining Multiple Variables


A scatterp lot (shown previously in Figure 3-1 and Figure 3-5) is a simple and w idely used visualization
for findin g the relationship among multiple va riables. A sca tterplo t ca n represent data with up to fi ve
variables using x-axis, y-axis, size, color, and shape. But usually only t wo to four variables are portrayed
in a scatterplot to minimize confusion. When examining a scatterplot, one needs to pay close attention
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

to the possible relationship between the variables. If the functiona l relationship between the variables is
somewhat pronounced, the data may roughly lie along a straight line, a parabola, or an exponential curve.
If variable y is related exponentially to x, then the plot of x versus log (y) is approximately linea r. If the
plot looks more like a cluster without a pattern, the corresponding variables may have a weak relationship.
The scatterplot in Figure 3-13 portrays the relationship of two variables: x and y . The red line shown
on the graph is the fitted line from the linear regression. Linear regression wi ll be revisited in Chapter 6,
"Advanced Analytical Theory and Methods: Reg ression." Figure 3-13 shows that the regression line does
not fit the data well. This is a case in which linear regression cannot model the relationship between the
variables. Alternative methods such as the l oess () functio n ca n be used to fit a nonlinear line to the
data. The blue curve shown on the graph represents the LOESS curve, which fits the data better than linear
regression.

0
0
0
0
N

.,.,
0

0
0

.,.,
0
0
0 o o 0
0
0
0
0
0

0 2 4 6 8 10

FIGURE 3-13 Examining two variables with regression

The Rcode to produce Figure 3-13 is as follows. The runi f ( 7 5, 0 , 1 0) generates 75 numbers
between 0 to 10 with random deviates, and the numbers conform to the uniform distribution. The
r norm( 7 5 , o , 2 o) generates 75 numbers that conform to the normal distribu tion, with the mean eq ual
to 0 and the standard deviation equal to 20. The poi n ts () function is a generic function that draws a
sequence of points at the specified coordinates. Parameter type=" 1" tells the function to draw a solid
line. The col parameter sets the color of the line, where 2 represents the red color and 4 represents the
blue co lor.

7 numbers ben:een ~nd 10 ~f unifor~ distribution


x c- runif(75 , 0 , 1 0)

x c- sort(x)
y c - 200 + xA 3 - 10 * x A2 + x + rnorm(75, 0 , 20)

lr c- lm(y - x) ::a_l_ .eqL .;~l ..


poly c- loess( y - x) L E5~
3.2 Exploratory Data Analysis

fit <- predict (pol y) fit a nonlinear lice

plot (x, y )

# draw the fitted li~e for the l i near regression


points (x, lr$coeffic i e nts [l ] + lr$coeffic i ents [2 ] • x ,
type= " 1 ", col = 2 )

po ints (x, fit, type = "1" , col = 4 )

Dotchart and Barplot


Dotchart and barplot from the previous section can visualize multiple variables. Both of them use color as
an additional dimension for visualizing the data.
For the same mtcars dataset, Figure 3-14 shows a dotchart that groups vehicle cyl inders at they-axis
and uses colors to distinguish different cylinders. The vehicles are sorted according to their MPG values.
The code to generate Figure 3-14 is shown next.

Miles Per Gallon (MPG) of Car Models


Grouped by Cylinder

4
0
Toyota Corolla
0
Fl8t 128
Lotus Eu ropa 0
Honda Civic 0
F1at X1-9 0
Porscl1e 914- 2 0
t.terc2400 0
Mere 230 0
Datsun 710 0
Toyota Corona 0
Volvo 142E 0

6
Hornet 4 Drive 0
Mazda RX4 Wag 0
Mazda RX4 0
Ferrari Dine 0
I.! ere 280 0
Valiant 0
I.! ere 280C 0

8
Pontiac Fire bird 0
Hornet Sporta bout 0
Mer e 450SL 0
Mere 450SE 0
Ford Pantera L 0
Dodge Challenger 0
AMC Javeun 0
Mere 450SLC 0
1.1aserati Bora 0
Chrysler Imperial 0
Duster 360 0
Camaro Z28 0
Lincoln Continental 0
Cadillac Fleet wood 0

I I I I

10 15 20 25 30

Miles Per Gallon

FIGURE 3-14 Dotplot to visualize multip le variables


REVI EW OF BASIC DATA ANALYTIC METHODS USING R

;: sor- bJ' mpg


cars <- mtcars[or der {mtcars$mpg ) , )

h grouping variab l e must be a factol


cars$cyl < - f actor {cars$cyl )

cars$col or[car s$cyl ==4) <- "red"


cars$color[ cars$cyl== 6) < - "blue •
ca r s$color[cars$cyl==8) < - "darkgreen •

dotchart {cars$mp g, labels=row.names{cars ) , cex - . 7, groups= cars$cyl,


main=" Mi les Per Gallon {MPG ) of Car Mode l s \ nGr ouped by Cylinder• ,
xl ab="Mil es Per Gal l on•, co l or=cars$color, gcolor="bl ac k")

The barplot in Figure 3-15 visualizes the distri bution of car cyli nder counts and number of gears. The
x-axis represents the number of cylinders, and the color represents the number of gears. The code to
generate Figure 3-15 is shown next.

Distribution of Car Cylinder Counts and Gears

~
Number of Gears
• 3
~ 4
!;!
D 5
CD

~
:J <D
0
u

""
N

4 6 8

Number of Cylinders

FIGURE 3-15 Barplot to visualize multiple variables

count s <- t able {mtcars$gear , mtcars$cyl )


barpl ot (counts, main= "Di s tributi on o f Car Cylinder Coun ts and Gears • ,
x l ab="Number of Cylinders • , ylab="Counts •,
col=c ( " #OOO OFFFF" , "# 0080FFFF", "#OOFFFFFF") ,
legend = rownames (counts ) , beside- TRUE,
args. l egend = list (x= "top", title= "Number of Gears" ))
3 .2 Exploratory Data Analysis

Box-and-Whisker Plot
Box-and-whisker plots show the distribution of a continuous variable for each value o f a discrete variable.
The box-and-whisker plot in Figure 3-16 visualizes mean household incomes as a function of region in
th e United States. The first digit of the U.S. postal ("ZIP") code corresponds to a geographical region
in the United States. In Figure 3-16, each data point corresponds to the mean household income from a
particular zip code. The horizontal axis represents the first digit of a zip code, ranging from 0 to 9, where
0 corresponds to the northeast reg ion ofthe United States (such as Maine, Vermont, and Massachusetts),
and 9 corresponds to the southwest region (such as Ca lifornia and Hawaii). The vertical axis rep resents
the logarithm of mean household incomes. Th e loga rithm is take n to bet ter visualize the distr ibution
of th e mean household incomes.

Mean Household Income by Zip Code

so-
..,
E
0
u
c
;:;
0
.c
'""'::J
0
J:
iii ~ 5- •
'"
:::!!
0
Cl
2

' ' ' '


2 5 6 8 9
Zlp1

FIGURE 3-16 A box-and-whisker plot of mean household income and geographical region

In this figure, the scatterplot is displayed beneath the box-and-whisker plot, with some jittering for the
overlap points so that each line of points widens into a strip. The "box" of the box-and-whisker shows t he
range that contains the central 50% of the data, and the line inside the box is the location of the median
value. The upper and lower hinges of the boxes correspond to the first and third quartiles of the data. The
upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge. The lower
whisker extends from the hinge to the lowest value w ithin 1.5 * IQR of the hinge. IQR is the inter-qua rtile
range, as discussed in Section 3.1.4. The points outside the wh iskers can be considered possible outliers.
REVIEW OF BAS IC DATA ANALYTIC M ETHODS USING R

The graph shows how household income varies by reg ion. The highest median incomes are in region
0 and region 9. Region 0 is slightly higher, but the boxes for the two regions overlap enough that the dif-
ference between the two regions probably is not significant. The lowest household incomes tend to be in
region 7, which includes states such as Louisiana, Arka nsas, and Oklahoma.
Assuming a data frame called DF contains two columns (MeanHousehol din come and Zipl), the
following Rscript uses the ggplot2 1ibrary [11 ] to plot a graph that is similar to Figure 3-16.

library (ggplot2 )
plot the jittered scat-erplot w/ boxplot
H color -code points with z1p codes
h th~ outlier . s.ze pr~vents the boxplot from p:c-•inq •h~ uutlier

ggplot (data=DF, aes (x=as . factor (Zipl ) , y=loglO(MeanHouseholdincome) )) +


geom_point(aes(color=factor (Zipl )) , alpha= 0 .2, pos it i on="j itter") +
geom_boxpl ot(outlier .size=O, alpha=O . l ) +
guides(colour=FALSE) +
ggtitle ("Mean Hous ehold Income by Zip Code")

Alternatively, one can create a simple box-and-whisker plot with the boxplot () function provided
by the Rbase package.

Hexbinplot for Large Data sets


This chapter ha s shown that scat terplot as a popular visualization can visualize data containing one or
more variables. But one should be ca reful about using it on high-volume data. lf there is too much data, the
structure of the data may become difficult to see in a scatterplot. Consider a case to compare the logarithm
of household income against the yearsof education, as shown in Figure 3-17. The cluster in the scatterplot
on the left (a) suggestsa somewhat linear relationship of the two variables. However, one cannot rea lly see
the structure of how the data is distributed inside the cluster. This is a Big Data type of problem. Millions
or billions of data points would require different approaches for exploration, visualization, and analysis.

j
g] Counts
71&8
6328
SS22
8
c
'0
0
C!
</) 0
0
0
0 I" ··- 4 77 1
407S
34
! 8
0
~ 1&15
316
:;)
0 </) 8 0 fa u .J 1640
I

I
c <i 141 8
1051
IB ~ 739
::!i
0
0. 0
0
r .... 432
279
C! 132
.2
" 39
0 0
10

~'•W~.Eduauon
.. 1

0 5 10 15

MeanEduca1ion

(a) (b)

FIGURE 3-17 (a) Scatterplot and (b) Hexbinplot of household income against years of education
3.2 Exploratory Data Analysis

Although color and transparency can be used in a scatterplot to address this issue, a hexbinplot is
sometimes a better alternative. A hexbinplot combines the ideas of scatterplot and histogram. Similar to
a scatterplot, a hexbinplot visualizes data in the x-axis andy-axis. Data is placed into hex bins, and the third
dimension uses shading to represent the concentration of data in each hexbin.
In Figure 3-17(b), the same data is plotted using a hexbinplot. The hexbinplot shows that the data is
more densely clustered in a streak that runs through the center of the cluster, roughly along the regression
line. The biggest concentration is around 12 years of education, extending to about 15 years.
In Figure 3-17, note the outlier data at MeanEducation=O. These data points may correspond to
some missing data that needs further cleansing.
Assuming the two variables MeanHouseholdincome and MeanEduca tion are from a data
frame named zeta, the scatterplot of Figure 3-17(a) is plotted by the following Rcode.
# plot the data points
plot(loglO(MeanHouseholdincome) - MeanEducation, data=zcta)
# add a straight fitted line of the linear regression
abline(lm(loglO(MeanHouseholdincome) - MeanEducation, data=zcta), col='red')

Using the zeta data frame, the hexbinplot of Figure 3-17(b) is plotted by the following R code.
Running the code requires the use of the hexbin package, which can be installed by running ins tall
.packages ( "hexbin").
library(hexbin)
# "g" adds the grid, "r" adds the regression line
# sqrt transform on the count gives more dynamic range to the shading
# inv provides the inverse transformation function of trans
hexbinplot(loglO(MeanHouseholdincome) - MeanEducation,
data=zcta, trans= sqrt, inv = function(x) x ... 2, type=c( 11 g 11 , 11
r 11 ) )

Scatterplot Matrix
A scatterplot matrix shows many scatterplots in a compact, side-by-side fashion. The scatterplot matrix,
therefore, can visually represent multiple attributes of a dataset to explore their relationships, magnify
differences, and disclose hidden patterns.
Fisher's iris dataset [13] includes the measurements in centimeters ofthe sepal length, sepal width,
petal length, and petal width for 50 flowers from three species of iris. The three species are setosa, versicolor,
and virginica. The iris dataset comes with the standard Rdistribution.
In Figure 3-18, all the variables of Fisher's iris dataset (sepal length, sepal width, petal length, and
petal width) are compared in a scatterplot matrix. The three different colors represent three species of iris
flowers. The scatterplot matrix in Figure 3-18 allows its viewers to compare the differences across the iris
species for any pairs of attributes.
REVIEW O F BA SIC DATA ANA LYTIC M ETHODS USIN G R

Fisher's Iris Dataset

..... w.4"t~
.....
20 25 30 35 •o 0510152025

••• I ..... "' ....


~- ..
~· f· . .."'
Sepal. length t' "' "' 0&<4

...... ..;.;.:.··
• ••• •
..... ,. .• "''"
14 11
10>1
'19
~

.. ...
11t
I ll

f.·.
)9

" . ..
;~·:
I

• •
•.,..··~til!~-= Sepal. Width ...
~-· .
.,.. .
..,
Q

0 _ic
"'
~ .. - ~·
•• •
·.t.*
•. .
• =:t • • Petal. length

Petal. Width
"'
Q

H 55 65 75 12 3<567

• setosa D verstcolor • virgimca


FIGURE 3·18 Scatterplot matrix of Fisher's {13] iris dataset

Consider the scatterplot from the first row and third col umn of Figure 3-18, where sepal length is com-
pared against petal length. The horizontal axis is the petal length, and the vertical axis is the sepal length.
The scatterplot shows that versicolor and virginica share similar sepal and petal lengths, although the latter
has longer petals. The petal lengthsof all setosa are about the sa me, and the petal lengths are remarkably
shorter than the other two species. The scatterplot shows that for versicolor and virgin ica, sepal length
grows linearly with the petal length.
The Rcode for generating the scatterplot mat rix is provided next.

I; define the colors


colors<- C( 11 red 11 11
green 11 ,
J
11
blue•')

~ draw the plot ma:rix

pairs(iris[l : 4], main= "Fisher ' s Iris Datase t•,


pch = 21, bg = colors[unclass ( iris$Species)]

= ~Qr qrdp~ica: pa~a~ :e~· - cl~!' p!ot - 1~9 :c :te ~1gure ~~a1o~

par (xpd = TRUE )

" ada l<"go::d


legend ( 0.2, 0 . 02, horiz = TRUE, as.vector (unique ( iris$Species )) ,
fil l = colors, bty = "n" )
3.2 Explorat ory Data Analysis

The vector colors defines th e colo r sc heme for the plot. It could be changed to something like
colors<- c("gray50", "white" , " black " } to makethescatterplotsgrayscale.

Analyzing a Variable over Time


Visua lizing a variable over time is the same as visualizing any pair of variables, but in this case the goal is
to identify time-specific patterns.
Figure 3-19 plots the mon thly total numbers of international airline passengers (in thousands) from
January 1940 to December 1960. Enter plot (AirPassengers} in the Rconsole to obtain a similar
graph. The plot shows that, for each year, a large peak occurs mid-year around July and August, and a sma ll
peak happens around the end of the year, possibly due to the holidays. Such a phenomenon is referred to
as a seasonality effect.

0
0
CD

0
0
II'>

"'Q;0> 0
.,c .... 0

"'"'
"'
Q, 0
0
< (")

0
0
N

0
~

1950 1952 1954 1956 1958 1960

Tune

FIGURE 3-19 Airline passenger counts from 1949 to 1960

Additionally, the overall trend is that the number of air passengers steadily increased from 1949 to
1960. Chapter 8, "Advanced Analytica l Theory and Methods: Time Series Analysis," discusses the analysis
of such data sets in greater detail.

3.2.5 Data Exploration Versus Presentation


Using visualization for data exploration is different from presenting results to stakeholders. Not every type
of plot issuitable for all audiences. Most of the plots presented earlier try to detail the data as clearly as pos-
sible for data scientists to identify structures and relationships. These graphs are more technical in nature
and are better suited to technical audiences such as data scientists. Nontechnical sta keholders, however,
generally prefer simple, clear graphics that focus on the message rather than the data.
Figure 3-20 shows the density plot on the distribution of account va lues from a bank. The data has been
converted to the log 10 scale. The plot includesa rug on the bottom to show the distribution of the variable.
This graph is more suitable for data scientists and business analysts because it provides information that
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

can be relevant to the downstream analysis. The graph shows that the transformed account values follow
an approximate normal distribution, in the range from $100 to $10,000,000. The median account value is
approximately $30,000 (1 o4s), with the majority of the accounts between $1,000 (1 03) and $1,000,000 (1 06).

Distribution of Account Values (log10 scale)


CD
ci
II)
ci
oq:
0
~
·a; ("')
c c::)
cu
0
N
c::)

.....
c::)

0
c::)

2 3 4 5 6 7

N = 5000 Bandwidth= 0.05759


FIGURE 3-20 Density plots are better to show to data scientists

Density plots are fairly technical, and they contain so much information that they would be difficult to
explain to less technical stakeholders. For example, it would be challenging to explain why the account
values are in the log 10 scale, and such information is not relevant to stakeholders. The same message can
be conveyed by partitioning the data into log-like bins and presenting it as a histogram. As can be seen in
Figure 3-21, the bulk of the accounts are in the S1,000-1,000,000 range, with the peak concentration in the
$10-SOK range, extending to $500K. This portrayal gives the stakeholders a better sense of the customer
base than the density plot shown in Figure 3-20.
Note that the bin sizes should be carefully chosen to avoid distortion of the data.ln this example, the bins
in Figure 3-21 are chosen based on observations from the density plot in Figure 3-20. Without the density
plot, the peak concentration might be just due to the somewhat arbitrary appearing choices for the bin sizes.
This simple example addresses the different needs of two groups of audience: analysts and stakehold-
ers. Chapter 12, "The Endgame, or Putting It All Together," further discusses the best practices of delivering
presentations to these two groups.
Following is the Rcode to generate the plots in Figure 3-20 and Figure 3-21.
# Generate random log normal income data
income= rlnorm(SOOO, meanlog=log(40000), sdlog=log(S))

# Part I: Create the density plot


plot(density(loglO(income), adjust=O.S),
main= 11 Distribution of Account Values (loglO scale)")
# Add rug to the density plot
3.3 Statistical Methods for Evaluation

r ug (logl O(income))

l• :l... "1 ! .... < bl:-;.5''

breaks = c(O, 1000, 5000, 10000, 50000, 100000, SeS, le6, 2e7 )
"'! 1.:: . . ... ••' ,
bins = cut(income, breaks, include .lowest =T,
labels c ( "< lK", "1 - SK", "5- lOK" , "10 - SOK",
"50-lOOK" , "100 -S OOK" , "SOCK-1M", "> 1M") )
~ n r •L ri. ..
plot(bins, main "Dis tribut i on of Account Val ues ",
xl ab "Account value ($ USD) ",
ylab = "Number of Accounts", col= "blue ")

Distribution of Account Values

- <1K

FIGURE 3·21 Histograms are better to show to stakeholders


1-5K 5-I OK 10-50K 50- l OOK

AccOirl value (S USO)


100.500K --
500K-1M > 11.1

3.3 Statistical Methods for Evaluation


Visualization is useful for data exploration and presentation, but statistics is crucial because it may exist
throughout the entire Data Analytics Lifecycle. Statistical techniques are used during the initial data explo-
ration and data preparation, model building, evaluation of the final models, and assessment of how the
new models improve the situation when deployed in the field. In particular, statisticscan help answer the
following questions for data analytics:

• Model Building and Planning

• What are the best input variables for the model?

• Can the model predict the outcome given the input?

You might also like