Machine Learning With R Cookbook - Sample Chapter
Machine Learning With R Cookbook - Sample Chapter
classification approach
customer data
Compress images with the dimension
reduction method
Incorporate R and Hadoop to solve machine
$ 59.99 US
39.99 UK
Sa
pl
e
and problems
P U B L I S H I N G
ee
Fr
P U B L I S H I N G
Chapter 10, Association Analysis and Sequence Mining, exposes you to the common
methods used to discover associated items and underlying frequent patterns from
transaction data. This chapter is a must read for those of you interested in finding out how
researchers discovered the famous association between customers that purchase beer and
those who purchase diapers.
Chapter 11, Dimension Reduction, teaches you how to select and extract features from
original variables. With this technique, we can remove the effect from redundant features,
and reduce the computational cost to avoid overfitting. For a more concrete example,
this chapter reveals how to compress and restore an image with the dimension reduction
approach.
Chapter 12, Big Data Analysis (R and Hadoop), reveals how you can use RHadoop,
which allows R to leverage the scalability of Hadoop, so as to process and analyze big
data. We cover all the steps, from setting up the RHadoop environment to actual big data
processing and machine learning on big data. Lastly, we explore how to deploy an
RHadoop cluster using Amazon EC2.
Appendix A, Resources for R and Machine Learning, will provide you with all the
resources for R and machine learning.
Appendix B, Dataset Survival of Passengers on the Titanic, shows you the dataset for
survival of passengers on the Titanic.
Practical Machine
Learning with R
In this chapter, we will cover the following topics:
Visualizing data
Introduction
The aim of machine learning is to uncover hidden patterns, unknown correlations, and find
useful information from data. In addition to this, through incorporation with data analysis,
machine learning can be used to perform predictive analysis. With machine learning, the
analysis of business operations and processes is not limited to human scale thinking;
machine scale analysis enables businesses to capture hidden values in big data.
Moreover, you can perform nonlinear dimension reduction to calculate the dissimilarity
of image data, and visualize the clustered graph, as shown in the following screenshot.
All you need to do is follow the recipes provided in this book.
14
Chapter 1
This chapter serves as an overall introduction to machine learning and R; the first few recipes
introduce how to set up the R environment and integrated development environment, RStudio.
After setting up the environment, the following recipe introduces package installation and
loading. In order to understand how data analysis is practiced using R, the next four recipes
cover data read/write, data manipulation, basic statistics, and data visualization using R. The
last recipe in the chapter lists useful data sources and resources.
Getting ready
If you are new to the R language, you can find a detailed introduction, language history, and
functionality on the official website (http://www.r-project.org/). When you are ready to
download and install R, please access the following link: http://cran.r-project.org/.
15
How to do it...
Please perform the following steps to download and install R for Windows and Mac users:
1. Go to the R CRAN website, http://www.r-project.org/, and click on the
download R link, that is, http://cran.r-project.org/mirrors.html):
CRAN mirrors
16
Chapter 1
3. Select the correct download link based on your operating system:
As the installation of R differs for Windows and Mac, the steps required to install R for each
OS are provided here.
For Windows users:
1. Click on Download R for Windows, as shown in the following screenshot, and then
click on base:
17
3. The installation file should be downloaded. Once the download is finished, you can
double-click on the installation file and begin installing R:
18
Chapter 1
5. After successfully completing the installation, a shortcut to the R application will
appear in your Start menu, which will open the R Console:
19
Chapter 1
6. Click on R to open R Console:
As an alternative to downloading a Mac .pkg file to install R, Mac users can also install R
using Homebrew:
1. Download XQuartz-2.X.X.dmg from https://xquartz.macosforge.org/
landing/.
2. Double-click on the .dmg file to mount it.
3. Update brew with the following command line:
$ brew update
5. Install gfortran:
$ brew install gfortran
6. Install R:
$ brew install R
For Linux users, there are precompiled binaries for Debian, Red Hat, SUSE, and Ubuntu.
Alternatively, you can install R from a source code. Besides downloading precompiled binaries,
you can install R for Linux through a package manager. Here are the installation steps for
CentOS and Ubuntu.
21
22
Chapter 1
4. Install R through the repository:
$ sudo yum install R
How it works...
CRAN provides precompiled binaries for Linux, Mac OS X, and Windows. For Mac and Windows
users, the installation procedures are straightforward. You can generally follow onscreen
instructions to complete the installation. For Linux users, you can use the package manager
provided for each platform to install R or build R from the source code.
See also
For those planning to build R from the source code, refer to R Installation and
Administration (http://cran.r-project.org/doc/manuals/R-admin.
html), which illustrates how to install R on a variety of platforms.
Getting ready
RStudio requires a working R installation; when RStudio loads, it must be able to locate a
version of R. You must therefore have completed the previous recipe with R installed on your
OS before proceeding to install RStudio.
23
How to do it...
Perform the following steps to download and install RStudio for Windows and Mac users:
1. Access RStudio's official site by using the following URL: http://www.rstudio.
com/products/RStudio/.
24
Chapter 1
25
5. Start RStudio:
Perform the following steps for downloading and installing RStudio for Ubuntu/Debian and
RedHat/Centos users:
For Debian(6+)/Ubuntu(10.04+) 32-bit:
$ wget http://download1.rstudio.org/rstudio-0.98.1091-i386.deb
$ sudo gdebi rstudio-0.98. 1091-i386.deb
Chapter 1
For RedHat/CentOS(5,4+) 32 bit:
$ wget http://download1.rstudio.org/rstudio-0.98. 1091-i686.rpm
$ sudo yum install --nogpgcheck rstudio-0.98. 1091-i686.rpm
How it works
The RStudio program can be run on the desktop or through a web browser. The desktop
version is available for Windows, Mac OS X, and Linux platforms with similar operations across
all platforms. For Windows and Mac users, after downloading the precompiled package of
RStudio, follow the onscreen instructions, shown in the preceding steps, to complete the
installation. Linux users may use the package management system provided for installation.
See also
In addition to the desktop version, users may install a server version to provide
access to multiple users. The server version provides a URL that users can access
to use the RStudio resources. To install RStudio, please refer to the following link:
http://www.rstudio.com/ide/download/server.html. This page provides
installation instructions for the following Linux distributions: Debian (6+), Ubuntu
(10.04+), RedHat, and CentOS (5.4+).
For other Linux distributions, you can build RStudio from the source code.
Getting ready
Start an R session on your host computer.
27
How to do it...
Perform the following steps to install and load R packages:
1. To load a list of installed packages:
> library()
R will return a list of CRAN mirrors, and then ask the user to either type a mirror ID to select it,
or enter zero to exit:
1. Install a package from CRAN; take package e1071 as an example:
> install.packages("e1071")
4. If you would like to view the documentation of the package, you can use the help
function:
> help(package ="e1071")
5. If you would like to view the documentation of the function, you can use the help
function:
> help(svm, e1071)
6. Alternatively, you can use the help shortcut, ?, to view the help document for this
function:
> ?e1071::svm
7.
If the function does not provide any documentation, you may want to search the
supplied documentation for a given keyword. For example, if you wish to search for
documentation related to svm:
> help.search("svm")
9. To view the argument taken for the function, simply use the args function. For
example, if you would like to know the argument taken for the lm function:
> args(lm)
28
Chapter 1
10. Some packages will provide examples and demos; you can use example or demo to
view an example or demo. For example, one can view an example of the lm package
and a demo of the graphics package by typing the following commands:
> example(lm)
> demo(graphics)
11. To view all the available demos, you may use the demo function to list all of them:
> demo()
How it works
This recipe first introduces how to view loaded packages, install packages from CRAN, and
load new packages. Before installing packages, those of you who are interested in the listing
of the CRAN package can refer to http://cran.r-project.org/web/packages/
available_packages_by_name.html.
When a package is installed, documentation related to the package is also provided. You are,
therefore, able to view the documentation or the related help pages of installed packages and
functions. Additionally, demos and examples are provided by packages that can help users
understand the capability of the installed package.
See also
Besides installing packages from CRAN, there are other R package repositories,
including Crantastic, a community site for rating and reviewing CRAN packages,
and R-Forge, a central platform for the collaborative development of R packages. In
addition to this, Bioconductor provides R packages for the analysis of genomic data.
If you would like to find relevant functions and packages, please visit the list of task
views at http://cran.r-project.org/web/views/, or search for keywords at
http://rseek.org.
Getting ready
First, start an R session on your machine. As this recipe involves steps toward the file IO, if
the user does not specify the full path, read and write activity will take place in the current
working directory.
29
How to do it...
Perform the following steps to read and write data with R:
1. To view the built-in datasets of R, type the following command:
> data()
2. R will return a list of datasets in a dataset package, and the list comprises the
name and description of each dataset.
3. To load the dataset iris into an R session, type the following command:
> data(iris)
4. The dataset iris is now loaded into the data frame format, which is a common
data structure in R to store a data table.
5. To view the data type of iris, simply use the class function:
> class(iris)
[1] "data.frame"
6. The data.frame console print shows that the iris dataset is in the structure of
data frame.
7.
Use the save function to store an object in a file. For example, to save the loaded iris
data into myData.RData, use the following command:
> save(iris, file="myData.RData")
8. Use the load function to read a saved object into an R session. For example, to load
iris data from myData.RData, use the following command:
> load("myData.RData")
9. In addition to using built-in datasets, R also provides a function to import data from
text into a data frame. For example, the read.table function can format a given
text into a data frame:
> test.data = read.table(header = TRUE, text = "
+ a b
+ 1 2
+ 3 4
+ ")
30
Chapter 1
10. You can also use row.names and col.names to specify the names of columns
and rows:
> test.data = read.table(text = "
+ 1 2
+ 3 4",
+ col.names=c("a","b"),
+ row.names = c("first","second"))
12. The class function shows that the test.data variable contains a data frame.
13. In addition to importing data by using the read.table function, you can use the
write.table function to export data to a text file:
> write.table(test.data, file = "test.txt" , sep = " ")
14. The write.table function will write the content of test.data into test.txt
(the written path can be found by typing getwd()), with a separation delimiter as
white space.
15. Similar to write.table, write.csv can also export data to a file. However,
write.csv uses a comma as the default delimiter:
> write.csv(test.data, file = "test.csv")
16. With the read.csv function, the csv file can be imported as a data frame. However,
the last example writes column and row names of the data frame to the test.csv
file. Therefore, specifying header to TRUE and row names as the first column within
the function can ensure the read data frame will not treat the header and the first
column as values:
> csv.data = read.csv("test.csv", header = TRUE, row.names=1)
> head(csv.data)
a b
1 1 2
2 3 4
How it works
Generally, data for collection may be in multiple files and different formats. To exchange data
between files and RData, R provides many built-in functions, such as save, load, read.csv,
read.table, write.csv, and write.table.
31
See also
For the load, read.table, and read.csv functions, the file to be read can also be a
complete URL (for supported URLs, use ?url for more information).
On some occasions, data may be in an Excel file instead of a flat text file. The WriteXLS
package allows writing an object into an Excel file with a given variable in the first argument
and the file to be written in the second argument:
1. Install the WriteXLS package:
> install.packages("WriteXLS")
3. Use the WriteXLS function to write the data frame iris into a file named iris.xls:
> WriteXLS("iris", ExcelFileName="iris.xls")
Getting ready
Ensure you have completed the previous recipes by installing R on your operating system.
How to do it...
Perform the following steps to manipulate the data with R.
Subset the data using the bracelet notation:
32
Chapter 1
1. Load the dataset iris into the R session:
> data(iris)
2. To select values, you may use a bracket notation that designates the indices of the
dataset. The first index is for the rows and the second for the columns:
> iris[1,"Sepal.Length"]
[1] 5.1
4. You can then use str() to summarize and display the internal structure of Sepal.
iris:
> str(Sepal.iris)
'data.frame':
150 obs. of
2 variables:
$ Sepal.Length: num
$ Sepal.Width : num
5. To subset data with the rows of given indices, you can specify the indices at the first
index with the bracket notation. In this example, we show you how to subset data
with the top five records with the Sepal.Length column and the Sepal.Width
selected:
> Five.Sepal.iris = iris[1:5, c("Sepal.Length", "Sepal.Width")]
> str(Five.Sepal.iris)
'data.frame': 5 obs. of
2 variables:
$ Sepal.Length: num
$ Sepal.Width : num
6. It is also possible to set conditions to filter the data. For example, to filter returned
records containing the setosa data with all five variables. In the following example,
the first index specifies the returning criteria, and the second index specifies the
range of indices of the variable returned:
> setosa.data = iris[iris$Species=="setosa",1:5]
> str(setosa.data)
'data.frame': 50 obs. of
5 variables:
$ Sepal.Length: num
$ Sepal.Width : num
3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num
1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num
0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species
: Factor w/ 3 levels "setosa","versicolor",..: 1 1
1 1 1 1 1 1 1 1 ...
33
Alternatively, the which function returns the indexes of satisfied data. The following
example returns indices of the iris data containing species equal to setosa:
> which(iris$Species=="setosa")
[1]
9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50
8. The indices returned by the operation can then be applied as the index to select the
iris containing the setosa species. The following example returns the setosa with all
five variables:
> setosa.data = iris[which(iris$Species=="setosa"),1:5]
> str(setosa.data)
'data.frame': 50 obs. of
5 variables:
$ Sepal.Length: num
$ Sepal.Width : num
3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num
1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num
0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species
: Factor w/ 3 levels "setosa","versicolor",..: 1 1
1 1 1 1 1 1 1 1 ...
2 variables:
$ Sepal.Length: num
$ Sepal.Width : num
3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
This reveals that Sepal.data contains 150 objects with the Sepal.Length variable and
Sepal.Width.
1. On the other hand, you can use a subset argument to get subset data containing
setosa only. In the second argument of the subset function, you can specify the
subset criteria:
34
Chapter 1
> setosa.data = subset(iris, Species =="setosa")
> str(setosa.data)
'data.frame': 50 obs. of
5 variables:
$ Sepal.Length: num
$ Sepal.Width : num
3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num
1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num
0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species
: Factor w/ 3 levels "setosa","versicolor",..: 1 1
1 1 1 1 1 1 1 1 ...
2. Most of the time, you may want to apply a union or intersect a condition while
subsetting data. The OR and AND operations can be further employed for this
purpose. For example, if you would like to retrieve data with Petal.Width >=0.2
and Petal.Length < = 1.4:
> example.data= subset(iris, Petal.Length <=1.4 & Petal.Width >=
0.2, select=Species )
> str(example.data)
'data.frame': 21 obs. of
1 variable:
Merging data: merging data involves joining two data frames into a merged data frame by a
common column or row name. The following example shows how to merge the flower.type
data frame and the first three rows of the iris with a common row name within the Species
column:
> flower.type = data.frame(Species = "setosa", Flower = "iris")
> merge(flower.type, iris[1:3,], by ="Species")
Species Flower Sepal.Length Sepal.Width Petal.Length Petal.Width
1
setosa
iris
5.1
3.5
1.4
0.2
setosa
iris
4.9
3.0
1.4
0.2
setosa
iris
4.7
3.2
1.3
0.2
Ordering data: the order function will return the index of a sorted data frame with a
specified column. The following example shows the results from the first six records with the
sepal length ordered (from big to small) iris data
> head(iris[order(iris$Sepal.Length, decreasing = TRUE),])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
132
7.9
3.8
6.4
2.0 virginica
118
7.7
3.8
6.7
2.2 virginica
35
7.7
2.6
6.9
2.3 virginica
123
7.7
2.8
6.7
2.0 virginica
136
7.7
3.0
6.1
2.3 virginica
106
7.6
3.0
6.6
2.1 virginica
How it works
Before conducting data analysis, it is important to organize collected data into a structured
format. Therefore, we can simply use the R data frame to subset, merge, and order a dataset.
This recipe first introduces two methods to subset data: one uses the bracket notation, while
the other uses the subset function. You can use both methods to generate the subset data
by selecting columns and filtering data with the given criteria. The recipe then introduces the
merge function to merge data frames. Last, the recipe introduces how to use order to sort
the data.
There's more...
The sub and gsub functions allow using regular expression to substitute a string. The sub and
gsub functions perform the replacement of the first and all the other matches, respectively:
> sub("e", "q", names(iris))
[1] "Sqpal.Length" "Sqpal.Width"
"Pqtal.Length" "Pqtal.Width"
"Spqcies"
"Pqtal.Lqngth" "Pqtal.Width"
"Spqciqs"
Getting ready
Ensure you have completed the previous recipes by installing R on your operating system.
How to do it...
Perform the following steps to apply statistics on a dataset:
1. Load the iris data into an R session:
> data(iris)
36
Chapter 1
2. Observe the format of the data:
> class(iris)
[1] "data.frame"
3. The iris dataset is a data frame containing four numeric attributes: petal length,
petal width, sepal width, and sepal length. For numeric values, you can
perform descriptive statistics, such as mean, sd, var, min, max, median, range,
and quantile. These can be applied to any of the four attributes in the dataset:
> mean(iris$Sepal.Length)
[1] 5.843333
> sd(iris$Sepal.Length)
[1] 0.8280661
> var(iris$Sepal.Length)
[1] 0.6856935
> min(iris$Sepal.Length)
[1] 4.3
> max(iris$Sepal.Length)
[1] 7.9
> median(iris$Sepal.Length)
[1] 5.8
> range(iris$Sepal.Length)
[1] 4.3 7.9
> quantile(iris$Sepal.Length)
0%
25%
50%
75% 100%
4.3
5.1
5.8
6.4
7.9
Sepal.Width Petal.Length
5.843333
3.057333
3.758000
Petal.Width
1.199333
Sepal.Width
Petal.Length
Petal.Width
37
:4.300
:50
Min.
:2.000
Min.
:1.000
Min.
:0.100
1st Qu.:5.100
versicolor:50
1st Qu.:2.800
1st Qu.:1.600
1st Qu.:0.300
Median :5.800
virginica :50
Median :3.000
Median :4.350
Median :1.300
Mean
:5.843
Mean
:3.057
Mean
:3.758
Mean
:1.199
3rd Qu.:6.400
3rd Qu.:3.300
3rd Qu.:5.100
3rd Qu.:1.800
Max.
Max.
Max.
Max.
:7.900
:4.400
:6.900
:2.500
6. The preceding example shows how to output the descriptive statistics of a single
variable. R also provides the correlation for users to investigate the relationship
between variables. The following example generates a 4x4 matrix by computing the
correlation of each attribute pair within the iris:
> cor(iris[,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.
Sepal.Length
1.0000000
-0.1175698
0.8717538
0.8179411
Sepal.Width
Petal.Length
-0.1175698
1.0000000
-0.4284401
-0.3661259
0.8717538
-0.4284401
1.0000000
0.9628654
Petal.Width
0.8179411
-0.3661259
0.9628654
1.0000000
R also provides a function to compute the covariance of each attribute pair within
the iris:
> cov(iris[,1:4])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length
0.6856935
-0.0424340
1.2743154
0.5162707
Sepal.Width
-0.0424340
0.1899794
-0.3296564
-0.1216394
Petal.Length
1.2743154
-0.3296564
3.1162779
1.2956094
Petal.Width
0.5162707
-0.1216394
1.2956094
0.5810063
8. Statistical tests are performed to access the significance of the results; here we
demonstrate how to use a t-test to determine the statistical differences between
two samples. In this example, we perform a t.test on the petal width an of an iris in
either the setosa or versicolor species. If we obtain a p-value less than 0.5, we can be
certain that the petal width between the setosa and versicolor will vary significantly:
> t.test(iris$Petal.Width[iris$Species=="setosa"],
+
iris$Petal.Width[iris$Species=="versicolor"])
38
Chapter 1
data: iris$Petal.Width[iris$Species == "setosa"] and iris$Petal.
Width[iris$Species == "versicolor"]
t = -34.0803, df = 74.755, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.143133 -1.016867
sample estimates:
mean of x mean of y
0.246
1.326
9. Alternatively, you can perform a correlation test on the sepal length to the sepal
width of an iris, and then retrieve a correlation score between the two variables.
The stronger the positive correlation, the closer the value is to 1. The stronger the
negative correlation, the closer the value is to -1:
> cor.test(iris$Sepal.Length, iris$Sepal.Width)
data:
0.04351158
sample estimates:
cor
-0.1175698
How it works...
R has a built-in statistics function, which enables the user to perform descriptive statistics
on a single variable. The recipe first introduces how to apply mean, sd, var, min, max,
median, range, and quantile on a single variable. Moreover, in order to apply the statistics
on all four numeric variables, one can use the sapply function. In order to determine the
relationships between multiple variables, one can conduct correlation and covariance.
Finally, the recipe shows how to determine the statistical differences of two given samples by
performing a statistical test.
39
There's more...
If you need to compute an aggregated summary statistics against data in different groups,
you can use the aggregate and reshape functions to compute the summary statistics of
data subsets:
1. Use aggregate to calculate the mean of each iris attribute group by the species:
> aggregate(x=iris[,1:4],by=list(iris$Species),FUN=mean)
2. Use reshape to calculate the mean of each iris attribute group by the species:
>
library(reshape)
>
>
cast(Species~variable,data=iris.melt,mean,
subset=Species %in% c('setosa','versicolor'),
margins='grand_row')
For information on reshape and aggregate, refer to the help documents by using ?reshape
or ?aggregate.
Visualizing data
Visualization is a powerful way to communicate information through graphical means. Visual
presentations make data easier to comprehend. This recipe presents some basic functions
to plot charts, and demonstrates how visualizations are helpful in data exploration.
Getting ready
Ensure that you have completed the previous recipes by installing R on your operating system.
How to do it...
Perform the following steps to visualize a dataset:
1. Load the iris data into the R session:
> data(iris)
2. Calculate the frequency of species within the iris using the table command:
> table.iris = table(iris$Species)
> table.iris
setosa versicolor
50
40
50
virginica
50
Chapter 1
3. As the frequency in the table shows, each species represents 1/3 of the iris data. We
can draw a simple pie chart to represent the distribution of species within the iris:
> pie(table.iris)
4. The histogram creates a frequency plot of sorts along the x-axis. The following
example produces a histogram of the sepal length:
> hist(iris$Sepal.Length)
5. In the histogram, the x-axis presents the sepal length and the y-axis presents the
count for different sepal lengths. The histogram shows that for most irises, sepal
lengths range from 4 cm to 8 cm.
41
7.
The preceding screenshot clearly shows the median and upper range of the petal
width of the setosa is much shorter than versicolor and virginica. Therefore, the petal
width can be used as a substantial attribute to distinguish iris species.
8. A scatter plot is used when there are two variables to plot against one another. This
example plots the petal length against the petal width and color dots in accordance
to the species it belongs to:
> plot(x=iris$Petal.Length, y=iris$Petal.Width, col=iris$Species)
Chapter 1
9. The preceding screenshot is a scatter plot of the petal length against the petal width.
As there are four attributes within the iris dataset, it takes six operations to plot all
combinations. However, R provides a function named pairs, which can generate
each subplot in one figure:
> pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21,
bg = c("red", "green3", "blue")[unclass(iris$Species)])
How it works...
R provides many built-in plot functions, which enable users to visualize data with different kinds
of plots. This recipe demonstrates the use of pie charts that can present category distribution. A
pie chart of an equal size shows that the number of each species is equal. A histogram plots the
frequency of different sepal lengths. A box plot can convey a great deal of descriptive statistics,
and shows that the petal width can be used to distinguish an iris species. Lastly, we introduced
scatter plots, which plot variables on a single plot. In order to quickly generate a scatter plot
containing all the pairs of iris data, one may use the pairs command.
See also
43
Getting ready
Ensure that you have completed the previous recipes by installing R on your operating system.
How to do it...
Perform the following steps to retrieve data for machine learning:
1. Access the UCI machine learning repository: http://archive.ics.uci.edu/ml/.
44
Chapter 1
2. Click on View ALL Data Sets. Here you will find a list of datasets containing field
names, such as Name, Data Types, Default Task, Attribute Types, # Instances, #
Attributes, and Year:
45
5. Click on Data Folder, which will display a directory containing the iris dataset:
6. You can then either download iris.data or use the read.csv function to read the
dataset:
> iris.data = read.csv(url("http://archive.ics.uci.edu/ml/machinelearning-databases/iris/iris.data"), header = FALSE, col.names =
c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width",
"Species"))
> head(iris.data)
46
Chapter 1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
5.1
3.5
1.4
0.2 Iris-setosa
4.9
3.0
1.4
0.2 Iris-setosa
4.7
3.2
1.3
0.2 Iris-setosa
4.6
3.1
1.5
0.2 Iris-setosa
5.0
3.6
1.4
0.2 Iris-setosa
5.4
3.9
1.7
0.4 Iris-setosa
How it works...
Before conducting data analysis, it is important to collect your dataset. However, to collect
an appropriate dataset for further exploration and analysis is not easy. We can, therefore,
use the prepared dataset with the UCI repository as our data source. Here, we first access
the UCI dataset repository and then use the iris dataset as an example. We can find the iris
dataset by using the browser's find function (Ctrl + F), and then enter the file directory. Last,
we can download the dataset and use the R IO function, read.csv, to load the iris dataset
into an R session.
See also
47
www.PacktPub.com
Stay Connected: