R For Programmers PDF
R For Programmers PDF
Dan Zhang
www.allitebooks.com
www.allitebooks.com
R for Programmers
Mastering the Tools
www.allitebooks.com
www.allitebooks.com
R for Programmers
Mastering the Tools
Dan Zhang
www.allitebooks.com
Published with arrangement with the original publisher, Beijing Huazhang Graphics and Information Company.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
www.allitebooks.com
Contents
SECTION I BASICS OF R
1 Basic R Packages ..........................................................................................................3
1.1 R Is the Most Worthwhile Programming Language to Learn .................................. 3
1.1.1 Experience with Java ................................................................................... 4
1.1.2 Why Choose R? .......................................................................................... 4
1.1.2.1 Origin of R .................................................................................. 4
1.1.2.2 Development of R ........................................................................ 4
1.1.2.3 Communities and Resources of R ................................................ 5
1.1.2.4 Philosophy of R ............................................................................ 5
1.1.2.5 Users of R..................................................................................... 5
1.1.2.6 Syntax of R................................................................................... 6
1.1.2.7 Thinking Patterns of R ................................................................. 7
1.1.2.8 Problems to Be Solved by R.......................................................... 8
1.1.2.9 Shortcomings of R ....................................................................... 8
1.1.3 Application Prospects of R .......................................................................... 8
1.1.4 Missions Assigned to R by the New Era ...................................................... 9
1.2 Installation of Different Versions of R ..................................................................... 9
1.2.1 Installation of R in Windows .....................................................................10
1.2.2 Installation of R in Linux Ubuntu .............................................................10
1.2.3 Installation of Latest Version of R ..............................................................11
1.2.4 Installation of Certain Versions of R ..........................................................11
1.2.4.1 Installation of Version 2.15.3 of R .............................................. 12
1.2.4.2 Installation of Version 3.0.1 of R ................................................ 12
1.3 fortunes: Records the Wisdom of R ....................................................................... 12
1.3.1 Introduction to fortunes ............................................................................13
1.3.2 Installation of fortunes ...............................................................................13
1.3.3 Use of fortunes ...........................................................................................13
1.4 Using formatR to Format Codes Automatically .....................................................14
1.4.1 Introduction to formatR ............................................................................14
1.4.2 Installation of formatR ..............................................................................15
1.4.3 Use of formatR...........................................................................................15
www.allitebooks.com
vi ◾ Contents
1.4.3.1
tidy.source: To Format Codes by Inputting a Character String ...15
1.4.3.2
tidy.source: To Format Codes by Inputting Files .........................15
1.4.3.3
Formatting and Outputting R Script Files ..................................16
1.4.3.4
tidy.eval: Output Formatted R Codes and the Run Results ........17
1.4.3.5
usage: Definition of Format Function and Output
with Specific Width ....................................................................18
1.4.3.6 tidy.gui: A GUI Tool Used to Edit and Format R Codes ............19
1.4.3.7 tidy.fir: Format All the R Scripts in Directory dir .......................19
1.4.4 Source Code Analysis of formatR ..............................................................21
1.4.5 Bugs in the Source Code ........................................................................... 23
1.5 Multiuser Online Collaboration of R Development: RStudio Server ..................... 24
1.5.1 RStudio and RStudio Server ..................................................................... 24
1.5.2 Installation of RStudio Server ................................................................... 24
1.5.3 Use of RStudio Server ................................................................................25
1.5.3.1 System Configuration of RStudio Server.....................................25
1.5.3.2 System Management of RStudio Server ..................................... 27
1.5.4 Multiuser Collaboration of RStudio Server ............................................... 28
1.5.4.1 Add New Users and New User Groups ...................................... 28
1.5.4.2 Share of Git Codes ..................................................................... 30
1.6 Foolproof Programming of R and JSON ............................................................... 32
1.6.1 Introduction to rjson ..................................................................................33
1.6.1.1 Install and Load rjson ................................................................ 34
1.6.1.2 Call Function: fromJSON(): from JSON to R ........................... 34
1.6.1.3 toJSON(): from R to JSON ........................................................ 36
1.6.1.4 Converting between the C Library and R Library,
and Performance Test ................................................................. 37
1.6.2 Introduction to RJSONIO ........................................................................ 37
1.6.2.1 Install and Load RJSONIO ....................................................... 37
1.6.2.2 fromJSON(): from JSON to R ................................................... 38
1.6.2.3 toJSON: from R to JSON .......................................................... 38
1.6.2.4 isValidJSON(): Check Whether JSON Is Valid .......................... 40
1.6.2.5 asJSVars(): Convert into a Variable Format of JavaScript ............ 40
1.6.3 Implementation of Customized JSON ...................................................... 40
1.6.4 Performance Comparison of JSON ........................................................... 42
1.6.4.1 Serialization Test (toJSON) on a Large Object between rjson
and RJSONIO ........................................................................... 42
1.6.4.2 Serialization Test (toJSON) between Output by Lines
and Output by Columns in RJSONIO ...................................... 43
1.7 High-Quality Graphic Rendering Library of R Cairo ........................................... 44
1.7.1 Introduction to Cairo.................................................................................45
1.7.2 Installation of Cairo ...................................................................................45
1.7.3 Use of Cairo ...............................................................................................45
1.7.3.1 Scatterplot .................................................................................. 46
1.7.3.2 Three-Dimensional Cross-Sectional View ...................................47
1.7.3.3 Graphic with Mass Text in It ..................................................... 49
1.8 A Peculiar Tool Set: caTools ...................................................................................52
1.8.1 Introduction to caTools ..............................................................................52
www.allitebooks.com
Contents ◾ vii
www.allitebooks.com
viii ◾ Contents
SECTION II R SERVER
4 Cross-Platform Communication of R.......................................................................139
4.1 Cross-Platform Communication between Rserve and Java ...................................139
4.1.1 Installation of Rserve ...............................................................................140
4.1.2 Remote Connection between Rserve and Java .........................................141
4.1.2.1 Download JAR Package of Java Client ......................................142
4.1.2.2 Create Java Project in Eclipse ....................................................142
4.1.2.3 Implementation of Java Programming.......................................142
4.2 Rsession Makes It Easier for Java to Call R...........................................................144
4.2.1 Download of Rsession..............................................................................144
4.2.1.1 Download the Distribution Directly .........................................144
4.2.1.2 Download the Source Code and Compile Distribution.............144
4.2.2 Construct Rsession Projects Using Eclipse ...............................................146
4.2.3 API Introduction of Rsession ...................................................................146
4.2.4 Use of Rsession ........................................................................................147
4.2.4.1 Server Environment of Rserve ...................................................147
4.2.4.2 Java Code ..................................................................................148
4.2.4.3 Run Journal Output..................................................................149
www.allitebooks.com
Contents ◾ ix
7 RHadoop ..................................................................................................................253
7.1 R Has Injected Statistical Elements into Hadoop .................................................253
7.1.1 Introduction to Hadoop.......................................................................... 254
7.1.2 Why Should We Combine R with Hadoop? ............................................255
7.1.2.1 Why Should We Combine R with Hadoop When
the Hadoop Family Is Already So Powerful? .............................255
7.1.2.2 Mahout Can Also Perform Data Mining and Machine
Learning. So What’s the Difference between R and Mahout? ...255
7.1.3 How to Combine Hadoop with R............................................................256
7.1.3.1 RHadoop ..................................................................................256
7.1.3.2 RHive .......................................................................................256
7.1.3.3 Rewrite Mahout ........................................................................256
7.1.3.4 Use Hadoop to Call R ...............................................................256
7.1.3.5 R and Hadoop in Real Practice .................................................257
7.1.4 Outlook for the Future.............................................................................257
7.2 Installation and Use of RHadoop .........................................................................257
7.2.1 Environment Preparation .........................................................................257
7.2.2 Installation of RHadoop ..........................................................................258
7.2.2.1 Download the Three Relevant Packages of RHadoop ...............258
7.2.2.2 For the Installation of RHadoop, the Root Authority
Operation Is Recommended .....................................................258
7.2.2.3 Installation of the Dependent Library .......................................258
7.2.2.4 Install the rhdfs Library ............................................................259
7.2.2.5 Install the rmr Library ............................................................. 260
7.2.2.6 Install the rHBase Library ........................................................ 260
7.2.2.7 List All the Packages of RHadoop............................................ 260
7.2.3 Program Development of RHadoop ....................................................... 260
7.2.3.1 Basic Operations of rhdfs ......................................................... 260
7.2.3.2 Task of the rmr Algorithm ....................................................... 262
7.2.3.3 Wordcount Task of the rmr Algorithm .................................... 263
7.3 RHadoop Experiment: Count the Times of the Appearance of Certain
E-Mail Addresses ..................................................................................................265
7.3.1 Demand Description ...............................................................................265
7.3.2 Algorithm Implementation ..................................................................... 266
7.3.2.1 Calculate How Many Times the E-Mail Addresses Appear...... 266
7.3.2.2 Sort the E-Mail Addresses by Their Times of Appearance........ 268
7.4 Implement the Collaborative Filtering Algorithm by RHadoop
Based on MapReduce .......................................................................................... 269
7.4.1 Introduction to the Collaborative Filtering Algorithm
Based on Item Recommendation .............................................................270
7.4.2 Implementation of a Local Program of R .................................................271
7.4.2.1 Create a Co-Occurrence Matrix of Items ..................................271
7.4.2.2 Create a Rating Matrix of Users on Items .................................271
7.4.2.3 Calculate the Recommendation Result through a Matrix ........ 272
7.4.3 Implement a Distributed Program by R Based on Hadoop ......................275
7.4.3.1 Create a Co-Occurrence Matrix of Items ..................................276
xii ◾ Contents
SECTION IV APPENDIXES
Appendix A: Installation of a Java Environment...............................................................303
Appendix B: Installation of MySQL .................................................................................309
Appendix C: Installation of Redis .....................................................................................315
Appendix D: Installation of MongoDB ............................................................................ 319
Appendix E: Installation of Cassandra .............................................................................323
Appendix F: Installation of Hadoop .................................................................................327
Appendix G: Installation of the Hive Environment ..........................................................335
Appendix H: Installation of HBase ...................................................................................339
Bibliography......................................................................................................................345
Preface
xiii
xiv ◾ Preface
solutions to these problems. However, we already have mature and effective solutions in the com-
puter industry.
All of the content of this book comes from my summary of my experiences of using R in my
work, which means it is rather a real record of my working with R. The content covers the fields of
computing, Internet, database, big data, statistics, and finance. This book is a detailed summary
of all the solutions of the comprehensive application of R with Java, MySQL, Redis, MongoDB,
Cassandra, Hadoop, Hive, HBase, and so forth, which is strong in practice and operability. If
you are a beginner user of R, this book may help you appreciate the charm of R in all industries
and all fields. If you have used R for a while in a certain industry, this book may show a picture
of strong vitality when combining R with other computer languages and help you break through
the bottleneck. If you work in technical areas, the case implementation from a holistic view in
this book may bring you a new enlightenment, or even help you replan your career and find a new
orientation for studying and striving, just like it did for me. If you are a middle or senior manager
of a corporation, you will find many technical achievements in this book. You may even directly
apply these achievements in a corporate environment and make profits according to the detailed
operation records in this book.
Here I should point out that this book is not an introduction or guide, which means there will
be no syntax explanation of R in this book. You may have picked the wrong book if you wish to
gain basic and introductory knowledge of R. However, if you are familiar with the basics of R but
lack a computer language background, this book will tell you what R can do in a real environment
and how to implement the applications step by step.
After discussing with many beginners in R from different fields, I find that the greatest prob-
lem in learning to use R is how to manage to use its numerous software packages. There are few
books and only some pamphlets on the Internet covering this problem. This book covers more
than 30 packages of R with my experiences in practice and case analysis. I believe this will help us
solve problems using the packages of R.
This book is the first one of the series A Geek Ideal of R. Its companion piece, R for Programmers:
Advanced Techniques, will give an in-depth introduction to the underlying principle of R and how
to develop enterprise applications with R.
The environment involved in this book includes two operating systems, Linux Ubuntu® and
Windows® 7 and the 2.15.3 version and 3.0.1 version of R, which are specifically identified in each
section.
R is constantly advancing and undergoing updates, and it will lead a revolution of data eventu-
ally. Interdisciplinary integration is the trend of this development, which is also a great opportu-
nity for us.
Tong Tong, a fan of R language like me, is currently working in an import and export bank in
China. He specializes in RMB trade, which makes him an expert with profound knowledge and
understanding of the internationalization of the RMB. Tong Tong graduated from China Foreign
Affairs University, where he majored in international economics and trade. He also has a back-
ground in statistics and finance.
Tong Tong always has the passion to share his practical experience and combine his daily work
with the R language. I had the privilege of meeting him in a financial forum and I communicated
a lot with him about the use of R language. Despite the fact that Tong Tong does not have a pure
computer background, his passion for R language also made him discover a lot of unique ideas
in the past. As a self-learner, he did encounter many problems during the practice and he finally
found solutions through the book R for Programmers: Mastering the Tools.
Hereby, I am honored to invite Tong Tong to help me spread knowledge about R to many
readers with diverse mother tongues, and we also hope that it is a chance to share with the world
how we use the R language in China.
Once again, thank you, Tong Tong!
xvii
www.allitebooks.com
Acknowledgments
I wish to thank my teammates Lin Weilin, Lin Weiping, and Deng Yishuo. Thanks to R for
bringing us together. I thank He Ruijun, acquiring editor at CRC Press, who helped promote the
publication of this book. I also thank the translator, Tong Tong, for the translation work on this
book. I give special thanks to my parents and my wife for their support of my work and their care.
This book is dedicated to my dearest family and the fans of R.
xix
BASICS OF R I
Chapter 1
Basic R Packages
This chapter first presents reasons for learning to use R. It then considers the installation, develop-
ment tools, and a few commonly used packages of R to help readers gain a quick acquaintance of
R and stimulate their interest in learning the language.
Among the five programming languages—Node, Lua, Python, Ruby, and R—which one will
have the best application prospects in 2014 in China?
My choice is R, and I think R will be a star programming language not only in 2014 but also
for a long time in the future. This book therefore begins with a discussion of why R is the most
worthwhile programming language to learn.
3
4 ◾ R for Programmers: Mastering the Tools
◾ Origin of R
◾ Development of R
◾ Communities and resources of R
◾ Philosophy of R
◾ Users of R
◾ Syntax of R
◾ Thinking pattern of R
◾ Problems to be solved by R
◾ Shortcomings of R
1.1.2.1 Origin of R
In 1992, Ross Ihaka and Robert Gentlemen, two statisticians from University of Auckland of
New Zealand, invented a new programming language to teach elementary statistics courses more
conveniently. As both statisticians have R as their first initial, R was adopted as the name of this
newly invented programming language.
I have started a cross-border approach to knowledge since learning to use R. Statistics is based
on probability, while probability is based on mathematics. R is based on the premise that we are
trying to solve practical questions in statistics through programming. The intersection of the
knowledge of various subjects will determine our ability to solve the problems. The generic func-
tions of R in statistics make it a distinctive programming language.
1.1.2.2 Development of R
For a long time, R was used in a minority area, initially only by statisticians who wished to
replace SAS (Statistical Analysis System). As the concept of big data became more widespread, R
was finally discovered by industry. Subsequently, more and more people with engineering back-
grounds started to join the circle and made many improvements and upgrades to the computing
engine, performance, and a variety of programming packages of R, which led to the rebirth of R
reborn as a powerful new language.
The R used today has come much closer to the standard of industrial software. Driven by
engineers instead of only statisticians, R has gained a much more rapid growth. As the demand of
data analysis continues to grow, R will achieve a faster development and become synonym for free
and open data analysis software.
Basic R Packages ◾ 5
1.1.2.4 Philosophy of R
Every programming language has its own design concept and philosophy. As for my experience,
the philosophy of R is to get down to work.
We do not need to write long codes or design certain models using R. We can achieve a com-
plex statistical model just by a function call and entering some parameters. It is about which model
and what parameters to choose, rather than about how to program.
Using R, we will turn a mathematical formula into a statistical model, and we may also con-
sider how to make the result of a classifier more accurate. But we will not think about the time and
space complexity in the application of R.
The philosophy of R can transfer your knowledge of mathematics and statistics into comput-
ing models. And that is also determined by the origin of R.
1.1.2.5 Users of R
As noted previously, at first R was used only by some statisticians in academia, and then it became
widely adopted by scholars in many other fields. The applications of R can cover many fields
6 ◾ R for Programmers: Mastering the Tools
including statistical analysis, applied mathematics, financial analysis, economic analysis, social
sciences, data mining, artificial intelligence, bioinformatics, biopharmaceuticals, global geograph-
ical science, data visualization, and so forth.
The big data revolution triggered by the Internet in recent years has led many experts in indus-
try to start to learn and use R. With this expansion of interest, R has gradually met the demand
of industrialization and achieved full field development.
The following are some R packages that help promote the development of R in industry.
◾ RHadoop products of Revolution Analytics, which allows R to call the cluster resources of
Hadoop
◾ RStudio products of RStudio, which gives us some new understanding of editing software.
◾ RMySQL, ROracle and RJDBC, which create channels for R to visit databases
◾ rmongodb, rredis, RHive, rhbase, RCassandra, which create channels for R to visit NoSQL
◾ Rmpi and snow, which make parallel computing of a stand-alone equipment with multicore
possible
◾ Rserve and rwebsocket, which allow R to make data communication among platforms
1.1.2.6 Syntax of R
Similar to Python, R is an object-oriented programming language, but the syntax of R is much freer.
The names of many R functions are quite arbitrary, which may be part of the philosophy of R.
A programmer with a foundation in programming languages other than R may possibly feel
disconcerted when he or she sees assignment syntax as follows.
> a<-c(1,2,3,4)->b
> a
[1] 1 2 3 4
> b
[1] 1 2 3 4
> rnorm(10)
[1] -0.694541401 1.877780959 -0.178608091 0.004362026
[5] 0.836891967 1.794961298 0.115284187 0.155175219
[9] 0.464028612 -0.842569561
We could achieve a good visualization effect using R to draw a scatter diagram of the iris data
set.
Sepal.Width
7
5
Petal.Length
3
1
0.5 1.5 2.5
Petal.Width
3.0
2.0
Species
1.0
An advantage of the thinking patterns of R lies in the fact that we can simply analyze the data
we need to cope with. We do not need to go through a role transition from programmer to product
manager and consider what functions are involved, not to mention the program design.
Becoming free of the thinking patterns of programmers will help us learn more and find more
suitable positions of ourselves.
1.1.2.9 Shortcomings of R
Although R has many merits, as discussed earlier, it does have some shortcomings.
◾ Statistics analysis: including statistical distribution, hypothesis testing, and statistical modeling
◾ Financial analysis: quantitative strategies, investment portfolio, risk control, time series, and
volatility
◾ Data mining: data mining algorithms, data modeling, machine learning
◾ Internet: recommender system, consumption prediction, social network
www.allitebooks.com
Basic R Packages ◾ 9
I have written many articles concerning the application of R in my blog, including all the
aforementioned fields except for biology. The broad application prospect of R has made it most
capable of creating value in the new era.
The R packages have progressed to version 3.2.2, but some of the third-party packages of R have
still not been upgraded beyond version 2.15, such as RHadoop, RHive, and so forth. Thus certain
versions of R are needed if we are to use these R packages.
This is a quite simple operation for Windows®: All we have to do is install different exe files. But
those who are not familiar with Linux may have difficulty with the installation process. This section
therefore describes the installation of certain versions of R packages in Windows and Linux Ubuntu®.
10 ◾ R for Programmers: Mastering the Tools
We can find that the default version of R is 2.14.1, which is different from the version used in
this book. So next we wish to install the latest version of R packages.
# Re-install R packages.
~ sudo apt-get install r-base-core
Here we have installed the different versions of R conveniently to meet the different application
demands.
The popularity of big data throughout the world has led to the gradual adoption of R. But because
the R community has existed for many years, many of us might have little idea of the wisdom of
R in its long history. Fortunately, someone is secretly recording the wisdom of R.
Basic R Packages ◾ 13
Installation of fortunes:
# Launch R.
~ R
Most beginners may focus only on how to realize certain functions when they write codes but
ignore the importance of coding standards. Such codes will be unpopular not only to those
who need to read them, but also to the writers themselves if they review their codes after a few
months.
The most painful thing for a programmer is not working overtime every day to write codes, but
working overtime every day to try to understand the programs written by others.
At first, most programmers did not consider how to make it more convenient for others to
understand their codes. Finally someone started to make coding standards because he could no
longer tolerate those ugly codes. Then someone invented tools that could implement the automatic
formatting of codes. formatR is such a tool for automatic formatting of R.
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
Installation of formatR
# Launch R.
~ R
# Load formatR.
library(formatR)
if(TRUE){
x=1 # inline comments
}else{
x=2;print('Oh no... ask the right bracket to go away!')}
1*3 # one space before this comment will become two!
2+2+2 # 'short comments'
## here is a long long long long long long long long long long long long long
long long long long long long long comment
> tidy.source(messy)
# a single line of comments is preserved
1 + 1
if (TRUE) {
x = 1 # inline comments
} else {
x = 2
print("Oh no... ask the right bracket to go away!")
}
1 * 3 # one space before this comment will become two!
2 + 2 + 2 # 'short comments'
## here is a long long long long long long long long long long long long long
long long long
## long long long long comment
The output after formatting has been processed with blank, retract, line feed, and comment,
which increases the readability of the codes.
~ vi demo.r
a<-1+1;a;matrix(rnorm(10),5);
if(a>2) {b=c('11',832);"#a>2";} else print('a is invalid!!')
Format demo.r:
> x = "demo.r"
> tidy.source(x)
a <- 1 + 1
a
matrix(rnorm(10), 5)
if (a > 2) {
b = c("11", 832)
"#a>2"
} else print("a is invalid!!")
> f="demo2.r"
> tidy.source(x, keep.blank.line = TRUE, file = f)
> file.show(f)
matrix(rnorm(10), 5)
## [,1] [,2]
## [1,] 0.65050729 0.1725221
## [2,] 0.05174598 0.3434398
## [3,] -0.91056310 0.1138733
## [4,] 0.18131010 -0.7286614
## [5,] 0.40811952 1.8288346
> var
function (x, y = NULL, na.rm = FALSE, use)
{
if (missing(use))
use <- if (na.rm)
"na.or.complete"
else "everything"
na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs",
"everything", "na.or.complete"))
if (is.na(na.method))
stop("invalid 'use' argument")
if (is.data.frame(x))
x <- as.matrix(x)
else stopifnot(is.atomic(x))
if (is.data.frame(y))
y <- as.matrix(y)
else stopifnot(is.atomic(y))
.Call(C_cov, x, y, na.method, FALSE)
}
<bytecode: 0x0000000008fad030>
<environment: namespace:stats>
Sometimes the definition of a function can be very long, like in the function lm. We can con-
trol the display width of a function by the width parameter in usage.
> usage(lm)
lm(formula, data, subset, weights, na.action, method = "qr", model =
TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts
= NULL, offset,...)
> install.packages("gWidgetsRGtk2")
also installing the dependencies 'RGtk2', 'gWidgets'
trying URL
'http://mirror.bjtu.edu.cn/cran/bin/windows/contrib/3.0/RGtk2_2.20.25.
zip'
Content type 'application/zip' length 13646817 bytes (13.0 Mb)
opened URL
downloaded 13.0 Mb
trying URL
'http://mirror.bjtu.edu.cn/cran/bin/windows/contrib/3.0/
gWidgets_0.0-52.zip'
Content type 'application/zip' length 1212449 bytes (1.2 Mb)
opened URL
downloaded 1.2 Mb
trying URL
'http://mirror.bjtu.edu.cn/cran/bin/windows/contrib/3.0/gWidgetsRGtk2_
0.0-82.zip'
Content type 'application/zip' length 787592 bytes (769 Kb)
opened URL
downloaded 769 Kb
> library("gWidgetsRGtk2")
> g = tidy.gui()
~ vi dir2.r
if(a>2) {b=c('11',832);"#a>2";} else print('a is invalid!!')
Basic R Packages ◾ 21
Run tidy.dir:
> tidy.dir(path="dir")
tidying dir/dir.r
tidying dir/dir2.r
~ vi dir.r
a <- 1 + 1
a
matrix(rnorm(10), 5)
~ vi dir2.r
if (a > 2) {
b = c("11", 832)
"#a>2"
} else print("a is invalid!!")
tidy.source = function(
source = 'clipboard', keep.comment = getOption('keep.comment', TRUE),
keep.blank.line = getOption('keep.blank.line', TRUE),
replace.assign = getOption('replace.assign', FALSE),
left.brace.newline = getOption('left.brace.newline', FALSE),
reindent.spaces = getOption('reindent.spaces', 4),
output = TRUE, text = NULL,
width.cutoff = getOption('width'),...
) {
# Length processing.
if (length(text) == 0L || all(grepl('^\\s*$', text))) {
if (output) cat('\n',...)
return(list(text.tidy = text, text.mask = text))
}
# Blank processing.
if (keep.blank.line && R3) {
one = paste(text, collapse = '\n') # record how many line breaks
before/after
n1 = attr(regexpr('^\n*', one), 'match.length')
n2 = attr(regexpr('\n*$', one), 'match.length')
}
# Comment processing.
if (keep.comment) text = mask_comments(text, width.cutoff, keep.
blank.line)
# Linefeed of brackets.
if (left.brace.newline) text.tidy = move_leftbrace(text.tidy)
> c('11',832)->x2
> x2
[1] "11" "832"
> tidy.eval(text="c('11',832)->x2")
c("11", 832) <- x2
Error in eval(expr, envir, enclos): object 'x2' not found
The bug has been fixed. The author’s reply: This bug has been fixed in R 3.0.2.
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
The functions provided in formatR are very useful and practical, especially when we read some
codes that do not conform to the standards very well. Here I suggest that integrated development
environment (IDE) insert formatR into the tools of editors as a standardized formatting tool. We
hope that reading others’ codes may become a happy experience in the future.
24 ◾ R for Programmers: Mastering the Tools
RStudio is an effective tool for R development, and it’s also the best IDE integrated environment
for R. RStudio Server is one of the best in RStudio. It not only provides Web functions, which can
be installed on remote server and visited by the Web, but also supports multiuser collaboration
development. Let’s try to use such an effective tool!
Note: RStudio Server supports only the Linux system. The latest version of RStudio Server can
be downloaded from http://www.rstudio.com/ide/download/server.html/.
It can be seen that RStudio Server has been launched, and port 8787 has been opened.
/etc/rstudio/rserver.conf
/etc/rstudio/rsession.conf
~ vi/etc/rstudio/rserver.conf
# Listener port.
www-port=8080
~ vi/etc/rstudio/rsession.conf
# Start
~ sudo rstudio-server start
# Stop
~ sudo rstudio-server stop
# Restart
~ sudo rstudio-server restart
Force suspension of the running of the R process. This is an operation with the highest priority,
which will be executed immediately.
A temporary offline of RStudio Server will reject Web visiting and give users a friendly error.
Other operations of RStudio Server are just the same as in the standalone version of RStudio.
www.allitebooks.com
Basic R Packages ◾ 29
Open a new browser window and log in through the Hadoop account, as in Figure 1.8.
AAAAB3NzaC1yc2EAAAADAQABAAABAQDMmnFyZe2RHpXaGmENdH9kSyDyVzRas4GtRwMN
x+qQ4QsB8xVTrIbFayG2ilt+P8UUkVYO0qtUJIaLRjGy/SvQzzL7JKX12+VyYoKTfKvZ
ZnANJ414d6oZpbDwsC0Z7JARcWsFyTW1KxOMyesmzNNdB+F3bYN9sYNiTkOeVNVYmEQ8
aXywn4kcljBhVpT8PbuHl5eadSLt5zpN6bcX7tlquuTlRpLi1e4K+8j
Qo67H54FuDyrPLUYtVaiTNT/xWN6IU+DQ9CbfykJ0hrfDU1d1LiLQ4K2Fdg+vcKtB7Wxe
z2wKjsxb4Cb8TLSbXdIKEwSOFooINw25g/Aamv/nVvW1 conan@conan-deskop
Then we should upload the local project to Github. Create a new project rstudio-demo on
Github, with the address https://github.com/bsspirit/rstudio-demo. Upload the local directory to
rstudio-demo project by the operations that follow.
# Initialize git.
~ git init
Open RStudio and set it to directory/home/conan/R/Github, tools –> version control –> proj-
ect setup, as in Figure 1.9.
Basic R Packages ◾ 31
sayHello<-function(name){
print(paste("hello",name))
}
sayHello("Conan")
sayHello("World")
Commit: click tools –> version control –> commit, as in Figure 1.10.
Upload to Github: click tools –> version control –> push, as in Figure 1.11.
These powerful functions of RStudio have made programming quite easy to learn. Let’s start
doing this and try to be a real geek!
32 ◾ R for Programmers: Mastering the Tools
As a lightweight data format, JSON (JavaScript Object Notation) has been widely applied in all
kinds of environments. JSON is a standard object embedded in JavaScript, and a storage class of
Basic R Packages ◾ 33
table structure of MongoDB as well. JSON is semistructured, and it can express a rich meaning of
documents. There are fewer JSON documents than XML documents, which makes JSON docu-
ments more suitable for network transmission. As it is at an early stage, JSON is rarely used in R
programming. But as R becomes more popular and powerful, it has extended to many other fields,
including JSON. How can one convert a JSON data class into an R data class in a foolproof way?
The following is an introduction.
> install.packages("rjson")
> library(rjson)
~ vi fin0.json
{
"table1": {
"time": "130911",
"data": {
"code": [
"TF1312",
"TF1403",
"TF1406"
],
"rt_time": [
130911,
130911,
130911
]
}
},
"table2": {
"time": "130911",
"data": {
"contract": [
"TF1312",
"TF1312",
"TF1403"
],
"jtid": [
99,
65,
21
]
}
}
}
$table1$data
$table1$data$code
[1] "TF1312" "TF1403" "TF1406"
$table1$data$rt_time
[1] 130911 130911 130911
$table2
$table2$time
[1] "130911"
$table2$data
$table2$data$contract
[1] "TF1312" "TF1312" "TF1403"
$table2$data$jtid
[1] 99 65 21
> class(json_data)
[1] "list"
> class(json_data$table2)
[1] "list"
> class(json_data$table2$data)
[1] "list"
> class(json_data$table2$data$jtid)
[1] "numeric"
> class(json_data$table1$data$code)
[1] "character"
It can be seen that after converting, the original JSON object has been parsed into an R
list except for the innermost part, which is now a basic class (numeric, character). If you take
a leaf node of JSON data in R object structure, the index path of JSON will be json.table1.
data.code[0].
36 ◾ R for Programmers: Mastering the Tools
> json_data$table1$data$code
[1] "TF1312" "TF1403" "TF1406"
> json_data$table1$data$code[1]
[1] "TF1312"
> json_str<-toJSON(json_data)
> print(json_str)
[1]
"{\"table1\":{\"time\":\"130911\",\"data\":{\"code\":[\"TF1312\",\"T
F1403\",\"TF1406\"],\"rt_time\":[130911,130911,130911]}},\"table2\":
{\"time\":\"130911\",\"data\":{\"contract\":[\"TF1312\",\"TF1312\",\
"TF1403\"],\"jtid\":[99,65,21]}}}"
> cat(json_str)
{"table1":{"time":"130911","data":{"code":["TF1312","TF1403","TF1406
"],"rt_time":[130911,130911,130911]}},"table2":{"time":"130911","dat
a":{"contract":["TF1312","TF1312","TF1403"],"jtid":[99,65,21]}}}
We can implement the conversion from an R object to JSON if we use the function toJSON().
If we output the result with the function print(), it would be an escaping output(\”). If we output
the result with the function cat, it would be a standard JSON string. There are two ways to output
JSON to fin0_out.json: writeLines() and skin().
# writeLines
> writeLines(json_str, "fin0_out1.json")
# sink
> sink("fin0_out2.json")
> cat(json_str)
> sink()
Although the code is different, the output will be the same. WriteLines will create a blank line
at the end.
{"table1":{"time":"130911","data":{"code":["TF1312","TF1403","TF1406
"],"rt_time":[130911,130911,130911]}},"table2":{"time":"130911","dat
a":{"contract":["TF1312","TF1312","TF1403"],"jtid":[99,65,21]}}}
Basic R Packages ◾ 37
It can be seen that an operation based on the C library is faster than that based on the R
library. The difference of 0.02 is not so obvious because the amount of data is small. When the
JSON string gets much bigger, the difference will be bigger too. The default way adopted by
fromJSON is C, so we do not need to add the parameter method = ‘C’ normally. Next we run a
performance test of toJSON.
> install.packages("RJSONIO")
> library(RJSONIO)
38 ◾ R for Programmers: Mastering the Tools
$table1$data
$table1$data$code
[1] "TF1312" "TF1403" "TF1406"
$table1$data$rt_time
[1] 130911 130911 130911
$table2
$table2$time
[1] "130911"
$table2$data
$table2$data$contract
[1] "TF1312" "TF1312" "TF1403"
$table2$data$jtid
[1] 99 65 21
We find that the result is the same as in rjson: The class of R object is all lists except for the
innermost part. Then take a leaf node:
> json_data$table1$data$code
[1] "TF1312" "TF1403" "TF1406"
> json_data$table1$data$code[1]
[1] "TF1312"
> json_str<-toJSON(json_data)
> print(json_str)
[1] "{\n \"table1\": {\n \"time\": \"130911\",\n\"data\": {\n
\"code\": [\"TF1312\", \"TF1403\", \"TF1406\"],\n\"rt_time\":
Basic R Packages ◾ 39
> cat(json_str)
{
"table1": {
"time": "130911",
"data": {
"code": [ "TF1312", "TF1403", "TF1406" ],
"rt_time": [ 1.3091e+05, 1.3091e+05, 1.3091e+05 ]
}
},
"table2": {
"time": "130911",
"data": {
"contract": [ "TF1312", "TF1312", "TF1403" ],
"jtid": [ 99, 65, 21]
}
}
}
The output of toJSON is formatted, which is different from that of rjson. Then output it to
the file:
The result:
{
"table1": {
"time": "130911",
"data": {
"code": [ "TF1312", "TF1403", "TF1406" ],
"rt_time": [ 1.3091e+05, 1.3091e+05, 1.3091e+05 ]
}
},
"table2": {
"time": "130911",
"data": {
"contract": [ "TF1312", "TF1312", "TF1403"],
"jtid": [ 99, 65, 21 ]
}
}
}
40 ◾ R for Programmers: Mastering the Tools
> isValidJSON(json_str)
Error in file(con, "r"): cannot open the connection
> isValidJSON(json_str,TRUE)
[1] TRUE
myMatrix = [ [ 1, 4, 7, 10, 13 ],
[ 2, 5, 8, 11, 14 ],
[ 3, 6, 9, 12, 15 ] ] ;
[
{
"code": "TF1312",
"rt_time": "152929",
"rt_latest": 93.76,
"rt_bid1": 93.76,
"rt_ask1": 90.76,
"rt_bsize1": 2,
"rt_asize1": 100,
"optionValue": -0.4,
"diffValue": 0.6
}
]
Basic R Packages ◾ 41
Define a data.frame:
> df<-data.frame(
+ code=c('TF1312','TF1310','TF1313'),
+ rt_time=c("152929","152929","152929"),
+ rt_latest=c(93.76,93.76,93.76),
+ rt_bid1=c(93.76,93.76,93.76),
+ rt_ask1=c(90.76,90.76,90.76),
+ rt_bsize1=c(2,3,1),
+ rt_asize1=c(100,1,11),
+ optionValue=c(-0.4,0.2,-0.1),
+ diffValue=c(0.6,0.6,0.5)
+)
> df
code rt_time rt_latest rt_bid1 rt_ask1 rt_bsize1 rt_asize1
optionValue diffValue
1 TF1312 152929 93.76 93.76 90.76 2 100
-0.4 0.6
2 TF1310 152929 93.76 93.76 90.76 3 1
0.2 0.6
3 TF1313 152929 93.76 93.76 90.76 1 11
-0.1 0.5
Use toJSON directly. The JSON string was transformed to array by its columns, which is not
what we want.
> cat(toJSON(df))
{
"code": [ "TF1312", "TF1310", "TF1313" ],
"rt_time": [ "152929", "152929", "152929" ],
"rt_latest": [ 93.76, 93.76, 93.76 ],
"rt_bid1": [ 93.76, 93.76, 93.76 ],
"rt_ask1": [ 90.76, 90.76, 90.76 ],
"rt_bsize1": [ 2, 3, 1 ],
"rt_asize1": [ 100, 1, 11 ],
"optionValue": [ -0.4, 0.2, -0.1 ],
"diffValue": [ 0.6, 0.6, 0.5 ]
}
"rt_time": "152929",
"rt_latest": 93.76,
"rt_bid1": 93.76,
"rt_ask1": 90.76,
"rt_bsize1": 2,
"rt_asize1": 100,
"optionValue": -0.4,
"diffValue": 0.6
},
{
"code": "TF1310",
"rt_time": "152929",
"rt_latest": 93.76,
"rt_bid1": 93.76,
"rt_ask1": 90.76,
"rt_bsize1": 3,
"rt_asize1": 1,
"optionValue": 0.2,
"diffValue": 0.6
},
{
"code": "TF1313",
"rt_time": "152929",
"rt_latest": 93.76,
"rt_bid1": 93.76,
"rt_ask1": 90.76,
"rt_bsize1": 1,
"rt_asize1": 11,
"optionValue": -0.1,
"diffValue": 0.5
}
]
The output is displayed through data conversion by the function alply(). Such output by lines
is exactly what we want.
> library(rjson)
> df<-data.frame(
Basic R Packages ◾ 43
+ a=rep(letters,10000),
+ b=rnorm(260000),
+ c=as.factor(Sys.Date()-rep(1:260000))
+ )
>
> system.time(rjson::toJSON(df))
1.01 0.02 1.03
> system.time(rjson::toJSON(df))
1.01 0.03 1.04
> system.time(rjson::toJSON(df))
0.98 0.05 1.03
And create a test script of RJSONIO and run it on the command line.
> library(RJSONIO)
> df<-data.frame(
+ a=rep(letters,10000),
+ b=rnorm(260000),
+ c=as.factor(Sys.Date()-rep(1:260000))
+ )
> system.time(RJSONIO::toJSON(df))
2.23 0.02 2.24
> system.time(RJSONIO::toJSON(df))
2.30 0.00 2.29
> system.time(RJSONIO::toJSON(df))
2.25 0.01 2.26
The results of a comparison show that the performance of rjson is better than that of RJSONIO.
> library(rjson)
> library(plyr)
> df<-data.frame(
+ a=rep(letters,100),
+ b=rnorm(2600),
+ c=as.factor(Sys.Date()-rep(1:2600))
+ )
> system.time(rjson::toJSON(df))
0.01 0.00 0.02
> system.time(rjson::toJSON(df))
0.01 0.00 0.02
44 ◾ R for Programmers: Mastering the Tools
And create a test script of RJSONIO and run it on the command line.
> library(RJSONIO)
> library(plyr)
> df<-data.frame(
+ a=rep(letters,100),
+ b=rnorm(2600),
+ c=as.factor(Sys.Date()-rep(1:2600))
+ )
>
> system.time(RJSONIO::toJSON(df))
0.03 0.00 0.03
> system.time(RJSONIO::toJSON(df))
0.04 0.00 0.03
The result shows that output by columns is much more efficient than output by lines. Obviously
output by lines will run through extra data processes.
The test result that rjson is more efficient than RJSONIO cannot support the introduction of
RJSONIO. Of course, my one test cannot fully prove this conclusion. I hope that you can run
some tests by yourselves if you have enough time and interest.
R not only offers strong computing power in statistical analysis and data mining, but also qualifies
as a powerful tool in data visualization when compared to other expensive commercial software.
However, R cannot be strong in visualization without the support of all kinds of open source pack-
ages. Cairo is such a class library used in vector graphic processing. You can create high-quality vector
graphics (GIF, SVG, PDF, PostScript) and bitmaps (PNG, JEPG, TIFF) in Cairo and run high-
quality rendering in background programs as well. This section introduces how to use Cairo in R.
# Start R.
~ R
# Install Cairo.
> install.packages("Cairo")
◾ CairoPNG: grDevices:png()
◾ CairoJPEG: grDevices:jepg()
◾ CairoTIFF: grDevices:tiff()
◾ CairoSVG: grDevices:svg()
◾ Cairo PDF: grDevices:pdf()
46 ◾ R for Programmers: Mastering the Tools
The usual graphic output of me is png and svg. Here let’s check the compatibility of Cairo:
# Load Cairo.
> library(Cairo)
◾ Support: png, jpeg, pdf, svg, ps, xl1 (Linux desktop), raster
◾ Not support: tiff, win (win desktop)
Note: x11 would be FALSE and win would be TRUE in Windows.
Then let’s compare the effect of output by CairoPNG() and png().
1.7.3.1 Scatterplot
First we draw a scatterplot with 6000 points.
# png function.
> png(file="plot4.png",width=640,height=480)
> plot(x,y,col="#ff000018",pch=19,cex=2,main = "plot")
> dev.off()
# CairoPNG function
> CairoPNG(file="Cairo4.png",width=640,height=480)
> plot(x,y,col="#ff000018",pch=19,cex=2,main = "Cairo")
> dev.off()
Two png files, plot4.pgn and Cairo4.png, will be generated in the current directory, as in
Figures 1.12 and 1.13.
The code of graphic output of SVG is
> svg(file="plot-svg4.svg",width=6,height=6)
> plot(x,y,col="#ff000018",pch=19,cex=2,main = "plot-svg")
> dev.off()
> CairoSVG(file="Cairo-svg4.svg",width=6,height=6)
> plot(x,y,col="#ff000018",pch=19,cex=2,main = "Cairo-svg")
> dev.off()
Basic R Packages ◾ 47
Plot
4
0
y
–2
–4
–2 0 2 4
x
0
y
–2
–4
–2 0 2 4
x
Two svg files will be generated in the current directory: plot-svg4.svg and Cairo-svg4.svg.
These two files can be displayed in browsers.
# PNG graphic.
> png(file="plot2.png",width=640,height=480)
> op <- par(bg = "white", mar=c(0,2,3,0)+.1)
> persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col =
"lightblue", ltheta = 120, shade = 0.75, ticktype = "detailed", xlab
= "X", ylab = "Y", zlab = "Sinc(r)", main = "Plot")
> par(op)
> dev.off()
> CairoPNG(file="Cairo2.png",width=640,height=480)
> op <- par(bg = "white", mar=c(0,2,3,0)+.1)
> persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col =
"lightblue", ltheta = 120, shade = 0.75, ticktype = "detailed", xlab
= "X", ylab = "Y", zlab = "Sinc(r)", main = "Cairo")
> par(op)
> dev.off()
Two png files will be generated in the current directory: plot2.png and Cairo2.ong, as in
Figures 1.14 and 1.15.
Plot
0.8
Sinc (
0.6 1.0
r)
0.5
0.4
−1.0
0.0
−0.5
Y
0.0 −0.5
X
0.5
Cairo
0.8
Sinc (
0.6 1.0
) r
0.5
0.4
−1.0
0.0
−0.5
Y
0.0 −0.5
X
0.5
# Load MASS.
> library(MASS)
# PNG Graphic.
> png(file="plot5.png",width=640,height=480)
> biplot(corresp(m, nf=2), main="Plot")
> dev.off()
> CairoPNG(file="Cairo5.png",width=640,height=480)
> biplot(corresp(m, nf=2), main="Cairo")
> dev.off()
Two png files will be generated in the current directory: plot5.png and Cairo5.png, as in
Figures 1.16 and 1.17.
If we check the properties of these two files we find that a graphic generated by png is 54 kb
in size, while that generated by CairoPNG is 43.8 kb in size, which can be seen in Figure 1.18.
50 ◾ R for Programmers: Mastering the Tools
Plot
−0.6 −0.4 −0.2 0.0 0.2 0.4
0.4 C 69 R 72
R 60
R 27
R 2 C 95
RRC45
51 R 54 0.4
67 R 4 C 58
R 11 R 82 C4827 C 57 R 76
C 82
CR 2 84R 30 R 90
0.2 R 98 R 6 C 98
R 73 R 99 99 C 29RC R 76
5 0.2
CR6C6CRC 89C 100 RC45
88
89
88 CC74
R 43CC26 50 R 1 C 47 675 CC5688 R 37R 69
C 99 R 13
R 53 R 88 C CC65
8556 53 RCR3379 C 18
RC 66 59C CC80 CCR 618
86
C 18 R 26 C 68
C 82
C 68C 9 24 C 41R 77C 88C47
R
0.0 R 100 C 96 CC58
C5954R R34 RC174 C 94 R 587511 0.0
R 16 C 24 R
C 97 C 77 R 83 C2889 C 8R 9 R 8 R C 53
C 62C 39 23
C 75CC6990
C 8 C 14 R 10 R 94
C54R
7228 R 14
R 2563
R
CC58 RC 49
99 R 93
C 31 CC3081R 29
C 34C 25 R 83
R 92 C 28 C 44R 32C 84R 56
C 17 RR48 43 C 32 R 56 C 32 −0.2
−0.2 RC8772 R 65C CC9456
61 R 21
R 38 R 52 R 24
RR 82
C 23 R 91 R 80 C 38
C 49 R 67 22 R 62
R 40 R 46
−0.4
R 12 R 96
−0.4 C 15 C 12C 64
R 51 −0.6
R 89
−0.6
−0.6 0.4 −0.2 0.0 0.2 0.4
Cairo
−0.6 −0.4 −0.2 0.0 0.2 0.4
0.4 C 69 R 72
R 60
R 27
R 2 C 95
RRC45
51 R 54 0.4
67 R 4 C 58
R 11 R 82 C4827 C 57 R 76
C 82C 2 R 30 R 90
0.2 R 98 R 6 R 84C 98
R 73 R 99C 99 C 29RC R 5
76 0.2
CR6C6CRC 88
89C 100 R 45 C 74
89
88
R 43 CC26 50 R 1 C 47 C675 C 88 R 37R 69
C 99 R 13
R 53R 88C C CC 53
65 R R
3 79 C 18 C 56
C
8556 C 80 CCR 3 618
86
RC 66 59 C 18 C 68
CC82 68C 9 C 24R 26CR41 R 77C 88R 47
0.0 R 100 96CC9758
C 54R R
C59 34 C174 C 94 R C 11
5875 0.0
R 16C 24 C C 62C 39 C 77 R 23 R C
28
8389C 8R 9 R 8 R C
CC69
C 75 90 53
C 8 C 14R 10 RC94
C54R
7228 R 14
RR2563
C 58 R
C 49
99 R 93
C 31 CC3081R C29RC92
34 25C R2883 C 84 RR3256
C 17 RR48 43 C 32 R 56 C 32C 44 −0.2
−0.2 RC8772 R 65C CC94
56
61 R 21
R 38 R 52 RR 24
82C 23 R 80 C 38
C 49 R 67 R 22 R 91 R 62
R 40 R 46
−0.4
R 12 R 96
−0.4 C 15 C 12C 64
R 51 −0.6
R 89
−0.6
−0.6 0.4 −0.2 0.0 0.2 0.4
> svg(file="plot-svg5.svg",width=6,height=6)
> biplot(corresp(m, nf=2), main="Plot-svg")
> dev.off()
> CairoSVG(file="Cairo-svg5.svg",width=6,height=6)
> biplot(corresp(m, nf=2), main="Cairo-svg")
> dev.off()
Two svg files will be generated in the current directory: plot-svg5.svg and Cairo-svg5.svg.
For all three of the cases in the preceding text, it’s quite difficult to tell the difference between
CairoPNG() and png(). Cairo is just a little lighter and softer. Concerning this issue, Xie Yihui has
added that it’s hard to tell the difference because now the png() equipment of R adopts Cairo in
default. Several years ago png() did not adopt Cairo, so the PNG generated at that time is rather
low in quality. In most cases there would be little difference, while sometimes the performance of
anti-aliasing will be different.
52 ◾ R for Programmers: Mastering the Tools
R is born to be free. It is different from languages like Java and PHP as they are restrained by
unified standards. Almost all R packages vary in their naming and syntax, while R is even mixed
and matched in its functions. CaTools is such a mixed and matched library, covering several sets
of unrelated functions including image processing, encoding and decoding, classifier, vector com-
puting, and scientific computing, which are all convenient in use and powerful in function. It’s
impossible to describe these packages in a few words. Only the word peculiar can summarizes its
features.
# Start R.
~ R
> library(caTools)
# Number of columns.
> ncol(y$image)
[1] 61
# Number of rows.
> nrow(y$image)
[1] 87
* The readers may run the codes and generate color images.
A wave.gif file will be generated in the current directory, as in Figure 1.20. The demonstration
effect of animation can be seen in the online sources of this book, file wave.gif.
We can see that caTools has the same function of outputting GIF animation with the anima-
tion package made by Xie Yihui. We do not need to rely on other software libraries when we use
caTools to make gif animation, but the saveGIF function of animation package needs to rely on
third-party software such as ImageMagick and GraphicsMagick.
decoding of base64 supports only a class of vector but not data.frame and list. Here is the code to
encode and decode a Boolean vector:
# Original Data
> x
[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
# Original data.
> x
[1] "Hello R!!"
Basic R Packages ◾ 57
> data(iris)
> class(iris)
[1] "data.frame"
# Original data.
> head(x)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> library(MASS)
# Load the data set.
> data(cats)
# Print the first 6 lines.
58 ◾ R for Programmers: Mastering the Tools
> head(cats)
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
# Calculate the AUC of ROC curve and output image, as in Figure 1.21.
> colAUC(cats[,2:3], cats[,1], plotROC=TRUE)
Bwt Hwt
F vs. M 0.8338451 0.759048
◾ AUC = 1 is a perfect classifier. We can always get a perfect prediction no matter what thresh-
old value we set by using such a classifier. A perfect classifier does not exist for prediction on
most occasions.
◾ 0.5 < AUC < 1 is better than a random prediction. The classifier (model) will have predictive
value if the threshold value is set appropriately.
◾ AUC = 0.5 is the same as a random prediction (like flipping coins). The model does not have
predictive value.
◾ AUC < 0.5 is worse than a random prediction. But if we always take the opposite side of this
prediction, it’s better than a random prediction.
ROC curves
1.0
0.8
0.6
(Sensitivity)
0.4
0.2
Bwt
Hwt
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Probability of false alarm
(1-specificity)
It can be seen from Figure 1.21 that Bwt and Hwt are all within (0.5, 1). So data set cats is a a
real and effective data set. If the data set cats is a data set of classification, we can judge the quality
of this classifier using AUC on this data set.
> combs(2:5, 3)
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 2 3 5
[3,] 2 4 5
[4,] 3 4 5
> x = (1:1000)*pi/1000
> trapz(x, sin(x))
[1] 1.999993
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# Train model.
> model = LogitBoost(Data, Label, nIter=20)
# Model data.
> model
$Stump
feature threshhold sign feature threshhold sign feature threshhold sign
[1,] 3 1.9 -1 2 2.9 -1 4 1.6 1
[2,] 4 0.6 -1 3 4.7 -1 3 4.8 1
[3,] 3 1.9 -1 2 2.0 -1 4 1.7 1
[4,] 4 0.6 -1 3 1.9 1 3 4.9 1
[5,] 3 1.9 -1 4 1.6 -1 4 1.3 1
[6,] 4 0.6 -1 1 6.5 1 2 2.6 -1
[7,] 3 1.9 -1 3 1.9 1 4 1.7 1
[8,] 4 0.6 -1 2 2.0 -1 2 3.0 -1
[9,] 3 1.9 -1 3 5.0 -1 3 5.0 1
[10,] 4 0.6 -1 2 2.9 1 1 4.9 -1
[11,] 3 1.9 -1 3 1.9 1 3 4.4 1
[12,] 4 0.6 -1 2 2.0 -1 4 1.7 1
[13,] 3 1.9 -1 3 5.1 -1 2 3.1 -1
[14,] 4 0.6 -1 2 2.0 -1 3 5.1 1
[15,] 3 1.9 -1 3 1.9 1 1 6.5 -1
[16,] 4 0.6 -1 4 1.6 -1 3 5.1 1
[17,] 3 1.9 -1 2 3.1 1 2 3.1 -1
[18,] 4 0.6 -1 3 1.9 1 1 4.9 -1
Basic R Packages ◾ 61
$lablist
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
attr(,"class")
[1] "LogitBoost"
For the first six pieces of data, Lab columns show that the data belong to classification 1, setosa.
The other three columns—setosa, versicolor, and virginica—represent the probability of that clas-
sification. Next we set iterations to compare the result of the classification and that of real data.
# Set iterations to 2.
> table(predict(model, Data, nIter=2), Label)
Label
setosa versicolor virginica
setosa 48 0 0
versicolor 0 45 1
virginica 0 3 45
From the three preceding tests we can see that the more iteration, the more accurate the
model will be. Then we split the train set and test set randomly and make a prediction on the
classification.
# Train model.
> model = LogitBoost(Data[mask,], Label[mask], nIter=10)
# Prediction on classification.
> table(predict(model, Data[!mask,], nIter=2), Label[!mask])
setosa versicolor virginica
setosa 16 0 0
versicolor 0 15 3
virginica 0 1 12
[49] 222.2 220.7 220.0 218.7 217.0 215.9 215.8 214.1 212.3 213.9
214.6 213.6
[61] 212.1 211.4 213.1 212.9 213.3 211.5 212.3 213.0 211.0 210.7
210.1 211.4
[73] 210.0 209.7 208.8 208.8 208.8 210.6 211.9 212.8 212.5 214.8
215.3 217.5
[85] 218.8 220.7 222.2 226.7 228.4 233.2 235.7 237.1 240.6 243.8
245.3 246.0
[97] 246.3 247.7 247.6 247.8 249.4 249.0 249.9 250.5 251.5 249.0
247.6 248.8
[109] 250.4 250.7 253.0 253.7 255.0 256.2 256.0 257.4 260.4 260.0
261.3 260.4
[121] 261.6 260.8 259.8 259.0 258.9 257.4 257.7 257.9 257.4 257.3
257.6 258.9
[133] 257.8 257.7 257.2 257.5 256.8 257.5 257.0 257.6 257.3 257.5
259.6 261.1
[145] 262.9 263.3 262.8 261.8 262.2 262.7
260
250
240
BJsales
230
220
210
200
0 50 100 150
Time
There are six lines in Figure 1.22. The black line is the original data, and lines in other colors
are moving averages in different units. For example, the red line represents the average of three
points, and the green line represents the average of eight points.*
* The readers may run the source code to check the color image.
64 ◾ R for Programmers: Mastering the Tools
150
100
x
50
−50
> n=200
> x = rnorm(n,sd=30) + abs(seq(n)-n/4)
> plot(x, main = "Moving Window Analysis Functions (window
size=25)")
> lines(runmin (x,k), col="red")
> lines(runmed (x,k), col="green")
> lines(runmean(x,k), col="blue")
> lines(runmax (x,k), col="cyan")
# Summation
> x = c(1, 1e20, 1e40, -1e40, -1e20, -1)
> a = sum(x); print(a)
[1] -1e+20
# Exact Summation
> b = sumexact(x); print(b)
[1] 0
Basic R Packages ◾ 65
If we sum the vector x, the result should be 0, but the result in sum() is –1e+20. This is because
there is a calculation error caused by the accuracy of programming. The error no longer exists after
correction by sumexact(). Then we do an accumulative sum by cumsum().
# Accumulative summation.
> a = cumsum(x); print(a)
[1] 1e+00 1e+20 1e+40 0e+00 -1e+20 -1e+20
This chapter mainly introduces three packages of R that can process time series data. It may help
readers to master the data structure and basic use of time series data.
Time series analysis is a statistical method of dynamic data processing. Through time series analy-
sis, we can figure out what is changing in the world! R, as an effective tool for statistical analysis,
is very powerful in time series processing. A separate data class defined in R, zoo, is suitable for
time series data; it is the basic library for time series and stock analysis. This section introduces the
structure of zoo and how to use it in R.
67
68 ◾ R for Programmers: Mastering the Tools
date, and time, and depends only on the basic R environment. zooreg object extends and behaves like
zoo object, but it is used only in analysis of regular time series data. Many other packages of R for
time series analysis are based on zoo and zooreg. There are mainly six kind of API in zoo:
1. Basic object
– zoo: ordered times series object
– zooreg: regular time series object, extending and behaving like zoo object. It is different
from zoo in that it demands that data be continuous
2. Class conversion
– as.zoo: coercion from and to zoo
– plot.zoo: plotting method for objects of class zoo
– xyplot.zoo: plot zoo series with Lattice
– ggplot2.zoo: plot zoo objects with ggplot2
3. Data operation
– coredata: extract/replace the core data of zoo
– index: extract/replace the index of zoo
– window.zoo: filter data by time
– merge.zoo: merge two or more zoo objects
– read.zoo: read zoo sequence from files
– aggregate.zoo: compute summary statistics of zoo objects
– rollapply: apply rolling functions
– rollmean: calculate roll mean of zoo data
4. Processing of NA
– na.fill: fill NA
– na.locf: replace NA
– na.aggregate: replace NA by aggregation
– na.approx: replace NA using interpolation
– na.StructTS: replace NA using seasonal Kalman filter
– na.trim: filter records with NA
5. Auxiliary tools
– is.regular: check regularity of a series
– lag.zoo: calculate lag and difference
– MATCH: value matching
– ORDER: compute ordering permutation
6. Display control
– yearqtr: display time in year and quarter
– yearmon: display time in year and month
– xblocks: plot contiguous blocks along x-axis
– make.par.list: format conversion for plot.zoo data and xyplot.zoo data
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
www.allitebooks.com
Basic Packages of Time Series ◾ 69
# Start R.
~ R
X is the core data, which allows class of vector, matrix, and factor. Order.by is the index, which
demands uniqueness of fields for ordering. Frequency is the number of observations per unit of
time displayed.
The code that follows will create a zoo object indexed by time, as in Figure 2.1. We should pay
special attention to the fact that zoo object allows discontinuous time series data.
# Display.
> plot(x)
70 ◾ R for Programmers: Mastering the Tools
1.5
1.0
x
0.5
0.0
Feb 01 Feb 03 Feb 05 Feb 07 Feb 09 Feb 11 Feb 13
Index
Next, we create some groups of time series data indexed by number. The following code will
create a matrix of 4 rows and 3 columns with 12 elements. Create a zoo object y indexed by num-
ber 0:10 and output y as in Figure 2.2.
y
4.0
3.0
Series 1 2.0
1.0
8.0
7.0
Series 2
6.0
5.0
12.0
11.0
Series 3
10.0
9.0
0 2 4 6 8 10
Index
◾ Ts.eps: interval of time series. If the interval of data is less than ts.eps, es.pes will be used as
the interval. It is set through getOption(‘ts.eps’), which is 12-05 in default.
◾ Order.by: index, which demands the uniqueness of fields for ordering. Extend and behave
like order.by of zoo.
The following code will create a zooreg object indexed by continuous year (quarter), as in
Figure 2.3.
0.5
0.0
zr
−0.5
−1.0
−1.5
◾ Lag: Lag is calculated according to index in zoo, and according to value zooreg.
◾ Difference: Difference is calculated according to index in zoo, and according to value in
zooreg.
# Lag.
> lag(zz, k = -1)
2 3 6 7 8
1 2 3 6 7
# Difference.
> diff(zz)
2 3 6 7 8
1 1 3 1 1
Basic Packages of Time Series ◾ 73
> diff(zr)
2 3 7 8
1 1 1 1
Then the object is converted from the zoo class to the other class.
# Load ggplot2.
> library(ggplot2)
> library(scales)
1
x
–1
Feb 02 Feb 04 Feb 06 Feb 08 Feb 10 Feb 12 Feb 14
Index
Use the function index() to modify the index of the zoo class.
# Extract data which is dated from 2003-02-01 and has index date in
x.date[1:6].
> window(x, index = x.date[1:6], start = as.Date("2003-02-01"))
2003-02-09 0.7021167 -0.3073809
2003-02-11 2.5071111 0.6210542
Basic Packages of Time Series ◾ 77
# Merge the data set in a completed one, and fill the blank data
with NA.
> merge(y1, y2, all = TRUE)
y1.1 y1.2 y2.1 y2.2
1 1 6 NA NA
2 2 7 NA NA
3 3 8 0.9514985 1.7238941
4 4 9 -1.1131230 -0.2061446
5 5 10 0.6169665 -1.3141951
6 NA NA 0.5134937 0.0634741
7 NA NA 0.3694591 -0.2319775
78 ◾ R for Programmers: Mastering the Tools
# Merge the data set in a completed one, and fill the blank data
with 0.
> merge(y1, y2, all = TRUE, fill = 0)
y1.1 y1.2 y2.1 y2.2
1 1 6 0.0000000 0.0000000
2 2 7 0.0000000 0.0000000
3 3 8 0.9514985 1.7238941
4 4 9 -1.1131230 -0.2061446
5 5 10 0.6169665 -1.3141951
6 0 0 0.5134937 0.0634741
7 0 0 0.3694591 -0.2319775
2.1.3.8 Processing of NA
Use the function na.fill() to fill NA.
# Use extend to fill NA, i.e. fill it with the mean of the last and
next item of NA.
> na.fill(z, "extend")
1 2 3 4 5 6 7 8
2.0 2.0 2.5 3.0 4.0 5.0 9.0 9.0
Use the value of the statistical calculation of function na.aggregate() to replace NA.
> na.approx(z)
1 3 4 6 7 8
2.000000 1.333333 1.000000 4.000000 5.000000 2.000000
Use the function na.StructTS() to calculate the seasonal Kalman filter to replace NA, as in
Figure 2.5.
45
40
35
cbind(z, zout)
30
25
20
15
10
2000 2001 2002 2003
Index
> na.trim(xx)
6 6 7
> as.yearqtr("2001-2")
[1] "2001 Q2"
> as.yearmon("2007-03-01")
[1] "March 2007"
> as.yearmon("2007-12")
[1] "December 2007"
Basic Packages of Time Series ◾ 83
50
40
Flow
30
20
10
> set.seed(0)
> flow <- ts(filter(rlnorm(200, mean = 1), 0.8, method = "r"))
> rgb <- hcl(c(0, 0, 260), c = c(100, 0, 100), l = c(50, 90, 50),
alpha = 0.3)
> plot(flow)
> xblocks(flow > 30, col = rgb[1]) ## high values red
> xblocks(flow < 15, col = rgb[3]) ## low value blue
> xblocks(flow >= 15 & flow <= 30, col = rgb[2]) ## the rest gray
2.1.3.11 Read Time Series Data from Files and Create zoo Object
First we create a file and name it read.csv:
~ vi read.csv
2003-01-01,1.0073644,0.05579711
2003-01-03,-0.2731580,0.06797239
2003-01-05,-1.3096795,-0.20196174
2003-01-07,0.2225738,-1.15801525
2003-02-09,1.1134332,-0.59274327
2003-02-11,0.8373944,0.76606538
2003-02-13,0.3145168,0.03892812
2003-03-15,0.2222181,0.01464681
2003-03-17,-0.8436154,-0.18631697
2003-04-19,0.4438053,1.40059083
84 ◾ R for Programmers: Mastering the Tools
We’ve fully mastered the use of zoo library and zoo object. Now we can start processing time
series data in R!
data.frame
List ZOO
ts
POSIXct xts
zooreg
http://blog.fens.me/r-xts/
This section continues to introduce the extended implementation of zoo. Sometimes time series data
may contain complex laws. As a basic library of time series, zoo is designed for universal problems,
Basic Packages of Time Series ◾ 85
such as defining stock data and analyzing weather data. But when it comes to other assignments,
we may need more auxiliary functions to help fulfill these assignments more efficiently. Xts, as an
extension of zoo, provides more functions for data processing and data conversion.
xts object
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
# Start R.
~ R
# Install xts.
> install.packages("xts")
also installing the dependency 'zoo'
# Load xts.
> library(xts)
# Load sample_matrix.
> data(sample_matrix)
> data(sample_matrix)
> plot(as.xts(sample_matrix))
Warning message:
In plot.xts(as.xts(sample_matrix)) :
only the univariate series will be plotted
Warning message prompt indicates that only univariate sequence can be drawn. Thus only the
first row, sample_matrix[,1], is drawn.
K-line chart:
as.xts(sample_matrix)
51
50
49
48
as.xts(sample_matrix)
51
50
49
48
47
Jan 02 Jan 30 Feb 27 Mar 27 Apr 24 May 22 Jun 19
2007 2007 2007 2007 2007 2007 2007
> firstof(2005,01,01)
[1] "2005-01-01 CST"
> lastof(2007,10)
[1] "2007-10-31 23:59:59.99998 CST"
Create the first observation and the last observation of a time period:
$last.time
[1] "2000-12-31 23:59:59.99998 CST"
$last.time
[1] "2001-02-28 23:59:59.99998 CST"
> .parseISO8601('2000-01/02')
$first.time
[1] "2000-01-01 CST"
$last.time
[1] "2000-02-29 23:59:59.99998 CST"
> .parseISO8601('T08:30/T15:00')
$first.time
[1] "1970-01-01 08:30:00 CST"
$last.time
[1] "1970-12-31 15:00:59.99999 CST"
> head(x)
[,1]
2010-January-01 00:00:00.000 1
2010-January-01 00:01:00.000 2
2010-January-01 00:02:00.000 3
2010-January-01 00:03:00.000 4
2010-January-01 00:04:00.000 5
2010-January-01 00:05:00.000 6
# Calculate mean by quarter and display the result by the last day
of each quarter.
> apply.quarterly(xts.ts,mean)
[,1]
2007-03-31 0.12642053
2007-06-30 0.09977926
2007-08-19 0.04589268
94 ◾ R for Programmers: Mastering the Tools
# Calculate mean by year and display the result by the last day of
each year.
> apply.yearly(xts.ts,mean)
[,1]
2007-08-19 0.09849522
> data(sample_matrix)
# Divide the matrix data by month.
> to.period(sample_matrix)
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_
matrix.Close
2007-01-31 50.03978 50.77336 49.76308 50.22578
2007-02-28 50.22448 51.32342 50.19101 50.77091
2007-03-31 50.81620 50.81620 48.23648 48.97490
2007-04-30 48.94407 50.33781 48.80962 49.33974
2007-05-31 49.34572 49.69097 47.51796 47.73780
2007-06-30 47.74432 47.94127 47.09144 47.76719
> class(to.period(sample_matrix))
[1] "matrix"
> data(sample_matrix)
# Divide by month.
> endpoints(sample_matrix)
[1] 0 30 58 89 119 150 180
# Divide by every 7 days.
> endpoints(sample_matrix, 'days',k=7)
[1] 0 6 13 20 27 34 41 48 55 62 69 76 83 90 97 104
111 118 125
[20] 132 139 146 153 160 167 174 180
# Divide by week.
> endpoints(sample_matrix, 'weeks')
[1] 0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105
112 119 126
[20] 133 140 147 154 161 168 175 180
Basic Packages of Time Series ◾ 95
# Divide by month.
> endpoints(sample_matrix, 'months')
[1] 0 30 58 89 119 150 180
2013-11-24 6 6
2013-11-25 7 NA
2013-11-26 8 NA
2013-11-27 9 NA
2013-11-28 10 NA
> data(sample_matrix)
> x <- as.xts(sample_matrix)
# Split by week, and print out the data of first two weeks.
> split(x, f="weeks")[[1]]
Open High Low Close
2007-01-02 50.03978 50.11778 49.95041 50.11778
2007-01-03 50.23050 50.42188 50.23050 50.39767
2007-01-04 50.42096 50.42096 50.26414 50.33236
2007-01-05 50.37347 50.37347 50.22103 50.33459
2007-01-06 50.24433 50.24433 50.11121 50.18112
2007-01-07 50.13211 50.21561 49.99185 49.99185
2007-01-08 50.03555 50.10363 49.96971 49.98806
> split(x, f="weeks")[[2]]
Open High Low Close
2007-01-09 49.99489 49.99489 49.80454 49.91333
2007-01-10 49.91228 50.13053 49.91228 49.97246
2007-01-11 49.88529 50.23910 49.88529 50.23910
2007-01-12 50.21258 50.35980 50.17176 50.28519
2007-01-13 50.32385 50.48000 50.32385 50.41286
2007-01-14 50.46359 50.62395 50.46359 50.60145
2007-01-15 50.61724 50.68583 50.47359 50.48912
Processing of NA.
2013-11-22 4
2013-11-23 4
2013-11-24 6
2013-11-25 7
2013-11-26 8
2013-11-27 8
2013-11-28 8
# Replace NA with the next observation of NA.
> na.locf(x, fromLast=TRUE)
[,1]
2013-11-19 3
2013-11-20 3
2013-11-21 3
2013-11-22 4
2013-11-23 6
2013-11-24 6
2013-11-25 7
2013-11-26 8
2013-11-27 NA
2013-11-28 NA
# Print out starting time and ending time in the unit of day.
> periodicity(xts.ts)
Daily periodicity from 2007-01-01 to 2007-08-19
> data(sample_matrix)
# By year.
> timeBasedSeq('1999/2008')
[1] "1999-01-01" "2000-01-01" "2001-01-01" "2002-01-01"
"2003-01-01"
[6] "2004-01-01" "2005-01-01" "2006-01-01" "2007-01-01"
"2008-01-01"
Basic Packages of Time Series ◾ 101
# By month.
> head(timeBasedSeq('199901/2008'))
[1] "December 1998" "January 1999" "February 1999" "March
1999" "April 1999"
[6] "May 1999"
# By day.
> head(timeBasedSeq('199901/2008/d'),40)
[1] "December 1998" "January 1999" "January 1999" "January
1999" "January 1999"
[6] "January 1999" "January 1999" "January 1999" "January
1999" "January 1999"
[11] "January 1999" "January 1999" "January 1999" "January
1999" "January 1999"
[16] "January 1999" "January 1999" "January 1999" "January
1999" "January 1999"
[21] "January 1999" "January 1999" "January 1999" "January
1999" "January 1999"
[26] "January 1999" "January 1999" "January 1999" "January
1999" "January 1999"
[31] "January 1999" "January 1999" "February 1999" "February
1999" "February 1999"
[36] "February 1999" "February 1999" "February 1999" "February
1999" "February 1999"
2013-11-19 1
2013-11-20 2
2013-11-21 3
2013-11-22 4
2013-11-23 5
2013-11-24 6
2013-11-25 7
2013-11-26 8
h
2013-11-27 9
2013-11-28 10
2013-11-21 1
2013-11-22 1
2013-11-23 1
> str(x)
An 'xts' object on 2013-11-19/2013-11-28 containing:
Data: int [1:10, 1] 1 2 3 4 5 6 7 8 9 10
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
NULL
xts provides more API support than time series of the zoo class. Thus we possess a more con-
venient tool to make conversion and deformation of time series data.
zoo
ts
xts
plot.xts
Visualization of time series
xtsExtra
http://blog.fens.me/r-xts-xtsextra/
A blog post of r-bloggers, plot.xts is wonderful!, gave me great motivation to continue to explore
the power of xts (http://www.r-bloggers.com/plot-xts-is-wonderful/). xts extends the basic data
structure of zoo, and provides richer functions. xtsExtra library provides a simple but effective
graphic function plot.xts from the perspective of visualization. This section shows how to visualize
time-series-like xts objects by plot.xts.
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
Note: xtsExtra supports both Windows 7 and Linux. Since xtsExtra isn’t published in CRAN,
we need to download it in R-Forge.
# Start R.
~ R
# Load xtsExtra.
> library(xtsExtra)
> names(formals(plot.xts))
[1] "x" "y" "screens" "layout.screens"
"..."
[6] "yax.loc" "auto.grid" "major.ticks" "minor.ticks"
"major.format"
[11] "bar.col.up" "bar.col.dn" "candle.col" "xy.labels" "xy.
lines"
[16] "ylim" "panel" "auto.legend" "legend.names"
"legend.loc"
[21] "legend.pars" "events" "blocks" "nc"
"nr"
> data(sample_matrix)
> sample_xts <- as.xts(sample_matrix)
> plot(sample_xts[,1])
> class(sample_xts[,1])
[1] "xts" "zoo"
It can be seen from Figure 2.10 that xtsExtra::plot.xts() achieves a different effect from
xts::plot.xts(). Next, we’ll draw some more complex graphs.
106 ◾ R for Programmers: Mastering the Tools
sample_xts[, 1]
51
50
49
48
sample_xts[1:30,]
50.6
50.2
49.8
sample_xts[1:30,]
50.6
50.2
49.8
> plot(sample_xts[,1:2])
sample_xts[, 1:2]
51
50
49
48
50
48
48
51
48
51
48
51
48
51
48
51
48
50
47
50
47
50
47
50
47
50
47
50
47
Jan 02 Jan 16 Jan 30 Feb 13 Feb 27 Mar 13 Mar 27 Apr 10 Apr 24 May 08 May 22 Jun 05 Jun 19 Jun 30
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
Through drawing these graphs, we may find that plot.xts() provides various parameter con-
figurations to help us achieve richer forms of expression in visualizing time series.
Draw a graph displayed by a double screen and assign the screen and color of each line, as in
Figure 2.17.
sample_xts[, 1:4]
51
50
49
48
Jan 02 Jan 16 Jan 30 Feb 13 Feb 27 Mar 13 Mar 27 Apr 10 Apr 24 May 08 May 22 Jun 05 Jun 19 Jun 30
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
51
49
47
Jan 02 Jan 23 Feb 13 Mar 06 Mar 27 Apr 17 May 08 May 29 Jun 19 Jan 02 Jan 23 Feb 13 Mar 06 Mar 27 Apr 17 May 08 May 29 Jun 19
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
50
48
Jan 02 Jan 16 Jan 30 Feb 13 Feb 27 Mar 13 Mar 27 Apr 10 Apr 24 May 08 May 22 Jun 05 Jun 19 Jun 30
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
sample_xts
51
50
49
48
47
51
49
47
sample_xts
51
50
49
48
47
51
49
47
Jan 02 Jan 30 Feb 27 Mar 27 Apr 24 May 22 Jun 19
2007 2007 2007 2007 2007 2007 2007
Draw a graph displayed by a double screen and assign different coordinates, as in Figure 2.18.
Draw a graph displayed by a double screen and assign a different type of output, as in
Figure 2.19.
Draw a graph displayed by multiple screens and set them in different groups as in Figure 2.20.
10^sample_xts
2.0e+51
1.0e+51
0.0e+00
1.0e+51
1.0e+49
1.0e+47
Jan 02 Jan 23 Feb 13 Mar 06 Mar 27 Apr 17 May 08 May 29 Jun 19
2007 2007 2007 2007 2007 2007 2007 2007 2007
sample_xts[1:75, 1:2]-50.5
0.0
‒1.0
0.0
‒1.0
51
50
Open
49
48
50
High
48
Jan 02 Jan 16 Jan 30 Feb 13 Feb 27 Mar 13 Mar 27 Apr 10 Apr 24 May 08 May 22 Jun 05 Jun 19
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
51 51
50 50
Close
Low
49 49
48 48
47 47
51 51
Close
Low
49 49
47 47
Jan 02 Jan 30 Feb 27 Mar 27 Apr 24 May 22 Jun 19 Jan 02 Jan 30 Feb 27 Mar 27 Apr 24 May 22 Jun 19
2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007
sample_xts[, 1]
Bad days
Bad days
51
50
49
48
> plot(sample_xts[,1],sample_xts[,2])
51
sample_xts[, 2]
50
49
48
48 49 50 51
sample_xts[, 1]
51
sample_xts[, 2]
50
49
48
48 49 50 51
sample_xts[, 1]
Draw a graph of the xts class and add background coordinate lines automatically, as in Figure 2.25.
> plot.xts(tser)
1.5
1.0
0.5
tser
0.0
–0.5
–1.0
tser
1.5
1.0
0.5
0.0
−0.5
−1.0
4
Value
0
Mar 19 Mar 21 Mar 23 Mar 25 Mar 27 Mar 29
2015 2015 2015 2015 2015 2015
A C E
B D F
We’ve seen the powerful drawing function of xtsExtra::plot.xts. It’s easy to use to make graph-
ics of time series with rich elements!
Chapter 3
Performance Monitoring
Packages of R
This chapter mainly introduces three tool packages related to the performance of R, which may
help readers find bottlenecks of performance in programs.
The caching technique is widely applied in computer systems. For applications of high concur-
rency access, it is the best solution to strengthen performance considering price effectiveness.
Especially for repetitive calculation, cache can save a large amount of time for central processing
unit (CPU), even up to 99%. Optimization is being conducted and changes have started. R lead-
ers, represented by Hadley Wickham, are making R much faster.
117
118 ◾ R for Programmers: Mastering the Tools
◾ memoise: define memoised function, i.e., load the result of function calculation into local
cache
◾ forget: forget past results; resets the cache of a memoised function
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
# Start R.
~ R
# Install memoise.
> install.packages("memoise")
trying URL 'http://mirror.bjtu.edu.cn/cran/bin/windows/contrib/3.0/
memoise_0.1.zip'
Content type 'application/zip' length 10816 bytes (10 Kb)
opened URL
downloaded 10 Kb
# Load memoise.
> library(memoise
if (cache$has_key(hash)) {
cache$get(hash)
} else {
res <- f(...)
cache$set(hash, res)
res
}
}
attr(memo_f, "memoised") <- TRUE
return(memo_f)
}
TRUE
}
Performance Monitoring Packages of R ◾ 121
# Clear cache.
cache_reset()
From the source code we not only can appreciate the design philosophy of memoise, but also
gain a deep understanding of the writer of memoise on R. With a concise and effective code,
memoise is worthwhile to learn.
122 ◾ R for Programmers: Mastering the Tools
As R becomes more and more widely used, its problem of calculation performance attracts increas-
ing attention. How to get the cost of time of an algorithm on CPU clearly would be a key element
to optimize performance. It’s fortunate that the basic library of R has provided us with such a
function for performance monitoring, Rprof().
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
# Start R.
~ R
# View the definition of Rprof().
> Rprof
function (filename = "Rprof.out", append = FALSE, interval = 0.02,
memory.profiling = FALSE, gc.profiling = FALSE, line.profiling = FALSE,
numfiles = 100L, bufsize = 10000L)
{
Performance Monitoring Packages of R ◾ 123
if (is.null(filename))
filename <- ""
invisible(.External(C_Rprof, filename, append, interval,
memory.profiling, gc.profiling, line.profiling, numfiles,
bufsize))
}
<bytecode: 0x000000000d8efda8>
Rprof() is used to generate a log file to record performance indicators. Normally we just need
to assign the filename.
> bidpx1<-read.csv(file="000000_0.txt",header=FALSE)
> names(bidpx1)<-c("tradedate","tradetime","securityid","bidpx1","bi
dsize1","offerpx1","offersize1")
> bidpx1$securityid<-as.factor(bidpx1$securityid)
> head(bidpx1)
tradedate tradetime securityid bidpx1 bidsize1 offerpx1 offersize1
1 20130724 145004 131810 2.620 6960 2.630 13000
2 20130724 145101 131810 2.860 13880 2.890 6270
3 20130724 145128 131810 2.850 327400 2.851 1500
4 20130724 145143 131810 2.603 44630 2.800 10650
5 20130724 144831 131810 2.890 11400 3.000 77990
6 20130724 145222 131810 2.600 1071370 2.601 35750
> object.size(bidpx1)
1299920 bytes
Detailed explanation of fields: bidpx1 is the price of the best bid. Bidsize1 is the size of the best
bid. Offerpx1 is the price of the best offer. Offersize1 is the size of the best offer.
Task of calculation: divide the data into groups by securityid and calculate the mean of the
price of the best bid and the total amount of the size of the best bid.
+ ddply(bidpx1,.(securityid,datehour),summarize,price=mean(bi
dpx1),size=sum(bidsize1))
+ }
> head(fun1())
securityid datehour price size
1 131810 2013072210 3.445549 189670150
2 131810 2013072211 3.437179 131948670
3 131810 2013072212 3.421000 920
4 131810 2013072213 3.509442 299554430
5 131810 2013072214 3.578667 195130420
6 131810 2013072215 1.833000 718940
Check the running time of fun1 by system.time(). Run the operation twice, we may find that
the time cost of the system is similar and that there is no cache for the second operation.
> system.time(fun1())
User System Passing
0.08 0.00 0.07
> system.time(fun1())
User System Passing
0.06 0.00 0.06
~ vi fun1_rprof.out
sample.interval=20000
"substr" "paste" "fun1"
"paste" "fun1"
"structure" "splitter_d" "ddply" "fun1"
Performance Monitoring Packages of R ◾ 125
In fact, we cannot understand this log. So we need to use summaryRprof() to explain this log.
Check the statistical report by summaryRprof().
$by.total
total.time total.pct self.time self.pct
"fun1" 0.14 100.00 0.00 0.00
"ddply" 0.10 71.43 0.00 0.00
"ldply" 0.08 57.14 0.00 0.00
".fun" 0.06 42.86 0.06 42.86
".Call" 0.06 42.86 0.00 0.00
"" 0.06 42.86 0.00 0.00
"llply" 0.06 42.86 0.00 0.00
"loop_apply" 0.06 42.86 0.00 0.00
"paste" 0.04 28.57 0.02 14.29
"[[" 0.02 14.29 0.02 14.29
"structure" 0.02 14.29 0.02 14.29
"substr" 0.02 14.29 0.02 14.29
"list_to_dataframe" 0.02 14.29 0.00 0.00
"rbind.fill" 0.02 14.29 0.00 0.00
"splitter_d" 0.02 14.29 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 0.14
Explanation of data:
◾ $by.self: cost of time of current function. Self.time is the time of actual running, and total.time
is the accumulated time of running.
◾ $by.total: overall situation of function call. Self.time is the time of actual running, and total.time
is the accumulated time of running.
126 ◾ R for Programmers: Mastering the Tools
We can find from $by.self that most of the time is spent on .fun.
◾ .fun: the time of actual running is 0.06, accounting for 42.86% of the time of current function
◾ Paste: the time of actual running is 0.02, accounting for 14.29% of the time of current
function
◾ “[[”: the time of actual running is 0.02, accounting for 13.29% of the time of current function
◾ “structure”: the time of actual running is 0.02, accounting for 14.29% of the time of current
function
◾ “substr”: the time of actual running is 0.02, accounting for 14.29% of the time of current function
◾ 4 fun1: the accumulated time of running is 0.14, accounting for 100% of total accumulated
time of running. The time of actual running is 0.00.
◾ 3.fun: the accumulated time of running is 0.06, accounting for 42.86% of total accumu-
lated time of running. The time of actual running is 0.06.
◾ 2 paste: the accumulated time of running is 0.04, accounting for 28.57% of total accumu-
lated time of running. The time of actual running is 0.02.
◾ 1 splitter_d: the accumulated time of running is 0.02, accounting for 14.297% of total
accumulated time of running. The time of actual running is 0.00.
Now we know the CPU time of every function called. To optimize performance, we’ll start
from the function that cost the most time.
# Install stockPortfolio.
> install.packages("stockPortfolio")
# Load stockPortfolio.
> library(stockPortfolio)
> fileName <- "Rprof2.log"
From the eight printed records that cost the most time, we can see that most of the time of
actual running (self.time) is spent on file:2.02, scan:4.64.
# Install profr.
> install.packages("profr")
# Load profr.
> library(profr)
# Use ggplot2 to draw graphics.
# Load ggplot2.
> library(ggplot2)
The first case of data visualization is stock data analysis. The following code will generate
Figures 3.1 and 3.2.
> file<-"fun1_rprof.out"
# Graphic by plot.
> plot(parse_rprof(file))
# Graphic by ggplot2.
> ggplot(parse_rprof(file))
The second case of data visualization is data download. The following code will generate
Figures 3.3 and 3.4.
14
sys.parent
12 sys.call
match.call
10 stopifnot
quickdf
8 match .fun
Level
FUN <Anonymous>
6 lapply .Call
rev %*% loop_apply
4 id llply print.default
substr splitter_d ldply print
2 paste ddply print.data.fram
fun1 print
0
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
Time
13 sys.parent
12 sys.call
11 match.call
10 stopifnot
9 quickdf
8 match .fun
Level
7 FUN <Anonymous>
6 lapply .Call
4 id llply print.default
1 fun1 print
www.allitebooks.com
Performance Monitoring Packages of R ◾ 129
8
strptime
6 charToDate
as.Date.character
Level
0 1 2 3 4 5 6 7
Time
7 strptime
6 charToDate
5 as.Date.character
Level
1 getReturns
0 2 4 6
Time
Options:
-h, —help print short help message and exit
-v, —version print version info and exit
—lines print line information
—total print only by total
—self print only by self
—linesonly print only by line (implies —lines)
—min%total= minimum% to print for 'by total'
—min%self= minimum% to print for 'by self'
% total % self
total seconds self seconds name
100.0 0.14 0.0 0.00 "fun1"
71.4 0.10 0.0 0.00 "ddply"
57.1 0.08 0.0 0.00 "ldply"
42.9 0.06 42.9 0.06 ".fun"
42.9 0.06 0.0 0.00 ".Call"
42.9 0.06 0.0 0.00 ""
42.9 0.06 0.0 0.00 "llply"
42.9 0.06 0.0 0.00 "loop_apply"
28.6 0.04 14.3 0.02 "paste"
14.3 0.02 14.3 0.02 "[["
14.3 0.02 14.3 0.02 "structure"
14.3 0.02 14.3 0.02 "substr"
14.3 0.02 0.0 0.00 "list_to_dataframe"
Performance Monitoring Packages of R ◾ 131
% self % total
self seconds total seconds name
42.9 0.06 42.9 0.06 ".fun"
14.3 0.02 28.6 0.04 "paste"
14.3 0.02 14.3 0.02 "[["
14.3 0.02 14.3 0.02 "structure"
14.3 0.02 14.3 0.02 "substr"
Here is the report that displays only the parts in which indicator total accounts for more than
50% of the time.
% total % self
total seconds self seconds name
100.0 0.14 0.0 0.00 "fun1"
71.4 0.10 0.0 0.00 "ddply"
57.1 0.08 0.0 0.00 "ldply"
We can optimize the performance of code by using Rprof, and the computing power will no
longer be the bottleneck.
More and more people are starting to explore the field of data visualization. Images can be more
expressive than words, and interactive images based on HTML offer better visualization than static
PNG images. R has been fully prepared for data visualization. There are packages of image visualiza-
tion, ggplot2; packages of world map visualization, ggmap; packages of stock visualization, quant-
mod; and packages of interactive visualization based on HTML, googleVis in R. We can turn data
into images and even make the image dynamic by just entering a few lines of code. The performance
report will be used as an entry point to introduce a visualization package of R, lineprof.
1. Performance function:
a. focus: set the zoom of display height
b. auto_focus: set the zoom of display height automatically
c. lineprof: record the occupation of CPU and RAM
d. shine: output by shiny
2. Inner function: ancillary function
a. align: align the source code
b. find_ex: load demo
c. line_profile: output of formatted data of performance monitoring (Rprof)
d. parse_prof: formatted output
e. reduce_depth: set the depth of output
# Start R.
~ R
# Load devtools.
> library(devtools)
Performance Monitoring Packages of R ◾ 133
# Load lineprof.
> library(lineprof)
In the resource of this case, read-delim.r is the script file of objective function, wine.csv is the
test set, and x:lineprof is the data report generated.
Check read-delim.r.
all
}
Load wine.csv.
> df<-read.csv(file=wine)
Use shinySlickgrid to visualize data of performance indicators and output the result in the
form of a webpage.
# Load shinySlickgrid.
> library(shinySlickgrid)
# Start shiny.
> shine(x)
Loading required package: shiny
Shiny URLs starting with/lineprof will mapped to/home/conan/R/
x86_64-pc-linux-gnu-library/3.0/lineprof/www
Shiny URLs starting with/slickgrid will mapped to/home/conan/R/
x86_64-pc-linux-gnu-library/3.0/shinySlickgrid/slickgrid
Shiny will open a Web server in the background. The default access port of Web is 6742.
Remote access is available through browser. Open browser and type in http://192.168.1.201:6742
as in Figure 3.5.
There are six rows in the table of webpage in Figure 3.5. # is the number of row, source code is
the source code of monitored objective function, t is the total time (second) of the current executed
row, r is the memory released, a is the memory allocated, and d is the times of duplicates. The fol-
lowing is the explanation of function performance data.
◾ #6: used for loading data. Total time: 0.309s. Memory allocated: 0.064mb. Times of dupli-
cates: 14
◾ #15: used for clearing data. Total time: 0.179s. Memory allocated: 0.065mb. Times of dupli-
cates: 37
By using the visualization tool, lineporf, we can produce more flexible, more intuitive, and
better-looking reports, or even interactive reports of Web version. If you compare the effect with
that of Section 3.2, you will likely be surprised by how powerful R is and how rapidly R can pro-
gress. R has nearly infinite potential, but it still needs more people to push forward its progress. I
hope that you can also make contributions to the progress of R.
R SERVER II
Chapter 4
Cross-Platform
Communication of R
This chapter mainly introduces four tool packages of cross-platform communication of R. It will
help readers implement the communication between R, Java, and JavaScript.
139
140 ◾ R for Programmers: Mastering the Tools
Rserve provides us with a new choice. It is an abstract network interface of R. It implements the
communication of languages based on Transmission Control Protocol/Internet Protocol (TCP/IP)
agreement. Through the program call of the Client/Server (C/S) structure, Rserve supports the
communication between R and other languages including C/C++, Java, PHP, Python, Ruby, and
Node.js. Rserve supports many functions including remote connection, user authentication, and
file transfer. We can use R as the background service engine to process tasks including statistical
modeling, data analysis, and plotting. We explore the cross-platform communication between
Rserve and Java in this section.
Note: Rserve supports both Windows 7 and Linux. Because Rserve is mainly used as a com-
munication server, Linux is recommended.
The installation of Rserve is as follows:
# Start R.
~ R
# Install Rserve.
> install.packages("Rserve")
installing via 'install.libs.R' to/usr/local/lib/R/site-library/
Rserve
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (Rserve)
# Start Rserve.
~ R CMD Rserve
R version 3.0.1 (2013-05-16) — "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
Here the server of Rserve has been started, with port 6311. 127.0.0.1 indicates that only local
application access is available. If we want to access Rserve remotely, we need to open remote mode
by adding parameter –RS-enable-remote to the startup command.
0.0.0.0 indicates that restriction on IP access is lifted and we can access Rserve remotely from
now on.
◾ REngine.jar: used for mapping between data of class R and data of class Java
◾ RserveEngine.jar: used in communication program of Rserve
We can check the help files of these two libraries in the official file javadoc at http://rforge.net
/org/doc/. These two JAR packages are binary files compiled by Java, and there is no source code
file available.
◾ Main(): This is an entrance to start Java, by instantiating a demo object and calling
callRserve().
◾ callRserve(): Create a socket link to access Rserve remotely and send two sentences of R to
Rserve server in the form of character string. Then run calculation on Rserve, return the
result and output it in Java.
package org.conan.r.rserve;
import org.rosuda.REngine.REXP;
import org.rosuda.REngine.REXPMismatchException;
import org.rosuda.REngine.Rserve.RConnection;
import org.rosuda.REngine.Rserve.RserveException;
/**
* Start Java by Main.
*/
public static void main(String[] args) throws RserveException,
REXPMismatchException {
Demo1 demo = new Demo1();
demo.callRserve();
}
Cross-Platform Communication of R ◾ 143
/**
* Access Rserve.
*/
public void callRserve() throws RserveException,
REXPMismatchException {
//Create access connect.
RConnection c = new RConnection("192.168.1.201");
//Run a R sentence.
REXP x = c.eval("R.version.string");
//Print outthe result in Java.
System.out.println(x.asString());
//Run rnorm(10).
double[] arr = c.eval("rnorm(10)").asDoubles();
//Print out the result in loop.
for (double a: arr) {
System.out.print(a + ",");
}
}
}
Thus we’ve implemented the communication between Java and R through Rserve easily. To
speak more precisely, we’ve implemented a communication based on TCP/IP by using Java to
access the Rserve server. After we solve the problem of communication, we may make full use of
our imagination to apply R in more fields. This section is only a simple introduction on the instal-
lation and launch of Rserve. For the detailed use and configuration of Rserve server, please check
Section 6.1.
Rserve, as an important communication interface of R, has become an important channel for the
extension of R. But as Rserve is based on underlying structure, it is rather difficult to call the API
interface of Rserve using Java. Such a circumstance leads to the creation of Rsession. Rsession,
being an encapsulation of Rserve, provides higher API interfaces, including Rserve server control
and multisession mechanism, and supports Windows. Rsession makes it easier for Java to call API
of Rserve and provides a simpler way for Java to access remote or local Rserve instances, whereas
the other library for communication between R and Java, JRI, does not support a multisession
mechanism. We introduce JRI in the next section.
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
Rsession, constructed by Ant, can be compiled and packaged by us. Ant is an automatic con-
structing tool of Java.
clean:
clean-dist:
init:
[mkdir] Created dir: d:\workspace\java\rsession\Rsession\build
[mkdir] Created dir: d:\workspace\java\rsession\Rsession\dist\lib
resource:
[copy] Copying 28 files to d:\workspace\java\rsession\Rsession\
dist\lib
[copy] Copied 12 empty directories to 1 empty directory under
d:\workspace\java\rsession\Rsession\dist\lib
compile:
[javac] d:\workspace\java\rsession\Rsession\build.xml:33:
warning: 'includeantruntime' was not set, defaulting to bu
ild.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 10 source files to d:\workspace\java\rsession\
Rsession\build
dist:
[jar] Building jar: d:\workspace\java\rsession\Rsession\dist\
lib\Rsession.jar
[zip] Building zip: d:\workspace\java\rsession\Rsession\dist\
libRsession.zip
BUILD SUCCESSFUL
Total time: 2 seconds
146 ◾ R for Programmers: Mastering the Tools
By running the ant commands given in the preceding, we can generate a distribution,
libRsession.zip, in d:\workspace\java\rsession\Rsession\dist\. There is a jmatharray.jar in the
manually compiled distribution, which does not exist in the direct downloaded distribution.
I think this jar file might be a dependent package needed in the compiling process. It’s not
needed in the running, and will not impact the running process.
1. Interface
– BusyListener: to listen the condition of R engine
– EvalListener: to listen the running of R script of R engine
– Logger: for journal output
– UpdateObjectsListener: to listen the changes of environment when R is run
2. Function
– Rdaemon: daemon process of RServe
– RLogPanel: display the space of R journal
– RObjectsPanel: a control to display R variables
– RserverConf: connect configuration documents of Rserve instances
– Rsession: connect Rserve instances
– StartRserve: start local Rserve
The environment configuration of the server of Rserve is the same as that of RScilent of Rserve
in Section 5.2.
148 ◾ R for Programmers: Mastering the Tools
package org.conan.r.rsession;
import java.io.File;
import java.util.Properties;
import org.math.R.RserverConf;
import org.math.R.Rsession;
import org.rosuda.REngine.REXPMismatchException;
/**
* "Main" to start java.
*/
public static void main(String args[]) throws
REXPMismatchException {
//Run R script.
double[] rand = s.eval("rnorm(5)").asDoubles();
System.out.println(rand);
// Create an R object.
s.set("demo", Math.random());
s.eval("ls()");
// Delete demo.
s.rm("demo");
s.eval("ls()");
s.end();
}
}
// Run R script.
double[] rand = s.eval("rnorm(5)").asDoubles();
for(double ran:rand){
System.out.print(ran+",");
}
// Journal output.
[eval] rnorm(5)
org.rosuda.REngine.REXPDouble@5f934ad[5]
{0.08779203903807914,0.039929482749452114,-0.8788534039223883,
-0.8875740206608903,-0.8493446334021442}
0.08779203903807914,0.039929482749452114,-0.8788534039223883,
-0.8875740206608903,-0.8493446334021442
// Create an R object.
s.set("demo", Math.random());
s.eval("ls()");
// Delete demo.
s.rm("demo");
s.eval("ls()");
// Journal output.
[set] demo
s.set("df", new double[][] {{1, 2, 3}, {4, 5, 6}, {7, 8, 9}, {10, 11,
12}}, "x1", "x2", "x3");
double df$x1_3 = s.eval("df$x1[3]").asDouble();
System.out.println(df$x1_3);
s.rm("df");
// Journal output
[set] df
2.0
1.5
1.0
rnorm(10)
0.5
0.0
–0.5
–1.0
2 4 6 8 10
Index
// Journal output.
<html> Min. 1st Qu. Median Mean 3rd Qu. Max. <br/>
-2.332000 -0.659900 0.036920 0.004485 0.665800 2.517000 </html>
// Journal output.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.19700 -0.65330 -0.09893 -0.07190 0.53300 2.29000
// Journal output.
trying to load package sensitivity
package sensitivity is not installed.
package sensitivity not yet installed.
[eval] install.packages('sensitivity',repos='http://cran.cict.
fr/',dependencies=TRUE)
org.rosuda.REngine.REXPNull@4d47c5fc
request package sensitivity install...
package sensitivity is not installed.
! package sensitivity installation failed.
Impossible to install package sensitivity !
It turns out that Rsession is more user-friendly when compared to the JavaAPI of Rserve in
Section 5.1. Rsession encapsulated the process of Java calling R, which will make it easier for those
with a Java background to get started and master statistical calculation with Java application. Let’s
be creative!
152 ◾ R for Programmers: Mastering the Tools
Java has dominated the industry for quite a long time. Java syntax, JVM, JDK, and Java open
source libraries have all gained explosive growth and covered almost all fields of application devel-
opment. As Java covers more fields, problems also occur. It is becoming more and more difficult
to learn Java as the syntax is getting more complex and similar projects are created every day. It is
even harder for statistical practitioners without an IT background to learn to use Java.
R has always been an outstanding language in statistics as it has a simple syntax and moderate
learning curve. It will be very useful to combine the universality of Java with the professionalism
of R. This section will introduce the high-speed channel, rJava, to connect R and Java and realize
two-way communication.
I suggest using root authority to install rJava. Because rJava is a basic package for cross-platform
calling, we can reduce errors in the authority check of Linux when cross-platform programs are
called by using root authority.
We assume that the Java environment is already installed before we install rJava. For the instal-
lation of the Java environment, please refer to Appendix A.
# Install rJava.
> install.packages("rJava")
installing via 'install.libs.R' to/usr/local/lib/R/site-library/rJava
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (rJava)
# Load rJava.
> library(rJava)
# Start JVM.
> .jinit()
# Same to jcall(s,"I","length").
> s$length()
[1] 12
# Same to.jcall(s,"I","indexOf","World").
> s$indexOf("World")
[1] 6
Install rJava in R.
# Start R.
~ R
# Install rJava.
> install.packages("rJava")
# Load rJava.
> library(rJava)
> .jinit()
◾ Method main(): the entrance of the start of program. Instantiate a demo object and call
method callRJava().
◾ Method callRJava(): create an Rengine object and start R engine in JVM. Transfer two
statements of R to R engine in the form of character string and output the result in Java.
package org.conan.r.rjava;
import org.rosuda.JRI.Rengine;
/**
* Method main for starting Java applications.
*/
public static void main(String[] args) {
DemoRJava demo = new DemoRJava();
demo.callRJava();
}
Djava.library.path="C:\Program Files\R\R-3.0.1\library\rJava\jri\x64"
Run result:
~ mkdir/home/conan/R/DemoRJava
~ cd/home/conan/R/DemoRJava
~ ls -l
-rw-r—r— 1 conan conan 1328 Aug 8 2013 DemoRJava.jar
~ export R_HOME=/usr/lib/R
~ java -Djava.library.path=/usr/local/lib/R/site-library/rJava/jri
-cp/usr/local/lib/R/site-library/rJava/jri/JRI.jar:/home/conan/R/
DemoRJava/DemoRJava.jar org.conan.r.rjava.DemoRJava
158 ◾ R for Programmers: Mastering the Tools
Thus we’ve achieved two-way calling between R and Java using rJava and JRI in both Windows 7
and Linux Ubuntu.
Programmers who do not use Nodejs in Web development may have fallen a little bit behind.
Nodejs is a development platform of background programs based on JavaScript. In terms of my
personal experience, it’s more efficient in development than PHP. Though completely adopting
asynchronous loading, Node.js still has great potential to surpass PHP in performance.
HTML5, as a Web front end that uses JavaScript a lot, has an abundance of beautiful effects.
If we use HTML5 to re-render the images of R and add communication and user interaction parts
to them, the results will be a combination of the advantages of both languages and surely will
be amazing. The following is an introduction to cross-platform communication between R and
Node.js.
Rserve environment: Rserve v1.7-1. Interface: 6311, remote access allowed. View the process
of Rserve:
~ ps -aux|grep Rserve
conan 9736 0.0 1.2 116288 25440 ? Ss 13:11 0:01/
usr/lib/R/bin/Rserve -- RS-enable-remote
◾ Node: v0.10.5
◾ NPM: 1.2.19
◾ IP: 192.168.1.13
◾ Express: 3.2.2
We’ve created Express3 project in Windows 7. The following is the project directory:
D:\workspace\project\investment\webui.
~ D:\workspace\project\investment\webui>ls
README.md app.js models node-rio-dump.bin node_modules package.
json public routes views
160 ◾ R for Programmers: Mastering the Tools
~ vi app.js
// Omit.
var vis = require('./routes/vis')
app.get('/vis/rio',vis.rio);
~ vi /routes/vis.js
var rio = require("rio");
exports.rio = function(req, res){
options = {
//Address of remote Rserve server.
host: "192.168.1.201",
//Interface of remote Rserve server.
port : 6311,
// Callback function of computed result.
callback: function (err, val) {
// Normal return.
if (!err) {
console.log("RETURN:"+val);
// Transfer the result to interface.
return res.send({'success':true,'res':val});
} else { // Error return.
console.log("ERROR:Rserve call failed")
return res.send({'success':false});
}
},
}
rio.enableDebug(true); //Start debug mode.
rio.evaluate("pi/2 * 2 * 2",options); //Run R codes.
};
Through the preceding codes, we achieved the remote connection of rio and Rserve. Transfer
the statement (pi/2*2*2), as a parameter, to a remote Rserve server in the form of characters, and
get the return value through the callback method.
Cross-Platform Communication of R ◾ 161
Open a browser(http://localhost:3000/vis/rio). It can be seen from the Web interface that the
computed result of (pi/2*2*2) is 6.283185307179586. The structure return value in the browser
is a JSON object.
{
"success": true,
"res": 6.283185307179586
}
Connected to Rserve
Supported capabilities --------------
Data packet
00000000: 2108 0000 182d 4454 fb21 1940 !....-DT{!.@
Type SEXP 33
Response value: 6.283185307179586
RETURN:6.283185307179586
GET/vis/rio 200 33ms - 49b
Disconnected from Rserve
Closed from Rserve
I can see the communication situation between Node.js and Rserve: the response value is
6.283185307179586, which is the same with the display on the page. Then let’s modify the run-
ning script of R. Rnorm(10) is to take 10 random numbers of standard normal distribution,
N(0,1).
rio.evaluate("rnorm(10)",options);//Run R codes.
{
"success": true,
"res": [
-0.011531884725262991,
0.5106443501593562,
-0.05216533321965309,
162 ◾ R for Programmers: Mastering the Tools
1.9221980152236238,
0.5205238122633465,
-0.3275367539102907,
-0.06588102930129405,
1.5410418730008988,
1.308169913050071,
0.005044179478212583
]
}
Connected to Rserve
Supported capabilities --------------
Data packet
00000000: 2150 0000 f6ca 0c5e 079e 87bf 9b4a fad1 !P..vJ.^...?.JzQ
00000010: 3257 e03f eda2 5320 6ab5 aabf 2b25 bdb4 2W`?m"S.j5*?+%=4
00000020: 52c1 fe3f ebba ce8d 21a8 e03f bc17 92b7 RA~?k:N.!(`?<..7
00000030: 5cf6 d4bf ca9f 4642 94dd b0bf 1be3 e485 \vT?J.FB.]0?.cd.
00000040: 1ba8 f83f 5a94 2293 43ee f43f 1724 4e9e .(x?Z.".Cnt?.$N.
00000050: 34a9 743f 4)t?
Type SEXP 33
Response value: -0.011531884725262991,0.5106443501593562,
-0.05216533321965309,1.9221980152236238,0.5205238122633465,
-0.3275367539102907,-0.06588102930129405,1.5410418730008988,
1.308169913050071,0.005044179478212583
RETURN:-0.011531884725262991,0.5106443501593562,
-0.05216533321965309,1.9221980152236238,0.5205238122633465,
-0.3275367539102907,-0.06588102930129405,1.5410418730008988,
1.308169913050071,0.005044179478212583
GET/vis/rio 200 30ms - 285b
Disconnected from Rserve
Closed from Rserve
Thus we’ve achieved the cross-platform communication between R and Node.js, which seems
similar to the concept of “getting to a new level” in Chinese Kung-fu.
Chapter 5
Server Implementation of R
This chapter mainly introduces four tool packages of the R server, which may help readers to use
R to create the running environment of Socket server, Web server, and WebSocket server.
We’ve met Rserve for the connection between R and Java in Section 4.1. Now let’s learn more
details of Rserve.
Many projects depend on Rserve, as a communication Transmission Control Protocol/Internet
Protocol (TCP/IP) interface between R and many other languages. The server configuration and
operation of Rserve is very easy to learn, and its client is implemented by many languages includ-
ing C/C++, Java, and so forth. R has its own client to implement the RSclient project, which will
be introduced in the next section. This section provides a detailed discussion of the configuration
and use of Rserve as a server application.
163
164 ◾ R for Programmers: Mastering the Tools
Note: Rserve supports both Windows 7 and Linux. Because Rserve is mainly used as a communi-
cation server, Linux is recommended more.
The following is the installation of Rserve:
# Start R.
~ R
# Install Rserve.
> install.packages("Rserve")
# Load Rserve.
> library(Rserve)
There are two ways to start the Rserve server: starting the Rserve server in R and starting the
Rserve server in the command line. To start the Rserve server in R, we may need the functions of
Rserve.
> library(Rserve)
~ ps aux | grep R
conan 8799 0.1 1.5 121748 32088 pts/0 S+ 22:30 0:00
/usr/lib/R/bin/exec/R
conan 8830 0.0 1.2 116336 25044 ? Ss 06:46 0:00
/home/conan/R/x86_64-pc-linux-gnu-library/3.0/Rserve/libs//Rserve
Server Implementation of R ◾ 165
In this circumstance, the current R environment is not interrupted. Instead, an Rserve instance
is started alone in the system background, and using run.Rserve() is to start Rserve in the current
R environment.
~ netstat -nltp|grep R
tcp 0 0 127.0.0.1:6311 0.0.0.0:*
LISTEN 30664/R
Start Rserve in the command line and open the remote access mode.
~ ps -aux|grep Rserve
conan 27639 0.0 1.2 116288 25236 ? Ss 20:41 0:00
/usr/lib/R/bin/Rserve --RS-enable-remote
config file:/etc/Rserv.conf
working root: /tmp/Rserv
port: 6311
local socket: [none, TCP/IP used]
authorization required: no
plain text password: not allowedv
passwords file: [none]
allow I/O: yes
allow remote access: no
control commands: no
interactive: yes
max.input buffer size: 262144 kB
Server Implementation of R ◾ 167
~ sudo vi /etc/Rserv.conf
workdir /tmp/Rserv
remote enable
fileio enable
interactive yes
port 6311
maxinbuf 262144
encoding utf8
control enable
source /home/conan/R/RServe/source.R
eval xx=1
The option “source” is used for configuring the loaded file when the Rserve server is started,
including initializing system variables and system functions, and so forth. The option “eval” is
used to define environment variables.
Add the initialized starting script of the Rserve server.
~ vi /home/conan/R/RServe/source.R
cat("This is my Rserve!!")
print(paste("Server start at",Sys.time()))
~ R CMD Rserve
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
~ ps -aux|grep Rserve
conan 28339 0.0 1.2 116292 25240 ? Ss 22:31 0:00/
usr/lib/R/bin/Rserve
~ R
> library(RSclient) # Load RSclient.
> conn<-RS.connect()
> RS.eval(conn,rnorm(10))
[1] 0.03230305 0.95710725 -0.33416069 -0.37440009 -1.95515719
-0.22895924
[7] 0.39591984 1.67898842 -0.01666688 -0.26877775
Modify the configuration, add users’ login authentication, and permit plain text password.
Modify the file /etc/Rserv.conf.
~ sudo vi /etc/Rserv.conf
workdir /tmp/Rserv
remote enable
fileio enable
interactive yes
port 6311
maxinbuf 262144
encoding utf8
control enable
source /home/conan/R/RServe/source.R
eval xx=1
auth required
plaintext enable
Authentication reports an error when using RSclient again to access Rserve direct.
> library(RSclient)
> conn<-RS.connect()
> RS.eval(conn,rnorm(10))
Error in RS.eval(conn, rnorm(10)) :
command failed with status code 0x41: authentication failed
> library(RSclient)
> conn<-RS.connect()
> RS.login(conn,"conan","conan",authkey=RS.authkey(conn))
[1] TRUE
> RS.eval(conn,rnorm(5))
[1] -1.19827684 0.72164617 0.22225934 0.09901505 -1.54661436
170 ◾ R for Programmers: Mastering the Tools
The users’ login authentication here is the binding operating system users. We could also
specify the uid and gid parameter in Rserve.conf to control the server permissions on a more
detailed level.
This section gives a detailed introduction to the installation, start, configuration, and use of
Rserve. With this knowledge, we can now use Rserve to construct enterprise online applications.
Note: Rserve supports both Windows 7 and Linux. Because Rserve is mainly used as a com-
munication server, Linux is recommended more.
Start the Rserve server.
Server Implementation of R ◾ 171
~ ps -aux|grep Rserve
conan 28339 0.0 1.2 116292 25240 ? Ss 22:31 0:00
/usr/lib/R/bin/Rserve
Rserve environment:
After Rserve is started, we’ll use RSclient to access the Rserve server.
◾ Windows 7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
# Start R.
~ R
172 ◾ R for Programmers: Mastering the Tools
# Install RSclient.
> install.packages("RSclient")
# Load RSclient.
> library(RSclient)
API function name of the new version RCC is splited by “.”, as follows.
This section mainly introduces the use of API in the new version. The folllowing is the operat-
ing function of client:
Server management functions (need to set –RS-enable-control when Rserve is started) are as
follows:
> library(RSclient)
# Run script.
> RS.eval(conn,rnorm(5))
[1] -2.6762608 1.4435144 -0.4298395 -0.7046573 -1.4056073
# Set variables.
> RS.assign(conn,"xx",99)
raw(0)
> RS.eval(conn,xx-55)
[1] 44
# Synchronous execution.
> RS.eval(conn,head(rnorm(10000000)),wait=TRUE)
[1] -4.20217390 0.22353317 -1.70256992 0.30053213 -0.01427486
-0.70522254
# Close connection.
> RS.close(conn)
NULL
> conn
Closed Rserve connection 0x000000000445cc80
~ R
> library(RSclient)
> conn<-RS.connect(host="192.168.1.201")
174 ◾ R for Programmers: Mastering the Tools
> RS.login(conn,"conan","conan",authkey=RS.authkey(conn))
> RS.assign(conn,"A",1234)
> RS.eval(conn,A)
[1] 1234
> RS.eval(conn,getwd())
[1] "/tmp/Rserv/conn29039"
Operation of client B.
~ R
> library(RSclient)
> conn<-RS.connect(host="192.168.1.201")
> RS.login(conn,"conan","conan",authkey=RS.authkey(conn))
> RS.assign(conn,"B",5678)
> RS.eval(conn,B)
[1] 5678
> RS.eval(conn,ls())
[1] "B"
> RS.eval(conn,getwd())
[1] "/tmp/Rserv/conn29040"
We can see that after the ports of client A and client B are created, Rserve on the server side
will be operated in two separate spaces. Therefore the accesses of client A and client B are separate.
So how could we test the communication between two clients? The answer is through global vari-
ables. Set global variables in Rserve server. The code is as follows.
Error of “access rejected” is reported. Though we have opened –RS-enable-control, there is still
an error. We don’t know whether it’s a bug of Rserve, but we would know that global variables
have proved to be impracticable. Then we could use intermediate variables to achieve an interac-
tion between two clients indirectly and to store the data needed to be interacted into MySQL and
Redis.
We’ve achieved the remote connection between Rserve server and R through RSclient. If we
expand our thought further, we could construct a step-by-step computing environment through
Rserve and RSclient. With other open source technique as assistance, R can also reach the stan-
dard of an online application.
Server Implementation of R ◾ 175
R has long been used in a client program based on personal computers. We are familiar with the
whole process: downloading installation package, installing on desktop, writing algorithm, and run-
ning the codes. Then we publish our work in the form of either images or documents. But if we can
run R on the server side and publish the result on the Web, we will be working in an Internet style!
FastRWeb provides us a way to implement an R application of a Browser/Server (B/S) structure.
R runtime engine
R object
Web
Browser HTTP Proxy Rserve run() PNG
server
HTML/CSS/
JS
~ R
> install.packages("FastRWeb")
Because FastRWeb depends on Cairo, Cairo will be installed on the local library of Linux.
Please refer to Section 1.7 for this part. Now let’s install Rserve. For the installation and use of
Rserve, please refer to Section 5.1.
> install.packages("Rserve")
~ cd/var/FastRWeb/code/
~ ls -l
-rw-r--r-- 1 conan conan 1210 Oct 29 16:07 README
-rw-r--r-- 1 conan conan 79 Oct 29 17:50 rserve.conf
-rw-r--r-- 1 conan conan 2169 Oct 29 17:51 rserve.R
-rwxr-xr-x 1 conan conan 457 Oct 29 17:35 start
README is the help file, rserve.conf is the starting parameter of Rserve, rserve.R is the start-
ing script of Rserve, and start is the command.
Server Implementation of R ◾ 177
~ vi rserve.conf
http.port 8888
remote enable
source /var/FastRWeb/code/rserve.R
control enable
By default, Rserve provides socket communication interface. For the convenience of the Web
test, we’ll use an http communication interface instead. Modify the file rserve.R and add two lines
of codes to the top.
~ vi rserve.R
# Content added.
library(FastRWeb)
.http.request <- FastRWeb:::.http.request
Now we’ve completed the modification to use http protocol as the communication interface.
~ sudo ./start
R CMD Rserve --RS-conf/var/FastRWeb/code/rserve.conf --vanilla
--no-save
--RS-enable-remote
We can see from the startup log that XML, Cairo, Matrix, FastRWeb, and Rserve have all
been loaded and run normally. View the system process and port:
178 ◾ R for Programmers: Mastering the Tools
~ ps -aux|grep Rserve
conan 23739 0.0 1.4 120140 28916 ? Ss 16:47 0:00
/usr/lib/R/bin/Rserve --RS-conf/var/FastRWeb/code/rserve.conf
--vanilla --no-save --RS-enable-remote
Two ports have been opened: one is the socket port of Rserve 6311, and the other is the http
port 8888. Open http://192.168.1.201:8888/example1.png with browser and access through Web,
as in Figure 5.2.
The corresponding file of R code of Figure 5.2 is/var/FastRWeb/web.R/example1.png.R.
~ vi/var/FastRWeb/web.R/example1.png.R
Modify the file var/FastRWeb/web.R/example1.png.R and refresh in the browser. The result
is in Figure 5.3.
~ vi/var/FastRWeb/web.R/example1.png.R
Thus we’ve achieved running R script through the Web and drawing graphics on the Web.
There are some other examples in the directory/var/FastRWeb/web.R/. You can take them as refer-
ences and modify your own R script.
Server Implementation of R ◾ 179
1
rnorm (100)
‒1
‒2
‒2 ‒1 0 1 2
~ ls -l/var/FastRWeb/web.R
total 32
-rw-r--r-- 1 conan conan 790 Oct 29 16:07 common.R
-rw-r--r-- 1 conan conan 316 Oct 29 20:01 example1.png.R
-rw-r--r-- 1 conan conan 520 Oct 29 16:07 example2.R
-rw-r--r-- 1 conan conan 174 Oct 29 16:07 index.R
-rw-r--r-- 1 conan conan 215 Oct 29 16:07 info.R
-rw-r--r-- 1 conan conan 64 Oct 29 16:07 main.R
-rw-r--r-- 1 conan conan 167 Oct 29 16:07 README
-rw-r--r-- 1 conan conan 214 Oct 29 16:07 tmp.R
1
rnorm(400)
‒1
‒2
‒2 ‒1 0 1 2 3
rnorm(400)
According to the description of FastRWeb, it can communicate with any WebServer through
CGI. In this way, we can employ R script on servers of advanced languages such as PHP, Python,
Ruby, Java, and so forth. There is an article on R-bloggers discussing how to employ FastRWeb on
XAMPP based on Apache (http://www.r-bloggers.com/setting-up-fastrweb-on-mac-os-x/). If you
are interested in this, give it a try!
R has developed from a statistical language to an industrialized language. It not only sup-
ports the basic operation and visualization of the Web, but it also supports WebSocket. Our
Internet application now can interact with R directly through the WebSocket protocol with-
out using Rserve. R has undergone a technological revolution and become more advanced
and convenient.
# Start R.
~ R
# Install websockets.
> install.packages("websockets")
# Load websockets.
> library(websockets)
'websockets'R3.0.2
websockets depends on caTools, which is a tool set. Please refer to Section 1.8 for more infor-
mation about CaTools.
The author found on June, 2015 that websockets had been from the CRAN library on March
2, 2014 and taken over and maintained again by Joe Cheng from RStudio. The address is http://
cran.r-project.org/web/packages/websockets/index.html.
Thus when we install websockets, functions through install.packages() will report an error.
> install.packages("websockets")
Installing package into '/home/conan/R/
x86_64-pc-linux-gnu-library/3.0'
(as 'lib' is unspecified)
Warning:
package 'websockets' is not available (for R version 3.0.1)
Server Implementation of R ◾ 183
Errors are reported during the installation. It warns of a lack of the dependent packages caTools
and digest, so we need to install these two packages first.
# Start R.
~ R
# Start R.
~ R
# Load websockets.
> library(websockets)
> library(websockets)
'websockets'R3.0.2
# Start demo.
> demo(websockets)
~ netstat -nltp|grep r
If you open the page http://192.168.1.201:7681 in your browser and you can see the demo
application implemented by websockets, as in Figure 5.4. Please note that the browser must sup-
port HTML5, so we recommend Chrome.
Output the log of the server:
# Listen "receive".
> recv = function(DATA, WS,...){
+ cat("Receive callback\n")
+ D = ""
+ if(is.raw(DATA)){D = rawToChar(DATA)}
+
+ cat("Callback:You sent",D,"\n")
+ websocket_write(DATA=paste("You sent",D,"\n",collapse=" "),WS=WS)
+}
> set_callback('receive',recv,w)
# Listen "closed".
> cl = function(WS){
+ cat("Websocket client socket ",WS$socket," has closed.\n")
+ }
> set_callback('closed',cl,w)
# Connection established
> es = function(WS){
+ cat("Websocket client socket ",WS$socket," has been
established.\n")
+}
> set_callback('established',es,w)
~ vi client.r
# Listen "receive".
rece<-function(DATA, WS, HEADER) {
Server Implementation of R ◾ 187
D=''
if(is.raw(DATA)){
cat("raw data")
D = rawToChar(DATA)
}
cat("==>",D,"\n")
}
set_callback("receive",rece, client)
# Close connection.
websocket_close(client)
> library(websockets)
> client = websocket("ws://192.168.1.201",port=7681)
> rece<-function(DATA, WS, HEADER) {
+ D=''
+ if(is.raw(DATA)){
+ cat("raw data")
+ D=rawToChar(DATA)
+ }
+ cat("==>",D,"\n")
+}
> set_callback("receive",rece, client)
> websocket_write("2222", client)
[1] 1
> service(client)
raw data ==> You sent 2222
> websocket_close(client)
Client socket 3 was closed.
It can be seen from the output that we’ve achieved the communication process between the
client and the server.
postToServer('browser');
closeConnect();
Now we’ve finished the WebSocket server test constructed by R and opened a more convenient
channel for R and other languages.
DATABASE AND
BIG DATA
III
Chapter 6
This chapter mainly introduces five tool packages of R accessing databases, which may help read-
ers to connect R with MySQL, MongoDB, Redis, Cassandra, and Hive. The last section of this
chapter introduces a case of financial big data based on Hive.
MySQL is the most commonly used open source database software. It’s simple to install (for the
installation and configuration of MySQL, please refer to Appendix B) and stable in operation,
which makes it very suitable for small and medium size data storage. R, as a data analysis tool,
should certainly support a database driver interface. Great energy will be created if we combine R
with MySQL.
191
192 ◾ R for Programmers: Mastering the Tools
# Linux kernel.
~ uname -a
Linux conan 3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25
17:13:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
# Linux version.
~ cat /etc/issue
Ubuntu 12.04.2 LTS \n \l
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
# Start R.
~ R
# Install RMySQL.
> install.packages('RMySQL')
...
Configuration error:
could not find the MySQL installation include and/or library
directories. Manually specify the location of the MySQL
libraries and the header files and re-run R CMD INSTALL.
INSTRUCTIONS:
export PKG_CPPFLAGS="-I"
export PKG_LIBS="-L -lmysqlclient"
The installation process reported errors and prompted that we should add the configuration
parameter –with-mysql-dir to the installation directory of MySQL. Then let’s solve the error.
~ mysql -uroot -p
# Insert 3 items.
mysql> INSERT INTO t_user(user) values('A1'),('AB'),('fens.me');
Query OK, 3 rows affected (0.04 sec)
Records: 3 Duplicates: 0 Warnings: 0
196 ◾ R for Programmers: Mastering the Tools
# Start R.
~ R
# Load RMySQL.
> library(RMySQL)
Loading required package: DBI
# Create Connection.
> conn <- dbConnect(MySQL(), dbname = "rmysql", username="rmysql",
password="rmysql")
# Run SQL.
> users = dbGetQuery(conn, "SELECT * FROM t_user")
# View data.
> users
id user
1 1 A1
2 2 AB
3 3 fens.me
# Disconnect.
> dbDisconnect(conn)
[1] TRUE
Now we’ve achieved the connection between R and MySQL in Linux Ubuntu.
◾ Windows 7 64 bit
◾ Windows character set: gbk, utf8
◾ R: 3.0.1, x86_64-w64-mingw32/x64 (64-bit)
◾ MySQL: mysql Ver 14.14 Distrib 5.6.11, for Win64 (x86_64)
Database and NoSQL ◾ 197
~ R --version
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
http://www.gnu.org/licenses/.
# MySQL version.
~ mysql --version
mysql Ver 14.14 Distrib 5.6.11, for Win64 (x86_64)
# Start R.
~ R
# Install RMySQL.
> install.packages('RMySQl')
package 'RMySQl' is not available (for R version 3.0.1)
Here we have again met errors. It prompted that there was no corresponding version of
RMySQL. Because RMySQL does not provide the distribution of Windows version, we have to
compile the package of RMySQL manually in Windows and install it.
198 ◾ R for Programmers: Mastering the Tools
~ dir C:\Users\Administrator\AppData\Local\Temp\RtmpsfqQjK\
downloaded_packages
2013-09-24 13:16 165,363 RMySQL_0.9-3.tar.gz
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
Then there is another error. This time it prompted that there was no dynamic linked library file
D:\toolkit\mysql56/bin/libmySQL.dll.
~ mysql -uroot -p
mysql> create database rmysql;
Query OK, 1 row affected (0.04 sec)
200 ◾ R for Programmers: Mastering the Tools
~ mysql -uroot -p
mysql> create database rmysql;
Query OK, 1 row affected (0.04 sec)
# Start R.
~ R
# Load RMySQL.
> library(RMySQL)
DBI
MYSQL_HOME defined as D:\toolkit\mysql56
# Create connection.
> conn <- dbConnect(MySQL(), dbname = "rmysql", username="rmysql",
password="rmysql")
# Run SQL.
> users = dbGetQuery(conn, "SELECT * FROM t_user")
> users
id user
1 1 A1
2 2 AB
3 3 fens.me
# Disconnect.
> dbDisconnect(conn)
[1] TRUE
# Close connection.
> dbDisconnect(conn)
# Query of data.
204 ◾ R for Programmers: Mastering the Tools
> d0
row_names a b c
1 1 1 a 0.9886816
2 3 3 c 0.2770364
3 4 4 d 1.3613716
4 6 6 f 1.6123509
5 7 7 g 0.1761607
6 8 8 h 0.2970002
7 9 9 i 0.1903272
6 8 h 0.2970002
7 9 i 0.1903272
# Delete table.
> if(dbExistsTable(conn,'t_demo')){
+ dbRemoveTable(conn, "t_demo")
+ }
[1] TRUE
Special tip: Try not to use dbWriteTable(), as it will delete the original table structure, create a
new table structure, and insert data.
> dbDisconnect(conn)
> conn <- dbConnect(MySQL(), dbname = "rmysql", username="root",
password="",client.flag=CLIENT_MULTI_STATEMENTS)
# Create table.
mysql> CREATE TABLE t_blog(
id INT PRIMARY KEY AUTO_INCREMENT,
title varchar(12) NOT NULL UNIQUE,
author varchar(12) NOT NULL,
length int NOT NULL,
create_date timestamp NOT NULL DEFAULT now()
)ENGINE=INNODB DEFAULT CHARSET=UTF8;
# Insert data.
mysql> INSERT INTO t_blog(title,author,length) values('Hello, this is the
first article','Conan',20),('Coding of RMySQL','Conan',99),('R for
Programmers series','Conan',15);
# Query table.
mysql> select * from t_blog;
+----+------------------------------+--------+--------+---------------------+
| id | title | author | length | create_date |
+----+------------------------------+--------+--------+---------------------+
| 1 | Hello, this is the first article | Conan | 20 | 2013-08-15 00:13:13 |
| 2 | Coding of RMySQL | Conan | 99 | 2013-08-15 00:13:13 |
| 3 | R for Programmers series | Conan | 15 | 2013-08-15 00:13:13 |
+----+------------------------------+--------+--------+---------------------+
3 rows in set (0.00 sec)
208 ◾ R for Programmers: Mastering the Tools
> library(RMySQL)
# Run SQL.
> dbSendQuery(conn,"INSERT INTO t_blog(title,author,length) values
('Insert new article in R','Conan',50)");
# Query data.
> query<-dbSendQuery(conn, "SELECT * FROM t_blog")
Warning message:
In mysqlExecStatement(conn, statement, ...) :
RS-DBI driver warning: (unrecognized MySQL field type 7 in column 4 imported
as character)
# Query data.
> data <- fetch(query, n = -1)
> mysqlCloseResult(query)
[1] TRUE
# Display is normal.
> print(data)
id title author length create_date
1 1 Hello, this is the first article Conan 20 2013-08-15 00:13:13
2 2 Coding of RMySQL Conan 99 2013-08-15 00:13:13
3 3 R for Programmers series Conan 15 2013-08-15 00:13:13
4 4 Insert new article in R Conan 50 2013-08-15 00:29:45
> dbDisconnect(conn)
[1] TRUE
We can reduce the error and increase working efficiency by mastering all the techniques of
RMySQL and understanding the principle.
Database and NoSQL ◾ 209
MongoDB, as a documental NoSQL database, is very flexible in use. It avoids the complex data-
base design of relational database in the early stage. The storage of MongoDB is based on JSON,
and adopts JavaScript as an operating language, which gives the user infinite space for imagina-
tion. We can solve some really complex questions of criteria query by coding in MongoDB server.
This section will introduce how to connect R with MongoDB through rmongodb.
About the installation and configuration of MongoDB, please refer to Appendix D. Then let’s
view the server environment of MongoDB. Use command/etc/init.d/mongodb to start Mongo
DB. The default port is 27017.
# Start mongodb.
~ sudo /etc/init.d/mongodb start
Rather than invoking init scripts through /etc/init.d, use the service(8)
utility, e.g. service mongodb start
Since the script you are attempting to invoke has been converted to an
Upstart job, you may also use the start(8) utility, e.g. start mongodb
mongodb start/running, process 1878
Use the command line client program of MongoDB, mongo, to open Mongo Shell. Some
simple operations of Mongo Shell include viewing the database, switching the database, and view-
ing the data set.
# Switch database.
> use foobar
switched to db foobar
Then, we’ll use MongoDB client of R, rmongodb, to access the MongoDB server remotely to make a
test.
mongo.bson.buffer.start.array(buf, "comments")
mongo.bson.buffer.append(buf, "0", "a1")
mongo.bson.buffer.append(buf, "1", "a2")
mongo.bson.buffer.append(buf, "2", "a3")
mongo.bson.buffer.finish.object(buf)
b <- mongo.bson.from.buffer(buf)
212 ◾ R for Programmers: Mastering the Tools
ns = "db.blog"
Insert a record.
mongo.insert(mongo,ns,b)
mongo.destroy(mongo)
◾ Win7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
Section 6.2.2 introduced some basic functions of rmongodb function library, but we have not
installed the class library of rmongodb.
214 ◾ R for Programmers: Mastering the Tools
# Start R.
~ R
# Install rmongodb.
> install.packages("rmongodb")
Then, use mongo.create() to create a connection with MongoDB server. If it’s a local connec-
tion, mongo.create() won’t need parameters. The next example uses remote connection. It’ll add
host parameters to specify the IP address of the MongoDB server.
Use mongo.is.connected() to check whether the connection is normal. This statement will be used
a great deal in development. If the object or function reports an error in R modeling, the connection
will be closed automatically. It will not prompt disconnection because of the abnormal mechanism
of MongoDB. We need to use this command to test whether the connection is normal manually.
Next, define two variables, db and ns. db is the database we use, and ns is the database plus data set.
# Define db.
> db<-"foobar"
# Define db.collection.
> ns<-"foobar.blog"
{
"_id" : ObjectId("51663e14da2c51b1e8bc62eb"),
"name" : "Echo",
"age" : 22,
"gender" : "Male",
"score" : {
"Mike" : 5,
"Jimmy" : 3.5,
"Ann" : 4
},
Database and NoSQL ◾ 215
"comments" : [
"a1",
"a2",
"a3"
]
}
Object class.
Array class.
Then, use the modifiers $inc, $set, and $push to operate. First use $inc to add 1 to age.
# Delete object.
> mongo.remove(mongo, ns, query)
# Destroy mongo connection.
> mongo.destroy(mongo)
+ mongo_insert<-function(x){
+ buf <- mongo.bson.buffer.create()
+ mongo.bson.buffer.append(buf, "name", str_c("Dave",x))
+ mongo.bson.buffer.append(buf, "age", x)
+ mongo.bson.buffer.start.array(buf, "comments")
+ mongo.bson.buffer.append(buf, "0", "a1")
+ mongo.bson.buffer.append(buf, "1", "a2")
+ mongo.bson.buffer.append(buf, "2", "a3")
+ mongo.bson.buffer.finish.object(buf)
+ return(mongo.bson.from.buffer(buf))
+ }
+ mongo.insert.batch(mongo, ns, lapply(arr,mongo_insert))
+}
> batch_inc<-function(data,ns){
+ for(i in data){
+ buf <- mongo.bson.buffer.create()
+ mongo.bson.buffer.append(buf, "name", str_c("Dave",i))
+ criteria <- mongo.bson.from.buffer(buf)
+ buf <- mongo.bson.buffer.create()
+ mongo.bson.buffer.start.object(buf, "$inc")
+ mongo.bson.buffer.append(buf, "age", 1L)
+ mongo.bson.buffer.finish.object(buf)
+ objNew <- mongo.bson.from.buffer(buf)
+ mongo.update(mongo, ns, criteria, objNew)
+ }
+}
> batch_set<-function(data,ns){
+ for(i in data){
+ buf <- mongo.bson.buffer.create()
+ mongo.bson.buffer.append(buf, "name", str_c("Dave",i))
+ criteria <- mongo.bson.from.buffer(buf)
+ buf <- mongo.bson.buffer.create()
+ mongo.bson.buffer.start.object(buf, "$set")
+ mongo.bson.buffer.append(buf, "age", 1L)
+ mongo.bson.buffer.finish.object(buf)
+ objNew <- mongo.bson.from.buffer(buf)
+ mongo.update(mongo, ns, criteria, objNew)
+ }
+ }
218 ◾ R for Programmers: Mastering the Tools
> batch_push<-function(data,ns){
+ for(i in data){
+ buf <- mongo.bson.buffer.create()
+ mongo.bson.buffer.append(buf, "name", str_c("Dave",i))
+ criteria <- mongo.bson.from.buffer(buf)
+ buf <- mongo.bson.buffer.create()
+ mongo.bson.buffer.start.object(buf, "$push")
+ mongo.bson.buffer.append(buf, "comments", "Orange")
+ mongo.bson.buffer.finish.object(buf)
+ objNew <- mongo.bson.from.buffer(buf)
+ mongo.update(mongo, ns, criteria, objNew)
+ }
+}
All of the three aforementioned modifier programs use for loop in statements, so their perfor-
mance loss in for loop should be the same. Thus we’ll not take this factor into consideration. Now
let’s write the execution program and compare the speed of these three modifiers.
# Time of loop.
> data=1:1000
# Clear data.
> mongo.remove(mongo, ns)
[1] TRUE
# Insert in batch.
> system.time(batch_insert(data, ns))
user system elapsed
0.25 0.00 0.28
# Modifier $inc.
> system.time(batch_inc(data, ns))
user system elapsed
0.47 0.27 2.50
# Modifier $se.
> system.time(batch_set(data, ns))
user system elapsed
0.77 0.48 3.17
# Modifier $push.
> system.time(batch_push(data, ns))
user system elapsed
0.81 0.41 4.23
Database and NoSQL ◾ 219
The speed of these three modifiers is: $push > $set > $inc. Because each of these three modi-
fiers operates on a different class, $push on array, $set on any value and $inc on numbers, all the
test results will only serve as a reference.
We introduced how to use R to access JSON data in Section 1.6, and now we’ve successfully
connected R with MongoDB. As the data storage type of MongoDB is JSON, the semistructured
data processing basis of JSON is completed, and R will also play an important role in the area of
semistructured data processing.
Redis is a Key-Value database based on RAM. It’s more advanced than Memcache as it supports
many data structures and it’s efficient and fast. Redis can easily solve the problem of high concur-
rency data access. And it also performs well in real-time data storage. This section introduces how
to connect Redis using R.
# Start redis.
~ /etc/init.d/redis-server start
Starting redis-server: redis-server.
Use the command line client program of Redis, redis-cli, to open Redis Shell. The simple
operation of Redis Shell includes inserting a record and querying records.
Then, let’s use the Redis cliet of R, rredis, to access the Redis server remotely to run the
test.
functions to introduce. If you are interested in those functions, you may find them in the
official documents of rredis.
◾ Win7 64bit
◾ R: 3.0.1 x86_64-w64-mingw32/x64 b4bit
Now I’ll provide an introduction according to the following five operation types.
◾ Basic operation of redis: create connection, switch database, display all KEY value in list,
clear the data of current database, close connection
◾ Operation of class string: insert, read, delete, insert and set expiration time, operation in batch
◾ Operation of class list: insert, read, pop
◾ Operation of class set: insert, read, different set, intersection, union
◾ Interactive operation between rredis and redis-cli
222 ◾ R for Programmers: Mastering the Tools
# Install rredis.
> install.packages("rredis")
Create a connection with the Redis server through redisConnect(). If it’s a local connection,
redisConnect() won’t need parameters. The following example uses a remote connection. Add the
host parameter and configure the IP address redisConnect(host = "192.168.1.101",port = 6379).
# Switch to database1.
> redisSelect(1)
[1] "OK"
> redisKeys()
NULL
# Switch to database0.
> redisSelect(0)
[1] "OK"
> redisKeys()
[1] "x" "data"
# Close connection.
> redisClose()
Database and NoSQL ◾ 223
# Insert object.
> redisSet('x',runif(5))
[1] "OK"
# Read object.
> redisGet('x')
[1] 0.67616159 0.06358643 0.07478021 0.32129140 0.16264615
# Insert in batch.
> redisMSet(list(x=pi,y=runif(5),z=sqrt(2)))
[1] TRUE
# Read in batch.
> redisMGet(c('x','y','z'))
$x
[1] 3.141593
$y
[1] 0.9249501 0.3444994 0.6477250 0.1681421 0.2646853
$z
[1] 1.414214
# Delete data.
> redisDelete('x')
[1] 1
> redisGet('x')
NULL
> redisSAdd('A',runif(2))
> redisSAdd('A',55)
> redisSAdd('B',55)
> redisSAdd('B',rnorm(3))
# Different set.
> redisSDiff(c('A','B'))
[[1]]
[1] 0.6494041 0.3181108
# Intersection.
> redisSInter(c('A','B'))
[[1]]
[1] 55
# Union.
> redisSUnion(c('A','B'))
[[1]]
[1] 55
[[2]]
[1] 0.1074787 1.3111006 0.8223434
[[3]]
[1] 0.6494041 0.3181108
# Read data.
> redisGet('shell')
[1] "Greetings, R client!"
226 ◾ R for Programmers: Mastering the Tools
Use rredis to insert data and use Redis client to read data.
# Insert data.
> redisSet('R', 'Greetings, shell client!')
[1] "OK"
# Read data(gibberish).
redis 127.0.0.1:6379> get R
"X\\x00\x00\x00\x02\x00\x02\x0f\x00\x00\x02\x03\x00\x00\x00\x00\x10\
x00\x00\x00\x01\x00\x04\x00\\x00\x00\x00\x18Greetings, shell
client!"
KEY:
– Users: user id
VALUE:
– Id: user id
– Pw: password
– Email: E-mail address
Then, read the data file to memory. Create a Redis connection, insert the data to Redis in the
form of loop, and output corresponding VALVE value using sers:wolys as KEY.
# Read data.
> data<-scan(file="data5.txt",what=character(),sep=" ")
> data<-data[which(data!='#')]
> data
[1] "wolys" "wolysopen111" "[email protected]"
[4] "coralshanshan" "601601601" "[email protected]"
[7] "pengfeihuchao" "woaidami" "[email protected]"
[10] "simulategirl" "@#$9608125" "simulateboy@163.
com"
[13] "daisypp" "12345678" "zhoushigang_123@163.
com"
[16] "sirenxing424" "tfiloveyou" "sirenxing424@126.
com"
[19] "raininglxy" "1901061139" "[email protected]"
[22] "leochenlei" "leichenlei" "chenlei1201@gmail.
com"
[25] "z370433835" "lkp145566" "[email protected]"
[28] "cxx0409" "12345678" "[email protected]"
[31] "xldq_l" "061222ll" "[email protected]"
# Connect redis.
> redisConnect(host="192.168.1.101",port=6379)
> redisFlushAll()
> redisKeys()
[[1]]
[1] "pw:wolysopen111"
[[2]]
[1] "email:[email protected]"
[[3]]
[1] "id:wolys"
Thus we’ve finished the whole test case. Redis is a very efficient in-memory database. By com-
bining R with Redis, we can construct a powerful real-time computing application.
Cassandra is an open source distributed database management system designed to handle large
amounts of data across many commodity servers. Some techniques of Cassandra, including mul-
tiple datacenters, consistent hashing, and BlomFilter, provide new design ideas for the following
NoSQL products. This section introduces RCassandra to connect R with Cassandra.
For environment preparation, I choose Linux Ubuntu here. You may choose other types of
Linux according to your preference. The server environment of Cassandra is:
# Start Cassandra.
~ bin/cassandra
6.4.2.1 17 Functions
The following are the 17 functions and their comparison to the command of Cassandra.
RC.close RC.insert
RC.cluster.name RC.login
RC.connect RC.mget.range
RC.consistency RC.mutate
RC.describe.keyspace RC.read.table
RC.describe.keyspaces RC.use
RC.get RC.version
RC.get.range RC.write.table
RC.get.range.slices
230 ◾ R for Programmers: Mastering the Tools
Cassandra:
connect 192.168.1.200/9160;
RCassandra:
conn<-RC.connect(host="192.168.1.200",port=9160)
Cassandra:
show cluster name;
RCassandra:
RC.cluster.name(conn)
Cassandra:
show keyspaces;
RCassandra:
RC.describe.keyspaces(conn)
Cassandra:
show schema DEMO;
RCassandra:
RC.describe.keyspace(conn,'DEMO')
Cassandra:
use DEMO;
RCassandra:
RC.use(conn,'DEMO')
Database and NoSQL ◾ 231
Cassandra:
consistencylevel as ONE;
RCassandra:
RC.consistency(conn,level="one")
7. Insert data.
Cassandra:
set Users[1][name] = scott;
RCassandra:
RC.insert(conn,'Users','1', 'name', 'scott')
8. Insert database.
Cassandra:
NA
RCassandra:
RC.write.table(conn, "Users", df)
Cassandra:
list Users;
RCassandra:
RC.read.table(conn,"Users")
Cassandra:
get Users[1]['name'];
RCassandra:
RC.get(conn,'Users','1', c('name'))
232 ◾ R for Programmers: Mastering the Tools
Cassandra:
exit; quit;
RCassandra:
RC.close(conn)
# Start R.
~ R
# Install RCassandra.
> install.packages('RCassandra')
Keyspaces operation:
Then there is the operation of data. We should note that we cannot create a column family in
RCassandra, so we need to create a column family through Cassandra commands.
# iris is a data.frame.
> RC.write.table(conn, "iris", iris)
attr(,"class")
[1] "CassandraConnection"
> r[[1]]
key value ts
1 Petal.Length 1.7 1.372881e+15
2 Petal.Width 0.4 1.372881e+15
3 Sepal.Length 5.4 1.372881e+15
4 Sepal.Width 3.9 1.372881e+15
5 Species setosa 1.372881e+15
> head(y)
Petal.Length Petal.Width Sepal.Length Sepal.Width Species
1 1.4 0.2 5.1 3.5 setosa
2 1.4 0.2 4.9 3.0 setosa
3 1.3 0.2 4.7 3.2 setosa
4 1.5 0.2 4.6 3.1 setosa
5 1.4 0.2 5.0 3.6 setosa
# Insert data.
> RC.write.table(conn, "Users", df)
attr(,"class")
[1] "CassandraConnection"
Hive is a program interface of Hadoop. It adopts SQL-like syntax, which makes it a relatively easy
task for a data analyst to master it quickly. Hive has made the world of Java simpler and lighter and
helps Hadoop to be accepted by nonprogrammers. Starting with Hive, an analyst can master big
data too. This section introduces how to use RHive to help R connect Hive.
◾ Different data storage: Hive is based on HDFS of Hadoop, while relational databases are
based on a local file system.
◾ Different computing model: Hive is based on the MapReduce of Hadoop, while relational
databases are based on a memory computing model of indexes.
◾ Different application scene: The OLAP data warehouse system provides a big data query
for Hive, with a poor real-time performance, while the OLTP transactional system serves
relational databases in real-time query businesses.
Database and NoSQL ◾ 237
◾ Different expansibility: It’s very easy to increase distributional storage capacity and comput-
ing power in Hive as it is based on Hadoop; and it’s rather difficult to make horizontal scal-
ing in relational databases. We need to increase serial performance consistently.
Hive is a data warehouse product based on Hadoop, so we need to have a Hadoop environ-
ment. For the installation and configuration of Hadoop, please refer to Appendix G. For environ-
ment preparation, I choose Linux Ubuntu here. You may choose other types of Linux according
to your preference.
# Start R.
~ sudo R
> install.packages("rJava")
# Install RHive.
> install.packages("RHive")
# Load RHive.
> library(RHive)
Loading required package: rJava
Loading required package: Rserve
This is RHive 0.0-7. For overview type '?RHive'.
HIVE_HOME=/home/conan/hadoop/hive-0.9.0
call rhive.init() because HIVE_HOME is set.
When loading RHive, the environment variables of local Hive will be loaded in the environ-
ment of R automatically.
1. Connect to Hive.
Hive:
hive shell
RHive:
rhive.connect("192.168.1.210")
Hive:
show tables;
RHive:
rhive.list.tables()
Hive:
desc o_account;
RHive:
rhive.desc.table('o_account')
rhive.desc.table('o_account',TRUE)
Hive:
select * from o_account;
RHive:
rhive.query('select * from o_account')
Hive:
dfs -ls /;
RHive:
rhive.hdfs.ls()
240 ◾ R for Programmers: Mastering the Tools
Hive:
dfs -cat /user/hive/warehouse/o_account/part-m-00000;
RHive:
rhive.hdfs.cat('/user/hive/warehouse/o_account/part-m-00000')
Disconnect.
Hive:
quit;
RHive:
rhive.close()
# Initialization.
> rhive.init()
# Connect to hive.
> rhive.connect("192.168.1.210")
# Close connection.
> rhive.close()
[1] TRUE
rhive.hdfs.cat('/user/hive/warehouse/
rhive_sblk_1372238856/000000_0')
[email protected] 12:21:39
[email protected] 12:21:39
[email protected] 12:21:39
[email protected] 12:21:39
We’ve now connected the data channel between R and Hive using RHive and started to
achieve big data operation based on the Hive system using R. In the next section, I introduce a
RHive case of big data within a financial area.
I’ve been in the area of finance for a relatively short time, and reverse repurchase is the first product
that I’ve operated. Reverse repurchase is possibly a new word for most of us, or even for the expe-
rienced stock market participant. China’s banking system experienced a severe lack of liquidity
Database and NoSQL ◾ 243
in July 2013. The overnight interbanking borrowing rate reached 30% at that time. When the
banking system lacks liquidity and short-term interest rates rise much higher, other financial insti-
tutions sell their stocks and bonds to lend money to the banks.
> rhive.desc.table('t_reverse_repurchase')
col_name data_type
1 tradedate string
2 tradetime string
3 securityid string
4 bidpx1 double
5 bidsize1 double
6 offerpx1 double
7 offersize1 double
# Start R.
~ R
# Load RHive.
> library(RHive)
# Initialization.
> rhive.init()
244 ◾ R for Programmers: Mastering the Tools
View the historical data fragmentation of all stocks: the test data are from June 27, 2013 to
July 26, 2014.
Extract the data of “One Day Reverse Repurchase of Shanghai Stock Exchange” (204001) and
“One Day Reverse Repurchase of Shenzhen Stock Exchange” (131810).
Database and NoSQL ◾ 245
Use ggplot2 draw the trend of these two products in one week, as in Figure 6.1.
# Load ggplot2.
> library(ggplot2)
> g<-ggplot(data=bidpx1, aes(x=as.POSIXct(tradetime,format="%Y%m%d%H
%M%S"), y=bidpx1))
> g<-g+geom_line(aes(group=securityid,colour=securityid))
> g<-g+xlab('tradetime')+ylab('bidpx1')
12.5
10.0
7.5
securityid
bidpx1
131810
204001
5.0
2.5
0.0
12.5
10.0
7.5
securityid
bidpx1
131810
204001
5.0
2.5
0.0
Then the trend in one day, July 26, 2013, is as in Figure 6.2.
Extract the data of 131810 and 204001, and store them to table t_reverse_repurchase.
# Sign in R.
> library(RHive)
> rhive.init()
> rhive.connect("c1.wtmart.com")
3 4677 20130701
4 3124 20130702
5 2328 20130703
6 3787 20130704
7 4294 20130705
8 4977 20130708
9 4568 20130709
10 6619 20130710
11 5633 20130712
12 6159 20130715
13 5918 20130716
14 6200 20130719
15 6074 20130722
16 5991 20130723
17 5899 20130724
18 5346 20130725
19 6192 20130726
Acquire the data of one day and make ETL. First load the package:
> library(ggplot2)
> library(scales)
> library(plyr)
> bidpx1<-rhive.query(paste("SELECT
securityid,tradedate,tradetime,bidpx1 FROM t_reverse_repurchase
WHERE tradedate>=20130722"));
> oneDay<-function(date){
+ d1<-bidpx1[which(bidpx1$tradedate==date),]
+ d1$tradetime2<-round(as.numeric(as.character(d1$tradetime))/100)*100
+ d1$tradetime2[which(d1$tradetime2<100000)] <- paste(0,d1$tradetime2
[which(d1$tradetime2<100000)],sep="")
+ d1$tradetime2[which(d1$tradetime2=='1e+05')]='100000'
+ d1$tradetime2[which(d1$tradetime2=='096000')]='100000'
+ d1$tradetime2[which(d1$tradetime2=='106000')]='110000'
+ d1$tradetime2[which(d1$tradetime2=='126000')]='130000'
+ d1$tradetime2[which(d1$tradetime2=='136000')]='140000'
+ d1$tradetime2[which(d1$tradetime2=='146000')]='150000'
+ d1
+ }
Database and NoSQL ◾ 249
> meanScale<-function(d1){
+ ddply(d1, .(securityid,tradetime2), summarize, bidpx1=mean(bidpx1))
+ }
> findPoint<-function(a1,a2){
+ bigger_point<-function(a1,a2){
+ idx<-c()
+ for(i in intersect(a1$tradetime2,a2$tradetime2)){
+ i1<-which(a1$tradetime2==i)
+ i2<-which(a2$tradetime2==i)
+ if(a1$bidpx1[i1]-a2$bidpx1[i2]>=-0.02){
+ idx<-c(idx,i1)
+ }
+ }
+ idx
+ }
+ remove_continuous_point<-function(idx){
+ idx[-which(idx-c(NA,rev(rev(idx)[-1]))==1)]
+ }
+ idx<-bigger_point(a1,a2)
+ remove_continuous_point(idx)
+ }
> bigger_point<-function(a1,a2){
+ idx<-c()
+ for(i in intersect(a1$tradetime2,a2$tradetime2)){
+ i1<-which(a1$tradetime2==i)
+ i2<-which(a2$tradetime2==i)
+ if(a1$bidpx1[i1]-a2$bidpx1[i2]>=-0.02){
+ idx<-c(idx,i1)
+ }
+ }
+ idx
+ }
> remove_continuous_point<-function(idx){
+ idx[-which(idx-c(NA,rev(rev(idx)[-1]))==1)]
+ }
+ idx<-bigger_point(a1,a2)
+ remove_continuous_point(idx)
+ }
> findOptimize<-function(d3){
+
idx2<-which((d3$bidpx1-c(NA,rev(rev(d3$bidpx1)[-1])))/d3$bidpx1>0.1)
+ if(length(idx2)<1)
+ print("No Optimize point")
+ d3[idx2,]
+ }
250 ◾ R for Programmers: Mastering the Tools
> draw<-function(d2,d3,d4,date,png=FALSE){
+ g<-ggplot(data=d2, aes(x=strptime(paste(date,tradetime2,sep=""),
format="%Y%m%d%H%M%S"), y=bidpx1))
+ g<-g+geom_line(aes(group=securityid,colour=securityid))
+ g<-g+geom_point(data=d3,aes(size=1.5,colour=securityid))
+if(nrow(d4)>0){
+ g<-g+geom_text(data=d4,aes(label= format(d4$bidpx1,digits=4)),
colour="blue",hjust=0, vjust=0)
+ }
+ g<-g+xlab('tradetime')+ylab('bidpx1')
+ if(png){
+ ggsave(g,file=paste(date,".png",sep=""),width=12,height=8)
+ }else{
+ g
+ }
+ }
> date<-20130722
> d1<-oneDay(date)
> d2<-meanScale(d1)
> a1<-d2[which(d2$securityid==131810),]
> a2<-d2[which(d2$securityid==204001),]
> d3<-d2[findPoint(a1,a2),]
> d4<-findOptimize(d3)
> draw(d2,d3,d4,as.character(date),TRUE)
3.474
securityid
131810
204001
bidpx1
2 1.5
1.5
Generate five images corresponding to five days of a week (Figures 6.3 to 6.7). The back-test
of the data shows that the signals given by the programmatic strategy are all good selling points.
Their performance is better than my manual operation.
Through the comparison of the data in a week, we find that this simple strategy can bring us
some benefits. It’s better than the man-made decision
4.314
securityid
131810
bidpx1
3
204001
1.5
1.5
securityid
131810
bidpx1
204001
3 1.5
1.5
11.54
11
securityid
131810
bidpx1
204001
10 1.5
1.5
12
10
9.630
securityid
131810
bidpx1
204001
8 1.5
1.5
7.163
RHadoop
This chapter mainly introduces how to use R to access a Hadoop cluster through a RHadoop tool,
help readers to manage HDFS using R, develop MapReduce programs, and access HBase access.
This chapter uses R to implement MapReduce program cases based on a collaborative filtering
algorithm. It’s more concise than Java.
into Hadoop
http://blog.fens.me/r-hadoop-intro/
R and Hadoop belong to two different disciplines. They have different user groups, they are based
on two different knowledge systems, and they do different things. But data, as their intersection,
have made the combination with R and Hadoop an interdisciplinary choice, a tool to mine the
value of data.
253
254 ◾ R for Programmers: Mastering the Tools
◾ Hive is a data warehouse tool based on Hadoop. It can map the structured data file to a data-
base table and quickly achieve simple MapReduce statistics through SQL-like statements. It
is very suitable for statistical analysis for data warehouses because it doesn’t need to develop
special MapReduce applications.
◾ Pig is big data analysis tool based on Hadoop. It provides SQL-like language, Pig Latin. Pig
Latin will transform the SQL-like data analysis request to a series of optimized MapReduce
calculations.
◾ HBase is a highly reliable, high-performance, column-oriented, and scalable distributed
storage system. We can construct massive structured storage clusters on a low-cost PC Server
using HBase.
◾ Sqoop is a tool to transfer data between Hadoop and relational database. We can transfer
the data of relational database(MySQL, Oracle, Postgres, etc.) to Hadoop and vice versa.
◾ Zookeeper is a distributed open source coordination service designed for distributed appli-
cations. It’s mainly used for solving the data management problems occurring in distrib-
uted applications, simplifying the difficulty of coordination and management of distributed
applications, and increasing the performance of a distributed service.
◾ Mahout is a distributed framework based on the machine learning and data mining of
Hadoop. It uses MapReduce to achieve some of its data mining algorithm and solve the
problem of parallel mining.
◾ Avro is a data serialization system designed for data-intensive applications with massive data
exchange. Avro is a new data serialization format and transfer facility, and it will replace the
original IPC mechanism of Hadoop.
◾ Ambari, based on the Web, supports the supply, management, and monitoring of Hadoop
clusters.
◾ Chukwa is an open source data collection system used for monitoring large distributed
systems. It can collect all kinds of data, transform them into a file suited to be processed by
Hadoop, and store them in HDFS for Hadoop to perform MapReduce operations.
Hadoop started to develop independently with MapReduce and HDFS in 2006. The
Hadoop family has now been incubating several top Apache projects. Especially in recent years,
the development of Hadoop has grown much faster. It integrates many new techniques with it,
such as YARN, Hcatalog, Oozie, and Cassandra, which makes it very hard to follow. For more
introductory information, please refer to the series of articles in the author’s blog about the
Hadoop family.
RHadoop ◾ 255
◾ Question 1: Why should we combine R with Hadoop as the Hadoop family is already so
powerful?
◾ Question 2: Mahout can also perform data mining and machine learning. So what’s the
difference between R and Mahout?
Now, let’s simulate a scene: analyze the access log of a news website of 1PB and predict the
change in flow in the future. The specific process can be divided into four steps.
◾ Use R to perform data analysis, a construct regression model on the business object, and
define indicators.
◾ Use Hadoop to extract total indicator data from the massive log data.
◾ Use R to verify and optimize the indicator data.
◾ Use the distributed algorithm of Hadoop to rewrite the R model and deploy it online.
In this scene, R and Hadoop both play important roles. If we adopt the thinking pattern of pro-
grammers and use only Hadoop in the whole process, there will be no data modeling and process
of proving, and the forecasting results will also be problematic. If we adopt the thinking pattern
of statisticians and use only R, the forecasting result of sampling will also be biased. So combining
R with Hadoop is an inevitable orientation of the industry. The intersection of the industry with
academia has provided infinite space for the imagination of researchers in interdisciplinary studies.
7.1.2.2 Mahout Can Also Perform Data Mining and Machine Learning.
So What’s the Difference between R and Mahout?
1. Mahout is an algorithm framework of data mining and machine learning based on Hadoop.
The focus of Mahout is to solve the computing problems of big data.
2. An algorithm supported by Mahout includes collaborative filtering, recommended algo-
rithm, clustering algorithm, sorting algorithm, linear discriminant analysis (LDA), naïve
256 ◾ R for Programmers: Mastering the Tools
Bayes, random forest, and so forth. Most of the algorithms in Mahout are based on distance.
After matrix decomposition, they make full use of the parallel computing framework of
MapReduce and accomplish their computing tasks efficiently.
3. Many of the data mining algorithms of Mahout cannot be paralleled by MapReduce.
Besides, all the current models of Mahout are general computing models. If we apply these
models in projects directly, the computing result will be only a little better than the random
result. The secondary development of Mahout requires profound technical basis of Java and
Hadoop, or even basic knowledge of linear algebra, probability statistics, and introduction
to algorithms. So it’s not easy to master Mahout.
4. R provides most of the algorithms supported by Mahout. It also provides many algorithms
that are not supported by Mahout. In addition, the growth of algorithms in R is faster than
in Mahout, and it’s simple in development and flexible in parameter configuration. R is also
very fast in small data cluster computing.
Though Mahout can also perform well in data mining and machine learning, its strong field
doesn’t overlap with R. Only by combining the advantages of all sides and choosing the right
technique in suitable fields can we truly guarantee the quality of programs we make.
7.1.3.1 RHadoop
RHadoop is a product that combines Hadoop and R. It was developed by RevolutionAnalytics,
and its code is published in Github. RHadoop includes three R packages (rmr, rhdfs, and
rHBase), each corresponding to the three parts of Hadoop system construction: MapReduce,
HDFS, and HBase. This chapter gives a detailed introduction to the installation and use of
RHadoop.
7.1.3.2 RHive
RHive, developed by the Korean company NexR, is a tool package to access Hive directly through
R. Sections 6.5 and 6.6 have provided information on the installation and use of RHive.
I’ve written three examples of using Java to call R. You can refer to Sections 4.1, 4.2, and 4.3
to try to make your own combination and create some unique applications.
RHadoop series
Hadoop R RHadoop
RHadoop is the first product to achieve the combination of R and Hadoop and the big data analy-
sis based on it. Hadoop is used for big data storage, and R is used to replace Java to finish the
MapReduce algorithms. With the help of RHadoop, R developers now have a powerful tool to
process big data at 10G, 100G, TB, or even PB levels. The performance problem of a single machine
is perfectly solved. But it’s not so easy for R, Java, and Hadoop users to master all the knowledge.
It’s important to use the official version Java (JDK) of Oracle SUN and download from the
official website (http://www.oracle.com/technetwork/java/javase/downloads/index.html). Linux
Ubuntu’s own Java (JDK) may have many incompatibility problems. And for JDK, please choose
the 1.6.x version, as version 1.7 also has an incompatibility problem. For the R environment, please
install version 2.15.3, as version 2.14 and versions above 3.0 do not support RHadoop. System
environment in this section is as follows:
# Install rJava.
> install.packages("rJava")
Then, we need to install some other dependent libraries directly through install.packages(),
including reshape2, Repp, iterators, itertools, digest, RJSONlO, and functional.
> install.packages("reshape2")
> install.packages("Rcpp")
> install.packages("iterators")
> install.packages("itertools")
> install.packages("digest")
> install.packages("RJSONIO")
> install.packages("functional")
~ export HADOOP_CMD=/root/hadoop/hadoop-1.1.2/bin/hadoop
~ export HADOOP_STREAMING=/root/hadoop/hadoop-1.1.2/contrib/
streaming/hadoop-streaming-1.1.2.jar
~ sudo vi /etc/environment
HADOOP_CMD=/root/hadoop/hadoop-1.1.2/bin/hadoop
HADOOP_STREAMING=/root/hadoop/hadoop-1.1.2/contrib/streaming/
hadoop-streaming-1.1.2.jar
# Install rhdfs.
~ R CMD INSTALL /root/R/rhdfs_1.0.5.tar.gz
260 ◾ R for Programmers: Mastering the Tools
~ ls /disk1/system/usr/local/lib/R/site-library/
digest functional iterators itertools plyr Rcpp reshape2
rhdfs rJava RJSONIO rmr2 stringr
Because my hard disk is external and has a mounted directory of R class libraries using mount
and symlink(ln –s), my R class libraries are all under the directory/disk1/system. Normal class
libraries are in the directory of R is/usr/lib/R/site-library or/usr/local/lib/R/site-library, so you can
use them where there is an R command to query the position of R class libraries in your computer.
# Load rhdfs.
> library(rhdfs)
Found 4 items
drwxr-xr-x - root supergroup 0 2013-02-01 12:15 /user/conan
drwxr-xr-x - root supergroup 0 2013-03-06 17:24 /user/hdfs
drwxr-xr-x - root supergroup 0 2013-02-26 16:51 /user/hive
drwxr-xr-x - root supergroup 0 2013-03-06 17:21 /user/root
> hdfs.ls("/user/")
> hdfs.cat("/user/hdfs/o_same_school/part-m-00000")
[1] "10,3,tsinghua university,2004-05-26 15:21:00.0"
[2] "23,4007,Beijing No.171 Middle School,2004-05-31 06:51:53.0"
[3] "51,4016,Dalian University of Technology,2004-05-27 09:38:31.0"
[4] "89,4017,Amherst College,2004-06-01 16:18:56.0"
[5] "92,4017,Stanford University,2012-11-28 10:33:25.0"
[6] "99,4017,Stanford University Graduate School of
Business,2013-02-19 12:17:15.0"
262 ◾ R for Programmers: Mastering the Tools
# Load rmr2.
> library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2
Normal R:
[1] 1 4 9 16 25 36 49 64 81 100
R based on Hadoop:
Because MapReduce can only access the HDFS file system, we need to use to.dfs() to store
data toe HDFS system, and then use from.dfs() to extract the computing result of MapReduce
from the HDFS file system.
# mapreduce function
mapreduce(input = input ,output = output, input.format = "text",
map = wc.map, reduce = wc.reduce,combine = T)
}
$key
[1] "-"
[2] "04:42:37.0"
[3] "06:51:53.0"
[4] "07:10:24.0"
[5] "09:38:31.0"
[6] "10:33:25.0"
[7] "10,3,tsinghua"
[8] "10:42:10.0"
[9] "113,4017,Stanford"
[10] "12:00:38.0"
RHadoop ◾ 265
$val
[1] 1 2 1 2 1 1 1 4 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Thus we’ve finished the installation and use of rhdfs and rmr2 in RHadoop. Though it’s a little
tedious, it’s still worth it to refine the code of MapReduce in R.
RHadoop experiment
We can use R to implement MapReduce development through RHadoop. The R codes are much
more concise than the Java codes. This section introduces a case of statistical demand.
According to the experimental data, calculate how many times the E-mail addresses appear.
Then sort the E-mail addresses by their times of appearance. Use RHadoop to implement the
MapReduce algorithm. The experimental data are hadoop15.txt.
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
266 ◾ R for Programmers: Mastering the Tools
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
163.com,14
sohu.com,2
# Start R.
~ R
RHadoop ◾ 267
> library(rhdfs)
> library(rmr2)
$key
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1
[39] 1 1 1
$val
V1
1 [email protected]
2 [email protected]
3 [email protected]
4 [email protected]
5 [email protected]
6 [email protected]
7 [email protected]
8 [email protected]
9 [email protected]
10 [email protected]
11 [email protected]
12 [email protected]
13 [email protected]
14 [email protected]
15 [email protected]
16 [email protected]
17 [email protected]
18 [email protected]
19 [email protected]
20 [email protected]
21 [email protected]
22 [email protected]
23 [email protected]
24 [email protected]
25 [email protected]
26 [email protected]
27 [email protected]
28 [email protected]
29 [email protected]
30 [email protected]
31 [email protected]
32 [email protected]
33 [email protected]
268 ◾ R for Programmers: Mastering the Tools
34 [email protected]
35 [email protected]
36 [email protected]
37 [email protected]
38 [email protected]
39 [email protected]
40 [email protected]
41 [email protected]
Customize mr() to calculate how many times the E-mail addresses appear.
# Create mr().
> mr<-function(input=d0){
+ map<-function(k,v){
+ # Intercept the part of character string after @.
+ keyval(word(as.character(v$V1), 2, sep = fixed('@')),1)
+ }
+ reduce =function(k, v ) {
+ # Sum the same E-mail addresses.
+ keyval(k, sum(v))
+ }
+ d1<-mapreduce(input=input,map=map,reduce=reduce,combine=TRUE)
+ }
# Run mr().
> d1<-mr(d0)
$key
[1] "126.com" "163.com" "21cn.com" "gmail.com" "qq.com"
[6] "sina.com" "sohu.com" "yahoo.cn" "yahoo.com.cn"
$val
[1] 9 14 1 1 9 2 2 1 2
# Create sort().
> sort<-function(input=d1){
+ map<-function(k,v){
+ # Format the result set of d1.
+ keyval(1,data.frame(k,v))
+ }
+ reduce<-function(k,v){
RHadoop ◾ 269
Please note that this section uses statements based on a single node in the reduce process of the
second step. It requires more opimization when we deal with big data. We can achieve our objects
by just a few lines of code, which is much better than when using Java!
10
01
Collaborative filtering algorithm
by RHadoop based on MapReduce
http://blog.fens.me/rhadoop-mapreduce-rmr/
Because of the particularity of the operation of RHadoop’s rmr2 package on Hadoop, it is rather
difficult to implement the codes. If you need to learn it in depth, please try and think more about
the design of key/value of the MapReduce algorithm.
270 ◾ R for Programmers: Mastering the Tools
A collaborative filtering algorithm based on items is more widely used and recommended.
Many Internet companies are now using it, including Netflix, YouTube, Amazon, and so forth.
There are two main steps in designing the recommendation algorithm:
To introduce the algorithm model, we select a small group of data as the test set. The data set is
taken from the book Mahout in Action, p. 49, line 8. The eighth line of original data “3, 101, 2.5”
is changed to “3, 101, 2.0”. The three fields of each row are user ID, item ID, and rating of item.
Create the data file small.csv on server.
U3
[101] 2.0
[102] 0.0
[103] 0.0
[104] 4.0
[105] 4.5
[106] 0.0
[107] 5.0
272 ◾ R for Programmers: Mastering the Tools
# Load plyr.
> library(plyr)
# Load the data set.
> train<-read.csv(file="small.csv",header=FALSE)
> names(train)<-c("user","item","pref")
# View train data set.
> train
user item pref
1 1 101 5.0
2 1 102 3.0
3 1 103 2.5
4 2 101 2.0
5 2 102 2.5
6 2 103 5.0
7 2 104 2.0
8 3 101 2.0
9 3 104 4.0
10 3 105 4.5
11 3 107 5.0
12 4 101 5.0
13 4 103 3.0
14 4 104 4.5
RHadoop ◾ 273
15 4 106 4.0
16 5 101 4.0
17 5 102 3.0
18 5 103 2.0
19 5 104 4.0
20 5 105 3.5
21 5 106 4.0
# Method of calculating user list.
> usersUnique<-function(){
+ users<-unique(train$user)
+ users[order(users)]
+ }
# Method of calculating items list.
> itemsUnique<-function(){
+ items<-unique(train$item)
+ items[order(items)]
+ }
# User list.
> users<-usersUnique()
> users
[1] 1 2 3 4 5
# Items list.
> items<-itemsUnique()
> items
[1] 101 102 103 104 105 106 107
# Create item list index.
> index<-function(x) which(items %in% x)
> data<-ddply(train,.(user,item,pref),summarize,idx=index(item))
> data
user item pref idx
1 1 101 5.0 1
2 1 102 3.0 2
3 1 103 2.5 3
4 2 101 2.0 1
5 2 102 2.5 2
6 2 103 5.0 3
7 2 104 2.0 4
8 3 101 2.0 1
9 3 104 4.0 4
10 3 105 4.5 5
11 3 107 5.0 7
12 4 101 5.0 1
13 4 103 3.0 3
14 4 104 4.5 4
15 4 106 4.0 6
16 5 101 4.0 1
17 5 102 3.0 2
18 5 103 2.0 3
274 ◾ R for Programmers: Mastering the Tools
19 5 104 4.0 4
20 5 105 3.5 5
21 5 106 4.0 6
# Co-occurrence matrix.
> cooccurrence<-function(data){
+ n<-length(items)
+ co<-matrix(rep(0,n*n),nrow=n)
+ for(u in users){
+ idx<-index(data$item[which(data$user==u)])
+ m<-merge(idx,idx)
+ for(i in 1:nrow(m)){
+ co[m$x[i],m$y[i]]=co[m$x[i],m$y[i]]+1
+ }
+ }
+ return(co)
+ }
# Recommendation algorithm.
> recommend<-function(udata=udata,co=coMatrix,num=0){
+ n<-length(items)
+
+ # all of pref
+ pref<-rep(0,n)
+ pref[udata$idx]<-udata$pref
+
+ # User rating matrix.
+ userx<-matrix(pref,nrow=n)
+
+ # Co-occurrence matrix * rating matrix.
+ r<-co %*% userx
+
+ # Rank the recommendation result.
+ r[udata$idx]<-0
+ idx<-order(r,decreasing=TRUE)
+ topn<-data.frame(user=rep(udata$user[1],length(idx)),item=
items[idx],val=r[idx])
+ topn<-topn[which(topn$val>0),]
+
+ # Take the first "num" items of result.
+ if(num>0){
+ topn<-head(topn,num)
+ }
+
+ # Return the result.
+ return(topn)
+ }
# Generate co-occurrence matrix.
> co<-cooccurrence(data)
> co
RHadoop ◾ 275
1. Create a co-occurrence matrix. First, get the combination list of all items according to user
groups; then count the item combination list and create a co-occurrence matrix of items.
2. Create rating matrix of users on items.
3. Merge the co-occurrence matrix and the rating matrix.
4. Calculate the recommendation result list.
5. In MapReduce implementation, all the operations should be completed by using Map and
Reduce tasks. The process has changed slightly, as in Figure 7.2. The following are the five
steps of matrix analysis.
276 ◾ R for Programmers: Mastering the Tools
Recommender job
HDFS
Preference
values
To Item Preference To User
Mapper Vector Reducer
User preference
vectors
User Vector To Co-occurrence User Vector To
Co-occurrence Mapper combiner Co-occurrence Reducer
Co-occurrence
matrix
User Vector
Splitter Mapper
Co-occurrence column
wrapper mapper
Pre-partial-product
vectors
Hadoop
$key
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 102 102
102 102
[20] 102 102 102 103 103 103 103 103 103 103 103 103 103 103 104 104 104
104 104
[39] 104 104 104 104 104 104 104 105 105 105 105 106 106 106 106 107 107
107 107
[58] 101 101 101 101 101 101 102 102 102 102 102 102 103 103 103 103 103
103 104
[77] 104 104 104 104 104 105 105 105 105 105 105 106 106 106 106 106 106
RHadoop ◾ 277
$val
[1] 101 102 103 101 102 103 104 101 104 105 107 101 103 104 106 101 102
103 101
[20] 102 103 104 101 102 103 101 102 103 104 101 103 104 106 101 102 103
104 101
[39] 104 105 107 101 103 104 106 101 104 105 107 101 103 104 106 101 104
105 107
[58] 101 102 103 104 105 106 101 102 103 104 105 106 101 102 103 104 105
106 101
[77] 102 103 104 105 106 101 102 103 104 105 106 101 102 103 104 105 106
2. Count the item combination list and create co-occurrence matrix of items.
a. Key: vector of item list
b. Val: value of data frame of co-occurrence matrix(item, item, Freq)
The format of the matrix should be the same as the format that follows in Section 7.4.3.2.
Merge the two heterogeneous data sources into one, laying the foundation for data in
Section 7.4.3.3.
$key
[1] 101 101 101 101 101 101 101 102 102 102 102 102 102 103 103 103 103
103 103
[20] 104 104 104 104 104 104 104 105 105 105 105 105 105 105 106 106 106
106 106
[39] 106 107 107 107 107
$val
k v freq
1 101 101 5
2 101 102 3
3 101 103 4
4 101 104 4
5 101 105 2
6 101 106 2
7 101 107 1
8 102 101 3
9 102 102 3
10 102 103 3
11 102 104 2
12 102 105 1
13 102 106 1
14 103 101 4
15 103 102 3
16 103 103 4
17 103 104 3
18 103 105 1
19 103 106 2
20 104 101 4
21 104 102 2
22 104 103 3
23 104 104 4
278 ◾ R for Programmers: Mastering the Tools
24 104 105 2
25 104 106 2
26 104 107 1
27 105 101 2
28 105 102 1
29 105 103 1
30 105 104 2
31 105 105 2
32 105 106 1
33 105 107 1
34 106 101 2
35 106 102 1
36 106 103 2
37 106 104 2
38 106 105 1
39 106 106 2
40 107 101 1
41 107 104 1
42 107 105 1
43 107 107 1
$key
[1] 101 101 101 101 101 102 102 102 103 103 103 103 104 104 104 104
105 105 106
[20] 106 107
$val
item user pref
1 101 1 5.0
2 101 2 2.0
3 101 3 2.0
4 101 4 5.0
5 101 5 4.0
6 102 1 3.0
7 102 2 2.5
8 102 5 3.0
9 103 1 2.5
10 103 2 5.0
11 103 4 3.0
12 103 5 2.0
13 104 2 2.0
14 104 3 4.0
15 104 4 4.5
16 104 5 4.0
17 105 3 4.5
18 105 5 3.5
RHadoop ◾ 279
19 106 4 4.0
20 106 5 4.0
21 107 3 5.0
◾ Key: NULL
◾ Val: merged database
$key
NULL
$val
k.l v.l freq.l item.r user.r pref.r
1 103 101 4 103 1 2.5
2 103 102 3 103 1 2.5
3 103 103 4 103 1 2.5
4 103 104 3 103 1 2.5
5 103 105 1 103 1 2.5
6 103 106 2 103 1 2.5
7 103 101 4 103 2 5.0
8 103 102 3 103 2 5.0
9 103 103 4 103 2 5.0
10 103 104 3 103 2 5.0
11 103 105 1 103 2 5.0
12 103 106 2 103 2 5.0
13 103 101 4 103 4 3.0
$key
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
101 101
[19] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
101 102
[37] 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102
102 103
[55] 103 103 103 103 103 103 103 103 103 103 103 103 103 103 103 103
103 103
280 ◾ R for Programmers: Mastering the Tools
[73] 103 103 103 103 103 104 104 104 104 104 104 104 104 104 104 104
104 104
[91] 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104 105
105 105
[109] 105 105 105 105 105 105 105 105 105 105 105 106 106 106 106
106 106 106
[127] 106 106 106 106 106 107 107 107 107
$val
k.l v.l user.r v
1 101 101 1 25.0
2 101 101 2 10.0
3 101 101 3 10.0
4 101 101 4 25.0
5 101 101 5 20.0
6 101 102 1 15.0
7 101 102 2 6.0
8 101 102 3 6.0
9 101 102 4 15.0
10 101 102 5 12.0
11 101 103 1 20.0
12 101 103 2 8.0
13 101 103 3 8.0
14 101 103 4 20.0
15 101 103 5 16.0
16 101 104 1 20.0
17 101 104 2 8.0
18 101 104 3 8.0
$key
[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5
5 5 5
$val
user item pref
1 1 101 44.0
2 1 103 39.0
3 1 104 33.5
4 1 102 31.5
5 1 106 18.0
6 1 105 15.5
7 1 107 5.0
8 2 101 45.5
9 2 103 41.5
RHadoop ◾ 281
10 2 104 36.0
11 2 102 32.5
12 2 106 20.5
13 2 105 15.5
14 2 107 4.0
15 3 101 40.0
16 3 104 38.0
17 3 105 26.0
18 3 103 24.5
19 3 102 18.5
20 3 106 16.5
21 3 107 15.5
22 4 101 63.0
23 4 104 55.0
24 4 103 53.5
25 4 102 37.0
26 4 106 33.0
27 4 105 26.0
28 4 107 9.5
29 5 101 68.0
30 5 104 59.0
31 5 103 56.5
32 5 102 42.5
33 5 106 34.5
34 5 105 32.0
35 5 107 11.5
> to.dfs(1:10)
Warning message:
In to.dfs(1:10) : Converting to.dfs argument to keyval with a NULL key
282 ◾ R for Programmers: Mastering the Tools
# Load rmr2.
> library(rmr2)
# Input data file.
> train<-read.csv(file="small.csv",header=FALSE)
> names(train)<-c("user","item","pref")
# Use the format of Hadoop in rmr. Hadoop is the default configuration.
> rmr.options(backend = 'hadoop')
# Store the data set in HDFS.
> train.hdfs = to.dfs(keyval(train$user,train))
# Print the result
> from.dfs(train.hdfs)
13/04/07 14:35:44 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/04/07 14:35:44 INFO zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
13/04/07 14:35:44 INFO compress.CodecPool: Got brand-new decompressor
$key
[1] 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 5 5
$val
user item pref
1 1 101 5.0
2 1 102 3.0
3 1 103 2.5
4 2 101 2.0
5 2 102 2.5
6 2 103 5.0
7 2 104 2.0
8 3 101 2.0
9 3 104 4.0
10 3 105 4.5
11 3 107 5.0
12 4 101 5.0
13 4 103 3.0
14 4 104 4.5
15 4 106 4.0
16 5 101 4.0
17 5 102 3.0
18 5 103 2.0
19 5 104 4.0
20 5 105 3.5
21 5 106 4.0
# STEP 1, Create co-occurrence matrix of items.
# 1) get the combination list of all items according to user groups.
> train.mr<-mapreduce(
+ train.hdfs,
+ map = function(k, v) {
+ keyval(k,v$item)
+ }
RHadoop ◾ 283
+ ,reduce=function(k,v){
+ m<-merge(v,v)
+ keyval(m$x,m$y)
+ }
+ )
> from.dfs(train.mr)
$key
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 102
102 102 102
[20] 102 102 102 103 103 103 103 103 103 103 103 103 103 103 104 104
104 104 104
[39] 104 104 104 104 104 104 104 105 105 105 105 106 106 106 106 107
107 107 107
[58] 101 101 101 101 101 101 102 102 102 102 102 102 103 103 103 103
103 103 104
[77] 104 104 104 104 104 105 105 105 105 105 105 106 106 106 106 106 106
$val
[1] 101 102 103 101 102 103 104 101 104 105 107 101 103 104 106 101
102 103 101
[20] 102 103 104 101 102 103 101 102 103 104 101 103 104 106 101 102
103 104 101
[39] 104 105 107 101 103 104 106 101 104 105 107 101 103 104 106 101
104 105 107
[58] 101 102 103 104 105 106 101 102 103 104 105 106 101 102 103 104
105 106 101
[77] 102 103 104 105 106 101 102 103 104 105 106 101 102 103 104 105 106
> from.dfs(step2.mr)
$key
[1] 101 101 101 101 101 101 101 102 102 102 102 102 102 103 103 103
103 103 103
[20] 104 104 104 104 104 104 104 105 105 105 105 105 105 105 106 106
106 106 106
[39] 106 107 107 107 107
284 ◾ R for Programmers: Mastering the Tools
$val
k v freq
1 101 101 5
2 101 102 3
3 101 103 4
4 101 104 4
5 101 105 2
6 101 106 2
7 101 107 1
8 102 101 3
9 102 102 3
10 102 103 3
11 102 104 2
12 102 105 1
13 102 106 1
14 103 101 4
15 103 102 3
16 103 103 4
17 103 104 3
18 103 105 1
19 103 106 2
20 104 101 4
21 104 102 2
22 104 103 3
23 104 104 4
24 104 105 2
25 104 106 2
26 104 107 1
27 105 101 2
28 105 102 1
29 105 103 1
30 105 104 2
31 105 105 2
32 105 106 1
33 105 107 1
34 106 101 2
35 106 102 1
36 106 103 2
37 106 104 2
38 106 105 1
39 106 106 2
40 107 101 1
41 107 104 1
42 107 105 1
43 107 107 1
# 2. Create rating matrix of users on items.
> train2.mr<-mapreduce(
+ train.hdfs,
+ map = function(k, v) {
+ #df<-v[which(v$user==3),]
+ df<-v
+ key<-df$item
RHadoop ◾ 285
+ val<-data.frame(item=df$item,user=df$user,pref=df$pref)
+ keyval(key,val)
+ }
+)
> from.dfs(train2.mr)
$key
[1] 101 101 101 101 101 102 102 102 103 103 103 103 104 104 104 104
105 105 106
[20] 106 107
$val
item user pref
1 101 1 5.0
2 101 2 2.0
3 101 3 2.0
4 101 4 5.0
5 101 5 4.0
6 102 1 3.0
7 102 2 2.5
8 102 5 3.0
9 103 1 2.5
10 103 2 5.0
11 103 4 3.0
12 103 5 2.0
13 104 2 2.0
14 104 3 4.0
15 104 4 4.5
16 104 5 4.0
17 105 3 4.5
18 105 5 3.5
19 106 4 4.0
20 106 5 4.0
21 107 3 5.0
#3. Merge the co-occurrence matrix and the rating matrix.
> eq.hdfs<-equijoin(
+ left.input=step2.mr,
+ right.input=train2.mr,
+ map.left=function(k,v){
+ keyval(k,v)
+ },
+ map.right=function(k,v){
+ keyval(k,v)
+ },
+ outer = c("left")
+)
> from.dfs(eq.hdfs)
$key
NULL
286 ◾ R for Programmers: Mastering the Tools
$val
k.l v.l freq.l item.r user.r pref.r
1 103 101 4 103 1 2.5
2 103 102 3 103 1 2.5
3 103 103 4 103 1 2.5
4 103 104 3 103 1 2.5
5 103 105 1 103 1 2.5
6 103 106 2 103 1 2.5
7 103 101 4 103 2 5.0
8 103 102 3 103 2 5.0
9 103 103 4 103 2 5.0
10 103 104 3 103 2 5.0
11 103 105 1 103 2 5.0
12 103 106 2 103 2 5.0
13 103 101 4 103 4 3.0
14 103 102 3 103 4 3.0
15 103 103 4 103 4 3.0
16 103 104 3 103 4 3.0
17 103 105 1 103 4 3.0
18 103 106 2 103 4 3.0
19 103 101 4 103 5 2.0
20 103 102 3 103 5 2.0
> from.dfs(cal.mr)
$key
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
101 101 101
[19] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
101 101 102
[37] 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102
102 102 103
[55] 103 103 103 103 103 103 103 103 103 103 103 103 103 103 103
103 103 103
RHadoop ◾ 287
[73] 103 103 103 103 103 104 104 104 104 104 104 104 104 104 104
104 104 104
[91] 104 104 104 104 104 104 104 104 104 104 104 104 104 104 104
105 105 105
[109] 105 105 105 105 105 105 105 105 105 105 105 106 106 106 106
106 106 106
[127] 106 106 106 106 106 107 107 107 107
$val
k.l v.l user.r v
1 101 101 1 25.0
2 101 101 2 10.0
3 101 101 3 10.0
4 101 101 4 25.0
5 101 101 5 20.0
6 101 102 1 15.0
7 101 102 2 6.0
8 101 102 3 6.0
9 101 102 4 15.0
10 101 102 5 12.0
11 101 103 1 20.0
12 101 103 2 8.0
13 101 103 3 8.0
14 101 103 4 20.0
15 101 103 5 16.0
16 101 104 1 20.0
17 101 104 2 8.0
18 101 104 3 8.0
19 101 104 4 20.0
20 101 104 5 16.0
$val
user item pref
1 1 101 44.0
2 1 103 39.0
3 1 104 33.5
4 1 102 31.5
5 1 106 18.0
6 1 105 15.5
7 1 107 5.0
8 2 101 45.5
9 2 103 41.5
10 2 104 36.0
11 2 102 32.5
12 2 106 20.5
13 2 105 15.5
14 2 107 4.0
15 3 101 40.0
16 3 104 38.0
17 3 105 26.0
18 3 103 24.5
19 3 102 18.5
20 3 106 16.5
21 3 107 15.5
22 4 101 63.0
23 4 104 55.0
24 4 103 53.5
25 4 102 37.0
26 4 106 33.0
27 4 105 26.0
28 4 107 9.5
29 5 101 68.0
30 5 104 59.0
31 5 103 56.5
32 5 102 42.5
33 5 106 34.5
34 5 105 32.0
35 5 107 11.5
RHadoop series
HBase is a distributed database product of the Hadoop family. It has many advantages, including
supporting high-concurrency reading and writing, column-style data storage, efficient index, auto-
sharing, and auto region migration. It has become more and more accepted and used in industry.
For information on the installation and configuration of HBase, please refer to Appendix H.
View the server environment of HBase. Use command bin/start-hbase.sh to start HBase server.
The default port is port = 60000.
# Start Hadoop.
~ /home/conan/hadoop/hadoop-1.1.2/bin/start-all.sh
# Start HBase.
~ /home/conan/hadoop/hbase-0.94.2/bin/start-hbase.sh
# Start Thrift service of HBase.
~ /home/conan/hadoop/hbase-0.94.2/bin/hbase-daemon.sh start thrift
# View HBase process.
~ jps
12041 HMaster
12209 HRegionServer
13222 ThriftServer
31734 TaskTracker
31343 DataNode
31499 SecondaryNameNode
13328 Jps
31596 JobTracker
11916 HQuorumPeer
31216 NameNode
290 ◾ R for Programmers: Mastering the Tools
# Create table.
HBase Shell:
create 'student_shell','info'
rhbase:
hb.new.table("student_rhbase","info")
# List all the tables.
HBase Shell:
list
rhbase:
hb.list.tables()
# Display table structure.
HBase Shell:
describe 'student_shell'
rhbase:
hb.describe.table("student_rhbase")
# Insert a piece of data.
HBase Shell:
put 'student_shell','mary','info:age','19'
rhbase:
hb.insert("student_rhbase",list(list("mary","info:age", "24")))
# Load data.
HBase Shell:
get 'student_shell','mary'
RHadoop ◾ 291
rhbase:
hb.get('student_rhbase','mary')
# Delete table(We need two commands to delete table in HBase, and
only one in rHBase).
HBase Shell:
disable 'student_shell'
drop 'student_shell'
rhbase:
hb.delete.table('student_rhbase')
# Start R.
~ R
292 ◾ R for Programmers: Mastering the Tools
# Load rHBase.
> library(rhbase)
# Initialize HBase in R.
> hb.init()
<pointer: 0x16494a0>
attr(,"class")
[1] "hb.client.connection"
# Create table.
> hb.new.table("student_rhbase","info",opts=list(maxversions=5,x=
list(maxversions=1L,compression='GZ',inmemory=TRUE)))
[1] TRUE
# View all the tables.
> hb.list.tables()
$student_rhbase
maxversions compression inmemory bloomfiltertype bloomfiltervecsize
info: 5 NONE FALSE NONE
0
bloomfilternbhashes blockcache timetolive
info: 0 FALSE -1
# View table structure.
> hb.describe.table("student_rhbase")
maxversions compression inmemory bloomfiltertype
bloomfiltervecsize
info: 5 NONE FALSE NONE
0
bloomfilternbhashes blockcache timetolive
info: 0 FALSE -1
# Query data.
> hb.get('student_rhbase','mary')
[[1]]
[[1]][[1]]
[1] "mary"
[[1]][[2]]
[1] "info:age"
[[1]][[3]]
[[1]][[3]][[1]]
[1] "24"
# Delete table.
> hb.delete.table('student_rhbase')
[1] TRUE
RHadoop series
http://blog.fens.me/rhadoo-rmr2-pipemapred/
When we use the rmr2 package in RHadoop, we may meet the error PipeMapRed.waitOutput
Threads(): subprocess failed with code 1. This error may have troubled many users. What’s the
reason for this error? And how do we solve it? This section gives you the answers.
The code for possible error is as follows.
Error log in R:
hadoop-unjar1697638502297829404/] [] /tmp/
streamjob4620072667602885650.jar tmpDir=null
13/06/23 10:44:25 INFO mapred.FileInputFormat: Total input paths to
process : 1
13/06/23 10:44:25 INFO streaming.StreamJob: getLocalDirs(): [/home/
conan/hadoop/tmp/mapred/local]
13/06/23 10:44:25 INFO streaming.StreamJob: Running job:
job_201306231032_0001
13/06/23 10:44:25 INFO streaming.StreamJob: To kill this job, run:
13/06/23 10:44:25 INFO streaming.StreamJob: /home/conan/hadoop/
hadoop-1.1.2/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://
master:9001 -kill job_201306231032_0001
13/06/23 10:44:25 INFO streaming.StreamJob: Tracking URL: http://
master:50030/jobdetails.jsp?jobid=job_201306231032_0001
13/06/23 10:44:26 INFO streaming.StreamJob: map 0% reduce 0%
13/06/23 10:45:04 INFO streaming.StreamJob: map 100% reduce 100%
13/06/23 10:45:04 INFO streaming.StreamJob: To kill this job, run:
13/06/23 10:45:04 INFO streaming.StreamJob: /home/conan/hadoop/
hadoop-1.1.2/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://
master:9001 -kill job_201306231032_0001
13/06/23 10:45:04 INFO streaming.StreamJob: Tracking URL: http://
master:50030/jobdetails.jsp?jobid=job_201306231032_0001
13/06/23 10:45:04 ERROR streaming.StreamJob: Job not successful.
Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
LastFailedTask: task_201306231032_0001_m_000000
13/06/23 10:45:04 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine,
vectorized.reduce, :
hadoop streaming failed with error code 1
We can’t find the actual error of Hadoop only from the log above.
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.
java:576) at
org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at
org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.
doAs(UserGroupInformation.java:1121) at
org.apache.hadoop.mapred.Child.main(Child.java:249)
From its description, we may find that this is an error of authority. Let’s make an analysis of
the authority of users and user groups in RHadoop environment.
The users have the same authority, so why would an error appear? I’ve run another test in
Hadoop using the root authority, where no error is reported.
Now because the problem is quite specific, let’s look for the reason for this error on the Internet.
Search “org.apache.hadoop.security.AccessControlException” on Google. According to the search
result on the Internet, adjust the configuration of Hadoop: add the definition of dfs.permissions
.superusergroup in hdfs-site.xml. We can check the default value of the configuration of super-
usergroup in the system.
The default name of the group is now supergourp. Modify hdfs-site.xml and add the definition
of dfs.permissions.superusergroup.
~ vi $HADOOP_HOME/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/conan/hadoop/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.permissions.superusergroup</name>
<value>supergroup</value>
</property>
</configuration>
Restart the Hadoop cluster and R, and run rmr2 script. But the error is still not solved.
RHadoop ◾ 297
# Start R.
~ R
# /usr/local is empty.
~ ls /usr/local/lib/R/site-library
I find that all the class libraries in my environment are under the directory/home/conan. And
there is none under the directory/usr/local/lib/R/site-library. So we reinstall the RHadoop using
root authority according to the direction of the author of RHadoop.
# Switch to root.
~ sudo -i
# View the path of class libraries of R.
~ R
> .libPaths()
[1] "/usr/local/lib/R/site-library" "/usr/lib/R/site-library"
[3] "/usr/lib/R/library"
# Install under root authority.
~ R CMD javareconf
~ R
298 ◾ R for Programmers: Mastering the Tools
# Start R.
> install.packages("rJava")
> install.packages("reshape2")
> install.packages("Rcpp")
> install.packages("iterators")
> install.packages("itertools")
> install.packages("digest")
> install.packages("RJSONIO")
> install.packages("functional")
~ cd /home/conan/R
~ R CMD INSTALL rmr2_2.1.0.tar.gz
The dependent packages are installed under/usr/local/lib/R/site-library. Quit root user and
restart R to test.
# Restart R.
~ R
> library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape
# Run rmr2.
> small.ints = to.dfs(1:10)
> mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
Note: If the PATH variable already exists, we should add the path of JDK to the last and sepa-
rate it from the last configuration with a semicolon(;).
After saving the result, open a new CMD window and test the command line operation of Java.
303
304 ◾ Appendix A
# Install JDK.
~ sh ./jdk-6u45-linux-x64.bin
~ pwd
/home/conan/tookit
~ sudo vi /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
usr/games:/home/conan/tookit/jdk1.6.0_45/bin"
JAVA_HOME=/home/conan/tookit/jdk1.6.0_45
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
usr/games:/home/conan/tookit/jdk1.6.0_45/bin"
~ echo $JAVA_HOME
/home/conan/tookit/jdk1.6.0_45
~ java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)
~ java
Usage: java [-options] class [args...]
(to execute a class)
or java [-options] -jar jarfile [args...]
(to execute a jar file)
where options include:
-d32 use a 32-bit data model if available
-d64 use a 64-bit data model if available
-server to select the "server" VM
The default VM is server.
-cp
-classpath
A : separated list of directories, JAR archives,
and ZIP archives to search for class files.
-D=
set a system property
-verbose[:class|gc|jni]
enable verbose output
-version print product version and exit
-version:
require the specified version to run
-showversion print product version and continue
-jre-restrict-search | -jre-no-restrict-search
include/exclude user private JREs in the version
search
-? -help print this help message
-X print help on non-standard options
-ea[:...|:]
-enableassertions[:...|:]
enable assertions
-da[:...|:]
-disableassertions[:...|:]
disable assertions
-esa | -enablesystemassertions
enable system assertions
-dsa | -disablesystemassertions
disable system assertions
308 ◾ Appendix A
-agentlib:[=]
load native agent library, e.g. -agentlib:hprof
see also, -agentlib:jdwp=help and
-agentlib:hprof=help
-agentpath:[=]
load native agent library by full pathname
-javaagent:[=]
load Java programming language agent, see java.
lang.instrument
-splash:
show splash screen with specified image
After all of the preceding operations, the installation of a Java environment in Linux Ubuntu
is completed.
Appendix B: Installation
of MySQL
During the process we need to input the password of the root user. I choose mysql as my pass-
word. After installation, the MySQL server will start automatically and we can check the program
of the MySQL server.
309
310 ◾ Appendix B
~ mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 42
Server version: 5.5.35-0ubuntu0.12.04.2 (Ubuntu)
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights
reserved.
Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.
mysql>
~ mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 37
Server version: 5.5.35-0ubuntu0.12.04.2 (Ubuntu)
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights
reserved.
Appendix B ◾ 311
~ sudo vi /etc/mysql/my.cnf
# Add the character encoding of client under the label [client].
[client]
default-character-set=utf8
312 ◾ Appendix B
~ sudo vi /etc/mysql/my.cnf
#bind-address = 127.0.0.1
Since the script you are attempting to invoke has been converted to an
Upstart job, you may also use the stop(8) and then start(8)
utilities,
e.g. stop mysql ; start mysql. The restart(8) utility is also
available.
mysql start/running, process 3577
~ mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 37
Server version: 5.5.35-0ubuntu0.12.04.2 (Ubuntu)
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights
reserved.
Type 'help;' or '\h' for help. Type '\c' to clear the current input
statement.
Appendix B ◾ 313
We can see that the network monitoring is transformed to 0 0.0.0.0.0:3306 from 127.0.0.1:3306,
which indicates that MySQL can be accessed remotely now.
Now the MySQL server is installed in Linux Ubuntu.
Appendix C: Installation
of Redis
After the installation, Redis server will start automatically. Now let’s check the program of the
Redis server.
315
316 ◾ Appendix C
~ sudo vi /etc/redis/redis.conf
~ sudo vi /etc/redis/redis.conf
#bind 127.0.0.1
Appendix C ◾ 317
~ redis-cli
We can sign into the server but can’t run a command. We then sign into the Redis server and
input a password.
~ redis-cli -a redisredis
Everything goes well this time. Check the network monitoring port of Redis.
We can see that the network monitoring is transformed to 0 0.0.0.0.0:3306 from 127.0.0.1:3306,
which indicates that Redis can be accessed remotely now. We can access the Redis server remotely
from another Linux.
Remote access is successful. Now the Redis server is installed in Linux Ubuntu.
Appendix D: Installation
of MongoDB
319
320 ◾ Appendix D
After installation, MongoDB will be started automatically. Let’s check the program of the
MongoDB server.
Since the script you are attempting to invoke has been converted to an
Upstart job, you may also use the status(8) utility, e.g. status mongodb
mongodb start/running, process 6870
Check the status of the MongoDB server through the console of Web. Enter http://ip:28017 in
the browser and open the console of the Web, as in Figure D.1.
http://docs.mongodb.org/
http://groups.google.com/group/mongodb-user
The MongoDB server allows external access by default. Such single-node MongoDB is success-
fully installed in Linux Ubuntu.
322 ◾ Appendix D
323
324 ◾ Appendix E
.
├── apache-cassandra-1.2.15
│ ├── bin
│ ├── CHANGES.txt
│ ├── conf
│ ├── interface
│ ├── javadoc
│ ├── lib
│ ├── LICENSE.txt
│ ├── NEWS.txt
│ ├── NOTICE.txt
│ ├── pylib
│ ├── README.txt
│ └── tools
├── apache-cassandra-1.2.15-bin.tar.gz
~ vi conf/cassandra.yaml
data_file_directories:
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
saved_caches_directory: /var/lib/cassandra/saved_caches
Make sure that all these directories have been created in the operating system and the directory
/var/log/Cassandra is writable by users.
# Create directory.
~ sudo mkdir -p /var/lib/cassandra/data
~ sudo mkdir -p /var/lib/cassandra/saved_caches
~ sudo mkdir -p /var/lib/cassandra/commitlog
~ sudo mkdir -p /var/log/cassandra/
Appendix E ◾ 325
~ sudo vi /etc/environment
CASSANDRA_HOME=/home/conan/toolkit/cassandra1215
~ bin/cassandra-cli
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.15
326 ◾ Appendix E
https://github.com/apache/hadoop-common/releases/tag/release-1.1.2
~ wget https://github.com/apache/hadoop-common/archive/release-
1.1.2.tar.gz
327
328 ◾ Appendix F
https://github.com/apache/hadoop-common/tree/branch-1.1
~ cd hadoop-1.1.2
~ ls -l
total 648
drwxrwxr-x 2 conan conan 4096 Mar 6 2013 bin
-rw-rw-r-- 1 conan conan 120025 Mar 6 2013 build.xml
-rw-rw-r-- 1 conan conan 467130 Mar 6 2013 CHANGES.txt
drwxrwxr-x 2 conan conan 4096 Oct 3 02:31 conf
drwxrwxr-x 2 conan conan 4096 Oct 3 02:28 ivy
-rw-rw-r-- 1 conan conan 10525 Mar 6 2013 ivy.xml
drwxrwxr-x 4 conan conan 4096 Mar 6 2013 lib
-rw-rw-r-- 1 conan conan 13366 Mar 6 2013 LICENSE.txt
drwxrwxr-x 2 conan conan 4096 Oct 3 03:35 logs
-rw-rw-r-- 1 conan conan 101 Mar 6 2013 NOTICE.txt
-rw-rw-r-- 1 conan conan 1366 Mar 6 2013 README.txt
-rw-rw-r-- 1 conan conan 7815 Mar 6 2013 sample-conf.tgz
drwxrwxr-x 16 conan conan 4096 Mar 6 2013 src
There is no all kinds of class libraries of Hadoop-*.jar or dependent library under the root directory.
330 ◾ Appendix F
~ wget http://archive.apache.org/dist/ant/binaries/apache-ant-1.8.4-
bin.tar.gz
~ tar xvf apache-ant-1.8.4-bin.tar.gz
~ mkdir/home/conan/toolkit/
~ mv apache-ant-1.8.4/home/conan/toolkit/
~ cd/home/conan/toolkit/
~ mv apache-ant-1.8.4 ant184
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
usr/games:/home/conan/toolkit/jdk16/bin:/home/conan/toolkit/ant184/bin"
JAVA_HOME=/home/conan/toolkit/jdk16
ANT_HOME=/home/conan/toolkit/ant184
We should wait several minutes to build successfully. Then check the build directory generated.
~ ls -l build
drwxrwxr-x 3 conan conan 4096 Oct 3 04:06 ant
drwxrwxr-x 2 conan conan 4096 Oct 3 04:02 c++
Appendix F ◾ 331
I find that Hadoop-.jar all end with hadoop—1.1.3-SNAPSHOT.jar. The possible explanation
is that the release of the last version is the SNAPSHOT of the next version. Modify the 31st line
of build.xml and make the value of attribute version 1.1.2. Thus the name of packages generated
will be hadoop-*-1.1.2.jar. Restart Ant.
~ vi build.xml
~ rm -rf build
~ ant
~ sudo vi /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
usr/games:/home/conan/toolkit/jdk16/bin:/home/conan/toolkit/ant184/
bin:/home/conan/toolkit/maven3/bin:/home/conan/toolkit/tomcat7/bin:/
home/conan/hadoop/hadoop-1.1.2/bin"
332 ◾ Appendix F
JAVA_HOME=/home/conan/toolkit/jdk16
ANT_HOME=/home/conan/toolkit/ant184
MAVEN_HOME=/home/conan/toolkit/maven3
HADOOP_HOME=/home/conan/hadoop/hadoop-1.1.2
HADOOP_CMD=/home/conan/hadoop/hadoop-1.1.2/bin/hadoop
HADOOP_STREAMING=/home/conan/hadoop/hadoop-1.1.2/contrib/streaming/
hadoop-streaming-1.1.2.jar
#core-site.xml
~ vi conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/conan/hadoop/tmp</value>
</property>
<property>
<name>io.sort.mb</name>
<value>256</value>
</property>
</configuration>
#hdfs-site.xml
~ vi conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/conan/hadoop/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
#mapred-site.xml
~ vi conf/mapred-site.xml
<configuration>
Appendix F ◾ 333
<property>
<name>mapred.job.tracker</name>
<value>hdfs://master:9001</value>
</property>
</configuration>
~ mkdir /home/conan/hadoop/data
~ mkdir /home/conan/hadoop/tmp
~ sudo chmod 755 /home/conan/hadoop/data/
~ sudo chmod 755 /home/conan/hadoop/tmp/
~ ssh-keygen -t rsa
~ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
6. Format HDFS.
~ bin/start-all.sh
~ jps
15574 DataNode
16324 Jps
15858 SecondaryNameNode
334 ◾ Appendix F
16241 TaskTracker
15283 NameNode
15942 JobTracker
~ vi src/core/org/apache/hadoop/fs/FileUtil.java
After using Ant to repack, hadoop-core-1.1.2.jar that can be run in Windows is generated!
We’ve installed single-node Hadoop in Linux Ubuntu.
Appendix G: Installation
of the Hive Environment
~ cd/home/cos/toolkit/hive-0.9.0
~ vi conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_metadata?createDatabaseIfNot
Exist =
true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
335
336 ◾ Appendix G
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</
description>
</property>
#log4j.appender.EventCounter = org.apache.hadoop.metrics.jvm.
EventCounter
log4j.appender.EventCounter = org.apache.hadoop.log.metrics.
EventCounter
~ sudo vi/etc/environment
PATH = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
usr/games:/usr/local/games:/home/cos/toolkit/ant184/bin:/home/cos/
toolkit/jdk16/bin:/home/cos/toolkit/maven3/bin:/home/cos/toolkit/
hadoop-1.0.3/bin:/home/cos/toolkit/hive-0.9.0/bin"
JAVA_HOME =/home/cos/toolkit/jdk16
ANT_HOME =/home/cos/toolkit/ant184
MAVEN_HOME =/home/cos/toolkit/maven3
HADOOP_HOME =/home/cos/toolkit/hadoop-1.0.3
HIVE_HOME =/home/cos/toolkit/hive-0.9.0
HADOOP_STREAMING =/home/conan/hadoop/hadoop-1.0.3/contrib/streaming/
hadoop-streaming-1.0.3.jar
CLASSPATH =/home/cos/toolkit/jdk16/lib/dt.jar:/home/cos/toolkit/
jdk16/lib/tools.jar
Appendix G ◾ 337
$HADOOP_HOME/bin/hadoop fs -mkidr/tmp
$HADOOP_HOME/bin/hadoop fs -mkidr/user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w/tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w/user/hive/warehouse
~ ls/home/cos/toolkit/hive-0.9.0/lib
mysql-connector-java-5.1.22-bin.jar
# Decompress HBase.
~ tar xvf hbase-0.94.18.tar.gz
339
340 ◾ Appendix H
~ vi conf/hbase-env.sh
~ vi conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/conan/hadoop/hdata</value>
</property>
</configuration>
Appendix H ◾ 341
Copy the configuration file and class library of the Hadoop environment.
~ cp ~/hadoop/hadoop-1.1.2/conf/hdfs-site.xml conf/
~ cp ~/hadoop/hadoop-1.1.2/hadoop-core-1.1.2.jar lib/
~ mkdir/home/conan/hadoop/hdata
~/home/conan/hadoop/hadoop-1.1.2/bin/start-all.sh
~/home/conan/hadoop/hbase-0.94.18/bin/start-hbase.sh
~ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.18, r1577788, Sat Mar 15 04:46:47 UTC 2014
hbase(main):002:0> help # View the help information of HBase command
line.
~ wget http://apache.fayea.com/apache-mirror/thrift/0.9.1/thrift-
0.9.1.tar.gz
~ tar xvf thrift-0.9.1.tar.gz
~ mv thrift-0.9.1//home/conan/hadoop/
~ cd/home/conan/hadoop/
Note: All the errors below are caused by this distribution, so it is not recommended.
The second is downloading source code through Git.
To avoid all kinds of errors, we suggest that the second way be used.
# Compile Thrift.
~ make
# Install Thrift.
~ sudo make install
Appendix H ◾ 343
Thrift is now installed through the Git source code version. Then view the Thrift version.
~ thrift -version
Thrift version 0.9.1
Thrift is started, and now we can use various languages to access HBase through Thrift.
Bibliography
Couture-Beil, Alex (2014). “Package rjson Reference manual,” available at http://cran.r-project.org /web
/packages/rjson/rjson.pdf.
Lewis, Bart W. (2014). “Package rredis Reference manual,” available at http://cran.r-project.org/web/packages
/rredis/rredis.pdf.
Lewis, Bart W. (2014). “The rredis Package,” available at http://cran.r-project.org/web/packages/rredis
/vignettes/rredis.pdf.
Ooms, Jeroen (2014). “Package RMySQL Reference manual,” available at http://cran.r-project.org /web
/packages/RMySQL/RMySQL.pdf.
Owen, Sean, Robin Anil, Ted Dunning, Ellen Friedman (2014). “Mahout in Action,” Beijing, Post &
Telecom Press, pp. 91–114.
RevolutionAnalytics (2012). “RHadoop,” available at https://github.com/RevolutionAnalytics/RHadoop/wiki.
RForge.net. “Examples for Rserve,” available at http://rforge.net/Rserve/example.html.
Richet, Yann (2014). “Rsession: R sessions wrapping for Java,” available at https://github.com/yannrichet
/rsession.
Ryan, Jeffrey A. (2014). “Working with xts and quantmod Leveraging R with xts and quantmod for Quanti-
tative Trading,” available at http://www.rinfinance.com/RinFinance2009/presentations/xts_quantmod
_workshop.pdf.
Ryan, Jeffrey A., Joshua M. Ulrich (2014). “Package xts Reference manual,” available at http://cran.r
-project.org/web/packages/xts/xts.pdf.
Ryan, Jeffrey A., Joshua M. Ulrich (2014). “xts: Extensible Time Series,” available at http://cran.r-project
.org/web/packages/xts/vignettes/xts.pdf.
Santini, Alberto. “RIO,” available at https://github.com/albertosantini/node-rio.
Selivanov, Dmitriy (2014). “Package rmongodb Reference manual,” available at http://cran.r-project.org
/web/packages/rmongodb/rmongodb.pdf.
Shin, Bruce (2014). “Package RHive Reference manual,” available at http://cran.r-project.org/web/packages
/RHive/RHive.pdf.
Temple Lang, Duncan (2014). “Package RJSONIO Reference manual,” available at http://cran.r-project.org
/web/packages/RJSONIO/RJSONIO.pdf.
Tuszynski, Jarek (2014). “Package caTools Reference manual,” available at http://cran.r-project.org /web
/packages/caTools/caTools.pdf.
Urbanek, Simon (2008). “FastRWeb: Fast Interactive Web Framework for Data Mining Using R,” available
at http://urbanek.info/research/pub/urbanek-iasc08.pdf.
Urbanek, Simon (2014). “Package RCassandra Reference manual,” available at http://cran.r-project.org/web
/packages/RCassandra/RCassandra.pdf.
Urbanek, Simon (2014). “Package rJava Reference manual,” available at http://cran.r-project.org/web/packages
/rJava/rJava.pdf.
Urbanek, Simon (2014). “Package RSclient Reference manual,” available at http://cran.r-project.org/web
/packages/RSclient/RSclient.pdf.
Urbanek, Simon (2014). “Package Rserve Reference manual,” Chapter 4.1, available at http://cran.r-project
.org/web/packages/Rserve/Rserve.pdf.
345
346 ◾ Bibliography
Urbanek, Simon, Jeffrey Horner (2014). “Package Cairo Reference manual,” available at http://cran.r
-project.org/web/packages/Cairo/Cairo.pdf.
Urbanek, Simon, Jeffrey Horner (2014). “Package FastRWeb Reference manual,” available at http://cran.r
-project.org/web/packages/FastRWeb/FastRWeb.pdf.
Wickham, Hadley (2014). “Package memoise Reference manual,” available at http://cran.r-project.org/web
/packages/memoise/memoise.pdf.
Wickham, Hadley (2014). “Package profr Reference manual,” available at http://cran.r-project.org /web
/packages/profr/profr.pdf.
Wikipedia. “Base64,” available at http://en.wikipedia.org/wiki/Base64.
Wikipedia. “ROC Curve,” available at http://en.wikipedia.org/wiki/Receiver_operating_characteristic.
Xie, Yihui (2014). “Package formatR Reference manual,” available at http://cran.r-project.org/web/packages
/formatR/formatR.pdf.
Zeileis, Achim (2014). “Package fortunes Reference manual,” available at http://cran.r-project.org /web
/packages/fortunes/fortunes.pdf.
Zeileis, Achim (2014). “Package zoo Reference manual,” available at http://cran.r-project.org/web/packages
/zoo/zoo.pdf.
Zeileis, Achim, Gabor Grothendieck (2014). “zoo: An S3 Class and Methods for Indexed Totally Ordered
Observations,” available at http://cran.r-project.org/web/packages/zoo/vignettes/zoo.pdf.
Information Technology
Unlike other books about R, written from the perspective of statistics, R for
Programmers: Mastering the Tools is written from the perspective of program-
mers, providing a channel for programmers with expertise in other programming
languages to quickly understand R. The contents are divided into four sections.
The first section consists of the basics of R, explaining the advantages of using
R, the installation of different versions of R, and the 12 frequently used packages
of R. This will help you understand the tool packages, time series packages, and
performance monitoring packages of R quickly.
The fourth section comprises the appendices, introducing the installation of Java,
various databases, and Hadoop. Because this is a reference book, there is no special
sequence for reading all the chapters. You can choose the chapters in which you
have an interest. If you are new to R, and you wish to master R comprehensively,
simply follow the chapters in sequence.
K26506
6000 Broken Sound Parkway, NW
Suite 300, Boca Raton, FL 33487 ISBN: 978-1-4987-3681-7
711 Third Avenue 90000
an informa business New York, NY 10017
2 Park Square, Milton Park
www.crcpress.com Abingdon, Oxon OX14 4RN, UK
9 781498 736817
w w w.crcpress.com