Learning Predictive Analytics With R Get To Grips With Key Data Visualization and Predictive Analytic Skills Using R 1st Edition Eric Mayor
Learning Predictive Analytics With R Get To Grips With Key Data Visualization and Predictive Analytic Skills Using R 1st Edition Eric Mayor
https://ebookfinal.com/download/mastering-predictive-analytics-
with-r-2nd-edition-james-d-miller/
https://ebookfinal.com/download/learning-data-mining-with-r-1st-
edition-bater-makhabel/
Understanding the Predictive Analytics Lifecycle 1st
Edition Alberto Cordoba
https://ebookfinal.com/download/understanding-the-predictive-
analytics-lifecycle-1st-edition-alberto-cordoba/
https://ebookfinal.com/download/growth-curve-analysis-and-
visualization-using-r-daniel-mirman/
https://ebookfinal.com/download/data-visualization-with-
javascript-1st-edition-stephen-a-thomas/
https://ebookfinal.com/download/data-mining-applications-with-r-1st-
edition-yanchang-zhao/
https://ebookfinal.com/download/basic-data-analysis-for-time-series-
with-r-1st-edition-dewayne-r-derryberry/
Learning Predictive Analytics with R Get to grips with
key data visualization and predictive analytic skills using
R 1st Edition Eric Mayor Digital Instant Download
Author(s): Eric Mayor
ISBN(s): 9781782169352, 1782169350
Edition: 1
File Details: PDF, 3.36 MB
Year: 2015
Language: english
Learning Predictive Analytics
with R
Eric Mayor
BIRMINGHAM - MUMBAI
Learning Predictive Analytics with R
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book
is sold without warranty, either express or implied. Neither the author nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78216-935-2
www.packtpub.com
Credits
Reviewers Proofreader
Ajay Dhamija Safis Editing
Khaled Tannir
Matt Wiley Indexer
Rekha Nair
Commissioning Editor
Kunal Parikh Production Coordinator
Aparna Bhagat
Acquisition Editor
Kevin Colaco Cover Work
Aparna Bhagat
Technical Editor
Deepti Tuscano
Copy Editors
Puja Lalwani
Merilyn Pereira
About the Author
• LinkedIn: ajaykumardhamija
• ResearchGate: Ajay_Dhamija2
• Academia: ajaydhamija
• Facebook: akdhamija
• Twitter:@akdhamija
• Quora: Ajay-Dhamija
He is the author of the books RavenDB 2.x Beginner's Guide and Optimizing Hadoop
MapReduce, Packt Publishing, and a technical reviewer on the books Pentaho Analytics
for MongoDB and MongoDB High Availability, Packt Publishing.
He enjoys taking landscape and night photos, traveling, playing video games,
creating funny electronics gadgets using Arduino, Raspberry Pi, and .Net Gadgeteer,
and of course spending time with his wife and family.
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at [email protected] for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign
up for a range of free newsletters and receive exclusive discounts and offers on Packt
books and eBooks.
TM
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser
Updating graphics 47
Case study – exploring cancer-related deaths in the US 50
Discovering the dataset 50
Integrating supplementary external data 55
Summary 60
Chapter 4: Cluster Analysis 61
Distance measures 63
Learning by doing – partition clustering with kmeans() 65
Setting the centroids 66
Computing distances to centroids 67
Computing the closest cluster for each case 67
Tasks performed by the main function 68
Internal validation 69
Using k-means with public datasets 71
Understanding the data with the all.us.city.crime.1970 dataset 71
Finding the best number of clusters in the life.expectancy.1971 dataset 77
External validation 79
Summary 79
Chapter 5: Agglomerative Clustering Using hclust() 81
The inner working of agglomerative clustering 82
Agglomerative clustering with hclust() 86
Exploring the results of votes in Switzerland 86
The use of hierarchical clustering on binary attributes 92
Summary 95
Chapter 6: Dimensionality Reduction with Principal
Component Analysis 97
The inner working of Principal Component Analysis 98
Learning PCA in R 103
Dealing with missing values 104
Selecting how many components are relevant 105
Naming the components using the loadings 107
PCA scores 109
Accessing the PCA scores 109
PCA scores for analysis 110
PCA diagnostics 112
Summary 113
Chapter 7: Exploring Association Rules with Apriori 115
Apriori – basic concepts 116
Association rules 116
Itemsets 116
[ ii ]
Table of Contents
Support 116
Confidence 117
Lift 117
The inner working of apriori 117
Generating itemsets with support-based pruning 118
Generating rules by using confidence-based pruning 119
Analyzing data with apriori in R 119
Using apriori for basic analysis 119
Detailed analysis with apriori 122
Preparing the data 123
Analyzing the data 123
Coercing association rules to a data frame 127
Visualizing association rules 128
Summary 130
Chapter 8: Probability Distributions, Covariance,
and Correlation 131
Probability distributions 131
Introducing probability distributions 131
Discrete uniform distribution 132
The normal distribution 133
The Student's t-distribution 136
The binomial distribution 137
The importance of distributions 138
Covariance and correlation 139
Covariance 141
Correlation 142
Pearson's correlation 142
Spearman's correlation 145
Summary 146
Chapter 9: Linear Regression 147
Understanding simple regression 148
Computing the intercept and slope coefficient 150
Obtaining the residuals 151
Computing the significance of the coefficient 154
Working with multiple regression 156
Analyzing data in R: correlation and regression 156
First steps in the data analysis 157
Performing the regression 160
Checking for the normality of residuals 161
Checking for variance inflation 162
[ iii ]
Table of Contents
[ iv ]
Table of Contents
CART 207
Pruning 208
Random forests in R 210
Examining the predictions on the testing set 211
Conditional inference trees in R 212
Caret – a unified framework for classification 213
Summary 213
Chapter 12: Multilevel Analyses 215
Nested data 215
Multilevel regression 218
Random intercepts and fixed slopes 218
Random intercepts and random slopes 219
Multilevel modeling in R 221
The null model 221
Random intercepts and fixed slopes 225
Random intercepts and random slopes 228
Predictions using multilevel models 233
Using the predict() function 233
Assessing prediction quality 234
Summary 235
Chapter 13: Text Analytics with R 237
An introduction to text analytics 237
Loading the corpus 239
Data preparation 241
Preprocessing and inspecting the corpus 241
Computing new attributes 245
Creating the training and testing data frames 245
Classification of the reviews 245
Document classification with k-NN 245
Document classification with Naïve Bayes 247
Classification using logistic regression 249
Document classification with support vector machines 252
Mining the news with R 253
A successful document classification 253
Extracting the topics of the articles 257
Collecting news articles in R from the New York Times article
search API 259
Summary 262
[v]
Table of Contents
[ vi ]
Preface
Preface
The amount of data in the world is increasing exponentially as time passes. It is
estimated that the total amount of data produced in 2020 will be 20 zettabytes
(Kotov, 2014), that is, 20 billion terabytes. Organizations spend a lot of effort and
money on collecting and storing data, and still, most of it is not analyzed at all, or
not analyzed properly. One reason to analyze data is to predict the future, that is, to
produce actionable knowledge. The main purpose of this book is to show you how
to do that with reasonably simple algorithms. The book is composed of chapters
describing the algorithms and their use and of an appendices with exercises and
solutions to the exercises and references.
Prediction
What is meant by prediction? The answer, of course, depends on the field and the
algorithms used, but this explanation is true most of the time—given the attested
reliable relationships between indicators (predictors) and an outcome, the presence
(or level) of the indicators for similar cases is a reliable clue to the presence (or level)
of the outcome in the future. Here are some examples of relationships, starting with
the most obvious:
Unsupervised learning
In unsupervised learning, the algorithm will seek to find the structure that
organizes unlabelled data. For instance, based on similarities or distances between
observations, an unsupervised cluster analysis will determine groups and which
observations fit best into each of the groups. An application of this is, for instance,
document classification.
Supervised learning
In supervised learning, we know the class or the level of some observations of a
given target attribute. When performing a prediction, we use known relationships in
labeled data (data for which we know what the class or level of the target attribute
is) to predict the class or the level of the attribute in new cases (of which we do not
know the value).
• Classification problems
• Regression problems
Classification
In some cases, we want to predict which group an observation is part of. Here,
we are dealing with a quality of the observation. This is a classification problem.
Examples include:
[ viii ]
Preface
Regression
In other cases, we want to predict an observation's level on an attribute. Here, we are
dealing with a quantity, and this is a regression problem. Examples include:
• The prediction of how much individuals will cost to health care based on
their health habits
• The prediction of the weight of animals based on their diets
• The prediction of the number of defective devices based on manufacturing
specifications
[ ix ]
Preface
As stressed upon in the preceding diagram, field knowledge (here called Business
Understanding) informs and is informed by data understanding. The understanding
of the data then informs how the data has to be prepared. The next step is data
modeling, which can also lead to further data preparation. Data models have to
be evaluated, and this evaluation can be informed by field knowledge (this is also
stressed upon in the diagram), which is also updated through the data mining
process. Finally, if the evaluation is satisfactory, the models are deployed for
prediction. This book will focus on the data modeling and evaluation stages.
Caveats
Of course, predictions are not always accurate, and some have written about the
caveats of data science. What do you think about the relationship between the
attributes titled Predictor and Outcome on the following plot? It seems like there is
a relationship between the two. For the statistically inclined, I tested its significance:
r = 0.4195, p = .0024. The value p is the probability of obtaining a relationship of this
strength or stronger if there is actually no relationship between the attributes. We
could conclude that the relationship between these variables in the population they
come from is quite reliable, right?
[x]
Preface
Believe it or not, the population these observations come from is that of randomly
generated numbers. We generated a data frame of 50 columns of 50 randomly
generated numbers. We then examined all the correlations (manually) and generated
a scatterplot of the two attributes with the largest correlation we found. The code is
provided here, in case you want to check it yourself—line 1 sets the seed so that you
find the same results as we did, line 2 generates to the data frame, line 3 fills it with
random numbers, column by column, line 4 generates the scatterplot, line 5 fits the
regression line, and line 6 tests the significance of the correlation:
1 set.seed(1)
2 DF = data.frame(matrix(nrow=50,ncol=50))
3 for (i in 1:50) DF[,i] = runif(50)
4 plot(DF[[2]],DF[[16]], xlab = "Predictor", ylab = "Outcome")
5 abline(lm(DF[[2]]~DF[[16]]))
6 cor.test(DF[[2]], DF[[16]])
How could this relationship happen given that the odds were 2.4 in 1000 ? Well,
think of it; we correlated all 50 attributes 2 x 2, which resulted in 2,450 tests (not
considering the correlation of each attribute with itself). Such spurious correlation
was quite expectable. The usual threshold below which we consider a relationship
significant is p = 0.05, as we will discuss in Chapter 8, Probability Distributions,
Covariance, and Correlation. This means that we expect to be wrong once in 20 times.
You would be right to suspect that there are other significant correlations in the
generated data frame (there should be approximately 125 of them in total). This is
the reason why we should always correct the number of tests. In our example, as
we performed 2,450 tests, our threshold for significance should be 0.0000204 (0.05 /
2450). This is called the Bonferroni correction.
Spurious correlations are always a possibility in data analysis and this should be
kept in mind at all times. A related concept is that of overfitting. Overfitting happens,
for instance, when a weak classifier bases its prediction on the noise in data. We
will discuss overfitting in the book, particularly when discussing cross-validation
in Chapter 14, Cross-validation and Bootstrapping Using Caret and Exporting Predictive
Models Using PMML. All the chapters are listed in the following section.
We hope you enjoy reading the book and hope you learn a lot from us!
[ xi ]
Discovering Diverse Content Through
Random Scribd Documents
kerran niin mielellään tahtoo, että minä en menisi naimisiin Katerina
Ivanovnan kanssa, ja haluaa sitä siinä määrin (hän tiesi, että melkein
hysteriaan asti), niin miksipä hän kieltäytyisi antamasta minulle noita
kolmeatuhatta nimenomaan sitä varten, että voisin näillä rahoilla
jättää Katjan ja mennä täältä ainaiseksi matkoihini? Jos nämä
hemmoitellut yläluokan rouvat tahtovat jotakin oikkuaan tyydyttää,
niin he eivät säästä mitään saadakseen tahtonsa täyttymään.
Sitäpaitsi hän on niin rikaskin», tuumi Mitja. Mitä itse
»suunnitelmaan» tulee, niin se oli entinen, t.s. hän aikoi tarjota
oikeuksiaan Tšermašnjaan, — mutta ei enää kauppatavarana, kuten
eilen Samsonoville, ei viekoitellakseen rouvaa, kuten eilen
Samsonovia, sillä mahdollisuudella, että voi saada kolmentuhannen
asemesta kaksin verroin rahaa, kuusi- tai seitsemäntuhatta, vaan
yksinkertaisesti tarjota ne kunniallisen miehen tavoin vakuudeksi
lainasta. Kehittäessään tätä uutta ajatustaan Mitja innostui, mutta
niin kävi aina kaikissa hänen aloitteissaan, kaikissa hänen äkillisissä
päätöksissään. Jokaisen uuden ajatuksensa valtaan hän antautui
intohimoisesti. Siitä huolimatta hän astuessaan rouva Hohlakovin
kuistille tunsi pelon karmivan selkäänsä: vasta tällä sekunnilla hän
ymmärsi täydelleen ja matemaattisen selvästi, että tämä oli hänen
viimeinen toivonsa, että mitään muuta ei enää ole jäljellä
maailmassa, jos tämä pettää, »kuin tappaa ja ryöstää joku
kolmentuhannen tähden, muuta ei mitään»… Kello oli noin puoli
kahdeksan, kun hän soitti ovikelloa.
— Entä kuka te itse olette, hyvä mies? — alkoi eukko puhua aivan
toisenlaisella äänellä. — En voi tuntea teitä pimeässä.
Hän syöksyi pois. Pelästynyt Fenja oli iloissaan, että oli päässyt
niin vähällä, mutta hän ymmärsi sangen hyvin, että Mitjalla oli ollut
vain kiire, muuten kenties olisi käynyt hullusti. Mutta rientäessään
pois Mitja kuitenkin hämmästytti sekä Fenjan että mummo Matrenan
aivan odottamattomalla päähänpistolla: pöydällä seisoi vaskinen
survinastia ja siinä survin, pieni vaskinen survin, vain
neljännesarssinan pituinen. Ulos juostessaan ja avatessaan jo
toisella kädellään ovea Mitja yhtäkkiä sieppasi toisella kädellään
survinastiasta survimen ja pisti sen sivutaskuunsa mennen tiehensä
se mukanaan.
4.
Pimeässä
»Minne hän juoksi? Tietäähän sen: missä muualla hän voisi olla
kuin Fjodor Pavlovitšin luona? Samsonovilta juoksi suoraan hänen
luokseen, se on nyt selvä. Koko juoni, koko petos on nyt
silminnähtävä»… Kaikki tämä pyöri vihurin tavoin hänen päässään.
Maria Kondratjevnan pihaan hän ei poikennut: »Sinne ei tarvitse
mennä, ei ollenkaan tarvitse… ettei syntyisi pienintäkään hälinää…
ne ilmaisevat ja ilmiantavat heti… Maria Kondratjevna on ilmeisesti
osallisena salaliitossa, Smerdjakov samoin, samoin, kaikki ovat
lahjottuja!»- Hänen mielessään syntyi toinen tuuma: tehden suuren
kierroksen hän juoksi syrjäkadun kautta Fjodor Pavlovitšin talon
ympäri, juoksi Dmitrovskaja-kadun kautta ja pienen sillan yli ja tuli
suoraan yksinäiselle takakujalle, joka oli tyhjä ja asumaton ja jota
rajoitti toiselta puolen naapuritalon säleaita, toiselta luja ja korkea
lauta-aita, joka kiersi Fjodor Pavlovitšin puutarhan ympäri. Täällä
hän valitsi paikan, nähtävästi saman, josta hänelle tutun
kertomuksen mukaan Lizaveta Smerdjaštšaja aikoinaan oli kiivennyt
puutarhaan. »Jos kerran hän kykeni kiipeämään yli», välähti Jumala
ties mistä syystä hänen päässään, »niin kuinka minä en pääsisi tästä
yli?» Ja hän hypähti todellakin aidan luo ja sai silmänräpäyksessä
kädellään kiinni sen yläreunasta, kohottautui sitten tarmokkaasti ylös
ja kiipesi istumaan aidalle kahdareisin. Puutarhassa lähellä tätä
paikkaa oli pieni sauna, mutta aidalta saattoi nähdä myös talon
valaistut ikkunat. »Aivan niin, ukon makuuhuone oli valaistu,
Grušenjka on siellä!» ja hän hyppäsi aidalta puutarhaan. Vaikka hän
tiesikin, että Grigori oli sairas ja kenties Smerdjakovkin todella oli
sairaana, eikä hänen tuloaan kukaan kuullut, niin hän kuitenkin
vaistomaisesti painautui piiloon, seisoi liikkumattomana paikallaan ja
alkoi kuulostaa. Mutta kaikkialla vallitsi kuolemanhiljaisuus ja oli
aivan tyyntäkin, ei tuulen henkäystäkään.
»Ei siitä, että hän ei ole täällä», ajatteli Mitja vastaten samassa
itselleen, »vaan siitä, että en mitenkään voi saada varmaa tietoa,
onko hän täällä vai eikö.» Mitja muisti myöhemmin itse, että hänen
järkensä oli tällä hetkellä tavattoman selvä ja harkitsi kaikkea
jokaista yksityiskohtaa myöten, huomasi jokaisen pikku piirteen.
Mutta levottomuus, tietämättömyyden ja epäröinnin levottomuus
kasvoi hänen sydämessään huimaavan nopeasti. »Onko hän
lopultakin täällä vai eikö ole?» kuohui vihaisesti hänen
sydämessään. Ja hän teki äkkiä päätöksensä, ojensi kätensä ja
koputti hiljaa ikkunan kehykseen. Hän koputti niinkuin ukko ja
Smerdjakov olivat sopineet keskenään: kaksi ensimmäistä kertaa
hiljempää, sitten kolme kertaa nopeammin: kop-kop-kop, — joka
merkitsi, että Grušenjka on tullut. Ukko vavahti, käänsi päänsä,
hyppäsi nopeasti pystyyn ja syöksähti ikkunan luo. Mitja hypähti
varjoon. Fjodor Pavlovitš avasi ikkunan ja pisti päänsä ulos.
*****
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com