Introduction To Data Mining With R: Yanchang Zhao
Introduction To Data Mining With R: Yanchang Zhao
Yanchang Zhao
http://www.RDataMining.com
8 May 2015
1
Presented at AusDM 2014 (QUT, Brisbane) in Nov 2014, at Twitter (US) in Oct 2014, at UJAT (Mexico) in
Sept 2014, and at University of Canberra in Sept 2013
1 / 44
Questions
2 / 44
Questions
2 / 44
Questions
2 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
3 / 44
What is R?
I
I
I
I
I
An Introduction to R
The R Language Definition
R Data Import/Export
...
http://www.r-project.org/
http://cran.r-project.org/
4
http://www.bioconductor.org/
5
http://r-forge.r-project.org/
6
https://github.com/
7
http://cran.r-project.org/manuals.html
3
4 / 44
Why R?
I
I
I
I
I
I
I
8
9
http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html
http://cran.r-project.org/web/views/
5 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
6 / 44
Classification with R
7 / 44
8 / 44
9 / 44
plot(iris.ctree)
1
Petal.Length
p < 0.001
1.9
> 1.9
3
Petal.Width
p < 0.001
1.7
> 1.7
4
Petal.Length
p = 0.026
4.4
> 4.4
Node 2 (n = 40)
Node 5 (n = 21)
Node 6 (n = 19)
Node 7 (n = 32)
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
setosa
setosa
0
setosa
setosa
10 / 44
Prediction
11 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
12 / 44
Clustering with R
DBSCAN: fpc
BIRCH: birch
10
k-means Clustering
set.seed(8953)
iris2 <- iris
# remove class IDs
iris2$Species <- NULL
# k-means clustering
iris.kmeans <- kmeans(iris2, 3)
# check result
table(iris$Species, iris.kmeans$cluster)
##
##
##
##
##
1 2 3
setosa
0 50 0
versicolor 2 0 48
virginica 36 0 14
14 / 44
3.0
2.5
2.0
Sepal.Width
3.5
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
15 / 44
Density-based Clustering
library(fpc)
iris2 <- iris[-5] # remove class IDs
# DBSCAN clustering
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
# compare clusters with original class IDs
table(ds$cluster, iris$Species)
##
##
##
##
##
##
0
1
2
3
16 / 44
0
3
3 33
0
3
3
03 3
1
dc 2
1
1
3
3
3 3
0
0 2 2
0 2 22
2
2
2
0
0
3 33 0 333
3
3
3 3
3
3 30
33
0
3
22
3
2 22022 2 20
3
2 20 2 2
2
3
2 2 22
02
0
22
30
0
3
2 20
2
0 0
0
0
2 2
0 1
1
1 1
1 1
11 1 1
1
1
11
111 1 1 11 11
1 111111 1
1 1
1
11
1 11
1
11
2
dc 1
0
0
17 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
18 / 44
19 / 44
501
477
674
766
1485
1388
448
590
Class
Sex
Age Survived
3rd
Male Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
Crew
Male Adult
No
3rd Female Adult
No
2nd Female Adult
No
3rd
Male Adult
No
3rd
Male Adult
No
20 / 44
21 / 44
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
lhs
{Class=2nd,
Age=Child}
2 {Class=2nd,
Sex=Female,
Age=Child}
3 {Class=1st,
Sex=Female}
4 {Class=1st,
Sex=Female,
Age=Adult}
5 {Class=2nd,
Sex=Male,
Age=Adult}
6 {Class=2nd,
Sex=Female}
7 {Class=Crew,
Sex=Female}
8 {Class=Crew,
Sex=Female,
Age=Adult}
9 {Class=2nd,
Sex=Male}
10 {Class=2nd,
rhs
support confidence
lift
=> {Survived=Yes}
0.011
1.000 3.096
=> {Survived=Yes}
0.006
1.000 3.096
=> {Survived=Yes}
0.064
0.972 3.010
=> {Survived=Yes}
0.064
0.972 3.010
=> {Survived=No}
0.070
0.917 1.354
=> {Survived=Yes}
0.042
0.877 2.716
=> {Survived=Yes}
0.009
0.870 2.692
=> {Survived=Yes}
0.009
0.870 2.692
=> {Survived=No}
0.070
0.860 1.271
22 / 44
library(arulesViz)
plot(rules, method = "graph")
Graph for 12 rules
{Class=3rd,Sex=Male,Age=Adult}
{Class=2nd,Sex=Male,Age=Adult}
{Survived=No}{Class=3rd,Sex=Male}
{Class=2nd,Sex=Male}
{Class=1st,Sex=Female}
{Class=2nd,Sex=Female}
{Class=1st,Sex=Female,Age=Adult}
{Class=2nd,Sex=Female,Age=Child}
{Survived=Yes}
{Class=Crew,Sex=Female}
{Class=2nd,Age=Child}
{Class=Crew,Sex=Female,Age=Adult}
{Class=2nd,Sex=Female,Age=Adult}
23 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
24 / 44
Text mining: tm
25 / 44
Retrieve Tweets
Retrieve recent tweets by @RDataMining
## Option 1: retrieve tweets from Twitter
library(twitteR)
tweets <- userTimeline("RDataMining", n = 3200)
## Option 2: download @RDataMining tweets from RDataMining.com
url <- "http://www.rdatamining.com/data/rdmTweets.RData"
download.file(url, destfile = "./data/rdmTweets.RData")
## load tweets into R
load(file = "./data/rdmTweets.RData")
(n.tweet <- length(tweets))
## [1] 320
strwrap(tweets[[320]]$text, width = 55)
## [1] "An R Reference Card for Data Mining is now available"
## [2] "on CRAN. It lists many useful R functions and packages"
## [3] "for data mining applications."
26 / 44
Text Cleaning
library(tm)
# convert tweets to a data frame
df <- twListToDF(tweets)
# build a corpus
myCorpus <- Corpus(VectorSource(df$text))
# convert to lower case
myCorpus <- tm_map(myCorpus, tolower)
# remove punctuations and numbers
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
# remove URLs, 'http' followed by non-space characters
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
# remove 'r' and 'big' from stopwords
myStopwords <- setdiff(stopwords("english"), c("r", "big"))
# remove stopwords
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
27 / 44
Stemming
# keep a copy of corpus
myCorpusCopy <- myCorpus
# stem words
myCorpus <- tm_map(myCorpus, stemDocument)
# stem completion
myCorpus <- tm_map(myCorpus, stemCompletion,
dictionary = myCorpusCopy)
# replace "miners" with "mining", because "mining" was
# first stemmed to "mine" and then completed to "miners"
myCorpus <- tm_map(myCorpus, gsub, pattern="miners",
replacement="mining")
strwrap(myCorpus[320], width=55)
## [1] "r reference card data mining now available cran list"
## [2] "used r functions package data mining applications"
28 / 44
Frequent Terms
"big"
"mining"
"postdoctoral"
"social"
"computing"
"network"
"r"
"tutorial"
"data"
..
"package"..
"research..
"universi..
29 / 44
Associations
# which words are associated with 'r'?
findAssocs(myTdm, "r", 0.2)
##
r
## examples 0.32
## code
0.29
## package 0.20
# which words are associated with 'mining'?
findAssocs(myTdm, "mining", 0.25)
##
##
##
##
##
##
##
##
data
mahout
recommendation
sets
supports
frequent
itemset
mining
0.47
0.30
0.30
0.30
0.30
0.26
0.26
30 / 44
Network of Terms
library(graph)
library(Rgraphviz)
plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)
university
tutorial
social
network
analysis
mining
research
postdoctoral
position
used
data
big
package
examples
computing
slides
31 / 44
Word Cloud
library(wordcloud)
m <- as.matrix(myTdm)
freq <- sort(rowSums(m), decreasing=T)
wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)
provided melbourne
analysis outlier
map
mining network
open
graphics
thanks
conference users
processing
cfp text
analyst
exampleschapter
postdoctoral
job
analytics join
high
sydney
topic
china
large
snowfall
casesee available poll draft
performance applications
group now
reference course code can via
visualizing
series tenuretrack
industrial center due introduction
association clustering access
information
page distributed
sentiment videos techniques tried
youtube
top presentation science
classification southern
wwwrdataminingcom
canberra added experience
management
predictive
talk
vacancy
research
package
notes card
get
data
database
statistics
rdatamining
knowledge list
graph
free online
using
recent
published
workshop find
position
fast call
studies
tutorial
california
cloud
frequent
week tools
document
technology
nd
google
short software
time learn
details
lecture
book
32 / 44
Topic Modelling
library(topicmodels)
set.seed(123)
myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8)
terms(myLda, 5)
##
##
##
##
##
##
##
##
##
##
##
##
[1,]
[2,]
[3,]
[4,]
[5,]
[1,]
[2,]
[3,]
[4,]
[5,]
Topic 1
Topic 2 Topic 3
Topic 4
"mining"
"data"
"r"
"position"
"data"
"free"
"examples" "research"
"analysis" "course" "code"
"university"
"network" "online" "book"
"data"
"social"
"ausdm" "mining"
"postdoctoral"
Topic 5
Topic 6
Topic 7
Topic 8
"data"
"data"
"r"
"r"
"r"
"scientist" "package"
"data"
"mining"
"research" "computing" "clustering"
"applications" "r"
"slides"
"mining"
"series"
"package"
"parallel" "detection"
33 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
34 / 44
35 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
36 / 44
37 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
38 / 44
Hadoop
I
Spark
I
H2O
I
MongoDB
I
I
R and Hadoop
I
I
I
I
11
https://github.com/RevolutionAnalytics/RHadoop/wiki
40 / 44
12
http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/
41 / 44
Outline
Introduction
Classification with R
Clustering with R
Association Rule Mining with R
Text Mining with R
Time Series Analysis with R
Social Network Analysis with R
R and Big Data
Online Resources
42 / 44
Online Resources
I
RDataMining website:
I
I
I
http://www.rdatamining.com
Online documents
http://www.rdatamining.com/resources/onlinedocs
43 / 44
The End
Thanks!
Email: yanchang(at)rdatamining.com
44 / 44