Week 8
Week 8
STAT240
N = dim(data)[1]; K = 4; set.seed(240)
cluster = sample(K, size = N, replace = TRUE)
data$cluster = cluster
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means clustering in 2D
Step 2: Compute the mean x and y coordinate of each cluster
(this is called a centroid)
mu = matrix(NA, K, 3)
colnames(mu) = c("x", "y", "cluster")
for (k in 1:K) {
mu[k, 1] = mean(data$x[data$cluster == k])
mu[k, 2] = mean(data$y[data$cluster == k])
mu[k, 3] = k
}
q = p + geom_point(as.data.frame(mu), mapping = aes(x = x, y =
y, color = as.factor(cluster)), shape=4, stroke = 1) +
theme(legend.position = "none")
print(q)
k-means clustering in 2D
#Using eval=FALSE flag to print code w/o running it again
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
q = p + geom_point(as.data.frame(mu), mapping = aes(x = x, y =
y, color = as.factor(cluster)), shape=4, stroke = 1) +
theme(legend.position = "none")
print(q)
k-means clustering in 2D
Step 3: Reassign each data item to the cluster with the closest
centroid
library(flexclust)
# dist2() computes distance matrix (Euclidean dist. is default)
d = dist2(data[, 1:2], mu[, 1:2])
for (n in 1:N) { data$cluster[n] = which.min(d[n, ]) }
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means clustering in 2D
Iterate by repeating steps 2 and 3 until a stopping condition is
met
set.seed(240)
result = kmeans(data[, 1:2], centers = 4, iter.max = 100) #Note
iter.max
data$cluster = result$cluster
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means theory
The k-means algorithm is one of the simplest clustering
algorithms
Drawbacks:
Sentiment analysis
Chatbots, translation
Normalization of factors
Data analysis
We'll review string manipulation in base R. First, we will read the text
of The Great Gatsby (Fitzgerald 1925) into a single string variable
fn = "gatsby.txt"
s = readChar(fn, file.info(fn)$size)
nchar(s) # Print number of characters in text
[1] 296673
String manipulation
We want a character vector with one word per element
x = strsplit(s, '\\s+')
x = unlist(x)
print(x[204:264]) #Opening lines of the novel
library(wordcloud)
t = table(x)
wordcloud(names(t), t)
Wordclouds
In NLP, we often ignore "stopwords" such as "a" and "the". Here,
we also ignore infrequent words
library(stopwords)
x = tolower(x) #make all lowercase
x = x[!(x %in% stopwords("en"))] #remove "stopwords"
t = table(x)
t = t[t >= 20] #ignore infrequent words
Low level string
manipulation
Easiest R package: stringr
[1] "p" NA NA NA NA
[1] "pp" NA NA NA NA
Strings: detecting
my_s = c("apple", "banana", "pear", "pineapple")
str_detect(my_s, "a") #Contain the character "a"?
[[1]]
[1] "apples" "oranges" "pears" "bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
str_split(a, "\\s+")
[[1]]
[1] "apples" "and" "oranges" "and" "pears" "and"
"bananas"
[[2]]
[1] "pineapples" "and" "mangos" "and" "guavas"
Reading
Required:
Recommended:
https://r4ds.had.co.nz/strings.html