0% found this document useful (0 votes)
53 views24 pages

Week 8

1. The document introduces the concept of clustering, which is the process of grouping similar data items together through unsupervised learning techniques. 2. It demonstrates k-means clustering on a simulated 2D dataset, which iteratively assigns data points to clusters based on minimizing distance to centroid locations. 3. K-means clustering has limitations such as requiring the number of clusters to be pre-specified and not handling non-spherical clusters well, but it is simple to implement and understand.

Uploaded by

divya.kapoor04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views24 pages

Week 8

1. The document introduces the concept of clustering, which is the process of grouping similar data items together through unsupervised learning techniques. 2. It demonstrates k-means clustering on a simulated 2D dataset, which iteratively assigns data points to clusters based on minimizing distance to centroid locations. 3. K-means clustering has limitations such as requiring the number of clusters to be pre-specified and not handling non-spherical clusters well, but it is simple to implement and understand.

Uploaded by

divya.kapoor04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Introduction to Data Science

STAT240

Dr. David C. Stenning


03/06/2024 - Week 8
Clustering
Group similar data items together

Exploratory data analysis / pattern recognition

Example of unsupervised learning (a branch of machine learning)


Simulated dataset
We can use Calm Code to simulate a dataset demonstrating
clustering
Simulated dataset
data = read.csv('data.csv')
library(ggplot2)
ggplot(data, aes(x = x, y = y)) +
theme_classic() +
geom_point()
k-means clustering
The simplest clustering method is k-means clustering

Iterative method with random initialization

Requires a single, fixed parameter: the number of clusters k


k-means clustering in 2D
Step 1: Randomly assign each data item to one of the k clusters

N = dim(data)[1]; K = 4; set.seed(240)
cluster = sample(K, size = N, replace = TRUE)
data$cluster = cluster
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means clustering in 2D
Step 2: Compute the mean x and y coordinate of each cluster
(this is called a centroid)

If a cluster is empty, assign the corresponding centroid to a


randomly selected data item

mu = matrix(NA, K, 3)
colnames(mu) = c("x", "y", "cluster")
for (k in 1:K) {
mu[k, 1] = mean(data$x[data$cluster == k])
mu[k, 2] = mean(data$y[data$cluster == k])
mu[k, 3] = k
}
q = p + geom_point(as.data.frame(mu), mapping = aes(x = x, y =
y, color = as.factor(cluster)), shape=4, stroke = 1) +
theme(legend.position = "none")
print(q)
k-means clustering in 2D
#Using eval=FALSE flag to print code w/o running it again
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
q = p + geom_point(as.data.frame(mu), mapping = aes(x = x, y =
y, color = as.factor(cluster)), shape=4, stroke = 1) +
theme(legend.position = "none")
print(q)
k-means clustering in 2D
Step 3: Reassign each data item to the cluster with the closest
centroid

library(flexclust)
# dist2() computes distance matrix (Euclidean dist. is default)
d = dist2(data[, 1:2], mu[, 1:2])
for (n in 1:N) { data$cluster[n] = which.min(d[n, ]) }
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means clustering in 2D
Iterate by repeating steps 2 and 3 until a stopping condition is
met

Examples of stopping conditions:

1. A maximum number of iterations is reached

2. The cluster assignment doesn't change

3. The centroids move by only a small amount


k-means clustering in 2D
The result may look something like this:

set.seed(240)
result = kmeans(data[, 1:2], centers = 4, iter.max = 100) #Note
iter.max
data$cluster = result$cluster
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means theory
The k-means algorithm is one of the simplest clustering
algorithms

Drawbacks:

you have to already know how many clusters you want

the "variance" of each cluster is the same, such that a very


compact cluster might have a k-means solution that
includes aspects of other clusters

doesn't work for "convex" clusters

not guaranteed to find the "optimum" solution


Intermission
String manipulation
Natural language processing (NLP)

Sentiment analysis

Chatbots, translation

Electronic healthcare records ...

Preprocessing data sources

Normalization of factors

Parsing websites or documents ...

Data analysis

Understanding DNA ...


String manipulation

We'll review string manipulation in base R. First, we will read the text
of The Great Gatsby (Fitzgerald 1925) into a single string variable

fn = "gatsby.txt"
s = readChar(fn, file.info(fn)$size)
nchar(s) # Print number of characters in text

[1] 296673
String manipulation
We want a character vector with one word per element

x = strsplit(s, '\\s+')
x = unlist(x)
print(x[204:264]) #Opening lines of the novel

[1] "In" "my" "younger" "and"


[5] "more" "vulnerable" "years" "my"
[9] "father" "gave" "me" "some"
[13] "advice" "that" "I’ve" "been"
[17] "turning" "over" "in" "my"
[21] "mind" "ever" "since." "“Whenever"
[25] "you" "feel" "like" "criticizing"
[29] "anyone,”" "he" "told" "me,"
[33] "“just" "remember" "that" "all"
[37] "the" "people" "in" "this"
[41] "world" "haven’t" "had" "the"
[45] "advantages" "that" "you’ve" "had.”"
[49] "He" "didn’t" "say" "any"
[53] "more," "but" "we’ve" "always"
[57] "been" "unusually" "communicative" "in"
[61] "a"
Wordclouds
We can create a wordcloud of the document (visualization)

library(wordcloud)
t = table(x)
wordcloud(names(t), t)
Wordclouds
In NLP, we often ignore "stopwords" such as "a" and "the". Here,
we also ignore infrequent words

library(stopwords)
x = tolower(x) #make all lowercase
x = x[!(x %in% stopwords("en"))] #remove "stopwords"
t = table(x)
t = t[t >= 20] #ignore infrequent words
Low level string
manipulation
Easiest R package: stringr

Match a string and extract it str_extract

Detect a matching string str_detect

Replace one string with another str_replace

Split a string on a substring str_split


Strings: extracting
my_s = c("apples x4", "bag of flour", "bag of sugar", "OJ x2",
"1%milk x2")
str_extract(my_s, "\\d") #First instance of a number

[1] "4" NA NA "2" "1"

str_extract(my_s, "[a-z]+") #First instance of one or more


lowercase letters

[1] "apples" "bag" "bag" "x" "milk"

str_extract(my_s, "p") #First instance of "p"

[1] "p" NA NA NA NA

str_extract(my_s, "p+") #First instance of one or more "p"

[1] "pp" NA NA NA NA
Strings: detecting
my_s = c("apple", "banana", "pear", "pineapple")
str_detect(my_s, "a") #Contain the character "a"?

[1] TRUE TRUE TRUE TRUE

str_detect(my_s, "^a") #*Begin* with the character a?

[1] TRUE FALSE FALSE FALSE

str_detect(my_s,"a$") #*End* with the character a?

[1] FALSE TRUE FALSE FALSE


Strings: replacing
my_s = c("one apple", "two pears", "three bananas")
str_replace(my_s, "[aeiou]", "-") #Replace first vowel w/ "-"

[1] "-ne apple" "tw- pears" "thr-e bananas"

str_replace_all(my_s, "[aeiou]", "-")#Replace all vowels w/ "-"

[1] "-n- -ppl-" "tw- p--rs" "thr-- b-n-n-s"


Strings: splitting
a = c("apples and oranges and pears and bananas",
"pineapples and mangos and guavas")
str_split(a, " and ") #Note that the "splitter" isn't returned

[[1]]
[1] "apples" "oranges" "pears" "bananas"

[[2]]
[1] "pineapples" "mangos" "guavas"

str_split(a, "\\s+")

[[1]]
[1] "apples" "and" "oranges" "and" "pears" "and"
"bananas"

[[2]]
[1] "pineapples" "and" "mangos" "and" "guavas"
Reading
Required:

Munzert: intro to Chapter 8 and Sections 8.1 & 8.2

Recommended:

https://r4ds.had.co.nz/strings.html

You might also like