0% found this document useful (0 votes)

53 views24 pages

Week 8

1. The document introduces the concept of clustering, which is the process of grouping similar data items together through unsupervised learning techniques. 2. It demonstrates k-means clustering on a simulated 2D dataset, which iteratively assigns data points to clusters based on minimizing distance to centroid locations. 3. K-means clustering has limitations such as requiring the number of clusters to be pre-specified and not handling non-spherical clusters well, but it is simple to implement and understand.

Uploaded by

divya.kapoor04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views24 pages

Week 8

Uploaded by

divya.kapoor04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Introduction to Data Science

STAT240

Dr. David C. Stenning

03/06/2024 - Week 8
Clustering
Group similar data items together

Exploratory data analysis / pattern recognition

Example of unsupervised learning (a branch of machine learning)

Simulated dataset
We can use Calm Code to simulate a dataset demonstrating
clustering
Simulated dataset
data = read.csv('data.csv')
library(ggplot2)
ggplot(data, aes(x = x, y = y)) +
theme_classic() +
geom_point()
k-means clustering
The simplest clustering method is k-means clustering

Iterative method with random initialization

Requires a single, fixed parameter: the number of clusters k

k-means clustering in 2D
Step 1: Randomly assign each data item to one of the k clusters

N = dim(data)[1]; K = 4; set.seed(240)
cluster = sample(K, size = N, replace = TRUE)
data$cluster = cluster
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means clustering in 2D
Step 2: Compute the mean x and y coordinate of each cluster
(this is called a centroid)

If a cluster is empty, assign the corresponding centroid to a

randomly selected data item

mu = matrix(NA, K, 3)
colnames(mu) = c("x", "y", "cluster")
for (k in 1:K) {
mu[k, 1] = mean(data$x[data$cluster == k])
mu[k, 2] = mean(data$y[data$cluster == k])
mu[k, 3] = k
}
q = p + geom_point(as.data.frame(mu), mapping = aes(x = x, y =
y, color = as.factor(cluster)), shape=4, stroke = 1) +
theme(legend.position = "none")
print(q)
k-means clustering in 2D
#Using eval=FALSE flag to print code w/o running it again
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
q = p + geom_point(as.data.frame(mu), mapping = aes(x = x, y =
y, color = as.factor(cluster)), shape=4, stroke = 1) +
theme(legend.position = "none")
print(q)
k-means clustering in 2D
Step 3: Reassign each data item to the cluster with the closest
centroid

library(flexclust)
# dist2() computes distance matrix (Euclidean dist. is default)
d = dist2(data[, 1:2], mu[, 1:2])
for (n in 1:N) { data$cluster[n] = which.min(d[n, ]) }
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means clustering in 2D
Iterate by repeating steps 2 and 3 until a stopping condition is
met

Examples of stopping conditions:

1. A maximum number of iterations is reached

2. The cluster assignment doesn't change

3. The centroids move by only a small amount

k-means clustering in 2D
The result may look something like this:

set.seed(240)
result = kmeans(data[, 1:2], centers = 4, iter.max = 100) #Note
iter.max
data$cluster = result$cluster
p = ggplot(data, aes(x = x, y = y, color = as.factor(cluster)))
+ labs(color = "cluster") + theme_classic() + geom_point()
print(p)
k-means theory
The k-means algorithm is one of the simplest clustering
algorithms

Drawbacks:

you have to already know how many clusters you want

the "variance" of each cluster is the same, such that a very

compact cluster might have a k-means solution that
includes aspects of other clusters

doesn't work for "convex" clusters

not guaranteed to find the "optimum" solution

Intermission
String manipulation
Natural language processing (NLP)

Sentiment analysis

Chatbots, translation

Electronic healthcare records ...

Preprocessing data sources

Normalization of factors

Parsing websites or documents ...

Data analysis

Understanding DNA ...

String manipulation

We'll review string manipulation in base R. First, we will read the text
of The Great Gatsby (Fitzgerald 1925) into a single string variable

fn = "gatsby.txt"
s = readChar(fn, file.info(fn)$size)
nchar(s) # Print number of characters in text

[1] 296673
String manipulation
We want a character vector with one word per element

x = strsplit(s, '\\s+')
x = unlist(x)
print(x[204:264]) #Opening lines of the novel

[1] "In" "my" "younger" "and"

[5] "more" "vulnerable" "years" "my"
[9] "father" "gave" "me" "some"
[13] "advice" "that" "I’ve" "been"
[17] "turning" "over" "in" "my"
[21] "mind" "ever" "since." "“Whenever"
[25] "you" "feel" "like" "criticizing"
[29] "anyone,”" "he" "told" "me,"
[33] "“just" "remember" "that" "all"
[37] "the" "people" "in" "this"
[41] "world" "haven’t" "had" "the"
[45] "advantages" "that" "you’ve" "had.”"
[49] "He" "didn’t" "say" "any"
[53] "more," "but" "we’ve" "always"
[57] "been" "unusually" "communicative" "in"
[61] "a"
Wordclouds
We can create a wordcloud of the document (visualization)

library(wordcloud)
t = table(x)
wordcloud(names(t), t)
Wordclouds
In NLP, we often ignore "stopwords" such as "a" and "the". Here,
we also ignore infrequent words

library(stopwords)
x = tolower(x) #make all lowercase
x = x[!(x %in% stopwords("en"))] #remove "stopwords"
t = table(x)
t = t[t >= 20] #ignore infrequent words
Low level string
manipulation
Easiest R package: stringr

Match a string and extract it str_extract

Detect a matching string str_detect

Replace one string with another str_replace

Split a string on a substring str_split

Strings: extracting
my_s = c("apples x4", "bag of flour", "bag of sugar", "OJ x2",
"1%milk x2")
str_extract(my_s, "\\d") #First instance of a number

[1] "4" NA NA "2" "1"

str_extract(my_s, "[a-z]+") #First instance of one or more

lowercase letters

[1] "apples" "bag" "bag" "x" "milk"

str_extract(my_s, "p") #First instance of "p"

[1] "p" NA NA NA NA

str_extract(my_s, "p+") #First instance of one or more "p"

[1] "pp" NA NA NA NA
Strings: detecting
my_s = c("apple", "banana", "pear", "pineapple")
str_detect(my_s, "a") #Contain the character "a"?

[1] TRUE TRUE TRUE TRUE

str_detect(my_s, "^a") #Begin with the character a?

[1] TRUE FALSE FALSE FALSE

str_detect(my_s,"a$") #End with the character a?

[1] FALSE TRUE FALSE FALSE

Strings: replacing
my_s = c("one apple", "two pears", "three bananas")
str_replace(my_s, "[aeiou]", "-") #Replace first vowel w/ "-"

[1] "-ne apple" "tw- pears" "thr-e bananas"

str_replace_all(my_s, "[aeiou]", "-")#Replace all vowels w/ "-"

[1] "-n- -ppl-" "tw- p--rs" "thr-- b-n-n-s"

Strings: splitting
a = c("apples and oranges and pears and bananas",
"pineapples and mangos and guavas")
str_split(a, " and ") #Note that the "splitter" isn't returned

[[1]]
[1] "apples" "oranges" "pears" "bananas"

[[2]]
[1] "pineapples" "mangos" "guavas"

str_split(a, "\\s+")

[[1]]
[1] "apples" "and" "oranges" "and" "pears" "and"
"bananas"

[[2]]
[1] "pineapples" "and" "mangos" "and" "guavas"
Reading
Required:

Munzert: intro to Chapter 8 and Sections 8.1 & 8.2

Recommended:

https://r4ds.had.co.nz/strings.html

R Programming Lab Manual-24-25
No ratings yet
R Programming Lab Manual-24-25
17 pages
Module - 1&2 - R Programming
No ratings yet
Module - 1&2 - R Programming
189 pages
Tidy Verse
No ratings yet
Tidy Verse
76 pages
R Comandos
No ratings yet
R Comandos
13 pages
R Program
No ratings yet
R Program
49 pages
Data Editor
No ratings yet
Data Editor
6 pages
Chapter 9
No ratings yet
Chapter 9
16 pages
BDS306C - Imp Questions & Answers - Module 2-2
No ratings yet
BDS306C - Imp Questions & Answers - Module 2-2
14 pages
Unit - 5
No ratings yet
Unit - 5
22 pages
Ex 1
No ratings yet
Ex 1
7 pages
Lec 07
No ratings yet
Lec 07
14 pages
楊睿中統計學合併版
No ratings yet
楊睿中統計學合併版
557 pages
Unit 2 R
No ratings yet
Unit 2 R
7 pages
Big Data Pyqp Answers
No ratings yet
Big Data Pyqp Answers
6 pages
R BasicCommands
No ratings yet
R BasicCommands
5 pages
Lec 4 Basics of R
No ratings yet
Lec 4 Basics of R
22 pages
R Introduction
No ratings yet
R Introduction
94 pages
FE418 RLectureNotes1
No ratings yet
FE418 RLectureNotes1
15 pages
R Programming Lab 2
No ratings yet
R Programming Lab 2
10 pages
Ex 3
No ratings yet
Ex 3
20 pages
Module-2 String, Date and Time, Data Preparation Example Code
No ratings yet
Module-2 String, Date and Time, Data Preparation Example Code
18 pages
R Programming Lab Manual
No ratings yet
R Programming Lab Manual
30 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
SEC Notes
No ratings yet
SEC Notes
62 pages
MDPN460 Lecture03
No ratings yet
MDPN460 Lecture03
34 pages
R - I Unit
No ratings yet
R - I Unit
13 pages
R Programming
100% (8)
R Programming
60 pages
R Cheat Sheet PDF
100% (1)
R Cheat Sheet PDF
38 pages
Base R
No ratings yet
Base R
9 pages
R Practical File
No ratings yet
R Practical File
17 pages
RP LabManual - R19
No ratings yet
RP LabManual - R19
50 pages
R
No ratings yet
R
38 pages
R-Programming - Ai&ds 10 Prog
No ratings yet
R-Programming - Ai&ds 10 Prog
5 pages
R WorkSamples
No ratings yet
R WorkSamples
44 pages
RStudio
No ratings yet
RStudio
60 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
R Programming PDF
No ratings yet
R Programming PDF
128 pages
R Programming PDF
No ratings yet
R Programming PDF
128 pages
R
No ratings yet
R
13 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
R File Code
No ratings yet
R File Code
16 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Sem-Iv Class-1: The R Environment
No ratings yet
Sem-Iv Class-1: The R Environment
32 pages
Introduction To R Chap 2
No ratings yet
Introduction To R Chap 2
30 pages
Introduction To Rlogistic
No ratings yet
Introduction To Rlogistic
135 pages
R Imp Funtions
No ratings yet
R Imp Funtions
10 pages
R Software - Notes
No ratings yet
R Software - Notes
18 pages
R Programming
No ratings yet
R Programming
60 pages
FKF Rules and Regulations Final
No ratings yet
FKF Rules and Regulations Final
29 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Cooling Tower Motor Type
No ratings yet
Cooling Tower Motor Type
1 page
PAS Report 556
No ratings yet
PAS Report 556
264 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Banana - Mail Arte - Flue - v4 - n3-4 - 1984
No ratings yet
Banana - Mail Arte - Flue - v4 - n3-4 - 1984
3 pages
Social Psychology Assignment
No ratings yet
Social Psychology Assignment
12 pages
Work With Strings With Stringr::: Cheat Sheet
No ratings yet
Work With Strings With Stringr::: Cheat Sheet
2 pages
Continuity at A Point
No ratings yet
Continuity at A Point
20 pages
Besongntor Orockakwa
No ratings yet
Besongntor Orockakwa
37 pages
The Travelers Property Casualty Co. v. Saint-Gobain Technical Fabrics Canada Ltd.
No ratings yet
The Travelers Property Casualty Co. v. Saint-Gobain Technical Fabrics Canada Ltd.
11 pages
Basic Microbiology and Biochemistry
No ratings yet
Basic Microbiology and Biochemistry
67 pages
OS Process Synchronization Unit 3
No ratings yet
OS Process Synchronization Unit 3
55 pages
Internship Report
No ratings yet
Internship Report
10 pages
Number: To Infinity and Beyond
From Everand
Number: To Infinity and Beyond
Oliver Linton
No ratings yet
Boom Placer Spare Parts Manual Sp1420 RMC - 80009056 - 0, Edition - Dec '18
No ratings yet
Boom Placer Spare Parts Manual Sp1420 RMC - 80009056 - 0, Edition - Dec '18
104 pages
Priciples of Marketing by Philip Kotler and Gary Armstrong
No ratings yet
Priciples of Marketing by Philip Kotler and Gary Armstrong
33 pages
Orson Welles' Memo On by Lawrence French
100% (1)
Orson Welles' Memo On by Lawrence French
41 pages
FV - Pitch Deck - Company Name
No ratings yet
FV - Pitch Deck - Company Name
12 pages
R Cheat Sheet 3 PDF
No ratings yet
R Cheat Sheet 3 PDF
2 pages
Divinity Activation Mantras Empowerment
0% (2)
Divinity Activation Mantras Empowerment
2 pages
Formato Aplicacion para Lanchas DE
No ratings yet
Formato Aplicacion para Lanchas DE
2 pages
06 - Class 06 - Trade Setups
No ratings yet
06 - Class 06 - Trade Setups
12 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Pricing Policies That Protect Your Brand
No ratings yet
Pricing Policies That Protect Your Brand
7 pages
RESUME CV Tabeti Abdelkader English 2017
No ratings yet
RESUME CV Tabeti Abdelkader English 2017
11 pages
QuST Sponsored MTech
No ratings yet
QuST Sponsored MTech
1 page
ICAEW Assurance WB 2023
100% (1)
ICAEW Assurance WB 2023
382 pages
Linking Words Practice
No ratings yet
Linking Words Practice
9 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Lesson-Plan 1
No ratings yet
Lesson-Plan 1
2 pages
Vocality Radio Over IP - Introduction
No ratings yet
Vocality Radio Over IP - Introduction
18 pages
Rebranding and Revitalisation
100% (1)
Rebranding and Revitalisation
7 pages
5th Grade Gmo Plan
No ratings yet
5th Grade Gmo Plan
1 page
Road Paving, Trenches
100% (2)
Road Paving, Trenches
42 pages
Easy Love Spell
50% (2)
Easy Love Spell
2 pages
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
Classification of Wildlife: Geography Project
No ratings yet
Classification of Wildlife: Geography Project
2 pages

Week 8

Uploaded by

Week 8

Uploaded by

Introduction to Data Science

Dr. David C. Stenning

Exploratory data analysis / pattern recognition

Example of unsupervised learning (a branch of machine learning)

Iterative method with random initialization

Requires a single, fixed parameter: the number of clusters k

If a cluster is empty, assign the corresponding centroid to a

Examples of stopping conditions:

1. A maximum number of iterations is reached

2. The cluster assignment doesn't change

3. The centroids move by only a small amount

you have to already know how many clusters you want

the "variance" of each cluster is the same, such that a very

doesn't work for "convex" clusters

not guaranteed to find the "optimum" solution

Electronic healthcare records ...

Preprocessing data sources

Parsing websites or documents ...

Understanding DNA ...

[1] "In" "my" "younger" "and"

Match a string and extract it str_extract

Detect a matching string str_detect

Replace one string with another str_replace

Split a string on a substring str_split

[1] "4" NA NA "2" "1"

str_extract(my_s, "[a-z]+") #First instance of one or more

[1] "apples" "bag" "bag" "x" "milk"

str_extract(my_s, "p") #First instance of "p"

str_extract(my_s, "p+") #First instance of one or more "p"

[1] TRUE TRUE TRUE TRUE

str_detect(my_s, "^a") #*Begin* with the character a?

[1] TRUE FALSE FALSE FALSE

str_detect(my_s,"a$") #*End* with the character a?

[1] FALSE TRUE FALSE FALSE

[1] "-ne apple" "tw- pears" "thr-e bananas"

str_replace_all(my_s, "[aeiou]", "-")#Replace all vowels w/ "-"

[1] "-n- -ppl-" "tw- p--rs" "thr-- b-n-n-s"

Munzert: intro to Chapter 8 and Sections 8.1 & 8.2

You might also like

str_detect(my_s, "^a") #Begin with the character a?

str_detect(my_s,"a$") #End with the character a?