Lecture 12
Lecture 12
Cynthia Rush
Columbia University
December 9, 2016
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 1 / 107
Course Notes
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 2 / 107
Last Time
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 3 / 107
Topics for Today
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 4 / 107
Section I
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 5 / 107
Databases
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 6 / 107
Databases vs. Dataframes
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 7 / 107
Databases
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 8 / 107
Databases
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 9 / 107
Databases
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 9 / 107
SQL
Connecting R to SQL
SQL is its own language, independent of R (similar to regular
expressions). But were going to learn how to run SQL queries
through R.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 10 / 107
SQL
Connecting R to SQL
SQL is its own language, independent of R (similar to regular
expressions). But were going to learn how to run SQL queries
through R.
First, install the packages DBI, RSQLite.
Also, we need a database file: download the file baseball.db and
save it in your working directory.
> library(DBI)
> library(RSQLite)
> drv <- dbDriver("SQLite")
> con <- dbConnect(drv, dbname="baseball.db")
The object con is now a persistent connection to the database
baseball.db.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 10 / 107
SQL
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 13 / 107
SQL
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 13 / 107
Check Yourself
Tasks
Using dbReadTable(), grab the table named Salaries and save it as a
data frame called salaries. Using the salaries data frame and ddply(),
compute the payroll (total of salaries) for each team in the year 2010. Find
the 3 teams with the highest payrolls, and the team with the lowest payroll.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 14 / 107
Check Yourself
Solutions
> library(plyr)
> salaries <- dbReadTable(con, "Salaries")
> my.sum.func <- function(team.yr.df) {
+ return(sum(team.yr.df$salary))
+ }
> payroll <- ddply(salaries, .(yearID, teamID), my.sum.func)
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 15 / 107
Check Yourself
Solutions
> payroll <- payroll[payroll$yearID == 2010, ]
> payroll <- payroll[order(payroll$V1, decreasing = T), ]
> payroll[1:3, ]
yearID teamID V1
733 2010 NYA 206333389
719 2010 BOS 162447333
721 2010 CHN 146609000
> payroll[nrow(payroll), ]
yearID teamID V1
737 2010 PIT 34943000
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 16 / 107
SQL
SELECT
Main tool in the SQL language: SELECT, which allows you to perform
queries on a particular table in a database. It has the form:
WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 17 / 107
Examples
Pick out five columns from the table Batting, and look at the first 10 rows:
playerID yearID AB H HR
1 aardsda01 2004 0 0 0
2 aardsda01 2006 2 0 0
3 aardsda01 2007 0 0 0
4 aardsda01 2008 1 0 0
5 aardsda01 2009 0 0 0
6 aaronha01 1954 468 131 13
7 aaronha01 1955 602 189 27
8 aaronha01 1956 609 200 26
9 aaronha01 1957 615 198 44
10 aaronha01 1958 601 196 30
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 18 / 107
Examples
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 19 / 107
Examples
To reiterate: the previous call was simply to check our work, and we
wouldnt actually want to do this on a large database, since itd be much
more inefficient to first read into an R data frame, and then call R
commands
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 20 / 107
SQL
ORDER BY
We can use the ORDER BY option in SELECT to specify an ordering for the
rows
Default is ascending order; add DESC for descending
playerID yearID AB H HR
1 bondsba01 2001 476 156 73
2 mcgwima01 1998 509 152 70
3 sosasa01 1998 643 198 66
4 mcgwima01 1999 521 145 65
5 sosasa01 2001 577 189 64
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 21 / 107
Check Yourself
Tasks
Run the following queries and determine what theyre doing. Write R code to do
the same thing on the batting data frame.
> dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
+ "FROM Batting",
+ "WHERE yearID >= 1990
+ AND yearID <= 2000",
+ "ORDER BY HR DESC",
+ "LIMIT 5"))
> dbGetQuery(con, paste("SELECT playerID, yearID, MAX(HR)",
+ "FROM Batting"))
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 22 / 107
Check Yourself
Solutions
> bat.ord <- batting[order(batting$HR, decreasing = TRUE), ]
> subset <- bat.ord$yearID >= 1990 & bat.ord$yearID <= 2000
> columns <- c("playerID", "yearID", "AB", "H", "HR")
> head(bat.ord[subset, columns], 5)
playerID yearID AB H HR
54613 mcgwima01 1998 509 152 70
78578 sosasa01 1998 643 198 66
54614 mcgwima01 1999 521 145 65
78579 sosasa01 1999 625 180 63
31877 griffke02 1997 608 185 56
playerID yearID HR
7514 bondsba01 2001 73
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 23 / 107
Section II
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 24 / 107
Databases vs. Dataframes
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 25 / 107
SQL
SELECT
Main tool in the SQL language: SELECT, which allows you to perform
queries on a particular table in a database. It has the form:
WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional.
Importantly, in the first line of SELECT we can directly specify
computations that we want performed.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 26 / 107
Examples
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 27 / 107
Examples
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 27 / 107
GROUP BY
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 28 / 107
GROUP BY
Note: the order of commands here matters; try switching the order of
GROUP BY and ORDER BY above, and youll get an error.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 28 / 107
WHERE
We can use the WHERE option in SELECT to specify a subset of the rows to
use (pre-aggregation/pre-calculation)
> dbGetQuery(con, paste("SELECT yearID, AVG(HR)",
+ "FROM Batting",
+ "WHERE yearID >= 1990",
+ "GROUP BY yearID",
+ "ORDER BY AVG(HR) DESC",
+ "LIMIT 5"))
yearID AVG(HR)
1 1996 5.073620
2 1999 4.692699
3 2000 4.525437
4 2004 4.490115
5 2001 4.412288
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 29 / 107
Check Yourself
Tasks
Run the following queries and determine what theyre doing. Write R code
to do the same thing on the batting data frame. Hint use daply().
> dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
+ "FROM Batting",
+ "WHERE yearID >= 1990",
+ "GROUP BY teamID",
+ "ORDER BY AVG(HR) DESC",
+ "LIMIT 5"))
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 30 / 107
Check Yourself
Solutions
> bat.sub <- batting[batting$yearID >= 1990, ]
> my.mean.func <- function(team.df) {
+ return(mean(team.df$HR, na.rm = TRUE))
+ }
> avg.hrs <- daply(bat.sub, .(teamID), my.mean.func)
> avg.hrs <- sort(avg.hrs, decreasing = TRUE)
> head(avg.hrs, 5)
CHA NYA TOR CAL TEX
6.164251 5.986486 5.760937 5.625731 5.563961
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 31 / 107
AS
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 32 / 107
HAVING
We can use the HAVING option in SELECT to specify a subset of the rows
to display (post-aggregation/post-calculation)
> dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
+ "FROM Batting",
+ "WHERE yearID >= 1990",
+ "GROUP BY yearID",
+ "HAVING avgHR >= 4.5",
+ "ORDER BY avgHR DESC"))
yearID avgHR
1 1996 5.073620
2 1999 4.692699
3 2000 4.525437
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 33 / 107
Check Yourself
Tasks
Recompute the payroll for each team in 2010, but now with
dbGetQuery() and an appropriate SQL query. In particular, the output of
dbGetQuery() should be a data frame with two columns, the first giving
the team names, and the second the payrolls, just like your output from
daply() before. (Hint: your SQL query here will have to use GROUP BY.)
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 34 / 107
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT teamID, SUM(salary) as SUMsal",
+ "FROM Salaries",
+ "WHERE yearID == 2010",
+ "GROUP BY teamID",
+ "ORDER BY SUMsal DESC",
+ "LIMIT 3"))
teamID SUMsal
1 NYA 206333389
2 BOS 162447333
3 CHN 146609000
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 35 / 107
Section III
Databases: Join
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 36 / 107
Databases vs. Dataframes
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 37 / 107
JOIN
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 38 / 107
JOIN
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 39 / 107
JOIN
In all weve seen so far with SELECT, the FROM line has just specified one
table. But sometimes we need to combine information from many tables.
Use the JOIN option for this
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 40 / 107
JOIN
In all weve seen so far with SELECT, the FROM line has just specified one
table. But sometimes we need to combine information from many tables.
Use the JOIN option for this
There are 4 options for JOIN:
1. INNER JOIN or just JOIN: retain just the rows each table that match
the condition.
2. LEFT OUTER JOIN or just LEFT JOIN: retain all rows in the first
table, and just the rows in the second table that match the condition.
3. RIGHT OUTER JOIN or just RIGHT JOIN: retain just the rows in the
first table that match the condition, and all rows in the second table.
4. FULL OUTER JOIN or just FULL JOIN: retain all rows in both tables
Fields that cannot be filled in are assigned NA values
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 40 / 107
INNER JOIN
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 41 / 107
LEFT JOIN
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 42 / 107
RIGHT JOIN
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 43 / 107
FULL JOIN
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 44 / 107
Examples
Suppose we want to find the average salaries of the players with the top 10
highest homerun averages. We need to combine the two tables.
> dbGetQuery(con, paste("SELECT *",
+ "FROM Salaries",
+ "ORDER BY playerID",
+ "LIMIT 8"))
yearID teamID lgID playerID salary
1 2004 SFN NL aardsda01 300000
2 2007 CHA AL aardsda01 387500
3 2008 BOS AL aardsda01 403250
4 2009 SEA AL aardsda01 419000
5 2010 SEA AL aardsda01 2750000
6 1986 BAL AL aasedo01 600000
7 1987 BAL AL aasedo01 625000
8 1988 BAL AL aasedo01 675000
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 45 / 107
Examples
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 46 / 107
Examples
We can use a JOIN on the pair: yearID, playerID.
> dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
+ "FROM Batting JOIN Salaries
+ USING(yearID, playerID)",
+ "ORDER BY playerID",
+ "LIMIT 7"))
Note that here were missing one of David Aardsmas records from the Batting
table (i.e., the JOIN discarded 1 record)
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 47 / 107
Examples
For demonstration purposes, we can use a LEFT JOIN on the pair: yearID,
playerID:
> dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
+ "FROM Batting LEFT JOIN Salaries
+ USING(yearID, playerID)",
+ "ORDER BY playerID",
+ "LIMIT 7"))
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 49 / 107
Examples
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 50 / 107
Examples
Now, as to our original question (average salaries of the players with the top 10
highest homerun averages):
> dbGetQuery(con, paste("SELECT playerID, AVG(HR), AVG(salary)",
+ "FROM Batting JOIN Salaries
+ USING(yearID, playerID)",
+ "GROUP BY playerID",
+ "ORDER BY Avg(HR) DESC",
+ "LIMIT 10"))
Tasks
Using the Fielding table, list the 10 worst (highest) number of error
(E) commited by a player in one season, only considering years 1990
and later. In addition to the number of errors, list the year and player
ID for each record.
By appropriately merging the Fielding and Salaries tables, list the
salaries for each record that you extracted in the last question.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 52 / 107
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT yearID, playerID, E",
+ "FROM Fielding",
+ "WHERE yearID >= 1990",
+ "ORDER BY E DESC",
+ "LIMIT 10"))
yearID playerID E
1 1992 offerjo01 42
2 1993 offerjo01 37
3 1996 valenjo03 37
4 2000 valenjo03 36
5 1998 carusmi01 35
6 1995 offerjo01 35
7 2008 reynoma01 34
8 2010 desmoia01 34
9 1993 cordewi01 33
10 2000 glaustr01 33
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 53 / 107
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT yearID, playerID, E, salary",
+ "FROM Fielding LEFT JOIN Salaries
+ USING(yearID, playerID)",
+ "WHERE yearID >= 1990",
+ "ORDER BY E DESC",
+ "LIMIT 10"))
Debugging
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 55 / 107
Debugging
Bug is the original name for glitches and unexpected defects in code:
dates back to at least Edison in 1876.
Debugging is a the process of locating, understanding, and removing
bugs from your code.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 56 / 107
Debugging
Bug is the original name for glitches and unexpected defects in code:
dates back to at least Edison in 1876.
Debugging is a the process of locating, understanding, and removing
bugs from your code.
Why should we care to learn about this?
The truth: youre going to have to debug, because youre not perfect
(none of us are!) and so you cant write perfect code.
Debugging is frustrating and time-consuming, but essential.
Writing code that makes it easier to debug later is worth it, even if it
takes a bit more time (lots of our design ideas support this).
Simple things you can do to help: use lots of comments, use
meaningful variable names!
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 56 / 107
Debugging
How?
Debugging is (largely) a process of differential diagnosis. Stages of
debugging:
1. Reproduce the error: can you make the bug reappear?
2. Characterize the error: what can you see that is going wrong?
3. Localize the error: where in the code does the mistake originate?
4. Modify the code: did you eliminate the error? Did you add new ones?
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 57 / 107
Debugging
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 58 / 107
Debugging
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 58 / 107
Debugging
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 59 / 107
Localizing the Bug
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 60 / 107
Localizing the Bug
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 61 / 107
Localizing the Bug
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 61 / 107
Localizing the Bug
Who called xy.coords()? (Not us, at least not explicitly!) And why is it
saying x is a list? (We never set it to be so!)
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 61 / 107
Localizing the Bug
Lets modify the function by calling print() at various points, to print
out the state of variables, to help localize the error.
> my.plotter = function(x, y, my.list=NULL) {
+ if (!is.null(my.list)) {
+ print("Here is my.list:")
+ print(my.list)
+ print("Now about to plot my.list")
+ plot(my.list, main="A plot from my.list!")
+ }
+ else {
+ print("Here is x:"); print(x)
+ print("Here is y:"); print(y)
+ print("Now about to plot x, y")
+ plot(x, y, main="A plot from x, y!")
+ }
+ }
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 62 / 107
Localizing the Bug
$Y
[1] -1000 -729 -512 -343 -216 -125 -64 -27 -8
[20] 729 1000
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 63 / 107
Localizing the Bug
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 64 / 107
Check Yourself
Tasks
Below is a random.walk() function like the one you wrote in homework.
Unfortunately, this one has some bugs find them and fix them! If you forget the
random walk algorithm, its written in the next slide.
> random.walk = function(x.start = 5, seed = NULL) {
+ if (!is.null(seed)) set.seed(seed)
+ x.vals <- x.start
+ while (TRUE) {
+ r <- runif(1, -2, 1)
+ if (tail(x.vals + r, 1) <= 0) break
+ else x.vals <- c(x.vals, x.vals + r)
+ }
+ return(x.vals = x.vals, num.steps = length(x.vals))
+ }
>
> # random.walk(x.start=5, seed=3)$num.steps # Should print 8
> # random.walk(x.start=10, seed=7)$num.steps # Should print 14
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 65 / 107
Check Yourself
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 66 / 107
Check Yourself
Solutions
> random.walk = function(x.start=5, seed=NULL) {
+ if (!is.null(seed)) set.seed(seed)
+ x.vals <- x.start
+ while (TRUE) {
+ r <- runif(1, -2, 1)
+ if (tail(x.vals + r, 1) <= 0) break
+ else x.vals <- c(x.vals, x.vals[length(x.vals)] + r)
+ print(x.vals)
+ }
+ ret.val <- list(x.vals = x.vals, num.steps = length(x.vals))
+ print(ret.val)
+ return(ret.val)
+ }
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 67 / 107
Check Yourself
Solutions
> random.walk(x.start = 5, seed = 3)$num.steps
$num.steps
[1] 8
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 68 / 107
Check Yourself
Solutions
> random.walk(x.start = 10, seed = 7)$num.steps
K -means Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 70 / 107
Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 71 / 107
Clustering
Similarities to PCA
Both clustering and PCA seek to simplify the data via a small number of
summaries:
PCA looks to find a low-dimensional representaiton of the
observations that explain most of the variance.
Clustering looks to find homogenous subgroups among the
observations.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 71 / 107
K -Means Clustering
K -Means Clustering
Simple approach for partitioning a data set into K distinct,
non-overlapping clusters. Note K is pre-specified.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 72 / 107
K -Means Clustering
K -Means Clustering
Simple approach for partitioning a data set into K distinct,
non-overlapping clusters. Note K is pre-specified.
How Do We Do It?
First we specify the number of desired clusters K .
The K -means algorithm then assigns each observation to exactly one
of the K clusters.
The algorithm boils down to a simple and intuitive mathematical
problem.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 72 / 107
1
Example of K -means
A simulated dataset with 150 observations in two-dimensional space. We
see the results of the K -mean algorithm using different values of K .
0
Some figures taken from An Introduction to Statistical Learning (Springer, 2013)
with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 73 / 107
K -means Clustering
Notation
Let C1 , C2 , . . . , CK denote sets containing the indices of the observations
in each cluster. These sets satisfy the following properties:
1. C1 C2 CK = {1, . . . , n}. (Each observation belongs to at
least one of the K clusters.)
2. Ck Ck 0 = . (The clusters are non-overlapping.)
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 74 / 107
K -means Clustering
Main Idea
The idea behind K -means clsutering is that a good clustering is one for
which the within-cluster variation is small as possible.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 75 / 107
K -means Clustering
Main Idea
The idea behind K -means clsutering is that a good clustering is one for
which the within-cluster variation is small as possible.
Within-Cluster Variation
For cluster Ck , denote the within-cluster variation by W (Ck ).
W (Ck ) measures the amount by which the observations within a
cluster differ within each other.
The algorithm is then an optimization problem:
( K )
X
min W (Ck ) .
C1 ,...,Ck
k=1
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 75 / 107
K -means Clustering
Optimization Task
To solve the optimization problem, we need to define W (Ck ).
The most common choice is to use squared Euclidean distance:
p
1 X X
W (Ck ) = (xij xi 0 j )2 ,
|Ck | 0
i,i Ck j=1
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 76 / 107
K -means Clustering
Optimization Task
To solve the optimization problem, we need to define W (Ck ).
The most common choice is to use squared Euclidean distance:
p
1 X X
W (Ck ) = (xij xi 0 j )2 ,
|Ck | 0
i,i Ck j=1
K
X p
1 X X 2
min (xij xi 0 j )
C1 ,...,CK |Ck | 0
k=1 i,i Ck j=1
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 76 / 107
K -means Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 77 / 107
K -means Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 78 / 107
K -means Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 78 / 107
K -means Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 79 / 107
K -means Clustering
1 P
where xkj = |Ck | iCk xij is the mean of feature j in cluster Ck .
Reallocating the observations (step 2b) can only improve the above,
thereby always decreasing the value of the objective function in the
optimization problem.
As the algorithm runs, the clustering obtained will continually improve
until the result no longer changes.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 80 / 107
K -means Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 81 / 107
K -means Clustering
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 81 / 107
K -means Clustering
K -means performed six times with different starting assignments gets six
differnet values of the objective.
390 10. Unsupervised Learning
FIGURE 10.7. K-means clustering performed six times on the data from Fig-
ure 10.5 with K = 3, each time with a dierent random assignment of the ob-
Cynthia Rush servations in Step 1Lecture 12: Debugging
of the K-means and Databases
algorithm. December
Above each plot is the value of 9, 2016 82 / 107
Example of K -means
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 83 / 107
Example of K -means
> library(ggplot2)
> ggplot(data = iris) +
+ geom_point(aes(Petal.Length, Petal.Width,
+ color = Species))
2.5
2.0
Species
Petal.Width
1.5 setosa
versicolor
1.0 virginica
0.5
0.0
2 4 6
Petal.Length
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 84 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 85 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 86 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 87 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 88 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 89 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 90 / 107
Example of K -means
> set.seed(3)
> km.out <- kmeans(iris[, 3:4], centers = 3, nstart = 1)
> km.out$tot.withinss
[1] 31.41289
> km.out <- kmeans(iris[, 3:4], centers = 3, nstart = 20)
> km.out$tot.withinss
[1] 31.37136
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 91 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 92 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 94 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 94 / 107
Example of K -means
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 95 / 107
Check Yourself
Tasks
Use the previous code to write a function K.means which takes as input
two arguments data and K and returns the final clustering the algorithm
finds. Note that in the previous work we assumed K = 3, so generalize
your function in terms of K .
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 96 / 107
Check Yourself
Solution
> K.means <- function(data, K) {
+ clusters <- sample(1:K, nrow(data), replace = TRUE)
+ centers <- apply(data, 2, tapply, clusters, mean)
+ new.clus <- new.clusters(points = data,
+ centers = centers)
+ while(any(new.clus != clusters)) {
+ clusters <- new.clus
+ centers <- apply(data, 2, tapply, clusters, mean)
+ new.clus <- new.clusters(points = data,
+ centers = centers)
+ }
+ return(list(clusters = new.clus, centers = centers))
+ }
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 97 / 107
Check Yourself
Solution
> K.means(data, K = 3)
$clusters
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
[55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
[82] 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 1 3
[109] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3
[136] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3
$centers
Petal.Length Petal.Width
1 4.269231 1.342308
2 1.462000 0.246000
3 5.595833 2.037500
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 98 / 107
Check Yourself
Solution
> km.out$centers
Petal.Length Petal.Width
1 1.462000 0.246000
2 5.595833 2.037500
3 4.269231 1.342308
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 99 / 107
Section VI
Exam Review
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 100 / 107
Format
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 101 / 107
Topics
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 102 / 107
Topics
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 103 / 107
Topics
Distributions as Models
MLE or MOM estimation.
Testing fit (visually and otherwise).
Permutation tests.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 104 / 107
Topics
Optimization
Constrained vs. Unconstrained
Gradient Descent: Generally, how/why does it work? How is it
different than Newtons Method and what are the
stregnths/weaknesses of each.
(I will not, for example, ask you to code up your own gradient descent
algorithm, though I may ask you to properly use something already
coded.)
Built-in R optimization functions (like we used in homework).
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 105 / 107
Topics
Transforming Data
Functions like sort() and order()
Obviously apply(), tapply(), sapply(), lapply() but also
form of the output when we use different functions like range() vs.
mean().
Built-in R functions split(), aggregate(), and merge().
Functions in the plyr package to replace the apply() family.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 106 / 107
Topics
Unsupervised Learning
The difference between supervised and unsupervised learning.
How to find principle components and iterpret them.
Familiarity and understnading with clustering generally.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 107 / 107