0% found this document useful (0 votes)
12 views

Clustering

Understanding Clustering

Uploaded by

Tinotenda Sandra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
12 views

Clustering

Understanding Clustering

Uploaded by

Tinotenda Sandra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
13: Clustering Previous Next Index Unsupervised learning - introduction + Talk about clustering © Learning from unlabeled data + Unsupervised learning © Useful to contras with supervised learning + Compare and contrast © Supervised learning = Given a set of labels, fit a hypothesis to it © Unsupervised learning = Try and determining structure in the data 1 Clustering algorithm groups data together based on data features + What is clustering good for Market segmentation - group customers into different market segments © Social network analysis - Facebook "smartlists’ © Organizing computer clusters and data centers for network layout and location © Astronomical data analysis - Understanding galaxy formation K-means algorithm + Want an algorithm to automatically group the data into coherent clusters + K-means is by far the most widely used clustering algorithm Overview + Take unlabeled data and group into two clusters + Algorithm overview © 1) Randomly allocate two points as the cluster centroids = Have as many cluster centroids as clusters you want to do (K cluster centroids, in fact) = In our example we just have two clusters © 2) Cluster assignment step = Go through each example and depending on if it’s closer to the red or blue centroid assign each point to one of the two clusters = To demonstrate this, we've gone through the data and "colour" each point red or blue © 3) Move centroid step ‘= Take each centroid and move to the average of the correspondingly assigned data-points * Repeat 2) and 3) until convergence + More formal definition © Input: = K (number of clusters in the data) + Training set fx! x2, x8. x") © Algorithm: * Randomly initialize K cluster centroids as {t1,, Hoy Hg + sx} Repeat { fori=1tom cl := index (from 1 to K) of cluster centroid closest to x) fork=1toK [4 = average (mean) of points assigned to cluster k } + Loop = This inner loop repeatedly sets the &® centroid closes to x! + ive. take example, measure squared distance to each cluster centroid, assign eto the cluster closest @a Kod ope Wal Cs « toops = Loops over each centroid calculate the average mean based on all the points associated with each centroid from ¢) = What if there's a centroid with no data mare acon oto th clase = Or, randomly reinitialize it + Not sure when though, variable to be the index of the closes variable of cluster K-means for non-separated clusters + So far looking at K-means where we have well defined clusters + But often K-means is applied to datasets where there aren't well defined clusters © eg, T-shirt sizing T-shirt sizing 2 eor 9 © > . . 2 . 2 wt. Height t obvious discrete groups ay you want to have three sizes ($,M,L) how big do you make these? © One way would be to run K-means on this data © May do the following T-shirt sizing Height © So creates three clusters, even though they aren't realy there © Look at first population of people = Tey and design a small T-shirt which fits the 1st population ‘= And so on for the other two © This is an example of market segmentation * Build products which suit the needs of your subpopulations K means optimization objective + Supervised learning algorithms have an optimization objective (cost function) © K-means does too + K-means has an optimization objective like the supervised learning funetions we've seen, © Why is this good? © Knowing this is useful because it helps for debugging © Helps find better clusters ‘+ While K-means is running we keep track of two sets of variables © lis the index of clusters {1,2,... K} to which x is currently assigned one of K different values) © ty is the cluster associated with centroid k = Locations of cluster centroid k * i. there are mc! values, as each example has a c! value, and that value is one the the clusters (i.e. can only be = So there are K = So these the centroids which exist in the training data space the cluster centroid of the cluster to which example x' has been assigned to ‘= This is more for convenience than anything else ‘= You could look up that example iis indexed to cluster j (using the c vector), where jis between 1 and K = Then look up the value associated with cluster j in the u vector (i.e. what are the features associated with 4) ‘= But instead, for easy description, we have this variable which gets, = Lets say x! as been assigned to cluster 5 = Means that tly the same value he = Hs * Using this notation we can write the optimization objective; te IT (CO yop, pry esac) = =D Ile = peo |? (estes) = D2 Hl 6 i.e. squared distances between training example x! and the cluster centroid to which x! has been assigned to + This is just what we've been doing, as the visual description below shows; New = The red line here shows the distances between the example x! and the cluster to which that example has been. assigned = Means that when the example is very close to the cluster, this value is small = When the cluster is very far away from the example, the value is large © This is sometimes called the distortion (or distortion cost function) © Sowe are finding the values which minimizes this funetion; ymin I(r, we) Bayes MK + Ifwe consider the k-means algorithm © The cluster assigned step is minimizing J(..) with respect to cc? .¢ * i.e. find the centroid closest to each example = Doesn't change the centroids themselves © The move centroid step = We can show this step is choosing the values of which minimizes J(.) with respect to 1 © So, we're partitioning the algorithm into two parts ‘» First part minimizes the c variables 1 Second part minimizes the J variables * We can use this knowledge to help debug our K-means algorithm Random initialization + How we initialize K-means © And how avoid local optimum + Consider clustering algorithm © Never spoke about how we initialize the centroids 1 A few ways - one method is most recommended + Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)o © Randomly pick K training examples © Set p, up to pix to these example's values + K means can converge to different solutions depending on the initialization setup © Risk of_ocal optimum GLOBAL OPTIMUM LocaL OPTIMA © The local optimum are valid convergence, but local optimum not global ones «If this is a concern ‘© We can do multiple random initializations * See if we get the same result - many same results are likely to indicate a global optimum + Algorithmically we can do this as follows; For i=1to 100{ Randomly initialize K-means. Run K-means. Get c{1) c™), 1... Compute cost function (distortion) TED, 2.5, wry.) } © Atypical number of times to initialize K-means is 50-1000 © Randomly initialize K-means + For each 100 random initialization run K-means * Then compute the distortion on the set of cluster assignments and centroids at convergent + End with 100 ways of cluster the data * ick the clustering which gave the lowest distortion + Ifyou're running K means with 2-10 clusters ean help find better global optimum ‘© If Kis larger than 10, then multiple random initializations are less likely to be necessary © First solution is probably good enough (better granularity of clustering) How do we choose the number of clusters? * Choosing K? © Nota great way to do this automatically © Normally use visualizations to do it manually + What are the intuitions regarding the data? + Why is this hard © Sometimes very ambiguous 1 e.g, two clusters or four clusters = Not necessarily a correct answer © This is why doing it automatic this is hard Elbow method + Vary K and compute cost function at a range of K values + As K increases J\..) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize) © Plot this (K vs JO) + Look for the "elbow" on the graph Selon" ieee 12345678 K (ho. of clusters) Cost function J « Chose the “elbow” number of clusters * Ifyou get a nice plot this is a reasonable way of choosing K © Risks © Normally you don’t get a a nice line -> no clear elbow on curve © Not really that helpful Another method for choosing K + Using K-means for market segmentation + Running K-means for a later/downstream purpose © See how well different number of clusters serve you later needs + eg. © T-shirt size example = Ifyou have three sizes (S,M,L) 1 Or five sizes (XS, S, M, L, XL) = Run K means where K= 3 and K= 5 © How does this look Tshirt sting L. T-shirt sing Weight Weight Height Height © This gives a way to chose the number of clusters = Could consider the cost of making extra sizes vs. how well distributed the products are = How important are those sizes though? (e.g. more sizes might make the customers happier) = So applied problem may help guide the number of clusters

You might also like