SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha

This document summarizes an attempt to implement the k-means clustering algorithm using SQL for analyzing large datasets within a database management system. The k-means algorithm is implemented using basic SQL commands like SELECT, JOIN, etc. without needing to handle the data externally. Preliminary results show the SQL k-means algorithm scales linearly as the number of attributes, records, and clusters increase, demonstrating it can efficiently analyze large datasets. Further work is needed to compare its performance to memory-based algorithms.

Uploaded by

Moh Ali M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views5 pages

SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha

Uploaded by

Moh Ali M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

National conference on relational algebra, relational calculus and applications in DBMS 2007

SQLDM implementing k-means clustering using SQL

Jay B.Simha Abiba Systems Bengaluru 560 080 jay.b.simha @abibasystems.com ABSTRACT Clustering is one of the important data mining techniques used in exploratory analysis. In particular k-means algorithm is a popular algorithm for clustering. The algorithms available in statistical, machine learning and database literature have limitations of scalability. In this paper an attempt has been made to implement the k-means algorithm using SQL in a database. This approach is guaranteed to provide scalability on large datasets. Further in-database analysis makes the data access and pre-processing easier. The preliminary results are encouraging. The SQLDM algorithm is showing linear scalability in both number of records and number of attributes. Work is under progress to improve the algorithm with additional features. Introduction: Data mining techniques can significantly boost the ability to analyze data. Despite the potential effectiveness of data mining to significantly enhance data analysis, this technology is destined be a niche technology unless the data can be effectively accessed from traditional database systems, which may require integration with DBMS. This is because data analysis needs to be consolidated at the warehouse for data integrity and management concerns. Hence, one of the key challenges is to enable integration of data mining technology seamlessly within the framework of traditional database systems [2]. In the last decade, significant research progress has been made towards streamlining data mining algorithms. There has been an explosion of work [1] in scaling many major data mining techniques to work with large data sets, i.e., ensuring that the algorithms are disk-aware and more generally, conscious of memory hierarchy, instead of making the assumption that all data must reside in memory. Another direction of work that has been pursued is to consider if data mining algorithms may be implemented as traditional database applications. Efforts to implement mining algorithms on top of database systems have also led to primitives such as sampling to ease the task of data mining on relational systems [3]. Some extensions allow the data mined model to be accessed by SQL like operators [6]. However, very little work is done in using existing DBMS capabilities and SQL programming language to implement data mining algorithms [7]. In this paper we propose a simple implementation of k-means clustering algorithm using SQL. Commonly available DDL and DML commands are used to build the system. The resulting system is elegant (in terms of amount of code) and scalable (limited by the underlying RDBMS and its tuning).

Clustering and k-means: Cluster analysis is a technique for grouping data and finding structures in data. The most common application of clustering methods is to partition a data set into clusters or classes, where similar data are assigned to the same cluster whereas dissimilar data should belong to different clusters [4]. K-means [5] is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one

National conference on relational algebra, relational calculus and applications in DBMS 2007

for each cluster. These centroids should be placed in a clever way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early grouping is done. At this point k new centroids are calculated as centers of the clusters resulting from the previous step. After having these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function

where is a chosen distance measure between a data point and the cluster centre is an indicator of the distance of the n data points from their respective cluster centers. The algorithm is composed of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

2. 3. 4.

Fig 1. K-means algorithm Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The k-means algorithm can be run multiple times to reduce this effect. Implementing SQL K-means: There are three reasons for implementing k-means algorithm in SQL. First, most of the implementations for k-means in a procedural language like C, C++ or Java. Using SQL makes writing data mining algorithm easier, since SQL is declarative and there is no need to handle data at the lowest level. Second, these algorithms assume that all the data can fit into memory. This creates a problem, when the data to be explored is larger than the available memory, which in reality is always true. Even though some clever algorithms use sampling or incremental learning, the ability to handle large data is still a problem with these algorithms. Third, building the cluster model is only one part of the problem. Accessing the data from the sources like a data warehouse and transforming them out side the database is another major problem. Using SQL this problem can be minimized, since the data will be analyzed within the database, without the need for additional overheads.

National conference on relational algebra, relational calculus and applications in DBMS 2007

In this research an attempt has been made to implement k-means algorithm using SQL for in-database analysis of the large datasets. The strategy for implementing SQL k-means is as follows. Select the number of clusters randomly - Top K records - Filter with Order by - Filter with MOD operator Compute distance matrix - Join records and clusters table to compute distances - Select the one cluster per record Check the movement of the points - re compute new cluster centers - check the difference between old and new centers Repeat the steps 2 and 3 till the difference is less than threshold

The different data structures (tables) used in implementing the SQL k-means are given in table 1 and the process flow is shown in figure 2. Table 1. Data structures used for SQL k-means Table Y C YD MD C2 CN Primary Key RID CID RID RID None CID Columns y1, y2, , yp y1, y2, , yp d1, d2, .., dp D cid, rid y1, y2, , yp Number of rows N K KN N N K Contents Records Cluster centers Distances Minimum distance Classified records New cluster centers

The process starts with selection of cluster centers from the data table (Y) into table C. Next the distance of each of the data points (records) in data table (Y) with each of the cluster center from table C, are inserted into the distances table (YD). Next the records are assigned to the clusters based on minimum distance and attached with a class label of the cluster they belong to in table MD. Subsequently the new cluster centers are computed by joining tables C2 and Y and inserted into table CN. If the CN and C are different (based on chosen criteria), then the process is repeated till convergence.

Fig. 2 SQL k-means process The queries used for implementing the k-means are given in fig 3. The code shown is only for one attribute to demonstrate the compactness and readability of the code. However, the concept can be generalized to accommodate any number of attributes as restricted by the underlying RDBMS.

National conference on relational algebra, relational calculus and applications in DBMS 2007

Fig 3. SQL queries for k-means

Experiments: The experiments were conducted on a Linux based desktop system with 2 GB of RAM. The database used was MySQL 5.0. The SQL code generation was done in Jython and the connection to DBMS was done through JDBC library. Synthetic data used in the experimentation was generated using IBM data generator [8]. In the experimentation the number of attributes, number of records and number of cluster are varied to build the performance profile of the proposed algorithm. Results and discussions: The results of the experiments on the synthetic datasets using SQL k-means algorithm seems to be promising. It can be seen that SQL k-means scales linearly on all the three dimensions number of attributes, number of records and number of clusters. However, the comparison of this implementation with memory based algorithms could not be carried out due to large data size, which could not fit into the memory.

90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 N umber of attributes

90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 N umber of records (M)

90 80 70 60 50 40 30 20 10 0 0 10 20 30 40

R time units un

N umber of clus ters

R time units un

National conference on relational algebra, relational calculus and applications in DBMS 2007

Conclusions: Clustering is one of the important functions in knowledge discovery and data mining. K-means is one of the most popular clustering algorithms used in data mining. In this research an attempt has been made to implement the k-means algorithm in SQL. The results are promising and the algorithm implemented in SQL scales linearly in all the dimensions tested. This is expected to provide scalability for large datasets and reduction in pre-processing overheads. The work is under progress to enhance the current implementation with additional features. The future research directions are (i) to develop a method, implemented in SQL for optimum number of clusters, (ii) to evaluate the cluster quality and (iii) to test the concept in a live scenario. References: 1. 2. 3. 4. 5. Agrawal R.,et al.: Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining 1996: 307-328 Chaudhuri S.: Data Mining and Database Systems: Where is the Intersection? Data Engineering Bulletin 21(1) 1998. Clear J.,et al.: NonStop SQL/MX Primitives for Knowledge Discovery. KDD 1999: 425-429. Hoppner, F., Klawonn F., Kruse, R., and Runkler, T., Fuzzy Cluster Analysis, John Wiley and Sons, 1999. MacQueen J: "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297, 1967 Netz A., et al., Integration of Data Mining and Relational Databases Proceedings of the 26th International Conference on Very Large databases, Cairo, Egypt, 2000 Ordonez C, Programming the K-means clustering algorithm in SQL, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA http://www.cs.loyola.edu/~cgiannel/assoc_gen.html

6. 7.

A Dynamic K-Means Clustering For Data Mining-Dikonversi
No ratings yet
A Dynamic K-Means Clustering For Data Mining-Dikonversi
6 pages
K Means
No ratings yet
K Means
40 pages
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
No ratings yet
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
6 pages
Wk. 9. Cluster Analysis (01-04-2021)
No ratings yet
Wk. 9. Cluster Analysis (01-04-2021)
97 pages
L7 Clustering
No ratings yet
L7 Clustering
58 pages
A Dynamic K-Means Clustering For Data Mining
No ratings yet
A Dynamic K-Means Clustering For Data Mining
6 pages
Na 2010
No ratings yet
Na 2010
5 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
K Mean
No ratings yet
K Mean
7 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Enhanced K-Means for Big Data Clustering
No ratings yet
Enhanced K-Means for Big Data Clustering
16 pages
DSV - Unit 3 - Data Analysis in Depth
No ratings yet
DSV - Unit 3 - Data Analysis in Depth
53 pages
Clustering Techniques for CS Students
100% (1)
Clustering Techniques for CS Students
26 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
KNN
No ratings yet
KNN
8 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Application of K-Means 1002.2425 PDF
No ratings yet
Application of K-Means 1002.2425 PDF
4 pages
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
No ratings yet
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
12 pages
Unit 4
No ratings yet
Unit 4
125 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
56 pages
Data Mining K-Means Algorithm
No ratings yet
Data Mining K-Means Algorithm
36 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
51 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Complete Referenec of Sementics
No ratings yet
Complete Referenec of Sementics
6 pages
Session 3-Clustering
No ratings yet
Session 3-Clustering
41 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
K-Means Clustering in Data Mining
No ratings yet
K-Means Clustering in Data Mining
5 pages
K Means Clustering
No ratings yet
K Means Clustering
3 pages
Unit - V DW
No ratings yet
Unit - V DW
6 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Understanding K-Means Clustering
No ratings yet
Understanding K-Means Clustering
12 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Shivwangi Banerjee (ML)
No ratings yet
Shivwangi Banerjee (ML)
8 pages
Understanding Cluster Analysis Methods
No ratings yet
Understanding Cluster Analysis Methods
4 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
Untitled Document
No ratings yet
Untitled Document
32 pages
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
No ratings yet
A Genetic K-Means Clustering Algorithm Based On The Optimized Initial Centers
7 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Original Points A Partitional Clustering
No ratings yet
Original Points A Partitional Clustering
50 pages
K-Means Clustering Algorithm: - V - ' Is The Euclidean Distance Between X ' Is The Number of Data Points in I
No ratings yet
K-Means Clustering Algorithm: - V - ' Is The Euclidean Distance Between X ' Is The Number of Data Points in I
3 pages
Clustering
No ratings yet
Clustering
125 pages
K-Means Clustering in Data Mining
No ratings yet
K-Means Clustering in Data Mining
8 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Lab Manual 6
No ratings yet
Lab Manual 6
10 pages
Clustring Data Mining
No ratings yet
Clustring Data Mining
21 pages
CT075!3!2 DTM Topic 10 Cluster Analysis
No ratings yet
CT075!3!2 DTM Topic 10 Cluster Analysis
21 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
The International Journal of Engineering and Science (The IJES)
No ratings yet
The International Journal of Engineering and Science (The IJES)
4 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
6 pages
Machine Learning
No ratings yet
Machine Learning
100 pages
Software for Visually Impaired Navigation
No ratings yet
Software for Visually Impaired Navigation
5 pages
ML 2024 Part6 Classification Unsupervised
No ratings yet
ML 2024 Part6 Classification Unsupervised
43 pages
ML Lab Mannual
No ratings yet
ML Lab Mannual
29 pages
Computer Vision Exam Paper B.E. 7th Sem
No ratings yet
Computer Vision Exam Paper B.E. 7th Sem
2 pages
ML Insem
No ratings yet
ML Insem
46 pages
Introduction To Weka-A Toolkit For Machine Learning
No ratings yet
Introduction To Weka-A Toolkit For Machine Learning
11 pages
A Taxonomy of Unsupervised Feature Selection Methods Including Feature Weighting Schemes A Comprehensive Review
No ratings yet
A Taxonomy of Unsupervised Feature Selection Methods Including Feature Weighting Schemes A Comprehensive Review
29 pages
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
No ratings yet
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
31 pages
Optimizing Anti-Money Laundering Transaction Monitoring Systems Using SAS® Analytical Tools
No ratings yet
Optimizing Anti-Money Laundering Transaction Monitoring Systems Using SAS® Analytical Tools
10 pages
Document (3) - 240923 - 214242
No ratings yet
Document (3) - 240923 - 214242
1 page
JGR Atmospheres - 2022 - Fu - Quantifying Flash Droughts Over China From 1980 To 2017
No ratings yet
JGR Atmospheres - 2022 - Fu - Quantifying Flash Droughts Over China From 1980 To 2017
16 pages
Introduction To Data Mining Global Edition Pang Ning Tan Michael Steinbach Anuj Karpatne Vipin Kumar
100% (4)
Introduction To Data Mining Global Edition Pang Ning Tan Michael Steinbach Anuj Karpatne Vipin Kumar
79 pages
Bits
No ratings yet
Bits
6 pages
Anomaly Detection in Malware
No ratings yet
Anomaly Detection in Malware
5 pages
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
No ratings yet
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
11 pages
Crime Prediction Using K-Means Algorithm
100% (1)
Crime Prediction Using K-Means Algorithm
4 pages
Data Science & Analytics Basics
No ratings yet
Data Science & Analytics Basics
71 pages
Paper 2
No ratings yet
Paper 2
19 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
Fuzzy K Means Clustering
No ratings yet
Fuzzy K Means Clustering
10 pages
DiStefano, C., & Kamphaus, R. W. (2006) - Investigating Subtypes of Child Development.
No ratings yet
DiStefano, C., & Kamphaus, R. W. (2006) - Investigating Subtypes of Child Development.
17 pages
Fuzzypaper May No K
No ratings yet
Fuzzypaper May No K
20 pages
Top 10 Data Mining Algorithms
No ratings yet
Top 10 Data Mining Algorithms
65 pages
IRS Week-5
No ratings yet
IRS Week-5
4 pages
Chapter 7
No ratings yet
Chapter 7
49 pages
Efficient Data Search and Retrieval in Cloud Assisted Iot Environment
No ratings yet
Efficient Data Search and Retrieval in Cloud Assisted Iot Environment
6 pages
00 PAST - Characterization of Smallholder Cattle Production Systems in Congo, Mugumaarhahama Et Al. 2021
100% (1)
00 PAST - Characterization of Smallholder Cattle Production Systems in Congo, Mugumaarhahama Et Al. 2021
15 pages
Healthcare E Guide System Using K Means
No ratings yet
Healthcare E Guide System Using K Means
90 pages

SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha

Uploaded by

SQLDM - Implementing K-Means Clustering Using SQL: Jay B.Simha

Uploaded by

National conference on relational algebra, relational calculus and applications in DBMS 2007

SQLDM implementing k-means clustering using SQL

Fig 3. SQL queries for k-means

90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 N umber of records (M)

N umber of clus ters

You might also like