BIL Report
BIL Report
BIL Report
By
Apoorv Shinde (5021157)
Mrunali Shinde (5021158)
Mayuri Phapale (5021168)
Guided by:
Prof. Poonam Bari
1
CERTIFICATE
Submitted by:
Apoorv Shinde (5021157)
Mrunali Shinde (5021158)
Mayuri Phapale (5021168)
In partial fulfillment of degree of T.E. in Information Technology for term work of the BIL project
is approved.
External Examiner
Internal Guide
2
DECLARATION
We declare that this written submission represents our ideas in our own words and where others ideas
or words have been included; we have adequately cited and referenced the original sources. We also
declare that we have adhered to all principles of academic honesty and integrity and have not
that any violation of the above will be cause for disciplinary action by the institute and can evoke penal
action from the sources which have thus not been properly cited or from whom proper permission has
Date: -
3
ABSTRACT
K-Means clustering stands as a cornerstone in unsupervised learning, providing a versatile approach for
partitioning data into distinct groups based on similarity criteria. This paper offers a comprehensive
exploration of K-Means clustering, elucidating its principles, methodologies, and applications in data
analysis.
Originally devised for partitioning data into K clusters, K-Means has evolved to accommodate various
data types and cluster shapes efficiently. Its fundamental concept revolves around iteratively assigning
data points to the nearest centroid and updating centroids based on the mean of assigned points,
converging to stable cluster configurations.
We delve into the nuances of K-Means clustering algorithms, including initialization strategies,
distance metrics, and convergence criteria, elucidating their roles in optimizing clustering outcomes.
Furthermore, extensions like K-Medoids and hierarchical clustering broaden K-Means' applicability to
diverse datasets and clustering scenarios.
The paper also highlights practical applications of K-Means clustering across domains, including
customer segmentation, image compression, and anomaly detection. By leveraging its simplicity and
scalability, K- Means clustering emerges as a versatile tool for exploratory data analysis and pattern
discovery in complex datasets.
7
TABLE OF CONTENTS
1.
Sr. No. Title
Pag e
No.
1 Introduction 5-6
1.1 About K-Means
1.2 Problem Statement
1.3 Objectives
7 Acknowledgment 18
1.INTRODUCTION
1.1 ABOUT K-MEANS BI TOOL
Introduction:
In contemporary data mining, the precision of data classification holds paramount importance in guiding decision-
making processes across various sectors. This project centres on the utilization of K-Means clustering, a foundational
unsupervised learning technique, to partition data into distinct groups based on similarity criteria. Unlike traditional
classification methods, K-Means clustering doesn't require labelled data, making it particularly advantageous for
exploratory data analysis and pattern discovery.
Motivation:
The motivation behind this project arises from the burgeoning necessity for precise data classification techniques in
modern data mining applications. In today's data-centric landscape, organizations rely heavily on data-driven
insights to drive decision-making, foster innovation, and gain competitive advantages. However, the complexity and
heterogeneity of contemporary datasets pose significant challenges to conventional classification approaches.
Subsequently, the project delves into the theoretical foundations of K-Means clustering, elucidating concepts such as
centroid initialization, distance metrics, and convergence criteria. These theoretical underpinnings are translated into
practical implementation as K-Means clustering is applied to the pre-processed dataset for segmentation. Fine-tuning
of clustering parameters and evaluation of cluster validity metrics are key components of the project, aimed at
refining the K-Means model's performance and extracting meaningful insights from the segmented data.
Through this project, we aim to showcase the versatility and efficacy of K-Means clustering in data segmentation
tasks, highlighting its potential to uncover hidden patterns and structures within complex datasets.
7
The project aims to tackle the challenge of accurately classifying complex datasets within the realm of
data mining by leveraging Support Vector Machines (SVM). The primary focus lies in addressing
several key aspects:
Firstly, the selection of a suitable dataset poses a fundamental challenge, necessitating the identification
of a dataset that exhibits intricate patterns and potentially non-linear relationships between features and
classes. This complexity serves as a testbed for showcasing SVM's capacity to handle challenging
datasets effectively, where traditional classification methods might fall short.
Subsequently, the central objective revolves around developing an SVM-based classification model that
achieves high accuracy in predicting the class labels of dataset instances. This involves a
comprehensive approach, encompassing the optimization of SVM parameters and the implementation
of appropriate techniques to ensure robust performance across diverse evaluation metrics.
1.3 OBJECTIVES
K-Means Clustering::
Clustering is widely used in various fields such as biology, psychology, and economics. The number of
clusters or model parameters is often unknown in cluster analysis, posing a significant challenge. K-Means
clustering, a simple and fast technique, addresses this challenge by automatically determining the number of
clusters. Several approaches have been proposed to automate the selection of the number of clusters in K-
Means clustering.
These approaches aim to automate the selection of the optimal number of clusters in K-Means clustering,
reducing the reliance on user input and domain knowledge.
8
2.2 REFERRED WORK
Here are some published works we have analyzed for our project:-
1. Dubes, R. C., & Jain, A. K. (1988). Algorithms for Clustering Data. Prentice Hall.
2. Ng, A. (2012). Clustering with the K-Means Algorithm. Machine Learning.
3. Bozdogan, H. (1994). Mixture-model cluster analysis using model selection criteria and a new informational
measure of complexity.
4. Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: an information-theoretic
approach.
5. Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in
a data set.
6. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data.
7. Smyth, P. (1996). Clustering using Monte Carlo cross validation.
8. Wang, J. (2010). Consistent selection of the number of clusters via cross-validation. Biometrika.
9
3. PROPOSED SYSTEM
The proposed system is a desktop application designed to perform K-Means clustering analysis on user-provided data.
Built using Python, the system integrates essential libraries such as tkinter for the graphical user interface (GUI),
pandas for data manipulation, matplotlib for visualization, and scikit-learn for implementing the K-Means algorithm.
Key Features
1. User-Friendly Interface: The application offers an intuitive GUI that allows users to interact with the system
effortlessly.
2. Data Import: Users can import their data from Excel files into the application seamlessly.
3. K-Means Clustering: Upon importing the data, users specify the number of clusters (K) they want to generate using
the K-Means algorithm.
4. Result Display: After running the clustering algorithm, the system displays the resulting clusters visually and
provides cluster centre coordinates.
5. Evaluation Metrics: Users can evaluate the quality of clustering using metrics such as within-cluster sum of squares
(WCSS) or silhouette score.
6. Export Results: The application allows users to export the clustered data along with cluster assignments for further
analysis.
Workflow:
1. Data Import: Users select and import their dataset in Excel format using the application's interface.
2. Specify Parameters: Users enter the desired number of clusters (K) for the K-Means algorithm.
3.Run K-Means Algorithm: Upon clicking the "Calculate" button, the system processes the data using the K-Means
algorithm.
4. Display Results: The application showcases the clustered data points graphically, providing users with insights into
the underlying patterns.
5. Evaluate Clustering: Users can assess the quality of clustering using provided evaluation metrics.
6. Export Results: Optionally, users can export the clustered data along with cluster assignments for further analysis
in other tools.
10
11
Benefits:
1. Accessibility: The system caters to individuals and small businesses, enabling them to analyze their data without
extensive technical knowledge.
2. Efficiency: By automating the K-Means clustering process, the application saves users time and effort in analysing
their data.
3. Insightful Visualizations: The graphical representation of clustered data aids users in understanding patterns and
making informed decisions.
4. Flexibility: Users can customize the clustering process by adjusting parameters such as the number of clusters to
suit their specific needs.
The proposed K-Means clustering desktop application offers a user-friendly solution for data analysis, empowering
users to derive insights from their datasets efficiently. With its intuitive interface and powerful features, the system
simplifies the complex process of clustering analysis, making it accessible to a broader audience.
4. IMPLEMENTATION DETAILS
4.1 STEPS -
1. Import Libraries:
- Include tkinter, pandas, matplotlib, and scikit-learn for essential functionality.
6. Visualize Results:
- Display clustered data and centroids using matplotlib.
8. Run Application:
- Execute the mainloop() function to launch the GUI application.
12
11
5. EXPERIMENTAL RESULTS
5.1 CODE
code:
import tkinter as tk
from tkinter import messagebox
from tkinter import filedialog
import numpy as np
import matplotlib.pyplot as plt
class KMeans:
def _init_(self, n_clusters=3, max_iter=300):
self.n_clusters = n_clusters
self.max_iter = max_iter
class KMeansGUI:
def _init_(self, root):
self.root = root
self.root.title("KMeans Clustering")
self.root.geometry("400x300")
self.data = None
self.k = tk.IntVar()
self.k.set(3) # default value
self.create_widgets()
def create_widgets(self):
self.label = tk.Label(self.root, text="Select Data File:")
self.label.pack()
def select_file(self):
file_path = filedialog.askopenfilename()
if file_path:
try:
self.data = self.load_data(file_path)
except Exception as e:
messagebox.showerror("Error", f"Failed to load data: {e}")
def cluster_data(self):
if self.data is None:
messagebox.showerror("Error", "Please select a data file.")
return
k_value = self.k.get()
if k_value <= 0:
messagebox.showerror("Error", "Please enter a valid value for k.")
return
kmeans = KMeans(n_clusters=k_value)
kmeans.fit(self.data)
labels = kmeans._assign_labels(self.data) 15
plt.scatter(self.data[:, 0], self.data[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='red', s=100)
plt.title("KMeans Clustering")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
if _name_ == "_main_":
root = tk.Tk()
app = KMeansGUI(root)
root.mainloop()
16
5.2RESULTS
17
Fig. 5.3 Output of data.csv
18
Fig. 5.5 Output of weight-height.csv
19
6.CONCLUSION
In conclusion, K-Means clustering stands out as a robust and efficient algorithm, capable of partitioning data into
distinct groups based on similarities. Its simplicity and effectiveness make it a valuable tool in various fields, from
customer segmentation in marketing to image compression in computer vision.
K-Means offers utility in exploratory data analysis by uncovering underlying patterns and structures within
datasets. By iteratively optimizing cluster centroids to minimize the within-cluster sum of squares, K-Means
efficiently partitions data points into clusters, aiding in data interpretation and decision-making.
As we continue to advance in the realm of data science, K-Means clustering remains a fundamental technique,
contributing to insights and discoveries across diverse domains. Its versatility and scalability make it a go-to
choice for clustering analysis, driving innovation and progress in data-driven applications.
20
7. REFERENCES
21
ACKNOWLEDGEMENT
Our group would like to thank our Principal Dr S. M. Khot and our Head of Department Dr. Shubhangi
Vaikole for giving us this wonderful opportunity. We would also like to extend our thanks to our
project guide, Prof. Poonam Bari for guiding us throughout the project and helping us overcome
obstacles and difficulties faced by the project. We give our thanks to the University of Mumbai for
allotting this project which led to us researching topics related to the project and helped us learn many
things at a greater efficiency.
22
14