BIL Report

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

A Mini Project Report on

“BI Tool for Visualization of K-means”

By
Apoorv Shinde (5021157)
Mrunali Shinde (5021158)
Mayuri Phapale (5021168)

Guided by:
Prof. Poonam Bari

Department of Information Technology Fr. Conceicao Rodrigues


Institute of Technology Sector 9A, Vashi, Navi Mumbai – 400703
University of Mumbai
2023-2024

1
CERTIFICATE

This is to certify that the mini project entitled

“BI Tool for Visualization of SVM”

Submitted by:
Apoorv Shinde (5021157)
Mrunali Shinde (5021158)
Mayuri Phapale (5021168)

In partial fulfillment of degree of T.E. in Information Technology for term work of the BIL project
is approved.

External Examiner

Internal Guide

Head of the Department College Seal Date:

2
DECLARATION

We declare that this written submission represents our ideas in our own words and where others ideas

or words have been included; we have adequately cited and referenced the original sources. We also

declare that we have adhered to all principles of academic honesty and integrity and have not

misrepresented or fabricated or falsified any idea/data/fact/source in our submission. We understand

that any violation of the above will be cause for disciplinary action by the institute and can evoke penal

action from the sources which have thus not been properly cited or from whom proper permission has

not been taken when needed.

Apoorv Shinde (5021157)


Mrunali Shinde (5021158)
Mayuri Phapale (5021168)

Date: -

3
ABSTRACT

K-Means clustering stands as a cornerstone in unsupervised learning, providing a versatile approach for
partitioning data into distinct groups based on similarity criteria. This paper offers a comprehensive
exploration of K-Means clustering, elucidating its principles, methodologies, and applications in data
analysis.

Originally devised for partitioning data into K clusters, K-Means has evolved to accommodate various
data types and cluster shapes efficiently. Its fundamental concept revolves around iteratively assigning
data points to the nearest centroid and updating centroids based on the mean of assigned points,
converging to stable cluster configurations.

We delve into the nuances of K-Means clustering algorithms, including initialization strategies,
distance metrics, and convergence criteria, elucidating their roles in optimizing clustering outcomes.
Furthermore, extensions like K-Medoids and hierarchical clustering broaden K-Means' applicability to
diverse datasets and clustering scenarios.

The paper also highlights practical applications of K-Means clustering across domains, including
customer segmentation, image compression, and anomaly detection. By leveraging its simplicity and
scalability, K- Means clustering emerges as a versatile tool for exploratory data analysis and pattern
discovery in complex datasets.

7
TABLE OF CONTENTS
1.
Sr. No. Title
Pag e
No.
1 Introduction 5-6
1.1 About K-Means
1.2 Problem Statement
1.3 Objectives

2 Literature Survey 7-8


3 Proposed System 9
3.1 Proposed solution
4 Implementation Details 10
5 Experiment Results
11- 15
4.1 Results
6 Conclusion 16

7 Acknowledgment 18

1.INTRODUCTION
1.1 ABOUT K-MEANS BI TOOL

Introduction:
In contemporary data mining, the precision of data classification holds paramount importance in guiding decision-
making processes across various sectors. This project centres on the utilization of K-Means clustering, a foundational
unsupervised learning technique, to partition data into distinct groups based on similarity criteria. Unlike traditional
classification methods, K-Means clustering doesn't require labelled data, making it particularly advantageous for
exploratory data analysis and pattern discovery.

Motivation:
The motivation behind this project arises from the burgeoning necessity for precise data classification techniques in
modern data mining applications. In today's data-centric landscape, organizations rely heavily on data-driven
insights to drive decision-making, foster innovation, and gain competitive advantages. However, the complexity and
heterogeneity of contemporary datasets pose significant challenges to conventional classification approaches.

About the Project:


This project focuses on harnessing the power of K-Means clustering for data segmentation within the data mining
domain. It begins by selecting a dataset suited for K-Means analysis, characterized by features that exhibit diverse
patterns and structures. Following dataset selection, rigorous preprocessing steps are undertaken to enhance data
quality and optimize K-Means clustering performance. These preprocessing tasks encompass data cleaning, feature
engineering, and scaling techniques to prepare the dataset for effective segmentation by the K-Means algorithm.

Subsequently, the project delves into the theoretical foundations of K-Means clustering, elucidating concepts such as
centroid initialization, distance metrics, and convergence criteria. These theoretical underpinnings are translated into
practical implementation as K-Means clustering is applied to the pre-processed dataset for segmentation. Fine-tuning
of clustering parameters and evaluation of cluster validity metrics are key components of the project, aimed at
refining the K-Means model's performance and extracting meaningful insights from the segmented data.

Through this project, we aim to showcase the versatility and efficacy of K-Means clustering in data segmentation
tasks, highlighting its potential to uncover hidden patterns and structures within complex datasets.

1.2 PROBLEM STATEMENT

7
The project aims to tackle the challenge of accurately classifying complex datasets within the realm of
data mining by leveraging Support Vector Machines (SVM). The primary focus lies in addressing
several key aspects:

Firstly, the selection of a suitable dataset poses a fundamental challenge, necessitating the identification
of a dataset that exhibits intricate patterns and potentially non-linear relationships between features and
classes. This complexity serves as a testbed for showcasing SVM's capacity to handle challenging
datasets effectively, where traditional classification methods might fall short.

Subsequently, the central objective revolves around developing an SVM-based classification model that
achieves high accuracy in predicting the class labels of dataset instances. This involves a
comprehensive approach, encompassing the optimization of SVM parameters and the implementation
of appropriate techniques to ensure robust performance across diverse evaluation metrics.

1.3 OBJECTIVES

1. Develop an Accurate SVM-Based Classification Model


2. Address Dataset Complexity
3. Enhance Interpretability and Visualization
4. Ensure Generalization and Scalability
5. Contribute to Advancing SVM in Data Mining
2. LITERATURE SURVEY
2.1 Theory

K-Means Clustering::
Clustering is widely used in various fields such as biology, psychology, and economics. The number of
clusters or model parameters is often unknown in cluster analysis, posing a significant challenge. K-Means
clustering, a simple and fast technique, addresses this challenge by automatically determining the number of
clusters. Several approaches have been proposed to automate the selection of the number of clusters in K-
Means clustering.

1. By rule of thumb: A simple method applicable to any dataset.


2. Elbow Method: A visual method that identifies the "elbow" point where the rate of decrease in distortion
significantly slows down.
3. Information Criterion Approach: Uses information criteria like Akaike’s and Bayesian inference criterion
to balance model complexity and likelihood.
4. An Information Theoretic Approach: Utilizes rate distortion theory and jump statistic to determine the
right number of clusters.
5. Choosing k Using the Silhouette: Evaluates clustering quality based on within-cluster tightness and
separation.
6. Cross-validation: Estimates the number of clusters based on cluster stability using techniques like Monte
Carlo cross-validation.

These approaches aim to automate the selection of the optimal number of clusters in K-Means clustering,
reducing the reliance on user input and domain knowledge.

8
2.2 REFERRED WORK

Here are some published works we have analyzed for our project:-

1. Dubes, R. C., & Jain, A. K. (1988). Algorithms for Clustering Data. Prentice Hall.
2. Ng, A. (2012). Clustering with the K-Means Algorithm. Machine Learning.
3. Bozdogan, H. (1994). Mixture-model cluster analysis using model selection criteria and a new informational
measure of complexity.
4. Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: an information-theoretic
approach.
5. Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in
a data set.
6. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data.
7. Smyth, P. (1996). Clustering using Monte Carlo cross validation.
8. Wang, J. (2010). Consistent selection of the number of clusters via cross-validation. Biometrika.

9
3. PROPOSED SYSTEM
The proposed system is a desktop application designed to perform K-Means clustering analysis on user-provided data.
Built using Python, the system integrates essential libraries such as tkinter for the graphical user interface (GUI),
pandas for data manipulation, matplotlib for visualization, and scikit-learn for implementing the K-Means algorithm.

Key Features

1. User-Friendly Interface: The application offers an intuitive GUI that allows users to interact with the system
effortlessly.

2. Data Import: Users can import their data from Excel files into the application seamlessly.

3. K-Means Clustering: Upon importing the data, users specify the number of clusters (K) they want to generate using
the K-Means algorithm.

4. Result Display: After running the clustering algorithm, the system displays the resulting clusters visually and
provides cluster centre coordinates.

5. Evaluation Metrics: Users can evaluate the quality of clustering using metrics such as within-cluster sum of squares
(WCSS) or silhouette score.

6. Export Results: The application allows users to export the clustered data along with cluster assignments for further
analysis.

Workflow:

1. Data Import: Users select and import their dataset in Excel format using the application's interface.

2. Specify Parameters: Users enter the desired number of clusters (K) for the K-Means algorithm.

3.Run K-Means Algorithm: Upon clicking the "Calculate" button, the system processes the data using the K-Means
algorithm.

4. Display Results: The application showcases the clustered data points graphically, providing users with insights into
the underlying patterns.

5. Evaluate Clustering: Users can assess the quality of clustering using provided evaluation metrics.

6. Export Results: Optionally, users can export the clustered data along with cluster assignments for further analysis
in other tools.

10

11
Benefits:

1. Accessibility: The system caters to individuals and small businesses, enabling them to analyze their data without
extensive technical knowledge.

2. Efficiency: By automating the K-Means clustering process, the application saves users time and effort in analysing
their data.

3. Insightful Visualizations: The graphical representation of clustered data aids users in understanding patterns and
making informed decisions.

4. Flexibility: Users can customize the clustering process by adjusting parameters such as the number of clusters to
suit their specific needs.

The proposed K-Means clustering desktop application offers a user-friendly solution for data analysis, empowering
users to derive insights from their datasets efficiently. With its intuitive interface and powerful features, the system
simplifies the complex process of clustering analysis, making it accessible to a broader audience.
4. IMPLEMENTATION DETAILS

4.1 STEPS -

1. Import Libraries:
- Include tkinter, pandas, matplotlib, and scikit-learn for essential functionality.

2. Create GUI Window:


- Initialize a tkinter window to serve as the application's graphical interface.

3. Add Labels and Buttons:


- Create labels for titles and instructions.
- Add buttons for data import and clustering initiation.

4. Import Excel Data:


- Develop a function to import Excel data using pandas.

5. Perform K-Means Clustering:


- Implement a function to execute K-Means clustering.
- Obtain the number of clusters from user input.
- Utilize scikit-learn's KMeans algorithm for clustering.

6. Visualize Results:
- Display clustered data and centroids using matplotlib.

7. Enable User Interaction:


- Ensure buttons are responsive and activated based on input availability.

8. Run Application:
- Execute the mainloop() function to launch the GUI application.

12
11
5. EXPERIMENTAL RESULTS

5.1 CODE
code:
import tkinter as tk
from tkinter import messagebox
from tkinter import filedialog
import numpy as np
import matplotlib.pyplot as plt

class KMeans:
def _init_(self, n_clusters=3, max_iter=300):
self.n_clusters = n_clusters
self.max_iter = max_iter

def fit(self, X):


self.cluster_centers_ = X[np.random.choice(X.shape[0], self.n_clusters, replace=False)]
for _ in range(self.max_iter):
labels = self._assign_labels(X)
new_centers = self._compute_centers(X, labels)
if np.allclose(self.cluster_centers_, new_centers):
break
self.cluster_centers_ = new_centers
return self

def _assign_labels(self, X):


distances = np.linalg.norm(X[:, None] - self.cluster_centers_, axis=2)
return np.argmin(distances, axis=1)

def _compute_centers(self, X, labels):


new_centers = np.zeros_like(self.cluster_centers_)
for i in range(self.n_clusters):
cluster_points = X[labels == i]
if len(cluster_points) > 0: 13
sum_x, sum_y = 0, 0
for point in cluster_points:
sum_x += point[0]
sum_y += point[1]
new_centers[i][0] = sum_x / len(cluster_points)
new_centers[i][1] = sum_y / len(cluster_points)
else:
new_centers[i] = X[np.random.choice(X.shape[0])]
return new_centers

class KMeansGUI:
def _init_(self, root):
self.root = root
self.root.title("KMeans Clustering")
self.root.geometry("400x300")

self.data = None
self.k = tk.IntVar()
self.k.set(3) # default value

self.create_widgets()

def create_widgets(self):
self.label = tk.Label(self.root, text="Select Data File:")
self.label.pack()

self.select_button = tk.Button(self.root, text="Select File", command=self.select_file)


self.select_button.pack()

self.k_label = tk.Label(self.root, text="Enter number of clusters (k):")


self.k_label.pack()

self.k_entry = tk.Entry(self.root, textvariable=self.k)


self.k_entry.pack()
14
self.cluster_button = tk.Button(self.root, text="Cluster Data", command=self.cluster_data)
self.cluster_button.pack()

def select_file(self):
file_path = filedialog.askopenfilename()
if file_path:
try:
self.data = self.load_data(file_path)
except Exception as e:
messagebox.showerror("Error", f"Failed to load data: {e}")

def load_data(self, file_path):


data = []
with open(file_path, 'r') as file:
next(file) # Skip the header line
for line in file:
line = line.strip()
if line:
row = [float(val) for val in line.split(',')[1:]] # Exclude the first column (Gender)
data.append(row)
return np.array(data)

def cluster_data(self):
if self.data is None:
messagebox.showerror("Error", "Please select a data file.")
return

k_value = self.k.get()
if k_value <= 0:
messagebox.showerror("Error", "Please enter a valid value for k.")
return

kmeans = KMeans(n_clusters=k_value)
kmeans.fit(self.data)
labels = kmeans._assign_labels(self.data) 15
plt.scatter(self.data[:, 0], self.data[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', c='red', s=100)
plt.title("KMeans Clustering")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

if _name_ == "_main_":
root = tk.Tk()
app = KMeansGUI(root)
root.mainloop()

16
5.2RESULTS

Fig. 5.1 home screen

Fig. 5.2 select file as data.csv

17
Fig. 5.3 Output of data.csv

Fig. 5.4 select file as weight-height.csv

18
Fig. 5.5 Output of weight-height.csv

19
6.CONCLUSION

In conclusion, K-Means clustering stands out as a robust and efficient algorithm, capable of partitioning data into
distinct groups based on similarities. Its simplicity and effectiveness make it a valuable tool in various fields, from
customer segmentation in marketing to image compression in computer vision.

K-Means offers utility in exploratory data analysis by uncovering underlying patterns and structures within
datasets. By iteratively optimizing cluster centroids to minimize the within-cluster sum of squares, K-Means
efficiently partitions data points into clusters, aiding in data interpretation and decision-making.

As we continue to advance in the realm of data science, K-Means clustering remains a fundamental technique,
contributing to insights and discoveries across diverse domains. Its versatility and scalability make it a go-to
choice for clustering analysis, driving innovation and progress in data-driven applications.

20
7. REFERENCES

1. Scikit-learn documentation: https://scikit-learn.org/stable/documentation.html


2. Pandas documentation: https://pandas.pydata.org/docs/
3. Matplotlib documentation: https://matplotlib.org/stable/contents.html
4. Tkinter documentation: https://docs.python.org/3/library/tkinter.html
5. Towards Data Science: https://towardsdatascience.com/
6. Real Python: https://realpython.com/
7. Datacamp: https://www.datacamp.com/
8. GeeksforGeeks: https://www.geeksforgeeks.org/

21
ACKNOWLEDGEMENT

Our group would like to thank our Principal Dr S. M. Khot and our Head of Department Dr. Shubhangi
Vaikole for giving us this wonderful opportunity. We would also like to extend our thanks to our
project guide, Prof. Poonam Bari for guiding us throughout the project and helping us overcome
obstacles and difficulties faced by the project. We give our thanks to the University of Mumbai for
allotting this project which led to us researching topics related to the project and helped us learn many
things at a greater efficiency.

22
14

You might also like