Sample Doc Final
Sample Doc Final
A PROJECT REPORT
Submitted by
ANUSHA S 1305066
KIRUTHIKA M 1305084
PRATHEEBA R 1305097
SWETHA M 1305116
BACHELOR OF ENGINEERING
in
APRIL 2017
COIMBATORE INSTITUTE OF TECHNOLOGY
(A Govt. Aided Autonomous Institution Affiliated to Anna University)
COIMBATORE - 641014
BONAFIDE CERTIFICATE
Certified that this project CLUSTERING HIGH DIMENSIONAL DATA USING SIGNATURES
WITH HASHING TECHNIQUE is the bonafide work of ANUSHA S (1305066), KIRUTHIKA M
(1305084), PRATHEEBA R (1305097), SWETHA M (1305116) under my supervision during the
academic year 2016-2017.
Certified that the candidates were examined by us in the project work viva-voce
examination held on
Place:
Date:
ACKNOWLEDGEMENT
We express our sincere thanks to our Secretary Dr.R.Prabhakar and our Principal
Dr.V.Selladurai for providing us a greater opportunity to carry out our work. The following words are
rather very meagre to express our gratitude to them. This work is the outcome of their inspiration and
Computer Science and Engineering & Information Technology, for his encouragement during this tenure.
We equally tender our sincere thankfulness to our project guide Mrs.S.Priya, Department of
Computer Science and Engineering& Information Technology, for her valuable suggestions and guidance
During the period of study, the entire staff members of the Department of Computer Science and
Engineering & Information Technology have offered ungrudging help. It is also a great pleasure to
It is matter of great pleasure to thank our parents and family members for their constant support
DB Database of points
D Set of attributes / Dimensions
S Subspace
N Neighbourhood of a point in a subspace S
C Cluster
CS Core set
U Dense unit
H Hash Signature of a dense unit
L Large Integer
HTable Hash table
PCA Principal Component Analysis
1. INTRODUCTION
Data mining is the computational process of discovering patterns in large relational databases and
summarizing it into useful information. The overall goal of the data mining process is to extract
information from a data set and transform it into an understandable structure for further use.
Extract, transform, and load transaction data onto the data warehouse system.
1.2.1 Association
Association enables the discovery of interesting relations between different variables in large
databases. Association rule learning uncovers hidden patterns in the data that can be used to identify
variables within the data and the co-occurences of different variables that appear with the greatest
frequencies.
1.2.2 Clustering
Clustering is a data mining technique that makes a meaningful or useful cluster of objects which
have similar characteristics using the automatic technique.
1.2.3 Classification
Classification is used to classify each item in a set of data into one of a predefined set of classes
or groups. Classification method makes use of mathematical techniques such as decision trees, linear
programming, neural network and statistics
1.2.4 Prediction
The prediction is one of a data mining techniques that discovers the relationship between
independent variables and relationship between dependent and independent variables.
Sequential patterns analysis seeks to discover or identify similar patterns, regular events or trens
in transaction data over a business period.
1.3 Clustering
Clustering is the task of grouping a set of objects in such a way that objects in the same group are
more similar to each other than to those in other groups. Cluster is a group of objects that belongs to the
same class.
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Constraint-based Method
1.3.1 Partitioning Method
For a database of n objects, the partitioning method constructs k partition of data. Each
partition will represent a cluster and k will be less than or equal to n. It means that it will classify the
data into k groups, which satisfy the following requirements
Agglomerative Approach
Divisive Approach
1.3.2.1 Agglomerative Approach
This method is started with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keeps on doing so until all of the groups are
merged into one or until the termination condition holds.
Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
1.5 Hashing
The hash function is used to index the original value or key and then used later each time the
data associated with the value or key is to be retrieved. A hash table uses a hash function to compute an
index into an array of buckets or slots, from which the desired value can be found.
1.5.1 Advantage
The main advantage of hash tables over other table data structures is speed especially when the
number of entries is large. If the set of key-value pairs is fixed and known ahead of time the average
lookup cost can be reduced by a careful choice of the hash function, bucket table size, and internal data
structures.
2. SYSTEM SPECIFICATION
The hardware and software for the system is selected by considering the factors such as CPU processing
speed, peripheral channel speed, printer speed, seek time, relational delay of hard disk and
communication speed etc. The hardware and software specifications are as follows.
DESCRIPTION:
In Ordered Subspace clustering, the number of clusters to be formed are not known apriori.Each
data point in a union of subspaces can always be written as a linear combination of all other points. A
block diagonal matrix is formed from the data. The columns that are within a segment will be the zero
vector or very close to it because columns from the same subspace share similarity. Columns that greatly
deviate away from the zero vector indicate the boundary of a segment as the similarity is low.
DESCRIPTION:
LIMITATIONS:
CLIQUE algorithm does not perform well when the number of dimensions increases.
DESCRIPTION:
Data set is projected in all dimensions. Histograms are constructed over the dimensions. Sparse
histogram do not contribute to the cluster. Dimensions having dense histogram are combined recursively.
DFS is applied to find maximal region that represents a cluster.
3.2 PROPOSED SYSTEM
The proposed system uses the commonality of data points across dimensions as the key step in
Cluster contribution thereby by-passing the generation of clusters across increasing combinations of
dimensions. Clusters generated in 1-D having same signatures imply a maximal subspace cluster where
the subspace is a set of those particular dimensions. However same signatures for 1-D do not always
imply a maximal subspace cluster because there are chances that a cluster in 1-D can have interleaved
dense units of clusters in other dimensions. So a dense unit in 1-D is split into combinations of dense
units of comparatively smaller size TAU+1. These combinations of dense units form the Core Set. Dense
units from the Core Set can now be tested for same signatures to find maximal subspace cluster.
DBSCAN can be run across all the maximal subspace clusters so formed to find the final maximal
clusters.
For the dense sets of points in each of the 1-dimensional projections of the attribute-set of a
given data, the sufficiently common points among these 1-dimensional sets will lead to the dense points in
the higher dimensional subspaces.
Each row of database is assigned a point ID. A random number is generated for each point
ID and maintained in a table so that it can be used during assignment of signatures. Every point is added
to the point vector. All the point vectors are added to a data matrix.
A point is said to be dense if it has at least tau points within epsilon neighborhood. Epsilon
is a distance measure between two points. These dense units can be connected together to form a cluster.
A dense unit in 1-D is split into combinations of dense units of comparatively smaller size
TAU+1. These combinations of dense units form the Core Set. Dense units from the Core Set can now be
tested for same signatures to find maximal subspace cluster.
3.2.5 Hashing
The method of assigning signatures to each of these 1-D dense units is to avoid comparing
the individual points among all dense units in order to decide whether they contain exactly same points or
not. We can hash the signatures of these 1-D dense units from all k-dimensions.
The hash table is checked for collision now. The resulting collisions will lead to the
maximal subspace dense units. These dense units are given as input to DBSCAN algorithm to obtain
clusters.
Block Diagram:
3.3 Features
It involves bottom up approach since the number of clusters and number of subspaces
are not known priori.
This algorithm gives only non redundant information, (i.e) it assigns signatures to the
data points thereby generating non redundant clusters.
4. DESIGN
Input is obtained as database containing information about patients with diabetes disease. The
database is normalized using WEKA and stored as .csv format.
Each row of database is assigned a point ID. A random number is generated for each point
ID and maintained in a table. A point is said to be dense in a data matrix if it has at least tau points within
epsilon neighborhood within a single dimension. Epsilon is a distance measure between two points.
These dense units can be connected together to form a cluster.
A dense unit in 1-D is split into combinations of dense units of comparatively smaller size
TAU+1. These combinations of dense units form the Core Set. Dense units from the Core Set can now be
tested for same signatures to find maximal subspace cluster.
The method of assigning signatures to each of these 1-D dense units is to avoid comparing
the individual points among all dense units in order to decide whether they contain exactly same points or
not. We can hash the signatures of these 1-D dense units from all k-dimensions.
The resulting collisions in the hash table will lead to the maximal subspace dense units.
These dense units are given as input to DBSCAN algorithm to obtain clusters.
The output is a set of maximal subspace clusters from all the dimensions.
5. SOFTWARE DESCRIPTION
Net Beans IDE is the official IDE for Java 8. With its editors, code analyzers, and
converters, you can quickly and smoothly upgrade your applications to use new Java 8 language
constructs, such as lambdas, functional operations, and method references. Batch analyzers and
converters are provided to search through multiple applications at the same time, matching patterns
for conversion to new Java 8 language constructs. With its constantly improving Java Editor, many
rich features and an extensive range of tools, templates and samples, Net Beans IDE sets the
standard for developing with cutting edge technologies out of the box.
The IDE provides wizards and templates to let you create Java EE, Java SE, and Java ME
applications. A variety of technologies and frameworks are supported out of the box. For example, you
can use wizard and templates to create applications that use the OSGi framework or the NetBeans module
system as the basis of modular applications. The language-aware NetBeans editor detects errors while
you type and assists you with documentation popups and smart code completionall with the speed and
simplicity of your favorite lightweight text editor.
5.3 Building
Out of the box, the IDE provides support for the Maven and Ant build systems. In the New
Project wizard, when you choose to create a new application, you can choose to create Maven-based or
Ant-based applications. You can open Maven-based applications into the IDE without an import process
because the IDE reads project settings from the Maven POM file. In addition, tools are provided for
importing Ant-based projects that were not created in the IDE. The IDE includes a Maven Repository
Browser, as well as graphs for analying Maven dependencies.
To identify and solve problems in your applications, such as deadlocks and memory leaks, the
IDE provides a feature rich debugger and profiler.
When you are testing your applications, the IDE provides tools for using JUnit and TestNG, as
well as code analyzers and, in particular, integration with the popular open source FindBugs tool.
Design GUIs for Java SE, HTML5, Java EE, PHP, C/C++, and Java ME applications quickly and
smoothly by using editors and drag-and-drop tools in the IDE. For Java SE applications, the NetBeans
GUI Builder automatically takes care of correct spacing and alignment, while supporting in-place editing,
as well. The GUI builder is so easy to use and intuitive that it has been used to prototype GUIs live at
customer presentations.
6. IMPLEMENTATION
6.1 Function
Assign a set of random integers to each data point in the database. Sort the points in the database
so that there is no need to scan the database each time when the first two values itself not in the range of
epsilon distance. Find the core set by using epsilon as a distance measure between the points in each
dimension such that there are tau number of points in the neighborhood.
A set K of n large integers are randomly generated and used as a one-to-one mapping M : DB
K to assign a unique label to each point in the database. The signature H of a dense unit U is given by the
sum of the labels of the points in it. These 1-D signatures across different dimensions can be matched
without checking for the individual points contained in these dense units. The signature-sums are hashed
to a hash table. If all the sums collide then these dense units are same (with very high probability) and
exist in the subspace {d1 , d2 , . . . , dm }. Thus, the final collisions after hashing all dense units in all
dimensions generate dense units in the relevant maximal subspaces. These dense units are combined to
get final clusters in their respective subspaces.
PROCEDURE:
The maximal subspace cluster is generated using this procedure.
STEP 1:
Consider a set, K of very large, unique and random positive integers {K1 , K2 , . . . , Kn }. We
define M as a one-to-one mapping function, M : DB K.Each point Pi DB is assigned a unique
random integer Ki from the set K.
STEP 2:
STEP 3:
Then, we create a hash table hTable, as follows. In each dimension j, for everydense unit Ua we
calculate the sum of its elements called signature, Ha and hash this signature in hTable. If Ha collides
with anothersignature Hb then the dense unit Ua exists in subspace {j, k} with extremely high k
probability. After repeating this process in all single dimensions, each entry of this hash table will contain
a dense unit in the maximal subspace as we can store the colliding dimensions against each signature Hi
hTable.
STEP 4:
The dense units in all possible maximal subspaces are processed to create density-reachable sets
and hence, maximal clusters. We use DBSCAN in each found subspace for the clustering process and the
value of and can be adapted differently as per the dimensionality of the subspace to deal with the curse
of dimensionality.
7.1 Advantages
All the clustering algorithms find the common points in each dimensions but this approach finds
the interleaved dense units.
Unlike other algorithms, it uses a method of assigning signature so that repeated clusters are not
formed.
We need to scan the database only once unlike other algorithm which requires n database scans
for n dimensions.
Since the data points are sorted before finding dense units there is no need to scan all the data
points in a dimension when the first two points itself not contribute to the cluster.
This approach finds dense units only in single dimension which can be combined together to form
multiple dimensions.
The number of database scans and the number of clusters to be generated are not estimated before
runtime.
8. CONCLUSION
The generation of large and high dimensional data in the recent few years has over-whelmed the
data mining community. This approach efficiently finds the quality subspace clusters without expensive
database scans or generating trivial clusters in between. We have compared SUBSCALE algorithm
against recent subspace clustering algorithms and our proposed algorithm has performed far better when
it comes to handling high-dimensional datasets. However, the main cost in our SUBSCALE algorithm is
the computation of the candidate 1-dimensional dense units. In addition to splitting the hash table
computation, SUBSCALE has a high degree of parallelism as there is no dependency in computing dense
units across multiple dimensions. We plan to implement a parallel version of our algorithm on General
Purpose Graphics Processing Units (GPGPU) in the near future.
9. REFERENCES
2. Tierney S, Gao J, Guo Y (2014) Subspace clustering for sequential data. In:
Computer vision and pattern recognition(CVPR), 2014 IEEE conference On. IEEE.
pp 10191026.
4. VidalR(2011)Subspaceclustering.IEEESignalProcMag28(2):5268.