0% found this document useful (0 votes)
81 views

Horizontal Aggregation and Hierarchical Clustering

sample report

Uploaded by

Miny Vincent
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Horizontal Aggregation and Hierarchical Clustering

sample report

Uploaded by

Miny Vincent
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

1. INTRODUCTION
Most algorithms require datasets with horizontal layout as input with several records and one variable or dimensions per columns. Managing large data sets without DBMS support can be a difficult task. Trying different subsets of data points and dimensions is more flexible, faster and easier to do inside a relational database with SQL queries than outside with alternative tool. Horizontal aggregation can be performing by using operator, it can easily be implemented inside a query processor, much like a select, project and join. PIVOT operator on tabular data that exchange rows, enable data transformations useful in data modeling, data analysis, and data presentation.

There are many existing functions and operators for aggregation in Structured Query Language. The most commonly used aggregation is the sum of a column and other aggregation operators return the average, maximum, minimum or row count over groups of rows. All operations for aggregation have many limitations to build large data sets for data mining purposes. Database schemas are also highly normalized for On-Line Transaction Processing (OLTP) systems where data sets that are stored in a relational database or data warehouse. But data mining, statistical or machine learning algorithms generally require aggregated data in summarized form. Data mining algorithm requires suitable input in the form of cross tabular (horizontal) form; significant effort is required to compute aggregations for this purpose. Such effort is due to the amount and complexity of SQL code which needs to be written, optimized and tested.

To analyze data efficiently, Data mining systems are widely using datasets with columns in horizontal tabular layout. Preparing a data set is more complex task in a data mining project, requires many SQL queries, joining tables and aggregating columns. Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a horizontal tabular layout returns set of numbers, instead of one number per row. The system uses one parent table and different child tables, operations are then performed on the data loaded from multiple tables. PIVOT operator, offered by RDBMS is used to calculate aggregate operations. PIVOT method is much faster method and offers much scalability.
1 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Partitioning large set of data, obtained from the result of horizontal aggregation, in to homogeneous cluster is important task in this system. K-means algorithm using SQL is best suited for implementing this operation. To overcome the problems in K-means algorithm, we are optimizing it by using hierarchical clustering.

1.1 Project Scope


The objective of this project is to find an efficient clustering algorithm that overcomes the defects of commonly used K-means algorithm for portioning large set of data, obtained as a result of horizontal aggregation into homogeneous clusters. Data mining, statistical or machine learning algorithms generally require aggregated data in summarized form. To analyze data efficiently, Data mining systems are widely using datasets with columns in horizontal tabular layout. Preparing a data set is more complex task in a data mining project, requires many SQL queries, joining tables and aggregating columns. Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a horizontal tabular layout returns set of numbers, instead of one number per row. The system uses one parent table and different child tables, operations are then performed on the data loaded from multiple tables. Partitioning large set of data, obtained from the result of horizontal aggregation, in to homogeneous cluster is important task in data mining analysis.

The proposed implementations allow clustering large data sets in a relational DBMS eliminating the need to ex-port or access data outside the DBMS. Horizontal aggregation represents a template to generate SQL code from a data mining tool. This SQL code reduces manual work in the data preparation phase in data mining related project.

Automatically generated code, which is more efficient than end user written SQL code. Thus datasets for the data mining projects can be created in less time. Data sets can be created entirely inside the DBMS .

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

1.2 Design and Implementation Constraints


Managing large data sets without DBMS support can be a difficult task. All aggregations operation has many limitations to build data sets for data mining purposes SQL code needs to be written, optimized and tested.

1.3 Assumptions and Dependencies


Data in numerical form is preferred. Data should be applicable for performing horizontal aggregation. It is assumed that there is no missing attributes. The obtained data sets after horizontal aggregation should be suitable for performing clustering. The data is considered to be in numerical form.

1.4 Advantages
Works well with outliers Produces good clusters with a single scan of data set Minimizes running time Linearly scalable Less space complexity

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

2. LITERATURE SURVEY
2.1 Background
Horizontal aggregation is new class of function to return aggregated columns in a horizontal layout. Most algorithms require datasets with horizontal layout as input with several records and one variable or dimensions per columns. Preparing a data set is more complex task in a data mining project, requires many SQL queries, joining tables and aggregating columns. Horizontal aggregation can be performed by using the PIVOT operator. It can easily be implemented inside a query processor, much like a select, project and join. PIVOT operator on tabular data that exchange rows, enable data transformations useful in data modeling, data analysis, and data presentation. PIVOT method is much faster method and offers much scalability.

Partitioning large set of data, obtained from the result of horizontal aggregation, in to homogeneous cluster is important task in this system. K-means algorithm using SQL is best suited for implementing this operation. To overcome the problems in K-means algorithm, we are optimizing it by using hierarchical clustering.

Similar IEEE papers:

(i)

Horizontal Aggregations in SQL by Using Clustering as a Mixture Gaussians


To analyze data efficiently, Data mining systems are widely using datasets with

columns in horizontal tabular layout. Preparing a data set is more complex task in a data mining project, requires many SQL quires, joining tables and aggregating columns. Conventional RDBMS usually manage tables with vertical form. Aggregated columns in a horizontal tabular layout returns set of numbers, instead of one number per row [3].The system uses one parent table and different child tables, operations are then performed on the data loaded from multiple tables. PIVOT operator, offered by RDBMS is used to calculate
4 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

aggregate operations. PIVOT method is much faster method and offers much scalability. Partitioning large set of data, obtained from the result of horizontal aggregation, in to homogeneous cluster is important task in this system. Gaussians algorithm using SQL is best suited for implementing

(ii)

Optimization of Horizontal Aggregation in SQL by Using K-Means Clustering


Horizontal aggregation is new class of function to return aggregated columns in a

horizontal layout. Horizontal aggregation can be performing by using operator, it can easily be implemented inside a query processor, much like a select, project and join. PIVOT operator on tabular data that exchange rows, enable data transformations useful in data modeling, data analysis, and data presentation. K-means clustering algorithms are used to cluster the attribute, that attribute is the result of horizontal aggregation. Clustering using the K-Means algorithm requires the user to specify the number of clusters, which is a major disadvantage[4].

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

3. HARDWARE AND SOFTWARE SPECIFICATIONS


Minimum Software requirements: IDE: Netbeans 7.01 SQL SERVER 2005 JDK 1.6

Minimum Hardware requirements: 1 GB RAM

3.1 Java Development Kit

A Java Development Kit (JDK) is a program development environment for writing Java applets and applications. It consists of a runtime environment that "sits on top" of the operating system layer as well as the tools and programming that developers need to compile, debug, and run applets and applications written in the Java language.

3.2 SQL Server 2005

Microsoft SQL Server is a relational database management system developed by Microsoft. As a database, it is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a network (including the Internet). There are at least a dozen different editions of Microsoft SQL Server aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users. Its primary query languages are T-SQL and Slit included native support for managing XML data, in addition to relational data. For this purpose, it defined an xml data type that could be used either as a data type in database columns or as literals in queries. XML columns can be associated with XSD schemas; XML data being stored is verified against the schema. XML is converted to an internal binary data
6 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

type before being stored in the database. Specialized indexing methods were made available for XML data. XML data is queried using XQuery; SQL Server 2005 added some extensions to the T-SQL language to allow embedding XQuery queries in T-SQL. SQL Server 2005 has also been enhanced with new indexing algorithms, syntax and better error recovery systems. Data pages are check summed for better error resiliency, and optimistic concurrency support has been added for better performance. Permissions and access control have been made more granular and the query processor handles concurrent execution of queries in a more efficient way. Partitions on tables and indexes are supported natively, so scaling out a database onto a cluster is easier. SQL CLR was introduced with SQL Server 2005 to let it integrate with the .NET Framework. SQL Server 2005 introduced "MARS" (Multiple Active Results Sets), a method of allowing usage of database connections for multiple purposes.

3.3 Net Beans 7.01

NetBeans is an integrated development environment (IDE) for developing primarily with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an application platform framework for Java desktop applications and others. The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms supporting a compatible JVM. The NetBeans Platform allows applications to be developed from a set of modular software components called modules.

Applications based on the NetBeans Platform (including the NetBeans IDE itself) can be extended by third party developers. The NetBeans Platform is a reusable framework for simplifying the development of Java Swing desktop applications. The NetBeans IDE bundle for Java SE contains what is needed to start developing NetBeans plugins and NetBeans Platform based applications; no additional SDK is required. NetBeans IDE is a free, opensource, cross-platform IDE with built-in-support for Java Programming Language.

The NetBeans IDE is written in Java and can run on Windows, OS X, Linux, Solaris and other platforms supporting a compatible JVM. The NetBeans Platform allows applications to be developed from a set of modular software components called modules.

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

NetBeans Platform (including the NetBeans IDE itself) can be extended by third party developers.

4. SYSTEM ANALYSIS AND DESIGN


4.1 Existing System

Existing system refers to a system which presently we are using or provided and contains certain limitations which are researched so as to find a solution to it. The existing system consists of performing horizontal aggregation using pivot and clustering using KMeans. PIVOT operator on tabular data that exchange rows, enable data transformations useful in data modeling, data analysis, and data presentation. Data mining algorithm requires suitable input in the form of cross tabular (horizontal) form; significant effort is required to compute. K- Means is initialized from some random or approximate solution. Each step assigns each point to its nearest cluster and then points belonging to the same cluster are averaged to get new cluster centroids. Each step successively improves cluster centroids until they are stable. This is the standard version of K-Means technique used. But this clustering Technique has many disadvantages. Firstly the learning algorithm requires a priori specification of the number of cluster centers. If there are two highly overlapping data then k-means will not be able to resolve that there are two clusters. Euclidean distance measures can unequally weight underlying factors. The learning algorithm provides the local optima of the squared error function. Randomly choosing of the cluster center cannot lead us to the fruitful result. It is applicable only when mean is defined i.e. fails for categorical data. It is unable to handle noisy data and outliers and fails for non-linear data set.

4.2 Requirement Analysis


SQL Server 2008
8 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

SQL server is a RDBMS developed by Microsoft. This is the RDBMS that we have used for the project. It supports pivot operation. The databases are created in SQL SERVER. Connection is established with SQL server and then databases are read, aggregated and clustered accordingly. Database Databases are created to operate large quantities of information by inputting, storing, retrieving and managing that information. The databases used for horizontal aggregation and clustering should contain numerical data.

4.3 Proposed System


Proposed system refers to a system where we apply our ideas or do slight modifications to the Existing System. Here we are choosing a more efficient algorithm for clustering i.e. BIRCH (Balanced Iterative reducing and Clustering using Hierarchies), which is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database. In addition, BIRCH is recognized as the first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively. This Clustering Algorithm has many advantages. It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible subclusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance.

4.4. Module Division


1. 2.
9

Interface Design Horizontal Aggregation


HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

3. 4.

BIRCH Clustering Comparison using K-Means.

4.4.1 Interface Design The Interface of our project is basically made with NETBEANS 7.01. NetBeans is an integrated development environment (IDE) for developing primarily with Java, but also with other languages, in particular PHP, C/C++, and HTML5. It is also an application platform framework for Java desktop applications and others. Desktop Application of Netbeans has been used for our project to give a professional style. It consists of a Menubar with a set of menu options. One of the menu option is Optimising in this we have different menu items to display the required table, to perform Horizontal Aggregation, K-Means Clustering and BIRCH Clustering. When the Dataset menuItem is clicked, the user is prompted for the dataset name. On entering a valid Dataset name, the corresponding Dataset is displayed. If a valid name is not entered, no Table will be displayed and is an error. When the Horizontal Aggregation menuItem is clicked, the user is prompted for the METHOD i.e. what aggregation function is to be used for the Horizontal Aggregation using pivot , COLUMN TO BE AGGREGATED i.e. which on column values should the aggregation function be applied and then the TRANSPOSE i.e. which columns values has to be transposed into separated columns. Once valid entries are made, the horizontally aggregated output will be displayed in a table format. If illegal values are entered, no table will be displayed. When the optimizing menuItem is clicked, using the previously entered values the Kmeans clustering algorithm is executed and the output is displayed in table format. The KMeans clustering algorithm is performed for comparison purpose. When the BIRCH clustering menuItem is clicked, the BIRCH clustering code is executed. The output consists of displaying the number of data points within a specified cluster. This output is displayed as a Bar chart.

10

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Another menu Option in this menu is for comparison. It contains a menu item to display the comparison chart. When this menu Item is pressed, the comparison chart is displayed. This contains the comparison of the time complexity and space complexity of both the clustering algorithms. 4.4.2 Horizontal Aggregation The second module of this project is performing horizontal Aggregation. Horizontal aggregation is new class of function to return aggregated columns in a horizontal layout. Most algorithms require datasets with horizontal layout as input with several records and one variable or dimensions per columns. Managing large data sets without DBMS support can be a difficult task. Trying different subsets of data points and dimensions is more flexible, faster and easier to do inside a relational database with SQL queries than outside with alternative tool. Horizontal aggregation is done using PIVOT operator. The PIVOT operator on tabular data exchange rows, enable data transformations useful in data modeling, data analysis, and data presentation.

There are several advantages for horizontal aggregation. First one is horizontal aggregation represent a template to generate SQL code from a data mining tool. This SQL code reduces manual work in the data preparation phase in data mining related project. Second is automatically generated code, which is more efficient than end user written SQL code. Thus datasets for the data mining projects can be created in less time. Third advantage is the data sets can be created entirely inside the DBMS.

K 1 2 3 4 5 6 7
11

D1 3 2 1 1 2 1 3

D2 X Y Y Y X X X

A 9 6 10 0 1 Null 8 D1 1 2 3 D2X null 8 17 D2Y 10 6 Null

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 4.1: Horizontal aggregation of Table F To perform horizontal Aggregation in our project, the user is prompted for the METHOD i.e. what aggregation function is to be used for Horizontal Aggregation using pivot , COLUMN TO BE AGGREGATED i.e. which on column values should the aggregation function be applied and then the TRANSPOSE i.e. which columns values has to be transposed into separated columns. Once valid entries are made, the horizontally aggregated output will be displayed in a table format. If illegal values are entered, no table will be displayed. 4.4.3 BIRCH Clustering

BIRCH is a clustering algorithm which deals with large datasets by first generating a more compact summary that retains as much distribution information as possible, and then clustering the data summary instead of the dataset.

BIRCH summarizes a dataset into a set of sub clusters to reduce the scale of the clustering problem using the concept of clustering feature.

Clustering feature: Clustering feature (CF) entry is a triple summarizing the information that we maintain about a sub cluster of a data points.

CF-tree: A CF-tree is a height-balanced tree with two parameters: branching factor ( B for non leaf node and L for leaf node) and threshold T. Each nonleaf node contains atmost B entries of the form [CFi, childi], where i=1,2,,B, childi is a pointer to its i-th child node, and CFi is the CF entry of the sub cluster represented by this child. So a nonleaf node represents a sub cluster made up of all the sub clusters represented by its entries. In addition, each leaf node has two pointers, prev and next, which are used to chain all leaf nodes together for efficient scans. All entries in a leaf node must satisfy a threshold requirements,

12

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

with respect to a threshold value T: the diameter( alternatively , the radius) of each leaf entry has to be less than T.

The tree size is a function of T. The larger T is, the smaller the tree is. A node is required to fit in a page of size P, where P is a parameter of BIRCH. Once the dimension d of the data space is given, the sizes of leaf and nonleaf entries are known, and then B and L are determined by P. So P can be varied for performance tuning.

Phases of BIRCH algorithm: The four phases of BIRCH algorithm are: 1. Loading : The main task of Phase 1 is to scan all data and build an initial in-memory CFtree using the given amount of memory and recycling space on disk. 2. Optional Condensing: This phase serves as a cushion between phase one loading and phase three. We scan the leaf entries in the initial CF tree to rebuild a small CF Tree while removing more outliers and grouping more crowded sub clusters into larger ones. 3. Global Clustering: We use an existing global or semi global algorithm to cluster all the leaf entries across the boundaries of different nodes. After this phase we obtain a set of clusters that captures the major distribution of patterns in the data. 4. Cluster refining: It uses the centroids of the clusters produced by the previous phase as seeds and redistributes the data points to its closest seed to obtain a set of new clusters. It also ensures that all the copies of the given data point goes to the same cluster.

4.4.4 Comparison Using K-Means

Comparison of the performance of BIRCH clustering algorithm is done with another clustering algorithm such as K-Means algorithm. The comparison is done by comparing the
13 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

time complexities and space complexities of the algorithms. The time complexities are calculated by obtaining the difference between the starting and ending times while running both the algorithms.

K-means is initialized from some random or approximate solution. Each step assigns each point to its nearest cluster and then points belonging to the same cluster are averaged to get new cluster centroids. Each step successively improves cluster centroids until they are stable. This is the standard version of K-Means technique used. The algorithm is designed as follows:

K-means algorithm based on classification technique uses horizontal aggregation as input. Pivot operator is used to calculate the aggregation of particular data values from distinct fact tables. Optimization provides for PIVOT for large number of fact table. The database connectivity and choosing different tables with .mdb extension is the first step in this system.

Horizontal aggregation can be evaluated by choosing transpose column and aggregate operation .Pivot operator automatically transforms table to horizontal layout. This is the main advantage of this particular algorithm.

The k-means algorithm is the best-known squared error based clustering algorithm with input as horizontal aggregation The algorithm consist of mainly four steps.1) Selection of the initial k means for k clusters from attribute of datasets obtained from horizontal aggregation operation.2) Calculation of the dissimilarity between an object and the Mean of a cluster.3) Allocation of an object to the cluster whose mean is nearest to the object.4) Recalculation of the mean of a cluster from the objects allocated to it so that the intra cluster dissimilarity.

4.5 System Overview


4.5.1 Use Case Description Use Case 1: Name: Horizontal Aggregation Module
14 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Description: To horizontally aggregate a specified dataset so as to use it as input for the clustering algorithms. Flow of events: a) Main flow 1. User logs in to SQLSERVER. 2. Connection is established. 3. The necessary datasets are loaded into SQLSERVER. 4. User is prompted for the required aggregating function, the aggregated column and the transpose column. 5. The code for horizontal aggregation is executed. 6. Output is obtained and displayed in table format.

Activity diagram:

User logs in to DB

Loads tables into DB

User is prompted for method, transpose and aggregate column

Query is executed and horizontally aggregated table obtained

User views the result

Figure 4.2:Activity Diagram 15 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Use Case Diagram :


Enter the table name

Enter the transpose column

Enter the aggregate function

User

Enter the column to be aggregated

Figure 4.3: Use Case Diagram for Horizontal Aggregation

Specific Requirements SQL SERVER should contain the required datasets Aggregate functions should be a valid one Column to be aggregated should contain only numerical data

Pre conditions The required dataset should be loaded into the SQLSERVER Connection should be established between SQLSERVER and the Application.

Post conditions The Horizontally aggregated output of the dataset is displayed in table format

Use Case 2: Name: BIRCH Clustering Module Description: To perform Clustering using BIRCH algorithm. Its performance is compared with K-Means Algorithm. Flow of events a) Main flow: 1. Scan the horizontally aggregated data. 2. Create Clustering Features for each Data Point and insert it into the CF Tree.
16 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

3. Condense the CF Tree 4. Perform global clustering using any clustering algorithm such as K-Means. 5. Refine the output obtained 6. Check whether more clustering can be done on the data points 7. If yes, repeat the above steps from step 4 to 7 b) Alternative flow: 8.a: If no, then exit and display the clusters using a Bar Graph.

Activity Diagram:

User logs in to DB

User obtains aggregated data Initial Scanning

Create CF tree

Condense the Clustering Feature Tree

Perform Global Clustering Refine the tree 17 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING yes Cluster
more?

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

no

Figure 4.4: BIRCH Clustering

Use Case Diagram:


Create CF tree

Condense the clustering feature

Perform Global Clustering User

Refine the tree

Figure 4.5: Use case diagram for BIRCH ALGORITHM

Post conditions 1. View and analyze the clustered results

4.6 Algorithm
18 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Connection to the SQL Server Step 1: Start. Step 2: Connect the SQLSERVER. Step 3: Load tables which is to be clustered into the database. Step 4: Establish connection between the database and this application. Step 5: Stop

Horizontal Aggregation Using Pivot Step 1: Start. Step 2: Read a valid table name and read the transpose column name. Step 3: Read the column name which is to be aggregated. Step 4: Read the name of the aggregate function which is to be applied on the column. Step 5: Display the datasets in table format. Step 6: Execute the query of PIVOT operator. Step 7: Display the horizontally aggregated output in table format. Step 8: Stop.

K-Means Algorithm
Step 1: Start. Step 2: Initialize from some random or approximate solution. Step 3: Assigns each point to its nearest cluster. Step 4: Points belonging to the same cluster are averaged to get new cluster centroids. Step 5: Repeat steps 2 and 3 successively which improves cluster centroids until they are stable. Step 6: Stop.

BIRCH Algorithm Step 1: Start. Step 2: Scan all data and build an initial in-memory CF-tree using the given amount of memory and recycling space on disk.

19

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Step 2.1: Identifying the appropriate leaf: Starting from the root,, it recursively descends the CF tree by choosing tile closest child node according to a chosen distance metric D0. Step 2.2: Modifying the leaf: When it reaches a leaf node, it finds the closest leaf entry, say Li, and then tests whether Li can absorb Ent without violating the threshold conditions. If SO, the CF vector for Li is updated to reflect this, If not,, a new entry for Ent, is added to the leaf. If there is space on the leaf for this new entry, we are done, otherwise we must, .split the leaf node. Node splitting is done by choosing the farthest pair of entries as seeds, and redistributing the remaining entries. Step 2.3: Modifying the path to the leaf: After inserting Ent into a leaf, update the CF information. Step 3: Scan the leaf entries in the initial CF tree to rebuild a small CF Tree while removing more outliers and grouping more crowded sub clusters into larger ones. Step 4: Use an existing global or semi global algorithm to cluster all the leaf entries across the boundaries of different nodes. Step 5: Use the centroids of the clusters produced by the previous phase as seeds and redistributes the data points to its closest seed to obtain a set of new clusters. Step 6: Stop.

4.7. System Architecture


DATABASE

HORIZONTAL AGGREGATION USING PIVOT

DATA SET 1

DATA SET 1

INITIAL SCANNING 20 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING TREE CONDENSING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 4.6: System architecture

5. IMPLEMENTATION
Implementation is an activity that is contained throughout the development phase. It is a process of bringing a developed system into operational use and turning it over to the user. The new system and its components are to test in a structured and planned manner. A successful system should be delivered and users should have confidence that the system would work efficiently and effectively. The more complex the system being implemented the more involved will be the system analysis and design effort required for implementation.

Implementation is the stage of the project when the theoretical design is turned into the working system. It is the most crucial stage in achieving a successful new system and giving the users confidence that the new system will work and be effective. The coding process performs this task. Coding is viewed as a natural consequence of design.

5.1. Top down Design


A top-down approach is essentially the breaking down of a system to gain insight into its compositional sub-systems. In a top-down approach an overview of the system is first formulated, specifying but not detailing any first-level subsystems. Each subsystem is then refined in yet greater detail, sometimes in many additional subsystem levels, until the entire
21 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

specification is reduced to base elements. A top-down model is often specified with the assistance of "black boxes", these make it easier to manipulate. However, black boxes may fail to elucidate elementary mechanisms or be detailed enough to realistically validate the model. Top-down approaches emphasize planning and a complete understanding of the system. It is inherent that no coding can begin until a sufficient level of detail has been reached in the design of at least some part of the system. The Top-Down Approach is done by attaching the stubs in place of the module. This, however, delays testing of the ultimate functional units of a system until significant design is complete. Bottom-up emphasizes coding and early testing, which can begin as soon as the first module has been specified. This approach, however, runs the risk that modules may be coded without having a clear idea of how they link to other parts of the system, and that such linking may not be as easy as first thought. Re-usability of code is one of the main benefits of the bottom-up approach. Top-down is a programming style, the mainstay of traditional procedural languages, in which design begins by specifying complex pieces and then dividing them into successively smaller pieces. Eventually, the components are specific enough to be coded and the program is written. This is the exact opposite of the bottom-up programming approach which is common in object-oriented languages such as C++ or Java. The technique for writing a program using top-down methods is to write a main procedure that names all the major functions it will need. Later, the programming team looks at the requirements of each of those functions and the process is repeated. These compartmentalized sub-routines eventually will perform actions so simple they can be easily and concisely coded. When all the various sub-routines have been coded the program is done. By defining how the application comes together at a high level, lower level work can be self-contained. By defining how the lower level abstractions are expected to integrate into higher level ones, interfaces become clearly defined. Practicing top-down programming has several advantages:

22

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Separating the low level work from the higher level abstractions leads to a modular design.

Modular design means development can be self-contained. Having "skeleton" code illustrates clearly how low level modules integrate. Fewer operations errors (to reduce errors, because each module has to be processed separately, so programmers get large amount of time for processing).

Much less time consuming (each programmer is only involved in a part of the big project).

Very optimized way of processing (each programmer has to apply their own knowledge and experience to their parts (modules), so the project will become an optimized one).

Easy to maintain (if an error occurs in the output, it is easy to identify the errors generated from which module of the entire program).

5.2. Bottom up Approach


A bottom-up approach is the piecing together of systems to give rise to grander systems, thus making the original systems sub-systems of the emergent system. In a bottomup approach the individual base elements of the system are first specified in great detail. These elements are then linked together to form larger subsystems, which then in turn are linked, sometimes in many levels, until a complete top-level system is formed. This strategy often resembles a "seed" model, whereby the beginnings are small but eventually grow in complexity and completeness. However, "organic strategies" may result in a tangle of elements and subsystems, developed in isolation and subject to local optimization as opposed to meeting a global purpose In a bottom-up approach the individual base elements of the system are first specified in great detail. These elements are then linked together to form larger subsystems, which then in turn are linked, sometimes in many levels, until a complete top-level system is formed.

23

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

This strategy often resembles a "seed" model, whereby the beginnings are small, but eventually grow in complexity and completeness. Object-oriented programming (OOP) is a programming paradigm that uses "objects" to design applications and computer programs. This bottom-up approach has one weakness. We need to use a lot of intuition to decide the functionality that is to be provided by the module. If a system is to be built from existing system, this approach is more suitable as it starts from some existing modules.

5.3. Software Development Process Model


A software development process, also known as a software development life cycle (SDLC), is a structure imposed on the development of a software product. Similar terms include software life cycle and software process. It is often considered a subset of systems development life cycle. There are several models for such processes, each describing approaches to a variety of tasks or activities that take place during the process. Some people consider a life-cycle model a more general term and a software development process a more specific term. For example, there are many specific software development processes that 'fit' the spiral life-cycle model. ISO/IEC 12207 is an international standard for software life-cycle processes. It aims to be the standard that defines all the tasks required for developing and maintaining software. There are many different models and methodologies, but each generally consists of a series of defined steps or stages. Few popular models are:

Waterfall Model V-Model Spiral Model Prototyping Iterative (Incremental Delivery) Rapid Application Development The engineering process that we used to complete our project was the

V- model or the Validation model. This is an elaboration of Waterfall model and emphasizes on validation activities corresponding to each of the activities that create the products of the
24 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

project. It can be seen as expanding the testing activity in the waterfall model. Each step has a matching validation process that can give feedback (in case defects are found) to the corresponding development stage and a reworking of the succeeding steps. For example, the system designer might have written that a calculation be carried out in a certain way.

The person who has written the program design might have misunderstood what was required and written the pseudo code in a different way. At system testing stage, the system designer would carry out checks to ensure that the software is doing what was specified in the design document and would discover the program designers misreading of that document. So, all the steps from program design has to be repeated for correcting Defect.

25

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 5.1: V- Model

6. TESTING STRATEGIES

26

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

6.1 Black Box Testing


Black-box and white-box are test design methods. Black-box test design treats the system as a "black-box", so it doesn't explicitly use knowledge of the internal structure. Black-box test design is usually described as focusing on testing functional requirements. Synonyms for black-box include: behavioral, functional, opaque-box, and closed-box. Black Box Testing: It is also known as Skin Box was testing or Closed Box Testing.

Black testing takes an external perspective of the test object to derive test cases. These tests can be functional or non-functional, though usually functional. The test designer selects valid and invalid inputs and determines the correct output. There is no knowledge of the test object's internal structure. The higher the level, and hence the bigger and more complex the box, the more one is forced to use black box testing to simplify. While this method can uncover unimplemented parts of the specification, one cannot be sure that all existent paths are tested. Typical black box test design techniques include: Decision table testing Pairwise testing State transition tables Use case testing Cross-functional testing

Black testing takes an external perspective of the test object to derive test cases. The test designer selects valid and invalid inputs and determines the correct output. There is no knowledge of the test object's internal structure. The higher the level, and hence the bigger and more complex the box, the more one is forced to use black box testing to simplify. While this method can uncover unimplemented parts of the specification, one cannot be sure that all existent paths are tested.

6.2. White Box Testing

27

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

White-box test design allows one to peek inside the "box", and it focuses specifically on using internal knowledge of the software to guide the selection of test data. Synonyms for white-box include: structural, glass-box and clear-box. While black-box and white-box are terms that are still in popular use, many people prefer the terms "behavioral" and "structural". Behavioral test design is slightly different from black-box test design because the use of internal knowledge isn't strictly forbidden, but it's still discouraged. In practice, it hasn't proven useful to use a single test design method. One has to use a mixture of different methods so that they aren't hindered by the limitations of a particular one. Some call this "gray-box" or "translucent-box" test design, but others wish we'd stop talking about boxes altogether. It is important to understand that these methods are used during the test design phase, and their influence is hard to see in the tests once they're implemented. Note that any level of testing (unit testing, system testing, etc.) can use any test design methods. Unit testing is usually associated with structural test design, but this is because testers usually don't have well-defined requirements at the unit level to validate. It is also known as Clear Box Testing or Glass Box or Open Box. This is used to find internal functionalities of software like conditional loops, statement coverage. It is mainly done by the Developers.

White box testing (clear box testing, glass box testing, and transparent box testing, or structural testing) uses an internal perspective of the system to design test cases based on internal structure. It requires programming skills to identify all paths through the software. The tester chooses test case inputs to exercise paths through the code and determines the appropriate outputs. In electrical hardware testing, every node in a circuit may be probed and measured; an example Is in circuit testing(ICT).

Since the tests are based on the actual implementation, if the implementation changes, the tests probably will need to change, too. For example ICT needs updates if component values change, and needs modified/new fixture if the circuit changes. This adds financial resistance to the change process, thus buggy products may stay buggy.

AUTOMATED OPTICAL INSPECTION (AOI) offers similar component level correctness checking without the cost of ICT fixtures, however changes still require test updates. While white box testing is applicable at the unit levels of the software testing
28 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

process, it is typically applied to the unit. While it normally tests paths within a unit, it can also test paths between unit integration and testing. During integration, and between subsystems during a system level test. Though this method of test design can uncover an overwhelming number of test cases, it might not detect unimplemented parts of the specification or missing requirements, but one can be sure that all paths through the test object are executed. Typical white box test design techniques include: Control flow testing Data flow testing Branch testing

6.2.1 Module Testing

In computer programming, module testing is a software verification and validation method in which a programmer tests if individual units of source code are fit for use. A unit is the smallest testable part of an application. In procedural programming a unit may be an individual function or procedure. Unit tests are typically written and run by software developers to ensure that code meets its design and behaves as intended. Its implementation can vary from being very manual (pencil and paper) to being formalized as part of build automation The primary goal of module testing is to take the smallest piece of testable software in the application, isolate it from the remainder of the code, and determine whether it behaves exactly as you expect. Each unit is tested separately before integrating them into modules to test the interfaces between modules. Unit testing has proven its value in that a large percentage of defects are identified during its use.

The most common approach to unit testing requires drivers and stubs to be written. The driver simulates a calling unit and the stub simulates a called unit. The investment of developer time in this activity sometimes results in demoting unit testing to a lower level of priority and that is almost always a mistake. Even though the drivers and stubs cost time and money, unit testing provides some undeniable advantages. It allows for automation of the testing process, reduces difficulties of discovering errors contained in more complex pieces of the application, and test coverage is often enhanced because attention is given to each unit.
29 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

6.2.2 Integration Testing Integrated design is a collaborative method for designing buildings which emphasizes the development of a holistic design. Conventional building design usually involves a series of hand-offs from owner to architect to builder to occupant. This path does not invite all affected parties into the planning process, and therefore does not take into account their needs, areas of expertise or insights. In some cases, using the conventional method, incompatible elements of the design are not discovered until late in the process when it is expensive to make changes. In addition to extensive collaboration, integrated design involves a whole building design approach. A building is viewed as an interdependent system, as opposed to an accumulation of its separate components (site, structure, systems and use). The goal of looking at all the systems together to is make sure they work in harmony rather than against each other.

6.3. Testing Strategies


Test Case 1: Enter the table name PIVOT CONTENT Enter the method, column to be aggregated and transpose column

6.4 Summary of Test Results


Case 1: PIVOTCONTENT is entered as table name. The dataset is displayed. Method entered was sum, Column to be aggregated was A and transpose column was D2. The aggregated output was displayed. The K-Means algorithm is executed and displayed in table format and BIRCH clustering algorithm is displayed as a Bar chart.

30

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 6.1: Optimising options

Figure 6.2: Comparison Options

31

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

D1 3 2 1 1 2 1 3 2

D2 x y y y x x x x

A 9 6 10 0 1 -1 8 7

Figure 6.3: PIVOTCONTENT

32

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

7. RESULTS AND DISCUSSION


The Application horizontally aggregated the datasets and produced the clustered output.
1. Results for the dataset PIVOTCONTENT :

Figure 7.1: Loading Table to be Aggregated

33

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 7.2: The Loaded Table

Figure 7.3: Method of Aggregation

Figure 7.4: Column to be Aggregated 34 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 7.5: Transpose Column

Figure 7.6: After Horizontal Aggregation

35

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 7.7: After K-Means Clustering

Figure 7.8: After Hierarchical BIRCH Clustering

36

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Figure 7.9: Comparison of Hierarchical and BIRCH Clustering

The comparison between the results of the horizontal aggregation with pivot operator and without the pivot operator was a tedious task hence comparison chart for clustering with K-means and BIRCH was implemented.

The output after the BIRCH clustering was to be shown on an dot matrix graph. Since it was difficult to be done ,a bar graph for BIRCH clustering output was performed.

We had implemented horizontal aggregation and clustering on simple databases since pivot operator works for simple ones

37

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

8. CONCLUSION
This project has a great chance of creating a revolution in the field of Data Mining. Most of the data mining applications require data in summarized form.so the horizontal aggregation of the data return aggregated columns in a horizontal layout. Most algorithms require datasets with horizontal layout as input with several records and one variable or dimensions per columns. Horizontal aggregation can be performing by using PIVOT operator on tabular data that exchange rows, enable data transformations useful in data modelling, data analysis, and data presentation.

The clustering of the data was done using the BIRCH algorithm which can handle large amount of datasets and produces good clusters. It also removes all the outliers and hence minimizes the running time. Thus the data clustering has wide applications in the field of image processing, bioinformatics, market analysis etc.

We have implemented the horizontal aggregation using the PIVOT operator on databases and the aggregated output was clustered using the BIRCH algorithm. The clustered output results are displayed on a bar chart .The comparison between the clustering using KMEANS and the BIRCH were performed .The results of the comparison was that ,time and the space complexity of BIRCH clustering was very less compared to KMEANS. It proved to be an efficient method for large datasets.

38

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

9. FUTURE ENHANCEMENT
The horizontal aggregation for methods min ,max ,count can be implemented in future. In future, the input to the system can be given as multiple tables rather than single table. This project can be extended to develop a more formal model of evaluation methods to achieve better results. Also then we can be developing more complete cost effective methods.

39

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

10. REFERENCES

Web references: [1] www.hiv.lang.gov/content/sequence/HEATMAP/heatmap.html [2] www.ines-conf.org/ines-conf/58-INES2004.pdf [3]www.ijert.org/view.php?id=2378&title=horizontal-aggregations-in-sql-by-usingclustering-as-a-mixture-gaussians [4] www.ijarcsse.com/docs/papers/May2012/Volum2_issue5/V2I500441.pdf Book references: [1] Reda Alhajj : Advanced Data Mining and Applications , Copyright 2007 [2]Richard.T.Watson :Data Managements,Databases and Organizations,John Wiley& Sons Publishing, Inc,Canada, Copyright 2002 [3]S.Sumathi.S.Esakkirajan: Fundamentals of Relational Database Management

Systems,Springer-Verlag Berlin Heidelberg,2007 [4] Charu C.Aggarwal,chandan K.Reddy: Data Clustering:Algorithms and Applications,CRC press,2014. [5] Jiawei Han,Micheline Kamber:Data Mining:Concepts and Techniques,Academic press,2001 [6] Reda Alhajj : Advanced Data Mining and Applications , Copyright 2007

40

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

APPENDIX A
Uml Diagram Activity Diagrams Describes how activities are coordinated. Is particularly useful when you know that an operation has to achieve a number of different things, and you want to model what the essential dependencies between them are, before you decide in what order to do them. Records the dependencies between activities, such as which things can happen in parallel and what must be finished before something else can start. Represents the workflow of the process.

Activity diagrams commonly contain Activity states and action states Transitions Objects

Action states are atomic and cannot be decomposed. Work of the action state is not interrupted Activity states can be further decomposed Their activity being represented by other activity diagrams

They may be interrupted when the action or activity of a state completes, flow of control passes immediately to the next action or activity state A flow of control has to start and end someplace initial state -- a solid ball stop state -- a solid ball inside a circle

A branch specifies alternate paths taken based on some Boolean expression.A branch may have one incoming transition and two or more outgoing ones
41 HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

Use Case Diagrams Use case diagrams overview the usage requirements for a system. They are useful for presentations to management and/or project stakeholders, but for actual development you will find that provide significantly more value because they describe "the meat" of the actual requirements. Use case diagrams depict: Use cases: A use case describes a sequence of actions that provide something of measurable value to an actor and is drawn as a horizontal ellipse. Actors: An actor is a person, organization, or external system that plays a role in one or more interactions with your system. Actors are drawn as stick figures. Associations: Associations between actors and use cases are indicated in use case diagrams by solid lines. An association exists whenever an actor is involved with an interaction described by a use case. System boundary boxes (optional): You can draw a rectangle around the use cases, called the system boundary box, to indicate the scope of your system. Anything within the box represents functionality that is in scope and anything outside the box is not. Packages (optional): Packages are UML constructs that enable you to organize model elements (such as use cases) into groups. Packages are depicted as file folders and can be used on any of the UML diagrams, including both use case diagrams and class diagrams.

42

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

CS010 807 Main Project

Department of Computer Science & Engineering Rajagiri School of Engineering and Technology, Rajagiri Valley

43

HORIZONTAL AGGREGATION & HIERARCHICAL CLUSTERING

You might also like