Clustering With Multiviewpoint-Based Similarity Measure: Abstract
Clustering With Multiviewpoint-Based Similarity Measure: Abstract
Clustering With Multiviewpoint-Based Similarity Measure: Abstract
Abstract:
All clustering methods have to assume some cluster relationship among the data objects
that they are applied on. Similarity between a pair of objects can be defined either explicitly or
implicitly. In this paper, we introduce a novel multiviewpoint-based similarity measure and two
related clustering methods. The major difference between a traditional dissimilarity/similarity
measure and ours is that the former uses only a single viewpoint, which is the origin, while the
latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster
with the two objects being measured. Using multiple viewpoints, more informative assessment of
similarity could be achieved. Theoretical analysis and empirical study are conducted to support
this claim. Two criterion functions for document clustering are proposed based on this new
measure. We compare them with several well-known clustering algorithms that use other popular
similarity measures on various document collections to verify the advantages of our proposal.
INTRODUCTION
CLUSTERING is one of the most interesting and important topics in data mining. The aim of
clustering is to find intrinsic structures in data, and organize them into meaningful subgroups for
further study and analysis. There have been many clustering algorithms published every year.
They can be proposed for very distinct research fields, and developed using totally different
techniques and approaches.Nevertheless, according to a recent study ,more than half a century
after it was introduced, the simple algorithm k-means still remains as one of the top 10 data
mining algorithms nowadays. It is the most frequently used partitional clustering algorithm in
practice. Another recent scientific discussion [2] states that k-means is the favorite algorithm that
practitioners in the related fields choose to use. Needless to mention, k-means has more than a
few basic drawbacks, such as sensitiveness to initialization and to cluster size, and its
performance can be worse than other state-of-the-art algorithms in many domains. In spite of
that,its simplicity, understandability, and scalability are the reasons for its tremendous popularity.
An algorithm with adequate performance and usability in most of application scenarios could be
preferable to one with better performance in some cases but limited usage due to high
complexity.While offering reasonable results, k-means is fast and easy to combine with other
methods in larger systems.A common approach to the clustering problem is to treat it as an
optimization process. An optimal partition is found by optimizing a particular function of
similarity (or distance) among data. Basically, there is an implicit assumption that the true
intrinsic structure of data could be correctly described by the similarity formula defined and
embedded in the clustering criterion function. Hence,effectiveness of clustering algorithms under
this approach depends on the appropriateness of the similarity measure to the data at hand. For
instance, the original k-means has sum-of-squared-error objective function that uses euclidean
distance. In a very sparse and high-dimensional domain like text documents, spherical k-means,
which uses cosine similarity (CS) instead of euclidean distance as the measure,is deemed to be
more suitable .In, Banerjee et al. showed that euclidean distance was indeed one particular form
of a class of distance measures called Bregman divergences. They proposed Bregman
hardclustering algorithm, in which any kind of the Bregman divergences could be applied.
Kullback-Leibler divergence was a special case of Bregman divergences that was said to give
good clustering results on document data sets. Kullback-Leibler divergence is a good example of
nonsymmetric measure. Also on the topic of capturing dissimilarity in data,Pakalska et al. found
that the discriminative power of some distance measures could increase when their nonEuclidean and nonmetric attributes were increased. They concluded that noneuclidean and
nonmetric measures could be informative for statistical learning of data. In , Pelillo even argued
that the symmetry and nonnegativity assumption of similarity measures was actually a limitation
of current state-of-the-art clustering approaches. Simultaneously, clustering
still requires more robust dissimilarity or similarity measures; recent works such as [8] illustrate
this need.The work in this paper is motivated by investigations from the above and similar
research findings. It appears to us that the nature of similarity measure plays a very important
role in the success or failure of a clustering method. Our first objective is to derive a novel
method for measuring similarity between data objects in sparse and high-dimensional domain,
particularly text documents. From the proposed similarity measure, we then formulate new
clustering criterion functions and introduce their respective clustering algorithms, which are fast
and scalable like k-means, but are also capable of providing high-quality and consistent
performance.
Existing System
A common approach to the clustering problem is to treat it as an optimization process. An
optimal partition is found by optimizing a particular function of similarity (or distance) among
data. Basically, there is an implicit assumption that the true intrinsic structure of data could be
correctly described by the similarity formula defined and embedded in the clustering criterion
function. Hence, effectiveness of clustering algorithms under this approach depends on the
appropriateness of the similarity measure to the data at hand. For instance, the original k-means
has sum-of-squared-error objective function that uses Euclidean distance. In a very sparse and
high-dimensional domain like text documents, spherical k-means, which uses cosine similarity
(CS) instead of Euclidean distance as the measure, is deemed to be more suitable.
Proposed System:
The work in this paper is motivated by investigations from the above and similar research
findings. It appears to us that the nature of similarity measure plays a very important role in the
success or failure of a clustering method. Our first objective is to derive a novel method for
measuring similarity between data objects in sparse and high-dimensional domain, particularly
text documents. From the proposed similarity measure, we then formulate new clustering
criterion functions and introduce their respective clustering algorithms, which are fast and
scalable like k-means, but are also capable of providing high-quality and consistent performance.
Software Requirement Specification
Software Specification
Operating System
Windows XP
Technology
JAVA 1.6,Jfreechart
Pentium IV
RAM
512 MB
Hard Disk
80GB
Modules
Select File
Process
Histogram
Clusters
Similarity
Result
TECHNOLOGIES USED
4.1 Introduction To Java:
Java has been around since 1991, developed by a small team of Sun Microsystems
developers in a project originally called the Green project. The intent of the project was to
develop a platform-independent software technology that would be used in the consumer
electronics industry. The language that the team created was originally called Oak.
The first implementation of Oak was in a PDA-type device called Star Seven (*7) that
consisted of the Oak language, an operating system called GreenOS, a user interface, and
hardware. The name *7 was derived from the telephone sequence that was used in the team's
office and that was dialed in order to answer any ringing telephone from any other phone in the
office.
Around the time the First Person project was floundering in consumer electronics, a new
craze was gaining momentum in America; the craze was called "Web surfing." The World Wide
Web, a name applied to the Internet's millions of linked HTML documents was suddenly
becoming popular for use by the masses. The reason for this was the introduction of a graphical
Web browser called Mosaic, developed by ncSA. The browser simplified Web browsing by
combining text and graphics into a single interface to eliminate the need for users to learn many
confusing UNIX and DOS commands. Navigating around the Web was much easier using
Mosaic.
It has only been since 1994 that Oak technology has been applied to the Web. In 1994,
two Sun developers created the first version of Hot Java, and then called Web Runner, which is a
graphical browser for the Web that exists today. The browser was coded entirely in the Oak
language, by this time called Java. Soon after, the Java compiler was rewritten in the Java
language from its original C code, thus proving that Java could be used effectively as an
application language. Sun introduced Java in May 1995 at the Sun World 95 convention.
Web surfing has become an enormously popular practice among millions of computer
users. Until Java, however, the content of information on the Internet has been a bland series of
HTML documents. Web users are hungry for applications that are interactive, that users can
execute no matter what hardware or software platform they are using, and that travel across
heterogeneous networks and do not spread viruses to their computers. Java can create such
applications.
The Java programming language is a high-level language that can be characterized by all
of the following buzzwords:
Simple
Architecture neutral
Object oriented
Portable
Distributed
High performance
Interpreted
Multithreaded
Robust
Dynamic
Secure
With most programming languages, you either compile or interpret a program so that you
can run it on your computer. The Java programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer. Compilation happens just once; interpretation occurs each time the
program is executed. The following figure illustrates how this works.
name of main, which will sound familiar to C programmers. The main method is passed as a
parameter an array of strings (similar to the argv[] of C), and is declared as a static method.
To output text from the program, iexecute the println method of System. out, which is
javas output stream. UNIX users will appreciate the theory behind such a stream, as it is actually
standard output. For those who are instead used to the Wintel platform, it will write the string
passed to it to the users program.
4.2 Swing:
Introduction To Swing:
Swing contains all the components.
appropriate complexity for the task at hand if something is simple, you dont have to write
much code but as you try to do more your code becomes increasingly complex. This means an
easy entry point, but youve got the power if you need it.
Swing has great depth. This section does not attempt to be comprehensive, but instead
introduces the power and simplicity of Swing to get you started using the library. Please be
aware that what you see here is intended to be simple. If you need to do more, then Swing can
probably give you what you want if youre willing to do the research by hunting through the
online documentation from Sun.
Benefits Of Swing:
Swing components are Beans, so they can be used in any development environment that
supports Beans. Swing provides a full set of UI components. For speed, all the components are
lightweight and Swing is written entirely in Java for portability.
Swing could be called orthogonality of use; that is, once you pick up the general ideas
about the library you can apply them everywhere. Primarily because of the Beans naming
conventions.
Keyboard navigation is automatic you can use a Swing application without the mouse,
but you dont have to do any extra programming. Scrolling support is effortless you simply
wrap your component in a JScrollPane as you add it to your form. Other features such as tool tips
typically require a single line of code to implement.
Swing also supports something called pluggable look and feel, which means that the
appearance of the UI can be dynamically changed to suit the expectations of users working under
different platforms and operating systems. Its even possible to invent your own look and feel.
Domain Description:
Data mining involves the use of sophisticated data analysis tools to discover previously
unknown, valid patterns and relationships in large data sets. These tools can include statistical
models, mathematical algorithms, and machine learning methods (algorithms that improve their
performance automatically through experience, such as neural networks or decision trees).
Consequently, data mining consists of more than collecting and managing data, it also includes
analysis and prediction.
Data mining can be performed on data represented in quantitative, textual, or multimedia
forms. Data mining applications can use a variety of parameters to examine the data. They
include association (patterns where one event is connected to another event, such as purchasing a
pen and purchasing paper), sequence or path analysis (patterns where one event leads to another
event, such as the birth of a child and purchasing diapers), classification (identification of new
patterns, such as coincidences between duct tape purchases and plastic sheeting purchases),
clustering (finding and visually documenting groups of previously unknown facts, such as
geographic location and brand preferences), and forecasting (discovering patterns from which
one can make reasonable predictions regarding future activities, such as the prediction that
people who join an athletic club may take exercise classes)
The medical community sometimes uses data mining to help predict the effectiveness of a
procedure or medicine.
Pharmaceutical firms use data mining of chemical compounds and genetic material to
help guide research on new treatments for diseases.
Retailers can use information collected through affinity programs (e.g., shoppers club cards,
frequent flyer points, contests) to assess the effectiveness of product selection and placement
decisions, coupon offers, and which products are often purchased together.
DESIGN ANALYSIS
UML Diagrams:
UML is a method for describing the system architecture in detail using the blueprint.
UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems.
UML is a very important part of developing objects oriented software and the software
development process.
UML uses mostly graphical notations to express the design of software projects.
Using the UML helps project teams communicate, explore potential designs, and validate
the architectural design of the software.
Definition:
UML is a general-purpose visual modeling language that is used to specify, visualize, construct,
and document the artifacts of the software system.
UML is a language:
It will provide vocabulary and rules for communications and function on conceptual and physical
representation. So it is modeling language.
UML Specifying:
Specifying means building models that are precise, unambiguous and complete. In particular, the
UML address the specification of all the important analysis, design and implementation decisions
that must be made in developing and displaying a software intensive system.
UML Visualization:
The UML includes both graphical and textual representation. It makes easy to visualize the
system and for better understanding.
UML Constructing:
UML models can be directly connected to a variety of programming languages and it is
sufficiently expressive and free from any ambiguity to permit the direct execution of models.
UML Documenting:
UML provides variety of documents in addition raw executable codes.
The use case view of a system encompasses the use cases that describe the behavior of the
system as seen by its end users, analysts, and testers.
The design view of a system encompasses the classes, interfaces, and collaborations that form the
vocabulary of the problem and its solution.
The process view of a system encompasses the threads and processes that form the system's
concurrency and synchronization mechanisms.
The implementation view of a system encompasses the components and files that are used to
assemble and release the physical system.The deployment view of a system encompasses the
nodes that form the system's hardware topology on which the system executes.
Uses of UML :
The UML is intended primarily for software intensive systems. It has been used
effectively for such domain as
Enterprise Information System
Banking and Financial Services
Telecommunications
Transportation
Defense/Aerosp
Retails
Medical Electronics
Scientific Fields
Distributed Web
Building blocks of UML:
The vocabulary of the UML encompasses 3 kinds of building blocks
Things
Relationships
Diagrams
Things:
Things are the data abstractions that are first class citizens in a model. Things are of 4 types
Structural Things, Behavioral Things ,Grouping Things, An notational Things
Relationships:
Relationships tie the things together. Relationships in the UML are
Dependency, Association, Generalization, Specialization
UML Diagrams:
A diagram is the graphical presentation of a set of elements, most often rendered as a connected
graph of vertices (things) and arcs (relationships).
There are two types of diagrams, they are:
Structural and Behavioral Diagrams
Structural Diagrams:The UMLs four structural diagrams exist to visualize, specify, construct and document
the static aspects of a system. ican View the static parts of a system using one of the following
diagrams. Structural diagrams consists of Class Diagram, Object Diagram, Component Diagram,
Deployment Diagram.
Behavioral Diagrams :
The UMLs five behavioral diagrams are used to visualize, specify, construct, and
document the dynamic aspects of a system. The UMLs behavioral diagrams are roughly
organized around the major ways which can model the dynamics of a system.
Behavioral diagrams consists of
Use case Diagram, Sequence Diagram, Collaboration Diagram, State chart Diagram, Activity
Diagram
An actor is represents a user or another system that will interact with the system you are
modeling. A use case is an external view of the system that represents some action the user
might perform in order to complete a task.
Contents:
Use cases
Actors
System boundary
select path
process
histogram
clusters
similarity
Result
classes, packages and objects. Class diagrams describe three different perspectives when
designing a system, conceptual, specification, and implementation. These perspectives become
evident as the diagram is created and help solidify the design. Class diagrams are arguably the
most used UML diagram type. It is the main building block of any object oriented solution. It
shows the classes in a system, attributes and operations of each class and the relationship
between each class. In most modeling tools a class has three parts, name at the top, attributes in
the middle and operations or methods at the bottom. In large systems with many classes related
classes are grouped together to to create class diagrams. Different relationships between
diagrams are show by different types of Arrows. Below is a image of a class diagram. Follow the
link for more class diagram examples.
select file
+file()
Process
+process()
Histogram
+histogram()
result
Clusters
+cluster()
Similarity
+similarity()
Sequence Diagram
Sequence diagrams in UML shows how object interact with each other and the order those
interactions occur. Its important to note that they show the interactions for a particular scenario.
The processes are represented vertically and interactions are show as arrows. This article
explains thepurpose and the basics of Sequence diagrams.
/ process
/ select file
/ histogram
/ clusters
/ similarity
/ result
1 : select file()
3 : divide histograms()
4 : divide clusters()
5 : no of similarities()
6 : result()
Collaboration diagram
Communication diagram was called collaboration diagram in UML 1. It is similar to sequence
diagrams but the focus is on messages passed between objects. The same information can be
represented using a sequence diagram and different objects. Click here to understand the
differences using an example.
/ process
/ histogram
/ select file
/ similarity
/ clusters
/ result
select file
process
histograms
clusters
similarity
result
State Machine diagram in UML, sometime referred to as State or State chart diagram
select file
process
histogram
clusters
similarity
result
Component diagram
]A component diagram displays the structural relationship of components of a software system.
These are mostly used when working with complex systems that has many components.
Components communicate with each other using interfaces. The interfaces are linked using
connectors. Below images shows a component diagram.
Deployment Diagram
A deployment diagrams shows the hardware of your system and the software in those hardware.
Deployment diagrams are useful when your software solution is deployed across multiple
machines with each having a unique configuration. Below is an example deployment diagram.
SAMPLE CODE
//Bit.java
import java.io.*;
import java.lang.*;
import java.util.*;
////////////////////Bit class
class Bit
{
//bit operations
public static int Power(int tBase,int tExponent)
{
int tAns=1,t;
for(t=1;t<=tExponent;t++)
{
tAns=tAns*tBase;
}
return(tAns);
}
public static int GetBit(int tValue,int tPos)
{
int tBit=0;
tBit=tValue&Power(2,tPos);
if(tBit>0) tBit=1;
return(tBit);
}
import java.util.*;
class Dict
{
String words[];
int nwords;
int iwords[];
//constructor
Dict()
{
int maxWords=150000;
nwords=0;
words=new String[maxWords];
iwords=new int[26];
for(int t=0;t<26;t++) iwords[t]=0;
}
//methods
public void read_dictionary()
{
try
{
//System.out.println("Reading...");
//System.out.println("GNU Collaborative International Dictionary of
English (GCIDE)\n");
for(int i=0;i<26;i++)
{
String tfpath="dict\\gcide\\words_"+(char)(97+i)+".txt";
FileInputStream fin;
fin=new FileInputStream(tfpath);
int ch=0;
String tmp="";
while((ch=fin.read())!=-1)
{
if(ch==13)
{
addWord(tmp,i);
tmp="";
fin.read();
continue;
}
tmp+=(char)ch;
}
//System.out.println("gcide_"+(char)(97+i)+": "+iwords[i]+"
words");
}
}
catch(Exception e)
{
//System.out.println("Error: "+e.getMessage());
}
}
public void addWord(String tword,int alphabetIndex)
{
words[nwords]=tword;
nwords++;
iwords[alphabetIndex]++;
}
public boolean isWord(String tword)
{
boolean flag=false;
tword=tword.toLowerCase();
for(int t=0;t<nwords;t++)
{
if(tword.compareTo(words[t])==0)
{
flag=true;
break;
}
}
return(flag);
}
public String toString()
{
String tstr="";
tstr="\nTotal: "+nwords+" words";
return(tstr);
}
}
//DocumentIndexGraph.java
import java.lang.*;
import java.io.*;
public class DocumentIndexGraph
{
Itemset V;
ItemsetCollection E;
//constructor
public DocumentIndexGraph()
{
V=new Itemset();
E=new ItemsetCollection();
}
//get functions
public Itemset getV()
{
return(V);
}
public ItemsetCollection getE()
{
return(E);
}
//set functions
public void setV(Itemset tItemset)
{
V.clear();
V.appendItemset(tItemset);
}
public void setE(ItemsetCollection tItemsetCollection)
{
E.clear();
E.appendItemsetCollection(tItemsetCollection);
}
//methods
public void addNode(String tWord)
{
V.addItem(tWord);
}
public void addEdge(Itemset tEdge)
{
E.addItemset(tEdge);
}
public boolean isEdge(String str1,String str2)
{
boolean flag=false;
for(int t=0;t<=E.get_nItemsets()-1;t++)
{
String tstr1=E.getItemset(t).getItem(0);
String tstr2=E.getItemset(t).getItem(1);
if(str1.compareToIgnoreCase(tstr1)==0&&str2.compareToIgnoreCase(tstr2)==0)
{
flag=true;
break;
}
}
return(flag);
}
public boolean isPath(String str)
{
String tarr[]=StringUtils.split(str," ");
boolean flag=true;
for(int t=0;t<=tarr.length-2;t++)
{
if(isEdge(tarr[t],tarr[t+1])==false)
{
flag=false;
break;
}
}
return(flag);
}
public double findPhrasePathWeight(String str)
{
String tarr[]=StringUtils.split(str," ");
int tCount=0;
for(int t=0;t<=tarr.length-2;t++)
{
if(isEdge(tarr[t],tarr[t+1])==true)
{
tCount+=1;
}
}
double weight=(double)tCount/(double)(V.get_nItems());
return(weight);
}
}
//Hierarchical Clustering.java
import java.io.*;
import java.net.*;
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import javax.swing.*;
import javax.swing.filechooser.*;
import org.jfree.ui.RefineryUtilities;
class Hier extends JFrame implements ActionListener
{
JFrame frmRootPath = new JFrame("Root Path : Clustering with Multi-Viewpoint based
Similarity Measure");
JFrame frmButton = new JFrame("Functions : Clustering with Multi-Viewpoint based
Similarity Measure");
WebDocument documents[];
WebDocument CumulativeDocument;
ItemsetCollection Clusters=new ItemsetCollection();
Hier()
{
//Root path frame
frmRootPath.setDefaultLookAndFeelDecorated(true);
frmRootPath.setResizable(false);
frmRootPath.setBounds(50,50,400,400);
frmRootPath.getContentPane().setLayout(null);
//Functions frame
frmButton.setDefaultLookAndFeelDecorated(true);
frmButton.setResizable(false);
frmButton.setBounds(50,50,201,380);
frmButton.getContentPane().setLayout(null);
//Result frame
frmResult.setDefaultLookAndFeelDecorated(true);
frmResult.setResizable(false);
frmResult.setBounds(50,50,600,580);
frmResult.getContentPane().setLayout(null);
//Root path Design
lblRootPath.setBounds(50,15,100,20);
frmRootPath.getContentPane().add(lblRootPath);
spLinks.setBounds(48,35,270,200);
frmRootPath.getContentPane().add(spLinks);
//Process button Design
btProcess.setBounds(50,65,100,20);
btProcess.addActionListener(this);
frmButton.getContentPane().add(btProcess);
//Histogram button Design
btHistogram.setBounds(50,125,100,20);
btHistogram.addActionListener(this);
frmButton.getContentPane().add(btHistogram);
//Cluster button Design
btCluster.setBounds(50,185,100,20);
btCluster.addActionListener(this);
frmButton.getContentPane().add(btCluster);
//Similarity button Design
btSimilarity.setBounds(50,245,100,20);
btSimilarity.addActionListener(this);
frmButton.getContentPane().add(btSimilarity);
//Result Design
lblResult.setBounds(17,35,100,20);
frmResult.getContentPane().add(lblResult);
spResult.setBounds(15,55,540,450);
frmResult.getContentPane().add(spResult);
txtResult.setEditable(false);
//initialize lstRootPath
FileSystemView fv=FileSystemView.getFileSystemView();
File files[]=fv.getFiles(new File("data"),true);
Vector tvector=new Vector();
for(int t=0;t<files.length;t++)
{
String tFileName=fv.getSystemDisplayName(files[t]);
tvector.add(tFileName);
}
lstRootPath.setSelectionMode(ListSelectionModel.SINGLE_SELECTION);
lstRootPath.setListData(tvector);
lstRootPath.setSelectedIndex(0);
frmRootPath.setVisible(true);
frmButton.setVisible(true);
frmResult.setVisible(true);
}
if(evt.getSource()==btSimilarity)
{
Similarity();
}
}
public void process()
{
try
{
dict.read_dictionary();
//starting-urls
String tRootPath=(String)lstRootPath.getSelectedValue();
frontier.enqueue(tRootPath);
//breadth-first-search
nVisited=0;
txtResult.setText("");
logText="";
while(nVisited<maxPages&&frontier.isEmpty()==false)
{
String tstrFrontier=frontier.toString();
String tPath=frontier.dequeue();
if(isVisitedPage(tPath)==false)
{
logText+="Frontier: "+tstrFrontier+"\n";
addVisitedPage(tPath);
logText+="Downloading ["+tPath+"]..."+"\n";
parser1.setFilePath("data\\"+tPath);
Queue q=parser1.findLinks();
logText+="ExtractedLinks: "+q.toString()+"\n";
printVisitedPages();
logText+="\n";
while(q.isEmpty()==false)
{
String tPath1=q.dequeue();
if(isVisitedPage(tPath1)==false)
frontier.enqueue(tPath1);
}
}
}
//write visitlog
FileOutputStream foutlog=new FileOutputStream(logPath);
foutlog.write(logText.getBytes());
foutlog.close();
//construct webdocuments and its DIG
nDocuments=nVisited;
documents=new WebDocument[nDocuments];
CumulativeDocument=new WebDocument();
//find metas and construct cumulative document index graph
addResultText(" Clustering with Multi-Viewpoint based Similarity
Measure:\n\n");
ItemsetCollection icWords=new ItemsetCollection();
ItemsetCollection icEdges=new ItemsetCollection();
for(int t=0;t<nDocuments;t++) //for each document
{
documents[t]=new WebDocument();
documents[t].setDocName(visitedPages[t]);
parser1.setFilePath("data\\"+visitedPages[t]);
Queue q=parser1.findMetas(); //get meta-data
String tstr=q.toString();
tstr=StringUtils.replaceString(tstr,",","");
tstr=StringUtils.replaceString(tstr,"{","");
tstr=StringUtils.replaceString(tstr,"}","");
//get unique words in this document
String tarr[]=StringUtils.split(tstr," ");
Itemset tItemset=new Itemset(tarr);
ItemsetCollection ic1=new ItemsetCollection(tItemset);
tItemset=ic1.getUniqueItemset();
simalpha=0.3;
//suppress non-dictionary words
for(int t1=0;t1<tItemset.get_nItems();t1++)
{
if(dict.isWord(tItemset.getItem(t1))==false)
{
//tItemset.removeItem(t1);
}
}
icWords.addItemset(tItemset);
documents[t].DIG.setV(tItemset);
//get unique edges in this document
tstr=q.toString();
tstr=StringUtils.replaceString(tstr,"{","");
tstr=StringUtils.replaceString(tstr,"}","");
tarr=StringUtils.split(tstr,", ");
for(int j=0;j<tarr.length;j++)
{
documents[t].addPhrase(tarr[j]);
CumulativeDocument.addPhrase(tarr[j]);
String[] tarr1=StringUtils.split(tarr[j]," ");
if(tarr1.length>1)
{
for(int k=0;k<=tarr1.length-2;k++)
{
//for(int k1=k+1;k1<=tarr1.length-1;k1+
+) //if word-(k+1) appears before word-k
//{
Itemset i1=new Itemset(); //if word(k+1) appears next to word-k
i1.addItem(tarr1[k]);
i1.addItem(tarr1[k+1]);
icEdges.addItemset(i1);
documents[t].DIG.addEdge(i1);
//}
}
}
}
}
//set graph nodes and edges
for(int t=0;t<nDocuments;t++)
{
ItemsetCollection ic1=new ItemsetCollection();
ic1=documents[t].DIG.getE();
documents[t].DIG.setE(ic1.getUniqueItemsetCollection());
}
CumulativeDocument.DIG.setV(icWords.getUniqueItemset());
CumulativeDocument.DIG.setE(icEdges.getUniqueItemsetCollection());
double HRmin=1.0;
double HRmax=0.0;
/*for(int t=0;t<=nDocuments-2;t++)
{
for(int j=t+1;j<=nDocuments-1;j++)
{
double tsim=findSimilarity(documents[t],documents[j]);
if(HRmin>tsim) HRmin=tsim;
if(HRmax<tsim) HRmax=tsim;
}
}*/
//clustering
addResultText("\nSimilarities and its Corresponding OLP:\n");
Similarities=new ItemsetCollection();
double similarityThreshold=0.3;
sim=new double[nDocuments][nDocuments][1];
sim_perc=new double[nDocuments][nDocuments][1];
for(int t=0;t<=nDocuments-1;t++)
{
for(int j=0;j<=nDocuments-1;j++)
{
double hratio=findSimilarity(documents[t],documents[j]);
Itemset i1=new Itemset();
i1.addItem(""+t);
i1.addItem(""+j);
i1.addItem(""+hratio);
Similarities.addItemset(i1);
addResultText(" sim("+t+","+j+") :
"+hratio+"\n");
sim[t][j][0]=hratio;
OLP -->
sim_perc[t][j][0]=hratio*100;
if(hratio>=similarityThreshold)
{
String tstr1=""+t;
String tstr2=""+j;
int tNewClusterIndex=-1;
int tOldClusterIndex=-1;
for(int i=0;i<=Clusters.get_nItemsets()-1;i++)
{
if(Clusters.getItemset(i).isContains(tstr1)==true)
{
tNewClusterIndex=i;
}
if(Clusters.getItemset(i).isContains(tstr2)==true)
{
tOldClusterIndex=i;
}
}
if(tNewClusterIndex!=-1&&tOldClusterIndex!=-1)
{
Clusters.getItemset(tOldClusterIndex).removeItem(tstr2);
Clusters.getItemset(tNewClusterIndex).addItem(tstr2);
}
}
}
}
}
catch(IOException e)
{
System.out.println(e);
}
}
//display histogram
public void Histogram()
{
try
{
for(int i=0;i<nDocuments;i++)
{
Histogram hist = new Histogram("Document "+i+" Similiarity",sim_perc[i]);
hist.pack();
RefineryUtilities.centerFrameOnScreen(hist);
hist.setVisible(true);
}
/*txtResult.setText("");
//ItemsetCollection Hist=new ItemsetCollection();
//ItemsetCollection Similarities=new ItemsetCollection();
addResultText("\nHistogram:\n");
double tstart=0.0f;
double tinterval=0.1f;
for(int t=0;t<10;t++)
{
int tCount=0;
for(int j=0;j<=Similarities.get_nItemsets()-1;j++)
{
double
tsim=Double.parseDouble(Similarities.getItemset(j).getItem(2));
if(tsim>=tstart&&tsim<=tstart+tinterval)
{
tCount++;
}
}
Hist.addItemset(new Itemset(""+tCount));
addResultText("("+tstart+","+(tstart+tinterval)+"):
"+Hist.getItemset(Hist.get_nItemsets()-1)+"\n");
tstart+=tinterval;
}*/
}
catch(Exception e)
{
System.out.println(e);
}
}
//display clusters
public void Cluster()
{
try{
txtResult.setText("");
//ItemsetCollection Clusters=new ItemsetCollection();
addResultText("\nClusters With the Obtained OLP :\n");
int nClusters=0;
for(int t=0;t<=Clusters.get_nItemsets()-1;t++)
{
if(Clusters.getItemset(t).get_nItems()!=0)
{
addResultText("Cluster"+(nClusters+1)+":
"+Clusters.getItemset(t).toString()+"\n");
//
nCm(tstr2);
//Clusters.getItemset(tNewClusterIndex).addItem(tstr2);
nClusters+=1;
}
}
}
catch(Exception e)
{
}
}
public void Similarity()
{
try
{
txtResult.setText("");
ItemsetCollection Clusters=new ItemsetCollection();
addResultText("\nSimilarities and its Corresponding OLP:\n");
ItemsetCollection Similarities=new ItemsetCollection();
double similarityThreshold=0.3;
for(int t=0;t<=nDocuments-1;t++)
{
for(int j=0;j<=nDocuments-1;j++)
{
double hratio=findSimilarity(documents[t],documents[j]);
Itemset i1=new Itemset();
i1.addItem(""+t);
i1.addItem(""+j);
i1.addItem(""+hratio);
Similarities.addItemset(i1);
addResultText(" sim("+t+","+j+") :
OLP -->
"+hratio+"\n");
if(hratio>=similarityThreshold)
{
String tstr1=""+t;
String tstr2=""+j;
int tNewClusterIndex=-1;
int tOldClusterIndex=-1;
for(int i=0;i<=Clusters.get_nItemsets()-1;i++)
{
if(Clusters.getItemset(i).isContains(tstr1)==true)
{
tNewClusterIndex=i;
}
if(Clusters.getItemset(i).isContains(tstr2)==true)
{
tOldClusterIndex=i;
}
}
if(tNewClusterIndex!=-1&&tOldClusterIndex!=-1)
{
Clusters.getItemset(tOldClusterIndex).removeItem(tstr2);
Clusters.getItemset(tNewClusterIndex).addItem(tstr2);
}
}
}
}
}
catch(Exception e)
{
}
}
{
double s1j=StringUtils.split(d1.getPhrase(t)," ").length;
double tweight=doc1.DIG.findPhrasePathWeight(d1.getPhrase(t));
sigmaj+=s1j*tweight;
}
//find sigmak
double sigmak=0.0;
for(int t=0;t<d2.getPhrases().get_nItems();t++)
{
double s2k=StringUtils.split(d2.getPhrase(t)," ").length;
double tweight=doc1.DIG.findPhrasePathWeight(d2.getPhrase(t));
sigmak+=s2k*tweight;
}
double fragmentationFactor=1.2; //proposed constant
//find sigmap
double sigmap=0.0;
for(int t=0;t<doc1.getPhrases().get_nItems();t++)
{
double li=StringUtils.split(doc1.getPhrase(t)," ").length;
double si=doc1.getPhrases().get_nItems();
double gi=java.lang.Math.pow(li/si,fragmentationFactor);
double f1i=d1.findPhraseFrequency(doc1.getPhrase(t));
double w1i=doc1.DIG.findPhrasePathWeight(doc1.getPhrase(t));
double f2i=d2.findPhraseFrequency(doc1.getPhrase(t));
double w2i=doc1.DIG.findPhrasePathWeight(doc1.getPhrase(t));
double tsum=(f1i*w1i)+(f2i+w2i);
sigmap+=java.lang.Math.pow(gi*tsum,2.0);
}
//find sim_p
double simp=java.lang.Math.sqrt(sigmap);
simp/=(sigmaj+sigmak);
return(simp);
}
double findTermSimilarity(WebDocument d1,WebDocument d2)
{
WebDocument doc1=CombineDocument(d1,d2);
double sigma1=0.0;
double sigma21=0.0,sigma22=0.0;
for(int t=0;t<doc1.DIG.V.get_nItems();t++)
{
double tfidf1=findTFIDF(doc1.DIG.V.getItem(t),d1);
double tfidf2=findTFIDF(doc1.DIG.V.getItem(t),d2);
sigma1+=tfidf1*tfidf2;
sigma21+=tfidf1*tfidf1;
sigma22+=tfidf2*tfidf2;
}
//consine similarity
double simt=sigma1/java.lang.Math.sqrt(sigma21*sigma22);
return(simt);
}
double findTFIDF(String term,WebDocument d1)
{
//find tf
double n1=d1.findTermFrequency(term);
double tsum=0.0;
for(int t=0;t<d1.DIG.V.get_nItems();t++)
{
tsum+=d1.findTermFrequency(d1.DIG.V.getItem(t));
}
double tf=n1/tsum;
//find idf
int tDocCount=0;
for(int t=0;t<nDocuments;t++)
{
if(documents[t].DIG.V.isContains(term)==true)
{
tDocCount+=1;
}
}
double tval=(double)nDocuments/(double)tDocCount;
double idf=java.lang.Math.log(tval);
double tfidf=tf*idf;
return(tfidf);
}
WebDocument CombineDocument(WebDocument d1,WebDocument d2)
{
//construct combined doc to find matching phrases
DocumentIndexGraph dig1=new DocumentIndexGraph();
dig1.V.appendItemset(d1.DIG.V);
dig1.V.appendItemset(d2.DIG.V);
{
boolean visited=false;
for(int t=0;t<nVisited;t++)
{
if(tStr.compareToIgnoreCase(visitedPages[t])==0)
{
visited=true;
}
}
return(visited);
}
private void printVisitedPages()
{
logText+="visited:"+"\n";
for(int t=0;t<nVisited;t++)
{
logText+="["+visitedPages[t]+"]"+"\n";
}
}
static public void main(String[] args)
{
try {
UIManager.setLookAndFeel("com.sun.java.swing.plaf.windows.WindowsLookAndFeel");
} catch (Exception e) {
e.printStackTrace();
}
new Hier();
}
}
//Histogram
import java.awt.*;
import org.jfree.chart.*;
import org.jfree.chart.axis.*;
import org.jfree.chart.plot.*;
import org.jfree.chart.renderer.category.*;
import org.jfree.data.category.*;
import org.jfree.data.category.*;
import org.jfree.data.general.*;
import org.jfree.ui.*;
/**
* A simple demonstration application showing how to create a bar chart.
*
*/
public class Histogram extends ApplicationFrame {
/**
* Creates a new demo instance.
*
* @param title the frame title.
*/
public Histogram(final String title,double[][] sim) {
super(title);
final CategoryDataset dataset = createDataset(sim);
final JFreeChart chart = createChart(title,dataset);
final ChartPanel chartPanel = new ChartPanel(chart);
}
/**
* Creates a sample chart.
*
* @param dataset the dataset.
*
* @return The chart.
*/
private JFreeChart createChart(String title,final CategoryDataset dataset) {
// create the chart...
final JFreeChart chart = ChartFactory.createBarChart(
title,
"",
// chart title
// domain axis label
"Similiarity",
dataset,
PlotOrientation.VERTICAL, // orientation
true,
// include legend
true,
// tooltips?
false
// URLs?
);
// NOW DO SOME OPTIONAL CUSTOMISATION OF THE CHART...
// set the background color for the chart...
chart.setBackgroundPaint(Color.white);
// get a reference to the plot for further customisation...
final CategoryPlot plot = chart.getCategoryPlot();
plot.setBackgroundPaint(Color.lightGray);
plot.setDomainGridlinePaint(Color.white);
plot.setRangeGridlinePaint(Color.white);
// set the range axis to display integers only...
final NumberAxis rangeAxis = (NumberAxis) plot.getRangeAxis();
rangeAxis.setStandardTickUnits(NumberAxis.createIntegerTickUnits());
// disable bar outlines...
final BarRenderer renderer = (BarRenderer) plot.getRenderer();
renderer.setDrawBarOutline(false);
// set up gradient paints for series...
final GradientPaint gp0 = new GradientPaint(
0.0f, 0.0f, Color.blue,
0.0f, 0.0f, Color.lightGray
);
final GradientPaint gp1 = new GradientPaint(
0.0f, 0.0f, Color.green,
0.0f, 0.0f, Color.lightGray
);
final GradientPaint gp2 = new GradientPaint(
0.0f, 0.0f, Color.red,
0.0f, 0.0f, Color.lightGray
);
renderer.setSeriesPaint(0, gp0);
renderer.setSeriesPaint(1, gp1);
renderer.setSeriesPaint(2, gp2);
final CategoryAxis domainAxis = plot.getDomainAxis();
domainAxis.setCategoryLabelPositions(
CategoryLabelPositions.createUpRotationLabelPositions(Math.PI / 6.0)
);
// OPTIONAL CUSTOMISATION COMPLETED.
return chart;
}
//
****************************************************************************
// * JFREECHART DEVELOPER GUIDE
*
*
// * http://www.object-refinery.com/jfreechart/guide.html
// *
// * Sales are used to provide funding for the JFreeChart project - please
// * support us so that we can continue developing free software.
*
*
//
****************************************************************************
/**
* Starting point for the demonstration application.
*
* @param args ignored.
*/
/*public static void main(final String[] args) {
final Histogram demo = new Histogram("Bar Chart Demo");
demo.pack();
RefineryUtilities.centerFrameOnScreen(demo);
demo.setVisible(true);
}*/
}
//ItemsetCollection.java
import java.lang.*;
import java.io.*;
import java.util.*;
////////////////////ItemsetCollection class
class ItemsetCollection
{
ArrayList Itemsets;
for(t=0;t<=tItemsetCollection.get_nItemsets()-1;t++)
{
addItemset(tItemsetCollection.getItemset(t));
}
}
public void removeItemset(Itemset tItemset)
{
for(int i=0;i<=Itemsets.size()-1;i++)
{
if(getItemset(i).isEquals(tItemset)==true)
{
Itemsets.remove(i);
break;
}
}
}
public void removeItemset(int tIndex)
{
if(tIndex>=0&&tIndex<=Itemsets.size()-1)
{
removeItemset(getItemset(tIndex));
}
}
public void removeItemsetCollection(ItemsetCollection tItemsetCollection)
{
for(int t=0;t<=tItemsetCollection.get_nItemsets()-1;t++)
{
removeItemset(tItemsetCollection.getItemset(t));
}
}
public void removeEmptyItemsets()
{
for(int t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).get_nItems()==0)
{
removeItemset(t);
}
}
}
public void clear()
{
Itemsets.clear();
}
public Itemset getUniqueItemset()
{
Itemset tItemset=new Itemset();
for(int i=0;i<=Itemsets.size()-1;i++)
{
for(int j=0;j<=getItemset(i).get_nItems()-1;j++)
{
if(tItemset.isContains(getItemset(i).getItem(j))==false)
{
tItemset.addItem(getItemset(i).getItem(j));
}
}
}
return(tItemset);
}
public ItemsetCollection getUniqueItemsetCollection()
{
ItemsetCollection ic1=new ItemsetCollection();
for(int i=0;i<=Itemsets.size()-1;i++)
{
if(ic1.isContains(getItemset(i))==false)
{
ic1.addItemset(getItemset(i));
}
}
return(ic1);
}
public double getSupport(String tItem)
{
int t,tCount=0;
double tSupport;
for(t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).isContains(tItem)==true)
{
tCount=tCount+1;
}
}
tSupport=((double)tCount/(double)Itemsets.size())*100.0;
tSupport=Math.round(tSupport);
return(tSupport);
}
public double getSupport(Itemset tItemset)
{
int t,tCount=0;
double tSupport;
for(t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).isContains(tItemset)==true)
{
tCount=tCount+1;
}
}
tSupport=((double)tCount/(double)Itemsets.size())*100.0;
tSupport=Math.round(tSupport);
return(tSupport);
}
public int getSupportCount(Itemset tItemset)
{
int t,tCount=0;
for(t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).isContains(tItemset)==true)
{
tCount=tCount+1;
}
}
return(tCount);
}
public boolean isContains(Itemset tItemset)
{
boolean found=false;
for(int t=0;t<=Itemsets.size()-1;t++)
{
if(getItemset(t).isContains(tItemset)==true)
{
found=true;
break;
}
}
return(found);
}
public String toString()
{
String tStr="";
for(int t=0;t<=Itemsets.size()-1;t++)
{
tStr=tStr+getItemset(t).toString()+"\n\r\n\r";
if(printStatus==true)
{
System.out.print(t+" transactions, "+(tStr.length()/1024)+"k...\r");
}
}
return(tStr);
}
public String toString1()
{
String tStr="";
for(int t=0;t<=Itemsets.size()-1;t++)
{
tStr=tStr+getItemset(t).toString()+"\n";
if(printStatus==true)
{
System.out.print(t+" transactions, "+(tStr.length()/1024)+"k...\r");
}
}
return(tStr);
}
}
//WebPageRetrieval.java
import java.io.*;
import java.net.*;
class WebPageRetrieval
{
public static void openWebpage(String tstrURL) throws Exception
{
URL target=new URL(tstrURL);
URLConnection con=target.openConnection();
byte b[]=new byte[1028];
int n=0;
System.out.println("Reading: ["+tstrURL+"]:");
BufferedInputStream in=new BufferedInputStream(con.getInputStream(),8080);
while((n=in.read(b,0,1024))!=-1)
{
System.out.println(new String(b,0,0,n));
}
System.out.println("\nContentType: "+con.getContentType());
System.out.println("ContentLength: "+con.getContentLength());
}
public static void main(String args[]) throws Exception
{
openWebpage("http://www.yahoo.com/");
}
}
Screen shots
TESTING
Testing is a process of executing a program with the intent of finding an error. A good test
case is one that has a high probability of finding an as-yet undiscovered error. A successful test
is one that uncovers an as-yet- undiscovered error. System testing is the stage of implementation,
which is aimed at ensuring that the system works accurately and efficiently as expected before
live operation commences. It verifies that the whole set of programs hang together. System
testing requires a test consists of several key activities and steps for run program, string, system
and is important in adopting a successful new system. This is the last chance to detect and correct
errors before the system is installed for user acceptance testing.
The software testing process commences once the program is created and the
documentation and related data structures are designed. Software testing is essential for
correcting errors. Otherwise the program or the project is not said to be complete. Software
testing is the critical element of software quality assurance and represents the ultimate the review
of specification design and coding. Testing is the process of executing the program with the
intent of finding the error. A good test case design is one that as a probability of finding a yet
undiscovered error. A successful test is one that uncovers a yet undiscovered error. Any
engineering product can be tested in one of the two ways:
6.1 Unit Tesing:
6.1.1 White Box Testing:
This testing is also called as Glass box testing. In this testing, by knowing the specific
functions that a product has been design to perform test can be conducted that demonstrate each
function is fully operational at the same time searching for errors in each function. It is a test
case design method that uses the control structure of the procedural design to derive test cases.
Basis path testing is a white box testing.
Basis path testing:
Flow graph notation
Cyclometric complexity
Equivalence partitioning
Comparison testing
Testcase
number
Testcase
Select file
Input
File name
Expected
output
Obtained
output
REFERENCES
1. X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J.McLachlan, A. Ng, B.
Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J.Hand, and D. Steinberg, Top 10 Algorithms in Data
Mining, Knowledge Information Systems, vol. 14, no. 1, pp. 1-37, 2007.
2. I. Guyon, U.V. Luxburg, and R.C. Williamson, Clustering:Science or Art?, Proc. NIPS
Workshop Clustering Theory, 2009.
3.
Dhillon and D. Modha, Concept Decompositions for Large Sparse Text Data Using
Clustering, Machine Learning, vol. 42,nos. 1/2, pp. 143-175, Jan. 2001.
4. S. Zhong, Efficient Online Spherical K-means Clustering, Proc.IEEE Intl Joint Conf.
Neural Networks (IJCNN), pp. 3180-3185, 2005.
5. A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering with Bregman Divergences, J.
Machine Learning Research, vol. 6,pp. 1705-1749, Oct. 2005.
6. E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke,Non-Euclidean or NonMetric Measures Can Be Informative,Structural, Syntactic, and Statistical Pattern Recognition,
vol. 4109,pp. 871-880, 2006.
7. M. Pelillo, What Is a Cluster? Perspectives from Game Theory,Proc. NIPS Workshop
Clustering Theory, 2009.
8. D. Lee and J. Lee, Dynamic Dissimilarity Measure for Support Based Clustering, IEEE
Trans. Knowledge and Data Eng., vol. 22,no. 6, pp. 900-905, June 2010.
9. A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, Clustering on the Unit Hypersphere Using Von
Mises-Fisher Distributions,J. Machine Learning Research, vol. 6, pp. 1345-1382, Sept. 2005.
10. W. Xu, X. Liu, and Y. Gong, Document Clustering Based on Non-Negative Matrix
Factorization, Proc. 26th Ann. Intl ACM SIGIR Conf. Research and Development in
Informaion Retrieval, pp. 267-273,2003.
11. I.S. Dhillon, S. Mallela, and D.S. Modha, Information-Theoretic Co-Clustering, Proc.
Ninth ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD), pp. 89-98,
2003.
12. C.D. Manning, P. Raghavan, and H. Schu tze, An Introduction to Information Retrieval.
Cambridge Univ. Press, 2009.
13. C. Ding, X. He, H. Zha, M. Gu, and H. Simon, A Min-Max Cut Algorithm for Graph
Partitioning and Data Clustering, Proc.IEEE Intl Conf. Data Mining (ICDM), pp. 107-114,
2001.
14.
H. Zha, X. He, C. Ding, H. Simon, and M. Gu, Spectral Relaxation for K-Means
Clustering, Proc. Neural Info. Processing Systems (NIPS), pp. 1057-1064, 2001.
15. J. Shi and J. Malik, Normalized Cuts and Image Segmentation,IEEE Trans. Pattern
Analysis Machine Intelligence, vol. 22, no. 8,pp. 888-905, Aug. 2000.
16.
I.S. Dhillon, Co-Clustering Documents and Words Using Bipartite Spectral Graph
Partitioning, Proc. Seventh ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining
(KDD),pp. 269-274, 2001.
17. Y. Gong and W. Xu, Machine Learning for Multimedia Content Analysis. Springer-Verlag,
2007.
18. Y. Zhao and G. Karypis, Empirical and Theoretical Comparisons of Selected Criterion
Functions for Document Clustering,Machine Learning, vol. 55, no. 3, pp. 311-331, June 2004.
19. G. Karypis, CLUTO a Clustering Toolkit, technical report, Dept.of Computer Science,
Univ. of Minnesota, http://glaros.dtc.umn.edu/~gkhome/views/cluto, 2003.
20.
Clustering, Proc. 17th Natl Conf. Artificial Intelligence: Workshop of Artificial Intelligence for
Web Search (AAAI),pp. 58-64, July 2000.