Comparative Study of Data Mining Tools
Comparative Study of Data Mining Tools
Abstract: -Today the rapid development of information technology and adoption of its several applications has created
the revolution in business and various fields significantly. The growing interest in business using electronics and
technology has brought vital improvement in data mining field also, since it’s an important part of data accessibility.
Data mining and it’s applications can be viewed as one of the emerging and promising technological developments
that provide efficient means to access various types of data and information available worldwide. Not only this, these
applications also aids in decision making. A better understanding of these applications helps in aking choice among
all available application and tools. The paper gives the comprehensive and theoretical analysis of six open source data
mining tools. The study describes the technical specification, features, and specialization for each selected tool along
with its applications. By employing the study the choice and selection of tools can be made easy.
Keywords: Data, Data Mining, Data Mining Tools, Open Source Tools, Technical Specification.
I. Introduction
There has been a dramatic increase in amount of information and data which is stored in electronic format since last few
decades. The size of data base has been in the process of continuous increment and has reached up to terabytes. This
explosive rate of data increment is growing day by day and estimations tell that the amount of information in world
doubles every 20 months. Thus the most important question concerned with data is its retrieval which finds the most
suitable answer in data mining. Data mining is the process of extraction of predictive information from large data masses.
It can also be described as a process of analyzing data from different perspectives and summarizing it into useful
information.
With a vast history deeply rooted in machine learning, artificial intelligence, database along with statistics data mining
was coined very early. Data mining is strongly associated with data science which involves manipulation and
classification of data by applying statistical and mathematical concepts. Data mining is an important phase in knowledge
discovery and includes application of discovery and analytical methods on data to produce specific models across data.
Data are available everywhere. It can be used to predict the future. Usually the statistical approach is used. Data mining is
an extension of traditional data analysis and statistical approaches in that it incorporates analytical techniques drawn from
a range of disciplines. Due to the widespread availability of huge, complex, information-rich data sets, the ability to
extract useful knowledge hidden in these data and to act on that knowledge has become increasingly important in today’s
competitive world .Thus data mining is analysis of large observational data sets to find unsuspected relationships and to
summarize the data in novel ways that are both understandable and useful to data owner. [1].
Briefly, data mining is an approach to research and analysis. [2] It is exploration and analysis of large quantities of data
in order to discover meaningful patterns and rules. [3]
Sometime, data may be in different formats as it comes from different sources, irrelevant attributes and missing data.
Therefore, data needs to be prepared before applying any kind of data mining. Data mining is also known under many
other names, including knowledge extraction, information discovery, information harvesting, data archeology, and data
pattern processing.[4]Many researchers and practitioners use data mining as a synonym for knowledge discovery but data
mining is also just one step of the knowledge discovery process. All the techniques follow an automated process of
knowledge discovery (KDD) i.e., data cleaning, data integration, data selection, data transformation, data mining and
knowledge representation [5]
A . Weka
Waikato Environment for Knowledge Analysis. Weka is a collection of machine learning algorithms for data mining
tasks. These algorithms can either be applied directly to a data set or can be called from your own Java code. The Weka
(pronounced Weh-Kuh) workbench contains a collection of several tools for visualization and algorithms for analytics of
data and predictive modeling, together with graphical user interfaces for easy access to this functionality.
1)Technical Specification:
First released in 1997.
Latest version available is WEKA 3.6.11.
Has GNU general public license.
Platform independent software.
Supported by Java
Can be downloaded from www.cs.waikato.ac.
2)General Features
Weka is a Java based open source tool data mining tool which is a collection of many data mining and machine
learning algorithms, including pre-processing on data, classification, clustering, and association rule extraction
Weka provides three graphical user interfaces i.e. the Explorer for exploratory data analysis to support
preprocessing, attribute selection, learning, visualization, the Experimenter that provides experimental
environment for testing and evaluating machine learning algorithms, and the Knowledge Flow for new process
Advantages
It is also suitable for developing new machine learning schemes.[8]
Weka loads data file in formats of ARFF, CSV, C4.5, binary. Though it is open source, Free, Extensible, Can be
integrated into other java packages.
Limitation
It lacks proper and adequate documentations and suffers from “Kitchen Sink Syndrome” where systems are
updated constantly.
Worse connectivity to Excel spreadsheet and non-Java based databases.
CSV reader not as robust as in Rapid Miner.
Not as polished.
Weka is much weaker in classical statistics.
Does not have the facility to save parameters for scaling to apply to future datasets.
Does not have automatic facility for Parameter optimization of machine learning/statistical methods
B. KEEL
Knowledge Extraction based on Evolutionary Learning is an application package of machine learning software tools.
KEEL is designed for providing solution to data mining problems and assessing evolutionary algorithms. It has a
collection of libraries for preprocessing and post-processing techniques for data manipulating, soft-computing methods in
knowledge of extracting and learning, and providing scientific and research methods.
1)Technical Overview
First released in 2004.
Latest version available is KEEL 2.0.
Licensed by GNU, general public license.
Can run on any platform.
Supported by java language.
Can be downloaded from www.sci2s.ugr.es/keel.
2)Specialization
Keel is a software tool to assess evolutionary algorithms for Data Mining problems.
Machine learning tool.
Advantages
It includes regression, classification, clustering, and pattern mining and so on.
It contains a big collection of classical knowledge extraction algorithms, preprocessing techniques (instance
selection, feature selection, discretization, imputation methods for missing values etc.), Computational
Intelligence based learning algorithms, including evolutionary rule learning algorithms based on different
approaches (Pittsburgh, Michigan and IRL), and hybrid models such as genetic fuzzy systems, evolutionary
neural networks etc.[9]
Limitation:
Efficiency is restricted by the number of algorithms it support as compared to other tools.
C. R
Revolution is a free software programming language and software environment for statistical computing and graphics.
The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed.
1)Technical Specification
First released in 1997
Latest version Available is 3.1.0
Licensed by GNU General Public License
Cross Platform
C, Fortran and R
www.r-project.org
2)General Features
The R project is a platform for the analysis, graphics and software development activities of data miners and
related areas.
D. KNIME
Konstanz Information Miner, is an open source data analytics, reporting and integration platform. It has been used in
pharmaceutical research, but is also used in other areas like CRM customer data analysis, business intelligence and
financial data analysis. It is based on the Eclipse platform and, through its modular API, and is easily extensible. Custom
nodes and types can be implemented in KNIME within hours thus extending KNIME to comprehend and provide first-
tier support for highly domain-specific data format.
1)Technical Specification
Released on 2004.
Latest version available is KNIME2.9
Licensed By GNU General Public License
Compatible with Linux ,OS X, Windows
Written in java
www.knime.org
2)General Features
Knime, pronounced “naim”, is a nicely designed data mining tool that runs inside the IBM’s Eclipse
development environment.
It is a modular data exploration platform that enables the user to visually create data flows (often referred to as
pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive
views on data and models.
The Knime base version already incorporates over 100 processing nodes for data I/O, preprocessing and
cleansing, modeling, analysis and data mining as well as various interactive views, such as scatter plots, parallel
coordinates and others.
3)Specification
Integration of the Chemistry Development Kit with additional nodes for the processing of chemical structures,
compounds, etc.
Specialized for Enterprise reporting, Business Intelligence, data mining.
Advantages
It integrates all analysis modules of the well-known. Weka data mining environment and additional plugins
allow R-scripts to be run, offering access to a vast library of statistical routines. [8]
It is easy to try out because it requires no installation besides downloading and un archiving.
The one aspect of KNIME that truly sets it apart from other data mining packages is its ability to interface with
programs that allow for the visualization and analysis of molecular data
Limitations:
Have only limited error measurement methods .
Has no wrapper methods for descriptor selection.
Does not have automatic facility for Parameter optimization of machine learning/statistical methods.
F. ORANGE
Orange is a component-based data mining and machine learning software suite, featuring a visual programming front-
end for explorative data analysis and visualization, and Python bindings and libraries for scripting. It includes a set of
components for data preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration
techniques. It is implemented in C++ and Python. Its graphical user interface builds upon the cross-platform framework
1)Technical Requirements:
Developed in 2009.
Latest version available is Orange 2.7
Licensed by GNU General Public License
Compatible with Python, C++,C.
Can be downloaded from www.orange.biolab.si
2)General Features
Orange is a component-based data mining and machine learning software suite.
It includes a set of components for data preprocessing, feature scoring and filtering, modeling, model evaluation,
and exploration techniques.
Data mining in Orange is done through visual programming or Python scripting.
3)Specialization
Open source data visualization and analysis for novice and experts.
Table 1 : Technical Overview of best six data mining open source tools
S. Tool Relea Release date/ Latest License Operating Language Website
N Name se version System
Date
2 ORANGE 2009 6 May,2013/2.7 GNU General Cross Platform Python C++,C www.orange.bi
Public olab.si
License
3 KNIME 2004 6December,2013/2.9 GNU General Linux ,OS X, Java www.knime.org
Public Windows
License
The table shown gives the technical overview of the tools which includes name of tool and description of release date,
latest version release date, licence, operating system, language and official website.
2 ORANGE Better debugger, Shortest scripts,poor Big installation, Limited reporting capabilities
statistics,suitable for novoice Experts
3 KNIME Molecular analysis, Mass Limited error measurements, no wrapper methods for descriptor
spectrometry. Chemistry selection,poor parameter optimazation
Development kit
6 R Purely statistical Less specialized for data mining, requires knowledge of array
language
The given table enumerates the advantages and limitation of each tool separately.
References:
[1] Hand David, Mannila Heikki, Smyth Padhraic.: “Principles of data mining”, Prentice hall India, pp.1, 2004.
[2]. Sethi I. K., “Layered Neural Net Design Through Decision Trees, Circuits,and Systems”, IEEE International
Symposium,1990.
[3]. Meheta M., Aggarwall R., Rissamen I. : “SLIQ:A fast Scalable Classifier for Data Mining”, In Proc.
International Conference Extending data base Technology(EDBI), Avignon, France, March 1996.
[4]. Fayyad, U., Piatesky-Shapiro, G., Smyth, P., and Uthurusamy, R. (Eds.),. Advances in Knowledge Discovery
and Data Mining, AAAI Press, Cambridge, 1996..
[5]. Kittipol Wisaeng . “An Empirical Comparison of Data Mining Techniques in Medical Databases”, International
Journal of Computer Applications (0975 – 8887), Volume 77– No.7, September 2013.
[6]. S.R.Mulik, S.G.Gulawani :“ PERFORMANCE COMPARISON OF DATA MINING TOOLS IN MINING
ASSOCIATION RULES”, International Journal of Research in IT, Management and Engineering
(IJRIME), Volume1Issue3 ISSN: 2249- 1619
[7]. Ralf Mikut and Markus Reischl Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,
Volume 1, Issue 5, pages 431–443, September/October 2011.
[8]. Witten, I.H., Frank, E.: “Data Mining: Practical machine Learning tools and techniques”, 2nd addition,Morgan
Kaufmann, San Francisco(2005).
[9]. Alcala-Fdez, J.,L., del Jesus, M.J., Ventura, s., Garrell, J.M, Otero, J., Romero,C., bacardit, j., Rivas, V.M.,
Fernandez, J.C., Herrera., F., : “KEEL: A software tool to Assess Evolutionary Algorithms to Data mining
Problems”, Soft computing 13:3,pp 307-318(2009).
[10]. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler,T. “YALE: Rapid Prototyping for Complex Data
Mining tasks”, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining(KDD-06), pp. 935-940, 2006.
[11] http://orange.biolab.si/features/
[12] https://github.com/Dans-labs/recommender-systems/blob/.../datamining.r
[13]. http://www.r-project.org/
[14] http://www.knime.org/
[15] http://rapidminer.com/