0% found this document useful (0 votes)
95 views9 pages

RapidMiner For ML

This document summarizes an article that analyzes five popular open source data analysis tools: RapidMiner, Weka, R Tool, Knime, and Orange. It describes the tools' capabilities for data preparation, machine learning, predictive analytics, and visualization. The article evaluates the tools based on parameters like the amount of data they can handle, response time, ease of use, cost, supported algorithms, and more. The goal is to determine which tool is most efficient based on these factors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views9 pages

RapidMiner For ML

This document summarizes an article that analyzes five popular open source data analysis tools: RapidMiner, Weka, R Tool, Knime, and Orange. It describes the tools' capabilities for data preparation, machine learning, predictive analytics, and visualization. The article evaluates the tools based on parameters like the amount of data they can handle, response time, ease of use, cost, supported algorithms, and more. The goal is to determine which tool is most efficient based on these factors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331169871

An Extensive Study of Data Analysis Tools (Rapid Miner, Weka, R Tool, Knime,
Orange)

Article · September 2018


DOI: 10.14445/23488387/IJCSE-V5I9P102

CITATIONS READS
18 4,098

3 authors, including:

Venkateswarlu Pynam
Jawaharlal Nehru Technological Gurajada University
6 PUBLICATIONS 26 CITATIONS

SEE PROFILE

All content following this page was uploaded by Venkateswarlu Pynam on 10 October 2020.

The user has requested enhancement of the downloaded file.


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

An Extensive Study of Data Analysis Tools


(Rapid Miner, Weka, R Tool, Knime, Orange)
Venkateswarlu Pynam 1, R Roje Spanadna2, Kolli Srikanth 3
Assistant professors, Department of Information Technology, University College of engineering Vizianagaram
JNTUniversity KAKINADA, Andhara Pradesh :525003

Abstract namely descriptive, inferential, predictive and


In today’s data has been increasing in the prescriptive analytics. With the increasing need of data
concept of 3 v’s (volume, velocity and variety) analysis [7] some tools that are directly analyze the data
technology. Due to the large and Complex Collection of and derive a conclusion are in demand.
Datasets is difficult to process on traditional data
processing applications. So that leads to arrive a new There are thousands of Big Data tools out for data
technology called Data Analytics. It is a science of analysis at present. Data analysis is the process of
exploring raw data and elicitation the useful inspecting the data, cleaning the data, transforming the
information and hidden pattern. The main aim of data data and modeling data with the goal of discovering
analysis is to use advance analytics techniques for huge useful information, Suggesting conclusions and
and different datasets. The size of the dataset may vary supporting decision making. Data analysis in the areas
from terabytes to zetta bytes and that can be structured of open source data tools, data visualization tools,
or unstructured. The paper gives the comprehensive sentimental tools, data extraction tools and databases.
and theoretical analysis of five open source data These tools are generating a report to summaries the
analytics tools which are RapidMiner, Weka, R tool, conclusions and provide better visualizations and
KNIME and Orange. By employing the study the choice produce accurate results with minimum effort. There
and selection of tools and be made easy, these tools are are different tools available for data analytics like
evaluated on basis of various parameters like amount of RapidMiner, Weka, KNIME, R tool, Orange,
data used, response time, ease of use, price tag, OpenRefine, Solver, Julia, etc [5]. we have choose five
analysis of algorithm and handling. tools among these for comparison which are
RapidMiner, Weka, KNIME, R tool and Orange then
Keywords - Data Analytics, Big Data, Data analytical we will find out most efficient tool among these on
tools, Visualization tools, Data mining. basis of few parameters.

I. INTRODUCTION II. DATA ANALYTICAL TOOLS

Data is a collection of values in the form of raw Open Source Data Tools Rapid Miner is a data
data which is translated into forms that is easy to science software platform which has been developed by
process. The data is been increasing exponentially in Ralf Klinkenberg, Ingo Mierswa, and Simon Fischer at
the digital form since last few decades. Data size has the Artificial Intelligence. RapidMiner [9] that provides
raise from gigabytes to terabytes. This explosive rate of an unified climate for data preparation, machine
data increment is growing day by day and estimations learning, deep learning, text mining, and predictive
tell that the amount of information in world gets double analytics and business analytics. RapidMiner is used for
almost every month. This type of massive amount of business, commercial applications, research, education,
data in both structured and unstructured is called Bid training, rapid prototyping and application development
Data. When handling and processing of data has and supports all machine learning process including
become difficult with conventional databases and data preparation, results visualization, model validation
software techniques. There are different problems with and optimization [8].
big data [1] like processing of large data without solid
analytical techniques become difficult which often RapidMiner uses a client or server model with the
leads to inaccurate result. Data Analytic is the science server offered as either as a premise or in social or
of analyzing data to convert information to useful separate cloud infrastructures. There is no scoping
knowledge. This knowledge could help us understand mechanism in RapidMiner processes therefore objects
our world better and in many contexts enable us to can be stored and retrieved at any nesting level. The
make better decisions. The data analytics techniques are parameter optimization schemes are also available in
structured around of different category of data analytics RapidMiner. Numerous clustering operators are

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 4


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

available in RapidMiner that generate a cluster attribute


e.g. the K-Means operator. The macro is one of the Identifying a wider variety of data sources may increase
advanced topics of RapidMiner. RapidMiner naturally [2]
the probability of finding hidden patterns and
calculate the type of attributes of particular dataset and correlations. For example, to provide insight, it can be
all attributes have legitimate role. The type and proper beneficial to identify as many types of related data
role can be changed by using the comparable operators sources as possible, especially when it is unclear
[3]. RapidMiner operators because writing scripts can exactly what to look for. Depending on the business
be time engrossing and error prone [3]. RapidMiner scope of the analysis the nature of the business
keeps datasets in memory as long as possible. So if problems being addressed, the required datasets and
there is any memory left RapidMiner will not dispose their sources can be internal and/or external to the
of previous results of the process. The report described enterprise.
RapidMiner's strengths as a "platform that supports an
extensive breadth and depth of functionality, and with B. Designing data requirement
that it comes quite close to the business market. To perform the data analytics for a distinct
problem, it needs datasets from associated domains.
III. ANALYSIS TECHNIQUES Based on the domain and problem specification, the
data source can be determined and based on the
There are various phases in each of the problem definition the data characteristics of these
analysis process which are performed in order to get the datasets can be defined.
output. These phases are performed in sequential order For example, if we are going to perform social media
to achieve the desired goal effectively. The phases of analytics like problem specification, we use the data
analytics [10] are: source as Facebook or Twitter. For identifying the user
characteristics, we need user profile information, likes,
1. Identify the problem, and posts as data attributes.
2. Designing data requirement,
3. Pre-Processing data, C. Preprocessing data
4. Performing analytics over data and In data analytics, we don‟t use the duplicate
5. Visualizing data. data sources, data characteristics, data tools, and
algorithms all of them will not need data in the
duplicate configuration. This advantage is to the
achievement of data operations, such as data cleansing,
data aggregation, data augmentation, data sorting, and
data formatting to furnish the data in a financed
arrangement to the data tools and as well as algorithms
that will be used in the data analytics.

Preprocessing is used to achieve data operation to


decipher data into a fixed data arrangement previously
furnished data to algorithms. The data analytics process
will be proposed formatted data as the input. In Big
Data, the datasets need to be formatted and transfer to
Hadoop Distributed File System (HDFS) and used
further by distinct nodes with Mappers and Reducers in
Hadoop clusters.

Figure : 1 The phases of analytics D. Performing analytics over data


A. Identify the problem After data is available in the appropriate format
Now a day‟s analytics are performed on web for data analytics applications will be performed. The
datasets because if increasing the use of internet and data analytics applications are achieved for determining
growing business of organizations over internet. This essential knowledge from data to take improved
leads to gradual increase of data size day by day. The decisions towards business in data mining concepts. It
organizations are wants to make predictions over the may use either descriptive or predictive analytics for
data to make desired decisions. The analytical business perception.
applications must be scalable for collecting the datasets.
Let us assume that there is an e-commerce website and Analytics can be achieved with various machine
wants to increase the business [4]. learning and custom algorithmic concepts, such as data

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 5


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

regression, data classification, data clustering, etc. For Now to load our data we can simply select the button:
Big Data, the equivalent algorithms can be converted in ‟import data‟. Click on the button „import data‟
to MapReduce algorithms for working on Hadoop Step 1: After locating the file click „next‟.
clusters by converting their data analytics logic to the
MapReduce which is to be run over Hadoop clusters.
These models need to be calculated and improved by
discrete stages of machine learning concepts. The
improved algorithms can provide better observation.

E. Visualizing Data
The capability to analyze large amounts of data
and find useful judgment brings little value that can
clarify the results are the analysts. The Data
Visualization is committed to using data visualization
approach to distinctly disseminate the analysis results
for effective clarification by business users. Business Figure: 3 loading the data
users are able to understand the results in order to
Step 2: Loads in the data and displays much like a
achieve value from the analysis. The results of
spreadsheet.
completing the Data Visualization provide users with
the ability to perform visual analysis [4].

IV. DATA ANALYTICS TOOLS

A. Rapidminer
Rapid Miner is applicable in both Free and open-
source software and economic version and is a
popular predictive analytic platform. Rapid Miner is
helping activity enclose predictive analysis in their
work processes with its user amicable, well-healed
library of data science and machine learning Figure: 4 Loads data in spreadsheet
algorithms through its all-in-one programming
surrounding like Rapid Miner Studio. Likewise the Step 3: In this window we can decide if we want to
basic data mining appearances like data cleansing, exclude any certain column by selecting the „exclude
filtering, clustering, etc. The tool is also compatible column‟ entry. Further you can change the „name‟,
with weak scripts. Rapid Miner is used for business or ‟role‟ or „type‟ of an attribute. Since the default for
commercial applications, research and education. each column for loading is „general attribute‟ in this
Now make sure to highlight the repository so that the case we need to change the role of our „churn‟-
folders end up in right place. Now create a folder attribute.
named „data‟

Figure: 5 load a general attribute

BUILDING A DECISION TREE


To create a decision tree first we have to import a
dataset. Here, we are using a dataset about customer
Figure: 2 import the new data churn. After downloading the dataset and importing

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 6


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

into the rapid miner tool, we have to retrieve the data appearance. Predictive modelling was using a linear
from our repository. Now click on the process regression predictor to evaluation sales for each item
directory, highlight your customer data and drag it over. accordingly [6]. Finally, we refine out the appropriate
columns and exported it to a .csv file

1. File reader
The most familiar way to store nearly small
amounts of data is static a text file. Among text files,
the most familiar pattern has been so far the CSV
(Comma Separated Version) format. The “comma” in
the CSV phrase is just one of the available characters to
separate data inner the file. Semicolon, colon, dot, tab,
and many other signs are uniformly sufficient. A more
rigid clarification of the file structure cause of course
Figure: 6 process directory for quick reading. However, occasionally you need a
more malleable definition of the file structure to get to a
Before we actually build a model we have to inspect result, even if it desires a bit of a longer composition
our data for issues and see if we need to do any further time.
preparation. so click on the „output „ port of the
operator and drag a connection on to the „results‟ port
of the process panel.
Now, click the port to establish the connection and
come over to your „run process‟ button and run it.

Figure: 9 read data from an ASCII file or URL location

2. Partitioning
The input table is division into two partitions (i.e.
row-wise), e.g. train and test data. The two separations
are accessible at the two output ports.

Figure: 7 running the dataset

Figure: 10 partitioning the data.


Figure: 8 decision tree for customer churn data 3. Decision Tree
After the data is partitioned into train and test
B. Knime data, a Decision Tree Model is trained and applied.
Knime is a data mining tool that can be used The Decision Tree learner node is important for the
gaining approximately any kind of analysis. We guidance of a decision tree model. Here is a abrupt
explored how to visualise a dataset and retrive essential

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 7


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

description of the basic environment available in its Additionally, it is possible to hilight cells of this matrix
configuration window. to determine the underlying rows. The dialog allows
you to select two columns for comparison; the values
from the first selected column are represented in the
confusion matrix's rows and the values from the second
column by the confusion matrix's columns. The output
of the node is the confusion matrix with the number of
matches in each cell. Additionally, the second out-port
reports a number of accuracy statistics such as True-
Positives, False-Positives, True-Negatives, False-
Negatives, Recall, Precision, Sensitivity, Specificity, F-
measure, as well as the overall accuracy and Cohen's
kappa.

Figure: 11 Decision Tree learner node


4. Decision tree image
Decision tree aspect on an image are presently
supported image type is PNG. The data input is
optional. It can be used to provide a column with color
information. This color information is needed for the
chart in the nodes of the decision tree.

Figure: 14 confusion matrix of the node

7. Entropy scorer
Scorer for clustering results given a reference
clustering. Connect the table containing the reference
clustering to the first input port (the table should
contain a column with the cluster IDs) and the table
with the clustering results to the second input port (it
should also contain a column with some cluster IDs).
Select the respective columns in both tables from the
dialog. After successful execution, the view will show
entropy values (the smaller the better) and some quality
Figure: 12 Nodes of the decision tree
value (in [0,1] - with 1 being the best possible value, as
5. Decision tree predictor used in Fuzzy Clustering in Parallel Universes , section
6: "Experimental results").

Figure: 13 predictors of the decision tree Figure: 15 clustering results


6. Scorer
Compares two columns by their attribute value 8. Numeric scorer
pairs and shows the confusion matrix, i.e. how many This node computes certain statistics between the
rows of which attribute and their classification match. a numeric column's values (ri) and predicted (pi)

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 8


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

values. It computes R²=1-SSres/SStot=1-Σ(pi-ri)²/Σ(ri- Select the file hypothyroid.arff from the given datasets
1/n*Σri)² (can be negative!), mean absolute error and click on open button
(1/n*Σ|pi-ri|), mean squared error (1/n*Σ(pi-ri)²), root
mean squared error (sqrt(1/n*Σ(pi-ri)²)), and mean
signed difference (1/n*Σ(pi-ri)). The computed values
can be inspected in the node's view and/or further
processed using the output table.
Statistics:
This node calculates statistical moments such as
minimum, maximum, mean, standard deviation,
variance, median, overall sum, number of missing
values and row count across all numeric columns, and
counts all nominal values together with their
occurrences. The dialog offers two options for choosing
the median and/or nominal values calculations:
Figure: 19 select .arff file from datasets
With the Selected dataset Preprocessing is perfomed
and the respective graph is shown based on the class
and data items selected as shown below.

Figure: 16 statistical moments’ calculations

Figure: 20 preprocessing the data


Here we are classifying the dataset based on percentage
split with 65% which yields 95.97% for correctly
classified instances. To Show the output screen simply
click on start button.
Figure: 17 decision tree for data set

C. Weka
Initially after starting the weka explorer the
following window will be appeared where we can
perform various operations using different datasets
available [4]. To load the required dataset simply click
on the button open file and choose the path C:/weka-
3.8/data

Figure: 21 split the data


To generate a Decision tree simply click on the folder
Trees and select the algorithm to generate a decision
tree. Here we are selecting J48 algorithim to generate
Figure: 18 load the thyroid dataset

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 9


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

based on the test option “Use Training set”. Now the


following decision tree is generated based on the
classified data items.

Figure: 25 preprocessing the data


Figure: 22 generate a Decision tree
A decision tree is a architecture that includes a root
D. Orange node, branches, and leaf nodes. Each subjective node
Orange provides data visualisation and data stand for a test on an attribute, each branch stand for the
analysis for novice and expert, through interactive outcome of a test, and each leaf node holds a class
workflows. The File widget will now read the famous label. The uppermost node in the tree is the root node.
data set on iris flower dataset, and send it to the
workflow. The changes will proliferate through the
workflow updating its appliance.

Figure: 26 decision tree for dataset

V. CONCLUSION
Figure: 23 read the data on iris flower dataset
Our aim is to inspect different types of animals, Depends on the analysis, Weka would be studied a
classification of them. Field colander design on the very close to KNIME because of its many inherent
canvas and attach it to the File appliance. appearance that require no coding knowledge.
RapidMiner would be considered appropriate for
experts, particularly those in the hard sciences, because
of the additional programming skills that are needed,
and the limited visualization support that is provided.
RapidMiner has good and simple to use graphical
efficiency, so it can be simply used and achieve on any
system, furthermore it integrates superlative algorithms
of other specified tools. R is the leading tool in
visualization but it is a bit harder to create pretty
graphs. R promotes reproducible research. R
commands contribute an identical record of how an
Figure: 24 split the data
We can visualize the pre-processed data in the form of analysis was done. Commands can be alter, rerun,
simple graphs. The above pre-processed data can be clarify, shared, etc. It can be concluded from
visualized by using the box plot graph. information that though data analytics is the basic
concept to all tool yet, In comparison, Orange offers

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 10


SSRG International Journal of Computer Science and Engineering ( SSRG – IJCSE ) – Volume 5 Issue 9 – September 2018

tools that seem to be targeted primarily at people with


probably less need for custom applications into their
own software but a distant accessible time with user
communication, its written in python and origin is
available, user preservatives are supported.

REFERENCES

[1] Lekha R. Nair , Sujala D. Shetty. “Research in Big Data and


Analytics: An Overview” presented at International Journal of
Computer Applications, Volume 108 – No 14, 2014.
[2] Mike Barlow. Real-Time Big Data Analytics: Emerging
Architecture. Sebastopol, CA: O‟Reilly Media, 2013, pp. 3.
[3] Sanjay Rathee. “Big Data and Hadoop with components like
Flume, Pig, Hive and Jaql,” presented at International
Conference on Cloud, Big Data and Trust 2013, RGPV, 2015.
[4] Swasti Singhal, Monika Jena. “A Study on WEKA Tool for
Data Preprocessing, Classification and Clustering” presented at
International Journal of Innovative Technology and Exploring
Engineering (IJITEE), Volume-2, Issue-6,2013
[5] Kalpana Rangra, Dr. K. L. Bansal. “Comparative Study of Data
Mining Tools”, presented at International Journal of Advanced
Research in Computer Science and Software Engineering,
Volume 4, Issue 6, 2014.
[6] Michael R. Berthold, Nicolas Cebron, Fabian Dill, Thomas
R. Gabriel, Tobias K¨otter, Thorsten Meinl, Peter Ohl, Kilian
Thiel and Bernd Wiswedel. “KNIME – The Konstanz
Information Miner” presented at University of Konstanz
Nycomed Chair for Bioinformatics and Information Mining,
Germany.
[7] http://bigdata-madesimple.com/top-30-big-data-tools-data-
analysis/
[8] http://opensourceforu.com/2017/03/top-10-open-source-data-
mining-tools/
[9] https://rapidminer.com/wp-
content/uploads/2014/10/RapidMiner-5-Operator-Reference.pdf
[10] http://pingax.com/understanding-data-analytics-project-life-
cycle/

ISSN: 2348 – 8387 http://www.internationaljournalssrg.org Page 11

View publication stats

You might also like