Exp 6
Exp 6
AIM:-
Perform data Pre-processing task and demonstrate Classification,
Clustering, Association algorithm on data sets using data mining tool
(WEKA/R tool).
THEORY:-
WEKA - an open source software provides tools for data pre-processing,
implementation of several Machine Learning algorithms, and visualization
tools so that you can develop machine learning techniques and apply them to
real- world data mining problems. What WEKA offers is summarized in the
following diagram −
If you observe the beginning of the flow of the image, you will understand that
there are many stages in dealing with Big Data to make it suitable for machine
learning −
First, you will start with the raw data collected from the field. This data may
contain several null values and irrelevant fields. You use the data pre-
processing tools provided in WEKA to cleanse the data.
Then, you would save the pre-processed data in your local storage for applying
ML algorithms.
Next, depending on the kind of ML model that you are trying to develop you
would select one of the options such as Classify, Cluster, or Associate.
The Attributes Selection allows the automatic selection of features to create a
reduced dataset.
Note that under each category, WEKA provides the implementation of several
algorithms. You would select an algorithm of your choice, set the desired
parameters and run it on the dataset.
Then, WEKA would give you the statistical output of the model processing. It
provides you a visualization tool to inspect the data.
The various models can be applied on the same dataset. You can then compare
the outputs of different models and select the best that meets your purpose.
Thus, the use of WEKA results in a quicker development of machine learning
models on the whole.
Pre-processing using WEKA:
The data that is collected from the field contains many unwanted things that
leads to wrong analysis. For example, the data may contain null fields, it may
contain columns that are irrelevant to the current analysis, and so on. Thus, the
data must be pre-processed to meet the requirements of the type of analysis you
are seeking. This is the done in the pre-processing module.
To demonstrate the available features in pre-processing, we will use
the Abalone database that is provided in the installation.
Using the Open file ... option under the Pre-process tag select
the abalone.arff file.
Using Filters:
Some of the machine learning techniques such as association rule mining
requires categorical data.
weka→filters→supervised→attribute→Discretize
weka→filters→unsupervised→attribute→ReplaceWithMissing Values
Selecting Classifier
Click on the Choose button and select the following classifier −
weka→classifiers>bayes>Naïve Bayes
Association Rule mining using WEKA:
It was observed that people who buy beer also buy diapers at the same time.
That is there is an association in buying beer and diapers together. Though this
seems not well convincing, this association rule was mined from huge
databases of supermarkets. Similarly, an association may be found between
peanut butter and bread.
Finding such associations becomes vital for supermarkets as they would stock
diapers next to beers so that customers can locate both items easily resulting in
an increased sale for the supermarket.
The Apriori algorithm is one such algorithm in ML that finds out the probable
associations and creates association rules. WEKA provides the implementation
of the Apriori algorithm. You can define the minimum support and an
acceptable confidence level while computing these rules.
Data Visualization
The method of representing data through graphs and plots with the aim to
understand data clearly is data visualization.
There are many ways to represent data. Some of them are as follows:
1) Pixel Oriented Visualization: Here the color of the pixel represents the
dimension value. The color of the pixel represents the corresponding values.
2) Geometric Representation: The multidimensional datasets are represented
in 2D, 3D, and 4D scatter plots.
3) Icon Based Visualization: The data is represented using Chernoff’s faces
and stick figures. Chernoff’s faces use the human mind’s ability to recognize
facial characteristics and differences between them. The stick figure uses 5 stick
figures to represent multidimensional data.
4) Hierarchical Data Visualization: The datasets are represented
using treemaps. It represents hierarchical data as a set of nested
triangles.