MCSL-223 Section 2 Data Mining Lab
MCSL-223 Section 2 Data Mining Lab
MCSL-223 Section 2 Data Mining Lab
NETWORKS AND
DATA MINING LAB
SECTION 2 DATA MINING LAB
Structure Page Nos.
2.0 Introduction
2.1 Objectives
2.2 Introduction to WEKA
2.3 Latest Version and Downloads
2.4 Data Sets
2.5 Installation of WEKA
2.6 Features of WEKA Explorer
2.7 Data Preprocessing
2.8 Association Rule Mining
2.9 Classification
2.10 Clustering
2.11 Practical Sessions
2.12 Summary
2.13 Further Readings
2.14 Website References
2.15 Online Lab Resources
2.0 INTRODUCTION
This is the lab course, wherein you will have the hands on experience. You
have studied the course material (MCS-221 Data Warehousing and Data
Mining). Along with the examples discussed in this section, separately a list of
lab-sessions to be performed sessionwise, is given towards the end. Please go
through the general guidelines and the program documentation guidelines
carefully.
2.1 OBJECTIVES
After going through this practical course, you will be able to:
Understand how to handle data mining tasks using a data mining
toolkit (such as open source WEKA)
Understand the various kinds of options available in WEKA
Understand the data sets and data preprocessing.
Demonstrate the classification, clustering and etc. in large data sets.
Demonstrate the working of algorithms for data mining tasks such
association rule mining, classification, clustering and regression.
Exercise the data mining techniques with varied input values for
different parameters.
Emphasize hands-on experience working with all real data sets.
Ability to add mining algorithms as a component to the exiting
tools.
Ability to apply mining techniques for realistic data.
14
DATA MINING
2.2 INTRODUCTION TO WEKA LAB
With WEKA, the machine learning algorithms are readily available to the
users. The ML specialists can use these methods to extract useful information
from high volumes of data. Here, the specialists can create an environment to
develop new machine learning methods and implement them on real data.
There are two versions of WEKA: WEKA 3.8 is the latest stable version and
WEKA 3.9 is the development version. The stable version receives only bug
fixes and feature upgrades that do not break compatibility with its earlier
releases, while the development version may receive new features that break
compatibility with its earlier releases.
WEKA 3.8 and 3.9 feature a package management system that makes it easy
for the WEKA community to add new functionality to WEKA. The package
15
COMPUTER
NETWORKS AND
management system requires an internet connection in order to download and
DATA MINING LAB install packages.
Stable version
WEKA 3.8 is the latest stable version of WEKA. This branch of WEKA only
receives bug fixes and upgrades that do not break compatibility with earlier 3.8
releases, although major new features may become available in packages.
There are different options for downloading and installing it on your system:
Windows
Use https://sourceforge.net/projects/weka/files/weka-3-8/3.8.5/weka-3-8-5-
azul-zulu-windows.exe/download?use_mirror=nchc to download a self-
extracting executable for 64-bit Windows that includes Azul's 64-bit OpenJDK
Java VM 11 (weka-3-8-5-azul-zulu-windows.exe; 124.6 MB)
This executable will install WEKA in your Program Menu. Launching via the
Program Menu or shortcuts will automatically use the included JVM to run
WEKA.
Linux
Use https://sourceforge.net/projects/weka/files/weka-3-8/3.8.5/weka-3-8-5-
azul-zulu-arm-osx.dmg/download?use_mirror=nchc to download a zip archive
for Linux that includes Azul's 64-bit OpenJDK Java VM 11 (weka-3-8-5-azul-
zulu-linux.zip; 137.4 MB)
First unzip the zip file. This will create a new directory called weka-3-8-5. To
run Weka, change into that directory and type:
./weka.sh
WEKA packages
There is a list of packages for WEKA that can be installed using the built-in
package manager. Javadoc for a package is available at
https://weka.sourceforge.io/doc.packages/ followed by the name of the
package.
Requirements
The latest official releases of WEKA require Java 8 or later. Note that if you
are using Windows and your computer has a display with high pixel density
(HiDPI), you may need to use Java 9 or later to avoid problems with
inappropriate scaling of WEKA's graphical user interfaces.
2.4 DATASETS
Below are some sample WEKA data sets available in .arff format.
airline.arff
breast-cancer.arff
contact-lens.arff
16
cpu.arff DATA MINING
LAB
cpu.with-vendor.arff
credit-g.arff
diabetes.arff
glass.arff
hypothyroid.arff
ionospehre.arff
iris.2D.arff
iris.arff
labor.arff
ReutersCorn-train.arff
ReutersCorn-test.arff
ReutersGrain-train.arff
ReutersGrain-test.arff
segment-challenge.arff
segment-test.arff
soybean.arff
supermarket.arff
unbalanced.arff
vote.arff
weather.numeric.arff
weather.nominal.arff
17
COMPUTER
A gzip'ed tar containing ordinal, real-world datasets donated by Professor
NETWORKS AND
DATA MINING LAB Arie Ben David at
https://sourceforge.net/projects/weka/files/datasets/regression-
datasets/datasets-arie_ben_david.tar.gz/download?use_mirror=pilotfiber
A bzip'ed tar file containing the Reuters21578 dataset split into separate
files according to the ModApte split reuters21578-ModApte.tar.bz2,
81,745,032 Bytes.
A zip file containing a new, image-based version of the classic iris data,
with 50 images for each of the three species of iris. The images have size
600x600. Please see the ARFF file for further information (iris_reloaded.zip,
92,267,000 Bytes). After expanding into a directory using your jar utility (or
an archive program that handles tar-archives/zip files in case of the gzip'ed
tars/zip files), these datasets may be used with WEKA.
2. After successful download, open the file location and double click on
the downloaded file. The Step Up wizard will appear. Click on Next.
3. The License Agreement terms will open. Read it thoroughly and click
on “I Agree”.
1. Simple CLI is WEKA Shell with command line and output. With
“help”, the overview of all the commands can be seen. Simple CLI
offers access to all classes such as classifiers, clusters, and filters, etc.
2. Explorer is an environment to discover the data. The WEKA Explorer
window show different tabs starting with preprocess. Initially, the
preprocess tab is active, as first the data set is preprocessed before
applying algorithms to it and explored the dataset.
3. Experimenter is an environment to make experiments and statistical
tests between learning schemes. The WEKA experimenter button allows
the users to create, run, and modify different schemes in one experiment
on a dataset. The experimenter has 2 types of configuration: Simple and
Advanced. Both configurations allow users to run experiments locally
and on remote computers.
4. KnowledgeFlow is a Java-Beans based interface for tuning and
machine learning experiments. Knowledge flow shows a graphical
representation of WEKA algorithms. The user can select the
components and create a workflow to analyze the datasets. The data can
19
COMPUTER
NETWORKS AND
be handled by batch-wise or incrementally. Parallel workflows can be
DATA MINING LAB designed and each will run in a separate thread. The different
components available are Datasources, Datasavers, Filters, Classifiers,
Clusters, Evaluation, and Visualization.
5. WEKA has workbench module which contains all the GUI’s in a single
window.
In this section we will learn about the features of the WEKA Explorer.
2.6.1 Dataset
(i) Nominal Attributes: Attribute which relates to a name and has predefined
values such as color, weather. These attributes are called categorical attributes.
These attributes do not have any order and their values are also called
enumerations.
@attribute outlook {sunny, overcast, rainy}: declaration of the nominal
attribute.
(ii) Binary Attributes: These attributes represent only values 0 and 1. These
are the type of nominal attributes with only 2 categories. These attributes are
also called Boolean.
(iii) Ordinal Attributes: The attributes which preserve some order or ranking
amongst them are ordinal attributes. Successive values cannot be predicted but
only order is maintained. Example: size, grade, etc.
(iv) Numeric Attributes: Attributes representing measurable quantities are
numeric attributes. These are represented by real numbers or
integers. Example: temperature, humidity.
@attribute humidity real: declaration of a numeric attribute
(v) String Attributes: These attributes represent a list of characters represented
in double-quotes.
20
2.6.2 ARFF Data format DATA MINING
LAB
WEKA works on the ARFF file for data analysis. ARFF stands for Attribute
Relation File Format. It has 3 sections: relation, attributes, and data. Every
section starts with “@”.
ARFF files take Nominal, Numeric, String, Date, and Relational data attributes.
Some of the well-known machine learning datasets is present in WEKA as
ARFF.
@relation weather
@attribute outlook {sunny, overcast, rainy}:
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no} //class attribute: The class attribute represents the
output.
@data
sunny, FALSE,85,85,no
sunny, TRUE,80,90,no
overcast, FALSE,83,86,yes
rainy, FALSE,70,96,yes
rainy, FALSE,68,80,yes
CSV Format
Key points are:CSV is probably the simplest possible structured format for
data
CSV strikes a delicate balance, remaining readable by both machines &
humans
CSV is a two dimensional structure consisting of rows of data, each row
containing multiple cells. Rows are (usually) separated by line terminators so
each row corresponds to one line. Cells within a row are separated by
commans (hence the C(ommmas) part)
Note that strictly we’re really talking about DSV files in that we can allow
‘delimiters’ between cells other than a comma. However, many people and
21
COMPUTER
NETWORKS AND
many programs still call such data CSV (since comma is so common as the
DATA MINING LAB delimiter)
CSV is a “text-based” format, i.e. a CSV file is a text file. This makes it
amenable for processing with all kinds of text-oriented tools (from text editors
to unix tools like sed,gep etc..)
If you open up a CSV file in a text editor it would look something like:
A,B,C
1,2,3
4,”5,3”,6,
Here there are 3 rows each of 3 columns. Notice how the second column in the
last line is “quoted” because the content of that value actually contains a “,”
character.
XRFF stands for the XML attribute Relation File Format. It represents data that
can store comments, attributes, and instance weights. It has .xrff extension
and .xrff.gz (compressed format) file extension. The XRFF files represented
data in XML format.
2.6.5 Classifiers
2.6.6 Clustering
WEKA uses the Cluster tab to predict the similarities in the dataset. Based on
clustering, the user can find out the attributes useful for analysis and ignore
other attributes. The available algorithms for clustering in WEKA are k-means,
EM, Cobweb, X-means, and FarhtestFirst.
22
DATA MINING
LAB
2.6.7 Association
The only algorithm available in WEKA for finding out association rules is
Apriori.
WEKA uses 2 approaches for best attribute selection for calculation purpose:
Using Search method algorithm: Best-first, forward selection, random,
exhaustive, genetic algorithm, and ranking algorithm.
2.6.9 Visualization
2.7.1 Discretization
23
COMPUTER
NETWORKS AND
DATA MINING LAB
Synopsis
Options
24
Select the temperature attribute based on class temperature to visualize below: DATA MINING
LAB
The original dataset must fit entirely in memory. The number of instances in the
generated dataset may be specified. The dataset must have a nominal class
attribute. If not, use the unsupervised version. The filter can be made to
maintain the class distribution in the subsample, or to bias the class distribution
toward a uniform distribution. When used in batch mode subsequent batches are
NOT re-sampled.
26
DATA MINING
LAB
Synopsis
Produces a random subsample of a dataset using either sampling with
replacement or without replacement. The original dataset must fit entirely in
memory. The number of instances in the generated dataset may be specified.
The dataset must have a nominal class attribute. If not, use the unsupervised
version. The filter can be made to maintain the class distribution in the
subsample, or to bias the class distribution toward a uniform distribution. When
used in batch mode (i.e. in the Filtered Classifier), subsequent batches are NOT
resampled.
Options
debug -- If set to true, filter may output additional info to the console.
noReplacement -- Disables the replacement of instances.
doNotCheckCapabilities -- If set, filters capabilities are not checked before
filter is built (Use with caution to reduce runtime).
sampleSizePercent -- The subsample size as a percentage of the original set.
invertSelection -- Inverts the selection (only if instances are drawn WITHOUT
replacement).
Select the outlook attribute based on class outlook (Nom) to visualize below:
27
COMPUTER
NETWORKS AND
DATA MINING LAB
Select the humidity attribute based on class outlook (Nom) to visualize below:
Select the windy attribute based on class outlook (Nom) to visualize below:
28
DATA MINING
LAB
Select the play attribute based on class outlook (Nom) to visualize below:
Subsets of features that are highly correlated with the class while having low
inter-correlation are preferred.
30
DATA MINING
LAB
Options
PoolSize: The size of the thread pool, for example, the number of cores in the
CPU.
InfoGainAttributeEval:
Evaluates the worth of an attribute by measuring the information gain with
respect to the class.
32
InfoGain (Class, Attribute) = H (Class) - H (Class | Attribute). DATA MINING
LAB
Options
missingMerge -- Distribute counts for missing values. Counts are distributed
across other values in proportion to their frequency. Otherwise, missing is
treated as a separate value.
binarizeNumericAttributes -- Just binaries numeric attributes instead of
properly discrediting them.
doNotCheckCapabilities -- If set, evaluator capabilities are not checked before
evaluator is built (Use with caution to reduce runtime).
weka.attributeSelection.CorrelationAttributeEval
CorrelationAttributeEval
Options
Ranked attributes:
0.447 4 student
0.293 2 Age
0.258 5 Credit-Rating
0.24 1 Rid
0.113 3 Income
Selected attributes: 4, 2, 5, 1, 3 : 5
weka.attributeSelection.GainRatioAttributeEval
GainRatioAttributeEval
Evaluates the worth of an attribute by measuring the gain ratio with respect to
the class.
Options
35
COMPUTER
NETWORKS AND
MissingMerge: Distribute counts for missing values. Counts are distributed
DATA MINING LAB across other values in proportion to their frequency. Otherwise, missing is
treated as a separate value.
Ranked attributes:
0.1564 2 Age
0.1518 4student
0.0488 5 Credit-Rating
0.0188 3 Income
0 1 Rid
Selected attributes: 2, 4, 5, 3, 1 : 5
36
DATA MINING
2.8 ASSOCIATION RULE MINING LAB
Apriori is an algorithm for frequent item set mining and association rule
learning over transitional databases. It proceeds by identifying the frequent
individual items in the database and attending them to larger and larger item
sets as long as those item sets appear sufficiently often in the database. The
frequent item sets determine by Apriori can be used to define by association
rules which highlight tends of the databases.
The Apriori algorithm was proposed by Agrawal and Srikant in 1994. Apriori is
designed to operate on database containing transactions. Each transaction is
seen as a set of items. Given a threshold, the Apriori algorithm identifies the
item sets which are subsets of atleast transactions in the database
Apriori uses breath-first search and a Hash tree structure to count candidate
item sets efficiently. It generates candidate item sets of length k from item sets
of length k-1. Then it prunes the candidates which have an infrequent sub
pattern.
Pseudo code:
37
COMPUTER
NETWORKS AND
DATA MINING LAB
Useful Concepts
To select interesting rules from the set of all possible rules, constraints on
various measures of significance and interest can be used. The best-known
constraints are minimum thresholds on support and confidence.
(i) Support
The support supp(X) of an item set X is defined as the proportion of
transactions in the data set which contain the item set.
Supp(X) =no of transactions which contain the item set X/total no of
transactions
(ii) Confidence
The confidence of a rule is defined
Conf(x->y) = supp (XUY)/supp(x)
(iii) Lift
The lift of a rule is defined as:
Lift(X->Y) = supp (XUY)/supp(Y)*supp(X)
(iv) Conviction
The conviction of a rule is defined as:
Conv (X_>Y) =1-supp(Y)/1-conf(X->Y)
38 Options
DATA MINING
LAB
car: If enabled class association rules are mined instead of (general) association
rules.
classIndex: Index of the class attribute. If set to -1, the last attribute is taken as
class attribute.
delta: Iteratively decrease support by this factor. Reduces support until min
support is reached or required number of rules has been generated.
LowerBoundMinSupport: Lower bound for minimum support.
MetricType: Set the type of metric by which to rank rules. Confidence is the
proportion of the examples covered by the premise that are also covered by the
consequence (Class association rules can only be mined using confidence). Lift
is confidence divided by the proportion of all examples that are covered by the
consequence. This is a measure of the importance of the association that is
independent of support. Leverage is the proportion of additional examples
covered by both the premise and consequence above those expected if the
premise and consequence were independent of each other. The total number of
examples that this represents is presented in brackets following the leverage.
Conviction is another measure of departure from independence.
MinMetric :Minimum metric score. Consider only rules with scores higher
than this value.
NumRules : Number of rules to find.
OutputItemSets :If enabled the item sets are output as well.
RemoveAllMissingCols :Remove columns with all missing values.
SignificanceLevel: Significance level. Significance test (confidence metric
only).
UpperBoundMinSupport: Upper bound for minimum support. Start
iteratively decreasing minimum support from this value.
Verbose :If enabled the algorithm will be run in verbose mode.
How to open the Apriori Algorithm in WEKA 39
COMPUTER
NETWORKS AND
Startwekaselect the explorerselect the open file(weather
DATA MINING LAB Nominal)select the Association chooses the algorithm(Apriori)click on
start.
41
COMPUTER
NETWORKS AND
DATA MINING LAB
42
10. play=yes 9 ==> humidity=normal 6 conf:(0.67) < lift:(1.33)>lev:(0.11) [1] DATA MINING
LAB
conv:(1.13)
43
COMPUTER
NETWORKS AND
=== Run information ===
DATA MINING LAB
Scheme: weka.associations.Apriori -N 10 -T 2 -C 0.1 -D 0.05 -U 1.0 -M 0.1
-S -1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
44
DATA MINING
LAB
2.9 CLASSIFICATION
Logistic Regression
Naive Bayes
Decision Tree
k-Nearest Neighbors
Support Vector Machines
Each instance describes the properties of radar returns from the atmosphere and
the task is to predict whether or not there is structure in the ionosphere or not.
46
There are 34 numerical input variables of generally the same scale. You can DATA MINING
LAB
learn more about this dataset on the UCI Machine Learning Repository. Top
results are in the order of 98% accuracy.
The algorithm learns a coefficient for each input value, which are linearly
combined into a regression function and transformed using a logistic (s-shaped)
function. Logistic regression is a fast and simple technique, but can be very
effective on some problems.
The logistic regression only supports binary classification problems, although
the WEKA implementation has been adapted to support multi-class
classification problems.
Choose the logistic regression algorithm:
1. Click the “Choose” button and select “Logistic” under the “functions”
group.
2. Click on the name of the algorithm to review the algorithm
configuration.
The algorithm can run for a fixed number of iterations (maxIts), but by default
will run until it is estimated that the algorithm has converged. The
implementation uses a ridge estimator which is a type of regularization. This
47
COMPUTER
NETWORKS AND
method seeks to simplify the model during training by minimizing the
DATA MINING LAB coefficients learned by the model. The ridge parameter defines how much
pressure to put on the algorithm to reduce the size of the coefficients. Setting
this to 0 will turn off this regularization.
You can see that with the default configuration that logistic regression achieves
an accuracy of 88%.
48
Choose the Naive Bayes algorithm: DATA MINING
LAB
1. Click the “Choose” button and select “NaiveBayes” under the “bayes”
group.
2. Click on the name of the algorithm to review the algorithm
configuration.
49
COMPUTER
NETWORKS AND
DATA MINING LAB
There are a number of other flavors of naive bayes algorithms that you could
work with.
They work by creating a tree to evaluate an instance of data, start at the root of
the tree and moving town to the leaves (roots) until a prediction can be made.
The process of creating a decision tree works by greedily selecting the best split
point in order to make predictions and repeating the process until the tree is a
fixed depth.
After the tree is constructed, it is pruned in order to improve the model’s ability
to generalize to new data.
50
DATA MINING
LAB
The depth of the tree is defined automatically, but a depth can be specified in
the maxDepth attribute.
You can also choose to turn of pruning by setting the noPruning parameter to
True, although this may result in worse performance. The minNum parameter
defines the minimum number of instances supported by the tree in a leaf node
when constructing the tree from the training data.
You can see that with the default configuration that the decision tree algorithm
achieves an accuracy of 89%.
51
COMPUTER
NETWORKS AND
DATA MINING LAB
Another more advanced decision tree algorithm that you can use is the C4.5
algorithm, called J48 in WEKA.You can review a visualization of a decision
tree prepared on the entire training data set by right clicking on the “Result list”
and clicking “Visualize Tree”.
It is a simple algorithm, but one that does not assume very much about the
problem other than that the distance between data instances is meaningful in
making predictions. As such, it often achieves very good performance.
When making predictions on classification problems, KNN will take the mode
(most common class) of the k most similar instances in the training dataset.
WEKA Configuration for the Search Algorithm in the k-Nearest Neighbors Algorithm
You can see that with the default configuration that the kNN algorithm achieves
an accuracy of 86%.
SVM was developed for numerical input variables, although will automatically
convert nominal values to numerical values. Input data is also normalized
before being used.
SVM work by finding a line that best separates the data into the two groups.
This is done using an optimization process that only considers those data
instances in the training dataset that are closest to the line that best separates the
classes. The instances are called support vectors, hence the name of the
technique.
In almost all problems of interest, a line cannot be drawn to neatly separate the
classes, therefore a margin is added around the line to relax the constraint,
allowing some instances to be misclassified but allowing a better result overall.
Finally, few datasets can be separated with just a straight line. Sometimes a line
with curves or even polygonal regions need to be marked out. This is achieved
with SVM by projecting the data into a higher dimensional space in order to
draw the lines and make predictions. Different kernels can be used to control
the projection and the amount of flexibility in separating the classes.
Choose the SVM algorithm:
1. Click the “Choose” button and select “SMO” under the “function”
group.
2. Click on the name of the algorithm to review the algorithm
configuration.
SMO refers to the specific efficient optimization algorithm used inside the
SVM implementation, which stands for Sequential Minimal Optimization.
55
COMPUTER
NETWORKS AND
DATA MINING LAB
A key parameter in SVM is the type of Kernel to use. The simplest kernel is a
Linear kernel that separates data with a straight line or hyperplane. The default
in WEKA is a Polynomial Kernel that will separate the classes using a curved
or wiggly line, the higher the polynomial, the more wiggly (the exponent value).
A popular and powerful kernel is the RBF Kernel or Radial Basis Function
Kernel that is capable of learning closed polygons and complex shapes to
separate the classes. It is a good idea to try a suite of different kernels and C
(complexity) values on your problem and see what works best.
You can see that with the default configuration that the SVM algorithm
achieves an accuracy of 88%.
56
DATA MINING
LAB
2.10 CLUSTERING
K-Means Algorithm Using WEKA Explorer
Let us see how to implement the K-means algorithm for clustering using
WEKA Explorer.
Cluster Analysis
These subsets are called clusters and the set of clusters is called clustering.
Cluster Analysis is used in many applications such as image recognition,
pattern recognition, web search, and security, in business intelligence such as
the grouping of customers with similar likings.
K-Means Clustering
57
COMPUTER
NETWORKS AND
How Does K-Mean Clustering Algorithm Work?
DATA MINING LAB
Step #1: Choose a value of K where K is the number of clusters.
Step #2: Iterate each point and assign the cluster which is having the nearest
center to it. When each element is iterated then compute the centroid of all the
clusters.
Step #3: Iterate every element from the dataset and calculate the Euclidean
distance between the point and the centroid of every cluster. If any point is
present in the cluster which is not nearest to it then reassign that point to the
nearest cluster and after performing this to all the points in the dataset, again
calculate the centroid of each cluster.
Step #4: Perform Step#3 until there is no new assignment that took place
between the two consecutive iterations.
4) Click on Start in the left panel. The algorithm display results on the white
screen. Let us analyze the run information:
Scheme, Relation, Instances, and Attributes describe the property of the
dataset and the clustering method used. In this case, vote.arff dataset
has 435 instances and 13 attributes.
The sum of the squared error is 1098.0. This error will reduce with an
increase in the number of clusters.
The 5 final clusters with centroids are represented in the form of a table.
In our case, Centroids of clusters are 168.0, 47.0, 37.0, 122.0.33.0 and
28.0.
59
COMPUTER
NETWORKS AND
DATA MINING LAB
60
DATA MINING
LAB
The blue color represents class label democrat and the red color
represents class label republican.
Click the box on the right-hand side of the window to change the x
coordinate attribute and view clustering with respect to other attributes.
Output
61
COMPUTER
NETWORKS AND
K means clustering is a simple cluster analysis method. The number of clusters
DATA MINING LAB can be set using the setting tab. The centroid of each cluster is calculated as the
mean of all points within the clusters. With the increase in the number of
clusters, the sum of square errors is reduced. The objects within the cluster
exhibit similar characteristics and properties. The clusters represent the class
labels.
You may seek assistance in doing the lab exercises from the concerned lab
instructor. Since the assignments have credits, the lab instructor is obviously not
expected to tell you how to solve these, but you may ask questions concerning the
Operating system and C programs.
The program should be interactive, general and properly documented with real
Input/ Output data.
If two or more submissions from different students appear to be of the same origin
(i.e. are variants of essentially the same program), none of them will be counted.
You are strongly advised not to copy somebody else's work.
As soon as you have finished a lab exercise, contact one of the lab instructor /
incharge in order to get the exercise evaluated and also get the signature from
him/her on the Observation book.
The total no. of lab sessions (3 hours each) are 10 sessions and the list of
assignments is provided session-wise. It is important to observe the deadline given
for each assignment.
62
DATA MINING
LAB
2.11 PRACTICAL SESSIONS
Session-1
Session-2
Session-3
63
COMPUTER
NETWORKS AND
8. Implement the Apriori Algorithm to find the association rules in
DATA MINING LAB contactless.arff dataset.
Session-4
Session-5
Session-6
Session-7
Session-8
Session-9
Session-10
2.12 SUMMARY
In this Data Mining lab course you were given exposure to work with WEKA.
WEKA is a collection of machine learning algorithms for data mining tasks. It
contains tools for data preparation, classification, regression, clustering,
association rules mining, and visualization. Found only on the islands of
New Zealand, the WEKA. WEKA is open source software issued under
the GNU General Public License. The video links for the courses are available
at Online Lab Resources.
1. Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA
Workbench. Online Appendix for "Data Mining: Practical Machine
Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition,
2016.
2. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, and Ian H. Witten (2009). The WEKA Data Mining
Software: An Update. SIGKDD Explorations, Volume 11, Issue 1.
3. Jason Bell (2020) Machine Learning: Hands-On for Developers and
Technical Professionals, Second Edition, Wiley.
4. Richard J. Roiger (2020) Just Enough R! An Interactive Approach to
Machine Learning and Analytics, CRC Press.
5. Parteek Bhatia (2019) Data Mining and Data Warehousing Principles
and Practical Techniques, Cambridge University Press.
6. Mark Wickham (2018) Practical Java Machine Learning Projects with
Google Cloud Platform and Amazon Web Services, APress.
7. AshishSingh Bhatia, Bostjan Kaluza (2018) Machine Learning in Java -
Second Edition, Packt Publishing.
8. Richard J. Roiger (2016) Data Mining: A Tutorial-Based Primer, CRC
Press.
9. Mei Yu Yuan (2016) Data Mining and Machine Learning: WEKA
Technology and Practice, Tsinghua University Press (in Chinese).
10. Jürgen Cleve, Uwe Lämmel (2016) Data Mining, De Gruyter (in
German).
11. Eric Rochester (2015) Clojure Data Analysis Cookbook - Second
Edition, Packt Publishing.
12. Boštjan Kaluža (2013) Instant Weka How-to, Packt Publishing.
13. Hongbo Du (2010) Data Mining Techniques and Applications,
Cengage Learning.
1. https://www.cs.waikato.ac.nz/ml/weka
2. https://weka.wikispaces.com
67
COMPUTER
NETWORKS AND
DATA MINING LAB
2.15 ONLINE LAB RESOURCES
68