Weka DW&DM Lab Notes
Weka DW&DM Lab Notes
ParAccel (Actian)
Cloudera
Talend
Query surge
Amazon Redshift
Teradata
Oracle
TabLeau
Page 1.1
Open Source Data Mining Tools
WEKA
Orange
KNIME
R-Programming
Rapid Miner
Apache Mahout
Tanagra
XL Miner
Page 1.2
Experiment 1: Installation of WEKA Tool
Aim: A. Investigation the Application interfaces of the Weka tool. Introduction:
Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together with
graphical user interfaces for easy access to these functions. The original non-Java version of
Weka was a Tcl/Tk front-end to (mostly third-party) modeling algorithms implemented in other
programming languages, plus data preprocessing utilities in C, and Make file-based system for
running machine learning experiments. This original version was primarily designed as a tool for
analyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3),
for which development started in 1997, is now used in many different application areas, in
particular for educational purposes and research. Advantages of Weka include:
Description:
Open the program. Once the program has been loaded on the user‟s machine it is opened by
navigating to the programs start option and that will depend on the user‟s operating system.
Figure 1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen:
1. Explorer - the graphical interface used to conduct experimentation on raw data After clicking
the Explorer button the weka explorer interface appears.
3. Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze.
6. Visualize- used to see what the various manipulation produced on the data set in a 2D format,
in scatter plot and bar graph output.
2. Experimenter - this option allows users to conduct different experimental variations on data
sets and perform statistical manipulation. The Weka Experiment Environment enables the user to
create, run, modify, and analyze experiments in a more convenient manner than is possible when
processing the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to determine if one of the
schemes is (statistically) better than the other schemes.
Click the “Open file…” button to open a data set and double click on the “data” directory.
Weka provides a number of small common machine learning datasets that you can use to practiceon.
Select the “iris.arff” file to load the Iris dataset.
References:
[1] Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine learning tools and
techniques. 2nd edition Morgan Kaufmann, San Francisco.
[2] Ross Quinlan (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
San Mateo, CA.
[3] CVS–http://weka.sourceforge.net/wiki/index.php/CVS
[4] Weka Doc–http://weka.sourceforge.net/wekadoc/
Exercise:
1. Normalize the data using min-max normalization
7. https://www.filehorse.com/download-weka/
a. PDF
b. CSV
c. MP4
d. PNG
Answer: b. CSV.
Explanation: Weka can import data in various file formats, but CSV
(Comma Separated Values) is the most commonly used file format for
importing data.
a. Attribute selection
b. Data cleaning
c. Data normalization
d. Data visualization
Answer: d. Data visualization.
Explanation: Weka does not have built-in data visualization tools, but it
does provide various data preprocessing techniques such as attribute
selection, data cleaning, and data normalization.
a. Naive Bayes
b. K-means
c. Random Forest
d. PCA
a. K-means
b. DBSCAN
c. EM
d. Linear Regression
Answer: d. Distance.
a. It is computationally expensive
b. It requires large amounts of training data
c. It is sensitive to irrelevant features
d. It cannot handle categorical data
a. Linear Regression
b. Naive Bayes
c. Random Forest
d. K-means
a. K-means
b. DBSCAN
c. Naive Bayes
d. EM
a. Data normalization
b. Data imputation
c. Data visualization
d. Data discretization
Explanation: Weka does not have built-in data visualization tools, but it
does provide various data preprocessing techniques such as data
normalization, data imputation, and data discretization.
a. Numeric
b. Nominal
c. Binary
d. Sequential
Answer: d. Sequential.
a. Classification
b. Clustering
c. Association Rule Mining
d. Data Visualization
a. Random Forest
b. J48
c. K-means
d. DBSCAN
Answer: b. J48.
a. Overfitting
b. Underfitting
c. Missing values
d. Class imbalance
a. Linear Regression
b. Polynomial Regression
c. Logistic Regression
d. K-means
Answer: d. K-means.
a. K-fold cross-validation
b. Leave-one-out cross validation
c. Stratified cross-validation
d. Naive Bayes cross-validation
a. Bagging
b. Boosting
c. Random Forest
d. K-means
Answer: d. K-means.
a. Logistic Regression
b. Decision Tree
c. Naive Bayes
d. k-Nearest Neighbors (k-NN)
a. Accuracy
b. F-measure
c. Silhouette coefficient
d. Precision
a. Bagging
b. SMOTE
c. Random Forest
d. Boosting
Answer: b. SMOTE.
a. Decision Tree
b. k-Nearest Neighbors (k-NN)
c. Linear Regression
d. Support Vector Machine (SVM)
a. Min-max normalization
b. Recursive Feature Elimination (RFE)
c. Correlation-based Feature Selection (CFS)
d. Principal Component Analysis (PCA)
Answer: d. Boosting.
a. AdaBoost
b. Bagging
c. Boosting
d. Random Forest
a. Euclidean distance
b. Manhattan distance
c. Mahalanobis distance
d. All of the above
a. Mean imputation
b. Median imputation
c. Mode imputation
d. All of the above
a. Linear kernel
b. Polynomial kernel
c. Gaussian kernel
d. All of the above
Answer: d. JRip.
a. k-Means
b. Hierarchical Clustering
c. DBSCAN
d. Linear Regression
a. k-NN
b. Apriori
c. Random Forest
d. JRip
Answer: d. JRip.
a. Decision Tree
b. Naive Bayes
c. Bagging
d. k-NN
Answer: c. Bagging.
a. PCA
b. LDA
c. ICA
d. SVM
Answer: b. LDA.
Answer: d. ZeroR.
a. Bagging
b. Boosting
c. Stacking
d. Random Forest
Answer: c. Stacking.
a. k-Means
b. EM
c. DBSCAN
d. SOM
Answer: c. DBSCAN.
a. Mean Imputation
b. Mode Imputation
c. Median Imputation
d. k-NN Imputation
a. Bagging
b. Boosting
c. Stacking
d. Random Forest
Answer: b. Boosting.
Explanation: Boosting is a type of ensemble learning technique in Weka
that combines multiple models using a weighted sum of their predictions.
It works by iteratively reweighting the instances based on their
classification errors and building a new model on the reweighted data.
Weka provides various ensemble learning techniques, including Bagging,
Boosting, Random Forest, Stacking, and more.
a. Naive Bayes
b. k-NN
c. Decision Tree
d. SVM
a. k-Means
b. EM
c. DBSCAN
d. SOM
Answer: b. EM.
a. Filter
b. Wrapper
c. Embedded
d. Correlation-based
Answer: b. Wrapper.
Answer: d. Isomap.
a. OneR
b. ZeroR
c. JRip
d. Random Tree
Answer: c. JRip.
a. k-Means
b. EM
c. DBSCAN
d. Hierarchical
Answer: d. Hierarchical.
a. Filter
b. Wrapper
c. Embedded
d. Correlation-based
Answer: d. Correlation-based.
a. k-Means
b. EM
c. DBSCAN
d. CLIQUE
Answer: d. CLIQUE.
a. Naive Bayes
b. k-NN
c. J48
d. Random Forest
Answer: c. J48.