Data Mining Lab Report
Data Mining Lab Report
SUBMITTED TO:
1
UE163048
Practical - 1
(Implementation of functions, Procedures, Triggers and Cursors)
1. Creating a Table and inserting values
Output:
2
UE163048
FUNCTIONS:
1. To calculate total no. of customers in the table
3
UE163048
PROCEDURES:
a. To create a greetings procedure
b. To drop a procedure
4
UE163048
CURSORS:
1. Implicit Cursor: To update salary of customers
TRIGGERS:
1. Create trigger before insert/ update/ delete
5
UE163048
(i) On update
6
UE163048
(ii) On insert
(iii) On delete
7
UE163048
Practical - 2
(Normalization of Relations)
Original Table:
8
UE163048
9
UE163048
Practical - 3
(Learning and exploring Weka)
What is WEKA?
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be
applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-
processing, classification, regression, clustering, association rules, and visualization. It is also well-
suited for developing new machine learning schemes.
Weka is open source software issued under the GNU General Public License.
3 (a): Exploring preprocessing in Weka
Unsupervised filters
1. Add ID-An instance filter that adds an ID attribute to the dataset. The new attribute contains a
unique ID for each instance.
Note: The ID is not reset for the second batch of instances when batch mode is used from the
command-line, or the FilteredClassifier.
2. Numeric to Nominal- A filter for turning numeric attributes into nominal ones.
10
UE163048
3. Numeric to Binary-Converts all numeric attributes into binary attributes (apart from the class
attribute, if set): if the value of the numeric attribute is exactly zero, the value of the new attribute
will be zero.
4. Discretize-An instance filter that discretizes a range of numeric attributes in the dataset into
nominal attributes. Discretization is by simple binning. Skips the class attribute if set.
11
UE163048
Supervised Filters:
1. Nominal to Binary-Converts all nominal attributes into binary numeric attributes. An attribute
with k values is transformed into k binary attributes if the class is nominal (using the one-attribute-
per-value approach). Binary attributes are left binary if option '-A' is not given. If the class is
numeric, k - 1 new binary attributes are generated in the manner described in "Classification and
Regression Trees" by Breiman et al. (i.e., by taking the average class value associated with each
attribute value into account).
2. Merge Nominal Values - Merges values of all nominal attributes among the specified attributes,
excluding the class attribute, using the CHAID method, but without considering re-splitting of
merged subsets.
12
UE163048
13
UE163048
CLASSIFICATION IN WEKA
1. ZERO-R :There is a classifier called Zero Rule (or 0R or ZeroR for short). It is the simplest rule
you can use on a classification problem and it simply predicts the majority class in your dataset
(e.g. the mode).
14
UE163048
2. DECISION TABLE: Class for building and using a simple decision table majority classifier.
15
UE163048
Decision Table
with 25 cross-validation folds
CLUSTERING IN WEKA
SIMPLEK-MEANS
The WEKA SimpleKMeans algorithm uses Euclidean distance measure to compute distances between
instances and clusters. To perform clustering, select the "Cluster" tab in the Explorer and click on the
"Choose" button. This results in a drop down list of available clustering algorithms.
16
UE163048
17
UE163048
Practical - 3
(Learning and exploring Weka)
What is WEKA?
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It is
also well-suited for developing new machine learning schemes.
Weka is open source software issued under the GNU General Public License.
Unsupervised filters
1. Add ID
18
UE163048
2. Numeric to Nominal
3. Numeric to Binary
19
UE163048
4. Discretize
Supervised Filters:
1. Nominal to Binary
20
UE163048
Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets
and relevant association rules. It is devised to operate on a database containing a lot of
transactions, for instance, items brought by customers in a store.
21
UE163048
22
UE163048
PRACTICAL - 5
INTRODUCTION :
Nowadays, data mining is playing a vital role in automobile industry and one of the most important areas
of research with the objective of finding meaningful information from the data stored in huge dataset.
Automotive data mining (ADM) is a very important research area which helpful to predict useful
information from automobile database to improve automobile performance, better understanding and to
have better sales and marketing operations. Data Mining or knowledge discovery has become the area of
growing significance because it helps in analyzing data from different perspectives and summarizing it
into useful information.
23
UE163048
BACKGROUND :
Required Software(WEKA) -
We have used a data mining software named as WEKA for this project. For the purposes of this study, we
select WEKA (Waikato Environment for Knowledge Analysis) software that was developed at the
University of Waikato in New Zealand. WEKA tool supports to a wider range of algorithms & very large
data sets. The WEKA (pronounced Waykuh) workbench contains a collection of visualization tools &
algorithms. WEKA is open source software issued under the GNU General Public License. It contains
tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
The original non-java version of WEKA was a TCL/TK, but the recent java based version is WEKA
3(1997), is now used in many different application areas, in particular for education & research. WEKA’s
main user interface is Explorer. The Experimenter is also there by which we can compare WEKA’s
machine learning algorithms’ performance. The Explorer interface has many panels by which we can
access to main components of workbench. The Visualization tab allows visualizing a 2-D plot of the
current working relation, it is very useful. In this study WEKA toolkit 3.8.1 is used for generating the
association rules and prediction of result.
WEKA supports several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. All of WEKA's techniques are
predicated on the assumption that the data is available as a single flat file or relation, where each data
point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some
other attribute types are also supported). WEKA provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query. It is not capable of multi-relational
data mining, but there is separate software for converting a collection of linked database tables into a
single table that is suitable for processing using WEKA.
PROBLEM STATEMENT :
Data mining is widely used in automobile industry to find the problems that arise in this field.
Automobile’s performance is of great concern and several factors may affect the performance. For
prediction the three required components are: Parameters which affect the car’s performance, Data mining
methods and third one is data mining tool. By applying data mining techniques on automobile data we
can obtain knowledge which describes the car’s performance. This knowledge will help to find out the
best car in each segment depending upon various factors like price, average, engine type, etc.
METHODOLOGY :
The methodology consists of 5 different phases (as shown in the figure) i.e. Data Set Generation, Data
Cleaning, Attribute Selection, Data Mining and Analysis of Results.
24
UE163048
WORKFLOW DIAGRAM :
25
UE163048
● Dataset and attribute selection - A dummy dataset of automobile is collected which contains the
data of various cars by major car manufacturing companies in the market. The dataset contains
207 instances and 26 attributes. It has some missing values also. The data file has to be in either
in ‘CSV’ format or ‘ARFF’ format.
● Preprocessing - Data Preprocessing is the first step of evaluation. For this experiment WEKA
Explorer interface is being used. Here the source data file is selected from local machine. After
loading the data in Explorer, the data is refined by selecting different options which is known as
‘Data Cleaning’ and can also select or remove attributes as per our need. The following is the
preprocessed of our dataset. Left hand side of the above screen shows detail of relation name,
number of attributes and number of records. Right hand side gives details of attribute values, type,
and number of distinct values. Specification of every attribute is displayed in the right bottom of
the screen.
26
UE163048
● Filters - The preprocess section allows filters to be defined that transform the data in various
ways. The Filter box is used to set up the filters that are required. There are mainly two
categories of filters-Supervised and Unsupervised. Here we will choose unsupervised category
filters. In case if the dataset is contained with any numeric values we have to convert it
nominal values( as Association in WEKA can only support nominal values) by using ‘Numeric
To Nominal’ filter under attribute section of Unsupervised filters.
● Precision : proportion of instances that are truly of a class divided by the total
instances classified as that class
● Recall : proportion of instances classified as a given class divided by the actual total in
that class (equivalent to TP rate)
27
UE163048
1. First, minimum support is applied to find all frequent item sets in a database.
2. Second, these frequent item sets and the minimum confidence constraint are used to form rules.
While the second step is straight forward, the first step needs more attention.
Finding all frequent item sets in a database is difficult since it involves searching all possible item sets.
28
UE163048
Support- The support for a rule X => Y is obtained by dividing the number of transactions Which
satisfy the rule, N (X=>Y), by the total number of transactions, N
Confidence- The confidence of the rule X => Y is obtained by dividing the number of Transactions
which satisfy the rule N (X=>Y) by the number of transactions which contain the Body of the rule, X.
Confidence (X=>Y) = N (X=>Y) / N (X)
The confidence is the conditional probability of the RHS holding true given that the LHS Holds true. A
high confidence that the LHS event leads to the RHS event implies causation or Statistical dependence.
Lift- The lift of the rule X => Y is the deviation of the support of the whole rule from the Support
expected under independence given the supports of the LHS (X) and the RHS (Y).
Lift is an indication of the effect that knowledge that LHS holds true has on the probability of The RHS
holding true. Hence Lift is a value that gives us information about the increase in Probability of the
"then" (consequent RHS) given the "if" (antecedent LHS) part.
Lift is exactly 1: No effect (LHS and RHS independent). No relationship between Events.
Lift greater than 1: Positive effect (given that the LHS holds true, it is more likely that The
Operational risk management RHS holds true). Positive dependence between events.
Lift is smaller
than 1: Negative effect (when the LHS holds true, it is less likely that the RHS holds true). Negative
dependence between events.
Leverage – proportion of additional examples covered by both the antecedent and the Consequent
above those expected if the antecedent and consequent were independent of each Other, and finally.
Since the data mining software used to generate association rules accepts data only in arff format, the
researcher first converted the data on Ms Excel file into comma separated text format and then to arff
format. Data in arff format of nominal form is then given to Weka associate, then select Apriori for
association rule mining.
For our test we shall consider 206 students data with respect to different type of nominal attributes.
29
UE163048
The ARFF file presented below contains information regarding each student’s performance.
Using the Apriori Algorithm we want to find the association rules that have min Support=0.1(10%)
and minimum confidence=0.9(90%). We will do this using WEKA GUI. After we launch the WEKA
application and open the Automobile.csv file, we move to the Associate tab and we set up the
following configuration:
30
UE163048
In here, we can set minimum support= 0.1, because this can generate more frequent item set. If we set
minimum support= 0.2 or more, then this can remove many attributes, but minimum no of attributes is
not sufficient to give a proper decision.
But, minimum confidence=0.9, can set higher because this boundary can give less amount of rules.
31
UE163048
USEFUL CONCEPTS:
For the dataset, association rules of the form X -> Y, where the frequent item-sets are generated using
methods Aproiri techniques. The item-sets X and Y are called antecedent and consequent of the rule
respectively. Generation of association rules (AR) is generally controlled by the two measures or
metrics Called support and confidence, Some important are given below.
Now, In this Student Performance dataset, we can calculate the interestingness as per as Weka results
for every generating association rules.
But here, We only calculate for one rule which was generated by weka.
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The best-known constraints are minimum thresholds on support
and confidence.
32
UE163048
Support:
The support for a rule X => Y is obtained by dividing the number of transactions which satisfy the
rule, N {X=>Y}, by the total number of transactions N.
The support supp(X) or supp(Y) of an itemset X or Y is defined as the proportion of transactions in the
data set which contain the itemset.
supp(X)= no. of transactions which contain the itemset X / total no. of transactions
supp(Y)= no. of transactions which contain the itemset Y / total no. of transactions
Confidence:
33
UE163048
Lift:
Lift(X→Y) = supp(X U Y)
-----------------
supp(Y)*sup(X)
Leverage:
Leverage is the proportion of additional elements covered by both the premise and consequence above
the expected if independent.
RESULT ANALYSIS:
The KDD (Knowledge Discovery in Databases) paradigm is a step by step process for finding
interesting patterns in large amounts of data. Data mining is one step in the process. The algorithms'
potential as good analytical tools for performance evaluation is shown by looking at results from a
computer performance dataset. It is much easier to store data than it is to make sense of it. Being able
to find relationships in large amounts of stored data can lead to enhanced analysis strategies in fields
such as educational, marketing, computer performance analysis, and data analysis in general. The
problem addressed by KDD is to find patterns in these massive datasets. Traditionally data has been
analyzed manually, but there are human limits. Large databases offer too much data to analyze in the
traditional manner. The focus of this paper is to first summarize exactly what the KDD process is.
1. After completing all part test (preprocess, classification, filter, association and visualization),
we are going to show the final accumulate structure of student performance by using KDD process.
2. If you choose ARFF file in your experiment then select ARFF loader or, you choose csv file in
your experiment then select on csv loader .we take csv file.
3. Click on csv loader and paste it on screen, then pass the dataset to next position.
4. For transfer numerical value to nominal value into csv file, use the intermediate filter “numeric to
nominal” .Then pass the dataset to next position.
5. To classify the file need some intermediate evaluation-
1. Class assigner (to assign the class)
2. Cross validation fold maker
34
UE163048
6. After that, need to choose a standard classifier to classify & prediction result of the given
dataset by test set and training set. Take classifier like- J48.
7. Then connect the classifier with some intermediate evaluation
I. classifier performance evaluator (use for getting some important parameter result)
II. prediction appender ( use for getting predicted result )
- Both connect by batch classifier from J48 classifier.
To show the result use text viewer ,by connect with text ,means to get parameters result that means,
confusion matrix , accuracy ,TP rate, FP rate, precision ,recall and so on & also predicted result along
with actual result.
8. (i) To show the generating graph or image by the classifier, need a graph viewer by passing
graphs signal.
(ii) There will be needed a visualization tool which is model performance chart by passing the
threshold data for getting some chart between classifier parameters and by using visualizable
error signal for getting some chart between error points between attributes( like- actual result
vs predicted result).
9. After completing Classification Stage, will go to Association Stage for generating association rules
by using Apriori algorithm, passing the dataset from loader to see the result of rules, need a text
viewer for showing the output by using text signal.
10. (i) At last, for visualization of the dataset need to choose Scatter Plot Matrix tool by passing the
dataset from loader.
(ii)After completing diagrams, need to load the data in the csv loader portion or tools and Click on
run at the left top portion then look on bottom status portion for checking success or errors point
WEKA LIMITATIONS :
1. In WEKA, when we have to declare any item set values in an attribute portion, then only those
items will be used in creating of data format. If we are not given those similar item set, then WEKA
show an error pop-up message because WEKA does not support any undeclared numerical or string
value.
35
UE163048
This paper presents data mining in Automobile environment that identifies car’s performance
patterns using association rule mining technique. The identified patterns are analyzed to offer a helpful
and constructive recommendations to the new car buyers to enhance their selection process..
Association rule mining has been applied to Cars/Automobiles for analysis of its performance. In this
research, the association rule mining technique is used to find hidden patterns and evaluate
automobiles’' performance and trends. Apriori algorithm is used for finding associations among
attributes.
The automobile performance was evaluated based on data collected from the market including
attributes like prices, mileage etc. . After that Zero-R classification algorithms were used. The data
mining tool used in the experiment was WEKA 3.8.2. Based on the accuracy and the classification
errors one may conclude that the Zero-R Classification method was the most suited algorithm for the
dataset. The Apriori algorithm was applied to the dataset using WEKA to find analysis of overall
automobile performance by some of the best rules. The data may be extended to collect some of the
extra technical skills of the automobile and mined with different classification algorithms to predict the
automobile performance.
In future work the authors also interested in working in future on data of each and every
automobile present in the market. It may define what kinds of construction mechanisms are
adapted for every automobile model who shares the same characteristics. It may also provide
various multidimensional summary reports and redefine pedagogical learning paths.
36