MLCSE3
MLCSE3
VISION
To be a premier institute of knowledge of share quality research and development
technologies towards national buildings
MISSION
2. Collaborate with industries and academic towards training research innovation and
entrepreneurship
3. Create a platform of active participation co-curricular and extra-curricular activities
MISSION
PEO1: Strong foundation of knowledge and skills in the field of Computer Science and
Engineering.
PEO2: Provide solutions to challenging problems in their profession by applying computer
engineering theory and practices.
PEO3: Produce leadership and are effective in multidisciplinary environment.
PSO1: Ability to design and develop computer programs and understand the structure and
develop methodologies of software systems.
PSO2: Ability to apply their skills in the field of networking, web design, cloud computing
and data analytics.
PSO3: Ability to understand the basic and advanced computing technologies towards getting
employed or to become an entrepreneur
1. Students are advised to come to the laboratory at least 5 minutes before (to the starting time),
those who come after 5 minutes will not be allowed into the lab.
2. Plan your task properly much before to the commencement, come prepared to the lab with the
synopsis / program / experiment details.
3. Student should enter into the laboratory with:
a. Laboratory observation notes with all the details (Problem statement, Aim, Algorithm,
Procedure, Program, Expected Output, etc.,) filled in for the lab session.
b. Laboratory Record updated up to the last session experiments and other utensils (if any)
needed in the lab.
c. Proper Dress code and Identity card.
4. Sign in the laboratory login register, write the TIME-IN, and occupy the computer system
allotted to you by the faculty.
5. Execute your task in the laboratory, and record the results / output in the lab observation note
book, and get certified by the concerned faculty.
6. All the students should be polite and cooperative with the laboratory staff, must maintain the
discipline and decency in the laboratory.
7. Computer labs are established with sophisticated and high end branded systems, which should
be utilized properly.
8. Students/Faculty must keep their mobile phones in SWITCHED OFF mode during the lab
sessions. Misuse of the equipment, misbehaviors with the staff and systems etc., will attract
severe punishment.
9. Students must take the permission of the faculty in case of any urgency to go out; if anybody
found loitering outside the lab / class without permission during working hours will be treated
seriously and punished appropriately.
10. Students should LOG OFF/ SHUT DOWN the computer system before he/she leaves the
lab after completing the task(experiment) in all aspects. He/she must ensure the system/ seat
is kept properly
Course Objectives:
1. The lab course provides hands-on experimentation for gaining practical orientation on different
Machine learning concepts. Specifically students learn.
2. To write programs for various data exploration and analytics tasks and techniques in R
Programming language.
3. To apply various Machine learning techniques available in WEKA for data exploration and
pre- processing datasets containing numerical and categorical attributes.
4. To apply various Machine learning techniques for available in WEKA for extracting patterns /
Knowledge and interpret the resulting patterns.
1. Understand the data structures available in R programming and learn to write R programs to
perform several data analytics operations like plotting, boxplots, normalization, discretization,
transformation, attribute selection, etc., on datasets
2. Write R programs to build regression and classification models for numerical and categorical
datasets and evaluate the models with appropriate performance metrics. 3. understand and use
WEKA explorer for data exploration, visualization, and other data pre processing tasks for
numerical and categorical datasets.
3. Extract Association rules using A-priori and FP-growth methods available in WEKA and
interpret the patterns.
4. Build and cross-validate models for Classification and Clustering on labelled and unlabeled
datasets respectively using different methods.
SYLLABUS
Load the ‘iris.csv’ file and display the names and type of each column. Find statistics such as min, max,
range, mean, median, variance, standard deviation for each column of data. Repeat the above for
‘mtcars.csv’ dataset also.
2.Write R program to normalize the variables into 0 to 1 scale using min-max normalization
3.Generate histograms for each feature / variable (sepal length/ sepal width/ petal length/ petal width)
and generate scatter plots for every pair of variables showing each species in a different color.
4.Generate box plots for each of the numerical attributes. Identify the attribute with the highest
variance.
5.Study of homogeneous and heterogeneous data structures such as vector, matrix, array, list, data
frame in R.
6.Write R Program using ‘apply’ group of functions to create and apply normalization function on
each of the numeric variables/columns of iris dataset to transform them into a value around 0 with z-
score normalization.
7.Write R Program using ‘apply’ group of functions to create and apply discretization function on
each of the numeric variables/ features of iris dataset to transform them into 3 levels designated as
“Low, Medium, High” values based on equi-width quantiles such that each variable gets nearly equal
number of data points in each level.
c. 0.18, 0.37, 0.35, 0.78, 0.56, 0.75, 1.18, 1.36, 1.17, 1.65
d. b) Analyze the significance of residual standard-error value, R-squared value, F statistic. Find
the correlation coefficient for this data and analyze the significance of the correlation value.
e. c) Perform a log transformation on the ‘Air Velocity 'column, perform linear regression again,
and analyze all the relevant values.
9. Write R program for reading ‘state.x77’ dataset into a data frame and apply multiple regression to
predict the value of the variable ‘murder’ based on the other independent variables based on their
correlations.
10. Write R program to split ‘Titanic’ dataset into training and test partitions and build a decision tree
for predicting whether survived or not given the description of a person travelled. Evaluate the
performance metrics from the confusion matrix.
11. Create an ARFF (Attribute-Relation File Format) file and read it in WEKA. Explore the purpose
of each button under the pre-process panel after loading the ARFF file. Also, try to interpret using a
different ARFF file, weather.arff, provided with WEKA.
12. Performing data preprocessing in WEKA Study Unsupervised Attribute Filters such as Replace
Missing Values to replace missing values in the given dataset, Add to add the new attribute Average,
Discretize to discretize the attributes into bins. Explore Normalize and Standardize options on a dataset
with numerical attributes.
Demonstration of classification process using J48 algorithm on mixed type of dataset after discretizing
numeric attributes.
Perform cross-validation strategy with various fold levels. Compare the accuracy of the results.
15. Association rule analysis in WEKA
Demonstration of Association Rule Mining on supermarket dataset using Apriori Algorithm with
different support and confidence thresholds.
Demonstration of Association Rule Mining on supermarket dataset using FP- Growth Algorithm with
different support and confidence thresholds.
16. Performing clustering in WEKA
Apply hierarchical clustering algorithm on numeric dataset and estimate cluster quality. Apply
DBSCAN algorithm on numeric dataset and estimate cluster quality.
Index
SNO Experiment Name Page No.
R Programming Lab
11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
# the Iris dataset is an inbuilt dataset available in RStudio and we can use it using iris identifier
print(iris)
12
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
13
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
14
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
print(names(iris))
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
For displaying the names along with their types we use the lapply function using class as second argument
print(lapply(iris,class))
##$Sepal.Length ##
[1] "numeric" ##
## $Sepal.Width ##
[1] "numeric" ##
## $Petal.Length ##
[1] "numeric" ##
## $Petal.Width ##
[1] "numeric" ##
## $Species
## [1] "factor"
print(mean(iris$Petal.Width))
## [1] 1.199333
print(mean(iris$Petal.Width))
## [1] 1.199333
15
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
print(mean(iris$Sepal.Length)) ## [1]
5.843333
print(mean(iris$Sepal.Width)) ## [1]
3.057333
print(mean(iris$Petal.Length))
## [1] 3.758
print(mean(iris$Petal.Width))
## [1] 1.199333
print(median(iris$Sepal.Length)) ## [1]
5.8
print(median(iris$Sepal.Width)) ## [1]
print(median(iris$Petal.Length)) ## [1]
4.35
print(median(iris$Petal.Width))
## [1] 1.3
16
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
print(min(iris$Sepal.Width)) ## [1] 2
## [1] 1
print(max(iris$Sepal.Length))## [1] 0.1
print(min(iris$Petal.Width))
## [1] 7.9
print(max(iris$Sepal.Width)
Finding the range of each column We use range() to finding the range of each column. It returns a vector
containing the minimum and maximum of all the given arguments.
print(var(iris$Sepal.Width))
## [1] 0.1899794
print(var(iris$Petal.Length))
## [1] 3.116278
print(var(iris$Petal.Width))
## [1] 0.5810063
17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
print(range(iris$Sepal.Length))
print(range(iris$Sepal.Width))
print(range(iris$Petal.Length))
print(range(iris$Petal.Width))
## [1] 0.1 2.5
Finding variance of each column we use var() function to find the variance of each column
print(v ar(iris$Se pal.Length))
Finding standard deviation of ea ch column We use sd() function to find the variance of
## [1] 0.8280661
print(sd(iris$Sepal.Width))
## [1] 0.4358663
print(sd(iris$Petal.Length))
## [1] 1.765298
print(sd(iris$Petal.Width))
## [1] 0.7622377
18
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
summary(iris)
19
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
2. R program to normalize the variables into 0 to 1 scale using min- max normalization
the formula to achieve min max normalization is y = (x-min)/(max-min)
y= x −mi n ma
x−min
#dummy data
x = sample(-100:100,50) print("original data")
## [1] "original data"
print(x)
## [1] -31 -85 100 45 38 69 -62 66 -61 -89 -40 -81 -91 67 42 39 -68 63
-95
## [20] 8 14 56 26 -99 22 71 -26 91 92 -20 -39 73 96 23 -79 87 -18
-93
## [39] 82 4 -56 94 0 -17 -83 5 -52 -21 -23 -6
maximum = max(x)
minimum = min(x)
normalized = (x-minimum)/(maximum-minimum)
print("Normalized data")
## [1] "Normalized data" print(normalized)
par(mfrow=c(1,2))
hist(x,breaks = 10, xlab = "Data",col = "lightblue", )
hist(normalized, breaks = 10, xlab = "Normalized data", col = "yellow")
20
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
21
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
3. Generate histograms for any one variable and generate scatter plots for every pair of
variables showing each species in different colour on iris dataset.
Generating histogram for any one variable let it be sepal length
hist(iris$Sepal.Length, col="yellow", xlab = "Sepal length in cm", main = "Histogram of Sepal
lengths")
22
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
#correlation panel
panel.cor = function(x,y){
usr<- par("usr");on.exit(par(usr)) par(usr =
c(0,1,0,1))
r = round(cor(x,y), digits = 2) txt =
paste0("R = ",r)
cex.cor = 0.8/strwidth(txt)
text(0.5,0.5,txt,cex=cex.cor*r)
}
23
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
24
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
25
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
4. Generate box plots for each of the numerical attribute. Identify the attribute with
highest variance.
We will be using building dataset “airquality” dataset for this program. It is a Daily air quality
measurements in New York, May to September 1973.
Finding variance is simple. The spread of the boxplot indicates the variance. The more the spread
of boxplot then it have more variance.
The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box
plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the
median. The whiskers go from each quartile to the minimum or maximum (5).
Example:
A sample of 101010 boxes of raisins has these weights (in grams):25, 28, 29, 29, 30,
34, 35, 35, 37, 38
1. Order the data from smallest to largest. 25, 28,
29, 29, 30, 34, 35, 35, 37, 38
2. Find the median
26
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
27
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
5. Study of homogeneous and heterogeneous data structures such as vector, matrix,
array, list and data frame in R.
Data Structure:
A data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data structures
in R programming are tools for holding multiple values.
R‟s base data structures are often organized by their dimensionality (1D, 2D, or nD) and whether
they‟re homogeneous (all elements must be of the identical type) or heterogeneous (the elements
are often of various types). This gives rise to the five data types which are most frequently utilized
in data analysis. the subsequent table shows a transparent cut view of those data structures.
1D Vector List
2D Matrix Dataframe
nD Array
Vector:
A vector is an ordered collection of basic data types of a given length. The only key thing here is all the
elements of a vector must be of the identical data type e.g homogenous data structures. Vectors are one-
dimensional data structures.
How to create a Vector?
Vectors are generally created using the c() function.
> X = c(1, 3, 5, 7, 8)
> print(X) [1]
13578
> typeof(X)
[1] "double"
> length(X)
[1] 5
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
> x <- c(1, 5.4, TRUE, "hello")
> typeof(x)
[1] "character"
> length(x)
[1] 4
> x
[1] "1" "5.4" "TRUE" "hello"
> X
[1] 1 3 5 7 8
Creating a vector using operator:
> x <- 1:7
> x
[1] 1 2 3 4 5 6 7
> y <- 2:-2
> y
[1] 2 1 0 -1 -2
Creating a vector using seq() function:
> z <- seq(1, 3, by=0.2)
> z
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
> a<-seq(1, 5, length.out=4)
> a
[1] 1.000000 2.333333 3.666667 5.000000
Accessing elements using integer vector as index:
Vector index in R starts from 1, unlike most programming languages where index start from 0.
We can use a vector of integers as index to access specific elements. We can also use negative
integers to return all elements except that those specified. But we cannot mix positive and neg
ative integers while indexing and real numbers, if used, are truncated to integers (2).
> x = c(0, 2, 4 6, 8,
, 10)
> x
[1] 0 2 4 8 10
6
> x[3]
[1] 4
> x[c(2, 4)]
[1] 2 6
> x[-1]
[1] 2 4 6 8 10 #Access all expect 1st element
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Accessing elements using logical vector as index:
When we use a logical vector for indexing, the position where the logical vector is TRUE is
returned. This useful feature helps us in filtering of vector as shown below.
> x[c(TRUE, FALSE, FALSE, TRUE)]
[1] 0 6 8
> x[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
[1] 0 6 8 10
> x[c(TRUE, FALSE, FALSE)]
[1] 0 6
> x[c(TRUE)]
[1] 0 2 4 6 8 10
> x[c(TRUE, TRUE)]
[1] 0 2 4 6 8 10
> x[c(TRUE, FALSE)]
[1] 0 4 8
> x[x < 0]
numeric(0)
> x[x < 4]
[1] 0 2
> x[x > 4]
[1] 8
6 10
Modifying vectors
We can modify a vector using the assignment operator. We can use the techniques discussed above to
access specific elements and modify them. If we want to truncate the elements, we can use reassignments.
> x=c(-3, - - 0, 1, 2
2, 1, )
> x
[1] -3 - -1 0 1 2
2
> x[2] <- 0
> x
[1] -3 -1 0 1 2
0
> x[x<0] <- 5 # modify elements less than 0 as 5
> x
[1] 5 0 5 0 1 2
> x <- x[1:4] # truncate x to first 4 elements
> x
[1] 5 0 5 0
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Lists:
A list is a generic object consisting of an ordered collection of objects. Lists are heterogeneous da
ta structures. These are also one-dimensional data structures. A list can be a list of vectors, list of
matrices, a list of characters and a list of functions and so on.
Creating Lists
List can be created using the list() function.
> x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
> str(x)
List of 3
$ a: num 2.5
$ b: logi TRUE
$ c: int [1:3] 1 2 3
Structure of the list can be returned using str() function.
In this example, a, b and c are called tags which makes it easier to reference the componen
ts of the list. However, tags are optional. We can create the same list without the tags as fol
lows. In such scenario, numeric indices are used by default.
> x <- list(2.5,TRUE,1:3)
> x
[[1]]
[1] 2.5
[[2]]
[1] TRUE
[[3]]
[1] 1 2 3
[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"
[[3]]
[1] 4
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
[[2]
]
[,1] [,2] [,3]
[1,] 1 3 -1
[2,] 2 4 9
[[3]]
[[3]][[1]]
[1] "Red"
[[3]][[2]]
[1] 12.3
$Matrix
$Misc
$Misc[[1]]
[1] "Red"
$Misc[[2]]
[1] 12.3
32
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
v1 = c(1,2,3)
v2 = c(4,5,6,7,8,9)
## ,,1
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## ,,2
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Data Frames
Data frame is a tabular data object or two dimensional array like structure in which each column,
contains values of one variable and each row contains one set of values from each column.
Unlike matrices, each column of a data frame can contain different modes of data.
costs = data.frame(
name = c("carrot","apple","sugar"), costPerKG =
c(50.00,60.00,39.50),
QuantityAvailableinKGs = c(10,5,50)) print(costs)
33
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
6) R program using apply group of functions to create and apply normalization function on
each of te numerical columns of iris dataset to tranform them into a value around 0 with z-
score normalization.
we can achieve z-score normalization in R by using function scale(X,center=TRUE, scale=TRUE)
where X refers to data.
So we need to apply it to all the columns in iris dataset which has numeric data. to do that we use
apply() function
# Load the iris dataset
data(iris)
Output:
7. Write R program using apply group of functions to create and apply discretization function on each
of the numeric variables features of iris dataset to transform them into 3 levels designated as ''low,
medium, high" values based on equi-width quantiles such that each variable gets nearly equal number
of data points in each level.
To achieve discretization of the numeric variables in the iris dataset based on equal-width quantiles, we can
use the apply family of functions in R. The goal is to transform each numeric feature into three categories
("low", "medium", "high") based on equal-width binning, so that each bin contains approximately the same
number of data points.
Here’s how you can write an R program that uses the apply family of functions to achieve this:
data(iris)
cut(x, breaks = quantile(x, probs = 0:3 / 3), include.lowest = TRUE, labels = c("low", "medium", "high"))
# Apply the discretization function to each numeric column in the iris dataset
# Use the apply function to apply discretize to each column of iris, excluding the species column
head(iris_discretized)
35
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Output:
The values in the columns Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width are now
categorized as "low", "medium", or "high" based on equal-width quantiles. The Species column remains
unchanged.
36
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
= 1 + 2 + (1)
Where 1 is intercept, and 2 is slope, and is the error term.
Problem Specification
In the given problem „Air velocity‟, and „Evaporation Coefficient‟ are the variables with 10 observations.
The goal here is to establish a mathematical equation for „Evaporation Coefficient‟ as a function of „Air velocity‟,
so you can use it to predict „Evaporation Coefficient‟ when only the „Air velocity‟ of the car is known. So, it is desirable
to build a linear regression model with the response variable as „Evaporation Coefficient‟ and the predictor as „Air
velocity‟. Before we begin building the regression model, it is a good practice to analyse and understand the
variables.
> airvelocity<-c(20,60,100,140,180,220,260,300,340,380)
> evaporationcoefficient<-c(0.18, 0.37, 0.35, 0.78, 0.56, 0.75, 1.18, 1.36, 1
.17, 1.65)
> airvelocity
[1] 20 60 100 140 180 220 260 300 340 380
> evaporationcoefficient
[1] 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65
Graphical analysis
The aim of this exercise is to build a simple regression model that you can use to predict
„Evaporation Coefficient‟. But before jumping in to the syntax, let‟s try to understand these variables
graphically.
Typically, for each of the predictors, the following plots help visualize the patterns:
37
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
>
The scatter plot along with the smoothing line above suggests a linear and positive relationship between
the „Air Velocity‟ and „Evaporation Coefficient‟.
This is a good thing. Because, one of the underlying assumptions of linear regression is,
the relationship between the response and predictor variables is linear and additive.
38
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
a) Analyze the significance of residual standard error value, R-squared value, F-statistic. Find the
correlation coefficient for this data and analyze the significance of the correlation value.
m1 = lm(dist ~ speed, data=cars) summary(m1)
##
## Call:
## lm(formula = dist ~ speed, data = cars) ##
## Residuals:
## Min 1Q Median 3Q Max ## -
29.069 -9.525 -2.272 9.215 43.201 ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept)
-17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 *** ##
---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##
## Residual standard error: 15.38 on 48 degrees of freedom ## Multiple R-
squared: 0.6511, Adjusted R-squared: 0.6438 ## F-statistic: 89.57 on 1 and
48 DF, p-value: 1.49e-12
summary.aov(m1) # To get the sums of squares and mean squares
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 21185 21185 89.57 1.49e-12 ***
## Residuals 48 11354 237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#calculate sums of squares total, residual and model
y = cars$dist ybar =
39
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
mean(y)
#ss is sum of squares ss.total = sum((y-ybar)^2) print(ss.total)
## [1] 32538.98
ss.residual=sum((y-m1$fitted)^2)
print(ss.residual)
## [1] 11353.52
ss.model = ss.total-ss.residual print(ss.model)
## [1] 21185.46
n = length(cars$speed)
40
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
k = length(m1$coef)
df.total = n-1
df.residual = n-k
df.model = k-1
## [1] 236.5317
ms.model =
ss.model/df.model
print(ms.model)
## [1] 21185.46
41
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
linearModel = lm(y~x)
print(linearModel)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -1.457 0.456
#we now apply predict() function and set the predictor variable in the pdata
argument.
result = predict(linearModel,pdata)
print(result)
## 1
## 180.9577
#we can also set the interval type as "predict" with out changing the default
0.95 confidence level
3. R program using apply group of functions to create and apply normalization function on
each of te numeric variables/columns of iris dataset to tranform them into a value around 0
with z-score normalization.
we can achive z-score normalization in R by using function scale(X,center=TRUE, scale=TRUE)
where X refers to data.
So we need to apply it to all the columns in iris dataset which has numeric data. to do that we use
apply() function
#We are using apply function to implement scale() function on every column.
#Here 2 means column
42
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
43
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
44
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
45
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
12.Write a R program for reading 'state x77' dataset into a data frame and apply multiple regression to
predict the value of the variable 'murder 'based on the other independent variables based on their
correlations
46
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
11. Write R program to split 'Titanic' dataset into training and test partitions and build a decision tree for
predicting whether survived or not given the description of a person travelled. Evaluate the performance
metrics from the confusion matrix
# Load necessary libraries
library(rpart) # For decision tree
library(caret) # For confusion matrix and other metrics
library(dplyr) # For data manipulation
# Load the Titanic dataset
data("titanic")
# View the structure of the dataset
str(titanic)
# Clean the dataset (e.g., handle missing values)
# For simplicity, we'll remove rows with missing values in the 'Survived' column
titanic_cleaned <- titanic %>%
filter(!is.na(Survived))
# Convert factors to appropriate levels if needed
titanic_cleaned$Survived <- as.factor(titanic_cleaned$Survived)
# Split the dataset into training and test sets (70% train, 30% test)
set.seed(123) # For reproducibility
train_index <- createDataPartition(titanic_cleaned$Survived, p = 0.7, list = FALSE)
train_data <- titanic_cleaned[train_index, ]
test_data <- titanic_cleaned[-train_index, ]
# Build the decision tree model to predict 'Survived'
decision_tree_model <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare,
data = train_data, method = "class")
# Print the decision tree model
print(decision_tree_model)
# Predict on the test set
predictions <- predict(decision_tree_model, test_data, type = "class")
# Evaluate performance using confusion matrix
conf_matrix <- confusionMatrix(predictions, test_data$Survived)
# Print the confusion matrix and performance metrics
print(conf_matrix)
# Extract additional performance metrics from the confusion matrix
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)
# Display performance metrics
cat("Accuracy: ", accuracy, "\n")
cat("Precision: ", precision, "\n")
cat("Recall: ", recall, "\n")
cat("F1 Score: ", f1_score, "\n")
1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
WEKA
11. Create an ARFF (Attribute-Relation File Format) file and read it in WEKA. Explore the purpose
of each button under the preprocess panel after loading the ARFF file. Also, try to interpret using
a different ARFF file, weather.arff, provided with WEKA.
ARFF is the WEKA’s native data storage method. ARFF is an acronym for Attribute- Relational File Format.
The bulk of the ARFF file consists of a list of instances and the attribute values for each instance are separated
by commas. This format is similar to CSV file format and it can also be seen as an extension to normal CSV
format.
2
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Current Relation:
It explains about the dataset by giving us the information like number of instances, attributes etcetera.
Selected Attribute:
This explains the characteristics of a specific attribute’s data. Like type which is Nominal or
numerical. Missing values etc.
Histogram:
This will be generated for the attributes data we selected currently. You can draw it for any
other attribute. Here play is selected as the class attribute; it is used to colour the histogram, and
any filters that require a class value use it too.
If you select a numeric attribute, you see it’s minimum and maximum values, mean, and
standard deviation. In this case, the histogram will show the distribution of the class as a function
of this attribute.
Remove:
You can delete an attribute by clicking its checkbox and using the Remove button.
Invert:
It is used to invert the selection.
Pattern:
Pattern selects those attributes whose names match a user-supplied regular expression.
Undo:
It can be used to undo the changes we did.
Edit:
This button is used to edit and inspect the data loaded.
3
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
12. Performing data preprocessing in Weka: Study Unsupervised Attribute Filters such as
replace MissingValues to replace missing values in the given dataset, Add to add the new
attribute Average, Discretize to discretize the attributes into bins. Explore Normalize and
Standardize options on a dataset with numerical attributes.
Study of some of the Unsupervised Attribute Filters
Replace MissingValues replaces each missing value by the mean for numeric attributes and the
mode for nominal ones.
If a class is set, missing values of that attribute are not replaced by default, but this can be changed.
Replace MissingWithUserConstant is another filter that can replace missing values. In this case, it
allows the user to specify a constant value to use. This constant can be specified separately for
numeric, nominal and date attributes.
Add:
Add inserts an attribute at a given position, whose value is declared to be missing for all instances.
Use the generic object editor to specify the attribute’s name, where it will appear in the list of
attributes, and its possible values (for nominal attributes); for date attributes, you can also specify the
date format.
Discretize:
Discretize uses equal-width or equal frequency binning to discretize a range of numeric attributes,
specified in the usual way. For the former method, the number of bins can be specified or chosen
automatically by maximizing the likelihood of using leave-one-out cross-validation. It is also
possible to create several binary attributes instead of one multi-valued one. For equal-frequency
discretization, the desired number of instances per interval can be changed.
PKIDiscretize discretizes numeric attributes using equal-frequency binning; the number of bins is
the square root of the number of values (excluding missing values). Both these filters skip the class
attribute by default.
Exploring Normalize and Standardize options on the dataset with numerical attributes.
Here we are selecting iris data set as the data set we are using to explore Normalize and Standardize
options.
Normalize:
Normalize scales all numeric values in the dataset to lie between 0 and 1. The normalized values
can be further scaled and translated with user-supplied constants.
4
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
5
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
As you can see the Maximum values is 1 and Minimum is 0 indicating Normalization.
Standardize:
Center and Standardize transform the data to have zero mean. Standardize gives them unit variance
too. All three skip the class attribute if set.
Instead of Normalize filter, if we use the Standardize filter. Then the output will be
As you can see the Mean is 0 and Standard deviation is 1 which is the characteristic of Standardize. But the data
may not be in the range of 0 and 1.
6
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Now you can find the id3 algorithm under the trees section.
Classifer Output:
=== Run information ===
Scheme:
weka.classifiers.trees.Id
3 Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook temperature humidity
windy
7
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
play
Test mode: 10-fold cross-validation
outlook = sunny
| humidity = high: no
| humidity = normal: yes
outlook = overcast: yes
outlook = rainy
| windy = TRUE: no
| windy = FALSE: yes
8
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Demonstration of classification process using naïve Bayes algorithm on categorical dataset (‘vote’):
Load the vote dataset by opening vote.arff
9
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
export-administration-act-south-
africa Class
Test mode: 10-fold cross-validation
Attribute democrat
republican (0.61)
(0.39)
handicapped-infants
n 103.0 135.0
y 157.0 32.0
[total] 260.0 167.0
water-project-cost-sharing
n 120.0 74.0
y 121.0 76.0
[total] 241.0 150.0
adoption-of-the-budget-resolution
n 30.0 143.0
y 232.0 23.0
[total] 262.0 166.0
physician-fee-freeze
n 246.0 3.0
y 15.0 164.0
[total] 261.0 167.0
el-salvador-aid
n 201.0 9.0
y 56.0 158.0
[total] 257.0 167.0
religious-groups-in-schools
n 136.0 18.0
y 124.0 150.0
[total] 260.0 168.0
10
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
anti-satellite-test-ban
n 60.0 124.0
y 201.0 40.0
[total] 261.0 164.0
11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
aid-to-nicaraguan-contras
n 46.0 134.0
y 219.0 25.0
[total] 265.0 159.0
mx-missile
n 61.0 147.0
y 189.0 20.0
[total] 250.0 167.0
immigration
n 140.0 74.0
y 125.0 93.0
[total] 265.0 167.0
synfuels-corporation-cutback
n 127.0 139.0
y 130.0 22.0
[total] 257.0 161.0
n 214.0 21.0
y 37.0 136.0
[total] 251.0 157.0
superfund-right-to-sue
n 180.0 23.0
y 74.0 137.0
[total] 254.0 160.0
crime
n 168.0 4.0
y 91.0 159.0
[total] 259.0 163.0
duty-free-exports
n 92.0 143.0
y 161.0 15.0
[total] 253.0 158.0
12
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
export-administration-act-south-africa
n 13.0 51.0
y 174.0 97.0
[total] 178.0 148.0
13
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
14
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Classifier output:
=== Run information ===
15
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Instances: 4627
Attributes: 217
[ list of attributes omitted ]
Test mode: 10-fold cross-
validation
RandomForest
16
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
pplying J48 algorithm:
Select the classifier
18
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Number of Leaves : 10
19
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
=== Run information ===
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: 20-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
Number of Leaves : 10
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.000 1.000 0.980 0.990 0.985 1.000 1.000 Iris-setosa
0.980 0.060 0.891 0.980 0.933 0.900 0.974 0.956 Iris-versicolor
0.900 0.010 0.978 0.900 0.938 0.910 0.978 0.947 Iris-virginica
Weighted Avg. 0.953 0.023 0.956 0.953 0.954 0.932 0.984 0.967
=== Confusion Matrix
=== a b c <-- classified as
49 1 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 5 45 | c = Iris-virginica
10 96%
20 95.333%
30 96%
21
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Let us take iris dataset as all the attributes are numerical except the class attribute.
Cluster 0
((((((((((((((((((((0.2:0.03254,0.2:0.03254):0.00913,(0.3:0.03254,0.3:0.03254):0.00913):0.00332,
((0.2:0.02778,0.2:0.02778):0.00476,0.2:0.03254):0.01244):0,0.2:0.04498):0.0051,0.2:0.05008):0.003
64,0.2:0.05371):0.00437,(0.2:0.05085,0.2:0.05085):0.00724):0.01535,
(0.5:0.06731,0.4:0.06731):0.00612):0.00188,0.2:0.07531):0.00196,0.3:0.07728):0.00536,
((((((0.2:0.04383,0.2:0.04383):0.00625,0.3:0.05008):0,0.1:0.05008):0.00279,
(((((0.2:0.03254,0.2:0.03254):0.01129,0.2:0.04383):0.00116,0.2:0.04498):0.0051,0.2:0.05008):0.002
79,((0.1:0,0.1:0):0,0.1:0):0.05287):0):0.00522,0.2:0.05808):0.01919,
((0.2:0.04498,0.2:0.04498):0.01549,0.1:0.06047):0.0168):0.00536):0.00165,0.2:0.08429):0.00356,
(((0.2:0.02778,0.2:0.02778):0.04371,
22
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
((0.3:0.04498,0.2:0.04498):0.01394,0.4:0.05893):0.01256):0.00809,0.4:0.07958):0.00826):0.00212,0
.4:0.08996):0.00321,0.6:0.09317):0.00598,
(0.4:0.0678,0.4:0.0678):0.03135):0.00292,0.3:0.10206):0.01316,0.2:0.11523):0.01375,(0.2:0.12263,
(0.1:0.10346,0.2:0.10346):0.01917):0.00634):0.00241,0.4:0.13139)
Cluster 2
(((((((((((((((((((((((((((((1.4:0.07344,(((1.5:0.06508,1.5:0.06508):0.00066,
(1.4:0.05008,1.4:0.05008):0.01566):0.00224,1.3:0.06798):0.00546):0.00188,(1.3:0.07137,
(1.3:0.05556,1.3:0.05556):0.01581):0.00395):0.00733,(1.5:0.07137,
((1.4:0.04498,1.4:0.04498):0.01549,1.5:0.06047):0.01089):0.01127):0.00515,1.4:0.08779):0.00538,1
.2:0.09317):0.00405,1.5:0.09722):0.0004,(1.5:0.05556,1.5:0.05556):0.04207):0.00152,
(1.5:0.07344,1.6:0.07344):0.02571):0,1.6:0.09914):0.00219,1.5:0.10133):0.00073,1.6:0.10206):0.00
14,(((((1.3:0.08333,1.3:0.08333):0.00613,((((1.3:0.06574,((1.3:0.05287,1.2:0.05287):0,(1.3:0.05287,
(1.3:0.04498,1.3:0.04498):0.00789):0):0.01287):0.0077,
(1.2:0.04498,1.2:0.04498):0.02845):0,1.2:0.07344):0.0093,(1.1:0.05287,
(1.1:0.04498,1.0:0.04498):0.00789):0.02987):0.00672):0.0005,1.0:0.08996):0.00406,1.0:0.09402):0.
00041,1.3:0.09443):0.00902):0.00268,1.7:0.10614):0.00342,((((((1.8:0.08784,
((1.8:0.03254,1.8:0.03254):0.0254,1.8:0.05794):0.0299):0.00162,(1.9:0.08429,
(1.8:0.05287,1.8:0.05287):0.03142):0.00518):0.00524,1.9:0.0947):0.01144,(2.2:0.09415,
(2.1:0.04167,2.2:0.04167):0.05249):0.01199):0,(((1.8:0.07148,
(1.8:0.05008,1.8:0.05008):0.02141):0.02614,(2.0:0.08504,2.0:0.08504):0.01258):0.00852,
(((2.1:0.05287,2.1:0.05287):0.04475,((((2.3:0.04383,2.3:0.04383):0.03881,2.4:0.08264):0.00719,
(2.3:0.07148,2.3:0.07148):0.01834):0.00487,2.5:0.0947):0.00292):0.00534,2.1:0.10296):0.00318):0)
:0.00129,2.1:0.10743):0.00214):0.00446,((2.5:0.08983,
(2.4:0.06047,2.3:0.06047):0.02935):0.01175,2.3:0.10158):0.01245):0.01212,1.4:0.12614):0.00283,1.
4:0.12897):0.00054,1.5:0.12951):0.00514,
(((1.9:0,1.9:0):0.08779,2.0:0.08779):0.01089,2.0:0.09869):0.03597):0.01023,
((1.5:0.09869,1.3:0.09869):0.00264,1.5:0.10133):0.04356):0.00338,
(((2.1:0.09869,2.0:0.09869):0.02337,2.3:0.12206):0.01586,((1.8:0.07344,1.9:0.07344):0.05554,
(1.8:0.12263,1.6:0.12263):0.00634):0.00895):0.01034):0.00275,1.8:0.15102):0.00299,2.3:0.15401):0
.00606,
(((1.0:0.05008,1.0:0.05008):0.04555,1.1:0.09562):0.03389,1.0:0.12951):0.03056):0.00969,1.0:0.169
76):0.00916,2.4:0.17892):0.01985,2.5:0.19878):0.00086,1.7:0.19964):0.02884,
(2.2:0.11232,2.0:0.11232):0.11615)
23
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
0 0 50 | Iris-virginica
Cluster 0 <-- Iris-setosa
Cluster 1 <-- No class
Cluster 2 <-- Iris-versicolor
Incorrectly clustered instances : 51.0 34 %
Cluster
As you can see squares which represent incorrectly clustered points. Most of Iris-virginica are
clustered with Iris versicolor and are incorrectly clustered.
Clustered DataObjects:
150 Number of attributes:
4 Epsilon: 0.3; minPoints:
4 Distance-type:
Number of generated clusters: 2
Elapsed time: .01
25
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
26
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
27
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
( 98.) 5.1,2.5,3,1.1 --> 1
( 99.) 5.7,2.8,4.1,1.3 --> 1
(100.) 6.3,3.3,6,2.5 --> 1
(101.) 5.8,2.7,5.1,1.9 --> 1
28
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
0 50 ( 34%)
1 99 ( 66%)
Unclustered instances : 1
30
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Read the supermarket dataset into the WEKA by opening the file
supermarket.arff And we find the association rules in the Associate tab.
At confidence = 0.9:
Let us take the confidence value of 0.9 and apply the Apriori algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C (minMetric in
the associator menu) associator.
Associator output:
=== Run information ===
Apriori
=======
31
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
1. biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 <conf:(0.92)> lift:
(1.27) lev:(0.03) [155] conv:(3.35)
2. baking needs=t biscuits=t fruit=t total=high 760 ==> bread and cake=t 696 <conf:(0.92)> lift:
(1.27) lev:(0.03) [149] conv:(3.28)
3. baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705 <conf:(0.92)>
lift:(1.27) lev:(0.03) [150] conv:(3.27)
4. biscuits=t fruit=t vegetables=t total=high 815 ==> bread and cake=t 746 <conf:(0.92)>
lift: (1.27) lev:(0.03) [159] conv:(3.26)
5. party snack foods=t fruit=t total=high 854 ==> bread and cake=t 779 <conf:(0.91)>
lift:(1.27) lev:(0.04) [164] conv:(3.15)
6. biscuits=t frozen foods=t vegetables=t total=high 797 ==> bread and cake=t 725 <conf:(0.91)>
lift:(1.26) lev:(0.03) [151] conv:(3.06)
7. baking needs=t biscuits=t vegetables=t total=high 772 ==> bread and cake=t 701 <conf:(0.91)>
lift:(1.26) lev:(0.03) [145] conv:(3.01)
8. biscuits=t fruit=t total=high 954 ==> bread and cake=t 866 <conf:(0.91)> lift:(1.26) lev:(0.04)
[179] conv:(3)
9. frozen foods=t fruit=t vegetables=t total=high 834 ==> bread and cake=t 757 <conf:(0.91)> lift:
(1.26) lev:(0.03) [156] conv:(3)
10. frozen foods=t fruit=t total=high 969 ==> bread and cake=t 877 <conf:(0.91)> lift:(1.26) lev:
(0.04) [179] conv:(2.92)
At confidence = 0.85:
Let us take the confidence value of 0.85 and apply the Apriori algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C
(minMetric in the associator menu) associator.
Associator output:
=== Run information ===
32
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===
Apriori
=======
itemsets L(5): 2
1. biscuits=t frozen foods=t fruit=t vegetables=t 1039 ==> bread and cake=t 929 <conf:(0.89)>
lift:(1.24) lev:(0.04) [181] conv:(2.62)
2. fruit=t vegetables=t total=high 1050 ==> bread and cake=t 938 <conf:(0.89)> lift:(1.24) lev:
(0.04) [182] conv:(2.6)
3. fruit=t total=high 1243 ==> bread and cake=t 1104 <conf:(0.89)> lift:(1.23) lev:(0.05) [209]
conv:(2.49)
4. biscuits=t total=high 1228 ==> bread and cake=t 1082 <conf:(0.88)> lift:(1.22) lev:(0.04) [198]
conv:(2.34)
5. milk-cream=t total=high 1217 ==> bread and cake=t 1071 <conf:(0.88)> lift:(1.22) lev:(0.04)
[195] conv:(2.32)
6. biscuits=t margarine=t vegetables=t 1054 ==> bread and cake=t 925 <conf:(0.88)>
lift:(1.22) lev:(0.04) [166] conv:(2.27)
7. frozen foods=t total=high 1273 ==> bread and cake=t 1117 <conf:(0.88)> lift:(1.22) lev:(0.04)
[200] conv:(2.27)
8. biscuits=t margarine=t fruit=t 1073 ==> bread and cake=t 938 <conf:(0.87)> lift:(1.21) lev:
(0.04) [165] conv:(2.21)
9. party snack foods=t total=high 1120 ==> bread and cake=t 979 <conf:(0.87)> lift:(1.21) lev:
(0.04) [172] conv:(2.21)
33
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
10. vegetables=t total=high 1270 ==> bread and cake=t 1110 <conf:(0.87)> lift:(1.21) lev:(0.04)
[195] conv:(2.21)
At confidence = 0.8:
Let us take the confidence value of 0.8 and apply the Apriori algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C
(minMetric in the associator menu) associator.
Associator output:
=== Run information ===
1. biscuits=t vegetables=t 1764 ==> bread and cake=t 1487 <conf:(0.84)> lift:(1.17) lev:(0.05)
[217] conv:(1.78)
2. total=high 1679 ==> bread and cake=t 1413 <conf:(0.84)> lift:(1.17) lev:(0.04) [204] conv:
(1.76)
3. biscuits=t milk-cream=t 1767 ==> bread and cake=t 1485 <conf:(0.84)> lift:(1.17) lev:(0.05)
[213] conv:(1.75)
4. biscuits=t fruit=t 1837 ==> bread and cake=t 1541 <conf:(0.84)> lift:(1.17) lev:(0.05) [218]
conv:(1.73)
34
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
5. biscuits=t frozen foods=t 1810 ==> bread and cake=t 1510 <conf:(0.83)> lift:(1.16) lev:(0.04)
[207] conv:(1.69)
6. frozen foods=t fruit=t 1861 ==> bread and cake=t 1548 <conf:(0.83)> lift:(1.16) lev:(0.05)
[208] conv:(1.66)
7. frozen foods=t milk-cream=t 1826 ==> bread and cake=t 1516 <conf:(0.83)> lift:(1.15) lev:
(0.04) [201] conv:(1.65)
8. baking needs=t milk-cream=t 1907 ==> bread and cake=t 1580 <conf:(0.83)> lift:(1.15) lev:
(0.04) [207] conv:(1.63)
9. milk-cream=t fruit=t 2038 ==> bread and cake=t 1684 <conf:(0.83)> lift:(1.15) lev:(0.05) [217]
conv:(1.61)
10. baking needs=t biscuits=t 1764 ==> bread and cake=t 1456 <conf:(0.83)> lift:(1.15) lev:(0.04)
At confidence = 0.9:
Let us take the confidence value of 0.9 and apply the FPGrowth algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C (minMetric
in the associator menu) associator.
Associator output:
=== Run information ===
[186] conv:(1.6)
Demonstration of Association Rule Mining on supermarket dataset using FPGrowth Algorithm with
different support and confidence thresholds:
Select the FPGrowth algorithm
Scheme: weka.associations.FPGrowth -P 2 -I -1 -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M
0.1 Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===
1. [fruit=t, frozen foods=t, biscuits=t, total=high]: 788 ==> [bread and cake=t]: 723 <conf:(0.92)>
lift:(1.27) lev:(0.03) conv:(3.35)
2. [fruit=t, baking needs=t, biscuits=t, total=high]: 760 ==> [bread and cake=t]: 696 <conf:(0.92)>
lift:(1.27) lev:(0.03) conv:(3.28)
35
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
3. [fruit=t, baking needs=t, frozen foods=t, total=high]: 770 ==> [bread and cake=t]: 705 <conf:
(0.92)> lift:(1.27) lev:(0.03) conv:(3.27)
4. [fruit=t, vegetables=t, biscuits=t, total=high]: 815 ==> [bread and cake=t]: 746 <conf:(0.92)>
lift:(1.27) lev:(0.03) conv:(3.26)
5. [fruit=t, party snack foods=t, total=high]: 854 ==> [bread and cake=t]: 779 <conf:(0.91)> lift:
(1.27) lev:(0.04) conv:(3.15)
6. [vegetables=t, frozen foods=t, biscuits=t, total=high]: 797 ==> [bread and cake=t]: 725 <conf:
(0.91)> lift:(1.26) lev:(0.03) conv:(3.06)
7. [vegetables=t, baking needs=t, biscuits=t, total=high]: 772 ==> [bread and cake=t]: 701 <conf:
(0.91)> lift:(1.26) lev:(0.03) conv:(3.01)
8. [fruit=t, biscuits=t, total=high]: 954 ==> [bread and cake=t]: 866 <conf:(0.91)> lift:(1.26) lev:
(0.04) conv:(3)
9. [fruit=t, vegetables=t, frozen foods=t, total=high]: 834 ==> [bread and cake=t]: 757 <conf:
(0.91)> lift:(1.26) lev:(0.03) conv:(3)
10. [fruit=t, frozen foods=t, total=high]: 969 ==> [bread and cake=t]: 877 <conf:(0.91)> lift:(1.26)
lev:(0.04) conv:(2.92)
At confidence = 0.85:
Let us take the confidence value of 0.85 and apply the FPGrowth algorithm for association rule mining
on supermarket dataset. You can modify it by changing the value of attribute -C (minMetric in the
associator menu) associator.
Associator output:
=== Run information ===
10)
1. [fruit=t, vegetables=t, frozen foods=t, biscuits=t]: 1039 ==> [bread and cake=t]: 929 <conf:
(0.89)> lift:(1.24) lev:(0.04) conv:(2.62)
2. [fruit=t, vegetables=t, total=high]: 1050 ==> [bread and cake=t]: 938 <conf:(0.89)> lift:(1.24)
lev:(0.04) conv:(2.6)
3. [fruit=t, total=high]: 1243 ==> [bread and cake=t]: 1104 <conf:(0.89)> lift:(1.23) lev:(0.05)
conv:(2.49)
4. [biscuits=t, total=high]: 1228 ==> [bread and cake=t]: 1082 <conf:(0.88)> lift:(1.22) lev:(0.04)
conv:(2.34)
5. [milk-cream=t, total=high]: 1217 ==> [bread and cake=t]: 1071 <conf:(0.88)> lift:(1.22) lev:
(0.04) conv:(2.32)
36
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
At confidence = 0.8:
Let us take the confidence value of 0.8 and apply the FPGrowth algorithm for
association rule mining on supermarket dataset. You can modify it by
changing the value of attribute -C (minMetric in the associator menu)
associator.
Associator output:
=== Run information ===
6. [fruit=t, frozen foods=t]: 1861 ==> [bread and cake=t]: 1548 <conf:(0.83)>
lift:(1.16) lev:(0.05) conv:(1.66)
7. [milk-cream=t, frozen foods=t]: 1826 ==> [bread and cake=t]: 1516
<conf:(0.83)> lift:(1.15) lev:(0.04) conv:(1.65)
8. [milk-cream=t, baking needs=t]: 1907 ==> [bread and cake=t]: 1580
<conf:(0.83)> lift:(1.15) lev:(0.04) conv:(1.63)
9. [fruit=t, milk-cream=t]: 2038 ==> [bread and cake=t]: 1684 <conf:(0.83)>
lift:(1.15) lev:(0.05) conv:(1.61)
10. [baking needs=t, biscuits=t]: 1764 ==> [bread and cake=t]: 1456
<conf:(0.83)> lift:(1.15) lev: (0.04) conv:(1.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING