0% found this document useful (0 votes)
31 views87 pages

MLCSE3

The document is a laboratory manual for the Machine Learning Lab (CS3207) at Sanketika Vidya Parishad Engineering College, outlining the course objectives, outcomes, and detailed experiments for B.Tech Computer Science and Engineering students. It includes instructions for laboratory conduct, program educational objectives, specific outcomes, and a syllabus covering data analysis using R and WEKA. The manual aims to provide students with hands-on experience in machine learning concepts and practical applications.

Uploaded by

Swathi Penaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views87 pages

MLCSE3

The document is a laboratory manual for the Machine Learning Lab (CS3207) at Sanketika Vidya Parishad Engineering College, outlining the course objectives, outcomes, and detailed experiments for B.Tech Computer Science and Engineering students. It includes instructions for laboratory conduct, program educational objectives, specific outcomes, and a syllabus covering data analysis using R and WEKA. The manual aims to provide students with hands-on experience in machine learning concepts and practical applications.

Uploaded by

Swathi Penaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

MACHINE LEARNING LAB


(CS3207)
LABORATORY MANUAL & RECORD

B.Tech (CSE) (With effect from 2022-23 admitted batches)


(III YEAR- I SEM)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


(APPROVED BY AICTE, AFFILIATED TO ANDHRA UNIVERSITY,
ACCREDITED BY NAAC-A GRADE, ISO 9001:2015 CERFIFIED)
PM PALEM, VISAKHAPATNAM-41,
www.svpce.edu.in

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING
INSTITUTE VISION AND MISSION

VISION
To be a premier institute of knowledge of share quality research and development
technologies towards national buildings
MISSION

1. Develop the state of the art environment of high quality of learning

2. Collaborate with industries and academic towards training research innovation and
entrepreneurship
3. Create a platform of active participation co-curricular and extra-curricular activities

DEPARTMENT VISION AND MISSION


VISION
To impart quality education for producing highly talented globally recognizable technocrats
and entrepreneurs with innovative ideas in computer science and engineering to meet industrial
needs and societal expectations

MISSION

 To impart high standard value-based technical education in all aspects of Computer


Science and Engineering through the state of the art infrastructure and innovative
approach.
 To produce ethical, motivated, and skilled engineers through theoretical knowledge
and practical applications.
 To impart the ability for tackling simple to complex problems individually as well as
in a team.
 To develop globally competent engineers with strong foundations, capable of “out of
the box” thinking so as to adapt to the rapidly changing scenarios requiring socially
conscious green computing solutions.

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING

PROGRAMME EDUCATIONAL OBJECTIVES (PEOs)

Graduates of B. Tech in computer science and Engineering Programme shall be able to

PEO1: Strong foundation of knowledge and skills in the field of Computer Science and
Engineering.
PEO2: Provide solutions to challenging problems in their profession by applying computer
engineering theory and practices.
PEO3: Produce leadership and are effective in multidisciplinary environment.

PROGRAMME SPECIFIC OUTCOMES (PSOs)

PSO1: Ability to design and develop computer programs and understand the structure and
develop methodologies of software systems.
PSO2: Ability to apply their skills in the field of networking, web design, cloud computing
and data analytics.
PSO3: Ability to understand the basic and advanced computing technologies towards getting
employed or to become an entrepreneur

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
PROGRAM OUTCOMES
DEPARTMENTOFCOMPUTERSCIENCEANDENGINEERING
1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering fundamentals,
and an engineering specialization to the solution of complex engineering problems.
2. Problem Analysis: Identify, formulate, research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences and
engineering sciences.
3. Design/Development of Solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety and the cultural, societal and environmental considerations.
4. Conduct Investigations of Complex Problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
5. Modern Tool Usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities with an
understanding of the limitations.
6. The Engineer and Society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
7. Environment and Sustainability: Understand the impact of the professional engineering solutions
in societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of
the engineering practice.
9. Individual and Team Work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, give and receive clear instructions.
11. Project Management and Finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-Long Learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

GENERAL LABORATORY INSTRUCTIONS

1. Students are advised to come to the laboratory at least 5 minutes before (to the starting time),
those who come after 5 minutes will not be allowed into the lab.
2. Plan your task properly much before to the commencement, come prepared to the lab with the
synopsis / program / experiment details.
3. Student should enter into the laboratory with:

a. Laboratory observation notes with all the details (Problem statement, Aim, Algorithm,
Procedure, Program, Expected Output, etc.,) filled in for the lab session.
b. Laboratory Record updated up to the last session experiments and other utensils (if any)
needed in the lab.
c. Proper Dress code and Identity card.
4. Sign in the laboratory login register, write the TIME-IN, and occupy the computer system
allotted to you by the faculty.
5. Execute your task in the laboratory, and record the results / output in the lab observation note
book, and get certified by the concerned faculty.
6. All the students should be polite and cooperative with the laboratory staff, must maintain the
discipline and decency in the laboratory.
7. Computer labs are established with sophisticated and high end branded systems, which should
be utilized properly.
8. Students/Faculty must keep their mobile phones in SWITCHED OFF mode during the lab
sessions. Misuse of the equipment, misbehaviors with the staff and systems etc., will attract
severe punishment.
9. Students must take the permission of the faculty in case of any urgency to go out; if anybody
found loitering outside the lab / class without permission during working hours will be treated
seriously and punished appropriately.
10. Students should LOG OFF/ SHUT DOWN the computer system before he/she leaves the
lab after completing the task(experiment) in all aspects. He/she must ensure the system/ seat
is kept properly

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Course Objectives:

1. The lab course provides hands-on experimentation for gaining practical orientation on different
Machine learning concepts. Specifically students learn.
2. To write programs for various data exploration and analytics tasks and techniques in R
Programming language.
3. To apply various Machine learning techniques available in WEKA for data exploration and
pre- processing datasets containing numerical and categorical attributes.
4. To apply various Machine learning techniques for available in WEKA for extracting patterns /
Knowledge and interpret the resulting patterns.

Course Outcomes (CO) :

1. Understand the data structures available in R programming and learn to write R programs to
perform several data analytics operations like plotting, boxplots, normalization, discretization,
transformation, attribute selection, etc., on datasets

2. Write R programs to build regression and classification models for numerical and categorical
datasets and evaluate the models with appropriate performance metrics. 3. understand and use
WEKA explorer for data exploration, visualization, and other data pre processing tasks for
numerical and categorical datasets.

3. Extract Association rules using A-priori and FP-growth methods available in WEKA and
interpret the patterns.

4. Build and cross-validate models for Classification and Clustering on labelled and unlabeled
datasets respectively using different methods.

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CS3207 MACHINE LEARNING LAB

SYLLABUS

1. Exploratory data analysis using R

Load the ‘iris.csv’ file and display the names and type of each column. Find statistics such as min, max,
range, mean, median, variance, standard deviation for each column of data. Repeat the above for
‘mtcars.csv’ dataset also.

2.Write R program to normalize the variables into 0 to 1 scale using min-max normalization

3.Generate histograms for each feature / variable (sepal length/ sepal width/ petal length/ petal width)
and generate scatter plots for every pair of variables showing each species in a different color.
4.Generate box plots for each of the numerical attributes. Identify the attribute with the highest
variance.

5.Study of homogeneous and heterogeneous data structures such as vector, matrix, array, list, data
frame in R.

6.Write R Program using ‘apply’ group of functions to create and apply normalization function on
each of the numeric variables/columns of iris dataset to transform them into a value around 0 with z-
score normalization.
7.Write R Program using ‘apply’ group of functions to create and apply discretization function on
each of the numeric variables/ features of iris dataset to transform them into 3 levels designated as
“Low, Medium, High” values based on equi-width quantiles such that each variable gets nearly equal
number of data points in each level.

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
8.a) Use R to apply linear regression to predict evaporation coefficient in terms of air velocity using
the data given below:

a. Air Velocity (cm/sec) 20,60,100,140,180,220,260,300,340,380

b. Evaporation Coefficient (sq mm/sec)

c. 0.18, 0.37, 0.35, 0.78, 0.56, 0.75, 1.18, 1.36, 1.17, 1.65

d. b) Analyze the significance of residual standard-error value, R-squared value, F statistic. Find
the correlation coefficient for this data and analyze the significance of the correlation value.

e. c) Perform a log transformation on the ‘Air Velocity 'column, perform linear regression again,
and analyze all the relevant values.

9. Write R program for reading ‘state.x77’ dataset into a data frame and apply multiple regression to
predict the value of the variable ‘murder’ based on the other independent variables based on their
correlations.

10. Write R program to split ‘Titanic’ dataset into training and test partitions and build a decision tree
for predicting whether survived or not given the description of a person travelled. Evaluate the
performance metrics from the confusion matrix.

2. WEKA Knowledge Extraction toolkit:

11. Create an ARFF (Attribute-Relation File Format) file and read it in WEKA. Explore the purpose
of each button under the pre-process panel after loading the ARFF file. Also, try to interpret using a
different ARFF file, weather.arff, provided with WEKA.

12. Performing data preprocessing in WEKA Study Unsupervised Attribute Filters such as Replace
Missing Values to replace missing values in the given dataset, Add to add the new attribute Average,
Discretize to discretize the attributes into bins. Explore Normalize and Standardize options on a dataset
with numerical attributes.

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
13. Classification using the WEKA toolkit Demonstration of classification process using id3
algorithm on categorical dataset (weather).
Demonstration of classification process using naïve Bayes algorithm on categorical dataset (‘vote’).
Demonstration of classification process using Random Forest algorithm on datasets containing large
number of attributes.
14. Classification using the WEKA toolkit – Part 2

Demonstration of classification process using J48 algorithm on mixed type of dataset after discretizing
numeric attributes.
Perform cross-validation strategy with various fold levels. Compare the accuracy of the results.
15. Association rule analysis in WEKA

Demonstration of Association Rule Mining on supermarket dataset using Apriori Algorithm with
different support and confidence thresholds.
Demonstration of Association Rule Mining on supermarket dataset using FP- Growth Algorithm with
different support and confidence thresholds.
16. Performing clustering in WEKA

Apply hierarchical clustering algorithm on numeric dataset and estimate cluster quality. Apply
DBSCAN algorithm on numeric dataset and estimate cluster quality.

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Index
SNO Experiment Name Page No.

1 Exploratory data analysis using R 4

2 R program to normalize the variables into 0 to 1 scale using min- 11


max normalization

3 Generate histograms for any one variable and generate scatter 13


plots for every pair of variables showing each species in different
colour on iris dataset
4 Generate box plots for each of the numerical attribute. Identify the 17
attribute with highest variance

5 Study of homogeneous and heterogeneous data structures such as 18


vector, matrix, array, list and dataframe in R

6 R program using apply group of functions to create and apply 20


normalization function on each of te numerical columns of iris
dataset to tranform them into a value around 0 with z- score
normalization
7 Apply Linear regression to 24

a)predict evaporation coefficient in terms of air 24


velociy using the given data
25
b)Analyze the significance of residual standard error
value, R-squared value, F-statistic. Find the
correlation coefficient for this data and analyze the
significance of the correlation value
27
c)Perform a log transformation on the “Air velocity”
column, perform linear regression again and analyze all
the relevant values.
8 R program using apply group of functions to create and apply 27
normalization function on each of te numeric variables/columns
of iris dataset to tranform them into a value around 0 with z-score
normalization
9 Create an ARFF (Attribute-Relation File Format) file and read it 1
in WEKA. Explore the purpose of each button under the
preprocess panel after loading the ARFF file. Also, try to interpret
using a different ARFF file, weather.arff, provided with WEKA
10 Performing data preprocessing in Weka: Study Unsupervised 4
Attribute Filters such as Replace MissingValues to replace
missing values in the given dataset, Add to add the new attribute

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Average, Discretize to discretize the attributes into bins. Explore


Normalize and Standardize options on a dataset with numerical
attributes
11 Classification using the WEKA toolkit 8
a)Demonstration of classification process using the id3 algorithm
on categorical dataset(weather) 8
b) Demonstration of classification process using the naïAve Bayes
algorithm on a categorical dataset (‘vote’) 8
c) Demonstration of classification process using Random Forest
algorithm on datasets containing a large number of attributes 8
12 Classification using the WEKA toolkit – Part 2: Demonstration of
classification process using J48 algorithm on mixed type of 16
dataset after discretizing numeric attributes. Perform cross-
validation strategy with various fold levels. Compare the accuracy
of the results
13 Performing clustering in WEKA Apply hierarchical clustering 23
algorithm on numeric dataset and estimate cluster quality. Apply
DBSCAN algorithm on numeric dataset and estimate cluster
quality
14 Association rule analysis in WEKA 33
a)Demonstration of Association Rule Mining on supermarket 33
dataset using Apriori Algorithm with different support and
confidence thresholds
b)Demonstration of Association Rule Mining on supermarket 33
dataset using FPGrowth Algorithm with different support and
confidence thresholds
15 Reference 40

R Programming Lab

11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

1. Exploratory data analysis using R


Load the iris dataset and display the names and type of each column. Find statistics such as min, max,
range, mean, median, variance and standard deviation for each column of data.
Iris Data Set Iris dataset os the best known dataset to be found in pattern recognition literature. The
dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
Attribute Information
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class
−Iris Setosa
−Iris Versicolour
−Iris Virginica

# the Iris dataset is an inbuilt dataset available in RStudio and we can use it using iris identifier
print(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species


## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa

12
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

## 28 5.2 3.5 1.5 0.2 setosa


## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor

## 83 5.8 2.7 3.9 1.2 versicolor


## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor

13
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

## 88 6.3 2.3 4.4 1.3 versicolor


## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica

14
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

## 139 6.0 3.0 4.8 1.8 virginica


## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica

To display the names and types of each column


# to display names of each column we use the names function

print(names(iris))
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

For displaying the names along with their types we use the lapply function using class as second argument
print(lapply(iris,class))
##$Sepal.Length ##
[1] "numeric" ##
## $Sepal.Width ##
[1] "numeric" ##
## $Petal.Length ##
[1] "numeric" ##
## $Petal.Width ##
[1] "numeric" ##
## $Species
## [1] "factor"

Finding the mean of individual columns

print(mean(iris$Petal.Width))

## [1] 1.199333
print(mean(iris$Petal.Width))

## [1] 1.199333

15
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

# we can find mean of a column using mean function

# we can refer to individual columns in a dataset using $ symbol as showin below

print(mean(iris$Sepal.Length)) ## [1]

5.843333

print(mean(iris$Sepal.Width)) ## [1]

3.057333

print(mean(iris$Petal.Length))

## [1] 3.758

print(mean(iris$Petal.Width))
## [1] 1.199333

Finding the median of each column


# we can find median of a column using median function

print(median(iris$Sepal.Length)) ## [1]

5.8

print(median(iris$Sepal.Width)) ## [1]

print(median(iris$Petal.Length)) ## [1]

4.35

print(median(iris$Petal.Width))
## [1] 1.3

16
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Finding maximum value of each column

# we can find maximum of a column using min function

print(min(iris$Sepal.Length)) ## [1] 4.3

print(min(iris$Sepal.Width)) ## [1] 2

print(min(iris$Petal.Length))## [1] 2.5

## [1] 1
print(max(iris$Sepal.Length))## [1] 0.1
print(min(iris$Petal.Width))
## [1] 7.9

print(max(iris$Sepal.Width)

Finding the range of each column We use range() to finding the range of each column. It returns a vector
containing the minimum and maximum of all the given arguments.

print(var(iris$Sepal.Width))

## [1] 0.1899794

print(var(iris$Petal.Length))

## [1] 3.116278

print(var(iris$Petal.Width))

## [1] 0.5810063

17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

print(range(iris$Sepal.Length))

## [1] 4.3 7.9

print(range(iris$Sepal.Width))

## [1] 2.0 4.4

print(range(iris$Petal.Length))

## [1] 1.0 6.9

print(range(iris$Petal.Width))
## [1] 0.1 2.5

Finding variance of each column we use var() function to find the variance of each column
print(v ar(iris$Se pal.Length))
Finding standard deviation of ea ch column We use sd() function to find the variance of

each c#o#lu[m1]n 0.6856935


print(sd(iris$Sepal.Length))

## [1] 0.8280661

print(sd(iris$Sepal.Width))

## [1] 0.4358663

print(sd(iris$Petal.Length))

## [1] 1.765298

print(sd(iris$Petal.Width))

## [1] 0.7622377

18
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

summary(iris)

## Sepal.Length Sepal.Width Petal.Length


Petal.Width ## Min. :4.300 Min. :2.000 Min.
:1.000 Min.
:0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa 50
## versicolor:50
## virginica :50
##
##
##

19
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

2. R program to normalize the variables into 0 to 1 scale using min- max normalization
the formula to achieve min max normalization is y = (x-min)/(max-min)

y= x −mi n ma
x−min

#dummy data
x = sample(-100:100,50) print("original data")
## [1] "original data"

print(x)
## [1] -31 -85 100 45 38 69 -62 66 -61 -89 -40 -81 -91 67 42 39 -68 63
-95
## [20] 8 14 56 26 -99 22 71 -26 91 92 -20 -39 73 96 23 -79 87 -18
-93
## [39] 82 4 -56 94 0 -17 -83 5 -52 -21 -23 -6
maximum = max(x)
minimum = min(x)
normalized = (x-minimum)/(maximum-minimum)
print("Normalized data")
## [1] "Normalized data" print(normalized)

## [1] 0.34170854 0.07035176 1.00000000 0.72361809 0.68844221 0.84422111


## [7] 0.18592965 0.82914573 0.19095477 0.05025126 0.29648241 0.09045226
## [13] 0.04020101 0.83417085 0.70854271 0.69346734 0.15577889 0.81407035
## [19] 0.02010050 0.53768844 0.56783920 0.77889447 0.62814070 0.00000000
## [25] 0.60804020 0.85427136 0.36683417 0.95477387 0.95979899 0.39698492
## [31] 0.30150754 0.86432161 0.97989950 0.61306533 0.10050251 0.93467337
## [37] 0.40703518 0.03015075 0.90954774 0.51758794 0.21608040 0.96984925
## [43] 0.49748744 0.41206030 0.08040201 0.52261307 0.23618090 0.39195980
## [49] 0.38190955 0.46733668
#using par function to fix multiple graphs in same plot

par(mfrow=c(1,2))
hist(x,breaks = 10, xlab = "Data",col = "lightblue", )
hist(normalized, breaks = 10, xlab = "Normalized data", col = "yellow")

20
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

21
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

3. Generate histograms for any one variable and generate scatter plots for every pair of
variables showing each species in different colour on iris dataset.
Generating histogram for any one variable let it be sepal length
hist(iris$Sepal.Length, col="yellow", xlab = "Sepal length in cm", main = "Histogram of Sepal
lengths")

Let us use red, green, blue as the colours for 3 species


my_cols=c("red","green","blue")
#21 is for circle, 22 is for squares 24 is for triangles
pairs(iris[1:4],pch=c(21,22,24)[iris$Species],bg=my_cols[iris$Species])

22
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

#correlation panel
panel.cor = function(x,y){
usr<- par("usr");on.exit(par(usr)) par(usr =
c(0,1,0,1))
r = round(cor(x,y), digits = 2) txt =
paste0("R = ",r)
cex.cor = 0.8/strwidth(txt)
text(0.5,0.5,txt,cex=cex.cor*r)
}

#customizing panels and printing correlations #customize upper


panel
upper.panel =
function(x,y){ points(x,y,pch=19,col=my_cols[i ris$Species])
}

#create the plots

23
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

#customize upper panel


upper.panel =
function(x,y){ points(x,y,pch=19,col =
my_cols[iris$Species])r = round(cor(x,y),digits=2)
txt = paste0("R = ",r)
usr = par("usr");on.exit(par(usr)) par(usr =
c(0,1,0,1))
text(0.5,0.9,txt)
}

24
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

25
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

4. Generate box plots for each of the numerical attribute. Identify the attribute with
highest variance.
We will be using building dataset “airquality” dataset for this program. It is a Daily air quality
measurements in New York, May to September 1973.
Finding variance is simple. The spread of the boxplot indicates the variance. The more the spread
of boxplot then it have more variance.

The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box
plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the
median. The whiskers go from each quartile to the minimum or maximum (5).

Example:
A sample of 101010 boxes of raisins has these weights (in grams):25, 28, 29, 29, 30,
34, 35, 35, 37, 38
1. Order the data from smallest to largest. 25, 28,
29, 29, 30, 34, 35, 35, 37, 38
2. Find the median

The median is the mean of the middle two numbers:


25, 28, 29, 29, 30, 34, 35, 35, 37, 38
30 + 34
= 32
2
3. Find the quartile
First Quartile Q1: The first quartile is the median of the data points to the left of the median.
25, 28, 29, 29, 30
Q1=29
Third Quartile Q3: The third quartile is the median of the data points to the right of the median.
34, 35, 35, 37, 38
Q3=29
4. Complete the five-number summary by finding the min and the max. The min
is the smallest data point, which is 25.
The max is the largest data point, which is 38. The five-
number summary is 25, 29, 32, 35, 38.

26
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

boxplot(sepal.length ~ variety, data=iris)

boxplot(sepal.width ~ variety, data=iris)

boxplot(petal.length ~ variety, data=iris)

boxplot(petal.width ~ variety, data=iris)

27
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
5. Study of homogeneous and heterogeneous data structures such as vector, matrix,
array, list and data frame in R.

Data Structure:
A data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data structures
in R programming are tools for holding multiple values.
R‟s base data structures are often organized by their dimensionality (1D, 2D, or nD) and whether
they‟re homogeneous (all elements must be of the identical type) or heterogeneous (the elements
are often of various types). This gives rise to the five data types which are most frequently utilized
in data analysis. the subsequent table shows a transparent cut view of those data structures.

Dimension Homogenous Heterogeneous

1D Vector List

2D Matrix Dataframe

nD Array

Vector:
A vector is an ordered collection of basic data types of a given length. The only key thing here is all the
elements of a vector must be of the identical data type e.g homogenous data structures. Vectors are one-
dimensional data structures.
How to create a Vector?
Vectors are generally created using the c() function.
> X = c(1, 3, 5, 7, 8)
> print(X) [1]
13578
> typeof(X)
[1] "double"
> length(X)
[1] 5
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
> x <- c(1, 5.4, TRUE, "hello")
> typeof(x)
[1] "character"
> length(x)
[1] 4
> x
[1] "1" "5.4" "TRUE" "hello"
> X
[1] 1 3 5 7 8
Creating a vector using operator:
> x <- 1:7
> x
[1] 1 2 3 4 5 6 7
> y <- 2:-2
> y
[1] 2 1 0 -1 -2
Creating a vector using seq() function:
> z <- seq(1, 3, by=0.2)
> z
[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
> a<-seq(1, 5, length.out=4)
> a
[1] 1.000000 2.333333 3.666667 5.000000
Accessing elements using integer vector as index:
Vector index in R starts from 1, unlike most programming languages where index start from 0.
We can use a vector of integers as index to access specific elements. We can also use negative
integers to return all elements except that those specified. But we cannot mix positive and neg
ative integers while indexing and real numbers, if used, are truncated to integers (2).
> x = c(0, 2, 4 6, 8,
, 10)
> x
[1] 0 2 4 8 10
6
> x[3]
[1] 4
> x[c(2, 4)]
[1] 2 6
> x[-1]
[1] 2 4 6 8 10 #Access all expect 1st element
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Accessing elements using logical vector as index:

When we use a logical vector for indexing, the position where the logical vector is TRUE is
returned. This useful feature helps us in filtering of vector as shown below.
> x[c(TRUE, FALSE, FALSE, TRUE)]
[1] 0 6 8
> x[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
[1] 0 6 8 10
> x[c(TRUE, FALSE, FALSE)]
[1] 0 6
> x[c(TRUE)]
[1] 0 2 4 6 8 10
> x[c(TRUE, TRUE)]
[1] 0 2 4 6 8 10
> x[c(TRUE, FALSE)]
[1] 0 4 8
> x[x < 0]
numeric(0)
> x[x < 4]
[1] 0 2
> x[x > 4]
[1] 8
6 10
Modifying vectors

We can modify a vector using the assignment operator. We can use the techniques discussed above to
access specific elements and modify them. If we want to truncate the elements, we can use reassignments.
> x=c(-3, - - 0, 1, 2
2, 1, )
> x
[1] -3 - -1 0 1 2
2
> x[2] <- 0
> x
[1] -3 -1 0 1 2
0
> x[x<0] <- 5 # modify elements less than 0 as 5
> x
[1] 5 0 5 0 1 2
> x <- x[1:4] # truncate x to first 4 elements
> x
[1] 5 0 5 0
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Lists:

A list is a generic object consisting of an ordered collection of objects. Lists are heterogeneous da
ta structures. These are also one-dimensional data structures. A list can be a list of vectors, list of
matrices, a list of characters and a list of functions and so on.
Creating Lists
List can be created using the list() function.
> x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
> str(x)
List of 3
$ a: num 2.5
$ b: logi TRUE
$ c: int [1:3] 1 2 3
 Structure of the list can be returned using str() function.
 In this example, a, b and c are called tags which makes it easier to reference the componen
ts of the list. However, tags are optional. We can create the same list without the tags as fol
lows. In such scenario, numeric indices are used by default.
> x <- list(2.5,TRUE,1:3)
> x
[[1]]
[1] 2.5

[[2]]
[1] TRUE

[[3]]
[1] 1 2 3

> empId = c(1, 2, 3, 4)


> empName = c("Debi", "Sandeep", "Subham", "Shiba")
> numberOfEmp = 4
> empList = list(empId, empName, numberOfEmp)
> print(empList)
[[1]]
[1] 1 2 3 4

[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"

[[3]]
[1] 4
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Name List Elements in R Language


> data_list <- list(c("Jan","Feb","Mar"), matrix(c(1,2,3,4,-1,9), nrow = 2)
,list("Red",12.3))
> data_list [[1]]
[1] "Jan" "Feb" "Mar"

[[2]
]
[,1] [,2] [,3]
[1,] 1 3 -1
[2,] 2 4 9

[[3]]
[[3]][[1]]
[1] "Red"

[[3]][[2]]
[1] 12.3

> names(data_list) <- c("Monat", "Matrix", "Misc")


> data_list
$Monat
[1] "Jan" "Feb" "Mar"

$Matrix

[,1] [,2] [,3]


[1,] 1 3 -1
[2,] 2 4 9

$Misc
$Misc[[1]]
[1] "Red"

$Misc[[2]]
[1] 12.3

Accessing Elements from the Lists


 Accessing elements by index
> print(data_list[3])
$Misc

32
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
v1 = c(1,2,3)
v2 = c(4,5,6,7,8,9)

arr = array(c(v1,v2),dim=c(3,3,2)) print(arr)

## ,,1
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## ,,2
##
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

Data Frames
Data frame is a tabular data object or two dimensional array like structure in which each column,
contains values of one variable and each row contains one set of values from each column.
Unlike matrices, each column of a data frame can contain different modes of data.

costs = data.frame(
name = c("carrot","apple","sugar"), costPerKG =
c(50.00,60.00,39.50),
QuantityAvailableinKGs = c(10,5,50)) print(costs)

## name costPerKG QuantityAvailableinKGs


## 1 carrot 50.0 10
## 2 apple 60.0 5
## 3 sugar 39.5 50

33
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

6) R program using apply group of functions to create and apply normalization function on
each of te numerical columns of iris dataset to tranform them into a value around 0 with z-
score normalization.
we can achieve z-score normalization in R by using function scale(X,center=TRUE, scale=TRUE)
where X refers to data.

So we need to apply it to all the columns in iris dataset which has numeric data. to do that we use
apply() function
# Load the iris dataset
data(iris)

# Function to perform z-score normalization normalize


<- function(x) {
(x - mean(x)) / sd(x)
}

# Apply normalization function to numeric columns of iris dataset iris_normalized <-


as.data.frame(lapply(iris[, sapply(iris, is.numeric)], normalize))

# Add non-numeric columns back to the normalized dataframe


non_numeric_cols <- iris[, !sapply(iris, is.numeric)] iris_normalized
<- cbind(non_numeric_cols, iris_normalized)
# Print the first few rows of the normalized dataframe
head(iris_normalized)

Output:

Species Sepal.Length Sepal.Width Petal.Length Petal.Width


1 setosa -0.8976739 1.01560199 -1.335752 -1.311052
2 setosa -1.1392005 -0.13153881 -1.335752 -1.311052
3 setosa -1.3807271 0.32731751 -1.392399 -1.311052
4 setosa -1.5014904 0.09788935 -1.279104 -1.311052
5 setosa -1.0184372 1.24503015 -1.335752 -1.311052
6 setosa -0.5353840 1.93331359 -1.165810 -1.048667
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

7. Write R program using apply group of functions to create and apply discretization function on each
of the numeric variables features of iris dataset to transform them into 3 levels designated as ''low,
medium, high" values based on equi-width quantiles such that each variable gets nearly equal number
of data points in each level.

To achieve discretization of the numeric variables in the iris dataset based on equal-width quantiles, we can
use the apply family of functions in R. The goal is to transform each numeric feature into three categories
("low", "medium", "high") based on equal-width binning, so that each bin contains approximately the same
number of data points.

Here’s how you can write an R program that uses the apply family of functions to achieve this:

# Load the iris dataset

data(iris)

# Define the discretization function that categorizes data into 3 levels

discretize <- function(x) {

# Cut the data into 3 bins based on equal-width quantiles

# The "labels" parameter assigns labels to the resulting levels

cut(x, breaks = quantile(x, probs = 0:3 / 3), include.lowest = TRUE, labels = c("low", "medium", "high"))

# Apply the discretization function to each numeric column in the iris dataset

# Use the apply function to apply discretize to each column of iris, excluding the species column

iris_discretized <- iris

iris_discretized[, 1:4] <- apply(iris[, 1:4], 2, discretize)

# View the discretized dataset

head(iris_discretized)

35
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Output:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


low high low low setosa
medium high high low setosa
medium high high low setosa
low high low low setosa
medium high high low setosa
medium high high low setosa

The values in the columns Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width are now
categorized as "low", "medium", or "high" based on equal-width quantiles. The Species column remains
unchanged.

36
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

8. Introduction to regression using R.


Air Velocity (cm/sec) 20,60,100,140,180,220,260,300,340,380
Evaporation Coefficient(mm2/sec) 0.18, 0.37, 0.35, 0.78, 0.56, 0.75, 1.18, 1.36, 1.17, 1.65

Introduction to Linear Regression


Linear regression is one of the most commonly used predictive modelling techniques. The aim of linear
regression is to find a mathematical equation for a continuous response variable Y as a function of one or
more X variable(s). So that you can use this regression model to predict the Y when only the X is known. It
is expressed in the equation 1.

= 1 + 2 + (1)
Where 1 is intercept, and 2 is slope, and is the error term.

Problem Specification
In the given problem „Air velocity‟, and „Evaporation Coefficient‟ are the variables with 10 observations.
The goal here is to establish a mathematical equation for „Evaporation Coefficient‟ as a function of „Air velocity‟,
so you can use it to predict „Evaporation Coefficient‟ when only the „Air velocity‟ of the car is known. So, it is desirable
to build a linear regression model with the response variable as „Evaporation Coefficient‟ and the predictor as „Air
velocity‟. Before we begin building the regression model, it is a good practice to analyse and understand the
variables.

> airvelocity<-c(20,60,100,140,180,220,260,300,340,380)
> evaporationcoefficient<-c(0.18, 0.37, 0.35, 0.78, 0.56, 0.75, 1.18, 1.36, 1
.17, 1.65)
> airvelocity
[1] 20 60 100 140 180 220 260 300 340 380
> evaporationcoefficient
[1] 0.18 0.37 0.35 0.78 0.56 0.75 1.18 1.36 1.17 1.65

Graphical analysis
The aim of this exercise is to build a simple regression model that you can use to predict
„Evaporation Coefficient‟. But before jumping in to the syntax, let‟s try to understand these variables
graphically.

Typically, for each of the predictors, the following plots help visualize the patterns:

Using Scatter Plot to Visualize the Relationship


Scatter plots can help visualize linear relationships between the response and predictor variables. Ideally,
if you have many predictor variables, a scatter plot is drawn for each one of them against the response,

37
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

along with the line of best fit as seen below.

> scatter.smooth(airvelocity, evaporationcoefficient, main="Airvelocity ~ Eva


poration Coefficient")

>

The scatter plot along with the smoothing line above suggests a linear and positive relationship between
the „Air Velocity‟ and „Evaporation Coefficient‟.

This is a good thing. Because, one of the underlying assumptions of linear regression is,
the relationship between the response and predictor variables is linear and additive.

Using BoxPlot to Check for Outliers


Generally, an outlier is any datapoint that lies outside the 1.5 * inter quartile range (IQR). IQR is
calculated as the distance between the 25th percentile and 75th percentile values for that variable (1).

> par(mfrow=c(1, 2))

> boxplot(airvelocity, main="Airvelocity", sub=paste("Outlier rows: ", boxplo t.stats(airvelocity)$out)) # box

plot for 'speed'

> boxplot(evaporationcoefficient, main="Distance", sub=paste("Outlier rows: "

, boxplot.stats(evaporationcoefficient)$out)) # box plot for 'distance'

38
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

a) Analyze the significance of residual standard error value, R-squared value, F-statistic. Find the
correlation coefficient for this data and analyze the significance of the correlation value.
m1 = lm(dist ~ speed, data=cars) summary(m1)
##
## Call:
## lm(formula = dist ~ speed, data = cars) ##
## Residuals:
## Min 1Q Median 3Q Max ## -
29.069 -9.525 -2.272 9.215 43.201 ##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) ## (Intercept)
-17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 *** ##
---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##
## Residual standard error: 15.38 on 48 degrees of freedom ## Multiple R-
squared: 0.6511, Adjusted R-squared: 0.6438 ## F-statistic: 89.57 on 1 and
48 DF, p-value: 1.49e-12
summary.aov(m1) # To get the sums of squares and mean squares
## Df Sum Sq Mean Sq F value Pr(>F)
## speed 1 21185 21185 89.57 1.49e-12 ***
## Residuals 48 11354 237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#calculate sums of squares total, residual and model

y = cars$dist ybar =

39
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

mean(y)
#ss is sum of squares ss.total = sum((y-ybar)^2) print(ss.total)

## [1] 32538.98
ss.residual=sum((y-m1$fitted)^2)
print(ss.residual)
## [1] 11353.52
ss.model = ss.total-ss.residual print(ss.model)

## [1] 21185.46

#calculate degrees of freedom total, residual and model

n = length(cars$speed)

40
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

k = length(m1$coef)

df.total = n-1
df.residual = n-k
df.model = k-1

# calcuating mean squres


ms.residual =
ss.residual/df.residual
print(ms.residual)

## [1] 236.5317

ms.model =
ss.model/df.model
print(ms.model)

## [1] 21185.46

41
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

b) Perform a log transformation on the “Air Velocity” column, perform


linear regression again and analyze all the relevant values.
#air velocity
x = c(20,60,100,140,180,220,260,300,340,380)
#evaporation coefficient
y = c(0.18,0.37, 0.35, 0.78, 0.56, 0.75, 1.18, 1.36, 1.17, 1.65)
x = log(x)

linearModel = lm(y~x)
print(linearModel)

##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -1.457 0.456

#now we create a new dataframe that set the x value


pdata = data.frame(x=400)

#we now apply predict() function and set the predictor variable in the pdata
argument.

result = predict(linearModel,pdata)
print(result)

## 1
## 180.9577
#we can also set the interval type as "predict" with out changing the default
0.95 confidence level

result = predict(linearModel,pdata,interval = "predict")


print(result)

## fit lwr upr


## 1 180.9577 92.97031 268.945

3. R program using apply group of functions to create and apply normalization function on
each of te numeric variables/columns of iris dataset to tranform them into a value around 0
with z-score normalization.
we can achive z-score normalization in R by using function scale(X,center=TRUE, scale=TRUE)
where X refers to data.

So we need to apply it to all the columns in iris dataset which has numeric data. to do that we use
apply() function
#We are using apply function to implement scale() function on every column.
#Here 2 means column

42
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

#It indicates apply function to apply scale() function column wise.


apply(iris[1:4], 2, scale)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] -0.89767388 1.01560199 -1.33575163 -1.3110521482
## [2,] -1.13920048 -0.13153881 -1.33575163 -1.3110521482
## [3,] -1.38072709 0.32731751 -1.39239929 -1.3110521482
## [4,] -1.50149039 0.09788935 -1.27910398 -1.3110521482
## [5,] -1.01843718 1.24503015 -1.33575163 -1.3110521482
## [6,] -0.53538397 1.93331463 -1.16580868 -1.0486667950
## [7,] -1.50149039 0.78617383 -1.33575163 -1.1798594716
## [8,] -1.01843718 0.78617383 -1.27910398 -1.3110521482
## [9,] -1.74301699 -0.36096697 -1.33575163 -1.3110521482
## [10,] -1.13920048 0.09788935 -1.27910398 -1.4422448248
## [11,] -0.53538397 1.47445831 -1.27910398 -1.3110521482
## [12,] -1.25996379 0.78617383 -1.22245633 -1.3110521482
## [13,] -1.25996379 -0.13153881 -1.33575163 -1.4422448248
## [14,] -1.86378030 -0.13153881 -1.50569459 -1.4422448248
## [15,] -0.05233076 2.16274279 -1.44904694 -1.3110521482
## [16,] -0.17309407 3.08045544 -1.27910398 -1.0486667950
## [17,] -0.53538397 1.93331463 -1.39239929 -1.0486667950
## [18,] -0.89767388 1.01560199 -1.33575163 -1.1798594716
## [19,] -0.17309407 1.70388647 -1.16580868 -1.1798594716
## [20,] -0.89767388 1.70388647 -1.27910398 -1.1798594716
## [21,] -0.53538397 0.78617383 -1.16580868 -1.3110521482
## [22,] -0.89767388 1.47445831 -1.27910398 -1.0486667950
## [23,] -1.50149039 1.24503015 -1.56234224 -1.3110521482
## [24,] -0.89767388 0.55674567 -1.16580868 -0.9174741184
## [25,] -1.25996379 0.78617383 -1.05251337 -1.3110521482
## [26,] -1.01843718 -0.13153881 -1.22245633 -1.3110521482
## [27,] -1.01843718 0.78617383 -1.22245633 -1.0486667950
## [28,] -0.77691058 1.01560199 -1.27910398 -1.3110521482
## [29,] -0.77691058 0.78617383 -1.33575163 -1.3110521482
## [30,] -1.38072709 0.32731751 -1.22245633 -1.3110521482
## [31,] -1.25996379 0.09788935 -1.22245633 -1.3110521482
## [32,] -0.53538397 0.78617383 -1.27910398 -1.0486667950
## [33,] -0.77691058 2.39217095 -1.27910398 -1.4422448248
## [34,] -0.41462067 2.62159911 -1.33575163 -1.3110521482
## [35,] -1.13920048 0.09788935 -1.27910398 -1.3110521482
## [36,] -1.01843718 0.32731751 -1.44904694 -1.3110521482
## [37,] -0.41462067 1.01560199 -1.39239929 -1.3110521482
## [38,] -1.13920048 1.24503015 -1.33575163 -1.4422448248
## [39,] -1.74301699 -0.13153881 -1.39239929 -1.3110521482
## [40,] -0.89767388 0.78617383 -1.27910398 -1.3110521482
## [41,] -1.01843718 1.01560199 -1.39239929 -1.1798594716
## [42,] -1.62225369 -1.73753594 -1.39239929 -1.1798594716
## [43,] -1.74301699 0.32731751 -1.39239929 -1.3110521482
## [44,] -1.01843718 1.01560199 -1.22245633 -0.7862814418
## [45,] -0.89767388 1.70388647 -1.05251337 -1.0486667950
## [46,] -1.25996379 -0.13153881 -1.33575163 -1.1798594716
## [47,] -0.89767388 1.70388647 -1.22245633 -1.3110521482
## [48,] -1.50149039 0.32731751 -1.33575163 -1.3110521482
## [49,] -0.65614727 1.47445831 -1.27910398 -1.3110521482
## [50,] -1.01843718 0.55674567 -1.33575163 -1.3110521482

43
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

## [51,] 1.39682886 0.32731751 0.53362088 0.2632599711


## [52,] 0.67224905 0.32731751 0.42032558 0.3944526477

## [53,] 1.27606556 0.09788935 0.64691619 0.3944526477


## [54,] -0.41462067 -1.73753594 0.13708732 0.1320672944
## [55,] 0.79301235 -0.59039513 0.47697323 0.3944526477
## [56,] -0.17309407 -0.59039513 0.42032558 0.1320672944
## [57,] 0.55148575 0.55674567 0.53362088 0.5256453243
## [58,] -1.13920048 -1.50810778 -0.25944625 -0.2615107354
## [59,] 0.91377565 -0.36096697 0.47697323 0.1320672944
## [60,] -0.77691058 -0.81982329 0.08043967 0.2632599711
## [61,] -1.01843718 -2.42582042 -0.14615094 -0.2615107354
## [62,] 0.06843254 -0.13153881 0.25038262 0.3944526477
## [63,] 0.18919584 -1.96696410 0.13708732 -0.2615107354
## [64,] 0.30995914 -0.36096697 0.53362088 0.2632599711
## [65,] -0.29385737 -0.36096697 -0.08950329 0.1320672944
## [66,] 1.03453895 0.09788935 0.36367793 0.2632599711
## [67,] -0.29385737 -0.13153881 0.42032558 0.3944526477
## [68,] -0.05233076 -0.81982329 0.19373497 -0.2615107354
## [69,] 0.43072244 -1.96696410 0.42032558 0.3944526477
## [70,] -0.29385737 -1.27867961 0.08043967 -0.1303180588
## [71,] 0.06843254 0.32731751 0.59026853 0.7880306775
## [72,] 0.30995914 -0.59039513 0.13708732 0.1320672944
## [73,] 0.55148575 -1.27867961 0.64691619 0.3944526477
## [74,] 0.30995914 -0.59039513 0.53362088 0.0008746178
## [75,] 0.67224905 -0.36096697 0.30703027 0.1320672944
## [76,] 0.91377565 -0.13153881 0.36367793 0.2632599711
## [77,] 1.15530226 -0.59039513 0.59026853 0.2632599711
## [78,] 1.03453895 -0.13153881 0.70356384 0.6568380009
## [79,] 0.18919584 -0.36096697 0.42032558 0.3944526477
## [80,] -0.17309407 -1.04925145 -0.14615094 -0.2615107354
## [81,] -0.41462067 -1.50810778 0.02379201 -0.1303180588
## [82,] -0.41462067 -1.50810778 -0.03285564 -0.2615107354
## [83,] -0.05233076 -0.81982329 0.08043967 0.0008746178
## [84,] 0.18919584 -0.81982329 0.76021149 0.5256453243
## [85,] -0.53538397 -0.13153881 0.42032558 0.3944526477
## [86,] 0.18919584 0.78617383 0.42032558 0.5256453243
## [87,] 1.03453895 0.09788935 0.53362088 0.3944526477
## [88,] 0.55148575 -1.73753594 0.36367793 0.1320672944
## [89,] -0.29385737 -0.13153881 0.19373497 0.1320672944
## [90,] -0.41462067 -1.27867961 0.13708732 0.1320672944
## [91,] -0.41462067 -1.04925145 0.36367793 0.0008746178
## [92,] 0.30995914 -0.13153881 0.47697323 0.2632599711
## [93,] -0.05233076 -1.04925145 0.13708732 0.0008746178
## [94,] -1.01843718 -1.73753594 -0.25944625 -0.2615107354
## [95,] -0.29385737 -0.81982329 0.25038262 0.1320672944
## [96,] -0.17309407 -0.13153881 0.25038262 0.0008746178
## [97,] -0.17309407 -0.36096697 0.25038262 0.1320672944
## [98,] 0.43072244 -0.36096697 0.30703027 0.1320672944
## [99,] -0.89767388 -1.27867961 -0.42938920 -0.1303180588
## [100,] -0.17309407 -0.59039513 0.19373497 0.1320672944
## [101,] 0.55148575 0.55674567 1.27004036 1.7063794137
## [102,] -0.05233076 -0.81982329 0.76021149 0.9192233541
## [103,] 1.51759216 -0.13153881 1.21339271 1.1816087073

44
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

## [104,] 0.55148575 -0.36096697 1.04344975 0.7880306775


## [105,] 0.79301235 -0.13153881 1.15674505 1.3128013839
## [106,] 2.12140867 -0.13153881 1.60992627 1.1816087073
## [107,] -1.13920048 -1.27867961 0.42032558 0.6568380009
## [108,] 1.75911877 -0.36096697 1.43998331 0.7880306775

## [109,] 1.03453895 -1.27867961 1.15674505 0.7880306775


## [110,] 1.63835547 1.24503015 1.32668801 1.7063794137
## [111,] 0.79301235 0.32731751 0.76021149 1.0504160307
## [112,] 0.67224905 -0.81982329 0.87350679 0.9192233541
## [113,] 1.15530226 -0.13153881 0.98680210 1.1816087073
## [114,] -0.17309407 -1.27867961 0.70356384 1.0504160307
## [115,] -0.05233076 -0.59039513 0.76021149 1.5751867371
## [116,] 0.67224905 0.32731751 0.87350679 1.4439940605
## [117,] 0.79301235 -0.13153881 0.98680210 0.7880306775
## [118,] 2.24217198 1.70388647 1.66657392 1.3128013839
## [119,] 2.24217198 -1.04925145 1.77986923 1.4439940605
## [120,] 0.18919584 -1.96696410 0.70356384 0.3944526477
## [121,] 1.27606556 0.32731751 1.10009740 1.4439940605
## [122,] -0.29385737 -0.59039513 0.64691619 1.0504160307
## [123,] 2.24217198 -0.59039513 1.66657392 1.0504160307
## [124,] 0.55148575 -0.81982329 0.64691619 0.7880306775
## [125,] 1.03453895 0.55674567 1.10009740 1.1816087073
## [126,] 1.63835547 0.32731751 1.27004036 0.7880306775
## [127,] 0.43072244 -0.59039513 0.59026853 0.7880306775
## [128,] 0.30995914 -0.13153881 0.64691619 0.7880306775
## [129,] 0.67224905 -0.59039513 1.04344975 1.1816087073
## [130,] 1.63835547 -0.13153881 1.15674505 0.5256453243
## [131,] 1.87988207 -0.59039513 1.32668801 0.9192233541
## [132,] 2.48369858 1.70388647 1.49663097 1.0504160307
## [133,] 0.67224905 -0.59039513 1.04344975 1.3128013839
## [134,] 0.55148575 -0.59039513 0.76021149 0.3944526477
## [135,] 0.30995914 -1.04925145 1.04344975 0.2632599711
## [136,] 2.24217198 -0.13153881 1.32668801 1.4439940605
## [137,] 0.55148575 0.78617383 1.04344975 1.5751867371
## [138,] 0.67224905 0.09788935 0.98680210 0.7880306775
## [139,] 0.18919584 -0.13153881 0.59026853 0.7880306775
## [140,] 1.27606556 0.09788935 0.93015445 1.1816087073
## [141,] 1.03453895 0.09788935 1.04344975 1.5751867371
## [142,] 1.27606556 0.09788935 0.76021149 1.4439940605
## [143,] -0.05233076 -0.81982329 0.76021149 0.9192233541
## [144,] 1.15530226 0.32731751 1.21339271 1.4439940605
## [145,] 1.03453895 0.55674567 1.10009740 1.7063794137
## [146,] 1.03453895 -0.13153881 0.81685914 1.4439940605
## [147,] 0.55148575 -1.27867961 0.70356384 0.9192233541
## [148,] 0.79301235 -0.13153881 0.81685914 1.0504160307
## [149,] 0.43072244 0.78617383 0.93015445 1.4439940605
## [150,] 0.06843254 -0.13153881 0.76021149 0.7880306775

As you can see that whole data is centered around

45
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

12.Write a R program for reading 'state x77' dataset into a data frame and apply multiple regression to
predict the value of the variable 'murder 'based on the other independent variables based on their
correlations

# Load the state.x77 dataset


data("state.x77")
# Convert the dataset to a data frame for easier manipulation
state_df <- as.data.frame(state.x77)
# View the structure of the dataset
str(state_df)
# Check the correlation matrix between all variables
cor_matrix <- cor(state_df)
print(cor_matrix)
# Identify the independent variables with the highest correlation to 'Murder'
# We'll look for variables with the highest correlation to 'Murder'
cor_murder <- cor_matrix["Murder", ]
print(cor_murder)
# Based on the correlations, we will choose the independent variables
# For this example, we'll select the variables with a high correlation to 'Murder'
# In this case, let's use variables with absolute correlation above 0.3
independent_vars <- names(cor_murder[abs(cor_murder) > 0.3 & names(cor_murder) != "Murder"])
# Create the formula for multiple regression
formula <- as.formula(paste("Murder ~", paste(independent_vars, collapse = " + ")))
# Fit the multiple regression model
model <- lm(formula, data = state_df)
# View the summary of the regression model
summary(model)
# Predictions based on the model
predictions <- predict(model, state_df)
# View the first few predictions
head(predictions)

46
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

11. Write R program to split 'Titanic' dataset into training and test partitions and build a decision tree for
predicting whether survived or not given the description of a person travelled. Evaluate the performance
metrics from the confusion matrix
# Load necessary libraries
library(rpart) # For decision tree
library(caret) # For confusion matrix and other metrics
library(dplyr) # For data manipulation
# Load the Titanic dataset
data("titanic")
# View the structure of the dataset
str(titanic)
# Clean the dataset (e.g., handle missing values)
# For simplicity, we'll remove rows with missing values in the 'Survived' column
titanic_cleaned <- titanic %>%
filter(!is.na(Survived))
# Convert factors to appropriate levels if needed
titanic_cleaned$Survived <- as.factor(titanic_cleaned$Survived)
# Split the dataset into training and test sets (70% train, 30% test)
set.seed(123) # For reproducibility
train_index <- createDataPartition(titanic_cleaned$Survived, p = 0.7, list = FALSE)
train_data <- titanic_cleaned[train_index, ]
test_data <- titanic_cleaned[-train_index, ]
# Build the decision tree model to predict 'Survived'
decision_tree_model <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare,
data = train_data, method = "class")
# Print the decision tree model
print(decision_tree_model)
# Predict on the test set
predictions <- predict(decision_tree_model, test_data, type = "class")
# Evaluate performance using confusion matrix
conf_matrix <- confusionMatrix(predictions, test_data$Survived)
# Print the confusion matrix and performance metrics
print(conf_matrix)
# Extract additional performance metrics from the confusion matrix
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
f1_score <- 2 * (precision * recall) / (precision + recall)
# Display performance metrics
cat("Accuracy: ", accuracy, "\n")
cat("Precision: ", precision, "\n")
cat("Recall: ", recall, "\n")
cat("F1 Score: ", f1_score, "\n")

1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

WEKA

Waikato Environment for Knowledge Analysis

11. Create an ARFF (Attribute-Relation File Format) file and read it in WEKA. Explore the purpose
of each button under the preprocess panel after loading the ARFF file. Also, try to interpret using
a different ARFF file, weather.arff, provided with WEKA.

ARFF is the WEKA’s native data storage method. ARFF is an acronym for Attribute- Relational File Format.
The bulk of the ARFF file consists of a list of instances and the attribute values for each instance are separated
by commas. This format is similar to CSV file format and it can also be seen as an extension to normal CSV
format.

Creating an ARFF file:


Many of the spreadsheet and database applications we use today provides a mechanism to export
the data into CSV(Comma Separated Value) file. Having done that we need to do some
modifications make it as an ARFF file.

1.Load the file into a text editor.


2.Add dataset name using @relation tag.
3.Add the attributes information using @attribute tag. 4.Add
a line with @data followed by the actual data. 5.Save the
file as raw text with extension ARFF.

Below is an example of the ARFF file generated from above CSV.

2
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Exploring the preprocess panel.


6. Click on open file button.
7. Choose weather.arff file.
This loads the data into the explorer. The next figure shows how the explorer looks after loading
the file.

Current Relation:
It explains about the dataset by giving us the information like number of instances, attributes etcetera.
Selected Attribute:
This explains the characteristics of a specific attribute’s data. Like type which is Nominal or
numerical. Missing values etc.
Histogram:
This will be generated for the attributes data we selected currently. You can draw it for any
other attribute. Here play is selected as the class attribute; it is used to colour the histogram, and
any filters that require a class value use it too.
If you select a numeric attribute, you see it’s minimum and maximum values, mean, and
standard deviation. In this case, the histogram will show the distribution of the class as a function
of this attribute.
Remove:
You can delete an attribute by clicking its checkbox and using the Remove button.
Invert:
It is used to invert the selection.
Pattern:
Pattern selects those attributes whose names match a user-supplied regular expression.
Undo:
It can be used to undo the changes we did.
Edit:
This button is used to edit and inspect the data loaded.
3
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

12. Performing data preprocessing in Weka: Study Unsupervised Attribute Filters such as
replace MissingValues to replace missing values in the given dataset, Add to add the new
attribute Average, Discretize to discretize the attributes into bins. Explore Normalize and
Standardize options on a dataset with numerical attributes.
Study of some of the Unsupervised Attribute Filters

Replace Missing Values:

Replace MissingValues replaces each missing value by the mean for numeric attributes and the
mode for nominal ones.
If a class is set, missing values of that attribute are not replaced by default, but this can be changed.
Replace MissingWithUserConstant is another filter that can replace missing values. In this case, it
allows the user to specify a constant value to use. This constant can be specified separately for
numeric, nominal and date attributes.

Add:
Add inserts an attribute at a given position, whose value is declared to be missing for all instances.
Use the generic object editor to specify the attribute’s name, where it will appear in the list of
attributes, and its possible values (for nominal attributes); for date attributes, you can also specify the
date format.
Discretize:
Discretize uses equal-width or equal frequency binning to discretize a range of numeric attributes,
specified in the usual way. For the former method, the number of bins can be specified or chosen
automatically by maximizing the likelihood of using leave-one-out cross-validation. It is also
possible to create several binary attributes instead of one multi-valued one. For equal-frequency
discretization, the desired number of instances per interval can be changed.
PKIDiscretize discretizes numeric attributes using equal-frequency binning; the number of bins is
the square root of the number of values (excluding missing values). Both these filters skip the class
attribute by default.

Exploring Normalize and Standardize options on the dataset with numerical attributes.
Here we are selecting iris data set as the data set we are using to explore Normalize and Standardize
options.

Normalize:
Normalize scales all numeric values in the dataset to lie between 0 and 1. The normalized values
can be further scaled and translated with user-supplied constants.

The data before normalization.

4
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

To select Normalize filter,


● Go to Filter and click on the choose button.
● Then navigate as follows in that menu.
Weka >> filters >> unsupervised >> attribute >> Normalize
● Now click on Apply.
● After applying normalization

5
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

As you can see the Maximum values is 1 and Minimum is 0 indicating Normalization.

Standardize:
Center and Standardize transform the data to have zero mean. Standardize gives them unit variance
too. All three skip the class attribute if set.

Instead of Normalize filter, if we use the Standardize filter. Then the output will be

As you can see the Mean is 0 and Standard deviation is 1 which is the characteristic of Standardize. But the data
may not be in the range of 0 and 1.

6
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

13. Classification using the WEKA toolkit

a. Demonstration of classification process using the id3 algorithm on categorical


dataset(weather).
b. Demonstration of classification process using the naïve Bayes algorithm on a
categorical dataset (‘vote’).
c. Demonstration of classification process using Random Forest algorithm on datasets
containing a large number of attributes.
The Classify panel lets you train and test learning schemes that perform classification or
regression. Demonstration of classification process using id3 algorithm on categorical
dataset(weather): By default, you may not find the id3 algorithm under the available
classifiers in the classify tab.
To install this go to tools >> Package Manager and install SimpleEducationalLearningSchemes
package.

Now you can find the id3 algorithm under the trees section.

Now click on start to start the process of classification.

Classifer Output:
=== Run information ===
Scheme:
weka.classifiers.trees.Id
3 Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook temperature humidity
windy

7
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

play
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===


Id3

outlook = sunny
| humidity = high: no
| humidity = normal: yes
outlook = overcast: yes
outlook = rainy
| windy = TRUE: no
| windy = FALSE: yes

Time taken to build model: 0 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 12 85.7143 %


Incorrectly Classified Instances2 14.2857 %
Kappa statistic 0.6889
Mean absolute error 0.1429
Root mean squared error 0.378
Relative absolute error 30 %
Root relative squared error 76.6097 %
Total Number of Instances 14

=== Detailed Accuracy By Class ===

=== Confusion Matrix


=== a b <-- classified as
8 1 | a = yes
1 4 | b = no

8
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Demonstration of classification process using naïve Bayes algorithm on categorical dataset (‘vote’):
Load the vote dataset by opening vote.arff

file. Select naive Bayes classifier

Now click on start to start the classification process


Classifier output:
=== Run information ===
Scheme:
weka.classifiers.bayes.NaiveBayes
Relation: vote
Instances: 435
Attributes: 17
handicapped-infants water-project-cost-sharing
adoption-of-the-budget-resolution physician-fee-freeze
el-salvador-aid
religious-groups-in-schools anti-satellite-test-ban
aid-to-nicaraguan-contras mx-missile
immigration
synfuels-corporation-cutback education-spending superfund-right-to-sue
crime
duty-free-exports

9
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

export-administration-act-south-
africa Class
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Naive Bayes Classifier Class

Attribute democrat
republican (0.61)
(0.39)

handicapped-infants
n 103.0 135.0
y 157.0 32.0
[total] 260.0 167.0

water-project-cost-sharing

n 120.0 74.0
y 121.0 76.0
[total] 241.0 150.0

adoption-of-the-budget-resolution

n 30.0 143.0
y 232.0 23.0
[total] 262.0 166.0

physician-fee-freeze

n 246.0 3.0
y 15.0 164.0
[total] 261.0 167.0

el-salvador-aid
n 201.0 9.0
y 56.0 158.0
[total] 257.0 167.0

religious-groups-in-schools

n 136.0 18.0
y 124.0 150.0
[total] 260.0 168.0

10
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
anti-satellite-test-ban
n 60.0 124.0

y 201.0 40.0
[total] 261.0 164.0

11
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

aid-to-nicaraguan-contras

n 46.0 134.0
y 219.0 25.0
[total] 265.0 159.0

mx-missile

n 61.0 147.0
y 189.0 20.0
[total] 250.0 167.0

immigration

n 140.0 74.0
y 125.0 93.0
[total] 265.0 167.0

synfuels-corporation-cutback

n 127.0 139.0
y 130.0 22.0
[total] 257.0 161.0

n 214.0 21.0
y 37.0 136.0
[total] 251.0 157.0

superfund-right-to-sue

n 180.0 23.0
y 74.0 137.0
[total] 254.0 160.0

crime
n 168.0 4.0
y 91.0 159.0
[total] 259.0 163.0

duty-free-exports

n 92.0 143.0
y 161.0 15.0
[total] 253.0 158.0

12
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

export-administration-act-south-africa

n 13.0 51.0
y 174.0 97.0
[total] 178.0 148.0

13
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Time taken to build model: 0 seconds

=== Stratified cross-validation ===


=== Summary ===
Correctly Classified Instances 392 90.1149 %
Incorrectly Classified Instances 43 9.8851 %
Kappa statistic 0.7949
Mean absolute error 0.0995
Root mean squared error 0.2977
Relative absolute error
20.9815 %
Root relative squared error 61.1406 %
Total Number of Instances 435

=== Detailed Accuracy By Class ===

=== Confusion Matrix

=== a b <-- classified as


238 29 | a = democrat 14
154 | b = republican
Demonstration of classification process using Random Forest algorithm on datasets
containing large number of attributes.

The dataset with large number of attributes is


supermarket. It has 217 attributes.
Load it into WEKA by opening file supermarket.arff

Now select the Random Forest classifier.

Apply the algorithm by clicking on start.

14
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Classifier output:
=== Run information ===

Scheme: weka.classifiers.trees.RandomForest -P 100 -I 100 -num-slots 1 -K 0 -M 1.0 -V 0.001 -


S1
Relation: supermarket

15
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Instances: 4627
Attributes: 217
[ list of attributes omitted ]
Test mode: 10-fold cross-
validation

=== Classifier model (full training set) ===

RandomForest

Bagging with 100 iterations and base learner

weka.classifiers.trees.RandomTree -K 0 -M 1.0 -V 0.001 -S 1 -do-not-check-capabilities

Time taken to build model: 8.99 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 2948 63.713 %


Incorrectly Classified Instances 1679 36.287 %
Kappa statistic 0
Mean absolute error 0.4624
Root mean squared error 0.4808
Relative absolute error 99.9964 %
Root relative squared error 100 %
Total Number of Instances 4627

=== Detailed Accuracy By Class ===

=== Confusion Matrix


=== a b <-- classified
as
2948 0 | a = low
1679 0 | b = high

16
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

14. Classification using the WEKA toolkit –


Part 2: Demonstration of classification process using J48 algorithm on mixed type of
dataset after discretizing numeric attributes. Perform cross-validation strategy with
various fold levels. Compare the accuracy of the results.

Selecting the mixed type of dataset:


We use iris data set for this problem. It is a mixed type of dataset which will have both numerical
and nominal attributes.
Histograms of all attributes.

Discretizing numeric attributes:


Select the filter Discretize and apply it.
After discretizing the numeric attributes, the contiuous data will be discretized and the histograms
will look like this

17
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
pplying J48 algorithm:
Select the classifier

Apply it by clicking in start


Classifier output at 10 fold cross-validation:
=== Run information
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-
last- precision6 Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree

petalwidth = '(-inf-0.34]': Iris-setosa (41.0)


petalwidth = '(0.34-0.58]': Iris-setosa (8.0)
petalwidth = '(0.58-0.82]': Iris-setosa (1.0)
petalwidth = '(0.82-1.06]': Iris-versicolor (7.0)
petalwidth = '(1.06-1.3]': Iris-versicolor (21.0)

18
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

petalwidth = '(1.3-1.54]': Iris-versicolor


(20.0/3.0) petalwidth = '(1.54-1.78]': Iris-
versicolor (6.0/2.0) petalwidth = '(1.78-2.02]':
Iris-virginica (23.0/1.0) petalwidth = '(2.02-
2.26]': Iris-virginica (9.0) petalwidth = '(2.26-
inf)': Iris-virginica (14.0)

Number of Leaves : 10

Size of the tree : 11

Time taken to build model: 0.09 seconds

=== Stratified cross-validation


===
=== Summary ===

Correctly Classified Instances 144 96 %


Incorrectly Classified Instances 6 4 %
Kappa statistic 0.94
Mean absolute error 0.0489
Root mean squared error 0.1637
Root relative squared error 34.7274 %
Total Number of Instances 150

=== Detailed Accuracy By Class ===

=== Confusion Matrix


=== a b c <-- classified as
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 5 45 | c = Iris-virginica

Classifier output at 20 fold cross-validation:

19
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2


Relation: iris-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last-precision

Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: 20-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree

petalwidth = '(-inf-0.34]': Iris-setosa (41.0)


petalwidth = '(0.34-0.58]': Iris-setosa (8.0)
petalwidth = '(0.58-0.82]': Iris-setosa (1.0)
petalwidth = '(0.82-1.06]': Iris-versicolor (7.0)
petalwidth = '(1.06-1.3]': Iris-versicolor (21.0)
petalwidth = '(1.3-1.54]': Iris-versicolor (20.0/3.0)
petalwidth = '(1.54-1.78]': Iris-versicolor (6.0/2.0)
petalwidth = '(1.78-2.02]': Iris-virginica (23.0/1.0)
petalwidth = '(2.02-2.26]': Iris-virginica (9.0)
petalwidth = '(2.26-inf)': Iris-virginica (14.0)

Number of Leaves : 10

Size of the tree : 11

Time taken to build model: 0.01 seconds

=== Stratified cross-validation


===
=== Summary ===
Correctly Classified Instances 143 95.3333 %
Incorrectly Classified Instances7 4.6667 %
Kappa statistic 0.93
Mean absolute error 0.0485
Root mean squared error 0.1619
Relative absolute error 10.8999 %
Root relative squared error 34.3145 %
20
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.000 1.000 0.980 0.990 0.985 1.000 1.000 Iris-setosa
0.980 0.060 0.891 0.980 0.933 0.900 0.974 0.956 Iris-versicolor
0.900 0.010 0.978 0.900 0.938 0.910 0.978 0.947 Iris-virginica
Weighted Avg. 0.953 0.023 0.956 0.953 0.954 0.932 0.984 0.967
=== Confusion Matrix
=== a b c <-- classified as
49 1 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
0 5 45 | c = Iris-virginica

Comparison of accuracies at different fold levels of cross-validation:

Flod level accuracy

10 96%

20 95.333%

30 96%

21
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Performing clustering in WEKA Apply hierarchical clustering algorithm on numeric


dataset and estimate cluster quality. Apply DBSCAN algorithm on numeric dataset and
estimate cluster quality.

Let us take iris dataset as all the attributes are numerical except the class attribute.

Applying Hierarchical clustering algorithm:


I am using cluster number as 3 because there are three classes. And selected class attribute for
evaluation.

Scheme: weka.clusterers.HierarchicalClusterer -N 3 -L SINGLE -P


-A "weka.core.EuclideanDistance -R first-last"
Relation: iris
Instances: 150
Attributes: 5
sepallengt
h
sepalwidth
petallengt
h
petalwidth
Ignored:
class
Test mode: Classes to clusters evaluation on training data

=== Clustering model (full training set) ===

Cluster 0
((((((((((((((((((((0.2:0.03254,0.2:0.03254):0.00913,(0.3:0.03254,0.3:0.03254):0.00913):0.00332,
((0.2:0.02778,0.2:0.02778):0.00476,0.2:0.03254):0.01244):0,0.2:0.04498):0.0051,0.2:0.05008):0.003
64,0.2:0.05371):0.00437,(0.2:0.05085,0.2:0.05085):0.00724):0.01535,
(0.5:0.06731,0.4:0.06731):0.00612):0.00188,0.2:0.07531):0.00196,0.3:0.07728):0.00536,
((((((0.2:0.04383,0.2:0.04383):0.00625,0.3:0.05008):0,0.1:0.05008):0.00279,
(((((0.2:0.03254,0.2:0.03254):0.01129,0.2:0.04383):0.00116,0.2:0.04498):0.0051,0.2:0.05008):0.002
79,((0.1:0,0.1:0):0,0.1:0):0.05287):0):0.00522,0.2:0.05808):0.01919,
((0.2:0.04498,0.2:0.04498):0.01549,0.1:0.06047):0.0168):0.00536):0.00165,0.2:0.08429):0.00356,
(((0.2:0.02778,0.2:0.02778):0.04371,
22
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

((0.3:0.04498,0.2:0.04498):0.01394,0.4:0.05893):0.01256):0.00809,0.4:0.07958):0.00826):0.00212,0
.4:0.08996):0.00321,0.6:0.09317):0.00598,
(0.4:0.0678,0.4:0.0678):0.03135):0.00292,0.3:0.10206):0.01316,0.2:0.11523):0.01375,(0.2:0.12263,
(0.1:0.10346,0.2:0.10346):0.01917):0.00634):0.00241,0.4:0.13139)

Cluster 2
(((((((((((((((((((((((((((((1.4:0.07344,(((1.5:0.06508,1.5:0.06508):0.00066,
(1.4:0.05008,1.4:0.05008):0.01566):0.00224,1.3:0.06798):0.00546):0.00188,(1.3:0.07137,
(1.3:0.05556,1.3:0.05556):0.01581):0.00395):0.00733,(1.5:0.07137,
((1.4:0.04498,1.4:0.04498):0.01549,1.5:0.06047):0.01089):0.01127):0.00515,1.4:0.08779):0.00538,1
.2:0.09317):0.00405,1.5:0.09722):0.0004,(1.5:0.05556,1.5:0.05556):0.04207):0.00152,
(1.5:0.07344,1.6:0.07344):0.02571):0,1.6:0.09914):0.00219,1.5:0.10133):0.00073,1.6:0.10206):0.00
14,(((((1.3:0.08333,1.3:0.08333):0.00613,((((1.3:0.06574,((1.3:0.05287,1.2:0.05287):0,(1.3:0.05287,
(1.3:0.04498,1.3:0.04498):0.00789):0):0.01287):0.0077,
(1.2:0.04498,1.2:0.04498):0.02845):0,1.2:0.07344):0.0093,(1.1:0.05287,
(1.1:0.04498,1.0:0.04498):0.00789):0.02987):0.00672):0.0005,1.0:0.08996):0.00406,1.0:0.09402):0.
00041,1.3:0.09443):0.00902):0.00268,1.7:0.10614):0.00342,((((((1.8:0.08784,
((1.8:0.03254,1.8:0.03254):0.0254,1.8:0.05794):0.0299):0.00162,(1.9:0.08429,
(1.8:0.05287,1.8:0.05287):0.03142):0.00518):0.00524,1.9:0.0947):0.01144,(2.2:0.09415,
(2.1:0.04167,2.2:0.04167):0.05249):0.01199):0,(((1.8:0.07148,
(1.8:0.05008,1.8:0.05008):0.02141):0.02614,(2.0:0.08504,2.0:0.08504):0.01258):0.00852,
(((2.1:0.05287,2.1:0.05287):0.04475,((((2.3:0.04383,2.3:0.04383):0.03881,2.4:0.08264):0.00719,
(2.3:0.07148,2.3:0.07148):0.01834):0.00487,2.5:0.0947):0.00292):0.00534,2.1:0.10296):0.00318):0)
:0.00129,2.1:0.10743):0.00214):0.00446,((2.5:0.08983,
(2.4:0.06047,2.3:0.06047):0.02935):0.01175,2.3:0.10158):0.01245):0.01212,1.4:0.12614):0.00283,1.
4:0.12897):0.00054,1.5:0.12951):0.00514,
(((1.9:0,1.9:0):0.08779,2.0:0.08779):0.01089,2.0:0.09869):0.03597):0.01023,
((1.5:0.09869,1.3:0.09869):0.00264,1.5:0.10133):0.04356):0.00338,
(((2.1:0.09869,2.0:0.09869):0.02337,2.3:0.12206):0.01586,((1.8:0.07344,1.9:0.07344):0.05554,
(1.8:0.12263,1.6:0.12263):0.00634):0.00895):0.01034):0.00275,1.8:0.15102):0.00299,2.3:0.15401):0
.00606,
(((1.0:0.05008,1.0:0.05008):0.04555,1.1:0.09562):0.03389,1.0:0.12951):0.03056):0.00969,1.0:0.169
76):0.00916,2.4:0.17892):0.01985,2.5:0.19878):0.00086,1.7:0.19964):0.02884,
(2.2:0.11232,2.0:0.11232):0.11615)

Time taken to build model (full training data) : 0.12 seconds


=== Model and evaluation on training set
=== Clustered Instances
0 49 ( 33%)
1 1 ( 1%)
2 100 ( 67%)
Class attribute:
class Classes to
Clusters:
0 1 2 <-- assigned to cluster
49 1 0 | Iris-setosa
0 0 50 | Iris-versicolor

23
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
0 0 50 | Iris-virginica
Cluster 0 <-- Iris-setosa
Cluster 1 <-- No class
Cluster 2 <-- Iris-versicolor
Incorrectly clustered instances : 51.0 34 %

Cluster
As you can see squares which represent incorrectly clustered points. Most of Iris-virginica are
clustered with Iris versicolor and are incorrectly clustered.

Applying DBSCAN algorithm on numeric dataset:


DBSCAN is not available in WEKA by default. Install it from the package
manager. After tuning the algorithm.
Cluster output:
=== Run information ===

Scheme: weka.clusterers.DBSCAN -E 0.3 -M 4 -A "weka.core.EuclideanDistance -R first-


last" Relation: iris
Instances: 150
Attributes: 5
sepallengt
h
sepalwidt
h
petallengt
h
petalwidt
h
Ignored:
Class
========================================================================
================

Clustered DataObjects:
150 Number of attributes:
4 Epsilon: 0.3; minPoints:
4 Distance-type:
Number of generated clusters: 2
Elapsed time: .01

( 0.) 5.1,3.5,1.4,0.2 --> 0


( 1.) 4.9,3,1.4,0.2 --> 0
( 2.) 4.7,3.2,1.3,0.2 --> 0
( 3.) 4.6,3.1,1.5,0.2 --> 0
( 4.) 5,3.6,1.4,0.2 --> 0
( 5.) 5.4,3.9,1.7,0.4 --> 0
( 6.) 4.6,3.4,1.4,0.3 --> 0
24
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
( 7.) 5,3.4,1.5,0.2 --> 0
( 8.) 4.4,2.9,1.4,0.2 --> 0
( 9.) 4.9,3.1,1.5,0.1 --> 0

25
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

( 10.) 5.4,3.7,1.5,0.2 --> 0


( 11.) 4.8,3.4,1.6,0.2 --> 0
( 12.) 4.8,3,1.4,0.1 --> 0
( 13.) 4.3,3,1.1,0.1 --> 0
( 14.) 5.8,4,1.2,0.2 --> 0
( 15.) 5.7,4.4,1.5,0.4 --> 0
( 16.) 5.4,3.9,1.3,0.4 --> 0
( 17.) 5.1,3.5,1.4,0.3 --> 0
( 18.) 5.7,3.8,1.7,0.3 --> 0
( 19.) 5.1,3.8,1.5,0.3 --> 0
( 20.) 5.4,3.4,1.7,0.2 --> 0
( 21.) 5.1,3.7,1.5,0.4 --> 0
( 22.) 4.6,3.6,1,0.2 --> 0
( 23.) 5.1,3.3,1.7,0.5 --> 0
( 24.) 4.8,3.4,1.9,0.2 --> 0
( 25.) 5,3,1.6,0.2 --> 0
( 26.) 5,3.4,1.6,0.4 --> 0
( 27.) 5.2,3.5,1.5,0.2 --> 0
( 28.) 5.2,3.4,1.4,0.2 --> 0
( 29.) 4.7,3.2,1.6,0.2 --> 0
( 30.) 4.8,3.1,1.6,0.2 --> 0
( 31.) 5.4,3.4,1.5,0.4 --> 0
( 32.) 5.2,4.1,1.5,0.1 --> 0
( 33.) 5.5,4.2,1.4,0.2 --> 0
( 34.) 4.9,3.1,1.5,0.1 --> 0
( 35.) 5,3.2,1.2,0.2 --> 0
( 36.) 5.5,3.5,1.3,0.2 --> 0
( 37.) 4.9,3.1,1.5,0.1 --> 0
( 38.) 4.4,3,1.3,0.2 --> 0
( 39.) 5.1,3.4,1.5,0.2 --> 0
( 40.) 5,3.5,1.3,0.3 --> 0
( 41.) 4.5,2.3,1.3,0.3 --> 0
( 42.) 4.4,3.2,1.3,0.2 --> 0
( 43.) 5,3.5,1.6,0.6 --> 0
( 44.) 5.1,3.8,1.9,0.4 --> 0
( 45.) 4.8,3,1.4,0.3 --> 0
( 46.) 5.1,3.8,1.6,0.2 --> 0
( 47.) 4.6,3.2,1.4,0.2 --> 0
( 48.) 5.3,3.7,1.5,0.2 --> 0
( 49.) 5,3.3,1.4,0.2 --> 0
( 50.) 7,3.2,4.7,1.4 --> 1
( 51.) 6.4,3.2,4.5,1.5 --> 1
( 52.) 6.9,3.1,4.9,1.5 --> 1
( 53.) 5.5,2.3,4,1.3 --> 1

26
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

( 54.) 6.5,2.8,4.6,1.5 --> 1

( 55.) 5.7,2.8,4.5,1.3 --> 1


( 56.) 6.3,3.3,4.7,1.6 --> 1
( 57.) 4.9,2.4,3.3,1 --> 1
( 58.) 6.6,2.9,4.6,1.3 --> 1
( 59.) 5.2,2.7,3.9,1.4 --> 1
( 60.) 5,2,3.5,1 --> 1
( 61.) 5.9,3,4.2,1.5 --> 1
( 62.) 6,2.2,4,1 --> 1
( 63.) 6.1,2.9,4.7,1.4 --> 1
( 64.) 5.6,2.9,3.6,1.3 --> 1
( 65.) 6.7,3.1,4.4,1.4 --> 1
( 66.) 5.6,3,4.5,1.5 --> 1
( 67.) 5.8,2.7,4.1,1 --> 1
( 68.) 6.2,2.2,4.5,1.5 --> 1
( 69.) 5.6,2.5,3.9,1.1 --> 1
( 70.) 5.9,3.2,4.8,1.8 --> 1
( 71.) 6.1,2.8,4,1.3 --> 1
( 72.) 6.3,2.5,4.9,1.5 --> 1
( 73.) 6.1,2.8,4.7,1.2 --> 1
( 74.) 6.4,2.9,4.3,1.3 --> 1
( 75.) 6.6,3,4.4,1.4 --> 1
( 76.) 6.8,2.8,4.8,1.4 --> 1
( 77.) 6.7,3,5,1.7 --> 1
( 78.) 6,2.9,4.5,1.5 --> 1
( 79.) 5.7,2.6,3.5,1 --> 1
( 80.) 5.5,2.4,3.8,1.1 --> 1
( 81.) 5.5,2.4,3.7,1 --> 1
( 82.) 5.8,2.7,3.9,1.2 --> 1
( 83.) 6,2.7,5.1,1.6 --> 1
( 84.) 5.4,3,4.5,1.5 --> 1
( 85.) 6,3.4,4.5,1.6 --> 1
( 86.) 6.7,3.1,4.7,1.5 --> 1
( 87.) 6.3,2.3,4.4,1.3 --> 1
( 88.) 5.6,3,4.1,1.3 --> 1
( 89.) 5.5,2.5,4,1.3 --> 1
( 90.) 5.5,2.6,4.4,1.2 --> 1
( 91.) 6.1,3,4.6,1.4 --> 1
( 92.) 5.8,2.6,4,1.2 --> 1
( 93.) 5,2.3,3.3,1 --> 1
( 94.) 5.6,2.7,4.2,1.3 --> 1
( 95.) 5.7,3,4.2,1.2 --> 1
( 96.) 5.7,2.9,4.2,1.3 --> 1
( 97.) 6.2,2.9,4.3,1.3 --> 1

27
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
( 98.) 5.1,2.5,3,1.1 --> 1
( 99.) 5.7,2.8,4.1,1.3 --> 1
(100.) 6.3,3.3,6,2.5 --> 1
(101.) 5.8,2.7,5.1,1.9 --> 1

28
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(102.) 7.1,3,5.9,2.1 --> 1


(103.) 6.3,2.9,5.6,1.8 --> 1
(104.) 6.5,3,5.8,2.2 --> 1
(105.) 7.6,3,6.6,2.1 --> 1
(106.) 4.9,2.5,4.5,1.7 --> 1
(107.) 7.3,2.9,6.3,1.8 --> 1
(108.) 6.7,2.5,5.8,1.8 --> 1
(109.) 7.2,3.6,6.1,2.5 --> 1
(110.) 6.5,3.2,5.1,2 --> 1
(111.) 6.4,2.7,5.3,1.9 --> 1
(112.) 6.8,3,5.5,2.1 --> 1
(113.) 5.7,2.5,5,2 --> 1
(114.) 5.8,2.8,5.1,2.4 --> 1
(115.) 6.4,3.2,5.3,2.3 --> 1
(116.) 6.5,3,5.5,1.8 --> 1
(117.) 7.7,3.8,6.7,2.2 --> 1
(118.) 7.7,2.6,6.9,2.3 --> 1
(119.) 6,2.2,5,1.5 --> 1
(120.) 6.9,3.2,5.7,2.3 --> 1
(121.) 5.6,2.8,4.9,2 --> 1
(122.) 7.7,2.8,6.7,2 --> 1
(123.) 6.3,2.7,4.9,1.8 --> 1
(124.) 6.7,3.3,5.7,2.1 --> 1
(125.) 7.2,3.2,6,1.8 --> 1
(126.) 6.2,2.8,4.8,1.8 --> 1
(127.) 6.1,3,4.9,1.8 --> 1
(128.) 6.4,2.8,5.6,2.1 --> 1
(129.) 7.2,3,5.8,1.6 --> 1
(130.) 7.4,2.8,6.1,1.9 --> 1
(131.) 7.9,3.8,6.4,2 --> NOISE
(132.) 6.4,2.8,5.6,2.2 --> 1
(133.) 6.3,2.8,5.1,1.5 --> 1
(134.) 6.1,2.6,5.6,1.4 --> 1
(135.) 7.7,3,6.1,2.3 --> 1
(136.) 6.3,3.4,5.6,2.4 --> 1
(137.) 6.4,3.1,5.5,1.8 --> 1
(138.) 6,3,4.8,1.8 --> 1
(139.) 6.9,3.1,5.4,2.1 --> 1
(140.) 6.7,3.1,5.6,2.4 --> 1
(141.) 6.9,3.1,5.1,2.3 --> 1
(142.) 5.8,2.7,5.1,1.9 --> 1
(143.) 6.8,3.2,5.9,2.3 --> 1
(144.) 6.7,3.3,5.7,2.5 --> 1
(145.) 6.7,3,5.2,2.3 --> 1
(146.) 6.3,2.5,5,1.9 --> 1
(147.) 6.5,3,5.2,2 --> 1
29
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(148.) 6.2,3.4,5.4,2.3 --> 1


(149.) 5.9,3,5.1,1.8 --> 1

Time taken to build model (full training data) : 0.01 seconds

=== Model and evaluation on training set

=== Clustered Instances

0 50 ( 34%)
1 99 ( 66%)

Unclustered instances : 1

Class attribute: class


Classes to Clusters:

0 1 <-- assigned to cluster


50 0 | Iris-setosa
0 50 | Iris-versicolor
0 49 | Iris-virginica

Cluster 0 <-- Iris-setosa


Cluster 1 <-- Iris-
versicolor

Incorrectly clustered instances : 49.0 32.66

30
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

15. Association rule analysis in WEKA:


a. Demonstration of Association Rule Mining on supermarket dataset using
Apriori Algorithm with different support and confidence thresholds.
b. Demonstration of Association Rule Mining on supermarket dataset
using FPGrowth Algorithm with different support and confidence
thresholds.

Read the supermarket dataset into the WEKA by opening the file
supermarket.arff And we find the association rules in the Associate tab.

Demonstration of Association Rule Mining on supermarket dataset using Apriori Algorithm


with different support and confidence thresholds:
Select the Apriori Algorithm

At confidence = 0.9:
Let us take the confidence value of 0.9 and apply the Apriori algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C (minMetric in
the associator menu) associator.
Associator output:
=== Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -


1 Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===

Apriori
=======

Minimum support: 0.15 (694


instances) Minimum metric
<confidence>: 0.9 Number of cycles
performed: 17

Generated sets of large itemsets:

31
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Size of set of large itemsets L(1): 44

Size of set of large itemsets L(2): 380

Size of set of large itemsets L(3): 910

Size of set of large itemsets L(4): 633

Size of set of large itemsets L(5): 105

Size of set of large itemsets L(6): 1

Best rules found:

1. biscuits=t frozen foods=t fruit=t total=high 788 ==> bread and cake=t 723 <conf:(0.92)> lift:
(1.27) lev:(0.03) [155] conv:(3.35)
2. baking needs=t biscuits=t fruit=t total=high 760 ==> bread and cake=t 696 <conf:(0.92)> lift:
(1.27) lev:(0.03) [149] conv:(3.28)
3. baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705 <conf:(0.92)>
lift:(1.27) lev:(0.03) [150] conv:(3.27)
4. biscuits=t fruit=t vegetables=t total=high 815 ==> bread and cake=t 746 <conf:(0.92)>
lift: (1.27) lev:(0.03) [159] conv:(3.26)
5. party snack foods=t fruit=t total=high 854 ==> bread and cake=t 779 <conf:(0.91)>
lift:(1.27) lev:(0.04) [164] conv:(3.15)
6. biscuits=t frozen foods=t vegetables=t total=high 797 ==> bread and cake=t 725 <conf:(0.91)>
lift:(1.26) lev:(0.03) [151] conv:(3.06)
7. baking needs=t biscuits=t vegetables=t total=high 772 ==> bread and cake=t 701 <conf:(0.91)>
lift:(1.26) lev:(0.03) [145] conv:(3.01)
8. biscuits=t fruit=t total=high 954 ==> bread and cake=t 866 <conf:(0.91)> lift:(1.26) lev:(0.04)
[179] conv:(3)
9. frozen foods=t fruit=t vegetables=t total=high 834 ==> bread and cake=t 757 <conf:(0.91)> lift:
(1.26) lev:(0.03) [156] conv:(3)
10. frozen foods=t fruit=t total=high 969 ==> bread and cake=t 877 <conf:(0.91)> lift:(1.26) lev:
(0.04) [179] conv:(2.92)

At confidence = 0.85:
Let us take the confidence value of 0.85 and apply the Apriori algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C
(minMetric in the associator menu) associator.

Associator output:
=== Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.85 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -


1 Relation: supermarket

32
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===
Apriori
=======

Minimum support: 0.2 (925


instances) Minimum metric
<confidence>: 0.85 Number of cycles
performed: 16

Generated sets of large itemsets:

Size of set of large itemsets L(1): 38

Size of set of large itemsets L(2): 225

Size of set of large itemsets L(3): 302

Size of set of large itemsets

L(4): 80 Size of set of large

itemsets L(5): 2

Best rules found:

1. biscuits=t frozen foods=t fruit=t vegetables=t 1039 ==> bread and cake=t 929 <conf:(0.89)>
lift:(1.24) lev:(0.04) [181] conv:(2.62)
2. fruit=t vegetables=t total=high 1050 ==> bread and cake=t 938 <conf:(0.89)> lift:(1.24) lev:
(0.04) [182] conv:(2.6)
3. fruit=t total=high 1243 ==> bread and cake=t 1104 <conf:(0.89)> lift:(1.23) lev:(0.05) [209]
conv:(2.49)
4. biscuits=t total=high 1228 ==> bread and cake=t 1082 <conf:(0.88)> lift:(1.22) lev:(0.04) [198]
conv:(2.34)
5. milk-cream=t total=high 1217 ==> bread and cake=t 1071 <conf:(0.88)> lift:(1.22) lev:(0.04)
[195] conv:(2.32)
6. biscuits=t margarine=t vegetables=t 1054 ==> bread and cake=t 925 <conf:(0.88)>
lift:(1.22) lev:(0.04) [166] conv:(2.27)
7. frozen foods=t total=high 1273 ==> bread and cake=t 1117 <conf:(0.88)> lift:(1.22) lev:(0.04)
[200] conv:(2.27)
8. biscuits=t margarine=t fruit=t 1073 ==> bread and cake=t 938 <conf:(0.87)> lift:(1.21) lev:
(0.04) [165] conv:(2.21)
9. party snack foods=t total=high 1120 ==> bread and cake=t 979 <conf:(0.87)> lift:(1.21) lev:
(0.04) [172] conv:(2.21)

33
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

10. vegetables=t total=high 1270 ==> bread and cake=t 1110 <conf:(0.87)> lift:(1.21) lev:(0.04)
[195] conv:(2.21)

At confidence = 0.8:
Let us take the confidence value of 0.8 and apply the Apriori algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C
(minMetric in the associator menu) associator.

Associator output:
=== Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.8 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -


1 Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===
Apriori
=======

Minimum support: 0.3 (1388 instances)


Minimum metric <confidence>: 0.8
Number of cycles performed: 14

Generated sets of large itemsets:

Size of set of large itemsets L(1): 25

Size of set of large itemsets L(2): 69

Size of set of large itemsets L(3): 20

Best rules found:

1. biscuits=t vegetables=t 1764 ==> bread and cake=t 1487 <conf:(0.84)> lift:(1.17) lev:(0.05)
[217] conv:(1.78)
2. total=high 1679 ==> bread and cake=t 1413 <conf:(0.84)> lift:(1.17) lev:(0.04) [204] conv:
(1.76)
3. biscuits=t milk-cream=t 1767 ==> bread and cake=t 1485 <conf:(0.84)> lift:(1.17) lev:(0.05)
[213] conv:(1.75)
4. biscuits=t fruit=t 1837 ==> bread and cake=t 1541 <conf:(0.84)> lift:(1.17) lev:(0.05) [218]
conv:(1.73)

34
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

5. biscuits=t frozen foods=t 1810 ==> bread and cake=t 1510 <conf:(0.83)> lift:(1.16) lev:(0.04)
[207] conv:(1.69)
6. frozen foods=t fruit=t 1861 ==> bread and cake=t 1548 <conf:(0.83)> lift:(1.16) lev:(0.05)
[208] conv:(1.66)
7. frozen foods=t milk-cream=t 1826 ==> bread and cake=t 1516 <conf:(0.83)> lift:(1.15) lev:
(0.04) [201] conv:(1.65)
8. baking needs=t milk-cream=t 1907 ==> bread and cake=t 1580 <conf:(0.83)> lift:(1.15) lev:
(0.04) [207] conv:(1.63)
9. milk-cream=t fruit=t 2038 ==> bread and cake=t 1684 <conf:(0.83)> lift:(1.15) lev:(0.05) [217]
conv:(1.61)
10. baking needs=t biscuits=t 1764 ==> bread and cake=t 1456 <conf:(0.83)> lift:(1.15) lev:(0.04)

At confidence = 0.9:
Let us take the confidence value of 0.9 and apply the FPGrowth algorithm for association rule
mining on supermarket dataset. You can modify it by changing the value of attribute -C (minMetric
in the associator menu) associator.

Associator output:
=== Run information ===

[186] conv:(1.6)

Demonstration of Association Rule Mining on supermarket dataset using FPGrowth Algorithm with
different support and confidence thresholds:
Select the FPGrowth algorithm
Scheme: weka.associations.FPGrowth -P 2 -I -1 -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M
0.1 Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===

FPGrowth found 16 rules (displaying top 10)

1. [fruit=t, frozen foods=t, biscuits=t, total=high]: 788 ==> [bread and cake=t]: 723 <conf:(0.92)>
lift:(1.27) lev:(0.03) conv:(3.35)
2. [fruit=t, baking needs=t, biscuits=t, total=high]: 760 ==> [bread and cake=t]: 696 <conf:(0.92)>
lift:(1.27) lev:(0.03) conv:(3.28)

35
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

3. [fruit=t, baking needs=t, frozen foods=t, total=high]: 770 ==> [bread and cake=t]: 705 <conf:
(0.92)> lift:(1.27) lev:(0.03) conv:(3.27)
4. [fruit=t, vegetables=t, biscuits=t, total=high]: 815 ==> [bread and cake=t]: 746 <conf:(0.92)>
lift:(1.27) lev:(0.03) conv:(3.26)
5. [fruit=t, party snack foods=t, total=high]: 854 ==> [bread and cake=t]: 779 <conf:(0.91)> lift:
(1.27) lev:(0.04) conv:(3.15)
6. [vegetables=t, frozen foods=t, biscuits=t, total=high]: 797 ==> [bread and cake=t]: 725 <conf:
(0.91)> lift:(1.26) lev:(0.03) conv:(3.06)
7. [vegetables=t, baking needs=t, biscuits=t, total=high]: 772 ==> [bread and cake=t]: 701 <conf:
(0.91)> lift:(1.26) lev:(0.03) conv:(3.01)
8. [fruit=t, biscuits=t, total=high]: 954 ==> [bread and cake=t]: 866 <conf:(0.91)> lift:(1.26) lev:
(0.04) conv:(3)
9. [fruit=t, vegetables=t, frozen foods=t, total=high]: 834 ==> [bread and cake=t]: 757 <conf:
(0.91)> lift:(1.26) lev:(0.03) conv:(3)
10. [fruit=t, frozen foods=t, total=high]: 969 ==> [bread and cake=t]: 877 <conf:(0.91)> lift:(1.26)
lev:(0.04) conv:(2.92)

At confidence = 0.85:
Let us take the confidence value of 0.85 and apply the FPGrowth algorithm for association rule mining
on supermarket dataset. You can modify it by changing the value of attribute -C (minMetric in the
associator menu) associator.
Associator output:
=== Run information ===

Scheme: weka.associations.FPGrowth -P 2 -I -1 -N 10 -T 0 -C 0.85 -D 0.05 -U 1.0 -M


0.1 Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full training set) ===

FPGrowth found 51 rules (displaying top

10)

1. [fruit=t, vegetables=t, frozen foods=t, biscuits=t]: 1039 ==> [bread and cake=t]: 929 <conf:
(0.89)> lift:(1.24) lev:(0.04) conv:(2.62)
2. [fruit=t, vegetables=t, total=high]: 1050 ==> [bread and cake=t]: 938 <conf:(0.89)> lift:(1.24)
lev:(0.04) conv:(2.6)
3. [fruit=t, total=high]: 1243 ==> [bread and cake=t]: 1104 <conf:(0.89)> lift:(1.23) lev:(0.05)
conv:(2.49)
4. [biscuits=t, total=high]: 1228 ==> [bread and cake=t]: 1082 <conf:(0.88)> lift:(1.22) lev:(0.04)
conv:(2.34)
5. [milk-cream=t, total=high]: 1217 ==> [bread and cake=t]: 1071 <conf:(0.88)> lift:(1.22) lev:
(0.04) conv:(2.32)

36
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

6. [fruit=t, biscuits=t, margarine=t]: 1073 ==> [bread and cake=t]: 938


<conf:(0.87)> lift:(1.21) lev:(0.04) conv:(2.21)
7. [party snack foods=t, total=high]: 1120 ==> [bread and cake=t]: 979
<conf:(0.87)> lift:(1.21) lev:(0.04) conv:(2.21)
8. [vegetables=t, total=high]: 1270 ==> [bread and cake=t]: 1110
<conf:(0.87)> lift:(1.21) lev: (0.04) conv:(2.21)
9. [fruit=t, frozen foods=t, biscuits=t]: 1309 ==> [bread and cake=t]: 1143
<conf:(0.87)> lift: (1.21) lev:(0.04) conv:(2.2)

At confidence = 0.8:
Let us take the confidence value of 0.8 and apply the FPGrowth algorithm for
association rule mining on supermarket dataset. You can modify it by
changing the value of attribute -C (minMetric in the associator menu)
associator.

Associator output:
=== Run information ===

Scheme: weka.associations.FPGrowth -P 2 -I -1 -N 10 -T 0 -C 0.8 -


D 0.05 -U 1.0 -M 0.1 Relation: supermarket
Instances: 4627
Attributes: 217
[list of attributes omitted]
=== Associator model (full

training set) ===

FPGrowth found 17 rules

(displaying top 10)

1. [vegetables=t, biscuits=t]: 1764 ==> [bread and cake=t]: 1487


<conf:(0.84)> lift:(1.17) lev: (0.05) conv:(1.78)
2. [total=high]: 1679 ==> [bread and cake=t]: 1413 <conf:(0.84)> lift:(1.17) lev:(0.04)
conv:(1.76)
3. [milk-cream=t, biscuits=t]: 1767 ==> [bread and cake=t]: 1485
<conf:(0.84)> lift:(1.17) lev: (0.05) conv:(1.75)
4. [fruit=t, biscuits=t]: 1837 ==> [bread and cake=t]: 1541 <conf:(0.84)>
lift:(1.17) lev:(0.05) conv:(1.73)
5. [frozen foods=t, biscuits=t]: 1810 ==> [bread and cake=t]: 1510
<conf:(0.83)> lift:(1.16) lev: (0.04) conv:(1.69)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

6. [fruit=t, frozen foods=t]: 1861 ==> [bread and cake=t]: 1548 <conf:(0.83)>
lift:(1.16) lev:(0.05) conv:(1.66)
7. [milk-cream=t, frozen foods=t]: 1826 ==> [bread and cake=t]: 1516
<conf:(0.83)> lift:(1.15) lev:(0.04) conv:(1.65)
8. [milk-cream=t, baking needs=t]: 1907 ==> [bread and cake=t]: 1580
<conf:(0.83)> lift:(1.15) lev:(0.04) conv:(1.63)
9. [fruit=t, milk-cream=t]: 2038 ==> [bread and cake=t]: 1684 <conf:(0.83)>
lift:(1.15) lev:(0.05) conv:(1.61)
10. [baking needs=t, biscuits=t]: 1764 ==> [bread and cake=t]: 1456
<conf:(0.83)> lift:(1.15) lev: (0.04) conv:(1.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SANKETIKA VIDYA PARISHAD ENGINEERING COLLEGE 41

You might also like