0% found this document useful (0 votes)
168 views

CS-423 Data Warehousing and Data Mining Lab

This lab manual was prepared by faculty and staff at the Military College of Signals National University of Sciences and Technology to help students study data warehousing and data mining tools and concepts. The manual provides instructions for experiments and assignments. It was created in 2014 and updated multiple times, with the latest version in 2022. The manual covers general lab policies like maintaining the manual, completing assignments by deadlines, and prohibitions against plagiarism.

Uploaded by

Shahzeb Raheel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views

CS-423 Data Warehousing and Data Mining Lab

This lab manual was prepared by faculty and staff at the Military College of Signals National University of Sciences and Technology to help students study data warehousing and data mining tools and concepts. The manual provides instructions for experiments and assignments. It was created in 2014 and updated multiple times, with the latest version in 2022. The manual covers general lab policies like maintaining the manual, completing assignments by deadlines, and prohibitions against plagiarism.

Uploaded by

Shahzeb Raheel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

COMPUTER LABORATORY MANUAL

Data Warehousing and Mining


(CS – 423)

DEPARTMENT OF COMPUTER SOFTWARE ENGINEERING


Military College of Signals
National University of Sciences and Technology
www.mcs.nust.edu.pk
PREFACE
This lab manual has been prepared to facilitate the students of Computer Software
Engineering in studying and analysing various tools and concepts of Data Warehousing and
Data Mining. Generally, data mining (sometimes called data or knowledge discovery) is the
process of analysing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or both. Data
mining software is one of a number of analytical tools for analysing data. It allows users to
analyse data from many different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process of finding correlations or
patterns among dozens of fields in large relational databases.

PREPARED BY
Lab manual is prepared by Assoc Prof Dr. Hammad Afzal, Asst Prof Malik Muhammad Zaki
Murtaza Khan and Lab Engr Marium Hida.

GENERAL INSTRUCTIONS
a. Students are required to maintain the lab manual with them till the end of the semester.
b. All readings, answers to questions and illustrations must be solved on the place provided.
If more space is required, then additional sheets may be attached.
c. It is the responsibility of the student to have the manual graded before deadlines as given
by the instructor
d. Loss of manual will result in re submission of the complete manual.
e. Students are required to go through the experiment before coming to the lab session. Lab
session details will be given in training schedule.
f. Students must bring the manual in each lab.
g. Keep the manual neat clean and presentable.
h. Plagiarism is strictly forbidden. No credit will be given if a lab session is plagiarised and
no re submission will be entertained.
i. Marks will be deducted for late submission.
j. Error handling in a program is the responsibility of the student.

VERSION HISTORY
Date Update By Details
August 2014 Dr. Hammad Afzal v1
September Dr. Hammad Afzal & TA Subhan Khan v2
2014
September Dr. Hammad Afzal & Lab Engr Marium Hida v3
2016
September Dr. Hammad Afzal, Dr. Malik Muhammad Zaki Murtaza v4
2019 Khan & Lab Engr Marium Hida
April 2021 Dr. Hammad Afzal, Dr. Naima Iltaf, Lab Engr Sehrish V5
Ferdous & Lab Engr Saba Siddique

February Dr. Naima Iltaf, Dr. Hammad Afzal, Lab Engr Saba Siddique V6
2022 & Lab Engr Laraib Zainab
Lab Rubrics (Group 1)

Criteria Unacceptable Substandard Adequate Proficient


(Marks=0) Marks=1 Marks=2 Marks=3
The program execution let to
R1 The program failed to inaccurate or incomplete results. The program was correctly The program was correctly
Completeness produce the right It was not correctly functional functional and most of the functional, and all the
And Accuracy accurate result or not all the features were features were implemented features were implemented
implemented
Student successfully figures
The student fails to figure Student successfully figures out Student successfully figures
R2 out most of syntax and
out the syntax and few of syntax and semantic out all syntax and semantic
Syntax and semantic errors of the
semantic errors of the errors of the program with errors of the program
Semantics program with minimum
incorrect program extensive guidance without any guidance
guidance
Student has demonstrated on
Student failed to Student has basic knowledge accurate understanding of
Student has basic
R3 demonstrate a clear of understanding. Provides the lab objective and
understanding, but asked
Demonstration understanding of the fundamental answers to asked concepts. All the questions
questions were not answered.
assigned task questions are answered completely
and correctly
R4 The code is poorly The code is readable only by The code is exceptionally
Complexity and organized and very someone who knows what it is The code is fairly easy to read well organized and very
Readability difficult to read supposed be doing easy to follow
Complete working
program is copied
R5 Most of working program is Most of working program is
indicating no effort on Complete working program
Perseverance copied. Minor contribution by contributed by the student.
student’s part resulting in is contributed by the student
and plagiarism the student Minor copied components
a total score of zero for
all rubrics
Lab Rubrics (Group 3)

Criteria Unacceptable Substandard Adequate Proficient


(Marks=0) Marks=1 Marks=2 Marks=3
The system execution let to
R1 The system failed to inaccurate or incomplete results. The system was correctly The system was correctly
Completeness produce the right It was not correctly functional functional and most of the functional, and all the
and Accuracy accurate result or not all the features were features were implemented features were implemented
implemented
The student has
The student failed to The student has basic The student has moderate demonstrated on accurate
R2 demonstrate a clear knowledge of understanding, knowledge of understanding. understanding of the lab
Demonstration understanding of the but asked questions were not Answer to the question is objective and concepts. All
assigned task answered. basic the questions are answered
completely and correctly
Complete working
program is copied
Most of working program is Most of working program is
R3 indicating no effort on Complete working program
copied. Minor contribution by contributed by the student.
Plagiarism student’s part resulting in is contributed by the student
the student Minor copied components
a total score of zero for
all rubrics
R4 Actively helps to identify
Shows little commitment Demonstrates commitment to Demonstrates commitment to
Contribution/ group goals and works
to group goals and fails group goals, but has difficulty group goals and carries out
Group effectively to meet them in
to perform assigned roles performing assigned roles assigned roles effectively
participation all roles assumed
Poor presentation; cannot Well-organized, clear
Presentation lacks clarity and Presentation acceptable;
R5 explain topic; scientific presentation; good use of
organization; little use of adequate use of scientific
Presentation terminology lacking or scientific vocabulary and
scientific terms and vocabulary; terms; acceptable
skills confused; lacks terminology; good
poor understanding of topic understanding of topic
understanding of topic understanding of topic
Lab Rubrics (Group 6)

Criteria Unacceptable Substandard Adequate Proficient


(Marks=0) Marks=1 Marks=2 Marks=3
R1 The system failed to The system execution let to The system was correctly The system was correctly
Completeness produce the right inaccurate or incomplete results. functional and most of the functional, and all the
and Accuracy accurate result It was not correctly functional or features were implemented features were implemented
not all the features were
implemented
R2 Fails to comprehend the The student is unable to The student is able to convertThe student is able to analyze
Complex problem and its decompose/transfer problem to conceptual model to and infer the results after
Problems implications conceptual model with adequate simulation/program execution and is able to relate
understanding of the complexity the results to conceptual
of his design model’s design
choices/complexities
R3 The student failed to The student has basic knowledge The student has moderate The student has demonstrated
Demonstration demonstrate a clear of understanding, but asked knowledge of understanding. on accurate understanding of
understanding of the questions were not answered. Answer to the question is the lab objective and
assigned task basic concepts. All the questions
are answered completely and
correctly
R4 The student clearly The student failed to follow the The student followed most of The student followed the
Followed failed to follow the some of the verbal and written the verbal and written verbal and written
Directions verbal and written instructions to successfully instructions to complete all the instructions to successfully
instructions to complete all requirements of the requirements of the lab complete requirements of the
successfully complete lab lab
the lab
R5 The student clearly The student knows the basic The student has moderate The student effectively uses
Modern tool failed to use simulation knowledge of simulation tools to knowledge of simulation tools simulation tools to design,
Usage tools to design, design, configure, test and to design, configure, test and configure, test and
configure, test and troubleshoot the given scenario. troubleshoot the given troubleshoot given scenario.
troubleshoot the given scenario
scenario.
COURSE LEVEL OUTCOMES
CS-423 Data Warehousing and Data Mining
Course Learning Outcomes (CLOs)
At the end of the course the students will be able to: PLOs BT Level*
1 Develop the understanding of the concepts of Data 1 C-2
Warehousing and Data Mining fundamentals including
various Data Cubes, Data Pre-Processing, and Frequent
Patterns Analysis
2 Apply the concepts of Supervised and unsupervised learning on 3 C-3
different types of data.
3 Practice modern tools and programming environments to 5 P-3
learn various data mining tasks

S No List of Experiments CLO R-G


1. Introduction to Python - I 3 1
2. Introduction to Python - II 3 1
3. Data Cleaning 3 1
4. Feature Selection 3 1
5. Dimensionality Reduction using PCA 3 1
6. Understanding Clustering – I 3 1
7. Understanding Clustering – II 3 1
8. Experiment 7: Association Rule Analysis 3 1
9. Understanding Classification using KNN 3 1
10. Linear Regression 3 1
11. Open-Ended Lab 3 1
12. Data Analytics using Rapid Miner 3 1
13. Project 3 3
Table of Contents
Experiment 1 – Introduction to Python - I.....................................................................................10
Experiment 2: Introduction to Python - II.....................................................................................19
Experiment 3: Data Cleaning.........................................................................................................30
Experiment 4: Feature Selection...................................................................................................34
Experiment 5: Dimensionality Reduction through PCA...............................................................37
Experiment 6: Understanding Clustering - I..................................................................................39
Experiment 7: Understanding Clustering - II................................................................................42
Experiment 8: Association Rule Analysis using Python...............................................................44
Experiment 9: Understanding Classification using KNN..............................................................46
Experiment 10: Linear Regression................................................................................................48
Experiment 11: Open Ended Lab..................................................................................................50
MARKS

Max. Marks Obtained Instructor Sign


S No Experiment
Marks R1 R2 R3 R4 R5

Grand Total
Experiment 1 – Introduction to Python - I
Objective: To provide basic knowledge about the python language: Expressions and statements,
Operators, Variables, Lists, Strings, Functions,
Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda/Google Colab

Learning the basics of Python Language


Introduction
Python is a general-purpose programming language used in just about any kind of software you
can think of. You can use it to build websites, artificial intelligence, servers, business software,
and more.
Python is a portable, cross-platform language —you can write and execute Python code on any
operating system with a Python interpreter.
Expressions and statements:
A computer program written in Python is built from statements and expressions. Each statement
performs some action. For example, the built-in print statement writes to the screen:

The single quotation marks in our print statement indicate that what we are printing is a string – a
sequence of letters or other symbols. If the string is to contain a single quotation mark, we must
use the double quotation mark key instead otherwise it will generate an error.

Each expression performs some calculation, yielding a value. For example, we can calculate the
result of a simple mathematical expression using whole numbers (or integers):
When Python has calculated the result of this expression, it prints it to the screen, even though
we have not used print. All our programs will be built from such statements and expressions.
Python Input
While programming, we might want to take the input from the user. In Python, we can use the
input() function.

In the above example, we have used the input() function to take input from the user and stored
the user input in the num variable.
It is important to note that the entered value 10 is a string, not a number. So, type(num) returns
<class 'str'>.
To convert user input into a number we can use int() or float() functions as:

Creating Variables:
In programming, a variable is a container (storage area) to hold data. It has no command for
declaring a variable. A variable is created the moment you first assign a value to it. It allows you
to assign values to multiple variables in one line:

Variables do not need to be declared with any particular type and can even change type after they
have been set. (Note: variable names are case sensitive)
If we want to assign the same value to multiple variables at once, we can do this as:
You can get the data type of a variable with the type() function.

Python Literals
Literals are representations of fixed values in a program. They can be numbers, characters, or
strings, etc. For example, 'Hello, World!', 12, 23.0, 'C', etc.
Literals are often used to assign values to variables or constants. For example:

In the above expression, site_name is a variable, and 'programiz.com' is a literal.

Literal Collections
There are four different literal collections List literals, Tuple literals, Dict literals, and Set
literals.

In the above example, we created a list of fruits, a tuple of numbers, a dictionary of alphabets
having values with keys designated to each value and a set of vowels.
Python Data Types
Data Types Classes Description
Numeric int, float, complex holds numeric values
String str holds sequence of characters
Sequence list, tuple, range holds collection of items
Mapping dict holds data in key-value pair form
Boolean bool holds either True or False
Set set, frozeenset hold collection of unique items
Since everything is an object in Python programming, data types are classes and variables are
instances(object) of these classes.
List Data Type
List is an ordered collection of similar or different types of items separated by commas and
enclosed within brackets [ ]. For example,
Access List Items: Here, we have created a list named languages with 3 string values inside it.
To access items from a list, we use the index number (0, 1, 2 ...). For example,

We can measure the length of list by following command:


len(languages)
Concatenate two lists: Adding two lists by using simple ‘+’ operator creates a new list with
everything from the first list, followed by everything from the second list.

We use append() to a list to add a single item to a list and the list itself is updated as a result of
the operation.
languages.append(“C++”)
The following command prints all words present before the 2nd index in language list.
languages[:2]
The following command prints all words present after the 2nd index in language list.
languages[2:]
By convention, m:n means elements m…n-1.
languages[0:2]

We can also slice with negative indexes — the same basic rule of starting from the start index
and stopping one before the end index applies.
Python Tuple Data Type
Tuple is an ordered sequence of items same as a list. The only difference is that tuples are
immutable. Tuples once created cannot be modified.
In Python, we use the parentheses () to store items of a tuple. For example,

Here, product is a tuple with a string value Xbox and integer value 499.99.

Python Set Data Type


Set is an unordered collection of unique items. Set is defined by values separated by commas
inside braces { }. For example,

Here, we have created a set named student_info with 5 integer values.


Since sets are unordered collections, indexing has no meaning. Hence, the slicing operator []
does not work.
Dictionary Data Type
Python dictionary is an ordered collection of items. It stores elements in key/value pairs.
Here, keys are unique identifiers that are associated with each value.
Access Dictionary Values Using Keys: We use keys to retrieve the respective value. But not the
other way around. For example,
In the below example, we have created a dictionary named capital_city. Here,
Keys are 'Nepal', 'Italy', 'England'
Values are 'Kathmandu', 'Rome', 'London'
Methods Description
list.append(x) Add an item to the end of the list.
list.extend(L) Extend the list by appending all the items in the given list.
list.insert(i, x) Insert an item at a given position. The first argument is the index of the
element before which to insert, so a.insert (0,x) inserts at the front of
the list, and a.insert(len(a), x) is equivalent to a.append(x).
list.remove(x) Remove the first item from the list whose value is x. It is an error if
there is no such item.
Remove the item at the given position in the list, and return it. If no
index is specified, a.pop() removes and returns the la the list. (The
square brackets around the iin the method signature denote that
list.pop([i]) the parameter is optional, not that you should square brackets at
that position. You will see this notation frequently in the Python
Library Reference.)
list.sort( ) Sort the items of the list, in place.
list.reverse( ) Reverse the elements of the list, in place.

Some of the methods we used to access the elements of a list also work with individual words,
or strings. For example, we can assign a string to a variable, index a string, and slice a string:

We can also perform multiplication and addition with strings:


We can join the words of a list to make a single string, or split a string into a list, as follows:

Functions:
A function is a block of code which only runs when it is called. You can pass data, known as
parameters, into a function. A function can return data as a result.
Creating & Calling a Function
In Python a function is defined using the def keyword:

Arguments
Information can be passed into functions as arguments. Arguments are specified after the
function name, inside the parentheses. You can add as many arguments as you want, just
separate them with a comma.
The following example has a function with one argument (fname). When the function is called,
we pass along a first name, which is used inside the function to print the full name:

By default, a function must be called with the correct number of arguments. Meaning that if your
function expects 2 arguments, you have to call the function with 2 arguments, not more, and not
less.
This function expects 2 arguments, and gets 2 arguments:
If you try to call the function with 1 or 3 arguments, you will get an error. Moreover, If we call
the function without argument, it uses the default value.
Return Values
To let a function, return a value, use the return statement.

Lab Tasks:
Q1: Write a program to:
a) Define a string with your name and assign it to a variable. Print the contents of this
variable in two ways, first by simply typing the variable name and pressing enter, then by
using the print statement.
b) Try adding the string to itself using my_string + my_string, or multiplying it by a
number, e.g., my_string * 3. Notice that the strings are joined together without any
spaces. How could you fix this?

Q2. Write a python program to create a function arithematic_operation to perform arithmetic


operations(+, -, *, /) using two numbers.

Q3: Given two strings, s1, and s2 return a new string made of the first, middle, and last
characters each input string.
Given:
s1 = "America"
s2 = "Japan"
Expected Output:
AJrpan

Q4: Write a program to create a function that takes wo strings s1 and s2 and create a new string
by appending s2 in the middle of s1.
Given:
s1 = "Software"
s2 = "Design"
Expected Output:
SoftDesignware
Q5. Write a program to create five lists, one for each row in our dataset:

Also use list indexing to extract the number of ratings from the five rows and then average them.

Q6. Write a program to create the course variable then set the course variable to be an empty list.
1. Now, Add 'Machine Learning', 'Software Construction', and 'Formal Methods' to the
users list in that order without reassigning the variable.
2. Delete software construction and display the updated list content.
3. Add the course 'Artificial Intelligence' to course where ' Software Construction' used to
be.
4. Slice course to Return 1st and 3rd Elements

References:
[1] https://mas-dse.github.io/startup/anaconda-windows-install/#anaconda
[2] https://www.programiz.com/python-programming/first-program
[3] https://www.codecademy.com/learn/learn-python-3
[4] https://www.w3schools.com/python/
Experiment # 2: Introduction to Python - II
Objective: To provide basic knowledge about the python language; Tuples, Sets, Dictionaries,
Conditional and Loop statements
Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda/ Google Colab

Python — For Loop


A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or
a string). This is less like the for keyword in other programming languages and works more like
an iterator method as found in other object-orientated programming languages.
With the for loop, we can execute a set of statements, once for each item in a list, tuple, set etc.
In Python, the for loop is used to run a block of code for a certain number of times. The for loop
does not require an indexing variable to set beforehand

Here, val accesses each item of sequence on each iteration. Loop continues until we reach the
last item in the sequence.

You can loop through the tuple items by using a for loop.

Python —While Loop


Python while loop is used to run a block code until a certain condition is met.
The syntax of while loop is:
Example:

Python Tuples
Tuples are used to store multiple items in a single variable. Tuple is one of 4 built-in data types
in Python used to store collections of data, the other 3 are List, String, and Dictionary, all with
different qualities and usage.
A tuple is a collection which is ordered and unchangeable. It allows duplicate members. Tuples
are written with round brackets.

Python - Access Tuple Items


You can access tuple items by referring to the index number, inside square brackets:

Negative Indexing
Negative indexing means start from the end. -1 refers to the last item, -2 refers to the second last
item etc.

Python - Update Tuples


Tuples are unchangeable, meaning that you cannot change, add, or remove items once the tuple
is created. But there are some workarounds.
Change Tuple Values
You can convert the tuple into a list, change the list, and convert the list back into a tuple.
Python - Tuple Methods
Python has two built-in methods that you can use on tuples.

Python Sets
A set is a collection which is unordered and unindexed. It does not allow duplicate members. In
Python, sets are written with curly brackets.

Access Items
You cannot access items in a set by referring to an index or a key but you can loop through the
set items using a for loop, or ask if a specified value is present in a set, by using the in keyword.

Add Items
To add one item to a set use the add() method.

To add more than one item to a set use the update() method.

Get the Length of a Set


To determine how many items a set has, use the len() method.
Set Methods:

Python Dictionaries
A dictionary is a collection which is unordered, changeable, indexed and doesn’t allow
duplicates. In Python dictionaries are written with curly brackets, and they have keys and values.

Accessing Items
You can access the items of a dictionary by referring to its key name, inside square brackets:

There is also a method called get() that will give you the same result:

Loop Through a Dictionary


You can loop through a dictionary by using a for loop. When looping through a dictionary, the
return value are the keys of the dictionary, but there are methods to return the values as well.
Example:

Dictionary Length
To determine how many items (key-value pairs) a dictionary has, use the len() function.

Dictionary Methods

Python Conditions and If statements


Python supports the usual logical conditions from mathematics:
 Equals: a == b

 Not Equals: a != b
 Less than: a < b
 Less than or equal to: a <= b
 Greater than: a > b
 Greater than or equal to: a >= b
These conditions can be used in several ways, most commonly in "if statements" and loops.
An "if statement" is written by using the if keyword.
If statement:

In this example we use two variables, a and b, which are used as part of the if statement to test
whether b is greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so
we print to screen that "b is greater than a".
Indentation
Python relies on indentation (whitespace at the beginning of a line) to define scope in the code.
Other programming languages often use curly brackets for this purpose.
If statement, without indentation (will raise an error):

Elif
The elif keyword is python’s way of saying "if the previous conditions were not true, then try
this condition".

In this example a is equal to b, so the first condition is not true, but the elif condition is true, so
we print to screen that "a and b are equal.
Else
The else keyword catches anything which isn't caught by the preceding conditions.
In this example a is greater than b, so the first condition is not true, also the elif condition is not
true, so we go to the else condition and print to screen that "a is greater than b". You can also
have an else without the elif.

Short Hand If
If you have only one statement to execute, you can put it on the same line as the if statement.
One line if statement:

Short Hand If ... Else


If you have only one statement to execute, one for if, and one for else, you can put it all on the
same line:
One line if else statement:

This technique is known as Ternary Operators, or Conditional Expressions. You can also
have multiple else statements on the same line:
One line if else statement, with 3 conditions:

Logical Operator
And
The and keyword is a logical operator, and is used to combine conditional statements.
Test if a is greater than b, AND if c is greater than a:

Or
The or keyword is a logical operator, and is used to combine conditional statements.
Test if a is greater than b, OR if a is greater than c:
Nested If
You can have if statements inside if statements, this is called nested if statements.

The while loop requires relevant variables to be ready, in this example we need to define an
indexing variable, i, which we set to 1.
The break Statement
With the break statement we can stop the loop even if the while condition is true.
Exit the loop when i is 3:

.
Looping Through a String
Even strings are iterable objects, they contain a sequence of characters.
Loop through the letters in the word "banana":

The break Statement


With the break statement we can stop the loop before it has looped through all the items.
Exit the loop when x is "banana":
The range() Function
To loop through a set of code a specified number of times, we can use the range() function.
The range() function returns a sequence of numbers, starting from 0 by default, and increments
by 1 (by default), and ends at a specified number.

Note: The range(6) is not the values of 0 to 6, but the values 0 to 5.


The range() function defaults to 0 as a starting value, however it is possible to specify the starting
value by adding a parameter: range(2, 6), which means values from 2 to 6 (but not including 6).
Using the start parameter:

The range() function defaults to increment the sequence by 1, however it is possible to specify
the increment value by adding a third parameter: range(2, 30, 3).
Increment the sequence with 3 (default is 1):

Else in For Loop


The else keyword in a for loop specifies a block of code to be executed when the loop is finished.
Print all numbers from 0 to 5, and print a message when the loop has ended:
Note: The else block will NOT be executed if the loop is stopped by a break statement.
Break the loop when x is 3, and see what happens with the else block:

Nested Loops
A nested loop is a loop inside a loop. The "inner loop" will be executed one time for each
iteration of the "outer loop".
Print each adjective for every fruit:

The pass Statement


for loops cannot be empty, but if you for some reason have a for loop with no content, put in
the pass statement to avoid getting an error.

LAB TASKS:
Q1: Write a program that takes two integers as input (lower limit and upper limit) and displays
all the prime numbers including and between these two numbers.
Q2: Given a list iterate it and display numbers which are divisible by 5 and if you find number
greater than 150 stop the loop iteration.
list1 = [12, 15, 32, 42, 55, 75, 122, 132, 150, 180, 200]
Q3: Write a program that accepts a comma separated sequence of words as input and prints the
words in a comma-separated sequence after sorting them alphabetically. Suppose the following
input is supplied to the program: without, hello, bag, world. Then, the output should be: bag,
hello, without, world.
Q4:
a. Write a simple calculator program. Follow the steps below:
• Declare and define a function named Menu which displays a list of choices for the
user such as addition, subtraction, multiplication, and classic division. It should take the
choice from user as an input and return.
• Define and declare a separate function for each choice (each mathematical
operation).
• In the main body of the program call the respective function depending on the
user’s choice.
b. Implement the following functions for the calculator you created in the above task.
• Factorial
• x_power_y (x raised to the power y)

References:
https://www.w3schools.com/python/default.asp
Experiment 3: Data Cleaning

Objective : To learn and implement the data cleaning techniques


Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda

Introduction
Cleaning Data
Data cleaning or Data cleansing is very important from the perspective of building intelligent
automated systems. Data cleansing is a preprocessing step that improves the data validity,
accuracy, completeness, consistency, and uniformity. It is essential for building reliable machine
learning models that can produce good results. Otherwise, no matter how good the model is, its
results cannot be trusted. In short, data cleaning means fixing bad data in your data set. Bad data
could be:
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
The dataset that we are going to use is ‘rawdata.csv’. It has following characteristics:
 The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and
28).
 The data set contains wrong format ("Date" in row 26).
 The data set contains wrong data ("Duration" in row 7).
 The data set contains duplicates (row 11 and 12).

Step 1: Load and view dataset


Task: Load and view the dataset provided after importing important libraries.
Step 2: Dealing with empty cells

As empty cells can potentially give a wrong result while analyzing data, so to deal with the
empty cells, we would be performing the following operations:
a. Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells by using the method
dropna(). Since data sets can be very big, and removing a few rows will not have a big impact on
the result.
Task: Remove all the empty cells in dataset provided
By default, the dropna() method returns a new DataFrame, and will not change the original. If
you want to change the original DataFrame, use the inplace = True argument.
b. Replace empty values
Another way of dealing with empty cells is to insert a new value instead by using method fillna().
This way you do not have to delete entire rows just because of some empty cells.
Task: Replace the empty values with 150
c. Replace only for a specified Columns
In above methods, we are replacing all empty cells in the whole Data Frame. To only replace
empty values for one column, specify the column name for the DataFrame.
Task: Replace the empty values in ‘Calories’ with 130.
d. Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the
column. Pandas uses the mean() median() and mode() methods to calculate the respective values
for a specified column:
i. Mean:
Mean = the average value (the sum of all values divided by number of values).
ii. Median:
Median = the value in the middle, after you have sorted all values ascending.
iii. Mode:
Mode = the value that appears most frequently.
Tasks:
1. Calculate the Mean of ‘Calories’ and replace the missing values with it.
2. Calculate the Median of ‘Maxpulse’ and replace the missing values with it.
3. Calculate the mode of ‘Pulse’ and replace the missing values with it.

Step 3: Dealing with data of wrong format


As cells with data of wrong format can make it difficult, or even impossible, to analyze data. To
fix it, you have two options: remove the rows, or convert all cells in the columns into the same
format.
a) Convert Into a Correct Format
In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the
'Date' column should be a string that represents a date.
Task: Convert the ‘Date’ column into string.
You will see that the date in row 26 was fixed after converting ‘Date’ column into string, but the
empty date in row 22 got a NaT (Not a Time) value, in other words an empty value. One way to
deal with empty values is simply removing the entire row.
Task: Remove the entire row 22
b) Removing Rows
The result from the converting in the example above gave us a NaT value, which can be handled
as a NULL value, and we can remove the row by using the dropna() method.
Step 4: Dealing with wrong data
"Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if
someone registered "199" instead of "1.99". Sometimes you can spot wrong data by looking at
the data set, because you have an expectation of what it should be.
In our data set, you can see that in row 7, the duration is 450, but for all the other rows the
duration is between 30 and 60. It doesn't have to be wrong but taking in consideration that this is
the data set of someone's workout sessions, we conclude with the fact that this person did not
work out in 450 minutes.
a) Replacing Values
One way to fix wrong values is to replace them with something else. In our test data, it is most
likely a typo, and the value should be "45" instead of "450".
Task: Insert the value "45" in row 7.
For small data sets you might be able to replace the wrong data one by one, but not for big data
sets. To replace wrong data for larger data sets you can create some rules, e.g. set some
boundaries for legal values, and replace any values that are outside of the boundaries.
Task: Loop through all values in the ‘Duration’ column. If the value is higher than 120, set it to
120.
b) Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data. This way
you do not have to find out what to replace them with, and there is a good chance you do not
need them to do your analyses.
Task: Delete rows where "Duration" is higher than 120.
Step 5: Dealing with duplicates
Duplicate rows are rows that have been registered more than one time. By looking at our data
set, we can assume that rows 11 and 12 are duplicates.
To discover duplicates, we can use the duplicated() method. The duplicated() method returns a
Boolean values for each row:
Task: Remove duplicates using the drop_duplicates() method.
Final Lab Task:
For given dataset ‘diabetes.csv’, perform data cleaning techniques. After applying the data
cleaning methods, carry out fitting of data with a Regression Model and compute its
accuracy. The code for fitting with Logistic Regression is as follows:

# Split Data into a training set and test set


from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3, random state=1)
Note: Here x includes the features or independent variables where y includes the dependent
variable variables or label.
# Fitting with Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
# Import scikit-learn metrics module for accuracy calculation
From sklearn import metrics
# Compute Model Accuracy
print(“Accuracy: “, metrics.accuracy_score(y_test, y_pred)*100)
Experiment 4: Feature Selection

Objective : To learn and implement the feature selection techniques


Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda

Introduction
Feature Selection is one of the core concepts in machine learning which hugely impacts the
performance of your model. The data features that you use to train your machine learning models
have a huge influence on the performance you can achieve. Irrelevant or partially relevant
features can negatively impact model performance.
Feature Selection is the process where you automatically or manually select those features which
contribute most to your prediction variable or output in which you are interested in. Having
irrelevant features in your data can decrease the accuracy of the models and make your model
learn based on irrelevant features.
The most commonly and easy to use Feature selection techniques which provide good results are
as follows:
1. Univariate Selection:
Statistical tests can be used to select those features that have the strongest relationship with the
output variable.
The scikit-learn library provides the SelectKBest class that can be used with a suite of different
statistical tests to select a specific number of features.
TASK 1:
Download the dataset from this link: https://www.kaggle.com/iabhishekofficial/mobile-price-
classification#train.csv
Description of the variables in dataset:
battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height
Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost)
and 3(very high cost).

Use the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features
from the above dataset which is used for Mobile Price Range Prediction.
2. Feature Importance
You can get the feature importance of each feature of your dataset by using the feature
importance property of the model. Feature importance gives you a score for each feature of your
data, the higher the score more important or relevant is the feature towards your output variable.
Feature importance is an inbuilt class that comes with Tree Based Classifiers.

TASK 2:
Load the dataset again and use Extra Tree Classifier for extracting the top 10 features for the
dataset and plot your results.
 To import the Extra Tree Classifier, use the following command:
from sklearn.ensemble import ExtraTreesClassifier
 Use the following inbuilt class for feature importances:
feature_importances_

3. Correlation Matrix with Heatmap


Correlation states how the features are related to each other or the target variable. Correlation can
be positive (increase in one value of feature increases the value of the target variable) or negative
(increase in one value of feature decreases the value of the target variable)
Heatmap makes it easy to identify which features are most related to the target variable.
TASK 3:
Load the dataset and plot heatmap of correlated features using the seaborn library.
 To import the seaborn library, use the following command:
import seaborn as sns
 To get correlations of each features in dataset, use the following command:
corrmat = data.corr()
 To plot heat map, use the following command:
g=sns.heatmap(corrmat , annot=True, cmap="RdYlGn")
After plotting the results, see how the price range, in the last row, is correlated with other
features.
Experiment 5: Dimensionality Reduction through PCA

Objective : To apply dimensionality reduction tasks (PCA) using


Python.
Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda

Introduction
Data pre-processing is crucial in any data mining process as they directly impact success rate of
the project. This reduces complexity of the data under analysis as data in real world is unclean.
Data is said to be unclean if it is missing attribute, attribute values, contain noise or outliers and
duplicate or wrong data. Presence of any of these will degrade quality of the results.
Furthermore, data sparsity increases as the dimensionality increases which makes operations like
clustering, outlier detection less meaningful as they greatly depend on density and distance
between points. Purpose of dimensionality reduction is to:
∙ Avoid curse of dimensionality
∙ Reduces time required by algorithms
∙ Greatly reduces memory consumption
∙ Ease of visualization of data
∙ Eliminate irrelevant features
Principal Component Analysis (PCA) is a method used to reduce number of variables in your
data by extracting important one from a large pool. It reduces the dimension of your data with
the aim of retaining as much information as possible. In other words, this method combines
highly correlated variables together to form a smaller number of an artificial set of variables
which is called “principal components” that account for most variance in the data.

TASK:
Apply PCA on the Fisher’s Iris data set. The data contains 3 classes of 50 instances each, where
each class refers to a type of iris plant. There are 4 different attributes describing the data. You
will use principal component analysis to transform the data to a lower dimensional space.
Steps to follow:
a) Download the Iris data set from the following webpage:
http://archive.ics.uci.edu/ml/datasets/Iris
b) Load all relevant packages and dataset.
c) Split feature vectors and labels.
d) Normalize the dataset which is done by subtracting the mean of each feature vector from
the dataset so that the dataset should be centered on the origin.
e) Compute the covariance matrix which is basically a measure of the extent to which
corresponding elements from two sets of ordered data move in the same direction.
 To compute the covariance matrix, use the np.cov() builtin method
f) Calculate the eigenvalues and eigenvectors.
Remember: The Eigenvectors of the Covariance matrix we get are Orthogonal to each
other and each vector represents a principal axis. A higher Eigenvalue corresponds to a
higher variability. Hence the principal axis with the higher Eigenvalue will be an axis
capturing higher variability in the data. Orthogonal means the vectors are mutually
perpendicular to each other.
 You can use the builtin method np.linalg.eigh(). It will return two objects, a 1-D
array containing the eigenvalues, and a 2-D square array or matrix (depending on
the input type) of the corresponding eigenvectors (in columns).
g) Sort the eigen values in descending order.
Remember: We order the eigenvalues from largest to smallest so that it gives us the
components in order of significance. Each column in the Eigen vector-matrix corresponds
to a principal component, so arranging them in descending order of their Eigenvalue will
automatically arrange the principal component in descending order of their variability.
Hence, the first column in our rearranged Eigen vector-matrix here will be a principal
component that captures the highest variability.
 You can use the builtin method np.argsort()
h) Choose components and form a feature vector.
Remember: If we have a dataset with n variables, then we have the corresponding n
eigenvalues and eigenvectors. To reduce the dimensions, we choose the first p
eigenvalues and ignore the rest. Some information is lost in the process, but if the
eigenvalues are small, we do not lose much.
In this task, select the first two principal components. n_components = 2 means your final
data should be reduced to just 2 dimensions.
i) Transform the data by having a dot product between the Transpose of the Feature Vector
and the Transpose of the mean-centered data. By transposing the outcome of the dot
product, the result we get is the data reduced to lower dimensions (2-D) from higher
dimensions (4-D).
 You can use the following command for this purpose:
X_reduced=np.dot(eigenvector_subset.transpose(),
X_meaned.transpose()).transpose()
j) Project the data onto its first two principal components and plot the results using the
seaborn and matplotlib libraries. (Hint: Create Data Frame of reduced dataset and
concatenate it with Labels (target variable) to create a complete Dataset).
Experiment 6: Understanding Clustering - I

Objective:
 Develop an understanding of how to perform k-means on a data set.
 Develop an understanding of the use of objective function to select the best possible
value of k in k-means clustering.
 Learn how to implement KMeans with PCA
Time Required: 3 hrs
Programming Language: Python
Software Required: Anaconda

Introduction
The technique to segregate datasets into various groups, on basis of having similar features and
characteristics, is being called Clustering. The groups being formed are being known as Clusters.
Clustering technique is used as a data analysis technique for discovering interesting patterns in
data, such as groups of customers based on their behavior and in various Field such as Image
recognition, Spam Filtering. Clustering is being used in Unsupervised Learning Algorithm in
Machine Learning as it can be segregated multivariate data into various groups, without any
supervisor, on basis of common pattern hidden inside the datasets.
There are many clustering algorithms to choose from and no single best clustering algorithm for
all cases. Instead, it is a good idea to explore a range of clustering algorithms and different
configurations for each algorithm. In this lab we will be learning and understanding the
implementation of k-means clustering algorithm.

KMeans Clustering: KMeans Algorithm is an Iterative algorithm that divides a group of n


datasets into k subgroups or clusters based on the similarity and their mean distance from the
centroid of that particular subgroup formed.
K, here is the pre-defined number of clusters to be formed by the Algorithm. If K=3, It means the
number of clusters to be formed from the dataset is 3

Task 1:
You have to solve the customer segmentation problem by using KMeans clustering and the
dataset “Mall_Customers.csv”.
Steps to follow:
1. Import the important libraries
2. Load and view the dataset
3. Apply feature scaling using MinMaxScaler. MinMaxScaler() is a data normalization
technique in machine learning that scales and transforms the features of a dataset to have
values between 0 and 1. This normalization method is used to ensure that all features are
on a similar scale.
You can use the following code:
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scale = scaler.fit_transform(df[['Annual Income (k$)','Spending Score (1-100)']])
df_scale = pd.DataFrame(scale, columns = ['Annual Income (k$)','Spending Score (1-
100)']);
df_scale.head(5)
4. Apply KMeans with 2 clusters
#Applying KMeans
from sklearn.cluster import KMeans
import sklearn.cluster as cluster
km=KMeans(n_clusters=2)
y_predicted = km.fit_predict(df[['Annual Income (k$)','Spending Score (1-100)']])
y_predicted
5. Find the centroid of the two clusters by using the attribute ‘cluster_centers_’ as
shown below:
#Find the centroid
km.cluster_centers_
6. Visualize the results by using the scatterplot from seaborn library
#Visualize Results
df['Clusters'] = km.labels_
sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",hue = 'Clusters',
data=df,palette='viridis')

Finding Optimum number of Clusters in K Means


The tricky part with K-Means clustering is one does not know in advance that in how many
clusters the given data can be divided. There are two methods that can be used to find the
optimal value of K other than hit and trial but in this lab, we would be using only one which is
WCSS.
Elbow Method with Within-Cluster-Sum of Squared Error (WCSS)
The Elbow Method is a popular technique for determining the optimal number of clusters. Here,
we calculate the Within-Cluster-Sum of Squared Errors (WCSS) for various values of k and
choose the k for which WSS first starts to diminish. In the plot of WSS-versus-k, this can be
observed as an elbow.
 The Squared Error for a data point is the square of the distance of a point from its cluster
center.
 The WSS score is the summation of Squared Errors for all given data points.
 Distance metrics like Euclidean Distance or the Manhattan Distance can be used.
Task 2:
Continuing with our task 1,
1. Calculate the WCSS for K=2 to k=12 and calculate the WCSS in each iteration by using
the following code:
#Finding optimum value of K
K=range(2,12)
wss = []
for k in K:
kmeans=cluster.KMeans(n_clusters=k)
kmeans=kmeans.fit(df_scale)
wss_iter = kmeans.inertia_
wss.append(wss_iter)

2. Plot the WCSS vs K cluster graph


#Plotting the graph
plt.xlabel('K')
plt.ylabel('Within-Cluster-Sum of Squared Errors (WSS)')
plt.plot(K,wss)
Note: You will observe an elbow bend at point 5. It is the point after which WCSS does
not diminish much with the increase in value of K.
3. After finding out the optimum value of K, apply KMeans with this value and plot the
graph.
#Applying KMeans with optimal value of K
kmeans = cluster.KMeans(n_clusters=5
kmeans = kmeans.fit(df[['Annual Income (k$)','Spending Score (1-100)']])
df['Clusters'] = kmeans.labels_
sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",hue = 'Clusters',
data=df,palette='viridis')

Task 3:
Apply KMeans clustering after reducing the dimensionality of dataset into two components. You
can use the following code for applying PCA:
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df_scale)
pca_df = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
pca_df.head()
Experiment 7: Understanding Clustering - II

Objective: Develop an understanding of how to perform hierarchical clustering on a data set.


Time Required: 3 hrs
Programming Language: Python
Software Required: Anaconda

Introduction
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is
distinct from each other cluster, and the objects within each cluster are broadly similar to each
other. Hierarchical clustering is further divided into two types: Agglomerative and Divisive. In
this lab, we ll be applying agglomerative clustering in which objects are grouped in clusters
based on their similarity. The algorithm starts by treating each object as a singleton cluster. Next,
pairs of clusters are successively merged until all clusters have been merged into one big cluster
containing all objects. The result is a tree-based representation of the objects,
named dendrogram.

Task:
You have to solve the wholesale customer segmentation problem using hierarchical clustering.
You can download the dataset using this link. The data is hosted on the UCI Machine Learning
repository. The aim of this problem is to segment the clients of a wholesale distributor based on
their annual spending on diverse product categories, like milk, grocery, region, etc.
Steps to follow:
1. Import the important libraries
2. Load and view the dataset
3. Normalize the data so that the scale of each variable is the same. If the scale of the variables
is not the same, the model might become biased towards the variables with a higher
magnitude like Fresh or Milk. To normalize the data, you can use the following code:
#Normalize data
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
4. Draw the dendrogram to help you decide the number of clusters for this particular problem.
You can use the following code for it:
#Draw dendogram
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = sch.dendrogram(sch.linkage(data_scaled, method='ward'))
After drawing the dendogram, you will see that the x-axis contains the samples and y-axis
represents the distance between these samples. The vertical line with maximum distance is the
blue line and hence you can decide a threshold of 6 and cut the dendrogram with the following
code:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
After running the above code, you will get two clusters as this line cuts the dendrogram at two
points.
5. Apply hierarchical clustering for 2 clusters. You can use the following code for it
#Apply hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)
After executing the above code, you will see the values of 0s and 1s in the output since you
defined 2 clusters. 0 represents the points that belong to the first cluster and 1 represents points
in the second cluster.
6. Plot the clusters to visualize them by using following code:
#Plotting clusters
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)
Experiment 8: Association Rule Analysis using Python
Objective : To implement association rule analysis using Python.
Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda

Introduction
Association Rule Analysis finds interesting associations and relationships among large sets of
data items. This rule shows how frequently an itemset occurs in a transaction. A typical example
is Market Based Analysis. Market Based Analysis is one of the key techniques used by large
relations to show associations between items. It allows retailers to identify relationships between
the items that people buy together frequently.

Apriori Algorithm:
The Apriori algorithm is used for finding frequent item-sets in a dataset for Boolean association
rule. Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent item-sets are
used to find k+1 item-sets. To improve the efficiency of level-wise generation of frequent item-
sets, an important property is used called Apriori property which helps by reducing the search
space. Walmart especially has made great use of the algorithm in suggesting products to its
users.

The output of the apriori algorithm is the generation of association rules. This can be done by
using some measures called support, confidence, and lift. Now let’s understand each term.

Support: It is calculated by dividing the number of transactions having the item by the total
number of transactions.

Confidence: It is the measure of trustworthiness and can be calculated using the below formula.

Conf(A => B)=

Lift: It is the probability of purchasing B when A is sold. It can be calculated by using the below
formula.

Lift(A => B)=


Lift(A => B) =1 : There is no relation between A and B.
Lift(A => B)> 1: There is a positive relation between the item set . It means, when product A is
bought, it is more likely that B is also bought.
Lift(A => B)< 1: There is a negative relation between the items. It means, if product A is
bought, it is less likely that B is also bought.

Dataset
Load the data using the following link https://www.kaggle.com/datasets/mrmining/online-
retail

Lab Tasks
Write a Python program that accomplishes the following:
1. Load the transaction data from the 'Online Retail.xlsx' file into a panda DataFrame.
2. Preprocess the data by removing extra spaces in the 'Description' column, dropping rows
without invoice numbers, and filtering out credit transactions.
3. Create separate transaction baskets for each country of interest (France, United Kingdom,
Portugal, and Sweden) by grouping the data based on 'Country', 'InvoiceNo', and
'Description' columns. Calculate the sum of 'Quantity' for each unique combination of
'InvoiceNo' and 'Description'. Reshape the resulting DataFrame to have 'InvoiceNo' as the
index and each unique 'Description' as a column, representing the quantity of the
corresponding item in the transaction.
4. Apply the Apriori algorithm using the 'apriori()' function from the mlxtend library to find
frequent itemsets that include the 'Cutlery Set' in each country. Set the minimum support
threshold to 0.05.
5. Generate association rules from the frequent itemsets using the 'association_rules()'
function, considering a minimum lift threshold of 1.
6. Sort the association rules based on confidence and lift values in descending order.
7. Extract and analyze the top association rules that involve the 'Cutlery Set' for each
country.
8. Interpret the rules to identify patterns of the 'Cutlery Set' being purchased with other
items in different countries. Look for high-confidence rules with significant lift values,
which indicate strong associations between the 'Cutlery Set' and other items.
9. Print the top association rules for each country, including the antecedent (items
commonly purchased before the 'Cutlery Set') and consequent (items commonly
purchased after the 'Cutlery Set') of each rule, along with their confidence and lift values.
10. Provide insightful interpretations of the association rule patterns in each country,
highlighting any interesting and meaningful findings related to the 'Cutlery Set' item.
Experiment 9: Understanding Classification using KNN
Objective: To develop an understanding of how classifier model is trained and tested on a data
set.
Time Required: 3 hrs
Programming Language: Python
Software Required: Anaconda

Introduction
Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the
data. For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks. Some of the basic classification algorithms are: Logistic
Regression, Naive Bayes Classifier, Nearest Neighbor, Support Vector Machines etc. In this lab
we will be exploring KNN (K Nearest Neighbor) classification algorithm to understand the
training and testing of a classifier model.
The k-nearest neighbors (KNN) algorithm is a data classification method for estimating the
likelihood that a data point will become a member of one group, or another based on what group
the data points nearest to it belong to. The k-nearest neighbor algorithm is a type of supervised
machine learning algorithm used to solve classification and regression problems. However, it's
mainly used for classification problems.
Note: Don't confuse K-NN classification with K-means clustering. KNN is a supervised
classification algorithm that classifies new data points based on the nearest data points. On the
other hand, K-means clustering is an unsupervised clustering algorithm that groups data into a K
number of clusters.

Task 1: You must predict whether a person will have diabetes or not using KNN classifier.
Steps to follow:
1. Load and view the provided dataset ‘diabetes.csv’.
2. Import all the important libraries.
3. Perform data cleaning by replacing empty values with the mean of respective column so
that it won’t affect the outcome.
4. Split the dependent variables (features) and independent variables (label) of the dataset
into X and Y
5. Split data into training set and test set
6. Perform feature scaling to the training and test set of independent variables for reducing
the size to smaller values using the following code:
#Feature scaling
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
7. Define the K Nearest Neighbor model with the training set by using the following code:
# Define the model
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric='euclidean')
8. Fit your defined model and predict the test results
9. Evaluate the model using the confusion matrix, f1_score and accuracy score by
comparing the predicted and actual test values. To compute the confusion matrix, use the
following code:
#Finding confusion matrix and f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
cm = confusion_matrix(y_test, y_pred)
print (cm)
print (f1_score(y_test, y_pred))

Task 2: Using the above implemented code, vary the model by using cosine similarity measure
instead of Euclidean and determine which one is producing better values in terms of accuracy
and f1_score.
Experiment 10: Linear Regression

Objective : To apply linear regression using Python.


Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda

Introduction
Linear regression is a widely used supervised learning algorithm in the field of machine learning
and statistics. It is primarily used for predicting continuous numeric values based on input
features. The goal of linear regression is to model the relationship between the input variables
(also known as independent variables or features) and the continuous target variable (also known
as the dependent variable) by fitting a linear equation to the data.
In linear regression, the relationship between the input features and the target variable is assumed
to be linear. The algorithm estimates the coefficients of the linear equation that best fits the given
data, allowing us to make predictions on new data points.
The coefficients (weights) are estimated during the training process using a method called
Ordinary Least Squares (OLS) or a variant such as Ridge Regression or Lasso Regression. The
objective is to minimize the sum of squared differences between the predicted values and the
actual target values in the training data.
During the training phase, the algorithm adjusts the coefficients to find the line that best fits the
training data. Once trained, the model can make predictions by simply plugging in the values of
the input features into the linear equation.
Linear regression can be used for various tasks such as predicting housing prices, stock market
trends, sales forecasts, and many more. It serves as a foundation for more advanced regression
techniques and can be extended to handle more complex relationships through feature
engineering and incorporating non-linear transformations.

TASK:
Apply Linear Regression on Advertising data set that involves advertising expenditures on TV,
radio, and newspaper, and the corresponding sales figures.
Steps to follow:
a) Download the Advertising data set from the following webpage:
https://www.kaggle.com/datasets/thorgodofthunder/tvradionewspaperadvertising

b) Load all relevant packages and dataset.


from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

c) Split feature vectors and labels


d) Clean and preprocess the data to prepare it for the linear regression model. This step may
involve handling missing values, encoding categorical variables, and splitting the data
into features (X) and target variable (y).
e) Split the data into training and testing set
f) Train the Model: Instantiate the linear regression model and fit it to the training data. The
model will learn the relationship between the features and the target variable.
g) Evaluate the trained model using various metrics such as mean squared error (MSE) and
coefficient of determination (R^2).
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
h) Plot a scatter plot of the training set and testing set using the first feature column of X as
the x-axis and Y as the y-axis.
plt.scatter(X_test[:,0] , y_pred, color = 'red')
plt.title("Plot Testing set")
plt.ylabel("Y")
plt.xlabel("Input feature columns")
Experiment 11: Open Ended Lab
Objective: To test the data mining analytical skills of students to solve the problem by using the
knowledge they have gained in their previous labs
Time Required: 3 hrs
Programming Language: Python
Software Required: Anaconda
______________________________________________________________________________
Task:
The automobile manufacturer is seeking to identify the closest competitors to their newly
developed vehicle prototypes before launching the new model. To achieve this, they need to
group existing vehicles on the market based on similarities, determine which group is the most
similar to the prototypes, and use this information to identify the primary competitors for their
new model.

The objective is to utilize clustering techniques to identify clusters of vehicles that possess
unique characteristics. This analysis will provide an overview of the current market of vehicles
and aid manufacturers in deciding on the development of new models based on the identified
distinct clusters.

You can download the dataset from the link given below:
https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/
labs/cars_clus.csv

Build your own pipeline and justify it. Also show the implementation and results of your solution
through code.

Rubrics for Evaluation


Parameter Poor Weak Good Excellent
(0) (1) (2) (3)
Data The student did The student The student The student
Preparation not perform any attempted to successfully successfully
data cleaning, or cleaned the data
CLO-2 clean the data, cleaned the data,
the cleaning but with some
(C-3) process is but the result is errors or demonstrating a
completely mostly omissions, or good
incorrect or incorrect or some parts are understanding of
inadequate. inadequate. incomplete or data cleaning
unclear. techniques and
best practices.
Data Analysis No attempt made Inappropriate The Appropriate
CLO-2 to analyze the statistical understanding statistical
(C-3) data or of the statistical
techniques are techniques are
inappropriate analyses
statistical used, or the performed is used to analyze the
techniques were understanding mostly clear data. The
used. of the and accurate. understanding of
statistical the statistical
analyses analyses
performed is performed is clear
incomplete and accurate.
Data Modeling The student did The student The student The student
CLO-2 not demonstrate demonstrated a demonstrated a demonstrated an
(C-3) poor
any good excellent
understanding
understanding of of data understanding understanding of
data modeling modeling of data data modeling
concepts. concepts. modeling concepts.
concepts.
Model Selection The student does The student The student The student
and Building not build any attempts to successfully successfully built
CLO-2 (C-3) data mining build data built data accurate, robust,
models, or the mining models mining models and interpretable
building process but the result is but with some data mining
is completely mostly errors or models,
incorrect or incorrect or omissions, or demonstrating a
inadequate. inadequate. some parts are good
incomplete or understanding of
unclear. model building
techniques and
best practices.
Model The student does The student The student The student
Evaluation not use any attempted to successfully successfully used
CLO-2 evaluation use evaluation used evaluation appropriate
(C-3) metrics, or the metrics such as metrics but with evaluation metrics,
selected metrics Accuracy, F1 some errors or demonstrating a
are completely etc. omissions, or good
incorrect or some parts are understanding of
inadequate. incomplete. evaluation metrics
and their
interpretation.

You might also like