CS-423 Data Warehousing and Data Mining Lab
CS-423 Data Warehousing and Data Mining Lab
PREPARED BY
Lab manual is prepared by Assoc Prof Dr. Hammad Afzal, Asst Prof Malik Muhammad Zaki
Murtaza Khan and Lab Engr Marium Hida.
GENERAL INSTRUCTIONS
a. Students are required to maintain the lab manual with them till the end of the semester.
b. All readings, answers to questions and illustrations must be solved on the place provided.
If more space is required, then additional sheets may be attached.
c. It is the responsibility of the student to have the manual graded before deadlines as given
by the instructor
d. Loss of manual will result in re submission of the complete manual.
e. Students are required to go through the experiment before coming to the lab session. Lab
session details will be given in training schedule.
f. Students must bring the manual in each lab.
g. Keep the manual neat clean and presentable.
h. Plagiarism is strictly forbidden. No credit will be given if a lab session is plagiarised and
no re submission will be entertained.
i. Marks will be deducted for late submission.
j. Error handling in a program is the responsibility of the student.
VERSION HISTORY
Date Update By Details
August 2014 Dr. Hammad Afzal v1
September Dr. Hammad Afzal & TA Subhan Khan v2
2014
September Dr. Hammad Afzal & Lab Engr Marium Hida v3
2016
September Dr. Hammad Afzal, Dr. Malik Muhammad Zaki Murtaza v4
2019 Khan & Lab Engr Marium Hida
April 2021 Dr. Hammad Afzal, Dr. Naima Iltaf, Lab Engr Sehrish V5
Ferdous & Lab Engr Saba Siddique
February Dr. Naima Iltaf, Dr. Hammad Afzal, Lab Engr Saba Siddique V6
2022 & Lab Engr Laraib Zainab
Lab Rubrics (Group 1)
Grand Total
Experiment 1 – Introduction to Python - I
Objective: To provide basic knowledge about the python language: Expressions and statements,
Operators, Variables, Lists, Strings, Functions,
Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda/Google Colab
The single quotation marks in our print statement indicate that what we are printing is a string – a
sequence of letters or other symbols. If the string is to contain a single quotation mark, we must
use the double quotation mark key instead otherwise it will generate an error.
Each expression performs some calculation, yielding a value. For example, we can calculate the
result of a simple mathematical expression using whole numbers (or integers):
When Python has calculated the result of this expression, it prints it to the screen, even though
we have not used print. All our programs will be built from such statements and expressions.
Python Input
While programming, we might want to take the input from the user. In Python, we can use the
input() function.
In the above example, we have used the input() function to take input from the user and stored
the user input in the num variable.
It is important to note that the entered value 10 is a string, not a number. So, type(num) returns
<class 'str'>.
To convert user input into a number we can use int() or float() functions as:
Creating Variables:
In programming, a variable is a container (storage area) to hold data. It has no command for
declaring a variable. A variable is created the moment you first assign a value to it. It allows you
to assign values to multiple variables in one line:
Variables do not need to be declared with any particular type and can even change type after they
have been set. (Note: variable names are case sensitive)
If we want to assign the same value to multiple variables at once, we can do this as:
You can get the data type of a variable with the type() function.
Python Literals
Literals are representations of fixed values in a program. They can be numbers, characters, or
strings, etc. For example, 'Hello, World!', 12, 23.0, 'C', etc.
Literals are often used to assign values to variables or constants. For example:
Literal Collections
There are four different literal collections List literals, Tuple literals, Dict literals, and Set
literals.
In the above example, we created a list of fruits, a tuple of numbers, a dictionary of alphabets
having values with keys designated to each value and a set of vowels.
Python Data Types
Data Types Classes Description
Numeric int, float, complex holds numeric values
String str holds sequence of characters
Sequence list, tuple, range holds collection of items
Mapping dict holds data in key-value pair form
Boolean bool holds either True or False
Set set, frozeenset hold collection of unique items
Since everything is an object in Python programming, data types are classes and variables are
instances(object) of these classes.
List Data Type
List is an ordered collection of similar or different types of items separated by commas and
enclosed within brackets [ ]. For example,
Access List Items: Here, we have created a list named languages with 3 string values inside it.
To access items from a list, we use the index number (0, 1, 2 ...). For example,
We use append() to a list to add a single item to a list and the list itself is updated as a result of
the operation.
languages.append(“C++”)
The following command prints all words present before the 2nd index in language list.
languages[:2]
The following command prints all words present after the 2nd index in language list.
languages[2:]
By convention, m:n means elements m…n-1.
languages[0:2]
We can also slice with negative indexes — the same basic rule of starting from the start index
and stopping one before the end index applies.
Python Tuple Data Type
Tuple is an ordered sequence of items same as a list. The only difference is that tuples are
immutable. Tuples once created cannot be modified.
In Python, we use the parentheses () to store items of a tuple. For example,
Here, product is a tuple with a string value Xbox and integer value 499.99.
Some of the methods we used to access the elements of a list also work with individual words,
or strings. For example, we can assign a string to a variable, index a string, and slice a string:
Functions:
A function is a block of code which only runs when it is called. You can pass data, known as
parameters, into a function. A function can return data as a result.
Creating & Calling a Function
In Python a function is defined using the def keyword:
Arguments
Information can be passed into functions as arguments. Arguments are specified after the
function name, inside the parentheses. You can add as many arguments as you want, just
separate them with a comma.
The following example has a function with one argument (fname). When the function is called,
we pass along a first name, which is used inside the function to print the full name:
By default, a function must be called with the correct number of arguments. Meaning that if your
function expects 2 arguments, you have to call the function with 2 arguments, not more, and not
less.
This function expects 2 arguments, and gets 2 arguments:
If you try to call the function with 1 or 3 arguments, you will get an error. Moreover, If we call
the function without argument, it uses the default value.
Return Values
To let a function, return a value, use the return statement.
Lab Tasks:
Q1: Write a program to:
a) Define a string with your name and assign it to a variable. Print the contents of this
variable in two ways, first by simply typing the variable name and pressing enter, then by
using the print statement.
b) Try adding the string to itself using my_string + my_string, or multiplying it by a
number, e.g., my_string * 3. Notice that the strings are joined together without any
spaces. How could you fix this?
Q3: Given two strings, s1, and s2 return a new string made of the first, middle, and last
characters each input string.
Given:
s1 = "America"
s2 = "Japan"
Expected Output:
AJrpan
Q4: Write a program to create a function that takes wo strings s1 and s2 and create a new string
by appending s2 in the middle of s1.
Given:
s1 = "Software"
s2 = "Design"
Expected Output:
SoftDesignware
Q5. Write a program to create five lists, one for each row in our dataset:
Also use list indexing to extract the number of ratings from the five rows and then average them.
Q6. Write a program to create the course variable then set the course variable to be an empty list.
1. Now, Add 'Machine Learning', 'Software Construction', and 'Formal Methods' to the
users list in that order without reassigning the variable.
2. Delete software construction and display the updated list content.
3. Add the course 'Artificial Intelligence' to course where ' Software Construction' used to
be.
4. Slice course to Return 1st and 3rd Elements
References:
[1] https://mas-dse.github.io/startup/anaconda-windows-install/#anaconda
[2] https://www.programiz.com/python-programming/first-program
[3] https://www.codecademy.com/learn/learn-python-3
[4] https://www.w3schools.com/python/
Experiment # 2: Introduction to Python - II
Objective: To provide basic knowledge about the python language; Tuples, Sets, Dictionaries,
Conditional and Loop statements
Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda/ Google Colab
Here, val accesses each item of sequence on each iteration. Loop continues until we reach the
last item in the sequence.
You can loop through the tuple items by using a for loop.
Python Tuples
Tuples are used to store multiple items in a single variable. Tuple is one of 4 built-in data types
in Python used to store collections of data, the other 3 are List, String, and Dictionary, all with
different qualities and usage.
A tuple is a collection which is ordered and unchangeable. It allows duplicate members. Tuples
are written with round brackets.
Negative Indexing
Negative indexing means start from the end. -1 refers to the last item, -2 refers to the second last
item etc.
Python Sets
A set is a collection which is unordered and unindexed. It does not allow duplicate members. In
Python, sets are written with curly brackets.
Access Items
You cannot access items in a set by referring to an index or a key but you can loop through the
set items using a for loop, or ask if a specified value is present in a set, by using the in keyword.
Add Items
To add one item to a set use the add() method.
To add more than one item to a set use the update() method.
Python Dictionaries
A dictionary is a collection which is unordered, changeable, indexed and doesn’t allow
duplicates. In Python dictionaries are written with curly brackets, and they have keys and values.
Accessing Items
You can access the items of a dictionary by referring to its key name, inside square brackets:
There is also a method called get() that will give you the same result:
Dictionary Length
To determine how many items (key-value pairs) a dictionary has, use the len() function.
Dictionary Methods
Not Equals: a != b
Less than: a < b
Less than or equal to: a <= b
Greater than: a > b
Greater than or equal to: a >= b
These conditions can be used in several ways, most commonly in "if statements" and loops.
An "if statement" is written by using the if keyword.
If statement:
In this example we use two variables, a and b, which are used as part of the if statement to test
whether b is greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so
we print to screen that "b is greater than a".
Indentation
Python relies on indentation (whitespace at the beginning of a line) to define scope in the code.
Other programming languages often use curly brackets for this purpose.
If statement, without indentation (will raise an error):
Elif
The elif keyword is python’s way of saying "if the previous conditions were not true, then try
this condition".
In this example a is equal to b, so the first condition is not true, but the elif condition is true, so
we print to screen that "a and b are equal.
Else
The else keyword catches anything which isn't caught by the preceding conditions.
In this example a is greater than b, so the first condition is not true, also the elif condition is not
true, so we go to the else condition and print to screen that "a is greater than b". You can also
have an else without the elif.
Short Hand If
If you have only one statement to execute, you can put it on the same line as the if statement.
One line if statement:
This technique is known as Ternary Operators, or Conditional Expressions. You can also
have multiple else statements on the same line:
One line if else statement, with 3 conditions:
Logical Operator
And
The and keyword is a logical operator, and is used to combine conditional statements.
Test if a is greater than b, AND if c is greater than a:
Or
The or keyword is a logical operator, and is used to combine conditional statements.
Test if a is greater than b, OR if a is greater than c:
Nested If
You can have if statements inside if statements, this is called nested if statements.
The while loop requires relevant variables to be ready, in this example we need to define an
indexing variable, i, which we set to 1.
The break Statement
With the break statement we can stop the loop even if the while condition is true.
Exit the loop when i is 3:
.
Looping Through a String
Even strings are iterable objects, they contain a sequence of characters.
Loop through the letters in the word "banana":
The range() function defaults to increment the sequence by 1, however it is possible to specify
the increment value by adding a third parameter: range(2, 30, 3).
Increment the sequence with 3 (default is 1):
Nested Loops
A nested loop is a loop inside a loop. The "inner loop" will be executed one time for each
iteration of the "outer loop".
Print each adjective for every fruit:
LAB TASKS:
Q1: Write a program that takes two integers as input (lower limit and upper limit) and displays
all the prime numbers including and between these two numbers.
Q2: Given a list iterate it and display numbers which are divisible by 5 and if you find number
greater than 150 stop the loop iteration.
list1 = [12, 15, 32, 42, 55, 75, 122, 132, 150, 180, 200]
Q3: Write a program that accepts a comma separated sequence of words as input and prints the
words in a comma-separated sequence after sorting them alphabetically. Suppose the following
input is supplied to the program: without, hello, bag, world. Then, the output should be: bag,
hello, without, world.
Q4:
a. Write a simple calculator program. Follow the steps below:
• Declare and define a function named Menu which displays a list of choices for the
user such as addition, subtraction, multiplication, and classic division. It should take the
choice from user as an input and return.
• Define and declare a separate function for each choice (each mathematical
operation).
• In the main body of the program call the respective function depending on the
user’s choice.
b. Implement the following functions for the calculator you created in the above task.
• Factorial
• x_power_y (x raised to the power y)
References:
https://www.w3schools.com/python/default.asp
Experiment 3: Data Cleaning
Introduction
Cleaning Data
Data cleaning or Data cleansing is very important from the perspective of building intelligent
automated systems. Data cleansing is a preprocessing step that improves the data validity,
accuracy, completeness, consistency, and uniformity. It is essential for building reliable machine
learning models that can produce good results. Otherwise, no matter how good the model is, its
results cannot be trusted. In short, data cleaning means fixing bad data in your data set. Bad data
could be:
1. Empty cells
2. Data in wrong format
3. Wrong data
4. Duplicates
The dataset that we are going to use is ‘rawdata.csv’. It has following characteristics:
The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and
28).
The data set contains wrong format ("Date" in row 26).
The data set contains wrong data ("Duration" in row 7).
The data set contains duplicates (row 11 and 12).
As empty cells can potentially give a wrong result while analyzing data, so to deal with the
empty cells, we would be performing the following operations:
a. Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells by using the method
dropna(). Since data sets can be very big, and removing a few rows will not have a big impact on
the result.
Task: Remove all the empty cells in dataset provided
By default, the dropna() method returns a new DataFrame, and will not change the original. If
you want to change the original DataFrame, use the inplace = True argument.
b. Replace empty values
Another way of dealing with empty cells is to insert a new value instead by using method fillna().
This way you do not have to delete entire rows just because of some empty cells.
Task: Replace the empty values with 150
c. Replace only for a specified Columns
In above methods, we are replacing all empty cells in the whole Data Frame. To only replace
empty values for one column, specify the column name for the DataFrame.
Task: Replace the empty values in ‘Calories’ with 130.
d. Replace Using Mean, Median, or Mode
A common way to replace empty cells, is to calculate the mean, median or mode value of the
column. Pandas uses the mean() median() and mode() methods to calculate the respective values
for a specified column:
i. Mean:
Mean = the average value (the sum of all values divided by number of values).
ii. Median:
Median = the value in the middle, after you have sorted all values ascending.
iii. Mode:
Mode = the value that appears most frequently.
Tasks:
1. Calculate the Mean of ‘Calories’ and replace the missing values with it.
2. Calculate the Median of ‘Maxpulse’ and replace the missing values with it.
3. Calculate the mode of ‘Pulse’ and replace the missing values with it.
Introduction
Feature Selection is one of the core concepts in machine learning which hugely impacts the
performance of your model. The data features that you use to train your machine learning models
have a huge influence on the performance you can achieve. Irrelevant or partially relevant
features can negatively impact model performance.
Feature Selection is the process where you automatically or manually select those features which
contribute most to your prediction variable or output in which you are interested in. Having
irrelevant features in your data can decrease the accuracy of the models and make your model
learn based on irrelevant features.
The most commonly and easy to use Feature selection techniques which provide good results are
as follows:
1. Univariate Selection:
Statistical tests can be used to select those features that have the strongest relationship with the
output variable.
The scikit-learn library provides the SelectKBest class that can be used with a suite of different
statistical tests to select a specific number of features.
TASK 1:
Download the dataset from this link: https://www.kaggle.com/iabhishekofficial/mobile-price-
classification#train.csv
Description of the variables in dataset:
battery_power: Total energy a battery can store in one time measured in mAh
blue: Has Bluetooth or not
clock_speed: the speed at which microprocessor executes instructions
dual_sim: Has dual sim support or not
fc: Front Camera megapixels
four_g: Has 4G or not
int_memory: Internal Memory in Gigabytes
m_dep: Mobile Depth in cm
mobile_wt: Weight of mobile phone
n_cores: Number of cores of the processor
pc: Primary Camera megapixels
px_height
Pixel Resolution Height
px_width: Pixel Resolution Width
ram: Random Access Memory in MegaBytes
sc_h: Screen Height of mobile in cm
sc_w: Screen Width of mobile in cm
talk_time: the longest time that a single battery charge will last when you are
three_g: Has 3G or not
touch_screen: Has touch screen or not
wifi: Has wifi or not
price_range: This is the target variable with a value of 0(low cost), 1(medium cost), 2(high cost)
and 3(very high cost).
Use the chi-squared (chi²) statistical test for non-negative features to select 10 of the best features
from the above dataset which is used for Mobile Price Range Prediction.
2. Feature Importance
You can get the feature importance of each feature of your dataset by using the feature
importance property of the model. Feature importance gives you a score for each feature of your
data, the higher the score more important or relevant is the feature towards your output variable.
Feature importance is an inbuilt class that comes with Tree Based Classifiers.
TASK 2:
Load the dataset again and use Extra Tree Classifier for extracting the top 10 features for the
dataset and plot your results.
To import the Extra Tree Classifier, use the following command:
from sklearn.ensemble import ExtraTreesClassifier
Use the following inbuilt class for feature importances:
feature_importances_
Introduction
Data pre-processing is crucial in any data mining process as they directly impact success rate of
the project. This reduces complexity of the data under analysis as data in real world is unclean.
Data is said to be unclean if it is missing attribute, attribute values, contain noise or outliers and
duplicate or wrong data. Presence of any of these will degrade quality of the results.
Furthermore, data sparsity increases as the dimensionality increases which makes operations like
clustering, outlier detection less meaningful as they greatly depend on density and distance
between points. Purpose of dimensionality reduction is to:
∙ Avoid curse of dimensionality
∙ Reduces time required by algorithms
∙ Greatly reduces memory consumption
∙ Ease of visualization of data
∙ Eliminate irrelevant features
Principal Component Analysis (PCA) is a method used to reduce number of variables in your
data by extracting important one from a large pool. It reduces the dimension of your data with
the aim of retaining as much information as possible. In other words, this method combines
highly correlated variables together to form a smaller number of an artificial set of variables
which is called “principal components” that account for most variance in the data.
TASK:
Apply PCA on the Fisher’s Iris data set. The data contains 3 classes of 50 instances each, where
each class refers to a type of iris plant. There are 4 different attributes describing the data. You
will use principal component analysis to transform the data to a lower dimensional space.
Steps to follow:
a) Download the Iris data set from the following webpage:
http://archive.ics.uci.edu/ml/datasets/Iris
b) Load all relevant packages and dataset.
c) Split feature vectors and labels.
d) Normalize the dataset which is done by subtracting the mean of each feature vector from
the dataset so that the dataset should be centered on the origin.
e) Compute the covariance matrix which is basically a measure of the extent to which
corresponding elements from two sets of ordered data move in the same direction.
To compute the covariance matrix, use the np.cov() builtin method
f) Calculate the eigenvalues and eigenvectors.
Remember: The Eigenvectors of the Covariance matrix we get are Orthogonal to each
other and each vector represents a principal axis. A higher Eigenvalue corresponds to a
higher variability. Hence the principal axis with the higher Eigenvalue will be an axis
capturing higher variability in the data. Orthogonal means the vectors are mutually
perpendicular to each other.
You can use the builtin method np.linalg.eigh(). It will return two objects, a 1-D
array containing the eigenvalues, and a 2-D square array or matrix (depending on
the input type) of the corresponding eigenvectors (in columns).
g) Sort the eigen values in descending order.
Remember: We order the eigenvalues from largest to smallest so that it gives us the
components in order of significance. Each column in the Eigen vector-matrix corresponds
to a principal component, so arranging them in descending order of their Eigenvalue will
automatically arrange the principal component in descending order of their variability.
Hence, the first column in our rearranged Eigen vector-matrix here will be a principal
component that captures the highest variability.
You can use the builtin method np.argsort()
h) Choose components and form a feature vector.
Remember: If we have a dataset with n variables, then we have the corresponding n
eigenvalues and eigenvectors. To reduce the dimensions, we choose the first p
eigenvalues and ignore the rest. Some information is lost in the process, but if the
eigenvalues are small, we do not lose much.
In this task, select the first two principal components. n_components = 2 means your final
data should be reduced to just 2 dimensions.
i) Transform the data by having a dot product between the Transpose of the Feature Vector
and the Transpose of the mean-centered data. By transposing the outcome of the dot
product, the result we get is the data reduced to lower dimensions (2-D) from higher
dimensions (4-D).
You can use the following command for this purpose:
X_reduced=np.dot(eigenvector_subset.transpose(),
X_meaned.transpose()).transpose()
j) Project the data onto its first two principal components and plot the results using the
seaborn and matplotlib libraries. (Hint: Create Data Frame of reduced dataset and
concatenate it with Labels (target variable) to create a complete Dataset).
Experiment 6: Understanding Clustering - I
Objective:
Develop an understanding of how to perform k-means on a data set.
Develop an understanding of the use of objective function to select the best possible
value of k in k-means clustering.
Learn how to implement KMeans with PCA
Time Required: 3 hrs
Programming Language: Python
Software Required: Anaconda
Introduction
The technique to segregate datasets into various groups, on basis of having similar features and
characteristics, is being called Clustering. The groups being formed are being known as Clusters.
Clustering technique is used as a data analysis technique for discovering interesting patterns in
data, such as groups of customers based on their behavior and in various Field such as Image
recognition, Spam Filtering. Clustering is being used in Unsupervised Learning Algorithm in
Machine Learning as it can be segregated multivariate data into various groups, without any
supervisor, on basis of common pattern hidden inside the datasets.
There are many clustering algorithms to choose from and no single best clustering algorithm for
all cases. Instead, it is a good idea to explore a range of clustering algorithms and different
configurations for each algorithm. In this lab we will be learning and understanding the
implementation of k-means clustering algorithm.
Task 1:
You have to solve the customer segmentation problem by using KMeans clustering and the
dataset “Mall_Customers.csv”.
Steps to follow:
1. Import the important libraries
2. Load and view the dataset
3. Apply feature scaling using MinMaxScaler. MinMaxScaler() is a data normalization
technique in machine learning that scales and transforms the features of a dataset to have
values between 0 and 1. This normalization method is used to ensure that all features are
on a similar scale.
You can use the following code:
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scale = scaler.fit_transform(df[['Annual Income (k$)','Spending Score (1-100)']])
df_scale = pd.DataFrame(scale, columns = ['Annual Income (k$)','Spending Score (1-
100)']);
df_scale.head(5)
4. Apply KMeans with 2 clusters
#Applying KMeans
from sklearn.cluster import KMeans
import sklearn.cluster as cluster
km=KMeans(n_clusters=2)
y_predicted = km.fit_predict(df[['Annual Income (k$)','Spending Score (1-100)']])
y_predicted
5. Find the centroid of the two clusters by using the attribute ‘cluster_centers_’ as
shown below:
#Find the centroid
km.cluster_centers_
6. Visualize the results by using the scatterplot from seaborn library
#Visualize Results
df['Clusters'] = km.labels_
sns.scatterplot(x="Spending Score (1-100)", y="Annual Income (k$)",hue = 'Clusters',
data=df,palette='viridis')
Task 3:
Apply KMeans clustering after reducing the dimensionality of dataset into two components. You
can use the following code for applying PCA:
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df_scale)
pca_df = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
pca_df.head()
Experiment 7: Understanding Clustering - II
Introduction
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is
distinct from each other cluster, and the objects within each cluster are broadly similar to each
other. Hierarchical clustering is further divided into two types: Agglomerative and Divisive. In
this lab, we ll be applying agglomerative clustering in which objects are grouped in clusters
based on their similarity. The algorithm starts by treating each object as a singleton cluster. Next,
pairs of clusters are successively merged until all clusters have been merged into one big cluster
containing all objects. The result is a tree-based representation of the objects,
named dendrogram.
Task:
You have to solve the wholesale customer segmentation problem using hierarchical clustering.
You can download the dataset using this link. The data is hosted on the UCI Machine Learning
repository. The aim of this problem is to segment the clients of a wholesale distributor based on
their annual spending on diverse product categories, like milk, grocery, region, etc.
Steps to follow:
1. Import the important libraries
2. Load and view the dataset
3. Normalize the data so that the scale of each variable is the same. If the scale of the variables
is not the same, the model might become biased towards the variables with a higher
magnitude like Fresh or Milk. To normalize the data, you can use the following code:
#Normalize data
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
4. Draw the dendrogram to help you decide the number of clusters for this particular problem.
You can use the following code for it:
#Draw dendogram
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = sch.dendrogram(sch.linkage(data_scaled, method='ward'))
After drawing the dendogram, you will see that the x-axis contains the samples and y-axis
represents the distance between these samples. The vertical line with maximum distance is the
blue line and hence you can decide a threshold of 6 and cut the dendrogram with the following
code:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
After running the above code, you will get two clusters as this line cuts the dendrogram at two
points.
5. Apply hierarchical clustering for 2 clusters. You can use the following code for it
#Apply hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)
After executing the above code, you will see the values of 0s and 1s in the output since you
defined 2 clusters. 0 represents the points that belong to the first cluster and 1 represents points
in the second cluster.
6. Plot the clusters to visualize them by using following code:
#Plotting clusters
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)
Experiment 8: Association Rule Analysis using Python
Objective : To implement association rule analysis using Python.
Time Required : 3 hrs
Programming Language : Python
Software Required : Anaconda
Introduction
Association Rule Analysis finds interesting associations and relationships among large sets of
data items. This rule shows how frequently an itemset occurs in a transaction. A typical example
is Market Based Analysis. Market Based Analysis is one of the key techniques used by large
relations to show associations between items. It allows retailers to identify relationships between
the items that people buy together frequently.
Apriori Algorithm:
The Apriori algorithm is used for finding frequent item-sets in a dataset for Boolean association
rule. Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset
properties. We apply an iterative approach or level-wise search where k-frequent item-sets are
used to find k+1 item-sets. To improve the efficiency of level-wise generation of frequent item-
sets, an important property is used called Apriori property which helps by reducing the search
space. Walmart especially has made great use of the algorithm in suggesting products to its
users.
The output of the apriori algorithm is the generation of association rules. This can be done by
using some measures called support, confidence, and lift. Now let’s understand each term.
Support: It is calculated by dividing the number of transactions having the item by the total
number of transactions.
Confidence: It is the measure of trustworthiness and can be calculated using the below formula.
Lift: It is the probability of purchasing B when A is sold. It can be calculated by using the below
formula.
Dataset
Load the data using the following link https://www.kaggle.com/datasets/mrmining/online-
retail
Lab Tasks
Write a Python program that accomplishes the following:
1. Load the transaction data from the 'Online Retail.xlsx' file into a panda DataFrame.
2. Preprocess the data by removing extra spaces in the 'Description' column, dropping rows
without invoice numbers, and filtering out credit transactions.
3. Create separate transaction baskets for each country of interest (France, United Kingdom,
Portugal, and Sweden) by grouping the data based on 'Country', 'InvoiceNo', and
'Description' columns. Calculate the sum of 'Quantity' for each unique combination of
'InvoiceNo' and 'Description'. Reshape the resulting DataFrame to have 'InvoiceNo' as the
index and each unique 'Description' as a column, representing the quantity of the
corresponding item in the transaction.
4. Apply the Apriori algorithm using the 'apriori()' function from the mlxtend library to find
frequent itemsets that include the 'Cutlery Set' in each country. Set the minimum support
threshold to 0.05.
5. Generate association rules from the frequent itemsets using the 'association_rules()'
function, considering a minimum lift threshold of 1.
6. Sort the association rules based on confidence and lift values in descending order.
7. Extract and analyze the top association rules that involve the 'Cutlery Set' for each
country.
8. Interpret the rules to identify patterns of the 'Cutlery Set' being purchased with other
items in different countries. Look for high-confidence rules with significant lift values,
which indicate strong associations between the 'Cutlery Set' and other items.
9. Print the top association rules for each country, including the antecedent (items
commonly purchased before the 'Cutlery Set') and consequent (items commonly
purchased after the 'Cutlery Set') of each rule, along with their confidence and lift values.
10. Provide insightful interpretations of the association rule patterns in each country,
highlighting any interesting and meaningful findings related to the 'Cutlery Set' item.
Experiment 9: Understanding Classification using KNN
Objective: To develop an understanding of how classifier model is trained and tested on a data
set.
Time Required: 3 hrs
Programming Language: Python
Software Required: Anaconda
Introduction
Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the
data. For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks. Some of the basic classification algorithms are: Logistic
Regression, Naive Bayes Classifier, Nearest Neighbor, Support Vector Machines etc. In this lab
we will be exploring KNN (K Nearest Neighbor) classification algorithm to understand the
training and testing of a classifier model.
The k-nearest neighbors (KNN) algorithm is a data classification method for estimating the
likelihood that a data point will become a member of one group, or another based on what group
the data points nearest to it belong to. The k-nearest neighbor algorithm is a type of supervised
machine learning algorithm used to solve classification and regression problems. However, it's
mainly used for classification problems.
Note: Don't confuse K-NN classification with K-means clustering. KNN is a supervised
classification algorithm that classifies new data points based on the nearest data points. On the
other hand, K-means clustering is an unsupervised clustering algorithm that groups data into a K
number of clusters.
Task 1: You must predict whether a person will have diabetes or not using KNN classifier.
Steps to follow:
1. Load and view the provided dataset ‘diabetes.csv’.
2. Import all the important libraries.
3. Perform data cleaning by replacing empty values with the mean of respective column so
that it won’t affect the outcome.
4. Split the dependent variables (features) and independent variables (label) of the dataset
into X and Y
5. Split data into training set and test set
6. Perform feature scaling to the training and test set of independent variables for reducing
the size to smaller values using the following code:
#Feature scaling
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
7. Define the K Nearest Neighbor model with the training set by using the following code:
# Define the model
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=11, p=2, metric='euclidean')
8. Fit your defined model and predict the test results
9. Evaluate the model using the confusion matrix, f1_score and accuracy score by
comparing the predicted and actual test values. To compute the confusion matrix, use the
following code:
#Finding confusion matrix and f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
cm = confusion_matrix(y_test, y_pred)
print (cm)
print (f1_score(y_test, y_pred))
Task 2: Using the above implemented code, vary the model by using cosine similarity measure
instead of Euclidean and determine which one is producing better values in terms of accuracy
and f1_score.
Experiment 10: Linear Regression
Introduction
Linear regression is a widely used supervised learning algorithm in the field of machine learning
and statistics. It is primarily used for predicting continuous numeric values based on input
features. The goal of linear regression is to model the relationship between the input variables
(also known as independent variables or features) and the continuous target variable (also known
as the dependent variable) by fitting a linear equation to the data.
In linear regression, the relationship between the input features and the target variable is assumed
to be linear. The algorithm estimates the coefficients of the linear equation that best fits the given
data, allowing us to make predictions on new data points.
The coefficients (weights) are estimated during the training process using a method called
Ordinary Least Squares (OLS) or a variant such as Ridge Regression or Lasso Regression. The
objective is to minimize the sum of squared differences between the predicted values and the
actual target values in the training data.
During the training phase, the algorithm adjusts the coefficients to find the line that best fits the
training data. Once trained, the model can make predictions by simply plugging in the values of
the input features into the linear equation.
Linear regression can be used for various tasks such as predicting housing prices, stock market
trends, sales forecasts, and many more. It serves as a foundation for more advanced regression
techniques and can be extended to handle more complex relationships through feature
engineering and incorporating non-linear transformations.
TASK:
Apply Linear Regression on Advertising data set that involves advertising expenditures on TV,
radio, and newspaper, and the corresponding sales figures.
Steps to follow:
a) Download the Advertising data set from the following webpage:
https://www.kaggle.com/datasets/thorgodofthunder/tvradionewspaperadvertising
The objective is to utilize clustering techniques to identify clusters of vehicles that possess
unique characteristics. This analysis will provide an overview of the current market of vehicles
and aid manufacturers in deciding on the development of new models based on the identified
distinct clusters.
You can download the dataset from the link given below:
https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/
labs/cars_clus.csv
Build your own pipeline and justify it. Also show the implementation and results of your solution
through code.