0% found this document useful (0 votes)
12 views

Data Minig Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Data Minig Lab Manual

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

lOMoARcPSD|23130956

Data minig lab manual

R18 B.Tech. Cse (Computer Networks) Iii & Iv Year Jntu Hyderabad (Jawaharlal Nehru
Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Santosh Kumar ([email protected])
lOMoARcPSD|23130956

College Code: 7Q

BRILLIANT GRAMMAR SCHOOL EDUCATIONAL SOCIETY’S


GROUP OF INSTITUTIONS-INTEGRATED CAMPUS
(Approved by A.I.C.T.E & P.C.I, New Delhi, Affiliated to JNTUH, Hyderabad)
Abdullapur (V), Abdullapurmet (M), R.R Dt. Hyderabad – 501505
website: www.bgiic.ac.in, E-mail : [email protected],[email protected] Cell:9442263457

COMPUTER SCIENCE SCIENCE AND DESIGN

DATA MINING
LAB MANUAL
Regulation:18/JNTUH
Academic year: 2023-2024
B.TECH III YEAR I SEM(CSD)
BRILLIANT GRAMMAR SCHOOL EDUCATIONAL SOCIETY’S
GROUP OF INSTITUTIONS – INTEGRATED CAMPUS

DEPARTMENT OF COMPUTER SCIENCE SCIENCE AND DESIGN


1

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

DEPARTMENT
OF
COMPUTER SCIENCE SCIENCE AND DESIGN

DATA MINING LAB


LAB MANUAL

Brilliant
Grammar School Educational
Society's of
Institutions- IC

Abdullapur(V),Hayathnagar(M),Hyderabad,R.R.Dist-501505

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

LIST OF EXPERIMENTS:

Experiments using Weka & Pentaho Tools

1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation – Normalization

(iii) Data integration

2. Partitioning - Horizontal, Vertical, Round Robin, Hash based

3. Data Warehouse schemas – star, snowflake, fact constellation

4. Data cube construction – OLAP operations

5. Data Extraction, Transformations & Loading operations

6. Implementation of Attribute oriented induction algorithm

7. Implementation of apriori algorithm

8. Implementation of FP – Growth algorithm

9. Implementation of Decision Tree Induction

10. Calculating Information gain measures

11. Classification of data using Bayesian approach

12. Classification of data using K – nearest neighbour approach

13. Implementation of K – means algorithm

14. Implementation of BIRCH algorithm

15. Implementation of PAM algorithm

16. Implementation of DBSCAN algorithm

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation – Normalization

(iii) Data integration

Data Cleaning :

Data in the real world is frequently incomplete, noisy, and inconsistent. Many bits of the data may be

irrelevant or missing. Data cleaning is carried out to handle this aspect. Data cleaning methods aim to fill

in missing values, smooth out noise while identifying outliers, and fix data discrepancies. Unclean data

can confuse data and the model. Therefore, running the data through various Data Cleaning/Cleansing

methods is an important Data Preprocessing step.

(a) Missing Data :

It’s fairly common for your dataset to contain missing values. It could have happened during data

collection or as a result of a data validation rule, but missing values must be considered anyway.

1. Dropping rows/columns: If the complete row is having NaN values then it doesn't make any

value out of it. So such rows/columns are to be dropped immediately. Or if the % of row/column is

mostly missing say about more than 65% then also one can choose to drop.

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

2. Checking for duplicates: If the same row or column is repeated then also you can drop it by

keeping the first instance. So that while running machine learning algorithms, so as not to offer that

particular data object an advantage or bias.

3. Estimate missing values: If only a small percentage of the values are missing, basic interpolation

methods can be used to fill in the gaps. However, the most typical approach of dealing with missing

data is to fill them in with the feature’s mean, median, or mode value.

(b) Noisy Data:

Noisy data is meaningless data that machines cannot interpret. It can be caused by poor data collecting,

data input problems, and so on. It can be dealt with in the following ways:

1. Binning Method: This method smooths data that has been sorted. The data is divided into equal-

sized parts, and the process is completed using a variety of approaches. Each segment is dealt with

independently. All data in a segment can be replaced by its mean, or boundary values can be used to

complete the task.

2. Clustering: In this method, related data is grouped in a cluster. Outliers may go unnoticed, or

they may fall outside of clusters.

3. Regression: By fitting data to a regression function, data can be smoothed out. The regression

model employed may be linear (with only one independent variable) or multiple (with numerous

independent variables) (having multiple independent variables).

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Data Integration

It is involved in a data analysis task that combines data from multiple sources into a coherent data store.

These sources may include multiple databases. Do you think how data can be matched up ?? For a data

analyst in one database, he finds Customer_ID and in another he finds cust_id, How can he sure about

them and say these two belong to the same entity. Databases and Data warehouses have Metadata (It is

the data about data) it helps in avoiding errors.

Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for attribute A
that are V1, V2, V3, ….Vn.

o Min-max normalization: This method implements a linear transformation on the original data.
Let us consider that we have minA and maxA as the minimum and maximum value observed for
attribute A and Viis the value for attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute
income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

The value $73,600 would be transformed using min-max normalization as follows:

o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point
in the value. This movement of a decimal point depends on the maximum absolute value of A.
The formula for the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1


For example, the observed values for attribute A range from -986 to 917, and the maximum
absolute value for attribute A is 986. Here, to normalize each value of attribute A using decimal
scaling, we have to divide each value of attribute A by 1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute value
must be preserved to normalize the future data uniformly.

2. Partitioning - Horizontal, Vertical, Round Robin, Hash based

Partitioning is done to enhance performance and facilitate easy management of data.


Partitioning also helps in balancing the various requirements of the system. It optimizes the
hardware performance and simplifies the management of data warehouse by partitioning
each fact table into multiple separate partitions.
Horizontal Partitioning

There are various ways in which a fact table can be partitioned. In horizontal partitioning,
we have to keep in mind the requirements for manageability of the data warehouse.

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Partitioning by Time into Equal Segments


In this partitioning strategy, the fact table is partitioned on the basis of time period. Here
each time period represents a significant retention period within the business. For example,
if the user queries for month to date data then it is appropriate to partition the data into
monthly segments. We can reuse the partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments


This kind of partition is done where the aged data is accessed infrequently. It is implemented
as a set of small partitions for relatively current data, larger partition for inactive data.

Vertical Partition

Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Vertical partitioning can be performed in the following two ways −

 Normalization
 Row Splitting

Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following
tables that show how normalization is performed.

Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.

Hash Partitioning

Hash partitioning maps data to partitions based on a hashing algorithm that Oracle
applies to a partitioning key that you identify. The hashing algorithm evenly distributes
rows among partitions, giving partitions approximately the same size. Hash partitioning
is the ideal method for distributing data evenly across devices. Hash partitioning is also
an easy-to-use alternative to range partitioning, especially when the data to be partitioned
is not historical.

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Oracle Database uses a linear hashing algorithm and to prevent data from clustering
within specific partitions, you should define the number of partitions by a power of two
(for example, 2, 4, 8).

The following statement creates a table sales_hash, which is hash partitioned on


the salesman_id field:

CREATE TABLE sales_hash


(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_amount NUMBER(10),
week_no NUMBER(2))
PARTITION BY HASH(salesman_id)
PARTITIONS 4;

Round-robin partitioning: the simplest strategy, it ensures uniform data distribution. With n partitions,
the ith tuple in insertion order is assigned to partition (i mod n). This strategy enables the sequential
access to a relation to be done in parallel. However, the direct access to individual tuples, based on a
predicate, requires accessing the entire relation.

3) Data Warehouse schemas – star, snowflake, fact constellation

Star Schema
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

10

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
 Star Schema Definition
 The star schema that we have discussed can be defined using Data Mining Query
Language (DMQL) as follows −
 define cube sales star [time, item, branch, location]:

 dollars sold = sum(sales in dollars), units sold = count(*)

 define dimension time as (time key, day, day of week, month, quarter, year)
 define dimension item as (item key, item name, brand, type, supplier type)
 define dimension branch as (branch key, branch name, branch type)
 define dimension location as (location key, street, city, province or state, country)
Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

 Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
 Snowflake Schema Definition
 Snowflake schema can be defined using DMQL as follows −
 define cube sales snowflake [time, item, branch, location]:

11

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

 dollars sold = sum(sales in dollars), units sold = count(*)



 define dimension time as (time key, day, day of week, month, quarter, year)
 define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
 define dimension branch as (branch key, branch name, branch type)
 define dimension location as (location key, street, city (city key, city, province or state,
country))
Fact Constellation Schema
 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.

 The sales fact table is same as that in the star schema.


 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units
sold.
 It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.
 Fact Constellation Schema Definition
 Fact constellation schema can be defined using DMQL as follows −
 define cube sales [time, item, branch, location]:

 dollars sold = sum(sales in dollars), units sold = count(*)

 define dimension time as (time key, day, day of week, month, quarter, year)
 define dimension item as (item key, item name, brand, type, supplier type)
 define dimension branch as (branch key, branch name, branch type)
 define dimension location as (location key, street, city, province or state,country)
12

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

 define cube shipping [time, item, shipper, from location, to location]:

 dollars cost = sum(cost in dollars), units shipped = count(*)

 define dimension time as time in cube sales


 define dimension item as item in cube sales
 define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
 define dimension from location as location in cube sales
 define dimension to location as location in cube sales

4. Data cube construction – OLAP operations


An OLAP cube is a term that typically refers to multi-dimensional array of data. OLAP
is an acronym for online analytical processing,[1]which is a computer-based technique
of analyzing data to look for insights. The term cube here refers to a multi-dimensional
dataset, which is also sometimes called a hypercube if the number of dimensions is
greater than 3.

Operations:

1. Slice is the act of picking a rectangular subset of a cube by choosing a single value
for one of its dimensions, creating a new cube with one fewer dimension.[4] The
picture shows a slicing operation: The sales figures of all sales regions and all product
categories of the company in the year 2005 and 2006 are "sliced" out of the data cube.

2. Dice: The dice operation produces a subcube by allowing the analyst to pick specific
values of multiple dimensions.[5]The picture shows a dicing operation: The new cube
shows the sales figures of a limited number of product categories, the time and region
dimensions cover the same range as before.

3. Drill Down/Up allows the user to navigate among levels of data ranging from the
most summarized (up) to the most detailed (down).[4] The picture shows a drill-down
operation: The analyst moves from the summary category "Outdoor-Schutzausrüstung"
to see the sales figures for the individual products.

4. Roll-up: A roll-up involves summarizing the data along a dimension. The


summarization rule might be computing totals along a hierarchy or applying a set of
formulas such as "profit = sales
- expenses".

5. Pivot allows an analyst to rotate the cube in space to see its various faces. For

13

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

example, cities could be arranged vertically and products horizontally while viewing
data for a particular quarter. Pivoting could replace products with time periods to see
data across time for a single product.

14

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

15

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

6. Implementation of Attribute oriented induction algorithm


Aim: To Perform the implementation of attribute oriented Induction algorithm
Resources: Weka
Theory: AOI stands for Attribute-Oriented Induction. The attribute-oriented
induction approach to concept description was first proposed in 1989, a few years before
the introduction of the data cube approach. The data cube approach is essentially based
on materialized views of the data, which typically have been pre-computed in a data
16

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

warehouse.
In general, it implements off-line aggregation earlier an OLAP or data mining query is
submitted for processing. In other words, the attribute-oriented induction approach
is generally a query-oriented, generalization-based, on-line data analysis methods. The
general idea of attribute-oriented induction is to first collect the task-relevant data using
a database query and then perform generalization based on the examination of the
number of distinct values of each attribute in the relevant collection of data.
The generalization is implemented by attribute removal or attribute generalization.
Aggregation is implemented by combining identical generalized tuples and
accumulating their specific counts. This decreases the size of the generalized data set.
The resulting generalized association can be mapped into several forms for
presentation to the user, including charts or rules.
Algorithm: The process of attribute-oriented induction which are as follows −
• First, data focusing must be implemented before attribute-oriented induction. This
step corresponds to the description of the task-relevant records (i.e., data for analysis).
The data are collected based on the data supported in the data mining query.
• Because a data mining query is usually relevant to only a portion of the database,
selecting the relevant set of data not only makes mining more efficient, but also
changes more significant results than mining the whole database.
• It can be specifying the set of relevant attributes (i.e., attributes for mining, as
indicated in DMQL with the in relevance to clause) may be difficult for the user. A user
can choose only a few attributes that it is important, while missing others that can
also play a role in the representation.
• For example, suppose that the dimension birth place is defined by the attributes city,
province or state, and country. It can allow generalization on the birth place
dimension, the other attributes defining this dimension should also be included.
• In other terms, having the system automatically involve province or state and
country as relevant attributes enables city to be generalized to these larger conceptual
levels during the induction phase.
• At the other extreme, suppose that the user may have introduced too many attributes
by specifying all of the possible attributes with the clause “in relevance to *”. In this
case, all of the attributes in the relation specified by the from clause would be included in
the analysis.
• Some attributes are unlikely to contribute to an interesting representation. A
correlation-based or entropy-based analysis method can be used to perform attribute
relevance analysis and filter out statistically irrelevant or weakly relevant attributes
from the descriptive mining process.
17

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Procedure: Step1:open the weka explorer


Step2:load the data set
Step 3:choose select attributes option Choose cfssubseteval option Select the ranker
option Click the start button
Output: === Run information ===
Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 -E 1
Search: weka.attributeSelection.GreedyStepwise -T -1.7976931348623157E308 -N
-1 -num-slots 1
Relation: breast-cancer
Instances: 286

7. Implementation of apriori algorithm


the basic elements of asscociation rule mining using WEKA. The sample dataset used
for this example is contactlenses.arff

Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have
been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.

Dataset contactlenses.arff

18

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

13. Implementation of K – means algorithm


the use of simple k-mean clustering with Weka explorer. The sample data set used for
this example is based on the iris data available in ARFF format. This document assumes that
appropriate preprocessing has been performed. This iris dataset includes 150 instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup window shown
19

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.

Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster.For eg,
the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.

Step 7: Another way of understanding characterstics of each cluster through visualization ,we
can do this, try right clicking the result set on the result. List panel and selecting the visualize
cluster assignments.

The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.

20

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

7. Implementation of apriori algorithm

The Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. It uses a “bottom-up” approach, where frequent subsets are extended one at a
time (a step known as candidate generation, and groups of candidates are tested against the data).

 Problem:

TID ITEM
S
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

To find frequent item sets for above transaction with a minimum support of 2 having
confidence measure of 70% (i.e, 0.7).

Procedure:
Step 1:
Count the number of transactions in which each item occurs

TI ITEM
D S
1 2
2 3
3 3
4 1
5 3

Step 2:
Eliminate all those occurrences that have transaction numbers less than the minimum support (
2 in this case).

21

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

22

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

ITEM NO. OF
TRANSACTIONS

1 2

2 3

3 3

5 3

This is the single items that are bought frequently. Now let‟s say we want to find a pair of
items that are bought frequently. We continue from the above table (Table in step 2).

Step 3:
We start making pairs from the first item like 1,2;1,3;1,5 and then from second item like
2,3;2,5. We do not perform 2,1 because we already did 1,2 when we were making pairs with 1
and buying 1 and 2 together is same as buying 2 and 1 together. After making all the pairs we
get,

ITEM PAIRS

1,2
1,3
1,5
2,3
2,5
3,5
Step 4:
Now, we count how many times each pair is bought together.

NO.OF
TRANSACTIONS
ITEM PAIRS
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2

23

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Step 5:
Again remove all item pairs having number of transactions less than 2.

ITEM PAIRS NO.OF


TRANSACTIONS

1,3 2
2,3 2
2,5 3
3,5 2

These pair of items is bought frequently together. Now, let‟s say we want to find a set of
three items that are bought together. We use above table (of step 5) and make a set of three
items.

Step 6:
To make the set of three items we need one more rule (It‟s termed as self-join), it simply
means, from item pairs in above table, we find two pairs with the same first numeric, so, we get
(2,3) and (2,5), which gives (2,3,5). Then we find how many times (2, 3, 5) are bought together
in the original table and we get the following

ITEM NO. OF
SET TRANSACTIONS

(2,3,5) 2

Thus, the set of three items that are bought together from this data are (2, 3,

5). Confidence:
We can take our frequent item set knowledge even further, by finding association rules using the
frequent item set. In simple words, we know (2, 3, 5) are bought together frequently, but what is
the association between them. To do this, we create a list of all subsets of frequently bought
items (2, 3, 5) in our case we get following subsets:

 {2}
 {3}
 {5}
 {2,3}

24

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

 {3,5}
 {2,5}

25

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Now, we find association among all the subsets.


{2} => {3,5}: ( If „2‟ is bought , what‟s the probability that „3‟ and „5‟ would be bought in
same transaction)
Confidence = P (3฀5฀2)/ P(2) =2/3 =67%
{3}=>{2,5}= P (3฀5฀2)/P(3)=2/3=67%
{5}=>{2,3}= P (3฀5฀2)/P(5)=2/3=67%
{2,3}=>{5}= P (3฀5฀2)/P(2฀3)=2/2=100%
{3,5}=>{2}= P (3฀5฀2)/P(3฀5)=2/2=100%
{2,5}=>{3}= P (3฀5฀2)/ P(2฀5)=2/3=67%
Also, considering the remaining 2-items sets, we would get the following associations-
{1}=>{3}=P(1฀3)/P(1)=2/2=100%
{3}=>{1}=P(1฀3)/P(3)=2/3=67%
{2}=>{3}=P(3฀2)/P(2)=2/3=67%
{3}=>{2}=P(3฀2)/P(3)=2/3=67%
{2}=>{5}=P(2฀5)/P(2)=3/3=100%
{5}=>{2}=P(2฀5)/P(5)=3/3=100%
{3}=>{5}=P(3฀5)/P(3)=2/3=67%
{5}=>{3}=P(3฀5)/P(5)=2?3=67%
Eliminate all those having confidence less than 70%. Hence, the rules would be –
{2,3}=>{5}, {3,5}=>{2}, {1}=>{3},{2}=>{5}, {5}=>{2}.
 Now these manual results should be checked with the rules generated in WEKA.

So first create a csv file for the above problem, the csv file for the above problem will
look like the rows and columns in the above figure. This file is written in excel sheet.

26

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

8. Implementation of FP – Growth algorithm


PROBLEM:
To find all frequent item sets in following dataset using FP-growth algorithm. Minimum
support=2 and confidence =70%

TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

Solution:
Similar to Apriori Algorithm, find the frequency of occurrences of all each item in dataset and
then prioritize the items according to its descending order of its frequency of occurrence.
Eliminating those occurrences with the value less than minimum support and assigning the
priorities, we obtain the following table.

ITEM NO. OF PRIORITY


TRANSACTIONS

1 2 4

2 3 1

3 3 2

5 3 3

Re-arranging the original table, we obtain

TID ITEMS

100 1,3

200 2,3,5

300 2,3,5,1

400 2,5

27

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

28

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Construction of tree:
Note that all FP trees have „null‟ node as the root node. So, draw the root node first and attach
the items of the row 1 one by one respectively and write their occurrences in front of it. The
tree is further expanded by adding nodes according to the prefixes (count) formed and by
further incrementing the occurrences every time they occur and hence the tree is built.

Prefixes:

 1->3:1 2,3,5:1
 5->2,3:2 2:1
 3->2:2

Frequent item sets:

 1-> 3:2 /*2 and 5 are eliminated because they‟re less than minimum support, and
the occurrence of 3 is obtained by adding the occurrences in both the instances*/
 Similarly, 5->2,3:2 ; 2:3;3:2
 3->2 :2

Therefore, the frequent item sets are {3,1}, {2,3,5}, {2,5}, {2,3},
{3,5} The tree is constructed as below:

NUL

3:1
2:3
1:1
3:2
5:1

5:2

1:1

29

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Generating the association rules for the following tree and calculating
the confidence measures we get-
 {3}=>{1}=2/3=67%
 {1}=>{3}=2/2=100%
 {2}=>{3,5}=2/3=67%

30

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

 {2,5}=>{3}=2/3=67%
 {3,5}=>{2}=2/2=100%
 {2,3}=>{5}=2/2=100%
 {3}=>{2,5}=2/3=67%
 {5}=>{2,3}=2/3=67%
 {2}=>{5}=3/3=100%
 {5}=>{2}=3/3=100%
 {2}=>{3}=2/3=67%
 {3}=>{2}=2/3=67%

Thus eliminating all the sets having confidence less than 70%, we obtain the following
conclusions:
{1}=>{3} , {3,5}=>{2} , {2,3}=>{5} , {2}=>{5}, {5}=>{2}.

As we see there are 5 rules that are being generated manually and these are to be checked against
the results in WEKA. Inorder to check the results in the tool we need to follow the similar
procedure like
Apriori.

So first create a csv file for the above problem, the csv file for the above problem will look like the rows
and columns in the above figure. This file is written in excel sheet.

9. Implementation of Decision Tree Induction

Decision tree learning is one of the most widely used and practical methods for inductive
inference over supervised data. It represents a procedure for classifying categorical database on
their attributes. This representation of acquired knowledge in tree form is intuitive and easy to
assimilate by humans.

31

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

ILLUSTRATION:
Build a decision tree for the following data

AGE INC STUD CREDIT_RATIN BUYS_CO


OM ENT G MPUTER
E
Youth High No Fair No

Youth High No Excellent No

Middle aged High No Fair Yes

Senior Medium No Fair Yes

Senior Low Yes Fair Yes

Senior Low Yes Excellent No

Middle aged Medium Yes Excellent Yes

Youth Low No Fair No

Youth Medium Yes Fair Yes

Senior Medium Yes Fair Yes

Youth Medium Yes Excellent Yes

Middle aged Medium No Excellent Yes

Middle aged High Yes Fair Yes

Senior Medium No Excellent No

32

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

The entropy is a measure of the uncertainty associated with a random variable. As uncertainty
increases, so does entropy, values range from [0-1] to present the entropy of information

Entropy (D) =
Information gain is used as an attribute selection measure; pick the attribute having the highest
information gain, the gain is calculated by:
Gain (D, A) = Entropy (D) -
Where, D: A given data partition A: Attribute
V: Suppose we were partition the tuples in D on some attribute A having v distinct values D is
split into v partition or subsets, (D1, D2….. Dj) , where Dj contains those tuples in D that have
outcome Aj of A.

Class P:
buys_computer=”yes” Class
N: buys_computer=”no”

Entropy (D) = -9/14log (9/14)-5/15log (5/14) =0.940


Compute the expected information requirement for each attribute start with the attribute age
Gain (age, D)

= Entropy (D) -

= Entropy ( D ) – 5/14Entropy(Syouth)-4/14Entropy(Smiddle-aged)-5/14Entropy(Ssenior)
= 0.940-0.694
=0.246

Similarly, for other attributes,


Gain (Income, D) =0.029
Gain (Student, D ) = 0.151
Gain (credit_rating, D) = 0.048

Income Student Credit_rating Class

High No Fair No

High No Excellent No

Medium No Fair No

Low Yes Fair Yes

medium Yes excellent yes

33

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Now, calculating information gain for subtable (age<=30)


I The attribute age has the highest information gain and therefore becomes the splitting
* attribute at the root node of the decision tree. Branches are grown for each outcome
of age. These tuples are shown partitioned accordingly.
Income=”high” S11=0, S12=2 I=0
Income=”medium” S21=1 S22=1
I (S21, S23) = 1
Income=”low” S31=1
S32=0 I=0
Entropy for income
E( income ) = (2/5)(0) + (2/5)(1) + (1/5)(0) = 0.4
Gain( income ) = 0.971 - 0.4 = 0.571

Similarly, Gain(student)=0.971
Gain(credit)=0.0208
Gain( student) is highest ,

A decision tree for the concept buys_computer, indicating whether a customer at All Electronics
is likely to purchase a computer. Each internal (non-leaf) node represents a test on an attribute.
Each leaf node represents a class ( either buys_computer=”yes” or buys_computer=”no”.

first create a csv file for the above problem,the csv file for the above problem will look like the
rows and columns in the above figure. This file is written in excel sheet.

34

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Procedure for running the rules in weka:


Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.

35

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Step2:
Now select the classify tab in the tool and click on start button and then we can see the result of
the problem as below

Step3:

Check the main result which we got manually and the result in weka by right clicking on
the result and visualizing the tree.

The visualized tree in weka is as shown below:

36

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

10. Calculating Information gain measures

Information gain (IG) measures how much “information” a feature gives us about the class. –
Features that perfectly partition should give maximal information. – Unrelated features should
give no information. It measures the reduction in entropy. CfsSubsetEval aims to identify a
subset of attributes that are highly correlated with the target while not being strongly correlated
with one another. It searches through the space of possible attribute subsets for the “best” one
using the BestFirst search method by default, although other methods can be chosen. To use the
wrapper method rather than a filter method, such as CfsSubsetEval, first select
WrapperSubsetEval and then configure it by choosing a learning algorithm to apply and setting
the number of cross-validation folds to use when evaluating it on each attribute subset.

Steps:

 Open WEKA Tool.


 Click on WEKA Explorer.
 Click on Preprocessing tab button.
 Click on open file button.
 Select and Click on data option button.
 Choose a data set and open file.
 Click on select attribute tab and Choose attribute evaluator, search method algorithm

37

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

 Click on start button.

11. Classification of data using Bayesian approach


Description:
In machine learning, Naïve Bayes classifiers are a family of simple probabilistic classifiers based
on applying Bayes‟ Theorem with strong (naïve) independence assumptions between the
features
38

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Example:

.
AGE INCOME STUDENT CREDIT_RATING BUYS_COMPUTER

<30 High No Fair No

<30 High No Excellent No

31-40 High No Fair Yes

>40 Mediu m No Fair Yes

>40 Low Yes Fair Yes

>40 Low Yes Excellent No

31-40 Mediu m Yes Excellent Yes

<=30
Low No Fair No
<=30 Mediu m Yes Fair Yes

>40 Mediu m Yes Fair Yes

<30 Mediu m Yes Excellent Yes

31-40 Mediu m No Excellent Yes

31-40 High Yes Fair Yes

>40 Mediu m No Excellent No

CLASS:
C1:buys_com
puter = ‘yes’
C2:buys_com
puter=’no’
DATA TO
BECLASSIFIED

39

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

X= (age<=30, income=Medium, Student=Yes, credit_rating=Fair)


P(C1): P(buys_computer=”yes”)= 9/14 =0.643

P (buys_computer=”no”) =5/14=0.357

Compute P(X/C1) and p(x/c2) weget:

1. P( age=”<=30” |buys_computer=”yes”)=2/9
2. P(age=”<=30”|buys_computer=”no”)=3/5
3. P(income=”medium”|buys_computer=”yes”)=4/9
4. P(income=”medium”|buys_computer=”no”)=2/5
5. P(student=”yes”|buys_computer=”yes”)=6/9
6. P(student=”yes” |buys_computer=”no”)=1/5=0.2
7. P(credit_rating=”fair ”|buys_computer=”yes”)=6/9
8. P(credit_rating=”fair” |buys_computer=”no”)=2/5

X=(age<=30, income=medium, student=yes,


credit_rating=fair) P(X/C1): P
(X/buys_computer=”yes”)=2/9*4/9*6/9*6/9=
32/1134

P(X/C2):P(X/buys_computer=”no”)=3/5*2/5*1

/5*2/5=12/125

P(C1/X)=P(X/C1)*P(C1)

P(X/buys_computer=”yes”)*P(buys_computer=”yes”)=(32/1134)*(9/14)=0.019

P(C2/X)=p(x/c2)*p(c2)

40

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

P (X/buys_computer=”no”)*P(buys_computer=”no”)=(12/125)*(5/14)=0.007

Therefore, conclusion is that the given data belongs to C1 since P(C1/X)>P(C2/X)

Checking the result in the WEKA tool:

In order to check the result in the tool we need to


follow aprocedure. Step 1:

Create a csv file with the above table considered in the example. the arff
file will look as shown below:

Step 2:

Now open weka explorer and then select all the attributes in the table.

41

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Step 3:

Select the classifier tab in the tool and choose baye‟s folder and then naïve baye‟s classifier
to see the result as shown below.

42

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

12. Classification of data using K – nearest neighbour approach


KNN as Classifier

First, start with importing necessary python packages −

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Next, download the iris dataset from its weblink as follows −

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Next, we need to assign column names to the dataset as follows −

headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

Now, we need to read dataset to pandas dataframe as follows −

dataset = pd.read_csv(path, names = headernames)


dataset.head()
sepal-length sepal-width petal-length petal-width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines.

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

Next, we will divide the data into train and test split. Following code will split
the dataset into 60% training data and 40% of testing data −

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)

Next, data scaling will be done as follows −

from sklearn.preprocessing import StandardScaler


43

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, train the model with the help of KNeighborsClassifier class of sklearn as
follows −

from sklearn.neighbors import KNeighborsClassifier


classifier = KNeighborsClassifier(n_neighbors = 8)
classifier.fit(X_train, y_train)

At last we need to make prediction. It can be done with the help of following
script −

y_pred = classifier.predict(X_test)

Next, print the results as follows −

from sklearn.metrics import classification_report,


confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60
44

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Accuracy: 0.8833333333333333

13. Implementation of K – means algorithm


K-means algorithm aims to partition n observations into “k clusters” in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in
partitioning of the data into Voronoi cells.

ILLUSTRATION:

As a simple illustration of a k-means algorithm, consider the following data set consisting of
the scores of two variables on each of the five variables.

I X1 X2

A 1 1

B 1 0

C 0 2

D 2 4

E 3 5

This data set is to be grouped into two clusters: As a first step in finding a sensible partition,
let the A & C values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:

Cluster Individual Mean Vector(Centroid)

Cluster1 A (1,1)

Cluster2 C (0,2)

45

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:

A C

A 0 1.4

B 1 2.5

C 1.4 0

D 3.2 2.82

E 4.5 4.2

Initial partitions have changed, and the two clusters at this stage having the
following characteristics.

Individual Mean vector( Centroid)

Cluster 1 A,B (1,0.5)

Cluster 2 C,D,E (1.7,3.7)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individual‟s distance to its own cluster mean and to that of the opposite cluster.
And, we find:

I A C

A 0.5 2.7

B 0.5 3.7

C 1.8 2.4

D 3.6 0.5

E 4.9 1.9

The individuals C is now relocated to Cluster 1 due to its less mean distance with the centroid
46

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

points. Thus, its relocated to cluster 1 resulting in the new partition

Individual Mean vector(Centroid)

Cluster 1 A,B,C (0.7,1)

Cluster 2 D,E (2.5,4.5)

The iterative relocation would now continue from this new partition until no more relocation
occurs. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won‟t find a final solution. In this case, it would
be a better idea to consider stopping the algorithm after a pre-chosen maximum number of
iterations.
Checking the solution in weka:
In order to check the result in the tool we need to follow a
procedure. Step 1:
Create a csv file with the above table considered in the example. the csv file will look as shown
below:

Step 2:
Now open weka explorer and then select all the attributes in the table.

47

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Step 3:
Select the cluster tab in the tool and choose normal k-means technique to
see the result as shown below.

48

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

14. Implementation of BIRCH algorithm


BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised
data mining algorithm that performs hierarchical clustering over large data sets. With
modifications, it can also be used to accelerate k-means clustering and Gaussian mixture
modeling with the expectation-maximization algorithm. An advantage of BIRCH is its
ability to incrementally and dynamically cluster incoming, multi-dimensional metric data
points to produce the best quality clustering for a given set of resources (memory and time
constraints). In most cases, BIRCH only requires a single scan of the database.

Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the Clustering
feature tree (CF tree). This algorithm is based on the CF (clustering features) tree. In
addition, this algorithm uses a tree-structured summary to create clusters.

49

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those
nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are
situated in no-terminal CF nodes.

The CF tree is a height-balanced tree that gathers and manages clustering features and holds
necessary information of given data for further hierarchical clustering. This prevents the
need to work with whole data given as input. The tree cluster of data points as CF is
represented by three numbers (N, LS, SS).

o N = number of items in subclusters


o LS = vector sum of the data points
o SS = sum of the squared data points

There are mainly four phases which are followed by the algorithm of BIRCH.

o Scanning data into memory.


o Condense data (resize data).

50

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

o Global clustering.
o Refining clusters.

Two of them (resize data and refining clusters) are optional in these four phases. They come
in the process when more clarity is required. But scanning data is just like loading data into
a model. After loading the data, the algorithm scans the whole data and fits them into the CF
trees.

In condensing, it resets and resizes the data for better fitting into the CF tree. In global
clustering, it sends CF trees for clustering using existing clustering algorithms. Finally,
refining fixes the problem of CF trees where the same valued points are assigned to
different leaf nodes.

15. Implementation of PAM algorithm


PAM is the most powerful algorithm of the three algorithms but has the disadvantage of time complexity. The following K-
Medoids are performed using PAM. In the further parts, we'll see what CLARA and CLARANS are.

Algorithm:

Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to k number of clusters. These are the
initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to the cluster with the
nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)
4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make the new medoid permanent and
repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids to classify data points.

Here is an example to make the theory clear:

Data set:

x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9

51

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

Scatter plot:

If k is given as 2, we need to break down the data points into 2 clusters.

1. Initial medoids: M1(1, 3) and M2(4, 9)


2. Calculation of distances

Manhattan Distance: |x1 - x2| + |y1 - y2|

x< y From M1(1, 3) From M2(4, 9)

0 5 4 5 6

1 7 7 10 5

2 1 3 - -

3 8 6 10 7

4 4 9 - -

Cluster 1: 0

Cluster 2: 1, 3

1. Calculation of total cost:


(5) + (5 + 7) = 17
2. Random medoid: (5, 4)

M1(5, 4) and M2(4, 9):

52

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

x y From M1(5, 4) From M2(4, 9)

0 5 4 - -

1 7 7 5 5

2 1 3 5 9

3 8 6 5 7

4 4 9 - -

Cluster 1: 2, 3

Cluster 2: 1

1. Calculation of total cost:


(5 + 5) + 5 = 15
Less than the previous cost
New medoid: (5, 4).
2. Random medoid: (7, 7)

M1(5, 4) and M2(7, 7)

x y From M1(5, 4) From M2(7, 7)

0 5 4 - -

1 7 7 - -

2 1 3 5 10

3 8 6 5 2

4 4 9 6 5

Cluster 1: 2

Cluster 2: 3, 4

1. Calculation of total cost:


(5) + (2 + 5) = 12
Less than the previous cost
New medoid: (7, 7).
2. Random medoid: (8, 6)

M1(7, 7) and M2(8, 6)

53

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

x y From M1(7, 7) From M2(8, 6)


0 5 4 5 5
1 7 7 - -
2 1 3 10 10
3 8 6 - -
4 4 9 5 7

Cluster 1: 4

Cluster 2: 0, 2

1. Calculation of total cost:


(5) + (5 + 10) = 20
Greater than the previous cost
UNDO
Hence, the final medoids: M1(5, 4) and M2(7, 7)
Cluster 1: 2
Cluster 2: 3, 4
Total cost: 12
Clusters:

Time complexity: O(k * (n - k)2)

16. Implementation of DBSCAN algorithm

Implementation steps for the DBSCAN algorithm:


Now, we will perform the implementation of the DBSCAN algorithm in Python. Still, we
will do this in steps as we have mentioned earlier so that the implementation part does not

54

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

get any complex, and we can understand it very easily. We have to follow the following
steps in order to implement the DBSCAN algorithm and its logic inside a Python program:

Step 1: Importing all the required libraries:

First and foremost, we have to import all the required libraries which we have installed in
the prerequisites part so that we can use their functions while implementing the DBSCAN
algorithm.

Here, we have firstly imported all the required libraries or modules of libraries inside
the program:

1. # Importing numpy library as nmp


2. import numpy as nmp
3. # Importing pandas library as pds
4. import pandas as pds
5. # Importing matplotlib library as pplt
6. import matplotlib.pyplot as pplt
7. # Importing DBSCAN from cluster module of Sklearn library
8. from sklearn.cluster import DBSCAN
9. # Importing StandardSclaer and normalize from preprocessing module of Sklearn libr
ary
10. from sklearn.preprocessing import StandardScaler
11. from sklearn.preprocessing import normalize
12. # Importing PCA from decomposition module of Sklearn
13. from sklearn.decomposition import PCA

Step 2: Loading the Data:

In this step, we have to load that data, and we can do this by importing or loading the
dataset (that is required in the DBSCAN algorithm to work on it) inside the program. To
load the dataset inside the program, we will use the read.csv() function of the panda's
library and print the information from the dataset as we have done below:

1. # Loading the data inside an initialized variable


2. M = pds.read_csv('sampleDataset.csv') # Path of dataset file

55

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

3. # Dropping the CUST_ID column from the dataset with drop() function
4. M = M.drop('CUST_ID', axis = 1)
5. # Using fillna() function to handle missing values
6. M.fillna(method ='ffill', inplace = True)
7. # Printing dataset head in output
8. print(M.head())

Output:
BALANCE BALANCE_FREQUENCY ... PRC_FULL_PAYMENT TENURE
0 40.900749 0.818182 ... 0.000000 12
1 3202.467416 0.909091 ... 0.222222 12
2 2495.148862 1.000000 ... 0.000000 12
3 1666.670542 0.636364 ... 0.000000 12
4 817.714335 1.000000 ... 0.000000 12

[5 rows x 17 columns]

The data as given in the output above will be printed when we run the program, and we will
work on this data from the dataset file we loaded.

Step 3: Preprocessing the data:

Now, we will start preprocessing the data of the dataset in this step by using the functions of
preprocessing module of the Sklearn library. We have to use the following technique while
preprocessing the data with Sklearn library functions:

1. # Initializing a variable with the StandardSclaer() function


2. scalerFD = StandardScaler()
3. # Transforming the data of dataset with Scaler
4. M_scaled = scalerFD.fit_transform(M)
5. # To make sure that data will follow gaussian distribution
6. # We will normalize the scaled data with normalize() function
7. M_normalized = normalize(M_scaled)
8. # Now we will convert numpy arrays in the dataset into dataframes of panda
9. M_normalized = pds.DataFrame(M_normalized)

Step 4: Reduce the dimensionality of the data:

56

Downloaded by Santosh Kumar ([email protected])


lOMoARcPSD|23130956

In this step, we will be reducing the dimensionality of the scaled and normalized data so
that the data can be visualized easily inside the program. We have to use the PCA function
in the following way in order to transform the data and reduce its dimensionality:

1. # Initializing a variable with the PCA() function


2. pcaFD = PCA(n_components = 2) # components of data
3. # Transforming the normalized data with PCA
4. M_principal = pcaFD.fit_transform(M_normalized)
5. # Making dataframes from the transformed data
6. M_principal = pds.DataFrame(M_principal)
7. # Creating two columns in the transformed data
8. M_principal.columns = ['C1', 'C2']
9. # Printing the head of the transformed data
10. print(M_principal.head())

Output:
C1 C2
0 -0.489949 -0.679976
1 -0.519099 0.544828
2 0.330633 0.268877
3 -0.481656 -0.097610
4 -0.563512 -0.482506

57

Downloaded by Santosh Kumar ([email protected])

You might also like