0% found this document useful (0 votes)
37 views114 pages

B DWM Lab Manual Zil

project lab manual
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views114 pages

B DWM Lab Manual Zil

project lab manual
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

K.J.

INSTITUTE OF ENGINEERING & TECHNOLOGY ,SAVLI

LAB MANUAL

DATA WAREHOUSING AND DATA MINING


(3161610)

FOR

6THSEMESTER

IT
CERTIFICATE

K. J. INSTITUTE OF ENGINEERING & TECHNOLOGY


DATA WAREHOUSING AND DATA MINING(3161610)

This is to certify that Mr. /Ms. _ MISTRY ZEEL JAYESHBHAI Of 6TH SEM B.E I.T.
Class, ENROLL NO. 220643116004 , has satisfactorily Completed his/her term
work in DATA WAREHOUSING AND DATA MINING for the term ending
in _APRIL 2023 /2024.

DATE:
Grade:

PROFF. MISS. PRANAVI PATEL


INTERNAL EXAMINER EXTERNAL EXAMINER

MISS. PRANAVI PATEL


HEAD OF DEPARTMENT (IT)
FOREWORD

It is my great pleasure to present this laboratory manual for third year


Information Technology students for the subject of Data
Warehousing and Data Mining.

As a student, many of you may be wondering with some of the


questions in your mind regarding the subject and exactly what has
been tried is to answer through this manual.

Faculty members are also advised that covering these aspects in


initial stage itself will greatly relieve them in future as much of the
load will be taken care by the enthusiasm energies of the students
once they are conceptually clear.

DR. DEVANG SHAH


PRINCIPAL
LABORATORY MANUAL CONTENTS

This manual is intended for the third year students of Information


Technology in the subject of Data Warehousing and Data Mining.
This manual typically contains practical/lab sessions related Python
and implemented into Jupyter Notebook and WEKA, related the
subject to enhanced understanding.

Students are advised to thoroughly go through this manual rather


than only topics mentioned in the syllabus as practical aspects are
the key to understanding and conceptual visualization of theoretical
aspects covered in the books.

Good Luck for your Enjoyable Laboratory Sessions

MISS. PRANAVI PATEL


FACULTY AND HEAD OF DEPARTMENT (IT)
LAB INDEX
SR
DATE LIST OF PRACTICALS PAGE NO. SIGNATURE
NO.
Introduce and Perform different
1 18/1/24 methods of i) Data cleaning and ii) 1
Data integrationand transformation.
Perform data reduction in Linear
Algebra methods and Mainfold
2 Learning methods with achieved
25/1/24 35
Accuracy from
Make_Classification dataset using
LogisticRegression.

Demonstration of Pre-processing
3 08/02/24 43
Methods
i) Rescale Data and ii) Binarize Data.
Perform Different Normalization
methods.
4
15/02/24 i) Maximum Absolute Scaling, ii) 72
Min-Max Feature Scaling, and iii)
Z-score Method.

Introduce and perform


5 22/02/24 75
AttributeRelevance.

Implement Decision Tree based


6 07/03/24 89
Algorithm: Random Forest and
AdaBoost.

7 14/03/24 Perform installation of Weka tool. 93

Demonstration of preprocessing on
8 28/03/24 100
datasetstudent.arff

Demonstration of preprocessing on
9 04/04/24 104
datasetlabor.arff

Demonstration of Association
10 11/04/24 rule process on dataset contact 108
lenses .arffusing a priori
algorithm
DATAWAREHOSING AND DATAMINING 220643116004
444

PRACTICAL-1

AIM: Introduce and Perform different methods of i) Data cleaning and ii)
Data integration and transformation.

1. Data Cleaning
As we know that, Data Mining is the discipline of study which involves extracting insights from
huge amounts of data by the use of various scientific methods, algorithms, and processes. To
extract useful knowledge from data, Data Mining need raw data. This Raw data is a collection of
information from various outlines sources and an essential raw material of Data Scientists. It is
additionally known as primary or source data. It consists of garbage, irregular and inconsistent
values which lead to many difficulties. When using data, the insights and analysis extracted are
only as good as the data we are using. Essentially, when garbage data is in, then garbage analysis
comes out. Here Data cleaning comes into the picture, Data cleansing is an essential part of data
Mining. Data cleaning is the process of removing incorrect, corrupted, garbage, incorrectly
formatted, duplicate, or incomplete data within a dataset.
Why Data Cleaning?
Data cleaning is the most important task that should be done as a data science professional. Having
wrong or bad quality data can be detrimental to processes and analysis. Having clean data will
ultimately increase overall productivity and permit the very best quality information in your
decision-making.

Error-Free Data
When multiple sources of data are combined there may be chances of so much error. Through Data
Cleaning, errors can be removed from data. Having clean data which is free from wrong and
garbage values can help in performing analysis faster as well as efficiently. By doing this task our
considerable amount of time is saved. If we use data containing garbage values, the results won‘t
be accurate. When we don‘t use accurate data, surely we will make mistakes. Monitoring errors
and good reporting helps to find where errors are coming from, and also makes it easier to fix
incorrect or corrupt data for future applications.

Data Quality
The quality of the data is the degree to which it follows the rules of particular requirements. For
example, if we have imported phone numbers data of different customers, and in some places, we
have added email addresses of customers in the data. But because our needs were straightforward
for phone numbers, then the email addresses would be invalid data. Here some pieces of data
follow a specific format. Some types of numbers have to be in a specific range. Some data cells
might require a selected quite data like numeric, Boolean, etc. In every scenario, there are some
mandatory constraints our data should follow. Certain conditions affect multiple fields of data in
a particular form. Particular types of data have unique restrictions. If the data isn‘t in the required
format, it would always be invalid. Data cleaning will help us simplify this process and avoid
useless data values.

1
DATAWAREHOSING AND DATAMINING 220643116004
444
Accurate and Efficient
Ensuring the data is close to the correct values. We know that most of the data in a dataset are
valid, and we should focus on establishing its accuracy. Even if the data is authentic and corre ct,
it doesn‘t mean the data is accurate. Determining accuracy helps to figure out the data entered is
accurate or not. For example, the address of a customer is stored in the specified format, maybe it
doesn‘t need to be in the right one. The email has an additional character or value that makes it
incorrect or invalid. Another example is the phone number of a customer. This means that we have
to rely on data sources, to cross-check the data to figure out if it‘s accurate or not. Depending on
the kind of data we are using, we might be able to find various resources that could help us in this
regard for cleaning.

Complete Data
Completeness is the degree to which we should know all the required values. Completeness is a
little more challenging to achieve than accuracy or quality. Because it‘s nearly impossible to have
all the info we need. Only known facts can be entered. We can try to complete data by redoing the
data gathering activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer‘s contact information. But a number of them
might not have email addresses. In this case, we have to leave those columns empty. If we have a
system that requires us to fill all columns, we can try to enter missing or unknown there. But
entering such values does not mean that the data is complete. It would be still being referred to as
incomplete.

Maintains Data Consistency


To ensure the data is consistent within the same dataset or across multiple datasets, we can measure
consistency by comparing two similar systems. We can also check the data values within the same
dataset to see if they are consistent or not. Consistency can be relational. For example, a customer‘s
age might be 25, which is a valid value and also accurate, but it is also stated as a senior citizen in
the same system. In such cases, we have to cross-check the data, similar to measuring accuracy,
and see which value is true. Is the client a 25-year old? Or the client is a senior citizen? Only one
of these values can be true. There are multiple ways to for your data consistent.

By checking in different systems. By checking the source. By checking the latest data.

Python Code:

.
In [1]:

import pandas as pd
In [2]:

2
DATAWAREHOSING AND DATAMINING 220643116004
444
data = pd.read_csv('p1-1.csv')
In [3]:

data.head()

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

Out[3]
In [4]:

data
Out[4]:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

3 4 4.6 3.1 1.5 0.2 Iris-setosa

3
DATAWAREHOSING AND DATAMINING 220643116004
444

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns


In [5]:

data.tail()
Out[5]:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

In [6]:

4
DATAWAREHOSING AND DATAMINING 220643116004
444
data.isnull()

###This function provides the boolean value for the complete dataset to know if any null value is present or not.
Out[6]:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 False False False False False False

1 False False False False False False

2 False False False False False False

3 False False False False False False

4 False False False False False False

... ... ... ... ... ... ...

145 False False False False False False

146 False False False False False False

147 False False False False False False

148 False False False False False False

149 False False False False False False

150 rows × 6 columns


In [7]:

5
DATAWAREHOSING AND DATAMINING 220643116004
444
data.isna()

####This is the same as the isnull() function. Ans provides the same output.

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 False False False False False False

1 False False False False False False

2 False False False False False False

3 False False False False False False

4 False False False False False False

... ... ... ... ... ... ...

145 False False False False False False

146 False False False False False False

147 False False False False False False

148 False False False False False False

149 False False False False False False

150 rows × 6 columns

Out[7]:

In [8]:

6
DATAWAREHOSING AND DATAMINING 220643116004
444

data.isna().any()

###This function also gives a boolean value if any null value is present or not,
###but it gives results column-wise, not in tabular format.
Out[8]:
Id False
SepalLengthCm False
SepalWidthCm False
PetalLengthCm False
PetalWidthCm False
Species False
dtype: bool
In [9]:

data.isna().sum()

###This function gives the sum of the null values preset in the dataset column-wise.

Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

Out[9]

In [10]:

data.isna().any().sum()

###This function gives output in a single value if any null is present or not.

7
DATAWAREHOSING AND DATAMINING 220643116004
444

Out[10]:
There are no null values present in our dataset. But if there are any null value s preset we can fill
those places with any other value using the fillna() function of DataFrame.Following is the syntax of
fillna() function:

DataFrame_name.fillna(value=None, method=None, axis=None, inplace


=False, limit=None, downcast=None)

and for merging datasets: Merging the dataset is the process of combining two datasets in one, and
line up rows based on some particular or common property for data analysis. We can do this by
using the merge() function of the dataframe. Following is the syntax of the merge function:

DataFrame_name.merge(right, how='inner', on=None, left_on=None, ri


ght_on=None, left_index=False, right_index=False, sort=False, suff
ixes=('_x', '_y'), copy=True, indicator=False, validate=None)
De-Duplicate De-Duplicate means remove all duplicate values. There is no need for duplicate values
in data analysis. These values only affect the accuracy and efficiency of the analysis result. To find
duplicate values in the dataset we will use a simple dataframe function i.e. duplicated(). Let’s see the
example:
In [11]:

data.duplicated()
Out[11]:
0 False
1 False
2 False
3 False
4 False
...
145 False
146 False
147 False
148 False
149 False
Length: 150, dtype: bool
This function also provides bool values for duplicate values in the dataset. As we can see that
dataset doesn’t contain any duplicate values.

If a dataset contains duplicate values it can be removed using the drop_duplicates() function.
Following is the syntax of this function:

DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=


False, ignore_index=False)
8
DATAWAREHOSING AND DATAMINING 220643116004
444

data.duplicated().any().sum()

Out[10]:

In [11]:

data1 = pd.read_csv('StudentDetails.csv')
In [12]:

data1 Student Name 12th Marks Diplomla CPI Name of Collage


Sr No.
0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[12]:
In [13]:

9
DATAWAREHOSING AND DATAMINING 220643116004
444

data1["Name of Collage"].fillna("no collage")


0 Sceit
1 Parul
2 LD
3 no collage
4 no collage
5 Sigma
6 SVNIT
7 no collage
8 MS
9 KJIT
Name: Name of Collage, dtype: object

Out[13]:

In [14]:

data2
= pd.read_csv('StudentDetails.csv')
In [15]:

data2

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

10
DATAWAREHOSING AND DATAMINING 220643116004
444
8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[15]:
In [16]:

data2["Name of Collage"].fillna(method='ffill', inplace=True)


In [17]:

data2
Out[17]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 LD

4 5 Nimisha Sutariya NaN 7.50 LD

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 SVNIT

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [18]:

data3 = pd.read_csv('StudentDetails.csv')
In [19]:

11
DATAWAREHOSING AND DATAMINING 220643116004
444
data3
Out[19]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN


Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [20]:

data3["Name of Collage"].fillna(method='bfill', inplace=True)


In [21]:

data3 Student Name 12th Marks Diplomla CPI Name of Collage


Sr No.
0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 Sigma

4 5 Nimisha Sutariya NaN 7.50 Sigma

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 MS

12
DATAWAREHOSING AND DATAMINING 220643116004
444
8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[21]:

In [22]:

data4 = pd.read_csv('StudentDetails.csv')
In [23]:

data4

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[23]:

In [24]:

data4["Name of Collage"].fillna(method='backfill', inplace=True)


In [25]:

13
DATAWAREHOSING AND DATAMINING 220643116004
444

data4
Out[25]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 Sigma


Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

4 5 Nimisha Sutariya NaN 7.50 Sigma

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 MS

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [26]:

data5 = pd.read_csv('StudentDetails.csv')
In [27]:

data5

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

14
DATAWAREHOSING AND DATAMINING 220643116004
444
4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[27]:
In [28]:

data5["Name of Collage"].fillna(method='pad', inplace=True)


In [29]:

data5

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 LD

4 5 Nimisha Sutariya NaN 7.50 LD

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 SVNIT

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[29]:

In [30]:

15
DATAWAREHOSING AND DATAMINING 220643116004
444

data6 = pd.read_csv('StudentDetails.csv')
In [31]:

data6
Out[31]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [32]:

data6["Name of Collage"].fillna(method='ffill',limit=1,inplace=True)
In [33]:

data6
Out[33]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

16
DATAWAREHOSING AND DATAMINING 220643116004
444
1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 LD

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 SVNIT


Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

2. Data Integration and Transformation

Data Integration and Transformation


So far, we've made sure to remove the impurities in data and make it clean. Now, the next step is
to combine data from different sources to get a unified structure with more meaningful and
valuable information. This is mostly used if the data is segregated into different sources. To make
it simple, let's assume we have data in CSV format in different places, all talking about the same
scenario. Say we have some data about an employee in a database. We can't expect all the data
about the employee to reside in the same table. It's possible that the employee's personal data will
be located in one table, the employee's project history will be in a second table, the employee's
time-in and time-out details will be in another table, and so on. So, if we want to do some analysis
about the employee, we need to get all the employee data in one common place. This process of
bringing data together in one place is called data integration. To do data integration, we can merge
multiple pandas DataFrames using the merge function. Visualization is an important tool for
insight generation, but it is rare that you get the data in exactly the right form you need. You will
often need to create some new variables or summaries, rename variables, or reorder observations
for the data to be easier to manage.

In [57]:

df1 = pd.DataFrame({'Name': ['Shivani', 'Tanvi', 'Nimisha', 'Twinkle'],


'ENR': ['28', '03', '12', '05']})
In [58]:

17
DATAWAREHOSING AND DATAMINING 220643116004
444

df1
Out[58]:
Name ENR

0 Shivani 28

1 Tanvi 03

2 Nimisha 12

Name ENR

3 Twinkle 05

In [60]:

df2 = pd.DataFrame({'Name': ['Shivani', 'Tanvi', 'Nimisha', 'Twinkle'],


'Skills': ['Wd','JAVA' , 'SQL', 'Python']})

In [61]:

df2

Name Skills

0 Shivani Wd

1 Tanvi JAVA

2 Nimisha SQL

3 Twinkle Python

Out[61]:
In [62]:

data9 = pd.merge(df1, df2)

18
DATAWAREHOSING AND DATAMINING 220643116004
444
In [63]:

data9
Out[63]:
Name ENR Skills

0 Shivani 28 Wd

1 Tanvi 03 JAVA
Name ENR Skills

2 Nimisha 12 SQL

3 Twinkle 05 Python

In [ ]:

data8=pd.merge(data6, data7, on='Sr No', how='inner')


In [ ]:

In [64]:

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],'supervisor': ['Carly', 'Guido', 'Steve']})

In [65]:

df4

group supervisor

0 Accounting Carly

19
DATAWAREHOSING AND DATAMINING 220643116004
444
1 Engineering Guido

2 HR Steve

Out[65]:

In [68]:

display(df1, df2, pd.merge(df1, df2))


Name ENR

0 Shivani 28

1 Tanvi 03

2 Nimisha 12

3 Twinkle 05

Name Skills

0 Shivani Wd

1 Tanvi JAVA

2 Nimisha SQL

3 Twinkle Python

Name ENR Skills

0 Shivani 28 Wd

1 Tanvi 03 JAVA

2 Nimisha 12 SQL

In [41]:

import numpy as np
import pandas as pd

In [42]:

20
DATAWAREHOSING AND DATAMINING 220643116004
444

df = pd.read_csv('Dataset11.csv')
In [43]:

df
Out[43]:

NAME A B C D E

0 JANE 1 6 6 9 1

1 JOHN 8 1 2 8 1

2 ASHLEY 6 3 5 1 7

3 MAX 0 3 4 0 8

4 EMILY 7 6 6 0 6

Add / drop columns


The first and foremost way of transformation is adding or dropping columns. A new column can be
added as follows:
In [44]:

df['new'] = np.random.random(5)
In [45]:

df
Out[45]:

21
DATAWAREHOSING AND DATAMINING 220643116004
444

NAME A B C D E new

0 JANE 1 6 6 9 1 0.458527

1 JOHN 8 1 2 8 1 0.390897

2 ASHLEY 6 3 5 1 7 0.044329

3 MAX 0 3 4 0 8 0.229151

NAME A B C D E new

4 EMILY 7 6 6 0 6 0.589566

We give the values as an array or list and assign a name to the new column. Make sure the size of
the array is compatible with the size of the dataframe. The drop function is used to drop a column.
In [46]:

df.drop('new', axis=1, inplace=True)


In [47]:

df
Out[47]:

NAME A B C D E

0 JANE 1 6 6 9 1

1 JOHN 8 1 2 8 1

2 ASHLEY 6 3 5 1 7

3 MAX 0 3 4 0 8

4 EMILY 7 6 6 0 6

22
DATAWAREHOSING AND DATAMINING 220643116004
444
We pass the name of the column to be dropped. The axis parameter is set to 1 to indicate we are
dropping a column. Finally, the inplace parameter needs to be True to save the changes.

Add / drop rows


We can use the loc method to add a single row to a dataframe.
In [48]:

df.loc[5,:] = ['Jack', 3, 3, 4, 5, 1]
In [49]:

df
Out[49]:

NAME A B C D E

0 JANE 1.0 6.0 6.0 9.0 1.0

1 JOHN 8.0 1.0 2.0 8.0 1.0

2 ASHLEY 6.0 3.0 5.0 1.0 7.0

3 MAX 0.0 3.0 4.0 0.0 8.0

4 EMILY 7.0 6.0 6.0 0.0 6.0

5 Jack 3.0 3.0 4.0 5.0 1.0

In [50]:

df.drop(5, axis=0, inplace=True)


We have just dropped the row that was added in the previous step.

Insert
The insert function adds a column into a specific position.

23
DATAWAREHOSING AND DATAMINING 220643116004
444
In [51]:

df.insert(0, 'new', np.random.random(5))


In [52]:

df
Out[52]:

new NAME A B C D E

0 0.641899 JANE 1.0 6.0 6.0 9.0 1.0

1 0.846706 JOHN 8.0 1.0 2.0 8.0 1.0

2 0.291893 ASHLEY 6.0 3.0 5.0 1.0 7.0

3 0.598154 MAX 0.0 3.0 4.0 0.0 8.0

4 0.882514 EMILY 7.0 6.0 6.0 0.0 6.0

In [53]:

df.insert(2, 'me', np.random.random(5))


In [54]:

df
Out[54]:

new NAME me A B C D E

0 0.641899 JANE 0.661207 1.0 6.0 6.0 9.0 1.0

24
DATAWAREHOSING AND DATAMINING 220643116004
444

1 0.846706 JOHN 0.631619 8.0 1.0 2.0 8.0 1.0

2 0.291893 ASHLEY 0.459551 6.0 3.0 5.0 1.0 7.0

3 0.598154 MAX 0.659862 0.0 3.0 4.0 0.0 8.0

4 0.882514 EMILY 0.373759 7.0 6.0 6.0 0.0 6.0

In [55]:

df.drop('new', axis=1, inplace=True)


In [56]:

df
Out[56]:

NAME me A B C D E

0 JANE 0.661207 1.0 6.0 6.0 9.0 1.0

1 JOHN 0.631619 8.0 1.0 2.0 8.0 1.0

2 ASHLEY 0.459551 6.0 3.0 5.0 1.0 7.0

3 MAX 0.659862 0.0 3.0 4.0 0.0 8.0

4 EMILY 0.373759 7.0 6.0 6.0 0.0 6.0

In [57]:

df.drop('me', axis=1, inplace=True)


In [58]:

25
DATAWAREHOSING AND DATAMINING 220643116004
444

df
Out[58]:

NAME A B C D E

0 JANE 1.0 6.0 6.0 9.0 1.0


NAME A B C D E

1 JOHN 8.0 1.0 2.0 8.0 1.0

2 ASHLEY 6.0 3.0 5.0 1.0 7.0

3 MAX 0.0 3.0 4.0 0.0 8.0

4 EMILY 7.0 6.0 6.0 0.0 6.0

Melt
The melt function converts a dataframe from wide (high number of columns) to narrow form (high
number of rows). It is best explained via an example. Consider following dataframe. It contains
consecutive daily measurements for 5 people. The long format of this dataframe can be achieved
using the melt function.
The column passed to the id_vars parameter remains the same and the other columns are
combined under the variable and value columns.
In [66]:

df1 = pd.read_csv('Dataset12.csv')
In [67]:

df1

NAME A B C D E

26
DATAWAREHOSING AND DATAMINING 220643116004
444

0 ASHLEY 6 3 5 1 7

1 MAX 0 3 4 0 8

2 EMILY 7 6 6 0 6

Out[67]:

In [72]:

df2 = pd.read_csv('Dataset11.csv')
In [73]:

df2

NAME A B C D E

0 JANE 1 6 6 9 1

1 JOHN 8 1 2 8 1

2 Jack 4 9 8 6 3

Here is how we can combine them:

Out[73]

In [77]:

pd.concat([df1, df2], axis=0, ignore_index=True)


Out[77]:

NAME A B C D E

27
DATAWAREHOSING AND DATAMINING 220643116004
444

0 ASHLEY 6 3 5 1 7

1 MAX 0 3 4 0 8

2 EMILY 7 6 6 0 6

3 JANE 1 6 6 9 1

4 JOHN 8 1 2 8 1

NAME A B C D E

5 Jack 4 9 8 6 3

In [76]:

pd.concat([df1, df2], axis=1, ignore_index=True)


Out[76]:

0 1 2 3 4 5 6 7 8 9 10 11

0 ASHLEY 6 3 5 1 7 JANE 1 6 6 9 1

1 MAX 0 3 4 0 8 JOHN 8 1 2 8 1

2 EMILY 7 6 6 0 6 Jack 4 9 8 6 3

Merge
Merge function also combines dataframes based on common values in a given column or columns.
Consider the following two dataframes.
.
In [35]:

pd.melt(df, id_vars='NAME').head()
Out[35]:

28
DATAWAREHOSING AND DATAMINING 220643116004
444

NAME variable value

0 JANE A 1.0

1 JOHN A 8.0

2 Jack A 4.0

NAME variable value

3 JANE B 6.0

4 JOHN B 1.0

In [79]:

df3 = pd.read_csv('Customer.csv')df4 = pd.read_csv('Order.csv')

In [81]:

df3

ID Name Category

0 1 Rane A

1 2 Alex B

2 3 Ayan A

3 4 Jack C

29
DATAWAREHOSING AND DATAMINING 220643116004
444

4 5 John B

Out[81]

In [82]:

df4
Out[82]:

ID Amount Payment

0 2 250 Credit Card

1 4 320 Credit Card

2 5 250 Cash

3 6 440 Cash

We can merge them based on the id column.


In [86]:

df3.merge(df4, on='ID')

ID Name Category Amount Payment

0 2 Alex B 250 Credit Card

1 4 Jack C 320 Credit Card

2 5 John B 250 Cash

Out[86]:

30
DATAWAREHOSING AND DATAMINING 220643116004
444

df3.merge(df4, on='ID', how = 'inner')


Out[87]:

ID Name Category Amount Payment

0 2 Alex B 250 Credit Card

1 4 Jack C 320 Credit Card

ID Name Category Amount Payment

2 5 John B 250 Cash

We can perform Full join by just passing the how argument as ‘outer’ to the merge() function:
In [88]:

df3.merge(df4, on = 'ID', how = 'outer')


Out[88]:

ID Name Category Amount Payment

0 1 Rane A NaN NaN

1 2 Alex B 250.0 Credit Card

2 3 Ayan A NaN NaN

3 4 Jack C 320.0 Credit Card

4 5 John B 250.0 Cash

5 6 NaN NaN 440.0 Cash

Performing a left join is actually quite similar to a full join. Just change the how argument to ‘left’:
In [89]:

31
DATAWAREHOSING AND DATAMINING 220643116004
444

df3.merge(df4, on = 'ID', how = 'left')


Out[89]:

ID Name Category Amount Payment

0 1 Rane A NaN NaN

1 2 Alex B 250.0 Credit Card

ID Name Category Amount Payment

2 3 Ayan A NaN NaN

3 4 Jack C 320.0 Credit Card

4 5 John B 250.0 Cash

Similar to other joins, we can perform a right join by changing the how argument to ‘right’:
In [91]:

df3.merge(df4, on = 'ID', how = 'right')


Out[91]:

ID Name Category Amount Payment

0 2 Alex B 250 Credit Card

1 4 Jack C 320 Credit Card

2 5 John B 250 Cash

3 6 NaN NaN 440 Cash

Get dummies
Some machine learning models cannot handle categorical variables. In such cases, we should

32
DATAWAREHOSING AND DATAMINING 220643116004
444
encode the categorical variables in a way that each category is represented as a column.
In [95]:

df5 = pd.read_csv('Customer.csv')
In [96]:

df5

Name Category Value

0 Rane A 14.2

1 Alex A 21.4

2 Ayan C 15.6

3 Jack B 12.1

4 John B 17.7

Out[96]:

In [98]:

pd.get_dummies(df5)
Out[98]:

Valu Name_Al Name_Ay Name_Ja Name_Jo Name_Ra Category_ Category_ Category_


e ex an ck hn ne A B C

0 14.2 0 0 0 0 1 1 0 0

33
DATAWAREHOSING AND DATAMINING 220643116004
444

1 21.4 1 0 0 0 0 1 0 0

2 15.6 0 1 0 0 0 0 0 1

3 12.1 0 0 1 0 0 0 1 0

4 17.7 0 0 0 1 0 0 1 0

For instance, in the first row, the name is Jane and the ctg is A. Thus, the columns that represent
these values are 1 and all other columns are 0.

Pivot table
The pivot_table function transforms a dataframe to a format that explains the relationship among
variables.
We have the dataframe on the left that contains two categorical features (i.e. columns) and a
numerical feature. We want to see the average value of the categories in both columns. The
pivot_table function transforms the dataframe in a way that the average values or any other
aggregation can be seen clearly.
In [100]:

df5.pivot_table(index='Name', columns='Category', aggfunc='mean')


Out[100]:

Value

Category A B C

Name

Alex 21.4 NaN NaN

Ayan NaN NaN 15.6

Jack NaN 12.1 NaN

John NaN 17.7 NaN

Rane 14.2 NaN NaN

34
DATAWAREHOSING AND DATAMINING 220643116004
444
PRACTICAL-2

AIM: Perform data reduction in Linear Algebra methods and Mainfold


Learning methods with achieved Accuracy from Make_Classification dataset
using Logistic Regression.

Dimensionality reduction is an unsupervised learning technique.Nevertheless, it can be used as


a data transform pre-processing step for machine learning algorithms on classification and
regression predictive modeling datasets with supervised learning algorithms.
There are many dimensionality reduction algorithms to choose from and no single best algorithm
for all cases. Instead, it is a good idea to explore a range of dimensionality reduction algorithms
and different configurations for each algorithm. In this tutorial, you will discover how to fit and
evaluate top dimensionality reduction algorithms in Python.

Dimensionality reduction seeks a lower-dimensional representation of numerical input data that


preserves the salient relationships in the data.
There are many different dimensionality reduction algorithms and no single best method for all
datasets.
How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn
machine learning library.

Dimensionality reduction refers to techniques for reducing the number of input variables in
training data. High-dimensionality might mean hundreds, thousands, or even millions of input
variables.

Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in
the machine learning model, referred to as degrees of freedom. A model with too many degrees of
freedom is likely to overfit the training dataset and may not perform well on new data.
It is desirable to have simple models that generalize well, and in turn, input data with few input
variables. This is particularly true for linear models where the number of inputs and the degrees of
freedom of the model are often closely related.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It


might be performed after data cleaning and data scaling and before training a predictive model.

Linear Algebra Methods


Matrix factorization methods drawn from the field of linear algebra can be used for

35
DATAWAREHOSING AND DATAMINING 220643116004

dimensionality.

Some of the more popular methods include:

 Principal Components Analysis


 Singular Value Decomposition
 Linear Discriminant Analysis

Manifold Learning Methods


Manifold learning methods seek a lower-dimensional projection of high dimensional input that
captures the salient properties of the input data.

Some of the more popular methods include:

 Isomap Embedding
 Locally Linear Embedding
 Modified Locally Linear Embedding

To begin practical with classification introduction

Introduction of classification dataset

from sklearn.datasets import make_classification


make_classification
In [4]:

X, y = make_classification(n_sample s=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)


In [5]:

arr ay([[ 0.08054814, 0.82273313, -1.21175254, ..., 2.88260938,


1.79160028, -4.29708787],

Out[5]:
[-2.3302999 , - 4.86608574, -3.88291317, ..., -0.14561581,
-0.55489384, 0 .61420772],

36
DATAWAREHOSING AND DATAMINING 220643116004

[-1.19714954, 1.5556314 , -0.61871573, ..., 1.73481788,


0.13067403, -3.13351468],
...,
[ 0.61415067, -3.04457734, -3.15540898, ..., -0.3321506 ,
-2.76644911, 0.81460546],
[ 3.34221924, -1.33613258, -0.34013763, ..., -3.95225071,
1.33439536, -0.69139029],
[-1.49207892, 2.75225738, -1.22655776, ..., -3.10146388,
2.34534351, -1.32021006]])
In [6]:

Out[6]:
array([0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0,
0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1,
0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,
0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,

37
DATAWAREHOSING AND DATAMINING 220643116004

0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,
1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,
0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 1, 0, 0, 1, 1, 0, 1])
In [7]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the model
model = LogisticRegression()
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.824 (0.034)

Linear Algebra Methods

38
DATAWAREHOSING AND DATAMINING 220643116004

Principle Component Analysis

In [8]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.824 (0.034)

Single Value Decomposition


In [9]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)

39
DATAWAREHOSING AND DATAMINING 220643116004

# define the pipeline


steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.824 (0.034)

Linear Discriminant Analysis


In [10]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.825 (0.034)

Manifold Learning Methods


Isomap Embedding
In [11]:

40
DATAWAREHOSING AND DATAMINING 220643116004

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import Isomap
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('iso', Isomap(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.888 (0.029)

Locally Linear Embedding


In [12]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lle', LocallyLinearEmbedding(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.886 (0.028)

41
DATAWAREHOSING AND DATAMINING 220643116004

Modified Locally Linear Embedding


In [13]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lle', LocallyLinearEmbedding(n_components=5, method='modified', n_neighbors=10)), ('m', LogisticRegr
ession())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.848 (0.037)
In [ ]:

42
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-3

AIM: Demonstration of Pre-processing Methods i) Rescale Data and ii)


Binarize Data

i) Rescale Data

Your data must be prepared before you can build models. The data preparation process can
involve three steps: data selection, data preprocessing and data transformation. Your preprocessed
data may contain attributes with a mixtures of scales for various quantities such as dollars,
kilograms and sales volume. Many machine learning methods expect or are more effective if the
data attributes have the same scale. Two popular data scaling methods are normalization and
standardization.

Data Normalization
Normalization refers to rescaling real valued numeric attributes into the range 0 and 1. It is useful
to scale the input attributes for a model that relies on the magnitude of values, such as distance
measures used in k-nearest neighbors and in the preparation of coefficients in regression. The
example below demonstrate data normalization of the Iris flowers dataset.

Normalization
In [1]:

from sklearn.datasets import load_iris


from sklearn import preprocessing
# load the iris dataset
iris = load_iris()
print(iris.data.shape)

(150, 4)
In [2]:

iris

43
DATAWAREHOSING AND DATAMINING 220643116004

{'data': array([[5.1, 3.5, 1.4, 0.2],

Out[2]:
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],

44
DATAWAREHOSING AND DATAMINING 220643116004

[4.4, 3.2, 1.3, 0.2],


[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],

45
DATAWAREHOSING AND DATAMINING 220643116004

[6.3, 2.3, 4.4, 1.3],


[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],

46
DATAWAREHOSING AND DATAMINING 220643116004

[6.4, 2.8, 5.6, 2.2],


[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),

In [3]:

X = iris.data
y = iris.target

In [4]:

X
Out[4]:
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],

47
DATAWAREHOSING AND DATAMINING 220643116004

[4.6, 3.1, 1.5, 0.2],


[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],

48
DATAWAREHOSING AND DATAMINING 220643116004

[5.3, 3.7, 1.5, 0.2],


[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],

49
DATAWAREHOSING AND DATAMINING 220643116004

[5. , 2.3, 3.3, 1. ],


[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],

50
DATAWAREHOSING AND DATAMINING 220643116004

[6. , 3. , 4.8, 1.8],


[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])
In [5]:

y
Out[5]:
array([0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [6]:

# normalize the data attributes


normalized_X = preprocessing.normalize(X)
In [8]:

normalized_X
Out[8]:
array([[0.80377277, 0.55160877, 0.22064351, 0.0315205 ],
[0.82813287, 0.50702013, 0.23660939, 0.03380134],
[0.80533308, 0.54831188, 0.2227517 , 0.03426949],
[0.80003025, 0.53915082, 0.26087943, 0.03478392],
[0.790965 , 0.5694948 , 0.2214702 , 0.0316386 ],

51
DATAWAREHOSING AND DATAMINING 220643116004

[0.78417499, 0.5663486 , 0.2468699 , 0.05808704],


[0.78010936, 0.57660257, 0.23742459, 0.0508767 ],
[0.80218492, 0.54548574, 0.24065548, 0.0320874 ],
[0.80642366, 0.5315065 , 0.25658935, 0.03665562],
[0.81803119, 0.51752994, 0.25041771, 0.01669451],
[0.80373519, 0.55070744, 0.22325977, 0.02976797],
[0.786991 , 0.55745196, 0.26233033, 0.03279129],
[0.82307218, 0.51442011, 0.24006272, 0.01714734],
[0.8025126 , 0.55989251, 0.20529392, 0.01866308],
[0.81120865, 0.55945424, 0.16783627, 0.02797271],
[0.77381111, 0.59732787, 0.2036345 , 0.05430253],
[0.79428944, 0.57365349, 0.19121783, 0.05883625],
[0.80327412, 0.55126656, 0.22050662, 0.04725142],
[0.8068282 , 0.53788547, 0.24063297, 0.04246464],
[0.77964883, 0.58091482, 0.22930848, 0.0458617 ],
[0.8173379 , 0.51462016, 0.25731008, 0.03027177],
[0.78591858, 0.57017622, 0.23115252, 0.06164067],
[0.77577075, 0.60712493, 0.16864581, 0.03372916],
[0.80597792, 0.52151512, 0.26865931, 0.07901744],
[0.776114 , 0.54974742, 0.30721179, 0.03233808],
[0.82647451, 0.4958847 , 0.26447184, 0.03305898],
[0.79778206, 0.5424918 , 0.25529026, 0.06382256],
[0.80641965, 0.54278246, 0.23262105, 0.03101614],
[0.81609427, 0.5336001 , 0.21971769, 0.03138824],
[0.79524064, 0.54144043, 0.27072022, 0.03384003],
[0.80846584, 0.52213419, 0.26948861, 0.03368608],
[0.82225028, 0.51771314, 0.22840286, 0.06090743],
[0.76578311, 0.60379053, 0.22089897, 0.0147266 ],
[0.77867447, 0.59462414, 0.19820805, 0.02831544],
[0.81768942, 0.51731371, 0.25031309, 0.03337508],
[0.82512295, 0.52807869, 0.19802951, 0.03300492],
[0.82699754, 0.52627116, 0.19547215, 0.03007264],
[0.78523221, 0.5769053 , 0.22435206, 0.01602515],
[0.80212413, 0.54690282, 0.23699122, 0.03646019],
[0.80779568, 0.53853046, 0.23758697, 0.03167826],
[0.80033301, 0.56023311, 0.20808658, 0.04801998],
[0.86093857, 0.44003527, 0.24871559, 0.0573959 ],
[0.78609038, 0.57170209, 0.23225397, 0.03573138],
[0.78889479, 0.55222635, 0.25244633, 0.09466737],
[0.76693897, 0.57144472, 0.28572236, 0.06015208],
[0.82210585, 0.51381615, 0.23978087, 0.05138162],
[0.77729093, 0.57915795, 0.24385598, 0.030482 ],
[0.79594782, 0.55370283, 0.24224499, 0.03460643],
[0.79837025, 0.55735281, 0.22595384, 0.03012718],
[0.81228363, 0.5361072 , 0.22743942, 0.03249135],

52
DATAWAREHOSING AND DATAMINING 220643116004

[0.76701103, 0.35063361, 0.51499312, 0.15340221],


[0.74549757, 0.37274878, 0.52417798, 0.17472599],
[0.75519285, 0.33928954, 0.53629637, 0.16417236],
[0.75384916, 0.31524601, 0.54825394, 0.17818253],
[0.7581754 , 0.32659863, 0.5365549 , 0.17496355],
[0.72232962, 0.35482858, 0.57026022, 0.16474184],
[0.72634846, 0.38046824, 0.54187901, 0.18446945],
[0.75916547, 0.37183615, 0.51127471, 0.15493173],
[0.76301853, 0.33526572, 0.53180079, 0.15029153],
[0.72460233, 0.37623583, 0.54345175, 0.19508524],
[0.76923077, 0.30769231, 0.53846154, 0.15384615],
[0.73923462, 0.37588201, 0.52623481, 0.187941 ],
[0.78892752, 0.28927343, 0.52595168, 0.13148792],
[0.73081412, 0.34743622, 0.56308629, 0.16772783],
[0.75911707, 0.3931142 , 0.48800383, 0.17622361],
[0.76945444, 0.35601624, 0.50531337, 0.16078153],
[0.70631892, 0.37838513, 0.5675777 , 0.18919257],
[0.75676497, 0.35228714, 0.53495455, 0.13047672],
[0.76444238, 0.27125375, 0.55483721, 0.18494574],
[0.76185188, 0.34011245, 0.53057542, 0.14964948],
[0.6985796 , 0.37889063, 0.56833595, 0.21312598],
[0.77011854, 0.35349703, 0.50499576, 0.16412362],
[0.74143307, 0.29421947, 0.57667016, 0.17653168],
[0.73659895, 0.33811099, 0.56754345, 0.14490471],
[0.76741698, 0.34773582, 0.51560829, 0.15588157],
[0.76785726, 0.34902603, 0.51190484, 0.16287881],
[0.76467269, 0.31486523, 0.53976896, 0.15743261],
[0.74088576, 0.33173989, 0.55289982, 0.18798594],
[0.73350949, 0.35452959, 0.55013212, 0.18337737],
[0.78667474, 0.35883409, 0.48304589, 0.13801311],
[0.76521855, 0.33391355, 0.52869645, 0.15304371],
[0.77242925, 0.33706004, 0.51963422, 0.14044168],
[0.76434981, 0.35581802, 0.51395936, 0.15814134],
[0.70779525, 0.31850786, 0.60162596, 0.1887454 ],
[0.69333409, 0.38518561, 0.57777841, 0.1925928 ],
[0.71524936, 0.40530797, 0.53643702, 0.19073316],
[0.75457341, 0.34913098, 0.52932761, 0.16893434],
[0.77530021, 0.28304611, 0.54147951, 0.15998258],
[0.72992443, 0.39103094, 0.53440896, 0.16944674],
[0.74714194, 0.33960997, 0.54337595, 0.17659719],
[0.72337118, 0.34195729, 0.57869695, 0.15782644],
[0.73260391, 0.36029701, 0.55245541, 0.1681386 ],
[0.76262994, 0.34186859, 0.52595168, 0.1577855 ],
[0.76986879, 0.35413965, 0.5081134 , 0.15397376],
[0.73544284, 0.35458851, 0.55158213, 0.1707278 ],

53
DATAWAREHOSING AND DATAMINING 220643116004

[0.73239618, 0.38547167, 0.53966034, 0.15418867],


[0.73446047, 0.37367287, 0.5411814 , 0.16750853],
[0.75728103, 0.3542121 , 0.52521104, 0.15878473],
[0.78258054, 0.38361791, 0.4603415 , 0.16879188],
[0.7431482 , 0.36505526, 0.5345452 , 0.16948994],
[0.65387747, 0.34250725, 0.62274045, 0.25947519],
[0.69052512, 0.32145135, 0.60718588, 0.22620651],
[0.71491405, 0.30207636, 0.59408351, 0.21145345],
[0.69276796, 0.31889319, 0.61579374, 0.1979337 ],
[0.68619022, 0.31670318, 0.61229281, 0.232249 ],
[0.70953708, 0.28008043, 0.61617694, 0.1960563 ],
[0.67054118, 0.34211284, 0.61580312, 0.23263673],
[0.71366557, 0.28351098, 0.61590317, 0.17597233],
[0.71414125, 0.26647062, 0.61821183, 0.19185884],
[0.69198788, 0.34599394, 0.58626751, 0.24027357],
[0.71562645, 0.3523084 , 0.56149152, 0.22019275],
[0.71576546, 0.30196356, 0.59274328, 0.21249287],
[0.71718148, 0.31640359, 0.58007326, 0.22148252],
[0.6925518 , 0.30375079, 0.60750157, 0.24300063],
[0.67767924, 0.32715549, 0.59589036, 0.28041899],
[0.69589887, 0.34794944, 0.57629125, 0.25008866],
[0.70610474, 0.3258945 , 0.59747324, 0.1955367 ],
[0.69299099, 0.34199555, 0.60299216, 0.19799743],
[0.70600618, 0.2383917 , 0.63265489, 0.21088496],
[0.72712585, 0.26661281, 0.60593821, 0.18178146],
[0.70558934, 0.32722984, 0.58287815, 0.23519645],
[0.68307923, 0.34153961, 0.59769433, 0.24395687],
[0.71486543, 0.25995106, 0.62202576, 0.18567933],
[0.73122464, 0.31338199, 0.56873028, 0.20892133],
[0.69595601, 0.3427843 , 0.59208198, 0.21813547],
[0.71529453, 0.31790868, 0.59607878, 0.17882363],
[0.72785195, 0.32870733, 0.56349829, 0.21131186],
[0.71171214, 0.35002236, 0.57170319, 0.21001342],
[0.69594002, 0.30447376, 0.60894751, 0.22835532],
[0.73089855, 0.30454106, 0.58877939, 0.1624219 ],
[0.72766159, 0.27533141, 0.59982915, 0.18683203],
[0.71578999, 0.34430405, 0.5798805 , 0.18121266],
[0.69417747, 0.30370264, 0.60740528, 0.2386235 ],
[0.72366005, 0.32162669, 0.58582004, 0.17230001],
[0.69385414, 0.29574111, 0.63698085, 0.15924521],
[0.73154399, 0.28501714, 0.57953485, 0.21851314],
[0.67017484, 0.36168166, 0.59571097, 0.2553047 ],
[0.69804799, 0.338117 , 0.59988499, 0.196326 ],
[0.71066905, 0.35533453, 0.56853524, 0.21320072],
[0.72415258, 0.32534391, 0.56672811, 0.22039426],

54
DATAWAREHOSING AND DATAMINING 220643116004

[0.69997037, 0.32386689, 0.58504986, 0.25073566],


[0.73337886, 0.32948905, 0.54206264, 0.24445962],
[0.69052512, 0.32145135, 0.60718588, 0.22620651],
[0.69193502, 0.32561648, 0.60035539, 0.23403685],
[0.68914871, 0.33943145, 0.58629069, 0.25714504],
[0.72155725, 0.32308533, 0.56001458, 0.24769876],
[0.72965359, 0.28954508, 0.57909015, 0.22005426],
[0.71653899, 0.3307103 , 0.57323119, 0.22047353],
[0.67467072, 0.36998072, 0.58761643, 0.25028107],
[0.69025916, 0.35097923, 0.5966647 , 0.21058754]])
In [ ]:

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a
standard deviation of one (unit variance). It is useful to standardize attributes for a model that relies
on the distribution of attributes such as Gaussian processes. The example below demonstrate data
standardization of the Iris flowers dataset.

Data Standardization
In [9]:

from sklearn.datasets import load_iris


from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()

In [10]:

55
DATAWAREHOSING AND DATAMINING 220643116004

iris

{'data': array([[5.1, 3.5, 1.4, 0.2],

Out[10]:
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],

56
DATAWAREHOSING AND DATAMINING 220643116004

[4.5, 2.3, 1.3, 0.3],


[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],

57
DATAWAREHOSING AND DATAMINING 220643116004

[6.7, 3.1, 4.7, 1.5],


[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],

58
DATAWAREHOSING AND DATAMINING 220643116004

[7.9, 3.8, 6.4, 2. ],


[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
In [11]:

X = iris.data y = iris.target

In [12]:

standardized_X = preprocessing.scale(X)
In [14]:

59
DATAWAREHOSING AND DATAMINING 220643116004

standardized_X

array([[-9.00681170e-01, 1.01900435e+00, -1.34022653e+00,


-1.31544430e+00],
[-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
-1.31544430e+00],
[-1.38535265e+00, 3.28414053e-01, -1.39706395e+00,
-1.31544430e+00],
[-1.50652052e+00, 9.82172869e-02, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 1.24920112e+00, -1.34022653e+00,
-1.31544430e+00],
[-5.37177559e-01, 1.93979142e+00, -1.16971425e+00,
-1.05217993e+00],
[-1.50652052e+00, 7.88807586e-01, -1.34022653e+00,
-1.18381211e+00],
[-1.02184904e+00, 7.88807586e-01, -1.28338910e+00,
-1.31544430e+00],
[-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
-1.31544430e+00],
[-1.14301691e+00, 9.82172869e-02, -1.28338910e+00,
-1.44707648e+00],
[-5.37177559e-01, 1.47939788e+00, -1.28338910e+00,
-1.31544430e+00],
[-1.26418478e+00, 7.88807586e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.26418478e+00, -1.31979479e-01, -1.34022653e+00,

Out[14]:
-1.44707648e+00],
[-1.87002413e+00, -1.31979479e-01, -1.51073881e+00,
-1.44707648e+00],
[-5.25060772e-02, 2.16998818e+00, -1.45390138e+00,
-1.31544430e+00],
[-1.73673948e-01, 3.09077525e+00, -1.28338910e+00,
-1.05217993e+00],
[-5.37177559e-01, 1.93979142e+00, -1.39706395e+00,
-1.05217993e+00],
[-9.00681170e-01, 1.01900435e+00, -1.34022653e+00,
-1.18381211e+00],
[-1.73673948e-01, 1.70959465e+00, -1.16971425e+00,

60
DATAWAREHOSING AND DATAMINING 220643116004

-1.18381211e+00],
[-9.00681170e-01, 1.70959465e+00, -1.28338910e+00,
-1.18381211e+00],
[-5.37177559e-01, 7.88807586e-01, -1.16971425e+00,
-1.31544430e+00],
[-9.00681170e-01, 1.47939788e+00, -1.28338910e+00,
-1.05217993e+00],
[-1.50652052e+00, 1.24920112e+00, -1.56757623e+00,
-1.31544430e+00],
[-9.00681170e-01, 5.58610819e-01, -1.16971425e+00,
-9.20547742e-01],
[-1.26418478e+00, 7.88807586e-01, -1.05603939e+00,
-1.31544430e+00],
[-1.02184904e+00, -1.31979479e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.02184904e+00, 7.88807586e-01, -1.22655167e+00,
-1.05217993e+00],
[-7.79513300e-01, 1.01900435e+00, -1.28338910e+00,
-1.31544430e+00],
[-7.79513300e-01, 7.88807586e-01, -1.34022653e+00,
-1.31544430e+00],
[-1.38535265e+00, 3.28414053e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.26418478e+00, 9.82172869e-02, -1.22655167e+00,
-1.31544430e+00],
[-5.37177559e-01, 7.88807586e-01, -1.28338910e+00,
-1.05217993e+00],
[-7.79513300e-01, 2.40018495e+00, -1.28338910e+00,
-1.44707648e+00],
[-4.16009689e-01, 2.63038172e+00, -1.34022653e+00,
-1.31544430e+00],
[-1.14301691e+00, 9.82172869e-02, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 3.28414053e-01, -1.45390138e+00,
-1.31544430e+00],
[-4.16009689e-01, 1.01900435e+00, -1.39706395e+00,
-1.31544430e+00],
[-1.14301691e+00, 1.24920112e+00, -1.34022653e+00,
-1.44707648e+00],
[-1.74885626e+00, -1.31979479e-01, -1.39706395e+00,
-1.31544430e+00],
[-9.00681170e-01, 7.88807586e-01, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 1.01900435e+00, -1.39706395e+00,
-1.18381211e+00],
61
DATAWAREHOSING AND DATAMINING 220643116004

[-1.62768839e+00, -1.74335684e+00, -1.39706395e+00,


-1.18381211e+00],
[-1.74885626e+00, 3.28414053e-01, -1.39706395e+00,
-1.31544430e+00],
[-1.02184904e+00, 1.01900435e+00, -1.22655167e+00,
-7.88915558e-01],
[-9.00681170e-01, 1.70959465e+00, -1.05603939e+00,
-1.05217993e+00],
[-1.26418478e+00, -1.31979479e-01, -1.34022653e+00,
-1.18381211e+00],
[-9.00681170e-01, 1.70959465e+00, -1.22655167e+00,
-1.31544430e+00],
[-1.50652052e+00, 3.28414053e-01, -1.34022653e+00,
-1.31544430e+00],
[-6.58345429e-01, 1.47939788e+00, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 5.58610819e-01, -1.34022653e+00,
-1.31544430e+00],
[ 1.40150837e+00, 3.28414053e-01, 5.35408562e-01,
2.64141916e-01],
[ 6.74501145e-01, 3.28414053e-01, 4.21733708e-01,
3.95774101e-01],
[ 1.28034050e+00, 9.82172869e-02, 6.49083415e-01,
3.95774101e-01],
[-4.16009689e-01, -1.74335684e+00, 1.37546573e-01,
1.32509732e-01],
[ 7.95669016e-01, -5.92373012e-01, 4.78571135e-01,
3.95774101e-01],
[-1.73673948e-01, -5.92373012e-01, 4.21733708e-01,
1.32509732e-01],
[ 5.53333275e-01, 5.58610819e-01, 5.35408562e-01,
5.27406285e-01],
[-1.14301691e+00, -1.51316008e+00, -2.60315415e-01,
-2.62386821e-01],
[ 9.16836886e-01, -3.62176246e-01, 4.78571135e-01,
1.32509732e-01],
[-7.79513300e-01, -8.22569778e-01, 8.07091462e-02,
2.64141916e-01],
[-1.02184904e+00, -2.43394714e+00, -1.46640561e-01,
-2.62386821e-01],
[ 6.86617933e-02, -1.31979479e-01, 2.51221427e-01,
3.95774101e-01],
[ 1.89829664e-01, -1.97355361e+00, 1.37546573e-01,
-2.62386821e-01],
[ 3.10997534e-01, -3.62176246e-01, 5.35408562e-01,
62
DATAWAREHOSING AND DATAMINING 220643116004

2.64141916e-01],
[-2.94841818e-01, -3.62176246e-01, -8.98031345e-02,
1.32509732e-01],
[ 1.03800476e+00, 9.82172869e-02, 3.64896281e-01,
2.64141916e-01],
[-2.94841818e-01, -1.31979479e-01, 4.21733708e-01,
3.95774101e-01],
[-5.25060772e-02, -8.22569778e-01, 1.94384000e-01,
-2.62386821e-01],
[ 4.32165405e-01, -1.97355361e+00, 4.21733708e-01,
3.95774101e-01],
[-2.94841818e-01, -1.28296331e+00, 8.07091462e-02,
-1.30754636e-01],
[ 6.86617933e-02, 3.28414053e-01, 5.92245988e-01,
7.90670654e-01],
[ 3.10997534e-01, -5.92373012e-01, 1.37546573e-01,
1.32509732e-01],
[ 5.53333275e-01, -1.28296331e+00, 6.49083415e-01,
3.95774101e-01],
[ 3.10997534e-01, -5.92373012e-01, 5.35408562e-01,
8.77547895e-04],
[ 6.74501145e-01, -3.62176246e-01, 3.08058854e-01,
1.32509732e-01],
[ 9.16836886e-01, -1.31979479e-01, 3.64896281e-01,
2.64141916e-01],
[ 1.15917263e+00, -5.92373012e-01, 5.92245988e-01,
2.64141916e-01],
[ 1.03800476e+00, -1.31979479e-01, 7.05920842e-01,
6.59038469e-01],
[ 1.89829664e-01, -3.62176246e-01, 4.21733708e-01,
3.95774101e-01],
[-1.73673948e-01, -1.05276654e+00, -1.46640561e-01,
-2.62386821e-01],
[-4.16009689e-01, -1.51316008e+00, 2.38717193e-02,
-1.30754636e-01],
[-4.16009689e-01, -1.51316008e+00, -3.29657076e-02,
-2.62386821e-01],
[-5.25060772e-02, -8.22569778e-01, 8.07091462e-02,
8.77547895e-04],
[ 1.89829664e-01, -8.22569778e-01, 7.62758269e-01,
5.27406285e-01],
[-5.37177559e-01, -1.31979479e-01, 4.21733708e-01,
3.95774101e-01],
[ 1.89829664e-01, 7.88807586e-01, 4.21733708e-01,
5.27406285e-01],
63
DATAWAREHOSING AND DATAMINING 220643116004

[ 1.03800476e+00, 9.82172869e-02, 5.35408562e-01,


3.95774101e-01],
[ 5.53333275e-01, -1.74335684e+00, 3.64896281e-01,
1.32509732e-01],
[-2.94841818e-01, -1.31979479e-01, 1.94384000e-01,
1.32509732e-01],
[-4.16009689e-01, -1.28296331e+00, 1.37546573e-01,
1.32509732e-01],
[-4.16009689e-01, -1.05276654e+00, 3.64896281e-01,
8.77547895e-04],
[ 3.10997534e-01, -1.31979479e-01, 4.78571135e-01,
2.64141916e-01],
[-5.25060772e-02, -1.05276654e+00, 1.37546573e-01,
8.77547895e-04],
[-1.02184904e+00, -1.74335684e+00, -2.60315415e-01,
-2.62386821e-01],
[-2.94841818e-01, -8.22569778e-01, 2.51221427e-01,
1.32509732e-01],
[-1.73673948e-01, -1.31979479e-01, 2.51221427e-01,
8.77547895e-04],
[-1.73673948e-01, -3.62176246e-01, 2.51221427e-01,
1.32509732e-01],
[ 4.32165405e-01, -3.62176246e-01, 3.08058854e-01,
1.32509732e-01],
[-9.00681170e-01, -1.28296331e+00, -4.30827696e-01,
-1.30754636e-01],
[-1.73673948e-01, -5.92373012e-01, 1.94384000e-01,
1.32509732e-01],
[ 5.53333275e-01, 5.58610819e-01, 1.27429511e+00,
1.71209594e+00],
[-5.25060772e-02, -8.22569778e-01, 7.62758269e-01,
9.22302838e-01],
[ 1.52267624e+00, -1.31979479e-01, 1.21745768e+00,
1.18556721e+00],
[ 5.53333275e-01, -3.62176246e-01, 1.04694540e+00,
7.90670654e-01],
[ 7.95669016e-01, -1.31979479e-01, 1.16062026e+00,
1.31719939e+00],
[ 2.12851559e+00, -1.31979479e-01, 1.61531967e+00,
1.18556721e+00],
[-1.14301691e+00, -1.28296331e+00, 4.21733708e-01,
6.59038469e-01],
[ 1.76501198e+00, -3.62176246e-01, 1.44480739e+00,
7.90670654e-01],
[ 1.03800476e+00, -1.28296331e+00, 1.16062026e+00,
64
DATAWAREHOSING AND DATAMINING 220643116004

7.90670654e-01],
[ 1.64384411e+00, 1.24920112e+00, 1.33113254e+00,
1.71209594e+00],
[ 7.95669016e-01, 3.28414053e-01, 7.62758269e-01,
1.05393502e+00],
[ 6.74501145e-01, -8.22569778e-01, 8.76433123e-01,
9.22302838e-01],
[ 1.15917263e+00, -1.31979479e-01, 9.90107977e-01,
1.18556721e+00],
[-1.73673948e-01, -1.28296331e+00, 7.05920842e-01,
1.05393502e+00],
[-5.25060772e-02, -5.92373012e-01, 7.62758269e-01,
1.58046376e+00],
[ 6.74501145e-01, 3.28414053e-01, 8.76433123e-01,
1.44883158e+00],
[ 7.95669016e-01, -1.31979479e-01, 9.90107977e-01,
7.90670654e-01],
[ 2.24968346e+00, 1.70959465e+00, 1.67215710e+00,
1.31719939e+00],
[ 2.24968346e+00, -1.05276654e+00, 1.78583195e+00,
1.44883158e+00],
[ 1.89829664e-01, -1.97355361e+00, 7.05920842e-01,
3.95774101e-01],
[ 1.28034050e+00, 3.28414053e-01, 1.10378283e+00,
1.44883158e+00],
[-2.94841818e-01, -5.92373012e-01, 6.49083415e-01,
1.05393502e+00],
[ 2.24968346e+00, -5.92373012e-01, 1.67215710e+00,
1.05393502e+00],
[ 5.53333275e-01, -8.22569778e-01, 6.49083415e-01,
7.90670654e-01],
[ 1.03800476e+00, 5.58610819e-01, 1.10378283e+00,
1.18556721e+00],
[ 1.64384411e+00, 3.28414053e-01, 1.27429511e+00,
7.90670654e-01],
[ 4.32165405e-01, -5.92373012e-01, 5.92245988e-01,
7.90670654e-01],
[ 3.10997534e-01, -1.31979479e-01, 6.49083415e-01,
7.90670654e-01],
[ 6.74501145e-01, -5.92373012e-01, 1.04694540e+00,
1.18556721e+00],
[ 1.64384411e+00, -1.31979479e-01, 1.16062026e+00,
5.27406285e-01],
[ 1.88617985e+00, -5.92373012e-01, 1.33113254e+00,
9.22302838e-01],
65
DATAWAREHOSING AND DATAMINING 220643116004

[ 2.49201920e+00, 1.70959465e+00, 1.50164482e+00,


1.05393502e+00],
[ 6.74501145e-01, -5.92373012e-01, 1.04694540e+00,
1.31719939e+00],
[ 5.53333275e-01, -5.92373012e-01, 7.62758269e-01,
3.95774101e-01],
[ 3.10997534e-01, -1.05276654e+00, 1.04694540e+00,
2.64141916e-01],
[ 2.24968346e+00, -1.31979479e-01, 1.33113254e+00,
1.44883158e+00],
[ 5.53333275e-01, 7.88807586e-01, 1.04694540e+00,
1.58046376e+00],
[ 6.74501145e-01, 9.82172869e-02, 9.90107977e-01,
7.90670654e-01],
[ 1.89829664e-01, -1.31979479e-01, 5.92245988e-01,
7.90670654e-01],
[ 1.28034050e+00, 9.82172869e-02, 9.33270550e-01,
1.18556721e+00],
[ 1.03800476e+00, 9.82172869e-02, 1.04694540e+00,
1.58046376e+00],
[ 1.28034050e+00, 9.82172869e-02, 7.62758269e-01,
1.44883158e+00],
[-5.25060772e-02, -8.22569778e-01, 7.62758269e-01,
9.22302838e-01],
[ 1.15917263e+00, 3.28414053e-01, 1.21745768e+00,
1.44883158e+00],
[ 1.03800476e+00, 5.58610819e-01, 1.10378283e+00,
1.71209594e+00],
[ 1.03800476e+00, -1.31979479e-01, 8.19595696e-01,
1.44883158e+00],
[ 5.53333275e-01, -1.28296331e+00, 7.05920842e-01,
9.22302838e-01],
[ 7.95669016e-01, -1.31979479e-01, 8.19595696e-01,
1.05393502e+00],
[ 4.32165405e-01, 7.88807586e-01, 9.33270550e-01,
1.44883158e+00],
[ 6.86617933e-02, -1.31979479e-01, 7.62758269e-01,
7.90670654e-01]])

ii) Binarize Data


Binarization is the process of dividing data into two groups and assigning one out of two values to
all the members of the same group. This is usually accomplished. By defining a threshold t and

66
DATAWAREHOSING AND DATAMINING 220643116004

assigning the value 0 to all the data points below. The threshold and 1 to those above it.
sklearn.preprocessing.Binarizer() is a method which belongs to preprocessing module. It plays
a key role in the discretization of continuous feature values.

Binarize data
In [13]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing

In [14]:

data = pd.read_csv('p3.csv')
In [15]:

data
Out[15]:

Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

67
DATAWAREHOSING AND DATAMINING 220643116004

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

In [33]:

data["Salary"].fillna(method='ffill', inplace=True)
In [40]:

data["Age"].fillna(method='ffill', inplace=True)
In [41]:

data

Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

68
DATAWAREHOSING AND DATAMINING 220643116004

4 Germany 40.0 61000.0 Yes

5 France 35.0 58000.0 Yes

6 Spain 35.0 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

Out[41]:

In [42]:

age = data.iloc[:, 1].values


salary = data.iloc[:, 2].values
print ("\nOriginal age data values : \n", age)
print ("\nOriginal salary data values : \n", salary)

Original age data values :


[44. 27. 30. 38. 40. 35. 35. 48. 50. 37.]

Original salary data values :


[72000. 48000. 54000. 61000. 61000. 58000. 52000. 79000. 83000. 67000.]
In [43]:

from sklearn.preprocessing import Binarizer

x = age
x = x.reshape(1, -1)

69
DATAWAREHOSING AND DATAMINING 220643116004

y = salary
y = y.reshape(1, -1)
In [44]:

# For age, let threshold be 35


# For salary, let threshold be 61000
binarizer_1 = Binarizer(35)
binarizer_2 = Binarizer(61000)
C:\Users\K Patel\anaconda3\lib\site-packages\sklearn\utils\validation.py:70:
FutureWarning: Pass threshold=35 as keyword args. From version 1.0 (renaming
of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
C:\Users\K Patel\anaconda3\lib\site-packages\sklearn\utils\validation.py:70:
FutureWarning: Pass threshold=61000 as keyword args. From version 1.0 (renami
ng of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
In [45]:

print ("\nBinarized age : \n", binarizer_1.fit_transform(x))

print ("\nBinarized salary : \n", binarizer_2.fit_transform(y))

Binarized age :
[[1. 0. 0. 1. 1. 0. 0. 1. 1. 1.]]

Binarized salary :
[[1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]

70
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-4

AIM: Perform Different Normalization methods. i) Maximum Absolute


Scaling, ii) Min-Max Feature Scaling, and iii) Z-score Method.

i) Maximum Absolute Scaling


Maximum absolute scaling scales the data to its maximum value; that is, it divides every
observation by the maximum value of the variable: The result of the preceding transformation is a
distribution in which the values vary approximately within the range of -1 to 1.

Maximum Absolute Scaling


In [1]:

from sklearn.preprocessing import M axAbsScaler


In [2]:

X = [[ 1., -1., 2.],[ 2., 0., 0.],


[ 0., 1., -1.]]

In [3]:

transformer = MaxAbsScaler().fit(X)
In [4]:

transformer.transform(X)
Out[4]:
arr ay([[ 0.5, -1. , 1. ],
[ 1. , 0. , 0. ],

71
DATAWAREHOSING AND DATAMINING 220643116004

[ 0. , 1. , -0.5]])
In [ ]:

ii) Min-Max Feature Scaling

Transform features by scaling each feature to a given range. This estimator scales and translates
each feature individually such that it is in the given range on the training set, e.g. between zero
and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))


X_scaled = X_std * (max - min) + min
Where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Min-Max Feature Scaling

In [5]:

from sklearn.preprocessing import MinMaxScaler


In [6]:

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]


In [7]:

scaler = MinMaxScaler()
In [10]:

72
DATAWAREHOSING AND DATAMINING 220643116004

print(scaler.data_max_)
[ 1. 18.]
In [12]:

print(scaler.transform(data))
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]

iii) Z-score Method.


In statistics, a z-score tells us how many standard deviations away a value is from the mean. We
use the following formula to calculate a z-score:

z = (X – μ) / σ

where:

 X is a single raw data value


 μ is the population mean
 σ is the population standard deviation

How to Calculate Z-Scores in Python

We can calculate z-scores in Python using scipy.stats.zscore, which uses the following syntax:

scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=‘propagate‘)

where:

 a: an array like object containing data


 axis: the axis along which to calculate the z-scores. Default is 0.
 ddof: degrees of freedom correction in the calculation of the standard deviation. Default
is 0.
 nan_policy: how to handle when input contains nan. Default is propagate, which returns
nan. ‗raise‘ throws an error and ‗omit‘ performs calculations ignoring nan values.

73
DATAWAREHOSING AND DATAMINING 220643116004

Z_Score

In [25]:

import scipy.stats as stats


import numpy as np

In [26]:

data = np.array([[5, 6, 7, 7, 8],


[8, 8, 8, 9, 9],
[2, 2, 4, 4, 5]])

In [27]:

stats.zscore(data, axis=1)

Out[27]:
array([[-1.56892908, -0.58834841, 0.39223227, 0.39223227, 1.37281295],
[-0.81649658, -0.81649658, -0.81649658, 1.22474487, 1.22474487],
[-1.16666667, -1.16666667, 0.5 , 0.5 , 1.33333333]])
In [ ]:

74
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-5

AIM: Introduce and perform Attribute Relevance.

Attribute Relevance

Attribute relevance analysis phase has task to recognize attributes (characteristics) with strongest
impact on churn. Attributes which shows greatest segregation power in relation with churn (churn
= ―Yes‖ or ―No‖) by attribute relevance analysis will be selected as best candidates for building
predictive churn model. By no means is Attribute Relevance Analysis used only for predictive churn
model development, you can use it for every classification task. It is based on two terms:
Information Value and Weight of Evidence.

Information Value and Weight of Evidence


Weight of Evidence is explained as follows:
The weight of evidence tells the predictive power of an independent variable in relation to the
dependent variable. Since it evolved from credit scoring world, it is generally described as a
measure of the separation of good and bad customers. “Bad Customers” refers to the customers
who defaulted on a loan and “Good Customers” refers to the customers who paid back loan.

And from the same source, Information Value is explained as follows:


Information value is one of the most useful technique to select important variables in a predictive
model. It helps to rank variables on the basis of their importance.

WoE=[ln(Relative Frequency of Goods/ Relative Frequency of Bads)]*100


IV=Sum(Distribution Goodi – Distribution Badi)*WoEi

If we‘re talking about churn modeling, Goods would be clients which didn‘t churn, and Bads would
be clients which committed churn. Just from this, you can see the simplicity behind the formulas.
The Attribute Relevance Analysis for churn modeling example is divided into 6 steps:
1. Data Cleaning and Preparation,
2. Calculating IV and WoE,
3. Identifying Churners Profile,
4. Coarse Classing,
5. Dummy Variable Creation,
6. Correlations between Dummy Variables.

Step 1. Data Cleaning and Preparation

The dataset contains no missing values, so prerequisite 1 of 2 is satisfied!

There are 10,000 observations and 14 columns. From here on proceeded to data cleaning. Here are

the steps:
1. Delete RowNumber, CustomerId, and Surname — they are arbitrary and can‘t be used.

75
DATAWAREHOSING AND DATAMINING 220643116004

2. Group CreditScore, Age, Balance, and EstimatedSalary into 5 bins


3. Delete CreditScore, Age, Balance, and EstimatedSalary because they aren‘t needed anymore
In [8]:

import pandas as pd
import numpy as np

In [2]:

data = pd.read_csv('p5.csv')
In [3]:

data
Out[3]:

Row Cust Sur Cred Geog Ge A Te NumOf Has IsActiv Estimat Ex


Bala
Num omer na itSco raph nd g nu Produc CrC eMemb edSalar ite
ber Id me re y er e re nce
ts ard er y d

Har Fe
1563 Fran 4 101348.
0 1 grav 619 mal 2 0.00 1 1 1 1
4602 ce 2 88
e e

1564 Fe
1 2 Hill Spai 4 8380 112542.
608 mal 1 1 0 1 0
7311 n 1 7.86 58
e

Fe 1596
1561 Oni Fran 4 113931.
2 3 502 mal 8 60.8 3 1 0 1
9304 o ce 2 57
e 0

1570 Fe 3
Bon Fran 1 0.00 2 0 0 93826.6 0
3 4 699 mal
1354 i ce 9 3
e

76
DATAWAREHOSING AND DATAMINING 220643116004

4 5 1573 Mit 850 Spai Fe 4 2 1255 1 1 1 79084.1 0

Row Cust Sur Cred Geog Ge A Te NumOf Has IsActiv Estimat Ex


Bala
Num omer na itSco raph nd g nu Produc CrC eMemb edSalar ite
nce
ber Id me re y er e re ts ard er y d

7888 chel n mal 3 10.8 0


l e 2

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

9
9 1560 Obij Fran Ma 3 96270.6
9996 771 5 0.00 2 1 0 0
9 6229 iaku ce le 9 4
5

9 Joh
9 1556 Fran Ma 3 5736 101699.
9997 nsto 516 10 1 1 1 0
9 9892 ce le 5 9.61 77
ne
6

9 Fe
9 1558 Fran 3 42085.5
9998 Liu 709 mal 7 0.00 1 0 1 1
9 4532 ce 6 8
e
7

9 Sab
9 1568 Ger Ma 4 7507 92888.5
9999 bati 772 3 2 1 0 1
9 2355 many le 2 5.31 2
ni
8

9 Fe 1301
9 1562 Wal Fran 2 38190.7
10000 792 mal 4 42.7 1 1 0 0
9 8319 ker ce 8 8
e 9
9

10000 rows × 14 columns


In [4]:

data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

data['CreditScore_Bins'] = pd.qcut(data['CreditScore'], 5,
labels=['CS_lt_566', 'CS_556_to_627', 'CS_627_to_678', 'CS_678_to_735', 'CS_gt_735'])
data['Age_Bins'] = pd.qcut(data['Age'], 5,

77
DATAWAREHOSING AND DATAMINING 220643116004

labels=['Age_lt_31', 'Age_31_to_35', 'Age_35_to_40', 'Age_40_to_46', 'Age_gt_46'])


data['Balance_Bins'] = pd.qcut(data['Balance'], 5,
labels=['Bal_lt_73080', 'Bal_73080_to_110138', 'Bal_110138_to_133710', 'Bal_gt_133710'],
duplicates='drop')
data['Salary_Bins'] = pd.qcut(data['EstimatedSalary'], 5,
labels=['Sal_lt_41050', 'Sal_41050_to_80238', 'Sal_80238_to_119710', 'Sal_119710_to_159836',
'Sal_159836_to_199992'])

data.drop(['CreditScore', 'Age', 'Balance', 'EstimatedSalary'], axis=1, inplace=True)


In [5]:

data
Out[5]:

Geog Ge Te NumOf HasC IsActive Exi CreditSc Age_Bi Balance_Bi Salary_Bin


raph nde nur Product rCard Member ted ore_Bins ns ns s
y r e s

Fe
Franc CS_556_t Age_40 Bal_lt_7308 Sal_80238_
0 mal 2 1 1 1 1
e o_627 _to_46 0 to_119710
e

Fe
CS_556_t Age_40 Bal_73080_ Sal_80238_
1 Spain mal 1 1 0 1 0
o_627 _to_46 to_110138 to_119710
e

Fe
Franc CS_lt_56 Age_40 Bal_gt_133 Sal_80238_
2 mal 8 3 1 0 1
e 6 _to_46 710 to_119710
e

Fe
Franc CS_678_t Age_35 Bal_lt_7308 Sal_80238_
3 mal 1 2 0 0 0 to_119710
e o_735 _to_40 0
e

Fe
CS_gt_73 Age_40 Bal_110138 Sal_41050_
4 Spain mal 2 1 1 1 0
5 _to_46 _to_133710 to_80238
e

... ... ... ... ... ... ... ... ... ... ... ...

99 Franc Mal CS_gt_73 Age_35 Bal_lt_7308 Sal_80238_


5 2 1 0 0
95 e e 5 _to_40 0 to_119710

78
DATAWAREHOSING AND DATAMINING 220643116004

99 Franc Mal 10 1 1 1 0 CS_lt_56 Age_31 Bal_lt_7308 Sal_80238_

Geog Ge Te NumOf HasC IsActive Exi CreditSc Age_Bi Balance_Bi Salary_Bin


raph nde nur Product rCard Member ted ore_Bins ns ns s
y r e s

96 e e 6 _to_35 0 to_119710

Fe
99 Franc CS_678_t Age_35 Bal_lt_7308 Sal_41050_
mal 7 1 0 1 1
97 e o_735 _to_40 0 to_80238
e

99 Germ Mal CS_gt_73 Age_40 Bal_73080_ Sal_80238_


3 2 1 0 1
98 any e 5 _to_46 to_110138 to_119710

Fe
99 Franc CS_gt_73 Age_lt Bal_110138 Sal_lt_4105
mal 4 1 1 0 0
99 e 5 _31 _to_133710 0
e

10000 rows × 11 columns

Step 2. Calculating IV and WoE


Down below is the function which will calculate Weight of Evidence and Infomation Value.
Given Pandas DataFrame, attribute name, and target variable name it will do the calculations.
The function will return Pandas DataFrame and IV score.

In [6]:

def calculate_woe_iv(dataset, feature, target):


lst = []
for i in range(dataset[feature].nunique()):
val = list(dataset[feature].unique())[i]
lst.append({
'Value': val,
'All': dataset[dataset[feature] == val].count()[feature],
'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],
'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]
})

dset = pd.DataFrame(lst)
dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()

79
DATAWAREHOSING AND DATAMINING 220643116004

dset['WoE'] = np.log(dset['Distr_Good'] / dset['Distr_Bad'])


dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})
dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']
iv = dset['IV'].sum()

dset = dset.sort_values(by='WoE')

return dset, iv
In [9]:

for col in data.columns:


if col == 'Exited': continue
else:
print('WoE and IV for column: {}'.format(col))
df, iv = calculate_woe_iv(data, col, 'Exited')
print(df)
print('IV score: {:.2f}'.format(iv))
print('\n')

WoE and IV for column: Geography


Value All Good Bad Distr_Good Distr_Bad WoE IV
2 Germany 2509 1695 814 0.212859 0.399607 -0.629850 0.117623
1 Spain 2477 2064 413 0.259199 0.202749 0.245626 0.013865
0 France 5014 4204 810 0.527942 0.397644 0.283430 0.036930
IV score: 0.17

WoE and IV for column: Gender


Value All Good Bad Distr_Good Distr_Bad WoE IV
0 Female 4543 3404 1139 0.427477 0.559156 -0.268527 0.035359
1 Male 5457 4559 898 0.572523 0.440844 0.261361 0.034416
IV score: 0.07

WoE and IV for column: Tenure


Value All Good Bad Distr_Good Distr_Bad WoE IV
10 0 413 318 95 0.039935 0.046637 -0.155153 0.001040
1 1 1035 803 232 0.100841 0.113893 -0.121710 0.001589
9 9 984 771 213 0.096823 0.104566 -0.076931 0.000596
6 3 1009 796 213 0.099962 0.104566 -0.045021 0.000207
8 5 1012 803 209 0.100841 0.102602 -0.017307 0.000030

80
DATAWAREHOSING AND DATAMINING 220643116004

7 10 490 389 101 0.048851 0.049583 -0.014869 0.000011


4 4 989 786 203 0.098707 0.099656 -0.009577 0.000009
5 6 967 771 196 0.096823 0.096220 0.006246 0.000004
2 8 1025 828 197 0.103981 0.096711 0.072482 0.000527
0 2 1048 847 201 0.106367 0.098675 0.075068 0.000577
3 7 1028 851 177 0.106869 0.086892 0.206935 0.004134
IV score: 0.01

WoE and IV for column: NumOfProducts


C:\Users\K Patel\anaconda3\lib\site-packages\pandas\core\arraylike.py:358: Ru
ntimeWarning: divide by zero encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
Value All Good Bad Distr_Good Distr_Bad WoE IV
1 3 266 46 220 0.005777 0.108002 -2.928314 0.299348
0 1 5084 3675 1409 0.461509 0.691703 -0.404655 0.093149
3 4 60 0 60 0.000000 0.029455 0.000000 -0.000000
2 2 4590 4242 348 0.532714 0.170839 1.137260 0.411545
IV score: 0.80

WoE and IV for column: HasCrCard


Value All Good Bad Distr_Good Distr_Bad WoE IV
1 0 2945 2332 613 0.292854 0.300933 -0.027211 0.000220
0 1 7055 5631 1424 0.707146 0.699067 0.011490 0.000093
IV score: 0.00

WoE and IV for column: IsActiveMember


Value All Good Bad Distr_Good Distr_Bad WoE IV
1 0 4849 3547 1302 0.445435 0.639175 -0.361127 0.069965
0 1 5151 4416 735 0.554565 0.360825 0.429791 0.083268
IV score: 0.15

WoE and IV for column: CreditScore_Bins


Value All Good Bad Distr_Good Distr_Bad WoE IV
1 CS_lt_566 2010 1558 452 0.195655 0.221895 -0.125852 0.003302
0 CS_556_to_627 2020 1599 421 0.200804 0.206676 -0.028827 0.000169
3 CS_gt_735 1979 1573 406 0.197539 0.199313 -0.008941 0.000016
4 CS_627_to_678 2010 1615 395 0.202813 0.193913 0.044877 0.000399
2 CS_678_to_735 1981 1618 363 0.203190 0.178203 0.131216 0.003279
IV score: 0.01

81
DATAWAREHOSING AND DATAMINING 220643116004

WoE and IV for column: Age_Bins


Value All Good Bad Distr_Good Distr_Bad WoE IV
2 Age_gt_46 1885 1019 866 0.127967 0.425135 -1.200636 0.356791
0 Age_40_to_46 1696 1211 485 0.152078 0.238095 -0.448275 0.038559
1 Age_35_to_40 2266 1927 339 0.241994 0.166421 0.374392 0.028294
4 Age_31_to_35 1781 1615 166 0.202813 0.081492 0.911775 0.110617
3 Age_lt_31 2372 2191 181 0.275148 0.088856 1.130289 0.210563
IV score: 0.74

WoE and IV for column: Balance_Bins


Value All Good Bad Distr_Good Distr_Bad WoE \
3 Bal_110138_to_133710 2000 1461 539 0.183474 0.264605 -0.366167
2 Bal_gt_133710 2000 1538 462 0.193143 0.226804 -0.160654
1 Bal_73080_to_110138 2000 1554 446 0.195153 0.218949 -0.115059
0 Bal_lt_73080 4000 3410 590 0.428231 0.289642 0.391017

IV
3 0.029708
2 0.005408
1 0.002738
0 0.054191
IV score: 0.09

WoE and IV for column: Salary_Bins


Value All Good Bad Distr_Good Distr_Bad WoE \
4 Sal_159836_to_199992 2000 1569 431 0.197036 0.211586 -0.071242
0 Sal_80238_to_119710 2000 1596 404 0.200427 0.198331 0.010513
2 Sal_119710_to_159836 2000 1596 404 0.200427 0.198331 0.010513
1 Sal_41050_to_80238 2000 1601 399 0.201055 0.195876 0.026095
3 Sal_lt_41050 2000 1601 399 0.201055 0.195876 0.026095

IV
4 0.001037
0 0.000022
2 0.000022
1 0.000135
3 0.000135
IV score: 0.00

For now, you should just care about the line which says IV score. More precisely, keep your

thoughts on variables with the highest IV scores. Down below is a table for IV interpretation:

82
DATAWAREHOSING AND DATAMINING 220643116004

IV Interpretation table

Now you should see the clearer picture. You should only keep those attributes which have good

predictive power! In the current dataset those are:


 NumOfProducts (0.80)
 Age_Bins (0.74)
 Geography (0.17)
 IsActiveMember (0.15)

Step 3. Identifying Churners Profile

This isn‘t actually required step but is quite beneficial to do.

You as a company probably want to know how does the typical churner look like. I mean, you don‘t

care about his/her physical appearance, but you want to know where does the churner lives, what‘s

his/her age, etc…

To find that out, you will need to take a closer look at the returned data frames for those variables

which have the greatest predictive power. More precisely, look at the WoE column. Ideally, you

will find negative WoE score — this is the value most churners have.

In our example, this is the typical churners profile:


 Lives in Germany (WoE -0.63)
 Uses 3 products/services (WoE -2.92)
 Isn‘t an active member (WoE -0.36)
 Has more than 46 years (WoE -1.20)
83
DATAWAREHOSING AND DATAMINING 220643116004

With this information, you as a company can act and address this critical customer group.

Step 4. Coarse Classing

Coarse Classing is another term I haven‘t heard prior to my master degree studies. It‘s very simple

to explain the idea behind it, you basically want to group together instances with similar WoE

because they provide the same information.

For this dataset, coarse classing should be applied to Spain and France in Geography attribute

(WoEs 0.24 and 0.28).

In [10]:

Geography_df, Geography_iv = calculate_woe_iv(data, 'Geography', 'Exited')


In [11]:

Geography_df

Value All Good Bad Distr_Good Distr_Bad WoE IV

2 Germany 2509 1695 814 0.212859 0.399607 -0.629850 0.117623

1 Spain 2477 2064 413 0.259199 0.202749 0.245626 0.013865

0 France 5014 4204 810 0.527942 0.397644 0.283430 0.036930

Out[11]:

In [12]:

84
DATAWAREHOSING AND DATAMINING 220643116004

Geography_iv

0.16841897055216165

Out[12]:

Down below is the function for coarse classing, along with the function call. To call the function
you must know before hand what are the index locations of the two rows you want coarsed.

def coarse_classer(df, indexloc_1, indexloc_2):


mean_val = pd.DataFrame(np.mean(pd.DataFrame([df.iloc[indexloc_1], df.iloc[indexloc_2]]))).T
original = df.drop([indexloc_1, indexloc_2])

coarsed_df = pd.concat([original, mean_val])


coarsed_df = coarsed_df.sort_values(by='WoE', ascending=False).reset_index(drop=True)

return coarsed_df

geography_df = coarse_classer(Geography_df, 1, 2)geography_df

Value All Good Bad Distr_Good Distr_Bad WoE IV

0 France 5014.0 4204.0 810.0 0.527942 0.397644 0.283430 0.036930

1 NaN 3745.5 3134.0 611.5 0.393570 0.300196 0.264528 0.025398

Out[14]:

In [ ]:

85
DATAWAREHOSING AND DATAMINING 220643116004

You can notice that Value is NaN for the newly created row. It‘s nothing to worry about, you can
simply remap the original dataset to replace Spain and France with something new, for example,
Spain_and_France.

data['Geography'].replace({ 'Spain': 'Spain_and_France', 'France': 'Spain_and_France' }, inplace=True)

Step 5. Dummy Variable Creation

As you know, classification models perform the best when there exist only binary attributes. That‘s
where dummy variables come in. A dummy variable is one that takes the value 0 or 1 to indicate
the absence or presence of some categorical effect that may be expected to shift the outcome.
In a nutshell, if the attribute has n unique values, you will need to create n — 1 dummy variables.
You create one less dummy variable to avoid collinearity issues — when one variable is a perfect
predictor of the other.
Dummy variables will be needed for the following attributes:
 Geography
 NumOfProducts
 Age_Bins
and the code below will create them, and then concatenate them to a new Data Frame along
with IsActiveMember attribute and the target variable — Exited:

geography_dummies = pd.get_dummies(data['Geography'], drop_first=True, prefix='Geography')


num_products_dummies = pd.get_dummies(data['NumOfProducts'], drop_first=True, prefix='Num_Prods')
age_dummies = pd.get_dummies(data['Age_Bins'], drop_first=True)

df = pd.concat([geography_dummies, num_products_dummies, age_dummies, data[['IsActiveMember', 'Exited']]], a


xis=1)

In [18]:

86
DATAWAREHOSING AND DATAMINING 220643116004

df
Out[18]:

Geography_Spai Num_P Num_P Num_P Age_31 Age_35 Age_40 Age_ IsActive Exi
n_and_France rods_2 rods_3 rods_4 _to_35 _to_40 _to_46 gt_46 Member ted

0 1 0 0 0 0 0 1 0 1 1

1 1 0 0 0 0 0 1 0 1 0

2 1 0 1 0 0 0 1 0 0 1

3 1 1 0 0 0 1 0 0 0 0

4 1 0 0 0 0 0 1 0 1 0

... ... ... ... ... ... ... ... ... ... ...

99 1 1 0 0 0 1 0 0 0 0
95

99 1 0 0 0 1 0 0 0 1 0
96

99 1 0 0 0 0 1 0 0 1 1
97

99 0 1 0 0 0 0 1 0 0 1
98

99 1 0 0 0 0 0 0 0 0 0
99

10000 rows × 10 columns


In [ ]:

87
DATAWAREHOSING AND DATAMINING 220643116004

Step 6. Correlations between Dummy Variables

The final step of this process is to calculate correlations between dummy variables and to exclude
those with high correlation. What‘s considered a high correlation coefficient is up to debate, but I
would suggest you remove anything with a correlation above 0.7 (of absolute value). If you‘re
wondering which dummy variable to remove between the two, remove the one with lower Weight
of Evidence, due to weaker connection to the target variable.

ax = sb.heatmap(df.corr(), linewidths=0.8, cmap='Blues', fmt=".1f", annot=True)

Correlation Matrix
Here is visible that there exists no correlation between dummy variables, and therefore, all of
them must remain.

88
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-6

AIM: Implement Decision Tree based Algorithm: Random Forest and


AdaBoost.

Random Forest

Random forest is an ensemble of decision tree algorithms.

It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for
classification and regression problems.
In bagging, a number of decision trees are created where each tree is created from a different
bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset
where a sample may appear more than once in the sample, referred to as sampling with
replacement.
Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different
training dataset, and in turn, has a slightly different performance. Unlike normal decision tree
models, such as classification and regression trees (CART), trees used in the ensemble are
unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make
each tree more different and have less correlated predictions or prediction errors.
Predictions from the trees are averaged across all decision trees resulting in better performance
than any single tree in the model.

A prediction on a regression problem is the average of the prediction across the trees in the
ensemble. A prediction on a classification problem is the majority vote for the class label across
the trees in the ensemble.

 Regression: Prediction is the average prediction across the decision trees.


 Classification: Prediction is the majority vote class label predicted across the decision trees.
Random forest involves constructing a large number of decision trees from bootstrap samples from
the training dataset, like bagging.

Unlike bagging, random forest also involves selecting a subset of input features (columns or
variables) at each split point in the construction of trees. Typically, constructing a decision tree
involves evaluating the value for each input variable in the data in order to select a split point. By
reducing the features to a random subset that may be considered at each split point, it forces each
decision tree in the ensemble to be more different.

89
DATAWAREHOSING AND DATAMINING 220643116004

Random Forest

In [1]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

In [2]:

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)


In [3]:

model = RandomForestClassifier()
In [4]:

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)


n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
In [5]:

90
DATAWAREHOSING AND DATAMINING 220643116004

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Accuracy: 0.901 (0.025)
AdaBoost
Boosting is a class of ensemble machine learning algorithms that involve combining the
predictions from many weak learners.

A weak learner is a model that is very simple, although has some skill on the dataset. Boosting
was a theoretical concept long before a practical algorithm could be developed, and the AdaBoost
(adaptive boosting) algorithm was the first successful approach for the idea.

The AdaBoost algorithm involves using very short (one-level) decision trees as weak learners that
are added sequentially to the ensemble. Each subsequent model attempts to correct the predictions
made by the model before it in the sequence. This is achieved by weighing the training dataset to
put more focus on training examples on which prior models made prediction errors.

AdaBoost
In [6]:

from numpy import mean


from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
In [7]:

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)


In [8]:

model = AdaBoostClassifier()
In [9]:

91
DATAWAREHOSING AND DATAMINING 220643116004

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)


n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

In [10]:

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))


Accuracy: 0.806 (0.041)

In [ ]:

92
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-7

AIM: Perform installation of Weka tool.

For Weka tool installation follow the following steps.

Step1: Write down Weka installer

Step2: Click on the following URL ―https://waikato.github.io/weka-wiki/downloading_weka/‖

Step3: On the left side select the appropriate OS for your computer.

Step4: Click on ‗Click Here‘

93
DATAWAREHOSING AND DATAMINING 220643116004

Step5: Open the downloaded setup and click on ―Yes‖.

Step6: Setup screen will appear, click on Next.

Step7: The next screen will be of License Agreement, click on I Agree.

94
DATAWAREHOSING AND DATAMINING 220643116004

Step8: Next screen is of choosing components, all components are already marked so don‘t
change anything just click on the Install button.

Step9: The next screen will be of installing location so choose the drive which will have
sufficient memory space for installation. It needed a memory space of 301 MB.

95
DATAWAREHOSING AND DATAMINING 220643116004

Step10: Next screen will be of choosing the Start menu folder so don‘t do anything just click on
Install Button.

Step11: After this installation process will start and will hardly take a minute to complete the
installation.

96
DATAWAREHOSING AND DATAMINING 220643116004

Step12: Click on the Next button after the installation process is complete.

Step13: Click on Finish to finish the installation process.

97
DATAWAREHOSING AND DATAMINING 220643116004

Step14: Weka is successfully installed on the system and an icon is created on the desktop.

Step15: Run the software and see the interface.

98
DATAWAREHOSING AND DATAMINING 220643116004

99
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-8

AIM: Demonstration of preprocessing on dataset student.arff

This experiment illustrates some of the basic data preprocessing operations that can be performed
using WEKA-Explorer. The sample dataset used for this example is the student data available in
arff format.

Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.

Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute. The left panel in the above figure shows
the list of recognized attributes while the top panel indicates the names of the base relation or table
and the current working relation (which are same initially).

Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,

Step4:The visualization in the right button panel in the form of cross-tabulation across two
attributes.

Note:we can select another attribute using the dropdown list. Step5:Selecting or filtering attributes

Removing an attribute-When we need to remove an attribute,we can do this by using the attribute
filters in weka.In the filter model panel,click on choose button,This will show a popup window
with a list of available filters.

Scroll down the list and select the ―weka.filters.unsupervised.attribute.remove‖ filters.

Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog
box enter the index of the attribute to be filtered out.

b) Make sure that invert selection option is set to false.The click OK now in the filter box.you
will see ―Remove-R-7‖.

c) Click the apply button to apply filter to this data.This will remove the attribute and create
new working relation.

d) Save the new working relation as an arff file by clicking save button on the
100
DATAWAREHOSING AND DATAMINING 220643116004

top(button)panel.(student.arff)
Discretization

1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize age attribute.

ÆLet us divide the values of age attribute into three bins(intervals). ÆFirst load the dataset into
weka(student.arff)

ÆSelect the age attribute.

ÆActivate filter-dialog box and select ―WEKA.filters.unsupervised.attribute.discretize‖from the


list.

ÆTo change the defaults for the filters,click on the box immediately to the right of the choose
button.

ÆWe enter the index for the attribute to be discretized.In this case the attribute is age.So we must
enter ‗1‘ corresponding to the age attribute.

ÆEnter ‗3‘ as the number of bins.Leave the remaining field values as they are. ÆClick OK button.

ÆClick apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 3 bins.

ÆSave the new working relation in a file called student-data-discretized.arff

Dataset student .arff

@relation student

@attribute age {<30,30-40,>40} @attribute income {low, medium, high} @attribute student {yes,
no}

@attribute credit-rating {fair, excellent} @attribute buyspc {yes, no}

@data

101
DATAWAREHOSING AND DATAMINING 220643116004

<30, high, no, fair, no

<30, high, no, excellent, no 30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no 30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

The following screenshot shows the effect of discretization.

102
DATAWAREHOSING AND DATAMINING 220643116004

103
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-9

AIM: Demonstration of preprocessing on dataset labor.arff


This experiment illustrates some of the basic data preprocessing operations that can be performed
using WEKA-Explorer. The sample dataset used for this example is the labor data available in arff
format.

Step1:Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.

Step2:Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute. The left panel in the above figure shows
the list of recognized attributes while the top panel indicates the names of the base relation or table
and the current working relation (which are same initially).

Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,

Step4:The visualization in the right button panel in the form of cross-tabulation across two
attributes.

Note:we can select another attribute using the dropdown list. Step5:Selecting or filtering attributes

Removing an attribute-When we need to remove an attribute,we can do this by using the attribute
filters in weka.In the filter model panel,click on choose button,This will show a popup window
with a list of available filters.

Scroll down the list and select the ―weka.filters.unsupervised.attribute.remove‖ filters.

Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog
box enter the index of the attribute to be filtered out.

b) Make sure that invert selection option is set to false.The click OK now in the filter box.you
will see ―Remove-R-7‖.

c) Click the apply button to apply filter to this data.This will remove the attribute and create
new working relation.

d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.(labor.arff)

104
DATAWAREHOSING AND DATAMINING 220643116004

Discretization

1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize duration attribute.

ÆLet us divide the values of duration attribute into three bins(intervals). ÆFirst load the dataset
into weka(labor.arff)

ÆSelect the duration attribute.

ÆActivate filter-dialog box and select ―WEKA.filters.unsupervised.attribute.discretize‖from the


list.

ÆTo change the defaults for the filters,click on the box immediately to the right of the choose
button.

ÆWe enter the index for the attribute to be discretized.In this case the attribute is duration So we
must enter ‗1‘ corresponding to the duration attribute.

ÆEnter ‗1‘ as the number of bins.Leave the remaining field values as they are. ÆClick OK button.

ÆClick apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 1 bin.

ÆSave the new working relation in a file called labor-data-discretized.arff

Dataset labor.arff

105
DATAWAREHOSING AND DATAMINING 220643116004

The following screenshot shows the effect of discretization

106
DATAWAREHOSING AND DATAMINING 220643116004

107
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-10

AIM: Demonstration of Association rule process on dataset contact


lenses.arff using a priori algorithm.

This experiment illustrates some of the basic elements of asscociation rule


mining using WEKA. The sample dataset used for this example is
contactlenses.arff
Step1: Open the data file in Weka Explorer. It is presumed that the required
data fields have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association
rule algorithm. Step3: We will use apriori algorithm. This is the default
algorithm.
Step4: Inorder to change the parameters for the run (example support,
confidence etc) we click on the text box immediately to the right of the choose
button.
Dataset contactlenses.arff

Dataset contactlenses.arff

The following screenshot shows the association rules that were generated
108
DATAWAREHOSING AND DATAMINING 220643116004

when apriori algorithm is applied on the given dataset.

109

You might also like