0% found this document useful (0 votes)

37 views114 pages

B DWM Lab Manual Zil

project lab manual

Uploaded by

꧁[PàRTH Pàtel]꧂

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views114 pages

B DWM Lab Manual Zil

project lab manual

Uploaded by

꧁[PàRTH Pàtel]꧂

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

K.J.

INSTITUTE OF ENGINEERING & TECHNOLOGY ,SAVLI

LAB MANUAL

DATA WAREHOUSING AND DATA MINING

(3161610)

FOR

6THSEMESTER

IT
CERTIFICATE

K. J. INSTITUTE OF ENGINEERING & TECHNOLOGY

DATA WAREHOUSING AND DATA MINING(3161610)

This is to certify that Mr. /Ms. _ MISTRY ZEEL JAYESHBHAI Of 6TH SEM B.E I.T.
Class, ENROLL NO. 220643116004 , has satisfactorily Completed his/her term
work in DATA WAREHOUSING AND DATA MINING for the term ending
in _APRIL 2023 /2024.

DATE:
Grade:

PROFF. MISS. PRANAVI PATEL

INTERNAL EXAMINER EXTERNAL EXAMINER

MISS. PRANAVI PATEL

HEAD OF DEPARTMENT (IT)
FOREWORD

It is my great pleasure to present this laboratory manual for third year

Information Technology students for the subject of Data
Warehousing and Data Mining.

As a student, many of you may be wondering with some of the

questions in your mind regarding the subject and exactly what has
been tried is to answer through this manual.

Faculty members are also advised that covering these aspects in

initial stage itself will greatly relieve them in future as much of the
load will be taken care by the enthusiasm energies of the students
once they are conceptually clear.

DR. DEVANG SHAH

PRINCIPAL
LABORATORY MANUAL CONTENTS

This manual is intended for the third year students of Information

Technology in the subject of Data Warehousing and Data Mining.
This manual typically contains practical/lab sessions related Python
and implemented into Jupyter Notebook and WEKA, related the
subject to enhanced understanding.

Students are advised to thoroughly go through this manual rather

than only topics mentioned in the syllabus as practical aspects are
the key to understanding and conceptual visualization of theoretical
aspects covered in the books.

Good Luck for your Enjoyable Laboratory Sessions

MISS. PRANAVI PATEL

FACULTY AND HEAD OF DEPARTMENT (IT)
LAB INDEX
SR
DATE LIST OF PRACTICALS PAGE NO. SIGNATURE
NO.
Introduce and Perform different
1 18/1/24 methods of i) Data cleaning and ii) 1
Data integrationand transformation.
Perform data reduction in Linear
Algebra methods and Mainfold
2 Learning methods with achieved
25/1/24 35
Accuracy from
Make_Classification dataset using
LogisticRegression.

Demonstration of Pre-processing
3 08/02/24 43
Methods
i) Rescale Data and ii) Binarize Data.
Perform Different Normalization
methods.
4
15/02/24 i) Maximum Absolute Scaling, ii) 72
Min-Max Feature Scaling, and iii)
Z-score Method.

Introduce and perform

5 22/02/24 75
AttributeRelevance.

Implement Decision Tree based

6 07/03/24 89
Algorithm: Random Forest and
AdaBoost.

7 14/03/24 Perform installation of Weka tool. 93

Demonstration of preprocessing on
8 28/03/24 100
datasetstudent.arff

Demonstration of preprocessing on
9 04/04/24 104
datasetlabor.arff

Demonstration of Association
10 11/04/24 rule process on dataset contact 108
lenses .arffusing a priori
algorithm
DATAWAREHOSING AND DATAMINING 220643116004
444

PRACTICAL-1

AIM: Introduce and Perform different methods of i) Data cleaning and ii)
Data integration and transformation.

1. Data Cleaning
As we know that, Data Mining is the discipline of study which involves extracting insights from
huge amounts of data by the use of various scientific methods, algorithms, and processes. To
extract useful knowledge from data, Data Mining need raw data. This Raw data is a collection of
information from various outlines sources and an essential raw material of Data Scientists. It is
additionally known as primary or source data. It consists of garbage, irregular and inconsistent
values which lead to many difficulties. When using data, the insights and analysis extracted are
only as good as the data we are using. Essentially, when garbage data is in, then garbage analysis
comes out. Here Data cleaning comes into the picture, Data cleansing is an essential part of data
Mining. Data cleaning is the process of removing incorrect, corrupted, garbage, incorrectly
formatted, duplicate, or incomplete data within a dataset.
Why Data Cleaning?
Data cleaning is the most important task that should be done as a data science professional. Having
wrong or bad quality data can be detrimental to processes and analysis. Having clean data will
ultimately increase overall productivity and permit the very best quality information in your
decision-making.

Error-Free Data
When multiple sources of data are combined there may be chances of so much error. Through Data
Cleaning, errors can be removed from data. Having clean data which is free from wrong and
garbage values can help in performing analysis faster as well as efficiently. By doing this task our
considerable amount of time is saved. If we use data containing garbage values, the results won‘t
be accurate. When we don‘t use accurate data, surely we will make mistakes. Monitoring errors
and good reporting helps to find where errors are coming from, and also makes it easier to fix
incorrect or corrupt data for future applications.

Data Quality
The quality of the data is the degree to which it follows the rules of particular requirements. For
example, if we have imported phone numbers data of different customers, and in some places, we
have added email addresses of customers in the data. But because our needs were straightforward
for phone numbers, then the email addresses would be invalid data. Here some pieces of data
follow a specific format. Some types of numbers have to be in a specific range. Some data cells
might require a selected quite data like numeric, Boolean, etc. In every scenario, there are some
mandatory constraints our data should follow. Certain conditions affect multiple fields of data in
a particular form. Particular types of data have unique restrictions. If the data isn‘t in the required
format, it would always be invalid. Data cleaning will help us simplify this process and avoid
useless data values.

1
DATAWAREHOSING AND DATAMINING 220643116004
444
Accurate and Efficient
Ensuring the data is close to the correct values. We know that most of the data in a dataset are
valid, and we should focus on establishing its accuracy. Even if the data is authentic and corre ct,
it doesn‘t mean the data is accurate. Determining accuracy helps to figure out the data entered is
accurate or not. For example, the address of a customer is stored in the specified format, maybe it
doesn‘t need to be in the right one. The email has an additional character or value that makes it
incorrect or invalid. Another example is the phone number of a customer. This means that we have
to rely on data sources, to cross-check the data to figure out if it‘s accurate or not. Depending on
the kind of data we are using, we might be able to find various resources that could help us in this
regard for cleaning.

Complete Data
Completeness is the degree to which we should know all the required values. Completeness is a
little more challenging to achieve than accuracy or quality. Because it‘s nearly impossible to have
all the info we need. Only known facts can be entered. We can try to complete data by redoing the
data gathering activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer‘s contact information. But a number of them
might not have email addresses. In this case, we have to leave those columns empty. If we have a
system that requires us to fill all columns, we can try to enter missing or unknown there. But
entering such values does not mean that the data is complete. It would be still being referred to as
incomplete.

Maintains Data Consistency

To ensure the data is consistent within the same dataset or across multiple datasets, we can measure
consistency by comparing two similar systems. We can also check the data values within the same
dataset to see if they are consistent or not. Consistency can be relational. For example, a customer‘s
age might be 25, which is a valid value and also accurate, but it is also stated as a senior citizen in
the same system. In such cases, we have to cross-check the data, similar to measuring accuracy,
and see which value is true. Is the client a 25-year old? Or the client is a senior citizen? Only one
of these values can be true. There are multiple ways to for your data consistent.

By checking in different systems. By checking the source. By checking the latest data.

Python Code:

.
In [1]:

import pandas as pd
In [2]:

2
DATAWAREHOSING AND DATAMINING 220643116004
444
data = pd.read_csv('p1-1.csv')
In [3]:

data.head()

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

Out[3]
In [4]:

data
Out[4]:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

3 4 4.6 3.1 1.5 0.2 Iris-setosa

3
DATAWAREHOSING AND DATAMINING 220643116004
444

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [5]:

data.tail()
Out[5]:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

In [6]:

4
DATAWAREHOSING AND DATAMINING 220643116004
444
data.isnull()

###This function provides the boolean value for the complete dataset to know if any null value is present or not.
Out[6]:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 False False False False False False

1 False False False False False False

2 False False False False False False

3 False False False False False False

4 False False False False False False

... ... ... ... ... ... ...

145 False False False False False False

146 False False False False False False

147 False False False False False False

148 False False False False False False

149 False False False False False False

150 rows × 6 columns

In [7]:

5
DATAWAREHOSING AND DATAMINING 220643116004
444
data.isna()

####This is the same as the isnull() function. Ans provides the same output.

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 False False False False False False

1 False False False False False False

2 False False False False False False

3 False False False False False False

4 False False False False False False

... ... ... ... ... ... ...

145 False False False False False False

146 False False False False False False

147 False False False False False False

148 False False False False False False

149 False False False False False False

150 rows × 6 columns

Out[7]:

In [8]:

6
DATAWAREHOSING AND DATAMINING 220643116004
444

data.isna().any()

###This function also gives a boolean value if any null value is present or not,
###but it gives results column-wise, not in tabular format.
Out[8]:
Id False
SepalLengthCm False
SepalWidthCm False
PetalLengthCm False
PetalWidthCm False
Species False
dtype: bool
In [9]:

data.isna().sum()

###This function gives the sum of the null values preset in the dataset column-wise.

Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

Out[9]

In [10]:

data.isna().any().sum()

###This function gives output in a single value if any null is present or not.

7
DATAWAREHOSING AND DATAMINING 220643116004
444

Out[10]:
There are no null values present in our dataset. But if there are any null value s preset we can fill
those places with any other value using the fillna() function of DataFrame.Following is the syntax of
fillna() function:

DataFrame_name.fillna(value=None, method=None, axis=None, inplace

=False, limit=None, downcast=None)

and for merging datasets: Merging the dataset is the process of combining two datasets in one, and
line up rows based on some particular or common property for data analysis. We can do this by
using the merge() function of the dataframe. Following is the syntax of the merge function:

DataFrame_name.merge(right, how='inner', on=None, left_on=None, ri

ght_on=None, left_index=False, right_index=False, sort=False, suff
ixes=('_x', '_y'), copy=True, indicator=False, validate=None)
De-Duplicate De-Duplicate means remove all duplicate values. There is no need for duplicate values
in data analysis. These values only affect the accuracy and efficiency of the analysis result. To find
duplicate values in the dataset we will use a simple dataframe function i.e. duplicated(). Let’s see the
example:
In [11]:

data.duplicated()
Out[11]:
0 False
1 False
2 False
3 False
4 False
...
145 False
146 False
147 False
148 False
149 False
Length: 150, dtype: bool
This function also provides bool values for duplicate values in the dataset. As we can see that
dataset doesn’t contain any duplicate values.

If a dataset contains duplicate values it can be removed using the drop_duplicates() function.
Following is the syntax of this function:

DataFrame_name.drop_duplicates(subset=None, keep='first', inplace=

False, ignore_index=False)
8
DATAWAREHOSING AND DATAMINING 220643116004
444

data.duplicated().any().sum()

Out[10]:

In [11]:

data1 = pd.read_csv('StudentDetails.csv')
In [12]:

data1 Student Name 12th Marks Diplomla CPI Name of Collage

Sr No.
0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[12]:
In [13]:

9
DATAWAREHOSING AND DATAMINING 220643116004
444

data1["Name of Collage"].fillna("no collage")

0 Sceit
1 Parul
2 LD
3 no collage
4 no collage
5 Sigma
6 SVNIT
7 no collage
8 MS
9 KJIT
Name: Name of Collage, dtype: object

Out[13]:

In [14]:

data2
= pd.read_csv('StudentDetails.csv')
In [15]:

data2

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

10
DATAWAREHOSING AND DATAMINING 220643116004
444
8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[15]:
In [16]:

data2["Name of Collage"].fillna(method='ffill', inplace=True)

In [17]:

data2
Out[17]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 LD

4 5 Nimisha Sutariya NaN 7.50 LD

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 SVNIT

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [18]:

data3 = pd.read_csv('StudentDetails.csv')
In [19]:

11
DATAWAREHOSING AND DATAMINING 220643116004
444
data3
Out[19]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [20]:

data3["Name of Collage"].fillna(method='bfill', inplace=True)

In [21]:

data3 Student Name 12th Marks Diplomla CPI Name of Collage

Sr No.
0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 Sigma

4 5 Nimisha Sutariya NaN 7.50 Sigma

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 MS

12
DATAWAREHOSING AND DATAMINING 220643116004
444
8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[21]:

In [22]:

data4 = pd.read_csv('StudentDetails.csv')
In [23]:

data4

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[23]:

In [24]:

data4["Name of Collage"].fillna(method='backfill', inplace=True)

In [25]:

13
DATAWAREHOSING AND DATAMINING 220643116004
444

data4
Out[25]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 Sigma

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

4 5 Nimisha Sutariya NaN 7.50 Sigma

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 MS

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [26]:

data5 = pd.read_csv('StudentDetails.csv')
In [27]:

data5

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

14
DATAWAREHOSING AND DATAMINING 220643116004
444
4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[27]:
In [28]:

data5["Name of Collage"].fillna(method='pad', inplace=True)

In [29]:

data5

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 LD

4 5 Nimisha Sutariya NaN 7.50 LD

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 SVNIT

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

Out[29]:

In [30]:

15
DATAWAREHOSING AND DATAMINING 220643116004
444

data6 = pd.read_csv('StudentDetails.csv')
In [31]:

data6
Out[31]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 NaN

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 NaN

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

In [32]:

data6["Name of Collage"].fillna(method='ffill',limit=1,inplace=True)
In [33]:

data6
Out[33]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

0 1 Shivani Prajapati 80.0 NaN Sceit

16
DATAWAREHOSING AND DATAMINING 220643116004
444
1 2 Tanvi Patel NaN 7.81 Parul

2 3 Twinkle Chaudhari 70.0 NaN LD

3 4 Nitesh Singh NaN 8.00 LD

4 5 Nimisha Sutariya NaN 7.50 NaN

5 6 Vaishali Rathwa 75.0 NaN Sigma

6 7 Heena Prajapati 85.0 NaN SVNIT

7 8 Tejas Prajapati NaN 8.50 SVNIT

Sr No. Student Name 12th Marks Diplomla CPI Name of Collage

8 9 Akshay Prajapati 85.0 NaN MS

9 10 Ashvin Prajapati 96.0 NaN KJIT

2. Data Integration and Transformation

Data Integration and Transformation

So far, we've made sure to remove the impurities in data and make it clean. Now, the next step is
to combine data from different sources to get a unified structure with more meaningful and
valuable information. This is mostly used if the data is segregated into different sources. To make
it simple, let's assume we have data in CSV format in different places, all talking about the same
scenario. Say we have some data about an employee in a database. We can't expect all the data
about the employee to reside in the same table. It's possible that the employee's personal data will
be located in one table, the employee's project history will be in a second table, the employee's
time-in and time-out details will be in another table, and so on. So, if we want to do some analysis
about the employee, we need to get all the employee data in one common place. This process of
bringing data together in one place is called data integration. To do data integration, we can merge
multiple pandas DataFrames using the merge function. Visualization is an important tool for
insight generation, but it is rare that you get the data in exactly the right form you need. You will
often need to create some new variables or summaries, rename variables, or reorder observations
for the data to be easier to manage.

In [57]:

df1 = pd.DataFrame({'Name': ['Shivani', 'Tanvi', 'Nimisha', 'Twinkle'],

'ENR': ['28', '03', '12', '05']})
In [58]:

17
DATAWAREHOSING AND DATAMINING 220643116004
444

df1
Out[58]:
Name ENR

0 Shivani 28

1 Tanvi 03

2 Nimisha 12

Name ENR

3 Twinkle 05

In [60]:

df2 = pd.DataFrame({'Name': ['Shivani', 'Tanvi', 'Nimisha', 'Twinkle'],

'Skills': ['Wd','JAVA' , 'SQL', 'Python']})

In [61]:

df2

Name Skills

0 Shivani Wd

1 Tanvi JAVA

2 Nimisha SQL

3 Twinkle Python

Out[61]:
In [62]:

data9 = pd.merge(df1, df2)

18
DATAWAREHOSING AND DATAMINING 220643116004
444
In [63]:

data9
Out[63]:
Name ENR Skills

0 Shivani 28 Wd

1 Tanvi 03 JAVA
Name ENR Skills

2 Nimisha 12 SQL

3 Twinkle 05 Python

In [ ]:

data8=pd.merge(data6, data7, on='Sr No', how='inner')

In [ ]:

In [64]:

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],'supervisor': ['Carly', 'Guido', 'Steve']})

In [65]:

df4

group supervisor

0 Accounting Carly

19
DATAWAREHOSING AND DATAMINING 220643116004
444
1 Engineering Guido

2 HR Steve

Out[65]:

In [68]:

display(df1, df2, pd.merge(df1, df2))

Name ENR

0 Shivani 28

1 Tanvi 03

2 Nimisha 12

3 Twinkle 05

Name Skills

0 Shivani Wd

1 Tanvi JAVA

2 Nimisha SQL

3 Twinkle Python

Name ENR Skills

0 Shivani 28 Wd

1 Tanvi 03 JAVA

2 Nimisha 12 SQL

In [41]:

import numpy as np
import pandas as pd

In [42]:

20
DATAWAREHOSING AND DATAMINING 220643116004
444

df = pd.read_csv('Dataset11.csv')
In [43]:

df
Out[43]:

NAME A B C D E

0 JANE 1 6 6 9 1

1 JOHN 8 1 2 8 1

2 ASHLEY 6 3 5 1 7

3 MAX 0 3 4 0 8

4 EMILY 7 6 6 0 6

Add / drop columns

The first and foremost way of transformation is adding or dropping columns. A new column can be
added as follows:
In [44]:

df['new'] = np.random.random(5)
In [45]:

df
Out[45]:

21
DATAWAREHOSING AND DATAMINING 220643116004
444

NAME A B C D E new

0 JANE 1 6 6 9 1 0.458527

1 JOHN 8 1 2 8 1 0.390897

2 ASHLEY 6 3 5 1 7 0.044329

3 MAX 0 3 4 0 8 0.229151

NAME A B C D E new

4 EMILY 7 6 6 0 6 0.589566

We give the values as an array or list and assign a name to the new column. Make sure the size of
the array is compatible with the size of the dataframe. The drop function is used to drop a column.
In [46]:

df.drop('new', axis=1, inplace=True)

In [47]:

df
Out[47]:

NAME A B C D E

0 JANE 1 6 6 9 1

1 JOHN 8 1 2 8 1

2 ASHLEY 6 3 5 1 7

3 MAX 0 3 4 0 8

4 EMILY 7 6 6 0 6

22
DATAWAREHOSING AND DATAMINING 220643116004
444
We pass the name of the column to be dropped. The axis parameter is set to 1 to indicate we are
dropping a column. Finally, the inplace parameter needs to be True to save the changes.

Add / drop rows

We can use the loc method to add a single row to a dataframe.
In [48]:

df.loc[5,:] = ['Jack', 3, 3, 4, 5, 1]
In [49]:

df
Out[49]:

NAME A B C D E

0 JANE 1.0 6.0 6.0 9.0 1.0

1 JOHN 8.0 1.0 2.0 8.0 1.0

2 ASHLEY 6.0 3.0 5.0 1.0 7.0

3 MAX 0.0 3.0 4.0 0.0 8.0

4 EMILY 7.0 6.0 6.0 0.0 6.0

5 Jack 3.0 3.0 4.0 5.0 1.0

In [50]:

df.drop(5, axis=0, inplace=True)

We have just dropped the row that was added in the previous step.

Insert
The insert function adds a column into a specific position.

23
DATAWAREHOSING AND DATAMINING 220643116004
444
In [51]:

df.insert(0, 'new', np.random.random(5))

In [52]:

df
Out[52]:

new NAME A B C D E

0 0.641899 JANE 1.0 6.0 6.0 9.0 1.0

1 0.846706 JOHN 8.0 1.0 2.0 8.0 1.0

2 0.291893 ASHLEY 6.0 3.0 5.0 1.0 7.0

3 0.598154 MAX 0.0 3.0 4.0 0.0 8.0

4 0.882514 EMILY 7.0 6.0 6.0 0.0 6.0

In [53]:

df.insert(2, 'me', np.random.random(5))

In [54]:

df
Out[54]:

new NAME me A B C D E

0 0.641899 JANE 0.661207 1.0 6.0 6.0 9.0 1.0

24
DATAWAREHOSING AND DATAMINING 220643116004
444

1 0.846706 JOHN 0.631619 8.0 1.0 2.0 8.0 1.0

2 0.291893 ASHLEY 0.459551 6.0 3.0 5.0 1.0 7.0

3 0.598154 MAX 0.659862 0.0 3.0 4.0 0.0 8.0

4 0.882514 EMILY 0.373759 7.0 6.0 6.0 0.0 6.0

In [55]:

df.drop('new', axis=1, inplace=True)

In [56]:

df
Out[56]:

NAME me A B C D E

0 JANE 0.661207 1.0 6.0 6.0 9.0 1.0

1 JOHN 0.631619 8.0 1.0 2.0 8.0 1.0

2 ASHLEY 0.459551 6.0 3.0 5.0 1.0 7.0

3 MAX 0.659862 0.0 3.0 4.0 0.0 8.0

4 EMILY 0.373759 7.0 6.0 6.0 0.0 6.0

In [57]:

df.drop('me', axis=1, inplace=True)

In [58]:

25
DATAWAREHOSING AND DATAMINING 220643116004
444

df
Out[58]:

NAME A B C D E

0 JANE 1.0 6.0 6.0 9.0 1.0

NAME A B C D E

1 JOHN 8.0 1.0 2.0 8.0 1.0

2 ASHLEY 6.0 3.0 5.0 1.0 7.0

3 MAX 0.0 3.0 4.0 0.0 8.0

4 EMILY 7.0 6.0 6.0 0.0 6.0

Melt
The melt function converts a dataframe from wide (high number of columns) to narrow form (high
number of rows). It is best explained via an example. Consider following dataframe. It contains
consecutive daily measurements for 5 people. The long format of this dataframe can be achieved
using the melt function.
The column passed to the id_vars parameter remains the same and the other columns are
combined under the variable and value columns.
In [66]:

df1 = pd.read_csv('Dataset12.csv')
In [67]:

df1

NAME A B C D E

26
DATAWAREHOSING AND DATAMINING 220643116004
444

0 ASHLEY 6 3 5 1 7

1 MAX 0 3 4 0 8

2 EMILY 7 6 6 0 6

Out[67]:

In [72]:

df2 = pd.read_csv('Dataset11.csv')
In [73]:

df2

NAME A B C D E

0 JANE 1 6 6 9 1

1 JOHN 8 1 2 8 1

2 Jack 4 9 8 6 3

Here is how we can combine them:

Out[73]

In [77]:

pd.concat([df1, df2], axis=0, ignore_index=True)

Out[77]:

NAME A B C D E

27
DATAWAREHOSING AND DATAMINING 220643116004
444

0 ASHLEY 6 3 5 1 7

1 MAX 0 3 4 0 8

2 EMILY 7 6 6 0 6

3 JANE 1 6 6 9 1

4 JOHN 8 1 2 8 1

NAME A B C D E

5 Jack 4 9 8 6 3

In [76]:

pd.concat([df1, df2], axis=1, ignore_index=True)

Out[76]:

0 1 2 3 4 5 6 7 8 9 10 11

0 ASHLEY 6 3 5 1 7 JANE 1 6 6 9 1

1 MAX 0 3 4 0 8 JOHN 8 1 2 8 1

2 EMILY 7 6 6 0 6 Jack 4 9 8 6 3

Merge
Merge function also combines dataframes based on common values in a given column or columns.
Consider the following two dataframes.
.
In [35]:

pd.melt(df, id_vars='NAME').head()
Out[35]:

28
DATAWAREHOSING AND DATAMINING 220643116004
444

NAME variable value

0 JANE A 1.0

1 JOHN A 8.0

2 Jack A 4.0

NAME variable value

3 JANE B 6.0

4 JOHN B 1.0

In [79]:

df3 = pd.read_csv('Customer.csv')df4 = pd.read_csv('Order.csv')

In [81]:

df3

ID Name Category

0 1 Rane A

1 2 Alex B

2 3 Ayan A

3 4 Jack C

29
DATAWAREHOSING AND DATAMINING 220643116004
444

4 5 John B

Out[81]

In [82]:

df4
Out[82]:

ID Amount Payment

0 2 250 Credit Card

1 4 320 Credit Card

2 5 250 Cash

3 6 440 Cash

We can merge them based on the id column.

In [86]:

df3.merge(df4, on='ID')

ID Name Category Amount Payment

0 2 Alex B 250 Credit Card

1 4 Jack C 320 Credit Card

2 5 John B 250 Cash

Out[86]:

30
DATAWAREHOSING AND DATAMINING 220643116004
444

df3.merge(df4, on='ID', how = 'inner')

Out[87]:

ID Name Category Amount Payment

0 2 Alex B 250 Credit Card

1 4 Jack C 320 Credit Card

ID Name Category Amount Payment

2 5 John B 250 Cash

We can perform Full join by just passing the how argument as ‘outer’ to the merge() function:
In [88]:

df3.merge(df4, on = 'ID', how = 'outer')

Out[88]:

ID Name Category Amount Payment

0 1 Rane A NaN NaN

1 2 Alex B 250.0 Credit Card

2 3 Ayan A NaN NaN

3 4 Jack C 320.0 Credit Card

4 5 John B 250.0 Cash

5 6 NaN NaN 440.0 Cash

Performing a left join is actually quite similar to a full join. Just change the how argument to ‘left’:
In [89]:

31
DATAWAREHOSING AND DATAMINING 220643116004
444

df3.merge(df4, on = 'ID', how = 'left')

Out[89]:

ID Name Category Amount Payment

0 1 Rane A NaN NaN

1 2 Alex B 250.0 Credit Card

ID Name Category Amount Payment

2 3 Ayan A NaN NaN

3 4 Jack C 320.0 Credit Card

4 5 John B 250.0 Cash

Similar to other joins, we can perform a right join by changing the how argument to ‘right’:
In [91]:

df3.merge(df4, on = 'ID', how = 'right')

Out[91]:

ID Name Category Amount Payment

0 2 Alex B 250 Credit Card

1 4 Jack C 320 Credit Card

2 5 John B 250 Cash

3 6 NaN NaN 440 Cash

Get dummies
Some machine learning models cannot handle categorical variables. In such cases, we should

32
DATAWAREHOSING AND DATAMINING 220643116004
444
encode the categorical variables in a way that each category is represented as a column.
In [95]:

df5 = pd.read_csv('Customer.csv')
In [96]:

df5

Name Category Value

0 Rane A 14.2

1 Alex A 21.4

2 Ayan C 15.6

3 Jack B 12.1

4 John B 17.7

Out[96]:

In [98]:

pd.get_dummies(df5)
Out[98]:

Valu Name_Al Name_Ay Name_Ja Name_Jo Name_Ra Category_ Category_ Category_

e ex an ck hn ne A B C

0 14.2 0 0 0 0 1 1 0 0

33
DATAWAREHOSING AND DATAMINING 220643116004
444

1 21.4 1 0 0 0 0 1 0 0

2 15.6 0 1 0 0 0 0 0 1

3 12.1 0 0 1 0 0 0 1 0

4 17.7 0 0 0 1 0 0 1 0

For instance, in the first row, the name is Jane and the ctg is A. Thus, the columns that represent
these values are 1 and all other columns are 0.

Pivot table
The pivot_table function transforms a dataframe to a format that explains the relationship among
variables.
We have the dataframe on the left that contains two categorical features (i.e. columns) and a
numerical feature. We want to see the average value of the categories in both columns. The
pivot_table function transforms the dataframe in a way that the average values or any other
aggregation can be seen clearly.
In [100]:

df5.pivot_table(index='Name', columns='Category', aggfunc='mean')

Out[100]:

Value

Category A B C

Name

Alex 21.4 NaN NaN

Ayan NaN NaN 15.6

Jack NaN 12.1 NaN

John NaN 17.7 NaN

Rane 14.2 NaN NaN

34
DATAWAREHOSING AND DATAMINING 220643116004
444
PRACTICAL-2

AIM: Perform data reduction in Linear Algebra methods and Mainfold

Learning methods with achieved Accuracy from Make_Classification dataset
using Logistic Regression.

Dimensionality reduction is an unsupervised learning technique.Nevertheless, it can be used as

a data transform pre-processing step for machine learning algorithms on classification and
regression predictive modeling datasets with supervised learning algorithms.
There are many dimensionality reduction algorithms to choose from and no single best algorithm
for all cases. Instead, it is a good idea to explore a range of dimensionality reduction algorithms
and different configurations for each algorithm. In this tutorial, you will discover how to fit and
evaluate top dimensionality reduction algorithms in Python.

Dimensionality reduction seeks a lower-dimensional representation of numerical input data that

preserves the salient relationships in the data.
There are many different dimensionality reduction algorithms and no single best method for all
datasets.
How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn
machine learning library.

Dimensionality reduction refers to techniques for reducing the number of input variables in
training data. High-dimensionality might mean hundreds, thousands, or even millions of input
variables.

Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in
the machine learning model, referred to as degrees of freedom. A model with too many degrees of
freedom is likely to overfit the training dataset and may not perform well on new data.
It is desirable to have simple models that generalize well, and in turn, input data with few input
variables. This is particularly true for linear models where the number of inputs and the degrees of
freedom of the model are often closely related.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It

might be performed after data cleaning and data scaling and before training a predictive model.

Linear Algebra Methods

Matrix factorization methods drawn from the field of linear algebra can be used for

35
DATAWAREHOSING AND DATAMINING 220643116004

dimensionality.

Some of the more popular methods include:

 Principal Components Analysis

 Singular Value Decomposition
 Linear Discriminant Analysis

Manifold Learning Methods

Manifold learning methods seek a lower-dimensional projection of high dimensional input that
captures the salient properties of the input data.

Some of the more popular methods include:

 Isomap Embedding
 Locally Linear Embedding
 Modified Locally Linear Embedding

To begin practical with classification introduction

Introduction of classification dataset

from sklearn.datasets import make_classification

make_classification
In [4]:

X, y = make_classification(n_sample s=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)

In [5]:

arr ay([[ 0.08054814, 0.82273313, -1.21175254, ..., 2.88260938,

1.79160028, -4.29708787],

Out[5]:
[-2.3302999 , - 4.86608574, -3.88291317, ..., -0.14561581,
-0.55489384, 0 .61420772],

36
DATAWAREHOSING AND DATAMINING 220643116004

[-1.19714954, 1.5556314 , -0.61871573, ..., 1.73481788,

0.13067403, -3.13351468],
...,
[ 0.61415067, -3.04457734, -3.15540898, ..., -0.3321506 ,
-2.76644911, 0.81460546],
[ 3.34221924, -1.33613258, -0.34013763, ..., -3.95225071,
1.33439536, -0.69139029],
[-1.49207892, 2.75225738, -1.22655776, ..., -3.10146388,
2.34534351, -1.32021006]])
In [6]:

Out[6]:
array([0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0,
0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1,
0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,
0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,

37
DATAWAREHOSING AND DATAMINING 220643116004

0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,
1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,
0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 1, 0, 0, 1, 1, 0, 1])
In [7]:

from numpy import mean

from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the model
model = LogisticRegression()
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.824 (0.034)

Linear Algebra Methods

38
DATAWAREHOSING AND DATAMINING 220643116004

Principle Component Analysis

In [8]:

from numpy import mean

from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.824 (0.034)

Single Value Decomposition

In [9]:

from numpy import mean

39
DATAWAREHOSING AND DATAMINING 220643116004

# define the pipeline

steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.824 (0.034)

Linear Discriminant Analysis

In [10]:

from numpy import mean

from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.825 (0.034)

Manifold Learning Methods

Isomap Embedding
In [11]:

40
DATAWAREHOSING AND DATAMINING 220643116004

from numpy import mean

from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import Isomap
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('iso', Isomap(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.888 (0.029)

Locally Linear Embedding

In [12]:

from numpy import mean

from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lle', LocallyLinearEmbedding(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.886 (0.028)

41
DATAWAREHOSING AND DATAMINING 220643116004

Modified Locally Linear Embedding

In [13]:

from numpy import mean

from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lle', LocallyLinearEmbedding(n_components=5, method='modified', n_neighbors=10)), ('m', LogisticRegr
ession())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy: 0.848 (0.037)
In [ ]:

42
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-3

AIM: Demonstration of Pre-processing Methods i) Rescale Data and ii)

Binarize Data

i) Rescale Data

Your data must be prepared before you can build models. The data preparation process can
involve three steps: data selection, data preprocessing and data transformation. Your preprocessed
data may contain attributes with a mixtures of scales for various quantities such as dollars,
kilograms and sales volume. Many machine learning methods expect or are more effective if the
data attributes have the same scale. Two popular data scaling methods are normalization and
standardization.

Data Normalization
Normalization refers to rescaling real valued numeric attributes into the range 0 and 1. It is useful
to scale the input attributes for a model that relies on the magnitude of values, such as distance
measures used in k-nearest neighbors and in the preparation of coefficients in regression. The
example below demonstrate data normalization of the Iris flowers dataset.

Normalization
In [1]:

from sklearn.datasets import load_iris

from sklearn import preprocessing
# load the iris dataset
iris = load_iris()
print(iris.data.shape)

(150, 4)
In [2]:

iris

43
DATAWAREHOSING AND DATAMINING 220643116004

{'data': array([[5.1, 3.5, 1.4, 0.2],

Out[2]:
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],

44
DATAWAREHOSING AND DATAMINING 220643116004

[4.4, 3.2, 1.3, 0.2],

[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],

45
DATAWAREHOSING AND DATAMINING 220643116004

[6.3, 2.3, 4.4, 1.3],

[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],

46
DATAWAREHOSING AND DATAMINING 220643116004

[6.4, 2.8, 5.6, 2.2],

[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),

In [3]:

X = iris.data
y = iris.target

In [4]:

X
Out[4]:
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],

47
DATAWAREHOSING AND DATAMINING 220643116004

[4.6, 3.1, 1.5, 0.2],

[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],

48
DATAWAREHOSING AND DATAMINING 220643116004

[5.3, 3.7, 1.5, 0.2],

[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],

49
DATAWAREHOSING AND DATAMINING 220643116004

[5. , 2.3, 3.3, 1. ],

[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],

50
DATAWAREHOSING AND DATAMINING 220643116004

[6. , 3. , 4.8, 1.8],

[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])
In [5]:

y
Out[5]:
array([0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [6]:

# normalize the data attributes

normalized_X = preprocessing.normalize(X)
In [8]:

normalized_X
Out[8]:
array([[0.80377277, 0.55160877, 0.22064351, 0.0315205 ],
[0.82813287, 0.50702013, 0.23660939, 0.03380134],
[0.80533308, 0.54831188, 0.2227517 , 0.03426949],
[0.80003025, 0.53915082, 0.26087943, 0.03478392],
[0.790965 , 0.5694948 , 0.2214702 , 0.0316386 ],

51
DATAWAREHOSING AND DATAMINING 220643116004

[0.78417499, 0.5663486 , 0.2468699 , 0.05808704],

[0.78010936, 0.57660257, 0.23742459, 0.0508767 ],
[0.80218492, 0.54548574, 0.24065548, 0.0320874 ],
[0.80642366, 0.5315065 , 0.25658935, 0.03665562],
[0.81803119, 0.51752994, 0.25041771, 0.01669451],
[0.80373519, 0.55070744, 0.22325977, 0.02976797],
[0.786991 , 0.55745196, 0.26233033, 0.03279129],
[0.82307218, 0.51442011, 0.24006272, 0.01714734],
[0.8025126 , 0.55989251, 0.20529392, 0.01866308],
[0.81120865, 0.55945424, 0.16783627, 0.02797271],
[0.77381111, 0.59732787, 0.2036345 , 0.05430253],
[0.79428944, 0.57365349, 0.19121783, 0.05883625],
[0.80327412, 0.55126656, 0.22050662, 0.04725142],
[0.8068282 , 0.53788547, 0.24063297, 0.04246464],
[0.77964883, 0.58091482, 0.22930848, 0.0458617 ],
[0.8173379 , 0.51462016, 0.25731008, 0.03027177],
[0.78591858, 0.57017622, 0.23115252, 0.06164067],
[0.77577075, 0.60712493, 0.16864581, 0.03372916],
[0.80597792, 0.52151512, 0.26865931, 0.07901744],
[0.776114 , 0.54974742, 0.30721179, 0.03233808],
[0.82647451, 0.4958847 , 0.26447184, 0.03305898],
[0.79778206, 0.5424918 , 0.25529026, 0.06382256],
[0.80641965, 0.54278246, 0.23262105, 0.03101614],
[0.81609427, 0.5336001 , 0.21971769, 0.03138824],
[0.79524064, 0.54144043, 0.27072022, 0.03384003],
[0.80846584, 0.52213419, 0.26948861, 0.03368608],
[0.82225028, 0.51771314, 0.22840286, 0.06090743],
[0.76578311, 0.60379053, 0.22089897, 0.0147266 ],
[0.77867447, 0.59462414, 0.19820805, 0.02831544],
[0.81768942, 0.51731371, 0.25031309, 0.03337508],
[0.82512295, 0.52807869, 0.19802951, 0.03300492],
[0.82699754, 0.52627116, 0.19547215, 0.03007264],
[0.78523221, 0.5769053 , 0.22435206, 0.01602515],
[0.80212413, 0.54690282, 0.23699122, 0.03646019],
[0.80779568, 0.53853046, 0.23758697, 0.03167826],
[0.80033301, 0.56023311, 0.20808658, 0.04801998],
[0.86093857, 0.44003527, 0.24871559, 0.0573959 ],
[0.78609038, 0.57170209, 0.23225397, 0.03573138],
[0.78889479, 0.55222635, 0.25244633, 0.09466737],
[0.76693897, 0.57144472, 0.28572236, 0.06015208],
[0.82210585, 0.51381615, 0.23978087, 0.05138162],
[0.77729093, 0.57915795, 0.24385598, 0.030482 ],
[0.79594782, 0.55370283, 0.24224499, 0.03460643],
[0.79837025, 0.55735281, 0.22595384, 0.03012718],
[0.81228363, 0.5361072 , 0.22743942, 0.03249135],

52
DATAWAREHOSING AND DATAMINING 220643116004

[0.76701103, 0.35063361, 0.51499312, 0.15340221],

[0.74549757, 0.37274878, 0.52417798, 0.17472599],
[0.75519285, 0.33928954, 0.53629637, 0.16417236],
[0.75384916, 0.31524601, 0.54825394, 0.17818253],
[0.7581754 , 0.32659863, 0.5365549 , 0.17496355],
[0.72232962, 0.35482858, 0.57026022, 0.16474184],
[0.72634846, 0.38046824, 0.54187901, 0.18446945],
[0.75916547, 0.37183615, 0.51127471, 0.15493173],
[0.76301853, 0.33526572, 0.53180079, 0.15029153],
[0.72460233, 0.37623583, 0.54345175, 0.19508524],
[0.76923077, 0.30769231, 0.53846154, 0.15384615],
[0.73923462, 0.37588201, 0.52623481, 0.187941 ],
[0.78892752, 0.28927343, 0.52595168, 0.13148792],
[0.73081412, 0.34743622, 0.56308629, 0.16772783],
[0.75911707, 0.3931142 , 0.48800383, 0.17622361],
[0.76945444, 0.35601624, 0.50531337, 0.16078153],
[0.70631892, 0.37838513, 0.5675777 , 0.18919257],
[0.75676497, 0.35228714, 0.53495455, 0.13047672],
[0.76444238, 0.27125375, 0.55483721, 0.18494574],
[0.76185188, 0.34011245, 0.53057542, 0.14964948],
[0.6985796 , 0.37889063, 0.56833595, 0.21312598],
[0.77011854, 0.35349703, 0.50499576, 0.16412362],
[0.74143307, 0.29421947, 0.57667016, 0.17653168],
[0.73659895, 0.33811099, 0.56754345, 0.14490471],
[0.76741698, 0.34773582, 0.51560829, 0.15588157],
[0.76785726, 0.34902603, 0.51190484, 0.16287881],
[0.76467269, 0.31486523, 0.53976896, 0.15743261],
[0.74088576, 0.33173989, 0.55289982, 0.18798594],
[0.73350949, 0.35452959, 0.55013212, 0.18337737],
[0.78667474, 0.35883409, 0.48304589, 0.13801311],
[0.76521855, 0.33391355, 0.52869645, 0.15304371],
[0.77242925, 0.33706004, 0.51963422, 0.14044168],
[0.76434981, 0.35581802, 0.51395936, 0.15814134],
[0.70779525, 0.31850786, 0.60162596, 0.1887454 ],
[0.69333409, 0.38518561, 0.57777841, 0.1925928 ],
[0.71524936, 0.40530797, 0.53643702, 0.19073316],
[0.75457341, 0.34913098, 0.52932761, 0.16893434],
[0.77530021, 0.28304611, 0.54147951, 0.15998258],
[0.72992443, 0.39103094, 0.53440896, 0.16944674],
[0.74714194, 0.33960997, 0.54337595, 0.17659719],
[0.72337118, 0.34195729, 0.57869695, 0.15782644],
[0.73260391, 0.36029701, 0.55245541, 0.1681386 ],
[0.76262994, 0.34186859, 0.52595168, 0.1577855 ],
[0.76986879, 0.35413965, 0.5081134 , 0.15397376],
[0.73544284, 0.35458851, 0.55158213, 0.1707278 ],

53
DATAWAREHOSING AND DATAMINING 220643116004

[0.73239618, 0.38547167, 0.53966034, 0.15418867],

[0.73446047, 0.37367287, 0.5411814 , 0.16750853],
[0.75728103, 0.3542121 , 0.52521104, 0.15878473],
[0.78258054, 0.38361791, 0.4603415 , 0.16879188],
[0.7431482 , 0.36505526, 0.5345452 , 0.16948994],
[0.65387747, 0.34250725, 0.62274045, 0.25947519],
[0.69052512, 0.32145135, 0.60718588, 0.22620651],
[0.71491405, 0.30207636, 0.59408351, 0.21145345],
[0.69276796, 0.31889319, 0.61579374, 0.1979337 ],
[0.68619022, 0.31670318, 0.61229281, 0.232249 ],
[0.70953708, 0.28008043, 0.61617694, 0.1960563 ],
[0.67054118, 0.34211284, 0.61580312, 0.23263673],
[0.71366557, 0.28351098, 0.61590317, 0.17597233],
[0.71414125, 0.26647062, 0.61821183, 0.19185884],
[0.69198788, 0.34599394, 0.58626751, 0.24027357],
[0.71562645, 0.3523084 , 0.56149152, 0.22019275],
[0.71576546, 0.30196356, 0.59274328, 0.21249287],
[0.71718148, 0.31640359, 0.58007326, 0.22148252],
[0.6925518 , 0.30375079, 0.60750157, 0.24300063],
[0.67767924, 0.32715549, 0.59589036, 0.28041899],
[0.69589887, 0.34794944, 0.57629125, 0.25008866],
[0.70610474, 0.3258945 , 0.59747324, 0.1955367 ],
[0.69299099, 0.34199555, 0.60299216, 0.19799743],
[0.70600618, 0.2383917 , 0.63265489, 0.21088496],
[0.72712585, 0.26661281, 0.60593821, 0.18178146],
[0.70558934, 0.32722984, 0.58287815, 0.23519645],
[0.68307923, 0.34153961, 0.59769433, 0.24395687],
[0.71486543, 0.25995106, 0.62202576, 0.18567933],
[0.73122464, 0.31338199, 0.56873028, 0.20892133],
[0.69595601, 0.3427843 , 0.59208198, 0.21813547],
[0.71529453, 0.31790868, 0.59607878, 0.17882363],
[0.72785195, 0.32870733, 0.56349829, 0.21131186],
[0.71171214, 0.35002236, 0.57170319, 0.21001342],
[0.69594002, 0.30447376, 0.60894751, 0.22835532],
[0.73089855, 0.30454106, 0.58877939, 0.1624219 ],
[0.72766159, 0.27533141, 0.59982915, 0.18683203],
[0.71578999, 0.34430405, 0.5798805 , 0.18121266],
[0.69417747, 0.30370264, 0.60740528, 0.2386235 ],
[0.72366005, 0.32162669, 0.58582004, 0.17230001],
[0.69385414, 0.29574111, 0.63698085, 0.15924521],
[0.73154399, 0.28501714, 0.57953485, 0.21851314],
[0.67017484, 0.36168166, 0.59571097, 0.2553047 ],
[0.69804799, 0.338117 , 0.59988499, 0.196326 ],
[0.71066905, 0.35533453, 0.56853524, 0.21320072],
[0.72415258, 0.32534391, 0.56672811, 0.22039426],

54
DATAWAREHOSING AND DATAMINING 220643116004

[0.69997037, 0.32386689, 0.58504986, 0.25073566],

[0.73337886, 0.32948905, 0.54206264, 0.24445962],
[0.69052512, 0.32145135, 0.60718588, 0.22620651],
[0.69193502, 0.32561648, 0.60035539, 0.23403685],
[0.68914871, 0.33943145, 0.58629069, 0.25714504],
[0.72155725, 0.32308533, 0.56001458, 0.24769876],
[0.72965359, 0.28954508, 0.57909015, 0.22005426],
[0.71653899, 0.3307103 , 0.57323119, 0.22047353],
[0.67467072, 0.36998072, 0.58761643, 0.25028107],
[0.69025916, 0.35097923, 0.5966647 , 0.21058754]])
In [ ]:

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a
standard deviation of one (unit variance). It is useful to standardize attributes for a model that relies
on the distribution of attributes such as Gaussian processes. The example below demonstrate data
standardization of the Iris flowers dataset.

Data Standardization
In [9]:

from sklearn.datasets import load_iris

from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()

In [10]:

55
DATAWAREHOSING AND DATAMINING 220643116004

iris

{'data': array([[5.1, 3.5, 1.4, 0.2],

Out[10]:
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],

56
DATAWAREHOSING AND DATAMINING 220643116004

[4.5, 2.3, 1.3, 0.3],

[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],

57
DATAWAREHOSING AND DATAMINING 220643116004

[6.7, 3.1, 4.7, 1.5],

[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],

58
DATAWAREHOSING AND DATAMINING 220643116004

[7.9, 3.8, 6.4, 2. ],

[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
In [11]:

X = iris.data y = iris.target

In [12]:

standardized_X = preprocessing.scale(X)
In [14]:

59
DATAWAREHOSING AND DATAMINING 220643116004

standardized_X

array([[-9.00681170e-01, 1.01900435e+00, -1.34022653e+00,

-1.31544430e+00],
[-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
-1.31544430e+00],
[-1.38535265e+00, 3.28414053e-01, -1.39706395e+00,
-1.31544430e+00],
[-1.50652052e+00, 9.82172869e-02, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 1.24920112e+00, -1.34022653e+00,
-1.31544430e+00],
[-5.37177559e-01, 1.93979142e+00, -1.16971425e+00,
-1.05217993e+00],
[-1.50652052e+00, 7.88807586e-01, -1.34022653e+00,
-1.18381211e+00],
[-1.02184904e+00, 7.88807586e-01, -1.28338910e+00,
-1.31544430e+00],
[-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
-1.31544430e+00],
[-1.14301691e+00, 9.82172869e-02, -1.28338910e+00,
-1.44707648e+00],
[-5.37177559e-01, 1.47939788e+00, -1.28338910e+00,
-1.31544430e+00],
[-1.26418478e+00, 7.88807586e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.26418478e+00, -1.31979479e-01, -1.34022653e+00,

Out[14]:
-1.44707648e+00],
[-1.87002413e+00, -1.31979479e-01, -1.51073881e+00,
-1.44707648e+00],
[-5.25060772e-02, 2.16998818e+00, -1.45390138e+00,
-1.31544430e+00],
[-1.73673948e-01, 3.09077525e+00, -1.28338910e+00,
-1.05217993e+00],
[-5.37177559e-01, 1.93979142e+00, -1.39706395e+00,
-1.05217993e+00],
[-9.00681170e-01, 1.01900435e+00, -1.34022653e+00,
-1.18381211e+00],
[-1.73673948e-01, 1.70959465e+00, -1.16971425e+00,

60
DATAWAREHOSING AND DATAMINING 220643116004

-1.18381211e+00],
[-9.00681170e-01, 1.70959465e+00, -1.28338910e+00,
-1.18381211e+00],
[-5.37177559e-01, 7.88807586e-01, -1.16971425e+00,
-1.31544430e+00],
[-9.00681170e-01, 1.47939788e+00, -1.28338910e+00,
-1.05217993e+00],
[-1.50652052e+00, 1.24920112e+00, -1.56757623e+00,
-1.31544430e+00],
[-9.00681170e-01, 5.58610819e-01, -1.16971425e+00,
-9.20547742e-01],
[-1.26418478e+00, 7.88807586e-01, -1.05603939e+00,
-1.31544430e+00],
[-1.02184904e+00, -1.31979479e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.02184904e+00, 7.88807586e-01, -1.22655167e+00,
-1.05217993e+00],
[-7.79513300e-01, 1.01900435e+00, -1.28338910e+00,
-1.31544430e+00],
[-7.79513300e-01, 7.88807586e-01, -1.34022653e+00,
-1.31544430e+00],
[-1.38535265e+00, 3.28414053e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.26418478e+00, 9.82172869e-02, -1.22655167e+00,
-1.31544430e+00],
[-5.37177559e-01, 7.88807586e-01, -1.28338910e+00,
-1.05217993e+00],
[-7.79513300e-01, 2.40018495e+00, -1.28338910e+00,
-1.44707648e+00],
[-4.16009689e-01, 2.63038172e+00, -1.34022653e+00,
-1.31544430e+00],
[-1.14301691e+00, 9.82172869e-02, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 3.28414053e-01, -1.45390138e+00,
-1.31544430e+00],
[-4.16009689e-01, 1.01900435e+00, -1.39706395e+00,
-1.31544430e+00],
[-1.14301691e+00, 1.24920112e+00, -1.34022653e+00,
-1.44707648e+00],
[-1.74885626e+00, -1.31979479e-01, -1.39706395e+00,
-1.31544430e+00],
[-9.00681170e-01, 7.88807586e-01, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 1.01900435e+00, -1.39706395e+00,
-1.18381211e+00],
61
DATAWAREHOSING AND DATAMINING 220643116004

[-1.62768839e+00, -1.74335684e+00, -1.39706395e+00,

-1.18381211e+00],
[-1.74885626e+00, 3.28414053e-01, -1.39706395e+00,
-1.31544430e+00],
[-1.02184904e+00, 1.01900435e+00, -1.22655167e+00,
-7.88915558e-01],
[-9.00681170e-01, 1.70959465e+00, -1.05603939e+00,
-1.05217993e+00],
[-1.26418478e+00, -1.31979479e-01, -1.34022653e+00,
-1.18381211e+00],
[-9.00681170e-01, 1.70959465e+00, -1.22655167e+00,
-1.31544430e+00],
[-1.50652052e+00, 3.28414053e-01, -1.34022653e+00,
-1.31544430e+00],
[-6.58345429e-01, 1.47939788e+00, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 5.58610819e-01, -1.34022653e+00,
-1.31544430e+00],
[ 1.40150837e+00, 3.28414053e-01, 5.35408562e-01,
2.64141916e-01],
[ 6.74501145e-01, 3.28414053e-01, 4.21733708e-01,
3.95774101e-01],
[ 1.28034050e+00, 9.82172869e-02, 6.49083415e-01,
3.95774101e-01],
[-4.16009689e-01, -1.74335684e+00, 1.37546573e-01,
1.32509732e-01],
[ 7.95669016e-01, -5.92373012e-01, 4.78571135e-01,
3.95774101e-01],
[-1.73673948e-01, -5.92373012e-01, 4.21733708e-01,
1.32509732e-01],
[ 5.53333275e-01, 5.58610819e-01, 5.35408562e-01,
5.27406285e-01],
[-1.14301691e+00, -1.51316008e+00, -2.60315415e-01,
-2.62386821e-01],
[ 9.16836886e-01, -3.62176246e-01, 4.78571135e-01,
1.32509732e-01],
[-7.79513300e-01, -8.22569778e-01, 8.07091462e-02,
2.64141916e-01],
[-1.02184904e+00, -2.43394714e+00, -1.46640561e-01,
-2.62386821e-01],
[ 6.86617933e-02, -1.31979479e-01, 2.51221427e-01,
3.95774101e-01],
[ 1.89829664e-01, -1.97355361e+00, 1.37546573e-01,
-2.62386821e-01],
[ 3.10997534e-01, -3.62176246e-01, 5.35408562e-01,
62
DATAWAREHOSING AND DATAMINING 220643116004

2.64141916e-01],
[-2.94841818e-01, -3.62176246e-01, -8.98031345e-02,
1.32509732e-01],
[ 1.03800476e+00, 9.82172869e-02, 3.64896281e-01,
2.64141916e-01],
[-2.94841818e-01, -1.31979479e-01, 4.21733708e-01,
3.95774101e-01],
[-5.25060772e-02, -8.22569778e-01, 1.94384000e-01,
-2.62386821e-01],
[ 4.32165405e-01, -1.97355361e+00, 4.21733708e-01,
3.95774101e-01],
[-2.94841818e-01, -1.28296331e+00, 8.07091462e-02,
-1.30754636e-01],
[ 6.86617933e-02, 3.28414053e-01, 5.92245988e-01,
7.90670654e-01],
[ 3.10997534e-01, -5.92373012e-01, 1.37546573e-01,
1.32509732e-01],
[ 5.53333275e-01, -1.28296331e+00, 6.49083415e-01,
3.95774101e-01],
[ 3.10997534e-01, -5.92373012e-01, 5.35408562e-01,
8.77547895e-04],
[ 6.74501145e-01, -3.62176246e-01, 3.08058854e-01,
1.32509732e-01],
[ 9.16836886e-01, -1.31979479e-01, 3.64896281e-01,
2.64141916e-01],
[ 1.15917263e+00, -5.92373012e-01, 5.92245988e-01,
2.64141916e-01],
[ 1.03800476e+00, -1.31979479e-01, 7.05920842e-01,
6.59038469e-01],
[ 1.89829664e-01, -3.62176246e-01, 4.21733708e-01,
3.95774101e-01],
[-1.73673948e-01, -1.05276654e+00, -1.46640561e-01,
-2.62386821e-01],
[-4.16009689e-01, -1.51316008e+00, 2.38717193e-02,
-1.30754636e-01],
[-4.16009689e-01, -1.51316008e+00, -3.29657076e-02,
-2.62386821e-01],
[-5.25060772e-02, -8.22569778e-01, 8.07091462e-02,
8.77547895e-04],
[ 1.89829664e-01, -8.22569778e-01, 7.62758269e-01,
5.27406285e-01],
[-5.37177559e-01, -1.31979479e-01, 4.21733708e-01,
3.95774101e-01],
[ 1.89829664e-01, 7.88807586e-01, 4.21733708e-01,
5.27406285e-01],
63
DATAWAREHOSING AND DATAMINING 220643116004

[ 1.03800476e+00, 9.82172869e-02, 5.35408562e-01,

3.95774101e-01],
[ 5.53333275e-01, -1.74335684e+00, 3.64896281e-01,
1.32509732e-01],
[-2.94841818e-01, -1.31979479e-01, 1.94384000e-01,
1.32509732e-01],
[-4.16009689e-01, -1.28296331e+00, 1.37546573e-01,
1.32509732e-01],
[-4.16009689e-01, -1.05276654e+00, 3.64896281e-01,
8.77547895e-04],
[ 3.10997534e-01, -1.31979479e-01, 4.78571135e-01,
2.64141916e-01],
[-5.25060772e-02, -1.05276654e+00, 1.37546573e-01,
8.77547895e-04],
[-1.02184904e+00, -1.74335684e+00, -2.60315415e-01,
-2.62386821e-01],
[-2.94841818e-01, -8.22569778e-01, 2.51221427e-01,
1.32509732e-01],
[-1.73673948e-01, -1.31979479e-01, 2.51221427e-01,
8.77547895e-04],
[-1.73673948e-01, -3.62176246e-01, 2.51221427e-01,
1.32509732e-01],
[ 4.32165405e-01, -3.62176246e-01, 3.08058854e-01,
1.32509732e-01],
[-9.00681170e-01, -1.28296331e+00, -4.30827696e-01,
-1.30754636e-01],
[-1.73673948e-01, -5.92373012e-01, 1.94384000e-01,
1.32509732e-01],
[ 5.53333275e-01, 5.58610819e-01, 1.27429511e+00,
1.71209594e+00],
[-5.25060772e-02, -8.22569778e-01, 7.62758269e-01,
9.22302838e-01],
[ 1.52267624e+00, -1.31979479e-01, 1.21745768e+00,
1.18556721e+00],
[ 5.53333275e-01, -3.62176246e-01, 1.04694540e+00,
7.90670654e-01],
[ 7.95669016e-01, -1.31979479e-01, 1.16062026e+00,
1.31719939e+00],
[ 2.12851559e+00, -1.31979479e-01, 1.61531967e+00,
1.18556721e+00],
[-1.14301691e+00, -1.28296331e+00, 4.21733708e-01,
6.59038469e-01],
[ 1.76501198e+00, -3.62176246e-01, 1.44480739e+00,
7.90670654e-01],
[ 1.03800476e+00, -1.28296331e+00, 1.16062026e+00,
64
DATAWAREHOSING AND DATAMINING 220643116004

7.90670654e-01],
[ 1.64384411e+00, 1.24920112e+00, 1.33113254e+00,
1.71209594e+00],
[ 7.95669016e-01, 3.28414053e-01, 7.62758269e-01,
1.05393502e+00],
[ 6.74501145e-01, -8.22569778e-01, 8.76433123e-01,
9.22302838e-01],
[ 1.15917263e+00, -1.31979479e-01, 9.90107977e-01,
1.18556721e+00],
[-1.73673948e-01, -1.28296331e+00, 7.05920842e-01,
1.05393502e+00],
[-5.25060772e-02, -5.92373012e-01, 7.62758269e-01,
1.58046376e+00],
[ 6.74501145e-01, 3.28414053e-01, 8.76433123e-01,
1.44883158e+00],
[ 7.95669016e-01, -1.31979479e-01, 9.90107977e-01,
7.90670654e-01],
[ 2.24968346e+00, 1.70959465e+00, 1.67215710e+00,
1.31719939e+00],
[ 2.24968346e+00, -1.05276654e+00, 1.78583195e+00,
1.44883158e+00],
[ 1.89829664e-01, -1.97355361e+00, 7.05920842e-01,
3.95774101e-01],
[ 1.28034050e+00, 3.28414053e-01, 1.10378283e+00,
1.44883158e+00],
[-2.94841818e-01, -5.92373012e-01, 6.49083415e-01,
1.05393502e+00],
[ 2.24968346e+00, -5.92373012e-01, 1.67215710e+00,
1.05393502e+00],
[ 5.53333275e-01, -8.22569778e-01, 6.49083415e-01,
7.90670654e-01],
[ 1.03800476e+00, 5.58610819e-01, 1.10378283e+00,
1.18556721e+00],
[ 1.64384411e+00, 3.28414053e-01, 1.27429511e+00,
7.90670654e-01],
[ 4.32165405e-01, -5.92373012e-01, 5.92245988e-01,
7.90670654e-01],
[ 3.10997534e-01, -1.31979479e-01, 6.49083415e-01,
7.90670654e-01],
[ 6.74501145e-01, -5.92373012e-01, 1.04694540e+00,
1.18556721e+00],
[ 1.64384411e+00, -1.31979479e-01, 1.16062026e+00,
5.27406285e-01],
[ 1.88617985e+00, -5.92373012e-01, 1.33113254e+00,
9.22302838e-01],
65
DATAWAREHOSING AND DATAMINING 220643116004

[ 2.49201920e+00, 1.70959465e+00, 1.50164482e+00,

1.05393502e+00],
[ 6.74501145e-01, -5.92373012e-01, 1.04694540e+00,
1.31719939e+00],
[ 5.53333275e-01, -5.92373012e-01, 7.62758269e-01,
3.95774101e-01],
[ 3.10997534e-01, -1.05276654e+00, 1.04694540e+00,
2.64141916e-01],
[ 2.24968346e+00, -1.31979479e-01, 1.33113254e+00,
1.44883158e+00],
[ 5.53333275e-01, 7.88807586e-01, 1.04694540e+00,
1.58046376e+00],
[ 6.74501145e-01, 9.82172869e-02, 9.90107977e-01,
7.90670654e-01],
[ 1.89829664e-01, -1.31979479e-01, 5.92245988e-01,
7.90670654e-01],
[ 1.28034050e+00, 9.82172869e-02, 9.33270550e-01,
1.18556721e+00],
[ 1.03800476e+00, 9.82172869e-02, 1.04694540e+00,
1.58046376e+00],
[ 1.28034050e+00, 9.82172869e-02, 7.62758269e-01,
1.44883158e+00],
[-5.25060772e-02, -8.22569778e-01, 7.62758269e-01,
9.22302838e-01],
[ 1.15917263e+00, 3.28414053e-01, 1.21745768e+00,
1.44883158e+00],
[ 1.03800476e+00, 5.58610819e-01, 1.10378283e+00,
1.71209594e+00],
[ 1.03800476e+00, -1.31979479e-01, 8.19595696e-01,
1.44883158e+00],
[ 5.53333275e-01, -1.28296331e+00, 7.05920842e-01,
9.22302838e-01],
[ 7.95669016e-01, -1.31979479e-01, 8.19595696e-01,
1.05393502e+00],
[ 4.32165405e-01, 7.88807586e-01, 9.33270550e-01,
1.44883158e+00],
[ 6.86617933e-02, -1.31979479e-01, 7.62758269e-01,
7.90670654e-01]])

ii) Binarize Data

Binarization is the process of dividing data into two groups and assigning one out of two values to
all the members of the same group. This is usually accomplished. By defining a threshold t and

66
DATAWAREHOSING AND DATAMINING 220643116004

assigning the value 0 to all the data points below. The threshold and 1 to those above it.
sklearn.preprocessing.Binarizer() is a method which belongs to preprocessing module. It plays
a key role in the discretization of continuous feature values.

Binarize data
In [13]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing

In [14]:

data = pd.read_csv('p3.csv')
In [15]:

data
Out[15]:

Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

67
DATAWAREHOSING AND DATAMINING 220643116004

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

In [33]:

data["Salary"].fillna(method='ffill', inplace=True)
In [40]:

data["Age"].fillna(method='ffill', inplace=True)
In [41]:

data

Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

68
DATAWAREHOSING AND DATAMINING 220643116004

4 Germany 40.0 61000.0 Yes

5 France 35.0 58000.0 Yes

6 Spain 35.0 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

Out[41]:

In [42]:

age = data.iloc[:, 1].values

salary = data.iloc[:, 2].values
print ("\nOriginal age data values : \n", age)
print ("\nOriginal salary data values : \n", salary)

Original age data values :

[44. 27. 30. 38. 40. 35. 35. 48. 50. 37.]

Original salary data values :

[72000. 48000. 54000. 61000. 61000. 58000. 52000. 79000. 83000. 67000.]
In [43]:

from sklearn.preprocessing import Binarizer

x = age
x = x.reshape(1, -1)

69
DATAWAREHOSING AND DATAMINING 220643116004

y = salary
y = y.reshape(1, -1)
In [44]:

# For age, let threshold be 35

# For salary, let threshold be 61000
binarizer_1 = Binarizer(35)
binarizer_2 = Binarizer(61000)
C:\Users\K Patel\anaconda3\lib\site-packages\sklearn\utils\validation.py:70:
FutureWarning: Pass threshold=35 as keyword args. From version 1.0 (renaming
of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
C:\Users\K Patel\anaconda3\lib\site-packages\sklearn\utils\validation.py:70:
FutureWarning: Pass threshold=61000 as keyword args. From version 1.0 (renami
ng of 0.25) passing these as positional arguments will result in an error
warnings.warn(f"Pass {args_msg} as keyword args. From version "
In [45]:

print ("\nBinarized age : \n", binarizer_1.fit_transform(x))

print ("\nBinarized salary : \n", binarizer_2.fit_transform(y))

Binarized age :
[[1. 0. 0. 1. 1. 0. 0. 1. 1. 1.]]

Binarized salary :
[[1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]

70
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-4

AIM: Perform Different Normalization methods. i) Maximum Absolute

Scaling, ii) Min-Max Feature Scaling, and iii) Z-score Method.

i) Maximum Absolute Scaling

Maximum absolute scaling scales the data to its maximum value; that is, it divides every
observation by the maximum value of the variable: The result of the preceding transformation is a
distribution in which the values vary approximately within the range of -1 to 1.

Maximum Absolute Scaling

In [1]:

from sklearn.preprocessing import M axAbsScaler

In [2]:

X = [[ 1., -1., 2.],[ 2., 0., 0.],

[ 0., 1., -1.]]

In [3]:

transformer = MaxAbsScaler().fit(X)
In [4]:

transformer.transform(X)
Out[4]:
arr ay([[ 0.5, -1. , 1. ],
[ 1. , 0. , 0. ],

71
DATAWAREHOSING AND DATAMINING 220643116004

[ 0. , 1. , -0.5]])
In [ ]:

ii) Min-Max Feature Scaling

Transform features by scaling each feature to a given range. This estimator scales and translates
each feature individually such that it is in the given range on the training set, e.g. between zero
and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std * (max - min) + min
Where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Min-Max Feature Scaling

In [5]:

from sklearn.preprocessing import MinMaxScaler

In [6]:

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

In [7]:

scaler = MinMaxScaler()
In [10]:

72
DATAWAREHOSING AND DATAMINING 220643116004

print(scaler.data_max_)
[ 1. 18.]
In [12]:

print(scaler.transform(data))
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]

iii) Z-score Method.

In statistics, a z-score tells us how many standard deviations away a value is from the mean. We
use the following formula to calculate a z-score:

z = (X – μ) / σ

where:

 X is a single raw data value

 μ is the population mean
 σ is the population standard deviation

How to Calculate Z-Scores in Python

We can calculate z-scores in Python using scipy.stats.zscore, which uses the following syntax:

scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=‘propagate‘)

where:

 a: an array like object containing data

 axis: the axis along which to calculate the z-scores. Default is 0.
 ddof: degrees of freedom correction in the calculation of the standard deviation. Default
is 0.
 nan_policy: how to handle when input contains nan. Default is propagate, which returns
nan. ‗raise‘ throws an error and ‗omit‘ performs calculations ignoring nan values.

73
DATAWAREHOSING AND DATAMINING 220643116004

Z_Score

In [25]:

import scipy.stats as stats

import numpy as np

In [26]:

data = np.array([[5, 6, 7, 7, 8],

[8, 8, 8, 9, 9],
[2, 2, 4, 4, 5]])

In [27]:

stats.zscore(data, axis=1)

Out[27]:
array([[-1.56892908, -0.58834841, 0.39223227, 0.39223227, 1.37281295],
[-0.81649658, -0.81649658, -0.81649658, 1.22474487, 1.22474487],
[-1.16666667, -1.16666667, 0.5 , 0.5 , 1.33333333]])
In [ ]:

74
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-5

AIM: Introduce and perform Attribute Relevance.

Attribute Relevance

Attribute relevance analysis phase has task to recognize attributes (characteristics) with strongest
impact on churn. Attributes which shows greatest segregation power in relation with churn (churn
= ―Yes‖ or ―No‖) by attribute relevance analysis will be selected as best candidates for building
predictive churn model. By no means is Attribute Relevance Analysis used only for predictive churn
model development, you can use it for every classification task. It is based on two terms:
Information Value and Weight of Evidence.

Information Value and Weight of Evidence

Weight of Evidence is explained as follows:
The weight of evidence tells the predictive power of an independent variable in relation to the
dependent variable. Since it evolved from credit scoring world, it is generally described as a
measure of the separation of good and bad customers. “Bad Customers” refers to the customers
who defaulted on a loan and “Good Customers” refers to the customers who paid back loan.

And from the same source, Information Value is explained as follows:

Information value is one of the most useful technique to select important variables in a predictive
model. It helps to rank variables on the basis of their importance.

WoE=[ln(Relative Frequency of Goods/ Relative Frequency of Bads)]*100

IV=Sum(Distribution Goodi – Distribution Badi)*WoEi

If we‘re talking about churn modeling, Goods would be clients which didn‘t churn, and Bads would
be clients which committed churn. Just from this, you can see the simplicity behind the formulas.
The Attribute Relevance Analysis for churn modeling example is divided into 6 steps:
1. Data Cleaning and Preparation,
2. Calculating IV and WoE,
3. Identifying Churners Profile,
4. Coarse Classing,
5. Dummy Variable Creation,
6. Correlations between Dummy Variables.

Step 1. Data Cleaning and Preparation

The dataset contains no missing values, so prerequisite 1 of 2 is satisfied!

There are 10,000 observations and 14 columns. From here on proceeded to data cleaning. Here are

the steps:
1. Delete RowNumber, CustomerId, and Surname — they are arbitrary and can‘t be used.

75
DATAWAREHOSING AND DATAMINING 220643116004

2. Group CreditScore, Age, Balance, and EstimatedSalary into 5 bins

3. Delete CreditScore, Age, Balance, and EstimatedSalary because they aren‘t needed anymore
In [8]:

import pandas as pd
import numpy as np

In [2]:

data = pd.read_csv('p5.csv')
In [3]:

data
Out[3]:

Row Cust Sur Cred Geog Ge A Te NumOf Has IsActiv Estimat Ex

Bala
Num omer na itSco raph nd g nu Produc CrC eMemb edSalar ite
ber Id me re y er e re nce
ts ard er y d

Har Fe
1563 Fran 4 101348.
0 1 grav 619 mal 2 0.00 1 1 1 1
4602 ce 2 88
e e

1564 Fe
1 2 Hill Spai 4 8380 112542.
608 mal 1 1 0 1 0
7311 n 1 7.86 58
e

Fe 1596
1561 Oni Fran 4 113931.
2 3 502 mal 8 60.8 3 1 0 1
9304 o ce 2 57
e 0

1570 Fe 3
Bon Fran 1 0.00 2 0 0 93826.6 0
3 4 699 mal
1354 i ce 9 3
e

76
DATAWAREHOSING AND DATAMINING 220643116004

4 5 1573 Mit 850 Spai Fe 4 2 1255 1 1 1 79084.1 0

Row Cust Sur Cred Geog Ge A Te NumOf Has IsActiv Estimat Ex

Bala
Num omer na itSco raph nd g nu Produc CrC eMemb edSalar ite
nce
ber Id me re y er e re ts ard er y d

7888 chel n mal 3 10.8 0

l e 2

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

9
9 1560 Obij Fran Ma 3 96270.6
9996 771 5 0.00 2 1 0 0
9 6229 iaku ce le 9 4
5

9 Joh
9 1556 Fran Ma 3 5736 101699.
9997 nsto 516 10 1 1 1 0
9 9892 ce le 5 9.61 77
ne
6

9 Fe
9 1558 Fran 3 42085.5
9998 Liu 709 mal 7 0.00 1 0 1 1
9 4532 ce 6 8
e
7

9 Sab
9 1568 Ger Ma 4 7507 92888.5
9999 bati 772 3 2 1 0 1
9 2355 many le 2 5.31 2
ni
8

9 Fe 1301
9 1562 Wal Fran 2 38190.7
10000 792 mal 4 42.7 1 1 0 0
9 8319 ker ce 8 8
e 9
9

10000 rows × 14 columns

In [4]:

data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)

data['CreditScore_Bins'] = pd.qcut(data['CreditScore'], 5,
labels=['CS_lt_566', 'CS_556_to_627', 'CS_627_to_678', 'CS_678_to_735', 'CS_gt_735'])
data['Age_Bins'] = pd.qcut(data['Age'], 5,

77
DATAWAREHOSING AND DATAMINING 220643116004

labels=['Age_lt_31', 'Age_31_to_35', 'Age_35_to_40', 'Age_40_to_46', 'Age_gt_46'])

data['Balance_Bins'] = pd.qcut(data['Balance'], 5,
labels=['Bal_lt_73080', 'Bal_73080_to_110138', 'Bal_110138_to_133710', 'Bal_gt_133710'],
duplicates='drop')
data['Salary_Bins'] = pd.qcut(data['EstimatedSalary'], 5,
labels=['Sal_lt_41050', 'Sal_41050_to_80238', 'Sal_80238_to_119710', 'Sal_119710_to_159836',
'Sal_159836_to_199992'])

data.drop(['CreditScore', 'Age', 'Balance', 'EstimatedSalary'], axis=1, inplace=True)

In [5]:

data
Out[5]:

Geog Ge Te NumOf HasC IsActive Exi CreditSc Age_Bi Balance_Bi Salary_Bin

raph nde nur Product rCard Member ted ore_Bins ns ns s
y r e s

Fe
Franc CS_556_t Age_40 Bal_lt_7308 Sal_80238_
0 mal 2 1 1 1 1
e o_627 _to_46 0 to_119710
e

Fe
CS_556_t Age_40 Bal_73080_ Sal_80238_
1 Spain mal 1 1 0 1 0
o_627 _to_46 to_110138 to_119710
e

Fe
Franc CS_lt_56 Age_40 Bal_gt_133 Sal_80238_
2 mal 8 3 1 0 1
e 6 _to_46 710 to_119710
e

Fe
Franc CS_678_t Age_35 Bal_lt_7308 Sal_80238_
3 mal 1 2 0 0 0 to_119710
e o_735 _to_40 0
e

Fe
CS_gt_73 Age_40 Bal_110138 Sal_41050_
4 Spain mal 2 1 1 1 0
5 _to_46 _to_133710 to_80238
e

... ... ... ... ... ... ... ... ... ... ... ...

99 Franc Mal CS_gt_73 Age_35 Bal_lt_7308 Sal_80238_

5 2 1 0 0
95 e e 5 _to_40 0 to_119710

78
DATAWAREHOSING AND DATAMINING 220643116004

99 Franc Mal 10 1 1 1 0 CS_lt_56 Age_31 Bal_lt_7308 Sal_80238_

Geog Ge Te NumOf HasC IsActive Exi CreditSc Age_Bi Balance_Bi Salary_Bin

raph nde nur Product rCard Member ted ore_Bins ns ns s
y r e s

96 e e 6 _to_35 0 to_119710

Fe
99 Franc CS_678_t Age_35 Bal_lt_7308 Sal_41050_
mal 7 1 0 1 1
97 e o_735 _to_40 0 to_80238
e

99 Germ Mal CS_gt_73 Age_40 Bal_73080_ Sal_80238_

3 2 1 0 1
98 any e 5 _to_46 to_110138 to_119710

Fe
99 Franc CS_gt_73 Age_lt Bal_110138 Sal_lt_4105
mal 4 1 1 0 0
99 e 5 _31 _to_133710 0
e

10000 rows × 11 columns

Step 2. Calculating IV and WoE

Down below is the function which will calculate Weight of Evidence and Infomation Value.
Given Pandas DataFrame, attribute name, and target variable name it will do the calculations.
The function will return Pandas DataFrame and IV score.

In [6]:

def calculate_woe_iv(dataset, feature, target):

lst = []
for i in range(dataset[feature].nunique()):
val = list(dataset[feature].unique())[i]
lst.append({
'Value': val,
'All': dataset[dataset[feature] == val].count()[feature],
'Good': dataset[(dataset[feature] == val) & (dataset[target] == 0)].count()[feature],
'Bad': dataset[(dataset[feature] == val) & (dataset[target] == 1)].count()[feature]
})

dset = pd.DataFrame(lst)
dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()

79
DATAWAREHOSING AND DATAMINING 220643116004

dset['WoE'] = np.log(dset['Distr_Good'] / dset['Distr_Bad'])

dset = dset.replace({'WoE': {np.inf: 0, -np.inf: 0}})
dset['IV'] = (dset['Distr_Good'] - dset['Distr_Bad']) * dset['WoE']
iv = dset['IV'].sum()

dset = dset.sort_values(by='WoE')

return dset, iv
In [9]:

for col in data.columns:

if col == 'Exited': continue
else:
print('WoE and IV for column: {}'.format(col))
df, iv = calculate_woe_iv(data, col, 'Exited')
print(df)
print('IV score: {:.2f}'.format(iv))
print('\n')

WoE and IV for column: Geography

Value All Good Bad Distr_Good Distr_Bad WoE IV
2 Germany 2509 1695 814 0.212859 0.399607 -0.629850 0.117623
1 Spain 2477 2064 413 0.259199 0.202749 0.245626 0.013865
0 France 5014 4204 810 0.527942 0.397644 0.283430 0.036930
IV score: 0.17

WoE and IV for column: Gender

Value All Good Bad Distr_Good Distr_Bad WoE IV
0 Female 4543 3404 1139 0.427477 0.559156 -0.268527 0.035359
1 Male 5457 4559 898 0.572523 0.440844 0.261361 0.034416
IV score: 0.07

WoE and IV for column: Tenure

Value All Good Bad Distr_Good Distr_Bad WoE IV
10 0 413 318 95 0.039935 0.046637 -0.155153 0.001040
1 1 1035 803 232 0.100841 0.113893 -0.121710 0.001589
9 9 984 771 213 0.096823 0.104566 -0.076931 0.000596
6 3 1009 796 213 0.099962 0.104566 -0.045021 0.000207
8 5 1012 803 209 0.100841 0.102602 -0.017307 0.000030

80
DATAWAREHOSING AND DATAMINING 220643116004

7 10 490 389 101 0.048851 0.049583 -0.014869 0.000011

4 4 989 786 203 0.098707 0.099656 -0.009577 0.000009
5 6 967 771 196 0.096823 0.096220 0.006246 0.000004
2 8 1025 828 197 0.103981 0.096711 0.072482 0.000527
0 2 1048 847 201 0.106367 0.098675 0.075068 0.000577
3 7 1028 851 177 0.106869 0.086892 0.206935 0.004134
IV score: 0.01

WoE and IV for column: NumOfProducts

C:\Users\K Patel\anaconda3\lib\site-packages\pandas\core\arraylike.py:358: Ru
ntimeWarning: divide by zero encountered in log
result = getattr(ufunc, method)(*inputs, **kwargs)
Value All Good Bad Distr_Good Distr_Bad WoE IV
1 3 266 46 220 0.005777 0.108002 -2.928314 0.299348
0 1 5084 3675 1409 0.461509 0.691703 -0.404655 0.093149
3 4 60 0 60 0.000000 0.029455 0.000000 -0.000000
2 2 4590 4242 348 0.532714 0.170839 1.137260 0.411545
IV score: 0.80

WoE and IV for column: HasCrCard

Value All Good Bad Distr_Good Distr_Bad WoE IV
1 0 2945 2332 613 0.292854 0.300933 -0.027211 0.000220
0 1 7055 5631 1424 0.707146 0.699067 0.011490 0.000093
IV score: 0.00

WoE and IV for column: IsActiveMember

Value All Good Bad Distr_Good Distr_Bad WoE IV
1 0 4849 3547 1302 0.445435 0.639175 -0.361127 0.069965
0 1 5151 4416 735 0.554565 0.360825 0.429791 0.083268
IV score: 0.15

WoE and IV for column: CreditScore_Bins

Value All Good Bad Distr_Good Distr_Bad WoE IV
1 CS_lt_566 2010 1558 452 0.195655 0.221895 -0.125852 0.003302
0 CS_556_to_627 2020 1599 421 0.200804 0.206676 -0.028827 0.000169
3 CS_gt_735 1979 1573 406 0.197539 0.199313 -0.008941 0.000016
4 CS_627_to_678 2010 1615 395 0.202813 0.193913 0.044877 0.000399
2 CS_678_to_735 1981 1618 363 0.203190 0.178203 0.131216 0.003279
IV score: 0.01

81
DATAWAREHOSING AND DATAMINING 220643116004

WoE and IV for column: Age_Bins

Value All Good Bad Distr_Good Distr_Bad WoE IV
2 Age_gt_46 1885 1019 866 0.127967 0.425135 -1.200636 0.356791
0 Age_40_to_46 1696 1211 485 0.152078 0.238095 -0.448275 0.038559
1 Age_35_to_40 2266 1927 339 0.241994 0.166421 0.374392 0.028294
4 Age_31_to_35 1781 1615 166 0.202813 0.081492 0.911775 0.110617
3 Age_lt_31 2372 2191 181 0.275148 0.088856 1.130289 0.210563
IV score: 0.74

WoE and IV for column: Balance_Bins

Value All Good Bad Distr_Good Distr_Bad WoE \
3 Bal_110138_to_133710 2000 1461 539 0.183474 0.264605 -0.366167
2 Bal_gt_133710 2000 1538 462 0.193143 0.226804 -0.160654
1 Bal_73080_to_110138 2000 1554 446 0.195153 0.218949 -0.115059
0 Bal_lt_73080 4000 3410 590 0.428231 0.289642 0.391017

IV
3 0.029708
2 0.005408
1 0.002738
0 0.054191
IV score: 0.09

WoE and IV for column: Salary_Bins

Value All Good Bad Distr_Good Distr_Bad WoE \
4 Sal_159836_to_199992 2000 1569 431 0.197036 0.211586 -0.071242
0 Sal_80238_to_119710 2000 1596 404 0.200427 0.198331 0.010513
2 Sal_119710_to_159836 2000 1596 404 0.200427 0.198331 0.010513
1 Sal_41050_to_80238 2000 1601 399 0.201055 0.195876 0.026095
3 Sal_lt_41050 2000 1601 399 0.201055 0.195876 0.026095

IV
4 0.001037
0 0.000022
2 0.000022
1 0.000135
3 0.000135
IV score: 0.00

For now, you should just care about the line which says IV score. More precisely, keep your

thoughts on variables with the highest IV scores. Down below is a table for IV interpretation:

82
DATAWAREHOSING AND DATAMINING 220643116004

IV Interpretation table

Now you should see the clearer picture. You should only keep those attributes which have good

predictive power! In the current dataset those are:

 NumOfProducts (0.80)
 Age_Bins (0.74)
 Geography (0.17)
 IsActiveMember (0.15)

Step 3. Identifying Churners Profile

This isn‘t actually required step but is quite beneficial to do.

You as a company probably want to know how does the typical churner look like. I mean, you don‘t

care about his/her physical appearance, but you want to know where does the churner lives, what‘s

his/her age, etc…

To find that out, you will need to take a closer look at the returned data frames for those variables

which have the greatest predictive power. More precisely, look at the WoE column. Ideally, you

will find negative WoE score — this is the value most churners have.

In our example, this is the typical churners profile:

 Lives in Germany (WoE -0.63)
 Uses 3 products/services (WoE -2.92)
 Isn‘t an active member (WoE -0.36)
 Has more than 46 years (WoE -1.20)
83
DATAWAREHOSING AND DATAMINING 220643116004

With this information, you as a company can act and address this critical customer group.

Step 4. Coarse Classing

Coarse Classing is another term I haven‘t heard prior to my master degree studies. It‘s very simple

to explain the idea behind it, you basically want to group together instances with similar WoE

because they provide the same information.

For this dataset, coarse classing should be applied to Spain and France in Geography attribute

(WoEs 0.24 and 0.28).

In [10]:

Geography_df, Geography_iv = calculate_woe_iv(data, 'Geography', 'Exited')

In [11]:

Geography_df

Value All Good Bad Distr_Good Distr_Bad WoE IV

2 Germany 2509 1695 814 0.212859 0.399607 -0.629850 0.117623

1 Spain 2477 2064 413 0.259199 0.202749 0.245626 0.013865

0 France 5014 4204 810 0.527942 0.397644 0.283430 0.036930

Out[11]:

In [12]:

84
DATAWAREHOSING AND DATAMINING 220643116004

Geography_iv

0.16841897055216165

Out[12]:

Down below is the function for coarse classing, along with the function call. To call the function
you must know before hand what are the index locations of the two rows you want coarsed.

def coarse_classer(df, indexloc_1, indexloc_2):

mean_val = pd.DataFrame(np.mean(pd.DataFrame([df.iloc[indexloc_1], df.iloc[indexloc_2]]))).T
original = df.drop([indexloc_1, indexloc_2])

coarsed_df = pd.concat([original, mean_val])

coarsed_df = coarsed_df.sort_values(by='WoE', ascending=False).reset_index(drop=True)

return coarsed_df

geography_df = coarse_classer(Geography_df, 1, 2)geography_df

Value All Good Bad Distr_Good Distr_Bad WoE IV

0 France 5014.0 4204.0 810.0 0.527942 0.397644 0.283430 0.036930

1 NaN 3745.5 3134.0 611.5 0.393570 0.300196 0.264528 0.025398

Out[14]:

In [ ]:

85
DATAWAREHOSING AND DATAMINING 220643116004

You can notice that Value is NaN for the newly created row. It‘s nothing to worry about, you can
simply remap the original dataset to replace Spain and France with something new, for example,
Spain_and_France.

data['Geography'].replace({ 'Spain': 'Spain_and_France', 'France': 'Spain_and_France' }, inplace=True)

Step 5. Dummy Variable Creation

As you know, classification models perform the best when there exist only binary attributes. That‘s
where dummy variables come in. A dummy variable is one that takes the value 0 or 1 to indicate
the absence or presence of some categorical effect that may be expected to shift the outcome.
In a nutshell, if the attribute has n unique values, you will need to create n — 1 dummy variables.
You create one less dummy variable to avoid collinearity issues — when one variable is a perfect
predictor of the other.
Dummy variables will be needed for the following attributes:
 Geography
 NumOfProducts
 Age_Bins
and the code below will create them, and then concatenate them to a new Data Frame along
with IsActiveMember attribute and the target variable — Exited:

geography_dummies = pd.get_dummies(data['Geography'], drop_first=True, prefix='Geography')

num_products_dummies = pd.get_dummies(data['NumOfProducts'], drop_first=True, prefix='Num_Prods')
age_dummies = pd.get_dummies(data['Age_Bins'], drop_first=True)

df = pd.concat([geography_dummies, num_products_dummies, age_dummies, data[['IsActiveMember', 'Exited']]], a

xis=1)

In [18]:

86
DATAWAREHOSING AND DATAMINING 220643116004

df
Out[18]:

Geography_Spai Num_P Num_P Num_P Age_31 Age_35 Age_40 Age_ IsActive Exi
n_and_France rods_2 rods_3 rods_4 _to_35 _to_40 _to_46 gt_46 Member ted

0 1 0 0 0 0 0 1 0 1 1

1 1 0 0 0 0 0 1 0 1 0

2 1 0 1 0 0 0 1 0 0 1

3 1 1 0 0 0 1 0 0 0 0

4 1 0 0 0 0 0 1 0 1 0

... ... ... ... ... ... ... ... ... ... ...

99 1 1 0 0 0 1 0 0 0 0
95

99 1 0 0 0 1 0 0 0 1 0
96

99 1 0 0 0 0 1 0 0 1 1
97

99 0 1 0 0 0 0 1 0 0 1
98

99 1 0 0 0 0 0 0 0 0 0
99

10000 rows × 10 columns

In [ ]:

87
DATAWAREHOSING AND DATAMINING 220643116004

Step 6. Correlations between Dummy Variables

The final step of this process is to calculate correlations between dummy variables and to exclude
those with high correlation. What‘s considered a high correlation coefficient is up to debate, but I
would suggest you remove anything with a correlation above 0.7 (of absolute value). If you‘re
wondering which dummy variable to remove between the two, remove the one with lower Weight
of Evidence, due to weaker connection to the target variable.

ax = sb.heatmap(df.corr(), linewidths=0.8, cmap='Blues', fmt=".1f", annot=True)

Correlation Matrix
Here is visible that there exists no correlation between dummy variables, and therefore, all of
them must remain.

88
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-6

AIM: Implement Decision Tree based Algorithm: Random Forest and

AdaBoost.

Random Forest

Random forest is an ensemble of decision tree algorithms.

It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for
classification and regression problems.
In bagging, a number of decision trees are created where each tree is created from a different
bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset
where a sample may appear more than once in the sample, referred to as sampling with
replacement.
Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different
training dataset, and in turn, has a slightly different performance. Unlike normal decision tree
models, such as classification and regression trees (CART), trees used in the ensemble are
unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make
each tree more different and have less correlated predictions or prediction errors.
Predictions from the trees are averaged across all decision trees resulting in better performance
than any single tree in the model.

A prediction on a regression problem is the average of the prediction across the trees in the
ensemble. A prediction on a classification problem is the majority vote for the class label across
the trees in the ensemble.

 Regression: Prediction is the average prediction across the decision trees.

 Classification: Prediction is the majority vote class label predicted across the decision trees.
Random forest involves constructing a large number of decision trees from bootstrap samples from
the training dataset, like bagging.

Unlike bagging, random forest also involves selecting a subset of input features (columns or
variables) at each split point in the construction of trees. Typically, constructing a decision tree
involves evaluating the value for each input variable in the data in order to select a split point. By
reducing the features to a random subset that may be considered at each split point, it forces each
decision tree in the ensemble to be more different.

89
DATAWAREHOSING AND DATAMINING 220643116004

Random Forest

In [1]:

from numpy import mean

In [2]:

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)

In [3]:

model = RandomForestClassifier()
In [4]:

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
In [5]:

90
DATAWAREHOSING AND DATAMINING 220643116004

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.901 (0.025)
AdaBoost
Boosting is a class of ensemble machine learning algorithms that involve combining the
predictions from many weak learners.

A weak learner is a model that is very simple, although has some skill on the dataset. Boosting
was a theoretical concept long before a practical algorithm could be developed, and the AdaBoost
(adaptive boosting) algorithm was the first successful approach for the idea.

The AdaBoost algorithm involves using very short (one-level) decision trees as weak learners that
are added sequentially to the ensemble. Each subsequent model attempts to correct the predictions
made by the model before it in the sequence. This is achieved by weighing the training dataset to
put more focus on training examples on which prior models made prediction errors.

AdaBoost
In [6]:

from numpy import mean

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=6)

In [8]:

model = AdaBoostClassifier()
In [9]:

91
DATAWAREHOSING AND DATAMINING 220643116004

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

In [10]:

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.806 (0.041)

In [ ]:

92
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-7

AIM: Perform installation of Weka tool.

For Weka tool installation follow the following steps.

Step1: Write down Weka installer

Step2: Click on the following URL ―https://waikato.github.io/weka-wiki/downloading_weka/‖

Step3: On the left side select the appropriate OS for your computer.

Step4: Click on ‗Click Here‘

93
DATAWAREHOSING AND DATAMINING 220643116004

Step5: Open the downloaded setup and click on ―Yes‖.

Step6: Setup screen will appear, click on Next.

Step7: The next screen will be of License Agreement, click on I Agree.

94
DATAWAREHOSING AND DATAMINING 220643116004

Step8: Next screen is of choosing components, all components are already marked so don‘t
change anything just click on the Install button.

Step9: The next screen will be of installing location so choose the drive which will have
sufficient memory space for installation. It needed a memory space of 301 MB.

95
DATAWAREHOSING AND DATAMINING 220643116004

Step10: Next screen will be of choosing the Start menu folder so don‘t do anything just click on
Install Button.

Step11: After this installation process will start and will hardly take a minute to complete the
installation.

96
DATAWAREHOSING AND DATAMINING 220643116004

Step12: Click on the Next button after the installation process is complete.

Step13: Click on Finish to finish the installation process.

97
DATAWAREHOSING AND DATAMINING 220643116004

Step14: Weka is successfully installed on the system and an icon is created on the desktop.

Step15: Run the software and see the interface.

98
DATAWAREHOSING AND DATAMINING 220643116004

99
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-8

AIM: Demonstration of preprocessing on dataset student.arff

This experiment illustrates some of the basic data preprocessing operations that can be performed
using WEKA-Explorer. The sample dataset used for this example is the student data available in
arff format.

Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.

Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute. The left panel in the above figure shows
the list of recognized attributes while the top panel indicates the names of the base relation or table
and the current working relation (which are same initially).

Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,

Step4:The visualization in the right button panel in the form of cross-tabulation across two
attributes.

Note:we can select another attribute using the dropdown list. Step5:Selecting or filtering attributes

Removing an attribute-When we need to remove an attribute,we can do this by using the attribute
filters in weka.In the filter model panel,click on choose button,This will show a popup window
with a list of available filters.

Scroll down the list and select the ―weka.filters.unsupervised.attribute.remove‖ filters.

Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog
box enter the index of the attribute to be filtered out.

b) Make sure that invert selection option is set to false.The click OK now in the filter box.you
will see ―Remove-R-7‖.

c) Click the apply button to apply filter to this data.This will remove the attribute and create
new working relation.

d) Save the new working relation as an arff file by clicking save button on the
100
DATAWAREHOSING AND DATAMINING 220643116004

top(button)panel.(student.arff)
Discretization

1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize age attribute.

ÆLet us divide the values of age attribute into three bins(intervals). ÆFirst load the dataset into
weka(student.arff)

ÆSelect the age attribute.

ÆActivate filter-dialog box and select ―WEKA.filters.unsupervised.attribute.discretize‖from the

list.

ÆTo change the defaults for the filters,click on the box immediately to the right of the choose
button.

ÆWe enter the index for the attribute to be discretized.In this case the attribute is age.So we must
enter ‗1‘ corresponding to the age attribute.

ÆEnter ‗3‘ as the number of bins.Leave the remaining field values as they are. ÆClick OK button.

ÆClick apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 3 bins.

ÆSave the new working relation in a file called student-data-discretized.arff

Dataset student .arff

@relation student

@attribute age {<30,30-40,>40} @attribute income {low, medium, high} @attribute student {yes,
no}

@attribute credit-rating {fair, excellent} @attribute buyspc {yes, no}

@data

101
DATAWAREHOSING AND DATAMINING 220643116004

<30, high, no, fair, no

<30, high, no, excellent, no 30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no 30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

The following screenshot shows the effect of discretization.

102
DATAWAREHOSING AND DATAMINING 220643116004

103
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-9

AIM: Demonstration of preprocessing on dataset labor.arff

This experiment illustrates some of the basic data preprocessing operations that can be performed
using WEKA-Explorer. The sample dataset used for this example is the labor data available in arff
format.

Step1:Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.

Step2:Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute. The left panel in the above figure shows
the list of recognized attributes while the top panel indicates the names of the base relation or table
and the current working relation (which are same initially).

Step4:The visualization in the right button panel in the form of cross-tabulation across two
attributes.

Note:we can select another attribute using the dropdown list. Step5:Selecting or filtering attributes

Scroll down the list and select the ―weka.filters.unsupervised.attribute.remove‖ filters.

Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog
box enter the index of the attribute to be filtered out.

b) Make sure that invert selection option is set to false.The click OK now in the filter box.you
will see ―Remove-R-7‖.

c) Click the apply button to apply filter to this data.This will remove the attribute and create
new working relation.

d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.(labor.arff)

104
DATAWAREHOSING AND DATAMINING 220643116004

Discretization

ÆLet us divide the values of duration attribute into three bins(intervals). ÆFirst load the dataset
into weka(labor.arff)

ÆSelect the duration attribute.

ÆActivate filter-dialog box and select ―WEKA.filters.unsupervised.attribute.discretize‖from the

list.

ÆTo change the defaults for the filters,click on the box immediately to the right of the choose
button.

ÆWe enter the index for the attribute to be discretized.In this case the attribute is duration So we
must enter ‗1‘ corresponding to the duration attribute.

ÆEnter ‗1‘ as the number of bins.Leave the remaining field values as they are. ÆClick OK button.

ÆClick apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 1 bin.

ÆSave the new working relation in a file called labor-data-discretized.arff

Dataset labor.arff

105
DATAWAREHOSING AND DATAMINING 220643116004

The following screenshot shows the effect of discretization

106
DATAWAREHOSING AND DATAMINING 220643116004

107
DATAWAREHOSING AND DATAMINING 220643116004

PRACTICAL-10

AIM: Demonstration of Association rule process on dataset contact

lenses.arff using a priori algorithm.

This experiment illustrates some of the basic elements of asscociation rule

mining using WEKA. The sample dataset used for this example is
contactlenses.arff
Step1: Open the data file in Weka Explorer. It is presumed that the required
data fields have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association
rule algorithm. Step3: We will use apriori algorithm. This is the default
algorithm.
Step4: Inorder to change the parameters for the run (example support,
confidence etc) we click on the text box immediately to the right of the choose
button.
Dataset contactlenses.arff

Dataset contactlenses.arff

The following screenshot shows the association rules that were generated
108
DATAWAREHOSING AND DATAMINING 220643116004

when apriori algorithm is applied on the given dataset.

109

Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Cleaning Thesis
100% (2)
Data Cleaning Thesis
5 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
Importance of Data Cleaning 1
No ratings yet
Importance of Data Cleaning 1
47 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
EWM HU Label PPF Action and Conditions Technique
50% (2)
EWM HU Label PPF Action and Conditions Technique
3 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Overview of Data Preprocessing
No ratings yet
Overview of Data Preprocessing
4 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Unit 3
No ratings yet
Unit 3
18 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
M-II FDS U-II Questions
No ratings yet
M-II FDS U-II Questions
43 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Airline Management System.
100% (1)
Airline Management System.
5 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
DWM
No ratings yet
DWM
14 pages
Experiment 12: Series and Parallel Pumps Introduction To The Equipment
No ratings yet
Experiment 12: Series and Parallel Pumps Introduction To The Equipment
8 pages
CAN BUS English Version
100% (1)
CAN BUS English Version
17 pages
E Commerce MCQ
No ratings yet
E Commerce MCQ
15 pages
01 - SAP Data Transfer Made Easy
0% (1)
01 - SAP Data Transfer Made Easy
13 pages
Agis User Manula M210 P13
No ratings yet
Agis User Manula M210 P13
28 pages
Dungeon Lords
No ratings yet
Dungeon Lords
27 pages
Intro ERP Using GBI Case Study PP (Letter) en v2.40
No ratings yet
Intro ERP Using GBI Case Study PP (Letter) en v2.40
39 pages
LIOLink DOC V7 2 en
No ratings yet
LIOLink DOC V7 2 en
78 pages
Amadeus Training Catalogue
No ratings yet
Amadeus Training Catalogue
81 pages
3.1.2.7 Packet Tracer - Investigating A VLAN Implementation Instructions
100% (1)
3.1.2.7 Packet Tracer - Investigating A VLAN Implementation Instructions
4 pages
Sony Xperia C2305 User's Manual
No ratings yet
Sony Xperia C2305 User's Manual
123 pages
Cobas Ampliprep Taqman
No ratings yet
Cobas Ampliprep Taqman
4 pages
GIA 2018 - ENABLON 4.0 - FINAL PRESENTATION27Nov7 - 6PM
No ratings yet
GIA 2018 - ENABLON 4.0 - FINAL PRESENTATION27Nov7 - 6PM
22 pages
JAVA NOTES 3 Sem A - NOTES 3 CLASS CONSTRUCTOR
No ratings yet
JAVA NOTES 3 Sem A - NOTES 3 CLASS CONSTRUCTOR
45 pages
NumPy & Pandas
No ratings yet
NumPy & Pandas
27 pages
Pepper Fuchs Solenoid Driver Kfd0 Sd2 Ex1
No ratings yet
Pepper Fuchs Solenoid Driver Kfd0 Sd2 Ex1
3 pages
Hazard Analysis
No ratings yet
Hazard Analysis
39 pages
Robustel M1000 PRO GPRS Modem
No ratings yet
Robustel M1000 PRO GPRS Modem
4 pages
LINKSYS Wusb54gsc Wireless-G LINKSYS by Cisco
No ratings yet
LINKSYS Wusb54gsc Wireless-G LINKSYS by Cisco
38 pages
Oop Reviewer
No ratings yet
Oop Reviewer
3 pages
Codeblocks - The Open Source Cross Platform C++ Ide: Julius Parulek
No ratings yet
Codeblocks - The Open Source Cross Platform C++ Ide: Julius Parulek
19 pages
SITB2
No ratings yet
SITB2
9 pages
Vaksin Senin, 25 Oktober 2021
No ratings yet
Vaksin Senin, 25 Oktober 2021
8 pages
Text
No ratings yet
Text
19 pages
Cases Skodew
No ratings yet
Cases Skodew
6 pages
Lecture 3 Getting Input From The Keyboard
No ratings yet
Lecture 3 Getting Input From The Keyboard
17 pages
Data Sheet 6MD5513-0AP00-0AA0: Product Details
No ratings yet
Data Sheet 6MD5513-0AP00-0AA0: Product Details
2 pages
Step by Step Configuration of E Separation
No ratings yet
Step by Step Configuration of E Separation
3 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet