B DWM Lab Manual Zil
B DWM Lab Manual Zil
LAB MANUAL
FOR
6THSEMESTER
IT
CERTIFICATE
This is to certify that Mr. /Ms. _ MISTRY ZEEL JAYESHBHAI Of 6TH SEM B.E I.T.
Class, ENROLL NO. 220643116004 , has satisfactorily Completed his/her term
work in DATA WAREHOUSING AND DATA MINING for the term ending
in _APRIL 2023 /2024.
DATE:
Grade:
Demonstration of Pre-processing
3 08/02/24 43
Methods
i) Rescale Data and ii) Binarize Data.
Perform Different Normalization
methods.
4
15/02/24 i) Maximum Absolute Scaling, ii) 72
Min-Max Feature Scaling, and iii)
Z-score Method.
Demonstration of preprocessing on
8 28/03/24 100
datasetstudent.arff
Demonstration of preprocessing on
9 04/04/24 104
datasetlabor.arff
Demonstration of Association
10 11/04/24 rule process on dataset contact 108
lenses .arffusing a priori
algorithm
DATAWAREHOSING AND DATAMINING 220643116004
444
PRACTICAL-1
AIM: Introduce and Perform different methods of i) Data cleaning and ii)
Data integration and transformation.
1. Data Cleaning
As we know that, Data Mining is the discipline of study which involves extracting insights from
huge amounts of data by the use of various scientific methods, algorithms, and processes. To
extract useful knowledge from data, Data Mining need raw data. This Raw data is a collection of
information from various outlines sources and an essential raw material of Data Scientists. It is
additionally known as primary or source data. It consists of garbage, irregular and inconsistent
values which lead to many difficulties. When using data, the insights and analysis extracted are
only as good as the data we are using. Essentially, when garbage data is in, then garbage analysis
comes out. Here Data cleaning comes into the picture, Data cleansing is an essential part of data
Mining. Data cleaning is the process of removing incorrect, corrupted, garbage, incorrectly
formatted, duplicate, or incomplete data within a dataset.
Why Data Cleaning?
Data cleaning is the most important task that should be done as a data science professional. Having
wrong or bad quality data can be detrimental to processes and analysis. Having clean data will
ultimately increase overall productivity and permit the very best quality information in your
decision-making.
Error-Free Data
When multiple sources of data are combined there may be chances of so much error. Through Data
Cleaning, errors can be removed from data. Having clean data which is free from wrong and
garbage values can help in performing analysis faster as well as efficiently. By doing this task our
considerable amount of time is saved. If we use data containing garbage values, the results won‘t
be accurate. When we don‘t use accurate data, surely we will make mistakes. Monitoring errors
and good reporting helps to find where errors are coming from, and also makes it easier to fix
incorrect or corrupt data for future applications.
Data Quality
The quality of the data is the degree to which it follows the rules of particular requirements. For
example, if we have imported phone numbers data of different customers, and in some places, we
have added email addresses of customers in the data. But because our needs were straightforward
for phone numbers, then the email addresses would be invalid data. Here some pieces of data
follow a specific format. Some types of numbers have to be in a specific range. Some data cells
might require a selected quite data like numeric, Boolean, etc. In every scenario, there are some
mandatory constraints our data should follow. Certain conditions affect multiple fields of data in
a particular form. Particular types of data have unique restrictions. If the data isn‘t in the required
format, it would always be invalid. Data cleaning will help us simplify this process and avoid
useless data values.
1
DATAWAREHOSING AND DATAMINING 220643116004
444
Accurate and Efficient
Ensuring the data is close to the correct values. We know that most of the data in a dataset are
valid, and we should focus on establishing its accuracy. Even if the data is authentic and corre ct,
it doesn‘t mean the data is accurate. Determining accuracy helps to figure out the data entered is
accurate or not. For example, the address of a customer is stored in the specified format, maybe it
doesn‘t need to be in the right one. The email has an additional character or value that makes it
incorrect or invalid. Another example is the phone number of a customer. This means that we have
to rely on data sources, to cross-check the data to figure out if it‘s accurate or not. Depending on
the kind of data we are using, we might be able to find various resources that could help us in this
regard for cleaning.
Complete Data
Completeness is the degree to which we should know all the required values. Completeness is a
little more challenging to achieve than accuracy or quality. Because it‘s nearly impossible to have
all the info we need. Only known facts can be entered. We can try to complete data by redoing the
data gathering activities like approaching the clients again, re-interviewing people, etc. For
example, we might need to enter every customer‘s contact information. But a number of them
might not have email addresses. In this case, we have to leave those columns empty. If we have a
system that requires us to fill all columns, we can try to enter missing or unknown there. But
entering such values does not mean that the data is complete. It would be still being referred to as
incomplete.
By checking in different systems. By checking the source. By checking the latest data.
Python Code:
.
In [1]:
import pandas as pd
In [2]:
2
DATAWAREHOSING AND DATAMINING 220643116004
444
data = pd.read_csv('p1-1.csv')
In [3]:
data.head()
Out[3]
In [4]:
data
Out[4]:
3
DATAWAREHOSING AND DATAMINING 220643116004
444
data.tail()
Out[5]:
In [6]:
4
DATAWAREHOSING AND DATAMINING 220643116004
444
data.isnull()
###This function provides the boolean value for the complete dataset to know if any null value is present or not.
Out[6]:
5
DATAWAREHOSING AND DATAMINING 220643116004
444
data.isna()
####This is the same as the isnull() function. Ans provides the same output.
Out[7]:
In [8]:
6
DATAWAREHOSING AND DATAMINING 220643116004
444
data.isna().any()
###This function also gives a boolean value if any null value is present or not,
###but it gives results column-wise, not in tabular format.
Out[8]:
Id False
SepalLengthCm False
SepalWidthCm False
PetalLengthCm False
PetalWidthCm False
Species False
dtype: bool
In [9]:
data.isna().sum()
###This function gives the sum of the null values preset in the dataset column-wise.
Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
Out[9]
In [10]:
data.isna().any().sum()
###This function gives output in a single value if any null is present or not.
7
DATAWAREHOSING AND DATAMINING 220643116004
444
Out[10]:
There are no null values present in our dataset. But if there are any null value s preset we can fill
those places with any other value using the fillna() function of DataFrame.Following is the syntax of
fillna() function:
and for merging datasets: Merging the dataset is the process of combining two datasets in one, and
line up rows based on some particular or common property for data analysis. We can do this by
using the merge() function of the dataframe. Following is the syntax of the merge function:
data.duplicated()
Out[11]:
0 False
1 False
2 False
3 False
4 False
...
145 False
146 False
147 False
148 False
149 False
Length: 150, dtype: bool
This function also provides bool values for duplicate values in the dataset. As we can see that
dataset doesn’t contain any duplicate values.
If a dataset contains duplicate values it can be removed using the drop_duplicates() function.
Following is the syntax of this function:
data.duplicated().any().sum()
Out[10]:
In [11]:
data1 = pd.read_csv('StudentDetails.csv')
In [12]:
Out[12]:
In [13]:
9
DATAWAREHOSING AND DATAMINING 220643116004
444
Out[13]:
In [14]:
data2
= pd.read_csv('StudentDetails.csv')
In [15]:
data2
10
DATAWAREHOSING AND DATAMINING 220643116004
444
8 9 Akshay Prajapati 85.0 NaN MS
Out[15]:
In [16]:
data2
Out[17]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage
In [18]:
data3 = pd.read_csv('StudentDetails.csv')
In [19]:
11
DATAWAREHOSING AND DATAMINING 220643116004
444
data3
Out[19]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage
In [20]:
12
DATAWAREHOSING AND DATAMINING 220643116004
444
8 9 Akshay Prajapati 85.0 NaN MS
Out[21]:
In [22]:
data4 = pd.read_csv('StudentDetails.csv')
In [23]:
data4
Out[23]:
In [24]:
13
DATAWAREHOSING AND DATAMINING 220643116004
444
data4
Out[25]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage
In [26]:
data5 = pd.read_csv('StudentDetails.csv')
In [27]:
data5
14
DATAWAREHOSING AND DATAMINING 220643116004
444
4 5 Nimisha Sutariya NaN 7.50 NaN
Out[27]:
In [28]:
data5
Out[29]:
In [30]:
15
DATAWAREHOSING AND DATAMINING 220643116004
444
data6 = pd.read_csv('StudentDetails.csv')
In [31]:
data6
Out[31]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage
In [32]:
data6["Name of Collage"].fillna(method='ffill',limit=1,inplace=True)
In [33]:
data6
Out[33]:
Sr No. Student Name 12th Marks Diplomla CPI Name of Collage
16
DATAWAREHOSING AND DATAMINING 220643116004
444
1 2 Tanvi Patel NaN 7.81 Parul
In [57]:
17
DATAWAREHOSING AND DATAMINING 220643116004
444
df1
Out[58]:
Name ENR
0 Shivani 28
1 Tanvi 03
2 Nimisha 12
Name ENR
3 Twinkle 05
In [60]:
In [61]:
df2
Name Skills
0 Shivani Wd
1 Tanvi JAVA
2 Nimisha SQL
3 Twinkle Python
Out[61]:
In [62]:
18
DATAWAREHOSING AND DATAMINING 220643116004
444
In [63]:
data9
Out[63]:
Name ENR Skills
0 Shivani 28 Wd
1 Tanvi 03 JAVA
Name ENR Skills
2 Nimisha 12 SQL
3 Twinkle 05 Python
In [ ]:
In [64]:
In [65]:
df4
group supervisor
0 Accounting Carly
19
DATAWAREHOSING AND DATAMINING 220643116004
444
1 Engineering Guido
2 HR Steve
Out[65]:
In [68]:
0 Shivani 28
1 Tanvi 03
2 Nimisha 12
3 Twinkle 05
Name Skills
0 Shivani Wd
1 Tanvi JAVA
2 Nimisha SQL
3 Twinkle Python
0 Shivani 28 Wd
1 Tanvi 03 JAVA
2 Nimisha 12 SQL
In [41]:
import numpy as np
import pandas as pd
In [42]:
20
DATAWAREHOSING AND DATAMINING 220643116004
444
df = pd.read_csv('Dataset11.csv')
In [43]:
df
Out[43]:
NAME A B C D E
0 JANE 1 6 6 9 1
1 JOHN 8 1 2 8 1
2 ASHLEY 6 3 5 1 7
3 MAX 0 3 4 0 8
4 EMILY 7 6 6 0 6
df['new'] = np.random.random(5)
In [45]:
df
Out[45]:
21
DATAWAREHOSING AND DATAMINING 220643116004
444
NAME A B C D E new
0 JANE 1 6 6 9 1 0.458527
1 JOHN 8 1 2 8 1 0.390897
2 ASHLEY 6 3 5 1 7 0.044329
3 MAX 0 3 4 0 8 0.229151
NAME A B C D E new
4 EMILY 7 6 6 0 6 0.589566
We give the values as an array or list and assign a name to the new column. Make sure the size of
the array is compatible with the size of the dataframe. The drop function is used to drop a column.
In [46]:
df
Out[47]:
NAME A B C D E
0 JANE 1 6 6 9 1
1 JOHN 8 1 2 8 1
2 ASHLEY 6 3 5 1 7
3 MAX 0 3 4 0 8
4 EMILY 7 6 6 0 6
22
DATAWAREHOSING AND DATAMINING 220643116004
444
We pass the name of the column to be dropped. The axis parameter is set to 1 to indicate we are
dropping a column. Finally, the inplace parameter needs to be True to save the changes.
df.loc[5,:] = ['Jack', 3, 3, 4, 5, 1]
In [49]:
df
Out[49]:
NAME A B C D E
In [50]:
Insert
The insert function adds a column into a specific position.
23
DATAWAREHOSING AND DATAMINING 220643116004
444
In [51]:
df
Out[52]:
new NAME A B C D E
In [53]:
df
Out[54]:
new NAME me A B C D E
24
DATAWAREHOSING AND DATAMINING 220643116004
444
In [55]:
df
Out[56]:
NAME me A B C D E
In [57]:
25
DATAWAREHOSING AND DATAMINING 220643116004
444
df
Out[58]:
NAME A B C D E
Melt
The melt function converts a dataframe from wide (high number of columns) to narrow form (high
number of rows). It is best explained via an example. Consider following dataframe. It contains
consecutive daily measurements for 5 people. The long format of this dataframe can be achieved
using the melt function.
The column passed to the id_vars parameter remains the same and the other columns are
combined under the variable and value columns.
In [66]:
df1 = pd.read_csv('Dataset12.csv')
In [67]:
df1
NAME A B C D E
26
DATAWAREHOSING AND DATAMINING 220643116004
444
0 ASHLEY 6 3 5 1 7
1 MAX 0 3 4 0 8
2 EMILY 7 6 6 0 6
Out[67]:
In [72]:
df2 = pd.read_csv('Dataset11.csv')
In [73]:
df2
NAME A B C D E
0 JANE 1 6 6 9 1
1 JOHN 8 1 2 8 1
2 Jack 4 9 8 6 3
Out[73]
In [77]:
NAME A B C D E
27
DATAWAREHOSING AND DATAMINING 220643116004
444
0 ASHLEY 6 3 5 1 7
1 MAX 0 3 4 0 8
2 EMILY 7 6 6 0 6
3 JANE 1 6 6 9 1
4 JOHN 8 1 2 8 1
NAME A B C D E
5 Jack 4 9 8 6 3
In [76]:
0 1 2 3 4 5 6 7 8 9 10 11
0 ASHLEY 6 3 5 1 7 JANE 1 6 6 9 1
1 MAX 0 3 4 0 8 JOHN 8 1 2 8 1
2 EMILY 7 6 6 0 6 Jack 4 9 8 6 3
Merge
Merge function also combines dataframes based on common values in a given column or columns.
Consider the following two dataframes.
.
In [35]:
pd.melt(df, id_vars='NAME').head()
Out[35]:
28
DATAWAREHOSING AND DATAMINING 220643116004
444
0 JANE A 1.0
1 JOHN A 8.0
2 Jack A 4.0
3 JANE B 6.0
4 JOHN B 1.0
In [79]:
In [81]:
df3
ID Name Category
0 1 Rane A
1 2 Alex B
2 3 Ayan A
3 4 Jack C
29
DATAWAREHOSING AND DATAMINING 220643116004
444
4 5 John B
Out[81]
In [82]:
df4
Out[82]:
ID Amount Payment
2 5 250 Cash
3 6 440 Cash
df3.merge(df4, on='ID')
Out[86]:
30
DATAWAREHOSING AND DATAMINING 220643116004
444
We can perform Full join by just passing the how argument as ‘outer’ to the merge() function:
In [88]:
Performing a left join is actually quite similar to a full join. Just change the how argument to ‘left’:
In [89]:
31
DATAWAREHOSING AND DATAMINING 220643116004
444
Similar to other joins, we can perform a right join by changing the how argument to ‘right’:
In [91]:
Get dummies
Some machine learning models cannot handle categorical variables. In such cases, we should
32
DATAWAREHOSING AND DATAMINING 220643116004
444
encode the categorical variables in a way that each category is represented as a column.
In [95]:
df5 = pd.read_csv('Customer.csv')
In [96]:
df5
0 Rane A 14.2
1 Alex A 21.4
2 Ayan C 15.6
3 Jack B 12.1
4 John B 17.7
Out[96]:
In [98]:
pd.get_dummies(df5)
Out[98]:
0 14.2 0 0 0 0 1 1 0 0
33
DATAWAREHOSING AND DATAMINING 220643116004
444
1 21.4 1 0 0 0 0 1 0 0
2 15.6 0 1 0 0 0 0 0 1
3 12.1 0 0 1 0 0 0 1 0
4 17.7 0 0 0 1 0 0 1 0
For instance, in the first row, the name is Jane and the ctg is A. Thus, the columns that represent
these values are 1 and all other columns are 0.
Pivot table
The pivot_table function transforms a dataframe to a format that explains the relationship among
variables.
We have the dataframe on the left that contains two categorical features (i.e. columns) and a
numerical feature. We want to see the average value of the categories in both columns. The
pivot_table function transforms the dataframe in a way that the average values or any other
aggregation can be seen clearly.
In [100]:
Value
Category A B C
Name
34
DATAWAREHOSING AND DATAMINING 220643116004
444
PRACTICAL-2
Dimensionality reduction refers to techniques for reducing the number of input variables in
training data. High-dimensionality might mean hundreds, thousands, or even millions of input
variables.
Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in
the machine learning model, referred to as degrees of freedom. A model with too many degrees of
freedom is likely to overfit the training dataset and may not perform well on new data.
It is desirable to have simple models that generalize well, and in turn, input data with few input
variables. This is particularly true for linear models where the number of inputs and the degrees of
freedom of the model are often closely related.
35
DATAWAREHOSING AND DATAMINING 220643116004
dimensionality.
Isomap Embedding
Locally Linear Embedding
Modified Locally Linear Embedding
Out[5]:
[-2.3302999 , - 4.86608574, -3.88291317, ..., -0.14561581,
-0.55489384, 0 .61420772],
36
DATAWAREHOSING AND DATAMINING 220643116004
Out[6]:
array([0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0,
0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1,
0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1,
0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,
1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1,
37
DATAWAREHOSING AND DATAMINING 220643116004
0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,
1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,
0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0,
0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 1, 0, 0, 1, 1, 0, 1])
In [7]:
38
DATAWAREHOSING AND DATAMINING 220643116004
In [8]:
39
DATAWAREHOSING AND DATAMINING 220643116004
40
DATAWAREHOSING AND DATAMINING 220643116004
41
DATAWAREHOSING AND DATAMINING 220643116004
42
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-3
i) Rescale Data
Your data must be prepared before you can build models. The data preparation process can
involve three steps: data selection, data preprocessing and data transformation. Your preprocessed
data may contain attributes with a mixtures of scales for various quantities such as dollars,
kilograms and sales volume. Many machine learning methods expect or are more effective if the
data attributes have the same scale. Two popular data scaling methods are normalization and
standardization.
Data Normalization
Normalization refers to rescaling real valued numeric attributes into the range 0 and 1. It is useful
to scale the input attributes for a model that relies on the magnitude of values, such as distance
measures used in k-nearest neighbors and in the preparation of coefficients in regression. The
example below demonstrate data normalization of the Iris flowers dataset.
Normalization
In [1]:
(150, 4)
In [2]:
iris
43
DATAWAREHOSING AND DATAMINING 220643116004
Out[2]:
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
44
DATAWAREHOSING AND DATAMINING 220643116004
45
DATAWAREHOSING AND DATAMINING 220643116004
46
DATAWAREHOSING AND DATAMINING 220643116004
In [3]:
X = iris.data
y = iris.target
In [4]:
X
Out[4]:
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
47
DATAWAREHOSING AND DATAMINING 220643116004
48
DATAWAREHOSING AND DATAMINING 220643116004
49
DATAWAREHOSING AND DATAMINING 220643116004
50
DATAWAREHOSING AND DATAMINING 220643116004
y
Out[5]:
array([0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [6]:
normalized_X
Out[8]:
array([[0.80377277, 0.55160877, 0.22064351, 0.0315205 ],
[0.82813287, 0.50702013, 0.23660939, 0.03380134],
[0.80533308, 0.54831188, 0.2227517 , 0.03426949],
[0.80003025, 0.53915082, 0.26087943, 0.03478392],
[0.790965 , 0.5694948 , 0.2214702 , 0.0316386 ],
51
DATAWAREHOSING AND DATAMINING 220643116004
52
DATAWAREHOSING AND DATAMINING 220643116004
53
DATAWAREHOSING AND DATAMINING 220643116004
54
DATAWAREHOSING AND DATAMINING 220643116004
Data Standardization
Standardization refers to shifting the distribution of each attribute to have a mean of zero and a
standard deviation of one (unit variance). It is useful to standardize attributes for a model that relies
on the distribution of attributes such as Gaussian processes. The example below demonstrate data
standardization of the Iris flowers dataset.
Data Standardization
In [9]:
In [10]:
55
DATAWAREHOSING AND DATAMINING 220643116004
iris
Out[10]:
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
56
DATAWAREHOSING AND DATAMINING 220643116004
57
DATAWAREHOSING AND DATAMINING 220643116004
58
DATAWAREHOSING AND DATAMINING 220643116004
X = iris.data y = iris.target
In [12]:
standardized_X = preprocessing.scale(X)
In [14]:
59
DATAWAREHOSING AND DATAMINING 220643116004
standardized_X
Out[14]:
-1.44707648e+00],
[-1.87002413e+00, -1.31979479e-01, -1.51073881e+00,
-1.44707648e+00],
[-5.25060772e-02, 2.16998818e+00, -1.45390138e+00,
-1.31544430e+00],
[-1.73673948e-01, 3.09077525e+00, -1.28338910e+00,
-1.05217993e+00],
[-5.37177559e-01, 1.93979142e+00, -1.39706395e+00,
-1.05217993e+00],
[-9.00681170e-01, 1.01900435e+00, -1.34022653e+00,
-1.18381211e+00],
[-1.73673948e-01, 1.70959465e+00, -1.16971425e+00,
60
DATAWAREHOSING AND DATAMINING 220643116004
-1.18381211e+00],
[-9.00681170e-01, 1.70959465e+00, -1.28338910e+00,
-1.18381211e+00],
[-5.37177559e-01, 7.88807586e-01, -1.16971425e+00,
-1.31544430e+00],
[-9.00681170e-01, 1.47939788e+00, -1.28338910e+00,
-1.05217993e+00],
[-1.50652052e+00, 1.24920112e+00, -1.56757623e+00,
-1.31544430e+00],
[-9.00681170e-01, 5.58610819e-01, -1.16971425e+00,
-9.20547742e-01],
[-1.26418478e+00, 7.88807586e-01, -1.05603939e+00,
-1.31544430e+00],
[-1.02184904e+00, -1.31979479e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.02184904e+00, 7.88807586e-01, -1.22655167e+00,
-1.05217993e+00],
[-7.79513300e-01, 1.01900435e+00, -1.28338910e+00,
-1.31544430e+00],
[-7.79513300e-01, 7.88807586e-01, -1.34022653e+00,
-1.31544430e+00],
[-1.38535265e+00, 3.28414053e-01, -1.22655167e+00,
-1.31544430e+00],
[-1.26418478e+00, 9.82172869e-02, -1.22655167e+00,
-1.31544430e+00],
[-5.37177559e-01, 7.88807586e-01, -1.28338910e+00,
-1.05217993e+00],
[-7.79513300e-01, 2.40018495e+00, -1.28338910e+00,
-1.44707648e+00],
[-4.16009689e-01, 2.63038172e+00, -1.34022653e+00,
-1.31544430e+00],
[-1.14301691e+00, 9.82172869e-02, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 3.28414053e-01, -1.45390138e+00,
-1.31544430e+00],
[-4.16009689e-01, 1.01900435e+00, -1.39706395e+00,
-1.31544430e+00],
[-1.14301691e+00, 1.24920112e+00, -1.34022653e+00,
-1.44707648e+00],
[-1.74885626e+00, -1.31979479e-01, -1.39706395e+00,
-1.31544430e+00],
[-9.00681170e-01, 7.88807586e-01, -1.28338910e+00,
-1.31544430e+00],
[-1.02184904e+00, 1.01900435e+00, -1.39706395e+00,
-1.18381211e+00],
61
DATAWAREHOSING AND DATAMINING 220643116004
2.64141916e-01],
[-2.94841818e-01, -3.62176246e-01, -8.98031345e-02,
1.32509732e-01],
[ 1.03800476e+00, 9.82172869e-02, 3.64896281e-01,
2.64141916e-01],
[-2.94841818e-01, -1.31979479e-01, 4.21733708e-01,
3.95774101e-01],
[-5.25060772e-02, -8.22569778e-01, 1.94384000e-01,
-2.62386821e-01],
[ 4.32165405e-01, -1.97355361e+00, 4.21733708e-01,
3.95774101e-01],
[-2.94841818e-01, -1.28296331e+00, 8.07091462e-02,
-1.30754636e-01],
[ 6.86617933e-02, 3.28414053e-01, 5.92245988e-01,
7.90670654e-01],
[ 3.10997534e-01, -5.92373012e-01, 1.37546573e-01,
1.32509732e-01],
[ 5.53333275e-01, -1.28296331e+00, 6.49083415e-01,
3.95774101e-01],
[ 3.10997534e-01, -5.92373012e-01, 5.35408562e-01,
8.77547895e-04],
[ 6.74501145e-01, -3.62176246e-01, 3.08058854e-01,
1.32509732e-01],
[ 9.16836886e-01, -1.31979479e-01, 3.64896281e-01,
2.64141916e-01],
[ 1.15917263e+00, -5.92373012e-01, 5.92245988e-01,
2.64141916e-01],
[ 1.03800476e+00, -1.31979479e-01, 7.05920842e-01,
6.59038469e-01],
[ 1.89829664e-01, -3.62176246e-01, 4.21733708e-01,
3.95774101e-01],
[-1.73673948e-01, -1.05276654e+00, -1.46640561e-01,
-2.62386821e-01],
[-4.16009689e-01, -1.51316008e+00, 2.38717193e-02,
-1.30754636e-01],
[-4.16009689e-01, -1.51316008e+00, -3.29657076e-02,
-2.62386821e-01],
[-5.25060772e-02, -8.22569778e-01, 8.07091462e-02,
8.77547895e-04],
[ 1.89829664e-01, -8.22569778e-01, 7.62758269e-01,
5.27406285e-01],
[-5.37177559e-01, -1.31979479e-01, 4.21733708e-01,
3.95774101e-01],
[ 1.89829664e-01, 7.88807586e-01, 4.21733708e-01,
5.27406285e-01],
63
DATAWAREHOSING AND DATAMINING 220643116004
7.90670654e-01],
[ 1.64384411e+00, 1.24920112e+00, 1.33113254e+00,
1.71209594e+00],
[ 7.95669016e-01, 3.28414053e-01, 7.62758269e-01,
1.05393502e+00],
[ 6.74501145e-01, -8.22569778e-01, 8.76433123e-01,
9.22302838e-01],
[ 1.15917263e+00, -1.31979479e-01, 9.90107977e-01,
1.18556721e+00],
[-1.73673948e-01, -1.28296331e+00, 7.05920842e-01,
1.05393502e+00],
[-5.25060772e-02, -5.92373012e-01, 7.62758269e-01,
1.58046376e+00],
[ 6.74501145e-01, 3.28414053e-01, 8.76433123e-01,
1.44883158e+00],
[ 7.95669016e-01, -1.31979479e-01, 9.90107977e-01,
7.90670654e-01],
[ 2.24968346e+00, 1.70959465e+00, 1.67215710e+00,
1.31719939e+00],
[ 2.24968346e+00, -1.05276654e+00, 1.78583195e+00,
1.44883158e+00],
[ 1.89829664e-01, -1.97355361e+00, 7.05920842e-01,
3.95774101e-01],
[ 1.28034050e+00, 3.28414053e-01, 1.10378283e+00,
1.44883158e+00],
[-2.94841818e-01, -5.92373012e-01, 6.49083415e-01,
1.05393502e+00],
[ 2.24968346e+00, -5.92373012e-01, 1.67215710e+00,
1.05393502e+00],
[ 5.53333275e-01, -8.22569778e-01, 6.49083415e-01,
7.90670654e-01],
[ 1.03800476e+00, 5.58610819e-01, 1.10378283e+00,
1.18556721e+00],
[ 1.64384411e+00, 3.28414053e-01, 1.27429511e+00,
7.90670654e-01],
[ 4.32165405e-01, -5.92373012e-01, 5.92245988e-01,
7.90670654e-01],
[ 3.10997534e-01, -1.31979479e-01, 6.49083415e-01,
7.90670654e-01],
[ 6.74501145e-01, -5.92373012e-01, 1.04694540e+00,
1.18556721e+00],
[ 1.64384411e+00, -1.31979479e-01, 1.16062026e+00,
5.27406285e-01],
[ 1.88617985e+00, -5.92373012e-01, 1.33113254e+00,
9.22302838e-01],
65
DATAWAREHOSING AND DATAMINING 220643116004
66
DATAWAREHOSING AND DATAMINING 220643116004
assigning the value 0 to all the data points below. The threshold and 1 to those above it.
sklearn.preprocessing.Binarizer() is a method which belongs to preprocessing module. It plays
a key role in the discretization of continuous feature values.
Binarize data
In [13]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
In [14]:
data = pd.read_csv('p3.csv')
In [15]:
data
Out[15]:
67
DATAWAREHOSING AND DATAMINING 220643116004
In [33]:
data["Salary"].fillna(method='ffill', inplace=True)
In [40]:
data["Age"].fillna(method='ffill', inplace=True)
In [41]:
data
68
DATAWAREHOSING AND DATAMINING 220643116004
Out[41]:
In [42]:
x = age
x = x.reshape(1, -1)
69
DATAWAREHOSING AND DATAMINING 220643116004
y = salary
y = y.reshape(1, -1)
In [44]:
Binarized age :
[[1. 0. 0. 1. 1. 0. 0. 1. 1. 1.]]
Binarized salary :
[[1. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]
70
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-4
In [3]:
transformer = MaxAbsScaler().fit(X)
In [4]:
transformer.transform(X)
Out[4]:
arr ay([[ 0.5, -1. , 1. ],
[ 1. , 0. , 0. ],
71
DATAWAREHOSING AND DATAMINING 220643116004
[ 0. , 1. , -0.5]])
In [ ]:
Transform features by scaling each feature to a given range. This estimator scales and translates
each feature individually such that it is in the given range on the training set, e.g. between zero
and one.
This transformation is often used as an alternative to zero mean, unit variance scaling.
In [5]:
scaler = MinMaxScaler()
In [10]:
72
DATAWAREHOSING AND DATAMINING 220643116004
print(scaler.data_max_)
[ 1. 18.]
In [12]:
print(scaler.transform(data))
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]
z = (X – μ) / σ
where:
We can calculate z-scores in Python using scipy.stats.zscore, which uses the following syntax:
where:
73
DATAWAREHOSING AND DATAMINING 220643116004
Z_Score
In [25]:
In [26]:
In [27]:
stats.zscore(data, axis=1)
Out[27]:
array([[-1.56892908, -0.58834841, 0.39223227, 0.39223227, 1.37281295],
[-0.81649658, -0.81649658, -0.81649658, 1.22474487, 1.22474487],
[-1.16666667, -1.16666667, 0.5 , 0.5 , 1.33333333]])
In [ ]:
74
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-5
Attribute Relevance
Attribute relevance analysis phase has task to recognize attributes (characteristics) with strongest
impact on churn. Attributes which shows greatest segregation power in relation with churn (churn
= ―Yes‖ or ―No‖) by attribute relevance analysis will be selected as best candidates for building
predictive churn model. By no means is Attribute Relevance Analysis used only for predictive churn
model development, you can use it for every classification task. It is based on two terms:
Information Value and Weight of Evidence.
If we‘re talking about churn modeling, Goods would be clients which didn‘t churn, and Bads would
be clients which committed churn. Just from this, you can see the simplicity behind the formulas.
The Attribute Relevance Analysis for churn modeling example is divided into 6 steps:
1. Data Cleaning and Preparation,
2. Calculating IV and WoE,
3. Identifying Churners Profile,
4. Coarse Classing,
5. Dummy Variable Creation,
6. Correlations between Dummy Variables.
There are 10,000 observations and 14 columns. From here on proceeded to data cleaning. Here are
the steps:
1. Delete RowNumber, CustomerId, and Surname — they are arbitrary and can‘t be used.
75
DATAWAREHOSING AND DATAMINING 220643116004
import pandas as pd
import numpy as np
In [2]:
data = pd.read_csv('p5.csv')
In [3]:
data
Out[3]:
Har Fe
1563 Fran 4 101348.
0 1 grav 619 mal 2 0.00 1 1 1 1
4602 ce 2 88
e e
1564 Fe
1 2 Hill Spai 4 8380 112542.
608 mal 1 1 0 1 0
7311 n 1 7.86 58
e
Fe 1596
1561 Oni Fran 4 113931.
2 3 502 mal 8 60.8 3 1 0 1
9304 o ce 2 57
e 0
1570 Fe 3
Bon Fran 1 0.00 2 0 0 93826.6 0
3 4 699 mal
1354 i ce 9 3
e
76
DATAWAREHOSING AND DATAMINING 220643116004
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9
9 1560 Obij Fran Ma 3 96270.6
9996 771 5 0.00 2 1 0 0
9 6229 iaku ce le 9 4
5
9 Joh
9 1556 Fran Ma 3 5736 101699.
9997 nsto 516 10 1 1 1 0
9 9892 ce le 5 9.61 77
ne
6
9 Fe
9 1558 Fran 3 42085.5
9998 Liu 709 mal 7 0.00 1 0 1 1
9 4532 ce 6 8
e
7
9 Sab
9 1568 Ger Ma 4 7507 92888.5
9999 bati 772 3 2 1 0 1
9 2355 many le 2 5.31 2
ni
8
9 Fe 1301
9 1562 Wal Fran 2 38190.7
10000 792 mal 4 42.7 1 1 0 0
9 8319 ker ce 8 8
e 9
9
data['CreditScore_Bins'] = pd.qcut(data['CreditScore'], 5,
labels=['CS_lt_566', 'CS_556_to_627', 'CS_627_to_678', 'CS_678_to_735', 'CS_gt_735'])
data['Age_Bins'] = pd.qcut(data['Age'], 5,
77
DATAWAREHOSING AND DATAMINING 220643116004
data
Out[5]:
Fe
Franc CS_556_t Age_40 Bal_lt_7308 Sal_80238_
0 mal 2 1 1 1 1
e o_627 _to_46 0 to_119710
e
Fe
CS_556_t Age_40 Bal_73080_ Sal_80238_
1 Spain mal 1 1 0 1 0
o_627 _to_46 to_110138 to_119710
e
Fe
Franc CS_lt_56 Age_40 Bal_gt_133 Sal_80238_
2 mal 8 3 1 0 1
e 6 _to_46 710 to_119710
e
Fe
Franc CS_678_t Age_35 Bal_lt_7308 Sal_80238_
3 mal 1 2 0 0 0 to_119710
e o_735 _to_40 0
e
Fe
CS_gt_73 Age_40 Bal_110138 Sal_41050_
4 Spain mal 2 1 1 1 0
5 _to_46 _to_133710 to_80238
e
... ... ... ... ... ... ... ... ... ... ... ...
78
DATAWAREHOSING AND DATAMINING 220643116004
96 e e 6 _to_35 0 to_119710
Fe
99 Franc CS_678_t Age_35 Bal_lt_7308 Sal_41050_
mal 7 1 0 1 1
97 e o_735 _to_40 0 to_80238
e
Fe
99 Franc CS_gt_73 Age_lt Bal_110138 Sal_lt_4105
mal 4 1 1 0 0
99 e 5 _31 _to_133710 0
e
In [6]:
dset = pd.DataFrame(lst)
dset['Distr_Good'] = dset['Good'] / dset['Good'].sum()
dset['Distr_Bad'] = dset['Bad'] / dset['Bad'].sum()
79
DATAWAREHOSING AND DATAMINING 220643116004
dset = dset.sort_values(by='WoE')
return dset, iv
In [9]:
80
DATAWAREHOSING AND DATAMINING 220643116004
81
DATAWAREHOSING AND DATAMINING 220643116004
IV
3 0.029708
2 0.005408
1 0.002738
0 0.054191
IV score: 0.09
IV
4 0.001037
0 0.000022
2 0.000022
1 0.000135
3 0.000135
IV score: 0.00
For now, you should just care about the line which says IV score. More precisely, keep your
thoughts on variables with the highest IV scores. Down below is a table for IV interpretation:
82
DATAWAREHOSING AND DATAMINING 220643116004
IV Interpretation table
Now you should see the clearer picture. You should only keep those attributes which have good
You as a company probably want to know how does the typical churner look like. I mean, you don‘t
care about his/her physical appearance, but you want to know where does the churner lives, what‘s
To find that out, you will need to take a closer look at the returned data frames for those variables
which have the greatest predictive power. More precisely, look at the WoE column. Ideally, you
will find negative WoE score — this is the value most churners have.
With this information, you as a company can act and address this critical customer group.
Coarse Classing is another term I haven‘t heard prior to my master degree studies. It‘s very simple
to explain the idea behind it, you basically want to group together instances with similar WoE
For this dataset, coarse classing should be applied to Spain and France in Geography attribute
In [10]:
Geography_df
Out[11]:
In [12]:
84
DATAWAREHOSING AND DATAMINING 220643116004
Geography_iv
0.16841897055216165
Out[12]:
Down below is the function for coarse classing, along with the function call. To call the function
you must know before hand what are the index locations of the two rows you want coarsed.
return coarsed_df
Out[14]:
In [ ]:
85
DATAWAREHOSING AND DATAMINING 220643116004
You can notice that Value is NaN for the newly created row. It‘s nothing to worry about, you can
simply remap the original dataset to replace Spain and France with something new, for example,
Spain_and_France.
As you know, classification models perform the best when there exist only binary attributes. That‘s
where dummy variables come in. A dummy variable is one that takes the value 0 or 1 to indicate
the absence or presence of some categorical effect that may be expected to shift the outcome.
In a nutshell, if the attribute has n unique values, you will need to create n — 1 dummy variables.
You create one less dummy variable to avoid collinearity issues — when one variable is a perfect
predictor of the other.
Dummy variables will be needed for the following attributes:
Geography
NumOfProducts
Age_Bins
and the code below will create them, and then concatenate them to a new Data Frame along
with IsActiveMember attribute and the target variable — Exited:
In [18]:
86
DATAWAREHOSING AND DATAMINING 220643116004
df
Out[18]:
Geography_Spai Num_P Num_P Num_P Age_31 Age_35 Age_40 Age_ IsActive Exi
n_and_France rods_2 rods_3 rods_4 _to_35 _to_40 _to_46 gt_46 Member ted
0 1 0 0 0 0 0 1 0 1 1
1 1 0 0 0 0 0 1 0 1 0
2 1 0 1 0 0 0 1 0 0 1
3 1 1 0 0 0 1 0 0 0 0
4 1 0 0 0 0 0 1 0 1 0
... ... ... ... ... ... ... ... ... ... ...
99 1 1 0 0 0 1 0 0 0 0
95
99 1 0 0 0 1 0 0 0 1 0
96
99 1 0 0 0 0 1 0 0 1 1
97
99 0 1 0 0 0 0 1 0 0 1
98
99 1 0 0 0 0 0 0 0 0 0
99
87
DATAWAREHOSING AND DATAMINING 220643116004
The final step of this process is to calculate correlations between dummy variables and to exclude
those with high correlation. What‘s considered a high correlation coefficient is up to debate, but I
would suggest you remove anything with a correlation above 0.7 (of absolute value). If you‘re
wondering which dummy variable to remove between the two, remove the one with lower Weight
of Evidence, due to weaker connection to the target variable.
Correlation Matrix
Here is visible that there exists no correlation between dummy variables, and therefore, all of
them must remain.
88
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-6
Random Forest
It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for
classification and regression problems.
In bagging, a number of decision trees are created where each tree is created from a different
bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset
where a sample may appear more than once in the sample, referred to as sampling with
replacement.
Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different
training dataset, and in turn, has a slightly different performance. Unlike normal decision tree
models, such as classification and regression trees (CART), trees used in the ensemble are
unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make
each tree more different and have less correlated predictions or prediction errors.
Predictions from the trees are averaged across all decision trees resulting in better performance
than any single tree in the model.
A prediction on a regression problem is the average of the prediction across the trees in the
ensemble. A prediction on a classification problem is the majority vote for the class label across
the trees in the ensemble.
Unlike bagging, random forest also involves selecting a subset of input features (columns or
variables) at each split point in the construction of trees. Typically, constructing a decision tree
involves evaluating the value for each input variable in the data in order to select a split point. By
reducing the features to a random subset that may be considered at each split point, it forces each
decision tree in the ensemble to be more different.
89
DATAWAREHOSING AND DATAMINING 220643116004
Random Forest
In [1]:
In [2]:
model = RandomForestClassifier()
In [4]:
90
DATAWAREHOSING AND DATAMINING 220643116004
A weak learner is a model that is very simple, although has some skill on the dataset. Boosting
was a theoretical concept long before a practical algorithm could be developed, and the AdaBoost
(adaptive boosting) algorithm was the first successful approach for the idea.
The AdaBoost algorithm involves using very short (one-level) decision trees as weak learners that
are added sequentially to the ensemble. Each subsequent model attempts to correct the predictions
made by the model before it in the sequence. This is achieved by weighing the training dataset to
put more focus on training examples on which prior models made prediction errors.
AdaBoost
In [6]:
model = AdaBoostClassifier()
In [9]:
91
DATAWAREHOSING AND DATAMINING 220643116004
In [10]:
In [ ]:
92
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-7
Step3: On the left side select the appropriate OS for your computer.
93
DATAWAREHOSING AND DATAMINING 220643116004
94
DATAWAREHOSING AND DATAMINING 220643116004
Step8: Next screen is of choosing components, all components are already marked so don‘t
change anything just click on the Install button.
Step9: The next screen will be of installing location so choose the drive which will have
sufficient memory space for installation. It needed a memory space of 301 MB.
95
DATAWAREHOSING AND DATAMINING 220643116004
Step10: Next screen will be of choosing the Start menu folder so don‘t do anything just click on
Install Button.
Step11: After this installation process will start and will hardly take a minute to complete the
installation.
96
DATAWAREHOSING AND DATAMINING 220643116004
Step12: Click on the Next button after the installation process is complete.
97
DATAWAREHOSING AND DATAMINING 220643116004
Step14: Weka is successfully installed on the system and an icon is created on the desktop.
98
DATAWAREHOSING AND DATAMINING 220643116004
99
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-8
This experiment illustrates some of the basic data preprocessing operations that can be performed
using WEKA-Explorer. The sample dataset used for this example is the student data available in
arff format.
Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute. The left panel in the above figure shows
the list of recognized attributes while the top panel indicates the names of the base relation or table
and the current working relation (which are same initially).
Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,
Step4:The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Note:we can select another attribute using the dropdown list. Step5:Selecting or filtering attributes
Removing an attribute-When we need to remove an attribute,we can do this by using the attribute
filters in weka.In the filter model panel,click on choose button,This will show a popup window
with a list of available filters.
Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog
box enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to false.The click OK now in the filter box.you
will see ―Remove-R-7‖.
c) Click the apply button to apply filter to this data.This will remove the attribute and create
new working relation.
d) Save the new working relation as an arff file by clicking save button on the
100
DATAWAREHOSING AND DATAMINING 220643116004
top(button)panel.(student.arff)
Discretization
1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize age attribute.
ÆLet us divide the values of age attribute into three bins(intervals). ÆFirst load the dataset into
weka(student.arff)
ÆTo change the defaults for the filters,click on the box immediately to the right of the choose
button.
ÆWe enter the index for the attribute to be discretized.In this case the attribute is age.So we must
enter ‗1‘ corresponding to the age attribute.
ÆEnter ‗3‘ as the number of bins.Leave the remaining field values as they are. ÆClick OK button.
ÆClick apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 3 bins.
@relation student
@attribute age {<30,30-40,>40} @attribute income {low, medium, high} @attribute student {yes,
no}
@data
101
DATAWAREHOSING AND DATAMINING 220643116004
102
DATAWAREHOSING AND DATAMINING 220643116004
103
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-9
Step1:Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.
Step2:Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute. The left panel in the above figure shows
the list of recognized attributes while the top panel indicates the names of the base relation or table
and the current working relation (which are same initially).
Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,
Step4:The visualization in the right button panel in the form of cross-tabulation across two
attributes.
Note:we can select another attribute using the dropdown list. Step5:Selecting or filtering attributes
Removing an attribute-When we need to remove an attribute,we can do this by using the attribute
filters in weka.In the filter model panel,click on choose button,This will show a popup window
with a list of available filters.
Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting dialog
box enter the index of the attribute to be filtered out.
b) Make sure that invert selection option is set to false.The click OK now in the filter box.you
will see ―Remove-R-7‖.
c) Click the apply button to apply filter to this data.This will remove the attribute and create
new working relation.
d) Save the new working relation as an arff file by clicking save button on the
top(button)panel.(labor.arff)
104
DATAWAREHOSING AND DATAMINING 220643116004
Discretization
1) Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize duration attribute.
ÆLet us divide the values of duration attribute into three bins(intervals). ÆFirst load the dataset
into weka(labor.arff)
ÆTo change the defaults for the filters,click on the box immediately to the right of the choose
button.
ÆWe enter the index for the attribute to be discretized.In this case the attribute is duration So we
must enter ‗1‘ corresponding to the duration attribute.
ÆEnter ‗1‘ as the number of bins.Leave the remaining field values as they are. ÆClick OK button.
ÆClick apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 1 bin.
Dataset labor.arff
105
DATAWAREHOSING AND DATAMINING 220643116004
106
DATAWAREHOSING AND DATAMINING 220643116004
107
DATAWAREHOSING AND DATAMINING 220643116004
PRACTICAL-10
Dataset contactlenses.arff
The following screenshot shows the association rules that were generated
108
DATAWAREHOSING AND DATAMINING 220643116004
109