0% found this document useful (0 votes)

15 views9 pages

Exp2-Dm - KS

The document outlines data preprocessing tasks in Python, focusing on handling categorical data, scaling features, and splitting datasets into training and testing sets. It provides code examples for encoding categorical variables, using the OneHotEncoder and LabelEncoder, and demonstrates how to split the dataset using train_test_split. Additionally, it emphasizes the importance of scaling features for accurate machine learning predictions and includes visualizations before and after scaling.

Uploaded by

lakshminarasimhayatham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views9 pages

Exp2-Dm - KS

Uploaded by

lakshminarasimhayatham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

2/4 B.

Tech AI&DS Data Mining Using Python

EXP2. Lab
2. Demonstrate the following data preprocessing tasks using python libraries.
a) Dealing with categorical data
b) Scaling the features
c) Splitting dataset into Training and Testing Sets

A) Dealing with Categorical Data:

When dealing with large and real-world datasets, categorical data is almost inevitable.Categorical variables

represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age

group, educational level etc. These variables often has letters or words as its values. Since machine learning

models are all about numbers and calculations , these categorical variables need to be coded in to numbers.

Having coded the categorical variable into numbers may just not be enough.

DATASETEXP1:
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

import numpy as np
import matplotlib.pyplot as pltimport
pandas as pd

import dtale

from sklearn.impute import SimpleImputer

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom
sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

datas = pd.read_csv('C:\\Users\\AIDS\\Desktop\\EXP1DATASET.csv')

datas.head(20)
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

encoder = OneHotEncoder(handle_unknown='ignore')
encoder_datas = pd.DataFrame(encoder.fit_transform(datas[['SALARY']]).toarray())
df2 =datas.join(encoder_datas)
df2.drop('SALARY', axis=1, inplace=True)
df2.head()

INDEX COUNTRY FACULTYID 0 1 2 3 4 5 6 7 8 9 10 11 12

0 0 INDIA 270.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 FRANCE 301.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 2 JAPAN 50.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

3 3 SPAIN 35.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

4 4 CHINA 390.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
datas.iloc[:,-1] = le.fit_transform(datas.iloc[:,-1])
datas

[35]:

INDEX COUNTRY FACULTYID SALARY

0 0 INDIA 270.0 3.0

1 1 FRANCE 301.0 4.0

2 2 JAPAN 50.0 6.0

3 3 SPAIN 35.0 12.0

4 4 CHINA 390.0 8.0

5 5 GERMANY 35.0 7.0

6 6 INDIA NaN 5.0

7 7 AUSTRALIA 422.0 10.0

8 8 RUSSIA 50.0 11.0

9 9 USA 534.0 12.0

KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
INDEX COUNTRY FACULTYID SALARY

10 10 KORIA 21.0 0.0

11 11 USA NaN 8.0

12 12 INDIA 63.0 9.0

13 13 LONDON 200.0 2.0

14 14 INDIA 105.0 1.0

15 15 JAPAN 321.0 9.0

As you can see the purchased column has been successfully transformed.

Now we have completed the encoding of all the categorical data in our dataset and can move
to the next step.

c) Splitting the Dataset into Training and Testing sets:

All machine learning models require us to provide a training set for the machine so that the model can train

from that data to understand the relations between features and can predict for new observations. When we

are provided a single huge dataset with too much of observations ,it is a good idea to split the dataset into to

two, a training_set and a test_set, so that we can test our model after its been trained with the training_set.

X = datas.iloc[:, :-1].values
y = datas.iloc[:, -1].values
#values function coverts the data into arrays
print("Independent Variable\n")
print(X)
print("\nDependent Variable\n")
print(y)

Independent Variable
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
[[0 'INDIA' 270.0]
[1 'FRANCE' 301.0]
[2 'JAPAN' 50.0]
[3 'SPAIN' 35.0]
[4 'CHINA' 390.0]
[5 'GERMANY' 35.0]
[6 'INDIA' nan]
[7 'AUSTRALIA' 422.0]
[8 'RUSSIA' 50.0]
[9 'USA' 534.0]
[10 'KORIA' 21.0]
[11 'USA' nan]
[12 'INDIA' 63.0]
[13 'LONDON' 200.0]
[14 'INDIA' 105.0]
[15 'JAPAN' 321.0]]

Dependent Variable
[ 3. 4. 6. 12. 8. 7. 5. 10. 11. 12. 0. 8. 9. 2. 1. 9.]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

print(x_train)

[[0 'INDIA' 270.0]

[3 'SPAIN' 35.0]
[7 'AUSTRALIA' 422.0]
[13 'LONDON' 200.0]
[5 'GERMANY' 35.0]
[12 'INDIA' 63.0]
[14 'INDIA' 105.0]
[9 'USA' 534.0]
[2 'JAPAN' 50.0]
[15 'JAPAN' 321.0]
[4 'CHINA' 390.0]
[1 'FRANCE' 301.0]]

print(x_test)

[[10 'KORIA' 21.0]

[8 'RUSSIA' 50.0]
[11 'USA' nan]
[6 'INDIA' nan]]

from sklearn.model_selection import train_test_split

KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
x_train, x_test, y_train, y_test= train_test_split(X,y, test_size= 0.2,random_state=0)

x_train

Out[41]:
array([[13, 'LONDON', 200.0],
[4, 'CHINA', 390.0],
[2, 'JAPAN', 50.0],
[14, 'INDIA', 105.0],
[10, 'KORIA', 21.0],
[7, 'AUSTRALIA', 422.0],
[15, 'JAPAN', 321.0],
[11, 'USA', nan],
[3, 'SPAIN', 35.0],
[0, 'INDIA', 270.0],
[5, 'GERMANY', 35.0],
[12, 'INDIA', 63.0]], dtype=object)

x_test

Out[42]:
array([[1, 'FRANCE', 301.0],
[6, 'INDIA', nan],
[8, 'RUSSIA', 50.0],
[9, 'USA', 534.0]], dtype=object)

y_train

array([ 2., 8., 6., 1., 0., 10., 9., 8., 12., 3., 7., 9.])

y_test
Out[45]:
array([ 4., 5., 11., 12.])

.
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
B) Scaling the features:

Since machine learning models rely on numbers to solve relations it is important to have similarly scaled data in

a dataset. Scaling ensures that all data in a dataset falls in the same range. Unscaled data can cause

inaccurate or false predictions. Some machine learning algorithms can handle feature scaling on its own and

doesn’t require it explicitly.

from sklearn.preprocessing import StandardScaler

import pandas as pd

import numpy as np

import matplotlib

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

matplotlib.style.use('ggplot')

np.random.seed(1)

df = pd.DataFrame({

'Col_1': np.random.normal(0, 2, 30000),

'Col_2': np.random.normal(5, 3, 30000), 'Col_3': np.random.normal(-5, 5, 30000)

})
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

df.head()

Col_1 Col_2 Col_3

0 3.248691 6.158165 3.702313

1 -1.223513 3.502543 0.381228

2 -1.056344 8.859463 -0.195733

3 -2.145937 3.943360 -8.099449

4 1.730815 11.728704 -10.434062

col_names = df.columns

features = df[col_names]

scaler = StandardScaler().fit(features.values)

features = scaler.transform(features.values)

scaled_features = pd.DataFrame(features, columns = col_names)

scaled_features.head()

Col_1 Col_2 Col_3

0 1.624908 0.376945 1.751563

1 -0.614084 -0.504704 1.083641

2 -0.530391 1.273758 0.967605

3 -1.075893 -0.358355 -0.621956

4 0.864990 2.226327 -1.091483

KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

scaled_features.describe()

Col_1 Col_2 Col_3

count 3.000000e+04 3.000000e+04 3.000000e+04

mean 1.231607e-17 -5.933032e-17 6.300146e-17

std 1.000017e+00 1.000017e+00 1.000017e+00

min -4.240174e+00 -4.079611e+00 -4.349633e+00

25% -6.757204e-01 -6.750500e-01 -6.728936e-01

50% 1.140547e-03 4.150084e-03 1.111792e-03

75% 6.657449e-01 6.767212e-01 6.671877e-01

max 4.171969e+00 3.600454e+00 4.021416e+00

fig,(ax1,ax2) = plt.subplots(ncols=2, figsize=(6,5))

ax1.set_title('Before Scaling')
sns.kdeplot(df['Col_1'], ax=ax1)
sns.kdeplot(df['Col_2'], ax=ax1)
sns.kdeplot(df['Col_3'], ax=ax1)
ax2.set_title('After Standard Scaler')
sns.kdeplot(scaled_features['Col_1'], ax=ax2)
sns.kdeplot(scaled_features['Col_2'], ax=ax2)
sns.kdeplot(scaled_features['Col_3'], ax=ax2)
plt.show()

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Wonder
100% (3)
Wonder
220 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
43 pages
NPM-D3A en 25 0101
No ratings yet
NPM-D3A en 25 0101
4 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
ML Contenthalf
No ratings yet
ML Contenthalf
35 pages
Practical Assignment ML
No ratings yet
Practical Assignment ML
50 pages
SmartOTDR Optics ENG
No ratings yet
SmartOTDR Optics ENG
304 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
Prathamesh KRAI
No ratings yet
Prathamesh KRAI
38 pages
ML Functions
No ratings yet
ML Functions
12 pages
Exp. 1
No ratings yet
Exp. 1
4 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Data Science Practical
No ratings yet
Data Science Practical
22 pages
Machine Learning Laboratory: Manual
No ratings yet
Machine Learning Laboratory: Manual
52 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
DMLAB-2 - 4238 - 01-08-2024.ipynb - Colab
No ratings yet
DMLAB-2 - 4238 - 01-08-2024.ipynb - Colab
4 pages
Layout-Oriented Storage Control in EWM - SAP Blogs
No ratings yet
Layout-Oriented Storage Control in EWM - SAP Blogs
12 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Enda Practical 3 Explanation One
No ratings yet
Enda Practical 3 Explanation One
7 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
ML
No ratings yet
ML
8 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
1
No ratings yet
1
13 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
Programming With Python Solution W22
No ratings yet
Programming With Python Solution W22
21 pages
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
From Everand
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
Anthony So
No ratings yet
Infographics and Maps
No ratings yet
Infographics and Maps
7 pages
DC17 Ch04
No ratings yet
DC17 Ch04
42 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Practical File of AI and ML
No ratings yet
Practical File of AI and ML
26 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Code Shabab Error 7
No ratings yet
Code Shabab Error 7
5 pages
Admit Card
No ratings yet
Admit Card
3 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
A Dead-Time Compensating PID Controller Structure and Robust Tuning
No ratings yet
A Dead-Time Compensating PID Controller Structure and Robust Tuning
10 pages
Leica Aibot: Line 1 Line 2 (Optional)
No ratings yet
Leica Aibot: Line 1 Line 2 (Optional)
2 pages
Ashfatmaterial
No ratings yet
Ashfatmaterial
4 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
ANL252 SU5 Jul2022
No ratings yet
ANL252 SU5 Jul2022
58 pages
Final ML File
No ratings yet
Final ML File
34 pages
Aux Boiler Feed Water VV Sipart PS2-Manual
No ratings yet
Aux Boiler Feed Water VV Sipart PS2-Manual
3 pages
83 Sklearn Pipeline
No ratings yet
83 Sklearn Pipeline
8 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
January 2022 Centre Listing Private Candidates
No ratings yet
January 2022 Centre Listing Private Candidates
2 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Chetan CV
No ratings yet
Chetan CV
2 pages
Solutions To COMP9334 Week 5 Sample Problems: Problem 1
No ratings yet
Solutions To COMP9334 Week 5 Sample Problems: Problem 1
8 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Microprocessor and Microcontroller Course Unit 1-Part 1-2023
No ratings yet
Microprocessor and Microcontroller Course Unit 1-Part 1-2023
34 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
Introduction To Assembly Langauge Programming
No ratings yet
Introduction To Assembly Langauge Programming
8 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
A Tool For The Analysis of Chromosomes: Karyotype: Taxon June 2016
No ratings yet
A Tool For The Analysis of Chromosomes: Karyotype: Taxon June 2016
8 pages
Please DocuSign Tax Certificate of Foreign S
No ratings yet
Please DocuSign Tax Certificate of Foreign S
7 pages
Note To Merchant
No ratings yet
Note To Merchant
10 pages
Exam Mid Programming
No ratings yet
Exam Mid Programming
5 pages
Data - Preprocessing - Tools - Ipynb - Colaboratory
No ratings yet
Data - Preprocessing - Tools - Ipynb - Colaboratory
4 pages
Infocyte Hunt-Biotech Case Study
No ratings yet
Infocyte Hunt-Biotech Case Study
4 pages
Auto Flight - FMS Management of Vertical Navigation
No ratings yet
Auto Flight - FMS Management of Vertical Navigation
11 pages
AI Based Smart Traffic Management System
No ratings yet
AI Based Smart Traffic Management System
8 pages
Nerf in Digital Twin
No ratings yet
Nerf in Digital Twin
16 pages
2023 07 14-17 36 32
No ratings yet
2023 07 14-17 36 32
25 pages
Fds Front Pages
No ratings yet
Fds Front Pages
7 pages
SUPPLEMENT System of Linear Equation by Graphing Method
No ratings yet
SUPPLEMENT System of Linear Equation by Graphing Method
4 pages
Data Anonymization - SAP
No ratings yet
Data Anonymization - SAP
4 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
PHD Position
No ratings yet
PHD Position
2 pages

Exp2-Dm - KS

Uploaded by

Exp2-Dm - KS

Uploaded by

2/4 B.

Tech AI&DS Data Mining Using Python

A) Dealing with Categorical Data:

from sklearn.impute import SimpleImputer

INDEX COUNTRY FACULTYID 0 1 2 3 4 5 6 7 8 9 10 11 12

from sklearn.preprocessing import LabelEncoder

INDEX COUNTRY FACULTYID SALARY

0 0 INDIA 270.0 3.0

1 1 FRANCE 301.0 4.0

2 2 JAPAN 50.0 6.0

3 3 SPAIN 35.0 12.0

4 4 CHINA 390.0 8.0

5 5 GERMANY 35.0 7.0

6 6 INDIA NaN 5.0

7 7 AUSTRALIA 422.0 10.0

8 8 RUSSIA 50.0 11.0

9 9 USA 534.0 12.0

10 10 KORIA 21.0 0.0

11 11 USA NaN 8.0

12 12 INDIA 63.0 9.0

13 13 LONDON 200.0 2.0

14 14 INDIA 105.0 1.0

15 15 JAPAN 321.0 9.0

c) Splitting the Dataset into Training and Testing sets:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

[[0 'INDIA' 270.0]

[[10 'KORIA' 21.0]

from sklearn.model_selection import train_test_split

doesn’t require it explicitly.

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

import seaborn as sns

'Col_1': np.random.normal(0, 2, 30000),

'Col_2': np.random.normal(5, 3, 30000), 'Col_3': np.random.normal(-5, 5, 30000)

Col_1 Col_2 Col_3

0 3.248691 6.158165 3.702313

1 -1.223513 3.502543 0.381228

2 -1.056344 8.859463 -0.195733

3 -2.145937 3.943360 -8.099449

4 1.730815 11.728704 -10.434062

scaled_features = pd.DataFrame(features, columns = col_names)

Col_1 Col_2 Col_3

0 1.624908 0.376945 1.751563

1 -0.614084 -0.504704 1.083641

2 -0.530391 1.273758 0.967605

3 -1.075893 -0.358355 -0.621956

4 0.864990 2.226327 -1.091483

Col_1 Col_2 Col_3

count 3.000000e+04 3.000000e+04 3.000000e+04

mean 1.231607e-17 -5.933032e-17 6.300146e-17

std 1.000017e+00 1.000017e+00 1.000017e+00

min -4.240174e+00 -4.079611e+00 -4.349633e+00

25% -6.757204e-01 -6.750500e-01 -6.728936e-01

50% 1.140547e-03 4.150084e-03 1.111792e-03

75% 6.657449e-01 6.767212e-01 6.671877e-01

max 4.171969e+00 3.600454e+00 4.021416e+00

fig,(ax1,ax2) = plt.subplots(ncols=2, figsize=(6,5))

You might also like