0% found this document useful (0 votes)

20 views4 pages

Practical 1 52

The document discusses preprocessing steps for a machine learning model, including: 1) Loading a dataset with country, age, salary, and purchase data and handling missing values using mean imputation. 2) Encoding categorical features like country using one-hot encoding and purchase status using label encoding. 3) Splitting the dataset into training and test sets. 4) Scaling numeric features like age and salary using standardization to prevent feature dominance. The preprocessing prepares the raw data for machine learning modeling by handling issues like missing data, encoding categories, and scaling features.

Uploaded by

Royal Empire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views4 pages

Practical 1 52

Uploaded by

Royal Empire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

20BECE30058

1) Preprocessing

Fetching the Dataset

import numpy as np
import pandas as pd

dataset=pd.read_csv('/content/drive/MyDrive/ML/ML_Lab/Data.csv')

Dataset

Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany NaN 54000.0 No

3 Spain 38.0 61000.0 No

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 NaN No

9 France 37.0 67000.0 Yes

from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive

X=dataset.iloc[:,:-1].values #---taking all the columns except the label

column y=dataset.iloc[:,-1].values #---taking the values of the label column
print(X) print(y)

[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' nan 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 nan]
['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

Handling missing data

from sklearn.impute import SimpleImputer

#---create an instance of the class SimpleImputer imputer=SimpleImputer(missing_values=np.nan, strategy= 'mean') #---

missing_values=np.nan, means all the empty values in the data
#---strategy=mean, median, most_frequent, constant

#---connect the imputer object with the data imputer.fit(X[:,1:3]) #---the column with
numerical data is only fitted to imputer object

#----transform the data(replace the missing data with the mean)

X[:,1:3]=imputer.transform(X[:,1:3]) print(X) #------no missing

values now in the X

[['France' 44.0 72000.0]

['Spain' 27.0 48000.0]
['Germany' 39.875 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 61375.0]
['France' 35.0 58000.0]
['Spain' 39.875 52000.0]
['France' 48.0 79000.0]

https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 1
20BECE30058
['Germany' 50.0 61375.0]
['France' 37.0 67000.0]]

----Arguments----

SimpleImputer( missing_values=nan, strategy='mean', ll_value=None, verbose=0, copy=True, add_indicator=False, ) copy-->If

True, a copy of X will be created. If False, imputation will be done in-place whenever possible. ll_value--> When strategy ==

"constant", ll_value is used to replace all occurrences of missing_values.default=None

Encoding Categorical data

Suppose the data is categorical in the dataset and we need to convert it into the numeric format

Here we are having "Purchased" & "Country" column as categorical data

'Country' has 3 values--> France, Spain, Germany, If we assign 0, 1 and 2 numbers respectively to each value then it will be considered as some kind of
priority given to each value(0,1 & 2)

To do this, we can do one hot encoding(create different columns for each unique values)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

#-----Object of Column Transformer Class ct= ColumnTransformer(transformers=

[('encoder',OneHotEncoder(),[0])], remainder='passthrough')
#---3 things in 'transformers'= 1st--> kind of
transformation/encoding #--- 2nd--> Class of encoder
#--- 3rd--> index of the columns that we want to apply the OnHot encoding,here [0] for country column #-
--remainder=passthrough --> that means we want to keep all other features/columns in feature matrix without
transformation # {'drop','passthrough'}

print(type(X))
#-----Connect the data with the 'ct' object,here we will use 'fit_transform' which will fit and transform together in 1 step
X=ct.fit_transform(X) #-- - -update the encoded column to the same feature matrix X
X

<class 'numpy.ndarray'>
array([[1.0, 0.0, 0.0, 44.0, 72000.0],
[0.0, 0.0, 1.0, 27.0, 48000.0],
[0.0, 1.0, 0.0, 39.875, 54000.0],
[0.0, 0.0, 1.0, 38.0, 61000.0],
[0.0, 1.0, 0.0, 40.0, 61375.0],
[1.0, 0.0, 0.0, 35.0, 58000.0],
[0.0, 0.0, 1.0, 39.875, 52000.0],
[1.0, 0.0, 0.0, 48.0, 79000.0],
[0.0, 1.0, 0.0, 50.0, 61375.0],
[1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

Encoding the dependent Variable

from sklearn.preprocessing import LabelEncoder

le=LabelEncoder() y=le.fit_transform(y) #---parameter as

dependent variable print(y)

[0 1 0 0 1 1 0 1 0 1]

Splitting Dataset
from sklearn.model_selection import train_test_split
#---returns the 4 matrices-2 for training and 2 for testing

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1) #--random_state=seed value for splitting order #-

--X stands for feature matrix and y stands for target matrix

print(X_train)

https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 2
20BECE30058
[[0.0 0.0 1.0 39.875 52000.0]
[0.0 1.0 0.0 40.0 61375.0]
[1.0 0.0 0.0 44.0 72000.0]
[0.0 0.0 1.0 38.0 61000.0]
[0.0 0.0 1.0 27.0 48000.0]
[1.0 0.0 0.0 48.0 79000.0]
[0.0 1.0 0.0 50.0 61375.0]
[1.0 0.0 0.0 35.0 58000.0]]

print(X_test)

[[0.0 1.0 0.0 39.875 54000.0]

[1.0 0.0 0.0 37.0 67000.0]]

print(y_train)

[0 1 0 0 1 1 0 1]

print(y_test)

[0 1]

Feature Scaling

Feature scaling is done to prevent over dominate some feature values just based upon their magnitues even if they were not important

2 Methods of Feature Scaling

Standardization (scales between -3 and +3)

x(stand)=[x-mean(x)]/standard deviation(x)

Normalization (scales between 0 and +1) x

(norm)=[x-min(x)]/max(x)-min(x)

Normalization is recomended when you have a normal distribution in features Standardization is

used in general conditions

from sklearn.preprocessing import StandardScaler

sc=StandardScaler()
X_train[:,3:]=sc.fit_transform(X_train[:,3:]) #---specify the range, takes all rows and selected columns from 3 upto last column
#---fit method will only compute the mean and standard deviation of the feature,then transform method will apply the

formula # we apply the same scaler to tes set, thats why we use 'transform' method only

X_test[:,3:]=sc.transform(X_test[:,3:])

print(X_train)

[[0.0 0.0 1.0 -0.05231070414657415 -1.0245464314521104]

[0.0 1.0 0.0 -0.03411567661733096 -0.023360993551025323]
[1.0 0.0 0.0 0.5481252043184508 1.1113158360702047]
[0.0 0.0 1.0 -0.3252361170852219 -0.06340841106706872]
[0.0 0.0 1.0 -1.9263985396586218 -1.4517188849565736]
[1.0 0.0 0.0 1.1303660852542325 1.858867629703015]
[0.0 1.0 0.0 1.4214865257221234 -0.023360993551025323]
[1.0 0.0 0.0 -0.7619167777870582 -0.38378775119541597]]

print(X_test)

[[0.0 1.0 0.0 -0.05231070414657415 -0.8109602046998791]

[1.0 0.0 0.0 -0.4707963373191673 0.5773502691896258]]

Note : We need to apply the same scaler of training data on the test data, otherwise it will compute the different values of mean and standard
deviation for the test set and the results would be affected. So that we will get the same transformation on the test set

Do we have to apply the scaling to the dummy variables/ one hot encoders? No, as the scaling takes the values between -3 and +3, the values in dummy
varibales lie between 0 and 1 which fall in this range.

Apply feature scaling to numerical values only, not to dummy variables.

https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 3
20BECE30058

https://colab.research.google.com/drive/1epQs6N97ffLdHEAPXVRF8eYmSDkR565_#scrollTo=_IsOM8AA5_QF&printMode=true 4

Srimaan: PG-TRB
No ratings yet
Srimaan: PG-TRB
24 pages
Knook Sampler Scarf
No ratings yet
Knook Sampler Scarf
6 pages
QUESTION BANK - (Laplace and Fourier Transform - CUTM1002)
No ratings yet
QUESTION BANK - (Laplace and Fourier Transform - CUTM1002)
7 pages
Catalogue Corolla Altis Compressed 1
No ratings yet
Catalogue Corolla Altis Compressed 1
8 pages
The Art of Strategy and Force Planning
No ratings yet
The Art of Strategy and Force Planning
14 pages
Case For TCP-2
No ratings yet
Case For TCP-2
14 pages
FQ P1YIydaRO5Vamw3Z8XJDmy3y9
No ratings yet
FQ P1YIydaRO5Vamw3Z8XJDmy3y9
6 pages
Orchid Hotel Explaination
No ratings yet
Orchid Hotel Explaination
14 pages
RR Infra Girders Launching
No ratings yet
RR Infra Girders Launching
1 page
AFM ER308 Afm Er308L
No ratings yet
AFM ER308 Afm Er308L
9 pages
Policy Server Installation Guide
0% (1)
Policy Server Installation Guide
24 pages
Craig Ch03
No ratings yet
Craig Ch03
46 pages
5.1 s2.0 S095006182032657X Main
No ratings yet
5.1 s2.0 S095006182032657X Main
15 pages
Electrical Technology
No ratings yet
Electrical Technology
24 pages
Lime 2
No ratings yet
Lime 2
11 pages
SAP MaxDB Content Server Architecture
No ratings yet
SAP MaxDB Content Server Architecture
72 pages
UNIT 6 - 4000 Essential English Words 1
No ratings yet
UNIT 6 - 4000 Essential English Words 1
6 pages
Discrete Scale Invariance and Other Cooperative Phenomena in Spatially Extended Systems With Threshold Dynamics
No ratings yet
Discrete Scale Invariance and Other Cooperative Phenomena in Spatially Extended Systems With Threshold Dynamics
28 pages
30 REPHRASING TEST With SOLUTIONS
No ratings yet
30 REPHRASING TEST With SOLUTIONS
4 pages
Forest Succession 1
No ratings yet
Forest Succession 1
2 pages
Hunshu
No ratings yet
Hunshu
6 pages
Thermodynamics Chemistry Difficult NEET Practice Questions, MCQS, Past Year Questions (PYQs), NCERT Questions, Question Bank, CL
No ratings yet
Thermodynamics Chemistry Difficult NEET Practice Questions, MCQS, Past Year Questions (PYQs), NCERT Questions, Question Bank, CL
1 page
UNIT 11 - BT MLH 11 - Test 2
No ratings yet
UNIT 11 - BT MLH 11 - Test 2
3 pages
RRL in Combined Cryptographic Algorithms
No ratings yet
RRL in Combined Cryptographic Algorithms
8 pages
Part 1 «Listening»: Содержание ↑ Audioscript ↓
No ratings yet
Part 1 «Listening»: Содержание ↑ Audioscript ↓
7 pages
FHFA Motion Not To Disclose Freddie Mac Accounting
No ratings yet
FHFA Motion Not To Disclose Freddie Mac Accounting
11 pages
Microwave Project
No ratings yet
Microwave Project
11 pages
1279-Article Text-5449-1-10-20200212
No ratings yet
1279-Article Text-5449-1-10-20200212
4 pages
1st Sem Result
No ratings yet
1st Sem Result
1 page
Motor
No ratings yet
Motor
2 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)

Practical 1 52

Uploaded by

Practical 1 52

Uploaded by

20BECE30058

Fetching the Dataset

Country Age Salary Purchased

0 France 44.0 72000.0 No

2 Germany NaN 54000.0 No

3 Spain 38.0 61000.0 No

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 NaN No

9 France 37.0 67000.0 Yes

from google.colab import drive

X=dataset.iloc[:,:-1].values #---taking all the columns except the label

[['France' 44.0 72000.0]

Handling missing data

#---create an instance of the class SimpleImputer imputer=SimpleImputer(missing_values=np.nan, strategy= 'mean') #---

#----transform the data(replace the missing data with the mean)

X[:,1:3]=imputer.transform(X[:,1:3]) print(X) #------no missing

values now in the X

[['France' 44.0 72000.0]

SimpleImputer( missing_values=nan, strategy='mean', ll_value=None, verbose=0, copy=True, add_indicator=False, ) copy-->If

"constant", ll_value is used to replace all occurrences of missing_values.default=None

Encoding Categorical data

Here we are having "Purchased" & "Country" column as categorical data

from sklearn.compose import ColumnTransformer

#-----Object of Column Transformer Class ct= ColumnTransformer(transformers=

Encoding the dependent Variable

from sklearn.preprocessing import LabelEncoder

le=LabelEncoder() y=le.fit_transform(y) #---parameter as

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1) #--random_state=seed value for splitting order #-

[[0.0 1.0 0.0 39.875 54000.0]

2 Methods of Feature Scaling

Normalization (scales between 0 and +1) x

Normalization is recomended when you have a normal distribution in features Standardization is

used in general conditions

from sklearn.preprocessing import StandardScaler

[[0.0 0.0 1.0 -0.05231070414657415 -1.0245464314521104]

[[0.0 1.0 0.0 -0.05231070414657415 -0.8109602046998791]

Apply feature scaling to numerical values only, not to dummy variables.

You might also like