0% found this document useful (0 votes)
18 views2 pages

Week 4

oh

Uploaded by

bamek59014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views2 pages

Week 4

oh

Uploaded by

bamek59014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 2

DATAPREPROCESSING–CATEGORICAL DATA

For a given set of training data examples stored


ina.CSVfile,demonstrateDataPreprocessinginMachinelearningwiththefollo
wingsteps
A) Getting the dataset.
B) Importing libraries.
C) Importing datasets.
D) Finding Missing Data.
E) Encoding Categorical Data.
F) Splitting dataset into training and test set.
G) Feature scaling.

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Dataset.csv')
#Extracting Independent Variable
x= data_set.iloc[:, :-1].values
#Extracting Dependent variable
y= data_set.iloc[:, 3].values
#handling missing data(Replacing missing data with the mean value)
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
#Fitting imputer object to the independent varibles x.
imputerimputer= imputer.fit(x[:, 1:3])
#Replacing missing data with the calculated mean value
x[:, 1:3]= imputer.transform(x[:, 1:3])
#for Country Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0])
x= onehot_encoder.fit_transform(x).toarray()
#encoding for purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
#Feature Scaling of datasets
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)

Encoding Categorical Data

Categorical data is data which has some categories such as, in our dataset; there are
two categorical variable, Country, and Purchased. machine learning model
completely works on mathematics and numbers, but if our dataset would have a
categorical variable, then it may create trouble while building the model. So it is
necessary to encode these categorical variables into numbers.

Feature scaling is the final step of data preprocessing in machine learning. It is a


technique to standardize the independent variables of the dataset in a specific range.
In feature scaling, we put our variables in the same range and in the same scale so that
no any variable dominate the other variable.

You might also like