0% found this document useful (0 votes)
60 views5 pages

Praktikum Modul 3

The document discusses the key steps in data pre-processing: 1) handling missing data by replacing missing values with column averages, 2) encoding categorical variables as factors, 3) splitting the data into training and test sets using a 80/20 split, and 4) scaling features to standardized ranges to prevent bias during model training. These steps clean the data and prepare it for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views5 pages

Praktikum Modul 3

The document discusses the key steps in data pre-processing: 1) handling missing data by replacing missing values with column averages, 2) encoding categorical variables as factors, 3) splitting the data into training and test sets using a 80/20 split, and 4) scaling features to standardized ranges to prevent bias during model training. These steps clean the data and prepare it for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

10/8/2019 Data Pre-processing

Data Pre-processing
Siddhanta, Wiga Maulana Baihaqi
November 12, 2017

Data Pre-processing steps


Many real world data is dirty and needs to be cleaned before used in code. The process of cleaning a dataset is called Data Preprocessing.

Preprocessing of data includes below steps : 1. Taking care of missing data 2. Catgorical Data 3. Splitting data into training and test data sets 4.
feature scaling

1. Taking Care of Missing data

#import the data set first


# data is present in Data.csv file in current folder or working directory(use command setwd("Current location of the Data.cs
v") to set the working directory to current directory.)

dataset <- read.csv("Data.csv")

#Showing the dataset


dataset

## Country Age Salary Purchased


## 1 France 44 72000 No
## 2 Spain 27 48000 Yes
## 3 Germany 30 54000 No
## 4 Spain 38 61000 No
## 5 Germany 40 NA Yes
## 6 France 35 58000 Yes
## 7 Spain NA 52000 No
## 8 France 48 79000 Yes
## 9 Germany 50 83000 No
## 10 France 37 67000 Yes

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 1/5
10/8/2019 Data Pre-processing

# So missing values present in both Age and Salary Columns

#taking care of missing values

# By replacing it to the average value for non NA entries.

dataset$Age <- ifelse(is.na(dataset$Age),


ave(dataset$Age, FUN = function(x)
mean(x, na.rm = TRUE)),
dataset$Age)

dataset$Salary <- ifelse(is.na(dataset$Salary),


ave(dataset$Salary, FUN = function(x)
mean(x, na.rm = TRUE)),
dataset$Salary)

How the ave() function works here ?

read it like this : we are changing Age column of dataset and if the column entry is NA then, take the average of the dataset$Age column where
function FUN is function of x which calculates the mean excluding(na.rm = TRUE) the NA values.

else,
take whatever present in dataset$Age

mean() :

#defining x = 1 2 3
x <- 1:3
#introducing missing value
x[1] <- NA
# mean = NA
mean(x)

## [1] NA

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 2/5
10/8/2019 Data Pre-processing

# mean = mean excluding the NA value


mean(x, na.rm = T)

## [1] 2.5

So finally the dataset looks like this :

dataset

## Country Age Salary Purchased


## 1 France 44.00000 72000.00 No
## 2 Spain 27.00000 48000.00 Yes
## 3 Germany 30.00000 54000.00 No
## 4 Spain 38.00000 61000.00 No
## 5 Germany 40.00000 63777.78 Yes
## 6 France 35.00000 58000.00 Yes
## 7 Spain 38.77778 52000.00 No
## 8 France 48.00000 79000.00 Yes
## 9 Germany 50.00000 83000.00 No
## 10 France 37.00000 67000.00 Yes

Now the missing values are replaced by the average of the respective columns!

2. Categorical data

Categorical data is non numeric data which belongs to specific set of categories. Like the Country column in dataset

By default read.csv() function in R makes all the string variables as categorical variables(factor) but suppose there is a name column in the dataset
in that case we dont need them as categorical variables. Below is the code to make specific variables as factor variables.

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 3/5
10/8/2019 Data Pre-processing

# Encoding categorical data


dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany'),
labels = c(1, 2, 3))

dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))

3. Splitting into training and test dataset : When the dataset is presented to us to do machine learning stuff we need some data as part of
training and some to test the model after the learning stage is done.

So we need to split the dataset into training and test, using below code we can do so,

For this we need to install catools,

#install.packages("catTools") #if not present


library(caTools) #adding caTools to the library
set.seed(123) # this is to ensure same output as split is done randomly, you can exclude in real time
split = sample.split(dataset$Purchased,SplitRatio = 0.8)
training_set = subset(dataset,split == TRUE)
test_set = subset(dataset, split == FALSE)

SplitRatio is the ratio in which training and test set, its usually set an 80:20 for training and test respectively.

sample.split() methid takes the column and calculates a numeric array with true and false in random locations and with the given split ratio.

subset() method takes the dataset and subset according to the condition

4. Feature Scaling :

Feature scaling is needed when different features has different ranges, for example Age and Salary Column.

They have very different ranges but when we training a model, which is basically trying to fit some line(in linear regression) then the error is trying
to be minimized,

to minimize the error the euclidian distance is minimized using some algorithm(gradient descent )

But if no feature scaling is applied then the training will be highly biased with the feature having large values because the euclidian distance will be
large there.

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 4/5
10/8/2019 Data Pre-processing

Hence, We need feature scaling which is done in below steps :

#feature scaling
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])

2:3 is for both Age and Salary Now the dataset(training and test both) looks like :

training_set

## Country Age Salary Purchased


## 1 1 0.90101716 0.9392746 0
## 2 2 -1.58847494 -1.3371160 1
## 3 3 -1.14915281 -0.7680183 0
## 4 2 0.02237289 -0.1040711 0
## 5 3 0.31525431 0.1594000 1
## 7 2 0.13627122 -0.9577176 0
## 8 1 1.48678000 1.6032218 1
## 10 1 -0.12406783 0.4650265 1

test_set

## Country Age Salary Purchased


## 6 1 -0.7071068 -0.7071068 1
## 9 3 0.7071068 0.7071068 0

Note : Most libraries in R internally take care this feature scaling problem(overfitting) so we might not need to include this always.

Now the Data Preprocessing part is done..!!!

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 5/5

You might also like