Praktikum Modul 3
Praktikum Modul 3
Data Pre-processing
Siddhanta, Wiga Maulana Baihaqi
November 12, 2017
Preprocessing of data includes below steps : 1. Taking care of missing data 2. Catgorical Data 3. Splitting data into training and test data sets 4.
feature scaling
https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 1/5
10/8/2019 Data Pre-processing
read it like this : we are changing Age column of dataset and if the column entry is NA then, take the average of the dataset$Age column where
function FUN is function of x which calculates the mean excluding(na.rm = TRUE) the NA values.
else,
take whatever present in dataset$Age
mean() :
#defining x = 1 2 3
x <- 1:3
#introducing missing value
x[1] <- NA
# mean = NA
mean(x)
## [1] NA
https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 2/5
10/8/2019 Data Pre-processing
## [1] 2.5
dataset
Now the missing values are replaced by the average of the respective columns!
2. Categorical data
Categorical data is non numeric data which belongs to specific set of categories. Like the Country column in dataset
By default read.csv() function in R makes all the string variables as categorical variables(factor) but suppose there is a name column in the dataset
in that case we dont need them as categorical variables. Below is the code to make specific variables as factor variables.
https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 3/5
10/8/2019 Data Pre-processing
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))
3. Splitting into training and test dataset : When the dataset is presented to us to do machine learning stuff we need some data as part of
training and some to test the model after the learning stage is done.
So we need to split the dataset into training and test, using below code we can do so,
SplitRatio is the ratio in which training and test set, its usually set an 80:20 for training and test respectively.
sample.split() methid takes the column and calculates a numeric array with true and false in random locations and with the given split ratio.
subset() method takes the dataset and subset according to the condition
4. Feature Scaling :
Feature scaling is needed when different features has different ranges, for example Age and Salary Column.
They have very different ranges but when we training a model, which is basically trying to fit some line(in linear regression) then the error is trying
to be minimized,
to minimize the error the euclidian distance is minimized using some algorithm(gradient descent )
But if no feature scaling is applied then the training will be highly biased with the feature having large values because the euclidian distance will be
large there.
https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 4/5
10/8/2019 Data Pre-processing
#feature scaling
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])
2:3 is for both Age and Salary Now the dataset(training and test both) looks like :
training_set
test_set
Note : Most libraries in R internally take care this feature scaling problem(overfitting) so we might not need to include this always.
https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 5/5