0% found this document useful (0 votes)

60 views5 pages

Praktikum Modul 3

The document discusses the key steps in data pre-processing: 1) handling missing data by replacing missing values with column averages, 2) encoding categorical variables as factors, 3) splitting the data into training and test sets using a 80/20 split, and 4) scaling features to standardized ranges to prevent bias during model training. These steps clean the data and prepare it for analysis.

Uploaded by

Juki Agus Riyanto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views5 pages

Praktikum Modul 3

Uploaded by

Juki Agus Riyanto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

10/8/2019 Data Pre-processing

Data Pre-processing
Siddhanta, Wiga Maulana Baihaqi
November 12, 2017

Data Pre-processing steps

Many real world data is dirty and needs to be cleaned before used in code. The process of cleaning a dataset is called Data Preprocessing.

Preprocessing of data includes below steps : 1. Taking care of missing data 2. Catgorical Data 3. Splitting data into training and test data sets 4.
feature scaling

1. Taking Care of Missing data

#import the data set first

# data is present in Data.csv file in current folder or working directory(use command setwd("Current location of the Data.cs
v") to set the working directory to current directory.)

dataset <- read.csv("Data.csv")

#Showing the dataset

dataset

## Country Age Salary Purchased

## 1 France 44 72000 No
## 2 Spain 27 48000 Yes
## 3 Germany 30 54000 No
## 4 Spain 38 61000 No
## 5 Germany 40 NA Yes
## 6 France 35 58000 Yes
## 7 Spain NA 52000 No
## 8 France 48 79000 Yes
## 9 Germany 50 83000 No
## 10 France 37 67000 Yes

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 1/5
10/8/2019 Data Pre-processing

# So missing values present in both Age and Salary Columns

#taking care of missing values

# By replacing it to the average value for non NA entries.

dataset$Age <- ifelse(is.na(dataset$Age),

ave(dataset$Age, FUN = function(x)
mean(x, na.rm = TRUE)),
dataset$Age)

dataset$Salary <- ifelse(is.na(dataset$Salary),

ave(dataset$Salary, FUN = function(x)
mean(x, na.rm = TRUE)),
dataset$Salary)

How the ave() function works here ?

read it like this : we are changing Age column of dataset and if the column entry is NA then, take the average of the dataset$Age column where
function FUN is function of x which calculates the mean excluding(na.rm = TRUE) the NA values.

else,
take whatever present in dataset$Age

mean() :

#defining x = 1 2 3
x <- 1:3
#introducing missing value
x[1] <- NA
# mean = NA
mean(x)

## [1] NA

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 2/5
10/8/2019 Data Pre-processing

# mean = mean excluding the NA value

mean(x, na.rm = T)

## [1] 2.5

So finally the dataset looks like this :

dataset

## Country Age Salary Purchased

## 1 France 44.00000 72000.00 No
## 2 Spain 27.00000 48000.00 Yes
## 3 Germany 30.00000 54000.00 No
## 4 Spain 38.00000 61000.00 No
## 5 Germany 40.00000 63777.78 Yes
## 6 France 35.00000 58000.00 Yes
## 7 Spain 38.77778 52000.00 No
## 8 France 48.00000 79000.00 Yes
## 9 Germany 50.00000 83000.00 No
## 10 France 37.00000 67000.00 Yes

Now the missing values are replaced by the average of the respective columns!

2. Categorical data

Categorical data is non numeric data which belongs to specific set of categories. Like the Country column in dataset

By default read.csv() function in R makes all the string variables as categorical variables(factor) but suppose there is a name column in the dataset
in that case we dont need them as categorical variables. Below is the code to make specific variables as factor variables.

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 3/5
10/8/2019 Data Pre-processing

# Encoding categorical data

dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany'),
labels = c(1, 2, 3))

dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))

3. Splitting into training and test dataset : When the dataset is presented to us to do machine learning stuff we need some data as part of
training and some to test the model after the learning stage is done.

So we need to split the dataset into training and test, using below code we can do so,

For this we need to install catools,

#install.packages("catTools") #if not present

library(caTools) #adding caTools to the library
set.seed(123) # this is to ensure same output as split is done randomly, you can exclude in real time
split = sample.split(dataset$Purchased,SplitRatio = 0.8)
training_set = subset(dataset,split == TRUE)
test_set = subset(dataset, split == FALSE)

SplitRatio is the ratio in which training and test set, its usually set an 80:20 for training and test respectively.

sample.split() methid takes the column and calculates a numeric array with true and false in random locations and with the given split ratio.

subset() method takes the dataset and subset according to the condition

4. Feature Scaling :

Feature scaling is needed when different features has different ranges, for example Age and Salary Column.

They have very different ranges but when we training a model, which is basically trying to fit some line(in linear regression) then the error is trying
to be minimized,

to minimize the error the euclidian distance is minimized using some algorithm(gradient descent )

But if no feature scaling is applied then the training will be highly biased with the feature having large values because the euclidian distance will be
large there.

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 4/5
10/8/2019 Data Pre-processing

Hence, We need feature scaling which is done in below steps :

#feature scaling
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])

2:3 is for both Age and Salary Now the dataset(training and test both) looks like :

training_set

## Country Age Salary Purchased

## 1 1 0.90101716 0.9392746 0
## 2 2 -1.58847494 -1.3371160 1
## 3 3 -1.14915281 -0.7680183 0
## 4 2 0.02237289 -0.1040711 0
## 5 3 0.31525431 0.1594000 1
## 7 2 0.13627122 -0.9577176 0
## 8 1 1.48678000 1.6032218 1
## 10 1 -0.12406783 0.4650265 1

test_set

## Country Age Salary Purchased

## 6 1 -0.7071068 -0.7071068 1
## 9 3 0.7071068 0.7071068 0

Note : Most libraries in R internally take care this feature scaling problem(overfitting) so we might not need to include this always.

Now the Data Preprocessing part is done..!!!

https://rstudio-pubs-static.s3.amazonaws.com/329310_0842b7c1f17e4943a7dcbc70a3a47440.html 5/5

Machine Learning Business Report
75% (55)
Machine Learning Business Report
60 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Machine Learning - Project
80% (10)
Machine Learning - Project
14 pages
Working With Data
No ratings yet
Working With Data
38 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Data Preprocessing With R - Hands-On Tutorial
No ratings yet
Data Preprocessing With R - Hands-On Tutorial
14 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
DSR LAB MANUAL - 10 Programs
No ratings yet
DSR LAB MANUAL - 10 Programs
34 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Materi 4
No ratings yet
Materi 4
30 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Unit 2
No ratings yet
Unit 2
76 pages
Data Science Using R 2
No ratings yet
Data Science Using R 2
29 pages
All Codes
No ratings yet
All Codes
10 pages
Saurabh
No ratings yet
Saurabh
22 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
25 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Sakhil Assignment 02
No ratings yet
Sakhil Assignment 02
8 pages
DEV Lab Manual
No ratings yet
DEV Lab Manual
27 pages
R Practicals
No ratings yet
R Practicals
32 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Lab 2
No ratings yet
Lab 2
5 pages
Lab Manual - DSR
No ratings yet
Lab Manual - DSR
32 pages
Lab File AD PDF
No ratings yet
Lab File AD PDF
25 pages
R Commands
No ratings yet
R Commands
18 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
R Programming
No ratings yet
R Programming
11 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
R-Lab p-4,2,1
No ratings yet
R-Lab p-4,2,1
12 pages
Da Lab File 2
No ratings yet
Da Lab File 2
13 pages
PreProcessing With R
No ratings yet
PreProcessing With R
6 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
Data Science
No ratings yet
Data Science
13 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Data Preparation: Handling Missing Values and Outliers
No ratings yet
Data Preparation: Handling Missing Values and Outliers
28 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
R Assignment
No ratings yet
R Assignment
8 pages
R Codes
No ratings yet
R Codes
23 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
MKT4080-Codes
No ratings yet
MKT4080-Codes
9 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
R Cheat Sheet (Updated)
No ratings yet
R Cheat Sheet (Updated)
13 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
Data Preprocessing
No ratings yet
Data Preprocessing
7 pages
Antarang Foundation
No ratings yet
Antarang Foundation
25 pages
Transmission Servicing Volvo 850
No ratings yet
Transmission Servicing Volvo 850
7 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
Labininay Carl Case Study1 DCIT65
No ratings yet
Labininay Carl Case Study1 DCIT65
4 pages
Tata Nano Car
No ratings yet
Tata Nano Car
34 pages
Unit 3. Information Search Process
No ratings yet
Unit 3. Information Search Process
34 pages
Hotel Bill 25092024
No ratings yet
Hotel Bill 25092024
1 page
Makeup Sephora
No ratings yet
Makeup Sephora
1 page
CV of DR Uday Dokras
No ratings yet
CV of DR Uday Dokras
27 pages
How To Create COBie Using With BIM Interoperability Tool
No ratings yet
How To Create COBie Using With BIM Interoperability Tool
26 pages
Chap07 DMMvideo
No ratings yet
Chap07 DMMvideo
40 pages
l1 Auto Sensors Accessible
No ratings yet
l1 Auto Sensors Accessible
14 pages
4 交易之王语录
No ratings yet
4 交易之王语录
98 pages
MS015 User Manual Multi
No ratings yet
MS015 User Manual Multi
90 pages
Epicor 9.05 Performance Tuning Guide - SQL
No ratings yet
Epicor 9.05 Performance Tuning Guide - SQL
21 pages
Visionis Biometric Solutions Vis 3015 Vis 3016 Vis 3013 ENG
No ratings yet
Visionis Biometric Solutions Vis 3015 Vis 3016 Vis 3013 ENG
14 pages
Groundnut
No ratings yet
Groundnut
64 pages
1Y0-204 Dumps Citrix Virtual Apps and Desktops 7 Administration
No ratings yet
1Y0-204 Dumps Citrix Virtual Apps and Desktops 7 Administration
7 pages
Activity 19 - Classification and Transfer of Victims
No ratings yet
Activity 19 - Classification and Transfer of Victims
49 pages
SLIDE PAPARAN POLPUM KEMENDAGRI 18 JAN 23 TTG PEMILU
No ratings yet
SLIDE PAPARAN POLPUM KEMENDAGRI 18 JAN 23 TTG PEMILU
35 pages
IFD5 Manual - Issue 5
No ratings yet
IFD5 Manual - Issue 5
21 pages
Last Introduction
No ratings yet
Last Introduction
5 pages
An Agricultural Robotfor Multipurpose Operationsina Greenhouse
No ratings yet
An Agricultural Robotfor Multipurpose Operationsina Greenhouse
11 pages
Vio's Bartering Money Guide For Poor People-1 PDF
No ratings yet
Vio's Bartering Money Guide For Poor People-1 PDF
13 pages
(Campus of Open Learning) University of Delhi Delhi-110007
No ratings yet
(Campus of Open Learning) University of Delhi Delhi-110007
1 page
Marine Hsse Final Assignment Chop Saw
No ratings yet
Marine Hsse Final Assignment Chop Saw
11 pages
New Indy Complaint
No ratings yet
New Indy Complaint
5 pages
User Manual 2195612
No ratings yet
User Manual 2195612
2 pages
Thayer, Vice President Kamala Harris Visit To Vietnam Scene Setter
No ratings yet
Thayer, Vice President Kamala Harris Visit To Vietnam Scene Setter
3 pages
Chapter1 Corporation and Corporate Governance
No ratings yet
Chapter1 Corporation and Corporate Governance
4 pages
Framemaker Has Two Ways of Approaching Documents: and Unstructured
No ratings yet
Framemaker Has Two Ways of Approaching Documents: and Unstructured
3 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet