0% found this document useful (0 votes)

7 views26 pages

Data Cleaning Using R

The document outlines the process of data cleaning using R, emphasizing the importance of transforming raw data into a clean and consistent format for analysis. It details various objectives of data cleaning, such as eliminating errors, ensuring consistency, and removing duplicates, along with practical examples of R code for different cleaning tasks. The document also compares messy data with clean data and provides methods for handling missing values, modifying column names, and detecting outliers.

Uploaded by

kkabeerbe23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views26 pages

Data Cleaning Using R

Uploaded by

kkabeerbe23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

DATA CLEANING USING

R
The process to transform raw data into
consistent data that can be easily
analyzed.

DATA
CLEANING It is aimed at filtering the content of
statistical statements .
DEFINITION
To improves your data quality and
overall productivity.
The Eliminate Errors
following
are the Eliminate Redundancy
various
purposes
OBJECTIVE of data Increase Data Reliability
cleaning
OF DATA in R: Delivery Accuracy
CLEANING Ensure Consistency
Assure Completeness
Standardize your approach
CLEAN DATA VS MESSY
DATA
Messy Data Clean Data
• Special characters (e.g. commas • Free of duplicate rows/values
in numeric values) • Error-free (misspellings free )
• Numeric values stored as • Relevant (special characters free
text/character data types )
• Duplicate rows • The appropriate data type for
• Misspellings analysis
• Inaccuracies • Free of outliers (or only contain
• White space outliers that have been
• Missing data identified/understood)
• Zeros instead of null values • Neat and clean data structure
vary.
• Using inbuilt datasets(“airquality” datasets)
• > head(airquality)
Ozone Solar.R Wind Temp Month Day
DATA 1 41 190 7.4 67 5 1

CLEANING 2 36 118 8.0 72 5 2

EXAMPLE 3 12 149 12.6 74 5 3

4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

The NA value inside the columns

SUMMARY FUNCTION
> summary(airquality)
WE CAN GET A CLEAR VISUAL OF THE IRREGULAR
DATA USING A BOXPLOT.
BOXPLOT(AIRQUALITY)
REMOVING
IRREGULARITIES DATA
WITH IS.NA() METHODS

• NEW_DF = AIRQUALITY
• NEW_DF$OZONE =
IFELSE(IS.NA(NEW_DF$OZONE),
MEDIAN(NEW_DF$OZONE,
• NA.RM =
TRUE),)
PERFORMING THE
SAME OPERATION IN
ANOTHER COLUMN.

• NEW_DF$SOLAR.R =
IFELSE(IS.NA(NEW_DF$SOLAR.R)
, MEDIAN(NEW_DF$SOLAR.R,
• NA.RM =
TRUE), NEW_DF$SOLAR.R) )
• SUMMARY(NEW_DF)
HEAD(NEW_DF)
1) CREATION OF EXAMPLE DATA
2) EXAMPLE 1: MODIFY COLUMN NAMES
3) EXAMPLE 2: FORMAT MISSING VALUES
AN 4) EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
ILLUSTRATION 5) EXAMPLE 4: REMOVE ROWS WITH MISSING VALUES

WITH 6) EXAMPLE 5: REMOVE DUPLICATES

EXAMPLE 7) EXAMPLE 6: MODIFY CLASSES OF COLUMNS

8) EXAMPLE 7: DETECT & REMOVE OUTLIERS
9) EXAMPLE 8: REMOVE SPACES IN CHARACTER STRINGS
10) EXAMPLE 9: COMBINE CATEGORIES
CREATION OF
EXAMPLE DATA
DATA <- DATA.FRAME(X1 = C(1:4, 99999, 1, NA, 1, 1, NA),
X1 = C(1:5, 1, "NA", 1, 1, "NA"),
X1 = C(LETTERS[C(1:3)], "X X", "X", " Y Y Y", "X", "A",
"A", NA),
X4 = "",
X5 = NA)
DATA
EXAMPLE 1:
MODIFY COLUMN
NAMES
• LET’S FIRST HAVE A CLOSER LOOK AT THE NAMES OF
OUR DATA FRAME COLUMNS:

• COLNAMES(DATA)
• # PRINT COLUMN NAMES# [1] "X1" "X1.1" "X1.2"
"X4" "X5“

• LET’S ASSUME THAT WE WANT TO CHANGE THESE

COLUMN NAMES TO A CONSECUTIVE RANGE WITH THE
PREFIX “COL”. THEN, WE CAN APPLY
THE COLNAMES, PASTE0, AND NCOL FUNCTIONS AS
SHOWN BELOW.

• #MODIFY ALL COLUMN NAMESDATA

• COLNAMES(DATA) <- PASTE0("COL", 1:NCOL(DATA))

• # PRINT UPDATED DATA FRAME

EXAMPLE 2: FORMAT MISSING VALUES
• In the R programming language, missing values are usually represented by NA.
For that reason, it is useful to convert all missing values to this NA format.

• Some missing values are represented by blank character strings.

data[data == ""] #
Print blank data cells# [1] NA NA NA "" "" "" "" "" "" ""
"" "" "" NA NA NA NA NA NA NA NA NA NA

• Assign NA values to those blank cells, we can use the following syntax:

data[data == ""] <- NA # Replace

blanks by NA

have a closer look at the column col2:

data$col2 # Print
column# [1] "1" "2" "3" "4" "5" "1" "NA" "1" "1" "NA"
EXAMPLE 2: FORMAT MISSING VALUES
The NA values in this column are shown between quotes (i.e. “NA”). This indicates
that those NA values are formatted as characters instead of real NA values.

We can change that using the following R code:

data$col2[data$col2 == "NA"] <- NA # Replace
character "NA"

data # Print
EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
Use the rowSums, is.na, and ncol functions to remove only-NA rows:

data <- data[rowSums(is.na(data)) != ncol(data), ] #

Drop empty rowsdata
# Print updated data frame
EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
Also exclude columns that contain only NA values

data <- data[ , colSums(is.na(data)) != nrow(data)]

# Drop empty columnsdata
# Print updated data frame
EXAMPLE 4: REMOVE ROWS WITH MISSING VALUES
in case you have decided to remove all rows with one or more NA values, you may use
the na.omit function as shown below:

data <- na.omit(data) # Delete rows with missing values data

# Print updated data frame
EXAMPLE 5: REMOVE DUPLICATES
Use the unique function to our data frame as demonstrated in the following R
snippet:

data <- unique(data) # Exclude duplicates data

# Print updated data frame
EXAMPLE 6: MODIFY CLASSES OF COLUMNS
• The class of the columns of a data frame is another critical topic when it
comes to data cleaning.
• This example explains how to format each column to the most appropriate
data type automatically.
• Let’s first check the current classes of our data frame columns:

sapply(data, class) # Print classes of all columns

# col1 col2 col3
# "numeric" "character" "character"

use the type.convert function to change the column classes whenever it is appropriate:

data <- type.convert(data, as.is = TRUE)

data #
Print updated data frame
Print the data types of our columns once again, we can see that the first two
columns have been changed to the integer class.
sapply(data, class) # Print classes of updated
columns
# col1 col2 col3
# "integer" "integer" "character"
EXAMPLE 7: DETECT & REMOVE OUTLIERS
One method to detect outliers is provided by the boxplot.stats function
# Identify outliers in column# [1] 99999
data$col1[data$col1 %in% boxplot.stats(data$col1)$out]

• The previous output has returned one outlier (i.e. the value 99999). This
value is obviously much higher than the other values in this column.
• Apply the R code below to remove the outlier:

# Remove rows with outliersdata

data <- data[! data$col1 %in% boxplot.stats(data$col1)$out, ]
EXAMPLE 8: REMOVE SPACES IN CHARACTER STRINGS
• Use the gsub function as demonstrated below

# Delete white space in character stringsdata

data$col3 <- gsub(" ", "", data$col3)
EXAMPLE 9: COMBINE CATEGORIES
• Use the gsub function as demonstrated below
The following R code illustrates how to group the categories “a”, “b”, and
“c” in a single category “a”.

data$col3[data$col3 %in% c("b", "c")] <- "a" # Merge

categoriesdata
# Print updated data frame
THANKS

Programming For Data Science Assignment-2
No ratings yet
Programming For Data Science Assignment-2
23 pages
Subsetting Data in R
No ratings yet
Subsetting Data in R
44 pages
WsCube Tech Online MERN Stack Course
No ratings yet
WsCube Tech Online MERN Stack Course
24 pages
04 Data Cleaning in R
No ratings yet
04 Data Cleaning in R
36 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
CleaningData Chapter 3
No ratings yet
CleaningData Chapter 3
29 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
Learn R - Learn R - Data Cleaning Cheatsheet - Codecademy
No ratings yet
Learn R - Learn R - Data Cleaning Cheatsheet - Codecademy
4 pages
Cleaning Data3
No ratings yet
Cleaning Data3
41 pages
R Tutorial2
No ratings yet
R Tutorial2
23 pages
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
Data Cleaning R
No ratings yet
Data Cleaning R
2 pages
Advanced R Programming Tidyverse Packages Notes
No ratings yet
Advanced R Programming Tidyverse Packages Notes
12 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Solutions For QB3
No ratings yet
Solutions For QB3
14 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
(R) Internal-2 Q & A
No ratings yet
(R) Internal-2 Q & A
65 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
Mda Practical2 Eda
No ratings yet
Mda Practical2 Eda
50 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
Practical Preprocessing and Data Cleaning
No ratings yet
Practical Preprocessing and Data Cleaning
51 pages
8 R Basics 3
No ratings yet
8 R Basics 3
27 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
R Subnetting
No ratings yet
R Subnetting
16 pages
Dar Lecture 7
No ratings yet
Dar Lecture 7
24 pages
R Tutorial3
No ratings yet
R Tutorial3
17 pages
Section 03
No ratings yet
Section 03
20 pages
R Programming Materials
No ratings yet
R Programming Materials
51 pages
R File Code
No ratings yet
R File Code
16 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
First Course On R
No ratings yet
First Course On R
26 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
Tutorial-Identifying and Imputation of Missing Values
No ratings yet
Tutorial-Identifying and Imputation of Missing Values
20 pages
Introduction To The R Programming Language
No ratings yet
Introduction To The R Programming Language
14 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Materi 4
No ratings yet
Materi 4
30 pages
Statistics With R Unit 1: Divya Arun Kumar
No ratings yet
Statistics With R Unit 1: Divya Arun Kumar
65 pages
Machine Learning - Unit IV Notes
No ratings yet
Machine Learning - Unit IV Notes
18 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
FE418 RLectureNotes1
No ratings yet
FE418 RLectureNotes1
15 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
Introduction To R
No ratings yet
Introduction To R
74 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
R
No ratings yet
R
13 pages
Week2 DataWrangling DelimitedText PDF
No ratings yet
Week2 DataWrangling DelimitedText PDF
5 pages
R Programming Swirl
No ratings yet
R Programming Swirl
22 pages
Base R
No ratings yet
Base R
9 pages
Mod3 Tables EPP
No ratings yet
Mod3 Tables EPP
9 pages
R Cheat Sheet (Updated)
No ratings yet
R Cheat Sheet (Updated)
13 pages
BMR Assignment: Tidyr
No ratings yet
BMR Assignment: Tidyr
3 pages
2.3 Data Frame
No ratings yet
2.3 Data Frame
3 pages
Matrix, Dataframes, List
No ratings yet
Matrix, Dataframes, List
8 pages
Reshape2 - R - Flexibly Reshape Data - A Reboot of The Reshape Package
No ratings yet
Reshape2 - R - Flexibly Reshape Data - A Reboot of The Reshape Package
14 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
Unit 2
No ratings yet
Unit 2
76 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Annual Report 2023-24 Draft1 Print
No ratings yet
Annual Report 2023-24 Draft1 Print
38 pages
Rebuttal of Colin Leslie Dean's Critique of Kurt Godel
100% (1)
Rebuttal of Colin Leslie Dean's Critique of Kurt Godel
4 pages
Term 3 Revision Test
No ratings yet
Term 3 Revision Test
5 pages
Balston Gas and Liquid Sample Analyzer Filters
No ratings yet
Balston Gas and Liquid Sample Analyzer Filters
50 pages
1,3 Butadiene
No ratings yet
1,3 Butadiene
7 pages
How God Answers Prayer
100% (1)
How God Answers Prayer
12 pages
1 ML Introduction
No ratings yet
1 ML Introduction
36 pages
Chapter 1 Acct 2121
No ratings yet
Chapter 1 Acct 2121
39 pages
Guidance On - Auditing Climate Change Issues in ISO 9001
No ratings yet
Guidance On - Auditing Climate Change Issues in ISO 9001
10 pages
Partituradebanda - Band Folio - Book 2 - Sax Tenor
No ratings yet
Partituradebanda - Band Folio - Book 2 - Sax Tenor
25 pages
104174us Minimax Operatinginstructions 100518a
No ratings yet
104174us Minimax Operatinginstructions 100518a
12 pages
Interior Design Final
No ratings yet
Interior Design Final
11 pages
De Cuong On Thi Tieng Anh Hoc Ky II Lop 11 Nang Cao
No ratings yet
De Cuong On Thi Tieng Anh Hoc Ky II Lop 11 Nang Cao
13 pages
EPA HQ OPP 2008 0440 0059 - Attachment - 2
No ratings yet
EPA HQ OPP 2008 0440 0059 - Attachment - 2
396 pages
HMEdedelingen 1206
No ratings yet
HMEdedelingen 1206
32 pages
OceanofPDF - Com Do This For You - Krissy Cela
No ratings yet
OceanofPDF - Com Do This For You - Krissy Cela
166 pages
Course Outline CSCD 607 Advanced Computer Networks
No ratings yet
Course Outline CSCD 607 Advanced Computer Networks
9 pages
08 - Mechanical Design
No ratings yet
08 - Mechanical Design
71 pages
Answer Key ME Grade 5 Revision Sheet
No ratings yet
Answer Key ME Grade 5 Revision Sheet
2 pages
Sitrans fmt020
No ratings yet
Sitrans fmt020
11 pages
Line Sizing Calculation - Pump Discharge
No ratings yet
Line Sizing Calculation - Pump Discharge
2 pages
Unit III - 2marks With Answer
No ratings yet
Unit III - 2marks With Answer
3 pages
IBPS Clerk Previous Year Question Paper 2018: Quantitative Aptitude (Questions & Solutions)
No ratings yet
IBPS Clerk Previous Year Question Paper 2018: Quantitative Aptitude (Questions & Solutions)
18 pages
2024 Preoperative Fasting in Children
No ratings yet
2024 Preoperative Fasting in Children
8 pages
Series and Parallel - Simple Circuits: © Boardworks LTD 2003
No ratings yet
Series and Parallel - Simple Circuits: © Boardworks LTD 2003
22 pages
Internet Society Pulse Platform Presentation Tutorials
No ratings yet
Internet Society Pulse Platform Presentation Tutorials
16 pages
National Museum of Rwanda
No ratings yet
National Museum of Rwanda
4 pages
Unit 12 Lexis: Commentary
No ratings yet
Unit 12 Lexis: Commentary
5 pages
Mid-Term Exam
No ratings yet
Mid-Term Exam
3 pages

Data Cleaning Using R

Uploaded by

Data Cleaning Using R

Uploaded by

DATA CLEANING USING

CLEANING 2 36 118 8.0 72 5 2

EXAMPLE 3 12 149 12.6 74 5 3

The NA value inside the columns

WITH 6) EXAMPLE 5: REMOVE DUPLICATES

EXAMPLE 7) EXAMPLE 6: MODIFY CLASSES OF COLUMNS

• LET’S ASSUME THAT WE WANT TO CHANGE THESE

• #MODIFY ALL COLUMN NAMESDATA

• # PRINT UPDATED DATA FRAME

• Some missing values are represented by blank character strings.

data[data == ""] <- NA # Replace

have a closer look at the column col2:

We can change that using the following R code:

data <- data[rowSums(is.na(data)) != ncol(data), ] #

data <- data[ , colSums(is.na(data)) != nrow(data)]

data <- na.omit(data) # Delete rows with missing values data

data <- unique(data) # Exclude duplicates data

sapply(data, class) # Print classes of all columns

data <- type.convert(data, as.is = TRUE)

# Remove rows with outliersdata

# Delete white space in character stringsdata

data$col3[data$col3 %in% c("b", "c")] <- "a" # Merge

You might also like