0% found this document useful (0 votes)
7 views26 pages

Data Cleaning Using R

The document outlines the process of data cleaning using R, emphasizing the importance of transforming raw data into a clean and consistent format for analysis. It details various objectives of data cleaning, such as eliminating errors, ensuring consistency, and removing duplicates, along with practical examples of R code for different cleaning tasks. The document also compares messy data with clean data and provides methods for handling missing values, modifying column names, and detecting outliers.

Uploaded by

kkabeerbe23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views26 pages

Data Cleaning Using R

The document outlines the process of data cleaning using R, emphasizing the importance of transforming raw data into a clean and consistent format for analysis. It details various objectives of data cleaning, such as eliminating errors, ensuring consistency, and removing duplicates, along with practical examples of R code for different cleaning tasks. The document also compares messy data with clean data and provides methods for handling missing values, modifying column names, and detecting outliers.

Uploaded by

kkabeerbe23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DATA CLEANING USING

R
The process to transform raw data into
consistent data that can be easily
analyzed.

DATA
CLEANING It is aimed at filtering the content of
statistical statements .
DEFINITION
To improves your data quality and
overall productivity.
The Eliminate Errors
following
are the Eliminate Redundancy
various
purposes
OBJECTIVE of data Increase Data Reliability
cleaning
OF DATA in R: Delivery Accuracy
CLEANING Ensure Consistency
Assure Completeness
Standardize your approach
CLEAN DATA VS MESSY
DATA
Messy Data Clean Data
• Special characters (e.g. commas • Free of duplicate rows/values
in numeric values) • Error-free (misspellings free )
• Numeric values stored as • Relevant (special characters free
text/character data types )
• Duplicate rows • The appropriate data type for
• Misspellings analysis
• Inaccuracies • Free of outliers (or only contain
• White space outliers that have been
• Missing data identified/understood)
• Zeros instead of null values • Neat and clean data structure
vary.
• Using inbuilt datasets(“airquality” datasets)
• > head(airquality)
Ozone Solar.R Wind Temp Month Day
DATA 1 41 190 7.4 67 5 1

CLEANING 2 36 118 8.0 72 5 2

EXAMPLE 3 12 149 12.6 74 5 3


4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

The NA value inside the columns


SUMMARY FUNCTION
> summary(airquality)
WE CAN GET A CLEAR VISUAL OF THE IRREGULAR
DATA USING A BOXPLOT.
BOXPLOT(AIRQUALITY)
REMOVING
IRREGULARITIES DATA
WITH IS.NA() METHODS

• NEW_DF = AIRQUALITY
• NEW_DF$OZONE =
IFELSE(IS.NA(NEW_DF$OZONE),
MEDIAN(NEW_DF$OZONE,
• NA.RM =
TRUE),)
PERFORMING THE
SAME OPERATION IN
ANOTHER COLUMN.

• NEW_DF$SOLAR.R =
IFELSE(IS.NA(NEW_DF$SOLAR.R)
, MEDIAN(NEW_DF$SOLAR.R,
• NA.RM =
TRUE), NEW_DF$SOLAR.R) )
• SUMMARY(NEW_DF)
HEAD(NEW_DF)
1) CREATION OF EXAMPLE DATA
2) EXAMPLE 1: MODIFY COLUMN NAMES
3) EXAMPLE 2: FORMAT MISSING VALUES
AN 4) EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
ILLUSTRATION 5) EXAMPLE 4: REMOVE ROWS WITH MISSING VALUES

WITH 6) EXAMPLE 5: REMOVE DUPLICATES

EXAMPLE 7) EXAMPLE 6: MODIFY CLASSES OF COLUMNS


8) EXAMPLE 7: DETECT & REMOVE OUTLIERS
9) EXAMPLE 8: REMOVE SPACES IN CHARACTER STRINGS
10) EXAMPLE 9: COMBINE CATEGORIES
CREATION OF
EXAMPLE DATA
DATA <- DATA.FRAME(X1 = C(1:4, 99999, 1, NA, 1, 1, NA),
X1 = C(1:5, 1, "NA", 1, 1, "NA"),
X1 = C(LETTERS[C(1:3)], "X X", "X", " Y Y Y", "X", "A",
"A", NA),
X4 = "",
X5 = NA)
DATA
EXAMPLE 1:
MODIFY COLUMN
NAMES
• LET’S FIRST HAVE A CLOSER LOOK AT THE NAMES OF
OUR DATA FRAME COLUMNS:

• COLNAMES(DATA)
• # PRINT COLUMN NAMES# [1] "X1" "X1.1" "X1.2"
"X4" "X5“

• LET’S ASSUME THAT WE WANT TO CHANGE THESE


COLUMN NAMES TO A CONSECUTIVE RANGE WITH THE
PREFIX “COL”. THEN, WE CAN APPLY
THE COLNAMES, PASTE0, AND NCOL FUNCTIONS AS
SHOWN BELOW.

• #MODIFY ALL COLUMN NAMESDATA


• COLNAMES(DATA) <- PASTE0("COL", 1:NCOL(DATA))

• # PRINT UPDATED DATA FRAME


EXAMPLE 2: FORMAT MISSING VALUES
• In the R programming language, missing values are usually represented by NA.
For that reason, it is useful to convert all missing values to this NA format.

• Some missing values are represented by blank character strings.

data[data == ""] #
Print blank data cells# [1] NA NA NA "" "" "" "" "" "" ""
"" "" "" NA NA NA NA NA NA NA NA NA NA

• Assign NA values to those blank cells, we can use the following syntax:

data[data == ""] <- NA # Replace


blanks by NA

have a closer look at the column col2:


data$col2 # Print
column# [1] "1" "2" "3" "4" "5" "1" "NA" "1" "1" "NA"
EXAMPLE 2: FORMAT MISSING VALUES
The NA values in this column are shown between quotes (i.e. “NA”). This indicates
that those NA values are formatted as characters instead of real NA values.

We can change that using the following R code:


data$col2[data$col2 == "NA"] <- NA # Replace
character "NA"

data # Print
EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
Use the rowSums, is.na, and ncol functions to remove only-NA rows:

data <- data[rowSums(is.na(data)) != ncol(data), ] #


Drop empty rowsdata
# Print updated data frame
EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
Also exclude columns that contain only NA values

data <- data[ , colSums(is.na(data)) != nrow(data)]


# Drop empty columnsdata
# Print updated data frame
EXAMPLE 4: REMOVE ROWS WITH MISSING VALUES
in case you have decided to remove all rows with one or more NA values, you may use
the na.omit function as shown below:

data <- na.omit(data) # Delete rows with missing values data


# Print updated data frame
EXAMPLE 5: REMOVE DUPLICATES
Use the unique function to our data frame as demonstrated in the following R
snippet:

data <- unique(data) # Exclude duplicates data


# Print updated data frame
EXAMPLE 6: MODIFY CLASSES OF COLUMNS
• The class of the columns of a data frame is another critical topic when it
comes to data cleaning.
• This example explains how to format each column to the most appropriate
data type automatically.
• Let’s first check the current classes of our data frame columns:

sapply(data, class) # Print classes of all columns


# col1 col2 col3
# "numeric" "character" "character"

use the type.convert function to change the column classes whenever it is appropriate:

data <- type.convert(data, as.is = TRUE)


data #
Print updated data frame
Print the data types of our columns once again, we can see that the first two
columns have been changed to the integer class.
sapply(data, class) # Print classes of updated
columns
# col1 col2 col3
# "integer" "integer" "character"
EXAMPLE 7: DETECT & REMOVE OUTLIERS
One method to detect outliers is provided by the boxplot.stats function
# Identify outliers in column# [1] 99999
data$col1[data$col1 %in% boxplot.stats(data$col1)$out]

• The previous output has returned one outlier (i.e. the value 99999). This
value is obviously much higher than the other values in this column.
• Apply the R code below to remove the outlier:

# Remove rows with outliersdata


data <- data[! data$col1 %in% boxplot.stats(data$col1)$out, ]
EXAMPLE 8: REMOVE SPACES IN CHARACTER STRINGS
• Use the gsub function as demonstrated below

# Delete white space in character stringsdata


data$col3 <- gsub(" ", "", data$col3)
EXAMPLE 9: COMBINE CATEGORIES
• Use the gsub function as demonstrated below
The following R code illustrates how to group the categories “a”, “b”, and
“c” in a single category “a”.

data$col3[data$col3 %in% c("b", "c")] <- "a" # Merge


categoriesdata
# Print updated data frame
THANKS

You might also like