Data Cleaning Using R
Data Cleaning Using R
R
The process to transform raw data into
consistent data that can be easily
analyzed.
DATA
CLEANING It is aimed at filtering the content of
statistical statements .
DEFINITION
To improves your data quality and
overall productivity.
The Eliminate Errors
following
are the Eliminate Redundancy
various
purposes
OBJECTIVE of data Increase Data Reliability
cleaning
OF DATA in R: Delivery Accuracy
CLEANING Ensure Consistency
Assure Completeness
Standardize your approach
CLEAN DATA VS MESSY
DATA
Messy Data Clean Data
• Special characters (e.g. commas • Free of duplicate rows/values
in numeric values) • Error-free (misspellings free )
• Numeric values stored as • Relevant (special characters free
text/character data types )
• Duplicate rows • The appropriate data type for
• Misspellings analysis
• Inaccuracies • Free of outliers (or only contain
• White space outliers that have been
• Missing data identified/understood)
• Zeros instead of null values • Neat and clean data structure
vary.
• Using inbuilt datasets(“airquality” datasets)
• > head(airquality)
Ozone Solar.R Wind Temp Month Day
DATA 1 41 190 7.4 67 5 1
• NEW_DF = AIRQUALITY
• NEW_DF$OZONE =
IFELSE(IS.NA(NEW_DF$OZONE),
MEDIAN(NEW_DF$OZONE,
• NA.RM =
TRUE),)
PERFORMING THE
SAME OPERATION IN
ANOTHER COLUMN.
• NEW_DF$SOLAR.R =
IFELSE(IS.NA(NEW_DF$SOLAR.R)
, MEDIAN(NEW_DF$SOLAR.R,
• NA.RM =
TRUE), NEW_DF$SOLAR.R) )
• SUMMARY(NEW_DF)
HEAD(NEW_DF)
1) CREATION OF EXAMPLE DATA
2) EXAMPLE 1: MODIFY COLUMN NAMES
3) EXAMPLE 2: FORMAT MISSING VALUES
AN 4) EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
ILLUSTRATION 5) EXAMPLE 4: REMOVE ROWS WITH MISSING VALUES
• COLNAMES(DATA)
• # PRINT COLUMN NAMES# [1] "X1" "X1.1" "X1.2"
"X4" "X5“
data[data == ""] #
Print blank data cells# [1] NA NA NA "" "" "" "" "" "" ""
"" "" "" NA NA NA NA NA NA NA NA NA NA
• Assign NA values to those blank cells, we can use the following syntax:
data # Print
EXAMPLE 3: REMOVE EMPTY ROWS & COLUMNS
Use the rowSums, is.na, and ncol functions to remove only-NA rows:
use the type.convert function to change the column classes whenever it is appropriate:
• The previous output has returned one outlier (i.e. the value 99999). This
value is obviously much higher than the other values in this column.
• Apply the R code below to remove the outlier: