Dirty Data in R Programming: Identifying, Understanding, and Cleaning Data
Dirty Data in R Programming: Identifying, Understanding, and Cleaning Data
MANJU
2 nd Msc computer science
.
Techniques:
Output:
column
1 1
2 2
3 3
4 4
5 5
Dealing With Duplicate
•Identifying duplicates: duplicated()
•Removing duplicates:
Output:
ID Value
1 1 10
2 2 20
3 3 30
5 4 40
Correcting Data Type
o Converting strings to numbers: as.numeric()
o Parsing dates: lubridate::ymd()
•Example:
library(lubridate)
data$date <- ymd(data$date)
Example
data <- data.frame(
date = c("20241219", "2024/12/18", "2024-12-17")
)
Output:
date
1 2024-12-19
2 2024-12-18
3 2024-12-17
Detecting and Handling Outlier
•Methods:
•Boxplots for visual detection
•Z-score or IQR method for identification
•Coding
Q1 <- quantile(data$column, 0.25)
Q3 <- quantile(data$column, 0.75)
IQR <- Q3 - Q1
data <- data[data$column >= (Q1 - 1.5 * IQR)
& data$column <= (Q3 + 1.5 * IQR), ]
Example
data <- data.frame(column = c(10, 12, 15, 18, 22, 100, 150))