0% found this document useful (0 votes)
11 views14 pages

Dirty Data in R Programming: Identifying, Understanding, and Cleaning Data

The document discusses dirty data in R programming, defining it as data that is incomplete, inconsistent, inaccurate, or improperly formatted, and outlines common issues such as missing values, duplicates, and outliers. It provides techniques for handling these issues, including removing or imputing missing values, identifying and removing duplicates, correcting data types, and detecting outliers using methods like boxplots and IQR. Examples of R code are included to illustrate the processes of cleaning data.

Uploaded by

rithikaoct08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views14 pages

Dirty Data in R Programming: Identifying, Understanding, and Cleaning Data

The document discusses dirty data in R programming, defining it as data that is incomplete, inconsistent, inaccurate, or improperly formatted, and outlines common issues such as missing values, duplicates, and outliers. It provides techniques for handling these issues, including removing or imputing missing values, identifying and removing duplicates, correcting data types, and detecting outliers using methods like boxplots and IQR. Examples of R code are included to illustrate the processes of cleaning data.

Uploaded by

rithikaoct08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Dirty Data In R Programming

Identifying, Understanding, and Cleaning Data

MANJU
2 nd Msc computer science
.

What is Dirty Data


•Definition: Data that is incomplete, inconsistent,
inaccurate, or improperly formatted.
•Examples:
 Missing values
 Duplicates
 Incorrect data types
 Outliers
•Impact of dirty data on analysis and decision-making
Common Issues In Dirty
Data
•Missing Data: NA values or blanks
•Inconsistent Data: Different date formats,
mismatched categories
•Outliers: Extreme values affecting analysis
•Duplicates: Repeated rows in datasets
•Typographical Errors: Incorrect spellings or
labels
•Incorrect Data Types: Numeric stored as text,
etc.
Handing Missing Value

Techniques:

•Removing rows/columns: na.omit()


•Replacing values: replace_na()
•Imputation: Mean, median, or custom methods
Example
library(dplyr)
data <- data %>%
mutate(column = ifelse(is.na(column), median(column,
na.rm = TRUE), column))
Example
data <- data.frame(column = c(1, 2, NA, 4, 5))

Output:
column
1 1
2 2
3 3
4 4
5 5
Dealing With Duplicate
•Identifying duplicates: duplicated()

•Removing duplicates:

data <- data[!duplicated(data), ]


Example
data <- data.frame(
ID = c(1, 2, 3, 2, 4),
Value = c(10, 20, 30, 20, 40)
)

FALSE FALSE FALSE TRUE FALSE

Output:
ID Value
1 1 10
2 2 20
3 3 30
5 4 40
Correcting Data Type
o Converting strings to numbers: as.numeric()
o Parsing dates: lubridate::ymd()

•Example:
library(lubridate)
data$date <- ymd(data$date)
Example
data <- data.frame(
date = c("20241219", "2024/12/18", "2024-12-17")
)

Output:

date
1 2024-12-19
2 2024-12-18
3 2024-12-17
Detecting and Handling Outlier
•Methods:
•Boxplots for visual detection
•Z-score or IQR method for identification
•Coding
Q1 <- quantile(data$column, 0.25)
Q3 <- quantile(data$column, 0.75)
IQR <- Q3 - Q1
data <- data[data$column >= (Q1 - 1.5 * IQR)
& data$column <= (Q3 + 1.5 * IQR), ]
Example
data <- data.frame(column = c(10, 12, 15, 18, 22, 100, 150))

Q1 <- quantile(data$column, 0.25) # Q1 = 12


Q3 <- quantile(data$column, 0.75) # Q3 = 22
IQR <- Q3 - Q1 # IQR = 10

lower_threshold <- Q1 - 1.5 * IQR # 12 - 15 = -3


upper_threshold <- Q3 + 1.5 * IQR # 22 + 15 = 37

data <- data[data$column >= lower_threshold & data$column <=


upper_threshold, ]
Example
Output:
column
1 10
2 12
3 15
4 18
5 22
THANK YOU

You might also like