0% found this document useful (0 votes)
4 views

Pre Processing

The document outlines a process for handling missing values in a dataset using R, including methods to replace missing values with mean, median, and mode. It also describes steps to remove duplicate rows, convert categorical columns to factors, perform one-hot encoding, and normalize a specific column. The code snippets provided demonstrate these data preprocessing techniques.

Uploaded by

عبد ا للہ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Pre Processing

The document outlines a process for handling missing values in a dataset using R, including methods to replace missing values with mean, median, and mode. It also describes steps to remove duplicate rows, convert categorical columns to factors, perform one-hot encoding, and normalize a specific column. The code snippets provided demonstrate these data preprocessing techniques.

Uploaded by

عبد ا للہ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

df <- read.csv("dataset.

csv", header = TRUE, stringsAsFactors = FALSE)

sum(is.na(df)) # Total missing values

colSums(is.na(df)) # Missing values per column

c) Replacing Missing Values

# Replace missing values in a column with mean

df$Column1[is.na(df$Column1)] <- mean(df$Column1, na.rm =


TRUE)

# Replace missing values with median

df$Column1[is.na(df$Column1)] <- median(df$Column1, na.rm =


TRUE)

# Replace missing values with mode

mode_function <- function(x) {

unique_x <- unique(x)

unique_x[which.max(tabulate(match(x, unique_x)))]

df$Column1[is.na(df$Column1)] <- mode_function(df$Column1)

df <- df[!duplicated(df), ] # Remove duplicate rows

# Convert categorical column to factors

df$Category <- as.factor(df$Category)

# One-hot encoding (dummy variables)


library(dummies)

df_encoded <- dummy.data.frame(df, names = c("Category"), sep =


"_")

Normalization

df$Column1 <- (df$Column1 - min(df$Column1)) / (max(df$Column1)


- min(df$Column1))

df$Column1 <- (df$Column1 - mean(df$Column1, na.rm = TRUE)) /


sd(df$Column1, na.rm = TRUE)

You might also like