Exploratory Data Analysis
Lab Exercise 1: Summary Statistics and Data Visualization
Problem Statement:
Use the mtcars dataset available in R. Calculate summary statistics (mean, median, standard deviation)
for the mpg (miles per gallon) column. Then, create a histogram and a boxplot for the same column.
Lab Exercise 2: Correlation Analysis
Problem Statement:
Use the iris dataset. Calculate the correlation matrix for the numerical variables in the dataset. Create a
pairs plot to visualize the relationships between these variables.
Lab Exercise 3: Data Cleaning and Handling Missing Values
Problem Statement:
Create a sample dataset with some missing values. Handle the missing values by imputing the mean for
numerical columns and the mode for categorical columns.
Lab Exercise 4: Outlier Detection
Problem Statement:
Using the mtcars dataset, detect outliers in the hp (horsepower) column using the IQR method. Display
the rows that contain outliers.
Lab Exercise 5: Data Transformation and Visualization
Problem Statement:
Use the iris dataset. Normalize the Sepal.Length column and create a density plot for the normalized
values. Also, create a scatter plot between the normalized Sepal.Length and Sepal.Width.
Answers
Lab Exercise 1:
# Load the dataset
data(mtcars)
# Calculate summary statistics
mean_mpg <- mean(mtcars$mpg)
median_mpg <- median(mtcars$mpg)
sd_mpg <- sd(mtcars$mpg)
# Display the summary statistics
mean_mpg
median_mpg
sd_mpg
# Create a histogram
hist(mtcars$mpg, main="Histogram of MPG", xlab="Miles Per Gallon", col="blue")
# Create a boxplot
boxplot(mtcars$mpg, main="Boxplot of MPG", ylab="Miles Per Gallon", col="green")
Lab Exercise 2
# Load the dataset
data(iris)
# Calculate the correlation matrix
cor_matrix <- cor(iris[, 1:4])
# Display the correlation matrix
cor_matrix
# Create a pairs plot
pairs(iris[, 1:4], main="Pairs Plot of Iris Dataset", col=iris$Species)
Lab Exercise 3
# Create a sample dataset with missing values
sample_data <- data.frame(
Age = c(25, 30, NA, 22, 40, NA, 35),
Gender = c("Male", "Female", "Female", NA, "Male", "Male", NA)
)
# Define a function to impute the mean for numerical columns
impute_mean <- function(x) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
return(x)
}
# Define a function to impute the mode for categorical columns
impute_mode <- function(x) {
x[is.na(x)] <- names(sort(table(x), decreasing = TRUE))[1]
return(x)
}
# Impute missing values
sample_data$Age <- impute_mean(sample_data$Age)
sample_data$Gender <- impute_mode(sample_data$Gender)
# Display the cleaned dataset
sample_data
Lab Exercise 4
# Load the dataset
data(mtcars)
# Calculate the IQR for the hp column
Q1 <- quantile(mtcars$hp, 0.25)
Q3 <- quantile(mtcars$hp, 0.75)
IQR_hp <- IQR(mtcars$hp)
# Define the outlier boundaries
lower_bound <- Q1 - 1.5 * IQR_hp
upper_bound <- Q3 + 1.5 * IQR_hp
# Detect outliers
outliers <- mtcars[mtcars$hp < lower_bound | mtcars$hp > upper_bound, ]
# Display the rows containing outliers
outliers
Lab Exercise 5
# Load the dataset
data(iris)
# Normalize the Sepal.Length column
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
iris$Sepal.Length.Normalized <- normalize(iris$Sepal.Length)
# Create a density plot for the normalized values
plot(density(iris$Sepal.Length.Normalized), main="Density Plot of Normalized Sepal Length",
xlab="Normalized Sepal Length")
# Create a scatter plot between the normalized Sepal.Length and Sepal.Width
plot(iris$Sepal.Length.Normalized, iris$Sepal.Width, main="Scatter Plot of Normalized Sepal Length vs
Sepal Width", xlab="Normalized Sepal Length", ylab="Sepal Width", col=iris$Species)