0% found this document useful (0 votes)
18 views5 pages

Creating EDA Reports Using Ggplot2 in R Markdown

This lesson guide focuses on creating Exploratory Data Analysis (EDA) reports using ggplot2 in R Markdown. Students will learn the importance of ggplot2, apply the grammar of graphics for effective visualizations, and perform univariate and bivariate analyses. The guide includes practical examples and code snippets for generating structured and reproducible EDA reports.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

Creating EDA Reports Using Ggplot2 in R Markdown

This lesson guide focuses on creating Exploratory Data Analysis (EDA) reports using ggplot2 in R Markdown. Students will learn the importance of ggplot2, apply the grammar of graphics for effective visualizations, and perform univariate and bivariate analyses. The guide includes practical examples and code snippets for generating structured and reproducible EDA reports.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lesson Guide: Creating EDA Reports

using ggplot2 in R Markdown


Course: Analytics Techniques and Tools using R

Learning Objectives
By the end of this lesson, students will be able to:

1.​ Understand the importance of using ggplot2 for EDA reports.


2.​ Apply the grammar of graphics to create effective visualizations.
3.​ Use R Markdown to generate structured and reproducible EDA reports with ggplot2.
4.​ Perform univariate and bivariate analysis using ggplot2.
5.​ Conduct statistical tests and visualize their results using ggplot2.

Lesson Content
1. Introduction to ggplot2 for EDA Reports

●​ Why use ggplot2?


●​ Grammar of Graphics: A structured approach to visualization
●​ Basic syntax of ggplot2
●​ Installing and loading ggplot2

# Install ggplot2 if not already installed​


if (!requireNamespace("ggplot2", quietly = TRUE)) {​
install.packages("ggplot2", dependencies = TRUE)​
}​

# Load the package​
library(ggplot2)
2. Grammar of Graphics in ggplot2

ggplot2 follows a layered approach:

●​ Data Layer: The dataset used for visualization.


●​ Aesthetics (aes()) Layer: Mapping variables to visual properties.
●​ Geometry Layer (geom_*()): Defines the type of visualization.
●​ Faceting (facet_wrap() or facet_grid()): Splitting plots by categorical variables.
●​ Theme and Labels (theme(), labs()): Customizing appearance.

# Basic ggplot structure​


ggplot(data, aes(x = variable1, y = variable2)) +​
geom_point()

3. Data Structure Analysis

3.1. Understanding the Dataset


# Load dataset (Example: mtcars)​
data <- mtcars​

# Check dataset structure​
str(data)​

# Summary statistics​
summary(data)

3.2. Data Quality and Handling Missing Values


# Check for missing values​
sum(is.na(data))​

# Handling missing values​
data <- na.omit(data) # Remove rows with missing values
4. Univariate Analysis

4.1. Understanding Distribution and Normality


# Shapiro-Wilk test for normality​
shapiro.test(data$mpg)

4.2. Visualizing Data Distribution with ggplot2


# Histogram​
ggplot(data, aes(x = mpg)) +​
geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +​
labs(title = "Histogram of MPG", x = "MPG", y = "Count")​

# Boxplot​
ggplot(data, aes(y = mpg)) +​
geom_boxplot(fill = "lightblue") +​
labs(title = "Boxplot of MPG", y = "MPG")

4.3. Outlier Detection


# Boxplot-based outliers​
Q1 <- quantile(data$mpg, 0.25)​
Q3 <- quantile(data$mpg, 0.75)​
IQR_value <- Q3 - Q1​
lower_bound <- Q1 - 1.5 * IQR_value​
upper_bound <- Q3 + 1.5 * IQR_value​
outliers <- data$mpg[data$mpg < lower_bound | data$mpg > upper_bound]​

# 3-SD Rule Outliers​
data_mean <- mean(data$mpg)​
data_sd <- sd(data$mpg)​
lower_sd_bound <- data_mean - 3 * data_sd​
upper_sd_bound <- data_mean + 3 * data_sd​
outliers_sd <- data$mpg[data$mpg < lower_sd_bound | data$mpg >
upper_sd_bound]
5. Bivariate Analysis

5.1. Categorical vs Categorical (Chi-Square Test & Stacked Bar Plots)


# Stacked bar plot​
ggplot(data, aes(x = factor(cyl), fill = factor(gear))) +​
geom_bar(position = "fill") +​
labs(title = "Proportion of Cylinders by Gear Type", x = "Cylinders", y =
"Proportion")​

# Chi-Square Test​
chisq.test(table(data$cyl, data$gear))

5.2. Categorical vs Numerical (T-Test, ANOVA, Wilcoxon, Kruskal-Wallis)


# Boxplot comparison​
ggplot(data, aes(x = factor(cyl), y = mpg)) +​
geom_boxplot(fill = "lightblue") +​
labs(title = "MPG by Cylinder Count", x = "Cylinders", y = "MPG")​

# T-Test for two groups​
t.test(mpg ~ am, data = mtcars)​

# ANOVA for multiple groups​
anova(lm(mpg ~ cyl, data = mtcars))​

# Wilcoxon Test​
wilcox.test(mpg ~ am, data = mtcars)​

# Kruskal-Wallis Test​
kruskal.test(mpg ~ cyl, data = mtcars)

5.3. Numerical vs Numerical (Correlation & Regression)


# Scatterplot with trend line​
ggplot(data, aes(x = hp, y = mpg)) +​
geom_point(color = "blue") +​
geom_smooth(method = "lm", color = "red") +​
labs(title = "HP vs MPG", x = "Horsepower", y = "MPG")​

# Correlation matrix​
cor(data[, c("mpg", "hp", "wt")])
6. Summary & Next Steps
●​ Key Takeaways: ggplot2 provides a structured and powerful approach to EDA
visualization.
●​ Next Lesson: Advanced Data Visualization Techniques with ggplot2.

You might also like