DS-R Block 4 All
DS-R Block 4 All
Setting Up ggplot2
1. Scatter Plot
2. Bar Plot
3. Line Plot
A line plot is ideal for showing trends over time or ordered categories.
4. Box Plot
A box plot is useful for visualizing the distribution of a dataset and identifying
outliers.
5. Histograms
6. Faceting
Faceting allows you to create multiple plots based on the values of one or more
categorical variables.
You can modify the coordinate system. For example, you might want to use a log scale:
Step 6: Themes
Finally, you can customize the plot’s theme:
# Customize theme
final_plot <- plot_with_facet + theme_minimal() +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5))
print(final_plot)
Data Science using R
Exploring ggplot R programming
Sample Dataset
We will use the mtcars dataset, which is included with R. It contains information about
various car models, including miles per gallon (mpg), number of cylinders, horsepower,
and more.
# Load the mtcars dataset
data(mtcars)
# View the first few rows of the dataset
head(mtcars)
Exploring ggplot R programming
Basic Structure of a ggplot2 Plot
The basic structure of a ggplot2 plot is:
ggplot(data, aes(x = x_variable, y = y_variable)) + geom_type()
Types of Plots with ggplot2
1. Scatter Plot
A scatter plot visualizes the relationship between two continuous variables.
# Scatter plot of mpg vs. hp
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
ggtitle("Scatter Plot of MPG vs Horsepower") + xlab("Horsepower") + ylab("Miles
per Gallon")
Exploring ggplot R programming
2. Line Plot
A line plot is used to show trends over a continuous variable.
# Create a line plot of mpg vs. hp
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_line() +
ggtitle("Line Plot of MPG vs Horsepower") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Exploring ggplot R programming
3. Bar Plot
A bar plot compares the counts or values of different categories.
# Bar plot of counts of cars by number of cylinders
ggplot(mtcars, aes(x = factor(cyl)) ) +
geom_bar(fill = "steelblue") +
ggtitle("Bar Plot of Number of Cylinders") +
xlab("Number of Cylinders") +
ylab("Count of Cars")
Exploring ggplot R programming
4. Box Plot
A box plot shows the distribution of a continuous variable and can highlight outliers.
# Box plot of mpg by number of cylinders
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightgreen") +
ggtitle("Box Plot of MPG by Number of Cylinders") +
xlab("Number of Cylinders") +
ylab("Miles per Gallon")
Exploring ggplot R programming
5. Histogram
A histogram visualizes the distribution of a single continuous variable.
# Histogram of mpg
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "orange", color = "black") +
ggtitle("Histogram of Miles per Gallon") +
xlab("Miles per Gallon") +
ylab("Frequency")
Data Science using R
Tidying data is a crucial step in the data analysis process, particularly when
using R and the tidyverse framework. The concept of tidying data was popularized
by Hadley Wickham and focuses on structuring data in a way that makes it easier to
visualize and analyze. The basic principles involve ensuring that each variable is a
column, each observation is a row, and each type of observational unit forms a table.
Tidying data - Variables to visuals R programming
Example Dataset
Let’s create a sample dataset that needs to be tidied:
# Create a sample dataset
data <- data.frame(
id = 1:4,
year_2020 = c(5, 3, 6, 2),
year_2021 = c(7, 8, 5, 6)
)
Tidying data - Variables to visuals R programming
id year sales
1 year_2020 5
1 year_2021 7
2 year_2020 3
2 year_2021 8
3 year_2020 6
3 year_2021 5
4 year_2020 2
4 year_2021 6
Tidying data - Variables to visuals R programming
Aesthetics in ggplot2
Aesthetics are defined inside the aes() function and can map data variables to
visual properties such as:
1. Position: Where the data point is located in the plot (x and y axes).
2. Color: The color of the points, lines, or bars.
3. Size: The size of points or the width of lines.
4. Shape: The shape of points (e.g., circles, triangles).
5. Fill: The fill color for bars or areas.
6. Alpha: The transparency of points or lines.
Aesthetics – Attributes and visible aesthetics R programming
Basic Structure of Aesthetics
The general structure for defining aesthetics in a ggplot is as follows:
ggplot(data, aes(x = variable_x, y = variable_y, color = variable_color, size
= variable_size)) + geom_type()
Example Dataset
Let's use the mtcars dataset to demonstrate how to use aesthetics effectively:
# Load necessary library
library(ggplot2)
# Load the mtcars dataset
data(mtcars)
Using Aesthetics in ggplot2
Aesthetics – Attributes and visible aesthetics R programming
1. Basic Scatter Plot with Aesthetics
Here’s how you can create a basic scatter plot using mpg (miles per gallon) vs.
hp (horsepower), with color representing the number of cylinders (cyl):
# Basic scatter plot with aesthetics
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
ggtitle("Scatter Plot of MPG vs Horsepower") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Aesthetics – Attributes and visible aesthetics R programming
Attributes of Aesthetics
1. Color
The color aesthetic is often used to differentiate groups or categories. In the example
above, the points are colored based on the number of cylinders.
2. Size
The size aesthetic allows you to indicate a third variable by varying the size of
points:
# Scatter plot with size aesthetic
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl), size = wt)) +
geom_point() +
ggtitle("Scatter Plot of MPG vs Horsepower with Weight") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Aesthetics – Attributes and visible aesthetics R programming
3. Shape
You can also use the shape aesthetic to differentiate categories visually:
# Scatter plot with shape aestheticggplot(mtcars, aes(x = hp, y = mpg, color =
factor(cyl), shape = factor(gear))) + geom_point(size = 3) + ggtitle("Scatter Plot with
Color and Shape Aesthetics") + xlab("Horsepower") + ylab("Miles per Gallon")
4. Fill
The fill aesthetic is used primarily for bar plots and areas:
# Bar plot with fill aestheticggplot(mtcars, aes(x = factor(cyl), fill = factor(gear))) +
geom_bar(position = "dodge") + ggtitle("Count of Cars by Number of Cylinders and
Gears") + xlab("Number of Cylinders") + ylab("Count") +
scale_fill_brewer(palette = "Set2") # Custom color palette
Aesthetics – Attributes and visible aesthetics R programming
A line plot connects data points to show trends over a continuous variable.
A stacked bar chart shows the breakdown of a total across different categories.
# Create a stacked bar chart comparing the number of cars by cylinders and gears
ggplot(mtcars, aes(x = factor(cyl), fill = factor(gear))) +
geom_bar(position = "stack") +
ggtitle("Stacked Bar Chart of Cylinders and Gears") +
xlab("Number of Cylinders") +
ylab("Count of Cars") +
scale_fill_brewer(palette = "Set2", name = "Gears")