0% found this document useful (0 votes)
16 views50 pages

DS-R Block 4 All

The document provides an overview of data visualization in R using the ggplot2 package, covering basic concepts, installation, and various types of plots such as scatter, bar, line, box, and histogram. It emphasizes the importance of the Grammar of Graphics, which breaks down plots into components like data, aesthetics, geometries, and themes. Additionally, it discusses tidying data principles and how to visualize tidied data effectively.

Uploaded by

Jeya preetha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views50 pages

DS-R Block 4 All

The document provides an overview of data visualization in R using the ggplot2 package, covering basic concepts, installation, and various types of plots such as scatter, bar, line, box, and histogram. It emphasizes the importance of the Grammar of Graphics, which breaks down plots into components like data, aesthetics, geometries, and themes. Additionally, it discusses tidying data principles and how to visualize tidied data effectively.

Uploaded by

Jeya preetha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Science using R

Data Visualization R programming

© Kalasalingam Academy of Research and Education


Data Visualization R programming

Data visualization is a critical aspect of data analysis, as it helps convey insights


and patterns in a dataset effectively. In R, there are several packages available for
creating visualizations, with ggplot2 being one of the most popular and versatile
options. Below, we’ll explore some basic concepts and examples of data visualization
in R using ggplot2.
Data Visualization R programming

Setting Up ggplot2

First, you need to install and load the ggplot2 package:

# Install ggplot2 if you haven't already


install.packages("ggplot2")

# Load the ggplot2 package


library(ggplot2)
Data Visualization R programming
Sample Dataset
For our visualization examples, let’s create a sample dataset:
# Create a sample data frame
data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Age = c(25, 30, 35, 40, 28),
Salary = c(50000, 60000, 70000, 80000, 65000),
Department = c("HR", "IT", "IT", "HR", "Finance")
)
# Print the sample data
print(data)
Data Visualization R programming
Basic Plotting with ggplot2

The basic structure of a ggplot2 plot is:

ggplot(data, aes(x = x_variable, y = y_variable)) +


geom_type()
Where:

• data is your dataset.

• aes() defines the aesthetic mappings (which variables to plot).

• geom_type() defines the type of plot (e.g., points, lines, bars).


Data Visualization R programming

1. Scatter Plot

A scatter plot is used to visualize the relationship between two continuous


variables.

# Create a scatter plot


ggplot(data, aes(x = Age, y = Salary)) +
geom_point() +
ggtitle("Scatter Plot of Age vs Salary") +
xlab("Age") +
ylab("Salary")
Data Visualization R programming

2. Bar Plot

A bar plot is useful for comparing categorical data.

# Create a bar plot for Salary by Department


ggplot(data, aes(x = Department, y = Salary, fill = Department)) +
geom_bar(stat = "identity") +
ggtitle("Bar Plot of Salary by Department") +
xlab("Department") +
ylab("Salary") +
theme_minimal()
Data Visualization R programming

3. Line Plot

A line plot is ideal for showing trends over time or ordered categories.

# Create a line plot for Salary by Age


ggplot(data, aes(x = Age, y = Salary, group = 1)) +
geom_line() +
geom_point() +
ggtitle("Line Plot of Salary by Age") +
xlab("Age") +
ylab("Salary")
Data Visualization R programming

4. Box Plot

A box plot is useful for visualizing the distribution of a dataset and identifying
outliers.

# Create a box plot of Salary by Department


ggplot(data, aes(x = Department, y = Salary)) +
geom_boxplot(fill = "lightblue") +
ggtitle("Box Plot of Salary by Department") +
xlab("Department") +
ylab("Salary")
Data Visualization R programming

5. Histograms

Histograms are used to visualize the distribution of a single continuous variable.

# Create a histogram of Salary


ggplot(data, aes(x = Salary)) +
geom_histogram(binwidth = 5000, fill = "lightgreen", color = "black") +
ggtitle("Histogram of Salary") +
xlab("Salary") +
ylab("Frequency")
Data Visualization R programming

6. Faceting

Faceting allows you to create multiple plots based on the values of one or more
categorical variables.

# Create faceted plots for Salary by Department


ggplot(data, aes(x = Age, y = Salary)) +
geom_point() +
facet_wrap(~ Department) +
ggtitle("Faceted Scatter Plot of Salary by Department")
Data Science using R

Grammar of Graphics R programming

© Kalasalingam Academy of Research and Education


Grammar of Graphics R programming

The Grammar of Graphics is a foundational concept behind the ggplot2


package in R. It provides a systematic way to describe and build a wide range of data
visualizations based on the components that make up a plot. Understanding this
grammar will help you effectively create complex visualizations by layering different
components.
Grammar of Graphics R programming
Components of the Grammar of Graphics
The Grammar of Graphics breaks down a plot into several key components:
1. Data: The dataset you want to visualize.
2. Aesthetics (aes): Aesthetic mappings that define how data variables are mapped to
visual properties (e.g., x and y positions, colors, shapes).
3. Geometries (geom): The geometric objects that represent the data points (e.g.,
points, lines, bars).
4. Statistics (stat): Statistical transformations that summarize or compute data values
(e.g., calculating means).
5. Coordinates: The coordinate system used for the plot (e.g., Cartesian, polar).
6. Facets: The arrangement of multiple plots based on a factor variable.
7. Theme: The overall visual appearance of the plot (background, gridlines, fonts).
Grammar of Graphics R programming
Building a Plot with ggplot2

The basic structure of a ggplot2 plot is:

ggplot(data, aes(x = x_variable, y = y_variable)) +


geom_type() +
stat_type() +
coord_type() +
facet_grid() +
theme()
Grammar of Graphics R programming
Example: Building a Plot Step by Step
Let’s use a sample dataset to demonstrate the Grammar of Graphics in action.
# Load necessary libraries
library(ggplot2)

# Create a sample dataset


data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Age = c(25, 30, 35, 40, 28),
Salary = c(50000, 60000, 70000, 80000, 65000),
Department = c("HR", "IT", "IT", "HR", "Finance")
)
Grammar of Graphics R programming
Step 1: Basic Plot with Data and Aesthetics

First, we define the data and map the aesthetics:

# Basic plot with data and aesthetics


base_plot <- ggplot(data, aes(x = Age, y = Salary, color = Department)) +
ggtitle("Scatter Plot of Salary by Age")
Step 2: Add Geometries
Next, we add the geometric layer (scatter plot in this case):
# Add geometric layer (scatter plot)
plot_with_geom <- base_plot + geom_point(size = 3)
print(plot_with_geom)
Grammar of Graphics R programming
Step 3: Add Statistics (Optional)
You can add statistical transformations, like smoothing:
# Add a smooth line
plot_with_stat <- plot_with_geom + geom_smooth(method = "lm", se = FALSE)
print(plot_with_stat)
Step 4: Customize Coordinates (Optional)

You can modify the coordinate system. For example, you might want to use a log scale:

# Change to log scale for y-axis


plot_with_coord <- plot_with_stat + scale_y_log10()
print(plot_with_coord)
Grammar of Graphics R programming
Step 5: Faceting
You can create separate plots for each level of a categorical variable:
# Facet by Department
plot_with_facet <- plot_with_coord + facet_wrap(~ Department)
print(plot_with_facet)

Step 6: Themes
Finally, you can customize the plot’s theme:
# Customize theme
final_plot <- plot_with_facet + theme_minimal() +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5))
print(final_plot)
Data Science using R
Exploring ggplot R programming

© Kalasalingam Academy of Research and Education


Exploring ggplot R programming
Exploring ggplot2 in R involves understanding its capabilities for creating various
types of visualizations and how to customize them for better data presentation. Below is
a guide to help you get started with ggplot2, including creating different types of plots
and using advanced features.

Getting Started with ggplot2


First, ensure that you have the ggplot2 package installed and loaded:
# Install ggplot2 if you haven't already
install.packages("ggplot2")
# Load the ggplot2 package
library(ggplot2)
Exploring ggplot R programming

Sample Dataset
We will use the mtcars dataset, which is included with R. It contains information about
various car models, including miles per gallon (mpg), number of cylinders, horsepower,
and more.
# Load the mtcars dataset
data(mtcars)
# View the first few rows of the dataset
head(mtcars)
Exploring ggplot R programming
Basic Structure of a ggplot2 Plot
The basic structure of a ggplot2 plot is:
ggplot(data, aes(x = x_variable, y = y_variable)) + geom_type()
Types of Plots with ggplot2
1. Scatter Plot
A scatter plot visualizes the relationship between two continuous variables.
# Scatter plot of mpg vs. hp
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point() +
ggtitle("Scatter Plot of MPG vs Horsepower") + xlab("Horsepower") + ylab("Miles
per Gallon")
Exploring ggplot R programming
2. Line Plot
A line plot is used to show trends over a continuous variable.
# Create a line plot of mpg vs. hp
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_line() +
ggtitle("Line Plot of MPG vs Horsepower") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Exploring ggplot R programming
3. Bar Plot
A bar plot compares the counts or values of different categories.
# Bar plot of counts of cars by number of cylinders
ggplot(mtcars, aes(x = factor(cyl)) ) +
geom_bar(fill = "steelblue") +
ggtitle("Bar Plot of Number of Cylinders") +
xlab("Number of Cylinders") +
ylab("Count of Cars")
Exploring ggplot R programming
4. Box Plot
A box plot shows the distribution of a continuous variable and can highlight outliers.
# Box plot of mpg by number of cylinders
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightgreen") +
ggtitle("Box Plot of MPG by Number of Cylinders") +
xlab("Number of Cylinders") +
ylab("Miles per Gallon")
Exploring ggplot R programming

5. Histogram
A histogram visualizes the distribution of a single continuous variable.
# Histogram of mpg
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "orange", color = "black") +
ggtitle("Histogram of Miles per Gallon") +
xlab("Miles per Gallon") +
ylab("Frequency")
Data Science using R

Tidying data - Variables to visuals R programming

© Kalasalingam Academy of Research and Education


Tidying data - Variables to visuals R programming

Tidying data is a crucial step in the data analysis process, particularly when
using R and the tidyverse framework. The concept of tidying data was popularized
by Hadley Wickham and focuses on structuring data in a way that makes it easier to
visualize and analyze. The basic principles involve ensuring that each variable is a
column, each observation is a row, and each type of observational unit forms a table.
Tidying data - Variables to visuals R programming

Principles of Tidy Data

1. Each variable must have its own column.

2. Each observation must have its own row.

3. Each type of observational unit must form a table.


Tidying data - Variables to visuals R programming

The tidyverse Package


To work with tidy data in R, you typically use the tidyverse package, which includes several
important packages, such as dplyr, ggplot2, and tidyr. Here's how to get started:
# Install tidyverse if you haven't already
install.packages("tidyverse")
# Load the tidyverse package
library(tidyverse)

Example Dataset
Let’s create a sample dataset that needs to be tidied:
# Create a sample dataset
data <- data.frame(
id = 1:4,
year_2020 = c(5, 3, 6, 2),
year_2021 = c(7, 8, 5, 6)
)
Tidying data - Variables to visuals R programming

# View the original dataset


print(data)
This dataset contains observations of some variable (let’s say sales) across two years but
is in a wide format. We’ll tidy it into a long format.
Tidying Data with tidyr
We can use the pivot_longer() function from the tidyr package to convert our dataset
from wide to long format:
# Tidy the dataset
tidy_data <- data %>%
pivot_longer(cols = starts_with("year"),
names_to = "year",
values_to = "sales")

# View the tidied dataset


print(tidy_data)
Tidying data - Variables to visuals R programming

Resulting Tidy Data


The resulting tidy_data dataset will look like this:

id year sales
1 year_2020 5
1 year_2021 7
2 year_2020 3
2 year_2021 8
3 year_2020 6
3 year_2021 5
4 year_2020 2
4 year_2021 6
Tidying data - Variables to visuals R programming

Visualizing Tidied Data


Now that the data is tidy, we can easily visualize it using ggplot2. For instance, we
can create a line plot to show the trends over the years:
# Create a line plot
ggplot(tidy_data, aes(x = year, y = sales, group = id, color = as.factor(id))) +
geom_line() +
geom_point() +
ggtitle("Sales Over Years") +
xlab("Year") +
ylab("Sales") +
theme_minimal()
Tidying data - Variables to visuals R programming

Visualizing Tidied Data


Now that the data is tidy, we can easily visualize it using ggplot2. For instance, we
can create a line plot to show the trends over the years:
# Create a line plot
ggplot(tidy_data, aes(x = year, y = sales, group = id, color = as.factor(id))) +
geom_line() +
geom_point() +
ggtitle("Sales Over Years") +
xlab("Year") +
ylab("Sales") +
theme_minimal()
Tidying data - Variables to visuals R programming

Handling Multiple Variables


If you have a dataset with multiple variables that you want to visualize
simultaneously, you can use the pivot_longer() function to convert it into a tidy
format. For example, consider a dataset with sales data across different products and
years:
# Create a more complex dataset
complex_data <- data.frame(
product = c("A", "A", "B", "B"),
year = c(2020, 2021, 2020, 2021),
sales = c(5, 7, 3, 8)
)

# View the complex dataset


print(complex_data)
Data Science using R

Aesthetics – Attributes and visible


aesthetics R programming

© Kalasalingam Academy of Research and Education


Aesthetics – Attributes and visible aesthetics R programming
In ggplot2, aesthetics (often abbreviated as aes) are visual properties that help
convey the meaning of your data in a plot. They are essential for understanding how
data variables are represented visually, and they provide the foundation for creating
informative visualizations.

Aesthetics in ggplot2

Aesthetics are defined inside the aes() function and can map data variables to
visual properties such as:
1. Position: Where the data point is located in the plot (x and y axes).
2. Color: The color of the points, lines, or bars.
3. Size: The size of points or the width of lines.
4. Shape: The shape of points (e.g., circles, triangles).
5. Fill: The fill color for bars or areas.
6. Alpha: The transparency of points or lines.
Aesthetics – Attributes and visible aesthetics R programming
Basic Structure of Aesthetics
The general structure for defining aesthetics in a ggplot is as follows:
ggplot(data, aes(x = variable_x, y = variable_y, color = variable_color, size
= variable_size)) + geom_type()
Example Dataset
Let's use the mtcars dataset to demonstrate how to use aesthetics effectively:
# Load necessary library
library(ggplot2)
# Load the mtcars dataset
data(mtcars)
Using Aesthetics in ggplot2
Aesthetics – Attributes and visible aesthetics R programming
1. Basic Scatter Plot with Aesthetics
Here’s how you can create a basic scatter plot using mpg (miles per gallon) vs.
hp (horsepower), with color representing the number of cylinders (cyl):
# Basic scatter plot with aesthetics
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
ggtitle("Scatter Plot of MPG vs Horsepower") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Aesthetics – Attributes and visible aesthetics R programming
Attributes of Aesthetics
1. Color
The color aesthetic is often used to differentiate groups or categories. In the example
above, the points are colored based on the number of cylinders.
2. Size
The size aesthetic allows you to indicate a third variable by varying the size of
points:
# Scatter plot with size aesthetic
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl), size = wt)) +
geom_point() +
ggtitle("Scatter Plot of MPG vs Horsepower with Weight") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Aesthetics – Attributes and visible aesthetics R programming
3. Shape
You can also use the shape aesthetic to differentiate categories visually:
# Scatter plot with shape aestheticggplot(mtcars, aes(x = hp, y = mpg, color =
factor(cyl), shape = factor(gear))) + geom_point(size = 3) + ggtitle("Scatter Plot with
Color and Shape Aesthetics") + xlab("Horsepower") + ylab("Miles per Gallon")
4. Fill
The fill aesthetic is used primarily for bar plots and areas:
# Bar plot with fill aestheticggplot(mtcars, aes(x = factor(cyl), fill = factor(gear))) +
geom_bar(position = "dodge") + ggtitle("Count of Cars by Number of Cylinders and
Gears") + xlab("Number of Cylinders") + ylab("Count") +
scale_fill_brewer(palette = "Set2") # Custom color palette
Aesthetics – Attributes and visible aesthetics R programming

Additional Aesthetic Attributes


1. Alpha
The alpha aesthetic controls the transparency of the plotted points or lines:
# Scatter plot with alpha aesthetic
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl), alpha = wt)) +
geom_point(size = 3) +
ggtitle("Scatter Plot with Alpha Aesthetic") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Aesthetics – Attributes and visible aesthetics R programming
Mapping vs. Setting Aesthetics
It’s important to understand the difference between mapping aesthetics and
setting them:
•Mapping: When you map an aesthetic inside the aes() function, it tells ggplot2
to use the values from the dataset for that aesthetic.
ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) # Mapping
•Setting: When you set an aesthetic outside the aes() function, it applies a single
value to that aesthetic across all data points.
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(color = "blue") # Setting
Data Science using R

Geometrics – Histogram, Scatter, Line,


Bar, Stacked Bar charts

© Kalasalingam Academy of Research and Education


Geometrics – Histogram, Scatter, Line, Bar, Stacked Bar charts
In ggplot2, geometric objects (often referred to as "geoms") are the visual
representations of your data. Different geoms allow you to create various types of plots,
such as histograms, scatter plots, line plots, bar plots, and stacked bar charts. Below is an
overview of these common geoms, including example code for each type using the mtcars
dataset.
1. Histogram
A histogram displays the distribution of a continuous variable by dividing the data into bins.
# Load necessary libraries
library(ggplot2)
# Create a histogram of the 'mpg' variable (Miles per Gallon)
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black", alpha = 0.7) +
ggtitle("Histogram of Miles per Gallon") +
xlab("Miles per Gallon") +
ylab("Frequency")
Geometrics – Histogram, Scatter, Line, Bar, Stacked Bar charts
2. Scatter Plot

A scatter plot visualizes the relationship between two continuous variables.

# Create a scatter plot of 'hp' (Horsepower) vs 'mpg' (Miles per Gallon)


ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
ggtitle("Scatter Plot of MPG vs Horsepower") +
xlab("Horsepower") +
ylab("Miles per Gallon") +
scale_color_manual(values = c("red", "green", "blue"), name = "Cylinders")
Geometrics – Histogram, Scatter, Line, Bar, Stacked Bar charts
3. Line Plot

A line plot connects data points to show trends over a continuous variable.

# Create a line plot of 'mpg' over 'hp'


ggplot(mtcars, aes(x = hp, y = mpg, group = 1)) + # group = 1 connects all points
geom_line(color = "blue") +
geom_point(size = 3, color = "red") + # Add points for visibility
ggtitle("Line Plot of MPG vs Horsepower") +
xlab("Horsepower") +
ylab("Miles per Gallon")
Geometrics – Histogram, Scatter, Line, Bar, Stacked Bar charts
4. Bar Plot

A bar plot compares the counts or values of different categories.

# Create a bar plot of the count of cars by number of cylinders


ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar(fill = "lightblue", color = "black") +
ggtitle("Bar Plot of Number of Cylinders") +
xlab("Number of Cylinders") +
ylab("Count of Cars")
Geometrics – Histogram, Scatter, Line, Bar, Stacked Bar charts
5. Stacked Bar Chart

A stacked bar chart shows the breakdown of a total across different categories.

# Create a stacked bar chart comparing the number of cars by cylinders and gears
ggplot(mtcars, aes(x = factor(cyl), fill = factor(gear))) +
geom_bar(position = "stack") +
ggtitle("Stacked Bar Chart of Cylinders and Gears") +
xlab("Number of Cylinders") +
ylab("Count of Cars") +
scale_fill_brewer(palette = "Set2", name = "Gears")

You might also like