Data Visualization With R Ggplot2
Data Visualization With R Ggplot2
WITH R: GGPLOT2
Getting started with ggplot2
Toolbox
Content
Build a plot layer by layer
Geoms Plot
geom_area() area plot
geom_bar(stat = "identity“) bar chart
geom_line() line plot
geom_path()
geom_point() scatterplot
geom_polygon() polygons
geom_rect(), geom_tile() and geom_raster() rectangles
Examples
df <- data.frame(
x = c(3, 1, 5),
y = c(2, 4, 6),
label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) +
labs(x = NULL, y = NULL) + # Hide axis label
theme(plot.title = element_text(size = 12)) # Shrink plot title
p + geom_point() + ggtitle("point")
p + geom_text() + ggtitle("text")
p + geom_bar(stat = "identity") + ggtitle("bar")
p + geom_tile() + ggtitle("raster")
Examples
p + geom_line() + ggtitle("line")
p + geom_area() + ggtitle("area")
p + geom_path() + ggtitle("path")
p + geom_polygon() + ggtitle("polygon")
Exercise 2
1. What geoms would you use to draw each of the following named plots?
• Scatterplot
• Line chart
• Histogram
• Bar chart
• Pie chart
2. What’s the difference between geom_path() and geom_polygon()? What’s the difference between
geom_path() and geom_line()?
3. What low-level geoms are used to draw geom_smooth()? What about geom_boxplot() and
geom_violin()?
Labels
• Adding text to a plot can be quite tricky. ggplot2 doesn’t have all the answers, but does provide
some tools to make your life a little easier. The main tool is geom_text(), which adds labels at the
specified x and y positions. geom_text() has the most aesthetics of any geom, because there are so
many ways to control the appearance of a text:
• family gives the name of a font. There are only three fonts that are guaranteed to work
everywhere: “sans” (the default), “serif”, or “mono”:
df <- data.frame(x = 1, y = 3:1, family = c("sans", "serif", "mono"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = family, family = family))
Labels
• fontface specifies the face: “plain” (the default), “bold” or “italic”.
df <- data.frame(x = 1, y = 3:1, face = c("plain", "bold", "italic"))
ggplot(df, aes(x, y)) +
geom_text(aes(label = face, fontface = face))
Labels
• You can adjust the alignment of the text with the hjust (“left”, “center”, “right”, “inward”, “outward”) and
vjust (“bottom”, “middle”, “top”, “inward”, “outward”) aesthetics. The default alignment is centered. One of
the most useful alignments is “inward”: it aligns text towards the middle of the plot:
df <- data.frame(
x = c(1, 1, 2, 2, 1.5),
y = c(1, 2, 1, 2, 1.5),
text = c("bottom-left", "bottom-right",
"top-left", "top-right", "center")
)
ggplot(df, aes(x, y)) +
geom_text(aes(label = text))
ggplot(df, aes(x, y)) +
geom_text(aes(label = text), vjust = "inward", hjust = "inward")
Labels
• size controls the font size. Unlike most tools, ggplot2 uses mm, rather than the usual points (pts).
This makes it consistent with other size units in ggplot2. (There are 72.27 pts in a inch, so to
convert from points to mm, just multiply by 72.27 / 25.4).
• angle specifies the rotation of the text in degrees.
Labels
• You can map data values to these aesthetics, but use restraint: it is hard to percieve the relationship
between variables mapped to these aesthetics. geom_text() also has three parameters. Unlike the
aesthetics, these only take single values, so they must be the same for all labels:
• Often you want to label existing points on the plot. You don’t want the text to overlap with the
points (or bars etc), so it’s useful to offset the text a little. The nudge_x and nudge_y parameters
allow you to nudge the text a little horizontally or vertically:
df <- data.frame(trt = c("a", "b", "c"), resp = c(1.2, 3.4, 2.5))
ggplot(df, aes(resp, trt)) +
geom_point() +
geom_text(aes(label = paste0("(", resp, ")")), nudge_y = -0.25) +
xlim(1, 3.6)
Labels
• If check_overlap = TRUE, overlapping labels will be automatically removed. The algorithm is
simple: labels are plotted in the order they appear in the data frame; if a label would overlap with
an existing point, it’s omitted. This is not incredibly useful, but can be handy.
ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = model)) +
xlim(1, 8)
ggplot(diamonds, aes(depth)) +
geom_histogram(binwidth = 0.1) +
xlim(55, 70)
Displaying distributions
• If you want to compare the distribution between groups, you have a few options:
• Show small multiples of the histogram, facet_wrap(~ var).
• Use colour and a frequency polygon, geom_freqpoly().
• Use a “conditional density plot”, geom_histogram(position = "fill").
Displaying distributions
• The frequency polygon and conditional density plots are shown below. The conditional density plot uses
position_fill() to stack each bin, scaling it to the same height. This plot is perceptually challenging because
you need to compare bar heights, not positions, but you can see the strongest patterns.
ggplot(diamonds, aes(depth)) +
geom_freqpoly(aes(colour = cut), binwidth = 0.1, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
ggplot(diamonds, aes(depth)) +
geom_histogram(aes(fill = cut), binwidth = 0.1, position = "fill",
na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
Displaying distributions
• An alternative to a bin-based visualisation is a density estimate. geom_density() places a little normal
distribution at each data point and sums up all the curves. It has desirable theoretical properties, but is more
difficult to relate back to the data. Use a density plot when you know that the underlying density is smooth,
continuous and unbounded. You can use the adjust parameter to make the density more or less smooth.
ggplot(diamonds, aes(depth)) +
geom_density(na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
ggplot(diamonds, aes(depth, fill = cut, colour = cut)) +
geom_density(alpha = 0.2, na.rm = TRUE) +
xlim(58, 68) +
theme(legend.position = "none")
Displaying distributions
• The histogram, frequency polygon and density display a detailed view of the distribution. However,
sometimes you want to compare many distributions, and it’s useful to have alternative options that sacrifice
quality for quantity. Here are three options:
1. geom_boxplot(): the box-and-whisker plot shows five summary statistics along with individual “outliers”. It
displays far less information than a histogram, but also takes up much less space. You can use boxplot with both
categorical and continuous x. For continuous x, you’ll also need to set the group aesthetic to define how the x
variable is broken up into bins. A useful helper function is cut_width():
ggplot(diamonds, aes(clarity, depth)) +
geom_boxplot()
ggplot(diamonds, aes(carat, depth)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
Displaying distributions
2. geom_violin(): the violin plot is a compact version of the density plot. The underlying computation is the
same, but the results are displayed in a similar fashion to the boxplot:
ggplot(diamonds, aes(clarity, depth)) +
geom_violin()
ggplot(diamonds, aes(carat, depth)) +
geom_violin(aes(group = cut_width(carat, 0.1))) +
xlim(NA, 2.05)
Displaying distributions
3. geom_dotplot(): draws one point for each observation, carefully adjusted in space to avoid
overlaps and show the distribution. It is useful for smaller datasets.
Dealing with overplotting
• The scatterplot is a very important tool for assessing the relationship between two continuous variables.
However, when the data is large, points will be often plotted on top of each other, obscuring the true
relationship. In extreme cases, you will only be able to see the extent of the data, and any conclusions drawn
from the graphic will be suspect. This problem is called overplotting.
• For smaller datasets: Very small amounts of overplotting can sometimes be alleviated by making the points
smaller, or using hollow glyphs. The following code shows some options for 2000 points sampled from a
bivariate normal distribution.
df <- data.frame(x = rnorm(2000), y = rnorm(2000))
norm <- ggplot(df, aes(x, y)) + xlab(NULL) + ylab(NULL)
norm + geom_point()
norm + geom_point(shape = 1) # Hollow circles
norm + geom_point(shape = ".") # Pixel sized
Dealing with overplotting
• For larger datasets with more overplotting, you can use alpha blending (transparency) to make the points
transparent. If you specify alpha as a ratio, the denominator gives the number of points that must be
overplotted to give a solid colour. Values smaller than ~1/500 are rounded down to zero, giving completely
transparent points.
norm + geom_point(alpha = 1 / 3)
norm + geom_point(alpha = 1 / 5)
norm + geom_point(alpha = 1 / 10)
Statistical summaries
• geom_histogram() and geom_bin2d() use a familiar geom, geom_bar() and geom_raster(), combined with a
new statistical transformation, stat_bin() and stat_bin2d(). stat_bin() and stat_bin2d() combine the data into
bins and count the number of observations in each bin. But what if we want a
summary other than count? So far, we’ve just used the default statistical transformation associated with each
geom. Now we’re going to explore how to use stat_summary_bin() to stat_summary_2d() to compute
different summaries.
• Let’s start with a couple of examples with the diamonds data. The first example in each pair shows how we
can count the number of diamonds in each bin; the second shows how we can compute the average price.
ggplot(diamonds, aes(color)) +
geom_bar()
ggplot(diamonds, aes(color, price)) +
geom_bar(stat = "summary_bin", fun.y = mean)
Statistical summaries
ggplot(diamonds, aes(table, depth)) +
geom_bin2d(binwidth = 1, na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
ggplot(diamonds, aes(table, depth, z = price)) +
geom_raster(binwidth = 1, stat = "summary_2d", fun = mean,
na.rm = TRUE) +
xlim(50, 70) +
ylim(50, 70)
Example: building a scatterplot
How are engine size and fuel economy related? We might create a scatterplot of engine displacement and
highway mpg with points coloured by number of cylinders:
p + layer(
mapping = NULL,
data = NULL,
geom = "point", geom_params = list(),
stat = "identity", stat_params = list(),
position = "identity“
)
Data
• The data on each layer doesn’t need to be the same, and it’s often useful to combine multiple
datasets in a single plot. To illustrate that idea I’m going to generate two new datasets related to
the mpg dataset. First I’ll fit a loess model and generate predictions from it. (This is what
geom_smooth() does behind the scenes)
mod <- loess(hwy ~ displ, data = mpg)
grid <- data_frame(displ = seq(min(mpg$displ), max(mpg$displ), length = 50))
grid$hwy <- predict(mod, newdata = grid)
grid
Data
• Next, I’ll isolate observations that are particularly far away from their predicted values:
std_resid <- resid(mod) / mod$s
outlier <- filter(mpg, abs(std_resid) > 2)
outlier
Data
• I’ve generated these datasets because it’s common to enhance the display of raw data with
a statistical summary and some annotations. With these new datasets, I can improve our
initial scatterplot by overlaying a smoothed line, and labelling the outlying points:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_line(data = grid, colour = "blue", size = 1.5) +
geom_text(data = outlier, aes(label = model))
Data
• In this example, every layer uses a different dataset. We could define the same plot in
another way, omitting the default dataset, and specifying a dataset for each layer:
ggplot(mapping = aes(displ, hwy)) +
geom_point(data = mpg) +
geom_line(data = grid) +
geom_text(data = outlier, aes(label = model))
Exercise 4
• The following code uses dplyr to generate some summary statistics about each class of car:
library(dplyr)
class <- mpg %>%
group_by(class) %>%
summarise(n = n(), hwy = mean(hwy))
Use the data to recreate this plot:
Aesthetic mappings
• The aesthetic mappings, defined with aes(), describe how variables are mapped to visual
properties or aesthetics. aes() takes a sequence of aesthetic-variable pairs like this:
aes(x = displ, y = hwy, colour = class)
Or
aes(displ, hwy, colour = class)
Specifying the aesthetics in the plot vs. in the layers
• Aesthetic mappings can be supplied in the initial ggplot() call, in individual layers, or in
some combination of both. All of these calls create the same plot specification:
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
ggplot(mpg, aes(displ)) +
geom_point(aes(y = hwy, colour = class))
ggplot(mpg) +
geom_point(aes(displ, hwy, colour = class))
Specifying the aesthetics in the plot vs. in the layers
Each geom has a set of aesthetics that it understands, some of which must be provided. For
example, the point geoms requires x and y position, and understands colour, size and shape
aesthetics. A bar requires height (ymax), and understands width, border colour and fill
colour. Each geom lists its aesthetics in the documentation.
Geoms
• Some geoms differ primarily in the way that they are parameterised. For example, you
can draw a square in three ways:
• By giving geom_tile() the location (x and y) and dimensions (width and height).
• By giving geom_rect() top (ymax), bottom (ymin), left (xmin) and right (xmax)
positions.
• By giving geom_polygon() a four row data frame with the x and y positions of each
corner.
• Other related geoms are:
• geom_segment() and geom_line()
• geom_area() and geom_ribbon().
Exercise 6
• For each of the plots below, identify the geom used to draw it.
Stats
• A statistical transformation, or stat, transforms the data, typically by summarising it in
some manner. For example, a useful stat is the smoother, which calculates the smoothed
mean of y, conditional on x. You’ve already used many of ggplot2’s stats because they’re
used behind the scenes to generate many important geoms:
• stat_bin(): geom_bar(), geom_freqpoly(), geom_histogram()
• stat_bin2d(): geom_bin2d()
• stat_bindot(): geom_dotplot()
• stat_binhex(): geom_hex()
• stat_boxplot(): geom_boxplot()
• stat_contour(): geom_contour()
• stat_quantile(): geom_quantile()
• stat_smooth(): geom_smooth()
• stat_sum(): geom_count()
Stats
• Other stats can’t be created with a geom_ function:
• stat_ecdf(): compute a empirical cumulative distribution plot.
• stat_function(): compute y values from a function of x values.
• stat_summary(): summarise y values at distinct x values.
• stat_summary2d(), stat_summary_hex(): summarise binned values.
• stat_qq(): perform calculations for a quantile-quantile plot.
• stat_spoke(): convert angle and radius to position.
• stat_unique(): remove duplicated rows.
Stats
• There are two ways to use these functions. You can either add a stat_() function and
override the default geom, or add a geom_() function and override the default stat:
ggplot(mpg, aes(trans, cty)) +
geom_point() +
stat_summary(geom = "point", fun.y = "mean", colour = "red", size = 4)
ggplot(diamonds, aes(price)) +
geom_histogram(aes(y = ..density..), binwidth = 500)
Generated variables
• This technique is particularly useful when you want to compare the distribution of multiple groups
that have very different sizes. For example, it’s hard to compare the distribution of price within cut
because some groups are quite small. It’s easier to compare if we standardise each group to take up
the same area:
ggplot(diamonds, aes(price, colour = cut)) +
geom_freqpoly(binwidth = 500) +
theme(legend.position = "none")
3. Read the help for stat_sum() then use geom_count() to create a plot that
shows the proportion of cars that have each combination of drv and trans.
Position adjustments
• Position adjustments apply minor tweaks to the position of elements within a layer. Three
adjustments apply primarily to bars:
• position_stack(): stack overlapping bars (or areas) on top of each other.
• position_fill(): stack overlapping bars, scaling so the top is always at 1.
• position_dodge(): place overlapping bars (or boxplots) side-by-side.
Position adjustments
dplot <- ggplot(diamonds, aes(color, fill = cut)) + xlab(NULL) + ylab(NULL) +
theme(legend.position = "none")
# position stack is the default for bars, so `geom_bar()` is equivalent to `geom_bar(position =
"stack")`.
dplot + geom_bar()
dplot + geom_bar(position = "fill")
dplot + geom_bar(position = "dodge")
Position adjustments
• There’s also a position adjustment that does nothing: position_identity(). The identity position
adjustment is not useful for bars, because each bar obscures the bars behind, but there are many
geoms that don’t need adjusting, like the frequency polygon:
dplot + geom_bar(position = "identity", alpha = 1 / 2, colour = "grey50")
Default scales are named according to the aesthetic and the variable type: scale_y_continuous(),
scale_colour_discrete(), etc.
Modifying scales
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous("A really awesome x axis label") +
scale_y_continuous("An amazingly great y axis label")
Compare two specifications?
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous("Label 1") +
scale_x_continuous("Label 2")
It’s most natural to think about the limits of position scales: they map directly to the ranges of the
axes. But limits also apply to scales that have legends, like colour, size, and shape. This is
particularly important to realise if you want your colours to match up across multiple plots in your
paper.
Limits
• You can modify the limits using the limits parameter of the scale:
• For continuous scales, this should be a numeric vector of length two. If you only want to set
the upper or lower limit, you can set the other value to NA.
• For discrete scales, this is a character vector which enumerates all possible values.
df <- data.frame(x = 1:3, y = 1:3)
base <- ggplot(df, aes(x, y)) + geom_point()
base
base + scale_x_continuous(limits = c(1.5, 2.5))
#> Warning: Removed 2 rows containing missing values (geom_point).
base + scale_x_continuous(limits = c(0, 4))
Limits
• Because modifying the limits is such a common task, ggplot2 provides some helper to make this
even easier: xlim(), ylim() and lims() These functions inspect their input and then create the
appropriate scale, as follows:
• xlim(10, 20): a continuous scale from 10 to 20
• ylim(20, 10): a reversed continuous scale from 20 to 10
• xlim("a", "b", "c"): a discrete scale
• xlim(as.Date(c("2008-05-01", "2008-08-01"))): a date scale from May 1 to August 1 2008.
base + xlim(0, 4)
base + xlim(4, 0)
base + lims(x = c(0, 4))
Thành phần của ggplot2
• Về cơ bản, mỗi lệnh trong gói ggplot2 có ba thành phần chủ yếu sau:
Dữ liệu đầu vào,
Một bộ các aesthetic mappings (kí kiệu là aes) giữa các biến số của bộ dữ liệu và các
đặc điểm hình ảnh,
Ít nhất một layer miêu tả dữ liệu. Các layer thường được tạo ra bằng hàm geom.
• Sử dụng dữ liệu "ch4bt8.wf1"
library(ggplot2)
library(hexView)
test = readEViews("ch4bt8.wf1")
Scatter plot trong ggplot2
ggplot(test, aes(x = EDUC, y = WAGE)) + geom_point() # Cách 1
ggplot(test, aes(EDUC, WAGE)) + geom_point() # Cách 2
Scatter plot trong ggplot2
• Cho 2 nhóm lao động ứng với 2 nhóm giới tính:
test$MALE[test$MALE == 1] = "MALE"
test$MALE[test$MALE == 0] = "FEMALE"
ggplot(test, aes(EDUC, WAGE, colour = MALE)) + geom_point(show.legend = F) +
facet_wrap(~ MALE)
Scatter plot trong ggplot2
• Cho 2 nhóm lao động ứng với 2 nhóm giới tính:
ggplot(test, aes(EDUC, WAGE, colour = MALE)) + geom_point() # Gộp 2 scatter
plot
Đường hồi quy trong ggplot2
• Đường hồi quy cho toàn bộ mẫu nghiên cứu với biến độc lập là WAGE còn biến phụ
thuộc là EDUC.
ggplot(test, aes(EDUC, WAGE)) + geom_point() + geom_smooth(method = "lm", se =
FALSE)
Đường hồi quy trong ggplot2
• Hai đường hồi quy ứng với 2 nhóm lao động riêng biệt và hiển thị ở cùng một biểu đồ:
ggplot(test, aes(EDUC, WAGE, colour = MALE)) + geom_point() + geom_smooth(method
= "lm")
Histogram trong ggplot2
• Histogram cho biến IQ (chỉ số thông minh IQ) riêng biệt cho hai nhóm giới tính.
ggplot(test, aes(IQ, fill = MALE)) + geom_histogram(show.legend = F) + facet_wrap(~
MALE)
Histogram trong ggplot2
• Nhạt màu một chút:
ggplot(test, aes(IQ, fill = MALE)) + geom_histogram(show.legend = F, alpha = 0.5) +
facet_wrap(~ MALE)
Box plot trong ggplot2
library(hexView)
test = readEViews("ch4bt8.wf1") # Ví dụ với dữ liệu
"ch4bt8.wf1"
test$MALE[test$MALE == 1] = "MALE"
test$MALE[test$MALE == 0] = "FEMALE"
ggplot(test, aes(MALE, IQ, colour = MALE)) + geom_boxplot(show.legend = F)
# Box plot cho IQ theo hai nhóm giới tính
Hàm mật độ xác suất Density trong ggplot2
• Vẽ hàm mật độ xác suất cho log(wage) ứng với bốn nhóm lao động đến từ 4 khu vực địa
lý:
ggplot(CPS1988, aes(log(wage), colour = region)) + geom_density() # Cách 1
Hàm mật độ xác suất Density trong ggplot2
ggplot(CPS1988, aes(log(wage), fill = region)) + geom_density(alpha = 0.3) # Cách 2
Hàm mật độ xác suất Density trong ggplot2
• Vẽ hàm mật độ xác suất của log(wage) với toàn bộ quan sát bằng lệnh:
ggplot(CPS1988, aes(log(wage))) + geom_density(color = "darkblue", fill = "lightblue")
Hàm mật độ xác suất Density trong ggplot2
• Hiển thị đồng thời cả Histogram và Density:
ggplot(CPS1988, aes(log(wage))) + geom_density(alpha = 0.3, fill = "blue", color = "blue")
+ geom_histogram(aes(y = ..density..), fill = "red", color = "red", alpha = 0.3) + theme_bw()
Biểu đồ cột trong ggplot2
install.packages("tidyverse")
library(AER)
data("CPS1988")
library(tidyverse) # Vẽ số lao động theo khu vực địa lý
k1 = CPS1988 %>%group_by(region) %>% count() %>%ggplot(aes(region, n)) +
geom_col() + theme_minimal() # Kiểu 1
k1
k2 = CPS1988 %>%group_by(region) %>% count() %>%ggplot(aes(region, n, fill =
region)) + geom_col() + theme_minimal() # Kiểu 2
k2
k3 = CPS1988 %>%group_by(region) %>% count() %>%ggplot(aes(reorder(region, n), n,
fill = region)) + geom_col(show.legend = FALSE) # Kiểu 3
k3
Biểu đồ cột trong ggplot2
k4 = k3 + coord_flip() + labs(x = NULL, y = NULL, title = "Observations by Region",
caption = "Data Source: US Census Bureau") # Xoay k3 và hiệu chỉnh
k4
Biểu đồ cột trong ggplot2
k5 = CPS1988 %>% group_by(region, ethnicity) %>% count() %>%ggplot(aes(region, n))
+ geom_col() + facet_wrap(~ ethnicity, scales = "free") + geom_text(aes(label = n), color =
"white", vjust = 1.2, size = 3) + labs(x = NULL, y = NULL, title = "Observations by Region
and Ethnicity", caption = "Data Source: US Census Bureau")
# Số lao động theo khu vực địa lý và chủng tộc
k5 # Kiểu 1
k6 = CPS1988 %>% group_by(region, ethnicity) %>% count() %>%ggplot(aes(region, n,
fill = ethnicity)) + geom_col(position = "stack") + labs(x = NULL, y = NULL, title =
"Observations by Region and Ethnicity", caption = "Data Source: US Census Bureau")
# Kiểu 2
k6
k7 = CPS1988 %>% group_by(region, ethnicity) %>% count() %>%ggplot(aes(region, n,
fill = ethnicity)) + geom_col(position = "dodge") + labs(x = NULL, y = NULL, title =
"Observations by Region and Ethnicity", caption = "Data Source: US Census Bureau")
k7 # Kiểu 3
Pie chart trong ggplot2
library(dplyr)
p = CPS1988 %>% group_by(region) %>% count() %>% ggplot(aes(x = "", n, fill =
region)) + geom_col(width = 1, stat = "identity")
p # Bar chart
p + coord_polar("y", start = 0) + labs(x = NULL, y = NULL) # Pie
chart
Biểu đồ đường trong ggplot2
library(foreign)
test = read.dta("Table13_1.dta")
ggplot(test, aes(time, ex)) + geom_line(col = "blue")
Biểu đồ đường trong ggplot2
library(data.table)
test = fread("cophieu.csv") # Đọc file "cophieu.csv"
library(tidyverse) # Dùng lệnh "mutate"
library(lubridate) # Định dạng lại thời gian
test %>% mutate(Date = ymd(X.DTYYYYMMDD.)) %>% ggplot(aes(Date, X.Open.)) +
geom_line(color = "blue") + facet_wrap(~ X.Ticker., drop = TRUE, scales = "free") +
theme_bw() # Kiểu 1
test %>% mutate(Date = ymd(X.DTYYYYMMDD.)) %>% ggplot(aes(Date, X.Open., color =
X.Ticker.)) + geom_line() + theme_bw() # Kiểu 2
test %>% mutate(Date = ymd(X.DTYYYYMMDD.)) %>% rename(Price = X.Open., Symbol =
X.Ticker.) %>% ggplot(aes(Date, Price, color = Symbol)) + geom_line() + theme_bw()
# Kiểu 3
Bar chart
library(ggplot2)
data(Marriage, package = "mosaicData")
# plot the distribution of race
ggplot(Marriage, aes(x = race)) + geom_bar()
ggplot(plotdata,
aes(x = "",
y = prop,
fill = race)) +
geom_bar(width = 1,
stat = "identity",
color = "black") +
coord_polar("y",
start = 0,
direction = -1) +
theme_void()
Pie chart: add labels, remove the legend
# create a pie chart with slice labels
plotdata <- Marriage %>%
count(race) %>%
arrange(desc(race)) %>%
mutate(prop = round(n*100/sum(n), 1),
lab.ypos = cumsum(prop) - 0.5*prop)
plotdata$label <- paste0(plotdata$race, "\n",
round(plotdata$prop), "%")
ggplot(plotdata,
aes(x = "",
y = prop,
fill = race)) +
geom_bar(width = 1,
stat = "identity",
color = "black") +
geom_text(aes(y = lab.ypos, label = label),
color = "black") +
coord_polar("y",
start = 0,
direction = -1) +
theme_void() +
theme(legend.position = "FALSE") +
labs(title = "Participants by race")
Tree map
library(treemapify)
ggplot(plotdata,
aes(fill = officialTitle,
area = n)) +
geom_treemap() +
labs(title = "Marriages by officiate")
Tree map: with labels
# create a treemap with tile labels
ggplot(plotdata,
aes(fill = officialTitle,
area = n,
label = officialTitle)) +
geom_treemap() +
geom_treemap_text(colour = "white",
place = "centre") +
labs(title = "Marriages by officiate") +
theme(legend.position = "none")
Histogram
library(ggplot2)
data(Marriage, package="mosaicData")
# plot the age distribution using a histogram
ggplot(Marriage, aes(x = age)) +
geom_histogram() +
labs(title = "Participants by age",
x = "Age")
Histogram: with colors
# plot the histogram with blue bars and white borders
ggplot(Marriage, aes(x = age)) +
geom_histogram(fill = "cornflowerblue",
color = "white") +
labs(title="Participants by age",
x = "Age")
Histogram: bins and bandwidths
# plot the histogram with 20 bins
ggplot(Marriage, aes(x = age)) +
geom_histogram(fill = "cornflowerblue",
color = "white",
bins = 20) +
labs(title="Participants by age",
subtitle = "number of bins = 20",
x = "Age")
Histogram: binwidth
# plot the histogram with a binwidth of 5
ggplot(Marriage, aes(x = age)) +
geom_histogram(fill = "cornflowerblue",
color = "white",
binwidth = 5) +
labs(title="Participants by age",
subtitle = "binwidth = 5 years",
x = "Age")
Histogram: percentage
# plot the histogram with percentages on the y-axis
library(scales)
ggplot(Marriage,
aes(x = age,
y= ..count.. / sum(..count..))) +
geom_histogram(fill = "cornflowerblue",
color = "white",
binwidth = 5) +
labs(title="Participants by age",
y = "Percent",
x = "Age") +
scale_y_continuous(labels = percent)
Kernel Density plot
# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
geom_density() +
labs(title = "Participants by age")
Kernel Density plot: the fill and border colors
# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
geom_density(fill = "indianred3") +
labs(title = "Participants by age")
Kernel Density plot: smoothing parameter
# default bandwidth for the age variable
bw.nrd0(Marriage$age)
## [1] 5.181946
ggplot(plotdata,
aes(x = factor(class,
levels = c("2seater", "subcompact",
"compact", "midsize",
"minivan", "suv", "pickup")),
y = pct,
fill = factor(drv,
levels = c("f", "r", "4"),
labels = c("front-wheel",
"rear-wheel",
"4-wheel")))) +
geom_bar(stat = "identity",
position = "fill") +
scale_y_continuous(breaks = seq(0, 1, .2),
label = percent) +
geom_text(aes(label = lbl),
size = 3,
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = "Set2") +
labs(y = "Percent",
fill = "Drive Train",
x = "Class",
title = "Automobile Drive by Class") +
theme_minimal()
Scatterplot
library(ggplot2)
data(Salaries, package="carData")
# simple scatterplot
ggplot(Salaries,
aes(x = yrs.since.phd,
y = salary)) +
geom_point()
Vietnam
https://future.ueh.edu.vn/
Thank You