0% found this document useful (0 votes)
3 views18 pages

DSAAct 6

The document outlines an activity focused on data visualization using the ggplot2 library in R, specifically analyzing the Family Income and Expenditure Survey (FIES) data. It includes tasks for creating histograms, bar graphs, and scatter plots to explore demographics and financial data, with detailed instructions on improving visualizations through color, themes, and statistical annotations. The final task emphasizes adding regression lines and themes to scatter plots to enhance data interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views18 pages

DSAAct 6

The document outlines an activity focused on data visualization using the ggplot2 library in R, specifically analyzing the Family Income and Expenditure Survey (FIES) data. It includes tasks for creating histograms, bar graphs, and scatter plots to explore demographics and financial data, with detailed instructions on improving visualizations through color, themes, and statistical annotations. The final task emphasizes adding regression lines and themes to scatter plots to enhance data interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Science and Analytics - Activity 6

Chelsea C. Ancheta

2024-11-30

Data Visualization using ggplot2

Preparations

Ensure that the library ggplot is loaded. The code is already provided.
If the ggplot2 package is not yet installed. Click the console tab and enter: install.packages("ggplot2")

library(ggplot2)

Also, for this activity, we will go back to the Family Income and Expenditure Survey (FIES). So, import the
fies data frame.

FIES <- read.csv("FIES.csv")


FIES <- data.frame(FIES)

Task 1

We start by limiting only our data set. Let us take the data from Region 1 only. Create a subset data frame
fies_ilocos wherein the Region is “I - Ilocos Region”

fies_ilocos <- FIES[FIES$Region == "I - Ilocos Region", ]

Task 2 (Histogram)

Let’s Study our demographics. First create a simple histogram of the Household.Head.Age.

hist(fies_ilocos$Household.Head.Age)

1
Histogram of fies_ilocos$Household.Head.Age
300
200
Frequency

100
50
0

20 40 60 80 100

fies_ilocos$Household.Head.Age

Improve the graph by defining the starting and ending value of each bar. Use breaks = seq() command.
Copy your original script/code and add the needed code.

hist(fies_ilocos$Household.Head.Age,
breaks = seq (10,100, by = 5),
xlab = "Age",
ylab = "Frequency",
col = "lightblue",
main = "Histogram of Household Head Age")

2
Histogram of Household Head Age
300
200
Frequency

100
50
0

20 40 60 80 100

Age

Let’s further improve by including colors. Select a basic color of your choice. Use that color for the color
and use the lighter color for the fill
Copy your original script/code and add the needed code.

hist(fies_ilocos$Household.Head.Age, breaks = seq(10, 100, by = 5), xlab = "Age", ylab = "Frequency", co

3
Histogram of fies_ilocos$Household.Head.Age
300
200
Frequency

100
50
0

20 40 60 80 100

Age

Let’s finish the histogram by adding labels.


Copy your original script/code and add the needed code.

hist(fies_ilocos$Household.Head.Age, breaks = seq(10, 100, by = 5), xlab = "Age", ylab = "Frequency",

4
Histogram of fies_ilocos$Household.Head.Age
300
200
Frequency

100
50
0

20 40 60 80 100

Age

Next, add a theme. Use theme_minimal()


Copy your original script/code and add the needed code.

if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")


library(ggplot2)
ggplot(fies_ilocos, aes(x = Household.Head.Age)) + geom_histogram(breaks = seq(10, 100, by = 5), col =

5
300

200
count

100

25 50 75 100
Household.Head.Age

For statisticians, we can also add a vertical line representing the mean and the median. A code is provided
below, just edit the ... parts to complete the code.

ggplot(fies_ilocos,aes(x=Household.Head.Age))+
geom_histogram(breaks= seq(10,100,by=5),color = "purple", fill= "lightblue")+
geom_vline(aes(xintercept=mean(Household.Head.Age)),color="yellow",linetype="dashed") +
geom_vline(aes(xintercept=median(Household.Head.Age)),color="red",linetype="dotted") +
annotate("text",x = mean(fies_ilocos$Household.Head.Age)+ 7.5,y=325,label ="Mean",color="green") +
annotate("text",x = median(fies_ilocos$Household.Head.Age)+ 7.5,y=350,label="Median",color= "orange")

6
Median
Mean
300

200
y

100

25 50 75 100
Household.Head.Age

Your histogram is complete.

Task 3 (Bar Graph)

Next, let us look at the Household.Head.Marital.Status by creating a simple bar graph.

Marital.Status.Counts <- table(fies_ilocos$Household.Head.Marital.Status)


barplot(Marital.Status.Counts, main = "Bar Graph of Household Head Marital Status", xlab = "Marital Sta

7
Bar Graph of Household Head Marital Status
1500
1000
500
0

Annulled Married Single Widowed

Marital Status

Let’s make it a pre-process chart for Pareto. Arrange the bars from largest to smallest.
Copy your original script/code and add the needed code.

Marital.Status.Counts <- sort(table(fies_ilocos$Household.Head.Marital.Status), decreasing = TRUE)


barplot(Marital.Status.Counts, main = "Bar Graph of Household Head Marital Status", xlab = "Marital Sta

8
Bar Graph of Household Head Marital Status
1500
1000
500
0

Married Widowed Single Annulled

Marital Status

Next, add colors.


Copy your original script/code and add the needed code.

Marital.Status.Counts <- sort(table(fies_ilocos$Household.Head.Marital.Status), decreasing = TRUE)


bar_colors <- c("purple", "lightgreen", "orange", "red", "lightblue")
barplot(Marital.Status.Counts,
main = "Bar Graph of Household Head Marital Status",
xlab = "Marital Status",
ylab = "Count",
col = bar_colors)

9
Bar Graph of Household Head Marital Status
1500
1000
Count

500
0

Married Widowed Single Annulled

Marital Status

Next add the labels.


Copy your original script/code and add the needed code.

Marital.Status.Counts <- sort(table(fies_ilocos$Household.Head.Marital.Status), decreasing = TRUE)


bar_colors <- c("purple", "lightgreen", "orange", "red", "lightblue")
bar_midpoints <- barplot(Marital.Status.Counts, main = "Bar Graph of Household Head Marital Status", xl
text(x = bar_midpoints, y = Marital.Status.Counts, labels = Marital.Status.Counts, pos = 3, cex = 0.8,

10
Bar Graph of Household Head Marital Status
1713
1500
1000
Count

500

425

136
73
1
0

Married Widowed Single Annulled

Marital Status

Finally, add a theme. Try theme_bw()


Copy your original script/code and add the needed code.

Marital.Status <- as.data.frame(Marital.Status.Counts)


colnames(Marital.Status) <- c("Marital.Status", "Count")
ggplot(Marital.Status, aes(x = reorder(Marital.Status,-Count), y = Count, fill = Marital.Status)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(legend.position = "none") +
labs(title = "Bar Graph of Household Head Marital Status",
x = "Marital Status",
y = "Count")

11
Bar Graph of Household Head Marital Status

1500

1000
Count

500

Married Widowed Single Divorced/Separated Annulled


Marital Status

Task 4 (Scatter Plot)

Since the demographics has already been explored, let try to find some patterns in the data set about two
information.
Create a basic scatterplot using geom_point() on Total.Food.Expenditure and Total.Household.Income

ggplot(fies_ilocos, aes(x = Total.Household.Income, y = Total.Food.Expenditure)) + geom_point(color = "

12
5e+05

4e+05
Total.Food.Expenditure

3e+05

2e+05

1e+05

0e+00
0e+00 1e+06 2e+06 3e+06 4e+06
Total.Household.Income

Since the data set is concentrated on one corner of the plot, let’s transform by using log10().
Copy your original script/code and add the needed code.

ggplot(fies_ilocos, aes(x = Total.Household.Income, y = Total.Food.Expenditure)) +


geom_point(color = "purple", alpha = 0.6) + scale_x_log10() + scale_y_log10() + labs(title = "Total Foo

13
Total Food Expenditure vs Total Household Income (Log−Transformed)

3e+05
Total.Food.Expenditure

1e+05

3e+04

1e+04

3e+04 1e+05 3e+05 1e+06 3e+06


Total.Household.Income

We can now slowly see the pattern. Try to add colors.


Copy your original script/code and add the needed code.

ggplot(fies_ilocos, aes(x = Total.Household.Income,


y = Total.Food.Expenditure,
color = Household.Head.Marital.Status)) +
geom_point(alpha = 0.7, size = 3) +
labs(title = "Income vs. Food Expenditure by Marital Status",
x = "Total Household Income",
y = "Total Food Expenditure",
color = "Marital Status")

14
Income vs. Food Expenditure by Marital Status

5e+05

4e+05
Total Food Expenditure

Marital Status
Annulled
3e+05
Divorced/Separated
Married
Single
2e+05
Widowed

1e+05

0e+00
0e+00 1e+06 2e+06 3e+06 4e+06
Total Household Income

Add labels.
Copy your original script/code and add the needed code.

if (!requireNamespace("ggrepel", quietly = TRUE)) install.packages("ggrepel")


library(ggplot2)
library(ggrepel)
ggplot(fies_ilocos, aes(x = Total.Household.Income,
y = Total.Food.Expenditure,
color = Household.Head.Marital.Status,
label = Household.Head.Marital.Status)) +
geom_point(alpha = 0.7, size = 3) +
geom_text_repel(size = 3, max.overlaps = 10) +
labs(title = "Income vs. Food Expenditure by Marital Status",
x = "Total Household Income",
y = "Total Food Expenditure",
color = "Marital Status")

## Warning: ggrepel: 2343 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

15
Income vs. Food Expenditure by Marital Status
Widowed
5e+05

4e+05
Married
Total Food Expenditure

Married Marital Status


Married a Annulled
3e+05 Married
a Divorced/Separated
a Married
a Single
2e+05
a Widowed

1e+05

0e+00
0e+00 1e+06 2e+06 3e+06 4e+06
Total Household Income

Now try to add the regression line, use geom_smooth


Copy your original script/code and add the needed code.

ggplot(fies_ilocos, aes(x = Total.Household.Income,


y = Total.Food.Expenditure,
color = Household.Head.Marital.Status)) +
geom_point(alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = FALSE, linetype = "solid") +
labs(title = "Income vs. Food Expenditure by Marital Status",
x = "Total Household Income",
y = "Total Food Expenditure",
color = "Marital Status")

## ‘geom_smooth()‘ using formula = ’y ~ x’

16
Income vs. Food Expenditure by Marital Status
6e+05
Total Food Expenditure

Marital Status
4e+05
Annulled
Divorced/Separated
Married
Single

2e+05 Widowed

0e+00
0e+00 1e+06 2e+06 3e+06 4e+06
Total Household Income

Finally, add a theme, try theme_classic()


Copy your original script/code and add the needed code.

ggplot(fies_ilocos, aes(x = Total.Household.Income,


y = Total.Food.Expenditure,
color = Household.Head.Marital.Status)) +
geom_point(alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = FALSE, linetype = "solid") +
labs(title = "Income vs. Food Expenditure by Marital Status",
x = "Total Household Income",
y = "Total Food Expenditure",
color = "Marital Status") +
theme_classic()

## ‘geom_smooth()‘ using formula = ’y ~ x’

17
Income vs. Food Expenditure by Marital Status
6e+05
Total Food Expenditure

Marital Status
4e+05
Annulled
Divorced/Separated
Married
Single

2e+05 Widowed

0e+00
0e+00 1e+06 2e+06 3e+06 4e+06
Total Household Income

Congratulations! You can now make statistical reports!

18

You might also like