Exercise 2
Exercise 2
In the previous lab we worked with bivariate and univariate plots. What was common in these plots is
that there was no order in the data in a temporal direction. In this Lab we will work mainly with time series
data and other types of grouped and longitudinal data.
Time series
What is a time series? It is a series of data collected over a long period of time with a time interval between
each observation point (could be consistent or not). Univariate or multivariate data can be collected at
each point.
When working with time series data it makes sense to plot lines instead of points as we are interested in
following a trend in the data over the temporal variable.
How does the plot look like? Is it something you can interpret or not?
Task 1
1. Explore how to better visualize the sales for this dataset. Use the tools we learned in the previous lab,
but also explore the dataset to find what can be changed to get a better visualization.
In the economics data plot unemployment vs date. It is quite simple, right? How about if we want to plot
the relationship between two variables over time?
Task 2
1. Use the geom_path() argument to find the connection between expenditure and unemployment. What
do you see?
2. How about if you plot the relationship between unemployment and savings? Do you notice a trend? If
not, then how can you solve this issue?
Hint: color by date
3. Visualize the house pricing development in Houston vs. the offer, and color based on the year. What do
you notice?
Hint: use subset()
Task 3
1. We can now try and plot the order of the US presidents. In this task you should plot the order of starting
in office and color based on the party.
2. Now that we have some understand of the housing data and the presidential data lets try to plot the
two datasets together. If you focus on Dallas, does the president affiliation influence the listing of houses?
To complete the task above you would need to convert the date into actual dates with:
You will have to use the geom_rect() argument to create boxes of starting and ending dates for the
presidential duration. As this might be quite advanced you should add
geom_rect(
data = presidential
)
to your plot. Doing that will create the shading. Just remember that the original ggplot is on the housing
data!
Grouped data
How about when we have groups in the data? In the Texas housing dataset, we have many cities. In the
Midwest data we had states. It is not unusual to have multiple observation at different time points for
separate groups (either patients, or states, or brands) to see the trend of each over time. These data are
usually called longitudinal data.
Let’s try to plot the Texas housing data prices, by year for the different cities. In order to achieve that you
will need to aggregate the city prices over the years by city. You can use this code:
Now that you have the new dataset you can try to plot by group (exactly the same as adding color, but
change the word).
Task 4
1. Having done the grouping add a linear regression line to the plot using method=”lm”. What do you
conclude? Can you think of situations where we would choose group over color?
Heatmap
Sometimes it is not easy to go through large amounts of data. In cases where we want to look for patterns
in over a large number of variables we can plot the data as a heatmap to see how values of variables
change between groups or over time. For that purpose, we can use heatmaps. To work with heatmaps we
will have to work outside ggplot as it is not the optimal library to work with for this type of visualization.
Please install the package pheatmap.
Going to the economics dataset create a heatmap of the dataset (excluding the first column) using the
pheatmap function. What do you notice? Why do you think this is and how can you solve it?
Task 5
In the lab room, you will find the file Genes. This file contains info of 1000 gene expression levels and 40
individuals. The first 20 are healthy and the last 20 suffer from a disease. Import this file into your
workspace and create a heatmap of the genes. Can you differentiate between the healthy and the
diseased population?