0% found this document useful (0 votes)
10 views3 pages

Exercise 2

1. The document discusses visualizing time series data and grouped/longitudinal data using R. It introduces time series datasets like economics and txhousing and explores plotting trends over time. 2. Tasks involve better visualizing housing sales data, plotting relationships between economic variables over time, and comparing housing prices and presidential terms. 3. Grouped data from the txhousing dataset is aggregated by city and year to plot trends for each group. 4. A heatmap is introduced to visualize large datasets and look for patterns across many variables or groups. Tasks involve creating heatmaps to explore economic data and compare gene expression in healthy and diseased individuals.

Uploaded by

Yassine Azzimani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

Exercise 2

1. The document discusses visualizing time series data and grouped/longitudinal data using R. It introduces time series datasets like economics and txhousing and explores plotting trends over time. 2. Tasks involve better visualizing housing sales data, plotting relationships between economic variables over time, and comparing housing prices and presidential terms. 3. Grouped data from the txhousing dataset is aggregated by city and year to plot trends for each group. 4. A heatmap is introduced to visualize large datasets and look for patterns across many variables or groups. Tasks involve creating heatmaps to explore economic data and compare gene expression in healthy and diseased individuals.

Uploaded by

Yassine Azzimani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Lab session 2: Time series, heatmaps, and other

Data analysis and Visualization – Ilias Thomas

Red: code to copy paste in RStudio

Italics: variable and dataset names

Blue: functions and arguments

In the previous lab we worked with bivariate and univariate plots. What was common in these plots is
that there was no order in the data in a temporal direction. In this Lab we will work mainly with time series
data and other types of grouped and longitudinal data.

Time series
What is a time series? It is a series of data collected over a long period of time with a time interval between
each observation point (could be consistent or not). Univariate or multivariate data can be collected at
each point.

Let as consider some time series data in the ggplot2 package,

• economics: US economic time series


• presentential: Terms of 12 presidents from Eisenhower to Trump
• txhousing: Housing sales in Texas

Try to initially understand the datasets.

When working with time series data it makes sense to plot lines instead of points as we are interested in
following a trend in the data over the temporal variable.

Let’s try to explore the txhousing data. Plot the following:

ggplot(txhousing, aes(date, sales)) + geom_line()

How does the plot look like? Is it something you can interpret or not?

Task 1
1. Explore how to better visualize the sales for this dataset. Use the tools we learned in the previous lab,
but also explore the dataset to find what can be changed to get a better visualization.
In the economics data plot unemployment vs date. It is quite simple, right? How about if we want to plot
the relationship between two variables over time?

Task 2
1. Use the geom_path() argument to find the connection between expenditure and unemployment. What
do you see?

2. How about if you plot the relationship between unemployment and savings? Do you notice a trend? If
not, then how can you solve this issue?
Hint: color by date

Do you see a trend now? What would be the conclusion?

3. Visualize the house pricing development in Houston vs. the offer, and color based on the year. What do
you notice?
Hint: use subset()

Task 3
1. We can now try and plot the order of the US presidents. In this task you should plot the order of starting
in office and color based on the party.

2. Now that we have some understand of the housing data and the presidential data lets try to plot the
two datasets together. If you focus on Dallas, does the president affiliation influence the listing of houses?

To complete the task above you would need to convert the date into actual dates with:

txhousing$date <- as.Date(format(date_decimal(txhousing$date), "%Y-%m-%d"))

You will have to use the geom_rect() argument to create boxes of starting and ending dates for the
presidential duration. As this might be quite advanced you should add

geom_rect(

aes(xmin = start, xmax = end, fill = party),

ymin = -Inf, ymax = Inf, alpha = 0.2,

data = presidential

)
to your plot. Doing that will create the shading. Just remember that the original ggplot is on the housing
data!

Grouped data
How about when we have groups in the data? In the Texas housing dataset, we have many cities. In the
Midwest data we had states. It is not unusual to have multiple observation at different time points for
separate groups (either patients, or states, or brands) to see the trend of each over time. These data are
usually called longitudinal data.

Let’s try to plot the Texas housing data prices, by year for the different cities. In order to achieve that you
will need to aggregate the city prices over the years by city. You can use this code:

data_aggregate <- aggregate(txhousing [4:8], by=list(txhousing$year, txhousing$city), mean)

Now that you have the new dataset you can try to plot by group (exactly the same as adding color, but
change the word).

Task 4
1. Having done the grouping add a linear regression line to the plot using method=”lm”. What do you
conclude? Can you think of situations where we would choose group over color?

Heatmap
Sometimes it is not easy to go through large amounts of data. In cases where we want to look for patterns
in over a large number of variables we can plot the data as a heatmap to see how values of variables
change between groups or over time. For that purpose, we can use heatmaps. To work with heatmaps we
will have to work outside ggplot as it is not the optimal library to work with for this type of visualization.
Please install the package pheatmap.

Going to the economics dataset create a heatmap of the dataset (excluding the first column) using the
pheatmap function. What do you notice? Why do you think this is and how can you solve it?

Task 5
In the lab room, you will find the file Genes. This file contains info of 1000 gene expression levels and 40
individuals. The first 20 are healthy and the last 20 suffer from a disease. Import this file into your
workspace and create a heatmap of the genes. Can you differentiate between the healthy and the
diseased population?

You might also like