Data Visualization - Spring 2017
Data Visualization - Spring 2017
Install R:
https://cloud.r-project.org/
Online classes:
Online books:
1. data assembled for On Broadway project in our lab - Broadway street in Manhattan
(13 miles) broken into 713 rectangles, most data from 2/2014-7/2014
- nyc
2.
- tags
3. data assembled for Selfiecity project in our lab - as sample of 120,000 images shared
in six global cities during one week, 12/2013
- xx
You can either run the commands from the script, or copy the commands
from the demo below
ls()
rm("tags")
dim(xx)
str(xx)
colnames(xx)
> colnames(xx)
[1] "just_filename" "instagram_id"
[3] "updated" "updated_trans_from_UT"
[5] "updated_trans_from_UT_value" "datetime_number"
[7] "hour" "date"
[9] "username" "city"
xx2 = xx[c(1:2,5:8,9)]
colnames(xx2)
[1] "just_filename" "instagram_id"
[3] "updated_trans_from_UT_value" "datetime_number"
[5] "hour" "date"
[7] "username"
http://www.statmethods.net/management/subset.html
dim(xx)
dim(xx.sample)
plot this:
barplot(table(xx$hour))
length(unique(xxx$username))
Data Visualization
35094 - C SC 83060
Spring 2017
Instructor:
Dr. Lev Manovich
Professor, PhD Program in Computer Science, The Graduate Center, CUNY.
Director, Cultural Analytics Lab.
room - 4422
Course schedule
Course description
final project
selected resources
LIST
Screenshot from selfiecity.net/ (2014-2015). The project Received Gold Award in Best
Visualization Project of the Year categoryin (“Information is Beautiful” competition,
2014.
Course description
Summary
This is a hands-on course where students (1) learn and practice modern
techniques for visualization of different kinds of data. In addition, the
course also covers the following: (2) how to prepare and explore datasets
using R language; (3) how to design visualizations for public and to write
about them; (4) how to work with large datasets. The students will be able
introduced to principles of modern graphic design as they are used in
visualization.
Details
We will study and practice common visualization techniques for a
single and multiple variables, quantitative and categorical data, spatial and
temporal data, image collections (and networks if time allows).
Both in academic research and in industry, visualization, preparing
and analyzing data go together. Some visualization techniques can’t be used
before the data is transformed using methods from modern statistics or
data science. Additionally, real-life datasets often need to be cleaned and
organized in proper formats before they can be visualized. Therefore, we
will also devote time to learn basics of data cleaning and data analysis.
Students will also be introduced to basic principles of modern
design as they apply to design of static, animated and interactive
visualizations, data-centric publications, maps, and other common types of
data design. The principles cover use of form, proportion, color,
composition, design grids, basics of typography, hierarchical organization
of information, systematic use of design variables, and rhythm.
Students will complete a number of practical assignments to
understand and start mastering principles and techniques being introduced
in class.
The class time will be divided in three parts – 1/3 for instructor
presentations, 1/3 for discussions of important visualizations and design
projects, and readings, and 1/3 for critique of student work.
Selected historical and theoretical readings will be used to introduce
students to the histories of visualization and modern design and to help
them start thinking critically about the common practices of these fields,
and their use in commercial, non-profit, and scientific settings. In this way,
the class aims to both teach students solid practical skills and critical
reflexive attitude towards the material.
This short online class is similar to the approach we will use for part of the
course:
https://www.class-central.com/mooc/1478/udacity-data-analysis-with-r
Rationale
Data visualization is increasingly important today in more and more fields.
Its growing popularity corresponds to important cultural and technological
shifts in our societies – adoption of data-centric analysis research and
arguments across dozens of new areas, and also arrival of massive data sets.
Data visualization techniques allow people to use perception and cognition
to see patterns in data, and communicate and form research hypotheses.
The goal of this course is to introduce students to fundamentals of data
visualization and relevant design principles. Students will learn the basic
data visualization techniques, when and how to use them, how to design
visualizations that best exploit human visual perception, and how to
visualize various types of data (quantitative, categorical, spatial, temporal,
networks).
Learning Goals/Outcomes
The key goals of this course are to learn how to use modern visualization
techniques to help analysis and understanding of data, how to prepare and
analyze data sets using selected statistical and data science methods, how to
use principles of design in creating effective and engaging visualizations,
and how to approach visualization of various data types.
Assessments
a. Class participation: students are expected to participate in
discussions of the assigned material.
b. Practical assignments: students will complete a number of
homeworks. No late homework is accepted.
c. Final project: create a short visual essay about the topic of their
choice - so you can use your educational background and interests. The
topic should be of interest to general audiences as opposed to narrow
professional audiences. The essay should include a few visualizations of
some relevant dataset(s) you find or create. The essay should include
discussions/explanation of patterns in the visualizations.
Because our class meets only once a week, I will not be able to go over every
topic listed for every class. For the topics which we will not cover class, I
have linked lecture notes and other linked material. Therefore, in addition
to the assigned readings and sites to view in homeworks, you should also go
through lecture notes and linked material for each class. You
should do that after each class. Feel free to research any subjects which
interest you in more detail.
If you are already familiar with any of the readings, projects, concepts, or
data analysis/visualization techniques covered in any of the homework -
skip them.
Recommended Textbooks:
Some of the chapters of these textbooks listed below will be assigned during
the semester:
Vanchang Zhao. R and Data Mining: Examples and Case Studies. Elsevier,
2012.
COURSE SCHEDULE:
[may change during the semester depending on students progress and interests]
1 Class introduction
Homework for class 6 - note: I added new version of van Gogh dataset
which now has all image features and genres.
10 class cancelled
RESOURCES:
[last update was 5/2016 - so now there are new tools, resources and classes
available]
http://schoolofdata.org/handbook/courses/data-to-diagrams/
http://www.kdnuggets.com/2011/02/free-public-datasets.html
Best general overview of working with data for non-technical audiences (its written for
journalists but many parts are quite general):
Best textbook that teaches you data analysis (using R) - for people with very little technical
background:
For more advanced students - computer science text teaching you data analysis using R:
Vanchang Zhao, R and Data Mining: Examples and Case Studies. Elsevier, 2012.
Analysis of (literary) texts in R - written for digital humanities audience, very gentle and
gradual:
http://www.springer.com/statistics/computational+statistics/book/978-3-319-03163-7
http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/
Data cleaning:
http://schoolofdata.org/courses/#IntroDataCleaning
This online class is similar to the approach of taken in this class (i.e. this syllabus):
https://www.class-central.com/mooc/1478/udacity-data-analysis-with-r
http://flowingdata.com/2016/03/08/what-i-use-to-visualize-data/
http://www.tableau.com/
d3: http://d3js.org/
Mapbox, Carto.
plot.ly
http://radar.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html
Final Project
Deadline:
June 1, 3 pm
Description:
Create a short visual essay about the topic of their choice - so you can use your
educational background and interests. The topic should be of interest to general audiences as
opposed to narrow professional audiences.
The essay should include a few visualizations of some relevant dataset(s) you find or
create.
You can also include other visual material - photos, video, maps, etc.
Format can be anything: Google doc, Word doc, PDF, a webpage, a long blog post, etc.
Visualizations can be static, animated or interactive (which is easy to do using
Google Docs, plot.ly, or another interactive datavis tool).
Here are some examples of such essays - they some use sophisticated
interactive visualizations, and I don't expect you to produce
something like this, but the overall structure - presenting some story
using a number of visualizations - is what you should also use. (Note that these essays are longer
than what you need to write).
http://www.nytimes.com/interactive/2014/12/12/upshot/where-men-arent-working-
map.html?
http://qz.com/465820/how-brand-new-words-are-spreading-across-america/
http://www.nytimes.com/interactive/2014/09/19/travel/reif-larsen-norway.html
http://www.nytimes.com/interactive/2014/12/23/us/gender-gaps-stanford-94.html
http://blog.okcupid.com/index.php/race-attraction-2009-2014/
http://blog.okcupid.com/index.php/the-best-questions-for-first-dates/
You can create your own data. For example, let’s say you want to count and plot types of
objects and proportions of these types that appear across a number of Instagram photos in “flat
lay” genre. Or maybe you want to spend time in a cafe and record activities and their numbers
(working on laptop, chatting on a phone, talking to another person, etc.)
Or you can use existing dataset(s) available online about some subjects.
Economics data:
http://www.bls.gov/cps/cpsaat11.htm
http://www.nber.org/data/
Museums data:
https://github.com/cooperhewitt/collection
https://github.com/MuseumofModernArt/collection
City data:
https://www.citibikenyc.com/system-data
https://nycopendata.socrata.com/
https://snap.stanford.edu/data/
Lists of datasets:
http://www.kdnuggets.com/datasets/index.html
https://github.com/caesar0301/awesome-public-datasets
Notes for Class 2
Examples of current web visualization tools:
www.datawrapper.de
http://infosthetics.com/
http://www.visualcomplexity.com/vc/
Some of the “best visualizations” lists - linked here - see versions of these lists for
2015 and 2016:
http://manovich.net/index.php/exhibitions/selfiecity
http://www.informationisbeautifulawards.com/news/116-2015-the-winners
https://www.ted.com/talks/aaron_koblin
https://www.ted.com/talks/jer_thorp_make_data_more_human
https://www.ted.com/talks/manuel_lima_a_visual_history_of_human_knowle
dge
https://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualizati
on
Data visualization and data design/art conferences:
http://visualized.com/2016/
http://giorgialupi.com/
http://www.stefanieposavec.co.uk/
https://bost.ocks.org/mike/
http://truth-and-beauty.net/
http://feltron.com/
http://tulpinteractive.com/
http://blog.threestory.com/wordpress/tag/new-york-times
-Data Driven NYC (by far the most professional one, all at Bloomberg. But get
sold out very quickly)
-Data Skeptics
2) Watch these TED talks by some of the key people in data visualization
community:
https://www.ted.com/talks/aaron_koblin
https://www.ted.com/talks/jer_thorp_make_data_more_human
https://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization
https://www.ted.com/talks/manuel_lima_a_visual_history_of_human_knowledge
http://www.datakind.org/
http://schoolofdata.org/
http://www.law.nyu.edu/centers/ili (NYC) -
http://towcenter.org/ (NYC)
Note:
You may find that some R tutorials / textbooks use “<-” and others use “=”
X <- 5
X=5
1. data assembled for On Broadway project in our lab - Broadway street in Manhattan
(13 miles) broken into 713 rectangles, most data from 2/2014-7/2014
- nyc
2.
- tags
3. data assembled for Selfiecity project in our lab - as sample of 120,000 images shared
in six global cities during one week, 12/2013
- xx
You can either run the commands from the script, or copy the commands
from the demo below
ls()
rm("tags")
dim(xx)
str(xx)
colnames(xx)
Using head and tail commands:
head(xx)
tail(xx)
head(xx, n=20)
head(xx$username, n=40)
> colnames(xx)
[1] "just_filename" "instagram_id"
[3] "updated" "updated_trans_from_UT"
[5] "updated_trans_from_UT_value" "datetime_number"
[7] "hour" "date"
[9] "username" "city"
xx2 = xx[c(1:2,5:8,9)]
colnames(xx2)
[1] "just_filename" "instagram_id"
[3] "updated_trans_from_UT_value" "datetime_number"
[5] "hour" "date"
[7] "username"
http://www.statmethods.net/management/subset.html
dim(xx)
dim(xx.sample)
plot this:
barplot(table(xx$hour))
count the number of unique values in a data object (120K dataset):
length(unique(xxx$username))
Other commonly used R commands (they are not in the script for this week):
http://www.computerworld.com/article/2497164/business-intelligence-beginner-s-guide-to-r-
get-your-data-into-r.html
In addition to using data in standard .txt and .csv files, R can also store data in its own native
format - .rda
If R is running, you can double click on .Rda file and it will open in R workspace
load("/Users/levmanovich/Documents/xxx.Rda")
After you read .Rda file, you should find the name of the corresponding object is named in R:
ls()
For example, we want to sample the .Rda file and then write a new file to the hard drive:
nyc.sample <- nyc[sample(1:nrow(nyc), 40000, replace=FALSE),]
save(nyc.sample,file="nyc.May.sample.Rda")
sort:
http://www.statmethods.net/management/sorting.html
head(nyc)
head(nyc[order(nyc$numPix..Instagram),])
newdata = nyc[order(nyc$numPix..Instagram),]
tail(sort(table(xx$username)))
tail(sort(table(xx$username)), n=50)
barplot(tail(sort(table(xx$username)), n=50))
subset:
http://www.statmethods.net/management/subset.html
table(nyc$Neighborhood)
nyc.Mid = nyc[which(nyc$Neighborhood=="Midtown (34th-42nd)"),]
http://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group
http://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r
http://www.statmethods.net/management/aggregate.html
https://schoolofdata.org/handbook/courses/what-is-data/
http://r4ds.had.co.nz/
Alternative book chapter teaching you working with data using dplyr()
package (similar to Chapter 5 in R for Data Science):
https://bookdown.org/rdpeng/exdata/managing-data-frames-with-the-
dplyr-package.html#data-frames
Notes for class 4:
Structured vs unstructured
Note: You can think of a basic map as a scatterplot of two data columns: latitude and longitude.
Note: sometimes we have one variable which is recorded at regular intervals. For
example, we can record temperature at 1 hour intervals. Or we can record child’s height at 1 year
intervals. In such a case, we can plot this one column using bar plot or line plot, without using
the second column that indicates intervals. This type of data often called time series.
Visualizations can be made using full data in column(s), or aggregated data. An alternative
term for aggregated data is summarized. Typically we summarize data using some categories
that already exist in the data, or we can create new categories (for example, histogram).
For one variable, sort the data in ascending or descending order before plotting as bar plot / line
plot - unless the data already has a particular logical order, which should be then preserved.
Visualization proportions:
If you show data which is changing over time, make the time/sequence axis longer than the
other axis (i.e, use horizontal format)
If you plot two variables and none of them is time, uset square format. (This is the default
format for scatter plots.)
http://www.the-everyday.net/
You can see some of the visualizations similar to the ones I will show in the demo in our
published project:
http://www.the-everyday.net/p/the-extraordinary-and-everyday.html
Data file:
Kiev-feb17-feb22-1row-per-image_QTIP.txt
vg = read.delim("van_gogh_additional_measurements.txt")
library(ggplot2)
1) R for Data Science - go through chapter 7 - 16 (work though all examples and
commands in the chapters text on your computer)
http://r4ds.had.co.nz/
Notes for class 5: Calculating and Visualizing
Descriptive Statistics in R
http://www.statmethods.net/stats/descriptives.html
xx = read.delim("van_Gogh_genres.txt")
hist(xx$image_proportion, n=40)
mean(xx$image_proportion)
median(xx$image_proportion)
sd(xx$image_proportion)
summary(xx$image_proportion)
fivenum(xx$image_proportion)
Typically we want to count how many cases we have in one categorical variable:
table(xx$Genre_gen)
Or two variables:
table(xx$Genre_gen, xx$Year)
- For a single variable - bar plot, point plot, or line plot (point or line plots are the same
as bar plots but they show data using points or connected lines);
barplot(table(xx$Genre_gen))
plot(table(xx$Genre_gen), type="l")
plot(table(xx$Genre_gen), type="p")
- These plots often do not print all labels by default - to force them to print all labels,
use lax=2 option:
http://www.statmethods.net/graphs/bar.html
// count how many genres appear in van Gogh paintings in each place
aa = colSums( xtabs( ~ Genre_gen + Label_Place , xx ) !=0 )
http://www.statmethods.net/management/aggregate.html
http://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group
http://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r
Using tapply():
tapply(xx$image_proportion, xx$Genre_gen, mean)
http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-
vs-tapply-vs-by-vs-aggrega
tapply - “For when you want to apply a function to subsets of a vector and the subsets are
defined by some other vector, usually a factor.”
Using aggregate()
aggregate() does the same as tapply but it produces a data frame which is easier to further
analyze and visualize:
This format uses categories in one variable to aggregate all other variables:
attach(mtcars)
This format uses categories in two variable to aggregate all other variables:
This format uses categories in ONE variable to aggregate another SINGLE variable:
We can define our own function for aggregation - for example, to count number of cases in
a categorical variable (in this case, we are counting how many genres van Gogh painted in in
each place he lived):
““One very convenient feature of ggplot2 is its range of functions to summarize your R data in
the plot. This means that you often don’t have to pre-summarize your data.”
[note - some R functions have been updated so not all these functions may work now as in this
table]
Data file
You need to convert your data frame from its standard format (called “wide data” in R) - where
each variable in its column - to a “long format.”
// make a copy of the data frame leaving only columns containing variables you want to
plot
xx = x[c(1,8:10)]
xx = read.delim("van_gogh_all.txt")
ggplot(x, aes(x=Brightness_Median, colour=Genre_gen,group=Genre_gen)) +
geom_density()
More resources - visualising distributions of data and data parts using ggplot2:
http://www.r-bloggers.com/ggplot2-cheatsheet-for-visualizing-distributions/
http://www.fdawg.org/FDAWG/Tutorials/ggplot2.html
1| practical homework 1
Goal: use R and ggplot2 to visualize relations between selected variables in van Gogh data.
Note - I added a use new version of van Gogh data file that has both genres and
image proportions - use this version:
https://www.dropbox.com/s/yw692nrku58dwcu/van_gogh_all.txt?dl=0
1) Year, Month
2) Label_Place, season
3) Genre, image_proportions
In these visualizations, you can show number of paintings in each data group, or statistics of
brightness and saturation, or image proportions, or other data.
Make sure that all labels are descriptive and easy to understand. Rename default labels if
needed.
After you create 3 visualizations you are happy with, combine them into one PDF and submit it.
The PDF should be < 10 MB.
You will receive email from Dropbox telling you where to upload your file.
SUBMIT HOMEWOK BY 6PM March 6.
The following are three examples of possible visualizations for this homework:
ggplot(mm, aes(reorder(Genre_gen,image_proportion),image_proportion)) +
geom_point(size=3) + coord_flip()
http://ngrams.googlelabs.com
Media coverage:
http://www.nytimes.com/2013/12/08/technology/in-a-scoreboard-of-words-a-
cultural-guide.html?pagewanted=all&_r=0
Further developments:
http://www.theatlantic.com/technology/archive/2013/10/googles-ngram-
viewer-goes-wild/280601/
http://larryferlazzo.edublogs.org/2014/07/24/ny-times-creates-their-own-
version-of-googles-ngram-viewer/
http://feltron.com
http://blog.stephenwolfram.com/2012/03/the-
personal-analytics-of-my-life/
OK Cupid blog:
http://blog.okcupid.com/
Development of statistics in the 18th-19th century and the idea of “social physics”:
Philip Ball. Chapter 3: The Law of Large Numbers from his book Critical Mass. 2006.
The PDF of the chapter you need to read.
Reality Mining - take a look at this Wikipedia article that lists different ways to capture social
data at multiple scales.
Alex Pentland (MIT) is one of the pioneers in using big data to study social phenomena. Read
his article: https://www.edge.org/conversation/reinventing-society-in-the-wake-of-
big-data
And some critical voices:
http://www.technologyreview.com/review/526561/the-limits-of-social-
engineering/
----------------------------------------------------------
Class demo:
x = read.delim("~/Documents/_CUNY GC class Spring 2017/van_gogh_all.txt")
colnames(x)
colnames(x)[5] <- "Place"
colnames(x)[9] <- "Genres_a"
colnames(x)[10] <- "Genres_b"
colnames(x)[15] <- "Proportion"
colnames(x)
library(ggplot2)
ggplot(x, aes(Year_Month, Proportion)) + geom_point()
ggplot(x, aes(Year_Month, Proportion)) + geom_point(alpha=0.2) + theme_minimal()
Class 8
A few well-known examples of using standard data visualization
techniques with “big data” - and how to write about big social
data for non-technical audiences:
Google n-Gram viewer:
http://ngrams.googlelabs.com
Media coverage:
http://www.nytimes.com/2013/12/08/technology/in-a-scoreboard-of-words-a-
cultural-guide.html?pagewanted=all&_r=0
Further developments:
http://www.theatlantic.com/technology/archive/2013/10/googles-ngram-
viewer-goes-wild/280601/
http://larryferlazzo.edublogs.org/2014/07/24/ny-times-creates-their-own-
version-of-googles-ngram-viewer/
http://blog.stephenwolfram.com/2012/03/the-
personal-analytics-of-my-life/
OK Cupid blog:
http://blog.okcupid.com/
http://feltron.com
Class 9:
Practical Homework 2 review
Development of statistics in the 18th-19th century and the idea of “social physics”:
Philip Ball. Chapter 3: The Law of Large Numbers from his book Critical Mass. 2006.
The PDF of the chapter you need to read.
“All that is necessary to reduce the whole of Nature to laws similar to those which Newton
discovered with the aid of calculus, is to have a sufficient number of observations and a
mathematics that is complex enough.” - Condorset (French mathematician), Essay on
Applications of Analysis to the Probability of Majority Decisions, 1785.
“Now that human mind has grasped celestial and terrestrial physics, mechanical and chemical,
organic physics, both vegetable and animal, there remains one science, to fill up the series of
sciences of observations - social physics.” - August Compte, Cours de philosophie positive (1830-
1842).
Reality Mining - take a look at this Wikipedia article that lists different ways to capture social
data at multiple scales.
Alex Pentland (MIT) is one of the pioneers in using big data to study social phenomena. Read
his text: https://www.edge.org/conversation/reinventing-society-in-the-wake-of-
big-data
Critical response:
“The power of social physics,” he writes, “comes from the fact that almost all of our day-to-day
actions are habitual, based mostly on what we have learned from observing the behavior of
others.”
“Political and economic classes, he contends, are “oversimplified stereotypes of a fluid and
overlapping matrix of peer groups.” Peer groups, unlike classes, are defined by “shared norms”
rather than just “standard features such as income” or “their relationship to the means of
production.”
“Pentland may be right that our behavior is determined largely by social norms and the
influences of our peers, but what he fails to see is that those norms and influences are
themselves shaped by history, politics, and economics, not to mention power and prejudice.
People don’t have complete freedom in choosing their peer groups. Their choices are
constrained by where they live, where they come from, how much money they have, and what
they look like. A statistical model of society that ignores issues of class, that takes patterns of
influence as givens rather than as historical contingencies, will tend to perpetuate existing social
structures and dynamics. It will encourage us to optimize the status quo rather than challenge
it.”
“What big data can’t account for is what’s most unpredictable, and most interesting, about us.”
class 11:
changing continuous variables into categorical
variables (cut); work with bigger datasets (sample);
read large datasets into R; using data tables.
http://www.r-bloggers.com/r-function-of-the-day-cut/
http://stackoverflow.com/questions/5746544/r-cut-by-defined-interval
Example using van_gogh_genres.txt
x2=x
One method:
x2$Seasons = cut(x2$Month, breaks = 4)
Better method:
x2$Seasons = cut(x2$Month, breaks = seq(0,12, by=3))
http://www.statisticshowto.com/sampling-with-replacement-without/
https://en.wikipedia.org/wiki/Simple_random_sample
x = tw5cities.Rda
summary(london$lat)
summary(london.sample$lat)
library(dplyr)
install.packages('data.table')
library(data.table)
x = fread("~/Documents/_Twitter 81 cities/city-tables-wide-combined-10km-
bracketless/London.csv")
“The data.table R package provides an enhanced version of data.frame that allows you to do
blazing fast data manipulations. The data.tableR package is being used in different fields such as
finance and genomics, and is especially useful for those of you that are working with large data
sets (e.g. 1GB to 100GB in RAM).”
source: https://www.datacamp.com/community/tutorials/data-table-cheat-sheet#gs.nsWlMGk
Another reason to use data tables is that for many operations, the syntax is easier.
You can convert a data frame in your workplace into a data table. Assume that you
have data frame called DF:
DT = data.table (DF)
There are lots of tutorials online showing how to use data tables, for example:
https://www.r-bloggers.com/intro-to-the-data-table-package/
dt <- data.table(mtcars)
class(dt)
dt[,mean(mpg)]
dt[,mean(mpg),by=am]
dt[,mean(mpg),by=.(am,cyl)]
class 12:
Using colors and working with colors in R; basic
design principles; creating and publishing
interactive web visualizations
https://en.wikipedia.org/wiki/HSL_and_HSV
http://www.rapidtables.com/web/color/RGB_Color.htm
There are many designed color palettes on the web and also color combinations generators, for
example:
http://www.colourlovers.com/
R has many functions and methods for assigning colors to objects in plots. Here are some of
them:
http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
By default, the colors for discrete scales are evenly spaced around a HSL color circle. For
example, if there are two colors, then they will be selected from opposite points on the
circle; if there are three colors, they will be 120° apart on the color circle; and so on.
colors()
http://www.statmethods.net/advgraphs/parameters.html
http://ggplot2.tidyverse.org/reference/scale_gradient.html
http://colorbrewer2.org/
( http://ggplot2.tidyverse.org/reference/scale_brewer.html )
Display palettes:
display.brewer.all()
ggthemes Package:
https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html
Plot.ly - very powerful system for generating and publishing web graphs
Practical Homework 3:
You need to create a single visualization that uses data from tw5cities.Rda file that I already
shared with you. If you like, you can instead create a few copies of a single visualizations to
show parts of the data separately (corresponding to diff. cities).
Your visualization(s) should compare growth in visual tweets in five cities during 2011-2014.
Each row in the data file corresponds to one post. You have metadata for geo-location, city,
country, the level of economic development of the country, and date and time of the post.
You are allowed to aggregate the data but only in limited ways. For example, the data file has
year, month, day, hour and minute of each post. You are allowed to aggregate this in some
small intervals, such 5, 10, 15, or 30 minutes. Thus, instead of plotting every tweet separately,
you can for example plot number of tweets shared in a given city for every 30 minutes during
2011-2014.
Or maybe you want to aggregate over space, to count and visualize number of posts sharted
over small parts of the city. (Note that the data was collected for 10 km x 10 km central area in
each city.)
Whatever aggregation you use, remember that you still need to visualize data in detail. I.e. if
you to visualize every single post as a separate point, your plot will have 2,732,305 separate
points. Now, lets say you aggregated the posts in 15 minute intervals. If you now plot this data,
the plot will shaow app. 100,000 separate points.
- if you are visualizing aggregated data as points, you visualization need to contain > 50,000
points.
- if you are visualizing aggregated data as lines, you visualization need to contain > 1000 lines.
- if you are visualizing aggregated data as a heatmap, you visualization need to contain >
50,000 heatmap elements (i.e. cells in a grid.)