0% found this document useful (0 votes)

113 views

Data Visualization - Spring 2017

This document provides an overview of Workshop 1 which took place on September 25, 2017. It includes information on installing R, selected online resources for learning R, downloading data and R script files used in the workshop demo, examples of reading data into R and exploring/cleaning the data, and a brief overview of data visualization. The workshop covered basic techniques for importing, examining, manipulating and visualizing data in R, with a focus on working with tabular and CSV data files. Examples used publicly available data sets on Instagram images and Broadway street areas in New York City.

Uploaded by

nerelevant7138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views

Data Visualization - Spring 2017

Uploaded by

nerelevant7138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 57

Workshop 1 - 09/25/2017

notes, R script and datasets:

Install R:

https://cloud.r-project.org/

Note: in these workshops I am using default R interface (not R Studio).

Selected resources: Learning and using R

Online classes:

free introductory R class from code school:

https://www.codeschool.com/courses/try-r

many R courses from datacomp:

https://www.datacamp.com/

Creating interactive web visualizations using R:

https://plot.ly/

Online books:

variety of ways to explore and visualize data:

Chapter 3 - Vanchang Zhao, R and Data Mining: Examples and Case Studies (Elsevier, 2012).

Exploratory Data Analysis with R:

https://bookdown.org/rdpeng/exdata/

R for Data Science:

http://r4ds.had.co.nz/
Online tutorials:
http://flowingdata.com/category/tutorials/

WORKSHOP NOTES, DATA and R SCRIPT:

Download data used in this class demo:

1. data assembled for On Broadway project in our lab - Broadway street in Manhattan
(13 miles) broken into 713 rectangles, most data from 2/2014-7/2014
- nyc

2.
- tags
3. data assembled for Selfiecity project in our lab - as sample of 120,000 images shared
in six global cities during one week, 12/2013
- xx

Download R script that demonstrates commands in the demo

below
link

You can either run the commands from the script, or copy the commands
from the demo below

Reading data into R:

http://www.statmethods.net/input/index.html

Example - reading a data file in tab delimited format:

select folder containing the data:

Misc > Change working directory
list files in the current directory on your computer:
list.files()
dir()

read the tab delimited data files into R:

xx <- read.delim("120K_Instagram_images_data.txt")
nyc = read.delim("broadway-crosstab.LM.all.txt")

read comma delimited data file into R:

tags = read.delim("top_10_tags_neighborhood.csv")

Working with data:

examine data objects in R workspace:

ls()

delete an object “tags” from R workspace:

rm("tags")

Examining a data object:

dim(xx)
str(xx)
colnames(xx)

Using head and tail commands:

head(xx)
tail(xx)
head(xx, n=20)
head(xx$username, n=40)

Drop data columns you will not use:

> colnames(xx)
[1] "just_filename" "instagram_id"
[3] "updated" "updated_trans_from_UT"
[5] "updated_trans_from_UT_value" "datetime_number"
[7] "hour" "date"
[9] "username" "city"
xx2 = xx[c(1:2,5:8,9)]

colnames(xx2)
[1] "just_filename" "instagram_id"
[3] "updated_trans_from_UT_value" "datetime_number"
[5] "hour" "date"
[7] "username"

create random sample:

http://www.statmethods.net/management/subset.html

xx.sample <- xx[sample(1:nrow(xx), 5000, replace=FALSE),]

dim(xx)
dim(xx.sample)

count the number of records for each factor (=categorical variable):

table(xx$city)
table(xx.sample$city)
table(xx$hour)

plot this:
barplot(table(xx$hour))

count the number of unique values in a data object (120K dataset):

length(unique(xxx$username))
Data Visualization

35094 - C SC 83060

Spring 2017

The Graduate Center, CUNY (City University of New York)

365 5th Avenue, New York City.

Instructor:
Dr. Lev Manovich
Professor, PhD Program in Computer Science, The Graduate Center, CUNY.
Director, Cultural Analytics Lab.

Format: PhD seminar.

Meeting day and time: Tuesdays, 11:45 am - 1:45 pm

room - 4422

Course schedule

Course description

final project

selected resources

LIST
Screenshot from selfiecity.net/ (2014-2015). The project Received Gold Award in Best
Visualization Project of the Year categoryin (“Information is Beautiful” competition,
2014.
Course description

Summary
This is a hands-on course where students (1) learn and practice modern
techniques for visualization of different kinds of data. In addition, the
course also covers the following: (2) how to prepare and explore datasets
using R language; (3) how to design visualizations for public and to write
about them; (4) how to work with large datasets. The students will be able
introduced to principles of modern graphic design as they are used in
visualization.

Details
We will study and practice common visualization techniques for a
single and multiple variables, quantitative and categorical data, spatial and
temporal data, image collections (and networks if time allows).
Both in academic research and in industry, visualization, preparing
and analyzing data go together. Some visualization techniques can’t be used
before the data is transformed using methods from modern statistics or
data science. Additionally, real-life datasets often need to be cleaned and
organized in proper formats before they can be visualized. Therefore, we
will also devote time to learn basics of data cleaning and data analysis.
Students will also be introduced to basic principles of modern
design as they apply to design of static, animated and interactive
visualizations, data-centric publications, maps, and other common types of
data design. The principles cover use of form, proportion, color,
composition, design grids, basics of typography, hierarchical organization
of information, systematic use of design variables, and rhythm.
Students will complete a number of practical assignments to
understand and start mastering principles and techniques being introduced
in class.
The class time will be divided in three parts – 1/3 for instructor
presentations, 1/3 for discussions of important visualizations and design
projects, and readings, and 1/3 for critique of student work.
Selected historical and theoretical readings will be used to introduce
students to the histories of visualization and modern design and to help
them start thinking critically about the common practices of these fields,
and their use in commercial, non-profit, and scientific settings. In this way,
the class aims to both teach students solid practical skills and critical
reflexive attitude towards the material.

This short online class is similar to the approach we will use for part of the
course:
https://www.class-central.com/mooc/1478/udacity-data-analysis-with-r

Rationale
Data visualization is increasingly important today in more and more fields.
Its growing popularity corresponds to important cultural and technological
shifts in our societies – adoption of data-centric analysis research and
arguments across dozens of new areas, and also arrival of massive data sets.
Data visualization techniques allow people to use perception and cognition
to see patterns in data, and communicate and form research hypotheses.
The goal of this course is to introduce students to fundamentals of data
visualization and relevant design principles. Students will learn the basic
data visualization techniques, when and how to use them, how to design
visualizations that best exploit human visual perception, and how to
visualize various types of data (quantitative, categorical, spatial, temporal,
networks).

Learning Goals/Outcomes
The key goals of this course are to learn how to use modern visualization
techniques to help analysis and understanding of data, how to prepare and
analyze data sets using selected statistical and data science methods, how to
use principles of design in creating effective and engaging visualizations,
and how to approach visualization of various data types.

- Students will be able to understand data visualization medium

theoretically and historically in relation to other major visual and
communication media, past and present.
- They will learn how to use the basic modern visualization techniques.
- They will learn basic techniques for cleaning data
- They will learn summary statistics and selected techniques from data
science commonly used before data can be visualized.
- They will learn how to create interactive web visualizations.
- They will also learn basic principles of modern graphic design relevant for
design of visualizations, and interactive data projects.
- They will learn to reason about visualizations and data patterns, and
critique visualizations.
- They will learn how to approach design process, present initial multiple
proposals and refine the chosen design though iterations.

Assessments
a. Class participation: students are expected to participate in
discussions of the assigned material.
b. Practical assignments: students will complete a number of
homeworks. No late homework is accepted.
c. Final project: create a short visual essay about the topic of their
choice - so you can use your educational background and interests. The
topic should be of interest to general audiences as opposed to narrow
professional audiences. The essay should include a few visualizations of
some relevant dataset(s) you find or create. The essay should include
discussions/explanation of patterns in the visualizations.

Course requirements and grading:

1) Active participation in class discussions (which requires doing all

assigned readings on time): %20

2) Practical homeworks: 60% (3 x 20%)

3) Final project: 20%

Readings and lecture notes:

Homework to be done before each class is listed below.

Because our class meets only once a week, I will not be able to go over every
topic listed for every class. For the topics which we will not cover class, I
have linked lecture notes and other linked material. Therefore, in addition
to the assigned readings and sites to view in homeworks, you should also go
through lecture notes and linked material for each class. You
should do that after each class. Feel free to research any subjects which
interest you in more detail.

If you are already familiar with any of the readings, projects, concepts, or
data analysis/visualization techniques covered in any of the homework -
skip them.

If you don’t have computer science background to understand details some

of the readings, try at least to get the key ideas - by reading an introduction
and a conclusion of an assigned or recommended computer science article.

Recommended Textbooks:

Some of the chapters of these textbooks listed below will be assigned during
the semester:

1) If you have computer science background:

Vanchang Zhao. R and Data Mining: Examples and Case Studies. Elsevier,
2012.

2) Appropriate for people with less technical background - excellent free

online books:

Exploratory Data Analysis with R:

https://bookdown.org/rdpeng/exdata/
R for Data Science:
http://r4ds.had.co.nz/

2) If you have no or very little technical background - very gentle

introduction to working with data and analyzing it in R:
Jeffrey Stanton, Introduction to Data Science, 2013.

3) If you are in humanities - this textbook specifically focuses on analyzing

major “humanities”data types using R:

Taylor Arnold and Lauren. Humanities Data in R: Exploring Networks,

Geospatial Data, Images, and Text. Springler, 2015.

COURSE SCHEDULE:
[may change during the semester depending on students progress and interests]

1 Class introduction

2 Some of the data visualization tools, collections of best visualization

projects, conferences, meetings, local organizations and other
resources

Notes for class 2

homework for class 3

PART 1: Using Visualization for Data Exploration

3 Introduction to R; working with data tables.

Notes for class 3

Homework for class 4

4 Classic visualization techniques for 1 and 2 variables (R).

Notes for class 4

5 Computing and visualizing descriptive statistics (R).

History of statistics.
Assign practical homework 1.

Notes for class 5

Homework for class 6 - note: I added new version of van Gogh dataset
which now has all image features and genres.

6 Practical Homework 1 class review

II. VISUALIZATION FOR PRESENTATION & WORKING WITH BIG DATA

7 GC classed cancelled because of the snow storm

8 Selected projects that use basic visualization techniques in creative

ways. Start discussions of design principles as use in visualization.

Notes for Class 8

Assign practical homework 2

9 Practical Homework 2 class review

Homework for class 10

10 class cancelled

11 Discussion of the readings about history of social statistics, the idea

of “social physics,” and the use of big social data today.
R: changing continuous variables into categorical variables (cut); work with
bigger datasets (sample); faster ways to read large datasets into R; using data
tables.

Notes for Class 11

12 Working with colors; interactive web visualizations using plot.ly;

custom ggplot2 themes

Notes for class 12

Assign practical homework 3

13 Practical Homework 3 class review

Assign final projects

10 Basic principles of modern graphic design. Use of color.

Creating interactive web visualizations (Google Charts, plot.ly, etc.).
Artistic visualization; visualizing space and time (R and other tools).
Design principles for data visualization. Use of color. Analysis and visualization
of large data.

11 Elements of contemporary data analysis: features, multi-dimensional feature

space; distance matrix, heatmaps (R).

Practical Homework 3 class review

12 Dimension reduction (PCA, MDS); use of dimension reduction methods for

visualization of multivariable datasets. Cluster Analysis (R).

13 Cultural, social and political dimensions of data science

(discussions of readings and online projects)

RESOURCES:
[last update was 5/2016 - so now there are new tools, resources and classes
available]

My highly selective list of BEST tools and resources for learning

visualization, data analysis, and working with social media data:

Start here if you completely new to this - the very

basics:
http://schoolofdata.org/handbook/courses/tell-me-a-story/

http://schoolofdata.org/handbook/courses/data-to-diagrams/

Get some (big) data to analyze and visualize:

http://www.smartdatacollective.com/bernardmarr/235366/big-data-20-free-big-data-sources-
everyone-should-know

http://www.kdnuggets.com/2011/02/free-public-datasets.html

Hundreds of datasets for every country from quandl.com - for example:

https://www.quandl.com/collections/society/internet-users-by-country

Working with data - best books and online texts:

[Warning: this list is biased towards R, because I love it.]

Best general overview of working with data for non-technical audiences (its written for
journalists but many parts are quite general):

data journalism handbook

Best textbook that teaches you data analysis (using R) - for people with very little technical
background:

Jeffrey Stanton, Introduction to Data Science.

For more advanced students - computer science text teaching you data analysis using R:

Vanchang Zhao, R and Data Mining: Examples and Case Studies. Elsevier, 2012.

Analysis of (literary) texts in R - written for digital humanities audience, very gentle and
gradual:

http://www.springer.com/statistics/computational+statistics/book/978-3-319-03163-7

Text mining in R - fast, diving right in:

Hands-On Data Science with R Text Mining

More free data science books:

http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/
Data cleaning:
http://schoolofdata.org/courses/#IntroDataCleaning

Online courses in Data Science:

Some of the online courses about Data Science

Another list of online courses about Data Science

This online class is similar to the approach of taken in this class (i.e. this syllabus):

https://www.class-central.com/mooc/1478/udacity-data-analysis-with-r

Digital humanities - online resources:

lists of tools and other resources:

The CUNY Digital Humanities Resource Guide

Best visualization tools:

Yes, there dozens of software tools, but these are used most widely today:

http://flowingdata.com/2016/03/08/what-i-use-to-visualize-data/

Leader in data visualization software, no programming (requires PC or Windows

emulator on Mac)

http://www.tableau.com/

Interactive web visualization, without or with programming:

Google charts

More sophisticated interactive web visualization using programming:

d3: http://d3js.org/

Creating online interactive maps, no programming

Mapbox, Carto.

Interactive web visualization (interfaces with R and Python), can be used

without or with programming

plot.ly

If you want to learn d3:

Free book:
http://chimera.labs.oreilly.com/books/1230000000345/index.html

Learning and using R:

free introductory R class from code school:
https://www.codeschool.com/courses/try-r

many R courses from datacomp ($25/month):

https://www.datacamp.com/

variety of ways to explore and visualize data:

Chapter 3 - Vanchang Zhao, R and Data Mining: Examples and Case Studies (Elsevier, 2012).

Best tutorials for basic and advanced visualization using R:

http://flowingdata.com/category/tutorials/
Creating interactive web visualizations using R:
https://plot.ly/

Online books that teach you data exploration and visualization

with R:

Exploratory Data Analysis with R:

https://bookdown.org/rdpeng/exdata/

R for Data Science:

http://r4ds.had.co.nz/

If you want to use Python:

http://pbpython.com/visualization-tools-1.html

http://radar.oreilly.com/2013/03/python-data-tools-just-keep-getting-better.html

Final Project

Deadline:

June 1, 3 pm

Description:

Create a short visual essay about the topic of their choice - so you can use your
educational background and interests. The topic should be of interest to general audiences as
opposed to narrow professional audiences.

The essay should include a few visualizations of some relevant dataset(s) you find or
create.

The essay should include some discussions/explanation of patterns in the

visualizations.

You can also include other visual material - photos, video, maps, etc.

Text length: between 800 and 1200 words.

Format can be anything: Google doc, Word doc, PDF, a webpage, a long blog post, etc.
Visualizations can be static, animated or interactive (which is easy to do using
Google Docs, plot.ly, or another interactive datavis tool).

Here are some examples of such essays - they some use sophisticated
interactive visualizations, and I don't expect you to produce
something like this, but the overall structure - presenting some story
using a number of visualizations - is what you should also use. (Note that these essays are longer
than what you need to write).

Ex-student who took this class:

https://medium.com/@WallHelen/120kmoma-peak-years-80b9c55fc734#.a2p4cu7su

http://www.nytimes.com/interactive/2014/12/12/upshot/where-men-arent-working-
map.html?

http://qz.com/465820/how-brand-new-words-are-spreading-across-america/

http://www.nytimes.com/interactive/2014/09/19/travel/reif-larsen-norway.html

http://www.nytimes.com/interactive/2014/12/23/us/gender-gaps-stanford-94.html

http://blog.okcupid.com/index.php/race-attraction-2009-2014/

http://blog.okcupid.com/index.php/the-best-questions-for-first-dates/

You can create your own data. For example, let’s say you want to count and plot types of
objects and proportions of these types that appear across a number of Instagram photos in “flat
lay” genre. Or maybe you want to spend time in a cafe and record activities and their numbers
(working on laptop, chatting on a phone, talking to another person, etc.)

Or you can use existing dataset(s) available online about some subjects.

Here are examples of public datasets online:

Economics data:

http://www.bls.gov/cps/cpsaat11.htm

http://www.nber.org/data/

Museums data:

https://github.com/cooperhewitt/collection

https://github.com/MuseumofModernArt/collection

City data:

https://www.citibikenyc.com/system-data

https://nycopendata.socrata.com/

Social media data:

https://snap.stanford.edu/data/

Lists of datasets:

http://www.kdnuggets.com/datasets/index.html

https://github.com/caesar0301/awesome-public-datasets
Notes for Class 2
Examples of current web visualization tools:

www.datawrapper.de

Google sheets (docs.google.com)

Classical online collections of interesting data visualizations projects:

http://infosthetics.com/

http://www.visualcomplexity.com/vc/

Some of the “best visualizations” lists - linked here - see versions of these lists for
2015 and 2016:

http://manovich.net/index.php/exhibitions/selfiecity

Annual data visualization competition:

http://www.informationisbeautifulawards.com/news/116-2015-the-winners

TED talks by key people in data visualization community:

https://www.ted.com/talks/aaron_koblin

https://www.ted.com/talks/jer_thorp_make_data_more_human

https://www.ted.com/talks/manuel_lima_a_visual_history_of_human_knowle
dge

https://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualizati
on
Data visualization and data design/art conferences:

http://visualized.com/2016/

EYEO Festival: http://eyeofestival.com/

A few of the most well-known contemporary visualization designers:

http://giorgialupi.com/

http://www.stefanieposavec.co.uk/

https://bost.ocks.org/mike/

http://truth-and-beauty.net/

http://feltron.com/

http://tulpinteractive.com/

Examples of visualization in journalism:

http://blog.threestory.com/wordpress/tag/new-york-times

Some of the NYC data and visualization meetups (recommended by my RA who

took this class two years ago):

-Data Visualization New York

-Data Driven NYC (by far the most professional one, all at Bloomberg. But get
sold out very quickly)

-Metis New York Data Science

-New York Open Statistical Programming (they do a lot of conferences on R)

-New York Data Science Study Group (this one tends to be more workshop type).

-Data Skeptics

Homework for class 3

1) Read the following sections from Data Journalism Handbook:

Using visualizations to Tell Stories

Designing with Data

Different Charts Tell Different Stories

2) Watch these TED talks by some of the key people in data visualization
community:

https://www.ted.com/talks/aaron_koblin

https://www.ted.com/talks/jer_thorp_make_data_more_human

https://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization

https://www.ted.com/talks/manuel_lima_a_visual_history_of_human_knowledge

3) Optional - check these organizations/programs:

http://www.datakind.org/

http://schoolofdata.org/

http://www.datasociety.net/ (NYC) - see events calendar

Civic Hall (NYC) - http://civichall.org/events/

http://www.law.nyu.edu/centers/ili (NYC) -

http://towcenter.org/ (NYC)

Notes for class 3

Note:
You may find that some R tutorials / textbooks use “<-” and others use “=”

X <- 5
X=5

Download data used in this class demo:

1. data assembled for On Broadway project in our lab - Broadway street in Manhattan
(13 miles) broken into 713 rectangles, most data from 2/2014-7/2014
- nyc

2.
- tags
3. data assembled for Selfiecity project in our lab - as sample of 120,000 images shared
in six global cities during one week, 12/2013
- xx

Download R script that demonstrates commands in the demo

below
link

You can either run the commands from the script, or copy the commands
from the demo below

Reading data into R:

http://www.statmethods.net/input/index.html

Example - reading a data file in tab delimited format:

select folder containing the data:

Misc > Change working directory

list files in the current directory on your computer:

list.files()
dir()

read the tab delimited data files into R:

xx <- read.delim("120K_Instagram_images_data.txt")
nyc = read.delim("broadway-crosstab.LM.all.txt")

read comma delimited data file into R:

tags = read.delim("top_10_tags_neighborhood.csv")

Working with data:

examine data objects in R workspace:

ls()

delete an object “tags” from R workspace:

rm("tags")

Examining a data object:

dim(xx)
str(xx)
colnames(xx)
Using head and tail commands:
head(xx)
tail(xx)
head(xx, n=20)
head(xx$username, n=40)

Drop data columns you will not use:

> colnames(xx)
[1] "just_filename" "instagram_id"
[3] "updated" "updated_trans_from_UT"
[5] "updated_trans_from_UT_value" "datetime_number"
[7] "hour" "date"
[9] "username" "city"

xx2 = xx[c(1:2,5:8,9)]

colnames(xx2)
[1] "just_filename" "instagram_id"
[3] "updated_trans_from_UT_value" "datetime_number"
[5] "hour" "date"
[7] "username"

create random sample:

http://www.statmethods.net/management/subset.html

xx.sample <- xx[sample(1:nrow(xx), 5000, replace=FALSE),]

dim(xx)
dim(xx.sample)

count the number of records for each factor (=categorical variable):

table(xx$city)
table(xx.sample$city)
table(xx$hour)

plot this:
barplot(table(xx$hour))
count the number of unique values in a data object (120K dataset):

length(unique(xxx$username))

Other commonly used R commands (they are not in the script for this week):

Guide to reading data into R:

http://www.computerworld.com/article/2497164/business-intelligence-beginner-s-guide-to-r-
get-your-data-into-r.html

Save R data into a tab delimited file:

write.table(nyx, file="broadway.txt", row.names = FALSE, quote = FALSE, sep="\t")

Load .Rda file into R:

In addition to using data in standard .txt and .csv files, R can also store data in its own native
format - .rda

If R is running, you can double click on .Rda file and it will open in R workspace

Other ways to load data in .rda format into R workspace:

load("/Users/levmanovich/Documents/xxx.Rda")

You can also load .rda file using top menu:

File > Open Document

After you read .Rda file, you should find the name of the corresponding object is named in R:

ls()

Write data to .Rda file:

For example, we want to sample the .Rda file and then write a new file to the hard drive:
nyc.sample <- nyc[sample(1:nrow(nyc), 40000, replace=FALSE),]

save(nyc.sample,file="nyc.May.sample.Rda")

sort:
http://www.statmethods.net/management/sorting.html

head(nyc)
head(nyc[order(nyc$numPix..Instagram),])
newdata = nyc[order(nyc$numPix..Instagram),]

tail(sort(table(xx$username)))
tail(sort(table(xx$username)), n=50)
barplot(tail(sort(table(xx$username)), n=50))

subset:

http://www.statmethods.net/management/subset.html

selecting part of the data based on conditions:

table(nyc$Neighborhood)
nyc.Mid = nyc[which(nyc$Neighborhood=="Midtown (34th-42nd)"),]

Summarize data by factor:

There are many ways to do this in R - here are just some:

http://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group

http://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r

http://www.statmethods.net/management/aggregate.html

Representing dates and times in R:

http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-
code.html

Homework for class 4:

1) Go through this online class module:

https://schoolofdata.org/handbook/courses/what-is-data/

2) R for Data Science - go through Book Introduction and

chapters 2-6 (work though all examples in the chapters)

http://r4ds.had.co.nz/

If you need help setting up R -

https://bookdown.org/rdpeng/exdata/getting-started-with-r.html

Alternative book chapter teaching you working with data using dplyr()
package (similar to Chapter 5 in R for Data Science):
https://bookdown.org/rdpeng/exdata/managing-data-frames-with-the-
dplyr-package.html#data-frames
Notes for class 4:

Continuous vs. Discrete Data:

Discrete data - integers, categories
Continuous data - numbers with proportions (floating point numbers)
Continuous data can be made discrete by aggregating it (e.g., creating a histogram).

Structured vs unstructured

Terms used to refer to data table elements:

Rows represents objects (also called records, elements).
Columns represent properties (also called characteristics, variables, features).

One way to divide basic visualization techniques:

visualizations of one variable: bar plot, line plot, histogram, pie plot, radar plot.
visualizations of two variables: scatter plot, heatmap.
visualizations of multiple variables: scatter plot matrix, mosaic plot, parallel plot.

Note: You can think of a basic map as a scatterplot of two data columns: latitude and longitude.

Note: sometimes we have one variable which is recorded at regular intervals. For
example, we can record temperature at 1 hour intervals. Or we can record child’s height at 1 year
intervals. In such a case, we can plot this one column using bar plot or line plot, without using
the second column that indicates intervals. This type of data often called time series.

Visualizing full data vs. aggregated data:

Visualizations can be made using full data in column(s), or aggregated data. An alternative
term for aggregated data is summarized. Typically we summarize data using some categories
that already exist in the data, or we can create new categories (for example, histogram).

Common ways of summarizing data:

For categorical data count numbers of cases in each category;
For continuous data: calculate descriptive statistics (mean, median, standard deviation,
variance, etc.)

Ordering one dimensional data before plotting:

For one variable, sort the data in ascending or descending order before plotting as bar plot / line
plot - unless the data already has a particular logical order, which should be then preserved.
Visualization proportions:

If you show data which is changing over time, make the time/sequence axis longer than the
other axis (i.e, use horizontal format)

If you plot two variables and none of them is time, uset square format. (This is the default
format for scatter plots.)

demo 1: basic visualisation in R using base

graphics
The data set used for this demo is 13,208 images and video shared by 6,165 Instagram users in
central part of Kiev (Ukraine) during 2014 Ukraine Revolution (so called “Maidan revolution.”)
The images were shared during 2/17-2/22, 2014. These images were tagged with 5,845 unique
tags. The data and images were downloaded using Instagram API.

Project that uses this data:

http://www.the-everyday.net/

You can see some of the visualizations similar to the ones I will show in the demo in our
published project:

http://www.the-everyday.net/p/the-extraordinary-and-everyday.html

Data file:

Kiev-feb17-feb22-1row-per-image_QTIP.txt

R script for this demo

mm <- aggregate(username ~ date.y , data=xx, FUN=function(x) length(unique(x)))

demo 2: basic visualization in R using ggplot2

download Van Gogh data file: link

[change to the directory containing this file]

vg = read.delim("van_gogh_additional_measurements.txt")

library(ggplot2)

ggplot(vg, aes(factor(Year))) + geom_bar()

ggplot(vg, aes(factor(Year))) + geom_bar() + coord_polar()

ggplot(vg, aes(factor(Month))) + geom_bar() + coord_polar()

ggplot(vg, aes(factor(Month))) + geom_bar() + facet_wrap(~Year)

ggplot(vg, aes(factor(Year_Month))) + geom_bar()

ggplot(vg, aes(factor(Year_Month))) + geom_bar() + coord_polar()

ggplot(vg, aes(x=Year_Month, y=Shape_Count)) + geom_point()

ggplot(vg, aes(x=Image_Width, y=Image_Height)) + geom_point()

ggplot(vg, aes(x=Image_Width, y=Image_Height)) + geom_point() + theme_minimal()

ggplot(vg, aes(x=Brightness_Median, y=Saturation_Median)) + geom_point() +

facet_wrap(~Label_Place) + theme_minimal()

Homework for class 5

1) R for Data Science - go through chapter 7 - 16 (work though all examples and
commands in the chapters text on your computer)

http://r4ds.had.co.nz/
Notes for class 5: Calculating and Visualizing
Descriptive Statistics in R

van Gogh data file with genre info

Descriptive statistics in R using built-in commands:

http://www.statmethods.net/stats/descriptives.html

Calculating common descriptive statistics in R:

First, read sample data table into R:

xx = read.delim("van_Gogh_genres.txt")

hist(xx$image_proportion, n=40)

Calculating single descriptive statistics statistics measures - continuous variables :

mean(xx$image_proportion)
median(xx$image_proportion)
sd(xx$image_proportion)

Calculating many descriptive statistics together - continuous variables :

summary(xx$image_proportion)
fivenum(xx$image_proportion)

Summarizing discrete (categorical) data:

Typically we want to count how many cases we have in one categorical variable:
table(xx$Genre_gen)

Or two variables:

table(xx$Genre_gen, xx$Year)

Or three variables, and so on:

table(xx$Genre_gen, xx$Year, xx$Season)

Visualizing summaries of categorical variables:

- For a single variable - bar plot, point plot, or line plot (point or line plots are the same
as bar plots but they show data using points or connected lines);

barplot(table(xx$Genre_gen))
plot(table(xx$Genre_gen), type="l")
plot(table(xx$Genre_gen), type="p")

- These plots often do not print all labels by default - to force them to print all labels,
use lax=2 option:

plot(table(xx$Genre_gen), type="p", las=2)

- You can also rotate barplot using horiz=TRUE option:

barplot(table(xx$Genre_gen), las=2, horiz=TRUE)

For two categorical variables:

- Grouped bar plot:

xx.table= table(xx$Label_Place, xx$Season)

barplot(xx.table, beside=TRUE)

More examples and options:

http://www.statmethods.net/graphs/bar.html

You can also use (less common) mosaic plot:

plot(table(xx$Genre_gen, xx$Year))

Statistical summaries and visualizations of parts

of a dataset (groups, factors, categories):

See this script:

GC_script_data_summary.R

Counting how many factors are in another factor:

This is a kind of summary appropriate for categorical data - instead of using descriptive
statistics such as mean and sd, we instead count number of items in each category:

// count how many genres appear in van Gogh paintings in each place
aa = colSums( xtabs( ~ Genre_gen + Label_Place , xx ) !=0 )

// different way to do the same

aa = with(xx, tapply(Season, Label_Place, function(x) length(unique(x))))

Counting numbers of rows using two groups:

aggregate(xx, by = list(xx$Genre_gen, xx$Label_Place), FUN = length)

Summarizing continuous data by group:

There are many ways to do this in R - here are some:

http://www.statmethods.net/management/aggregate.html

http://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group

http://stats.stackexchange.com/questions/8225/how-to-summarize-data-by-group-in-r

Using tapply():
tapply(xx$image_proportion, xx$Genre_gen, mean)

A guide to “apply” functions in R:

http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-
vs-tapply-vs-by-vs-aggrega

tapply - “For when you want to apply a function to subsets of a vector and the subsets are
defined by some other vector, usually a factor.”

Using aggregate()

aggregate() does the same as tapply but it produces a data frame which is easier to further
analyze and visualize:

This format uses categories in one variable to aggregate all other variables:

attach(mtcars)

aggregate(mtcars, by=list(cyl), FUN=mean)

aggregate(mtcars, by=list(cyl), FUN=fivenum)

This format uses categories in two variable to aggregate all other variables:

aggregate(mtcars, by=list(cyl,vs), FUN=mean)

This format uses categories in ONE variable to aggregate another SINGLE variable:

aggregate(instagram_id ~ username, data=xx, FUN=sum)

We can define our own function for aggregation - for example, to count number of cases in
a categorical variable (in this case, we are counting how many genres van Gogh painted in in
each place he lived):

aggregate(Genre ~ Label_Place, data=xx, FUN=function(x) length(unique(x)))

Using ggplot2 built-in statistics commands:

The following comes from:
http://www.dummies.com/how-to/content/how-to-plot-summarized-data-in-a-ggplot2-in-
r.html

““One very convenient feature of ggplot2 is its range of functions to summarize your R data in
the plot. This means that you often don’t have to pre-summarize your data.”

[note - some R functions have been updated so not all these functions may work now as in this
table]

Stat Description Default Geom

stat_bin() Counts the number of observations geom_bar()
in bins.
stat_smooth() Creates a smooth line. geom_line()
stat_sum() Adds values. geom_point()
stat_identity( No summary. Plots data as is. geom_point()
)
stat_boxplot() Summarizes data for a box-and- geom_boxplot(
whisker plot. )

How to tell ggplot2 to leave your data unsummarized

“Sometimes you don’t want ggplot2 to summarize your data in the plot. This usually happens
when your data is already pre-summarized or when each line of your data frame has to be
plotted separately.
In these cases, you want to tell ggplot2 to do nothing at all, and the stat to do this is
stat_identity().”

Plotting data summaries by group:

Make sorted bar plot of summary statistics by group:

barplot(sort(tapply(xx$image_proportion, xx$Genre_gen, sd)))

Same but plotting every label:

barplot(tapply(xx$image_proportion, xx$Genre_gen, sd), las=2)

Plot statistics by group using ggplot2:

Requires two steps:

1) mm=aggregate(image_proportion ~ Genre_gen, data=xx, FUN=function(x)

mean(x))

2) ggplot(mm, aes(reorder(Genre_gen,image_proportion), image_proportion)) +

geom_point(size=3) + coord_flip()

Make line plots of distributions of one variable in every group:

Data file

Project where this data was used: http://on-broadway.nyc

You need to convert your data frame from its standard format (called “wide data” in R) - where
each variable in its column - to a “long format.”

// read data into R

x = read.delim("broadway-crosstab.LM.all.txt")

// make a copy of the data frame leaving only columns containing variables you want to
plot
xx = x[c(1,8:10)]

// load reshape2 library

library(reshape2)

// use melt() to convert this data into long format

xxx = melt(xx,id="Id_num")

// plot using ggplot()

ggplot(xxx, aes(x=Id_num,y=value,colour=variable,group=variable)) + geom_line()

Line plot of distributions of data in each group using geom_density()

xx = read.delim("van_gogh_all.txt")
ggplot(x, aes(x=Brightness_Median, colour=Genre_gen,group=Genre_gen)) +
geom_density()

More resources - visualising distributions of data and data parts using ggplot2:

http://www.r-bloggers.com/ggplot2-cheatsheet-for-visualizing-distributions/

http://www.fdawg.org/FDAWG/Tutorials/ggplot2.html

P.S. Combining multiple ggplot2 graphs together:

http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

Homework for class 6

1| practical homework 1

Goal: use R and ggplot2 to visualize relations between selected variables in van Gogh data.

[ van Gogh data file - old version ]

Note - I added a use new version of van Gogh data file that has both genres and
image proportions - use this version:

https://www.dropbox.com/s/yw692nrku58dwcu/van_gogh_all.txt?dl=0

Show the relations between these variables:

1) Year, Month
2) Label_Place, season
3) Genre, image_proportions

In these visualizations, you can show number of paintings in each data group, or statistics of
brightness and saturation, or image proportions, or other data.

Make sure that all labels are descriptive and easy to understand. Rename default labels if
needed.

After you create 3 visualizations you are happy with, combine them into one PDF and submit it.
The PDF should be < 10 MB.

Name the PDF file as follows:

lastname_firstname_1.pdf

You will receive email from Dropbox telling you where to upload your file.
SUBMIT HOMEWOK BY 6PM March 6.

The following are three examples of possible visualizations for this homework:

ggplot(mm, aes(reorder(Genre_gen,image_proportion),image_proportion)) +
geom_point(size=3) + coord_flip()

ggplot(xx, aes(factor(Month))) + geom_bar() + coord_polar() + facet_wrap(~ Year)

ggplot(xx, aes(factor(Season))) + geom_bar() + coord_polar() + facet_wrap(~ Label_Place)

example of a visualization showing relations between average (mean) image proportion and
Genre category
example of a visualization showing number of paintings in relation to Place and Season
example of a visualization showing number of paintings in relation to Year and Month
example of a visualization showing distribution of paintings proportions in relation to genre in

2| Check these well-known examples of using standard data

visualization techniques with “big data” - and how to write about
big social data for non-technical audiences:

Google n-Gram viewer:

http://ngrams.googlelabs.com

More details and the paper: http://www.culturomics.org/

Media coverage:
http://www.nytimes.com/2013/12/08/technology/in-a-scoreboard-of-words-a-
cultural-guide.html?pagewanted=all&_r=0

Further developments:

http://www.theatlantic.com/technology/archive/2013/10/googles-ngram-
viewer-goes-wild/280601/

http://larryferlazzo.edublogs.org/2014/07/24/ny-times-creates-their-own-
version-of-googles-ngram-viewer/

Nicholas Felton: Annual Reports:

http://feltron.com

Stephen Wolfram: The Personal Analytics of My Life -

http://blog.stephenwolfram.com/2012/03/the-
personal-analytics-of-my-life/

OK Cupid blog:

http://blog.okcupid.com/

3| Readings: historical development of statistics and social

sciences and contemporary “social physics”

Development of statistics in the 18th-19th century and the idea of “social physics”:

Philip Ball. Chapter 3: The Law of Large Numbers from his book Critical Mass. 2006.
The PDF of the chapter you need to read.

“Big Data” about human behavior and new “social physics”:

Reality Mining - take a look at this Wikipedia article that lists different ways to capture social
data at multiple scales.

Alex Pentland (MIT) is one of the pioneers in using big data to study social phenomena. Read
his article: https://www.edge.org/conversation/reinventing-society-in-the-wake-of-
big-data
And some critical voices:

http://www.technologyreview.com/review/526561/the-limits-of-social-
engineering/

Evgeny Morozov, Every Little Byte Counts, NYT, 5/14/2014.

----------------------------------------------------------
Class demo:
x = read.delim("~/Documents/_CUNY GC class Spring 2017/van_gogh_all.txt")
colnames(x)
colnames(x)[5] <- "Place"
colnames(x)[9] <- "Genres_a"
colnames(x)[10] <- "Genres_b"
colnames(x)[15] <- "Proportion"
colnames(x)

library(ggplot2)
ggplot(x, aes(Year_Month, Proportion)) + geom_point()
ggplot(x, aes(Year_Month, Proportion)) + geom_point(alpha=0.2) + theme_minimal()

xb = aggregate(Brightness_Median ~ Place, data=x, FUN=mean)

xs= aggregate(Saturation_Median ~ Place, data=x, FUN=mean)
xbs = data.frame(xb, xs[2])
library(reshape2)
xbsl = melt(xbs,id="Place")
ggplot(xbsl, aes(x=Place, y=value, group=variable, color=variable)) + geom_point()

Class 8
A few well-known examples of using standard data visualization
techniques with “big data” - and how to write about big social
data for non-technical audiences:
Google n-Gram viewer:

http://ngrams.googlelabs.com

More details and the paper: http://www.culturomics.org/

Media coverage:
http://www.nytimes.com/2013/12/08/technology/in-a-scoreboard-of-words-a-
cultural-guide.html?pagewanted=all&_r=0

Further developments:

http://www.theatlantic.com/technology/archive/2013/10/googles-ngram-
viewer-goes-wild/280601/

http://larryferlazzo.edublogs.org/2014/07/24/ny-times-creates-their-own-
version-of-googles-ngram-viewer/

Stephen Wolfram: The Personal Analytics of My Life -

http://blog.stephenwolfram.com/2012/03/the-
personal-analytics-of-my-life/

OK Cupid blog:

http://blog.okcupid.com/

Nicholas Felton: Annual Reports:

http://feltron.com

Homework for class 9:

Practical Homework 2 - explained in class

Class 9:
Practical Homework 2 review

Homework for class 11:

Readings: historical development of statistics and social sciences

and contemporary “social physics”

Development of statistics in the 18th-19th century and the idea of “social physics”:

Philip Ball. Chapter 3: The Law of Large Numbers from his book Critical Mass. 2006.
The PDF of the chapter you need to read.

“All that is necessary to reduce the whole of Nature to laws similar to those which Newton
discovered with the aid of calculus, is to have a sufficient number of observations and a
mathematics that is complex enough.” - Condorset (French mathematician), Essay on
Applications of Analysis to the Probability of Majority Decisions, 1785.

“Now that human mind has grasped celestial and terrestrial physics, mechanical and chemical,
organic physics, both vegetable and animal, there remains one science, to fill up the series of
sciences of observations - social physics.” - August Compte, Cours de philosophie positive (1830-
1842).

“Big Data” about human behavior and new “social physics”:

Reality Mining - take a look at this Wikipedia article that lists different ways to capture social
data at multiple scales.

Alex Pentland (MIT) is one of the pioneers in using big data to study social phenomena. Read
his text: https://www.edge.org/conversation/reinventing-society-in-the-wake-of-
big-data

Critical response:

Nicholas Carr, The Limits of Social Engineering, MIT Technology Review,

04/16/2014.
http://www.technologyreview.com/review/526561/the-limits-of-social-
engineering/

“The power of social physics,” he writes, “comes from the fact that almost all of our day-to-day
actions are habitual, based mostly on what we have learned from observing the behavior of
others.”
“Political and economic classes, he contends, are “oversimplified stereotypes of a fluid and
overlapping matrix of peer groups.” Peer groups, unlike classes, are defined by “shared norms”
rather than just “standard features such as income” or “their relationship to the means of
production.”
“Pentland may be right that our behavior is determined largely by social norms and the
influences of our peers, but what he fails to see is that those norms and influences are
themselves shaped by history, politics, and economics, not to mention power and prejudice.
People don’t have complete freedom in choosing their peer groups. Their choices are
constrained by where they live, where they come from, how much money they have, and what
they look like. A statistical model of society that ignores issues of class, that takes patterns of
influence as givens rather than as historical contingencies, will tend to perpetuate existing social
structures and dynamics. It will encourage us to optimize the status quo rather than challenge
it.”
“What big data can’t account for is what’s most unpredictable, and most interesting, about us.”

class 11:
changing continuous variables into categorical
variables (cut); work with bigger datasets (sample);
read large datasets into R; using data tables.

Changing continuous variables into categorical variables (cut

command):

Sometimes it is useful to change a continuous variable into a categorical variable.

This in fact is what R histogram command does automatically.
R cut command is more general, and gives you a number of options for “cutting” the data:

http://www.r-bloggers.com/r-function-of-the-day-cut/

http://stackoverflow.com/questions/5746544/r-cut-by-defined-interval
Example using van_gogh_genres.txt

x = read.delim("~/Documents/_CUNY GC class Spring 2017/data and

scripts/van_gogh_genres.txt")

x2=x

One method:
x2$Seasons = cut(x2$Month, breaks = 4)

Better method:
x2$Seasons = cut(x2$Month, breaks = seq(0,12, by=3))

Visualize new categorical data:

ggplot(x2, aes(factor(Seasons))) + geom_bar()

ggplot(x2, aes(factor(Seasons))) + geom_bar() + facet_wrap(~Label_Place)

Let’s add custom labels:

x2$Seasons_n = cut(x2$Month, breaks = seq(0,12, by=3), labels = c("winter", "spring",

"summer", "fall"))

ggplot(x2, aes(factor(Seasons_n))) + geom_bar() + facet_wrap(~Label_Place)

Working with larger datasets in R - using random samples

Creating random samples:

# take a random sample of size 50 from a dataset mydata

# sample without replacement

mysample <- mydata[sample(1:nrow(mydata), 50,

replace=FALSE),]
What is “sampling without replacement”?

http://www.statisticshowto.com/sampling-with-replacement-without/
https://en.wikipedia.org/wiki/Simple_random_sample

Practical example using tw5cities.Rda:

x = tw5cities.Rda

london = x[which(x$city == "London"),]

london.sample <- london[sample(1:nrow(london), 50000,replace=FALSE),]

Compare speed in plotting full data vs small sample:

ggplot(london, aes(lat, lon)) + geom_point(size=0.05, alpha=0.05)

ggplot(london.sample, aes(lat, lon)) + geom_point(size=0.1, alpha=0.3)

Calculating descriptive statistics:

summary(london$lat)
summary(london.sample$lat)

How to create random samples by group?

library(dplyr)

xs = x %>% group_by(city) %>% sample_n(10000)

Check the result: table(xs$city)

Working with larger datasets in R - how to read large files?

Handling large data sets in R

More methods for importing data into R
One method - using fread() from data.table package:

install.packages('data.table')
library(data.table)

x = fread("~/Documents/_Twitter 81 cities/city-tables-wide-combined-10km-
bracketless/London.csv")

xd = fread("~/Documents/_NYC Instagram Augustin paper/NYC_ALL_not_tourist_stats.csv")

Using Data.table package

“The data.table R package provides an enhanced version of data.frame that allows you to do
blazing fast data manipulations. The data.tableR package is being used in different fields such as
finance and genomics, and is especially useful for those of you that are working with large data
sets (e.g. 1GB to 100GB in RAM).”

source: https://www.datacamp.com/community/tutorials/data-table-cheat-sheet#gs.nsWlMGk

Another reason to use data tables is that for many operations, the syntax is easier.

You can convert a data frame in your workplace into a data table. Assume that you
have data frame called DF:

DT = data.table (DF)

There are lots of tutorials online showing how to use data tables, for example:

https://www.r-bloggers.com/intro-to-the-data-table-package/

dt <- data.table(mtcars)
class(dt)
dt[,mean(mpg)]
dt[,mean(mpg),by=am]
dt[,mean(mpg),by=.(am,cyl)]

Another recommended tutorial:

https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html

For ever faster operations using data tables, set keys.

class 12:
Using colors and working with colors in R; basic
design principles; creating and publishing
interactive web visualizations

Specifying color values using color spaces: RGB, HSV, etc.

https://en.wikipedia.org/wiki/HSL_and_HSV

http://www.rapidtables.com/web/color/RGB_Color.htm

There are many designed color palettes on the web and also color combinations generators, for
example:

http://www.colourlovers.com/

Color systems and methods in R:

R has many functions and methods for assigning colors to objects in plots. Here are some of
them:

http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/

How does R assigns default colors?

By default, the colors for discrete scales are evenly spaced around a HSL color circle. For
example, if there are two colors, then they will be selected from opposite points on the
circle; if there are three colors, they will be 120° apart on the color circle; and so on.

Using R pre-defined colors:

colors()

http://www.statmethods.net/advgraphs/parameters.html

Using one of the pre-defined R colors:

ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point(color='darkblue')

ggplot(mtcars, aes(x=wt, y=mpg)) + geom_point(color='darkgrey')

Assigning custom colors:

cbPalette <- c("#999999", "#E69F00", "#56B4E9")

ggplot(mtcars, aes(x=wt, y=mpg, color=factor(cyl))) + geom_point(size=4) +

scale_colour_manual(values=cbPalette)

Or you can do it all at once:

ggplot(mtcars, aes(x=wt, y=mpg, color=factor(cyl))) + geom_point(size=4) +

scale_colour_manual(values=c("#999999", "#E69F00", "#56B4E9"))

Assigning colors to continuous variable:

ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Sepal.Width)) + geom_point() +

+ scale_color_gradient(low = "white", high = "red")

http://ggplot2.tidyverse.org/reference/scale_gradient.html

Using ColorBrewer palettes in R:

http://colorbrewer2.org/

( http://ggplot2.tidyverse.org/reference/scale_brewer.html )
Display palettes:

display.brewer.all()

Use with categorical variable:

ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) + geom_point() +

scale_color_brewer(palette = "Set2”)

ggthemes Package:

https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html

Plot.ly - very powerful system for generating and publishing web graphs

Practical Homework 3:

You need to create a single visualization that uses data from tw5cities.Rda file that I already
shared with you. If you like, you can instead create a few copies of a single visualizations to
show parts of the data separately (corresponding to diff. cities).

Your visualization(s) should compare growth in visual tweets in five cities during 2011-2014.
Each row in the data file corresponds to one post. You have metadata for geo-location, city,
country, the level of economic development of the country, and date and time of the post.

You are allowed to aggregate the data but only in limited ways. For example, the data file has
year, month, day, hour and minute of each post. You are allowed to aggregate this in some
small intervals, such 5, 10, 15, or 30 minutes. Thus, instead of plotting every tweet separately,
you can for example plot number of tweets shared in a given city for every 30 minutes during
2011-2014.

Or maybe you want to aggregate over space, to count and visualize number of posts sharted
over small parts of the city. (Note that the data was collected for 10 km x 10 km central area in
each city.)

Whatever aggregation you use, remember that you still need to visualize data in detail. I.e. if
you to visualize every single post as a separate point, your plot will have 2,732,305 separate
points. Now, lets say you aggregated the posts in 15 minute intervals. If you now plot this data,
the plot will shaow app. 100,000 separate points.

Here are the specific requirements you need to follow:

- if you are visualizing aggregated data as points, you visualization need to contain > 50,000
points.

- if you are visualizing aggregated data as lines, you visualization need to contain > 1000 lines.

- if you are visualizing aggregated data as a heatmap, you visualization need to contain >
50,000 heatmap elements (i.e. cells in a grid.)

Chapter 3 Writing Within The Disciplines PDF
100% (3)
Chapter 3 Writing Within The Disciplines PDF
42 pages
Syllabus Online Learning (DV+ML) Compress
No ratings yet
Syllabus Online Learning (DV+ML) Compress
26 pages
Week 1
No ratings yet
Week 1
49 pages
Data Viz
No ratings yet
Data Viz
8 pages
Syllabus - Introduction To Data Visualization - Summer 2021
No ratings yet
Syllabus - Introduction To Data Visualization - Summer 2021
7 pages
Get (Ebook) Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R by Michael Freeman, Joel Ross ISBN 9780135133101, 0135133106 PDF ebook with Full Chapters Now
100% (7)
Get (Ebook) Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R by Michael Freeman, Joel Ross ISBN 9780135133101, 0135133106 PDF ebook with Full Chapters Now
65 pages
Data Visualization in Data Science
100% (6)
Data Visualization in Data Science
34 pages
2b dv syllabus (1)
No ratings yet
2b dv syllabus (1)
3 pages
417 AI Handbook Class9 Data Visualization
No ratings yet
417 AI Handbook Class9 Data Visualization
11 pages
Subject Code:: Data Visualization
No ratings yet
Subject Code:: Data Visualization
8 pages
Tableau Fundamental or Beginners
No ratings yet
Tableau Fundamental or Beginners
13 pages
DV Lab Manual (Ex - No.1-10)
No ratings yet
DV Lab Manual (Ex - No.1-10)
23 pages
Cse3050 - Data-Visualization-And-Presentation - Eth - 1.0 - 57 - Cse3050 - 61 Acp
No ratings yet
Cse3050 - Data-Visualization-And-Presentation - Eth - 1.0 - 57 - Cse3050 - 61 Acp
2 pages
CAP792
No ratings yet
CAP792
1 page
DAT206 - Data Visualization: Number of Contact Hours, Credits, & Prerequisites
No ratings yet
DAT206 - Data Visualization: Number of Contact Hours, Credits, & Prerequisites
3 pages
Fe 550
No ratings yet
Fe 550
4 pages
Class X AI Project Cycle Notes
No ratings yet
Class X AI Project Cycle Notes
19 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Instant Download (Ebook) Interactive Web-Based Data Visualization with R, Plotly, and Shiny by Carson Sievert ISBN 9781138331457, 1138331457 PDF All Chapters
100% (11)
Instant Download (Ebook) Interactive Web-Based Data Visualization with R, Plotly, and Shiny by Carson Sievert ISBN 9781138331457, 1138331457 PDF All Chapters
65 pages
Interactive Web-Based Data Visualization with R, Plotly, and Shiny 1st Edition Carson Sievert download
100% (3)
Interactive Web-Based Data Visualization with R, Plotly, and Shiny 1st Edition Carson Sievert download
68 pages
2023 Gerunov BusinessAnalyticsR SU
No ratings yet
2023 Gerunov BusinessAnalyticsR SU
107 pages
Summary Notes GEI
No ratings yet
Summary Notes GEI
4 pages
Padm-Gp 4119 001
No ratings yet
Padm-Gp 4119 001
20 pages
Suraj Jaiswal 231348078 R Experiment
No ratings yet
Suraj Jaiswal 231348078 R Experiment
37 pages
Data Visualization Techniques: Dr. D. Koteswara Rao
No ratings yet
Data Visualization Techniques: Dr. D. Koteswara Rao
41 pages
Mastering Data Analysis With R - Sample Chapter
No ratings yet
Mastering Data Analysis With R - Sample Chapter
32 pages
Robert Kabacoff - Modern Data Visualization With R-CRC Pressr (2024)
No ratings yet
Robert Kabacoff - Modern Data Visualization With R-CRC Pressr (2024)
272 pages
Data Visualization
No ratings yet
Data Visualization
3 pages
Udacity Dandsyllabus
No ratings yet
Udacity Dandsyllabus
7 pages
Lesson3 Sandbox - RMD
No ratings yet
Lesson3 Sandbox - RMD
4 pages
Learning and Curriculum in Technology Education: A Design and Visual Communication Perspective.
From Everand
Learning and Curriculum in Technology Education: A Design and Visual Communication Perspective.
William Van Zyl
No ratings yet
Lectures
No ratings yet
Lectures
191 pages
IN4089 - Lecture 01 - Intro - What Why How-Pdfjam
No ratings yet
IN4089 - Lecture 01 - Intro - What Why How-Pdfjam
16 pages
Syllabus
No ratings yet
Syllabus
2 pages
R Short Course
No ratings yet
R Short Course
40 pages
Interactive Web-Based Data Visualization with R, Plotly, and Shiny 1st Edition Carson Sievert 2024 Scribd Download
100% (1)
Interactive Web-Based Data Visualization with R, Plotly, and Shiny 1st Edition Carson Sievert 2024 Scribd Download
65 pages
Visualización Información Snyder Jaime
No ratings yet
Visualización Información Snyder Jaime
6 pages
CU Data Science
No ratings yet
CU Data Science
8 pages
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Da Laqs Saqs
No ratings yet
Da Laqs Saqs
23 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
Module 7_(Data Analysis with R Programming)
No ratings yet
Module 7_(Data Analysis with R Programming)
18 pages
Course Outline: Week 1: Pictures You See With Your Brain
No ratings yet
Course Outline: Week 1: Pictures You See With Your Brain
2 pages
Lecture Zero INTB233 1
No ratings yet
Lecture Zero INTB233 1
35 pages
Lecture3434 - CAP792 - UNIT 5
No ratings yet
Lecture3434 - CAP792 - UNIT 5
25 pages
Data Visualization and Communication - MSDA Fall 2018 Syllabus
No ratings yet
Data Visualization and Communication - MSDA Fall 2018 Syllabus
8 pages
Python Data Visualization
No ratings yet
Python Data Visualization
174 pages
(Ebook) R in Action by Robert Kabacoff ISBN 9781935182399, 1935182390 download
100% (1)
(Ebook) R in Action by Robert Kabacoff ISBN 9781935182399, 1935182390 download
57 pages
Statistical Analysis and Visualizations Using R: Okan Bulut
No ratings yet
Statistical Analysis and Visualizations Using R: Okan Bulut
96 pages
Exploratory_Data_Analysis_Course_Notes
No ratings yet
Exploratory_Data_Analysis_Course_Notes
55 pages
Brochure Berkeley Data Visualization
No ratings yet
Brochure Berkeley Data Visualization
13 pages
1 Introduction
No ratings yet
1 Introduction
130 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
The Complete Student Design Presentation Sourcebook for the Practice of Architectural Engineering
From Everand
The Complete Student Design Presentation Sourcebook for the Practice of Architectural Engineering
Ata Asheghi
1/5 (1)
RHadoop
No ratings yet
RHadoop
50 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
R
No ratings yet
R
14 pages
R Visualization ADA
No ratings yet
R Visualization ADA
47 pages
Intro Visualization
No ratings yet
Intro Visualization
46 pages
Academy Program PDF
No ratings yet
Academy Program PDF
19 pages
How To Pass in University Arrear Exams?
100% (6)
How To Pass in University Arrear Exams?
26 pages
Bibliographic Essay Final
No ratings yet
Bibliographic Essay Final
4 pages
Nursing Dissertation Examples Literature Review
100% (1)
Nursing Dissertation Examples Literature Review
7 pages
The Sahapedia-UNESCO Project Fellowship 2019 Annexure IV: Description of Deliverables
No ratings yet
The Sahapedia-UNESCO Project Fellowship 2019 Annexure IV: Description of Deliverables
4 pages
In Praise of The F Word Essay
No ratings yet
In Praise of The F Word Essay
5 pages
Shsreadingwritingmodule2 2
No ratings yet
Shsreadingwritingmodule2 2
25 pages
CSS Beginners Guide PDF
100% (1)
CSS Beginners Guide PDF
88 pages
ISM IGCSE Curriculum Outline Guide 2016-17 (PDFDrive)
No ratings yet
ISM IGCSE Curriculum Outline Guide 2016-17 (PDFDrive)
59 pages
Doctoral Thesis Structure Humanities
100% (3)
Doctoral Thesis Structure Humanities
7 pages
Crime - Vocabulary Exercise
No ratings yet
Crime - Vocabulary Exercise
7 pages
English 1201 - Syllabus 2
No ratings yet
English 1201 - Syllabus 2
5 pages
Unit 3 - Activity 1 - How To Analyze An Essay - April Fools On Polar Circus
100% (2)
Unit 3 - Activity 1 - How To Analyze An Essay - April Fools On Polar Circus
4 pages
Think Outside The Box
No ratings yet
Think Outside The Box
25 pages
17. History SAGS 2025
No ratings yet
17. History SAGS 2025
44 pages
Urdu Essay in Urdu Language
50% (2)
Urdu Essay in Urdu Language
4 pages
English Corner - SPM English Essay Format - Continuous Writing PDF
No ratings yet
English Corner - SPM English Essay Format - Continuous Writing PDF
4 pages
English 10 Diagnostic
No ratings yet
English 10 Diagnostic
4 pages
Thesis Statement Picture
100% (2)
Thesis Statement Picture
6 pages
1301 Ad Analysis Essay1
No ratings yet
1301 Ad Analysis Essay1
4 pages
Wok My+worldview+assignment
No ratings yet
Wok My+worldview+assignment
2 pages
Acgrandes e Module in Writing For G 12 Students
No ratings yet
Acgrandes e Module in Writing For G 12 Students
41 pages
Write Here Right Now An Interactive Introduction To Academic Writing and Research 1600869257
No ratings yet
Write Here Right Now An Interactive Introduction To Academic Writing and Research 1600869257
386 pages
English Literature A Level Coursework Books
100% (2)
English Literature A Level Coursework Books
7 pages
Utep Graduate School Thesis Format
100% (3)
Utep Graduate School Thesis Format
8 pages
Thesis Statement For 4th Graders
100% (3)
Thesis Statement For 4th Graders
8 pages
Lecture 3 and 4: Writing An Argumentative Essay: INSTRUCTOR: Qurrat Ul Ain Raza
No ratings yet
Lecture 3 and 4: Writing An Argumentative Essay: INSTRUCTOR: Qurrat Ul Ain Raza
38 pages
POSITION PAPER Handouts
No ratings yet
POSITION PAPER Handouts
5 pages
Intro To Writing q2
No ratings yet
Intro To Writing q2
7 pages
English Writing Essays and Transactional Texts
100% (1)
English Writing Essays and Transactional Texts
89 pages