0% found this document useful (0 votes)
8 views

Part 6 Section 1 Introduction to Data Visualization

This document introduces data visualization, emphasizing its importance in making data comprehensible through graphical representation. It discusses the benefits of visualizing data, the steps to create effective visualizations, and introduces the ggplot2 R package for statistical graphics. The document also outlines the grammar of graphics, which includes components like data, aesthetic mappings, and geometric objects.

Uploaded by

dhvv112233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Part 6 Section 1 Introduction to Data Visualization

This document introduces data visualization, emphasizing its importance in making data comprehensible through graphical representation. It discusses the benefits of visualizing data, the steps to create effective visualizations, and introduces the ggplot2 R package for statistical graphics. The document also outlines the grammar of graphics, which includes components like data, aesthetic mappings, and geometric objects.

Uploaded by

dhvv112233
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Part 6 - Section 1: Introduction to data

visualization

Dr. Nguyen Quang Huy

May 16, 2020

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 1 / 16
Introduction to data visualization

Data visualization is the graphical representation of data. By using visual


charts, graphs, and maps, data visualization provides an accessible way to
understand trends, outliers, and patterns in data.
For most human brains, it is difficult to extract information from looing
at the numbers, characters,...
However, we can quickly identify red from blue, square from circle, ...

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 2 / 16
Introduction to data visualization
It is rarely useful when looking at the numbers, character strings from a
dataset. How much information you get when look at murders dataset?
library(dslabs)
murders

## state abb region population total


## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 3 / 16
Introduction to data visualization
However, as people say, "a picture is worth a thousand words". We have
useful information from examining this plot
US Gun Murders in 2010
Region Northeast South North Central West

California
1000 Texas
Florida
Pennsylvania New York
Michigan Illinois
Louisiana Missouri Virginia Georgia
Ohio
Maryland Arizona North Carolina
South Carolina Tennessee New Jersey
Mississippi Alabama Indiana
Total number of murders

Kentucky Massachusetts
100 Arkansas Oklahoma
District of Columbia Connecticut Wisconsin Washington
Nevada
New Mexico Colorado
Kansas
Minnesota
Oregon
Delaware Nebraska

Alaska West Virginia Utah


Rhode Island Iowa
Montana Idaho
10 South Dakota Maine

Wyoming Hawaii
New Hampshire
North Dakota

Vermont
1 3 10 30
Populations in millions

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 4 / 16
Introduction to data visualization

We can combine murders data with map data


50

washington north dakota


minnesota
montana
idaho michigan
45 oregon wisconsin maine

wyoming south dakota new hampshire

iowa new york massachusetts


nebraska
connecticut
pennsylvania
40 ohio new jersey
illinois delaware
utah colorado kansas indiana west virginia
missouri
nevada kentucky virginia
lat

california
tennessee
north carolina
35
arizona new mexico oklahoma arkansas
south carolina
mississippi
georgia
alabama
louisiana
30 texas

florida

25

−120 −110 −100 −90 −80 −70


long

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 5 / 16
Introduction to data visualization
Why data visualization is important?
Make data easier to understand and remember
Discover unknown facts, outliers and trends
Visualize relationships and patterns quickly
Ask better questions and make better decisions
What makes a good data visualization?
Step 1. Clean data (is ready to visualize)
Step 2. Pick the right chart
Step 3. Design and customize your visualization
Step 4. Publish, share and communicate
Remember, simplicity is the key.
Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 6 / 16
Introduction to data visualization

Storytelling in the work of Charles Joseph Minard (1780-1871) about


Napoleon’s Russian campaign of 1812

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 7 / 16
Introduction to data visualization

“The greatest value of a picture is when it forces us to notice what we


never expected to see”

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 8 / 16
Introduction to data visualization

What is the problem with the NYC Regents Exam in 2010 where you need a
score of 65 to pass?

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 9 / 16
Introduction to ggplot2

ggplot2 is an R package for producing statistical graphics with a deep


underlying grammar.
The grammar, based on the grammar of graphics (Wilkinson, 2005), is
made up of a set of independent components that can be composed in
many different ways.
The grammar of graphic tells us that a statistical graphic is a mapping
from data to aesthetic attributes (color, shape, size, ...) of geometric
objects (points, lines, bars, ...)
ggplot2 is powerful because users are not limited to a set of
pre-specified graphics, but they can create new graphics that are
appropriate for their problem.

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 10 / 16
Introduction to ggplot2

Advantages of ggplot2
Users are not limited to a set of pre-specified graphics. You can build
graphics that precisely tells your story.
Disadvantages of ggplot2
ggplot2 is useful only when users have some basic knowledge in R.
ggplot2 doesn’t suggest what graphics you should use to answer the
questions you are interested in.
ggplot2 is not designed to create dynamic and interactive graphics i.e.
ggplot2 is suitable with static data.

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 11 / 16
Grammar of graphics
The grammar of graphics describes the deep features that underlie all
statistical graphics:

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 12 / 16
Grammar of graphics
1. Data that you want to visualise.
2. Aesthetic mappings (aes) describing how variables in the data are
mapped to aesthetic attributes.
3. Geometric objects (geoms) represent what you actually see on
the plot: points, lines, polygons, etc.
4. A faceting describes how to break up the data into subsets.
5. Statistical transformations (stats) summarise data in many
useful ways.
6. A coordinate system describes how data coordinates are mapped
to the plane of the graphic.
7. A theme controls the finer points of display, like the font size and
background colour.
Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 13 / 16
Introduction to data visualization
Measles vacinne was licensed in 1963 in the United States
Wyoming
Wisconsin
West Virginia
Washington
Virginia
Vermont
Utah
Texas
Tennessee
South Dakota
South Carolina
Rhode Island
Pennsylvania
Oregon
Oklahoma
Ohio
North Dakota
North Carolina
New York
New Mexico
New Jersey
New Hampshire
Nevada
Nebraska
Montana
Missouri
Mississippi
Minnesota
Michigan
Massachusetts
Maryland
Maine
Louisiana
Kentucky
Kansas
Iowa
Indiana
Illinois
Idaho
Hawaii
Georgia
Florida
District Of Columbia
Delaware
Connecticut
Colorado
California
Arkansas
Arizona
Alaska
Alabama
1940 1960 1980 2000

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 14 / 16
Introduction to data visualization

p1<-us_contagious_diseases%>%filter(disease=="Measles")%>%
mutate(rate=count*1000/population)%>%
ggplot(aes(year,state,fill=rate))+geom_tile(color="grey")
p1+scale_fill_gradientn(colors = c(rgb(1,1,1),rgb(1,0,0),
rgb(0.8,0,0)),trans = "sqrt")+
geom_vline(xintercept=1963, col = "green", size=2) +
scale_x_continuous(expand=c(0,0))+
theme_minimal()+
theme(panel.grid = element_blank(),
legend.position="bottom",
text = element_text(size = 12))+
xlab(label="")+
ylab(label="")

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 15 / 16
Introduction to data visualization

End of Section 1

Dr. Nguyen Quang Huy Part 6 - Section 1: Introduction to data visualization May 16, 2020 16 / 16

You might also like