4 Data-Visualization
4 Data-Visualization
Science
Data
Visualization
Agenda
Data visualization is basically presentation of data with graphics. It's a way to summarize your findings and display it in a
form that facilitates interpretation and can help in identifying patterns or trends
Importance of Data Visualization
Importance of Data Visualization
A picture is worth a 1000 words. Humans are easily attracted to visuals and most of us prefer to understand a particular
scenario through pictures instead of text
It is faster to recognize a result than to read a paragraph. When done well, visualizations explain complex ideas simply.
Oftentimes charts tell the story much faster than a prose description or a table presentation
Importance of Data Visualization
Below is the sample data: Sales Products
Order method Product line Product type Product Year Revenue Quantity Gross margin
Retailer country Retailer type
type
Personal
United States Telephone Golf Shop Accessories Navigation Trail Master 2012 1095 3 0.34767123
Camping
United States Telephone Department Store Equipment Sleeping Bags Hibernator 2012 160103.2 1160 0.3769019
Camping
United States Telephone Department Store Equipment Sleeping Bags Hibernator Pillow 2012 33520.42 2475 0.46298346
Camping
United States Fax Outdoors Shop Equipment Cooking Gear TrailChef Cook Set 2013 42304.32 794 0.34365616
• Which order method type has highest revenue Camping TrailChef Single
United States Fax Outdoors Shop Equipment Cooking Gear Flame 2013 52266.32 824 0.26880025
• Which product line has highest quantity ? United States Telephone Department Store
Camping
Equipment Packs
Canyon Mule
Cooler 2012 53822.16 1652 0.50890117
Personal
United States Web Golf Shop Accessories Watches Venue 2014 73949 1013 0.42858294
Personal
United States Web Golf Shop Accessories Watches Infinity 2014 157890.2 665 0.45986527
Personal
United States Web Golf Shop Accessories Watches Lux 2014 67265.2 396 0.48654044
Importance of Data Visualization
After displaying the data in the form of graphs . Lets try answering the questions
400
E-mail
Fax 350
300
Sales visit Telephone
250
Quantity
E-mail 200
Fax Sales
visit 150
Telephone
100
Web
50
Web 0
Camping Golf Equipment Mountaineering Outdoor Personal
Equipment Equipment Protection Accessories
Product line
Importance of Data Visualization
After displaying the data in the form of graphs . Lets try answering the questions
E-mail
Which order method type has highest revenue ?
Fax
Sales visit
Telephone
E-mail
Fax
Web order method type has highest revenue
Sales visit
among all the order method type.
Telephone
Web Moreover looking at the pie chart you can also
tell that which order method type has lowest
revenue : Email.
Web
Web
Importance of Data Visualization
After displaying the data in the form of graphs . Lets try answering the questions
400
300
250
Quantity
200
150
From the graph it can be easily seen that
100
Personal Accessories has highest number
50
among all the product line.
0
Camping Golf Equipment Mountaineering Outdoor Personal
Equipment Equipment Protection Accessories
Product line
Base Graphics in R
Base Graphics in R
Bar Plots are suitable for showing comparison between cumulative totals across several groups
Bar Plot
plot(customer_churn$Dependent
s)
Bar Plot
Adding color
plot(customer_churn$Dependents, col="coral")
Bar Plot
plot(customer_churn$Dependents,
col="coral",xlab="Dependents",
main="Distribution of Dependents")
Bar Plot
plot(customer_churn$PhoneService,col="aquamarine4"
)
Bar Plot
plot(customer_churn$Contract,col="palegreen4",
xlab="Contract",
main="Distribution of Contract")
Histogram
Histogram
Histogram is basically a plot that breaks the data into bins (or breaks) and shows frequency distribution of these bins
Histogram
hist(customer_churn$tenure)
Histogram
Adding Color
hist(customer_churn$tenure,col="olivedrab")
Histogram
hist(customer_churn$tenure,
col="olivedrab",
breaks=30)
Grammar of Graphics
Grammar of Graphics
Every form of communication needs to have grammar. Since, visualization is also a form of communication, it needs to
have a foundation of grammar
I am John Am John I
Components of Grammar of Graphics
Element Description
ggplot(data=customer_churn,
aes(x=tenure))+geom_histogram()
geom_hist()
ggplot(data = customer_churn,
aes(x=tenure))+geom_histogram(bins = 50)
geom_hist()
ggplot(data = customer_churn,
aes(x=tenure, fill=Partner))+
geom_histogram(position = "identity")
geom_bar()
geom_bar()
ggplot(data = customer_churn,
aes(x=Dependents))+geom_bar()
geom_bar()
Add fill
color
ggplot(data = customer_churn,
aes(x=Dependents,fill=DeviceProtection))
+geom_bar()
geom_bar()
ggplot(data = customer_churn,
aes(x=Dependents,fill=DeviceProtection))
+geom_bar(position=‘dodge’)
geom_point()
geom_point()
geom_point() function helps in making scatterplots with ggplot2. A scatter plot helps in understanding how does one
variable change w.r.t another variable. It is used for two continuous values
geom_point()
ggplot(data = customer_churn,
aes(y=TotalCharges,x=tenure))+geom_point(
)
geom_point()
Adding color
ggplot(data = customer_churn,
aes(y=TotalCharges,x=tenure)) +
geom_point(col="slateblue3")
geom_point()
ggplot(data = customer_churn,
aes(y=TotalCharges,x=tenure, col=Partner)) +
geom_point()
geom_point()
ggplot(data = customer_churn,
aes(y=TotalCharges,x=tenure,
col=InternetService)) +
geom_point()
geom_boxplot()
geom_boxplot()
geom_boxplot() function helps in making boxplots with ggplot2. Box Plot shows 5 statistically significant numbers- the minimum, the
25th percentile, the median, the 75th percentile and the maximum. It is thus useful for visualizing the spread of the data and deriving
inferences accordingly
geom_boxplot()
ggplot(data =
customer_churn,aes(y=MonthlyCharges,x=In
ternetService))+geom_boxplot()
geom_boxplot()
ggplot(data =
customer_churn,aes(y=MonthlyCharges,x=Inte
rnetService))+geom_boxplot(fill="violetred4")
Faceting the data
Faceting the data
facet_grid() is used to facet the data. Faceting is used when the plot is too chaotic and some variables have to be grouped into different
facets to have a better visualization
facet_grid()
ggplot(data = customer_churn,aes(x=tenure))+
geom_histogram(fill="tomato3", g1+labs(title = "Distribution of tenure")->g2
col="mediumaquamarine") -> g1
Theme Layer
g2+theme(panel.background =
g1+labs(title = "Distribution of tenure")->g2
element_rect(fill = "olivedrab3"))->g3
Theme Layer
g2+theme(panel.background = g3+theme(plot.background =
element_rect(fill = "olivedrab3"))->g3 element_rect(fill = "palegreen4"))->g4
Theme Layer
Which of these is the correct code to make a bar-plot for the ‘TechSupport’ column. The color of the bars
should be ‘blue’ & the title of the plot should be ‘Distribution of Tech Support’
Which of these is the correct code to make a histogram for the ‘tenure’ column. The fill color of the bins
should be ‘azure’ & the number of bins should be 87
1. ggplot(data = customer_churn,aes(x=tenure,col='azure'))+geom_histogram(bins=87)
2. ggplot(data = customer_churn,aes(x=tenure))+geom_histogram(col="azure",bins=87)
3. ggplot(data = customer_churn,aes(x=tenure))+geom_histogram(fill="azure",bins=87)
4. ggplot(data = customer_churn,aes(x=tenure,fill='azure'))+geom_histogram(bins=87)
Quiz
Which of these is the correct code to make a bar-plot for the ‘OnlineBackup’ column. The color of the bars should
be determined by the ‘PhoneService’ column
1. ggplot(data = customer_churn,aes(fill=OnlineBackup,x=PhoneService))+geom_bar()
2. ggplot(data = customer_churn,aes(y=OnlineBackup,fill=PhoneService))+geom_bar()
3. ggplot(data = customer_churn,aes(x=OnlineBackup))+geom_bar(fill=PhoneService)
4. ggplot(data = customer_churn,aes(x=OnlineBackup,fill=PhoneService))+geom_bar()
Quiz
1. geom_bar()
2. geom_histogram()
3. geom_point()