0% found this document useful (0 votes)
34 views49 pages

Visualization-Script R

The document discusses exploratory data analysis and data visualization techniques. It covers summarizing and exploring the distribution of data, including categorical and numerical variables. Different graphing methods like histograms, density plots, and bar plots are presented to visualize distributions. The normal distribution is also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views49 pages

Visualization-Script R

The document discusses exploratory data analysis and data visualization techniques. It covers summarizing and exploring the distribution of data, including categorical and numerical variables. Different graphing methods like histograms, density plots, and bar plots are presented to visualize distributions. The normal distribution is also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Visualization-script.

R
library(dslabs)
data(murders)
head(murders)
#type head(murders) to see the common order in data science, obs-
observations in Rows and variables in columns
#Data visualization is powerful to communicate a data driven finding-
descubrimiento basado en datos
#A picture is worth than thousand words, that's data visualization
advantage
#Data Visualization is the strongest tool of EDA Exploratory Data
Analysis
#EDA-exploratory data analysis is the most important and often
overlooked-pasada por alto part of a data analysis
#in EDA, data properties are explored through DV and summarization
techniques
#DV helps to discover biases-sesgos, mistakes, systematic errors, avoid
flawed analysis and false discovers
#basic of EDA and DV with ggplot2. other useful tools to learn:
interactive graphics
#the most basic statistical summary of a list of objects or numbers is
its DISTRIBUTION
#once a vector has been summarized(average, standard deviation) there are
ways to visualize and analyze if those distributions are right
#before visualize data it is necessary to know what type of data we have
#data-variable Types: 1_Categorical: a_ordinals, b_not ordinals,
2_Numeric: a_discrete, b_continuous
#Categorical-Ordinal data: variables defined by small number of groups
and categories have an inherent order
#example, spiciness: mild, medium, hot
#Categorical-Non ordinal data: variables defined by small number of
groups and categories don't have any order
#example, sex: female, male, regions: northeast, south, northcentral,
west
#Numeric-Continuous data: variables can take any numeric with decimals
value
#example, heights, murder rates
#Numeric-Discrete data: variables must take round-integers numeric values
#example, population sizes, counts of things
#heights as we take them are numeric-continuous, if heights are rounded
they are numeric-discrete
#numeric-discrete data can be considered categorical-ordinal
#example, the number of packs a person smokes a day rounded to 0, 1, 2 is
an ordinal variable: small number of groups with many members in each
group
#example, the number of cigarettes a person smokes 0, 1...39, 40 would be
a discrete variable: big number of groups with a few members in each
group
library(tidyverse)
library(dslabs)
data(heights)
heights$sex
names(heights)
#unique() to see how many unique values are in a vector
#table() to compute the frequency of each unique value, creates a
contingency table of the counts for each level
x <- c(3,3,3,3,4,4,2)
unique(x)
table(x)
sum(x)
sum(unique(x))
sum(table(x))
sum(x==3)
head(heights)
#the Distribution is the most basic statistical summary from a list of
objects/numbers
prop.table(table(heights$sex))
#the proportion of females is 0.227, for males is 0.773
#this two category frequency table is the simplest form of a
distribution, here the numbers describe everything
x
prop.table(x)
prop.table(table(x))
#prop.table function gives the proportion of the vector, with table
inside we get the proportion of each entry
options(digits = 3)
prop.table(table(murders$region))
barplot(prop.table(table(murders$region)))
#0.176 NE, 0.333 S, 0.235 NC, 0.255 W
#the barplot let us see the four proportions in a graph
#a distribution is a description/function that shows the possible values
of a variable & how often those values occur
#distribution in categorical data <- description, frequency table
#distribution in numerical data <- function, cumulative distribution
function CDF
#when we have numerical data, report the frequency is NOT an effective
summary because most values are unique
#Cumulative distribution functions-CDF reports the proportion of the data
below-debajo a for all possible values of a
#F(a) = Proportion(x <= a)
#F(b)-F(a) to report the proportion between two values
#Histograms are better to show the distribution of numerical data or CDF-
cumulative distribution functions
#histograms divide data into non overlapping bins of the same size-
intervals and plots the counted values-frequency
hist(heights$height)
#hist: the base of the bar is each interval defined by a range of values,
we know the data proportion by intervals
ecdf(heights$height)
ecdf(x)
#ecdf() function to compute the CDF to a numeric variable
#the CDF-cumulative dist.function has "a" values on the x-axis and F(a)-
the proportion lower than "a" on the y-axis
#use CDF to get a summary and probabilities when data is numeric-
continuous-with decimals
#cumulative distribution function-CDF shows the proportion of values
below a, the same as 1-F(a)
#cumulative distribution function-CDF also shows the probability of
finding a random value between a&b F(b)-F(a)
#Smooth density plots are similar to histograms but peaks are removed and
the y-axis change from counts to density
#when the list of values is big and large, a smooth density plot shows
the distribution in a more general way
#smooth density plots are histograms with very small bins-intervals, less
edges & jumps, the hist becomes smooth
#the smooth density is the CURVE that goes through the top of the
histogram bars
#steps to make a smooth curve, 1_histogram with frequencies 2_top points
3_match points 4_hide bars
#the degree of smoothness can be controlled by an argument on ggplot2
#Degree of smoothness selected should be representative of the hide data
to avoid visualization and analysis mistakes
#the area under the smooth density curve sums 1, to know the proportion
between two values, compute that area below
#compare two distributions is easier with smooth densities than
histograms because the jagged edges-bordes irregulares add clutter-
agregan desorden
#normal distribution <- bell curve <- Gaussian distribution <- defined by
two parameters, average & standard deviation
#ND is symmetric, centered at the average and 95% of values are within
the standard deviation, SD is 1 from the avg 0
#ND in gambling winnings, heights, weights, blood pressure, standardized
test scores, experimental measurement errors
#mean() and sd() summarizes the normal distribution of a dataset
average <- sum(x)/length(x)
SD <- sqrt(sum((x-average)^2)/length(x))
index <- heights$sex=="Male"
x <- heights$height[index]
x
average <- mean(x)
SD <- sd(x)
c(average,SD)
c(average=average,SD=SD)
#compute the mean() and sd() for male heights
mean(heights$height[heights$sex=="Male"])
mean(x)
sd(heights$height[heights$sex=="Male"])
sd(x)
c(mean(x),sd(x))
c(average=mean(x),SD=sd(x))
#for male heights mean is 69.314 & sd is 3.611, that's the normal
distribution of our data
#standard units <- Z-score <- scale() function: how many standard
deviations a value is away from the mean-average
#when the distribution is aprox.normal, convert data into z-scores-
standard units is useful to compare between data
z <- scale(x)
z
scale(heights$height[heights$sex=="Male"])
(x-mean(x))/sd(x)
#z=(x-mean(x))/sd(x)
#how many men are between two SD from the average if average is z=0, tall
z=2, short z= -2
#compute the average of the absolute value of +-z < menor a +-2
mean(abs(z)<2)
#95% of the males height for normal distribution data are inside +-2
#standard normal distribution <- z-scores = 0, mean = 0 & SD = 1
#z-score near 0 is average-promedio, z-score between +-2 are below the
mean, z-score between +-3 are extreme values
#the normal distribution is associated with the 68-95-99 % rule, percent
of observations that are below the curve
#68.3% is 1 SD from the mean abs(z)=1, 95.4% is 2 SD from the mean
abs(z)=2, 99.7% is 3 SD from the mean abs(z)=3
#abs() function computes the absolute +/- value
#pnorm() function to obtain the PROBABILITY cumulative distribution
function-CDF of a normal distribution
#F(a) = pnorm(a, mean, sd) distribution function = pnorm(observation,
average, SD)
#what is the probability that a randomly selected student is taller than
70.5 inches?
pnorm(70.5,mean(x),sd(x))
pnorm(70.5,mean(heights$height[heights$sex=="Male"]),sd(heights$height[he
ights$sex=="Male"]))
#what is the probability of this observation: F(a) =
pnorm(observation,mean,SD) and CDF = 1-F(a)
1-pnorm(70.5,mean(x),sd(x))
#0.628 is the probability of males with heights below-debajo 70.5 inches
F(a) = pnorm(obs,mean,sd)
#Answer: 0.371 is the probability of males with heights above-enicma
taller than 70.5 inches 1 - F(a)
#to know the probability of males with heights between-entre two values
F(b) - F(a)
#heights data is numeric-continuous, if we consider each height as a
categorical entry isn't useful =discretization
#divide data by intervals to compute probabilities for continuous
distributions
#divide data by intervals of length 1-rounded to compute probabilities
for continuous normal distributions
library(tidyverse)
library(dslabs)
x <- heights %>% filter(sex=="Male") %>% pull(height)
x
index <- heights$sex=="Male"
x <- heights$height[index]
x
#two ways to define x containing male heights
#pull() function to pull out-extraer a single variable
#what is the proportion of males with heights between 69.5-70.5? interval
of lenght 1
options(digits = 3)
#heights data with decimals, as numeric-continuous: probabilities by mean
& pnorm are aprox. equal if the range interval is 1-integer, following
the normal distribution
mean(x<=68.5)-mean(x<=67.5)
pnorm(68.5,mean(x),sd(x))-pnorm(67.5,mean(x),sd(x))
mean(x<=69.5)-mean(x<=68.5)
pnorm(69.5,mean(x),sd(x))-pnorm(68.5,mean(x),sd(x))
mean(x<=70.5)-mean(x<=69.5)
pnorm(70.5,mean(x),sd(x))-pnorm(69.5,mean(x),sd(x))
#heights data rounded values, as numeric-discrete: probabilities by mean
& pnorm don't match the normal dist. if the range interval is not 1-
decimals =discretization
mean(x<=70.9)-mean(x<=70.1)
pnorm(70.9,mean(x),sd(x))-pnorm(70.1,mean(x),sd(x))
#assign a small prob. to each single height is not useful, CDF works on
heights intervals to reach the probability distribution of a random value
#dnorm() returns the PDF-probability density function of a normal
distribution for a given random variable x
#pnorm() returns the CDF-cumulative dist.density function of a normal
distribution for a given random variable q
#qnorm() returns the inverse CDF-cumulative dist.density function for a
given random variable p
#rnorm() returns a vector of normally distributed random variables for a
given vector-number of observations n
#rnorm is useful to generate random samples-muestras & simulate with data
collected from a normal distribution population with specified mean & SD
#what could happen by chance if we pick 800 males at random, what is the
distribution of the tallest person?
#13.14 Exercises
xx <- heights %>% filter(sex=="Female") %>% pull(height)
xx <- heights$height[heights$sex=="Female"]
xx
#xx defined as female heights, mean is 64 and SD 3 inches
#if we pick at random, what is the probability that she is 5 feet or
shorter? 6 feet or taller? between 61-67 inches?
View(inches_to_ft)
5*12
mean(xx)
sd(xx)
pnorm(5*12,mean(xx),sd(xx))
pnorm(60,mean(xx),sd(xx))
#the probability of finding females shorter than-below 5 feet =60 inches
is 0.094%
1-pnorm(6*12,mean(xx),sd(xx))
#the probability of finding females taller-above 6 feet =72 inches is
0.0302%
pnorm(67,mean(xx),sd(xx))-pnorm(61,mean(xx),sd(xx))
#the probability of finding females between 61-67 inches height is 0.561%
#repeat the previous exercise with heights in centimeters
5*12*2.54
60*2.54
pnorm(5*12*2.54,mean(xx*2.54),sd(xx*2.54))
#the probability of finding females shorter than-below 5 feet =60 inches
=152 cm is 0.094%
1-pnorm(6*12*2.54,mean(xx*2.54),sd(xx*2.54))
#the probability of finding females taller-above 6 feet =72 inches
=182.88 cm is 0.0302%
pnorm(67*2.54,mean(xx*2.54),sd(xx*2.54))-
pnorm(61*2.54,mean(xx*2.54),sd(xx*2.54))
#the probability of finding females between 61-67 inches =154-170 cm
height is 0.561%
mean(xx)
sd(xx)
options(digits = 3)
pnorm(-1.96)
pnorm(1.96)
pnorm(-2)
pnorm(2)
qnorm(0.025)
qnorm(0.975)
qnorm(0.0228)
qnorm(0.977)
#pnorm and qnorm are inverse functions, qnorm-inverse function gives the
teorethical quantiles of a normal dist.
#datacamp assessment:
#What proportion of the data is between 69 and 72 inches, taller than 69
but shorter or equal to 72?
mean(x<=72&x>69)
mean(x<=72)-mean(x>69)
pnorm(72,mean(x),sd(x))-pnorm(69,mean(x),sd(x))
#ratio: how many times bigger the exact proportion is compared to the
approximation
exact <- mean(x > 79 & x <= 81)
approx <- pnorm(81,mean(x),sd(x))-pnorm(79,mean(x),sd(x))
exact/approx
exact
approx
#between 79&81 inches the exact proportion-by mean of male heights is
1.61 bigger than the aprox proportion-by pnorm
#get the proportion of 7feet taller men, then multiply this value by 1
billion (10^9) n of males and round the value
1-pnorm(7*12,mean=69,sd=3)
#the proportion of 7feet taller men is 2.866e-07
(1-pnorm(7*12,mean=69,sd=3))*(10^9)
#about 1 billion(10^9) men are between 18-40 years old, answer: there are
286.651 seven feet high or taller men in the world
round((1-pnorm(7*12,mean=69,sd=3))*(10^9))
10/round((1-pnorm(7*12,mean=69,sd=3))*(10^9))
10/((1-pnorm(7*12,mean=69,sd=3))*(10^9))
#if there are 10 players seven feet high or taller in NBA, representa el
0.0348 en el mundo
(1-pnorm((6*12)+8,mean=69,sd=3))*(10^9)
150/((1-pnorm((6*12)+8,mean=69,sd=3))*(10^9))
#there are 122866 men in the world with lebron james height,si 150 son
jugadores en la NBA representa el 0.0012
#Quantiles: cutoff points-puntos de corte that divide datasets into
intervals of probabilities
#q-th is the value which q% of the observations are equal or less from 0-
smallest prob. to 1-largest probability
m
quantile(m)
#percentiles and quartiles are a kind of quantiles
#Percentiles divide datasets into 100 intervals each one with 1% = 0.01
of probability
#Quartiles divide datasets into 4 intervals each one with 25% = 0.25 of
probability
summary(heights$height)
#summary() function gives the mean, min, max, 1st quartile, median or 2nd
quartile, 3rd quartile
quantile(heights$height)
#use summary() or quantile() to show the quartiles
#Percentiles: define p-percentiles seq(from 0.01 to 0.99 by 0.01) and
then use quantiles(data,p)
p <- seq(0.01,0.99,0.01)
quantile(heights$height,p)
percentiles <- quantile(heights$height,p)
percentiles[names(percentiles)=="25%"]
percentiles[names(percentiles)=="75%"]
percentiles["25%"]
percentiles["75%"]
percentiles["99%"]
#quantile function creates a vector, stored as here in "percentiles" we
can access each n% percentile from 1 to 99
#qnorm() function gives the theoretical quantiles of a dataset that
follows the normal distribution
#qnorm(p-probability of observations,mean,sd) if mean&sd are not defined,
by default mean=0 sd=1 and qnorm=quantiles
pnorm(-1.96)
qnorm(0.025)
pnorm(qnorm(0.025))
#pnorm on z-score= -1.96 gives the probability that a value from a normal
distribution will be less or equal than z
#theoretical quantiles obtained by qnorm can be compared to sample
quantiles and find if it follows the normal dist.
qnorm(p,69,3)
#q: vector of quantiles, p: vector of probabilities-proportions, p =
mean(x <= q) probabilities-proportion = mean(vector-x less or equal than
quantiles)
#QQ-plots: to check if data distributions are well approximated by a
normal distribution
#p = proportion 0.05, 0.01 up to 0.95, p-proportion of values in the data
below q-quantiles = p
options(digits = 3)
mean(x)
mean(x <= 69.5)
#arround 50% of males are shorter or equal than 69.5 inches-mean value,
if p =0.51 then q =69.5
mean(x >= 69.5)
#QQ-plots: sample quantiles-observed are compared to the theoretical
quantiles expected from the normal distribution
#the points on the QQ-plot will be near the same line when
sample=theoretical & data is approx. to the normal dist.
index <- heights$sex=="Male"
x <- heights$height[index]
z <- scale(x)
#to generate a QQ-plot, 1st define p as a vector of proportions
p <- seq(0.05,0.95,0.05)
#2nd define a vector of quantiles for p proportions =sample quantiles
with quantile() function-makes a vector by itself
sample_quantiles <- quantile(x,p)
#3rd define the theoretical quantiles with qnorm() function using the
mean and sd of x-heights data and proportions p
theoretical_quantiles <- qnorm(p, mean=mean(x), sd=sd(x))
theoretical_quantiles <- qnorm(p, mean(x), sd(x))
theoretical_quantiles
#4th make the QQ-plot: quantile-quantile sample-theoretical to see if
they match therefore with the normal distribution
plot(theoretical_quantiles,sample_quantiles)
abline(0,1)
#abline(from-0 to-1) function to add staight lines on a plot and see if
the points fall near the line
#theoretical =sample-obs =normal distribution, use z-scores-standard
units on qnorm function to simplify the code
sample_quantiles <- quantile(z,p)
theoretical_quantiles <- qnorm(p)
#standard units: it's not necessary to define mean & sd on qnorm()
because z-scores takes mean =0 & sd =1 by default
plot(theoretical_quantiles,sample_quantiles)
abline(0,1)
#male heights follow a normal distribution with average 69.44 inches,
standard deviation 3.27 inches
#percentiles are the quantiles obtained when p is defined from 0.01-1% up
to 0.99-99%
#median =50th percentile or 50% of the data is below-debajo de la
mediana, only in normal distributions mean=median
#quartiles are the 0.25-25% percentile or 1st quartile, median and 0.75-
75% percentile or 3rd quartile
View(murders_nw)
hist(murder_rate)
#data-murder rates do not follow the normal distribution
plot(murder_rate)
summary(murder_rate)
#boxplot are useful to show info on summary() & to compare multiple
distributions when they are not normal
#the box is defined by 25th-75th percentiles, the distance between them
is called interquartile range
#wishkers show the range, outliers are ploted separately as individual
points, the median is the horiz.line on the box
boxplot(murder_rate)
boxplot(heights$height~heights$sex)
#from the boxplot we see that men are on average taller than woman, both
have outliers...description of the graphic
#stratification: data is divided into groups based on variables
associated, resulting groups are called strata
hist(x)
hist(xx)
#Exercises 8.15
library(dslabs)
data("heights")
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]
length(male)
length(female)
#in this dataset we have 812 entries of male heights and 238 entries of
female heights
p <- seq(0.01,0.99,0.01)
quantile(male,p)
male_percentiles <- quantile(male,p)
male_percentiles <- male_percentiles[c("10%","30%","50%","70%","90%")]
male_percentiles
quantile(female,p)
female_percentiles <- quantile(female,p)
female_percentiles <-
female_percentiles[c("10%","30%","50%","70%","90%")]
female_percentiles
my_df <- data.frame(female=female_percentiles, male=male_percentiles)
my_df
#show and store male and female percentiles 10,30,50,70 and 90%
#NOT normal distributions: summarize with mean & sd is not useful,
provide a hist, qqplot to see the distribution
#Data visialization heps to discover flaws-fallas on the data,
measurement mistakes, over/sub estimates, sample characters, quality of
the data
#mad() function to know the median absolute deviation, if the
distribution is normal mean=median & sd=mad
#an entry mistake in 1 of 900 observations increases the mean 0.5-half
unit, a big difference in practical terms
#an entry mistake in 1 of 900 observations increases the sd 15 units, a
really big difference in practical terms
#median and mad-median absolute deviation are ROBUST SUMMARIES against
mean and sd
#GGPLOT2: uses gg-grammar of graphics to break plots into building blocks
with an intuitive syntax
#ggplot2 creates complex and aestheticall plots using simply and readable
codes
#ggplot2 is designed to work with tidy datasets, rows =observations &
columns =variables
#ggplot2-cheatsheet to remember basics of plots. Other graphic resources
to learn instead of ggplot: grid, lattice
library(ggplot2)
#to learn ggplot break the graph into 3 main components: data, geometry,
aesthetic mapping. Additional components: scale, labels, title, legend,
theme, style
#data component: the dataset must be summarized
#geometry component: the type of the plot, histogram, qqplot, boxplot,
barplot, smooth density, scatterplot etc
#aesthetic mapping component: variables mapped-asignadas to visual cues-
señales that depends on the geometry used, x-axis values, y-axis values,
colors, texts
#scale component: escala, magnitudes, defined by the range of the data,
on log scale
#labels, title, legend components
#theme, style components of the graph
#create a ggplot graph, 1st step: create a ggplot object, associate data
with graph objects, geometries and mappings
#ggplot() function initializes the graph-ggplot object, the 1st argument
associate the dataset with the new gg object
library(tidyverse)
library(dslabs)
data("murders")
ggplot(data = murders)
ggplot(murders)
murders %>% ggplot()
#three ways to associate the data with gg object, we get a blank slate-
pizarra because geometry is not defined yet
p <- ggplot(data = murders)
p <- ggplot(murders)
p <- murders %>% ggplot()
class(p)
#we see that p is a ggplot object by using class() function
#assign the plot to an object p so we create the gg object, associate it
to p and then be able to render-print it
print(p)
p
#ggplot creates graphs by adding layers-capas with + plus symbol
#layers-capas define the components of the graph: geometries, summary
statistics, scales to use, style changes
#data %>% ggplot()-gg object + layer 1 + layer 2 + layer n
#check ggplot-cheatsheet to use the right function for the graph we want
to generate
#layer 1 =geometry layer, the format geom_plot type defines the type of
plot, to work needs data & aes mapping-arguments
#layer 2 =aesthetic mappings, defines how the data connect with graph
features-aesthetic arguments x, y, colour, shape etc
#aes() function gets the features-arguments of a geometry function as
axis x-y, size, color etc. It’s preferred to put aes inside ggplot(aes())
#geom_point() function creates scatterplots that requires x and y axis by
aes()-aesthetic mappings
library(dplyr)
library(ggplot2)
murders %>% ggplot() + geom_point(aes(x=population/10^6, y=total))
ggplot(murders) +
geom_point(aes(x=murders$population/10^6,y=murders$total))
#remember ggplot as dplyr works like with() function, it's not necessary
access the variable again on the code
ggplot(murders) + geom_point(aes(x=population/10^6,y=total))
ggplot(murders) +
geom_point(aes(x=population_in_millions,y=total_gun_murders))
#to see all the plot on log10 scale use the variables defined before
population_in_millions and total_gun_murders
ggplot(murders) + geom_point(aes(population/10^6,total))
#aes features x and y are the 1st and the 2nd argument, the code also
works well without =
p <- ggplot(data = murders)
p + geom_point(aes(x=population/10^6,y=total))
p + geom_point(aes(population/10^6,total))
#geom_text() adds texts directly to the plot, requires aes mapping
features x, y and label arguments
#geom_label() adds etiquetas-label rectangles to the text, requires aes
mapping features x, y and label arguments
ggplot(murders) +
geom_point(aes(x=population/10^6,y=total)) +
geom_text(aes(x=population/10^6,y=total,label=abb))
p + geom_point(aes(x=population/10^6,y=total)) +
geom_text(aes(x=population/10^6,y=total,label=abb))
ggplot(murders) + geom_point(aes(x=population/10^6,y=total)) +
geom_text(aes(x=population/10^6,y=total,label=abb))
#layers with different aes mappings can be added to the same plot, be
careful adding arguments-features inside the aes() function to avoid
mistakes
ggplot(murders) + geom_point(aes(x=population/10^6,y=total)) +
geom_label(aes(x=population/10^6,y=total,label=abb))
#each geom_plot type has many features-arguments specific for the
function further than-ademásde essential aes & data
args(ggplot)
args(geom_point)
args(aes)
args(geom_text)
args(geom_label)
#mappings need to be inside aes() they use data from specific
observations
#operations that will affect all the points at the same way don't need to
be inside aes, they are not mappings
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
p + geom_point(aes(x=population/10^6,y=total), size=3) +
geom_text(aes(x=population/10^6,y=total,label=abb))
#size= outside aes() because is an operation that works on all the
points, changes the size of the plot points
p + geom_point(aes(x=population/10^6,y=total), size=3) +
geom_text(aes(x=population/10^6,y=total,label=abb), nudge_x=1)
#nudge_x= outside aes(), is not a mapping, is an operation on all the
points, moves the labels a little to the right
#global aes mapping apply layers to all geometries and they are defined
on the ggplot() object
#redefine p-gg object with aes() mapping inside the function ggplot() to
simplify the code & avoid type errors
p <- murders %>% ggplot(aes(x =population/10^6, y =total, label =abb))
p + geom_point(size =3) + geom_text(nudge_x =1.5)
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
p + geom_point(size =3) + geom_text(nudge_x =1.5)
#GLOBAL AES: gg-object <- data %>% ggplot(aes(arguments))
#GLOBAL AES: gg-object <- ggplot(data=, aes(arguments=))
#two ways of gg-object definition, by %>% or by data=
#local aes mappings-layers add new information on the global aes mapping
changing the aes mappings defined previously as default on gg-object
p + geom_point(size =3) + geom_text(x =10, y =800, label ="Hello there!")
p
#scales are preferred on log10 and it's not by default, this change needs
to be added through scales layer
#scale_x_continuous & scale_y_continuous are the default scales layer to
be changed-trans= to the log10 scale
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
p + geom_point(size =3) + geom_text(nudge_x =1.5) +
scale_x_continuous(trans ="log10") + scale_y_continuous(trans ="log10")
#nudge might be smaller if we change to log scale (to get the graph
right)
p + geom_point(size =3) + geom_text(nudge_x =0.075) +
scale_x_continuous(trans ="log10") + scale_y_continuous(trans ="log10")
#scale_x_log10() & scale_y_log10() to change-transform scale layers
directly to log10
p + geom_point(size =3) + geom_text(nudge_x =0.075) + scale_x_log10() +
scale_y_log10()
#xlab() and ylab() to add(name label), ggtitle(name title)
p + geom_point(size =3) + geom_text(nudge_x =0.075) + scale_x_log10() +
scale_y_log10() + xlab("Population in millions log scale") + ylab("Total
number of murders log scale") + ggtitle("US total gun murders in US
2010")
#redefine p without geom_point and then add by one the color arguments to
learn how to change colours
p + geom_text(nudge_x =0.075) + scale_x_log10() + scale_y_log10() +
xlab("Population in millions log scale") + ylab("Total number of murders
log scale") + ggtitle("US total gun murders in US 2010")
#by adding the color agument =blue inside the geom_point function we get
all the points blue
p + geom_point(size =3, color ="blue")
#color argument outside aes() is an operation that works over all the
points
#color argument inside aes() is a mapping that adds color automatically
to each categorical variable assigned =region
p + geom_point(size =3, col ="blue")
#arguments color= and col= are the same for colouring graphs
p + geom_point(aes(color =region), size =3)
p + geom_point(aes(col =region), size =3)
p + geom_point(size =3, aes(color =region))
#the order of arguments within geom_point do not seem to change the
render of the graph, size~aes = aes~size
#a reference legend is also added automatically with color, avoid it
setting the args show.legend=FALSE in geom_point
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
#to add a line representing the murder rate per country, define r as per
million rate, population*murder rate
10^6
identical(10^6,1000000)
#add a line representing the average murder_rates, remember murder_rate
is total/population*10^6
r <- murders %>% summarize(rate =sum(total)/sum(population)* 10^6) %>%
pull(rate)
r
r <- murders %>% summarize(rate =sum(total)/sum(population)* (10^6)) %>%
pull(rate)
r
#30.345 is the average murder_rate pop/tot*10^6
View(p)
#summarize() function creates scalar variables summarizing-resumuendo the
variables selected of an existing dataframe
#geom_abline() adds a line by default with a-intercept-intercepción =0
and b-slope-pendiente =1
p + geom_point(aes(color =region), size =3) + geom_abline(intercept
=log10(r)) + scale_x_log10() + scale_y_log10()
#in log10 scale, slope=1 and intercept =log10(r) to get the line on the
average murder rate
#to recreate the graph, change arguments lty= changes the line type from
solid to dashed, color="darkgrey & put geom_abline() layer before
geom_point() layer to get the average murder rate abajo de los puntos
p <- p + geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3)
p
p <- p + geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3) + scale_x_log10() +
scale_y_log10()
p
#capitalize-mayuscula the legend title using
scale_color_discrete(name="Region") with mayus.R
#add +layers and save the changes on p-gg object to avoid missing changes
p <- p + scale_color_discrete(name="Region")
p
p <- p + geom_text(nudge_x =0.075) + scale_x_log10() + scale_y_log10() +
xlab("Population in millions log scale") + ylab("Total number of murders
log scale") + ggtitle("US total gun murders in US 2010")
p
#add on-añadir packages on ggplot gives the graph finishing touches by
ggthemes and ggrepel
#many themes are included on ggplot and dslabs package like +
theme(argument=) layer addition and ds_theme_set() to see default values
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
#load from library ggthemes and then just add or store in p to avoid
missings, the layer theme_economist() that is the one needed for the
example
library(ggthemes)
install.packages("ggthemes")
library(ggthemes)
p + theme_economist()
#if there is a trouble loading a package from library() try with
install.packages("namepackage") and then library(name)
p + theme_clean()
p + theme_fivethirtyeight()
p + theme_economist_white()
p + theme_test()
#change the themes stored to see how they look like and which one goes
better with our plot
library(ggrepel)
install.packages("ggrepel")
library(ggrepel)
#ggrepel package contains extra geometries for ggplot2, change geom_text
layer by geom_text_repel layer to avoid labels falling on each other
#pull out the whole code clean:
r <- murders %>% summarize(rate =sum(total)/sum(population)* (10^6)) %>%
pull(rate)
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
+ geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3) + geom_text_repel() +
scale_x_log10() + scale_y_log10() + xlab("Population in millions log
scale") + ylab("Total number of murders log scale") + ggtitle("US total
gun murders in US 2010") + scale_color_discrete(name="Region") +
theme_economist()
p
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
+
geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3) +
geom_text_repel() + scale_x_log10() +
scale_y_log10() +
xlab("Population in millions log scale") +
ylab("Total number of murders log scale") +
ggtitle("US total gun murders in US 2010") +
scale_color_discrete(name="Region") +
theme_economist()
p
#put the + at the end of the layer, if not the code gets error just
because wrong typing
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
#create another summary plot using ggplot2
#geom_histogram() layer creates hist() more aesthetic than by default on
R, the x axis is divided into bins-intervals
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(x =height))
p
p + geom_histogram()
#dplyr functions work better than dslabs functions like data=dataframe$[]
to add data and aes on ggplot objects
#the histogram has intervals by default, define the binwidth-ancho de
intervalos as arguments =
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(x =height))
p + geom_histogram(binwidth =1)
p + geom_histogram(binwidth =1, fill ="blue", color ="black") +
xlab("Male heights in inches") + ggtitle("Histogram Male heights")
p + geom_histogram(binwidth =1, fill ="blue") + xlab("Male heights in
inches") + ggtitle("Histogram Male heights")
p + geom_histogram(binwidth =1, color ="blue") + xlab("Male heights in
inches") + ggtitle("Histogram Male heights")
#on histograms the argument color= is for the line of the bar, fill= is
for all the body bars
p + geom_histogram(binwidth =1, fill ="blue", color ="black") +
xlab("Male heights in inches") + ggtitle("Histogram Male heights")
#geom_density() layer to create smooth density plots, the smoothed
version of histograms, to compare distributions easy
p + geom_density()
p + geom_density(color ="blue")
p + geom_density(fill ="blue")
#arguments color= puts color on the line, fill= puts color on the area
below the line
#geom_qq() layer needs as arguments muestra-sample= where data sample-
muestra is compared to the normal distribution, sample compared to
theoretical
#first redefine p because we need aes(samples= instead of x= for the
qqplot
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(sample =height))
p + geom_qq()
#by default the qqplot is compared to normal dist. mean=0 and sd=1, use
dparams= argument to put the sample mean & sd or convert the sample into
z-scores with scale()
params <- heights %>% filter(sex =="Male") %>%
summarize(mean=mean(height), sd=sd(height))
#create a params object and assign it to the dparams argument inside the
qq geometry
p + geom_qq(dparams =params)
#now the qqplot is plotted against the normal distribution with the same
mean and sd oh heights dataset
p + geom_qq(dparams =params) + geom_abline()
#the points fall on the line, the data is approximately normal. if the
sample data is in z-scores standard units using scale() function, the
code looks clear
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(sample
=scale(height)))
p + geom_qq() + geom_abline()
install.packages("gridExtra")
library(gridExtra)
#load gridExtra package and use grid.arrange() function to put plots next
to each other
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(x =height))
p1 <- p + geom_histogram(binwidth =1, fill ="blue", color ="black")
p2 <- p + geom_histogram(binwidth =2, fill ="blue", color ="black")
p3 <- p + geom_histogram(binwidth =3, fill ="blue", color ="black")
grid.arrange(p1,p2,p3, ncol=3)
grid.arrange(p1,p2,p3)
#grid.arrange() is useful to compare ggplots on the same image by
defining the plots to use and its arguments, by default the plots are in
rows
#quickly plots using qplot() function, is not full as ggplot but provides
graphs easy guessing the type of plot
x <- heights %>% filter(sex=="Male") %>% pull(height)
qplot(x)
#qplot(x) makes a quick histogram, to obtain a fully histogram and change
the arguments, 1st create a gg-object and all the following steps
qplot(sample=scale(x))
qplot(sample=scale(x)) + geom_abline()
#qplot is a ggplot function so we can add layers as when we define a gg-
object, if sample= we get a scatterplot
#dot operator . to make invisible a variable, it means no variable-value
heights %>% qplot(sex,height, data=.)
#the previous code renders just the points, add geom="plot name" argument
to define the graph wanted
heights %>% qplot(sex,height, data=., geom="boxplot")
qplot(x, geom ="density")
#I() function to avoid-evitar, inhibir the evaluate, conversion or
interpretation of an object
qplot(x, bins =15, color ="black", xlab ="Population")
qplot(x, bins =15, color =I("black"), xlab ="Population")
#I() = keep it as it is
grid.arrange(p1,p2, ncol=2)
#define group=sex, color=sex, inside aes() to create 1 density plot by 2
groups female-male with 2 colors
heights %>% ggplot(aes(height, group =sex, color =sex)) + geom_density()
heights %>% ggplot(aes(height, group =sex, fill =sex)) +
geom_density(alpha =0.2)
#if we use fill=sex for the 2 groups female-male las curvas se
superponen, define alpha=0.2 to show both fills
library(tidyverse)
library(dplyr)
library(dslabs)
data("heights")
#summarize() function computes summary statistics on data frames and
returns a new dataframe with the name variables defined on the function
call & the values
s <- heights %>% filter(sex=="Male") %>% summarize(average=mean(height),
standard_deviation=sd(height))
s
class(s)
#remember summarize() is a dplyr function aware of variable names sorted
on dataframes, the names can be used directly
#the resulting object stored in s is class-dataframe and we can access $
its variables average & standard_deviation
s$average
s$standard_deviation
heights %>% filter(sex=="Male") %>% summarize(median=median(height),
minimum=min(height), maximum=max(height))
quantile(x,c(0,0.5,1))
#by summarize or quantile we get the same result, but quantile can not be
used inside summarize because it works on functions that return a single
value
murders <- murders %>% mutate(murder_rate = total/population*100000)
summarize(murders, mean(murder_rate))
mean(murders$murder_rate)
#compute the mean of murder_rate is not the real average because there
are big and small states, compute sum(total)/sum(population)*100000 gives
the correct average
us_murder_rate <- murders %>% summarize(rate =
sum(total)/sum(population)*100000)
us_murder_rate
class(us_murder_rate)
sum(murders$total)/sum(murders$population)*100000
#if we now the formula to compute us_murder_rate define the previous code
sum(tot)/sum(pop)*100000
#the $, pull() function and . the dot placeholder are useful if we have a
data frame & we don't know the formula, to get a numeric value to work
with
us_murder_rate %>% .$rate
class(us_murder_rate %>% .$rate)
class(us_murder_rate)
#the class of us_murder_rate is numeric once we access .$rate variable
us_murder_rate$rate
class(us_murder_rate$rate)
us_murder_rate %>% pull(rate)
class(us_murder_rate %>% pull(rate))
#use the pipe %>% to get only the number not the whole dataframe in one
line of code
us_murder_rate <- murders %>% summarize(rate =
sum(total)/sum(population)*100000) %>% .$rate
us_murder_rate
class(us_murder_rate)
#we get the same numeric result using pull() function, .$ and pull are
equivalent on dplyr functions
us_murder_rate <- murders %>% summarize(rate =
sum(total)/sum(population)*100000) %>% pull(rate)
us_murder_rate
class(us_murder_rate)
#compute the median murder rate for the south states without defining
extra objects using the pipe %>%
filter(murders, region=="South") %>% mutate(rate=total/population*10^5)
%>% summarize(median=median(rate)) %>% pull(median)
library(dslabs)
library(tidyverse)
library(dplyr)
#first group_by() and second summarize() is a common operation in data
exploratory
#group_by() function splits data into one or more variables defined
group_by(name variable)
heights %>% group_by(sex)
heights %>% group_by(sex) %>% summarize(average=mean(height),
standard_deviation=sd(height))
#summarize applied on group_by makes a summary over each group-female and
males separately, sex, average and standard_deviation in colums & two
rows female and male
murders %>% group_by(region) %>%
summarize(median_rate=median(murder_rate))
#to examine datasets the data is needed sorted, in dplyr package
arrange() is useful than sort and order functions
#arrange() function sorts a dataframe by a given column and by default in
ascending-lowest to highest order
murders %>% arrange(population) %>% head()
#the code above sorts-arrange() the whole murders dataframe by lowest to
highest population
murders %>% arrange(murder_rate) %>% head()
#the code above sorts-arrange() the whole murders dataframe by lowest to
highest murder_rate
#desc() function sorts a vector in descending order-highest to lowest, it
can be used within arrange()
murders %>% arrange(desc(population)) %>% head()
murders %>% arrange(desc(murder_rate)) %>% head()
#arrange() by multiple levels orders 1st by the first argument, next
within by the 2nd argument, 3rd and so on
#the code below sorts the dataset 1st by region and 2nd by murder rate
murders %>% arrange(region, murder_rate) %>% head()
murders %>% arrange(population, murder_rate) %>% head()
#top_n() function shows the top results ranked by a given variable NOT in
order, combine with arrange() to get the results in order
murders %>% top_n(10, murder_rate)
#the code above shows the top 10 highest murder rates
murders %>% arrange(desc(murder_rate)) %>% top_n(10)
#now the code is sorted-arrange() by descending order, shows the top 10
highest murder rates
#top_n(data,nrows,variable) the function needs 1st args-dataframe, 2nd
args-nrows value, 3rd args-variable to filter by
top_n(murders,10,murder_rate)
#the pipe %>% is very useful on dplyr functions, allowing many data
analysis easy
library(dslabs)
library(tidyverse)
library(dplyr)
#Assessment section 3
#load the data from the survey of the united states national center for
health statistics NHANES, from the specific NHANES package
install.packages("NHANES")
library(NHANES)
data("NHANES")
#remember how to remove NA defining na.rm=TRUE inside the function where
the variable with NA's will be used
data(na_example)
mean(na_example)&sd(na_example)
mean(na_example, na.rm=TRUE)&sd(na_example, na.rm=TRUE)
mean(na_example, na.rm=TRUE)
sd(na_example, na.rm=TRUE)
#na.rm=TRUE is a useful function when a dataframe has NA's and we need to
apply functions to variables with NA's
head(NHANES)
names(NHANES)
levels(NHANES$Gender)
levels(NHANES$AgeDecade)
summary(NHANES$BPSysAve)
levels(NHANES$Race1)
#levels shows the groups inside the variable and summary shows the
distribution of numeric variables as blood pressure
#filter the dataframe by 20-29 females, AgeDecade " 20-29" Gender
"female"
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female")
#what is the average and sd of systolic blood pressure stored in BPSysAve
for 20-29 females?
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(mean(BPSysAve, na.rm=TRUE), sd(BPSysAve, na.rm=TRUE))
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#without NA's the systolic blood pressure for females of 20-29 is
108(AVG) and 10.1(SD)
library(dplyr)
library(tidyverse)
library(NHANES)
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE)) %>
% pull(AVG)
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE)) %>
% .$AVG
#pull(AVG) or .$AVG gets the same result, output just the average for 20-
29 females
#report the min and max values for the same group
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(min=min(BPSysAve, na.rm=TRUE), max=max(BPSysAve, na.rm=TRUE))
NHANES %>% filter(Gender=="female") %>% group_by(AgeDecade) %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#the code above shows blood pressure mean & sd group by all AgeDecades,
still using filter for Gender females
NHANES %>% group_by(AgeDecade,Gender) %>% summarize(AVG=mean(BPSysAve,
na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#the code above groups by AgeDecade & Gender, now we have all the ages
and males-females, group_by() allows to split by two variables
NHANES %>% filter(Gender=="male") %>% group_by(AgeDecade) %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#the code above shows for males using filter, blood pressure mean & sd
group by all AgeDecades
#group by race, obtain the average systolic blood pressure for males from
40-49
NHANES %>% group_by(Race1) %>%filter(AgeDecade==" 40-49"&Gender=="male")
%>% summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve,
na.rm=TRUE)) %>% arrange(AVG)
NHANES %>% group_by(Race1) %>%filter(AgeDecade==" 40-49"&Gender=="male")
%>% summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve,
na.rm=TRUE)) %>% arrange(desc(AVG))
#manage data & data visualization to dispel common myths about
sensacionalist world topics
#Gapminder foundation by Hans Rosling
library(tidyverse)
library(dslabs)
data(gapminder)
head(gapminder)
names(gapminder)
#which countries had the highest child mortality rates in 2015? two
answers, by thoughts or by data exploratory
#compare Sri Lanka vs Turkey infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Sri Lanka","Turkey")) %>
% select(country, infant_mortality)
#answer: Sri Lanka 8.4 has a lower infant mortality than Turkey 11.6
#compare Poland vs south Korea infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Poland","South Korea"))
%>% select(country, infant_mortality)
#answer: South Korea 2.9 has a lower infant mortality than Poland 4.5
#compare Malaysia vs Russia infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Malaysia","Russia")) %>%
select(country, infant_mortality)
#answer: Malaysia 6.0 has a lower infant mortality than Russia 8.2
#compare Thailand vs south Africa infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Thailand","South
Africa")) %>% select(country, infant_mortality)
#answer: Thailand 10.5 has a lower infant mortality than South Africa
33.6
library(tidyverse)
library(dslabs)
library(dplyr)
data(gapminder)
#which is the relationship between life span-esperanza de vida and
births, number of childs in each continent? answer by thoughts and by
data exploratory analisys
#create a scatterplot of the relationship fertility~life_expectancy in
1962
ds_theme_set()
filter(gapminder,year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy)) + geom_point()
gapminder %>% filter(year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy)) + geom_point()
#from the plot we see that the points fall in two different categories
for the year 1962
#life_expectancy:70 years fertility:3 or less children and
life_expectancy:lower than 65 years fertility:5 or more children
#change the color=continent-character argument within aes() that assigns
automatically colors to see the scatterplot by continents
gapminder %>% filter(year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point()
#indeed in 1962 Europe and NorthAmerica had higher life span and low
fertility than developings America, Asia and Africa
library(gridExtra)
p1 <- gapminder %>% filter(year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() + ggtitle("1962
plot")
p2 <- gapminder %>% filter(year==2012) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() + ggtitle("2012
plot")
grid.arrange(p1,p2)
#grid.arrange() could be an option to see both plots and compare them,
faceting functions is another good option
#faceting makes multiple side-by-side plots stratified by one variable
facet_wrap() or +two variables facet_grid()
#facet_grid() layer separates the plots in +two variables with ~,
facet_grid(variable1 on the rows~variable2 on the columns)
filter(gapminder, year %in% c(1962,2012)) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() +
facet_grid(continent~year)
gapminder %>% filter(year %in% c(1962,2012)) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() +
facet_grid(continent~year)
#the data has been stratified to the right by rows-continent-variable1 &
on the top by columns-year-variable2
#the dot operator . makes invisible a variable, use it inside
facet_grid(.~variable2-columns) to separate by only one variable
gapminder %>% filter(year %in% c(1962,2012)) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() + facet_grid(.~year)
#now we get a plot split by year only and we only see the color continent
#the plot shows that developing continents-America, Asia, Africa have
mooved to western continents-Europe, North America
#facet_wrap() layer separates the plots using all the screen space-
columns&rows, better than facet_grid() that splits only in rows
#compare Europe against Asia through the years
1962,1970,1980,1990,2000,2012
years <- c(1962,1970,1980,1990,2000,2012)
continents <- c("Europe","Asia")
gapminder %>% filter(year %in% years & continent %in% continents) %>%
ggplot(aes(x=fertility, y=life_expectancy, color=continent)) +
geom_point() + facet_wrap(~year)
gapminder %>% filter(year %in% years & continent %in% continents) %>%
ggplot(aes(x=fertility, y=life_expectancy, color=continent)) +
geom_point() + facet_wrap(.~year)
#facet_wrap() creates a tidy screen space to compare plots, remember to
use ~ or .~variablename within the function
#the plot shows that Asian countries have made great improvements through
out the years reaching Europe
#facet_ functions fix the scales for better comparisons, if we use
grid.arrange() to compare plots, the scales are not the same bacause took
the data scales=free
#Time series plots have time on the x axis and a variable-measurement of
interest on the y axis
gapminder %>% filter(country=="United States") %>%
ggplot(aes(year,fertility)) + geom_point()
gapminder %>% filter(country=="United States") %>%
ggplot(aes(year,fertility)) + geom_line()
#use geom_line() instead of geom_poins when the points are regularly
spaced & densely packed, useful to compare between two objects of the
same variable
#compare the fertility between countries, example South Korea in Asia
with Germany in Europe
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year,fertility,group=country)) + geom_line()
#filter the variable country on the dataset by the two countries selected
and assign group argument within the aes() function by country
#by Adding color argument within aes(color=country) function,
automatically the data is grouped by country
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year,fertility,color=country)) + geom_line()
#Labels are usually preferred over legends in most plots although legends
are by default, labelling is visually better
#store the position of the labels in an object "labels" defining
x=position between x axis, y=position between y axis & then add the
labels within geom_text()
labels <- data.frame(country=c("South Korea","Germany"), x
=c(1975,1965),y =c(60,72))
labels
#we are setting on labels object las coordenadas donde queremos que las
etiquetas estén despues en el grafico mediante geom_text()
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year,life_expectancy,color=country)) + geom_line() +
geom_text(data=labels,aes(x,y,label=country), size=5) +
theme(legend.position ="none")
#south Korea improve in life expectancy reaching Germany
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year, life_expectancy, color=country)) + geom_line() +
geom_label(data=labels, aes(x,y,label=country)) + theme(legend.position =
"none")
names(gapminder)
str(gapminder$gdp)
#gdp-gross domestic product-producto bruto interno-pbi variable is the
market value of products&services generated by a country in a year
#gdp-pbi per person estimates how rich a country is, this divided 365
days in a year is equal to dollars per day
#gdp-pbi/population/365 =dollars per day
#add dollars per day variable to gapminder dataframe with mutate function
gapminder <- gapminder %>% mutate(dollars_per_day=gdp/population/365)
names(gapminder)
#gdp-pbi values are adjusted by inflation, represent current US dollars
so they can be compared across the years
#which are the levels of poverty by country? dollars_per_day = gdp-
pbi/population/365 is a good measure to compare countries
past_year <- 1970
gapminder %>% filter(year==past_year & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth = 1, color =
"black")
#remove NA's using data %>% filter(!is.na(variablename))
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
#the histogram quickly shows that most of the country averages are below
10 dollars per day
#dollars/day country average: 1 dollar/day-extremely poor, 2 dollars/day-
very poor, 4 dollars/day-poor, 8 dollars/day-middle, 16 dollars/day-well
off, 32 dollars/day-rich, 64 dollars/day-very rich country
#log transformations: convert multiplicative changes into additive
changes
log2(2)
log10(2)
log2(4)
log10(4)
#log2 means that every time a value doubles-x2, the log transformation
increases by 1, log2(2)=1, log2(4)=2
#log10 means that every time a value is x10 veces, the log transformation
adds-increases 0.3, log10(2)=0.3, log10(4)=0.6
#transform data from the previous histogram in log2 base 2
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(log2(dollars_per_day))) + geom_histogram(binwidth=1,
color="black")
#bumps-saltos-mode: of a distribution are values with the highest
frequency
#mode of a normal distribution = mean or average, mode(normal dist) =
mean(normal dist)
#NOT normal distributions can have multiple modes-bumps-saltos also
called local modes-bimodality consistent with high frequency for both,
lower&higher dollar/day countries
#is not recommended the use of natural log-log base e to scale the data,
log2 is easier than log10 to scale the data
#log2 base 2 works well with integers, log10 base 10 works well with
large numbers
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(log10(population))) + geom_histogram(binwidth=1,
color="black")
#population store large numbers so scale by log10 makes more sense
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(log2(population))) + geom_histogram(binwidth=1, color="black")
#transform plots to log in two ways: change the data-logged values before
plotting or change the x-y axes-logged scale of the plot
#log values-data allows read easy the value, the advantage of using log
scales-axis layers is that we see the original value on the axis
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2")
#scale_x_continuous(trans="log2") and scale_x_log10 to change the x-y
axis of the plot and preserve the original values on the axis
library(tidyverse)
library(dslabs)
library(dplyr)
#to see dollar/day distribution by regions, histograms or smooth plots
are not useful because the large number of regions, there are 22 levels
length(levels(gapminder$region))
#boxplots next to each other allow compare data by regions with some
important adjustments to visualize the data in the best way
p <- gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(region, dollars_per_day))
p + geom_boxplot()
#rotate the region labels on the x axis adding theme() layer that changes
the non-data components of the plots
#theme(axis.text.x = element_text(angle=argument to rotate the labels in
axis.text.x, hjust=horizontal justification argument))
p + geom_boxplot() + theme(axis.text.x = element_text(angle=90,hjust=1))
#reorder() function reorders the levels of a variable-1st args treated as
categorical based on the values of a second variable-2nd args that
usually is numeric
#reorder() changes the order of the levels of a factor variable based on
the summary computed on a numeric vector
fac <- factorc("Asia","Asia","West","West","West")
fac <- factor(c("Asia","Asia","West","West","West"))
levels(fac)
#by default the levels of the factor are in aphabetically order,
reorder() orders the levels of fac variable by value variable from lowest
to highest
value <- c(10,11,12,6,4)
fac
value
names(value) <- fac
fac
value
fac <- reorder(fac, value, FUN=mean)
levels(fac)
#the fac variable-1st args example is now reordered by the mean of value-
2nd args
#reorder regions on gapminder dataset by median() income level-
dollar_per_day
p <- gapminder %>% filter(year==1970 & !is.na(gdp)) %>% mutate(region=
reorder(region,dollars_per_day, FUN=median)) %>% ggplot(aes(region,
dollars_per_day, fill=continent)) + geom_boxplot() + theme(axis.text.x =
element_text(angle=90,hjust=1)) + xlab("Region")
p
#mutate() to change regions by dollar/day median with reorder(),
fill=args puts color inside automatically, theme(element_text()) to
rotate xlabels, xlab("") quotes inside with no words to erase region
label by default
p <- gapminder %>% filter(year==1970 & !is.na(gdp)) %>% mutate(region=
reorder(region,dollars_per_day, FUN=median)) %>% ggplot(aes(region,
dollars_per_day, fill=continent)) + geom_boxplot() + theme(axis.text.x =
element_text(angle=90,hjust=1)) + xlab("")
p
reorder(gapminder$region,gapminder$dollars_per_day, FUN=mean)
#the boxplots are in order by the median value income-dollars_per_day
against regions, and each continent has its color with fill=continent
within aes()
#change the y axis from the boxplot to log2 with
scale_y_coninuous(trans="log2")
p + scale_y_continuous(trans="log2")
#log2 scale allows to se the differences between regions, the box looks
big now and we can compare with ease
#add points of data to show the data on the plot only when the graph is
clear, use geom_point(show.legend=FALSE) layer
p + scale_y_continuous(trans="log2") + geom_point()
p + scale_y_continuous(trans="log2") + geom_point(show.legend = FALSE)
p + scale_y_continuous(trans="log2") + geom_point(size=0.5)
p + scale_y_continuous(trans="log2") + geom_point(size=1)
#change the size of the points defining args inside geom_point() layer
names(gapminder)
levels(gapminder$region)
length(levels(gapminder$region))
#check the bimodality of the dataset adding a group variable with mutate,
then plot both histograms using facet_grid by one variable-group
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(.~group)
#above we check the bimodality with a histogram, contries in the west
have higher incomes then developing countries
#define west regions in the vector west
west <- c("Western Europe","Northern Europe","Southern Europe","Northern
America","Australia and New Zeland")
#compare the differences in distribution of the western world across past
and present years
past_year <- 1970
present_year <- 2010
gapminder %>% filter(year %in%c(1970,2010) & !is.na(gdp)) %>%
mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(year~group)
#the code shows the income-dollars_per_day filter() by two years,
past&present, and the graph is rendered facet.grid() by year and group so
the space screen looks clear
gapminder %>% filter(year %in%c(past_year,present_year) & !is.na(gdp)) %>
% mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(year~group)
#some countries were defined after 1970 & data availability is high now,
that's why there are more countries in 2010 and this could change the
plot
#we are going to use only the countries with data availability for both
years using the intersect() function
country_list_1 <- gapminder %>% filter(year==1970 & !
is.na(dollars_per_day)) %>% pull(country)
country_list_2 <- gapminder %>% filter(year==2010 & !
is.na(dollars_per_day)) %>% pull(country)
#define country lists for both years 1970&2010, remember pull() is equal
to .$ to extract the countries
#then define a vector country_list with intersect(clist_1,clist_2)
containing tha data available for both years to remake the plot
#intersect() function to find the overlap-coincidencia between two
vectors
country_list <- intersect(country_list_1,country_list_2)
length(country_list)
length(country_list_1)
length(country_list_2)
#length() allows to see the data for dollars_per_day, in 1970 = 113
countries, in 2010 = 176 countries and there are 108 countries with both
data availability
gapminder %>% filter(year %in%c(1970,2010) & country %in% country_list)
%>% mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(year~group)
#the previous code remakes the plot using countries that are %in%
country_list-data available for both years
#the developing group improve in dollars_per_day as the west group also
do
#the boxplot of median income-dollars_per_day by regions can be compared
by year 1970-2010
p <- gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list)
%>% mutate(region= reorder(region,dollars_per_day, FUN=median)) %>%
ggplot(aes(region, dollars_per_day, fill=continent)) + geom_boxplot() +
theme(axis.text.x = element_text(angle=90,hjust=1)) + xlab("") +
scale_y_continuous(trans="log2") + geom_point()
p
p + facet_grid(.~year)
p + facet_grid(year~.)
#define p filterring by years, country %in% country_list and use
facet_grid(rows~columns) by year on rows
#compate boxplots one over the other is not useful, make an easy
comparison if one boxplot is next to the other one
#remember fill= inside aes() splits by color, we want to split by year so
we need 1970&2010 to be a categorical-factor variable
#ggplot automatically puts the boxplots next to each other and assigns a
color to each factor aes(fill-color=factor-categorical)
p <- gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list)
%>% mutate(region= reorder(region,dollars_per_day, FUN=median)) %>%
ggplot(aes(region, dollars_per_day, fill=factor(year))) + geom_boxplot()
+ theme(axis.text.x = element_text(angle=90,hjust=1)) + xlab("") +
scale_y_continuous(trans="log2")
p
#remember to take out facet_grid to cancel the split of the graph and
allow the split by year on the same plot
#smooth density plots are rough to convey-transmitir clearly the
gapminder data
#check the bimodality using a smooth density instead a histogram and easy
see how the gaps in 2010 are near
gapminder %>% filter(year %in%c(1970,2010) & country %in%country_list) %>
% ggplot(aes(dollars_per_day)) + geom_density(fill="grey") +
scale_x_continuous(trans="log2") + facet_grid(year~.)
#which is the reason of the changes? poor countries become rich or rich
countries become poor?
#how many countries are in each group west-developing?
gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list) %>%
mutate(group= ifelse(region%in%west, "West","Developing")) %>%
group_by(group) %>% summarize(n=n())
gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list) %>%
mutate(group= ifelse(region%in%west, "West","Developing")) %>%
group_by(group) %>% summarize(len=length(group))
#the number of countries in the west is different from the number of
countries in developings group
#if groups are not the same size in density plots will be scaled to 1 and
look of same size will be a big error
#accessing computed variables with the geom_density could solve the
mistake of density scales
#the areas of the density plot should be proportional to the size of the
groups
#count= argument from geom_density that multiplies the y-axis values by
the size of the group to be proportional
#count=density*number of points, argument from geom_density that
multiplies the y-axis values by the size of the group to be proportional
#dotdot .. between the name variable to access variables in ggplot
..name_variable..
#redefine aes(x =dollars_per_day, y =..count..) putting the proportional
size of the group on the y axis
p <- gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list)
%>% mutate(group= ifelse(region%in%west, "West","Developing")) %>%
ggplot(aes(x=dollars_per_day, y=..count.., fill=group)) +
scale_x_continuous(trans="log2")
p + geom_density(alpha=0.2) + facet_grid(year~.)
#in this density plot now the scales are proportional to the y-axis for
both groups that had different lengths, there are no more scale errors on
this plot
#remember to define inside aes() y equal to ..count.. count argument from
y-density axis between dotdot ..
#also can change the bw=0.75 argument within geom_density to get smoother
densities
p + geom_density(alpha=0.2, bw=0.75) + facet_grid(year~.)
#clearly the graph shows that developing world is reaching the right
side, incomes-dollars_per_day grow between 1970&2010
#how are the income-dollars_per_day changes against regions?
levels(gapminder$region)
#case_when() function defines a factor-categorical whose levels are
defined by logical operations to group data
install.packages("ggridges")
library(ggridges)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
ggplot(aes(x=dollars_per_day, y=group)) +
scale_x_continuous(trans="log2") + geom_density_ridges(adjust=1.5) +
facet_grid(.~year)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
ggplot(aes(x=dollars_per_day, y=region)) +
scale_x_continuous(trans="log2") + geom_density_ridges(adjust=1.5) +
facet_grid(.~year)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
ggplot(aes(x=dollars_per_day, y=continent)) +
scale_x_continuous(trans="log2") + geom_density_ridges(adjust=1.5) +
facet_grid(.~year)
#plot stacked-encimados using position="stack" on density plots
gapminder %>% mutate(group=case_when(.$region %in% west ~ "West", .
$region %in% c("Eastern Asia","South-Eastern Asia") ~ "East Asia", .
$region %in% c("Caribbean","Central America","South America") ~ "Latin
America", .$continent == "Africa" & .$region !="Northern Africa" ~ "Sub-
Saharan Africa", TRUE ~ "Others"))
#assign groups depending by region and case_when allows to set those
groups and the names they will take
#mutate(group-name variable= case_when( .$region to access variables %in%
west ~ "west" the tilde operator means "set this name"))
gapminder <- gapminder %>% mutate(group=case_when(.$region %in% west ~
"West", .$region %in% c("Eastern Asia","South-Eastern Asia") ~ "East
Asia", .$region %in% c("Caribbean","Central America","South America") ~
"Latin America", .$continent == "Africa" & .$region !="Northern Africa" ~
"Sub-Saharan Africa", TRUE ~ "Others"))
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
mutate(group=case_when(.$region %in% west ~ "West", .$region %in%
c("Eastern Asia","South-Eastern Asia") ~ "East Asia", .$region %in%
c("Caribbean","Central America","South America") ~ "Latin America", .
$continent == "Africa" & .$region !="Northern Africa" ~ "Sub-Saharan
Africa", TRUE ~ "Others")) %>% ggplot(aes(x=dollars_per_day,
color=group)) + scale_x_continuous(trans="log2") +
geom_density(alpha=0.2, bw=0.75) + facet_grid(year~.)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
mutate(group=case_when(.$region %in% west ~ "West", .$region %in%
c("Eastern Asia","South-Eastern Asia") ~ "East Asia", .$region %in%
c("Caribbean","Central America","South America") ~ "Latin America", .
$continent == "Africa" & .$region !="Northern Africa" ~ "Sub-Saharan
Africa", TRUE ~ "Others")) %>% ggplot(aes(x=dollars_per_day, fill=group))
+ scale_x_continuous(trans="log2") + geom_density(alpha=0.2, bw=0.75) +
facet_grid(year~.)
#remember change y=group by color=group or fill=group to get the correct
lines and facet_grid with year in rows
#what is the relation between country child survival and average income
by regions? case_when is useful to divide groups within we add them to
the dataset
gapminder <- gapminder %>% mutate(group=case_when(.$region %in% west ~
"West", .$region %in% c("Eastern Asia","South-Eastern Asia") ~ "East
Asia", .$region %in% c("Caribbean","Central America","South America") ~
"Latin America", .$continent == "Africa" & .$region !="Northern Africa" ~
"Sub-Saharan Africa", .$region %in% "Nothern Africa" ~ "Northern Africa",
.$region %in% c("Melanesia","Micronesia","Polynesia") ~ "Pacific
Islands"))
#redefine gapminder dataset adding two more regions
#store in surv_income the average income-gdp and infant_mortality-
sulvival rate, remember to filter by 2010 and extract NA's from gdp,
infant_mortality and group
surv_income <- gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
surv_income <- gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
surv_income <- gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
#limit= argument to change the range of the axis, use it inside x&y layer
scale_x_continuous(limit=c(0.25,150))
#logistic transformation or logit for a proportional rate p: f(p) =
log(p/(1-p)) is the same as f(p) = log(odds) ,odds is p/(1-p) example
expected children to survive
#logit logistic transformation scale and odds are useful to compare small
differences between 0 and 1
#survival rate acceptable is no less than 98%-0.98, a survival of 0.9 is
not acceptable as right
#log(odds) turns the changes into constante increases
#data visualization follow some principles to create effective figures
and tables adapted to the audience
#encoding data principles: position, aligned lengths, angles, area,
brightness, color hue
#position & lenght are preferred to display quantities followed by
angles, which are preferred over area, brightness & color are hard to
quantify but can be useful
#humans are not good at viasually quantifying angles and areas, pie
charts represent angles & area, donut charts represent only area, they
are not recommended
#use a barplot instead of a pie or donut chart, or just few variables
with percentage labels
#humans are good at visually quantifying linear measures, barplots
represent well position and lenght visual cues
#barplots use lenght encoding data and always the scale have to start at
0 to avoid sub & overestimations
#scatterplots, boxplots use position encoding data and it's not necessary
to start the scale at 0, adjust the scale to the graph is a visual
posibility
#ggplot defaults to use square area rather than radio-circular area, the
plots have to encode the correct quantities
#encoding data principles must be proportional to the quantity
#order by a meaningful value is better than default by alphabet order,
reorder(factor-categorical, value, FUN=mean), factor() to change numeric
into factors-categoricals
library(tidyverse)
library(dslabs)
library(dplyr)
library(gridExtra)
data(murders)
murders %>% mutate(murder_rate =total/population*100000) %>% mutate(state
=reorder(state, murder_rate)) %>% ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") + coord_flip() + theme(axis.text.y
=element_text(size=6)) + xlab("")
#the previous code shows the states ordered by murder_rate from highest
to lowest, not anymore alphabetical, now the graph is meaningful
#previously we display the quantities of the data, now let's show the
data with focus comparing groups male-female
#standard error IS NOT the same as standarrd deviation
heights %>% ggplot(aes(x=sex, y=height)) + geom_point()
#geom_point allows a better comparison between females and males group
instead of a barplot where we don't know how each point-observation
behave
#two ways to improve a plot that shows all the points: use geom_jitter()
layer or add alpha=0.2 argument to geom_point
#geom_jitter() layer adds a small random shift-variation to each point-
pequeño cambio aleatorio a cada punto
#alpha blending-alpha= argument clarify the points to not overlap them
#if there are many points, show distributions is convenient than show the
points, distribution lines can be contrasted
heights %>% ggplot(aes(x=sex, y=height)) + geom_jitter(alpha=0.2,
width=0.1)
#keep the same axes when comparing data across plots to avoid
interpretation mistakes
heights %>% ggplot(aes(height)) + geom_histogram(binwidth = 1,
color="black") + facet_grid(.~sex)
#allign plots vertically to see horizontal changes & allign plots
horizontally to see vertical changes
#for heights data, two histograms alligned verticall will show horiz-
left-right changes, two boxplots alligned horizontal-next will show
vertical-up-down changes
heights %>% ggplot(aes(x=sex, y=height)) + geom_boxplot()
heights %>% ggplot(aes(x=sex, y=height)) + geom_boxplot() +
geom_jitter(alpha=0.2, width=0.05)
#barplots are useful to show only one number, they are not good to show
and compare distributions
#the combination of using barplots incorrect when a log transformation is
necessary is very distorting
#boxplots are much more informative than barplots, specially if we have
many values
#log transformation: useful for data with multiplicative changes
#logit transformation: useful for fold-doble changes in odds
#sqrt transformation: useful for count data
#visual cues-señales visuales to be compared should be adjacent-next to
each other
#http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/.
resources to select friendly colors
#to the plot we want stored in p add scale_color_manual(value =the code
for the pallete of colors)
#remember to redefine region variable within mutate to store the
reordered object, if not we are just changing rate and not the state
where the reorder() function works
data("murders")
murders %>% mutate(rate = total/population*100000,
region=reorder(region,rate,FUN=mean)) %>% ggplot(aes(region,rate)) +
geom_boxplot() + geom_point()
#mutate(name_var1=compute code, name_var2=reorder(var1,var2,FUN=))
#compare two variables with scaterplots-geom_point(), compare the same
type of variables at different time points and relative small comparison
with slope charts-geom_line()
west
#slope charts shows an idea of the changes based on the slope lines,
angles are the visual encoding and also the position of the points
#for large number of observations the bland-altman plot shows the
differences between conditions in the y-axis and the mean of conditions
in the x-axis with a abline that divides the space screen
dat <- gapminder %>% filter(year%in% c(2010,2015) & region%in%west & !
is.na(life_expectancy) & population >10^7)
dat %>% mutate(location=ifelse(year==2010,1,2),
location=ifelse(year==2015 & country%in% c("United Kingdom","Portugal"),
location + 0.22, location, hjust=ifelse(year==2010, 1, 0)) %>%
mutate(year= as.factor(year)) %>% ggplot(aes(year,life_expectancy,
group=country)) + geom_line(aes(color=country)) +
geom_text(aes(x=location, label=country, hjust=hjust))
dat %>% ggplot(aes(x=year, y=life_expectancy, label=country,
group=country, color=country)) + geom_line(show.legend = FALSE) +
geom_text(show.legend = FALSE)
#Encode a 3rd variable on the graph using color hue or shape= argument,
continuous variables can use color, intensity, size and shape
#effective comunication of data is a strong antidote to misinformation
and fear
#the impact of vaccines on battling infectious diseases, data for 7
diseases from 1928 to 2011 in 50 states
data("us_contagious_diseases")
str(us_contagious_diseases)
names(us_contagious_diseases)
#define the object dat containing only measles-sarampion, per 10000 hab.
rate, order states by average of disease and remove Alaska and Hawaii
because they weren't states
dat <- us_contagious_diseases %>% filter(!state%in% c("Alaska","Hawaii")
& disease=="Measles") %>% mutate(rate
=count/population*10000*52/weeks_reporting) %>% mutate(state
=reorder(state,rate))
#count variable stores the totals, *10000 por 10 mil habitantes, *52
weeks_reporting in the whole year
#plot the measles data for California, cases per 10000 hab-y axis by
year-x axis
dat %>% filter(state=="California" & !is.na(rate)) %>% ggplot(aes(x=year,
y=rate)) + geom_line() + ylab("Cases per 10,000") + geom_vline(xintercept
= 1963, color="blue")
#geom_vline() layer adds reference lines to a plot in horizontal,
vertical or diagonal way specified by slope and x-y intercept= argument
#geom_vline(xintercept=1963, color="blue) is on the x-axis intercept 1963
because is the year when the vaccine was introduced
#3 variables to show, year-x axis, state-y axis, rate-color hue to
represent the continuous variable rate
#RColorBrewer package, download it from library, to choose color pallets
between sequential and diverging options
#sequential pallets are suited for data that goes from high values to low
values distinction
#diverging pallets are suited for data representig values that verge-
acercan from a center, higher or lower than the center are equal
install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all(type ="seq")
display.brewer.all(type ="div")
#display.brewer.all to see all the pallete of colors, type seq or div
#geom_tile() layer to tile-adornar, cover with colors representing
something
#geom_tile() layer to tile with colors representing disease rates by
region
#we have counts of data so the sqrt transformation is appropiate to avoid
having high numbers
#trans="sqrt" we have counts of data so the sqrt transformation is
appropiate to avoid having high numbers
dat %>% ggplot(aes(x=year, y=state, fill=rate)) +
geom_tile(color="grey50") + ylab("") + xlab("") + geom_vline(xintercept =
1963, color="blue")
dat %>% ggplot(aes(x=year, y=state, fill=rate)) +
geom_tile(color="grey50") + scale_x_continuous(expand= c(0,0)) +
scale_fill_gradientn(colors =RColorBrewer::brewer.pal(9,"Reds"),
trans="sqrt") + ylab("") + xlab("") + geom_vline(xintercept = 1963,
color="blue") + theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle("Measles Disease")
#scale_fill_gradientn() contanins RColorBrewer::brewer.pal(9,"Reds)
#position and length are better cues than color, so show the values with
position, compute the average and show it with a line plot of measles by
year, state and rate
AVG <- us_contagious_diseases %>% filter(disease=="Measles") %>%
group_by(year) %>% summarize(us_rate =
sum(count,na.rm=TRUE)/sum(population,na.rm=TRUE)*10000)
AVG
dat <- us_contagious_diseases %>% filter(!state%in% c("Alaska","Hawaii")
& disease=="Measles") %>% mutate(rate
=count/population*10000*52/weeks_reporting) %>% mutate(state
=reorder(state,rate))
dat %>% filter(!is.na(rate)) %>% ggplot(aes(x=year, y=rate ,
group=state)) + geom_line(alpha=0.2, size=1, show.legend =FALSE,
color="grey50") + geom_vline(xintercept = 1963, color="blue") + xlab("")
+ ylab("") + ggtitle("Cases per 10,000 by state")
#use regular 2 dimension plots, avoid pseudo-three-dimensional plots, the
3rd dimension doesn't represent any quantity
#avoid using too many significant digits on tables, by default R shows 7
significant digits, 2 significant digits is enough to see how values
behave
#options(digits=n), round(x, digits=n), signif(x, digits=n) three ways to
change the number of digits
#Assessment 5.3
#create a tile plot of smallpox cases per 10,000 population, exclude
Alaska and Hawaii
data("us_contagious_diseases")
dat <- us_contagious_diseases %>% filter(!state%in%c("Hawaii","Alaska") &
disease =="Smallpox") %>%
mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
dat %>% filter(!weeks_reporting<10) %>% ggplot(aes(year, state, fill =
rate)) + geom_tile(color = "grey50") + scale_x_continuous(expand=c(0,0))
+ scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle("Smallpox") + ylab("") + xlab("")
names(dat)
#dat has state variable changed with mutate() and reorder() states by
rate values, & rate variable added with mutate() count-total/population
per 10 mil
#scale_fill_gradientn() layer creates a n colour gradient, gradient
without final n creates a two colour gradient-sequential, and gradient2
diverging colour gradient
#filter by more than 10 weeks_reporting, fill=rate splits the plot by
rate, brewer.pal makes the color brewer available as pallets
#create a time series plot of smallpox excluding less than 10 weeks
cases, compute the average and show it through years
dat <- us_contagious_diseases %>% filter(!state%in% c("Hawaii","Alaska")
& disease =="Smallpox") %>% mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
avg <- us_contagious_diseases %>% filter(!weeks_reporting<10 &
disease=="Smallpox") %>% group_by(year) %>% summarize(us_rate =
sum(count, na.rm=TRUE)/sum(population, na.rm=TRUE)*10000)
dat %>% filter(!weeks_reporting<10) %>% ggplot() + geom_line(aes(year,
rate, group = state), color = "grey50", show.legend = FALSE, alpha =
0.2, size = 1) + geom_line(mapping = aes(year, us_rate), data = avg,
size = 1, color = "black") + scale_y_continuous(trans = "sqrt", breaks =
c(5,25,125,300)) + ggtitle("Cases per 10,000 by state") + xlab("") +
ylab("") + geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y,
label="US average"), color="black") + geom_vline(xintercept=1963, col =
"blue")
avg
#in avg store the average us_rate per 10mil hab, previously filter by
disease, weeks and year
#use geom_line two times because it's like two plots in one, light grey
lines year-rate grouped by state & the black line for the average us rate
#make a time series plot for California showing rates of al diseases of
10 or more weeks reporting
us_contagious_diseases %>% filter(state=="California" & !
weeks_reporting<10) %>% group_by(year, disease) %>% summarize(rate =
sum(count)/sum(population)*10000) %>% ggplot(aes(year, rate,
color=disease)) + geom_line()
#group_by(1st variable year by 2nd variable disease) and color=disease in
aes ggplot allows to see how each disease behave, summarize creates the
rate variable of y-axis
#make a time series plots of the rates for all diseases in the US
us_contagious_diseases %>% filter(!is.na(population)) %>% group_by(year,
disease) %>% summarize(rate = sum(count)/sum(population)*10000) %>%
ggplot(aes(year, rate, color=disease)) + geom_line()
options(digits = 3)
library(tidyverse)
library(dslabs)
library(ggplot2)
library(dplyr)
install.packages("titanic")
library(titanic)
titanic <- titanic_train %>% select(Survived, Pclass, Sex, Age, SibSp,
Parch, Fare) %>% mutate(Survived =factor(Survived), Pclass
=factor(Pclass), Sex =factor(Sex))
titanic
str(titanic)
levels(titanic)
names(titanic)
?titanic_train
#?titanic_train to learn more about the variables stored on a dataset
head(titanic)
head(titanic_train)
str(titanic_train)
names(titanic)
#variable types from titanic: survived-categorical non ordinal, Pclass &
Sex-categorical ordinal, Fare-numeric continuous, Age-numeric discrete
#the following code shows a geom_density for titanic age grouped by sex,
on the y-axis we have count-totals, fill=sex divides as group_by and
facet_grid by columns
titanic %>% filter(!is.na(Age) & !is.na(Sex)) %>% group_by(Age, Sex) %>%
ggplot(aes(x=Age, y=..count.., fill=Sex)) + geom_density(alpha=0.2,
bw=1.5, position="stack") + facet_grid(.~Sex)
#define params object to apply it into geom_qq and show the sample vs
theoretical distribution of titanic ages
params <- titanic %>% filter(!is.na(Age)) %>% summarize(mean=mean(Age),
sd=sd(Age))
params
titanic %>% filter(!is.na(Age) & !is.na(Sex)) %>% ggplot(aes(sample=Age))
+ geom_qq(dparams =params) + geom_abline()
#create a barplot of survived by sex, remove NA from survived and
fill=sex to group by sex and put color
titanic %>% filter(!is.na(Survived)) %>% ggplot(aes(Survived, fill=Sex))
+ geom_bar()
titanic %>% filter(!is.na(Survived)) %>% ggplot(aes(Survived, fill=Sex))
+ geom_bar(position=position_dodge())
#position=position_dodge() inside geom_bar() layer to separate male-
female bars into life or death
#compare Age survival distributions life-death, there are two modes in
survivals arround 0-8 age and 25-35 age, y=counts or density, here
density is more readable
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, y=..count.., fill=Survived)) + geom_density(alpha=0.2)
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, y=..count.., fill=Survived)) + geom_density(alpha=0.2)
+ facet_grid(.~Survived)
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, fill=Survived)) + geom_density(alpha=0.2) +
facet_grid(.~Survived)
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, fill=Survived)) + geom_density(alpha=0.2)
#compare y=fare x=survivals in a boxplot, trans="log2" and jitter to see
how points change
titanic %>% filter(!Fare==0) %>% ggplot(aes(x=Survived, y=Fare)) +
geom_boxplot()
titanic %>% filter(!Fare==0) %>% ggplot(aes(x=Survived, y=Fare)) +
geom_boxplot() + scale_y_continuous(trans="log2")
titanic %>% filter(!Fare==0) %>% ggplot(aes(x=Survived, y=Fare)) +
geom_boxplot() + scale_y_continuous(trans="log2") +
geom_jitter(alpha=0.2)
#fares of 8 had more deaths, median fares of survivals are higher than
median fares of deaths, there is an outlayer fare 500 who survived
#barplot of counts Pclass by survivals, the class 3 has the highest count
of deaths
titanic %>% filter(!is.na(Pclass)) %>% ggplot(aes(Pclass, fill=Survived))
+ geom_bar()
#proportional barplot from 0 to 1 of Pclass by survivals, the class 1 has
the highest survivals, class 2 about 50-50 life-death
titanic %>% filter(!is.na(Pclass)) %>% ggplot(aes(Pclass, fill=Survived))
+ geom_bar(position= position_fill())
#proportional barplot of Survivals by Pclass, the class 3 had about 70%
deaths and about 35% life survivals
titanic %>% filter(!is.na(Pclass)) %>% ggplot(aes(Survived, fill=Pclass))
+ geom_bar(position= position_fill())
#position= position_fill() inside geom_bar to get a proportional bar and
compare right between variables
#geom_density x=age, y=count of survivals split-fill=survived, facet by
sex~Pclass
titanic %>% filter(!is.na(Age)) %>% ggplot(aes(x=Age, y=..count..,
fill=Survived)) + geom_density(alpha=0.2) + facet_grid(Sex~Pclass)
titanic %>% filter(!is.na(Age)) %>% ggplot(aes(x=Age, y=..count..,
fill=Survived)) + geom_density(alpha=0.2) + facet_grid(Pclass~Sex)
library(tidyverse)
library(dslabs)
library(dplyr)
data("stars")
options(digits = 3)
str(stars)
head(stars)
levels(stars$star)
names(stars)
#magnitude is a function of star luminosity, negative values of magnitude
have higher luminosity
is.na(stars$magnitude)
mean(stars$magnitude)
sd(stars$magnitude)
summary(stars$magnitude)
stars %>% ggplot(aes(magnitude)) + geom_density()
stars %>% ggplot(aes(magnitude, ..count..)) + geom_density()
stars %>% ggplot(aes(temp)) + geom_density()
stars %>% ggplot(aes(temp, ..count..)) + geom_density()
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point()
#most stars follow a decreasing exponential trend
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point() +
scale_y_reverse()
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point() +
scale_y_reverse() + scale_x_log10()
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse()
#stars with negative magnitude are brighter and temperatures are also
high
library(ggrepel)
stars %>% ggplot(aes(x=temp, y=magnitude, label=star)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse() +
geom_text_repel(size=2)
stars %>% ggplot(aes(x=temp, y=magnitude, label=star)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse() +
geom_text_repel()
stars$temp
stars$star
stars %>% filter(star%in%
c("Antares","Castor","Mirfak","Polaris","vanMaanen'sStar")) %>%
pull(temp)
stars %>% filter(star%in%
c("Antares","Castor","Mirfak","Polaris","vanMaanen'sStar")) %>%
pull(magnitude)
stars %>% filter(star%in%
c("Sun","Antares","Castor","Mirfak","Polaris","vanMaanen'sStar")) %>%
ggplot(aes(x=temp, y=magnitude, label=star)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse() +
geom_text_repel()
stars %>% ggplot(aes(x=temp, y=magnitude, label=star, color=type)) +
geom_point() + scale_y_reverse() + scale_x_log10() + scale_x_reverse()
stars %>% filter(type%in%c("O","M")) %>% ggplot(aes(x=temp, y=magnitude,
label=star, color=type)) + geom_point() + scale_y_reverse() +
scale_x_log10() + scale_x_reverse()
stars %>% filter(type%in%c("O","M","G")) %>% ggplot(aes(x=temp,
y=magnitude, label=star, color=type)) + geom_point() + scale_y_reverse()
+ scale_x_log10() + scale_x_reverse() + geom_text_repel()
library(tidyverse)
library(dslabs)
library(dplyr)
data("temp_carbon")
data("greenhouse_gases")
data("historic_co2")
str(temp_carbon)
head(temp_carbon)
names(temp_carbon)
temp_carbon %>% .$year %>% max()
temp_carbon %>% filter(!is.na(carbon_emissions)) %>% pull(year) %>% max()
temp_carbon %>% filter(!is.na(carbon_emissions)) %>% .$year %>% max()
temp_carbon %>% filter(!is.na(carbon_emissions)) %>% select(year) %>%
max()
temp_carbon %>% .$year %>% min()
min(temp_carbon$year)
summary(temp_carbon$year)
summary(temp_carbon$carbon_emissions)
temp_carbon %>% filter(carbon_emissions & year) %>% summary()
#reach the first and last year for carbon emissions data available, how
many times bigger are carbon emissions now?
#min year 1751 max year 2014, ratio 9855/3 -the biggest value divided by
the smallest value
temp_carbon %>% filter(carbon_emissions & !is.na(carbon_emissions)) %>%
filter(year & !is.na(year)) %>% summarize(min=min(carbon_emissions),
max=max(carbon_emissions))
temp_carbon %>% filter(carbon_emissions & !is.na(carbon_emissions)) %>%
filter(year & !is.na(year)) %>% summarize(min=min(year), max=max(year))
9855/3
#reach the first and last year for temp_anomaly data available, how many
degrees C did temperature increase?
temp_carbon %>% filter(temp_anomaly & !is.na(temp_anomaly)) %>%
filter(year & !is.na(year)) %>% summarize(min=min(year), max=max(year))
temp_carbon %>% filter(temp_anomaly & !is.na(temp_anomaly)) %>%
filter(year & !is.na(year)) %>% summarize(mint=min(temp_anomaly),
maxt=max(temp_anomaly), miny=min(year), maxy=max(year))
options(digits = 3)
temp_carbon %>% filter(temp_anomaly & year) %>% summary()
?temp_carbon
temp_carbon %>% filter(year%in%c(1880,2018)) %>% pull(temp_anomaly)
0.82-(-0.11)
#the temperature increase 0.93 degrees celsius, just rest biiger-smaller,
not ratio biiger/smaller as carbon emissions
p <- temp_carbon %>% filter(!is.na(temp_anomaly))
p <- temp_carbon %>% filter(!is.na(temp_anomaly)) %>%
ggplot(aes(year,temp_anomaly)) + geom_line()
p
p + geom_hline(aes(yintercept=0), col="blue")
p + ylab("Temperature anomaly (Degrees C)") + ggtitle("Temperature
anomaly relative to 20th century mean 1880-2018") + geom_text(x=2000,
y=0.05, label="20th century mean", color="blue")
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(x=2000, y=0.05, label="20th century mean",
color="blue")
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(aes(x=2000, y=0.05, label="20th century
mean"), col="blue")
temp_carbon %>% filter(temp_anomaly & year) %>% summary()
temp_carbon %>% filter(temp_anomaly>=0.06) %>% pull(year)
temp_carbon %>% filter(temp_anomaly<0.06) %>% pull(year)
temp_carbon %>% filter(temp_anomaly>=0.5) %>% pull(year)
#show the years with temperatures below the 20th century mean 0.06, above
0.06 and the years above 0.5 C degrees
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(x=2000, y=0.05, label="20th century mean",
color="blue") + geom_line(mapping= aes(year, land_anomaly), color="red")
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(x=2000, y=0.05, label="20th century mean",
color="blue") + geom_line(mapping= aes(year, land_anomaly), color="red")
+ geom_line(mapping =aes(year, ocean_anomaly), color="blue")
#land temperature is the highest, oceand temperature follows the pattern
of global temperature, land temp is the one that most change since 1880
str(greenhouse_gases)
names(greenhouse_gases)
head(greenhouse_gases)
greenhouse_gases %>% ggplot(aes(year, concentration)) + geom_line() +
facet_grid(.~gas, scales="free") + geom_vline(xintercept =1850)
greenhouse_gases %>% ggplot(aes(year, concentration)) + geom_line() +
facet_grid(gas~., scales="free") + geom_vline(xintercept =1850)
greenhouse_gases %>% ggplot(aes(year, concentration)) + geom_line() +
facet_grid(gas~., scales="free") + geom_vline(xintercept =1850) +
ylab("Concentration (ch4/n2o ppb, co2 ppm)") + ggtitle("Atmospheric
greenhouse gas concentration by year 0-2000")
#greenhouse gas-ch4, n2o, co2 concentrations from 0 to 2000 in three
vertical alligned time series plot, vline represents the industrial
revolution in 1850
temp_carbon %>% filter(!is.na(carbon_emissions) & !is.na(year)) %>%
ggplot(aes(year, carbon_emissions)) + geom_line()
temp_carbon %>% filter(year%in%c(1960,2014)) %>% pull(carbon_emissions)
9855/2569
temp_carbon %>% filter(year%in%c(1970,1980)) %>% pull(carbon_emissions)
5301/4053
data("historic_co2")
str(historic_co2)
head(historic_co2)
names(historic_co2)
co2_time <- historic_co2 %>% ggplot(aes(year, co2, color=source)) +
geom_line()
co2_time
co2_time <- historic_co2 %>% ggplot(aes(year, co2, color=source)) +
geom_line()
co2_time
historic_co2 %>% ggplot(aes(year, co2, color=source)) + geom_line() +
facet_grid(source~.)
co2_time + scale_x_continuous(limit=c(-800000, -775000))
co2_time + scale_x_continuous(limit=c(-375000, -330000))
co2_time + scale_x_continuous(limit=c(-140000, -120000))
co2_time + scale_x_continuous(limit=c(-3000, 2018))
#change axis limits with limit= c(from-to) inside scale_x_continuous
#change limits to see how the line behaves in a period of time, it's like
a zoom for that period to see up&downs

You might also like