Visualization-Script R
Visualization-Script R
R
library(dslabs)
data(murders)
head(murders)
#type head(murders) to see the common order in data science, obs-
observations in Rows and variables in columns
#Data visualization is powerful to communicate a data driven finding-
descubrimiento basado en datos
#A picture is worth than thousand words, that's data visualization
advantage
#Data Visualization is the strongest tool of EDA Exploratory Data
Analysis
#EDA-exploratory data analysis is the most important and often
overlooked-pasada por alto part of a data analysis
#in EDA, data properties are explored through DV and summarization
techniques
#DV helps to discover biases-sesgos, mistakes, systematic errors, avoid
flawed analysis and false discovers
#basic of EDA and DV with ggplot2. other useful tools to learn:
interactive graphics
#the most basic statistical summary of a list of objects or numbers is
its DISTRIBUTION
#once a vector has been summarized(average, standard deviation) there are
ways to visualize and analyze if those distributions are right
#before visualize data it is necessary to know what type of data we have
#data-variable Types: 1_Categorical: a_ordinals, b_not ordinals,
2_Numeric: a_discrete, b_continuous
#Categorical-Ordinal data: variables defined by small number of groups
and categories have an inherent order
#example, spiciness: mild, medium, hot
#Categorical-Non ordinal data: variables defined by small number of
groups and categories don't have any order
#example, sex: female, male, regions: northeast, south, northcentral,
west
#Numeric-Continuous data: variables can take any numeric with decimals
value
#example, heights, murder rates
#Numeric-Discrete data: variables must take round-integers numeric values
#example, population sizes, counts of things
#heights as we take them are numeric-continuous, if heights are rounded
they are numeric-discrete
#numeric-discrete data can be considered categorical-ordinal
#example, the number of packs a person smokes a day rounded to 0, 1, 2 is
an ordinal variable: small number of groups with many members in each
group
#example, the number of cigarettes a person smokes 0, 1...39, 40 would be
a discrete variable: big number of groups with a few members in each
group
library(tidyverse)
library(dslabs)
data(heights)
heights$sex
names(heights)
#unique() to see how many unique values are in a vector
#table() to compute the frequency of each unique value, creates a
contingency table of the counts for each level
x <- c(3,3,3,3,4,4,2)
unique(x)
table(x)
sum(x)
sum(unique(x))
sum(table(x))
sum(x==3)
head(heights)
#the Distribution is the most basic statistical summary from a list of
objects/numbers
prop.table(table(heights$sex))
#the proportion of females is 0.227, for males is 0.773
#this two category frequency table is the simplest form of a
distribution, here the numbers describe everything
x
prop.table(x)
prop.table(table(x))
#prop.table function gives the proportion of the vector, with table
inside we get the proportion of each entry
options(digits = 3)
prop.table(table(murders$region))
barplot(prop.table(table(murders$region)))
#0.176 NE, 0.333 S, 0.235 NC, 0.255 W
#the barplot let us see the four proportions in a graph
#a distribution is a description/function that shows the possible values
of a variable & how often those values occur
#distribution in categorical data <- description, frequency table
#distribution in numerical data <- function, cumulative distribution
function CDF
#when we have numerical data, report the frequency is NOT an effective
summary because most values are unique
#Cumulative distribution functions-CDF reports the proportion of the data
below-debajo a for all possible values of a
#F(a) = Proportion(x <= a)
#F(b)-F(a) to report the proportion between two values
#Histograms are better to show the distribution of numerical data or CDF-
cumulative distribution functions
#histograms divide data into non overlapping bins of the same size-
intervals and plots the counted values-frequency
hist(heights$height)
#hist: the base of the bar is each interval defined by a range of values,
we know the data proportion by intervals
ecdf(heights$height)
ecdf(x)
#ecdf() function to compute the CDF to a numeric variable
#the CDF-cumulative dist.function has "a" values on the x-axis and F(a)-
the proportion lower than "a" on the y-axis
#use CDF to get a summary and probabilities when data is numeric-
continuous-with decimals
#cumulative distribution function-CDF shows the proportion of values
below a, the same as 1-F(a)
#cumulative distribution function-CDF also shows the probability of
finding a random value between a&b F(b)-F(a)
#Smooth density plots are similar to histograms but peaks are removed and
the y-axis change from counts to density
#when the list of values is big and large, a smooth density plot shows
the distribution in a more general way
#smooth density plots are histograms with very small bins-intervals, less
edges & jumps, the hist becomes smooth
#the smooth density is the CURVE that goes through the top of the
histogram bars
#steps to make a smooth curve, 1_histogram with frequencies 2_top points
3_match points 4_hide bars
#the degree of smoothness can be controlled by an argument on ggplot2
#Degree of smoothness selected should be representative of the hide data
to avoid visualization and analysis mistakes
#the area under the smooth density curve sums 1, to know the proportion
between two values, compute that area below
#compare two distributions is easier with smooth densities than
histograms because the jagged edges-bordes irregulares add clutter-
agregan desorden
#normal distribution <- bell curve <- Gaussian distribution <- defined by
two parameters, average & standard deviation
#ND is symmetric, centered at the average and 95% of values are within
the standard deviation, SD is 1 from the avg 0
#ND in gambling winnings, heights, weights, blood pressure, standardized
test scores, experimental measurement errors
#mean() and sd() summarizes the normal distribution of a dataset
average <- sum(x)/length(x)
SD <- sqrt(sum((x-average)^2)/length(x))
index <- heights$sex=="Male"
x <- heights$height[index]
x
average <- mean(x)
SD <- sd(x)
c(average,SD)
c(average=average,SD=SD)
#compute the mean() and sd() for male heights
mean(heights$height[heights$sex=="Male"])
mean(x)
sd(heights$height[heights$sex=="Male"])
sd(x)
c(mean(x),sd(x))
c(average=mean(x),SD=sd(x))
#for male heights mean is 69.314 & sd is 3.611, that's the normal
distribution of our data
#standard units <- Z-score <- scale() function: how many standard
deviations a value is away from the mean-average
#when the distribution is aprox.normal, convert data into z-scores-
standard units is useful to compare between data
z <- scale(x)
z
scale(heights$height[heights$sex=="Male"])
(x-mean(x))/sd(x)
#z=(x-mean(x))/sd(x)
#how many men are between two SD from the average if average is z=0, tall
z=2, short z= -2
#compute the average of the absolute value of +-z < menor a +-2
mean(abs(z)<2)
#95% of the males height for normal distribution data are inside +-2
#standard normal distribution <- z-scores = 0, mean = 0 & SD = 1
#z-score near 0 is average-promedio, z-score between +-2 are below the
mean, z-score between +-3 are extreme values
#the normal distribution is associated with the 68-95-99 % rule, percent
of observations that are below the curve
#68.3% is 1 SD from the mean abs(z)=1, 95.4% is 2 SD from the mean
abs(z)=2, 99.7% is 3 SD from the mean abs(z)=3
#abs() function computes the absolute +/- value
#pnorm() function to obtain the PROBABILITY cumulative distribution
function-CDF of a normal distribution
#F(a) = pnorm(a, mean, sd) distribution function = pnorm(observation,
average, SD)
#what is the probability that a randomly selected student is taller than
70.5 inches?
pnorm(70.5,mean(x),sd(x))
pnorm(70.5,mean(heights$height[heights$sex=="Male"]),sd(heights$height[he
ights$sex=="Male"]))
#what is the probability of this observation: F(a) =
pnorm(observation,mean,SD) and CDF = 1-F(a)
1-pnorm(70.5,mean(x),sd(x))
#0.628 is the probability of males with heights below-debajo 70.5 inches
F(a) = pnorm(obs,mean,sd)
#Answer: 0.371 is the probability of males with heights above-enicma
taller than 70.5 inches 1 - F(a)
#to know the probability of males with heights between-entre two values
F(b) - F(a)
#heights data is numeric-continuous, if we consider each height as a
categorical entry isn't useful =discretization
#divide data by intervals to compute probabilities for continuous
distributions
#divide data by intervals of length 1-rounded to compute probabilities
for continuous normal distributions
library(tidyverse)
library(dslabs)
x <- heights %>% filter(sex=="Male") %>% pull(height)
x
index <- heights$sex=="Male"
x <- heights$height[index]
x
#two ways to define x containing male heights
#pull() function to pull out-extraer a single variable
#what is the proportion of males with heights between 69.5-70.5? interval
of lenght 1
options(digits = 3)
#heights data with decimals, as numeric-continuous: probabilities by mean
& pnorm are aprox. equal if the range interval is 1-integer, following
the normal distribution
mean(x<=68.5)-mean(x<=67.5)
pnorm(68.5,mean(x),sd(x))-pnorm(67.5,mean(x),sd(x))
mean(x<=69.5)-mean(x<=68.5)
pnorm(69.5,mean(x),sd(x))-pnorm(68.5,mean(x),sd(x))
mean(x<=70.5)-mean(x<=69.5)
pnorm(70.5,mean(x),sd(x))-pnorm(69.5,mean(x),sd(x))
#heights data rounded values, as numeric-discrete: probabilities by mean
& pnorm don't match the normal dist. if the range interval is not 1-
decimals =discretization
mean(x<=70.9)-mean(x<=70.1)
pnorm(70.9,mean(x),sd(x))-pnorm(70.1,mean(x),sd(x))
#assign a small prob. to each single height is not useful, CDF works on
heights intervals to reach the probability distribution of a random value
#dnorm() returns the PDF-probability density function of a normal
distribution for a given random variable x
#pnorm() returns the CDF-cumulative dist.density function of a normal
distribution for a given random variable q
#qnorm() returns the inverse CDF-cumulative dist.density function for a
given random variable p
#rnorm() returns a vector of normally distributed random variables for a
given vector-number of observations n
#rnorm is useful to generate random samples-muestras & simulate with data
collected from a normal distribution population with specified mean & SD
#what could happen by chance if we pick 800 males at random, what is the
distribution of the tallest person?
#13.14 Exercises
xx <- heights %>% filter(sex=="Female") %>% pull(height)
xx <- heights$height[heights$sex=="Female"]
xx
#xx defined as female heights, mean is 64 and SD 3 inches
#if we pick at random, what is the probability that she is 5 feet or
shorter? 6 feet or taller? between 61-67 inches?
View(inches_to_ft)
5*12
mean(xx)
sd(xx)
pnorm(5*12,mean(xx),sd(xx))
pnorm(60,mean(xx),sd(xx))
#the probability of finding females shorter than-below 5 feet =60 inches
is 0.094%
1-pnorm(6*12,mean(xx),sd(xx))
#the probability of finding females taller-above 6 feet =72 inches is
0.0302%
pnorm(67,mean(xx),sd(xx))-pnorm(61,mean(xx),sd(xx))
#the probability of finding females between 61-67 inches height is 0.561%
#repeat the previous exercise with heights in centimeters
5*12*2.54
60*2.54
pnorm(5*12*2.54,mean(xx*2.54),sd(xx*2.54))
#the probability of finding females shorter than-below 5 feet =60 inches
=152 cm is 0.094%
1-pnorm(6*12*2.54,mean(xx*2.54),sd(xx*2.54))
#the probability of finding females taller-above 6 feet =72 inches
=182.88 cm is 0.0302%
pnorm(67*2.54,mean(xx*2.54),sd(xx*2.54))-
pnorm(61*2.54,mean(xx*2.54),sd(xx*2.54))
#the probability of finding females between 61-67 inches =154-170 cm
height is 0.561%
mean(xx)
sd(xx)
options(digits = 3)
pnorm(-1.96)
pnorm(1.96)
pnorm(-2)
pnorm(2)
qnorm(0.025)
qnorm(0.975)
qnorm(0.0228)
qnorm(0.977)
#pnorm and qnorm are inverse functions, qnorm-inverse function gives the
teorethical quantiles of a normal dist.
#datacamp assessment:
#What proportion of the data is between 69 and 72 inches, taller than 69
but shorter or equal to 72?
mean(x<=72&x>69)
mean(x<=72)-mean(x>69)
pnorm(72,mean(x),sd(x))-pnorm(69,mean(x),sd(x))
#ratio: how many times bigger the exact proportion is compared to the
approximation
exact <- mean(x > 79 & x <= 81)
approx <- pnorm(81,mean(x),sd(x))-pnorm(79,mean(x),sd(x))
exact/approx
exact
approx
#between 79&81 inches the exact proportion-by mean of male heights is
1.61 bigger than the aprox proportion-by pnorm
#get the proportion of 7feet taller men, then multiply this value by 1
billion (10^9) n of males and round the value
1-pnorm(7*12,mean=69,sd=3)
#the proportion of 7feet taller men is 2.866e-07
(1-pnorm(7*12,mean=69,sd=3))*(10^9)
#about 1 billion(10^9) men are between 18-40 years old, answer: there are
286.651 seven feet high or taller men in the world
round((1-pnorm(7*12,mean=69,sd=3))*(10^9))
10/round((1-pnorm(7*12,mean=69,sd=3))*(10^9))
10/((1-pnorm(7*12,mean=69,sd=3))*(10^9))
#if there are 10 players seven feet high or taller in NBA, representa el
0.0348 en el mundo
(1-pnorm((6*12)+8,mean=69,sd=3))*(10^9)
150/((1-pnorm((6*12)+8,mean=69,sd=3))*(10^9))
#there are 122866 men in the world with lebron james height,si 150 son
jugadores en la NBA representa el 0.0012
#Quantiles: cutoff points-puntos de corte that divide datasets into
intervals of probabilities
#q-th is the value which q% of the observations are equal or less from 0-
smallest prob. to 1-largest probability
m
quantile(m)
#percentiles and quartiles are a kind of quantiles
#Percentiles divide datasets into 100 intervals each one with 1% = 0.01
of probability
#Quartiles divide datasets into 4 intervals each one with 25% = 0.25 of
probability
summary(heights$height)
#summary() function gives the mean, min, max, 1st quartile, median or 2nd
quartile, 3rd quartile
quantile(heights$height)
#use summary() or quantile() to show the quartiles
#Percentiles: define p-percentiles seq(from 0.01 to 0.99 by 0.01) and
then use quantiles(data,p)
p <- seq(0.01,0.99,0.01)
quantile(heights$height,p)
percentiles <- quantile(heights$height,p)
percentiles[names(percentiles)=="25%"]
percentiles[names(percentiles)=="75%"]
percentiles["25%"]
percentiles["75%"]
percentiles["99%"]
#quantile function creates a vector, stored as here in "percentiles" we
can access each n% percentile from 1 to 99
#qnorm() function gives the theoretical quantiles of a dataset that
follows the normal distribution
#qnorm(p-probability of observations,mean,sd) if mean&sd are not defined,
by default mean=0 sd=1 and qnorm=quantiles
pnorm(-1.96)
qnorm(0.025)
pnorm(qnorm(0.025))
#pnorm on z-score= -1.96 gives the probability that a value from a normal
distribution will be less or equal than z
#theoretical quantiles obtained by qnorm can be compared to sample
quantiles and find if it follows the normal dist.
qnorm(p,69,3)
#q: vector of quantiles, p: vector of probabilities-proportions, p =
mean(x <= q) probabilities-proportion = mean(vector-x less or equal than
quantiles)
#QQ-plots: to check if data distributions are well approximated by a
normal distribution
#p = proportion 0.05, 0.01 up to 0.95, p-proportion of values in the data
below q-quantiles = p
options(digits = 3)
mean(x)
mean(x <= 69.5)
#arround 50% of males are shorter or equal than 69.5 inches-mean value,
if p =0.51 then q =69.5
mean(x >= 69.5)
#QQ-plots: sample quantiles-observed are compared to the theoretical
quantiles expected from the normal distribution
#the points on the QQ-plot will be near the same line when
sample=theoretical & data is approx. to the normal dist.
index <- heights$sex=="Male"
x <- heights$height[index]
z <- scale(x)
#to generate a QQ-plot, 1st define p as a vector of proportions
p <- seq(0.05,0.95,0.05)
#2nd define a vector of quantiles for p proportions =sample quantiles
with quantile() function-makes a vector by itself
sample_quantiles <- quantile(x,p)
#3rd define the theoretical quantiles with qnorm() function using the
mean and sd of x-heights data and proportions p
theoretical_quantiles <- qnorm(p, mean=mean(x), sd=sd(x))
theoretical_quantiles <- qnorm(p, mean(x), sd(x))
theoretical_quantiles
#4th make the QQ-plot: quantile-quantile sample-theoretical to see if
they match therefore with the normal distribution
plot(theoretical_quantiles,sample_quantiles)
abline(0,1)
#abline(from-0 to-1) function to add staight lines on a plot and see if
the points fall near the line
#theoretical =sample-obs =normal distribution, use z-scores-standard
units on qnorm function to simplify the code
sample_quantiles <- quantile(z,p)
theoretical_quantiles <- qnorm(p)
#standard units: it's not necessary to define mean & sd on qnorm()
because z-scores takes mean =0 & sd =1 by default
plot(theoretical_quantiles,sample_quantiles)
abline(0,1)
#male heights follow a normal distribution with average 69.44 inches,
standard deviation 3.27 inches
#percentiles are the quantiles obtained when p is defined from 0.01-1% up
to 0.99-99%
#median =50th percentile or 50% of the data is below-debajo de la
mediana, only in normal distributions mean=median
#quartiles are the 0.25-25% percentile or 1st quartile, median and 0.75-
75% percentile or 3rd quartile
View(murders_nw)
hist(murder_rate)
#data-murder rates do not follow the normal distribution
plot(murder_rate)
summary(murder_rate)
#boxplot are useful to show info on summary() & to compare multiple
distributions when they are not normal
#the box is defined by 25th-75th percentiles, the distance between them
is called interquartile range
#wishkers show the range, outliers are ploted separately as individual
points, the median is the horiz.line on the box
boxplot(murder_rate)
boxplot(heights$height~heights$sex)
#from the boxplot we see that men are on average taller than woman, both
have outliers...description of the graphic
#stratification: data is divided into groups based on variables
associated, resulting groups are called strata
hist(x)
hist(xx)
#Exercises 8.15
library(dslabs)
data("heights")
male <- heights$height[heights$sex=="Male"]
female <- heights$height[heights$sex=="Female"]
length(male)
length(female)
#in this dataset we have 812 entries of male heights and 238 entries of
female heights
p <- seq(0.01,0.99,0.01)
quantile(male,p)
male_percentiles <- quantile(male,p)
male_percentiles <- male_percentiles[c("10%","30%","50%","70%","90%")]
male_percentiles
quantile(female,p)
female_percentiles <- quantile(female,p)
female_percentiles <-
female_percentiles[c("10%","30%","50%","70%","90%")]
female_percentiles
my_df <- data.frame(female=female_percentiles, male=male_percentiles)
my_df
#show and store male and female percentiles 10,30,50,70 and 90%
#NOT normal distributions: summarize with mean & sd is not useful,
provide a hist, qqplot to see the distribution
#Data visialization heps to discover flaws-fallas on the data,
measurement mistakes, over/sub estimates, sample characters, quality of
the data
#mad() function to know the median absolute deviation, if the
distribution is normal mean=median & sd=mad
#an entry mistake in 1 of 900 observations increases the mean 0.5-half
unit, a big difference in practical terms
#an entry mistake in 1 of 900 observations increases the sd 15 units, a
really big difference in practical terms
#median and mad-median absolute deviation are ROBUST SUMMARIES against
mean and sd
#GGPLOT2: uses gg-grammar of graphics to break plots into building blocks
with an intuitive syntax
#ggplot2 creates complex and aestheticall plots using simply and readable
codes
#ggplot2 is designed to work with tidy datasets, rows =observations &
columns =variables
#ggplot2-cheatsheet to remember basics of plots. Other graphic resources
to learn instead of ggplot: grid, lattice
library(ggplot2)
#to learn ggplot break the graph into 3 main components: data, geometry,
aesthetic mapping. Additional components: scale, labels, title, legend,
theme, style
#data component: the dataset must be summarized
#geometry component: the type of the plot, histogram, qqplot, boxplot,
barplot, smooth density, scatterplot etc
#aesthetic mapping component: variables mapped-asignadas to visual cues-
señales that depends on the geometry used, x-axis values, y-axis values,
colors, texts
#scale component: escala, magnitudes, defined by the range of the data,
on log scale
#labels, title, legend components
#theme, style components of the graph
#create a ggplot graph, 1st step: create a ggplot object, associate data
with graph objects, geometries and mappings
#ggplot() function initializes the graph-ggplot object, the 1st argument
associate the dataset with the new gg object
library(tidyverse)
library(dslabs)
data("murders")
ggplot(data = murders)
ggplot(murders)
murders %>% ggplot()
#three ways to associate the data with gg object, we get a blank slate-
pizarra because geometry is not defined yet
p <- ggplot(data = murders)
p <- ggplot(murders)
p <- murders %>% ggplot()
class(p)
#we see that p is a ggplot object by using class() function
#assign the plot to an object p so we create the gg object, associate it
to p and then be able to render-print it
print(p)
p
#ggplot creates graphs by adding layers-capas with + plus symbol
#layers-capas define the components of the graph: geometries, summary
statistics, scales to use, style changes
#data %>% ggplot()-gg object + layer 1 + layer 2 + layer n
#check ggplot-cheatsheet to use the right function for the graph we want
to generate
#layer 1 =geometry layer, the format geom_plot type defines the type of
plot, to work needs data & aes mapping-arguments
#layer 2 =aesthetic mappings, defines how the data connect with graph
features-aesthetic arguments x, y, colour, shape etc
#aes() function gets the features-arguments of a geometry function as
axis x-y, size, color etc. It’s preferred to put aes inside ggplot(aes())
#geom_point() function creates scatterplots that requires x and y axis by
aes()-aesthetic mappings
library(dplyr)
library(ggplot2)
murders %>% ggplot() + geom_point(aes(x=population/10^6, y=total))
ggplot(murders) +
geom_point(aes(x=murders$population/10^6,y=murders$total))
#remember ggplot as dplyr works like with() function, it's not necessary
access the variable again on the code
ggplot(murders) + geom_point(aes(x=population/10^6,y=total))
ggplot(murders) +
geom_point(aes(x=population_in_millions,y=total_gun_murders))
#to see all the plot on log10 scale use the variables defined before
population_in_millions and total_gun_murders
ggplot(murders) + geom_point(aes(population/10^6,total))
#aes features x and y are the 1st and the 2nd argument, the code also
works well without =
p <- ggplot(data = murders)
p + geom_point(aes(x=population/10^6,y=total))
p + geom_point(aes(population/10^6,total))
#geom_text() adds texts directly to the plot, requires aes mapping
features x, y and label arguments
#geom_label() adds etiquetas-label rectangles to the text, requires aes
mapping features x, y and label arguments
ggplot(murders) +
geom_point(aes(x=population/10^6,y=total)) +
geom_text(aes(x=population/10^6,y=total,label=abb))
p + geom_point(aes(x=population/10^6,y=total)) +
geom_text(aes(x=population/10^6,y=total,label=abb))
ggplot(murders) + geom_point(aes(x=population/10^6,y=total)) +
geom_text(aes(x=population/10^6,y=total,label=abb))
#layers with different aes mappings can be added to the same plot, be
careful adding arguments-features inside the aes() function to avoid
mistakes
ggplot(murders) + geom_point(aes(x=population/10^6,y=total)) +
geom_label(aes(x=population/10^6,y=total,label=abb))
#each geom_plot type has many features-arguments specific for the
function further than-ademásde essential aes & data
args(ggplot)
args(geom_point)
args(aes)
args(geom_text)
args(geom_label)
#mappings need to be inside aes() they use data from specific
observations
#operations that will affect all the points at the same way don't need to
be inside aes, they are not mappings
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
p + geom_point(aes(x=population/10^6,y=total), size=3) +
geom_text(aes(x=population/10^6,y=total,label=abb))
#size= outside aes() because is an operation that works on all the
points, changes the size of the plot points
p + geom_point(aes(x=population/10^6,y=total), size=3) +
geom_text(aes(x=population/10^6,y=total,label=abb), nudge_x=1)
#nudge_x= outside aes(), is not a mapping, is an operation on all the
points, moves the labels a little to the right
#global aes mapping apply layers to all geometries and they are defined
on the ggplot() object
#redefine p-gg object with aes() mapping inside the function ggplot() to
simplify the code & avoid type errors
p <- murders %>% ggplot(aes(x =population/10^6, y =total, label =abb))
p + geom_point(size =3) + geom_text(nudge_x =1.5)
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
p + geom_point(size =3) + geom_text(nudge_x =1.5)
#GLOBAL AES: gg-object <- data %>% ggplot(aes(arguments))
#GLOBAL AES: gg-object <- ggplot(data=, aes(arguments=))
#two ways of gg-object definition, by %>% or by data=
#local aes mappings-layers add new information on the global aes mapping
changing the aes mappings defined previously as default on gg-object
p + geom_point(size =3) + geom_text(x =10, y =800, label ="Hello there!")
p
#scales are preferred on log10 and it's not by default, this change needs
to be added through scales layer
#scale_x_continuous & scale_y_continuous are the default scales layer to
be changed-trans= to the log10 scale
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
p + geom_point(size =3) + geom_text(nudge_x =1.5) +
scale_x_continuous(trans ="log10") + scale_y_continuous(trans ="log10")
#nudge might be smaller if we change to log scale (to get the graph
right)
p + geom_point(size =3) + geom_text(nudge_x =0.075) +
scale_x_continuous(trans ="log10") + scale_y_continuous(trans ="log10")
#scale_x_log10() & scale_y_log10() to change-transform scale layers
directly to log10
p + geom_point(size =3) + geom_text(nudge_x =0.075) + scale_x_log10() +
scale_y_log10()
#xlab() and ylab() to add(name label), ggtitle(name title)
p + geom_point(size =3) + geom_text(nudge_x =0.075) + scale_x_log10() +
scale_y_log10() + xlab("Population in millions log scale") + ylab("Total
number of murders log scale") + ggtitle("US total gun murders in US
2010")
#redefine p without geom_point and then add by one the color arguments to
learn how to change colours
p + geom_text(nudge_x =0.075) + scale_x_log10() + scale_y_log10() +
xlab("Population in millions log scale") + ylab("Total number of murders
log scale") + ggtitle("US total gun murders in US 2010")
#by adding the color agument =blue inside the geom_point function we get
all the points blue
p + geom_point(size =3, color ="blue")
#color argument outside aes() is an operation that works over all the
points
#color argument inside aes() is a mapping that adds color automatically
to each categorical variable assigned =region
p + geom_point(size =3, col ="blue")
#arguments color= and col= are the same for colouring graphs
p + geom_point(aes(color =region), size =3)
p + geom_point(aes(col =region), size =3)
p + geom_point(size =3, aes(color =region))
#the order of arguments within geom_point do not seem to change the
render of the graph, size~aes = aes~size
#a reference legend is also added automatically with color, avoid it
setting the args show.legend=FALSE in geom_point
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
#to add a line representing the murder rate per country, define r as per
million rate, population*murder rate
10^6
identical(10^6,1000000)
#add a line representing the average murder_rates, remember murder_rate
is total/population*10^6
r <- murders %>% summarize(rate =sum(total)/sum(population)* 10^6) %>%
pull(rate)
r
r <- murders %>% summarize(rate =sum(total)/sum(population)* (10^6)) %>%
pull(rate)
r
#30.345 is the average murder_rate pop/tot*10^6
View(p)
#summarize() function creates scalar variables summarizing-resumuendo the
variables selected of an existing dataframe
#geom_abline() adds a line by default with a-intercept-intercepción =0
and b-slope-pendiente =1
p + geom_point(aes(color =region), size =3) + geom_abline(intercept
=log10(r)) + scale_x_log10() + scale_y_log10()
#in log10 scale, slope=1 and intercept =log10(r) to get the line on the
average murder rate
#to recreate the graph, change arguments lty= changes the line type from
solid to dashed, color="darkgrey & put geom_abline() layer before
geom_point() layer to get the average murder rate abajo de los puntos
p <- p + geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3)
p
p <- p + geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3) + scale_x_log10() +
scale_y_log10()
p
#capitalize-mayuscula the legend title using
scale_color_discrete(name="Region") with mayus.R
#add +layers and save the changes on p-gg object to avoid missing changes
p <- p + scale_color_discrete(name="Region")
p
p <- p + geom_text(nudge_x =0.075) + scale_x_log10() + scale_y_log10() +
xlab("Population in millions log scale") + ylab("Total number of murders
log scale") + ggtitle("US total gun murders in US 2010")
p
#add on-añadir packages on ggplot gives the graph finishing touches by
ggthemes and ggrepel
#many themes are included on ggplot and dslabs package like +
theme(argument=) layer addition and ds_theme_set() to see default values
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
#load from library ggthemes and then just add or store in p to avoid
missings, the layer theme_economist() that is the one needed for the
example
library(ggthemes)
install.packages("ggthemes")
library(ggthemes)
p + theme_economist()
#if there is a trouble loading a package from library() try with
install.packages("namepackage") and then library(name)
p + theme_clean()
p + theme_fivethirtyeight()
p + theme_economist_white()
p + theme_test()
#change the themes stored to see how they look like and which one goes
better with our plot
library(ggrepel)
install.packages("ggrepel")
library(ggrepel)
#ggrepel package contains extra geometries for ggplot2, change geom_text
layer by geom_text_repel layer to avoid labels falling on each other
#pull out the whole code clean:
r <- murders %>% summarize(rate =sum(total)/sum(population)* (10^6)) %>%
pull(rate)
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
+ geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3) + geom_text_repel() +
scale_x_log10() + scale_y_log10() + xlab("Population in millions log
scale") + ylab("Total number of murders log scale") + ggtitle("US total
gun murders in US 2010") + scale_color_discrete(name="Region") +
theme_economist()
p
p <- ggplot(data= murders, aes(x =population/10^6, y =total, label =abb))
+
geom_abline(intercept =log10(r), lty=2, color="darkgrey") +
geom_point(aes(color =region), size =3) +
geom_text_repel() + scale_x_log10() +
scale_y_log10() +
xlab("Population in millions log scale") +
ylab("Total number of murders log scale") +
ggtitle("US total gun murders in US 2010") +
scale_color_discrete(name="Region") +
theme_economist()
p
#put the + at the end of the layer, if not the code gets error just
because wrong typing
library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
#create another summary plot using ggplot2
#geom_histogram() layer creates hist() more aesthetic than by default on
R, the x axis is divided into bins-intervals
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(x =height))
p
p + geom_histogram()
#dplyr functions work better than dslabs functions like data=dataframe$[]
to add data and aes on ggplot objects
#the histogram has intervals by default, define the binwidth-ancho de
intervalos as arguments =
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(x =height))
p + geom_histogram(binwidth =1)
p + geom_histogram(binwidth =1, fill ="blue", color ="black") +
xlab("Male heights in inches") + ggtitle("Histogram Male heights")
p + geom_histogram(binwidth =1, fill ="blue") + xlab("Male heights in
inches") + ggtitle("Histogram Male heights")
p + geom_histogram(binwidth =1, color ="blue") + xlab("Male heights in
inches") + ggtitle("Histogram Male heights")
#on histograms the argument color= is for the line of the bar, fill= is
for all the body bars
p + geom_histogram(binwidth =1, fill ="blue", color ="black") +
xlab("Male heights in inches") + ggtitle("Histogram Male heights")
#geom_density() layer to create smooth density plots, the smoothed
version of histograms, to compare distributions easy
p + geom_density()
p + geom_density(color ="blue")
p + geom_density(fill ="blue")
#arguments color= puts color on the line, fill= puts color on the area
below the line
#geom_qq() layer needs as arguments muestra-sample= where data sample-
muestra is compared to the normal distribution, sample compared to
theoretical
#first redefine p because we need aes(samples= instead of x= for the
qqplot
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(sample =height))
p + geom_qq()
#by default the qqplot is compared to normal dist. mean=0 and sd=1, use
dparams= argument to put the sample mean & sd or convert the sample into
z-scores with scale()
params <- heights %>% filter(sex =="Male") %>%
summarize(mean=mean(height), sd=sd(height))
#create a params object and assign it to the dparams argument inside the
qq geometry
p + geom_qq(dparams =params)
#now the qqplot is plotted against the normal distribution with the same
mean and sd oh heights dataset
p + geom_qq(dparams =params) + geom_abline()
#the points fall on the line, the data is approximately normal. if the
sample data is in z-scores standard units using scale() function, the
code looks clear
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(sample
=scale(height)))
p + geom_qq() + geom_abline()
install.packages("gridExtra")
library(gridExtra)
#load gridExtra package and use grid.arrange() function to put plots next
to each other
p <- heights %>% filter(sex =="Male") %>% ggplot(aes(x =height))
p1 <- p + geom_histogram(binwidth =1, fill ="blue", color ="black")
p2 <- p + geom_histogram(binwidth =2, fill ="blue", color ="black")
p3 <- p + geom_histogram(binwidth =3, fill ="blue", color ="black")
grid.arrange(p1,p2,p3, ncol=3)
grid.arrange(p1,p2,p3)
#grid.arrange() is useful to compare ggplots on the same image by
defining the plots to use and its arguments, by default the plots are in
rows
#quickly plots using qplot() function, is not full as ggplot but provides
graphs easy guessing the type of plot
x <- heights %>% filter(sex=="Male") %>% pull(height)
qplot(x)
#qplot(x) makes a quick histogram, to obtain a fully histogram and change
the arguments, 1st create a gg-object and all the following steps
qplot(sample=scale(x))
qplot(sample=scale(x)) + geom_abline()
#qplot is a ggplot function so we can add layers as when we define a gg-
object, if sample= we get a scatterplot
#dot operator . to make invisible a variable, it means no variable-value
heights %>% qplot(sex,height, data=.)
#the previous code renders just the points, add geom="plot name" argument
to define the graph wanted
heights %>% qplot(sex,height, data=., geom="boxplot")
qplot(x, geom ="density")
#I() function to avoid-evitar, inhibir the evaluate, conversion or
interpretation of an object
qplot(x, bins =15, color ="black", xlab ="Population")
qplot(x, bins =15, color =I("black"), xlab ="Population")
#I() = keep it as it is
grid.arrange(p1,p2, ncol=2)
#define group=sex, color=sex, inside aes() to create 1 density plot by 2
groups female-male with 2 colors
heights %>% ggplot(aes(height, group =sex, color =sex)) + geom_density()
heights %>% ggplot(aes(height, group =sex, fill =sex)) +
geom_density(alpha =0.2)
#if we use fill=sex for the 2 groups female-male las curvas se
superponen, define alpha=0.2 to show both fills
library(tidyverse)
library(dplyr)
library(dslabs)
data("heights")
#summarize() function computes summary statistics on data frames and
returns a new dataframe with the name variables defined on the function
call & the values
s <- heights %>% filter(sex=="Male") %>% summarize(average=mean(height),
standard_deviation=sd(height))
s
class(s)
#remember summarize() is a dplyr function aware of variable names sorted
on dataframes, the names can be used directly
#the resulting object stored in s is class-dataframe and we can access $
its variables average & standard_deviation
s$average
s$standard_deviation
heights %>% filter(sex=="Male") %>% summarize(median=median(height),
minimum=min(height), maximum=max(height))
quantile(x,c(0,0.5,1))
#by summarize or quantile we get the same result, but quantile can not be
used inside summarize because it works on functions that return a single
value
murders <- murders %>% mutate(murder_rate = total/population*100000)
summarize(murders, mean(murder_rate))
mean(murders$murder_rate)
#compute the mean of murder_rate is not the real average because there
are big and small states, compute sum(total)/sum(population)*100000 gives
the correct average
us_murder_rate <- murders %>% summarize(rate =
sum(total)/sum(population)*100000)
us_murder_rate
class(us_murder_rate)
sum(murders$total)/sum(murders$population)*100000
#if we now the formula to compute us_murder_rate define the previous code
sum(tot)/sum(pop)*100000
#the $, pull() function and . the dot placeholder are useful if we have a
data frame & we don't know the formula, to get a numeric value to work
with
us_murder_rate %>% .$rate
class(us_murder_rate %>% .$rate)
class(us_murder_rate)
#the class of us_murder_rate is numeric once we access .$rate variable
us_murder_rate$rate
class(us_murder_rate$rate)
us_murder_rate %>% pull(rate)
class(us_murder_rate %>% pull(rate))
#use the pipe %>% to get only the number not the whole dataframe in one
line of code
us_murder_rate <- murders %>% summarize(rate =
sum(total)/sum(population)*100000) %>% .$rate
us_murder_rate
class(us_murder_rate)
#we get the same numeric result using pull() function, .$ and pull are
equivalent on dplyr functions
us_murder_rate <- murders %>% summarize(rate =
sum(total)/sum(population)*100000) %>% pull(rate)
us_murder_rate
class(us_murder_rate)
#compute the median murder rate for the south states without defining
extra objects using the pipe %>%
filter(murders, region=="South") %>% mutate(rate=total/population*10^5)
%>% summarize(median=median(rate)) %>% pull(median)
library(dslabs)
library(tidyverse)
library(dplyr)
#first group_by() and second summarize() is a common operation in data
exploratory
#group_by() function splits data into one or more variables defined
group_by(name variable)
heights %>% group_by(sex)
heights %>% group_by(sex) %>% summarize(average=mean(height),
standard_deviation=sd(height))
#summarize applied on group_by makes a summary over each group-female and
males separately, sex, average and standard_deviation in colums & two
rows female and male
murders %>% group_by(region) %>%
summarize(median_rate=median(murder_rate))
#to examine datasets the data is needed sorted, in dplyr package
arrange() is useful than sort and order functions
#arrange() function sorts a dataframe by a given column and by default in
ascending-lowest to highest order
murders %>% arrange(population) %>% head()
#the code above sorts-arrange() the whole murders dataframe by lowest to
highest population
murders %>% arrange(murder_rate) %>% head()
#the code above sorts-arrange() the whole murders dataframe by lowest to
highest murder_rate
#desc() function sorts a vector in descending order-highest to lowest, it
can be used within arrange()
murders %>% arrange(desc(population)) %>% head()
murders %>% arrange(desc(murder_rate)) %>% head()
#arrange() by multiple levels orders 1st by the first argument, next
within by the 2nd argument, 3rd and so on
#the code below sorts the dataset 1st by region and 2nd by murder rate
murders %>% arrange(region, murder_rate) %>% head()
murders %>% arrange(population, murder_rate) %>% head()
#top_n() function shows the top results ranked by a given variable NOT in
order, combine with arrange() to get the results in order
murders %>% top_n(10, murder_rate)
#the code above shows the top 10 highest murder rates
murders %>% arrange(desc(murder_rate)) %>% top_n(10)
#now the code is sorted-arrange() by descending order, shows the top 10
highest murder rates
#top_n(data,nrows,variable) the function needs 1st args-dataframe, 2nd
args-nrows value, 3rd args-variable to filter by
top_n(murders,10,murder_rate)
#the pipe %>% is very useful on dplyr functions, allowing many data
analysis easy
library(dslabs)
library(tidyverse)
library(dplyr)
#Assessment section 3
#load the data from the survey of the united states national center for
health statistics NHANES, from the specific NHANES package
install.packages("NHANES")
library(NHANES)
data("NHANES")
#remember how to remove NA defining na.rm=TRUE inside the function where
the variable with NA's will be used
data(na_example)
mean(na_example)&sd(na_example)
mean(na_example, na.rm=TRUE)&sd(na_example, na.rm=TRUE)
mean(na_example, na.rm=TRUE)
sd(na_example, na.rm=TRUE)
#na.rm=TRUE is a useful function when a dataframe has NA's and we need to
apply functions to variables with NA's
head(NHANES)
names(NHANES)
levels(NHANES$Gender)
levels(NHANES$AgeDecade)
summary(NHANES$BPSysAve)
levels(NHANES$Race1)
#levels shows the groups inside the variable and summary shows the
distribution of numeric variables as blood pressure
#filter the dataframe by 20-29 females, AgeDecade " 20-29" Gender
"female"
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female")
#what is the average and sd of systolic blood pressure stored in BPSysAve
for 20-29 females?
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(mean(BPSysAve, na.rm=TRUE), sd(BPSysAve, na.rm=TRUE))
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#without NA's the systolic blood pressure for females of 20-29 is
108(AVG) and 10.1(SD)
library(dplyr)
library(tidyverse)
library(NHANES)
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE)) %>
% pull(AVG)
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE)) %>
% .$AVG
#pull(AVG) or .$AVG gets the same result, output just the average for 20-
29 females
#report the min and max values for the same group
NHANES %>% filter(AgeDecade==" 20-29"&Gender=="female") %>%
summarize(min=min(BPSysAve, na.rm=TRUE), max=max(BPSysAve, na.rm=TRUE))
NHANES %>% filter(Gender=="female") %>% group_by(AgeDecade) %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#the code above shows blood pressure mean & sd group by all AgeDecades,
still using filter for Gender females
NHANES %>% group_by(AgeDecade,Gender) %>% summarize(AVG=mean(BPSysAve,
na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#the code above groups by AgeDecade & Gender, now we have all the ages
and males-females, group_by() allows to split by two variables
NHANES %>% filter(Gender=="male") %>% group_by(AgeDecade) %>%
summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve, na.rm=TRUE))
#the code above shows for males using filter, blood pressure mean & sd
group by all AgeDecades
#group by race, obtain the average systolic blood pressure for males from
40-49
NHANES %>% group_by(Race1) %>%filter(AgeDecade==" 40-49"&Gender=="male")
%>% summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve,
na.rm=TRUE)) %>% arrange(AVG)
NHANES %>% group_by(Race1) %>%filter(AgeDecade==" 40-49"&Gender=="male")
%>% summarize(AVG=mean(BPSysAve, na.rm=TRUE), SD=sd(BPSysAve,
na.rm=TRUE)) %>% arrange(desc(AVG))
#manage data & data visualization to dispel common myths about
sensacionalist world topics
#Gapminder foundation by Hans Rosling
library(tidyverse)
library(dslabs)
data(gapminder)
head(gapminder)
names(gapminder)
#which countries had the highest child mortality rates in 2015? two
answers, by thoughts or by data exploratory
#compare Sri Lanka vs Turkey infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Sri Lanka","Turkey")) %>
% select(country, infant_mortality)
#answer: Sri Lanka 8.4 has a lower infant mortality than Turkey 11.6
#compare Poland vs south Korea infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Poland","South Korea"))
%>% select(country, infant_mortality)
#answer: South Korea 2.9 has a lower infant mortality than Poland 4.5
#compare Malaysia vs Russia infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Malaysia","Russia")) %>%
select(country, infant_mortality)
#answer: Malaysia 6.0 has a lower infant mortality than Russia 8.2
#compare Thailand vs south Africa infant mortality in 2015
gapminder %>% filter(year==2015 & country %in%c("Thailand","South
Africa")) %>% select(country, infant_mortality)
#answer: Thailand 10.5 has a lower infant mortality than South Africa
33.6
library(tidyverse)
library(dslabs)
library(dplyr)
data(gapminder)
#which is the relationship between life span-esperanza de vida and
births, number of childs in each continent? answer by thoughts and by
data exploratory analisys
#create a scatterplot of the relationship fertility~life_expectancy in
1962
ds_theme_set()
filter(gapminder,year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy)) + geom_point()
gapminder %>% filter(year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy)) + geom_point()
#from the plot we see that the points fall in two different categories
for the year 1962
#life_expectancy:70 years fertility:3 or less children and
life_expectancy:lower than 65 years fertility:5 or more children
#change the color=continent-character argument within aes() that assigns
automatically colors to see the scatterplot by continents
gapminder %>% filter(year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point()
#indeed in 1962 Europe and NorthAmerica had higher life span and low
fertility than developings America, Asia and Africa
library(gridExtra)
p1 <- gapminder %>% filter(year==1962) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() + ggtitle("1962
plot")
p2 <- gapminder %>% filter(year==2012) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() + ggtitle("2012
plot")
grid.arrange(p1,p2)
#grid.arrange() could be an option to see both plots and compare them,
faceting functions is another good option
#faceting makes multiple side-by-side plots stratified by one variable
facet_wrap() or +two variables facet_grid()
#facet_grid() layer separates the plots in +two variables with ~,
facet_grid(variable1 on the rows~variable2 on the columns)
filter(gapminder, year %in% c(1962,2012)) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() +
facet_grid(continent~year)
gapminder %>% filter(year %in% c(1962,2012)) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() +
facet_grid(continent~year)
#the data has been stratified to the right by rows-continent-variable1 &
on the top by columns-year-variable2
#the dot operator . makes invisible a variable, use it inside
facet_grid(.~variable2-columns) to separate by only one variable
gapminder %>% filter(year %in% c(1962,2012)) %>% ggplot(aes(x=fertility,
y=life_expectancy, color=continent)) + geom_point() + facet_grid(.~year)
#now we get a plot split by year only and we only see the color continent
#the plot shows that developing continents-America, Asia, Africa have
mooved to western continents-Europe, North America
#facet_wrap() layer separates the plots using all the screen space-
columns&rows, better than facet_grid() that splits only in rows
#compare Europe against Asia through the years
1962,1970,1980,1990,2000,2012
years <- c(1962,1970,1980,1990,2000,2012)
continents <- c("Europe","Asia")
gapminder %>% filter(year %in% years & continent %in% continents) %>%
ggplot(aes(x=fertility, y=life_expectancy, color=continent)) +
geom_point() + facet_wrap(~year)
gapminder %>% filter(year %in% years & continent %in% continents) %>%
ggplot(aes(x=fertility, y=life_expectancy, color=continent)) +
geom_point() + facet_wrap(.~year)
#facet_wrap() creates a tidy screen space to compare plots, remember to
use ~ or .~variablename within the function
#the plot shows that Asian countries have made great improvements through
out the years reaching Europe
#facet_ functions fix the scales for better comparisons, if we use
grid.arrange() to compare plots, the scales are not the same bacause took
the data scales=free
#Time series plots have time on the x axis and a variable-measurement of
interest on the y axis
gapminder %>% filter(country=="United States") %>%
ggplot(aes(year,fertility)) + geom_point()
gapminder %>% filter(country=="United States") %>%
ggplot(aes(year,fertility)) + geom_line()
#use geom_line() instead of geom_poins when the points are regularly
spaced & densely packed, useful to compare between two objects of the
same variable
#compare the fertility between countries, example South Korea in Asia
with Germany in Europe
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year,fertility,group=country)) + geom_line()
#filter the variable country on the dataset by the two countries selected
and assign group argument within the aes() function by country
#by Adding color argument within aes(color=country) function,
automatically the data is grouped by country
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year,fertility,color=country)) + geom_line()
#Labels are usually preferred over legends in most plots although legends
are by default, labelling is visually better
#store the position of the labels in an object "labels" defining
x=position between x axis, y=position between y axis & then add the
labels within geom_text()
labels <- data.frame(country=c("South Korea","Germany"), x
=c(1975,1965),y =c(60,72))
labels
#we are setting on labels object las coordenadas donde queremos que las
etiquetas estén despues en el grafico mediante geom_text()
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year,life_expectancy,color=country)) + geom_line() +
geom_text(data=labels,aes(x,y,label=country), size=5) +
theme(legend.position ="none")
#south Korea improve in life expectancy reaching Germany
gapminder %>% filter(country %in% c("South Korea","Germany")) %>%
ggplot(aes(year, life_expectancy, color=country)) + geom_line() +
geom_label(data=labels, aes(x,y,label=country)) + theme(legend.position =
"none")
names(gapminder)
str(gapminder$gdp)
#gdp-gross domestic product-producto bruto interno-pbi variable is the
market value of products&services generated by a country in a year
#gdp-pbi per person estimates how rich a country is, this divided 365
days in a year is equal to dollars per day
#gdp-pbi/population/365 =dollars per day
#add dollars per day variable to gapminder dataframe with mutate function
gapminder <- gapminder %>% mutate(dollars_per_day=gdp/population/365)
names(gapminder)
#gdp-pbi values are adjusted by inflation, represent current US dollars
so they can be compared across the years
#which are the levels of poverty by country? dollars_per_day = gdp-
pbi/population/365 is a good measure to compare countries
past_year <- 1970
gapminder %>% filter(year==past_year & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth = 1, color =
"black")
#remove NA's using data %>% filter(!is.na(variablename))
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
#the histogram quickly shows that most of the country averages are below
10 dollars per day
#dollars/day country average: 1 dollar/day-extremely poor, 2 dollars/day-
very poor, 4 dollars/day-poor, 8 dollars/day-middle, 16 dollars/day-well
off, 32 dollars/day-rich, 64 dollars/day-very rich country
#log transformations: convert multiplicative changes into additive
changes
log2(2)
log10(2)
log2(4)
log10(4)
#log2 means that every time a value doubles-x2, the log transformation
increases by 1, log2(2)=1, log2(4)=2
#log10 means that every time a value is x10 veces, the log transformation
adds-increases 0.3, log10(2)=0.3, log10(4)=0.6
#transform data from the previous histogram in log2 base 2
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(log2(dollars_per_day))) + geom_histogram(binwidth=1,
color="black")
#bumps-saltos-mode: of a distribution are values with the highest
frequency
#mode of a normal distribution = mean or average, mode(normal dist) =
mean(normal dist)
#NOT normal distributions can have multiple modes-bumps-saltos also
called local modes-bimodality consistent with high frequency for both,
lower&higher dollar/day countries
#is not recommended the use of natural log-log base e to scale the data,
log2 is easier than log10 to scale the data
#log2 base 2 works well with integers, log10 base 10 works well with
large numbers
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(log10(population))) + geom_histogram(binwidth=1,
color="black")
#population store large numbers so scale by log10 makes more sense
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(log2(population))) + geom_histogram(binwidth=1, color="black")
#transform plots to log in two ways: change the data-logged values before
plotting or change the x-y axes-logged scale of the plot
#log values-data allows read easy the value, the advantage of using log
scales-axis layers is that we see the original value on the axis
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2")
#scale_x_continuous(trans="log2") and scale_x_log10 to change the x-y
axis of the plot and preserve the original values on the axis
library(tidyverse)
library(dslabs)
library(dplyr)
#to see dollar/day distribution by regions, histograms or smooth plots
are not useful because the large number of regions, there are 22 levels
length(levels(gapminder$region))
#boxplots next to each other allow compare data by regions with some
important adjustments to visualize the data in the best way
p <- gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
ggplot(aes(region, dollars_per_day))
p + geom_boxplot()
#rotate the region labels on the x axis adding theme() layer that changes
the non-data components of the plots
#theme(axis.text.x = element_text(angle=argument to rotate the labels in
axis.text.x, hjust=horizontal justification argument))
p + geom_boxplot() + theme(axis.text.x = element_text(angle=90,hjust=1))
#reorder() function reorders the levels of a variable-1st args treated as
categorical based on the values of a second variable-2nd args that
usually is numeric
#reorder() changes the order of the levels of a factor variable based on
the summary computed on a numeric vector
fac <- factorc("Asia","Asia","West","West","West")
fac <- factor(c("Asia","Asia","West","West","West"))
levels(fac)
#by default the levels of the factor are in aphabetically order,
reorder() orders the levels of fac variable by value variable from lowest
to highest
value <- c(10,11,12,6,4)
fac
value
names(value) <- fac
fac
value
fac <- reorder(fac, value, FUN=mean)
levels(fac)
#the fac variable-1st args example is now reordered by the mean of value-
2nd args
#reorder regions on gapminder dataset by median() income level-
dollar_per_day
p <- gapminder %>% filter(year==1970 & !is.na(gdp)) %>% mutate(region=
reorder(region,dollars_per_day, FUN=median)) %>% ggplot(aes(region,
dollars_per_day, fill=continent)) + geom_boxplot() + theme(axis.text.x =
element_text(angle=90,hjust=1)) + xlab("Region")
p
#mutate() to change regions by dollar/day median with reorder(),
fill=args puts color inside automatically, theme(element_text()) to
rotate xlabels, xlab("") quotes inside with no words to erase region
label by default
p <- gapminder %>% filter(year==1970 & !is.na(gdp)) %>% mutate(region=
reorder(region,dollars_per_day, FUN=median)) %>% ggplot(aes(region,
dollars_per_day, fill=continent)) + geom_boxplot() + theme(axis.text.x =
element_text(angle=90,hjust=1)) + xlab("")
p
reorder(gapminder$region,gapminder$dollars_per_day, FUN=mean)
#the boxplots are in order by the median value income-dollars_per_day
against regions, and each continent has its color with fill=continent
within aes()
#change the y axis from the boxplot to log2 with
scale_y_coninuous(trans="log2")
p + scale_y_continuous(trans="log2")
#log2 scale allows to se the differences between regions, the box looks
big now and we can compare with ease
#add points of data to show the data on the plot only when the graph is
clear, use geom_point(show.legend=FALSE) layer
p + scale_y_continuous(trans="log2") + geom_point()
p + scale_y_continuous(trans="log2") + geom_point(show.legend = FALSE)
p + scale_y_continuous(trans="log2") + geom_point(size=0.5)
p + scale_y_continuous(trans="log2") + geom_point(size=1)
#change the size of the points defining args inside geom_point() layer
names(gapminder)
levels(gapminder$region)
length(levels(gapminder$region))
#check the bimodality of the dataset adding a group variable with mutate,
then plot both histograms using facet_grid by one variable-group
gapminder %>% filter(year==1970 & !is.na(gdp)) %>%
mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(.~group)
#above we check the bimodality with a histogram, contries in the west
have higher incomes then developing countries
#define west regions in the vector west
west <- c("Western Europe","Northern Europe","Southern Europe","Northern
America","Australia and New Zeland")
#compare the differences in distribution of the western world across past
and present years
past_year <- 1970
present_year <- 2010
gapminder %>% filter(year %in%c(1970,2010) & !is.na(gdp)) %>%
mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(year~group)
#the code shows the income-dollars_per_day filter() by two years,
past&present, and the graph is rendered facet.grid() by year and group so
the space screen looks clear
gapminder %>% filter(year %in%c(past_year,present_year) & !is.na(gdp)) %>
% mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(year~group)
#some countries were defined after 1970 & data availability is high now,
that's why there are more countries in 2010 and this could change the
plot
#we are going to use only the countries with data availability for both
years using the intersect() function
country_list_1 <- gapminder %>% filter(year==1970 & !
is.na(dollars_per_day)) %>% pull(country)
country_list_2 <- gapminder %>% filter(year==2010 & !
is.na(dollars_per_day)) %>% pull(country)
#define country lists for both years 1970&2010, remember pull() is equal
to .$ to extract the countries
#then define a vector country_list with intersect(clist_1,clist_2)
containing tha data available for both years to remake the plot
#intersect() function to find the overlap-coincidencia between two
vectors
country_list <- intersect(country_list_1,country_list_2)
length(country_list)
length(country_list_1)
length(country_list_2)
#length() allows to see the data for dollars_per_day, in 1970 = 113
countries, in 2010 = 176 countries and there are 108 countries with both
data availability
gapminder %>% filter(year %in%c(1970,2010) & country %in% country_list)
%>% mutate(group=ifelse(region%in%west,"West","Developing")) %>%
ggplot(aes(dollars_per_day)) + geom_histogram(binwidth=1, color="black")
+ scale_x_continuous(trans="log2") + facet_grid(year~group)
#the previous code remakes the plot using countries that are %in%
country_list-data available for both years
#the developing group improve in dollars_per_day as the west group also
do
#the boxplot of median income-dollars_per_day by regions can be compared
by year 1970-2010
p <- gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list)
%>% mutate(region= reorder(region,dollars_per_day, FUN=median)) %>%
ggplot(aes(region, dollars_per_day, fill=continent)) + geom_boxplot() +
theme(axis.text.x = element_text(angle=90,hjust=1)) + xlab("") +
scale_y_continuous(trans="log2") + geom_point()
p
p + facet_grid(.~year)
p + facet_grid(year~.)
#define p filterring by years, country %in% country_list and use
facet_grid(rows~columns) by year on rows
#compate boxplots one over the other is not useful, make an easy
comparison if one boxplot is next to the other one
#remember fill= inside aes() splits by color, we want to split by year so
we need 1970&2010 to be a categorical-factor variable
#ggplot automatically puts the boxplots next to each other and assigns a
color to each factor aes(fill-color=factor-categorical)
p <- gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list)
%>% mutate(region= reorder(region,dollars_per_day, FUN=median)) %>%
ggplot(aes(region, dollars_per_day, fill=factor(year))) + geom_boxplot()
+ theme(axis.text.x = element_text(angle=90,hjust=1)) + xlab("") +
scale_y_continuous(trans="log2")
p
#remember to take out facet_grid to cancel the split of the graph and
allow the split by year on the same plot
#smooth density plots are rough to convey-transmitir clearly the
gapminder data
#check the bimodality using a smooth density instead a histogram and easy
see how the gaps in 2010 are near
gapminder %>% filter(year %in%c(1970,2010) & country %in%country_list) %>
% ggplot(aes(dollars_per_day)) + geom_density(fill="grey") +
scale_x_continuous(trans="log2") + facet_grid(year~.)
#which is the reason of the changes? poor countries become rich or rich
countries become poor?
#how many countries are in each group west-developing?
gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list) %>%
mutate(group= ifelse(region%in%west, "West","Developing")) %>%
group_by(group) %>% summarize(n=n())
gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list) %>%
mutate(group= ifelse(region%in%west, "West","Developing")) %>%
group_by(group) %>% summarize(len=length(group))
#the number of countries in the west is different from the number of
countries in developings group
#if groups are not the same size in density plots will be scaled to 1 and
look of same size will be a big error
#accessing computed variables with the geom_density could solve the
mistake of density scales
#the areas of the density plot should be proportional to the size of the
groups
#count= argument from geom_density that multiplies the y-axis values by
the size of the group to be proportional
#count=density*number of points, argument from geom_density that
multiplies the y-axis values by the size of the group to be proportional
#dotdot .. between the name variable to access variables in ggplot
..name_variable..
#redefine aes(x =dollars_per_day, y =..count..) putting the proportional
size of the group on the y axis
p <- gapminder %>% filter(year%in%c(1970,2010) & country%in%country_list)
%>% mutate(group= ifelse(region%in%west, "West","Developing")) %>%
ggplot(aes(x=dollars_per_day, y=..count.., fill=group)) +
scale_x_continuous(trans="log2")
p + geom_density(alpha=0.2) + facet_grid(year~.)
#in this density plot now the scales are proportional to the y-axis for
both groups that had different lengths, there are no more scale errors on
this plot
#remember to define inside aes() y equal to ..count.. count argument from
y-density axis between dotdot ..
#also can change the bw=0.75 argument within geom_density to get smoother
densities
p + geom_density(alpha=0.2, bw=0.75) + facet_grid(year~.)
#clearly the graph shows that developing world is reaching the right
side, incomes-dollars_per_day grow between 1970&2010
#how are the income-dollars_per_day changes against regions?
levels(gapminder$region)
#case_when() function defines a factor-categorical whose levels are
defined by logical operations to group data
install.packages("ggridges")
library(ggridges)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
ggplot(aes(x=dollars_per_day, y=group)) +
scale_x_continuous(trans="log2") + geom_density_ridges(adjust=1.5) +
facet_grid(.~year)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
ggplot(aes(x=dollars_per_day, y=region)) +
scale_x_continuous(trans="log2") + geom_density_ridges(adjust=1.5) +
facet_grid(.~year)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
ggplot(aes(x=dollars_per_day, y=continent)) +
scale_x_continuous(trans="log2") + geom_density_ridges(adjust=1.5) +
facet_grid(.~year)
#plot stacked-encimados using position="stack" on density plots
gapminder %>% mutate(group=case_when(.$region %in% west ~ "West", .
$region %in% c("Eastern Asia","South-Eastern Asia") ~ "East Asia", .
$region %in% c("Caribbean","Central America","South America") ~ "Latin
America", .$continent == "Africa" & .$region !="Northern Africa" ~ "Sub-
Saharan Africa", TRUE ~ "Others"))
#assign groups depending by region and case_when allows to set those
groups and the names they will take
#mutate(group-name variable= case_when( .$region to access variables %in%
west ~ "west" the tilde operator means "set this name"))
gapminder <- gapminder %>% mutate(group=case_when(.$region %in% west ~
"West", .$region %in% c("Eastern Asia","South-Eastern Asia") ~ "East
Asia", .$region %in% c("Caribbean","Central America","South America") ~
"Latin America", .$continent == "Africa" & .$region !="Northern Africa" ~
"Sub-Saharan Africa", TRUE ~ "Others"))
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
mutate(group=case_when(.$region %in% west ~ "West", .$region %in%
c("Eastern Asia","South-Eastern Asia") ~ "East Asia", .$region %in%
c("Caribbean","Central America","South America") ~ "Latin America", .
$continent == "Africa" & .$region !="Northern Africa" ~ "Sub-Saharan
Africa", TRUE ~ "Others")) %>% ggplot(aes(x=dollars_per_day,
color=group)) + scale_x_continuous(trans="log2") +
geom_density(alpha=0.2, bw=0.75) + facet_grid(year~.)
gapminder %>% filter(year%in%c(1970,2010) & !is.na(dollars_per_day)) %>%
mutate(group=case_when(.$region %in% west ~ "West", .$region %in%
c("Eastern Asia","South-Eastern Asia") ~ "East Asia", .$region %in%
c("Caribbean","Central America","South America") ~ "Latin America", .
$continent == "Africa" & .$region !="Northern Africa" ~ "Sub-Saharan
Africa", TRUE ~ "Others")) %>% ggplot(aes(x=dollars_per_day, fill=group))
+ scale_x_continuous(trans="log2") + geom_density(alpha=0.2, bw=0.75) +
facet_grid(year~.)
#remember change y=group by color=group or fill=group to get the correct
lines and facet_grid with year in rows
#what is the relation between country child survival and average income
by regions? case_when is useful to divide groups within we add them to
the dataset
gapminder <- gapminder %>% mutate(group=case_when(.$region %in% west ~
"West", .$region %in% c("Eastern Asia","South-Eastern Asia") ~ "East
Asia", .$region %in% c("Caribbean","Central America","South America") ~
"Latin America", .$continent == "Africa" & .$region !="Northern Africa" ~
"Sub-Saharan Africa", .$region %in% "Nothern Africa" ~ "Northern Africa",
.$region %in% c("Melanesia","Micronesia","Polynesia") ~ "Pacific
Islands"))
#redefine gapminder dataset adding two more regions
#store in surv_income the average income-gdp and infant_mortality-
sulvival rate, remember to filter by 2010 and extract NA's from gdp,
infant_mortality and group
surv_income <- gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
surv_income <- gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
surv_income <- gapminder %>% filter(year%in% 2010 & !is.na(gdp) & !
is.na(infant_mortality) & !is.na(group)) %>% group_by(group) %>%
summarize(income =sum(gdp)/sum(population)/365, infant_survival_rate =1-
sum(infant_mortality/1000*population/sum(population))
#limit= argument to change the range of the axis, use it inside x&y layer
scale_x_continuous(limit=c(0.25,150))
#logistic transformation or logit for a proportional rate p: f(p) =
log(p/(1-p)) is the same as f(p) = log(odds) ,odds is p/(1-p) example
expected children to survive
#logit logistic transformation scale and odds are useful to compare small
differences between 0 and 1
#survival rate acceptable is no less than 98%-0.98, a survival of 0.9 is
not acceptable as right
#log(odds) turns the changes into constante increases
#data visualization follow some principles to create effective figures
and tables adapted to the audience
#encoding data principles: position, aligned lengths, angles, area,
brightness, color hue
#position & lenght are preferred to display quantities followed by
angles, which are preferred over area, brightness & color are hard to
quantify but can be useful
#humans are not good at viasually quantifying angles and areas, pie
charts represent angles & area, donut charts represent only area, they
are not recommended
#use a barplot instead of a pie or donut chart, or just few variables
with percentage labels
#humans are good at visually quantifying linear measures, barplots
represent well position and lenght visual cues
#barplots use lenght encoding data and always the scale have to start at
0 to avoid sub & overestimations
#scatterplots, boxplots use position encoding data and it's not necessary
to start the scale at 0, adjust the scale to the graph is a visual
posibility
#ggplot defaults to use square area rather than radio-circular area, the
plots have to encode the correct quantities
#encoding data principles must be proportional to the quantity
#order by a meaningful value is better than default by alphabet order,
reorder(factor-categorical, value, FUN=mean), factor() to change numeric
into factors-categoricals
library(tidyverse)
library(dslabs)
library(dplyr)
library(gridExtra)
data(murders)
murders %>% mutate(murder_rate =total/population*100000) %>% mutate(state
=reorder(state, murder_rate)) %>% ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") + coord_flip() + theme(axis.text.y
=element_text(size=6)) + xlab("")
#the previous code shows the states ordered by murder_rate from highest
to lowest, not anymore alphabetical, now the graph is meaningful
#previously we display the quantities of the data, now let's show the
data with focus comparing groups male-female
#standard error IS NOT the same as standarrd deviation
heights %>% ggplot(aes(x=sex, y=height)) + geom_point()
#geom_point allows a better comparison between females and males group
instead of a barplot where we don't know how each point-observation
behave
#two ways to improve a plot that shows all the points: use geom_jitter()
layer or add alpha=0.2 argument to geom_point
#geom_jitter() layer adds a small random shift-variation to each point-
pequeño cambio aleatorio a cada punto
#alpha blending-alpha= argument clarify the points to not overlap them
#if there are many points, show distributions is convenient than show the
points, distribution lines can be contrasted
heights %>% ggplot(aes(x=sex, y=height)) + geom_jitter(alpha=0.2,
width=0.1)
#keep the same axes when comparing data across plots to avoid
interpretation mistakes
heights %>% ggplot(aes(height)) + geom_histogram(binwidth = 1,
color="black") + facet_grid(.~sex)
#allign plots vertically to see horizontal changes & allign plots
horizontally to see vertical changes
#for heights data, two histograms alligned verticall will show horiz-
left-right changes, two boxplots alligned horizontal-next will show
vertical-up-down changes
heights %>% ggplot(aes(x=sex, y=height)) + geom_boxplot()
heights %>% ggplot(aes(x=sex, y=height)) + geom_boxplot() +
geom_jitter(alpha=0.2, width=0.05)
#barplots are useful to show only one number, they are not good to show
and compare distributions
#the combination of using barplots incorrect when a log transformation is
necessary is very distorting
#boxplots are much more informative than barplots, specially if we have
many values
#log transformation: useful for data with multiplicative changes
#logit transformation: useful for fold-doble changes in odds
#sqrt transformation: useful for count data
#visual cues-señales visuales to be compared should be adjacent-next to
each other
#http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/.
resources to select friendly colors
#to the plot we want stored in p add scale_color_manual(value =the code
for the pallete of colors)
#remember to redefine region variable within mutate to store the
reordered object, if not we are just changing rate and not the state
where the reorder() function works
data("murders")
murders %>% mutate(rate = total/population*100000,
region=reorder(region,rate,FUN=mean)) %>% ggplot(aes(region,rate)) +
geom_boxplot() + geom_point()
#mutate(name_var1=compute code, name_var2=reorder(var1,var2,FUN=))
#compare two variables with scaterplots-geom_point(), compare the same
type of variables at different time points and relative small comparison
with slope charts-geom_line()
west
#slope charts shows an idea of the changes based on the slope lines,
angles are the visual encoding and also the position of the points
#for large number of observations the bland-altman plot shows the
differences between conditions in the y-axis and the mean of conditions
in the x-axis with a abline that divides the space screen
dat <- gapminder %>% filter(year%in% c(2010,2015) & region%in%west & !
is.na(life_expectancy) & population >10^7)
dat %>% mutate(location=ifelse(year==2010,1,2),
location=ifelse(year==2015 & country%in% c("United Kingdom","Portugal"),
location + 0.22, location, hjust=ifelse(year==2010, 1, 0)) %>%
mutate(year= as.factor(year)) %>% ggplot(aes(year,life_expectancy,
group=country)) + geom_line(aes(color=country)) +
geom_text(aes(x=location, label=country, hjust=hjust))
dat %>% ggplot(aes(x=year, y=life_expectancy, label=country,
group=country, color=country)) + geom_line(show.legend = FALSE) +
geom_text(show.legend = FALSE)
#Encode a 3rd variable on the graph using color hue or shape= argument,
continuous variables can use color, intensity, size and shape
#effective comunication of data is a strong antidote to misinformation
and fear
#the impact of vaccines on battling infectious diseases, data for 7
diseases from 1928 to 2011 in 50 states
data("us_contagious_diseases")
str(us_contagious_diseases)
names(us_contagious_diseases)
#define the object dat containing only measles-sarampion, per 10000 hab.
rate, order states by average of disease and remove Alaska and Hawaii
because they weren't states
dat <- us_contagious_diseases %>% filter(!state%in% c("Alaska","Hawaii")
& disease=="Measles") %>% mutate(rate
=count/population*10000*52/weeks_reporting) %>% mutate(state
=reorder(state,rate))
#count variable stores the totals, *10000 por 10 mil habitantes, *52
weeks_reporting in the whole year
#plot the measles data for California, cases per 10000 hab-y axis by
year-x axis
dat %>% filter(state=="California" & !is.na(rate)) %>% ggplot(aes(x=year,
y=rate)) + geom_line() + ylab("Cases per 10,000") + geom_vline(xintercept
= 1963, color="blue")
#geom_vline() layer adds reference lines to a plot in horizontal,
vertical or diagonal way specified by slope and x-y intercept= argument
#geom_vline(xintercept=1963, color="blue) is on the x-axis intercept 1963
because is the year when the vaccine was introduced
#3 variables to show, year-x axis, state-y axis, rate-color hue to
represent the continuous variable rate
#RColorBrewer package, download it from library, to choose color pallets
between sequential and diverging options
#sequential pallets are suited for data that goes from high values to low
values distinction
#diverging pallets are suited for data representig values that verge-
acercan from a center, higher or lower than the center are equal
install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all(type ="seq")
display.brewer.all(type ="div")
#display.brewer.all to see all the pallete of colors, type seq or div
#geom_tile() layer to tile-adornar, cover with colors representing
something
#geom_tile() layer to tile with colors representing disease rates by
region
#we have counts of data so the sqrt transformation is appropiate to avoid
having high numbers
#trans="sqrt" we have counts of data so the sqrt transformation is
appropiate to avoid having high numbers
dat %>% ggplot(aes(x=year, y=state, fill=rate)) +
geom_tile(color="grey50") + ylab("") + xlab("") + geom_vline(xintercept =
1963, color="blue")
dat %>% ggplot(aes(x=year, y=state, fill=rate)) +
geom_tile(color="grey50") + scale_x_continuous(expand= c(0,0)) +
scale_fill_gradientn(colors =RColorBrewer::brewer.pal(9,"Reds"),
trans="sqrt") + ylab("") + xlab("") + geom_vline(xintercept = 1963,
color="blue") + theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle("Measles Disease")
#scale_fill_gradientn() contanins RColorBrewer::brewer.pal(9,"Reds)
#position and length are better cues than color, so show the values with
position, compute the average and show it with a line plot of measles by
year, state and rate
AVG <- us_contagious_diseases %>% filter(disease=="Measles") %>%
group_by(year) %>% summarize(us_rate =
sum(count,na.rm=TRUE)/sum(population,na.rm=TRUE)*10000)
AVG
dat <- us_contagious_diseases %>% filter(!state%in% c("Alaska","Hawaii")
& disease=="Measles") %>% mutate(rate
=count/population*10000*52/weeks_reporting) %>% mutate(state
=reorder(state,rate))
dat %>% filter(!is.na(rate)) %>% ggplot(aes(x=year, y=rate ,
group=state)) + geom_line(alpha=0.2, size=1, show.legend =FALSE,
color="grey50") + geom_vline(xintercept = 1963, color="blue") + xlab("")
+ ylab("") + ggtitle("Cases per 10,000 by state")
#use regular 2 dimension plots, avoid pseudo-three-dimensional plots, the
3rd dimension doesn't represent any quantity
#avoid using too many significant digits on tables, by default R shows 7
significant digits, 2 significant digits is enough to see how values
behave
#options(digits=n), round(x, digits=n), signif(x, digits=n) three ways to
change the number of digits
#Assessment 5.3
#create a tile plot of smallpox cases per 10,000 population, exclude
Alaska and Hawaii
data("us_contagious_diseases")
dat <- us_contagious_diseases %>% filter(!state%in%c("Hawaii","Alaska") &
disease =="Smallpox") %>%
mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
dat %>% filter(!weeks_reporting<10) %>% ggplot(aes(year, state, fill =
rate)) + geom_tile(color = "grey50") + scale_x_continuous(expand=c(0,0))
+ scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle("Smallpox") + ylab("") + xlab("")
names(dat)
#dat has state variable changed with mutate() and reorder() states by
rate values, & rate variable added with mutate() count-total/population
per 10 mil
#scale_fill_gradientn() layer creates a n colour gradient, gradient
without final n creates a two colour gradient-sequential, and gradient2
diverging colour gradient
#filter by more than 10 weeks_reporting, fill=rate splits the plot by
rate, brewer.pal makes the color brewer available as pallets
#create a time series plot of smallpox excluding less than 10 weeks
cases, compute the average and show it through years
dat <- us_contagious_diseases %>% filter(!state%in% c("Hawaii","Alaska")
& disease =="Smallpox") %>% mutate(rate = count / population * 10000) %>%
mutate(state = reorder(state, rate))
avg <- us_contagious_diseases %>% filter(!weeks_reporting<10 &
disease=="Smallpox") %>% group_by(year) %>% summarize(us_rate =
sum(count, na.rm=TRUE)/sum(population, na.rm=TRUE)*10000)
dat %>% filter(!weeks_reporting<10) %>% ggplot() + geom_line(aes(year,
rate, group = state), color = "grey50", show.legend = FALSE, alpha =
0.2, size = 1) + geom_line(mapping = aes(year, us_rate), data = avg,
size = 1, color = "black") + scale_y_continuous(trans = "sqrt", breaks =
c(5,25,125,300)) + ggtitle("Cases per 10,000 by state") + xlab("") +
ylab("") + geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y,
label="US average"), color="black") + geom_vline(xintercept=1963, col =
"blue")
avg
#in avg store the average us_rate per 10mil hab, previously filter by
disease, weeks and year
#use geom_line two times because it's like two plots in one, light grey
lines year-rate grouped by state & the black line for the average us rate
#make a time series plot for California showing rates of al diseases of
10 or more weeks reporting
us_contagious_diseases %>% filter(state=="California" & !
weeks_reporting<10) %>% group_by(year, disease) %>% summarize(rate =
sum(count)/sum(population)*10000) %>% ggplot(aes(year, rate,
color=disease)) + geom_line()
#group_by(1st variable year by 2nd variable disease) and color=disease in
aes ggplot allows to see how each disease behave, summarize creates the
rate variable of y-axis
#make a time series plots of the rates for all diseases in the US
us_contagious_diseases %>% filter(!is.na(population)) %>% group_by(year,
disease) %>% summarize(rate = sum(count)/sum(population)*10000) %>%
ggplot(aes(year, rate, color=disease)) + geom_line()
options(digits = 3)
library(tidyverse)
library(dslabs)
library(ggplot2)
library(dplyr)
install.packages("titanic")
library(titanic)
titanic <- titanic_train %>% select(Survived, Pclass, Sex, Age, SibSp,
Parch, Fare) %>% mutate(Survived =factor(Survived), Pclass
=factor(Pclass), Sex =factor(Sex))
titanic
str(titanic)
levels(titanic)
names(titanic)
?titanic_train
#?titanic_train to learn more about the variables stored on a dataset
head(titanic)
head(titanic_train)
str(titanic_train)
names(titanic)
#variable types from titanic: survived-categorical non ordinal, Pclass &
Sex-categorical ordinal, Fare-numeric continuous, Age-numeric discrete
#the following code shows a geom_density for titanic age grouped by sex,
on the y-axis we have count-totals, fill=sex divides as group_by and
facet_grid by columns
titanic %>% filter(!is.na(Age) & !is.na(Sex)) %>% group_by(Age, Sex) %>%
ggplot(aes(x=Age, y=..count.., fill=Sex)) + geom_density(alpha=0.2,
bw=1.5, position="stack") + facet_grid(.~Sex)
#define params object to apply it into geom_qq and show the sample vs
theoretical distribution of titanic ages
params <- titanic %>% filter(!is.na(Age)) %>% summarize(mean=mean(Age),
sd=sd(Age))
params
titanic %>% filter(!is.na(Age) & !is.na(Sex)) %>% ggplot(aes(sample=Age))
+ geom_qq(dparams =params) + geom_abline()
#create a barplot of survived by sex, remove NA from survived and
fill=sex to group by sex and put color
titanic %>% filter(!is.na(Survived)) %>% ggplot(aes(Survived, fill=Sex))
+ geom_bar()
titanic %>% filter(!is.na(Survived)) %>% ggplot(aes(Survived, fill=Sex))
+ geom_bar(position=position_dodge())
#position=position_dodge() inside geom_bar() layer to separate male-
female bars into life or death
#compare Age survival distributions life-death, there are two modes in
survivals arround 0-8 age and 25-35 age, y=counts or density, here
density is more readable
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, y=..count.., fill=Survived)) + geom_density(alpha=0.2)
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, y=..count.., fill=Survived)) + geom_density(alpha=0.2)
+ facet_grid(.~Survived)
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, fill=Survived)) + geom_density(alpha=0.2) +
facet_grid(.~Survived)
titanic %>% filter(!is.na(Age)) %>% group_by(Age, Survived) %>%
ggplot(aes(x=Age, fill=Survived)) + geom_density(alpha=0.2)
#compare y=fare x=survivals in a boxplot, trans="log2" and jitter to see
how points change
titanic %>% filter(!Fare==0) %>% ggplot(aes(x=Survived, y=Fare)) +
geom_boxplot()
titanic %>% filter(!Fare==0) %>% ggplot(aes(x=Survived, y=Fare)) +
geom_boxplot() + scale_y_continuous(trans="log2")
titanic %>% filter(!Fare==0) %>% ggplot(aes(x=Survived, y=Fare)) +
geom_boxplot() + scale_y_continuous(trans="log2") +
geom_jitter(alpha=0.2)
#fares of 8 had more deaths, median fares of survivals are higher than
median fares of deaths, there is an outlayer fare 500 who survived
#barplot of counts Pclass by survivals, the class 3 has the highest count
of deaths
titanic %>% filter(!is.na(Pclass)) %>% ggplot(aes(Pclass, fill=Survived))
+ geom_bar()
#proportional barplot from 0 to 1 of Pclass by survivals, the class 1 has
the highest survivals, class 2 about 50-50 life-death
titanic %>% filter(!is.na(Pclass)) %>% ggplot(aes(Pclass, fill=Survived))
+ geom_bar(position= position_fill())
#proportional barplot of Survivals by Pclass, the class 3 had about 70%
deaths and about 35% life survivals
titanic %>% filter(!is.na(Pclass)) %>% ggplot(aes(Survived, fill=Pclass))
+ geom_bar(position= position_fill())
#position= position_fill() inside geom_bar to get a proportional bar and
compare right between variables
#geom_density x=age, y=count of survivals split-fill=survived, facet by
sex~Pclass
titanic %>% filter(!is.na(Age)) %>% ggplot(aes(x=Age, y=..count..,
fill=Survived)) + geom_density(alpha=0.2) + facet_grid(Sex~Pclass)
titanic %>% filter(!is.na(Age)) %>% ggplot(aes(x=Age, y=..count..,
fill=Survived)) + geom_density(alpha=0.2) + facet_grid(Pclass~Sex)
library(tidyverse)
library(dslabs)
library(dplyr)
data("stars")
options(digits = 3)
str(stars)
head(stars)
levels(stars$star)
names(stars)
#magnitude is a function of star luminosity, negative values of magnitude
have higher luminosity
is.na(stars$magnitude)
mean(stars$magnitude)
sd(stars$magnitude)
summary(stars$magnitude)
stars %>% ggplot(aes(magnitude)) + geom_density()
stars %>% ggplot(aes(magnitude, ..count..)) + geom_density()
stars %>% ggplot(aes(temp)) + geom_density()
stars %>% ggplot(aes(temp, ..count..)) + geom_density()
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point()
#most stars follow a decreasing exponential trend
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point() +
scale_y_reverse()
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point() +
scale_y_reverse() + scale_x_log10()
stars %>% ggplot(aes(x=temp, y=magnitude)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse()
#stars with negative magnitude are brighter and temperatures are also
high
library(ggrepel)
stars %>% ggplot(aes(x=temp, y=magnitude, label=star)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse() +
geom_text_repel(size=2)
stars %>% ggplot(aes(x=temp, y=magnitude, label=star)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse() +
geom_text_repel()
stars$temp
stars$star
stars %>% filter(star%in%
c("Antares","Castor","Mirfak","Polaris","vanMaanen'sStar")) %>%
pull(temp)
stars %>% filter(star%in%
c("Antares","Castor","Mirfak","Polaris","vanMaanen'sStar")) %>%
pull(magnitude)
stars %>% filter(star%in%
c("Sun","Antares","Castor","Mirfak","Polaris","vanMaanen'sStar")) %>%
ggplot(aes(x=temp, y=magnitude, label=star)) + geom_point() +
scale_y_reverse() + scale_x_log10() + scale_x_reverse() +
geom_text_repel()
stars %>% ggplot(aes(x=temp, y=magnitude, label=star, color=type)) +
geom_point() + scale_y_reverse() + scale_x_log10() + scale_x_reverse()
stars %>% filter(type%in%c("O","M")) %>% ggplot(aes(x=temp, y=magnitude,
label=star, color=type)) + geom_point() + scale_y_reverse() +
scale_x_log10() + scale_x_reverse()
stars %>% filter(type%in%c("O","M","G")) %>% ggplot(aes(x=temp,
y=magnitude, label=star, color=type)) + geom_point() + scale_y_reverse()
+ scale_x_log10() + scale_x_reverse() + geom_text_repel()
library(tidyverse)
library(dslabs)
library(dplyr)
data("temp_carbon")
data("greenhouse_gases")
data("historic_co2")
str(temp_carbon)
head(temp_carbon)
names(temp_carbon)
temp_carbon %>% .$year %>% max()
temp_carbon %>% filter(!is.na(carbon_emissions)) %>% pull(year) %>% max()
temp_carbon %>% filter(!is.na(carbon_emissions)) %>% .$year %>% max()
temp_carbon %>% filter(!is.na(carbon_emissions)) %>% select(year) %>%
max()
temp_carbon %>% .$year %>% min()
min(temp_carbon$year)
summary(temp_carbon$year)
summary(temp_carbon$carbon_emissions)
temp_carbon %>% filter(carbon_emissions & year) %>% summary()
#reach the first and last year for carbon emissions data available, how
many times bigger are carbon emissions now?
#min year 1751 max year 2014, ratio 9855/3 -the biggest value divided by
the smallest value
temp_carbon %>% filter(carbon_emissions & !is.na(carbon_emissions)) %>%
filter(year & !is.na(year)) %>% summarize(min=min(carbon_emissions),
max=max(carbon_emissions))
temp_carbon %>% filter(carbon_emissions & !is.na(carbon_emissions)) %>%
filter(year & !is.na(year)) %>% summarize(min=min(year), max=max(year))
9855/3
#reach the first and last year for temp_anomaly data available, how many
degrees C did temperature increase?
temp_carbon %>% filter(temp_anomaly & !is.na(temp_anomaly)) %>%
filter(year & !is.na(year)) %>% summarize(min=min(year), max=max(year))
temp_carbon %>% filter(temp_anomaly & !is.na(temp_anomaly)) %>%
filter(year & !is.na(year)) %>% summarize(mint=min(temp_anomaly),
maxt=max(temp_anomaly), miny=min(year), maxy=max(year))
options(digits = 3)
temp_carbon %>% filter(temp_anomaly & year) %>% summary()
?temp_carbon
temp_carbon %>% filter(year%in%c(1880,2018)) %>% pull(temp_anomaly)
0.82-(-0.11)
#the temperature increase 0.93 degrees celsius, just rest biiger-smaller,
not ratio biiger/smaller as carbon emissions
p <- temp_carbon %>% filter(!is.na(temp_anomaly))
p <- temp_carbon %>% filter(!is.na(temp_anomaly)) %>%
ggplot(aes(year,temp_anomaly)) + geom_line()
p
p + geom_hline(aes(yintercept=0), col="blue")
p + ylab("Temperature anomaly (Degrees C)") + ggtitle("Temperature
anomaly relative to 20th century mean 1880-2018") + geom_text(x=2000,
y=0.05, label="20th century mean", color="blue")
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(x=2000, y=0.05, label="20th century mean",
color="blue")
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(aes(x=2000, y=0.05, label="20th century
mean"), col="blue")
temp_carbon %>% filter(temp_anomaly & year) %>% summary()
temp_carbon %>% filter(temp_anomaly>=0.06) %>% pull(year)
temp_carbon %>% filter(temp_anomaly<0.06) %>% pull(year)
temp_carbon %>% filter(temp_anomaly>=0.5) %>% pull(year)
#show the years with temperatures below the 20th century mean 0.06, above
0.06 and the years above 0.5 C degrees
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(x=2000, y=0.05, label="20th century mean",
color="blue") + geom_line(mapping= aes(year, land_anomaly), color="red")
p + geom_hline(aes(yintercept=0), col="blue") + ylab("Temperature anomaly
(Degrees C)") + ggtitle("Temperature anomaly relative to 20th century
mean 1880-2018") + geom_text(x=2000, y=0.05, label="20th century mean",
color="blue") + geom_line(mapping= aes(year, land_anomaly), color="red")
+ geom_line(mapping =aes(year, ocean_anomaly), color="blue")
#land temperature is the highest, oceand temperature follows the pattern
of global temperature, land temp is the one that most change since 1880
str(greenhouse_gases)
names(greenhouse_gases)
head(greenhouse_gases)
greenhouse_gases %>% ggplot(aes(year, concentration)) + geom_line() +
facet_grid(.~gas, scales="free") + geom_vline(xintercept =1850)
greenhouse_gases %>% ggplot(aes(year, concentration)) + geom_line() +
facet_grid(gas~., scales="free") + geom_vline(xintercept =1850)
greenhouse_gases %>% ggplot(aes(year, concentration)) + geom_line() +
facet_grid(gas~., scales="free") + geom_vline(xintercept =1850) +
ylab("Concentration (ch4/n2o ppb, co2 ppm)") + ggtitle("Atmospheric
greenhouse gas concentration by year 0-2000")
#greenhouse gas-ch4, n2o, co2 concentrations from 0 to 2000 in three
vertical alligned time series plot, vline represents the industrial
revolution in 1850
temp_carbon %>% filter(!is.na(carbon_emissions) & !is.na(year)) %>%
ggplot(aes(year, carbon_emissions)) + geom_line()
temp_carbon %>% filter(year%in%c(1960,2014)) %>% pull(carbon_emissions)
9855/2569
temp_carbon %>% filter(year%in%c(1970,1980)) %>% pull(carbon_emissions)
5301/4053
data("historic_co2")
str(historic_co2)
head(historic_co2)
names(historic_co2)
co2_time <- historic_co2 %>% ggplot(aes(year, co2, color=source)) +
geom_line()
co2_time
co2_time <- historic_co2 %>% ggplot(aes(year, co2, color=source)) +
geom_line()
co2_time
historic_co2 %>% ggplot(aes(year, co2, color=source)) + geom_line() +
facet_grid(source~.)
co2_time + scale_x_continuous(limit=c(-800000, -775000))
co2_time + scale_x_continuous(limit=c(-375000, -330000))
co2_time + scale_x_continuous(limit=c(-140000, -120000))
co2_time + scale_x_continuous(limit=c(-3000, 2018))
#change axis limits with limit= c(from-to) inside scale_x_continuous
#change limits to see how the line behaves in a period of time, it's like
a zoom for that period to see up&downs