0% found this document useful (0 votes)
26 views50 pages

Exploratory Data Analysis in R

Uploaded by

xekare1271
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views50 pages

Exploratory Data Analysis in R

Uploaded by

xekare1271
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Exploring numerical

data
E X P L O R AT O R Y D ATA A N A LY S I S I N R

Andrew Bray
Assistant Professor, Reed College
Cars dataset
str(cars)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 428 obs. of 19 variables:


$ name : chr "Chevrolet Aveo 4dr" "Chevrolet Aveo LS 4dr hatch" ...
$ sports_car : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ suv : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ wagon : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ minivan : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ pickup : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ all_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ rear_wheel : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ msrp : int 11690 12585 14610 14810 16385 13670 15040 13270 ...
$ dealer_cost: int 10965 11802 13697 13884 15357 12849 14086 12482 ...
$ eng_size : num 1.6 1.6 2.2 2.2 2.2 2 2 2 2 2 ...
$ ncyl : int 4 4 4 4 4 4 4 4 4 4 ...
$ horsepwr : int 103 103 140 140 140 132 132 130 110 130 ...
$ city_mpg : int 28 28 26 26 26 29 29 26 27 26 ...
$ hwy_mpg : int 34 34 37 37 37 36 36 33 36 33 ...
$ weight : int 2370 2348 2617 2676 2617 2581 2626 2612 2606 ...
$ wheel_base : int 98 98 104 104 104 105 105 103 103 103 ...
$ length : int 167 153 183 183 183 174 174 168 168 168 ...
$ width : int 66 66 69 68 69 67 67 67 67 67 ...

EXPLORATORY DATA ANALYSIS IN R


Dotplot
ggplot(data, aes(x = weight)) +
geom_dotplot(dotsize = 0.4)

EXPLORATORY DATA ANALYSIS IN R


Histogram
ggplot(data, aes(x = weight)) +
geom_histogram()

EXPLORATORY DATA ANALYSIS IN R


Density plot
ggplot(data, aes(x = weight)) +
geom_density()

EXPLORATORY DATA ANALYSIS IN R


Density plot
ggplot(data, aes(x = weight)) +
geom_density()

EXPLORATORY DATA ANALYSIS IN R


Density plot
ggplot(data, aes(x = weight)) +
geom_density()

EXPLORATORY DATA ANALYSIS IN R


Boxplot
ggplot(data, aes(x = 1, y = weight)) +
geom_boxplot() +
coord_flip()

EXPLORATORY DATA ANALYSIS IN R


Boxplot
ggplot(data, aes(x = 1, y = weight)) +
geom_boxplot() +
coord_flip()

EXPLORATORY DATA ANALYSIS IN R


Boxplot
ggplot(data, aes(x = 1, y = weight)) +
geom_boxplot() +
coord_flip()

EXPLORATORY DATA ANALYSIS IN R


Boxplot
ggplot(data, aes(x = 1, y = weight)) +
geom_boxplot() +
coord_flip()

EXPLORATORY DATA ANALYSIS IN R


Faceted histogram
ggplot(cars, aes(x = hwy_mpg)) +
geom_histogram() +
facet_wrap(~pickup)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Warning message:
Removed 14 rows containing non-finite values (stat_bin).

EXPLORATORY DATA ANALYSIS IN R


Faceted histogram
ggplot(cars, aes(x = hwy_mpg)) +
geom_histogram() +
facet_wrap(~pickup)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Warning message:
Removed 14 rows containing non-finite values (stat_bin).

EXPLORATORY DATA ANALYSIS IN R


Faceted histogram
ggplot(cars, aes(x = hwy_mpg)) +
geom_histogram() +
facet_wrap(~pickup)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Warning message:
Removed 14 rows containing non-finite values (stat_bin).

EXPLORATORY DATA ANALYSIS IN R


Let's practice!
E X P L O R AT O R Y D ATA A N A LY S I S I N R
Distribution of one
variable
E X P L O R AT O R Y D ATA A N A LY S I S I N R

Andrew Bray
Assistant Professor, Reed College
Marginal vs. conditional
ggplot(cars, aes(x = hwy_mpg)) +
geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Warning message:
Removed 14 rows containing non-finite values (stat_bin).

EXPLORATORY DATA ANALYSIS IN R


Marginal vs. conditional
ggplot(cars, aes(x = hwy_mpg)) +
geom_histogram() +
facet_wrap(~pickup)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Warning message:
Removed 14 rows containing non-finite values (stat_bin).

EXPLORATORY DATA ANALYSIS IN R


Building a data pipeline
cars2 <- cars %>%
filter(eng_size < 2.0)

ggplot(cars2, aes(x = hwy_mpg)) +


geom_histogram()

EXPLORATORY DATA ANALYSIS IN R


Building a data pipeline
cars %>%
filter(eng_size < 2.0) %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram()

EXPLORATORY DATA ANALYSIS IN R


Filtered and faceted histogram
cars %>%
filter(eng_size < 2.0) %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

EXPLORATORY DATA ANALYSIS IN R


Wide bin width
cars %>%
filter(eng_size < 2.0) %>%
ggplot(aes(x = hwy_mpg)) +
geom_histogram(binwidth = 5)

EXPLORATORY DATA ANALYSIS IN R


Density plot
cars %>%
filter(eng_size < 2.0) %>%
ggplot(aes(x = hwy_mpg)) +
geom_density()

EXPLORATORY DATA ANALYSIS IN R


Wide bandwidth
cars %>%
filter(eng_size < 2.0) %>%
ggplot(aes(x = hwy_mpg)) +
geom_density(bw = 5)

EXPLORATORY DATA ANALYSIS IN R


Let's practice!
E X P L O R AT O R Y D ATA A N A LY S I S I N R
Box plots
E X P L O R AT O R Y D ATA A N A LY S I S I N R

Andrew Bray
Assistant Professor, Reed College
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
Side-by-side box plots
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()

Warning message:
Removed 11 rows containing non-finite values (stat_boxplot).

EXPLORATORY DATA ANALYSIS IN R


Side-by-side box plots
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()

Warning message:
Removed 11 rows containing non-finite values (stat_boxplot).

EXPLORATORY DATA ANALYSIS IN R


Side-by-side box plots
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
geom_boxplot()

Warning message:
Removed 11 rows containing non-finite values (stat_boxplot).

EXPLORATORY DATA ANALYSIS IN R


EXPLORATORY DATA ANALYSIS IN R
EXPLORATORY DATA ANALYSIS IN R
Let's practice!
E X P L O R AT O R Y D ATA A N A LY S I S I N R
Visualization in
higher dimensions
E X P L O R AT O R Y D ATA A N A LY S I S I N R

Andrew Bray
Assistant Professor, Reed College
Plots for 3 variables
ggplot(cars, aes(x = msrp)) +
geom_density() +
facet_grid(pickup ~ rear_wheel)

EXPLORATORY DATA ANALYSIS IN R


Plots for 3 variables
ggplot(cars, aes(x = msrp)) +
geom_density() +
facet_grid(pickup ~ rear_wheel, labeller = label_both)

EXPLORATORY DATA ANALYSIS IN R


Plots for 3 variables
ggplot(cars, aes(x = msrp)) +
geom_density() +
facet_grid(pickup ~ rear_wheel, labeller = label_both)
table(cars$rear_wheel, cars$pickup)

FALSE TRUE
FALSE 306 12
TRUE 98 12

EXPLORATORY DATA ANALYSIS IN R


Higher dimensional plots
Shape

Size

Color

Pa ern

Movement

x-coordinate

y-coordinate

EXPLORATORY DATA ANALYSIS IN R


Let's practice!
E X P L O R AT O R Y D ATA A N A LY S I S I N R

You might also like