Summarizing Data
Summarizing Data
Methods to Summarise Data in R
1. apply
Apply function returns a vector or array or list of values
obtained by applying a function to either rows or columns. This
is the simplest of all the function which can do this job.
However this function is very specific to collapsing either row
or column.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
apply(m, 1, mean)
[1] 6 7 8 9 10 11 12 13 14 15
apply(m, 2, mean)
[1] 5.5 15.5
2. lapply
“lapply” returns a list of the same length as X, each element of
which is the result of applying FUN to the corresponding
element of X.”
l <- list(a = 1:10, b = 11:20)
lapply(l, mean)
$a
[1] 5.5
$b
[1] 15.5
3. sapply
“sapply” does the same thing as apply but returns a vector or
matrix. Let’s consider the last example again.
l <- list(a = 1:10, b = 11:20) l.mean <- sapply(l, mean)
class(l.mean)
[1] "numeric"
4. tapply
Till now, all the function we discussed cannot do what Sql can
achieve. Here is a function which completes the palette for
R. Usage is “tapply(X, INDEXatt, FUN = NULL, …, simplify =
TRUE)”, where X is “an atomic object, typically a vector” and
INDEX is “a list of factors, each of same length as X”. Here is
an example which will make the usage clear.
attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
setosa versicolor virginica
1.462 4.260 5.552
5. by
Now comes a slightly more complicated algorithm. Function
‘by’ is an object-oriented wrapper for ‘tapply’ applied to data
frames. Hopefully the example will make it more clear.
attach(iris)
by(iris[, 1:4], Species, colMeans)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
------------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
What did the function do? It simply splits the data by a class
variable, which in this case is the specie. And then it creates a
summary at this level. So it does apply function on split frames.
The returned object is of class “by”.
6. sqldf
If you found any of the above statements difficult, don’t panic. I
bring you a life line which you can use anytime. Let’s fit in the
SQL queries in R. Here is a way you can do the same.
attach(iris)
summarization <- sqldf(select Species, mean(Petal.Length) from
Petal.Length_mean where Species is not null group by Species’)
And it’s done. Wasn’t it simple enough? One setback of this
approach is the amount of time it takes to execute. In case you
are interested in getting speed and same results read the
next section.
7. ddply
Fastest of all we discussed. You will need an additional package.
Let’s do what we exactly did in tapply section.
library(plyr)
attach(iris)
# mean petal length by species
ddply(iris,"Species",summarise, Petal.Length_mean = mean
(Petal.Length))
Additional Notes: You can also use packages such as dplyr,
data.table to summarize data. Here’s– Faster Data Manipulation
with these 7 R Packages.
In general if you are trying to add this summarisation step in the
middle of a process and need a table as output, you need to go
for sqldf or ddply. “ddply” in these cases is faster but will not
give you options beyond just grouping. “sqldf” has all features
you need to summarize the data in SQL statements.
In case you are interested in using function similar to pivot
tables or transposing the tables, you can consider using
“reshape”. We have covered a few examples of the same in our
article – comprehensive guide for data exploration in R.
Challenge : Here is a simple problem you can attempt to solve
using all the methods we have discussed. You have a table for
all school kids marks in a particular city.
Write a code to find the mean marks of each school for both
class 1 and 2, for students with roll no less than 6. And print
only the class whose mean score comes out to be higher for the
school. For instance, if school A has a mean score of 6 for class
1 and 4 for class 2, you will reject class 2 and only take class 1
mean score for the school. In cases of tie, you can make a
random choice. Assume that the actual table is much bigger and
keep the code as generalized as possible.
summarize in r, when we have a dataset and need to get a clear
idea about each parameter then a summary of the data is
important. Summarized data will provide a clear idea about the
data set.
In this tutorial we are going to talk about summarize () function
from dplyr package. Summarizing a data set by group gives
better indication on the distribution of the data.
This tutorial you will get the idea about summarise(), group_by
summary and important functions in summarise()
Load Library
library(dplyr)
Let’s load iris data set for summarization. Let’s store the iris
data set into new variable say df for summarize in r.
df<-iris
df1<-summarise(df, mean(Sepal.Length())df<-iris
Output:-
mean(Sepal.Length)
5.843333
Let’s create mean and sd of Sepal Length.
df2<-summarise(df, Mean=mean(Sepal.Length(),
SD=sd(Sepal.Length())
Output:-
Mean SD
5.843333 0.8280661
Now we try to summarize based on groups.
Principal component analysis (PCA) in R »
df3<-summarise(group_by(df, Species),
Mean=mean(Sepal.Length(),
SD=sd(Sepal.Length())
Output:-
Species Mean SD
1 setosa 5.01 0.352
2 versicolor 5.94 0.516
3 virginica 6.59 0.636
You can make use of pipe operator for summarising the data set.
Pipe operator comes under magrittr package. Let’s load the
package.
library(magrittr)
df4<-df %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Length),
SD=sd(Sepal.Length))
Output:-
Species Mean SD
1 setosa 5.01 0.352
2 versicolor 5.94 0.516
3 virginica 6.59 0.636
Based on pipe operator you can easily summarize and plot it
with the help of ggplot2.