Summarizing Data

The document discusses various methods for summarizing data in R, including apply(), lapply(), sapply(), tapply(), by(), sqldf(), and ddply(). It provides examples of using these functions to calculate summary statistics like means, medians, and standard deviations for different groups within a dataset. Key functions mentioned include group_by() and summarise() from the dplyr package to group and summarize data in a pipeline.

Uploaded by

Nitish Ravuvari

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Summarizing Data

Uploaded by

Nitish Ravuvari

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 13

Summarizing data

People remain confused when it comes to summarizing data real

quick in R. There are various options.

People who transition from SAS or SQL are used to writing

simple queries on these languages to summarize data sets. For
such audience, the biggest concern is to how do we do the same
thing on R.
In this article I will cover primary ways to summarize data sets.
Hopefully this will make your journey much easier than it looks
like.
Generally, summarizing data means finding statistical figures
such as mean, median, box plot etc. If understand well with
scatter plots & histogram, you can refer to guide on data
visualization in R.

Methods to Summarise Data in R
1. apply
Apply function returns a vector or array or list of values
obtained by applying a function to either rows or columns. This
is the simplest of all the function which can do this job.
However this function is very specific to collapsing either row
or column.
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
apply(m, 1, mean)
[1] 6 7 8 9 10 11 12 13 14 15
apply(m, 2, mean)
[1] 5.5 15.5

2. lapply
“lapply” returns a list of the same length as X, each element of
which is the result of applying FUN to the corresponding
element of X.”
l <- list(a = 1:10, b = 11:20)
lapply(l, mean)
$a
[1] 5.5
$b
[1] 15.5

3. sapply
“sapply” does the same thing as apply but returns a vector or
matrix. Let’s consider the last example again.
l <- list(a = 1:10, b = 11:20) l.mean <- sapply(l, mean)
class(l.mean)
[1] "numeric"

4. tapply
Till now, all the function we discussed cannot do what Sql can
achieve. Here is a function which completes the palette for
R. Usage is “tapply(X, INDEXatt, FUN = NULL, …, simplify =
TRUE)”, where X is “an atomic object, typically a vector” and
INDEX is “a list of factors, each of same length as X”. Here is
an example which will make the usage clear.
attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
    setosa versicolor virginica
     1.462      4.260      5.552

5. by
Now comes a slightly more complicated algorithm. Function
‘by’ is an object-oriented wrapper for ‘tapply’ applied to data
frames. Hopefully the example will make it more clear.
attach(iris)
by(iris[, 1:4], Species, colMeans)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
       5.006        3.428        1.462        0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
       5.936        2.770        4.260        1.326
------------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
       6.588        2.974        5.552        2.026
What did the function do? It simply splits the data by a class
variable, which in this case is the specie. And then it creates a
summary at this level. So it does apply function on split frames.
The returned object is of class “by”.

6. sqldf
If you found any of the above statements difficult, don’t panic. I
bring you a life line which you can use anytime. Let’s fit in the
SQL queries in R. Here is a way you can do the same.
attach(iris)
summarization <- sqldf(select Species, mean(Petal.Length) from
Petal.Length_mean where Species is not null group by Species’)
And it’s done. Wasn’t it simple enough? One setback of this
approach is the amount of time it takes to execute. In case you
are interested in getting speed and same results read the
next section.
7. ddply
Fastest of all we discussed. You will need an additional package.
Let’s do what we exactly did in tapply section.
library(plyr)
attach(iris)
# mean petal length by species
ddply(iris,"Species",summarise, Petal.Length_mean = mean
(Petal.Length))

Additional Notes: You can also use packages such as dplyr,
data.table to summarize data. Here’s– Faster Data Manipulation
with these 7 R Packages.
In general if you are trying to add this summarisation step in the
middle of a process and need a table as output, you need to go
for sqldf or ddply. “ddply” in these cases is faster but will not
give you options beyond just grouping. “sqldf” has all features
you need to summarize the data in SQL statements.
In case you are interested in using function similar to pivot
tables or transposing the tables, you can consider using
“reshape”. We have covered a few examples of the same in our
article – comprehensive guide for data exploration in R.
Challenge : Here is a simple problem you can attempt to solve
using all the methods we have discussed. You have a table for
all school kids marks in a particular city.
Write a code to find the mean marks of each school for both
class 1 and 2, for students with roll no less than 6. And print
only the class whose mean score comes out to be higher for the
school. For instance, if school A has a mean score of 6 for class
1 and 4 for class 2, you will reject class 2 and only take class 1
mean score for the school. In cases of tie, you can make a
random choice. Assume that the actual table is much bigger and
keep the code as generalized as possible.

summarize in r, when we have a dataset and need to get a clear
idea about each parameter then a summary of the data is
important. Summarized data will provide a clear idea about the
data set.
In this tutorial we are going to talk about summarize () function
from dplyr package. Summarizing a data set by group gives
better indication on the distribution of the data.
This tutorial you will get the idea about summarise(), group_by
summary and important functions in summarise()
Load Library
library(dplyr)
Let’s load iris data set for summarization. Let’s store the iris
data set into new variable say df for summarize in r.
df<-iris
df1<-summarise(df, mean(Sepal.Length())df<-iris
Output:-
mean(Sepal.Length)
5.843333
Let’s create mean and sd of Sepal Length.
df2<-summarise(df, Mean=mean(Sepal.Length(),
SD=sd(Sepal.Length())
Output:-
Mean SD
5.843333 0.8280661
Now we try to summarize based on groups.
Principal component analysis (PCA) in R »
df3<-summarise(group_by(df, Species),
Mean=mean(Sepal.Length(),
SD=sd(Sepal.Length())
Output:-
Species Mean SD
1 setosa 5.01 0.352
2 versicolor 5.94 0.516
3 virginica 6.59 0.636
You can make use of pipe operator for summarising the data set.
Pipe operator comes under magrittr package. Let’s load the
package.
library(magrittr)
df4<-df %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Length),
SD=sd(Sepal.Length))
Output:-
Species Mean SD
1 setosa 5.01 0.352
2 versicolor 5.94 0.516
3 virginica 6.59 0.636
Based on pipe operator you can easily summarize and plot it
with the help of ggplot2.

Exploratory Data Analysis (EDA) » Overview »

library(ggplot2)
For plotting the datset we have main four steps
Step 1: Select the appropriate data frame
Step 2: Group the data frame
Step 3: Summarize the data frame
Step 4: Plot the summary statistics based on your requirement
df %>%
group_by(Species) %>%
summarise(Mean = mean(Sepal.Length)) %>%
ggplot(aes(x = Species, y = Mean, fill = Species)) +
geom_bar(stat = "identity") +
theme_classic() +
labs(
x = "Species",
y = "Average Sepal.Length ",
title = paste(
"Summary Based on Groups"
)
)
Sum

Another useful function to aggregate the variable is sum().

Deep Neural Network in R » Keras & Tensor Flow
df5<-df %>%
group_by(Species) %>%
summarise(sum = sum(Sepal.Length),
SD=sd(Sepal.Length))
Output:-
Species sum SD
1 setosa 250 0.352
2 versicolor 297 0.516
3 virginica 329 0.636
Minimum and maximum

Find the minimum and the maximum of a vector or variable

with the help of function min() and max().
df6<-df %>%
group_by(Species) %>%
summarise(Min = min(Sepal.Length),
Max=max(Sepal.Length))
Output:-
Species Min Max
1 setosa 4.3 5.8
2 versicolor 4.9 7
3 virginica 4.9 7.9
Count

Suppose if you want to count observations by group you can

aggregate the number of occurrence with n().
Naive Bayes Classification in R » Prediction Model »
df7<-df %>%
group_by(Species) %>%
summarise(Sepal.Length = n())%>%
arrange(desc(Sepal.Length))
Output:-
Species Sepal.Length
1 setosa 50
2 versicolor 50
3 virginica 50
First and Last

Some cases first cases or position identification is important,

then you can make use of first, last or nth position of a group.
df8<-df %>%
group_by(Species) %>%
summarise(First = first(Sepal.Length),
Last=last(Sepal.Length))
Output:- df8
Species First Last
1 setosa 5.1 5
2 versicolor 7 5.7
3 virginica 6.3 5.9
The same way you can make use of following functions some of
the functions already covered in the tutorial.
You can see the important functions below for summarizing the
dataset.
tidyverse in r – Complete Tutorial » Unknown Techniques »
Mean
summarise(df,mean = mean(x1))
Median
summarise(df,median = median(x1))
Sum
summarise(df,sum = sum(x1))
Standard Deviation
summarise(df,sd = sd(x1))
Interquartile
summarise(df,interquartile = IQR(x1))
Minimum
summarise(df,minimum = min(x1))
Maximum
summarise(df,maximum = max(x1))
Quantile
summarise(df,quantile = quantile(x1))
First Observation
summarise(df,first = first(x1))
Last observation
summarise(df,last = last(x1))
nth observation
summarise(df,nth = nth(x1, 2))
Number of occurrence
summarise(df,count = n(x1))
Number of distinct occurrence
summarise(df,distinct = n_distinct(x1))
How to find dataset differences in R Quickly Compare Datasets
»

Chapter 3 _STAT1204..
No ratings yet
Chapter 3 _STAT1204..
10 pages
R
No ratings yet
R
15 pages
Chapter 4 Programming Basics - Introduction To Data Science
No ratings yet
Chapter 4 Programming Basics - Introduction To Data Science
11 pages
Segment Tree PDF
No ratings yet
Segment Tree PDF
5 pages
Tidyverse: Core Packages in Tidyverse
No ratings yet
Tidyverse: Core Packages in Tidyverse
8 pages
UNIT II -DA USING R
No ratings yet
UNIT II -DA USING R
18 pages
Exploratory Data Analysis and Graphics: Lab 2
No ratings yet
Exploratory Data Analysis and Graphics: Lab 2
19 pages
R - II UNIT
No ratings yet
R - II UNIT
10 pages
BRM File
No ratings yet
BRM File
20 pages
Creating Publication-Ready Word Tables in R: Sara Weston and Debbie Yee
No ratings yet
Creating Publication-Ready Word Tables in R: Sara Weston and Debbie Yee
40 pages
Unit 5 - DS - 1st year
No ratings yet
Unit 5 - DS - 1st year
19 pages
ML File
No ratings yet
ML File
12 pages
Chapter 3 Programming Basics: 3.1 Conditional Expressions
No ratings yet
Chapter 3 Programming Basics: 3.1 Conditional Expressions
7 pages
ML Lab File Vijay Kumar
No ratings yet
ML Lab File Vijay Kumar
16 pages
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
No ratings yet
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
14 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
Basic Stats For Ecology
No ratings yet
Basic Stats For Ecology
26 pages
A Quick Introduction To Plyr: 1 Why Use Apply Functions Instead of For Loops?
No ratings yet
A Quick Introduction To Plyr: 1 Why Use Apply Functions Instead of For Loops?
6 pages
R Programming Paper Solutions
No ratings yet
R Programming Paper Solutions
43 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Introduction To R For Gene Expression Data Analysis
No ratings yet
Introduction To R For Gene Expression Data Analysis
11 pages
BES - R Lab 1
No ratings yet
BES - R Lab 1
4 pages
Pandas
No ratings yet
Pandas
29 pages
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
No ratings yet
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
12 pages
Ridge and Lasso Regression in Python
No ratings yet
Ridge and Lasso Regression in Python
18 pages
R - Solved QB Unit-II
No ratings yet
R - Solved QB Unit-II
14 pages
Explaining How Resnet-50 Works and Why It Is So Popular
No ratings yet
Explaining How Resnet-50 Works and Why It Is So Popular
15 pages
R Introduction by Deepayan Sarkar
No ratings yet
R Introduction by Deepayan Sarkar
23 pages
Accessing A Data Frame
No ratings yet
Accessing A Data Frame
2 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
100% (2)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell pdf download
40 pages
Unit 3
No ratings yet
Unit 3
11 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
100% (14)
Download full Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell all chapters
43 pages
Data Analysis and Visulaization Experiment
No ratings yet
Data Analysis and Visulaization Experiment
104 pages
cours
No ratings yet
cours
33 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
Intro To Statistic Using R - Session 1
No ratings yet
Intro To Statistic Using R - Session 1
1 page
BMR Assignment: Tidyr
No ratings yet
BMR Assignment: Tidyr
3 pages
2 Mark Python Imp
No ratings yet
2 Mark Python Imp
11 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Unit 2 Notes R Programming
No ratings yet
Unit 2 Notes R Programming
10 pages
Data Structure Notes
No ratings yet
Data Structure Notes
171 pages
Introduction to R for Business Analytics(1)
No ratings yet
Introduction to R for Business Analytics(1)
7 pages
08 Functions
No ratings yet
08 Functions
36 pages
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
100% (6)
Solution Manual for Using Multivariate Statistics 7th Edition Barbara G. Tabachnick, Linda S. Fidell - Read Online Or Download Now
35 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Data Science 30 Days Learning Plan - by Data Analytics - Mr. Plan Publication - Jun, 2024 - Medium
No ratings yet
Data Science 30 Days Learning Plan - by Data Analytics - Mr. Plan Publication - Jun, 2024 - Medium
11 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
sample project
No ratings yet
sample project
12 pages
Credit Card Final Review
No ratings yet
Credit Card Final Review
21 pages
Case Study Data Analytics
No ratings yet
Case Study Data Analytics
12 pages
Quantitative Skills for Animal Sciences-day 1
No ratings yet
Quantitative Skills for Animal Sciences-day 1
78 pages
NCFTEAS - 2024 Paper 16
No ratings yet
NCFTEAS - 2024 Paper 16
8 pages
Documentation & Report For Flyzy Flight Cancellation Project
No ratings yet
Documentation & Report For Flyzy Flight Cancellation Project
25 pages
Python
No ratings yet
Python
3 pages
Intro
No ratings yet
Intro
26 pages
Exploratory Data Analysis Stephan Morgenthaler (2009)
100% (2)
Exploratory Data Analysis Stephan Morgenthaler (2009)
12 pages
Play With Wireline Log Data 1733476190
No ratings yet
Play With Wireline Log Data 1733476190
50 pages
Rainfall Prediction using Machine Learning
No ratings yet
Rainfall Prediction using Machine Learning
9 pages
Exploratory Data Analysis and Data Mining On Yelp Restaurant Review Using Ada Boosting and MLP Techniques
No ratings yet
Exploratory Data Analysis and Data Mining On Yelp Restaurant Review Using Ada Boosting and MLP Techniques
5 pages
AnalytixLabs-Advanced Certification in Business Analytics-1714541322570
No ratings yet
AnalytixLabs-Advanced Certification in Business Analytics-1714541322570
20 pages
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
No ratings yet
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
2 pages
FMCG Synopsis
No ratings yet
FMCG Synopsis
14 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Modeling Mixed Type Random Variables
No ratings yet
Modeling Mixed Type Random Variables
12 pages
Ethical Issues and Guidelines For Conducting Data Analysis in Psychological Research (Optional Reading) PDF
No ratings yet
Ethical Issues and Guidelines For Conducting Data Analysis in Psychological Research (Optional Reading) PDF
14 pages
Data Preparation & Exploration
No ratings yet
Data Preparation & Exploration
12 pages
1703141447_capstone3problemstatement
No ratings yet
1703141447_capstone3problemstatement
14 pages
upGradMSDSBrochureclass Documents
No ratings yet
upGradMSDSBrochureclass Documents
31 pages
Complete Download Insights from Data with R: An Introduction for the Life and Environmental Sciences Owen L. Petchey PDF All Chapters
100% (4)
Complete Download Insights from Data with R: An Introduction for the Life and Environmental Sciences Owen L. Petchey PDF All Chapters
66 pages
Project Report - Rishabh Rai
No ratings yet
Project Report - Rishabh Rai
51 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
7 pages
Lecture Notes For Chapter 3: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 3: by Tan, Steinbach, Kumar
35 pages
Final Submitted Report
No ratings yet
Final Submitted Report
16 pages
On Eda
No ratings yet
On Eda
60 pages
AIML MCQS (1)
No ratings yet
AIML MCQS (1)
48 pages