0% found this document useful (0 votes)
5 views34 pages

Data Tabulation and Frequencies

The document provides an introduction to business analytics focusing on data tabulation and frequency distributions. It covers the creation of frequency tables for categorical and numeric data, including methods for constructing histograms and calculating cumulative and relative frequencies. Additionally, it discusses the importance of data visualization and techniques for summarizing data effectively.

Uploaded by

gerald.tanwh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views34 pages

Data Tabulation and Frequencies

The document provides an introduction to business analytics focusing on data tabulation and frequency distributions. It covers the creation of frequency tables for categorical and numeric data, including methods for constructing histograms and calculating cumulative and relative frequencies. Additionally, it discusses the importance of data visualization and techniques for summarizing data effectively.

Uploaded by

gerald.tanwh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

BT 1 1 0 1 I n t r o d u c t i o n t o B u s i n e s s

Analy tics

Data Tabulation & Frequencies


for data )
L e a r n i n g o b j e c t i ve s ( Tldr : learn to create tables

• Appreciate the importance and role of data visualization through


tabulation
• Be able to describe and summarize data using tabular techniques
(e.g. frequency tables, contingency tables)
• Be able to use and construct frequency distributions, relative
☆ frequency distributions, histogram and to compute cumulative
relative frequencies, percentiles and quartiles for a data set
Frequency Table
also known as

• Frequency distribution - a table that shows number of


observations in each of several non-overlapping groups

• Categorical variables naturally define the groups in a frequency


distribution.

that un be
Minha

split into cntgoies !


Frequency Distributions for Categorical Data – An
Example
Home_Market_Value_Type

One-way frequency table for house type

House Type Frequency or Number of Observations

the
This table shows
for
number of observation

of
housing
type
.

each
in tabular form
for categorical variable
distribution
F- ✗pressing the
frequency
a

Frequency Distributions for Categorical Data – An Example


Home_Market_Value_Type

WMM
☆ All tuition
of
th number
METHOD 1
^

Using “dplyr” package, “group_by” and “summarise” functions observation hor


alk non
-
overlapping

group

METHOD 2 S Convert the table into a


dutnfnne

Using Base R “table” function

t distribution
cents a frequency "

from
"
Home dataset
for the variable Type

1 Mtv's
when nut with
multiple Line a

""
inti tabulation table → until
it Mihiel
-
worth a

of uhh awh true


wasnt at combination
tht tho" the
Franny Distribution distribution for a categorical minke

ynph'm torn The frequency


1. Za in
form [ apart from tubular torn]

can also be expressed in


gnphiul

Frequency Distributions for Categorical Data – An Example


Home_Market_Value_Type

METHOD 1
Using “dplyr” package, “group_by” and “summarise” functions

the

the tabulated frequency
bar from
☆ creating a plot
Plotting Frequency Distributions for Categorical Data – An
ExaHome_Market_Value_Type
mple

METHOD 2
Using Base R “table” function
Frequency Distributions for Numeric Data
• Histogram: A graphical depiction of a
frequency distribution for numerical data in
the form of a bar chart
• Terminologies:
]
• Class*: a category for grouping data [ bar is class
each a

Represent
• Frequency: Number of data values in a class [ ] by height

• Density: Relative frequency


of bar

• Upper class limit: largest value that can go in a


class
class width = 100
• Lower class limit: smallest value that can go in a ^
class
• Class width: Difference between lower class limit
of a given class and the lower class limit of the
next higher class.
• Class midpoint: Midpoint of a class
7TEur ad " '
Creating a

Histogram
• Plotting histogram using “hist”
function

{
""
"" " " " "" "" ᵗᵈ

except for :

Break parameter
H width = 1
width of
specify the
each bar
<
(default value is based on Sturges’s rule)
Sturge’s Rule:
k = 1 + 3.322(log n) (k is the number of classes; n is the size of the data)
Eg: k=1+3.322(log 42) = 5.39 --> 6

☆ The notes are quite > 1) Breaks as a sigh numhe


diff between min
data [ the
WRONG th mrp of
g.
balls = 6 1) R caudata art rent mines in dataset]

confusing and
" "

whirl is
6
the of data by the number of bins speckle
2) R divides rinse
to mid
/ bars width th minimum mine
3) R crates 6 bins of equal ,
stutiy from

hike of distant
bin will be winter and represent
tan within each
of data pink that
4) The number
for crit bin
the frequently
""

,,
" " anon

defining the breakpoints [ the eawt

10,30, so 60 so ) we are
explicitly edges of the bars]

ey . trans = cc 0
, , ,

will here bins / hw at the following


The histogram
-50 60 bo to
30,30 so
-

lo
-

from 0-10 -
, ,
intents :
,

to ditton
The width of earth bw corresponds the

tonite rite]
between consent're ballpoints [g. o - w hair is

]
th not
bar represents

Histograms for Numerical Data


at each
The taping

[ Tip a
dat for each bar
pinn
M
, so if Zhan
left
- inclusive ,
Note
: bin are
in the
of 1° it win hunter
a mhie ,

!
bin 1-30 NOT 0-10
,

Some rules of thumb:


1. Number of groups - Choose between 5 to 15 groups; more for larger n; range of each
should be equal.
2. Choose lower limit of first group (LL) as a whole number smaller than the
minimum data value and the upper limit of last group (UL) as a whole
number larger than the maximum data value.

3. Group or bar width =


to determine
Formula
of our bar
the size


refer to the mop of Aintree

for each bar !


it takes :
Different arguments
v

Histograms for Numerical Data


•a single number giving the number of cells
for the histogram
•a vector giving the breakpoints between
histogram cells,
•a function to compute the vector of
bar width = breakpoints,
•seq (from,to,by)

breaks = 2 breaks = 4 breaks = 6



Histograms for Numerical Data of intent between
the number
Allow us to specify
Changing the tick marks on x axis with `xaxp` argument →
as defined by ✗ 1in
min mi ✗ mine
the ma

hist(Home$`House Age`, hist(Home$`House Age`,


main="Histogram of House Age", main="Histogram of House Age",
col="purple", col="purple",
xlim = c(min(Home$`House Age`)-1, max(Home$`House Age`)+1), xaxp=c(26,34,8),
xlab="House Age") xlim = c(min(Home$`House Age`)-1, max(Home$`House Age`)+1),
xlab="House Age")
A vector of the form c(x1, x2, n) giving the coordinates
of the extreme tick marks and the number of intervals
between tick-marks
numeric data
table for each group of
distribution
fretfully
to

a

un htltojmn govern

Wi un
>
numeric vector
the
split up
[ intents]

Histogram – to frequency distribution tables


numeric wines
Into group -1
OF
'

he seen at cntogoriml
that can

① H1 <- hist(Home$`House Age`,


main="Histogram of House Age",
col="purple",
xlim = c(min(Home$`House Age`)-1, max(Home$`House Age`)+1),
xlab="House Age") then it ran tht
that
a numeric vector
→ hint bin
>
at the the hiitynn
shows boundaries
② > H1$breaks the with the following

[1] 27 28 29 30 31 32 33 bars in the hilton innit ,

[ 27,2h )
"

[ Ur 4) ,
,

① homeagegp<-cut(Home$`House Age`,H1$breaks, include.lowest=TRUE)


[ 27,10 ) ,

i.

32,33 ]
table(homeagegp)
[
cut(x, breaks, include.lowest = FALSE, right = TRUE,
dig.lab = 3, …) [27,28] (28,29] (29,30] (30,31] (31,32] (32,33]
22 0 0 0 14 6
x a numeric vector which is to be converted to a factor by cutting
É≤ -i%"±-
breaks either a numeric vector of two or more unique cut points or a single
number (greater than or equal to 2) giving the number of intervals into
which x is to be cut.
Include.lowest logical, indicating if an ‘x[i]’ equal to the lowest (or highest,
for right = FALSE) ‘breaks’ value should be included.
right logical, indicating if the intervals should be closed on the right (and open
on the left) or vice versa.
dig.lab integer which is used when labels are not given. It determines the
number of digits used in formatting the break numbers.
"" Ellis "" } )
"""
"

iriio
-
.

i-uua.im "
-
-

Nun
_

tuna

Th ]/
=

cut Ca bran ,

of
hi "7mm
cut
14Mt syntax
,
,
nureiu mines
taken by
of
i refers to the mp
interns Lumbini )

cut into
way to
: the numeric vector that you
N
divided into intervals legitimation]
thrill be
defines how the data
brant :

should be utter
honest html
whether th
logical nine indicating
inelvh.HU : a
Histogram (label and density) [ Barely
though
went
LOL
]

11
Cumulative and Relative Frequencies

• Cumulative frequency is the sum of all previous frequencies up to the


current point [ For dataset /take]
a

• Relative frequency is the proportion of observations associated with


each value (or group)
Frequency of each group

Total Number of
observations
• Cumulative Relative Frequency is the proportion of total number of
observations that fall at or below the upper limit of each group.
previous !
to the burnt pint
turn all relative frequencies up
Dainty the of
a
Cumulative and Relative Frequencies
• Compute Cumulative Frequency, Relative Frequency and Cumulative Relative
Frequency

cumulative
sum of
cunrlfry
=

rl try

= reltry + rltyz
,
+
' ' -

`T1` dataframe =

snag ,
+
É¥
-

,
t

,
!?÷ .

-7
[ fqttohttgt
=
¥m ,

( try ]
Cumsum
I

Ctrey )
Sum

T1.cumfreq <- T1 %>% dplyr:: mutate(cumfreq=cumsum(Freq), relfreq=Freq/sum(Freq), cumrelfreq=cumfreq/sum(Freq))


↓ ↓ ↓
µ * i.sent:n = % '

we mutate to cnn.mn ( Frey ) FI ==


a- th soup

variables µ sync Fry ) Total observation


add these
cncrkke
cumulative
!
to th dntatnre frequency
P l o t t i n g C u m u l a t i v e F re q u e n c y ( O g i v e )
tuition
pithy
Application or

H
creating a plot
for Chmulht'm
trying
limit train
syntax of his Hmm , plot ,
☆ Relive

`T1.cumfreq` dataframe

L
#create vector of y coordinates to plot
cumfreq1 <- c(0, T1.cumfreq$cumfreq) # start with 0
plot(H1$breaks,
cumfreq1,
xlab = "Home Age",
ylab = “Cumulative Frequency”,
main="Cumulative Frequency for Home Age")
lines(H1$breaks, cumfreq1)
Pareto Analysis

} An Italian economist, Vilfredo Pareto, observed in 1906 that a large proportion


of wealth in Italy was owned by a small proportion of people.
} Similarly, businesses often find a large proportion of sales come from a small
percentage of customers, a large percentage of quality defects stems from just
a couple of sources, or a large percentage of inventory value corresponds to a
small percentage of items
} A Pareto analysis involves sorting data and calculating cumulative proportions.
Applying the Pareto Principle Sort by
Relative Cumulative Relative
Frequencies in % Frequencies in %

About 80% of the bicycle inventory value comes from 42% (10/24) of items.
!
Quartile ] snippet
[ Percentile -
was

Percentiles
• kth percentile is a value at or below
which at least k percent of the
observations lie.
• Most common way to compute the
kth percentile is to order the data
values from smallest to largest and
calculate the rank of the kth
percentile using the formula:
Computing Percentiles
• Compute the kth percentile for a variable in sample size n
• Rank of kth percentile = nk/100 + 0.5
• n = 94; k = 90
• For the 90th percentile, rank is
= 94(90)/100+0.5 = 85.1 (round to 85)
• Value of the 85th observation
Now let’s use R to compute the 32th, 57th, 98th percentile for
Room Size
Quartiles
• Quartiles break the data into four parts.
• 25th percentile is first quartile,Q1;
• 50th percentile is second quartile, Q2;
• 75th percentile is third quartile, Q3; and
• 100th percentile is fourth quartile, Q4.
• One-fourth of the data fall below the first quartile, one-half
are below the second quartile, and three-fourths are below
the third quartile.

Let’s use R to compute the 4 quartiles for Home Size


distribution piste
normal frequency
to Jut a

ftp.k
Contingency Tables
• One of most basic statistical tool for summarizing categorical
data
• A tabular method that displays number of observations in a
data set for different subcategories of two or more categorical
variables.
• Contingency tables can accept numerical variables but grouping
variable must be categorical.
• Subcategories of variables must be mutually exclusive and
exhaustive (i.e. each observation can be classified into only one
subcategory, and, taken together over all subcategories, they
must constitute the complete data set)
to
Examples of Contingency Tables 3 ate
→ mys
tides below
these
R
① Base function

Constructing a Contingency table for 2 categorical variables


DATA: Home_Market_Value_Type_R (assigned to HomeTR)
Categorical variables } Count number of units by type and region

row var column var


② DMR

Constructing a Contingency table for 2 categorical variables


DATA: Home_Market_Value_Type_R (assigned to HomeTR)

} Count number of units by type and region using dplyr::


group_by and dplyr::summarize (or dplyr::count)

tab1 <- HomeTR %>%


group_by(Type, Region) %>%
summarise(n=n())

OR

tab1 <- HomeTR %>%


count(Type, Region)
function
a
formatting

spread ( ) in tidyr
• Long to Wide dataset
spread() distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a
way that prevents duplication.
Column that
Column we
Dataset contains values to
want to spread
spread against

tab1w <- tab1 %>% spread(key=Region, value=n)

A-

27
is
'M
it spree
0PM

gather ( ) in tidyr
• Wide to Long dataset
Column name that Column name to
Dataset we want to gather gather values into
the columns into

Rev_tab1w<- gather (tab1w, key=Region, value=n, -`Type`)

Column that we
don’t want to gather

28
③ Ugh rpiwttnhk Patty → needs to be installed an imported

Constructing Contingency Tables using rPivotTable


p a c ka g e

Options: Count, Count Unique Values,


List Unique Values, Sum, Integer Sum,
Average, Sum over Sum, 80% Upper
Bound, 80% Lower Bound, Sum as
Fraction of Total, Sum as Fraction of
Rows, Sum as Fraction of Columns,
Count as Fraction of Total, Count as
Fraction of Rows, Count as Fraction of
Columns

Reference: Help rpivotTable & https://cran.r-project.org/web/packages/rpivotTable/vignettes/rpivotTableIntroduction.html


Constructing Contingency Tables using rPivotTable (1) ask )
p a c ka g e
} Count number of units by type and region.

Sub-Categories of Region

Sub-Categories of Unit Type

https://cran.r-project.org/web/packages/rpivotTable/vignettes/rpivotTableIntroduction.html
Constructing Contingency Tables using rPivotTable Manipulating
p a c ka g e aggregator
} Percentage of units over total by type and region. name

Sub-Categories of Region

Sub-Categories of Unit Type


Constructing a Pivot table for 3 categorical vars
} Count number of units by type, region, and sub-region.

Sub-Categories of sub-region
Slicers
• for drilling down to “slice” a PivotTable and display a subset of data
Slicers
• for drilling down to “slice” a PivotTable and display a subset of data

You might also like