Data Tabulation and Frequencies
Data Tabulation and Frequencies
Analy tics
that un be
Minha
the
This table shows
for
number of observation
of
housing
type
.
each
in tabular form
for categorical variable
distribution
F- ✗pressing the
frequency
a
WMM
☆ All tuition
of
th number
METHOD 1
^
group
t distribution
cents a frequency "
from
"
Home dataset
for the variable Type
1 Mtv's
when nut with
multiple Line a
""
inti tabulation table → until
it Mihiel
-
worth a
METHOD 1
Using “dplyr” package, “group_by” and “summarise” functions
the
✓
the tabulated frequency
bar from
☆ creating a plot
Plotting Frequency Distributions for Categorical Data – An
ExaHome_Market_Value_Type
mple
METHOD 2
Using Base R “table” function
Frequency Distributions for Numeric Data
• Histogram: A graphical depiction of a
frequency distribution for numerical data in
the form of a bar chart
• Terminologies:
]
• Class*: a category for grouping data [ bar is class
each a
Represent
• Frequency: Number of data values in a class [ ] by height
Histogram
• Plotting histogram using “hist”
function
{
""
"" " " " "" "" ᵗᵈ
except for :
Break parameter
H width = 1
width of
specify the
each bar
<
(default value is based on Sturges’s rule)
Sturge’s Rule:
k = 1 + 3.322(log n) (k is the number of classes; n is the size of the data)
Eg: k=1+3.322(log 42) = 5.39 --> 6
confusing and
" "
whirl is
6
the of data by the number of bins speckle
2) R divides rinse
to mid
/ bars width th minimum mine
3) R crates 6 bins of equal ,
stutiy from
hike of distant
bin will be winter and represent
tan within each
of data pink that
4) The number
for crit bin
the frequently
""
,,
" " anon
10,30, so 60 so ) we are
explicitly edges of the bars]
ey . trans = cc 0
, , ,
lo
-
from 0-10 -
, ,
intents :
,
to ditton
The width of earth bw corresponds the
tonite rite]
between consent're ballpoints [g. o - w hair is
]
th not
bar represents
[ Tip a
dat for each bar
pinn
M
, so if Zhan
left
- inclusive ,
Note
: bin are
in the
of 1° it win hunter
a mhie ,
!
bin 1-30 NOT 0-10
,
↑
to determine
Formula
of our bar
the size
↑
refer to the mop of Aintree
un htltojmn govern
☆
Wi un
>
numeric vector
the
split up
[ intents]
he seen at cntogoriml
that can
[ 27,2h )
"
[ Ur 4) ,
,
i.
32,33 ]
table(homeagegp)
[
cut(x, breaks, include.lowest = FALSE, right = TRUE,
dig.lab = 3, …) [27,28] (28,29] (29,30] (30,31] (31,32] (32,33]
22 0 0 0 14 6
x a numeric vector which is to be converted to a factor by cutting
É≤ -i%"±-
breaks either a numeric vector of two or more unique cut points or a single
number (greater than or equal to 2) giving the number of intervals into
which x is to be cut.
Include.lowest logical, indicating if an ‘x[i]’ equal to the lowest (or highest,
for right = FALSE) ‘breaks’ value should be included.
right logical, indicating if the intervals should be closed on the right (and open
on the left) or vice versa.
dig.lab integer which is used when labels are not given. It determines the
number of digits used in formatting the break numbers.
"" Ellis "" } )
"""
"
iriio
-
.
i-uua.im "
-
-
Nun
_
tuna
Th ]/
=
cut Ca bran ,
of
hi "7mm
cut
14Mt syntax
,
,
nureiu mines
taken by
of
i refers to the mp
interns Lumbini )
→
cut into
way to
: the numeric vector that you
N
divided into intervals legitimation]
thrill be
defines how the data
brant :
should be utter
honest html
whether th
logical nine indicating
inelvh.HU : a
Histogram (label and density) [ Barely
though
went
LOL
]
11
Cumulative and Relative Frequencies
Total Number of
observations
• Cumulative Relative Frequency is the proportion of total number of
observations that fall at or below the upper limit of each group.
previous !
to the burnt pint
turn all relative frequencies up
Dainty the of
a
Cumulative and Relative Frequencies
• Compute Cumulative Frequency, Relative Frequency and Cumulative Relative
Frequency
cumulative
sum of
cunrlfry
=
rl try
= reltry + rltyz
,
+
' ' -
`T1` dataframe =
snag ,
+
É¥
-
,
t
,
!?÷ .
-7
[ fqttohttgt
=
¥m ,
( try ]
Cumsum
I
Ctrey )
Sum
H
creating a plot
for Chmulht'm
trying
limit train
syntax of his Hmm , plot ,
☆ Relive
`T1.cumfreq` dataframe
L
#create vector of y coordinates to plot
cumfreq1 <- c(0, T1.cumfreq$cumfreq) # start with 0
plot(H1$breaks,
cumfreq1,
xlab = "Home Age",
ylab = “Cumulative Frequency”,
main="Cumulative Frequency for Home Age")
lines(H1$breaks, cumfreq1)
Pareto Analysis
About 80% of the bicycle inventory value comes from 42% (10/24) of items.
!
Quartile ] snippet
[ Percentile -
was
Percentiles
• kth percentile is a value at or below
which at least k percent of the
observations lie.
• Most common way to compute the
kth percentile is to order the data
values from smallest to largest and
calculate the rank of the kth
percentile using the formula:
Computing Percentiles
• Compute the kth percentile for a variable in sample size n
• Rank of kth percentile = nk/100 + 0.5
• n = 94; k = 90
• For the 90th percentile, rank is
= 94(90)/100+0.5 = 85.1 (round to 85)
• Value of the 85th observation
Now let’s use R to compute the 32th, 57th, 98th percentile for
Room Size
Quartiles
• Quartiles break the data into four parts.
• 25th percentile is first quartile,Q1;
• 50th percentile is second quartile, Q2;
• 75th percentile is third quartile, Q3; and
• 100th percentile is fourth quartile, Q4.
• One-fourth of the data fall below the first quartile, one-half
are below the second quartile, and three-fourths are below
the third quartile.
ftp.k
Contingency Tables
• One of most basic statistical tool for summarizing categorical
data
• A tabular method that displays number of observations in a
data set for different subcategories of two or more categorical
variables.
• Contingency tables can accept numerical variables but grouping
variable must be categorical.
• Subcategories of variables must be mutually exclusive and
exhaustive (i.e. each observation can be classified into only one
subcategory, and, taken together over all subcategories, they
must constitute the complete data set)
to
Examples of Contingency Tables 3 ate
→ mys
tides below
these
R
① Base function
OR
spread ( ) in tidyr
• Long to Wide dataset
spread() distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a
way that prevents duplication.
Column that
Column we
Dataset contains values to
want to spread
spread against
A-
27
is
'M
it spree
0PM
gather ( ) in tidyr
• Wide to Long dataset
Column name that Column name to
Dataset we want to gather gather values into
the columns into
Column that we
don’t want to gather
28
③ Ugh rpiwttnhk Patty → needs to be installed an imported
Sub-Categories of Region
https://cran.r-project.org/web/packages/rpivotTable/vignettes/rpivotTableIntroduction.html
Constructing Contingency Tables using rPivotTable Manipulating
p a c ka g e aggregator
} Percentage of units over total by type and region. name
Sub-Categories of Region
Sub-Categories of sub-region
Slicers
• for drilling down to “slice” a PivotTable and display a subset of data
Slicers
• for drilling down to “slice” a PivotTable and display a subset of data