Frequency 1
Frequency 1
Frequency 1
Definition of frequency tables here will restrict itself to 1-dimensional variations, which in
theory each one of us learnt in primary school where you could calculate manually, given
time. But they are such a common tool, that analysts can use for all sorts of data validation
and exploratory data analysis jobs, that finding a nice implementation might prove to be a
time-and-sanity saving task over a lifetime of counting how many things are of which type.
Whenever you have a limited number of different values in R, you can get a quick summary
of the data by calculating a frequency table. A frequency table is a table that represents the
CREATING A TABLE IN R
You can tabulate, for example, the amount of cars with a manual and an automatic gearbox
This outcome tells you that your data contains 13 cars with an automatic gearbox and 19
gear
Mazda RX4 4
Hornet 4 Drive 3
Hornet Sportabout 3
Valiant 3
What are the possible values of “gear”? Let’s use the factor() function to find out.
> factor(mtcars$gear)
[1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5
The cars in this data set have either 3, 4 or 5 forward gears. How many cars are there for each
number of forward gears?
Why the table() function does not work well
The table() function in Base R does give the counts of a categorical variable, but the output is not a data
frame – it’s a table, and it’s not easily accessible like a data frame.
> w = table(mtcars$gear)
> w
3 4 5
15 12 5
> class(w)
[1] "table"
You can convert this to a data frame, but the result does not retain the variable name “gear” in the
corresponding column name.
> t = as.data.frame(w)
> t
Var1 Freq
1 3 15
2 4 12
3 5 5
gear Freq
1 3 15
2 4 12
3 5 5
Then, call its library, and the count() function will be ready for use.
> library(plyr)
gear freq
1 3 15
2 4 12
3 5 5
> y
gear freq
1 3 15
2 4 12
3 5 5
> class(y)
[1] "data.frame"
1. We first find the range of eruption durations with the range function. It shows that the
observed eruptions are between 1.6 and 5.1 minutes in duration.
> duration = faithful$eruptions
> range(duration)
[1] 1.6 5.1
2. Break the range into non-overlapping sub-intervals by defining a sequence of equal distance
break points. If we round the endpoints of the interval [1.6, 5.1] to the closest half-integers,
we come up with the interval [1.5, 5.5]. Hence we set the break points to be the half-integer
sequence { 1.5, 2.0, 2.5, ... }.
> breaks = seq(1.5, 5.5, by=0.5) # half-integer sequence
> breaks
[1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
3. Classify the eruption durations according to the half-unit-length sub-intervals with cut. As
the intervals are to be closed on the left, and open on the right, we set the right argument
as FALSE.
> duration.cut = cut(duration, breaks, right=FALSE)
4. Compute the frequency of eruptions in each sub-interval with the table function.
> duration.freq = table(duration.cut)
Answer
Enhanced Solution
We apply the cbind function to print the result in column format.
> cbind(duration.freq)
duration.freq
[1.5,2) 51
[2,2.5) 41
[2.5,3) 5
[3,3.5) 7
[3.5,4) 30
[4,4.5) 73
[4.5,5) 61
[5,5.5) 4
Note
Per R documentation, you are advised to use the hist function to find the frequency distribution for
performance reasons.
Exercise
1. Find the frequency distribution of the eruption waiting periods in faithful.
2. Find programmatically the duration sub-interval that has the most eruptions.
Histogram
A histogram consists of parallel vertical bars that graphically shows the frequency distribution of a
quantitative variable. The area of each bar is equal to the frequency of items found in each class.
Example
In the data set faithful, the histogram of the eruptions variable is a collection of parallel vertical bars
showing the number of eruptions classified according to their durations.
Problem
Find the histogram of the eruption durations in faithful.
Solution
We apply the hist function to produce the histogram of the eruptions variable.
> duration = faithful$eruptions
> hist(duration, # apply the hist function
+ right=FALSE) # intervals closed on the left
Answer
Enhanced Solution
To colorize the histogram, we select a color palette and set it in the col argument of hist. In addition,
we update the titles for readability.
> colors = c("red", "yellow", "green", "violet", "orange",
+ "blue", "pink", "cyan")
> hist(duration, # apply the hist function
+ right=FALSE, # intervals closed on the left
+ col=colors, # set the color palette
+ main="Old Faithful Eruptions", # the main title
+ xlab="Duration minutes") # x-axis label
Exercise
Find the histogram of the eruption waiting period in faithful.
Another Ways
1
dat <- data.frame(Treenumber=c(3,3,3,3,6,6,6,9,10,11,11,12,12,12))
dat$Freq <- table(dat)[as.character(dat$Treenumber)]
2
library(dplyr)
df1 %>%
add_count(Treenumber)
Example
In the data set faithful, the relative frequency distribution of the eruptions variable shows the
frequency proportion of the eruptions according to a duration classification.
Problem
Find the relative frequency distribution of the eruption durations in faithful.
Solution
We first find the frequency distribution of the eruption durations as follows. Further details can be
found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
Then we find the sample size of faithful with the nrow function, and divide the frequency distribution
with it. As a result, the relative frequency distribution is:
> duration.relfreq = duration.freq / nrow(faithful)
Answer
Enhanced Solution
We can print with fewer digits and make it more readable by setting the digits option.
> old = options(digits=1)
> duration.relfreq
duration.cut
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
0.19 0.15 0.02 0.03 0.11 0.27 0.22
[5,5.5)
0.01
> options(old) # restore the old option
We then apply the cbind function to print both the frequency distribution and relative frequency
distribution in parallel columns.
> old = options(digits=1)
> cbind(duration.freq, duration.relfreq)
duration.freq duration.relfreq
[1.5,2) 51 0.19
[2,2.5) 41 0.15
[2.5,3) 5 0.02
[3,3.5) 7 0.03
[3.5,4) 30 0.11
[4,4.5) 73 0.27
[4.5,5) 61 0.22
[5,5.5) 4 0.01
> options(old) # restore the old option
Exercise
Find the relative frequency distribution of the eruption waiting periods in faithful.
Cumulative Frequency Distribution
The cumulative frequency distribution of a quantitative variable is a summary of data frequency
below a given level.
Example
In the data set faithful, the cumulative frequency distribution of the eruptions variable shows
the total number of eruptions whose durations are less than or equal to a set of chosen levels.
Problem
Find the cumulative frequency distribution of the eruption durations in faithful.
Solution
We first find the frequency distribution of the eruption durations as follows. Further details can be
found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
We then apply the cumsum function to compute the cumulative frequency distribution.
> duration.cumfreq = cumsum(duration.freq)
Answer
Enhanced Solution
We apply the cbind function to print the result in column format.
> cbind(duration.cumfreq)
duration.cumfreq
[1.5,2) 51
[2,2.5) 92
[2.5,3) 97
[3,3.5) 104
[3.5,4) 134
[4,4.5) 207
[4.5,5) 268
[5,5.5) 272
Exercise
Find the cumulative frequency distribution of the eruption waiting periods in faithful.
We then compute its cumulative frequency with cumsum, add a starting zero element, and plot the
graph.
> cumfreq0 = c(0, cumsum(duration.freq))
> plot(breaks, cumfreq0, # plot the data
+ main="Old Faithful Eruptions", # main title
+ xlab="Duration minutes", # x−axis label
+ ylab="Cumulative eruptions") # y−axis label
> lines(breaks, cumfreq0) # join the points
Answer
Exercise
Find the cumulative frequency graph of the eruption waiting periods in faithful.
The relationship between cumulative frequency and relative cumulative frequency is:
Example
In the data set faithful, the cumulative relative frequency distribution of the eruptions variable shows
the frequency proportion of eruptions whose durations are less than or equal to a set of chosen
levels.
Problem
Find the cumulative relative frequency distribution of the eruption durations in faithful.
Solution
We first find the frequency distribution of the eruption durations as follows. Further details can be
found in the Frequency Distribution tutorial.
> duration = faithful$eruptions
> breaks = seq(1.5, 5.5, by=0.5)
> duration.cut = cut(duration, breaks, right=FALSE)
> duration.freq = table(duration.cut)
We then apply the cumsum function to compute the cumulative frequency distribution.
> duration.cumfreq = cumsum(duration.freq)
Then we find the sample size of faithful with the nrow function, and divide the cumulative frequency
distribution with it. As a result, the cumulative relative frequency distribution is:
> duration.cumrelfreq = duration.cumfreq / nrow(faithful)
Answer
Enhanced Solution
We can print with fewer digits and make it more readable by setting the digits option.
> old = options(digits=2)
> duration.cumrelfreq
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5)
0.19 0.34 0.36 0.38 0.49 0.76 0.99
[5,5.5)
1.00
> options(old) # restore the old option
We then apply the cbind function to print both the cumulative frequency distribution and relative
cumulative frequency distribution in parallel columns.
> old = options(digits=2)
> cbind(duration.cumfreq, duration.cumrelfreq)
duration.cumfreq duration.cumrelfreq
[1.5,2) 51 0.19
[2,2.5) 92 0.34
[2.5,3) 97 0.36
[3,3.5) 104 0.38
[3.5,4) 134 0.49
[4,4.5) 207 0.76
[4.5,5) 268 0.99
[5,5.5) 272 1.00
> options(old)
Exercise
Find the cumulative frequency distribution of the eruption waiting periods in faithful.
Answer
Alternative Solution
We create an interpolate function Fn with the built-in function ecdf. Then we plot Fn right away.
There is no need to compute the cumulative frequency distribution a priori.
> Fn = ecdf(duration)
> plot(Fn,
+ main="Old Faithful Eruptions",
+ xlab="Duration minutes",
+ ylab="Cumulative eruption proportion")
Exercise
Find the cumulative relative frequency graph of the eruption waiting periods in faithful.
Here’s the top of an example dataset. Imagine a “tidy” dataset, such that each row is an one
observation. I would like to know how many observations (e.g. people) are of which type
(e.g. demographic – here a category between A and E inclusive)
TYPE PERSON ID
E 1
E 2
B 3
B 4
B 5
B 6
C 7
I want to be able to say things like: “4 of my records are of type E”, or “10% of my records
are of type A”. The dataset I will use in my below example is similar to the above table, only
with more records, including some with a blank (missing) type.
What would I like my 1-dimensional frequency table tool to do in an ideal world?
table(data$Type)
A super simple way to count up the number of records by type. But it doesn’t show
percentages or any sort of cumulation. By default it hasn’t highlighted that there are some
records with missing data. It does have a useNA parameter that will show that though if
desired.
The output also isn’t tidy and doesn’t work well with Knitr.
The table command can be wrapped in the prop.table command to show proportions.
prop.table(table(data$Type))
But you’d need to run both commands to understand the count and percentages, and the
latter inherits many of the limitations from the former.
if(any(!has)) install.packages(wants[!has])
Absolute frequencies
set.seed(123)
[1] "B" "D" "C" "E" "E" "A" "C" "E" "C" "C" "E" "C"
myLetters
A B C D E
1 1 5 1 4
names(tab)
tab["B"]
barplot(tab, main="Counts")
plot of chunk rerFrequencies01
myLetters
A B C D E
cumsum(relFreq)
A B C D E
letFac
[1] B D C E E A C E C C E C
Levels: A B C D E Q
table(letFac)
letFac
A B C D E Q
1 1 5 1 4 0
Counting runs
(vec <- rep(rep(c("f", "m"), 3), c(1, 3, 2, 4, 1, 2)))
[1] "f" "m" "m" "m" "f" "f" "m" "m" "m" "m" "f" "m" "m"
length(res$lengths)
[1] 6
inverse.rle(res)
[1] "f" "m" "m" "m" "f" "f" "m" "m" "m" "m" "f" "m" "m"
N <- 10
[1] m m f m f f f m m m
Levels: f m
[1] office office office office office office home home office office
work
f 1 3
m 1 5
summary(cTab)
Number of factors: 2
Using xtabs()
1 m office 4
2 m office 4
3 f office 0
4 m office 2
5 f office 4
6 f office 1
7 f home 1
8 m home 1
9 m office 0
10 m office 2
work
f 1 3
m 1 5
work
f 1 5
m 1 12
f m
4 6
colMeans(cTab)
home office
1 4
addmargins(cTab, c(1, 2), FUN=mean)
1: sex
2: work
work
Relative frequencies
(relFreq <- prop.table(cTab))
work
f 0.1 0.3
m 0.1 0.5
work
f 0.2500 0.7500
m 0.1667 0.8333
prop.table(cTab, 2)
work
f 0.500 0.375
m 0.500 0.625
Flat contingency tables for more than two variables
(group <- factor(sample(c("A", "B"), 10, replace=TRUE)))
[1] A A A A A A A B A A
Levels: A B
sex f m
group A B A B
work
home 1 0 0 1
office 3 0 5 0
library(epitools)
expand.table(cTab)
sex work
1 f home
2 f office
3 f office
4 f office
5 m home
6 m office
7 m office
8 m office
9 m office
10 m office
as.data.frame(cTab, stringsAsFactors=TRUE)
2 m home 1
3 f office 3
4 m office 5
Percentile rank
(vec <- round(rnorm(10), 2))
[1] 0.84 0.15 -1.14 1.25 0.43 -0.30 0.90 0.88 0.82 0.69
Fn <- ecdf(vec)
Fn(vec)
[1] 0.7 0.3 0.1 1.0 0.4 0.2 0.9 0.8 0.6 0.5
100 * Fn(0.1)
[1] 20
Fn(sort(vec))
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
knots(Fn)
[1] -1.14 -0.30 0.15 0.43 0.69 0.82 0.84 0.88 0.90 1.25
This is a pretty good start! By default, it shows counts, percents, and percent of non-missing
data. It can optionally sort in order of frequency. It the output is tidy, and works with kable
just fine. The only thing missing really is a cumulative percentage option. But it’s a great
improvement over base table.
I do find myself constantly misspelling “tabyl” as “taybl” though, which is annoying, but not
really something I can really criticise anyone else for.
However it isn’t very tidy by default, and doesn’t work with knitr. I also don’t really like the
column names it assigns, although one can certainly claim that’s pure personal preference.
A greater issue may be that the cumulative columns don’t seem to work as I would expect
when the table is sorted, as in the above example. The first entry in the table is “E”,
because that’s the largest category. However, it isn’t 100% of the non-missing dataset, as
you might infer from the fifth numerical column. In reality it’s 31.7%, per column 4.
As far as I can tell, the function is working out the cumulative frequencies before sorting the
table – so as category E is the last category in the data file it has calculated that by the time
you reach the end of category E you have 100% of the non-missing data in hand. I can’t
envisage a situation where you would want this behaviour, but I’m open to correction if
anyone can.
freq, from the summarytools package
Documentation
library(summarytools)
This looks pretty great. Has all the variations of counts, percents and missing-data output I
want – here you can interpret the “% valid” column as “% of all non-missing”. Very readable
in the console, and works well with Knitr. In fact it has some further nice formatting options
that I wasn’t particularly looking for.
It it pretty much tidy, although has a minor niggle in that the output always includes the total
row. It’s often important to know your totals, but if you’re piping it to other tools or charts,
you may have to use another command to filter that row out each time, as there doesn’t
seem to be an obvious way to prevent it being included with the rest of the dataset when
running it directly.
Update 2018-04-28: thanks to Roland in the comments below pointing out that a new feature to
disable the totals display has been added: set the “totals” parameter to false, and the totals row
won’t show up, potential making it easier to pass on for further analysis.
CrossTable, from the gmodels library
Documentation
library(gmodels)
CrossTable(data$Type)
Here the results are displayed in a horizontal format, a bit like the base “table”. Here though,
the proportions are clearly shown, albeit not with a cumulative version. It doesn’t highlight
that there are missing values, and isn’t “tidy”. You can get it to display a vertical version
(add the parameter max.width = 1 ) which is visually distinctive, but untidy in the usual
R tidyverse sense.
It’s not a great tool for my particular requirements here, but most likely this is because, as
you may guess from the command name, it’s not particularly designed for 1-way frequency
tables. If you are crosstabulating multiple dimensions it may provide a powerful and visually
accessible way to see counts, proportions and even run hypothesis tests.
It is mostly tidy, but also has an annoyance in that the category values themselves (A -E are
row labels rather than a standalone column. This means you may have to pop them into in a
new column for best use in any downstream tidy tools. That’s easy enough with
e.g. dplyr’sadd_rownames command. But that is another processing step to remember, which
is not a huge selling point.
There is a total row at the bottom, but it’s optional, so just don’t use the “total” parameter if
you plan to pass the data onwards in a way where you don’t want to risk double-counting
your totals. There’s an “exclude” parameter if you want to remove any particular categories
from analysis before performing the calculations as well as a couple of extra formatting
options that might be handy.
Stem-and-Leaf Plot
A stem-and-leaf plot of a quantitative variable is a textual graph that classifies data items
according to their most significant numeric digits. In addition, we often merge each alternating row
with its next row in order to simplify the graph for readability.
Example
In the data set faithful, a stem-and-leaf plot of the eruptions variable identifies durations with the
same two most significant digits, and queue them up in rows.
Problem
Find the stem-and-leaf plot of the eruption durations in faithful.
Solution
We apply the stem function to compute the stem-and-leaf plot of eruptions.
Answer
16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578
48 | 00000022335800333
50 | 0370
Exercise
Find the stem-and-leaf plot of the eruption waiting periods in faithful.
Scatter Plot
A scatter plot pairs up values of two quantitative variables in a data set and display them as
geometric points inside a Cartesian diagram.
Example
In the data set faithful, we pair up the eruptions and waiting values in the same observation as (x,y)
coordinates. Then we plot the points in the Cartesian plane. Here is a preview of the eruption data
value pairs with the help of the cbind function.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting interval
> head(cbind(duration, waiting))
duration waiting
[1,] 3.600 79
[2,] 1.800 54
[3,] 3.333 74
[4,] 2.283 62
[5,] 4.533 85
[6,] 2.883 55
Problem
Find the scatter plot of the eruption durations and waiting intervals in faithful. Does it reveal any
relationship between the variables?
Solution
We apply the plot function to compute the scatter plot of eruptions and waiting.
> duration = faithful$eruptions # the eruption durations
> waiting = faithful$waiting # the waiting interval
> plot(duration, waiting, # plot the variables
+ xlab="Eruption duration", # x−axis label
+ ylab="Time waited") # y−axis label
Answer
The scatter plot of the eruption durations and waiting intervals is as follows. It reveals a positive
linear relationship between them.
Enhanced Solution
We can generate a linear regression model of the two variables with the lm function, and then draw
a trend line with abline.
> abline(lm(waiting ~ duration))
Numerical Measures
We explain how to compute various statistical measures in R with examples. The tutorials are based
on the previously discussed built-in data set faithful.
Mean Variance
Quartile Covariance
Mean
The mean of an observation variable is a numerical measure of the central location of the data
values. It is the sum of its data values divided by data count.
Hence, for a data sample of size n, its sample mean is defined as follows:
Problem
Find the mean eruption duration in the data set faithful.
Solution
We apply the mean function to compute the mean value of eruptions.
> duration = faithful$eruptions # the eruption durations
> mean(duration) # apply the mean function
[1] 3.4878
Answer
Exercise
Find the mean eruption waiting periods in faithful.
Median
The median of an observation variable is the value at the middle when the data is sorted in
ascending order. It is an ordinal measure of the central location of the data values.
Problem
Find the median of the eruption duration in the data set faithful.
Solution
We apply the median function to compute the median value of eruptions.
> duration = faithful$eruptions # the eruption durations
> median(duration) # apply the median function
[1] 4
Answer
Exercise
Find the median of the eruption waiting periods in faithful.
Quartile
There are several quartiles of an observation variable. The first quartile, or lower quartile, is
the value that cuts off the first 25% of the data when it is sorted in ascending order. The second
quartile, or median, is the value that cuts off the first 50%. The third quartile, or upper
quartile, is the value that cuts off the first 75%.
Problem
Find the quartiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the quartiles of eruptions.
> duration = faithful$eruptions # the eruption durations
> quantile(duration) # apply the quantile function
0% 25% 50% 75% 100%
1.6000 2.1627 4.0000 4.4543 5.1000
Answer
The first, second and third quartiles of the eruption duration are 2.1627, 4.0000 and 4.4543
minutes respectively.
Exercise
Find the quartiles of the eruption waiting periods in faithful.
Note
There are several algorithms for the computation of quartiles. Details can be found in the R
documentation via help(quantile).
Percentile
The nth percentile of an observation variable is the value that cuts off the first n percent of the data
values when it is sorted in ascending order.
Problem
Find the 32nd, 57th and 98th percentiles of the eruption durations in the data set faithful.
Solution
We apply the quantile function to compute the percentiles of eruptions with the desired percentage
ratios.
> duration = faithful$eruptions # the eruption durations
> quantile(duration, c(.32, .57, .98))
32% 57% 98%
2.3952 4.1330 4.9330
Answer
The 32nd, 57th and 98th percentiles of the eruption duration are 2.3952, 4.1330 and 4.9330
Answer
Exercise
Find the range of the eruption waiting periods in faithful.
Exercise
Find the 17th, 43rd, 67th and 85th percentiles of the eruption waiting periods in faithful.
Note
There are several algorithms for the computation of percentiles. Details can be found in the R
documentation via help(quantile).
Interquartile Range
The interquartile range of an observation variable is the difference of its upper and lower
quartiles. It is a measure of how far apart the middle portion of data spreads in value.
Problem
Find the interquartile range of eruption duration in the data set faithful.
Solution
We apply the IQR function to compute the interquartile range of eruptions.
> duration = faithful$eruptions # the eruption durations
> IQR(duration) # apply the IQR function
[1] 2.2915
Answer
Exercise
Find the interquartile range of eruption waiting periods in faithful.
Box Plot
The box plot of an observation variable is a graphical representation based on its quartiles, as well
as its smallest and largest values. It attempts to provide a visual shape of the data distribution.
Problem
Find the box plot of the eruption duration in the data set faithful.
Solution
We apply the boxplot function to produce the box plot of eruptions.
> duration = faithful$eruptions # the eruption durations
> boxplot(duration, horizontal=TRUE) # horizontal box plot
Answer
Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the data
set. It is also useful in comparing the distribution of data across data sets
by drawing boxplots for each of them.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
x is a vector or a formula.
varwidth is a logical value. Set as true to draw width of the box proportionate
to the sample size.
names are the group labels which will be printed under each boxplot.
Example
We use the data set "mtcars" available in the R environment to create a
basic boxplot. Let's look at the columns "mpg" and "cyl" in mtcars.
print(head(input))
png(file = "boxplot.png")
dev.off()
The below script will create a boxplot graph with notch for each of the data
group.
png(file = "boxplot_with_notch.png")
notch = TRUE,
varwidth = TRUE,
col = c("green","yellow","purple"),
names = c("High","Medium","Low")
dev.off()
Mean
It is calculated by taking the sum of the values and dividing with the
number of values in a data series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
print(result.mean)
When trim = 0.3, 3 values from each end will be dropped from the
calculations to find mean.
In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and
the values removed from the vector for calculating mean are (−21,−5,2)
from left and (12,18,54) from right.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
print(result.mean)
Applying NA Option
If there are missing values, then the mean function returns NA.
To drop the missing values from the calculation use na.rm = TRUE. which
means remove the NA values.
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
print(result.mean)
print(result.mean)
Median
The middle most value in a data series is called the median.
The median()function is used in R to calculate this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
print(median.result)
Mode
The mode is the value that has highest number of occurrences in a set of
data. Unike mean and median, mode can have both numeric and character
data.
Example
# Create the function.
uniqv[which.max(tabulate(match(v, uniqv)))]
}
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
print(result)
print(result)
Variance
The variance is a numerical measure of how the data values is dispersed around the mean. In
particular, the sample variance is defined as:
Similarly, the population variance is defined in terms of the population mean μ and population
size N:
Problem
Find the variance of the eruption duration in the data set faithful.
Solution
We apply the var function to compute the variance of eruptions.
> duration = faithful$eruptions # the eruption durations
> var(duration) # apply the var function
[1] 1.3027
Answer
Exercise
Find the variance of the eruption waiting periods in faithful.
Standard Deviation
The standard deviation of an observation variable is the square root of its variance.
Problem
Find the standard deviation of the eruption duration in the data set faithful.
Solution
We apply the sd function to compute the standard deviation of eruptions.
> duration = faithful$eruptions # the eruption durations
> sd(duration) # apply the sd function
[1] 1.1414
Answer
Exercise
Find the standard deviation of the eruption waiting periods in faithful.
Covariance
The covariance of two variables x and y in a data set measures how the two are linearly related. A
positive covariance would indicate a positive linear relationship between the variables, and a
negative covariance would indicate the opposite.
The sample covariance is defined in terms of the sample means as:
Similarly, the population covariance is defined in terms of the population mean μ , μ as:
x y
Problem
Find the covariance of eruption duration and waiting time in the data set faithful. Observe if there is
any linear relationship between the two variables.
Solution
We apply the cov function to compute the covariance of eruptions and waiting.
> duration = faithful$eruptions # eruption durations
> waiting = faithful$waiting # the waiting period
> cov(duration, waiting) # apply the cov function
[1] 13.978
Answer
The covariance of eruption duration and waiting time is about 14. It indicates a positive linear
relationship between the two variables.
Correlation Coefficient
The correlation coefficient of two variables in a data set equals to their covariance divided by the
product of their individual standard deviations. It is a normalized measurement of how the two are
linearly related.
Formally, the sample correlation coefficient is defined by the following formula,
where s and s are the sample standard deviations, and s is the sample covariance.
x y xy
Similarly, the population correlation coefficient is defined as follows, where σ and σ are the
x y
If the correlation coefficient is close to 1, it would indicate that the variables are positively linearly
related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates
that the variables are negatively linearly related and the scatter plot almost falls along a straight
line with negative slope. And for zero, it would indicate a weak linear relationship between the
variables.
Problem
Find the correlation coefficient of eruption duration and waiting time in the data set faithful. Observe
if there is any linear relationship between the variables.
Solution
We apply the cor function to compute the correlation coefficient of eruptions and waiting.
> duration = faithful$eruptions # eruption durations
> waiting = faithful$waiting # the waiting period
> cor(duration, waiting) # apply the cor function
[1] 0.90081
Answer
The correlation coefficient of eruption duration and waiting time is 0.90081. Since it is rather close
to 1, we can conclude that the variables are positively linearly related.
R - Linear Regression
Regression analysis is a very widely used statistical tool to establish a
relationship model between two variables. One of these variable is called
predictor variable whose value is gathered through experiments. The other
variable is called response variable whose value is derived from the
predictor variable.
Find the coefficients from the model created and create the mathematical
equation using these
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the
response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
Coefficients:
(Intercept) x
-38.4551 0.6746
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(result)
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
png(file = "linearregression.png")
dev.off()
Syntax
The basic syntax for ts() function in time series analysis is −
timeseries.object.name <- ts(data, start, end, frequency)
start specifies the start time for the first observation in time series.
end specifies the end time for the last observation in time series.
Example
Consider the annual rainfall details at a place starting from January 2012.
We create an R time series object for a period of 12 months and plot it.
print(rainfall.timeseries)
png(file = "rainfall.png")
plot(rainfall.timeseries)
dev.off()
When we execute the above code, it produces the following result and chart
−
Jan Feb Mar Apr May Jun Jul Aug Sep
2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2
Oct Nov Dec
2012 985.0 882.8 1071.0
frequency = 24*6 pegs the data points for every 10 minutes of a day.
Multiple Time Series
We can plot multiple time series in one chart by combining both the series
into a matrix.
rainfall2 <-
c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)
print(rainfall.timeseries)
png(file = "rainfall_combined.png")
dev.off()
When we execute the above code, it produces the following result and chart
−
Series 1 Series 2
Jan 2012 799.0 655.0
Feb 2012 1174.8 1306.9
Mar 2012 865.1 1323.4
Apr 2012 1334.6 1172.2
May 2012 635.4 562.2
Jun 2012 918.5 824.0
Jul 2012 685.5 822.4
Aug 2012 998.6 1265.5
Sep 2012 784.2 799.6
Oct 2012 985.0 1105.6
Nov 2012 882.8 1106.7
Dec 2012 1071.0 1337.8
Syntax
The basic syntax for creating a nonlinear least square test in R is −
nls(formula, data, start)
Example
We will consider a nonlinear model with assumption of initial values of its
coefficients. Next we will see what is the confidence intervals of these
assumed values so that we can judge how well these values fir into the
model.
Let's assume the initial coefficients to be 1 and 3 and fit these values into
nls() function.
png(file = "nls.png")
plot(xvalues,yvalues)
# Plot the chart with new data by fitting it to a prediction from 100 data points.
lines(new.data$xvalues,predict(model,newdata = new.data))
dev.off()
print(sum(resid(model)^2))
print(confint(model))
R - Multiple Regression
Multiple regression is an extension of linear regression into relationship
between more than two variables. In simple linear relation we have one
predictor and one response variable, but in multiple regression we have
more than one predictor variable and one response variable.
We create the regression model using the lm() function in R. The model
determines the value of the coefficients using the input data. Next we can
predict the value of the response variable for a given set of predictor
variables using these coefficients.
lm() Function
This function creates the relationship model between the predictor and the
response variable.
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
formula is a symbol presenting the relation between the response variable and
predictor variables.
Example
Input Data
Consider the data set "mtcars" available in the R environment. It gives a
comparison between different car models in terms of mileage per gallon
(mpg), cylinder displacement("disp"), horse power("hp"), weight of the
car("wt") and some more parameters.
print(head(input))
print(model)
a <- coef(model)[1]
print(a)
print(Xdisp)
print(Xhp)
print(Xwt)
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is
−
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
R - Logistic Regression
The Logistic Regression is a regression model in which the response variable
(dependent variable) has categorical values such as True/False or 0/1. It
actually measures the probability of a binary response as the value of
response variable based on the mathematical equation relating it with the
predictor variables.
The function used to create the regression model is the glm() function.
Syntax
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)
family is R object to specify the details of the model. It's value is binomial for
logistic regression.
Example
The in-built data set "mtcars" describes different models of a car with their
various engine specifications. In "mtcars" data set, the transmission mode
(automatic or manual) is described by the column am which is a binary
value (0 or 1). We can create a logistic regression model between the
columns "am" and 3 other columns - hp, wt and cyl.
print(head(input))
print(summary(am.data))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion
In the summary as the p-value in the last column is more than 0.05 for the
variables "cyl" and "hp", we consider them to be insignificant in contributing
to the value of the variable "am". Only weight (wt) impacts the "am" value
in this regression model.