Statistics
Statistics
Statistics Introduction
Statistics is the science of analyzing, reviewing, and concluding data.
Some basic statistical numbers include:
Mean, median and mode
Minimum and maximum value
Percentiles
Variance and Standard Deviation
Covariance and Correlation
Probability distributions
Example
# Print the mtcars data set
mtcars
Example
# Use the question mark to get information about the data set
?mtcars
Get Information
Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for
better organization
# Use names() to find the names of the variables from the data set
names(Data_Cars)
Use the rownames() function to get the name of each row in the first column,
which is the name of each car:
Example
Data_Cars <- mtcars
rownames(Data_Cars)
From the examples above, we have found out that the data set
has 32 observations (Mazda RX4, Mazda RX4 Wag, Datsun 710, etc)
and 11 variables (mpg, cyl, disp, etc).
Example
Data_Cars <- mtcars
Data_Cars$cyl
Example
Data_Cars <- mtcars
sort(Data_Cars$cyl)
From the examples above, we see that most cars have 4 and 8 cylinders.
Analyzing the Data
Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.
For example, we can use the summary() function to get a statistical summary
of the data:
Example
Data_Cars <- mtcars
summary(Data_Cars)
The summary() function returns six statistical numbers for each variable:
Min
First quantile (percentile)
Median
Mean
Third quantile (percentile)
Max
Max Min
In the previous chapter, we introduced the mtcars data set. We learned from
the ‘R Math’ chapter that R has several built-in math functions.
For example, the min() and max() functions can be used to find the lowest or
highest value in a set:
Example
Find the largest and smallest value of the variable hp (horsepower).
max(Data_Cars$hp)
min(Data_Cars$hp)
Now we know that the largest horsepower value in the set is 335, and
the lowest 52.
We could take a look at the data set and try to find out which car these
two values belongs to:
It is much easier (and safer) to let R find out this for us.
For example, we can use the which.max() and which.min() functions to find
the index position of the max and min value in the table:
Example
Data_Cars <- mtcars
which.max(Data_Cars$hp)
which.min(Data_Cars$hp)
Example
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]
Outliers
Max and min can also be used to detect outliers. An outlier is a data point
that differs from rest of the observations.
Example of data points that could have been outliers in the mtcars data set:
If maximum of forward gears of a car was 11
If minimum of horsepower of a car was 0
If maximum weight of a car was 50 000 lbs
Mean
To calculate the average value (mean) of a variable from the mtcars
data set, find the sum of all values, and divide the sum by the number
of values.
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424
Example
Find the average weight (wt) of a car:
mean(Data_Cars$wt)
Median
The median value is the value in the middle, after you have sorted all the
values.
If we look at the values of the wt variable (from the mtcars data set), we will
see that there are two numbers in the middle.
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424
Note: If there are two numbers in the middle, we must divide the sum of those
numbers by two, to find the median.
The median() function in R can find the middle value:
Example
Find the mid point value of weight (wt):
median(Data_Cars$wt)
Mode
The mode value is the value that appears the greatest number of times.
R does not have a function to calculate the mode. However, we can create
our own function to find it.
If we look at the values of the wt variable (from the mtcars data set), we will
see that the numbers 60 are often shown.
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424
Example
Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
From the example above, we now know that the number that appears the
most number of times in mtcars wt variable is 3.44 or 3.440 lbs.
Percentiles
Percentiles are used in statistics to give you a number that describes the
value that a given percent of the values are lower than.
If we look at the values of the wt (weight) variable from the mtcars data set:
What is the 75. percentile of the weight of the cars? The answer is 3.61 or 3
610 lbs, meaning that 75% or the cars weight 3 610 lbs or less:
Example
Data_Cars <- mtcars
If we run the quantile() function without specifying the c() parameter, you
will get the percentiles of 0, 25, 50, 75 and 100:
Example
Data_Cars <- mtcars
quantile(Data_Cars$wt)
Quartiles
Quartiles are data divided into four parts, when sorted in an ascending order:
The value of the first quartile cuts off the first 25% of the data
The value of the second quartile cuts off the first 50% of the data
The value of the third quartile cuts off the first 75% of the data
The value of the fourth quartile cuts off the 100% of the data