0% found this document useful (0 votes)
53 views10 pages

Statistics

The document provides an introduction to the "mtcars" data set in R, which contains information on 32 cars. It describes the variables in the data set like miles per gallon, number of cylinders, horsepower, weight, and others. It demonstrates how to access, manipulate, and analyze the data set using R functions like dim(), names(), sort(), summary(), mean(), median(), and quantiles. The goal is to help users understand this built-in data set and how to extract useful information from it.

Uploaded by

noufatcoursera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views10 pages

Statistics

The document provides an introduction to the "mtcars" data set in R, which contains information on 32 cars. It describes the variables in the data set like miles per gallon, number of cylinders, horsepower, weight, and others. It demonstrates how to access, manipulate, and analyze the data set using R functions like dim(), names(), sort(), summary(), mean(), median(), and quantiles. The goal is to help users understand this built-in data set and how to extract useful information from it.

Uploaded by

noufatcoursera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

R Statistics

Statistics Introduction
 Statistics is the science of analyzing, reviewing, and concluding data.
 Some basic statistical numbers include:
 Mean, median and mode
 Minimum and maximum value
 Percentiles
 Variance and Standard Deviation
 Covariance and Correlation
 Probability distributions

 The R language was developed by two statisticians. It has many built-in


functionalities, in addition to libraries for the exact purpose of statistical
analysis.
Data Set
 A data set is a collection of data, often presented in a table.
 There is a popular built-in data set in R called "mtcars" (Motor Trend Car
Road Tests), which is retrieved from the 1974 Motor Trend US Magazine.

Example
# Print the mtcars data set
mtcars

“mtcars” data set:


Motor Trend Car Road Tests
Description
 The data was extracted from the 1974 Motor Trend US magazine and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973-74 models).
Usage
“mtcars”
Format
A data frame with 32 observations on 11 (numeric) variables:
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Information About the Data Set


 We can use the question mark (?) to get information about the mtcars data
set.

 Example
# Use the question mark to get information about the data set

?mtcars
Get Information
 Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:

 Example
Data_Cars <- mtcars # create a variable of the mtcars data set for
better organization

# Use dim() to find the dimension of the data set


dim(Data_Cars)

# Use names() to find the names of the variables from the data set
names(Data_Cars)

 Use the rownames() function to get the name of each row in the first column,
which is the name of each car:

 Example
Data_Cars <- mtcars

rownames(Data_Cars)

From the examples above, we have found out that the data set
has 32 observations (Mazda RX4, Mazda RX4 Wag, Datsun 710, etc)
and 11 variables (mpg, cyl, disp, etc).

 A variable is defined as something that can be measured or counted.


Here is a brief explanation of the variables from the mtcars data set:
Variable Name Description
mpg Miles/(US) Gallon
cyl Number of cylinders
disp Displacement
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Print Variable Values


 If you want to print all values that belong to a variable, access the data frame
by using the $ sign, and the name of the variable (for example cyl
(cylinders)).

 Example
Data_Cars <- mtcars

Data_Cars$cyl

Sort Variable Values


 To sort the values, use the sort() function:

Example
Data_Cars <- mtcars

sort(Data_Cars$cyl)

From the examples above, we see that most cars have 4 and 8 cylinders.
Analyzing the Data
 Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.
 For example, we can use the summary() function to get a statistical summary
of the data:

 Example
Data_Cars <- mtcars

summary(Data_Cars)

The summary() function returns six statistical numbers for each variable:
 Min
 First quantile (percentile)
 Median
 Mean
 Third quantile (percentile)
 Max
Max Min
 In the previous chapter, we introduced the mtcars data set. We learned from
the ‘R Math’ chapter that R has several built-in math functions.
 For example, the min() and max() functions can be used to find the lowest or
highest value in a set:

 Example
Find the largest and smallest value of the variable hp (horsepower).

Data_Cars <- mtcars

max(Data_Cars$hp)
min(Data_Cars$hp)
Now we know that the largest horsepower value in the set is 335, and
the lowest 52.

We could take a look at the data set and try to find out which car these
two values belongs to:

By observing the table, it looks like the largest hp value belongs to a

Maserati Bora, and the lowest belongs to a Honda Civic.

 It is much easier (and safer) to let R find out this for us.
 For example, we can use the which.max() and which.min() functions to find
the index position of the max and min value in the table:

 Example
Data_Cars <- mtcars
which.max(Data_Cars$hp)
which.min(Data_Cars$hp)

 Or even better, combine which.max() and which.min() with the rownames()


function to get the name of the car with the largest and smallest horsepower:

 Example
Data_Cars <- mtcars

rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]

Now we know for sure:


Maserati Bora is the car with the highest horsepower, and Honda
Civic is the car with the lowest horsepower.

Outliers
 Max and min can also be used to detect outliers. An outlier is a data point
that differs from rest of the observations.
 Example of data points that could have been outliers in the mtcars data set:
 If maximum of forward gears of a car was 11
 If minimum of horsepower of a car was 0
 If maximum weight of a car was 50 000 lbs

R Mean, Median, and Mode


In statistics, there are often three values that interests us:
 Mean - The average value
 Median - The middle value
 Mode - The most common value

Mean
 To calculate the average value (mean) of a variable from the mtcars
data set, find the sum of all values, and divide the sum by the number
of values.
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

The mean() function in R can do it.

 Example
Find the average weight (wt) of a car:

Data_Cars <- mtcars

mean(Data_Cars$wt)
Median
 The median value is the value in the middle, after you have sorted all the
values.
 If we look at the values of the wt variable (from the mtcars data set), we will
see that there are two numbers in the middle.
 Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

Note: If there are two numbers in the middle, we must divide the sum of those
numbers by two, to find the median.
 The median() function in R can find the middle value:

 Example
Find the mid point value of weight (wt):

Data_Cars <- mtcars

median(Data_Cars$wt)

Mode
 The mode value is the value that appears the greatest number of times.
 R does not have a function to calculate the mode. However, we can create
our own function to find it.
 If we look at the values of the wt variable (from the mtcars data set), we will
see that the numbers 60 are often shown.
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

Instead of counting it ourselves, we can use R to find the mode:

 Example
Data_Cars <- mtcars

names(sort(-table(Data_Cars$wt)))[1]

From the example above, we now know that the number that appears the
most number of times in mtcars wt variable is 3.44 or 3.440 lbs.

Percentiles
 Percentiles are used in statistics to give you a number that describes the
value that a given percent of the values are lower than.
 If we look at the values of the wt (weight) variable from the mtcars data set:

1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465


2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424

 What is the 75. percentile of the weight of the cars? The answer is 3.61 or 3
610 lbs, meaning that 75% or the cars weight 3 610 lbs or less:
 Example
Data_Cars <- mtcars

# c() specifies which percentile you want


quantile(Data_Cars$wt, c(0.75))

 If we run the quantile() function without specifying the c() parameter, you
will get the percentiles of 0, 25, 50, 75 and 100:

Example
Data_Cars <- mtcars

quantile(Data_Cars$wt)

Quartiles
 Quartiles are data divided into four parts, when sorted in an ascending order:
 The value of the first quartile cuts off the first 25% of the data
 The value of the second quartile cuts off the first 50% of the data
 The value of the third quartile cuts off the first 75% of the data
 The value of the fourth quartile cuts off the 100% of the data

 Use the quantile() function to get the quartiles.

You might also like