Rmarkdown
Rmarkdown
Contents
R data structures 1
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References 9
• Primera toma de contacto con un informe dinámico donde se muestra algunas de sus
caracteristicas.
• Son diferentes trozos de un libro.
• Se lee un archivo csv
2018-09-10
By the end of this notes, you will understand:
• The basic R data structures and how to use them to store and extract data
• How to get data into R from a variety of source formats
• Common methods for understanding and visualizing complex data
R data structures
The R data structures used most frequently in machine learning are vectors, factors, lists, arrays, and data
frames.
To find out more about machine learning see (Andrieu et al. 2003; Goldberg and Holland 1988).
Vectors
The fundamental R data structure is the vector, which stores an ordered set of values called elements. A
vector can contain any number of elements. However, all the elements must be of the same type; for instance,
a vector cannot contain both numbers and text.
1
There are several vector types commonly used in machine learning:integer(numbers without decimals),
numeric (numbers with decimals), character (text data), or logical (TRUE or FALSE values). There are
also two special values: NULL, which is used to indicate the absence of any value, and NA, which indicates a
missing value.
...
...
...
Create vectors of data for three medical patients:
# create vectors of data for three medical patients
subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)
## [1] 98.6
Examples of accessing items in vector include items in the range 2 to 3.
## examples of accessing items in vector
# include items in the range 2 to 3
temperature[2:3]
2
Since the dataset is stored in CSV form, we can use the read.csv() function to load the data into an R
data frame:
##### Exploring and understanding data --------------------
One of the first questions to ask in your investigation should be about how data is organized. If you are
fortunate, your source will provide a data dictionary, a document that describes the data’s features. In our
case, the used car data does not come with this documentation, so we’ll need to create our own.
# get structure of used car data
str(usedcars)
3
summary(usedcars[c("price", "mileage")])
## price mileage
## Min. : 3800 Min. : 4867
## 1st Qu.:10995 1st Qu.: 27200
## Median :13592 Median : 36385
## Mean :12962 Mean : 44261
## 3rd Qu.:14904 3rd Qu.: 55125
## Max. :21992 Max. :151479
# calculate the mean income
(36000 + 44000 + 56000) / 3
## [1] 45333.33
mean(c(36000, 44000, 56000))
## [1] 45333.33
# the median income
median(c(36000, 44000, 56000))
## [1] 44000
# the min/max of used car prices
range(usedcars$price)
## [1] 18192
# IQR for used car prices
IQR(usedcars$price)
## [1] 3909.5
# use quantile to calculate five-number summary
quantile(usedcars$price)
## 1% 99%
## 5428.69 20505.00
# quintiles
quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))
4
Table with information about mileage and price
mileage<-summary(usedcars$mileage)
price<-summary(usedcars$price)
kable(rbind(mileage,price), caption= "Descriptive statistic: mileage and price")
par(mfrow=c(2,2))
hist(usedcars$mileage, xlab="Mileage", main="Histogram of mileage",col="grey85")
hist(usedcars$price, xlab="Price", main="Histogram of price",col="grey85")
usedcars$transmission <- factor(usedcars$transmission)
plot(usedcars$mileage, usedcars$price, pch=16,
col=usedcars$transmission,xlab="Mileage", ylab="Price")
legend("topright", pch=16, c("AUTO","MANUAL"), col=1:2, cex=0.5)
## [1] 9749892
sd(usedcars$price)
## [1] 3122.482
var(usedcars$mileage)
## [1] 728033954
sd(usedcars$mileage)
## [1] 26982.1
5
Histogram of mileage Histogram of price
20 40 60
40
Frequency
Frequency
20
0
Mileage Price
AUTO
MANUAL
15000
Price
5000
Mileage
6
Boxplot of Used Car Prices
15000
Price ($)
5000
100000
0
7
50
40 Histogram of Used Car Prices
Frequency
30
20
10
0
Price ($)
Odometer (mi.)
8
Measuring spread - quartiles and the five-number summary
The five-number summary is a set of five statistics that roughly depict the spread of a dataset. All five of
the statistics are included in the output of the summary() function. Written in order, they are:
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)
In order to calculate the standard deviation, we must first obtain the variance, which is defined as the
average of the squared differences between each value and the mean value. In mathematical notation, the
variance of a set of n values of x is defined by the following formula. The Greek letter mu (µ) (similar in
appearance to an m) denotes the mean of the values, and the variance itself is denoted by the Greek letter
sigma (σ) squared (similar to a b turned sideways):
n
1X
V ar(X) = σ 2 = (xi − µ)2
n i=1
The standard deviation is the square root of the variance, and is denoted by sigma as shown in the following
formula:
v
u n
u1 X
StdDev(X) = σ = t (xi − µ)2
n i=1
Note. For more details on using mathematical expressions in Latex (R Markdown) see https://es.sharelatex.
com/learn/Mathematical_expressions.
Addenda
All these these methods should be used to analyze data and solve problems like the ozone layer (1986) or
socioeconomic problems like the precarious work (2000).
The main goal is to accomplish long-term growth as stated in Doppelhofer, Miller, and others (2004).
References
Andrieu, Christophe, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. 2003. “An Introduction to
Mcmc for Machine Learning.” Machine Learning 50 (1-2). Springer: 5–43.
Beck, Ulrich. 2000. Un Nuevo Mundo Feliz: La Precariedad Del Trabajo En La Era de La Globalización.
Doppelhofer, Gernot, Ronald I Miller, and others. 2004. “Determinants of Long-Term Growth: A Bayesian
Averaging of Classical Estimates (Bace) Approach.” The American Economic Review 94 (4). American
9
Economic Association: 813–35.
Goldberg, David E, and John H Holland. 1988. “Genetic Algorithms and Machine Learning.” Machine
Learning 3 (2). Springer: 95–99.
López Zavala, A, and others. 1986. “Capa de Ozono.” In Congreso Nacional de Ingeniería Sanitaria Y
Ambiental, 5, 304–8. SMISAAC.
10