0% found this document useful (0 votes)

17 views10 pages

Rmarkdown

Machine learning

Uploaded by

cristy alejandra medina armijo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Rmarkdown

Machine learning

Uploaded by

cristy alejandra medina armijo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Managing and Understanding Data

Escribir vuestro nombre y apellidos

10 de septiembre, 2018

Contents
R data structures 1
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Exploring and understanding data 2

Exploring the structure of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Show some registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Exploring numeric variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Table with information about mileage and price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Some descriptive graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Visualizing numeric variables - boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Measuring spread - quartiles and the five-number summary . . . . . . . . . . . . . . . . . . . . . . 9
Measuring spread - variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Addenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

References 9
• Primera toma de contacto con un informe dinámico donde se muestra algunas de sus
caracteristicas.
• Son diferentes trozos de un libro.
• Se lee un archivo csv

2018-09-10
By the end of this notes, you will understand:
• The basic R data structures and how to use them to store and extract data
• How to get data into R from a variety of source formats
• Common methods for understanding and visualizing complex data

R data structures
The R data structures used most frequently in machine learning are vectors, factors, lists, arrays, and data
frames.
To find out more about machine learning see (Andrieu et al. 2003; Goldberg and Holland 1988).

Vectors

The fundamental R data structure is the vector, which stores an ordered set of values called elements. A
vector can contain any number of elements. However, all the elements must be of the same type; for instance,
a vector cannot contain both numbers and text.

1
There are several vector types commonly used in machine learning:integer(numbers without decimals),
numeric (numbers with decimals), character (text data), or logical (TRUE or FALSE values). There are
also two special values: NULL, which is used to indicate the absence of any value, and NA, which indicates a
missing value.
...
...
...
Create vectors of data for three medical patients:
# create vectors of data for three medical patients
subject_name <- c("John Doe", "Jane Doe", "Steve Graves")
temperature <- c(98.1, 98.6, 101.4)
flu_status <- c(FALSE, FALSE, TRUE)

Access the second element in body temperature vector:

# access the second element in body temperature vector
temperature[2]

## [1] 98.6
Examples of accessing items in vector include items in the range 2 to 3.
## examples of accessing items in vector
# include items in the range 2 to 3
temperature[2:3]

## [1] 98.6 101.4

Exclude item 2 using the minus sign
# exclude item 2 using the minus sign
temperature[-2]

## [1] 98.1 101.4

Use a vector to indicate whether to include item
# use a vector to indicate whether to include item
temperature[c(TRUE, TRUE, FALSE)]

## [1] 98.1 98.6

Exploring and understanding data

After collecting data and loading it into R data structures, the next step in the machine learning process
involves examining the data in detail. It is during this step that you will begin to explore the data’s features
and examples, and realize the peculiarities that make your data unique. The better you understand your data,
the better you will be able to match a machine learning model to your learning problem. The best way to
understand the process of data exploration is by example. In this section, we will explore the usedcars.csv
dataset, which contains actual data about used cars recently advertised for sale on a popular U.S. website.
...
...
...

2
Since the dataset is stored in CSV form, we can use the read.csv() function to load the data into an R
data frame:
##### Exploring and understanding data --------------------

## data exploration example using used car data

usedcars <- read.csv(file1, stringsAsFactors = FALSE)

Exploring the structure of data

One of the first questions to ask in your investigation should be about how data is organized. If you are
fortunate, your source will provide a data dictionary, a document that describes the data’s features. In our
case, the used car data does not come with this documentation, so we’ll need to create our own.
# get structure of used car data
str(usedcars)

## 'data.frame': 150 obs. of 6 variables:

## $ year : int 2011 2011 2011 2011 2012 2010 2011 2010 2011 2010 ...
## $ model : chr "SEL" "SEL" "SEL" "SEL" ...
## $ price : int 21992 20995 19995 17809 17500 17495 17000 16995 16995 16995 ...
## $ mileage : int 7413 10926 7351 11613 8367 25125 27393 21026 32655 36116 ...
## $ color : chr "Yellow" "Gray" "Silver" "Gray" ...
## $ transmission: chr "AUTO" "AUTO" "AUTO" "AUTO" ...

Show some registers

# Table of 6 first registers

kable(head(usedcars), caption = "6 first registers of data")

Table 1: 6 first registers of data

year model price mileage color transmission

2011 SEL 21992 7413 Yellow AUTO
2011 SEL 20995 10926 Gray AUTO
2011 SEL 19995 7351 Silver AUTO
2011 SEL 17809 11613 Gray AUTO
2012 SE 17500 8367 White AUTO
2010 SEL 17495 25125 Silver AUTO

Exploring numeric variables

## Exploring numeric variables -----

# summarize numeric variables

summary(usedcars$year)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 2000 2008 2009 2009 2010 2012

3
summary(usedcars[c("price", "mileage")])

## price mileage
## Min. : 3800 Min. : 4867
## 1st Qu.:10995 1st Qu.: 27200
## Median :13592 Median : 36385
## Mean :12962 Mean : 44261
## 3rd Qu.:14904 3rd Qu.: 55125
## Max. :21992 Max. :151479
# calculate the mean income
(36000 + 44000 + 56000) / 3

## [1] 45333.33
mean(c(36000, 44000, 56000))

## [1] 45333.33
# the median income
median(c(36000, 44000, 56000))

## [1] 44000
# the min/max of used car prices
range(usedcars$price)

## [1] 3800 21992

# the difference of the range
diff(range(usedcars$price))

## [1] 18192
# IQR for used car prices
IQR(usedcars$price)

## [1] 3909.5
# use quantile to calculate five-number summary
quantile(usedcars$price)

## 0% 25% 50% 75% 100%

## 3800.0 10995.0 13591.5 14904.5 21992.0
# the 99th percentile
quantile(usedcars$price, probs = c(0.01, 0.99))

## 1% 99%
## 5428.69 20505.00
# quintiles
quantile(usedcars$price, seq(from = 0, to = 1, by = 0.20))

## 0% 20% 40% 60% 80% 100%

## 3800.0 10759.4 12993.8 13992.0 14999.0 21992.0

4
Table with information about mileage and price

mileage<-summary(usedcars$mileage)
price<-summary(usedcars$price)
kable(rbind(mileage,price), caption= "Descriptive statistic: mileage and price")

Table 2: Descriptive statistic: mileage and price

Min. 1st Qu. Median Mean 3rd Qu. Max.

mileage 4867 27200.25 36385.0 44260.65 55124.5 151479
price 3800 10995.00 13591.5 12961.93 14904.5 21992

Some descriptive graphics

par(mfrow=c(2,2))
hist(usedcars$mileage, xlab="Mileage", main="Histogram of mileage",col="grey85")
hist(usedcars$price, xlab="Price", main="Histogram of price",col="grey85")
usedcars$transmission <- factor(usedcars$transmission)
plot(usedcars$mileage, usedcars$price, pch=16,
col=usedcars$transmission,xlab="Mileage", ylab="Price")
legend("topright", pch=16, c("AUTO","MANUAL"), col=1:2, cex=0.5)

Visualizing numeric variables - boxplots

# boxplot of used car prices and mileage

boxplot(usedcars$price, main="Boxplot of Used Car Prices",ylab="Price ($)")

boxplot(usedcars$mileage, main="Boxplot of Used Car Mileage",

ylab="Odometer (mi.)")

# histograms of used car prices and mileage

hist(usedcars$price, main = "Histogram of Used Car Prices",
xlab = "Price ($)")
hist(usedcars$mileage, main = "Histogram of Used Car Mileage",
xlab = "Odometer (mi.)")
# variance and standard deviation of the used car data
var(usedcars$price)

## [1] 9749892
sd(usedcars$price)

## [1] 3122.482
var(usedcars$mileage)

## [1] 728033954
sd(usedcars$mileage)

## [1] 26982.1

5
Histogram of mileage Histogram of price
20 40 60

40
Frequency

Frequency

20
0

0 50000 100000 150000 0 5000 10000 15000 20000

Mileage Price

AUTO
MANUAL
15000
Price

5000

0 50000 100000 150000

Mileage

Figure 1: Descriptive graphics

6
Boxplot of Used Car Prices
15000
Price ($)

5000

Figure 2: Boxplot of prices

Boxplot of Used Car Mileage

Odometer (mi.)

100000
0

Figure 3: Boxplot of Mileage

7
50
40 Histogram of Used Car Prices
Frequency

30
20
10
0

5000 10000 15000 20000

Price ($)

Figure 4: Histogram of Used Car Prices

Histogram of Used Car Mileage

20 40 60
Frequency

0 50000 100000 150000

Odometer (mi.)

Figure 5: Histogram of mileage

8
Measuring spread - quartiles and the five-number summary

The five-number summary is a set of five statistics that roughly depict the spread of a dataset. All five of
the statistics are included in the output of the summary() function. Written in order, they are:
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)

Measuring spread - variance and standard deviation

In order to calculate the standard deviation, we must first obtain the variance, which is defined as the
average of the squared differences between each value and the mean value. In mathematical notation, the
variance of a set of n values of x is defined by the following formula. The Greek letter mu (µ) (similar in
appearance to an m) denotes the mean of the values, and the variance itself is denoted by the Greek letter
sigma (σ) squared (similar to a b turned sideways):

n
1X
V ar(X) = σ 2 = (xi − µ)2
n i=1

The standard deviation is the square root of the variance, and is denoted by sigma as shown in the following
formula:
v
u n
u1 X
StdDev(X) = σ = t (xi − µ)2
n i=1

Note. For more details on using mathematical expressions in Latex (R Markdown) see https://es.sharelatex.
com/learn/Mathematical_expressions.

Addenda

All these these methods should be used to analyze data and solve problems like the ozone layer (1986) or
socioeconomic problems like the precarious work (2000).
The main goal is to accomplish long-term growth as stated in Doppelhofer, Miller, and others (2004).

References
Andrieu, Christophe, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. 2003. “An Introduction to
Mcmc for Machine Learning.” Machine Learning 50 (1-2). Springer: 5–43.
Beck, Ulrich. 2000. Un Nuevo Mundo Feliz: La Precariedad Del Trabajo En La Era de La Globalización.
Doppelhofer, Gernot, Ronald I Miller, and others. 2004. “Determinants of Long-Term Growth: A Bayesian
Averaging of Classical Estimates (Bace) Approach.” The American Economic Review 94 (4). American

9
Economic Association: 813–35.
Goldberg, David E, and John H Holland. 1988. “Genetic Algorithms and Machine Learning.” Machine
Learning 3 (2). Springer: 95–99.
López Zavala, A, and others. 1986. “Capa de Ozono.” In Congreso Nacional de Ingeniería Sanitaria Y
Ambiental, 5, 304–8. SMISAAC.

Ohs352 Project Report Notes
No ratings yet
Ohs352 Project Report Notes
67 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
Lecture Notes - Programming in R
No ratings yet
Lecture Notes - Programming in R
9 pages
(Ebook PDF) Research Methods in Psychology: From Theory To Practice, Canadian Edition PDF Download
100% (5)
(Ebook PDF) Research Methods in Psychology: From Theory To Practice, Canadian Edition PDF Download
54 pages
Consolidate AmitRana
No ratings yet
Consolidate AmitRana
58 pages
An Ordered Book For R Language
No ratings yet
An Ordered Book For R Language
92 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
Data Visualisation Slides 1-6
No ratings yet
Data Visualisation Slides 1-6
318 pages
Statistics and Data Science With R Part - 4
No ratings yet
Statistics and Data Science With R Part - 4
23 pages
Working With Data
No ratings yet
Working With Data
38 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
40 pages
In R Programming PDF
No ratings yet
In R Programming PDF
72 pages
Data Preprocessing
No ratings yet
Data Preprocessing
27 pages
EM622 Data Analysis and Visualization Techniques For Decision-Making
No ratings yet
EM622 Data Analysis and Visualization Techniques For Decision-Making
47 pages
R Pres
No ratings yet
R Pres
53 pages
Data Analytics Using R
100% (1)
Data Analytics Using R
27 pages
Eda
No ratings yet
Eda
188 pages
Unit 1
No ratings yet
Unit 1
78 pages
WIN SEM (2022-23) CSE4027 ETH AP2022236000324 Reference Material I 25-Jan-2023 Module-1 Topic-3 - R Datatypes
No ratings yet
WIN SEM (2022-23) CSE4027 ETH AP2022236000324 Reference Material I 25-Jan-2023 Module-1 Topic-3 - R Datatypes
41 pages
Week 02
No ratings yet
Week 02
39 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Fractal Time Why A Watched Kettle Never Boils Studies of Nonlinear Phenomena in Life Science Susie Vrobel Download
No ratings yet
Fractal Time Why A Watched Kettle Never Boils Studies of Nonlinear Phenomena in Life Science Susie Vrobel Download
77 pages
R: Introduction: Kedar Kelkar
No ratings yet
R: Introduction: Kedar Kelkar
24 pages
The Machine Learning Process Involves Several Steps That Help Develop and Deploy A Successful Machine Learning Model
No ratings yet
The Machine Learning Process Involves Several Steps That Help Develop and Deploy A Successful Machine Learning Model
62 pages
Basics of R
No ratings yet
Basics of R
12 pages
MATH1152 - Set Theory Notes
No ratings yet
MATH1152 - Set Theory Notes
6 pages
B Ei
No ratings yet
B Ei
44 pages
Unit-4 Big Data Analytics Methods Using R
No ratings yet
Unit-4 Big Data Analytics Methods Using R
57 pages
Starting With R
No ratings yet
Starting With R
34 pages
UNIT-1 (Preparing To Model)
No ratings yet
UNIT-1 (Preparing To Model)
82 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Data Structure (Data Frame)
No ratings yet
Data Structure (Data Frame)
12 pages
Practical 3 Intro To R
No ratings yet
Practical 3 Intro To R
10 pages
R Session A
No ratings yet
R Session A
107 pages
Ba Assignment Sem 6 (22504025) Dhruvi Pathania
No ratings yet
Ba Assignment Sem 6 (22504025) Dhruvi Pathania
28 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
R Concepts - 25092018 PDF
No ratings yet
R Concepts - 25092018 PDF
51 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
R - A Practical Course
No ratings yet
R - A Practical Course
42 pages
INF30036 DataTypes Lecture2-1
No ratings yet
INF30036 DataTypes Lecture2-1
42 pages
Tutorial 1
No ratings yet
Tutorial 1
29 pages
R Programming
No ratings yet
R Programming
22 pages
Presentation of R
No ratings yet
Presentation of R
109 pages
Philosophy 1ST Prelim Notes 1
No ratings yet
Philosophy 1ST Prelim Notes 1
8 pages
Introduction To R
No ratings yet
Introduction To R
23 pages
Introduction To R
No ratings yet
Introduction To R
18 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Handout 2
No ratings yet
Handout 2
15 pages
Communications in Computer and Information Science 298
No ratings yet
Communications in Computer and Information Science 298
614 pages
CH 3
No ratings yet
CH 3
33 pages
Mangu Campus Updated May To August Teaching Timetable
No ratings yet
Mangu Campus Updated May To August Teaching Timetable
30 pages
1.3 - Super Elevation Equilibrium Cant Etc.
No ratings yet
1.3 - Super Elevation Equilibrium Cant Etc.
53 pages
Application of Eurocode 7 For Earth Retaining Structures
100% (1)
Application of Eurocode 7 For Earth Retaining Structures
57 pages
2.R Concepts - BDSM - Oct2020 PDF
No ratings yet
2.R Concepts - BDSM - Oct2020 PDF
37 pages
Term 3 Study Portion 2024 - 2025 (Secondary)
No ratings yet
Term 3 Study Portion 2024 - 2025 (Secondary)
18 pages
P1 2018
No ratings yet
P1 2018
5 pages
MTech R Notes
No ratings yet
MTech R Notes
14 pages
R-Tutorial - Introduction
No ratings yet
R-Tutorial - Introduction
30 pages
Information Retrieval 7 Boolean Model
No ratings yet
Information Retrieval 7 Boolean Model
11 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Msi and PLD Components - 20241204 - 071632 - 0000
No ratings yet
Msi and PLD Components - 20241204 - 071632 - 0000
50 pages
R Prog
No ratings yet
R Prog
27 pages
Introduction To R
No ratings yet
Introduction To R
21 pages
R
No ratings yet
R
15 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
BBA Full Syllybus-DBI COLLEGE
No ratings yet
BBA Full Syllybus-DBI COLLEGE
40 pages
R-Training For Print
No ratings yet
R-Training For Print
11 pages
High-Level Interpretability Detecting An AI's Objectives - LessWrong
No ratings yet
High-Level Interpretability Detecting An AI's Objectives - LessWrong
31 pages
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
No ratings yet
Kline, A., Ahner, D., & Hill, R. (2019) - The Weapon-Target
11 pages
Program List
No ratings yet
Program List
12 pages
Quarter 1-Module 5: Mathematics
100% (1)
Quarter 1-Module 5: Mathematics
14 pages
Properties of Areas
No ratings yet
Properties of Areas
20 pages
(Download) SSC - CGL Tier-II Exam Paper-I (Arithmetical Ability) Held On - 16-09-2012 - SSCPORTAL PDF
No ratings yet
(Download) SSC - CGL Tier-II Exam Paper-I (Arithmetical Ability) Held On - 16-09-2012 - SSCPORTAL PDF
12 pages
Reward Management Practices and Its Impact On Employees Motivation An Evidence
No ratings yet
Reward Management Practices and Its Impact On Employees Motivation An Evidence
6 pages
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
No ratings yet
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
2 pages
2009 Lotos Bssa
No ratings yet
2009 Lotos Bssa
21 pages
Chapter 2 Modeling in The Frequency Domain
No ratings yet
Chapter 2 Modeling in The Frequency Domain
3 pages
Leakage Current Mitigation in Photovoltaic String Inverter Using Predictive Control With Fixed Average Switching Frequency
No ratings yet
Leakage Current Mitigation in Photovoltaic String Inverter Using Predictive Control With Fixed Average Switching Frequency
11 pages
15a. Caretium NB-201 PDF
No ratings yet
15a. Caretium NB-201 PDF
2 pages
Science of The Egg Drop1
No ratings yet
Science of The Egg Drop1
2 pages
Supersolid Phases of Hardcore Bosons On The Square Lattice: Correlated Hopping, Next-Nearest Neighbor Hopping and Frustration
No ratings yet
Supersolid Phases of Hardcore Bosons On The Square Lattice: Correlated Hopping, Next-Nearest Neighbor Hopping and Frustration
20 pages
Act std4
No ratings yet
Act std4
3 pages
MSOR Program Plan
No ratings yet
MSOR Program Plan
2 pages
Tennessee: Free Preview Copies!
No ratings yet
Tennessee: Free Preview Copies!
16 pages
Hands-On AI Trading with Python, QuantConnect, and AWS
From Everand
Hands-On AI Trading with Python, QuantConnect, and AWS
Jiri Pik
3/5 (1)
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet

Rmarkdown

Uploaded by

Rmarkdown

Uploaded by

Managing and Understanding Data

Escribir vuestro nombre y apellidos

Exploring and understanding data 2

Access the second element in body temperature vector:

## [1] 98.6 101.4

## [1] 98.1 101.4

## [1] 98.1 98.6

Exploring and understanding data

## data exploration example using used car data

Exploring the structure of data

## 'data.frame': 150 obs. of 6 variables:

Show some registers

# Table of 6 first registers

Table 1: 6 first registers of data

year model price mileage color transmission

Exploring numeric variables

## Exploring numeric variables -----

# summarize numeric variables

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## [1] 3800 21992

## 0% 25% 50% 75% 100%

## 0% 20% 40% 60% 80% 100%

Table 2: Descriptive statistic: mileage and price

Min. 1st Qu. Median Mean 3rd Qu. Max.

Some descriptive graphics

Visualizing numeric variables - boxplots

# boxplot of used car prices and mileage

boxplot(usedcars$mileage, main="Boxplot of Used Car Mileage",

# histograms of used car prices and mileage

0 50000 100000 150000 0 5000 10000 15000 20000

0 50000 100000 150000

Figure 1: Descriptive graphics

Figure 2: Boxplot of prices

Boxplot of Used Car Mileage

Figure 3: Boxplot of Mileage

5000 10000 15000 20000

Figure 4: Histogram of Used Car Prices

Histogram of Used Car Mileage

0 50000 100000 150000

Figure 5: Histogram of mileage

Measuring spread - variance and standard deviation

You might also like