0% found this document useful (0 votes)

16 views80 pages

Beginning Data Science in R 4 Data Analysis Visualization and Modelling For The Data Scientist Second Edition 2nd Edition Thomas Mailund Download

The document is a comprehensive guide to data science using R, focusing on data analysis, visualization, and modeling techniques. It covers essential topics such as R programming basics, data manipulation, reproducible analysis, and supervised learning. The second edition by Thomas Mailund includes updated content and practical exercises for readers to enhance their data science skills.

Uploaded by

velikybleny

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views80 pages

Beginning Data Science in R 4 Data Analysis Visualization and Modelling For The Data Scientist Second Edition 2nd Edition Thomas Mailund Download

Uploaded by

velikybleny

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Beginning Data Science In R 4 Data Analysis

Visualization And Modelling For The Data

Scientist Second Edition 2nd Edition Thomas
Mailund download
https://ebookbell.com/product/beginning-data-science-in-r-4-data-
analysis-visualization-and-modelling-for-the-data-scientist-
second-edition-2nd-edition-thomas-mailund-53048282

Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Beginning Data Science In R 4 Data Analysis Visualization And

Modelling For The Data Scientist 2nd Edition Thomas Mailund

https://ebookbell.com/product/beginning-data-science-in-r-4-data-
analysis-visualization-and-modelling-for-the-data-scientist-2nd-
edition-thomas-mailund-62608344

Beginning Data Science In R Data Analysis Visualization And Modelling

For The Data Scientist Thomas Mailund

https://ebookbell.com/product/beginning-data-science-in-r-data-
analysis-visualization-and-modelling-for-the-data-scientist-thomas-
mailund-5755586

Beginning Mathematica And Wolfram For Data Science Applications In

Data Analysis Machine Learning And Neural Networks 2nd Edition Jalil
Villalobos Alva

Beginning Mathematica And Wolfram For Data Science Applications In

Data Analysis Machine Learning And Neural Networks Jalil Villalobos
Alva

https://ebookbell.com/product/beginning-mathematica-and-wolfram-for-
data-science-applications-in-data-analysis-machine-learning-and-
neural-networks-jalil-villalobos-alva-22971926
Beginning Mathematica And Wolfram For Data Science Applications In
Data Analysis Machine Learning And Neural Networks Jalil Villalobos
Alva

Beginning Mathematica And Wolfram For Data Science Applications In

Data Analysis Machine Learning And Neural Networks

https://ebookbell.com/product/beginning-mathematica-and-wolfram-for-
data-science-applications-in-data-analysis-machine-learning-and-
neural-networks-58367918

Beginning Data Science With Python And Jupyter Use Powerful

Industrystandard Tools Within Jupyter And The Python Ecosystem To
Unlock New Actionable Insights From Your Data Alex Galea

https://ebookbell.com/product/beginning-data-science-with-python-and-
jupyter-use-powerful-industrystandard-tools-within-jupyter-and-the-
python-ecosystem-to-unlock-new-actionable-insights-from-your-data-
alex-galea-43824710

Beginning Data Science With R 1st Edition Manas A Pathak

https://ebookbell.com/product/beginning-data-science-with-r-1st-
edition-manas-a-pathak-4973176

Beginning Data Science Iot And Ai On Single Board Computers Core

Skills And Realworld Application With The Bbc Microbit And Xinabox 1st
Edition Philip Meitiner

https://ebookbell.com/product/beginning-data-science-iot-and-ai-on-
single-board-computers-core-skills-and-realworld-application-with-the-
bbc-microbit-and-xinabox-1st-edition-philip-meitiner-11263736
Beginning Data
Science in R 4
Data Analysis, Visualization, and
Modelling for the Data Scientist
—
Second Edition
—
Thomas Mailund
Beginning Data Science
in R 4
Data Analysis, Visualization,
and Modelling for the Data Scientist
Second Edition

Thomas Mailund
Beginning Data Science in R 4: Data Analysis, Visualization, and Modelling for the
Data Scientist

Thomas Mailund
Aarhus, Denmark

ISBN-13 (pbk): 978-1-4842-8154-3 ISBN-13 (electronic): 978-1-4842-8155-0

https://doi.org/10.1007/978-1-4842-8155-0

Copyright © 2022 by Thomas Mailund

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the
trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Steve Anglin
Development Editor: James Markham
Coordinating Editor: Mark Powers
Cover designed by eStudioCalamar
Cover image by Pixabay (www.pixabay.com)
Distributed to the book trade worldwide by Apress Media, LLC, 1 New York Plaza, New York, NY 10004,
U.S.A. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.
springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science
+ Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail [email protected]; for reprint,
paperback, or audio rights, please e-mail [email protected].
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and
licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales
web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available
to readers on GitHub (https://github.com/Apress). For more detailed information, please visit http://www.
apress.com/source-code.
Printed on acid-free paper
Table of Contents
About the Author��xv

About the Technical Reviewer��xvii

Acknowledgments��xix

Introduction��xxi

Chapter 1: Introduction to R Programming�� 1

Basic Interaction with R�� 1
Using R As a Calculator�� 3
Simple Expressions�� 4
Assignments�� 6
Indexing Vectors�� 9
Vectorized Expressions�� 11
Comments�� 13
Functions�� 13
Getting Documentation for Functions�� 14
Writing Your Own Functions�� 16
Summarizing and Vector Functions�� 17
A Quick Look at Control Flow�� 20
Factors�� 26
Data Frames�� 32
Using R Packages�� 36
Dealing with Missing Values�� 37
Data Pipelines�� 38
Writing Pipelines of Function Calls�� 39
Writing Functions That Work with Pipelines�� 41
The Magical “.” Argument�� 42
iii
Table of Contents

Other Pipeline Operations�� 47

Coding and Naming Conventions�� 49
Exercises�� 50
Mean of Positive Values�� 50
Root Mean Square Error�� 50

Chapter 2: Reproducible Analysis�� 51

Literate Programming and Integration of Workflow and Documentation�� 52
Creating an R Markdown/knitr Document in RStudio�� 53
The YAML Language�� 57
The Markdown Language�� 59
Formatting Text�� 60
Cross-Referencing�� 64
Bibliographies�� 65
Controlling the Output (Templates/Stylesheets)�� 66
Running R Code in Markdown Documents�� 66
Using chunks when analyzing data (without compiling documents)�� 69
Caching Results�� 70
Displaying Data�� 71
Exercises�� 72
Create an R Markdown Document�� 72
Different Output�� 72
Caching�� 72

Chapter 3: Data Manipulation�� 73

Data Already in R�� 73
Quickly Reviewing Data�� 75
Reading Data�� 77
Examples of Reading and Formatting Data Sets�� 79
Breast Cancer Data set�� 79
Boston Housing Data Set�� 87
The readr Package�� 90

iv
Table of Contents

Manipulating Data with dplyr�� 92

Some Useful dplyr Functions�� 94
Breast Cancer Data Manipulation�� 106
Tidying Data with tidyr�� 110
Exercises�� 118
Importing Data�� 118
Using dplyr�� 119
Using tidyr�� 119

Chapter 4: Visualizing Data�� 121

Basic Graphics�� 121
The Grammar of Graphics and the ggplot2 Package�� 128
Using qplot()�� 129
Using Geometries�� 133
Facets�� 141
Scaling�� 145
Themes and Other Graphics Transformations�� 151
Figures with Multiple Plots�� 156
Exercises�� 160

Chapter 5: Working with Large Data Sets�� 161

Subsample Your Data Before You Analyze the Full Data Set�� 162
Running Out of Memory During an Analysis�� 164
Too Large to Plot�� 166
Too Slow to Analyze�� 171
Too Large to Load�� 173
Exercises�� 177
Subsampling�� 177
Hex and 2D Density Plots�� 177

Chapter 6: Supervised Learning�� 179

Machine Learning�� 179
Supervised Learning�� 180

v
Table of Contents

Regression vs. Classification�� 181

Inference vs. Prediction�� 182
Specifying Models�� 183
Linear Regression�� 183
Logistic Regression (Classification, Really)�� 189
Model Matrices and Formula�� 194
Validating Models�� 204
Evaluating Regression Models�� 206
Evaluating Classification Models�� 209
Confusion Matrix�� 210
Accuracy�� 213
Sensitivity and Specificity�� 215
Other Measures�� 216
More Than Two Classes�� 218
Sampling Approaches�� 218
Random Permutations of Your Data�� 219
Cross-Validation�� 223
Selecting Random Training and Testing Data�� 227
Examples of Supervised Learning Packages�� 229
Decision Trees�� 230
Random Forests�� 232
Neural Networks�� 233
Support Vector Machines�� 235
Naive Bayes�� 235
Exercises�� 236
Fitting Polynomials�� 236
Evaluating Different Classification Measures�� 236
Breast Cancer Classification�� 237
Leave-One-Out Cross-Validation (Slightly More Difficult)�� 237
Decision Trees�� 237
Random Forests�� 237

vi
Table of Contents

Neural Networks�� 238

Support Vector Machines�� 238
Compare Classification Algorithms�� 238

Chapter 7: Unsupervised Learning�� 239

Dimensionality Reduction�� 239
Principal Component Analysis�� 240
Multidimensional Scaling�� 250
Clustering�� 255
k-means Clustering�� 255
Hierarchical Clustering�� 263
Association Rules�� 267
Exercises�� 273
Dealing with Missing Data in the HouseVotes84 Data�� 273
k-means�� 274

Chapter 8: Project 1: Hitting the Bottle�� 275

Importing Data�� 275
Exploring the Data�� 276
Distribution of Quality Scores�� 276
Is This Wine Red or White?�� 277
Fitting Models�� 282
Exercises�� 285
Exploring Other Formulas�� 285
Exploring Different Models�� 285
Analyzing Your Own Data Set�� 285

Chapter 9: Deeper into R Programming�� 287

Expressions�� 287
Arithmetic Expressions�� 287
Boolean Expressions�� 289
Basic Data Types�� 290

vii
Table of Contents

Numeric�� 291
Integer�� 291
Complex�� 292
Logical�� 292
Character�� 293
Data Structures�� 294
Vectors�� 294
Matrix�� 296
Lists�� 298
Indexing�� 300
Named Values�� 304
Factors�� 305
Formulas�� 305
Control Structures�� 306
Selection Statements�� 306
Loops�� 307
Functions�� 311
Named Arguments�� 312
Default Parameters�� 313
Return Values�� 314
Lazy Evaluation�� 315
Scoping�� 317
Function Names Are Different from Variable Names�� 322
Recursive Functions�� 322
Exercises�� 325
Fibonacci Numbers�� 325
Outer Product�� 325
Linear Time Merge�� 325
Binary Search�� 326
More Sorting�� 326
Selecting the k Smallest Element�� 327

viii
Table of Contents

Chapter 10: Working with Vectors and Lists�� 329

Working with Vectors and Vectorizing Functions�� 329
ifelse�� 332
Vectorizing Functions�� 332
The apply Family�� 335
apply�� 336
Nothing Good, It Would Seem�� 339
lapply�� 340
sapply and vapply�� 342
Advanced Functions�� 342
Special Names�� 342
Infix Operators�� 343
Replacement Functions�� 344
How Mutable Is Data Anyway?�� 347
Exercises�� 348
between�� 348
rmq�� 348

Chapter 11: Functional Programming�� 349

Anonymous Functions�� 349
Higher-Order Functions�� 351
Functions Taking Functions As Arguments�� 351
Functions Returning Functions (and Closures)�� 352
Filter, Map, and Reduce�� 357
Functional Programming with purrr�� 360
Functions As Both Input and Output�� 363
Ellipsis Parameters…�� 368
Exercises�� 370
apply_if�� 370
power�� 370
Row and Column Sums�� 370
Factorial Again…�� 370

ix
Table of Contents

Function Composition�� 371

Implement This Operator�� 371

Chapter 12: Object-Oriented Programming�� 373

Immutable Objects and Polymorphic Functions�� 373
Data Structures�� 374
Example: Bayesian Linear Model Fitting�� 374
Classes�� 376
Polymorphic Functions�� 379
Defining Your Own Polymorphic Functions�� 380
Class Hierarchies�� 382
Specialization As Interface�� 383
Specialization in Implementations�� 384
Exercises�� 388
Shapes�� 388
Polynomials�� 389

Chapter 13: Building an R Package�� 391

Creating an R Package�� 391
Package Names�� 392
The Structure of an R Package�� 392
.Rbuildignore�� 393
Description�� 393
Title�� 394
Version�� 394
Description�� 395
Author and Maintainer�� 395
License�� 396
Type, Date, LazyData�� 396
URL and BugReports�� 396
Dependencies�� 396
Using an Imported Package�� 397
Using a Suggested Package�� 398
x
Table of Contents

NAMESPACE�� 399
R/ and man/�� 400
Checking the Package�� 400
Roxygen�� 401
Documenting Functions�� 401
Import and Export�� 402
Package Scope vs. Global Scope�� 404
Internal Functions�� 404
File Load Order�� 404
Adding Data to Your Package�� 405
NULL�� 406
Building an R Package�� 407
Exercises�� 407

Chapter 14: Testing and Package Checking�� 409

Unit Testing�� 409
Automating Testing�� 411
Using testthat�� 412
Writing Good Tests�� 414
Using Random Numbers in Tests�� 415
Testing Random Results�� 416
Checking a Package for Consistency�� 417
Exercise�� 417

Chapter 15: Version Control�� 419

Version Control and Repositories�� 419
Using Git in RStudio�� 420
Installing Git�� 421
Making Changes to Files, Staging Files, and Committing Changes�� 422
Adding Git to an Existing Project�� 424
Bare Repositories and Cloning Repositories�� 425
Pushing Local Changes and Fetching and Pulling Remote Changes�� 426
Handling Conflicts�� 428
xi
Table of Contents

Working with Branches�� 429

Typical Workflows Involve Lots of Branches�� 432
Pushing Branches to the Global Repository�� 433
GitHub�� 434
Moving an Existing Repository to GitHub�� 436
Installing Packages from GitHub�� 437
Collaborating on GitHub�� 437
Pull Requests�� 438
Forking Repositories Instead of Cloning�� 438
Exercises�� 440

Chapter 16: Profiling and Optimizing�� 441

Profiling�� 441
A Graph-Flow Algorithm�� 442
Speeding Up Your Code�� 456
Parallel Execution�� 461
Switching to C++�� 466
Exercises�� 469

Chapter 17: Project 2: Bayesian Linear Regression�� 471

Bayesian Linear Regression�� 471
Exercises: Priors and Posteriors�� 473
Predicting Target Variables for New Predictor Values�� 476
Formulas and Their Model Matrix�� 478
Working with Model Matrices in R�� 480
Exercises�� 485
Model Matrices Without Response Variables�� 485
Exercises�� 487
Interface to a blm Class�� 487
Constructor�� 488
Updating Distributions: An Example Interface�� 489
Designing Your blm Class�� 494
Model Methods�� 494
xii
Table of Contents

Building an R Package for blm�� 497

Deciding on the Package Interface�� 497
Organization of Source Files�� 498
Document Your Package Interface Well�� 498
Adding README and NEWS Files to Your Package�� 499
Testing�� 500
GitHub�� 500

Conclusions�� 501
D
ata Science�� 501
M
achine Learning�� 501
D
ata Analysis�� 502
R
Programming�� 502
T he End�� 503

Index�� 505

xiii
About the Author
Thomas Mailund is an associate professor in bioinformatics at Aarhus University,
Denmark. His background is in math and computer science, but for the last decade, his
main focus has been on genetics and evolutionary studies, particularly comparative
genomics, speciation, and gene flow between emerging species.

xv
About the Technical Reviewer
Jon Westfall is an associate professor of psychology at Delta
State University. He has authored Set Up and Manage Your
Virtual Private Server, Practical R 4, Beginning Android
Web Apps Development, Windows Phone 7 Made Simple,
and several works of fiction including One in the Same,
Mandate, and Franklin: The Ghost Who Successfully Evicted
Hipsters from His Home and Other Short Stories. He lives in
Cleveland, Mississippi, with his wife.

xvii
Acknowledgments
I would like to thank Asger Hobolth for many valuable comments on earlier versions of
this manuscript that helped me improve the writing and the presentation of the material.

xix
Introduction
Welcome to Beginning Data Science in R 4. I wrote this book from a set of lecture notes
for two classes I taught a few years back, “Data Science: Visualization and Analysis”
and “Data Science: Software Development and Testing.” The book is written to fit the
structure of these classes, where each class consists of seven weeks of lectures followed
by project work. This means that the book’s first half consists of eight chapters with core
material, where the first seven focus on data analysis and the eighth is an example of
a data analysis project. The data analysis chapters are followed by seven chapters on
developing reusable software for data science and then a second project that ties the
software development chapters together. At the end of the book, you should have a good
sense of what data science can be, both as a field covering analysis and developing new
methods and reusable software products.

What Is Data Science?

That is a difficult question. I don’t know if it is easy to find someone who is entirely sure
what data science is, but I am pretty sure that it would be difficult to find two people
without having three opinions about it. It is undoubtedly a popular buzzword, and
everyone wants to hire data scientists these days, so data science skills are helpful to
have on the CV. But what is it?
Since I can’t give you an agreed-upon definition, I will just give you my own: data
science is the science of learning from data.
This definition is very broad—almost too broad to be useful. I realize this. But then, I
think data science is an incredibly general field. I don’t have a problem with that. Of course,
you could argue that any science is all about getting information out of data, and you might
be right. However, I would say that there is more to science than just transforming raw
data into useful information. The sciences focus on answering specific questions about
the world, while data science focuses on how to manipulate data efficiently and effectively.
The primary focus is not which questions to ask of the data but how we can answer them,
whatever they may be. It is more like computer science and mathematics than it is like

xxi
Introduction

natural sciences, in this way. It isn’t so much about studying the natural world as it is about
computing efficiently on data and learning patterns from the data.
Included in data science is also the design of experiments. With the right data, we
can address the questions in which we are interested. This can be difficult with a poor
design of experiments or a poor choice of which data we gather. Study design might be
the most critical aspect of data science but is not the topic of this book. In this book, I
focus on the analysis of data, once gathered.
Computer science is mainly the study of computations, hinted at in the name, but is
a bit broader. It is also about representing and manipulating data. The name “computer
science” focuses on computation, while “data science” emphasizes data. But of course,
the fields overlap. If you are writing a sorting algorithm, are you then focusing on the
computation or the data? Is that even a meaningful question to ask?
There is considerable overlap between computer science and data science, and,
naturally, the skill sets you need overlap as well. To efficiently manipulate data, you
need the tools for doing that, so computer programming skills are a must, and some
knowledge about algorithms and data structures usually is as well. For data science,
though, the focus is always on the data. A data analysis project focuses on how the data
flows from its raw form through various manipulations until it is summarized in some
helpful way. Although the difference can be subtle, the focus is not on what operations
a program does during the analysis but how the data flows and is transformed. It is also
focused on why we do certain data transformations, what purpose those changes serve,
and how they help us gain knowledge about the data. It is as much about deciding what
to do with the data as it is about how to do it efficiently.
Statistics is, of course, also closely related to data science. So closely linked that many
consider data science as nothing more than a fancy word for statistics that looks slightly
more modern and sexy. I can’t say that I strongly disagree with this—data science does
sound hotter than statistics—but just as data science is slightly different from computer
science, data science is also somewhat different from statistics. Only, perhaps, somewhat
less so than computer science is.
A large part of doing statistics is building mathematical models for your data and
fitting the models to the data to learn about the data in this way. That is also what we
do in data science. As long as the focus is on the data, I am happy to call statistics data
science. But suppose the focus changes to the models and the mathematics. In that case,
we are drifting away from data science into something else—just as if the focus shifts
from the data to computations, we are straying from data science to computer science.

xxii
Introduction

Data science is also related to machine learning and artificial intelligence—and

again, there are huge overlaps. Perhaps not surprising since something like machine
learning has its home both in computer science and statistics; if it focuses on data
analysis, it is also at home in data science. To be honest, it has never been clear to
me when a mathematical model changes from being a plain old statistical model to
becoming machine learning anyway.
For this book, we are just going to go with my definition, and, as long as we are
focusing on analyzing data, we will call it data science.

Prerequisites for Reading This Book

For the first eight chapters in this book, the focus is on data analysis and not
programming. For those eight chapters, I do not assume a detailed familiarity with
software design, algorithms, data structures, etc. I do not expect you to have any
experience with the R programming language either. However, I assume that you have
had some experience with programming, mathematical modelling, and statistics.
Programming R can be quite tricky at times if you are familiar with scripting
languages or object-oriented languages. R is a functional language that does not allow
you to modify data. While it does have systems for object-oriented programming, it
handles this programming paradigm very differently from languages you are likely to
have seen, such as Java or Python.
For the data analysis part of this book, the first eight chapters, we will only use R for
very straightforward programming tasks, so none of this should pose a problem. We
will have to write simple scripts for manipulating and summarizing data, so you should
be familiar with how to write basic expressions like function calls, if statements, loops,
and such—these things you will have to be comfortable with. I will introduce every such
construction in the book when we need them to let you see how they are written in R, but
I will not spend much time explaining them. Mostly, I will expect you to be able to pick it
up from examples.
Similarly, I do not expect you to already know how to fit data and compare models
in R. I do assume that you have had enough introduction to statistics to be comfortable
with basic terms like parameter estimation, model fitting, explanatory and response
variables, and model comparison. If not, I expect you to at least be able to pick up what
we are talking about when you need to.

xxiii
Introduction

I won’t expect you to know a lot about statistics and programming, but this isn’t
“Data Science for Dummies,” so I expect you to figure out examples without me
explaining everything in detail.
After the first seven chapters is a short description of a data analysis project that one
of my students did for my class the first time I held it. It shows how such a project could
look, but I suggest that you do not wait until you have finished the first seven chapters to
start doing such analysis yourself. To get the most benefit out of reading this book, you
should continuously apply what you learn. Already when you begin reading, I suggest
that you find a data set that you would be interested in finding out more about and then
apply what you learn in each chapter to that data.
For the following eight chapters of the book, the focus is on programming. To read
this part, you should be familiar with object-oriented programming—I will explain
how we handle it in R and how it differs from languages such as Python, Java, or C++.
Still, I will expect you to be familiar with terms such as class hierarchies, inheritance,
and polymorphic methods. I will not expect you to be already familiar with functional
programming (but if you are, there should still be plenty to learn in those chapters if you
are not already familiar with R programming). The final chapter is yet another project
description.

Plan for the Book

In the book, we will cover basic data manipulation:

• Filtering and selecting relevant data

• Transforming data into shapes readily analyzable

• Summarizing data

• Visualization data in informative ways both for exploring data and

presenting results

• Model building

These are the critical aspects of doing analysis in data science. After this, we will
cover how to develop R code that is reusable and works well with existing packages and
that is easy to extend, and we will see how to build new R packages that other people
will be able to use in their projects. These are the essential skills you will need to develop
your own methods and share them with the world.

xxiv
Introduction

R is one of the most popular (and open source) data analysis programming
languages around at the moment. Of course, popularity doesn’t imply quality. Still,
because R is so popular, it has a rich ecosystem of extensions (called “packages” in R) for
just about any kind of analysis you could be interested in. People who develop statistical
methods often implement them as R packages, so you can usually get the state-of-the-art
techniques very easily in R. The popularity also means that there is a large community
of people who can help if you have problems. Most problems you run into can be solved
with a few minutes on Google or Stack Overflow because you are unlikely to be the first
to run into any particular issue. There are also plenty of online tutorials for learning
more about R and specialized packages. And there are plenty of books you can buy if you
want to learn more.

Data Analysis and Visualization

The topics focusing on data analysis and visualization I cover in the first eight chapters:

1. Introduction to R Programming: In this chapter, we learn how to

work with data and write data pipelines.

2. Reproducible Analysis: In this chapter, we find out how to

integrate documentation and analysis in a single document and
how to use such documents to produce reproducible research.

3. Data Manipulation: In this chapter, we learn how to import data,

tidy up data, transform, and compute summaries from data.

4. Visualizing Data: In this chapter, we learn how to make plots for

exploring data features and presenting data features and analysis
results.

5. Working with Large Data Sets: In this chapter, we see how to deal
with data where the number of observations makes our usual
approaches too slow.

6. Supervised Learning: In this chapter, we learn how to train models

when we have data sets with known classes or regression values.

7. Unsupervised Learning: In this chapter, we learn how to search for

patterns we are not aware of in data.

xxv
Introduction

8. Project 1: Hitting the Bottle: Following these chapters is the first

project, an analysis of physicochemical features of wine, where we
see the various techniques in use.

Software Development
The next nine chapters cover software and package development:

1. Deeper into R Programming: In this chapter, we explore more

advanced features of the R programming language.

2. Working with Vectors and Lists: In this chapter, we explore two

essential data structures, namely, vectors and lists.

3. Functional Programming: In this chapter, we explore an advanced

feature of the R programming language, namely, functional
programming.

4. Object-Oriented Programming: In this chapter, we learn how R

handles object orientation and how we can use it to write more
generic code.

5. Building an R Package: In this chapter, we learn the necessary

components of an R package and how we can program our own.

6. Testing and Package Checking: In this chapter, we learn

techniques for testing our R code and checking our R packages’
consistency.

7. Version Control: In this chapter, we learn how to manage code

under version control and how to collaborate using GitHub.

8. Profiling and Optimizing: In this chapter, we learn how to identify

code hotspots where inefficient solutions are slowing us down and
techniques for alleviating this.

9. Project 2: Bayesian Linear Regression: In the final chapter, we

get to the second project, where we build a package for Bayesian
linear regression.

xxvi
Introduction

Getting R and RStudio

You will need to install R on your computer to do the exercises in this book. I suggest that
you get an integrated environment since it can be slightly easier to keep track of a project
when you have your plots, documentation, code, etc., all in the same program.
I use RStudio (www.rstudio.com/products/RStudio), which I warmly recommend.
You can get it for free—just follow the link—and I will assume that you have it when I
need to refer to the software environment you are using in the following chapters. There
won’t be much RStudio specific, though, and most tools for working with R have mostly
the same features, so if you want to use something else, you can probably follow the
notes without any difficulties.

P
rojects
You cannot learn how to analyze data without analyzing data, and you cannot
understand how to develop software without developing software either. Typing in
examples from the book is nothing like writing code on your own. Even doing exercises
from the book—which you really ought to do—is not the same as working on your own
projects. Exercises, after all, cover minor isolated aspects of problems you have just been
introduced to. There is not a chapter of material presented before every task you have to
deal with in the real world. You need to work out by yourself what needs to be done and
how. If you only do the exercises in this book, you will miss the most crucial lesson in
analyzing data:
• How to explore the data and get a feeling for it

• How to do the detective work necessary to pull out some

understanding from the data
• How to deal with all the noise and weirdness found in any data set

And for developing a package, you need to think through how to design and
implement its functionality such that the various functions and data structures fit well
together.
I will go through a data analysis project to show you what that can look like in this
book. To learn how to analyze data on your own, you need to do it yourself as well—and
you need to do it with a data set that I haven’t explored for you. You might have a data
set lying around you have worked on before, a data set from something you are just
xxvii
Introduction

interested in, or you can probably find something interesting at a public data repository,
for example, one of these:

• RDataMining.com: www.rdatamining.com/resources/data

• UCI Machine Learning Repository: http://archive.ics.

uci.edu/ml/
• KDNuggets: www.kdnuggets.com/datasets/index.html

• Reddit R Data sets: www.reddit.com/r/datasets

• GitHub Awesome Public Data sets: https://github.com/

caesar0301/awesome-public-datasets

I suggest that you find yourself a data set and that you, after each lesson, use the
skills you have learned to explore this data set. Pick data structured as a table with
observations as rows and variables as columns since that is the form of the data we will
consider in this book. At the end of the first eight chapters, you will have analyzed this
data. You can write a report about your analysis that others can evaluate to follow and
maybe modify it: you will be doing reproducible science.
For the programming topics, I will describe another project illustrating the design
and implementation issues involved in making an R package. There, you should be able
to learn from implementing your own version of the project I use, but you will, of course,
be more challenged by working on a project without any of my help at all. Whatever you
do, to get the full benefit of this book, you really ought to make your own package while
reading the programming chapters.

xxviii
CHAPTER 1

Introduction to R
Programming
We will use R for our data analysis, so we need to know the basics of programming in
the R language. R is a full programming language with both functional programming
and object-oriented programming features, and learning the complete language is
far beyond the scope of this chapter. We return to it later, when we have a little more
experience using R. The good news is, though, that to use R for data analysis, we rarely
need to do much programming. At least, if you do the right kind of programming, you
won’t need much.
For manipulating data—how to do this is the topic of the next chapter—you mainly
have to string together a couple of operations, such as “group the data by this feature”
followed by “calculate the mean value of these features within each group” and then
“plot these means.” Doing this used to be more complicated to do in R, but a couple of
new ideas on how to structure data flow—and some clever implementations of these in
packages such as magrittr and dplyr—have significantly simplified it. We will see some
of this at the end of this chapter and more in the next chapter. First, though, we need to
get a taste of R.

Basic Interaction with R

Start by downloading RStudio if you haven’t done so already. If you open it, you should
get a window similar to Figure 1-1. Well, except that you will be in an empty project while
the figure shows (on the top right) that this RStudio is opened in a project called “Data
Science.” You always want to be working on a project. Projects keep track of the state of
your analysis by remembering variables and functions you have written and keep track
of which files you have opened and such. Go to File and then New Project to create a

1
© Thomas Mailund 2022
T. Mailund, Beginning Data Science in R 4, https://doi.org/10.1007/978-1-4842-8155-0_1
Chapter 1 Introduction to R Programming

project. You can create a project from an existing directory, but if this is the first time you
are working with R, you probably just want to create an empty project in a new directory,
so do that.

Figure 1-1. RStudio

Once you have RStudio opened, you can type R expressions into the console, which
is the frame on the left of the RStudio window. When you write an expression there, R
will read it, evaluate it, and print the result. When you assign values to variables, and
we will see how to do this shortly, they will appear in the Environment frame on the top
right. At the bottom right, you have the directory where the project lives, and files you
create will go there.
To create a new file, you go to File and then New File…. There you can select
several different file types. Those we are interested in are the R Script, R Notebook,
and R Markdown types. The former is the file type for pure R code, while the latter two
we use for creating reports where documentation text is mixed with R code. For data
analysis projects, I would recommend using either Notebook or Markdown files. Writing

2
Chapter 1 Introduction to R Programming

documentation for what you are doing is helpful when you need to go back to a project
several months down the line.
For most of this chapter, you can just write R code in the console, or you can create
an R Script file. If you create an R Script file, it will show up on the top left; see Figure 1-2.
You can evaluate single expressions using the Run button on the top right of this frame
or evaluate the entire file using the Source button. For writing longer expressions, you
might want to write them in an R Script file for now. In the next chapter, we will talk
about R Markdown, which is the better solution for data science projects.

Figure 1-2. RStudio with a new R Script file open

Using R As a Calculator
You can use the R console as a calculator where you type in an expression you want to
calculate, hit “enter,” and R gives you the result. You can play around with that a little bit
to get familiar with how to write expressions in R—there is some explanation for how to
write them in the following—and then moving from using R as a calculator to writing

3
Chapter 1 Introduction to R Programming

more sophisticated analysis programs is only a matter of degree. A data analysis program
is little more than a sequence of calculations, after all.

Simple Expressions
Simple arithmetic expressions are written, as in most other programming languages, in
the typical mathematical notation that you are used to:
1 + 2

## [1] 3

4 / 2

## [1] 2

(2 + 2) * 3

## [1] 12

Here, the lines that start with ## show the output that R will give you. By convention,
and I don’t really know why, these two hash symbols are often used to indicate that in R
documentation.
It also works pretty much as you are used to, except, perhaps, that you might be used
to integers behaving as integers in a division. At least in some programming languages,
division between integers is integer division, but in R you can divide integers, and if
there is a remainder, you will get a floating-point number back as the result:

4 / 3

## [1] 1.333333

When you write numbers like 4 and 3, they are always interpreted as floating-point
numbers, even if they print as integers, that is, without a decimal point. To explicitly get
an integer, you must write 4L and 3L:

class(4)

## [1] "numeric"

class(4L)

## [1] "integer"

4
Chapter 1 Introduction to R Programming

It usually doesn’t matter if you have an integer or a floating-point number, and

everywhere you see numbers in R, they are likely to be floats.
You will still get a floating-point if you divide two integers, and there is no need to tell
R explicitly that you want floating-point division. If you do want integer division, on the
other hand, you need a different operator, %/%:

4 %/% 3

## [1] 1

In many languages, % is used for getting the remainder of a division, but this doesn’t
quite work with R where % is used for something else (creating new infix operators), so in
R the operator for this is %%:

4 %% 3

## [1] 1

In addition to the basic arithmetic operators—addition, subtraction, multiplication,

division, and the modulus operator we just saw—you also have an exponentiation
operator for taking powers. For this, you can use either ^ or ** as infix operators:

2 ^ 2

## [1] 4

2 ** 2

## [1] 4

2 ^ 3

## [1] 8

2 ** 3

## [1] 8

There are some other data types besides numbers, but we won’t go into an
exhaustive list here. There are two types you do need to know about early, though, since
they are frequently used and since not knowing about how they work can lead to all
kinds of grief. Those are strings and “factors.”

5
Chapter 1 Introduction to R Programming

Strings work as you would expect. You write them in quotes, either double quotes or
single quotes, and that is about it:

"Hello,"

## [1] "Hello,"

'world!'

## [1] "world!"

Strings are not particularly tricky, but I mention them because they look a lot like
factors, but factors are not like strings, they just look sufficiently like them to cause some
confusion. I will explain the factors a little later in this chapter when we have seen how
functions and vectors work.

Assignments
To assign a value to a variable, you use the arrow operators. So to assign the value 2 to
the variable x, you would write

x <- 2

and you can test that x now holds the value 2 by evaluating x:

## [1] 2

and of course, you can now use x in expressions:

2 * x

## [1] 4

You can assign with arrows in both directions, so you could also write

2 -> x

6
Chapter 1 Introduction to R Programming

An assignment won’t print anything if you write it into the R terminal, but you can
get R to print it by putting the assignment in parentheses:

x <- "invisible"
(y <- "visible")

## [1] "visible"

Actually, all of the above are vectors of values…

If you were wondering why all the values printed earlier had a [1] in front of them, it
is because we are usually not working with single values anywhere in R. We are working
with vectors of values (and you will hear more about vectors in the next section). The
vectors we have seen have length one—they consist of a single value—so there is nothing
wrong about thinking about them as individual values. But they are vectors and what we
can do with a single number we can do with multiple in the same way.
The [1] does not indicate that we are looking at a vector of length one. The [1] tells
you that the first value after [1] is the first value in the vector. With longer vectors, you
get the index each time R moves to the next line of output. This output makes it easier to
count your way into a particular index.
You will see this if you make a longer vector, for example, we can make one of length
50 using the : operator:

1:50

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
##  [16]  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30
##  [31]  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45
##  [46]  46  47  48  49  50

The : operator creates a sequence of numbers, starting at the number to the left of
the colon and increasing by one until it reaches the number to the right of the colon, or
just before if an increment of one would move past the last number:

-1:1

## [1] -1 0 1

0.1:2.9

## [1] 0.1 1.1 2.1

7
Chapter 1 Introduction to R Programming

If you want other increments than 1, you can use the seq function instead:

seq(.1, .9, .2)

## [1] 0.1 0.3 0.5 0.7 0.9

Here, the first number is where we start, the second where we should stop, as with :,
but the third number gives us the increment to use.
Because we are practically always working on vectors, there is one caveat I want
to warn you about. If you want to know the length of a string, you might—reasonably
enough—think you can get that using the length function. You would be wrong. That
function gives you the length of a vector, so if you give it a single string, it will always
return 1:

length("qax")

## [1] 1

length("quux")

## [1] 1

length(c("foo", "bar"))

## [1] 2

In the last expression, we used the function c() to concatenate two vectors of strings.
Concatenating "foo" and "bar"

c("foo", "bar")

## [1] "foo" "bar"

creates a vector of two strings, and thus the result of calling length on that is 2. To get
the length of the actual string, you want nchar instead:

nchar("qax")

## [1] 3

nchar("quux")

## [1] 4

8
Chapter 1 Introduction to R Programming

nchar(c("foo", "bar"))

## [1] 3 3

If you wanted to concatenate the strings "foo" and "bar", to get a vector with the
single string "foobar", you need to use paste:

paste("foo", "bar", sep = "")

## [1] "foobar"

The argument sep = "" tells paste not to put anything between the two strings. By
default, it would put a space between them:

paste("foo", "bar")

## [1] "foo bar"

Indexing Vectors
If you have a vector and want the i’th element of that vector, you can index the vector to
get it like this:

(v <- 1:5)

## [1] 1 2 3 4 5

v[1]

## [1] 1

v[3]

## [1] 3

We have parentheses around the first expression to see the output of the operation.
An assignment is usually silent in R, but by putting the expression in parentheses, we
make sure that R prints the result, which is the vector of integers from 1 to 5. Notice here
that the first element is at index 1. Many programming languages start indexing at zero,
but R starts indexing at one. A vector of length n is thus indexed from 1 to n, unlike in
zero-indexed languages where the indices go from 0 to n − 1.

9
Chapter 1 Introduction to R Programming

If you want to extract a subvector, you can also do this with indexing. You just use a
vector of the indices you want inside the square brackets. We can use the : operator for
this or the concatenate function, c():

v[1:3]

## [1] 1 2 3

v[c(1,3,5)]

## [1] 1 3 5

You can use a vector of boolean values to pick out those values that are “true”:

v[c(TRUE, FALSE, TRUE, FALSE, TRUE)]

## [1] 1 3 5

This indexing is particularly useful when you combine it with expressions. We can,
for example, get a vector of boolean values telling us which values of a vector are even
numbers and then use that vector to pick them out:

v %% 2 == 0

## [1] FALSE TRUE FALSE TRUE FALSE

v[v %% 2 == 0]

## [1] 2 4

You can get the complement of a vector of indices if you change the sign of them:

v[-(1:3)]

## [1] 4 5

It is also possible to give vector indices names, and if you do, you can use those to
index into the vector. You can set the names of a vector when constructing it or use the
names function:

v <- c("A" = 1, "B" = 2, "C" = 3)

## A B C

10
Chapter 1 Introduction to R Programming

## 1 2 3

v["A"]

## A
## 1

names(v) <- c("x", "y", "z")

## x y z
## 1 2 3

v["x"]

## x
## 1

Names can be handy for making tables where you can look up a value by a key.

Vectorized Expressions
Now, the reason that the expressions we saw earlier worked with vector values instead
of single values is that in R, arithmetic expressions all work component-wise on vectors.
When you write an expression such as

x <- 1:3 ; y <- 4:6

x ** 2 - y

## [1] -3 -1 3

you are telling R to take each element in the vector x, squaring it, and subtracting
element-wise by y:

(x <- 1:3)

## [1] 1 2 3

x ** 2

## [1] 1 4 9

y <- 6:8

11
Chapter 1 Introduction to R Programming

x ** 2 - y

## [1] -5 -3 1

This also works if the vectors have different lengths, as they do in the preceding
example. The vector 2 is a vector of length 1 containing the number 2. The way
expressions work, when vectors do not have the same length, is you repeat the shorter
vector as many times as you need to:

(x <- 1:4)

## [1] 1 2 3 4

(y <- 1:2)

## [1] 1 2

x - y

## [1] 0 0 2 2

If the length of the longer vector is not a multiple of the length of the shorter, you get
a warning. The expression still repeats the shorter vector a number of times, just not an
integer number of times:

(x <- 1:4)

## [1] 1 2 3 4

(y <- 1:3)

## [1] 1 2 3

x - y

## Warning in x - y: longer object length is not a

## multiple of shorter object length

## [1] 0 0 0 3

Here, y is used once against the 1:3 part of x, and the first element of y is then used
for the 4 in x.

12
Chapter 1 Introduction to R Programming

Comments
You probably don’t want to write comments when you are just interacting with the
R terminal, but in your code, you do. Comments let you describe what your code is
intended to do, and how it is achieving it, so you don’t have to work that out again when
you return to it at a later point, having forgotten all the great thoughts you thought when
you wrote it.
R interprets as comments everything that follows the # character. From a # to the end
of the line, the R parser skips the text:

# This is a comment.

If you write your analysis code in R Markdown documents, which we will cover in the
next chapter, you won’t have much need for comments. In those kinds of files, you mix
text and R code differently. But if you develop R code, you will likely need it, and now you
know how to write comments.

Functions
You have already seen the use of functions, although you probably didn’t think much
about it when we saw expressions such as

length("qax")

You didn’t think about it because there wasn’t anything surprising about it. We just
use the usual mathematical notation for functions: f (x). If you want to call a function,
you simply use this notation and give the function its parameters in parentheses.
In R, you can also use the names of the parameters when calling a function, in
addition to the positions; we saw an example with sep = "" when we used paste to
concatenate two strings.
If you have a function f (x, y) of two parameters, x and y, calling f (5, 10) means calling
f with parameter x set to 5 and parameter y set to 10. In R, you can specify this explicitly,
and these two function calls are equivalent:

f(5, 10)
f(x = 5, y = 10)

13
Chapter 1 Introduction to R Programming

(Don’t try to run this code; we haven’t defined the function f, so calling it will fail.
But if we had a function f, then the two calls would be equivalent.)
If you specify the names of the parameters, the order doesn’t matter anymore, so
another equivalent function call would be

f(y = 10, x = 5)

You can combine the two ways of passing parameters to functions as long as you put
all the positional parameters before the named ones:

f(5, y = 10)

Except for maybe making the code slightly more readable—it is usually easier to
remember what parameters do than which order they come in—there is not much need
for this in itself. Where it becomes useful is when combined with default parameters.
A lot of functions in R take many parameters. More than you really can remember
the use for and certainly the order of. They are a lot like programs that take a lot of
options but where you usually just use the defaults unless you need to tweak something.
These functions take a lot of parameters, but most of them have useful default values,
and you typically do not have to specify the values to set them to. When you do need it,
though, you can specify it with a named parameter.

Getting Documentation for Functions

Since it can be hard to remember the details of what a function does, and especially what
all the parameters to a function do, you often have to look up the documentation for
functions. Luckily, this is very easy to do in R and RStudio. Whenever you want to know
what a function does, you can just ask R, and it will tell you (assuming that the author of
the function has written the documentation).
Take the function length from the example we saw earlier. If you want to know what
the function does, just write ?length in the R terminal. If you do this in RStudio, it will
show you the documentation in the frame on the right; see Figure 1-3.

14
Chapter 1 Introduction to R Programming

Figure 1-3. RStudio’s help frame

Try looking up the documentation for a few functions, for example, the nchar
function we also saw earlier.
All infix operators, like + or %%, are also functions in R, and you can read the
documentation for them as well. But you cannot write ?+ in the R terminal and get the
information. The R parser doesn’t know how to deal with that. If you want help on an
infix operator, you need to quote it, and you do that using back quotes. So to read the
documentation for +, you would need to write

?`+`

You probably do not need help to figure out what addition does, but people can write
new infix operators, so this is useful to know when you need help with those.

15
Chapter 1 Introduction to R Programming

Writing Your Own Functions

You can easily write your own functions. You use function expressions to define a
function and an assignment to give a function a name. For example, to write a function
that computes the square of a number, or a vector number, you can write

square <- function(x) x**2

square(2)

## [1] 4

square(1:4)

## [1] 1 4 9 16

The “function(x) x**2” expression defines the function, and anywhere you would
need a function, you can write the function explicitly like this. Assigning the function to
a name lets you use the name to refer to the function, just like assigning any other value,
like a number or a string to a name, will let you use the name for the value.
Functions you write yourself work just like any function already part of R or part of
an R package, with one exception, though: you will not have documentation for your
functions unless you write it, and that is beyond the scope of this chapter (but covered in
the chapter on building packages).
The square function just does a simple arithmetic operation on its input. Sometimes,
you want the function to do more than a single thing. If you want the function to do
several operations on its input, you need several statements for the function. In that case,
you need to give it a “body” of several statements, and such a body has to go in curly
brackets:

square_and_subtract <- function(x, y) {

squared <- x ** 2
squared - y
}
square_and_subtract(1:5, rev(1:5))

## [1] -4 0 6 14 24

(Check the documentation for rev to see what is going on here. Make sure you
understand what this example is doing.)

16
Chapter 1 Introduction to R Programming

In this simple example, we didn’t really need several statements. We could just have
written the function as

square_and_subtract <- function(x, y) x ** 2 - y

As long as there is only a single expression in the function, we don’t need the curly
brackets. For more complex functions, you will need it, though.
The result of a function—what it returns as its value when you call it—is the last
statement or expression (there actually isn’t any difference between statements and
expressions in R; they are the same thing). You can make the return value explicit,
though, using the return() expression:

square_and_subtract <- function(x, y) return(x ** 2 - y)

Explicit returning is usually only used when you want to return a value before the
end of the function. To see examples of this, we need control structures, so we will have
to wait a little bit to see an example. It isn’t used as much as in many other programming
languages.
One crucial point here, though, if you are used to programming in other languages:
The return() expression needs to include the parentheses. In most programming
languages, you could just write

square_and_subtract <- function(x, y) return x ** 2 - y

Such an expression doesn’t work for R. Try it, and you will get an error.

Summarizing and Vector Functions

As we have already seen, when we write arithmetic expressions such as x**2 - y,
we have an expression that will work for both single numbers for x and y, but also
element-wise for vectors x and y. If you write functions where the body consists of such
expressions, the function will work element-wise as well. The square and square_and_
subtract functions we wrote earlier work like that.
Now all functions work like this, however. While we often can treat data one element
at a time, we also often need to extract some summary of a collection of data, and
functions handle this as well.

17
Chapter 1 Introduction to R Programming

Take, for example, the function sum which adds together all the values in a vector you
give it as an argument (check ?sum now to see the documentation):

sum(1:4)

## [1] 10

This function summarizes its input into a single value. There are many similar
functions, and, naturally, these cannot be used element-wise on vectors; rather, they
reduce an entire vector into some smaller summary statistics, here the sum of all
elements.
Whether a function works on vector expressions or not depends on how it is defined.
While there are exceptions, most functions in R either work on vectors or summarize
vectors like sum. When you write your own functions, whether the function works
element-wise on vectors or not depends on what you put in the body of the function. If
you write a function that just does arithmetic on the input, like square, it will work in
vectorized expressions. If you write a function that does some summary of the data, it
will not. For example, if we write a function to compute the average of its input like this:

average <- function(x) {

n <- length(x)
sum(x) / n
}
average(1:5)

## [1] 3

This function will not give you values element-wise. Pretty obviously. It gets a little
more complicated when the function you write contains control structures, which we
will get to in the next section. In any case, this would be a nicer implementation since it
only involves one expression:

average <- function(x) sum(x) / length(x)

Oh, and by the way, don’t use this average function to compute the mean value of a
vector. R already has a function for that, mean, that deals much better with special cases
like missing data and vectors of length zero. Check out ?mean.

18
Chapter 1 Introduction to R Programming

Just because you are summarizing doesn’t mean that you have to return a single
value. In this function, we return both the mean and the standard deviation of the values
in a vector:

mean_and_sd <- function(x) c(mean = mean(x), sd = sd(x))

mean_and_sd(1:10)

## mean sd
## 5.50000 3.02765

We use the functions mean and sd to compute the two summary statistics, and
then we combine them into a vector (with named elements) that contains the two
summaries. This isn’t a vectorized function, because we do not process the values in
the input element-wise. It doesn’t compute a single summary, but returns something
(ever so slightly) more complex. Complicated functions often return data more complex
than vectors or single values, and we shall see examples in later chapters. If you can
avoid it, though, do so. Simple functions, with simple input and output, are easier to
use, and when we write functions, we want to make things as simple for us as we can.
With this mean_and_sd function, we do not gain anything that we do not already have
with the mean and sd function, and combining both operations in a single function only
complicates things needlessly.
The rough classification of functions into the vectorized, which operate element-
wise on data, and the summarizing functions, is only a classification of how we can use
them. If you compute a value for each element in one or more vectors, you have the
former, and if you summarize all the data in one or more vectors, you have the latter. The
implementation of a function can easily combine both.
Imagine, for example, that we wish to normalize data by subtracting the mean from
each element and then dividing by the standard deviation. We could implement it
like this:

normalise <- function(x) (x - mean(x)) / sd(x)

normalise(1:10)

## [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337

## [5] -0.1651446 0.1651446 0.4954337 0.8257228
## [9] 1.1560120 1.4863011

19
Chapter 1 Introduction to R Programming

We compute a value for each element in the input, so we have a vectorized function,
but in the implementation, we use two summarizing functions, mean and sd. The
expression (x - mean(x)) / sd(x) is a vector expression because mean(x) and sd(x)
become vectors of length one, and we can use those in the expression involving x to get a
value for each element.

A Quick Look at Control Flow

While you get very far just using expressions, for many computations, you need more
complex programming. Not that it is particularly complex, but you do need to be able
to select a choice of what to do based on data—selection or if statements—and ways of
iterating through data, looping or for statements.
If statements work like this:

if (<boolean expression>) <expression>

If the boolean expression evaluates to true, the expression is evaluated; if not, it

will not:

# this won't do anything

if (2 > 3) "false"

# this will
if (3 > 2) "true"

## [1] "true"

For expressions like these, where we do not alter the program state by evaluating
the expression, there isn’t much of an effect in evaluating the if expression. If we, for
example, are assigning to a variable, there will be an effect:

x <- "foo"
if (2 > 3) x <- "bar"
x

## [1] "foo"

if (3 > 2) x <- "baz"

## [1] "baz"
20
Chapter 1 Introduction to R Programming

If you want to have effects for both true and false expressions, you have this:

if (<boolean expression>) <true expression> else <false expression>

if (2 > 3) "bar" else "baz"

## [1] "baz"

If you want newlines in if statements, whether you have an else part or not, you
should use curly brackets.
You don’t always have to. If you have a single expression in the if part, you can leave
them out:

if (3 > 2)
x <- "bar"
x

## [1] "bar"

or if you have a single statement in the else part, you can leave out the brackets:

if (2 > 3) {
x <- "bar"
} else
x <- "qux"
x

## [1] "qux"

but we did need the brackets in the preceding if part for R to recognize that an else
bit was following. Without it, we would get an error:

if (2 > 3)
x <- "bar"
else
x <- "qux"

## Error: <text>:3:1: unexpected 'else'

## 2: x <- "bar"
## 3: else
## ^

21
Chapter 1 Introduction to R Programming

If you always use brackets, you don’t have to worry about when you strictly need
them or when you do not, and a part can have multiple statements without you having
to worry about it. If you put a newline in an if or if-else expression, I recommend that
you always use brackets as well.
An if statement works like an expression:

if (2 > 3) "bar" else "baz"

## [1] "baz"

This evaluates to the result of the expression in the “if” or the “else” part, depending
on the truth value of the condition:

x <- if (2 > 3) "bar" else "baz"

## [1] "baz"

It works just as well with braces:

x <- if (2 > 3) { "bar" } else { "baz" }

## [1] "baz"

but when the entire statement is on a single line, and the two parts are both a single
expression, I usually do not bother with that.
You cannot use it for vectorized expressions, though, since the boolean expression, if
you give it a vector, will evaluate the first element in the vector:

x <- 1:5
if (x > 3) "bar" else "baz"

## Warning in if (x > 3) "bar" else "baz": the

## condition has length > 1 and only the first
## element will be used

## [1] "baz"

If you want a vectorized version of if statements, you can instead use the ifelse()
function:

22
Chapter 1 Introduction to R Programming

x <- 1:5
ifelse(x > 3, "bar", "baz")

## [1] "baz" "baz" "baz" "bar" "bar"

(read the ?ifelse documentation to get the details of this function).

This, of course, also has consequences for writing functions that use if statements.
If your function contains a body that isn’t vectorized, your function won’t be either. So,
if you have an if statement that depends on your input—and if it doesn’t depend on the
input, it is rather useless—then that input shouldn’t be a vector:

maybe_square <- function(x) {

if (x %% 2 == 0) x ** 2 else x
}
maybe_square(1:5)

## Warning in if (x%%2 == 0) x^2 else x: the

## condition has length > 1 and only the first
## element will be used

## [1] 1 2 3 4 5

This function was supposed to square even numbers, and it will if we give it a single
number, but we gave it a vector. Since the first value in this vector, the only one that the
if statement looked at, was 1, it decided that x %% 2 == 0 was false—it is if x[1] is 1—
and then none of the values were squared. Clearly not what we wanted, and the warning
was warranted.
If you want a vectorized function, you need to use ifelse():

maybe_square <- function(x) {

ifelse(x %% 2 == 0, x ** 2, x)
}
maybe_square(1:5)

## [1] 1 4 3 16 5

23
Chapter 1 Introduction to R Programming

or you can use the Vectorize() function to translate a function that isn’t vectorized
into one that is:

maybe_square <- function(x) {

if (x %% 2 == 0) x ** 2 else x
}
maybe_square <- Vectorize(maybe_square)
maybe_square(1:5)

## [1] 1 4 3 16 5

The Vectorize function is what is known as a “functor”—a function that takes a

function as input and returns a new function. It is beyond the scope of this chapter to
cover how we can manipulate functions like other data, but it is a very powerful feature
of R that we return to in later chapters.
For now, it suffices to know that Vectorize will take your function that can only take
single values as input and then create a function that handles an entire vector by calling
your function with each element. You only see one element at a time, and Vectorize’s
function makes sure that you can handle an entire vector, one element at a time.
To loop over elements in a vector, you use for statements:

x <- 1:5
total <- 0
for (element in x) total <- total + element
total

## [1] 15

As with if statements, if you want the body to contain more than one expression,
you need to put it in curly brackets.
The for statement runs through the elements of a vector. If you want the indices
instead, you can use the seq_along() function, which given a vector as input returns a
vector of indices:

x <- 1:5
total <- 0

for (index in seq_along(x)) {

element <- x[index]

24
Chapter 1 Introduction to R Programming

total <- total + element

}
total

## [1] 15

There are also while statements for looping. These repeat as long as an expression
is true:

x <- 1:5
total <- 0
index <- 1
while (index <= length(x)) {
    element <- x[index]
    index <- index + 1
    total <- total + element
}
total

## [1] 15

If you are used to zero-indexed vectors, pay attention to the index <= length(x)
here. You would normally write index < length(x) in zero-indexed languages. Here,
that would miss the last element.
There is also a repeat statement that loops until you explicitly exit using the break
statement:

x <- 1:5
total <- 0
index <- 1
repeat {
    element <- x[index]
    total <- total + element
    index <- index + 1
    if (index > length(x)) break
}
total

## [1] 15

25
Random documents with unrelated
content Scribd suggests to you:
Elementary phenomena of magnetism and electricity, 483.
Elswick 4·7–in. gun, 206.
guns, 194.
Energy, 806.
Ether, 735.
the luminiferous, 408.
Exhaustion of coal, 755, 756, 757.
Expansive working of steam, 8, 17.
Explosion by concussion, 745.
of locomotive, 21.
of torpedoes, 229.
Explosives, 225, 740.
different effects of, 748.
names and classes of, 750.
Eye, the, 451.
dimensions of some parts of, 462.
Eye not optically perfect, 462.
Eyeballs, muscles of, 461.

F.
Fairbairn, Sir W., 280.
Faraday, 506, 508, 735.
ventilating gas-burner, 773.
Faure’s accumulator, 530.
Fellahs, 255.
Ferris wheel, Chicago, 81.
Field telegraphs, 555.
Fire-arms, 169.
Fish-plates, 105.
Fizeau, 386.
Floating matter in air, 383.
Fluids, electric, 487.
Fly-wheels, 7.
Force, conservation of, 804.
electromotive, 494.
Forth Bridge, the, 311.
Foucault, 387.
Fovea centralis, 456, 457.
Fraser-Woolwich guns, 195.
Fraunhofer’s lines, 420, 436.
Fresnel’s mirrors, 409.
measurement of velocity of light, 600.
Fribourg Suspension Bridge, 286.
Froment’s dial telegraph, 567.
Furnace, electric, 322.

G.
Galvanic batteries, 493, 494.
Galvanometer, 493.
mirror, 570.
Gas engine, 25.
governor, 769.
holder, 766.
making apparatus, 765.
meters, 775.
pressure, 769.
retorts, 766.
Gases of blast furnace, 49.
Gatling battery gun, or mitrailleur, 219.
Gauge, broad and narrow, 106.
Bourdon’s pressure, 12.
Geissler’s tubes, 505.
Ghost, Pepper’s, 392.
Giffard’s injector, 11.
Girder bridges, 280.
Glass, strains in, 407.
Glatton, H. M. S., 161.
Glynde, electric railway, 534.
Gold, 686.
Gold and diamonds, 687.
Gold-mining operations, 690.
Goodyear, Mr., 727.
Governor of steam engines, 6.
Gower Street Station, 114.
Gramme magneto-electric machine, the, 511.
Graphophone, 672.
Graphotype, 644.
Gray, 590.
Great Brooklyn Bridge, 303.
Great Eastern, 133, 152, 330, 465, 578.
Greatest Discovery of the Age, 801.
Greener’s expanding bullet, 182.
Grove, Sir W. R., 804.
Grove’s battery, 495.
Gun, 32–pounder, 191.
68–pounder, 192.
35–ton, 201.
81–ton, 201.
100–ton, 201.
110–ton, 202.
Elswick 4·7–in., 206.
Maxim, 225.
Moncrieff, 208.
Nordenfelt, 223.
Gun-cotton, 747.
torpedoes, 233.
Gunpowder, 734.
Guns, Armstrong’s, 192.
Elswick, 194.
Fraser-Woolwich, 195.
Krupp’s, 214.
quick-firing, 206.
submarine, 240.
Gutta-percha, 728.

H.
Half-tone process, 629.
Hancock, Mr. Charles, 729.
Mr. Thomas, 725.
Harvey’s torpedoes, 234.
Heat produced by electric current, 502.
Heat spectrum, 613.
Heating by gas, 776.
Helmholtz, 462, 464, 472, 474.
Henry, on Leyden jar discharge, 538.
Hercules, H.M.S., 150.
Hertz, Professor, 541.
Hippocampus, 664.
Hoe’s printing machines, 316, 318.
Holmes’ magneto-electric machine, 520.
Holophotal light, 604.
Holyhead and Kingston steamers, 136.
Horseless carriages (Automobiles), 23.
Horse-power, 10.
Hot-blast, 48.
Hotchkiss quick-firing guns, 208.
Hough’s metereograph, 654.
Howitzers, 213.
Hudson River steam navigation, 147.
Hughes’ printing telegraph, 560.
microphone, 590.
Hydraulic power, 324.

I.
Iceland spar, 399.
Illuminating power of gas, 774.
Illusion by movement of eye, 475.
by persistence of vision, 476.
stage, 290.
Images formed by lenses, 399, 616.
Impact theory of Stellar Evolution, 718.
Incandescent electric light, 528.
Incandescent gas-burners, 777.
Inclined railways, 125.
Inconstant, H.M.S., 152.
India-rubber, 724.
India-rubber and gutta-percha, 724.
Indicator, 9.
Induced currents, 502.
Induction coils, 503.
Injector, Giffard’s, 11.
Instantaneous photography, 623.
Introduction, 1.
Iron, 29.
Iron bridges, 276.
Iron in architecture, 72.
Iron, cast, 40.
chemical changes of, 34.
lighthouses, 596.
meteoric, 32.
ores, 39.
pig, 40.
ships, 133.
smelting, 39.
utility of, 30.
wrought, 47.
Ismaïlia, 260.

J.
Jablochkoff’s electric candle, 525.
Jackson, 551.
Jacobi, 531.
Jamin’s magnet, 513.
Johannesburg, 694.
Joule, 804, 805.
Joy’s valve gear, 20.
Jupiter, 384.

K.
Kaleidoscope, 389.
Karoos, 705.
Kimberley, 705.
Kinetographic theatre, 479.
Kinetoscope, 478.
Kirchhoff, 422.
Klondyke, 692.
König or Kaiser Wilhelm, ironclad, 164.
König’s printing machine, 308.
Krupp’s guns, 214.
steel, 55.
works, 56.
L.
Lake Timsah, 260.
Lathe at Woolwich, 198.
Lathe, Blanchard, 96.
screw-cutting, 87.
Lap of slide-valve, 9.
Lebel rifle, 188.
Lens, formation of image by, 616.
in steps, 600.
photographic, 616.
Lepidosiren, 685.
Letterpress printing, 306.
Leyden jar, 490.
discharge, 538.
Light, 380.
electric, 497.
invisible, 383.
Lighthouses, 593.
Limiting angle, 399.
Link motion, 16.
Linotype, the, 645.
Lithium, 425.
Lithography, 636.
Liverpool and Manchester Railway, 14.
Lock-gates, 267.
Locks in Manchester Ship Canal, 264, 266.
Locomotive, the, 14.
balancing of, 20.
compound, 18.
Lodge, Professor O., 540.

M.
Machine-guns, 218.
Madder, 796.
Magazine rifles, 187.
Magnesium, 720.
Magnetic field, 537.
Magnetism produced by current, 500.
Magneto-electric machines, 496, 507, 508.
Magneto-electricity, 506.
Mallet’s Mortars, 212.
Malus, 405.
Manchester Ship Canal, 262.
Manganese, 43.
Manhattan Life Insurance buildings, 78.
Mannlicher rifle, 187, 189.
Manufacturing v. making, 85.
Map, Channel Tunnel, 364.
Manchester Ship Canal, 263.
North Sea Canal, 271.
Pacific Railway, 117.
St. Gothard Railway, 372.
Suez Canal, 256.
Tower Bridge, etc., 299.
Marconi, 546.
Martini-Henry rifle, 184.
Mary Powell, the, 148.
Matter indestructible, 576.
Mauser rifle, 187, 188.
Maxim gun, 225.
Measuring machines, 88.
Mélinite, 748.
Menai Straits bridges, 280, 284.
Meteoric iron, 32.
Meteorites, 30.
Meteorology, importance of, 664.
Metereographs, 654.
Meters, gas, 775.
Metropolitan Railway, the, 114.
Microphone, 481.
Minié bullet, 180.
Mineral combustibles, 751.
Mines, submarine, 241.
Mirror, galvanometer, 570.
Mirrors, plane, 388.
Mirrors, illusions, 391–395.
Mitrailleur, 218.
Molecules, 733, 743.
Monarch, H.M.S., 156.
Moncrieff’s gun carriages, 208.
Mont Cenis Tunnel, 351.
Montigny mitrailleur, 222.
Morse’s code, 560.
instruments, 558.
plate, 562.
telegraphic line, 556.
transmitting key, 561.
Mortar, Mallet’s, 212.
Mount Washington Inclined Railway, 125.
Musical sound, 666.

N.
Naphthaline, 669.
Napier’s platen machine, 321.
Nasmyth’s steam hammer, 26.
Nature knowledge, 1.
printing, 640.
Nebulæ, 626.
Needle telegraphs, 553.
Negretti and Zambra’s recording thermometer, 659.
New metals, 714.
New York, the, 148.
Newton’s prism experiment, 418.
Niagara Suspension Bridge, 287.
Falls, 537.
Nicaragua Canal, 274.
Nichol’s prism, 403.
Niepce, J. N., 609.
de Saint-Victor, 611, 615.
Nitro-benzol, 784.
Nitrogen and oxygen compounds, 732.
Nitro-glycerine, 734.
Nordenfelt gun, 223.
North Sea Canal, 271.
Note A—Production of coal, 812.
Note B—Conservation of energy, 812.

O.
Œrsted’s experiments, 548.
Oil springs, 760.
Oldbury, manufacture of aluminium, 722.
Ophthalmoscope, 423.
Optical apparatus of lighthouses, 598.
Orders of lighthouse apparatus, 602.
Organic bodies, 798.
Oscillating engines, 14

P.
Pacific Railway, the, 116.
Paddle-wheels, 130.
Page, Mr., 532.
Panama Canal, 272.
Papier-maché stereotype process, 633.
Paraffin, 761.
oils, 762, 763.
Parallel motion, 8.
Paris Exhibition, buildings of, 76.
Pascal’s principle, 325.
Pattern printing, 321.
Pepper, J. H., 392, 393, 505.
Percussion cap, 180.
Petroleum, 757.
Phenakistiscope, 476.
Phenomena of light, some, 382.
Phonautograph, 666.
Phonograph, 665.
Photographic camera, 615.
Photography, 607.
Photography, celestial, 636.
in colours, 614, 630.
in the dark, 613.
X-Ray, 447.
Photolithography, 644.
Photozincography, 644.
Pig iron, 40.
Planes, Whitworth’s, 94.
Planets, photographs of, 636.
Planing machines, 92.
Plants in coal measures, 753.
Plaster of Paris, stereotype process, 633.
Pneumatic dispatch, 340.
force, 333.
Pniel, 701.
Points, railway, 108.
Polariscope, 405.
Polarizer, 403.
Polytechnic institution, Regent Street, 505.
Portable engines, 24.
Portable telegraphic instruments, 556.
Portrait, Davy, 714.
Helmholtz, 452.
Joule, 789.
Kirchhoff, 416.
Lesseps, 249.
Morse, 547.
Senefelder, 632.
Simpson, 731.
Tesla, 572.
Thomson, 481.
Watt, 3.
Whitworth, 85.
Port Saïd, 257.
Post-office railway van, 111.
Potassium, 715.
Powder, smokeless, 748.
Power, horse, 10.
hydraulic, 324.
of steam engine, 9.
of locomotive, 21.
Powers, mechanical, 32.
Pressure gauge, 12.
transmitted in fluids, 324.
Principle, the copying, 86.
of the cantilever, 291.
Printing machines, 305.
processes, 632.
telegraphs, 570.
Process blocks, 629.
Progress of mankind, 2.
Projectiles, 166.
air’s resistance to, 175.
deviation, 199.
long range of Whitworth, 193.
speed of, measured, 659.
trajectory of, 174.
Propagation of sound, 668.
Prospecting, 361.
Proteus anguinus, 684.
the modern, 807.
Pseudoscope, 472.
Puddling furnace, 45.

Q.
Queensferry, 292.
Quick-firing guns, 206.
Hotchkiss, 208.
R.
Railways, 101.
Great Western, 106.
Metropolitan, 114.
Midland, 112.
London and Manchester, 102, 124.
London and Woolwich, 103.
Pacific, 116.
St. Gothard, 371.
Stockton and Darlington, 101.
Randt, the, 694.
Rangoon petroleum, 759.
Rays polarized, 401.
Réaumur’s steel, 67.
Recoil, 172.
Recording instruments, 653.
Red-short iron, 62.
Reflection of light, 388.
Reflection in water, 396.
total, 399.
Refraction, 397.
double, 399.
Regulators for electric lamps, 523.
Reiss, 583.
Reliance building, the, 78.
Resistance, electrical, 494.
Resonance, electric, 540, 541.
Retina, 456.
Reverberatory furnace, 45.
Rifle, Brunswick, 180.
Chassepot, 182.
Lebel, 188.
Lee, 188.
Mannlicher, 187, 189.
Martini-Henry, 184.
magazine, 187.
Mauser, 187, 188.
military, 178.
Minié, 181.
Snider Enfield, 184.
Vetterli, 189.
Whitworth, 182.
Rifled cannon, 190.
Rifles, breech-loading, 182.
reduction of bores of, 189.
Rifling, 171.
guns, 199.
Righi, Professor, 541.
Rigi, railways ascending the, 126.
River steamboats, 144.
Rock boring, 349.
“Rocket,” 14.
Rock-drilling machines, 355.
Roentgen’s X-Rays, 445.
Rolling iron, 71.
Ronald’s telegraph, 548.
Roscoe, 425, 437, 444.
and Bunsen, 664.
Royal Gun Factory, Woolwich, 27.
Ruete’s ophthalmoscope, 466.
Ruhmkorff’s coil, 730.

S.
Saint Paul building, the, 78.
Saltash Bridge, 283.
San Francisco, 123.
Sand, properties of, 253.
Saw, circular, 100.
Sawing machines, 98.
Saturn, 440.
Schilling’s telegraph, 549.
Science, benefits of, 1.
Science and useful arts, 2.
Scott’s phonautograph, 669.
Screw, 86.
cutting lathe, 87.
dies and taps, 86.
propeller, 116, 117.
Sea anemones, 679.
horses, 684.
Secondary batteries, 530.
Segment shells, 217.
Senefelder, 556, 636.
Shear steel, 54.
Ship canals, 249.
Ships of war, 149.
Shrapnel shells, 216.
Siemens, 67.
Siemens’ dynamo, 522.
pneumatic tubes, 341.
regenerator, 68.
regulator, 523.
Siemens-Martin steel, 70.
Sight, 452.
Signals, railway, 108.
Silver plating by electricity, 500.
Sirius, 440.
Skerryvore, 596.
Skiagraphs, 447.
Slide rest, 87.
Smokeless powder, 226.
Snider rifle, 184.
Snow-plough, 123.
Sodium, 715.
Some phenomena of light, 382.
Sommeiller perforators, 351.
Sömmering’s telegraph, 548.
Sound, 665.
waves, 640.
Sounding telegraph, 566.
South African diamond fields, 701.
Speaking machine, 670.
tubes, 730.
Spectra, absorption, 431.
bright lined, 425.
of permanent gases, 430.
of salts, 424.
of stars, 440.
spark, 420.
Spectroscope, 416.
Spectrum, continuous, 422.
lithium, 425, 436.
of sodium, 424.
pure, 420.
solar, 418, 613.
Sphygmograph, 660.
Spiegeleisen, 64.
Stage illusions, 395.
Stars, distance of, 440.
motion of, 442.
spectra of, 440.
Steamboats, river, 144.
Steam engine, agricultural, 23.
domestic, 25.
forms of, 14.
Newcomen’s, 3.
portable, 23.
Watt’s double-acting, 6.
Watt’s improvements, 4.
Steam Engines, 3.
Steam carriages, 21.
expansive working of, 8, 17.
Steam hammer, the, 25.
Steam Navigation, 129.
Steam fire engine, 23.
“navvies,” 269.
rollers, 23.
superheated, 9.
Steamships, comparative sizes of, 138.
recent improvements in, 139.
speed of, 137.
Steel, 52, 68.
Bessemer, 58.
blister, 54.
cast, 54.
Krupp’s, 55.
puddled, 55.
Réaumur’s, 67.
shear, 54.
Siemens, 68.
tempering, 53.
tensile strength of, 53.
Stellar evolution, 810.
Stephenson, George, 14, 102.
Robert, 280.
St. Gothard Railway, 371.
Stereoscope, Brewster’s refracting, 470.
Wheatstone’s reflecting, 469.
Stereoscopic effect, 469.
lustre, 472.
views, 470, 474.
Stereotyping, 642.
Stevenson, Alan, 602, 603.
Storage battery, 530.
Strada, 547.
Stratified discharge, 517.
Stroboscopic disc, 476.
Submarine cables, 575.
Sub-Wealden, exploration, 362.
Suez Canal, the, 251.
Sun, elements in, 438.
constitution, 438.
Superheated steam, 9.
Surface plates, Whitworth’s, 94.
Suspension bridges, 284.
Sympathetic needles, 547.
Syphon recorder, 571.
Swan’s carbon process, 619, 481.

T.
Table, air resistance to projectile, 177.
benzol and toluol compounds, 789.
coal raised in Great Britain, 727.
coal-tar, colours, 794.
composition of cast iron, 43.
dimensions of parts of eye, 462.
dimensions of steamships, 138.
electric light v. gas, 514, 515.
floatation of Monarch and Captain, 160.
formulæ of hydro-carbon, 758.
Gramme machine, 518.
hydro-carbon in coal-tar, 782.
illuminating power of gases, 774.
lighthouse apparatus, 602.
lighthouse lamps, 597.
Martini-Henry rifle, 187.
nitrogen and oxygen compounds, 731, 732.
photographic actions, 612.
products from 100 lbs. of coal, 796.
ships of war, 166.
telegraph code, Morse’s, 560.
telegraph, war department, 557.
tenacities of iron, 277.
wave-lengths of colours, 411.
Wheatstone’s dot signals, 565.
Talbot, 610.
Talbotype, 611.
Tall buildings, 76.
Tawell, arrest of, 550.
Telegraphic instruments, 553.
Telegraphic lines, 572.
Telegraph poles, 572.
Telegraphs in Great Britain, 574.
Telegraphy, wireless, 546.
Tel-el-Kebir, 260.
Telepherage, 549.
Telephone, the, 581.
Telestereoscope, 472.
Tenacities of iron, 277.
Tension of electricity, 498.
Tesla, Nikola, 542.
oscillator, 542.
Tesla’s experiments, 545.
Thallium, 426.
The Terrible, H.M.S., 167.
Thomson, Sir W., 483, 570, 571, 805, 808.
Throttle valve, 6.
Thunderer, H.M.S., 164.
“Times” newspaper, 387, 312.
Tools, 85.
Torpedo-boats, 231.
Torpedoes, 227.
Tour de Cordouan, 593.
Tourmaline, 404.
Tower Bridge, the, 297.
Trajectory of projectile, 174.
Tramways, 22.
Transfer process, 638.
Transvaal, 693.
Tunnel, Mont Cenis, 351.
St. Gothard, 373.
Tunnels, helicoidal, 379.
Turbine, 144.
Turbinia, the, 144.
Turret ships, 154.
Tyndall, 383, 803, 809.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge

connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and

personal growth every day!

ebookbell.com

Fuse Box Diagram Toyota Camry (XV50 2012-2017)
No ratings yet
Fuse Box Diagram Toyota Camry (XV50 2012-2017)
10 pages
Production Requirements Checklist: Course Number and Name: Production Title: Prod. # Producer: Director
No ratings yet
Production Requirements Checklist: Course Number and Name: Production Title: Prod. # Producer: Director
2 pages
Hazop Study Fire and Explosion
No ratings yet
Hazop Study Fire and Explosion
30 pages
What Are The Signs of An Impending Geologic Hazard
100% (2)
What Are The Signs of An Impending Geologic Hazard
2 pages
D.N.jha - Rethinking Hindu Identity-Routledge (2014)
100% (1)
D.N.jha - Rethinking Hindu Identity-Routledge (2014)
111 pages
AFS Pro700 Brochure AFS-8018-10
No ratings yet
AFS Pro700 Brochure AFS-8018-10
2 pages
Series 2A Pneu Cyl 2a - 0910-Uk PDF
No ratings yet
Series 2A Pneu Cyl 2a - 0910-Uk PDF
48 pages
CE6603-Design of Steel Structures
No ratings yet
CE6603-Design of Steel Structures
12 pages
S1 LGL3702 Lecture 1 2025
No ratings yet
S1 LGL3702 Lecture 1 2025
40 pages
Objective:: Power Plant Lab (Me-223L) Experiment No: 6 Title: Demonistration of Steam Engine
No ratings yet
Objective:: Power Plant Lab (Me-223L) Experiment No: 6 Title: Demonistration of Steam Engine
5 pages
11.2 The Process of Cell Division
No ratings yet
11.2 The Process of Cell Division
36 pages
Thriller English
No ratings yet
Thriller English
69 pages
Adventure Tourism in Bilaspur: A Framework For Assessment and Strategic Development
100% (1)
Adventure Tourism in Bilaspur: A Framework For Assessment and Strategic Development
14 pages
Troubleshooting GEFANUC 90 30
No ratings yet
Troubleshooting GEFANUC 90 30
18 pages
Lesson Plan
No ratings yet
Lesson Plan
10 pages
ISO 9001 2015 Internal Audit Process Map Sample
No ratings yet
ISO 9001 2015 Internal Audit Process Map Sample
1 page
Lesson Plan in Napkin Folding
No ratings yet
Lesson Plan in Napkin Folding
2 pages
Cylinder Head Valves
No ratings yet
Cylinder Head Valves
6 pages
Bridgeswitch Family Datasheet PDF
No ratings yet
Bridgeswitch Family Datasheet PDF
32 pages
Support Vector Machine For EEG Signal
No ratings yet
Support Vector Machine For EEG Signal
4 pages
733-Article Text-1725-3-10-20230630
No ratings yet
733-Article Text-1725-3-10-20230630
16 pages
Contextualization of The MT4T E-Citizenship Learning Packets
No ratings yet
Contextualization of The MT4T E-Citizenship Learning Packets
36 pages
Steps Involved in Production and Utilization of A TV Programme
No ratings yet
Steps Involved in Production and Utilization of A TV Programme
5 pages
Abhishek Dhiman
No ratings yet
Abhishek Dhiman
3 pages
Chap 4
No ratings yet
Chap 4
17 pages
Activity On The Waves
No ratings yet
Activity On The Waves
1 page
Resource Utilization & Optimization in Quran: Synopsis For PHD Usulddin
No ratings yet
Resource Utilization & Optimization in Quran: Synopsis For PHD Usulddin
8 pages
1.0 Executive Summary: Abdm3313 Entrepreneurship
No ratings yet
1.0 Executive Summary: Abdm3313 Entrepreneurship
17 pages
Crane Telescopic
No ratings yet
Crane Telescopic
1 page
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)

Beginning Data Science in R 4 Data Analysis Visualization and Modelling For The Data Scientist Second Edition 2nd Edition Thomas Mailund Download

Uploaded by

Beginning Data Science in R 4 Data Analysis Visualization and Modelling For The Data Scientist Second Edition 2nd Edition Thomas Mailund Download

Uploaded by

Beginning Data Science In R 4 Data Analysis

Visualization And Modelling For The Data

Explore and download more ebooks at ebookbell.com

Beginning Data Science In R 4 Data Analysis Visualization And

Beginning Data Science In R Data Analysis Visualization And Modelling

Beginning Mathematica And Wolfram For Data Science Applications In

Beginning Mathematica And Wolfram For Data Science Applications In

Beginning Mathematica And Wolfram For Data Science Applications In

Beginning Data Science With Python And Jupyter Use Powerful

Beginning Data Science With R 1st Edition Manas A Pathak

Beginning Data Science Iot And Ai On Single Board Computers Core

ISBN-13 (pbk): 978-1-4842-8154-3 ISBN-13 (electronic): 978-1-4842-8155-0

Copyright © 2022 by Thomas Mailund

About the Technical Reviewer�������������������������������������������������������������������������������xvii

Chapter 1: Introduction to R Programming�������������������������������������������������������������� 1

Chapter 2: Reproducible Analysis�������������������������������������������������������������������������� 51

Chapter 3: Data Manipulation��������������������������������������������������������������������������������� 73

Chapter 4: Visualizing Data���������������������������������������������������������������������������������� 121

Chapter 5: Working with Large Data Sets������������������������������������������������������������� 161

Chapter 6: Supervised Learning��������������������������������������������������������������������������� 179

Chapter 7: Unsupervised Learning����������������������������������������������������������������������� 239

Chapter 8: Project 1: Hitting the Bottle����������������������������������������������������������������� 275

Chapter 9: Deeper into R Programming���������������������������������������������������������������� 287

Chapter 10: Working with Vectors and Lists�������������������������������������������������������� 329

Chapter 11: Functional Programming������������������������������������������������������������������ 349

Chapter 12: Object-Oriented Programming���������������������������������������������������������� 373

Chapter 13: Building an R Package���������������������������������������������������������������������� 391

Chapter 14: Testing and Package Checking��������������������������������������������������������� 409

Chapter 15: Version Control���������������������������������������������������������������������������������� 419

Chapter 16: Profiling and Optimizing������������������������������������������������������������������� 441

Chapter 17: Project 2: Bayesian Linear Regression���������������������������������������������� 471

What Is Data Science?

Data science is also related to machine learning and artificial intelligence—and

Prerequisites for Reading This Book

Plan for the Book

• Filtering and selecting relevant data

• Transforming data into shapes readily analyzable

• Visualization data in informative ways both for exploring data and

Data Analysis and Visualization

1. Introduction to R Programming: In this chapter, we learn how to

2. Reproducible Analysis: In this chapter, we find out how to

3. Data Manipulation: In this chapter, we learn how to import data,

4. Visualizing Data: In this chapter, we learn how to make plots for

6. Supervised Learning: In this chapter, we learn how to train models

7. Unsupervised Learning: In this chapter, we learn how to search for

8. Project 1: Hitting the Bottle: Following these chapters is the first

1. Deeper into R Programming: In this chapter, we explore more

2. Working with Vectors and Lists: In this chapter, we explore two

3. Functional Programming: In this chapter, we explore an advanced

4. Object-Oriented Programming: In this chapter, we learn how R

5. Building an R Package: In this chapter, we learn the necessary

6. Testing and Package Checking: In this chapter, we learn

7. Version Control: In this chapter, we learn how to manage code

8. Profiling and Optimizing: In this chapter, we learn how to identify

9. Project 2: Bayesian Linear Regression: In the final chapter, we

Getting R and RStudio

• How to do the detective work necessary to pull out some

• UCI Machine Learning Repository: http://archive.ics.

• Reddit R Data sets: www.reddit.com/r/datasets

• GitHub Awesome Public Data sets: ­https://github.com/

Basic Interaction with R

Figure 1-1. RStudio

Figure 1-2. RStudio with a new R Script file open

It usually doesn’t matter if you have an integer or a floating-point number, and

In addition to the basic arithmetic operators—addition, subtraction, multiplication,

and of course, you can now use x in expressions:

Actually, all of the above are vectors of values…

## [1] 0.1 1.1 2.1

seq(.1, .9, .2)

## [1] 0.1 0.3 0.5 0.7 0.9

## [1] "foo" "bar"

paste("foo", "bar", sep = "")

## [1] "foo bar"

v[c(TRUE, FALSE, TRUE, FALSE, TRUE)]

## [1] FALSE TRUE FALSE TRUE FALSE

v <- c("A" = 1, "B" = 2, "C" = 3)

About the Technical Reviewer��xvii

Chapter 1: Introduction to R Programming�� 1

Chapter 2: Reproducible Analysis�� 51

Chapter 3: Data Manipulation�� 73

Chapter 4: Visualizing Data�� 121

Chapter 5: Working with Large Data Sets�� 161

Chapter 6: Supervised Learning�� 179

Chapter 7: Unsupervised Learning�� 239

Chapter 8: Project 1: Hitting the Bottle�� 275

Chapter 9: Deeper into R Programming�� 287

Chapter 10: Working with Vectors and Lists�� 329

Chapter 11: Functional Programming�� 349

Chapter 12: Object-Oriented Programming�� 373

Chapter 13: Building an R Package�� 391

Chapter 14: Testing and Package Checking�� 409

Chapter 15: Version Control�� 419

Chapter 16: Profiling and Optimizing�� 441

Chapter 17: Project 2: Bayesian Linear Regression�� 471

What Is Data Science?

Prerequisites for Reading This Book

Plan for the Book

Data Analysis and Visualization

Getting R and RStudio

• GitHub Awesome Public Data sets: https://github.com/

Basic Interaction with R

Getting Documentation for Functions

Writing Your Own Functions

Summarizing and Vector Functions

A Quick Look at Control Flow