0% found this document useful (0 votes)
159 views

Introduction To Data Science Data Analysis and Prediction Algorithms With R

The document reviews a textbook that introduces data science concepts and the R programming language. It covers topics like importing and cleaning data, data visualization, and predictive algorithms. The review provides a high-level overview of the textbook's structure and content.

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views

Introduction To Data Science Data Analysis and Prediction Algorithms With R

The document reviews a textbook that introduces data science concepts and the R programming language. It covers topics like importing and cleaning data, data visualization, and predictive algorithms. The review provides a high-level overview of the textbook's structure and content.

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Technometrics

ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: https://www.tandfonline.com/loi/utch20

Introduction to Data Science: Data Analysis and


Prediction Algorithms With R
by Rafael A. Irizarry. Boca Raton, FL: Chapman and Hall/CRC, Taylor & Francis
Group, 2020, xxx + 713 pp., $99.95, ISBN: 978-0-367-35798-6.

Stan Lipovetsky

To cite this article: Stan Lipovetsky (2020) Introduction to Data Science: Data
Analysis and Prediction Algorithms With R, Technometrics, 62:2, 1-282, DOI:
10.1080/00401706.2020.1744905

To link to this article: https://doi.org/10.1080/00401706.2020.1744905

Published online: 07 May 2020.

Submit your article to this journal

Article views: 3215

View related articles

View Crossmark data

Citing articles: 3 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=utch20
280 BOOK REVIEWS

of the group to a source of supply, describing also the so-called parts and thirty-eight chapters, each with multiple subsections,
Kar solution, Folk solution, the cycle-complete solution, and the exercises, examples of codes, and discussions on outcomes.
weighted SV solution. Introduction describes the book structure, and Chapter 1 of
Dozens of the most recent references are given in each chap- “Getting Started With R and RStudio” describes the R con-
ter. The book is innovative even for specialists in game theory sole that executes the typed commands, the codes saved as
and operations research, decision making and applied socio- scripts, RStudio that presents a user-friendly integrated develop-
economics research in various fields. Also, it makes sense to ment environment (IDE) and provides many useful tools. Mul-
note that in practical implementations SV has been successfully tiple screenshots demonstrate running commands while editing
applied in marketing research, for example, in total undupli- scripts, changing global options, and installing R packages and
cated reach and frequency estimation and other problems (Con- libraries.
klin, Powaga, and Lipovetsky 2004; Conklin and Lipovetsky Part I of “R” starts with Chapter 2 on “R Basics” which
2005, 2013; Lipovetsky 2007, 2008), and in regression modeling describes the main building blocks of R and its logistics,
and key driver analysis (Lipovetsky and Conklin 2001, 2010; functions and prebuilt objects, variable names, data types and
Lipovetsky 2012). frames, vectors, matrices, and lists, data subsets and coercion,
sorting and ranking, vectors arithmetic and logical operators,
workplace saving, and some plots, with case study on the US
References gun murders in comparison with other countries. Chapter
Aumann, R. J., and Shapley, L. S. (1974), Values of Non-Atomic Games, 3 of “Programming Basics” explains how to use conditional
Princeton, NJ: Princeton University Press. [279] expressions and to define functions, describes namespaces and
Conklin, M., and Lipovetsky, S. (2005), “Marketing Decision Analysis by for-loops, vectorization and functionals, such as apply, sapply,
TURF and Shapley Value,” International Journal of Information Technol- replicate, and others. Chapter 4 of “The tidyverse” introduces a
ogy & Decision Making, 4, 5–19. [280]
(2013), “The Shapley Value in Marketing Research: 15 Years and
specific data format referred as tidy (one observation in a row,
Counting,” in Proceedings of the Sawtooth Software Conference, Dana different variables in each column, all data available) for a more
Point, CA, pp. 267–274. [280] efficient operating on data frames with collection of packages
Conklin, M., Powaga, K., and Lipovetsky, S. (2004), “Customer Satisfaction called the tidyverse, which can be loaded by installing the
Analysis: Identification of Key Drivers,” European Journal of Operational library(tidiverse). This library includes such popular packages
Research, 154, 819–827. [280]
Lipovetsky, S. (2007), “Antagonistic and Bargaining Games in Optimal
as dplyr for manipulating data frames, purrr for working with
Marketing Decisions,” International Journal of Mathematical Education functions, ggplot2 for graphing, and many others. The new ways
in Science and Technology, 38, 103–113. [280] of working with data frames with dplyr are described, including
(2008), “SURF—Structural Unduplicated Reach and Frequency: the pipe %>% operator for applying one function after the
Latent Class TURF and Shapley Value Analyses,” International Journal other, summarizing in groups, nested sorting, tibble as a special
of Information Technology & Decision Making, 7, 203–216. [280]
(2012), “Interpretation of Shapley Value Regression Coefficients
modern kind of data frames obtained in data stratifying, tibble,
as Approximation for Coefficients Derived by Elasticity Criterion,” in dot, and do operators. Chapter 5 of “Importin Data” describes
Proceedings of the Joint Statistical Meeting of the American Statistical paths and working directory, the readr and readxl packages for
Association, July–August, San Diego, CA, pp. 3302–3307. [280] the main tidyverse data importing functions.
Lipovetsky, S., and Conklin, M. (2001), “Analysis of Regression in Game Part II of “Data Visualization” presents in Chapter 6 “Intro-
Theory Approach,” Applied Stochastic Models in Business and Industry,
17, 319–330. [280]
duction to Data Visualization,” with several datasets in graph-
(2010), “Meaningful Regression Analysis in Adjusted Coefficients ical illustrations, citing the father of exploratory data analysis
Shapley Value Model,” Model Assisted Statistics and Applications, 5, 251– (EDA) J. W. Tukey: “The greatest value of a picture is when
264. [280] it forces us to notice what we never expected to see.” Chapter
Roth, A. E. (ed.) (1988), The Shapley Value: Essays in Honor of Lloyd S. 7 of “ggplot2” shows how to create various kind of plots with
Shapley, Cambridge: Cambridge University Press. [278]
Shapley, L. S. (1953), “A Value for n-Person Games,” in Contributions to the
different features using the package loaded with library(ggplot2).
Theory of Games (Vol. II), eds. H. W. Kuhn and A. W. Tucker, Princeton, Chapter 8 of “Visualizing Data Distributions” uses an example
NJ: Princeton University Press, pp. 307–318. [278] of data on the student heights to demonstrate their distributions
by genders and regions in histograms and smoothed density,
Stan Lipovetsky bars and box plots, percentiles and quantiles, QQ-plots, and
Minneapolis, MN cumulative graphs. Chapter 9 of “Data Visualizing in Practice”
on the example of data on developing countries presents vari-
ous scatterplots with multiple panels of faceting variables, time
series plots, multimodal distributions with box and ridge plots,
weighted density, and data transformation. Chapter 10 of “Data
Introduction to Data Science: Data Analysis and Visualization Principles” considers how humans detect patterns
Prediction Algorithms With R by Rafael A. Irizarry. Boca and make comparisons by viewable number of quantities, distri-
Raton, FL: Chapman and Hall/CRC, Taylor & Francis Group, butions by values or categories, with special features related to
2020, xxx+713 pp., $99.95, ISBN: 978-0-367-35798-6. position, length, angles, area, ordering, colors, and brightness.
Various kinds of plots are presented, plots for two variables are
The textbook belongs to the Data Science series and presents discussed, including slop charts, Bland–Altman plot, and plots
a modern approach to statistical evaluations via powerful abil- with encoded third variable. A case study on infectious diseases
ities of the R language. The monograph is organized in six and vaccines in graph presentation is given as well. Chapter
BOOK REVIEWS 281

11 of “Robust Summaries” deals with finding the outliers and contained in the string, and to make many other operations
using median, inter quartile range (IQR), and absolute median on them with help of the stringr package, that is illustrated on
deviations in graphs. several case studies. Chapter 25 of “Parsing Dates and Times”
Part III of “Statistics With R” starts with Chapter 12 describes the tidyverse functionality for working with dates
“Introduction to Statistics With R” notes that the next chapters through the lubridate package. Chapter 26 of “Text Mining”
describe statistical concepts and explain them by implementing is devoted to operating with a free form text, that is needed
R codes on the case studies. Chapter 13 “Probability” defines in such applications as spam filtering, cyber-crime prevention,
and calculates on various datasets the relative frequency, counter-terrorism, and sentiment analysis. The tydytext package
discrete and continuous probability distributions, presents provides converting a free text into a tidy table. A case study of
Monte Carlo simulations for categorical data, sampling with and the twitter account of D. J. Trump during 2016 election and other
without replacement, conditional probabilities, addition and cases are presented.
multiplication rules, combinations and permutations, discusses Part V of “Machine Learning,” in Chapter 27 of “Introduction
Monty Hall and Birthday problems. Chapter 14 of “Random to Machine Learning” defines this topic as the modern most
Variables” deals with data affected by chance because the data popular data science methodologies widely applied, for instance,
come from a random sample, or with a measurement error, in the handwritten zip code readers implemented in postal
or the source is random by nature. The expected value and service, speech recognition technologies, movie recommen-
standard error, central limit theorem (CLT) and the law of large dation systems, spam and malware detectors, housing price
numbers, population versus sample and properties of average predictors, and driverless cars. Another term often used for this
values are considered on example of financial crisis of 2007– approach is artificial intelligence (AI), although AI is rather
2008 which occurred because the risks of mortgage-backed related to the algorithms like those developed for chess playing
securities (MBS) and collateral debt obligations (CDO) were machines by programing rules, while the machine learning is
grossly underestimated. Chapter 15 of “Statistical Inference” based on algorithms and decisions built with data. Machine
describes polls and estimate properties, confidence intervals, learning uses available data to build a model and then apply
power and p-values, chi-square and odds ratio tests, and it for predictions for the continuous output, or classification
the problem of small p-values for large samples. Chapter 16 for categorical output. A quality for a model or a machine
of “Statistical Models” continues with poll aggregators by learning algorithm defined by the training and test subsets is
combined data by different experts to improve predictions. considered, together with the features of the confusion matrix,
Data-driven and hierarchical models, and Bayesian simulation sensitivity and specificity, receiver operating characteristics
and statistics are used for election forecasting and for predicting (ROC) and precision-recall curves, balanced accuracy and F1 -
the electoral college. Chapter 17 “Regression” focuses on the score, conditional probabilities and expectations for minimizing
bivariate regression model. Chapter 18 “Linear Models” is squared loss function. Chapter 28 of “Smoothing” considers
devoted to one of the main tools in data science—multiple curve fitting, aka low pass filtering, extremely useful in machine
regression modeling. On the example of the baseball data, the learning because conditional probabilities reveal the trends or
least squares estimates (LSE) are applied for building regressions shapes which are estimated in the presence of uncertainty in
using various R tools, particularly, in the tidyverse and broom data. Bin smoothing and kernels, local weighted regression
packages for stratified models. Chapter 19 “Association Is (loess) and fitting parabolas are described. Chapter 29 of “Cross
Not Causation,” aka correlation is not causation, discusses Validation” discusses how to implement cross validation with
spurious correlation, reversing cause and effect, confounders, caret package, questions of k-nearest neighbors (kNN), over-
and Simpson’s paradox, on the example of the US Berkley training and over-smoothing, picking the k in kNN, K-fold
admission data. cross validation, and bootstrap. Chapter 30 of “The caret
Part IV of “Data Wrangling,” in Chapter 20 of “Introduction Package” describes this package, which contains now 237
to Data Wrangling” reminds that the original data subsets can be different methods, in more detail, including its train function,
obtained in different forms from string processing, html parsing, cross validation, and shows fitting with loess on examples.
tables with times and dates, and text mining, so several prelim- The manual on these package techniques is available at
inary steps are needed to present the whole dataset in the frame https://topepo.github.io/caret/available-models.html, and see also
or tidyverse format. Chapter 21 of “Reshaping Data” describes https://topepo.github.io/caret/train-models-by-tag.html. Chapter
tidyr package which includes several functions for tidying data. 31 of “Examples of Algorithms” includes methods of supervised
Chapter 22 of “Joining Tables” characterizes several functions learning, such as linear regression and predict function, logistic
for binding and intersecting datasets. Chapter 23 of “Web Scrap- regression and generalized linear models, kNN and generative
ing,” or web harvesting, shows how to extract a data from a models, naïve Bayes and controlling prevalence, linear and
website, for instance, from Wikipedia page. The information quadratic discriminant analyses, classification and regression
used by a browser to render webpages comes as a text file from trees (CART), random forests and true conditional probability.
a server, and a text is coded in the hypertext markup language Chapter 32 of “Machine Learning in Practice” demonstrates
(HTML). Cascading style sheets (CSS) are widely used to make applications of kNN and random forest, with variable impor-
webpages to look nice, and the rvest package helps to import tance, visual assignments, and ensembles. Chapter 33 of “Large
a webpage into R. A format widely adopted in the internet is Datasets” deals with computational techniques and statistical
the JavaScript Object Notation (JSON), and jsonlite package concepts specifically oriented to the analysis of big data.
can be used to read it as a data frame. Chapter 24 of “String Various approaches are described, including matrix algebra
Processing” describes how to extract numerical data and names with vectorization and filtering based on summaries, indexing
282 BOOK REVIEWS

with matrices and data binarization, distances in higher SPSS, and Stata packages, describes techniques of data entry,
dimensions and preserving distances, dimension reduction problems with missing data and outliers. Added in this edition
and orthogonal transformations, principal component analysis new Chapter 4 discusses various methods of data visualization
(PCA) and singular value decomposition (SVD), regularization with specific tools for different types of bivariate and multi-
and penalized least squares, matrix factorization and factor variate variables, and gives multiple illustrations in graphs and
analysis, with examples of movie and user effects. Chapter figures. Chapter 5 presents possible ways for data screening
34 of “Clustering” focuses on the algorithms of unsupervised and transformation, needed for normality of distribution used
learning, including hierarchical clustering and k-means, with in many hypotheses testing. Chapter 6 describes selection of
using heatmaps and filtering features. appropriate methods of multivariate analysis due to the data
Part VI “Productivity Tools,” in Chapter 35 of “Introduction types and purposes of research.
to Productivity Tools” describes preferences of using the Unix Chapter 7 presents simple linear regression, its residual
shell as a tool for managing files and directories. The ver- errors, confidence intervals, parameter hypotheses testing,
sion control system Git is introduced, together with the service correlation, fixed predictor variable, bivariate regression as the
GitHub which permits to keep track of the script and report conditional distribution. Linearized models by transformed
changes, to host and share the code—see at http://github.com. variables and computer programs are also discussed. Chapter
Chapter 36 of “Organizing With Unix” explains more detail on 8 considers multiple linear regression with its various features
Linux, the shell, and the command line, navigating the filesystem and characteristics of fit quality, tests on parameters, and
by changing directories, giving examples of application of many extensions to polynomial models. Chapter 9 describes methods
Unix commands. Chapter 37 of “Git and GitHub” provides of variable selection in regression, including Akaike and
illustrations with multiple screenshots on this topic, describing Bayesian information criteria, F-criterion in stepwise modeling,
repositories, clones, and using these tools in RStudio. Chapter and Lasso regression. Chapter 10 covers special regression
38 “Reproducible Projects With RStudio and R Markdown” topics, such as missing values multiple imputation, dummy or
finalizes with creating reports on data analysis projects, with indicator variables, constraints on parameters, multicollinearity,
help of the package knitR which minimizes the work. And a and ridge regression. Chapter 11 focuses on discriminant
detail useful Index of nineteen pages closes the book. analysis and classification into groups by Fisher function,
The monograph presents a great introduction to data science adjusting the dividing points, measuring goodness of fit, cross-
and modern R programing, with tons of examples of application validation, and jackknife procedure. Chapter 12 continues with
of the R abilities throughout the whole volume. The book sug- logistic regression used for classification and other purposes,
gests multiple links to the internet websites related to the topics describes receiver operating characteristic, or ROC curve,
under consideration, that makes it an incredibly useful source of nominal and ordinal logistic models, Poisson regression, and
contemporary data science and programing, helping to students generalized linear model, or GLM. Chapter 13 deals with
and researchers in their projects. regression for survival data, describing hazard functions for
exponential and Weibull distributions, log-linear model, and
Stan Lipovetsky Cox proportional hazard regression. Chapter 14 presents
Minneapolis, MN principal components analysis, its features and applications.
Chapter 15 describes factor analysis in its exploratory version,
with techniques of factors rotation, and interpretation of the
results. Chapter 16 presents cluster analysis in its several
Practical Multivariate Analysis (6th ed.) by algorithms by agglomerative and divisive methods, including
Abdelmonem Afifi, Susanne May, Robin Donatello, and K-means and hierarchical clustering. Chapter 17 is devoted to
Virginia A. Clark. Boca Raton, FL: Chapman and Hall/CRC, the log-linear analysis for two-way and multi-way tables, with
Taylor & Francis Group, 2020, xv+418 pp., $99.95, ISBN: stepwise selection, assessing specific models, and comparison
978-1-138-70222-6. with the logit models. Chapter 18 concludes consideration
by the correlated outcomes regressions, when the dependent
The monograph belongs to the series Texts in Statistical variable observations are already not independent among
Science and presents the sixth upgraded edition of the popular themselves but can be viewed as related in subgroups or
manual. It was first issued in 1984, and from that time won by measurements in longitudinal studies. Conditional and
recognition as one of the best textbooks on the applied statistical marginal models with fixed and random effects are described,
modeling and analysis. The book is organized in two parts and and generalized estimating equations, GEE, are discussed
eighteen chapters: the first part considers “Preparation for Anal- as well. The book is finalized by the Appendix containing
ysis” (Chapters 1–6), and the second part is called “Regression references to the data sources used in the examples, bibliography
Analysis” (Chapters 7–18) although it covers many other related of more than three hundred sources, and a detail index given in
methods as well. eight pages.
Chapter 1 defines multivariate analysis, in its exploratory Most of chapters of the first part of the textbook contain
and confirmatory aspects, gives examples, and draws the book such subsections as “Introduction” or “Definition,” “Discussion”
structure. Chapter 2 characterizes data of different types, such or “Examples,” “Summary” and “Problems.” And almost all
as nominal, ordinal, interval, ratio, categorical, continuous, dis- chapters of the second part of the textbook start with the
crete, explanatory, and dependent. Chapter 3 deals with prepar- subsections of “Chapter Outline,” “When This Technique Is
ing data for analysis, briefing on statistical software of R, SAS, Used,” “Data Example,” “Basic Concepts,” and finish with

You might also like