Data Analytics Lesson 9 Notes
Data Analytics Lesson 9 Notes
Introduction to R
2
Contents
3 Lesson outcomes
3 Introduction
3 Introduction to R
5 Working with R
8 Additional resources
9 References
DATA ANALYTICS
3
Lesson outcomes
By the end of this lesson, you should be able to:
Introduction
The objectives for this lesson are to add a new tool to your data analyst toolkit, called R. You will go through the basic
steps of downloading and installing the tool and start exploring some of the packages that are available today in R.
The lesson will conclude by introducing another common method to estimate population parameters, the maximum
likelihood method.
Introducing R
R defined
• R is a powerful statistical language and free software that can be used for data analysis.
• R provides a wide variety of statistical and graphical techniques that include linear modelling, classification
techniques and time series analysis to name but a few.
• R is widely used amongst data analysts, statisticians, and researchers
• R is trusted by companies like Uber, Google, Airbnb, to name but a few.
Why R
• R is open source software.
• R is growing in popularity, currently it ranks 8th of the Tiobe index. The index is calculated from the number of
search engine results for queries containing the name of the language.
• R operates on a variety of operating systems like Windows, Mac, and Linux -OS.
• R contains a wide variety of open source packages for data importing, wrangling and visualisations.
Downloading R
You can download R from this link:
https://cran.r-project.org/
DATA ANALYTICS
4
(R, 2020)
RStudio defined
RStudio acts as a front face to R. The technically correct term would be to call it an IDE. An IDE is the abbreviation for an
integrated development environment. An IDE makes working with the application easier. RStudio provides all the
standard features of an IDE: a console, multiple script file interface, environmental variable display to name a few.
Why RStudio
• RStudio is open source software.
• RStudio operates on a variety of operating systems like Windows, Mac and Linux -OS.
• RStudio is more user-friendly than Base R.
Downloading RStudio
You can download RStudio from this link:
https://rstudio.com/products/rstudio/download/
DATA ANALYTICS
5
(RStudio, 2020)
Please note: You must install R before you can install RStudio. The two applications are separate utilities and need to be
updated on their respective schedules.
Working in R
R packages
A fundamental component to R is its packages. You can view packages as a piece of code someone is sharing with you that
you are able to import and use functions from, they have created. Packages can be likened to the add-on Data Analysis
Toolpak of Excel that increases the functionality of the base R installation.
Some of the most popular ones include tidyr, plotly and ggplot2 to name a few.
DATA ANALYTICS
6
Fig. Click on Tools tab and select “Install Packages” to install a package in RStudio.
ISLR package
The ISLR package in R was developed for the in the book, An Introduction to Statistical Learning with Application in R. This
is a useful resource I would recommend should you wish to further expand your data analytics and data science skills in R
after this course.
The package contains a collection of datasets used in the book. We will utilize this package to improve our understanding
of R and what it has to offer.
Fig. Search for package and click “Install” in RStudio to install package ISLR
DATA ANALYTICS
7
Another method that concerns itself with choosing a model that best fits the given set of data, is called the maximum
likelihood estimation
Maximum likelihood
The maximum likelihood estimation is a method that is used to estimate the parameters of a distribution. The method
tries to choose the fitted model where the observed data is most probable. This concept might be better understood
through an example.
(Brooks-Bartlett, 2018)
Suppose we measured the weights of children under a certain age. Through the maximum likelihood method, we want to
find the best way to fit a distribution to the data. There are more distributions apart from the normal distribution. We want
to fit the best distribution to the data because we will then be able to apply the generalised distribution to every
experiment of the same type. Let’s assume that the weight of the children is normally distributed.
When we assume that the distribution is normally distributed, we expect that measures will be symmetrical around the
mean and that most of the points will lie around the mean.
(Brooks-Bartlett, 2018)
DATA ANALYTICS
8
The normal distribution can come in many different shapes if the standard deviation and mean differs. In our case, we can
see from the plot that the blue line’s distribution looks like most of the values fall around the centre location of the
distribution, the peak. The probability of observing these measurements is high. The location of the blue line’s distribution
maximizes the likelihood the estimate for the mean.
In the same way, the maximum likelihood estimate for the standard deviation can be calculated.
So therefore, maximum likelihood estimates chosen, the mean and standard deviation, now maximizes the likelihood of
the observed values to fit the distribution chosen.
Additional resources
James, G., Witten, D., Hastie, T. & Tibshirani, R., An Introduction to Statistical Learning with Applications in R, 2017,
Springer, available online at (https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)
DATA ANALYTICS
9
References
• Brownlee, J., 2019, A Gentle Introduction to Linear Regression with Maximum Likelihood
Estimation, Machine Learning Mastery, https://machinelearningmastery.com/linear-regression-
with-maximum-likelihood-estimation/
• Comtois, D., 2018, My favourite R package for: summarsing data, Dabbling with Data,
https://dabblingwithdata.wordpress.com/2018/01/02/my-favourite-r-package-for-summarising-
data/
• Data Carpentry, 2020, Introducing R and RStudio IDE, Data Carpentry,
https://datacarpentry.org/genomics-r-intro/01-introduction/index.html
• Hadley, C.J., 2016, RStudio Install, Lynda.com, https://www.lynda.com/RStudio-
tutorials/RStudio-install/452087/490022-4.html
• Howson, I., 2019, ISLR: Data for an Introduction to Statistical Learning with Applications in R,
rrdr.io, https://rdrr.io/cran/ISLR/
• Ismay, C. & Kennedy, P.C., 2019, Getting Used to R ,RStudio and R Markdown,
https://ismayc.github.io/rbasics-book/3-rstudiobasics.html
• Jackson, C., 2015, Point & Interval Estimations: Definition & Differences, Study.com,
https://study.com/academy/lesson/point-interval-estimations-definition-differences-quiz.html
• James, G., Witten, D., Hastie, T., Tibshirani, R., 2017, An Introduction to Statistical Learning
with Applications in R, Springer, available online at: https://faculty.marshall.usc.edu/gareth-
james/ISL/ISLR%20Seventh%20Printing.pdf
• R Project, 2020, What is R, R, https://www.r-project.org/about.html
• RStudio, 2020, https://rstudio.com/
• Rungta, K., 2020, What is R Programming Language? Introduction & Basics of R, Guru99.com,
https://www.guru99.com/r-programming-introduction-basics.html
DATA ANALYTICS