0% found this document useful (0 votes)
2 views9 pages

Data Analytics Lesson 9 Notes

The document is a lesson on using R for data analytics, covering installation, basic functionalities, and the ISLR package. It introduces the maximum likelihood method for parameter estimation, explaining its significance in fitting data distributions. Additional resources and references are provided for further learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views9 pages

Data Analytics Lesson 9 Notes

The document is a lesson on using R for data analytics, covering installation, basic functionalities, and the ISLR package. It introduces the maximum likelihood method for parameter estimation, explaining its significance in fitting data distributions. Additional resources and references are provided for further learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Diploma in Data Analytics

Introduction to R
2

Contents

3 Lesson outcomes

3 Introduction

3 Introduction to R

5 Working with R

6 Maximum likelihood method

8 Additional resources

9 References

DATA ANALYTICS
3

Lesson outcomes
By the end of this lesson, you should be able to:

• Download and install R and RStudio


• Install and work with the ISLR package
• Understand more about parameter estimation

Introduction
The objectives for this lesson are to add a new tool to your data analyst toolkit, called R. You will go through the basic
steps of downloading and installing the tool and start exploring some of the packages that are available today in R.

The lesson will conclude by introducing another common method to estimate population parameters, the maximum
likelihood method.

Introducing R
R defined
• R is a powerful statistical language and free software that can be used for data analysis.
• R provides a wide variety of statistical and graphical techniques that include linear modelling, classification
techniques and time series analysis to name but a few.
• R is widely used amongst data analysts, statisticians, and researchers
• R is trusted by companies like Uber, Google, Airbnb, to name but a few.

Why R
• R is open source software.
• R is growing in popularity, currently it ranks 8th of the Tiobe index. The index is calculated from the number of
search engine results for queries containing the name of the language.
• R operates on a variety of operating systems like Windows, Mac, and Linux -OS.
• R contains a wide variety of open source packages for data importing, wrangling and visualisations.

Downloading R
You can download R from this link:

https://cran.r-project.org/

DATA ANALYTICS
4

(R, 2020)

RStudio defined
RStudio acts as a front face to R. The technically correct term would be to call it an IDE. An IDE is the abbreviation for an
integrated development environment. An IDE makes working with the application easier. RStudio provides all the
standard features of an IDE: a console, multiple script file interface, environmental variable display to name a few.

Why RStudio
• RStudio is open source software.
• RStudio operates on a variety of operating systems like Windows, Mac and Linux -OS.
• RStudio is more user-friendly than Base R.

Downloading RStudio
You can download RStudio from this link:

https://rstudio.com/products/rstudio/download/

DATA ANALYTICS
5

(RStudio, 2020)

Please note: You must install R before you can install RStudio. The two applications are separate utilities and need to be
updated on their respective schedules.

Working in R
R packages
A fundamental component to R is its packages. You can view packages as a piece of code someone is sharing with you that
you are able to import and use functions from, they have created. Packages can be likened to the add-on Data Analysis
Toolpak of Excel that increases the functionality of the base R installation.

R currently has more than 15 000 packages available.

Some of the most popular ones include tidyr, plotly and ggplot2 to name a few.

DATA ANALYTICS
6

Fig. Click on Tools tab and select “Install Packages” to install a package in RStudio.

ISLR package
The ISLR package in R was developed for the in the book, An Introduction to Statistical Learning with Application in R. This
is a useful resource I would recommend should you wish to further expand your data analytics and data science skills in R
after this course.

The package contains a collection of datasets used in the book. We will utilize this package to improve our understanding
of R and what it has to offer.

Fig. Search for package and click “Install” in RStudio to install package ISLR

Maximum likelihood method


Methods of estimation
From module 1, we also learnt about one method to estimate the line of best fit through passing points, the least squares
method. The least squares method drew a straight line through the data points and chooses the line of best fit by
minimizing the residual errors from the observed data points to the fitted data points.

DATA ANALYTICS
7

Another method that concerns itself with choosing a model that best fits the given set of data, is called the maximum
likelihood estimation

Maximum likelihood
The maximum likelihood estimation is a method that is used to estimate the parameters of a distribution. The method
tries to choose the fitted model where the observed data is most probable. This concept might be better understood
through an example.

(Brooks-Bartlett, 2018)

Suppose we measured the weights of children under a certain age. Through the maximum likelihood method, we want to
find the best way to fit a distribution to the data. There are more distributions apart from the normal distribution. We want
to fit the best distribution to the data because we will then be able to apply the generalised distribution to every
experiment of the same type. Let’s assume that the weight of the children is normally distributed.

When we assume that the distribution is normally distributed, we expect that measures will be symmetrical around the
mean and that most of the points will lie around the mean.

(Brooks-Bartlett, 2018)

DATA ANALYTICS
8

The normal distribution can come in many different shapes if the standard deviation and mean differs. In our case, we can
see from the plot that the blue line’s distribution looks like most of the values fall around the centre location of the
distribution, the peak. The probability of observing these measurements is high. The location of the blue line’s distribution
maximizes the likelihood the estimate for the mean.

In the same way, the maximum likelihood estimate for the standard deviation can be calculated.

So therefore, maximum likelihood estimates chosen, the mean and standard deviation, now maximizes the likelihood of
the observed values to fit the distribution chosen.

Additional resources
James, G., Witten, D., Hastie, T. & Tibshirani, R., An Introduction to Statistical Learning with Applications in R, 2017,
Springer, available online at (https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)

DATA ANALYTICS
9

References
• Brownlee, J., 2019, A Gentle Introduction to Linear Regression with Maximum Likelihood
Estimation, Machine Learning Mastery, https://machinelearningmastery.com/linear-regression-
with-maximum-likelihood-estimation/
• Comtois, D., 2018, My favourite R package for: summarsing data, Dabbling with Data,
https://dabblingwithdata.wordpress.com/2018/01/02/my-favourite-r-package-for-summarising-
data/
• Data Carpentry, 2020, Introducing R and RStudio IDE, Data Carpentry,
https://datacarpentry.org/genomics-r-intro/01-introduction/index.html
• Hadley, C.J., 2016, RStudio Install, Lynda.com, https://www.lynda.com/RStudio-
tutorials/RStudio-install/452087/490022-4.html
• Howson, I., 2019, ISLR: Data for an Introduction to Statistical Learning with Applications in R,
rrdr.io, https://rdrr.io/cran/ISLR/
• Ismay, C. & Kennedy, P.C., 2019, Getting Used to R ,RStudio and R Markdown,
https://ismayc.github.io/rbasics-book/3-rstudiobasics.html
• Jackson, C., 2015, Point & Interval Estimations: Definition & Differences, Study.com,
https://study.com/academy/lesson/point-interval-estimations-definition-differences-quiz.html
• James, G., Witten, D., Hastie, T., Tibshirani, R., 2017, An Introduction to Statistical Learning
with Applications in R, Springer, available online at: https://faculty.marshall.usc.edu/gareth-
james/ISL/ISLR%20Seventh%20Printing.pdf
• R Project, 2020, What is R, R, https://www.r-project.org/about.html
• RStudio, 2020, https://rstudio.com/
• Rungta, K., 2020, What is R Programming Language? Introduction & Basics of R, Guru99.com,
https://www.guru99.com/r-programming-introduction-basics.html

DATA ANALYTICS

You might also like