0% found this document useful (0 votes)
18 views18 pages

Module 1 - Introduction To R

Uploaded by

John David
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

Module 1 - Introduction To R

Uploaded by

John David
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BIA 5303 Big Data 2

Module 1

Introduction to R Reference:
Based on Prabhpreet Sidhu's slides

Objectives
✓ Understand the importance of big data in data science.
✓ See the evolution of big data.
✓ Look at the landscape of big data.
✓ Install R interpreter and RStudio IDE. Lecturer
✓ Learn to program in R using RStudio. Anas Kuzechie
Big Data

✓ Big Data is a discipline that studies four aspects:

application of Big Data analysis to


enhance business performance
Big Data History
Enterprise Data Warehouse
for small, refined, and
relational data storage NoSQL database store data in a
format other than relational tables

Enterprise Data Lake


Open-source framework that manages the storage and for large, raw, and
processing of large amounts of data for applications undefined data storage

Real-time data
streaming

Cloud-based processing
(Microsoft, Google, etc.) Hybrid processing (on
Data stored on premise and cloud)
company's premises

Machine Learning
Business
Intelligence Natural Language Processing
Big Data Era 2

Streaming
✓ Stream Processing: continuously query and analyze data in real-time, as it arrives.
✓ Examples: Sensors, traffic, web events, health, social media, gaming.
Big Data Landscape Summary

✓ Big Data: data sets that are so large or complex that traditional software cannot
deal with them.
✓ Volume: terabytes to exabytes of data to store and process.
✓ Velocity: streaming data, milliseconds to respond.
✓ Variety: data in many forms.
✓ Data Storage:
✓ How? Data warehouse vs data lake.
✓ Where? On-premise vs cloud.
✓ Data Processing:
✓ Where? On-premise vs cloud.
✓ When? Batch vs streaming.
Structured vs Unstructured Data

✓ Structured Data is data that fits neatly into a table with columns and rows, e.g.,
transactional data, financial data, etc.
✓ Unstructured data is data that does not fit into a table, e.g., images, videos,
audios, tweets, etc. To interact with such data, we need special tools and database
structures like Hadoop ecosystem.
Installing R and RStudio

https://posit.co/download/rstudio-desktop/

R Interpreter R Integrated Development Environment (IDE)


RStudio

R script with extension


.R containing R code. Environment shows objects in
memory with assigned value(s).

Console is where we can


type commands and see
output. Files show all files and folders in your default
workspace.
Plots will show all the graphs.
Packages will list a series of packages needed to
run certain processes.
Example

Click on the dotted square to


see the data on the top left.

Statement to generate a matrix


having 2 rows and 3 columns
History Tab

✓ History tab keeps a record of all previous commands.


✓ We can select the commands we want and send them to an R script.

Click To Source to copy the


selected commands to source file.
Setting Default Working Directory

Set default working directory that


will have all your R source files.
R Script

✓ RStudio interface has four windows:


✓ Console.
✓ Environment and History.
✓ Files, Plots, Packages, and Help
✓ R Scripts and Data View.

✓ Creating an R script:

Or

✓ Running an R
script:
Select the commands
to execute, then click
Run to see output on
the Console.
Packages Tab

✓ Package tab shows the list of add-ons


included in the installation of
RStudio. If checked, the package is loaded
into R.
✓ We can also install
other add-ons by
clicking on the
Install icon.
Plots Tab
R Programming Language

✓ R is an object-oriented programming (OOP) language. Everything we do in R can be


saved in an object and all functions are referenced by those objects.
✓ OOP is designed to reduce the amount of code required to accomplish any task. In
the case of R, the amount of code needed to perform statistical analysis.
✓ Numbers, datasets, or the output of a linear regression can all be stored in an
object (variable) using the <- operator.
✓ R is case sensitive! Check for this first when you get errors.
✓ R Objects: Comments
In R, we can annotate our code with comments. Just preface the
✓ Single entry line with a hash mark (#), and anything that comes thereafter will
✓ Vector be ignored by the interpreter.
✓ Matrix
✓ Dataframe
✓ List
Single Entry

✓ Most basic data class in R. They are either single numbers or single strings.

✓ Example

Vector
✓ A vector contains a series of numbers or strings of one consistent type. We create
a vector using the c command.
✓ Example
Matrix

✓ Matrix is a series of vectors of the same type.

✓ Example

Data Frame
✓ Data frame is a series of vectors of different types.

✓ Example

List
✓ List can be a combination of the previous four types. For example, the output of a
regression is a list.
R Pros & Cons

Pros Cons
Fast and free Steep learning curve
R is way ahead of SPSS and SAS No commercial support
Second only to Matlab for graphics Easy to make mistakes and not know
Active user community Working with large datasets is limited by RAM
Excellent for simulation, programming, Data preparation and cleaning can be messier
computer intensive analysis, etc. and more mistake prone in R vs SPSS or SAS
Forces you to think about your analysis
Interfaces with database systems such as
MySQL.

You might also like