Module 1 - Introduction To R
Module 1 - Introduction To R
Module 1
Introduction to R Reference:
Based on Prabhpreet Sidhu's slides
Objectives
✓ Understand the importance of big data in data science.
✓ See the evolution of big data.
✓ Look at the landscape of big data.
✓ Install R interpreter and RStudio IDE. Lecturer
✓ Learn to program in R using RStudio. Anas Kuzechie
Big Data
Real-time data
streaming
Cloud-based processing
(Microsoft, Google, etc.) Hybrid processing (on
Data stored on premise and cloud)
company's premises
Machine Learning
Business
Intelligence Natural Language Processing
Big Data Era 2
Streaming
✓ Stream Processing: continuously query and analyze data in real-time, as it arrives.
✓ Examples: Sensors, traffic, web events, health, social media, gaming.
Big Data Landscape Summary
✓ Big Data: data sets that are so large or complex that traditional software cannot
deal with them.
✓ Volume: terabytes to exabytes of data to store and process.
✓ Velocity: streaming data, milliseconds to respond.
✓ Variety: data in many forms.
✓ Data Storage:
✓ How? Data warehouse vs data lake.
✓ Where? On-premise vs cloud.
✓ Data Processing:
✓ Where? On-premise vs cloud.
✓ When? Batch vs streaming.
Structured vs Unstructured Data
✓ Structured Data is data that fits neatly into a table with columns and rows, e.g.,
transactional data, financial data, etc.
✓ Unstructured data is data that does not fit into a table, e.g., images, videos,
audios, tweets, etc. To interact with such data, we need special tools and database
structures like Hadoop ecosystem.
Installing R and RStudio
https://posit.co/download/rstudio-desktop/
✓ Creating an R script:
Or
✓ Running an R
script:
Select the commands
to execute, then click
Run to see output on
the Console.
Packages Tab
✓ Most basic data class in R. They are either single numbers or single strings.
✓ Example
Vector
✓ A vector contains a series of numbers or strings of one consistent type. We create
a vector using the c command.
✓ Example
Matrix
✓ Example
Data Frame
✓ Data frame is a series of vectors of different types.
✓ Example
List
✓ List can be a combination of the previous four types. For example, the output of a
regression is a list.
R Pros & Cons
Pros Cons
Fast and free Steep learning curve
R is way ahead of SPSS and SAS No commercial support
Second only to Matlab for graphics Easy to make mistakes and not know
Active user community Working with large datasets is limited by RAM
Excellent for simulation, programming, Data preparation and cleaning can be messier
computer intensive analysis, etc. and more mistake prone in R vs SPSS or SAS
Forces you to think about your analysis
Interfaces with database systems such as
MySQL.