0% found this document useful (0 votes)
39 views51 pages

Unit 5 R

BIG DATA

Uploaded by

azhagu sundari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views51 pages

Unit 5 R

BIG DATA

Uploaded by

azhagu sundari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

R is a language and environment for statistical computing

and graphics. It is a GNU project which is similar to the S


language and environment which was developed at Bell
Laboratories (formerly AT&T, now Lucent Technologies)
by John Chambers and colleagues.
The R Language stands out as a powerful tool in the
modern era of statistical computing and data analysis.
Widely embraced by statisticians, data scientists, and
researchers, the R Language offers an extensive suite of
packages and libraries tailored for data manipulation,
statistical modeling, and visualization.
• R provides a wide variety of statistical (linear
and nonlinear modelling, classical statistical
tests, time-series analysis, classification,
clustering, …) and graphical techniques, and is
highly extensible.
Why Use R Language?
The R Language is a powerful tool widely used for data analysis, statistical computing, and machine learning.
1. Comprehensive Statistical Analysis:
R language is specifically designed for statistical analysis and provides a vast array of statistical techniques and tests,
making it ideal for data-driven research.
2. Extensive Packages and Libraries:
The R Language boasts a rich ecosystem of packages and libraries that extend its capabilities, allowing users to perform
advanced data manipulation, visualization, and machine learning tasks with ease.
3. Strong Data Visualization Capabilities:
R language excels in data visualization, offering powerful tools like ggplot2 and plotly, which enable the creation of
detailed and aesthetically pleasing graphs and plots.
4. Open Source and Free:
As an open-source language, R is free to use, which makes it accessible to everyone, from individual researchers to large
organizations, without the need for costly licenses.
5. Platform Independence:
The R Language is platform-independent, meaning it can run on various operating systems, including Windows, macOS,
and Linux, providing flexibility in development environments.
6. Integration with Other Languages:
R can easily integrate with other programming languages such as C, C++, Python, and Java, allowing for seamless
interaction with different data sources and statistical packages.
7. Growing Community and Support:
R language has a large and active community of users and developers who contribute to its continuous improvement and
provide extensive support through forums, mailing lists, and online resources.
8. High Demand in Data Science:
R is one of the most requested programming languages in the Data Science job market, making it a valuable skill for
professionals looking to advance their careers in this field
Features of R Programming Language
The R Language is renowned for its extensive features that make it a powerful tool for data analysis, statistical computing, and visualization. Here
are some of the key features of R:
1. Comprehensive Statistical Analysis:
R langauge provides a wide array of statistical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis,
classification, and clustering.
2. Advanced Data Visualization:
With packages like ggplot2, plotly, and lattice, R excels at creating complex and aesthetically pleasing data visualizations, including plots, graphs,
and charts.
3. Extensive Packages and Libraries:
The Comprehensive R Archive Network (CRAN) hosts thousands of packages that extend R’s capabilities in areas such as machine learning, data
manipulation, bioinformatics, and more.
4. Open Source and Free:
R is free to download and use, making it accessible to everyone. Its open-source nature encourages community contributions and continuous
improvement.
5. Platform Independence:
R is platform-independent, running on various operating systems, including Windows, macOS, and Linux, which ensures flexibility and ease of
use across different environments.
6. Integration with Other Languages:
R language can integrate with other programming languages such as C, C++, Python, Java, and SQL, allowing for seamless interaction with
various data sources and computational processes.
7. Powerful Data Handling and Storage:
R efficiently handles and stores data, supporting various data types and structures, including vectors, matrices, data frames, and lists.
8. Robust Community and Support:
R has a vibrant and active community that provides extensive support through forums, mailing lists, and online resources, contributing to its rich
ecosystem of packages and documentation.
9. Interactive Development Environment (IDE):
RStudio, the most popular IDE for R, offers a user-friendly interface with features like syntax highlighting, code completion, and integrated tools
for plotting, history, and debugging.
10. Reproducible Research:
R supports reproducible research practices with tools like R Markdown and Knitr, enabling users to create dynamic reports, presentations, and
documents that combine code, text, and visualizations.
Advantages of R language
R is the most comprehensive statistical analysis package. As new technology and concepts often
appear first in R.
As R programming language is an open source. Thus, you can run R anywhere and at any time.
R programming language is suitable for GNU/Linux and Windows operating systems.
R programming is cross-platform and runs on any operating system.
In R, everyone is welcome to provide new packages, bug fixes, and code enhancements.
Disadvantages of R language
In the R programming language, the standard of some packages is less than perfect.
Although, R commands give little pressure on memory management. So R programming language
may consume all available memory.
In R basically, nobody to complain if something doesn’t work.
R programming language is much slower than other programming languages such as Python and
MATLAB.
• Applications of R language
• We use R for Data Science. It gives us a broad variety of
libraries related to statistics. It also provides the
environment for statistical computing and design.
• R is used by many quantitative analysts as its programming
tool. Thus, it helps in data importing and cleaning.
• R is the most prevalent language. So many data analysts
and research programmers use it. Hence, it is used as a
fundamental tool for finance.
• Tech giants like Google, Facebook, Bing, Twitter, Accenture,
Wipro, and many more using R nowadays.
Feature R Python

R is a language and environment for


Python is a general-purpose
statistical programming which
Introduction programming language for data
includes statistical computing and
analysis and scientific computing
graphics.

It has many features which are useful It can be used to develop GUI
Objective for statistical analysis and applications and web applications as
representation. well as with embedded systems

It has many easy-to-use packages for It can easily perform matrix


Workability
performing tasks computation as well as optimization

Integrated development Various popular R IDEs are Rstudio, Various popular Python IDEs are
environment RKward, R commander, etc. Spyder, Eclipse+Pydev, Atom, etc.

Some essential packages and


There are many packages and
Libraries and packages libraries are Pandas, Numpy, Scipy,
libraries like ggplot2, caret, etc.
etc.

It is mainly used for complex data It takes a more streamlined approach


Scope
analysis in data science. for data science projects.
Features R Python

It is used for data


It is used in all kinds of
analysts to import data
Data collection data formats including
from Excel, CSV, and
SQL tables
text files.

It optimized for the


You can explore data
Data exploration statistical analysis of
with Pandas
large datasets

It supports Tidyverse
and it became easy to Use can you NumPy,
Data modeling import, manipulate, SciPy, scikit-learn,
visualize, and report on TansorFlow
data

You can use ggplot2 and


ggplot tools to plots You can use Matplotlib,
Data visualization
complex scatter plots Pandas, Seaborn
with regression lines.
Statistical Analysis and Machine Learning In R and Python

Capability R Python

Basic Statistics Built-in functions (mean, median, etc.) NumPy (mean, median, etc.)

Statsmodels (OLS)
Linear Regression lm() function and Formulas
Ordinary Least Squares (OLS) Method

Generalized Linear Models (GLM) glm() function Statsmodels (GLM)

Time Series Analysis Time Series packages (forecast) Statsmodels (Time Series)

ANOVA and t-tests Built-in functions (aov, t.test) SciPy (ANOVA, t-tests)

Hypothesis Tests Built-in functions (wilcox.test, etc.) SciPy (Mann-Whitney, Kruskal-Wallis)

Principal Component Analysis (PCA) princomp() function scikit-learn (PCA)

Clustering (K-Means, Hierarchical) kmeans(), hclust() scikit-learn (KMeans, AgglomerativeClustering)

Decision Trees rpart() function scikit-learn (DecisionTreeClassifier)

Random Forest randomForest() function scikit-learn (RandomForestClassifier)


Advantages in R Programming and Python Programming

R Programming Python Programming

General-purpose programming to use data


It supports a large dataset for statistical analysis
analyze

Primary users are Scholar and R&D Primary users are Programmers and developers

Support packages like tidyverse, ggplot2, caret, Support packages like pandas, scipy, scikit-learn,
zoo TensorFlow, caret

Support RStudio and It has a wide range of


Support Conda environment with Spyder, Ipython
statistics and general data analysis and
Notebook
visualization capabilities.
R studio
• R Studio is an integrated development environment(IDE)
for R. IDE is a GUI, where you can write your quotes, see
the results and also see the variables that are generated
during the course of programming.
• R Studio is available as both Open source and
Commercial software.
• R Studio is also available as both Desktop and Server
versions.
• R Studio is also available for various platforms such as
Windows, Linux, and macOS.
• The console panel(left panel) is the place where R is waiting for you to tell it
what to do, and see the results that are generated when you type in the
commands.
• To the top right, you have the Environmental/History panel. It contains 2 tabs:
– Environment tab: It shows the variables that are generated during the course of
programming in a workspace that is temporary.
– History tab: In this tab, you’ll see all the commands that are used till now from the
start of usage of R Studio.
• To the right bottom, you have another panel, which contains multiple tabs, such
as files,
plots, packages, help, and viewer.
– The Files tab shows the files and directories that are available within the default
workspace of R.
– The Plots tab shows the plots that are generated during the course of programming.
– The Packages tab helps you to look at what are the packages that are already installed
in the R Studio and it also gives a user interface to install new packages.
– The Help tab is the most important one where you can get help from the R
Documentation on the functions that are in built-in R.
– The final and last tab is that the Viewer tab which can be used to see the local web
content that’s generated using R.
• Features of R Studio
• A friendly user interface
• writing and storing reusable programmes
• All imported data and newly created objects (such as variables,
functions, etc.) are easily accessible.
• Comprehensive assistance for any item Code autocompletion
• The capacity to organise and share your work with your partners
more effectively through the creation of projects.
• Plot snippets
• Simple terminal and console switching
• Tracking of operational history
• There are numerous articles from RStudio Support on using the IDE.
Installing R packages
install.packages('package_name')
Loading R package
library(package_name)
Help on an R package
help(package_name)
Types of Comments in R
In general, all programming languages have the following types
of comments:

single-line comments
# this code prints Hello World
print("Hello World")
multi-line comments
# this is a print statement
# it prints Hello World
print("Hello World")
R Variables and Constants
2. Integer Variables 3. Floating Point Variables
Types of R A = 14L x = 13.4
Variables print(A) print(x)
print(class(A)) print(class(x))
1. Boolean [1] 14 Output
[1] "integer" [1] 13.4
Variables [1] "numeric"
4. Character Variables
a = TRUE alphabet = "a"
print(alphabet)
print(a) print(class(alphabet))
Output
print(class(a)) [1] "a“
[1] "character"
[1] TRUE
5. String Variables
[1] "logical“ message = "Welcome to R“
print(message)
print(class(message))
Output
[1] "Welcome to R"
[1] "character"
R Constants
Constants are those entities whose values aren't meant to be changed anywhere
throughout the code. In R, we can declare constants using the <- symbol.
x <- "Welcome to R"
print(x)
Output
[1] "Welcome to R“

• Types of R Constants
• In R, we have the following types of constants.
• The five types of R constants - numeric, integer, complex, logical, string.
• In addition to these, there are 4 specific types of R constants
- Null, NA, Inf, NaN.
1. Integer Constants Numeric Constants
x <- 15L print(typeof(x)) print(class(x)) 3. Logical Constants
z<- 3e-3
Output x <- TRUE
print(z) # 0.003
[1] "integer" y <- FALSE
[1] "integer“ print(class(z)) # "numeric”
print(x)
y <- 3.4
print(y)
# hexadecimal value print(y) # 3.4
Output
x <- 0x15L print(class(z)) # "numeric"
[1] TRUE
print(x) # exponential value Output
x <- 1e5L [1] FALSE
[1] 0.003
print(x) 5. Complex Constants
[1] "numeric"
Output
[1] 3.4 y <- 3.2e-1i
[1] 21
[1] 100000 [1] "numeric" print(y)
print(typeof(y))
4. String Constants Output
message <- "Welcome to R" [1] 0+0.32i
print(message) [1] "complex"
Output
[1] "Welcome to R"
x <- NULL print(x) NULL

a <- 2^2020 Inf(Infinite)


print(a) # Inf
print(0/0) # NaN print(Inf/Inf) # NaN NaN (Not a Number)
print(NA + 20) # NA NA (Not Available)
Rules to Declare R Variables

• A variable name in R can be created using letters,


digits, periods, and underscores.
• You can start a variable name with a letter or a
period, but not with digits.
• If a variable name starts with a dot, you can't follow
it with digits.
• R is case sensitive. This means that age and Age are
treated as different variables.
• We have some reserved words that cannot be used
as variable names.
Built-In R Constants
# print list of uppercase letters print(LETTERS)
# print list of lowercase letters print(letters)
# print 3 letters abbreviation of English months print(month.abb)
# print numerical value of constant pi print(pi)

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S“"T" "U" "V" "W" "X" "Y" "Z"
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s“ "t" "u" "v" "w" "x" "y" "z"
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
[1] 3.141593
Different
bool1 <- TRUE
Types of Data Types
• logical print(bool1) # floating point values
print(class(bool1)) weight <- 63.5
• numeric [1] TRUE print(weight)
[1] "logical" print(class(weight)) # real numbers
• integer height <- 182
print(height)
• complex var1 <- 186L
print(class(height))

• character print(class(var1))
Output
[1] 63.5
[1] "numeric"
• raw [1] "integer" [1] 182
[1] "numeric"
# 2i represents imaginary part
# create a string variable
complex_value <- 3 + 2i
fruit <- "Apple"
# print class of complex_value
print(class(fruit))
print(class(complex_value))
# create a character variable
[1] "complex"
my_char <- 'A'
print(class(my_char))
[1] "character" [1] "character"
A raw data type specifies values as raw bytes. You can use the following methods to convert
character data types to a raw data type and vice-versa:
charToRaw() - converts character data to raw data
rawToChar() - converts raw data to character data

# convert character to raw


raw_variable <- charToRaw("Welcome to R")
print(raw_variable)
print(class(raw_variable))
char_variable <- rawToChar(raw_variable)
print(char_variable)
print(class(char_variable))
[1] 57 65 6c 63 6f 6d 65 20 74 6f 20 50 72 6f 67 72 61 6d 69 7a
[1] "raw“
[1] "Welcome to Programiz"
[1] "character"
# Python program to message <-"Hello World!“
add two numbers print(message)

numb1 = 8
numb2 = 4
sum = numb1 + numb2
print("The sum is",
sum)
R Print Output
# print values [1] "R is fun"
print("R is fun") [1] "Welcome to R"
# print variables
x <- "Welcome to R"
print(x)

paste(str,var) -print a string and variable together Welcome to R Prog


v1 <- “R prog"
# print string and variable together
print(paste("Welcome to", v1))
print(paste0("Welcome to", v1))
sprintf()-print formatted strings
myString <- "Welcome to R”
sprintf("String: %s", myString) [1] Welcome to R
myInteger <- 123
sprintf("Integer Value: %d", myInteger) [1] "Integer Value: 123"
myFloat <- 12.34 [1] "Float Value: 12.340000"
sprintf("Float Value: %f", myFloat)
Msg <- “R prog” Welcome to R Prog
cat("Welcome to ", msg)

Print Variables in R Terminal msg

You might also like