DS R Unit-1
DS R Unit-1
R PROGRAMMING
UNIT - I
Syllabus
Defining Data Science and Big data, Benefits and Uses, facets of Data, Data Science
Process. Historyand Overview of R, Getting Started with R, R Nuts and Bolts
Data:
Data is collection of information gathered by observation, research,
analysis, Etc.
Data Science:
Data science is an Advanced statistical computing.
Data science is the study of data. That means Data science is a field of applied
mathematics and statistics that provides useful information based on large amount
of complex data or big data. This field uses scientific methods, processes,
algorithms, and patterns to extract knowledge and insights from big data.
Data science uses the most powerful hardware, programming systems, and
most efficient algorithms to solve the data related problems. Data science focuses
on past data, present data, and also future predictions.
Big Data:
Big Data is a collection of data that is huge in volume. That means it is a data
with large size. Big data is a combination of structured, semi structured and
unstructured data collected by organizations that can be used in machine
learning projects, predictive modeling and other advanced analytics applications.
Companies use big data in their systems to improve operations and provide better
customer service. Big data is also used by medical researchers to identify disease
and medical conditions of patients. Financial services use big data systems for risk
management and real-time analysis of market data.
2. Increased Efficiency
Business operations can be made more efficient and costs can be cut
with the use of data science.
1.Search Engines
The most useful application of Data Science is Search Engines. We want
to search for something on the internet, we mostly used Search engines
like Google, Safari, Firefox, etc.
2. Transport
Data Science also entered into the Transport field. It can optimize
shipping routes in real-time.
3. Finance
Data Science plays a key role in Financial Industries. Data Science is
widely used in the banking and finance sectors for fraud detection.
4. E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to
make a better user experience.
5. Health Care
Data science can identify and predict disease, and personalize
healthcare recommendations.
6. Image Recognition
Data Science is also used in Image Recognition. For Example, When
we upload our image with our friend on Facebook, Facebook gives
suggestions Tagging who is in the picture. This is done with the help of
machine learning and Data Science.
Facets of data
Very large amount of data will generate in big data and data science. These
data is various types and main categories of data are as follows:
1. Structured
2. Unstructured
3. Natural language
4. Machine-generated
5. Graph-based
6. Audio, video, and images
7. Streaming
1.Structured data
Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data. An Excel table is an example of structured data.
2.Unstructured data
Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. The unstructured data can be in the form of Text. Email is an
example of unstructured data.
3.Natural language
Natural language is a special type of unstructured data. Natural language
processing enables machines to recognize characters, words and sentences. This
helps machines to understand language like humans.
4.Machine-generated data
Machine-generated data is an information that is automatically created by a
computer, process, application,etc. without human interaction. Examples of
machine data are web server logs, call detail records, network event logs, etc.
7.Streaming data
Streaming data, also known as real-time data, event data, stream data
processing, or data-in-motion, refers to a continuous flow of information
generated by various sources.
2. Retrieving data
The second step is to collect data. You have stated in the project charter which data you need and
where you can find it. In this step you ensure that you can use the data in your program, which means
checking the existence of, quality, and access to the data.
3. Data preparation
Data collection is an error-prone process, in this phase you enhance the quality of the data and prepare
it for use in subsequent steps. This phase consists of three sub phases: data cleansing, data integration and
data transformation.
4. Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether there are
outliers. To achieve this you mainly use descriptive statistics, visual techniques, and simple modeling.
10. R plays crucial role in the field of data science. Its extensive
set of packages and Libraries make it well suited for data
Analysis.
R Get Started
R – Environment Setup for Windows
You can download the latest Windows installer version of R from
CRAN(Comprehensive R Archive Network).CRAN is a network of web servers
around the world that store up-to-date versions and documentation of R.
Installing R on Windows OS
To install R on Windows OS:
1. Go to the CRAN website.
2. Click on "Download R for Windows".
3. Click on "install R for the first time" link to download the R executable (.exe) file.
4. Run the R executable file to start installation.
5. Select the installation language.
R is now successfully installed on your Windows OS. Open the R GUI to start
writing R codes.
The screenshot below shows R console on a Windows PC.
RStudio
RStudio is freely available open-source Integrated Development Environment
(IDE). RStudio provides an environment with many features to make R easier.
RStudio is a Graphical user interface, not just a command prompt.
Installing RStudio Desktop
To install RStudio Desktop on your computer, do the following:
1. Go to the RStudio website.
2. Download RStudio Desktop recommended for your computer.
3. Run the RStudio Executable file (.exe) for Windows.
RStudio is now successfully installed on your computer. The RStudio Desktop IDE interface is
shown in the figure below:
RStudio Interface
The RStudio interface has four main panels:
Console: You can type commands and see output.
Comments
Non-executable lines in R script and R Console are called as comments.
Comments lines are for documentation purposes and these lines are ignored by
the interpreter.
Note:
Unlike some other languages, R does not support multi-line
comments.
In R, you can your code with comments. Just preface the line with a hash
mark (#).
eg:
# - It is a comment in R
R> 1+1 # This works out the result of one plus one!
[1] 2
Example:
# print string
> print("Welcome to R Programming")
[1] "Welcome to R Programming"
# print variable
> x <- 100
>print(x)
[1] 100
Note:
Above outputs display [1], It indicates first Element of the output vector.
R Script File
Usually, you will do your programming by writing your programs in script files
and then execute those scripts
help of R interpreter. So let's start with writing following code in a script file called
test.R as below
Executing Code
RStudio supports the direct execution of code from source editor(script exitor).
After executing the line of code, RStudio automatically moves the cursor to the
nextline. The output displayed in
the Rstudio console.
Keyboard Shortcuts
Ctrl+Shift+N — New document
Ctrl+O — Open document
Ctrl+S — Save active document
Ctrl+1 — Move to the Source Editor
Ctrl+2 — Move to the Console
Ctrl+L - clear console
R - Data Types
Data types refer to format of storing the data in the program. Generally, while
doing programming in any programming language, you need to use various
variables to store information. Variables are nothing but reserved memory
locations to store values.
Integer Datatype
R supports integer data types which are the set of all integers. You can use the
capital ‘L’ notation as a suffix to denote that a particular value is an integer
datatype.
eg:
10L,20L,etc.
Logical Datatype
R has logical data types that take either a value of TRUE or FALSE.
eg:
x=TRUE
y=FALSE
Complex Datatype
R supports complex data types that are set of all the complex numbers. A
complex number has a real and an imaginary component. For example, 2+3i is a
complex number, where 3i is the imaginary Component and is equal to √-9 (√9
× √−1 = 3i)
Character Datatype
R supports character data types where you have all the alphabets and special
characters. It stores character values or strings. A string can be defined either
Single quotes (or) double quotes.
eg:
ch = 'R'
st="Welcome to R Programming"
typeof()
It is a Function and is used to Find datatype of different values in R
Programming.
Syntax: typeof(x)
Parameters: x: specified data
eg: data_test.R
typeof(100)
typeof(12.8)
typeof('R')
typeof("Welcome to R Programming")
typeof(4 + 3i)
Output:
[1] "double"
[1] "double"
[1] "character"
[1] "character"
[1] "complex"
scan()
scan() method is taking input continuously, to terminate the
input process, need to press Enter key 2 times on the console.
syntax:
x=scan()
eg:
scan.R
print("Enter values press Enter key 2 times to stop1")
x = scan()
print(x)
output1:
[1] "Enter values press Enter key 2 times to stop"
1: 1 2 3
4: 4 5 6
7: 7 8 9
10:
Read 9 items
[1] 1 2 3 4 5 6 7 8 9
output2:
[1] "Enter values press Enter key 2 times to stop"
1: 1
2: 2
3: 3
4:
Read 3 items
[1] 1 2 3
Comments
Unexecutable lines in a R script, Unexecutable line in a R Console are called as
comments. Comments lines are for documentation purpose and these lines are
ignored by the interpreter.
Note:
Unlike other languages, R does not support multi-line comments or comment blocks.
In R, you can your code with comments. Just preface the line with a hash mark (#),
eg:
R> 1+1 # This works out the result of one plus one!
[1] 2
Example:
# print string
> print("Welcome to R Programming")
[1] "Welcome to R Programming"
# print variable
x <- 100
>print(x)
[1] 100
R Script File
Usually, you will do your programming by writing your programs
in script files and then you execute those scripts at your
command prompt with the help of R interpreter called Rscript.
So let's start with writing following code in a text file called test.R
as below
Executing Code
RStudio supports the direct execution of code from within the
source editor.
Executing a Single Line
To execute the line of source code where the cursor currently
resides you press the Ctrl+Enter key or use the Run toolbar
button.
After executing the line of code, RStudio automatically moves the
cursor to the nextline.
Executing Multiple Lines
There are three ways to execute multiple lines from within the
editor:
1. Select the lines and press the Ctrl+Enter key (or) use
the Run toolbar button.
Keyboard Shortcuts
Ctrl+Shift+N — New document
Ctrl+O — Open document
Ctrl+S — Save active document
Ctrl+1 — Move focus to the Source Editor
Ctrl+2 — Move focus to the Console
Ctrl+L - clear console
R - Data Types
Generally, while doing programming in any programming
language, you need to use various variables to store various
information. Variables are nothing but reserved memory locations
to store values. This means that, when you create a variable you
reserve some space in memory.
You may like to store information of various data types like
character, integer, floating point, Boolean etc. Based on the
data type of a variable, the operating system allocates memory
and decides what can be stored in the reserved memory.
Integer Datatype
R supports integer data types which are the set of all integers.
You can use the capital ‘L’ notation as a suffix to denote that a
particular value is an integer datatype.
eg:
10L,20L,etc.
Logical Datatype
R has logical data types that take either a value of TRUE or
FALSE.
eg:
x=TRUE
y=FALSE
Complex Datatype
R supports complex data types that are set of all the complex
numbers. A complex number has a real and an imaginary
component. For example, 2+3i is a complex number, where 3i is
the imaginary Component and is equal to √-9 (√9 × √−1 = 3i)
Character Datatype
R supports character data types where you have all the
alphabets and special characters. It stores character values or
strings. Strings in R can contain alphabets, numbers, and
symbols. The easiest way to denote that a value is of character
type in R is to wrap the value inside single or double
quotes(inverted commas).
eg:
ch = 'R'
st="Welcome to R Programming"
typeof()
Syntax: typeof(x)
eg: data_test.R
typeof(100)
typeof(12.8)
typeof('R')
typeof("Welcome to R Programming")
typeof(4 + 3i)
Output:
[1] "double"
[1] "double"
[1] "character"
[1] "character"
[1] "complex"
syntax:
cat(value/variable1, value/variable2,...)
> n=10
Value of n = 10
readline()
Syntax:
eg1:
var = readline();
print(var)
eg2:
var = readline("Enter data :");
as.functions in R
To convert the input value to the desired data type, there are
some functions in R
eg:
script
var = as.integer(var);
#print(var)
scan()
scan() method is taking input continuously, to terminate the
input process, need to press Enter key 2 times on the console.
syntax:
x=scan()
eg:
scan.R
print("Enter values press Enter key 2 times to stop1")
x = scan()
print(x)
output1:
output2:
[1] "Enter values press Enter key 2 times to stop"
1: 1
2: 2
3: 3
4:
Read 3 items
[1] 1 2 3
R – Objects
There are 6 basic types of objects in the R language
1. Vectors
2. Lists
3. Matrices
4. Arrays
5. Factors
6. Data Frames
Vectors
When you want to create vector with more than one element,
you should use c() function which means to combine the
elements into a vector.
Create a vector
syntax:
var <- c(element-1,element-2,.....,element-n)
eg:
apple <- c('red','green',"yellow")
script
apple <- c('red','green',"yellow")
print(apple)
output:s
[1] "red" "green" "yellow"
Lists
Lists are the R objects which contain elements of
different types like − numbers, strings, vectors and
another list inside it. List is created using list() function.
Create a list containing strings, numbers, vectors and a
logical values.
script(lst1.R)
o/p:
[[1]]
[1] "WELCOME"
[[2]]
[1] 100 200
[[3]]
[1] TRUE
[[4]]
[1] 51.23
script(lst1.R)
vec <- c(3,4,5,6)
char_vec<-c("C","JAVA","PYTHON","R")
logic_vec<-c(TRUE,FALSE,FALSE,TRUE)
out_list<-list(vec,char_vec,logic_vec)
print(out_list)
Output:
[[1]]
[1] 3 4 5 6
[[2]]
[1] "C" "JAVA" "PYTHON" "R"
[[3]]
[1] TRUE FALSE FALSE TRUE
Matrices
Matrices are the R objects in which the elements are
arranged in a two-dimensional rectangular layout. They
contain elements of the same atomic types. We use
matrices containing numeric elements to be used in
mathematical calculations.
Syntax
The basic syntax for creating a matrix in R is−
>
m <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
>m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
colnames))
print(m)
o/p:
col1 col2 col3
row1 3 4 5
row2 6 7 8
row3 9 10 11
A = matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9),nrow = 3,
ncol = 3, byrow = TRUE )
# Naming rows
rownames(A) = c("A", "B", "C")
# Naming columns
colnames(A) = c("X", "Y", "Z")
print(A)
Output:
X Y Z
A 1 2 3
B 4 5 6
C 7 8 9
Diagonal matrix:
A diagonal matrix is a matrix in which the entries
outside the main diagonal are all zero.
Syntax: diag(data, m, n)
Parameters:
Data: Diagonal elements
m: no of rows
n: no of columns
Example:
dm <- diag(c(5, 3, 3), 3, 3)
print(dm)
Output:
Arrays
Arrays are essential data storage structures defined by a fixed number
of dimensions. Arrays are used for the allocation of space at contiguous
memory locations. Uni-dimensional arrays are called vectors .
Two-dimensional arrays are called matrices. Arrays consist of all elements
of the same data type.
Creating an Array
The array() function is used to create an array
Syntax:
array(data, dim = (nrow, ncol, nmat),
dimnames=names)
where,
Example
The following example creates an array with 3 rows and 3 columns.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
result <- array(c(vector1,vector2),dim = c(3,3))
print(result)
output
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
R - Factors
Factors are the data objects which are used to categorize the data
and store it as levels. R treats the text column as categorical data and
creates factors on it.
Creating a Factor in R
Factors are created using the factor () function by taking a vector as
input.
Syntax:
factor(data,[levels])
fc1.R
data <- c("East","West","East","North","South","East",
"West","South","West","East","North")
factor_data <- factor(data)
print(factor_data)
output:
[1] East West East North South East West South West East North
Levels: East North South West
"West","South","West","East","North")
factor_data <- factor(data)
print(factor_data)
new_order_data <- factor(data,levels=c"East","West","North","South"))
print(new_order_data)
output:
[1] East West East North South East West South West East North
Levels: East North South West
[1] East West East North South East West South West East North
Levels: East West North South
output:
[1] East West East North North East West West West East North
Levels: East North West
[1] East West East North North East West West West East
North
Levels: East West North
R Data Frames
creating a dataframe
The data.frame() function is used to create a data frame
syntax:
data.frame(vector-2,vector-2,...,vector-n,StringsAsFactors=Logical)
df1.R
emp_id = c (100,200,300,400,500)
emp_name = c("Hari","Ravi","Raju","Gopi","Vasu")
salary = c(30000.00,50000.00,20000.00,25000.00,15000.00)
emp_data <- data.frame(emp_id,emp_name,salary)
print(emp_data)
Output
emp_id emp_name salary
1 100 Hari 30000
2 200 Ravi 50000
3 300 Raju 20000
4 400 Gopi 25000
5 500 Vasu 15000
is. functions in R
is.functions are used for objects of specified type. These
functions are return logical value(TRUE/FALSE)
1.is.integer
3. is.character
eg:
is.complex()
is.vector()
is.list()
is.array()
is.matrix()
is.factor()
is.data.frame()
class function in R
The class() function in R is used to return type of an R object.
Syntax
class(x)
x: This represents the R object
eg:
print(class(100))
print(class(100L))
print(class('A'))
print(class("WELCOME"))
print(class(2+3i))
output:
[1] "numeric"
[1] "integer"
[1] "character"
[1] "character"
[1] "complex"
R - Operators
Operators are the symbols used to perform various kinds of
operations between the operands. Operators simulate the various
mathematical, logical, and decision operations performed on a set of
Numericals, Integers,and Complex Numbers as input operands.
Types of Operators
We have the following types of operators in R programming
1. Arithmetic Operators
2. Relational Operators
3. Logical Operators
4. Assignment Operators
Arithmetic Operators
Arithmetic operations simulate various mathematical operations,
like addition, subtraction, multiplication, division, and modulus using the
specified operator between operands. The operations are performed
element-wise at the corresponding positions of the vectors.
+ Addition
eg:
> 2+3
[1] 5
- Subtraction
eg:
> 8-5
[1] 3
* Multiplication
eg:
> 2*5
[1] 10
/ Division
eg:
> 10/3
[1] 3.333333
^ power
eg:
> 2^3
[1] 8
%% Modulus :
eg:
> 10%%3
[1] 1
Script
write a r script to enter any two integers and perform all arithmetic
operations.
ari.R
print("Enter any two integer values")
a <- as.integer(readline())
b <- as.integer(readline())
cat("Addition :",a+b)
cat("\nSubtraction : ",a-b)
cat("\nMultiplication :",a*b)
cat("\nDivision :",a/b)
cat("\npower :",a^b)
cat("\nModulus :",a%%b)
output:
Addition : 13
Subtraction :7
Multiplication : 30
Division : 3.333333
power : 1000
Modulus :1
Relational Operators
These are used to test the relation between the operands. These
operators Returns a boolean value.
eg:
> 10<20
[1] TRUE
> 5==10
[1] FALSE
> 5!=10
[1] TRUE
> 10<=20
[1] TRUE
Logical Operators
These are used to combine the result of two or more expressions or values .these
operators returns boolean value True or False.
Logical Not(!)
It takes a value of the operand and gives the opposite logical value.
eg1:
> n=10
> !n
[1] FALSE
> n=0
> !n
[1] TRUE
Assignment Operators
Assignment operators are used to assigning values to various data objects in R.
There are two kinds of assignment perators
1.Left Assignment (<- or <<- or =):
eg:
n <- 10
(or)
n=10
(or)
n <<- 10