0% found this document useful (0 votes)
50 views52 pages

Unit I

The document discusses data science, including what it is, how it works, and advice for new data science students. It covers topics like problem formulation, data collection and cleaning, data analysis and exploration, data modeling, and model optimization and deployment. It also provides an introduction to basic data analytics using R.

Uploaded by

malai54215421
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views52 pages

Unit I

The document discusses data science, including what it is, how it works, and advice for new data science students. It covers topics like problem formulation, data collection and cleaning, data analysis and exploration, data modeling, and model optimization and deployment. It also provides an introduction to basic data analytics using R.

Uploaded by

malai54215421
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

SETHU INSTITUTE OF TECHNOLOGY

(An Autonomous Institution)


Pulloor, Kariapatti, Virudhunagar (Dist.) -Pin: 626 115.

Department of Artificial Intelligence and Data Science

UNIT I INTRODUCTION AND R FOR DATA SCIENCE 9


Introduction of Data Science – Basic Data Analytics using R -Overview of R, R data types
variables –operators-Decision making –loops-functions-strings –vector-list-Matrices-Arrays-
Factors –data frames-packages-data Reshaping -R data interfaces-R charts and graphs –R
statistics Example –Data Versus Presentation
Prepared by: M.Senthilkumar, AP/CSE

Introduction of Data Science:

In a world of data space where organizations deal with petabytes and exabytes of
data, the era of Big Data emerged, the essence of its storage also grew. It was a
great challenge and concern for industries for the storage of data until 2010. Now
when frameworks like Hadoop and others solved the problem of storage, the focus
shifted to processing of data. Data Science plays a big role here. All those fancy
Sci-fi movies you love to watch around can turn into reality by Data Science.
Nowadays it‟s growth has been increased in multiple ways and thus one should be
ready for our future by learning what it is and how can we add value to it. Without
any hunches, let‟s dive into the world of Data Science. After touching to slightest
idea, you might have ended up with many questions like What is Data Science?
Why we need it? How can I be a Data Scientist?? Etc? So let‟s clear out ourselves
from this baffle.
Data science is a field that involves using statistical and computational techniques
to extract insights and knowledge from data. It encompasses a wide range of tasks,
including data cleaning and preparation, data visualization, statistical modeling,
machine learning, and more. Data scientists use these techniques to discover
patterns and trends in data, make predictions, and support decision -making. They
may work with a variety of data types, including structured data (such as numbers
and dates in a spreadsheet) and unstructured data (such as text, images, or
audio). Data science is used in a wide range of industries, including finance,
healthcare, retail, and more.
What is Data Science?
Data Science is kind of blended with various tools, algorithms, and machine
learning principles. Most simply, it involves obtaining meaningful information or
insights from structured or unstructured data through a process of analyzing,
programming and business skills. It is a field containing many elements like
mathematics, statistics, computer science, etc. Those who are good at these
respective fields with enough knowledge of the domain in which you are willing to
work can call themselves as Data Scientist. It‟s not an easy thing to do but not
impossible too. You need to start from data, it‟s visualization, programming,
formulation, development, and deployment of your model. In the future, there will
be great hype for data scientist jobs. Taking in that mind, be ready to prepare
yourself to fit in this world.
Data science is a field that involves using statistical and computational techniques
to extract insights and knowledge from data. It is a multi-disciplinary field that
encompasses aspects of computer science, statistics, and domain-specific
expertise. Data scientists use a variety of tools and methods, such as machine
learning, statistical modeling, and data visualization, to analyze and make
predictions from data. They work with both structured and unstructured data, and
use the insights gained to inform decision making and support business
operations. Data science is applied in a wide range of industries, including finance,
healthcare, retail, and more. It helps organizations to make data-driven decisions
and gain a competitive advantage.
How Data Science Works?
Data science is not a one-step process such that you will get to learn it in a short
time and call ourselves a Data Scientist. It‟s passes from many stages and every
element is important. One should always follow the proper steps to reach the
ladder. Every step has its value and it counts in your model. Buckle up in your
seats and get ready to learn about those steps.
 Problem Statement: no work start without motivation, Data science is any
exception though. It‟s really important to declare or formulate your problem
statement very clearly and precisely. Your whole model and it‟s working depend
on your statement. Many scientists consider this as the main and much
important step of Date Science. So make sure what‟s your problem statement
and how well can it add value to business or any other organization.
 Data Collection: After defining the problem statement, the next obvious step is to
go in search of data that you might require for your model. You must do good
research, find all that you need. Data can be in any form i.e unstructured or
structured. It might be in various forms like videos, spreadsheets, coded forms,
etc. You must collect all these kinds of sources.
 Data Cleaning: As you have formulated your motive and also you did collect your
data, the next step to do is cleaning. Yes, it is! Data cleaning is the most favorite
thing for data scientists to do. Data cleaning is all about the removal of missing,
redundant, unnecessary and duplicate data from your collection. There are
various tools to do so with the help of programming in either R or Python. It‟s
totally on you to choose one of them. Various scientist have their opinion on
which to choose. When it comes to the statistical part, R is preferred over
Python, as it has the privilege of more than 12,000 packages. While python is
used as it is fast, easily accessible and we can perform the same things as we
can in R with the help of various packages.
 Data Analysis and Exploration: It‟s one of the prime things in data science to do and
time to get inner Holmes out. It‟s about analyzing the structure of data, finding
hidden patterns in them, studying behaviors, visualizing the effects of one
variable over others and then concluding. We can explore the data with the help
of various graphs formed with the help of libraries using any programming
language. In R, GGplot is one of the most famous models while Matplotlib in
Python.
 Data Modeling: Once you are done with your study that you have formed from
data visualization, you must start building a hypothesis model such that it may
yield you a good prediction in future. Here, you must choose a good algorithm
that best fit to your model. There different kinds of algorithms from regression to
classification, SVM (Support vector machines), Clustering, etc. Your model can
be of a Machine Learning algorithm. You train your model with the train data and
then test it with test data. There are various methods to do so. One of them is
the K-fold method where you split your whole data into two parts, One is Train
and the other is test data. On these bases, you train your model.
 Optimization and Deployment: You followed each and every step and hence build a
model that you feel is the best fit. But how can you decide how well your model
is performing? This where optimization comes. You test your data and find how
well it is performing by checking its accuracy. In short, you check the efficiency
of the data model and thus try to optimize it for better accurate prediction.
Deployment deals with the launch of your model and let the people outside there
to benefit from that. You can also obtain feedback from organizations and people
to know their need and then to work more on your model.
Advice for new data science students
 Curiosity: If you are not curious, you would not know what to do with the data.
 Judgmental: It is because if you do not have preconceived notions about the
things you wouldn‟t know where to begin with.
 Argumentative: It is because if you can argument and if you can plead a case, at
least you can start somewhere and then you can learn from data and then can
modify your assumptions.
 Start by gaining a solid understanding of the basics of programming, statistics,
and linear algebra.
 Learn the tools of the trade such as Python, R, and SQL. Familiarize yourself
with the most popular libraries and frameworks like numpy, pandas, and scikit-
learn.
 Practice, practice, practice. Participate in online coding challenges and
hackathons to improve your skills and gain experience.
 Learn the basics of machine learning and familiarize yourself with the most
popular algorithms.
 Read research papers and stay up-to-date with the latest developments in the
field.
 Learn how to communicate your findings effectively. Being able to present your
work in a clear and compelling way is just as important as the technical skills
you possess.
 Build a portfolio of projects that showcase your skills and experience.
 Network with other data scientists and professionals in the field. Attend
meetups and conferences, and connect with people on LinkedIn.
 Be curious, and don‟t be afraid to ask questions.
 Finally, don‟t be discouraged if you encounter challenges or roadblocks along
the way. Learning to become a data scientist is a journey, and it takes time,
effort, and dedication to succeed.

Basic Data Analytics using R:

Data Analysis is a subset of data analytics, it is a process where the objective has to be
made clear, collect the relevant data, preprocess the data, perform analysis(understand
the data, explore insights), and then visualize it. The last step visualization is important
to make people understand what‟s happening in the firm.

Steps involved in data analysis:

The process of data analysis would include all these steps for the given problem
statement. Example- Analyze the products that are being rapidly sold out and
details of frequent customers of a retail shop.
 Defining the problem statement – Understand the goal, and what is needed to be
done. In this case, our problem statement is – “The product is mostly sold out
and list of customers who often visit the store.”
 Collection of data – Not all the company‟s data is necessary, understand the
relevant data according to the problem. Here the required columns are product
ID, customer ID, and date visited.
 Preprocessing – Cleaning the data is mandatory to put it in a structured format
before performing analysis.
1. Removing outliers( noisy data).
2. Removing null or irrelevant values in the columns. (Change null values to mean
value of that column.)
3. If there is any missing data, either ignore the tuple or fill it with a mean value of
the column.
Data Analysis using the Titanic dataset
You can download the titanic dataset (it contains data from real passengers of the
titanic)from here. Save the dataset in the current working directory, now we will
start analysis (getting to know our data).

titanic=read.csv("train.csv")
head(titanic)

Output:
PassengerId Survived Pclass Name Sex
1 892 0 3 Kelly, Mr. James male
2 893 1 3 Wilkes, Mrs. James (Ellen Needs) female
3 894 0 2 Myles, Mr. Thomas Francis male
4 895 0 3 Wirz, Mr. Albert male

Age SibSp Parch Ticket Fare Cabin Embarked


1 34.5 0 0 330911 7.8292 Q
2 47.0 1 0 363272 7.0000 S
3 62.0 0 0 240276 9.6875 Q
4 27.0 0 0 315154 8.6625 S
Our dataset contains all the columns like name, age, gender of the passenger and
class they have traveled in, whether they have survived or not, etc. To understand
the class (data type) of each column sapply() method can be used.

Overview of R:

R is a programming language and software environment for statistical analysis,


graphics representation and reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand, and is currently developed
by the R Development Core Team.
The core of R is an interpreted computer language which allows branching and
looping as well as modular programming using functions. R allows integration with
the procedures written in the C, C++, .Net, Python or FORTRAN languages for
efficiency.
R is freely available under the GNU General Public License, and pre -compiled
binary versions are provided for various operating systems like Linux, Windows
and Mac.
R is free software distributed under a GNU-style copy left, and an official part of
the GNU project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.
 A large group of individuals has contributed to R by sending code and bug
reports.
 Since mid-1997 there has been a core group (the "R Core Team") who can
modify the R source code archive.
Features of R
As stated earlier, R is a programming language and software environment for
statistical analysis, graphics representation and reporting. The following are the
important features of R −
 R is a well-developed, simple and effective programming language which
includes conditionals, loops, user defined recursive functions and input and
output facilities.
 R has an effective data handling and storage facility,
 R provides a suite of operators for calculations on arrays, lists, vectors and
matrices.
 R provides a large, coherent and integrated collection of tools for data
analysis.
 R provides graphical facilities for data analysis and display either directly at
the computer or printing at the papers.

R data types Variables:

Variables are nothing but reserved memory locations to store values. This means
that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide
character, integer, floating point, double floating point, Boolean etc. Based on the
data type of a variable, the operating system allocates memory and decides what
can be stored in the reserved memory.
In contrast to other programming languages like C and java in R, the variables are
not declared as some data type. The variables are assigned with R-Objects and the
data type of the R-object becomes the data type of the variable.The frequently used
ones are −
 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames

The simplest of these objects is the vector object and there are six data types of
these atomic vectors, also termed as six classes of vectors. The other R-Objects are
built upon the atomic vectors.

Data Type Example Verify


Logical TRUE, FALSE v <- TRUE

print(class(v))
it produces the following
result −
[1] "logical"

Numeric 12.3, 5, 999 v <- 23.5

print(class(v))
it produces the following
result −
[1] "numeric"

Integer 2L, 34L, 0L v <- 2L


print(class(v))
it produces the following
result −
[1] "integer"

Complex 3 + 2i v <- 2+5i


print(class(v))
it produces the following
result −
[1] "complex"

Character 'a' , '"good", "TRUE", '23.4' v <- "TRUE"


print(class(v))
it produces the following
result −
[1] "character"

Raw "Hello" is stored as 48 65 6c 6c 6f v <- charToRaw("Hello")


print(class(v))
it produces the following
result −
[1] "raw"
In R programming, the very basic data types are the R-objects called vectors which
hold elements of different classes as shown above. Please note in R the number of
classes is not confined to only the above six types. For example, we can use many
atomic vectors and create an array whose class will become array.
Vectors
When you want to create vector with more than one element, you shoul d
use c() function which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))
When we execute the above code, it produces the following result −
[1] "red" "green" "yellow"
[1] "character"
Lists
A list is an R-object which can contain many different types of elements inside it
like vectors, functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)
When we execute the above code, it produces the following result −
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
[1] function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector
input to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required
number of dimension. In the below example we create an array with two elements
which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
,,1

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

,,2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Factors
Factors are the r-objects which are created using a vector. It stores the vector along
with the distinct values of the elements in the vector as labels. The labels are
always character irrespective of whether it is numeric or character or Boolean etc.
in the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions gives the
count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result −
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column
can contain different modes of data. The first column can be numeric while the
second column can be character and third column can be logical. It is a list of
vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
When we execute the above code, it produces the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 65.0 78 26
R Operators:
An operator is a symbol that tells the compiler to perform specific mathematical or
logical manipulations. R language is rich in built-in operators and provides
following types of operators.
Types of Operators
We have the following types of operators in R programming −
 Arithmetic Operators
 Relational Operators
 Logical Operators
 Assignment Operators
 Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The
operators act on each element of the vector.

Operator Description Example

+ Adds two vectors v <- c( 2,5.5,6)


t <- c(8, 3, 4)
print(v+t)
it produces the following result −
[1] 10.0 8.5 10.0

− Subtracts second vector from v <- c( 2,5.5,6)


the first t <- c(8, 3, 4)
print(v-t)
it produces the following result −
[1] -6.0 2.5 2.0

* Multiplies both vectors v <- c( 2,5.5,6)


t <- c(8, 3, 4)
print(v*t)
it produces the following result −
[1] 16.0 16.5 24.0

/ Divide the first vector with the v <- c( 2,5.5,6)


second t <- c(8, 3, 4)
print(v/t)
When we execute the above code, it
produces the following result −
[1] 0.250000 1.833333 1.500000

%% Give the remainder of the first v <- c( 2,5.5,6)


vector with the second t <- c(8, 3, 4)
print(v%%t)
it produces the following result −
[1] 2.0 2.5 2.0

%/% The result of division of first v <- c( 2,5.5,6)


vector with second (quotient) t <- c(8, 3, 4)
print(v%/%t)
it produces the following result −
[1] 0 1 1

^ The first vector raised to the v <- c( 2,5.5,6)


exponent of second vector t <- c(8, 3, 4)
print(v^t)
it produces the following result −
[1] 256.000 166.375 1296.000

Relational Operators
Following table shows the relational operators supported by R language. Each
element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.

Operator Description Example

> v <- c(2,5.5,6,9)


Checks if each element of the
t <- c(8,2.5,14,9)
first vector is greater than the
print(v>t)
corresponding element of the
second vector. it produces the following result −
[1] FALSE TRUE FALSE FALSE

< v <- c(2,5.5,6,9)


Checks if each element of the
t <- c(8,2.5,14,9)
first vector is less than the
print(v < t)
corresponding element of the
second vector. it produces the following result −
[1] TRUE FALSE TRUE FALSE

== v <- c(2,5.5,6,9)
Checks if each element of the
t <- c(8,2.5,14,9)
first vector is equal to the
print(v == t)
corresponding element of the
second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE

<= v <- c(2,5.5,6,9)


Checks if each element of the
t <- c(8,2.5,14,9)
first vector is less than or equal print(v<=t)
to the corresponding element of
it produces the following result −
the second vector.
[1] TRUE FALSE TRUE TRUE

>= v <- c(2,5.5,6,9)


Checks if each element of the
t <- c(8,2.5,14,9)
first vector is greater than or
print(v>=t)
equal to the corresponding
element of the second vector. it produces the following result −
[1] FALSE TRUE FALSE TRUE

!= v <- c(2,5.5,6,9)
Checks if each element of the
t <- c(8,2.5,14,9)
first vector is unequal to the
print(v!=t)
corresponding element of the
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE

Logical Operators
Following table shows the logical operators supported by R language. It is
applicable only to vectors of type logical, numeric or complex. All numbers greater
than 1 are considered as logical value TRUE.
Each element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.

Operator Description Example

& It is called Element-wise Logical v <- c(3,1,TRUE,2+3i)


AND operator. It combines each t <- c(4,1,FALSE,2+3i)
element of the first vector with print(v&t)
the corresponding element of it produces the following result −
the second vector and gives a
[1] TRUE TRUE FALSE TRUE
output TRUE if both the
elements are TRUE.
| It is called Element-wise Logical v <- c(3,0,TRUE,2+2i)
OR operator. It combines each t <- c(4,0,FALSE,2+3i)
element of the first vector with print(v|t)
the corresponding element of it produces the following result −
the second vector and gives a
[1] TRUE FALSE TRUE TRUE
output TRUE if one the
elements is TRUE.

! It is called Logical NOT v <- c(3,0,TRUE,2+2i)


operator. Takes each element of print(!v)
the vector and gives the
it produces the following result −
opposite logical value.
[1] FALSE TRUE FALSE FALSE

The logical operator && and || considers only the first element of the vectors and
give a vector of single element as output.

Operator Description Example

&& v <- c(3,0,TRUE,2+2i)


Called Logical AND operator.
t <- c(1,3,TRUE,2+3i)
Takes first element of both the print(v&&t)
vectors and gives the TRUE only
if both are TRUE. it produces the following result −
[1] TRUE

|| v <- c(0,0,TRUE,2+2i)
Called Logical OR operator.
t <- c(0,3,TRUE,2+3i)
Takes first element of both the print(v||t)
vectors and gives the TRUE if
one of them is TRUE. it produces the following result −
[1] FALSE

Assignment Operators
These operators are used to assign values to vectors.

Operator Description Example


Called Left Assignment v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
<− v3 = c(3,1,TRUE,2+3i)
print(v1)
or
print(v2)
= print(v3)
or it produces the following result −
<<− [1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i

Called Right Assignment c(3,1,TRUE,2+3i) -> v1


c(3,1,TRUE,2+3i) ->> v2
-> print(v1)
or print(v2)

->> it produces the following result −


[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i

Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or
logical computation.

Operator Description Example

: Colon v <- 2:8


operator. It print(v)
creates the
it produces the following result −
series of
numbers [1] 2 3 4 5 6 7 8
in
sequence
for a
vector.

%in% This v1 <- 8


operator is v2 <- 12
used to t <- 1:10
identify if print(v1 %in% t)
an element print(v2 %in% t)
belongs to
it produces the following result −
a vector.
[1] TRUE
[1] FALSE

%*% This M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow =


operator is TRUE)
used to t = M %*% t(M)
print(t)
multiply a
matrix it produces the following result −
with its [,1] [,2]
transpose. [1,] 65 82
[2,] 82 117

Decision making:

Decision making is about deciding the order of execution of statements based on


certain conditions. In decision making programmer needs to provide some
condition which is evaluated by the program, along with it there also provided
some statements which are executed if the condition is true and optionally other
statements if the condition is evaluated to be false.

The decision making statement in R are as followed:


 if statement
 if-else statement
 if-else-if ladder
 nested if-else statement
 switch statement
if statement
Keyword if tells compiler that this is a decision control instruction and the condition following the
keyword if is always enclosed within a pair of parentheses. If the condition is TRUE the statement
gets executed and if condition is FALSE then statement does not get executed.
Syntax:
if(condition is true){
execute this statement
}
Example:

# R program to illustrate
# if statement
a <- 76
b <- 67

# TRUE condition
if(a > b)
{
c <- a - b
print("condition a > b is TRUE")
print(paste("Difference between a, b is : ", c))
}

# FALSE condition
if(a < b)
{
c <- a - b
print("condition a < b is TRUE")
print(paste("Difference between a, b is : ", c))
}

Output:
[1] "condition a > b is TRUE"
[1] "Difference between a, b is : 9"
if-else statement
If-else, provides us with an optional else block which gets executed if the condition for if block is
false. If the condition provided to if block is true then the statement within the if block gets
executed, else the statement within the else block gets executed.
Syntax:
if(condition is true) {
execute this statement
} else {
execute this statement
}
Example :

# R if-else statement Example


a <- 67
b <- 76
# This example will execute else block
if(a > b)
{
c <- a - b
print("condition a > b is TRUE")
print(paste("Difference between a, b is : ", c))
} else
{
c <- a - b
print("condition a > b is FALSE")
print(paste("Difference between a, b is : ", c))
}

Output:
[1] "condition a > b is FALSE"
[1] "Difference between a, b is: -9"
if-else-if ladder
It is similar to if-else statement, here the only difference is that an if statement is attached to else. If
the condition provided to if block is true then the statement within the if block gets executed, else-if
the another condition provided is checked and if true then the statement within the block gets
executed.
Syntax:
if(condition 1 is true) {
execute this statement
} else if(condition 2 is true) {
execute this statement
} else {
execute this statement
}
Example :

# R if-else-if ladder Example


a <- 67
b <- 76
c <- 99
if(a > b && b > c)
{
print("condition a > b > c is TRUE")
} else if(a < b && b > c)
{
print("condition a < b > c is TRUE")
} else if(a < b && b < c)
{
print("condition a < b < c is TRUE")
}

Output:
[1] "condition a < b < c is TRUE"
Nested if-else statement
When we have an if-else block as an statement within an if block or optionally within an else block,
then it is called as nested if else statement. When an if condition is true then following child if
condition is validated and if the condition is wrong else statement is executed, this happens within
parent if condition. If parent if condition is false then else block is executed with also may contain
child if else statement.
Syntax:
if(parent condition is true) {
if( child condition 1 is true) {
execute this statement
} else {
execute this statement
}
} else {
if(child condition 2 is true) {
execute this statement
} else {
execute this statement
}
}
Example:

# R Nested if else statement Example


a <- 10
b <- 11
if(a == 10)
{
if(b == 10)
{
print("a:10 b:10")
} else
{
print("a:10 b:11")
}
} else
{
if(a == 11)
{
print("a:11 b:10")
} else
{
print("a:11 b:11")
}
}

Output:
[1] "a:10 b:11"

Switch statement
In this switch function expression is matched to list of cases. If a match is found then it prints that
case‟s value. No default case is available here. If no case is matched it outputs NULL as shown in
example.
Syntax:
switch (expression, case1, case2, case3,…,case n )
Example:

# R switch statement example

# Expression in terms of the index value


x <- switch(
2, # Expression
"Geeks1", # case 1
"for", # case 2
"Geeks2" # case 3
)
print(x)
# Expression in terms of the string value
y <- switch(
"GfG3", # Expression
"GfG0"="Geeks1", # case 1
"GfG1"="for", # case 2
"GfG3"="Geeks2" # case 3
)
print(y)

z <- switch(
"GfG", # Expression
"GfG0"="Geeks1", # case 1
"GfG1"="for", # case 2
"GfG3"="Geeks2" # case 3
)
print(z)
print(z)

Output:
[1] "for"
[1] "Geeks2"
NULL

Loops:
In R programming, we require a control structure to run a block of code multiple times. Loops come
in the class of the most fundamental and strong programming concepts. A loop is a control
statement that allows multiple executions of a statement or a set of statements. The word „looping‟
means cycling or iterating.
A loop asks a query, in the loop structure. If the answer to that query requires an action, it will be
executed. The same query is asked again and again until further action is taken. Any time the query
is asked in the loop, it is known as an iteration of the loop. There are two components of a loop,
the control statement, and the loop body. The control statement controls the execution of
statements depending on the condition and the loop body consists of the set of statements to be
executed.
In order to execute the identical lines of code numerous times in a program, a programmer can
simply use a loop.
There are three types of loop in R programming:
 For Loop
 While Loop
 Repeat Loop

For Loop in R
It is a type of control statement that enables one to easily construct a loop that has to run
statements or a set of statements multiple times. For loop is commonly used to iterate over items of
a sequence. It is an entry controlled loop, in this loop the test condition is tested first, then the body
of the loop is executed, the loop body would not be executed if the test condition is false.

R – For loop Syntax:

for (value in sequence)


{
Statement
}
For Loop Flow Diagram:

Example 1: Program to display numbers from 1 to 5 using for loop in R.

for (val in 1: 5)
{
print(val)
}

Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Example 2: Program to display days of a week.

week < - c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',


'Saturday')
for (day in week)
{
print(day)
}

Output:
[1] "Sunday"
[1] "Monday"
[1] "Tuesday"
[1] "Wednesday"
[1] "Thursday"
[1] "Friday"
[1] "Saturday"
In the above program, initially, all the days (strings) of the week are assigned to the vector week.
Then for loop is used to iterate over each string in a week. In each iteration, each day of the week is
displayed.
While Loop in R
It is a type of control statement which will run a statement or a set of statements repeatedly unless
the given condition becomes false. It is also an entry controlled loop, in this loop the test condition
is tested first, then the body of the loop is executed, the loop body would not be executed if the test
condition is false.
R – While loop Syntax:

while ( condition )
{
statement
}

While loop Flow Diagram:

Below are some programs to illustrate the use of the while loop in R programming.

Example 1: Program to display numbers from 1 to 5 using while loop in R.


val = 1
while (val <= 5)
{
print(val)
val = val + 1
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Initially, the variable value is initialized to 1. In each iteration of the while loop the condition is
checked and the value of val is displayed and then it is incremented until it becomes 5 and the
condition becomes false, the loop is terminated.
Example 2: Program to calculate factorial of a number.

n<-5
factorial < - 1
i<-1
while (i <= n)
{
factorial = factorial * i
i=i+1
}
print(factorial)

Output:
[1] 120

Here, at first, the variable n is assigned to 5 whose factorial is going to be calculated, then variable i
and factorial are assigned to 1. i will be used for iterating over the loop, and factorial will be used for
calculating the factorial. In each iteration of the loop, the condition is checked i.e. i should be less
than or equal to 5, and after that factorial is multiplied with the value of i, then i is incremented.
When i becomes 5, the loop is terminated and the factorial of 5 i.e. 120 is displayed beyond the
scope of the loop.
Repeat Loop in R:
It is a simple loop that will run the same statement or a group of statements repeatedly until the
stop condition has been encountered. Repeat loop does not have any condition to terminate the
loop, a programmer must specifically place a condition within the loop‟s body and use the
declaration of a break statement to terminate this loop. If no condition is present in the body of the
repeat loop then it will iterate infinitely.

R – Repeat loop Syntax:


repeat
{
statement

if( condition )
{
break
}
}
Repeat loop Flow Diagram:
To terminate the repeat loop, we use a jump statement that is the break keyword.
Below are some programs to illustrate the use of repeat loops in R programming.
Example 1: Program to display numbers from 1 to 5 using repeat loop in R.
# R program to demonstrate the use of repeat loop
val = 1
# using repeat loop
repeat
{
# statements
print(val)
val = val + 1
# checking stop condition
if(val > 5)
{
# using break statement
# to terminate the loop
break
}
}

Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In the above program, the variable val is initialized to 1, then in each iteration of
the repeat loop the value of val is displayed and then it is incremented until it
becomes greater than 5. If the value of val becomes greater than 5 then break
statement is used to terminate the loop.
Example 2: Program to display a statement five times.
# R program to illustrate
# the application of repeat loop
# initializing the iteration variable with 0
i<-0
# using repeat loop
repeat
{
# statement to be executed multiple times
print("SIT AIDS!")

# incrementing the iteration variable


i=i+1

# checking the stop condition


if (i == 5)
{
# using break statement
# to terminate the loop
break
}
}
Output:
[1] "SIT AIDS!"
[1] "SIT AIDS!"
[1] "SIT AIDS!"
[1] "SIT AIDS!"
[1] "SIT AIDS!"
Here, initially the variable i is initialized with 0 then in each iteration of the repeat
loop after printing SIT AIDS! The value of i is incremented till it becomes 5 and the
condition in the if statement becomes true then, the break statement is executed to
terminate the repeat loop.

Jump Statements in Loop


We use a jump statement in loops to terminate the loop at a particular iteration or
to skip a particular iteration in the loop. The two most commonly used jump
statements in loops are:
 Break Statement: The break keyword is a jump statement that is used to
terminate the loop at a particular iteration.
Example:
# R program to illustrate
# the use of break statement

# using for loop


# to iterate over a sequence
for (val in 1: 5)
{
# checking condition
if (val == 3)
{
# using break keyword
break
}

# displaying items in the sequence


print(val)
}

Output:
[1] 1
[1] 2

In the above program, if the value of val becomes 3 then the break statement will
be executed and the loop will terminate.

 Next Statement: The next keyword is a jump statement which is used to skip a
particular iteration in the loop.
Example:
# R program to illustrate
# the use of next statement

# using for loop


# to iterate over the sequence
for (val in 1: 5)
{
# checking condition
if (val == 3)
{
# using next keyword
next
}

# displaying items in the sequence


print(val)
}

Output:
[1] 1
[1] 2
[1] 4
[1] 5
In the above program, if the value of Val becomes 3 then the next statement will be
executed hence the current iteration of the loop will be skipped. So 3 is not
displayed in the output.
As we can conclude from the above two programs the basic difference between the
two jump statements is that the break statement terminates the loop and
the next statement skips a particular iteration of the loop.

Functions:
Functions are useful when you want to perform a certain task multiple times. A
function accepts input arguments and produces the output by executing valid R
commands that are inside the function. In R Programming Language when you are
creating a function the function name and the file in which you are creating the
function need not be the same and you can have one or more function definitions
in a single R file.
Types of function in R Language
 Built-in Function: Built function R is sq(), mean(), max(), these function are
directly call in the program by users.
 User-defined Function: R language allow us to write our own function.
Functions in R Language

Functions are created in R by using the command function().


The general structure of the function file is as follows:

Note: In the above syntax f is the function name, this means that you are creating
a function with name f which takes certain arguments and executes the following
statements.
Built-in Function in R Programming Language
Here we will use built-in function like sum(), max() and min().
# Find sum of numbers 4 to 6.
print(sum(4:6))

# Find max of numbers 4 and 6.


print(max(4:6))

# Find min of numbers 4 and 6.


print(min(4:6))

Output:
[1] 15
[1] 6
[1] 4

User-defined Functions in R Programming Language


R provides built-in functions like print(), cat(), etc. but we can also create our own
functions. These functions are called user-defined functions.
Example:
evenOdd = function(x)
{
if(x %% 2 == 0)
return("even")
else
return("odd")
}
print(evenOdd(4))
print(evenOdd(3))
Output:
[1] "even"
[1] "odd"

Strings:
String Operations in R

R provides us various built-in functions that allow us to perform different


operations on strings. Here, we will look at some of the commonly used string
functions.

1. Find Length of R String

We use the nchar() method to find the length of a string. For example,

message1 <- "Programiz"

# use of nchar() to find length of message1


nchar(message1) # 9

Here, nchar() returns the number of characters present inside the string.

2. Join Strings Together

In R, we can use the paste() function to join two or more strings together. For
example,

message1 <- "Programiz"


message2 <- "Pro"

# use paste() to join two strings


paste(message1, message2)

Output

[1] Programiz Pro

Here, we have used the paste() function to join two strings: message1 and message2 .
3. Compare Two Strings in R Programming

We use the == operator to compare two strings. If two strings are equal, the
operator returns TRUE . Otherwise, it returns FALSE . For example,

message1 <- "Hello, World!"


message2 <- "Hola, Mundo!"
message3 <- "Hello, World!"

# compare message1 and message2


print(message1 == message2)

# compare message1 and message3


print(message1 == message3)

Output

[1] FALSE
[1] TRUE

In the above example,

 message1 == message2 - returns FALSE because two strings are not equal
 message1 == message3 - returns TRUE because both strings are equal

4. Change Case of R String

In R, we can change the case of a string using

 toupper() - convert string to uppercase


 tolower() - convert string to lowercase
Let's see an example,

message <- "R Programming"


message_upper <- toupper(message)
cat("Uppercase:", message_upper)
message_lower <- tolower(message)
cat("\nLowercase:", message_lower)
Output

Uppercase: R PROGRAMMING
Lowercase: r programming

Here, we have used the toupper() and the tolower() method to change the case of
the message1 string variable to uppercase and lowercase respectively.

Packages in R Programming:
The package is an appropriate way to organize the work and share it with others.
Typically, a package will include code (not only R code!), documentation for the
package and the functions inside, some tests to check everything works as it
should, and data sets.
There are some mostly used and popular packages which are as follows:

Packages in R
Packages in R Programming language are a set of R functions, compiled code, and
sample data. These are stored under a directory called “library” within the R
environment. By default, R installs a group of packages during installation. Once
we start the R console, only the default packages are available by default. Other
packages that are already installed need to be loaded explicitly to be utilized by the
R program that‟s getting to use them.
What are Repositories?
A repository is a place where packages are located and stored so you can install
packages from it. Organizations and Developers have a local repository; typically
they are online and accessible to everyone. Some of the most popular repositories
for R packages are:
 CRAN: Comprehensive R Archive Network(CRAN) is the official repository, it is a
network of ftp and web servers maintained by the R community around the
world. The R community coordinates it, and for a package to be published in
CRAN, the Package needs to pass several tests to ensure that the package is
following CRAN policies.
 Bioconductor: Bioconductor is a topic-specific repository, intended for open
source software for bioinformatics. Similar to CRAN, it has its own submission
and review processes, and its community is very active having several
conferences and meetings per year in order to maintain quality.
 Github: Github is the most popular repository for open source projects. It‟s
popular as it comes from the unlimited space for open source, the integration
with git, a version control software, and its ease to share and collaborate with
others.
Install an R-Packages
There are multiple ways to install R Package, some of them are,
 Installing Packages From CRAN: For installing Package from CRAN we need
the name of the package and use the following command:
install.packages("package name")
 Installing Package from CRAN is the most common and easiest way as we just
have to use only one command. In order to install more than a package at a
time, we just have to write them as a character vector in the first argument of
the install.packages() function:
Example:
install.packages(c("vioplot", "MASS"))
 Installing Bioconductor Packages: In Bioconductor, the standard way to
install a package is by first executing the following script:
source("https://bioconductor.org/biocLite.R")
 This will install some basic functions which are needed to install bioconductor
packages, such as the biocLite() function. To install the core packages of
Bioconductor just type it without further arguments:
biocLite()
 If we just want a few particular packages from this repository then type their
names directly as a character vector:
Example:
biocLite(c("GenomicFeatures", "AnnotationDbi"))
Update, Remove and Check Installed Packages in R
To check what packages are installed on your computer, type this command:
installed.packages()
To update all the packages, type this command:
update.packages()
To update a specific package, type this command:
install.packages("PACKAGE NAME")

R - Data Reshaping:
Data Reshaping in R is about changing the way data is organized into rows and columns.
Most of the time data processing in R is done by taking the input data as a data frame. It
is easy to extract data from the rows and columns of a data frame but there are situations
when we need the data frame in a format that is different from format in which we
received it. R has many functions to split, merge and change the rows to columns and
vice-versa in a data frame.

The various forms of reshaping data in a data frame are:


 Transpose of a Matrix
 Joining Rows and Columns
 Merging of Data Frames
 Melting and Casting
Why R – Data Reshaping is Important?
While doing an analysis or using an analytic function, the resultant data obtained
because of the experiment or study is generally different. The obtained data usually
has one or more columns that correspond or identify a row followed by a number of
columns that represent the measured values. We can say that these columns that
identify a row can be the composite key of a column in a database.
Transpose of a Matrix
We can easily calculate the transpose of a matrix in R language with the help of t()
function. The t() function takes a matrix or data frame as an input and gives the
transpose of that matrix or data frame as it‟s output.
Syntax:
t(Matrix/ Data frame)
Example:
# R program to find the transpose of a matrix
first <- matrix(c(1:12), nrow=4, byrow=TRUE)
print("Original Matrix")
print(first)
first <- t(first)
print("Transpose of the Matrix")
print(first)
Output:
[1] "Original Matrix"
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
[1] "Transpose of the Matrix"
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

Joining Rows and Columns in Data Frame


In R, we can join two vectors or merge two data frames using functions. There are
basically two functions that perform these tasks:
cbind():
We can combine vectors, matrix or data frames by columns using cbind() function.
Syntax: cbind(x1, x2, x3)
where x1, x2 and x3 can be vectors or matrices or data frames.
rbind():
We can combine vectors, matrix or data frames by rows using rbind() function.
Syntax:
rbind(x1, x2, x3)
where x1, x2 and x3 can be vectors or matrices or data frames.
Example:
# Cbind and Rbind function in R
name <- c("Shaoni", "esha", "soumitra", "soumi")
age <- c(24, 53, 62, 29)
address <- c("puducherry", "kolkata", "delhi", "bangalore")

# Cbind function
info <- cbind(name, age, address)
print("Combining vectors into data frame using cbind ")
print(info)

# creating new data frame


newd <- data.frame(name=c("sounak", "bhabani"),
age=c("28", "87"),
address=c("bangalore", "kolkata"))

# Rbind function
new.info <- rbind(info, newd)
print("Combining data frames using rbind ")
print(new.info)
Output:
[1] "Combining vectors into data frame using cbind "
name age address
[1,] "Shaoni" "24" "puducherry"
[2,] "esha" "53" "kolkata"
[3,] "soumitra" "62" "delhi"
[4,] "soumi" "29" "bangalore"
[1] "Combining data frames using rbind "
name age address
1 Shaoni 24 puducherry
2 esha 53 kolkata
3 soumitra 62 delhi
4 soumi 29 bangalore
5 sounak 28 bangalore
6 bhabani 87 kolkata

Merging two Data Frames


In R, we can merge two data frames using the merge() function provided both the
data frames should have the same column names. We may merge the two data
frames based on a key value.
Syntax: merge(dfA, dfB, …)
Example:
# Merging two data frames in R
d1 <- data.frame(name=c("shaoni", "soumi", "arjun"),
ID=c("111", "112", "113"))

d2 <- data.frame(name=c("sounak", "esha"),


ID=c("114", "115"))

total <- merge(d1, d2, all=TRUE)


print(total)

Output:
name ID
1 arjun 113
2 shaoni 111
3 soumi 112
4 esha 115
5 sounak 114
Melting and Casting
Data reshaping involves many steps in order to obtain desired or required format.
One of the popular methods is melting the data which converts each row into a
unique id-variable combination and then casting it. The two functions used for this
process:
melt():
It is used to convert a data frame into a molten data frame.
Syntax: melt(data, …, na.rm=FALSE, value.name=”value”)
where,
data: data to be melted
… : arguments
na.rm: converts explicit missings into implicit missings
value.name: storing values
dcast():
It is used to aggregate the molten data frame into a new form.
Syntax: melt(data, formula, fun.aggregate)
where,
data: data to be melted
formula: formula that defines how to cast
fun.aggregate: used if there is a data aggregation
Example:
# melt and cast
library(MASS)
library(reshape)
a <- data.frame(id=c("1", "1", "2", "2"),
points=c("1", "2", "1", "2"),
x1=c("5", "3", "6", "2"),
x2=c("6", "5", "1", "4"))

print("Melting")
m <- melt(a, id=c("id", "point"))
print(m)

print("Casting")
idmn <- dcast(a, id~variable, mean)
print(idmn)

Output:
Melting
id points variable value
1 1 x1 5
1 2 x1 3
2 1 x1 6
2 2 x1 2
1 1 x2 6
1 2 x2 5
2 1 x2 1
2 2 x2 4
Casting
id x1 x2
1 4 5.5
2 4 2.5
R - Data interfaces:

In R, we can read data from files stored outside the R environment. We can also
write data into files which will be stored and accessed by the operating system. R
can read and write into various file formats like csv, excel, xml etc.

Getting Started:
Before we start working with data (interface data), first make sure your working
directory in the right connection. You can check it by using the getwd() function.
You can also set a new working directory using setwd() function.
Read/Write CSV
Here is a simple example of read.csv() function to read a CSV file available in your
current working directory.
The csv file is a text file in which the values in the columns are separated by a
comma. Let's consider the following data present in the file named input.csv.
You can create this file using windows notepad or Ubuntu gedit by copying and
pasting this data. Save the file as input.csv using the save As All files(*.*) option in
notepad/gedit.

Reading a CSV file


Analysing a CSV file
Writing into a CSV file

XLSX file
 Microsoft Excel is the most widely used spreadsheet program which stores
data in the .xls or .xlsx format.
 R can read directly from these files using some excel specific packages.
 Few such packages are - XLConnect, xlsx, gdata etc.
 We will be using xlsx package. R can also write into excel file using this
package.

XLSX package installation


 You can use the following command in the R console to install the "xlsx"
package.
 It may ask to install some additional packages on which this package is
dependent.
 Follow the same command with required package name to install the
additional packages.
install.packages("xlsx")
Verify installation

Create an xlsx file

Reading xlsx file


 The input.xlsx is read by using the read.xlsx() function as shown below. The
result is stored as a data frame in the R environment.
 # Read the first worksheet in the file input.xlsx. data <- read.xlsx("input.xlsx",
sheetIndex = 1) print(data)
R charts and graphs:
R language is mostly used for statistics and data analytics purposes to represent
the data graphically in the software. To represent those data graphically, charts
and graphs are used in R.
R – graphs
There are hundreds of charts and graphs present in R. For example, bar plot, box
plot, mosaic plot, dot chart, coplot, histogram, pie chart, scatter graph, etc.
Types of R – Charts
 Bar Plot or Bar Chart
 Pie Diagram or Pie Chart
 Histogram
 Scatter Plot
 Box Plot
Bar Plot or Bar Chart
Bar plot or Bar Chart in R is used to represent the values in data vector as height
of the bars. The data vector passed to the function is represented over y-axis of the
graph. Bar chart can behave like histogram by using table() function instead of
data vector.
Syntax: barplot(data, xlab, ylab)
where:
 data is the data vector to be represented on y-axis
 xlab is the label given to x-axis
 ylab is the label given to y-axis
Note: To know about more optional parameters in barplot() function, use the below
command in R console:
help("barplot")
Example:
# defining vector
x <- c(7, 15, 23, 12, 44, 56, 32)
# output to be present as PNG file
png(file = "barplot.png")

# plotting vector
barplot(x, xlab = "GeeksforGeeks Audience",
ylab = "Count", col = "white",
col.axis = "darkgreen",
col.lab = "darkgreen")

# saving the file


dev.off()

Output:

Pie Diagram or Pie Chart


Pie chart is a circular chart divided into different segments according to the ratio of
data provided. The total value of the pie is 100 and the segments tell the fraction of
the whole pie. It is another method to represent statistical data in graphical form
and pie() function is used to perform the same.
Syntax: pie(x, labels, col, main, radius)
where,
 x is data vector
 labels shows names given to slices
 col fills the color in the slices as given parameter
 main shows title name of the pie chart
 radius indicates radius of the pie chart. It can be between -1 to +1
Note: To know about more optional parameters in pie() function, use the below
command in the R console:
help("pie")
Example:
Assume, vector x indicates the number of articles present on the GeeksforGeeks
portal in categories names(x)
# defining vector x with number of articles
x <- c(210, 450, 250, 100, 50, 90)

# defining labels for each value in x


names(x) <- c("Algo", "DS", "Java", "C", "C++", "Python")

# output to be present as PNG file


png(file = "piechart.png")

# creating pie chart


pie(x, labels = names(x), col = "white",
main = "Articles on GeeksforGeeks", radius = -1,
col.main = "darkgreen")

# saving the file


dev.off()

Output:

Pie chart in 3D can also be created in R by using following syntax but


requires plotrix library.
Syntax: pie3D(x, labels, radius, main)
Note: To know about more optional parameters in pie3D() function, use below
command in R console:
help("pie3D")
Example:
# importing library plotrix for pie3D()
library(plotrix)

# defining vector x with number of articles


x <- c(210, 450, 250, 100, 50, 90)

# defining labels for each value in x


names(x) <- c("Algo", "DS", "Java", "C", "C++", "Python")

# output to be present as PNG file


png(file = "piechart3d.png")

# creating pie chart


pie3D(x, labels = names(x), col = "white",
main = "Articles on GeeksforGeeks",
labelcol = "darkgreen", col.main = "darkgreen")

# saving the file


dev.off()

Output:

Histogram
Histogram is a graphical representation used to create a graph with bars
representing the frequency of grouped data in vector. Histogram is same as bar
chart but only difference between them is histogram represents frequency of
grouped data rather than data itself.
Syntax: hist(x, col, border, main, xlab, ylab)
where:
 x is data vector
 col specifies the color of the bars to be filled
 border specifies the color of border of bars
 main specifies the title name of histogram
 xlab specifies the x-axis label
 ylab specifies the y-axis label
Note: To know about more optional parameters in hist() function, use below
command in R console:
help("hist")
Example:
# defining vector
x <- c(21, 23, 56, 90, 20, 7, 94, 12,
57, 76, 69, 45, 34, 32, 49, 55, 57)

# output to be present as PNG file


png(file = "hist.png")

# hist(x, main = "Histogram of Vector x",


xlab = "Values",
col.lab = "darkgreen",
col.main = "darkgreen")

# saving the file


dev.off()

Output:
Scatter Plot
A Scatter plot is another type of graphical representation used to plot the points to
show relationship between two data vectors. One of the data vectors is represented
on x-axis and another on y-axis.
Syntax: plot(x, y, type, xlab, ylab, main)
Where,
 x is the data vector represented on x-axis
 y is the data vector represented on y-axis
 type specifies the type of plot to be drawn. For example, “l” for lines, “p” for
points, “s” for stair steps, etc.
 xlab specifies the label for x-axis
 ylab specifies the label for y-axis
 main specifies the title name of the graph
Note: To know about more optional parameters in plot() function, use the below
command in R console:
help("plot")
Example:
# taking input from dataset Orange already
# present in R
orange <- Orange[, c('age', 'circumference')]

# output to be present as PNG file


png(file = "plot.png")

# plotting
plot(x = orange$age, y = orange$circumference, xlab = "Age",
ylab = "Circumference", main = "Age VS Circumference",
col.lab = "darkgreen", col.main = "darkgreen",
col.axis = "darkgreen")

# saving the file


dev.off()

Output:
If a scatter plot has to be drawn to show the relation between 2 or more vectors or
to plot the scatter plot matrix between the vectors, then pairs() function is used to
satisfy the criteria.
Syntax: pairs(~formula, data)
where,
 ~formula is the mathematical formula such as ~a+b+c
 data is the dataset form where data is taken in formula
Note: To know about more optional parameters in pairs() function, use the below
command in R console:
help("pairs")
Example :
# output to be present as PNG file
png(file = "plotmatrix.png")

# plotting scatterplot matrix


# using dataset Orange
pairs(~age + circumference, data = Orange,
col.axis = "darkgreen")

# saving the file


dev.off()

Output:

Box Plot
Box plot shows how the data is distributed in the data vector. It represents five
values in the graph i.e., minimum, first quartile, second quartile(median), third
quartile, the maximum value of the data vector.
Syntax: boxplot(x, xlab, ylab, notch)
where,
 x specifies the data vector
 xlab specifies the label for x-axis
 ylab specifies the label for y-axis
 notch, if TRUE then creates notch on both the sides of the box
Note: To know about more optional parameters in boxplot() function, use the
below command in R console:
help("boxplot")
Example:
# defining vector with ages of employees
x <- c(42, 21, 22, 24, 25, 30, 29, 22,
23, 23, 24, 28, 32, 45, 39, 40)

# output to be present as PNG file


png(file = "boxplot.png")

# plotting
boxplot(x, xlab = "Box Plot", ylab = "Age",
col.axis = "darkgreen", col.lab = "darkgreen")

# saving the file


dev.off()

Output:

R – Statistics
Statistics is a form of mathematical analysis that concerns the collection,
organization, analysis, interpretation, and presentation of data. The statistical
analysis helps to make the best use of the vast data available and improves the
efficiency of solutions.
R – Statistics
R is a programming language and is used for environment statistical computing
and graphics. The following is an introduction to basic statistical concepts like
normal distribution (bell curve), central tendency (the mean, median, and mode),
variability (25%, 50%, 75% quartiles), variance, standard deviation, modality,
skewness.
Data Concepts
Data can be formed in different structures and different formats, before starting
the concepts of statistic we need to know the data formats.
These are some formats:
 Vector
 Data frame
 Variable
 Continuous Data
 Discrete Data
 Normal Data
Statistics in R
 Average, Variance and Standard Deviation in R
 Mean, Median and Mode in R Programming
Average in R Programming
Average a number expressing the central or typical value in a set of data, in
particular the mode, median, or (most commonly) the mean, which is calculated by
dividing the sum of the values in the set by their number. The basic formula for the
average of n numbers x1, x2, ……xn is

Example:
Suppose there are 8 data points,
2, 4, 4, 4, 5, 5, 7, 9
The average of these 8 data points is,

Computing Average in R Programming


To compute the average of values, R provides a pre-defined function mean(). This
function takes a Numerical Vector as an argument and results in the
average/mean of that Vector.
Syntax: mean(x, na.rm)
Parameters:
 x: Numeric Vector
 na.rm: Boolean value to ignore NA value
Example 1:
# R program to get average of a list

# Taking a list of elements


list = c(2, 4, 4, 4, 5, 5, 7, 9)

# Calculating average using mean()


print(mean(list))

Output:
[1] 5
Example 2:
# R program to get average of a list

# Taking a list of elements


list = c(2, 40, 2, 502, 177, 7, 9)

# Calculating average using mean()


print(mean(list))

Output:
[1] 105.5714

Variance in R Programming Language


Variance is the sum of squares of differences between all numbers and means. The
mathematical formula for variance is as follows,

Example:
Let‟s consider the same dataset that we have taken in average. First, calculate the
deviations of each data point from the mean, and square the result of each,
Computing Variance in R Programming
One can calculate the variance by using var() function in R.
Syntax: var(x)
Parameters:
x: numeric vector
Example 1:
# R program to get variance of a list

# Taking a list of elements


list = c(2, 4, 4, 4, 5, 5, 7, 9)

# Calculating variance using var()


print(var(list))

Output:
[1] 4.571429

Example 2:
# R program to get variance of a list

# Taking a list of elements


list = c(212, 231, 234, 564, 235)

# Calculating variance using var()


print(var(list))

Output:
[1] 22666.7

Standard Deviation in R Programming Language


Standard Deviation is the square root of variance. It is a measure of the extent to
which data varies from the mean. The mathematical formula for calculating
standard deviation is as follows,

Computing Standard Deviation in R


One can calculate the standard deviation by using sd() function in R.
Syntax: sd(x)
Parameters:
x: numeric vector
Example 1:
# R program to get
# standard deviation of a list

# Taking a list of elements


list = c(2, 4, 4, 4, 5, 5, 7, 9)

# Calculating standard
# deviation using sd()
print(sd(list))

Output:
[1] 2.13809

Example 2:
# R program to get
# standard deviation of a list

# Taking a list of elements


list = c(290, 124, 127, 899)

# Calculating standard
# deviation using sd()
print(sd(list))

Output:
[1] 367.6076

Data versus Presentation:


Data visualization is what R is good at. R comes with built-in support for many standard
graphs and provides advanced tools like ggplot2 that improve the quality and aesthetics of
your graphs.

In R, we can create visually appealing data visualizations by writing few lines of code. For
this purpose, we use the diverse functionalities of R. Data visualization is an efficient
technique for gaining insight about data through a visual medium.

The popular data visualization tools that are available are Tableau, Plotly, R, Google
Charts, Infogram, and Kibana. The various data visualization platforms have different
capabilities, functionality, and use cases. They also require a different skill set. This
article discusses the use of R for data visualization.
R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualization as it offers flexibility and
minimum required coding through its packages.

You might also like