Unit I
Unit I
In a world of data space where organizations deal with petabytes and exabytes of
data, the era of Big Data emerged, the essence of its storage also grew. It was a
great challenge and concern for industries for the storage of data until 2010. Now
when frameworks like Hadoop and others solved the problem of storage, the focus
shifted to processing of data. Data Science plays a big role here. All those fancy
Sci-fi movies you love to watch around can turn into reality by Data Science.
Nowadays it‟s growth has been increased in multiple ways and thus one should be
ready for our future by learning what it is and how can we add value to it. Without
any hunches, let‟s dive into the world of Data Science. After touching to slightest
idea, you might have ended up with many questions like What is Data Science?
Why we need it? How can I be a Data Scientist?? Etc? So let‟s clear out ourselves
from this baffle.
Data science is a field that involves using statistical and computational techniques
to extract insights and knowledge from data. It encompasses a wide range of tasks,
including data cleaning and preparation, data visualization, statistical modeling,
machine learning, and more. Data scientists use these techniques to discover
patterns and trends in data, make predictions, and support decision -making. They
may work with a variety of data types, including structured data (such as numbers
and dates in a spreadsheet) and unstructured data (such as text, images, or
audio). Data science is used in a wide range of industries, including finance,
healthcare, retail, and more.
What is Data Science?
Data Science is kind of blended with various tools, algorithms, and machine
learning principles. Most simply, it involves obtaining meaningful information or
insights from structured or unstructured data through a process of analyzing,
programming and business skills. It is a field containing many elements like
mathematics, statistics, computer science, etc. Those who are good at these
respective fields with enough knowledge of the domain in which you are willing to
work can call themselves as Data Scientist. It‟s not an easy thing to do but not
impossible too. You need to start from data, it‟s visualization, programming,
formulation, development, and deployment of your model. In the future, there will
be great hype for data scientist jobs. Taking in that mind, be ready to prepare
yourself to fit in this world.
Data science is a field that involves using statistical and computational techniques
to extract insights and knowledge from data. It is a multi-disciplinary field that
encompasses aspects of computer science, statistics, and domain-specific
expertise. Data scientists use a variety of tools and methods, such as machine
learning, statistical modeling, and data visualization, to analyze and make
predictions from data. They work with both structured and unstructured data, and
use the insights gained to inform decision making and support business
operations. Data science is applied in a wide range of industries, including finance,
healthcare, retail, and more. It helps organizations to make data-driven decisions
and gain a competitive advantage.
How Data Science Works?
Data science is not a one-step process such that you will get to learn it in a short
time and call ourselves a Data Scientist. It‟s passes from many stages and every
element is important. One should always follow the proper steps to reach the
ladder. Every step has its value and it counts in your model. Buckle up in your
seats and get ready to learn about those steps.
Problem Statement: no work start without motivation, Data science is any
exception though. It‟s really important to declare or formulate your problem
statement very clearly and precisely. Your whole model and it‟s working depend
on your statement. Many scientists consider this as the main and much
important step of Date Science. So make sure what‟s your problem statement
and how well can it add value to business or any other organization.
Data Collection: After defining the problem statement, the next obvious step is to
go in search of data that you might require for your model. You must do good
research, find all that you need. Data can be in any form i.e unstructured or
structured. It might be in various forms like videos, spreadsheets, coded forms,
etc. You must collect all these kinds of sources.
Data Cleaning: As you have formulated your motive and also you did collect your
data, the next step to do is cleaning. Yes, it is! Data cleaning is the most favorite
thing for data scientists to do. Data cleaning is all about the removal of missing,
redundant, unnecessary and duplicate data from your collection. There are
various tools to do so with the help of programming in either R or Python. It‟s
totally on you to choose one of them. Various scientist have their opinion on
which to choose. When it comes to the statistical part, R is preferred over
Python, as it has the privilege of more than 12,000 packages. While python is
used as it is fast, easily accessible and we can perform the same things as we
can in R with the help of various packages.
Data Analysis and Exploration: It‟s one of the prime things in data science to do and
time to get inner Holmes out. It‟s about analyzing the structure of data, finding
hidden patterns in them, studying behaviors, visualizing the effects of one
variable over others and then concluding. We can explore the data with the help
of various graphs formed with the help of libraries using any programming
language. In R, GGplot is one of the most famous models while Matplotlib in
Python.
Data Modeling: Once you are done with your study that you have formed from
data visualization, you must start building a hypothesis model such that it may
yield you a good prediction in future. Here, you must choose a good algorithm
that best fit to your model. There different kinds of algorithms from regression to
classification, SVM (Support vector machines), Clustering, etc. Your model can
be of a Machine Learning algorithm. You train your model with the train data and
then test it with test data. There are various methods to do so. One of them is
the K-fold method where you split your whole data into two parts, One is Train
and the other is test data. On these bases, you train your model.
Optimization and Deployment: You followed each and every step and hence build a
model that you feel is the best fit. But how can you decide how well your model
is performing? This where optimization comes. You test your data and find how
well it is performing by checking its accuracy. In short, you check the efficiency
of the data model and thus try to optimize it for better accurate prediction.
Deployment deals with the launch of your model and let the people outside there
to benefit from that. You can also obtain feedback from organizations and people
to know their need and then to work more on your model.
Advice for new data science students
Curiosity: If you are not curious, you would not know what to do with the data.
Judgmental: It is because if you do not have preconceived notions about the
things you wouldn‟t know where to begin with.
Argumentative: It is because if you can argument and if you can plead a case, at
least you can start somewhere and then you can learn from data and then can
modify your assumptions.
Start by gaining a solid understanding of the basics of programming, statistics,
and linear algebra.
Learn the tools of the trade such as Python, R, and SQL. Familiarize yourself
with the most popular libraries and frameworks like numpy, pandas, and scikit-
learn.
Practice, practice, practice. Participate in online coding challenges and
hackathons to improve your skills and gain experience.
Learn the basics of machine learning and familiarize yourself with the most
popular algorithms.
Read research papers and stay up-to-date with the latest developments in the
field.
Learn how to communicate your findings effectively. Being able to present your
work in a clear and compelling way is just as important as the technical skills
you possess.
Build a portfolio of projects that showcase your skills and experience.
Network with other data scientists and professionals in the field. Attend
meetups and conferences, and connect with people on LinkedIn.
Be curious, and don‟t be afraid to ask questions.
Finally, don‟t be discouraged if you encounter challenges or roadblocks along
the way. Learning to become a data scientist is a journey, and it takes time,
effort, and dedication to succeed.
Data Analysis is a subset of data analytics, it is a process where the objective has to be
made clear, collect the relevant data, preprocess the data, perform analysis(understand
the data, explore insights), and then visualize it. The last step visualization is important
to make people understand what‟s happening in the firm.
The process of data analysis would include all these steps for the given problem
statement. Example- Analyze the products that are being rapidly sold out and
details of frequent customers of a retail shop.
Defining the problem statement – Understand the goal, and what is needed to be
done. In this case, our problem statement is – “The product is mostly sold out
and list of customers who often visit the store.”
Collection of data – Not all the company‟s data is necessary, understand the
relevant data according to the problem. Here the required columns are product
ID, customer ID, and date visited.
Preprocessing – Cleaning the data is mandatory to put it in a structured format
before performing analysis.
1. Removing outliers( noisy data).
2. Removing null or irrelevant values in the columns. (Change null values to mean
value of that column.)
3. If there is any missing data, either ignore the tuple or fill it with a mean value of
the column.
Data Analysis using the Titanic dataset
You can download the titanic dataset (it contains data from real passengers of the
titanic)from here. Save the dataset in the current working directory, now we will
start analysis (getting to know our data).
titanic=read.csv("train.csv")
head(titanic)
Output:
PassengerId Survived Pclass Name Sex
1 892 0 3 Kelly, Mr. James male
2 893 1 3 Wilkes, Mrs. James (Ellen Needs) female
3 894 0 2 Myles, Mr. Thomas Francis male
4 895 0 3 Wirz, Mr. Albert male
Overview of R:
Variables are nothing but reserved memory locations to store values. This means
that, when you create a variable you reserve some space in memory.
You may like to store information of various data types like character, wide
character, integer, floating point, double floating point, Boolean etc. Based on the
data type of a variable, the operating system allocates memory and decides what
can be stored in the reserved memory.
In contrast to other programming languages like C and java in R, the variables are
not declared as some data type. The variables are assigned with R-Objects and the
data type of the R-object becomes the data type of the variable.The frequently used
ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of
these atomic vectors, also termed as six classes of vectors. The other R-Objects are
built upon the atomic vectors.
print(class(v))
it produces the following
result −
[1] "logical"
print(class(v))
it produces the following
result −
[1] "numeric"
,,2
Relational Operators
Following table shows the relational operators supported by R language. Each
element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.
== v <- c(2,5.5,6,9)
Checks if each element of the
t <- c(8,2.5,14,9)
first vector is equal to the
print(v == t)
corresponding element of the
second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE
!= v <- c(2,5.5,6,9)
Checks if each element of the
t <- c(8,2.5,14,9)
first vector is unequal to the
print(v!=t)
corresponding element of the
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE
Logical Operators
Following table shows the logical operators supported by R language. It is
applicable only to vectors of type logical, numeric or complex. All numbers greater
than 1 are considered as logical value TRUE.
Each element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.
The logical operator && and || considers only the first element of the vectors and
give a vector of single element as output.
|| v <- c(0,0,TRUE,2+2i)
Called Logical OR operator.
t <- c(0,3,TRUE,2+3i)
Takes first element of both the print(v||t)
vectors and gives the TRUE if
one of them is TRUE. it produces the following result −
[1] FALSE
Assignment Operators
These operators are used to assign values to vectors.
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or
logical computation.
Decision making:
# R program to illustrate
# if statement
a <- 76
b <- 67
# TRUE condition
if(a > b)
{
c <- a - b
print("condition a > b is TRUE")
print(paste("Difference between a, b is : ", c))
}
# FALSE condition
if(a < b)
{
c <- a - b
print("condition a < b is TRUE")
print(paste("Difference between a, b is : ", c))
}
Output:
[1] "condition a > b is TRUE"
[1] "Difference between a, b is : 9"
if-else statement
If-else, provides us with an optional else block which gets executed if the condition for if block is
false. If the condition provided to if block is true then the statement within the if block gets
executed, else the statement within the else block gets executed.
Syntax:
if(condition is true) {
execute this statement
} else {
execute this statement
}
Example :
Output:
[1] "condition a > b is FALSE"
[1] "Difference between a, b is: -9"
if-else-if ladder
It is similar to if-else statement, here the only difference is that an if statement is attached to else. If
the condition provided to if block is true then the statement within the if block gets executed, else-if
the another condition provided is checked and if true then the statement within the block gets
executed.
Syntax:
if(condition 1 is true) {
execute this statement
} else if(condition 2 is true) {
execute this statement
} else {
execute this statement
}
Example :
Output:
[1] "condition a < b < c is TRUE"
Nested if-else statement
When we have an if-else block as an statement within an if block or optionally within an else block,
then it is called as nested if else statement. When an if condition is true then following child if
condition is validated and if the condition is wrong else statement is executed, this happens within
parent if condition. If parent if condition is false then else block is executed with also may contain
child if else statement.
Syntax:
if(parent condition is true) {
if( child condition 1 is true) {
execute this statement
} else {
execute this statement
}
} else {
if(child condition 2 is true) {
execute this statement
} else {
execute this statement
}
}
Example:
Output:
[1] "a:10 b:11"
Switch statement
In this switch function expression is matched to list of cases. If a match is found then it prints that
case‟s value. No default case is available here. If no case is matched it outputs NULL as shown in
example.
Syntax:
switch (expression, case1, case2, case3,…,case n )
Example:
z <- switch(
"GfG", # Expression
"GfG0"="Geeks1", # case 1
"GfG1"="for", # case 2
"GfG3"="Geeks2" # case 3
)
print(z)
print(z)
Output:
[1] "for"
[1] "Geeks2"
NULL
Loops:
In R programming, we require a control structure to run a block of code multiple times. Loops come
in the class of the most fundamental and strong programming concepts. A loop is a control
statement that allows multiple executions of a statement or a set of statements. The word „looping‟
means cycling or iterating.
A loop asks a query, in the loop structure. If the answer to that query requires an action, it will be
executed. The same query is asked again and again until further action is taken. Any time the query
is asked in the loop, it is known as an iteration of the loop. There are two components of a loop,
the control statement, and the loop body. The control statement controls the execution of
statements depending on the condition and the loop body consists of the set of statements to be
executed.
In order to execute the identical lines of code numerous times in a program, a programmer can
simply use a loop.
There are three types of loop in R programming:
For Loop
While Loop
Repeat Loop
For Loop in R
It is a type of control statement that enables one to easily construct a loop that has to run
statements or a set of statements multiple times. For loop is commonly used to iterate over items of
a sequence. It is an entry controlled loop, in this loop the test condition is tested first, then the body
of the loop is executed, the loop body would not be executed if the test condition is false.
for (val in 1: 5)
{
print(val)
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Example 2: Program to display days of a week.
Output:
[1] "Sunday"
[1] "Monday"
[1] "Tuesday"
[1] "Wednesday"
[1] "Thursday"
[1] "Friday"
[1] "Saturday"
In the above program, initially, all the days (strings) of the week are assigned to the vector week.
Then for loop is used to iterate over each string in a week. In each iteration, each day of the week is
displayed.
While Loop in R
It is a type of control statement which will run a statement or a set of statements repeatedly unless
the given condition becomes false. It is also an entry controlled loop, in this loop the test condition
is tested first, then the body of the loop is executed, the loop body would not be executed if the test
condition is false.
R – While loop Syntax:
while ( condition )
{
statement
}
Below are some programs to illustrate the use of the while loop in R programming.
n<-5
factorial < - 1
i<-1
while (i <= n)
{
factorial = factorial * i
i=i+1
}
print(factorial)
Output:
[1] 120
Here, at first, the variable n is assigned to 5 whose factorial is going to be calculated, then variable i
and factorial are assigned to 1. i will be used for iterating over the loop, and factorial will be used for
calculating the factorial. In each iteration of the loop, the condition is checked i.e. i should be less
than or equal to 5, and after that factorial is multiplied with the value of i, then i is incremented.
When i becomes 5, the loop is terminated and the factorial of 5 i.e. 120 is displayed beyond the
scope of the loop.
Repeat Loop in R:
It is a simple loop that will run the same statement or a group of statements repeatedly until the
stop condition has been encountered. Repeat loop does not have any condition to terminate the
loop, a programmer must specifically place a condition within the loop‟s body and use the
declaration of a break statement to terminate this loop. If no condition is present in the body of the
repeat loop then it will iterate infinitely.
if( condition )
{
break
}
}
Repeat loop Flow Diagram:
To terminate the repeat loop, we use a jump statement that is the break keyword.
Below are some programs to illustrate the use of repeat loops in R programming.
Example 1: Program to display numbers from 1 to 5 using repeat loop in R.
# R program to demonstrate the use of repeat loop
val = 1
# using repeat loop
repeat
{
# statements
print(val)
val = val + 1
# checking stop condition
if(val > 5)
{
# using break statement
# to terminate the loop
break
}
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
In the above program, the variable val is initialized to 1, then in each iteration of
the repeat loop the value of val is displayed and then it is incremented until it
becomes greater than 5. If the value of val becomes greater than 5 then break
statement is used to terminate the loop.
Example 2: Program to display a statement five times.
# R program to illustrate
# the application of repeat loop
# initializing the iteration variable with 0
i<-0
# using repeat loop
repeat
{
# statement to be executed multiple times
print("SIT AIDS!")
Output:
[1] 1
[1] 2
In the above program, if the value of val becomes 3 then the break statement will
be executed and the loop will terminate.
Next Statement: The next keyword is a jump statement which is used to skip a
particular iteration in the loop.
Example:
# R program to illustrate
# the use of next statement
Output:
[1] 1
[1] 2
[1] 4
[1] 5
In the above program, if the value of Val becomes 3 then the next statement will be
executed hence the current iteration of the loop will be skipped. So 3 is not
displayed in the output.
As we can conclude from the above two programs the basic difference between the
two jump statements is that the break statement terminates the loop and
the next statement skips a particular iteration of the loop.
Functions:
Functions are useful when you want to perform a certain task multiple times. A
function accepts input arguments and produces the output by executing valid R
commands that are inside the function. In R Programming Language when you are
creating a function the function name and the file in which you are creating the
function need not be the same and you can have one or more function definitions
in a single R file.
Types of function in R Language
Built-in Function: Built function R is sq(), mean(), max(), these function are
directly call in the program by users.
User-defined Function: R language allow us to write our own function.
Functions in R Language
Note: In the above syntax f is the function name, this means that you are creating
a function with name f which takes certain arguments and executes the following
statements.
Built-in Function in R Programming Language
Here we will use built-in function like sum(), max() and min().
# Find sum of numbers 4 to 6.
print(sum(4:6))
Output:
[1] 15
[1] 6
[1] 4
Strings:
String Operations in R
We use the nchar() method to find the length of a string. For example,
Here, nchar() returns the number of characters present inside the string.
In R, we can use the paste() function to join two or more strings together. For
example,
Output
Here, we have used the paste() function to join two strings: message1 and message2 .
3. Compare Two Strings in R Programming
We use the == operator to compare two strings. If two strings are equal, the
operator returns TRUE . Otherwise, it returns FALSE . For example,
Output
[1] FALSE
[1] TRUE
message1 == message2 - returns FALSE because two strings are not equal
message1 == message3 - returns TRUE because both strings are equal
Uppercase: R PROGRAMMING
Lowercase: r programming
Here, we have used the toupper() and the tolower() method to change the case of
the message1 string variable to uppercase and lowercase respectively.
Packages in R Programming:
The package is an appropriate way to organize the work and share it with others.
Typically, a package will include code (not only R code!), documentation for the
package and the functions inside, some tests to check everything works as it
should, and data sets.
There are some mostly used and popular packages which are as follows:
Packages in R
Packages in R Programming language are a set of R functions, compiled code, and
sample data. These are stored under a directory called “library” within the R
environment. By default, R installs a group of packages during installation. Once
we start the R console, only the default packages are available by default. Other
packages that are already installed need to be loaded explicitly to be utilized by the
R program that‟s getting to use them.
What are Repositories?
A repository is a place where packages are located and stored so you can install
packages from it. Organizations and Developers have a local repository; typically
they are online and accessible to everyone. Some of the most popular repositories
for R packages are:
CRAN: Comprehensive R Archive Network(CRAN) is the official repository, it is a
network of ftp and web servers maintained by the R community around the
world. The R community coordinates it, and for a package to be published in
CRAN, the Package needs to pass several tests to ensure that the package is
following CRAN policies.
Bioconductor: Bioconductor is a topic-specific repository, intended for open
source software for bioinformatics. Similar to CRAN, it has its own submission
and review processes, and its community is very active having several
conferences and meetings per year in order to maintain quality.
Github: Github is the most popular repository for open source projects. It‟s
popular as it comes from the unlimited space for open source, the integration
with git, a version control software, and its ease to share and collaborate with
others.
Install an R-Packages
There are multiple ways to install R Package, some of them are,
Installing Packages From CRAN: For installing Package from CRAN we need
the name of the package and use the following command:
install.packages("package name")
Installing Package from CRAN is the most common and easiest way as we just
have to use only one command. In order to install more than a package at a
time, we just have to write them as a character vector in the first argument of
the install.packages() function:
Example:
install.packages(c("vioplot", "MASS"))
Installing Bioconductor Packages: In Bioconductor, the standard way to
install a package is by first executing the following script:
source("https://bioconductor.org/biocLite.R")
This will install some basic functions which are needed to install bioconductor
packages, such as the biocLite() function. To install the core packages of
Bioconductor just type it without further arguments:
biocLite()
If we just want a few particular packages from this repository then type their
names directly as a character vector:
Example:
biocLite(c("GenomicFeatures", "AnnotationDbi"))
Update, Remove and Check Installed Packages in R
To check what packages are installed on your computer, type this command:
installed.packages()
To update all the packages, type this command:
update.packages()
To update a specific package, type this command:
install.packages("PACKAGE NAME")
R - Data Reshaping:
Data Reshaping in R is about changing the way data is organized into rows and columns.
Most of the time data processing in R is done by taking the input data as a data frame. It
is easy to extract data from the rows and columns of a data frame but there are situations
when we need the data frame in a format that is different from format in which we
received it. R has many functions to split, merge and change the rows to columns and
vice-versa in a data frame.
# Cbind function
info <- cbind(name, age, address)
print("Combining vectors into data frame using cbind ")
print(info)
# Rbind function
new.info <- rbind(info, newd)
print("Combining data frames using rbind ")
print(new.info)
Output:
[1] "Combining vectors into data frame using cbind "
name age address
[1,] "Shaoni" "24" "puducherry"
[2,] "esha" "53" "kolkata"
[3,] "soumitra" "62" "delhi"
[4,] "soumi" "29" "bangalore"
[1] "Combining data frames using rbind "
name age address
1 Shaoni 24 puducherry
2 esha 53 kolkata
3 soumitra 62 delhi
4 soumi 29 bangalore
5 sounak 28 bangalore
6 bhabani 87 kolkata
Output:
name ID
1 arjun 113
2 shaoni 111
3 soumi 112
4 esha 115
5 sounak 114
Melting and Casting
Data reshaping involves many steps in order to obtain desired or required format.
One of the popular methods is melting the data which converts each row into a
unique id-variable combination and then casting it. The two functions used for this
process:
melt():
It is used to convert a data frame into a molten data frame.
Syntax: melt(data, …, na.rm=FALSE, value.name=”value”)
where,
data: data to be melted
… : arguments
na.rm: converts explicit missings into implicit missings
value.name: storing values
dcast():
It is used to aggregate the molten data frame into a new form.
Syntax: melt(data, formula, fun.aggregate)
where,
data: data to be melted
formula: formula that defines how to cast
fun.aggregate: used if there is a data aggregation
Example:
# melt and cast
library(MASS)
library(reshape)
a <- data.frame(id=c("1", "1", "2", "2"),
points=c("1", "2", "1", "2"),
x1=c("5", "3", "6", "2"),
x2=c("6", "5", "1", "4"))
print("Melting")
m <- melt(a, id=c("id", "point"))
print(m)
print("Casting")
idmn <- dcast(a, id~variable, mean)
print(idmn)
Output:
Melting
id points variable value
1 1 x1 5
1 2 x1 3
2 1 x1 6
2 2 x1 2
1 1 x2 6
1 2 x2 5
2 1 x2 1
2 2 x2 4
Casting
id x1 x2
1 4 5.5
2 4 2.5
R - Data interfaces:
In R, we can read data from files stored outside the R environment. We can also
write data into files which will be stored and accessed by the operating system. R
can read and write into various file formats like csv, excel, xml etc.
Getting Started:
Before we start working with data (interface data), first make sure your working
directory in the right connection. You can check it by using the getwd() function.
You can also set a new working directory using setwd() function.
Read/Write CSV
Here is a simple example of read.csv() function to read a CSV file available in your
current working directory.
The csv file is a text file in which the values in the columns are separated by a
comma. Let's consider the following data present in the file named input.csv.
You can create this file using windows notepad or Ubuntu gedit by copying and
pasting this data. Save the file as input.csv using the save As All files(*.*) option in
notepad/gedit.
XLSX file
Microsoft Excel is the most widely used spreadsheet program which stores
data in the .xls or .xlsx format.
R can read directly from these files using some excel specific packages.
Few such packages are - XLConnect, xlsx, gdata etc.
We will be using xlsx package. R can also write into excel file using this
package.
# plotting vector
barplot(x, xlab = "GeeksforGeeks Audience",
ylab = "Count", col = "white",
col.axis = "darkgreen",
col.lab = "darkgreen")
Output:
Output:
Output:
Histogram
Histogram is a graphical representation used to create a graph with bars
representing the frequency of grouped data in vector. Histogram is same as bar
chart but only difference between them is histogram represents frequency of
grouped data rather than data itself.
Syntax: hist(x, col, border, main, xlab, ylab)
where:
x is data vector
col specifies the color of the bars to be filled
border specifies the color of border of bars
main specifies the title name of histogram
xlab specifies the x-axis label
ylab specifies the y-axis label
Note: To know about more optional parameters in hist() function, use below
command in R console:
help("hist")
Example:
# defining vector
x <- c(21, 23, 56, 90, 20, 7, 94, 12,
57, 76, 69, 45, 34, 32, 49, 55, 57)
Output:
Scatter Plot
A Scatter plot is another type of graphical representation used to plot the points to
show relationship between two data vectors. One of the data vectors is represented
on x-axis and another on y-axis.
Syntax: plot(x, y, type, xlab, ylab, main)
Where,
x is the data vector represented on x-axis
y is the data vector represented on y-axis
type specifies the type of plot to be drawn. For example, “l” for lines, “p” for
points, “s” for stair steps, etc.
xlab specifies the label for x-axis
ylab specifies the label for y-axis
main specifies the title name of the graph
Note: To know about more optional parameters in plot() function, use the below
command in R console:
help("plot")
Example:
# taking input from dataset Orange already
# present in R
orange <- Orange[, c('age', 'circumference')]
# plotting
plot(x = orange$age, y = orange$circumference, xlab = "Age",
ylab = "Circumference", main = "Age VS Circumference",
col.lab = "darkgreen", col.main = "darkgreen",
col.axis = "darkgreen")
Output:
If a scatter plot has to be drawn to show the relation between 2 or more vectors or
to plot the scatter plot matrix between the vectors, then pairs() function is used to
satisfy the criteria.
Syntax: pairs(~formula, data)
where,
~formula is the mathematical formula such as ~a+b+c
data is the dataset form where data is taken in formula
Note: To know about more optional parameters in pairs() function, use the below
command in R console:
help("pairs")
Example :
# output to be present as PNG file
png(file = "plotmatrix.png")
Output:
Box Plot
Box plot shows how the data is distributed in the data vector. It represents five
values in the graph i.e., minimum, first quartile, second quartile(median), third
quartile, the maximum value of the data vector.
Syntax: boxplot(x, xlab, ylab, notch)
where,
x specifies the data vector
xlab specifies the label for x-axis
ylab specifies the label for y-axis
notch, if TRUE then creates notch on both the sides of the box
Note: To know about more optional parameters in boxplot() function, use the
below command in R console:
help("boxplot")
Example:
# defining vector with ages of employees
x <- c(42, 21, 22, 24, 25, 30, 29, 22,
23, 23, 24, 28, 32, 45, 39, 40)
# plotting
boxplot(x, xlab = "Box Plot", ylab = "Age",
col.axis = "darkgreen", col.lab = "darkgreen")
Output:
R – Statistics
Statistics is a form of mathematical analysis that concerns the collection,
organization, analysis, interpretation, and presentation of data. The statistical
analysis helps to make the best use of the vast data available and improves the
efficiency of solutions.
R – Statistics
R is a programming language and is used for environment statistical computing
and graphics. The following is an introduction to basic statistical concepts like
normal distribution (bell curve), central tendency (the mean, median, and mode),
variability (25%, 50%, 75% quartiles), variance, standard deviation, modality,
skewness.
Data Concepts
Data can be formed in different structures and different formats, before starting
the concepts of statistic we need to know the data formats.
These are some formats:
Vector
Data frame
Variable
Continuous Data
Discrete Data
Normal Data
Statistics in R
Average, Variance and Standard Deviation in R
Mean, Median and Mode in R Programming
Average in R Programming
Average a number expressing the central or typical value in a set of data, in
particular the mode, median, or (most commonly) the mean, which is calculated by
dividing the sum of the values in the set by their number. The basic formula for the
average of n numbers x1, x2, ……xn is
Example:
Suppose there are 8 data points,
2, 4, 4, 4, 5, 5, 7, 9
The average of these 8 data points is,
Output:
[1] 5
Example 2:
# R program to get average of a list
Output:
[1] 105.5714
Example:
Let‟s consider the same dataset that we have taken in average. First, calculate the
deviations of each data point from the mean, and square the result of each,
Computing Variance in R Programming
One can calculate the variance by using var() function in R.
Syntax: var(x)
Parameters:
x: numeric vector
Example 1:
# R program to get variance of a list
Output:
[1] 4.571429
Example 2:
# R program to get variance of a list
Output:
[1] 22666.7
# Calculating standard
# deviation using sd()
print(sd(list))
Output:
[1] 2.13809
Example 2:
# R program to get
# standard deviation of a list
# Calculating standard
# deviation using sd()
print(sd(list))
Output:
[1] 367.6076
In R, we can create visually appealing data visualizations by writing few lines of code. For
this purpose, we use the diverse functionalities of R. Data visualization is an efficient
technique for gaining insight about data through a visual medium.
The popular data visualization tools that are available are Tableau, Plotly, R, Google
Charts, Infogram, and Kibana. The various data visualization platforms have different
capabilities, functionality, and use cases. They also require a different skill set. This
article discusses the use of R for data visualization.
R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualization as it offers flexibility and
minimum required coding through its packages.