Data Analysis Using R (Student Copy)
Data Analysis Using R (Student Copy)
4. Programs on Operators in R.
5. Control Structures in R.
7. Decision Tree
13. Visualizations
Plot The Histogram, Bar Chart And Pie Chart On Sample Data
14. Implementation of various charts
1
EX.NO: 1
DATE:
INSTALLING R AND PACKAGES IN R
AIM
To Install R and packages in R..
PROCEDURE
To Install R and R Packages
To Install RStudio
1. Go to www.rstudio.com and click on the "Download RStudio" button.
2. Click on "Download RStudio Desktop."
3. Click on the version recommended for your system, or the latest Mac version, save the .dmg
file on your computer, double-click it to open, and then drag and drop it to your applications
folder.
To Install R Packages
The capabilities of R are extended through user-created packages, which allow
specialized statistical techniques, graphical devices, import/export capabilities, reporting tools
(knitr, Sweave), etc. These packages are developed primarily in R, and sometimes in Java, C,
C++, and Fortran.The R packaging system is also used by researchers to create compendia to
organize research data, code and report files in a systematic way for sharing and public
archiving.
A core set of packages is included with the installation of R, with more than 12,500 additional
packages (as of May 2018[update]) available at the Comprehensive R Archive Network
(CRAN).
2
Packages are collections of R functions, data, and compiled code in a well- defined
format. The directory where packages are stored is called the library. R comes with a standard
set of packages. Others are available for download and installation. Once installed, they have
to be loaded into the session to be used.
● libPaths() # get library location
● library() # see all packages installed
● search() # see packages currently loaded
Adding R Packages
You can expand the types of analyses you do be adding other packages. A complete list
of contributed packages is available from CRAN.
Follow these steps:
1.Download and install a package (you only need to do this once).
2.To use the package, invoke the library(package) command to load it into the current session.
(You need to do this once in each session, unless you customize your environment to
automatically load it each time.)
OUTPUT
4
AIM
To make a simple calculator using R
PROCEDURE
PROGRAM
# Program make a simple calculator that can add, subtract, multiply and divide
using functions
add<- function(x, y) {
return(x + y)
}
subtract<- function(x, y) {
return(x - y)
}
multiply<- function(x, y) {
return(x * y)
}
divide<- function(x, y) {
return(x / y)
}
# take input from the user
5
print("Select operation.")
print("1.Add")
print("2.Subtract")
print("3.Multiply")
print("4.Divide")
choice = as.integer(readline(prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer(readline(prompt="Enter first
number: "))
num2 = as.integer(readline(prompt="Enter
second number: "))
operator <- switch(choice,"+","-","*","/")
result<- switch(choice, add(num1, num2),
subtract(num1, num2), multiply(num1, num2),
divide(num1, num2))
print(paste(num1, operator,
num2, "=", result)
OUTPUT
6
RESULT
Thus the process has been executed successfully
EX.NO: 3
DATE:
7
CREATE A NUMERIC DATA VECTORS USING R
AIM
Write a R program to create three vectors numeric data, character data and logical
data. Display the content of the vectors and their type.
PROCEDURE
PROGRAM
OUTPUT
RESULT
This process has been executed successfully
EX.NO:4
DATE:
PROGRAMS ON OPERATORS IN R.
8
AIM
To demonstrate operators in R
PROCEDURE
1. Open a new file
2. To demonstrate operators in R
3. Arithmetic operators(+,-,*,/,^,%%)
4. Logical operators (&,|,!)
5. Relational operator(<,>,<=,>=,!=)
6. Assignment operator(<-,->,=)
7. Exit from the program
1. Arithmetic Operators
These operators are used to carry out mathematical operations like addition
andmultiplication. Here is a list of arithmetic operators available in R.
Program 1
# R program to illustrate
# the use of Arithmetic operators
vec1 <- c(0, 2)
vec2 <- c(2, 3)
Output
Addition of vectors : 2 5
Subtraction of vectors : -2 -1
Multiplication of vectors : 0 6
Division of vectors : 0 0.6666667
Modulo of vectors : 0 2
Power operator : 0 8
9
R Relational Operators
Relational operators are used to compare between values.Here is a list of relational operators
available in R.
Program 2
# R program to illustrate
# the use of Relational operators
vec1 <- c(0, 2)
vec2 <- c(2, 3)
Output
Assignment operators
Assignment operators are used to assign values to various data objects in R. The objects may
be integers, vectors, or functions. These values are then stores by the assigned variable names.
There are two kinds of assignment operators: Left and Right.
Program 3
# R program to illustrate
# the use of Logical operators
vec1 <- c(0,2)
vec2 <- c(TRUE,FALSE)
10
Output
Logical Operators
Logical operations simulate element-wise decision operations, based on the specified operator
between the operands, which are then evaluated to either a True or False boolean value. Any
non zero integer value is considered as a TRUE value, be it complex or real number.
Program 4
# R program to illustrate
# the use of Logical operators
vec1 <- c(0,2)
vec2 <- c(TRUE,FALSE)
# Performing operations on Operands
cat ("Element wise AND :", vec1 & vec2, "\n")
cat ("Element wise OR :", vec1 | vec2, "\n")
cat ("Logical AND :", vec1 && vec2, "\n")
cat ("Logical OR :", vec1 || vec2, "\n")
cat ("Negation :", !vec1)
Output
Element wise AND : FALSE FALSE
Element wise OR : TRUE TRUE
Logical AND : FALSE
Logical OR : TRUE
Negation : TRUE FALSE
EX.NO: 5
DATE:
CONTROL STRUCTURES
AIM
11
To implement control structures in R
PROCEDURE
R if statement
The syntax of if statement is:
if (test_expression)
{
statement
}
If the test_expression Is TRUE, the statement gets executed. But if it’s FALSE, nothing
happens.
Here, test_expression can be a logical or numeric vector,but only the first element is taken
into consideration.
In the case of numeric vector, zero is taken as FALSE, rest as TRUE.
Example: if statement
x <- 5
if(x > 0)
{
print("Positive number")
}
[1] "Positive number"
year=as.integer(readline(prompt="Enterayear:"))
if((year%%4)==0)
{
if((year %% 100) == 0)
{
if((year %% 400) == 0)
12
print(paste(year,"is a leap year"))
}
Else
{
print(paste(year,"isnot a leap year"))
}
} else {
print(paste(year,"is a leap year"))
}
} else {
print(paste(year,"isnot a leap year"))
}
OUTPUT
factorial = 1
if(num<0)
elseif(num== 0) {
}
else
13
{
for(iin 1:num)
{
factorial = factorial * i
}
print(paste("The factorial of", num ,"is",factorial))
}
OUTPUT
Enter a number: 8
[1] "The factorial of 8 is 40320"
#Anumberisevenifdivisionby2givearemainderof0.
#Ifremainderis1,itisodd.
num = as.integer(readline(prompt="Enter a number: "))
if((num %% 2) == 0)
{
print(paste(num,"is Even"))
}
else
{
print(paste(num,"isOdd"))
}
OUTPUT
Enter a number: 89
[1] "89 is Odd"
14
{
statement
}
Here, sequence is a vector andval takes on each of its value during the loop. In each iteration,
statement is evaluated.
x <- c(2,5,3,9,8,11,6)
count<- 0
for (val in x) {
}
print(count)
Output
[1] 3
flag = 0
if(num>1)
flag = 1
for(iin 2:(num-1))
{
if((num%%i)==0)
{
15
flag =0
break
}
}
}
if(num==2)
flag =1 if(flag==1)
{
print(paste(num,"is a prime number"))
}
else
{
print(paste(num,"is not a prime number"))
}
OUTPUT
Enter a number: 25
[1] "25 is not a prime number"
num*i))
Output
Enter a number: 7
16
[1] "7 x 8 =56"
[1] "7 x 9 =63"
[1] "7 x 10 = 70"
WHILE LOOP
In R programming, while loops are used to loop until a specific condition ismet.
while (test_expression)
statement
Here, test_expression is evaluated and the body of the loop is entered if the result is TRUE.
The statements inside the loop are executed and the flow returns to evaluate the
test_expressionagain.This is repeated each time untiltest_expression evaluates to FALSE, in
which case, the loop exits.
Example of while Loop
i<- 1
while (i< 6) { print(i)
i = i+1
}
Output
[1]1
[1]2
[1]3
[1]4
[1]5
17
temp =num
while(temp > 0)
{
digit = temp %% 10
sum = sum + (digit ^ 3)
temp = floor(temp / 10)
}
if(num == sum)
{
print(paste(num, "is an Armstrong number"))
}
else
{
print(paste(num, "is not an Armstrong number"))
}
OUTPUT
Enter a number: 23
[1] "23 is not an Armstrong number"
if(num< 0) {
else
{
sum = 0
{
sum=sum+numnum=num-1
}
}
18
OUTPUT
Enteranumber:10
[1] "The sum is55"
19
OUTPUT
How many terms?
7
[1] "Fibonacci sequence:"
[1]0
[1]1
[1]1
[1]2
[1]3
[1]5
[1]8
RESULT
The process has been executed successfully.
EX.NO: 6
DATE:
CREATING MATRIX AND MANIPULATING MATRIX IN R.
20
AIM
To create and manipulate matrix in R
PROCEDURE
Creation of matrix
1. matrix1<- matrix ( data = 1, nrow = 3, ncol = 3)
>matrix1 <- matrix ( data = 1, nrow = 3, ncol = 3)
>matrix1
Sol
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
Sol
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Sol
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
22
[3,] 3 6 9
matrix1+2
[,1] [,2] [,3]
[1,] 3 6 9
[2,] 4 7 10
[3,] 5 8 11
Manipulation of Matrix
1. matrix1
>matrix1
Sol
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
2. matrix1[1, 3]
>matrix1[1, 3]
Sol
[1] 7
3. matrix1[ 2, ]
Sol
> matrix1[ 2, ]
[1] 2 5 8
4. matrix1[,-2]
Sol
>matrix1[,-2]
[,1] [,2]
[1,] 1 7
[2,] 2 8
[3,] 3 9
23
5.matrix1[1, 1] = 15
Sol
>matrix1[1, 1] = 15
>matrix1
[,1] [,2] [,3]
[1,] 15 4 7
[2,] 2 5 8
[3,] 3 6 9
6.matrix1[ ,2 ] = 1
matrix1
Sol
[,1] [,2] [,3]
[1,] 15 1 7
[2,] 2 1 8
[3,] 3 1 9
8. >m<-matrix(nrow=2,ncol=4,data=c(1,3,5,7,2,4,6,8) , byrow=TRUE)
>m
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
9. Calculate Transpose.
>t(m)
Sol
[,1] [,2]
24
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
OUTPUT
25
RESULT
The process has been executed successfully
EX.NO: 7
DATE:
26
DECISION TREE
AIM
To build decision tree model and check its performance on titanic dataset
PROCEDURE
1. Start
2. Ensure data set is in current working directory
3. Read .CSV file and load into a data frame
4. Display the structure of a dataset
5. Pre-process the dataset and make the dataset ready for analysis
6. Fix randomization procedure with seed method
7. Sample 60% of data for training and 40% of the data for testing
8. Split data into training and testing
9. Install and import “rpart” package and library
10. Build decision tree model by defining “survived” as target variable and remaining
variables as independent variables
11. Evaluate the model with test data
12. Install and important “caret” package and library
13. Generate confusion matrix
14. Stop
PROGRAM
#load dataset
data=read.csv("titanic.csv")
#exploratory analysis on dataset
str(data)
head(data)
tail(data)
str(data)
sum(is.null(data))
summary(data)
27
#Rename columns 5 and 6
colnames(data)[6:7]<-c("sib","par")
str(data)
#dropping columns from dataset
data_new<-data[-c(3)]
str(data_new)
#split dataset into training and testing
set.seed(45)
train.index<- sample(row.names(data_new), dim(data_new)[1]*0.6)
test.index<- sample(setdiff(row.names(data_new), train.index),
dim(data_new)[1]*0.4)
train<- data_new[train.index, ]
test<- data_new[test.index, ]
#Implement Algorithm
#Classification Tree
#Full-grown Tree
library(rpart)
library(rpart.plot)
# Full grown tree
str(data_new)
class.tree<- rpart(Survived ~., data = train, method = "class")
tree.pred.test<- predict(class.tree, test, type = "class")
install.packages("caret")
library(caret)
confusionMatrix(tree.pred.test, test$Survived)
28
OUTPUT
RESULT
Thus, the process has been executed successfully
EX.NO: 8
29
DATE:
K-NEAREST NEIGHBOUR (KNN) ALGORITHM
AIM
To build machine learning model using KNN and check its performance on iris dataset
PROCEDURE
1. Start
2. Load the pre-loaded datatset “iris” into workspace
3. Display the structure of “iris” dataset
4. Install and import “e1071” package and library
5. Install and import “caTools” package and library
6. Install and import “class” package and library
7. Sample and split 70% of data for training and 30% of data for testing
8. Scale features (Columns) of training and testing except target column “species”
9. Building K-NN model and testing with test data by passing training dataset, testing
dataset, target_column and No. of nearest neighbor(5))
10. Build confusion matrix by passing actual and predicted values of test data
11. Stop
PROGRAM
# Loading data
data(iris)
# Structure
str(iris)
# Installing Packages
install.packages("e1071")
install.packages("caTools")
install.packages("class")
# Importing libraries
library(e1071)
library(caTools)
30
library(class)
# Loading data
data(iris)
head(iris)
# Splitting data into train and test data
split<- sample.split(iris, SplitRatio = 0.7)
train_cl<- subset(iris, split == "TRUE")
test_cl<- subset(iris, split == "FALSE")
# Feature Scaling
train_scale<- scale(train_cl[, 1:4])
test_scale<- scale(test_cl[, 1:4])
# Confusion Matrix
cm <- table(test_cl$Species, classifier_knn)
cm
31
OUTPUT
RESULT
Thus, the process has been executed successfully
32
EX.NO: 9
DATE:
NAÏVE BAYE’S ALGORITHM
AIM
To learn Naïve Bayes classifier and its implementation in R Programming
Theory
Naive Bayes algorithm is based on Bayes theorem. Bayes theorem gives the conditional
probability of an event A given another event B has occurred.
where,
P(A|B) = Conditional probability of A given B.
P(B|A) = Conditional probability of B given A.
P(A) = Probability of event A.
P(B) = Probability of event B.
Example:
Consider a sample space:
{HH, HT, TH, TT}
where,
H: Head
T: Tail
33
= [P(B|A) * P(A)] / P(B)
= [P(First coin is tail given second coin is head) *
P(Second coin being Head)] / P(first coin being tail)
= [(1/2) * (1/2)] / (1/2)
= (1/2)
= 0.5
RESULT
Thus, the process has been executed successfully
34
EX.NO: 10
DATE:
RANDOM FOREST ALGORITHM
AIM
To build machine learning model using Random Forest and check its performance on
titanic dataset
PROCEDURE
1. Start
2. Ensure data set is in current working directory
3. Read .CSV file and load into a data frame
4. Display the structure of a dataset
5. Pre-process the dataset and make the dataset ready for analysis
6. Fix randomization procedure with seed method
7. Sample 60% of data for training and 40% of the data for testing
8. Split data into training and testing
9. Convert the variables “Survived” and “Pclass” of training and testing data into
categorical using factor method
10. Install and import “randomForest” package and library
11. Build random forest model by defining “survived” as target variable and remaining
variables as independent variables
12. Evaluate the model with test data
13. Install and important “caret” package and library
14. Generate confusion matrix
15. Stop
PROGRAM
#load dataset
data=read.csv("titanic.csv")
#exploratory analysis on dataset
str(data)
head(data)
tail(data)
35
str(data)
sum(is.null(data))
summary(data)
#Rename columns 5 and 6
colnames(data)[6:7]<-c("sib","par")
str(data)
#dropping columns from dataset
data_new<-data[-c(3)]
str(data_new)
#split dataset into training and testing
set.seed(45)
train.index<- sample(row.names(data_new), dim(data_new)[1]*0.6)
test.index<- sample(setdiff(row.names(data_new), train.index), dim(data_new)[1]*0.4)
train<- data_new[train.index, ]
test<- data_new[test.index, ]
#Implement Algorithm
train$Survived<- as.factor(train$Survived)
train$Pclass<- as.factor(train$Pclass)
sapply(train, class)
set.seed(1234)
install.packages("randomForest")
library(randomForest)
RF_model1 <- randomForest(Survived ~., data = train,importance=TRUE)
test$Survived<- as.factor(test$Survived)
test$Pclass<- as.factor(test$Pclass)
sapply(test, class)
RF_prediction<- predict(RF_model1, test)
install.packages("caret")
library(caret)
conMat<- confusionMatrix(RF_prediction, test$Survived)
conMat
36
OUTPUT
RESULT
Thus, the process has been executed successfully
37
EX.NO: 11
DATE:
AIM
To create 2-D plot and perform clustering based on the data age and spending
PROCEDURE
1.Start
2.Create a data frame with two columns using vector c()
3.Read .CSV file and load into a data frame
4.Defineggplot by passing dataframe, x-axis, and y – axis and plot the geometrical point in
the graph
5. Install and import “cluster” package and library
6.create clusters by passing dataframe, number of clusters and initial configuration value for
the K means function
7.Display cluster results
8.Plot clusters
9.Stop
PROGRAM
df<-data.frame(age=c(18,21,40,24),spend=c(10,11,22,15))
ggplot(df,aes(x=age,y=spend))+geom_point()
library(cluster)
kmeans<-kmeans(df,centers=2,nstart=20)
str(kmeans)
clusplot(df,kmeans$cluster,label=2,time=0)
38
Output
RESULT
Thus, the process has been executed successfully
39
EX.NO: 12
DATE :
Aim:
To learn on hierarchical cluster analysis using r programming.
Cluster analysis or clustering is a technique to find subgroups of data points within a data
set. The data points belonging to the same subgroup have similar features or properties.
Clustering is an unsupervised machine learning approach and has a wide variety of
applications such as market research, pattern recognition, recommendation systems, and so
on. The most common algorithms used for clustering are K-means clustering and
Hierarchical cluster analysis. In this article, we will learn about hierarchical cluster analysis
and its implementation in R programming.
Hierarchical cluster analysis (also known as hierarchical clustering) is a clustering
technique where clusters have a hierarchy or a predetermined order. Hierarchical clustering
can be represented by a tree-like structure called a Dendrogram. There are two types of
hierarchical clustering:
● Agglomerative hierarchical clustering: This is a bottom-up approach where each
data point starts in its own cluster and as one moves up the hierarchy, similar pairs of
clusters are merged.
● Divisive hierarchical clustering: This is a top-down approach where all data points
start in one cluster and as one moves down the hierarchy, clusters are split recursively.
To measure the similarity or dissimilarity between a pair of data points, we use distance
measures (Euclidean distance, Manhattan distance, etc.). However, to find the dissimilarity
between two clusters of observations, we use agglomeration methods. The most common
agglomeration methods are:
● Complete linkage clustering: It computes all pairwise dissimilarities between the
observations in two clusters, and considers the longest (maximum) distance between
two points as the distance between two clusters.
● Single linkage clustering: It computes all pairwise dissimilarities between the
observations in two clusters, and considers the shortest (minimum) distance as the
distance between two clusters.
40
● Average linkage clustering: It computes all pairwise dissimilarities between the
observations in two clusters, and considers the average distance as the distance
between two clusters.
Performing Hierarchical Cluster Analysis using R
For computing hierarchical clustering in R, the commonly used functions are as follows:
● hclust in the stats package and agnes in the cluster package for agglomerative
hierarchical clustering.
● diana in the cluster package for divisive hierarchical clustering.
We will use the Iris flower data set from the datasets package in our implementation. We will
use sepal width, sepal length, petal width, and petal length column as our data points. First,
we load and normalize the data. Then the dissimilarity values are computed with dist function
and these values are fed to clustering functions for performing hierarchical clustering.
# Dissimilarity matrix
d <- dist(df, method = "euclidean")
gglomerative hierarchical clustering implementation
The dissimilarity matrix obtained is fed to hclust. The method parameter of hclust specifies
the agglomeration method to be used (i.e. complete, average, single). We can then plot the
dendrogram.
● R
41
hc1 <- hclust(d, method = "complete" )
Output:
Result:
The hierarchical clustering with r programming is learnt with an example.
42
EX.NO: 13
DATE:
HISTOGRAM REPRESENTATION USING IRIS DATASET
Sol:
This chapter shows examples on data exploration with R. It starts with inspecting the
dimensionality, structure and data of an R object, followed by basic statistics and various charts
like pie charts and histograms. Exploration of multiple variables are then demonstrated,
including grouped distribution, grouped boxplots, scattered plot and pairs plot. After that,
examples are given on level plot, contour plot and 3D plot. It also shows how to saving charts
into files of various formats.
The iris data is used in this chapter for demonstration of data exploration with R.
We first check the size and structure of data. The dimension and names of data can be
obtained respectively with dim() and names(). Functions str() and attributes() return the
structure and attributes of data.
dim(iris)
dim(iris)
[1] 150 5
names(iris)
names(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Str(iris)
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
43
$ Sepal.Width :num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width :num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Iris[1:5,]
iris[1:5,]
Sepal.LengthSepal.WidthPetal.LengthPetal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
head(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
tail(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width Species
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
summary(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
var(iris$Sepal.Length)
44
[1] 0.6856935
hist(iris$Sepal.Length)
plot(density(iris$Sepal.Length))
pie(table(iris$Species))
45
barplot(table(iris$Species))
After checking the distributions of individual variables, we then investigate the relationships
between two variables. Below we calculate covariance and correlation between variables with
cov() and cor().
cov(iris$Sepal.Length, iris$Petal.Length)
[1] 1.274315
cov(iris[,1:4])
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Sepal.Length0.6856935 -0.0424340 1.2743154 0.5162707
Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
Petal.Length1.2743154 -0.3296564 3.1162779 1.2956094
Petal.Width0.5162707 -0.1216394 1.2956094 0.5810063
RESULT
Thus, the process has been executed successfully
46
EX.NO: 14
DATE :
Implementation of various charts
Procedure
Bar Plot or Bar Chart
Bar plot or Bar Chart in R is used to represent the values in data vector as height of the bars.
The data vector passed to the function is represented over y-axis of the graph. Bar chart can
behave like histogram by using table() function instead of data vector.
Syntax: barplot(data, xlab, ylab)
where:
● data is the data vector to be represented on y-axis
● xlab is the label given to x-axis
● ylab is the label given to y-axis
Program:
# defining vector
x <- c(7, 15, 23, 12, 44, 56, 32)
# output to be present as PNG file
png(file = "barplot.png")
# plotting vector
barplot(x, xlab = "GeeksforGeeks Audience",
ylab = "Count", col = "white",
col.axis = "darkgreen",
col.lab = "darkgreen")
# saving the file
dev.off()
Output:
47
Pie Diagram or Pie Chart
Pie chart is a circular chart divided into different segments according to the ratio of data
provided. The total value of the pie is 100 and the segments tell the fraction of the whole pie.
It is another method to represent statistical data in graphical form and pie() function is used to
perform the same.
Syntax:pie(x, labels, col, main, radius)
where,
● x is data vector
● labelsshows names given to slices
● col fills the color in the slices as given parameter
● main shows title name of the pie chart
● radius indicates radius of the pie chart. It can be between -1 to +1
Program:
# defining vector x with number of articles
x <- c(210, 450, 250, 100, 50, 90)
48
# saving the file
dev.off()
Output:
Histogram
Histogram is a graphical representation used to create a graph with bars representing the
frequency of grouped data in vector. Histogram is same as bar chart but only difference
between them is histogram represents frequency of grouped data rather than data itself.
Syntax:hist(x, col, border, main, xlab, ylab)
where:
● x is data vector
● col specifies the color of the bars to be filled
● border specifies the color of border of bars
● main specifies the title name of histogram
● xlab specifies the x-axis label
● ylab specifies the y-axis label
Program:
# defining vector
x <- c(21, 23, 56, 90, 20, 7, 94, 12,
57, 76, 69, 45, 34, 32, 49, 55, 57)
# output to be present as PNG file
png(file = "hist.png")
# hist(x, main = "Histogram of Vector x",
xlab = "Values",
col.lab = "darkgreen",
49
col.main = "darkgreen")
# saving the file
dev.off()
Output:
Scatter Plot
A Scatter plot is another type of graphical representation used to plot the points to show
relationship between two data vectors. One of the data vectors is represented on x-axis and
another on y-axis.
Syntax:plot(x, y, type, xlab, ylab, main)
Where,
● x is the data vector represented on x-axis
● y is the data vector represented on y-axis
● type specifies the type of plot to be drawn. For example, “l” for lines, “p” for points,
“s” for stair steps, etc.
● xlab specifies the label for x-axis
● ylab specifies the label for y-axis
● main specifies the title name of the graph
Program:
# taking input from dataset Orange already
# present in R
orange <- Orange[, c('age', 'circumference')]
# output to be present as PNG file
png(file = "plot.png")
# plotting
plot(x = orange$age, y = orange$circumference, xlab = "Age",
ylab = "Circumference", main = "Age VS Circumference",
50
col.lab = "darkgreen", col.main = "darkgreen",
col.axis = "darkgreen")
# saving the file
dev.off()
Output:
Box Plot
Box plot shows how the data is distributed in the data vector. It represents five values in the
graph i.e., minimum, first quartile, second quartile(median), third quartile, the maximum
value of the data vector.
Syntax:boxplot(x, xlab, ylab, notch)
where,
● x specifies the data vector
● xlab specifies the label for x-axis
● ylab specifies the label for y-axis
● notch, if TRUE then creates notch on both the sides of the box
Program:
# defining vector with ages of employees
x <- c(42, 21, 22, 24, 25, 30, 29, 22,
23, 23, 24, 28, 32, 45, 39, 40)
# plotting
boxplot(x, xlab = "Box Plot", ylab = "Age",
col.axis = "darkgreen", col.lab = "darkgreen")
# saving the file
51
dev.off()
Output:
Result:
Implementation of various charts in R programming is learnt.
52
EX.NO: 15
DATE :
Predictive Analysis in R Programming
Aim:
To learn about Predictive analysis and its applications in R Programming .
53
● Gain competition in the market: With predictive analysis, businesses or companies
can make their way to grow fast and stand out as a competition to other businesses by
finding out their weakness and strengths.
● Learn new opportunities to increase revenue: Companies can create new offers or
discounts based on the pattern of the customers providing an increase in revenue.
● Find areas of weakening: Using these methods, companies can gain back their lost
customers by finding out the past actions taken by the company which customers
didn’t like.
Applications of Predictive Analysis
● Health care: Predictive analysis can be used to determine the history of patient and
thus, determining the risks.
● Financial modelling: Financial modelling is another aspect where predictive analysis
plays a major role in finding out the trending stocks helping the business in decision
making process.
● Customer Relationship Management: Predictive analysis helps firms in creating
marketing campaigns and customer services based on the analysis produced by the
predictive algorithms.
● Risk Analysis: While forecasting the campaigns, predictive analysis can show an
estimation of profit and helps in evaluating the risks too.
Example:
Let us take an example of time analysis series which is a method of predictive analysis in R
programming:
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
Forecasting Data:
Now, forecasting sales and revenue based on historical data.
55
# from date 22 January, 2020
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
# forecasting model using arima model
fit <- auto.arima(mts)
# Next 5 forecasted values
forecast(fit, 5)
# plotting the graph with next
# 5 weekly forecasted values
plot(forecast(fit, 5), xlab ="Weekly Data of Sales",
ylab ="Total Revenue",
main ="Sales vs Revenue", col.main ="darkgreen")
# saving the file
dev.off()
Output:
Result:
The predictive analysis and its applications in r programming are learnt with examples.
56
EX.NO: 16
DATE :
OPERATIONS ON LISTS IN R.
AIM
To Operate on List using R.
PROCEDURE
List is a data structure having components of mixed data types.
Creating a list
List can be created using the list() function
>x<-list("a"=2.5,"b"=TRUE,"c"=1:3)
Here,we create a list x,ofthreecomponents with datatypesdouble,logical and integer vector
respectively.
Its structure can be examined with the str() function.
>str(x)
We can create the same list without thetags as follows.In such scenario, numeric indices are
used by default.
>x <-list(2.5,TRUE,1:3)
>x
Program 1
p<-c(2,7,8)
q<-c("A","B","C")
x<-list(p,q)
what is the value ofx[2]?
Sol:
p <- c(2,7,8)
q <- c("A", "B", "C")
x <- list(p, q) x[2]
[[1]]
[1] "A" "B" "C"
57
II Given
w<-c(2,7,8)
v<-c("A","B","C")
x<list(w,v),
which R statement will replace "A" in x with "K".
w <- c(2, 7, 8)
v <- c("A", "B", "C")
x <- list(w, v)
x[[2]][1] <- "K"
>x
Sol
[[1]]
[1] 2 7 8
[[2]]
[1] "K" "B" "C"
59
VIII. If x<-list(y=1:10,t="Hello",f="TT",r=5:20), write an R statement that will give the
length of vector r of x.
x<-list(y=1:10,t="Hello",f="TT",r=5:20)
length(x$r)
Sol
[1] 16
IX Let string<-"GrandOpening",write an Rstatement to split this string into two and return
the following output:
> string <- "GrandOpening"
> a <- strsplit(string,"")
> list(a[[1]][1], a[[1]][2]) [[1]]
Sol
[1] "Grand"
[[2]]
[1] "Opening"
OUTPUT:
60
EX.NO: 17
DATE:
BUILT-IN FUNCTIONS IN R
AIM
To work Built in functions in R
PROCEDURE
Built-in Functions
Almost everything in R is done through functions. Here I'm only referring to numeric and
character functions that are commonly used in creating or recoding variables.
Numeric Functions
Function Description
abs(x) absolute value
sqrt(x) square root
ceiling(x) ceiling(3.475) is 4
floor(x) floor(3.475) is 3
trunc(x) round(3.475, digits=2) is 3.48
round(x, digits=n) signif(3.475, digits=2) is 3.5
signif(x, digits=n) signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x),
etc
log(x) natural logarithm
log10(x) common logarithm
exp(x) e^x
1.Calculate the cumulative sum (’running total’) of the numbers 2, 3, 4, 5, 6. Hint: use
cumsum() Function.
Sol: >sum(2:6)
[1] 20
>cumsum(2:6)
[1] 2 5 9 14 20
61
2. Print the 1 to10 numbers in reverse order. Hint: use the rev function.
Sol:
>rev(1:10)
[1] 10 9 8 7 6 5 4 3 2 1
4. Find 10 random numbers between 0 and100. (Hint: you can use sample() function)
Sol: >sample(1:100)
[1] 92 86 59 88 19 2 37 23 89 29 18 87 15 30 32 63 14 75
[19] 12 49 72 66 24 20 54 68 48 69 5 99 22 61 83 90 7 94
[37] 81 3 84 43 26 82 80 53 41 27 71 9 38 1 47 10 51 40
[55] 46 44 13 45 100 34 42 79 6 96 4 97 57 28 73 95 91 65
[73] 93 58 39 8 16 17 78 60 36 35 74 85 55 31 76 25 98 70
[91] 33 77 21 56 52 67 50 62 11 64
5. Calculate and Verify the value of x where x = 5, 5*x -> x, x Sol: > x<-5
> 5*x->x
> x [1] 25
6. Compute log to the base 10 (log10) of the sqrt of 100. Do not use variables.
Sol: >log10(sqrt(100))
[1] 1
62
OUTPUT
RESULT
This process has been executed successfully
63
EX.NO: 18
DATE:
PROGRAM TO CONVERT A GIVEN PH LEVELS OF SOIL TO AN ORDERED
FACTOR.
AIM
PROCEDURE
Note: Soil pH is a measure of the acidity or basicity of a soil. pH is defined as the negative
logarithm of the activity of hydronium ions in a solution. In soils, it is measured in a slurry of
soil mixed with water, and normally falls between 3 and 10, with 7 being neutral.
PROGRAM
ph= c(1,3,10,7,5,4,3,7,8,7,5,3,10,10,7)
print("Original data:")
print(ph)
ph_f=factor(ph,levels=c(3,7,10),ordered=TRUE)
print(ph_f)
OUTPUT:
64
RESULT
This process has been executed successfully
EX.NO: 19
DATE:
CREATING AND MANIPULATING A VECTOR IN R.
65
AIM
To create and manipulate a vector in R
PROCEDURE
Creating Vector
Vectors are generally created using the c() function. Since, a vector must have elements of the
same type, this function will try and coerce elements to the same type, if they are
different.Coercion is from lower to higher types from logical to integer to double to character
66
[1] 1.000000 2.333333 3.666667 5.000000
EXERCISE – I
1.Consider two vectors, x, y x=c(4,6,5,7,10,9,4,15)
y=c(0,10,1,8,2,3,4,1) What is the value of: x*y and x+y
Sol: > x<-c(4,6,5,7,10,9,4,15)
> y<-c(0,10,1,8,2,3,4,1)
>x
[1] 4 6 5 7 10 9 4 15
>y
[1] 0 10 1 8 2 3 4 1
> x*y
[1] 0 60 5 56 20 27 16 15
> x+y
[1] 4 16 6 15 12 12 8 16
3.What is the value of: dim(x) What is the value of: length(x)
Sol:
>x<-c(1:12)
>dim(x) NULL
>length(x) [1] 12
67
>a<-c(12:5)
>typeof(a)
[1] "integer"
>is.numeric(a)
[1] TRUE
6. If x=c ('blue', 'red', 'green', 'yellow') what is the value of: is.character(x).
Sol:
>x<-c ('blue', 'red', 'green', 'yellow')
>typeof(x)
[1] "character"
>is.character(x)
[1] TRUE
EXERCISE - II
69
[1]13.66473
4. Whichdaysawthehighestrainfall?Hintwhich.max()
Sol
rainfall
[1] 0.1 0.6 33.8 1.9 9.6 4.3 33.7 0.3 0.0 0.1
max(rainfall)
[1]33.8
5. Computetheproblemsum((x-mean(x))^2).
Sol
x<-c(1:10)
sum((x-mean(x))^2)
[1]82.5
Readthe`before'and`after'valuesintotwodifferentvectorscalled
beforeandafter.UseRtoevaluatetheamountofweightlostforeach
participant.Whatistheaverageamountofweight lost?
Sol
before
[1] 78 72 78 79 105
after
[1] 67 65 79 70 93
weightlost<-before-after
70
weightlost
[1] 11 7 -1 9 12
mean(weightlost)
[1]7.6
RESULT
This process has been executed successfully
EX.NO: 20
DATE:
71
CLASSIFICATION MODEL
AIM:
PROCEDURE:
Before we dive into Classification, let’s take a look at what Supervised Learning is.
Suppose you are trying to learn a new concept in maths and after solving a problem, you may
refer to the solutions to see if you were right or not. Once you are confident in your ability to
solve a particular type of problem, you will stop referring to the answers and solve the questions
put before you by yourself.
This is also how Supervised Learning works with machine learning models. In
Supervised Learning, the model learns by example. Along with our input variable, we also give
our model the corresponding correct labels. While training, the model gets to look at which
label corresponds to our data and hence can find patterns between our data and those labels.
It classifies spam Detection by teaching a model of what mail is spam and not spam.
Speech recognition where you teach a machine to recognize your voice.
Object Recognition by showing a machine what an object looks like and having it pick
that object from among other objects.
72
Figure 1: Supervised Learning Subdivisions
Classification algorithms used in machine learning utilize input training data for the
purpose of predicting the likelihood or probability that the data that follows will fall into one
of the predetermined categories. One of the most common applications of classification is for
filtering emails into “spam” or “non-spam”, as used by today’s top email service providers.
73
Classification is the process of predicting a categorical label of a data object based on
its features and properties. In classification, we locate identifiers or boundary conditions that
correspond to a particular label or category. We then try to place various unknown objects into
those categories, by using the identifiers. An example of this would be to predict the type of
water (mineral, tap, smart, etc.), based on its purity and mineral content.
74
1. R Logistic Regression
2. Decision Trees in R
3. Support Vector Machines in R
4. Naive Bayes Classifier
5. Artificial Neural Networks in R
6. K – Nearest Neighbor in R
1. Logistic regression
Weather forecast
Word classification
Symptom classification
2. Decision trees
Pattern recognition
Pricing decisions
Data exploration
Investment suggestions
Stock comparison
Spam filters
Disease prediction
Document classification
Handwriting analysis
Object recognition
Voice recognition
6. k-Nearest Neighbor
75
Unsupervised learning is a type of algorithm that learns patterns from untagged data.
The hope is that through mimicry, which is an important mode of learning in people, the
machine is forced to build a compact internal representation of its world and then generate
imaginative content from it.
In contrast to supervised learning where data is tagged by an expert, e.g. as a "ball" or
"fish", unsupervised methods exhibit self-organization that captures patterns as probability
densities or a combination of neural feature preferences.
The other levels in the supervision spectrum are reinforcement learning where the
machine is given only a numerical performance score as guidance, and semi-supervised
learning where a smaller portion of the data is tagged. Two broad methods in Unsupervised
Learning are Neural Networks and Probabilistic Methods.
● Sentiment Analysis
● Email Spam Classification
● Document Classification
● Image Classification
●
Types of Unsupervised Machine Learning Techniques
Unsupervised learning problems further grouped into clustering and association problems.
Clustering
Association
76
Association rules allow you to establish associations amongst data objects inside large
databases. This unsupervised technique is about discovering exciting relationships between
variables in large databases. For example, people that buy a new home most likely to buy new
furniture.
Coding:
library(caret)
df<- iris
levels(df$Species)
summary(df)
# Create a list of 80% of the rows in the original dataset we can use for training
inTraining<- createDataPartition(df$Species, p = 0.8, list=FALSE)
set.seed(7)
fit.lda<- train(Species~., data=df, method="lda", metric=metric, trControl=control)
predictions<- predict(fit.lda, validation)
predictions
confusionMatrix(predictions, validation$Species)
#KNN
set.seed(7)
fit.knn<- train(Species~., data=df, method="knn", metric=metric, trControl=control)
#c)advanced algorithms
#SVM
set.seed(7)
77
fit.svm<- train(Species~., data=df, method="svmRadial", metric=metric,
trControl=control)
#Random Forest
set.seed(7)
fit.rf<- train(Species~., data=df, method="rf", metric=metric, trControl=control)
dotplot(results)
importance<- varImp(fit.rf)
plot(importance)
Cart result:
78
RESULT
Thus, the process has been executed successfully
79