0% found this document useful (0 votes)
156 views

Data Science Lab Manual

The document discusses reading and writing different types of data in R. It describes functions like read.table(), read.csv(), readLines() for reading data and write.table() for writing tabular data to files. It also discusses the readxl package for reading Excel files into R and functions to read data from text, CSV and Excel files as well as write data to files in R.

Uploaded by

mmrmathsiubd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views

Data Science Lab Manual

The document discusses reading and writing different types of data in R. It describes functions like read.table(), read.csv(), readLines() for reading data and write.table() for writing tabular data to files. It also discusses the readxl package for reading Excel files into R and functions to read data from text, CSV and Excel files as well as write data to files in R.

Uploaded by

mmrmathsiubd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

MUFFAKHAM JAH COLLEGE OF

ENGINEERING AND TECHNOLOGY


(Affiliated to Osmania University and Recognized by AICTE)

Mount Pleasant, 8-2-249, Road No. 3, Banjara Hills,Hyderabad,


Telangana-500034.

DEPARTMENT OF COMPUTER SCIENCE


AND ARTIFICIAL INTELLIGENCE
(CS&AI)

DATA SCIENCE LAB (PC453AD)


LAB MANUAL
B.E IV SEM (2021-2022)
INDEX
S.NO. DATA SCIENCE LAB LIST OF PROGRAMS CO PAGE

1 Write R program for calculator application CO2 3


2 Write R program for performing descriptive statistics CO1 6
a) Using Summary CO1
b) Using subset() CO1
3 Write R program for r e a d i n g a n d w r i t i n g d i f f e r e n t CO2 12
types of data sets
a) Reading different types of data sets(.txt,.csv) from web and
disk and writing in specific disk location
b) Reading Excel data set in R
4 Write R program for visualizations CO2 17
5 Write R program to find Correlation and Covariance CO2 22

6 Write R program for Regression Modeling CO2 26

7 Write R program to build classification model using KNN CO3 30


algorithm
8 Write R program to build clustering model using K-mean CO3 34
algorithm

BEYOND SYLLABUS PROGRAMS


9 Write R program to read an XML file CO2 38

2
Experiment – 1
Calculator Application
R operators – R has many operators to carry out different mathematical and logical
operations. Operators in R can mainly be classified into the following categories.
1. Arithmetic operators - +, -, x, /
2. Assignment operators – <- , ->
3. Relational operators - <, >, ==, !=, <=, >=
4. Logical operators - !, &, &&, |, ||

We use the four fundamental arithmetic operations of mathematics for building a


calculator application. Those functions are –
1. Addition
2. Subtraction
3. Multiplication
4. Division

User-defined Functions in R – In R programming, user-defined functions are functions


that are created by the user for a specific use that the already built-in functions of R don’t
provide.
Syntax - functionName <- function (arguments) {
commands to perform
}
Parameters –
functionname: every function is generally given a name
function(argument): here the variables are mentioned
commands to perform: the block of code is written here.

3
1. Aim: To implement Calculator Application in R

a. Using with and without R objects on console


b. Using mathematical functions on console
c. Write an R script, to create an R object for calculator application and save in a specified
location in disk.

Program:

1+2
3-1
4*2
5*2
a<-1
b<-4
c<-2
a+b
a-b
a*b
b/c
add<-function(x,y)
{
print(x+y)
}
add(2,3)
subt<-function(x,y)
{
print(x-y)
}
subt(7,2)
mul<-function(x,y)
{
print(x*y)
}
mul(6,3)
div<-function(x,y)
{
print(x/y);
}
div(10,2)
choice=readline(prompt="Enter add for addition
subt for subtraction
mul for multiplication
div for division
Choice: ");
num1=readline(prompt = "Enter first number : ");

4
num2=readline(prompt = "Enter second number : ");
num1=as.integer(num1)
num2=as.integer(num2)
cal<-switch(choice,"add"=print(num1+num2),
"subt"=print(num1-num2),
"mul"=print(num1*num2),
"div"=print(num1/num2))

Output –

5
Experiment – 2
Descriptive Analysis
Dataset – mtcars
Description – The mtcars dataset is a built-in dataset in R that contains measurements on 11
aspects of automobile design and performance for 32 cars. The data was extracted from the
1974 Motor Trend US magazine.
Attributes –
1. Cyl
2. Disp
3. Hp
4. Drat
5. Wt
6. Qsec
7. Vs
8. Am
9. Gear
10. Carb

Dataset – cars
Description – This dataset contains 50 observations of 2 variables. It shows various
readings on “speed“ and “distance“ collected.
Attributes –
1. speed
2. distance
Dataset – iris
Description – The data set contains 3 classes of 50 instances each, where each class refers to
a type of iris plant. One class is linearly separable from the other 2; the latter are NOT
linearly separable from each other.

Attributes -

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. species:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

6
Subset function –

subset() function in R programming is used to create a subset of vectors, matrices,


or data frames.

Syntax – subset(x,subset,select)

Parameters –

 x: indicates the object


 subset: indicates the logical expression on the basis of which subsetting has to
be done
 select: indicates columns to select
Aggregate function –
Aggregate functions are often used to derive descriptive statistics.

Syntax – aggregate(x, by, FUN, …, simplify=TRUE, drop=TRUE)

Parameters –
 x: R object
 by: List of variables
 FUN: Function to be applied for summary statistics
 ... : Additional arguments to be passed to FUN
 Simplify: Whether to simplify results as much as possible or not
 Drop: Whether to drop unused combinations of grouping values or not
mean() function – This will simply calculate the total mean of all the observations present in
the data of that particular mentioned attribute.
min() function – This will give us the least valued observation from the data being used.
max() function - This will give us the maximum valued observation from the data being used.
summary() function – The summary of all the attributes are shown separately. The factors
used in doing so are minimum value, 1st quartile, Median, Mean, 3rd Quartile, Maximum
value.

7
2. Aim: To perform Descriptive Statistics in R

a. To write an R script to find basic descriptive statistics using summary, str, quartile function
on metacars
b. To apply the above functions on cars data sets
b. To apply subset(), aggregate() functions on iris dataset.

Datasets used:
1. mtcars
2. cars
3. iris

Program :

a. Descriptive Statistics Analysis on mtcars dataset

data(mtcars)
head(mtcars)
tail(mtcars)
head(mtcars,10)str(mtcars)
mtcars[1]
mtcars[15]
mtcars[1:4]
mtcars[c(1,4)]
mtcars[-2]
max(mtcars$cyl)
min(mtcars$mpg)
mean(mtcars$mpg)
median(mtcars$mpg)
summary(mtcars)

Output:

8
9
b. Descriptive Statistics Analysis on cars dataset

data(cars)
head(cars,10)
tail(cars,20)
str(cars)
head(cars)
max(cars)
max(cars$speed)
min(cars$speed)
mean(cars$speed)
median(cars$speed)
mode(cars$speed)
summary(cars$speed)
summary(cars)

Output:

10
c. Applying subset and aggregate functions on iris dataset

data(iris)
head(iris)
tail(iris)
subset(iris,Sepal.Length==6.1)
aggregate(.~Species,data=iris,mean)

Output –

11
Experiment – 3
Reading and writing different types of data
Package used – readxl

The readxl package makes it easy to get data out of Excel and into R. Compared to
many of the existing packages, readxl has no external dependencies, so it's easy to install and
use on all operating systems. It is designed to work with tabular data.
Functions for Reading Data into R –

There are a few very useful functions for reading data into R.

1. read.table() and read.csv() are two popular functions used for reading tabular data
into R.
2. readLines() is used for reading lines from a text file.
3. source() is a very useful function for reading in R code files from a another R
program.
4. dget() function is also used for reading in R code files.
5. load() function is used for reading in saved workspaces
6. unserialize() function is used for reading single R objects in binary format.

Functions for Writing Data to Files –

There are similar functions for writing data to files

1. write.table() is used for writing tabular data to text files (i.e. CSV).
2. read.delim() is used to read delimited text files in the R Language.
3. writeLines() function is useful for writing character data line-by-line to a file or
connection.
4. dump() is a function for dumping a textual representation of multiple R objects.
5. dput() function is used for outputting a textual representation of an R object.
6. serialize() is used for converting an R object into a binary format for outputting to a
connection .

12
3. Aim: To read and write different types of datasets

a. To read different types of datasets from web and disk and writing in file in specific disk
location.
b. To read Excel data sheet in R.

name=c("a","b","c","d","e")
marks=c(20,30,40,10,15)
id=c(1:5)
st=data.frame(id,name,marks)
View(st)

#1. writing data frame into CSV file


write.csv(student,"student.csv",row.names=FALSE)

#2. reading CSV file


st1=read.csv("student.csv")
View(st1)

#3.writing data frame to a text file


write.table(st1,file="st1.txt",quote=F,row.names=F)

#4. reading from text


st2=read.delim('st1.txt')
View(st2)

#5. reading a file from web


webfile = read.delim("http://www.sthda.com/upload/boxplot_format.txt")
print(webfile)
head(webfile)
write.table(webfile,file="webfile.txt",quote=F,row.names=FALSE)

# install package readxl first

install.packages("readxl")
library(readxl)

#6. reading excel datasheet


df=read_excel("d:/ex1.xlsx",sheet=2)
View(df)

13
Output:
> name=c("a","b","c","d","e")
> marks=c(20,30,40,10,15)
> id=c(1:5)
> st=data.frame(id,name,marks)
> View(st)
> #1. writing data frame into CSV file
> write.csv(student,"student.csv",row.names=FALSE)

> #2. reading CSV file


> st1=read.csv("student.csv")
> View(st1)

> #3.writing data frame to a text file


> write.table(st1,file="st1.txt",quote=F,row.names=F)
>
> #4. reading from text
> st2=read.delim('st1.txt')
> View(st2)

#5. reading a file from web


> webfile = read.delim("http://www.sthda.com/upload/boxplot_format.txt")
> print(webfile)

14
Nom variable Group
1 IND1 10 A
2 IND2 7 A
3 IND3 20 A
4 IND4 14 A
5 IND5 14 A
6 IND6 12 A
7 IND7 10 A
8 IND8 23 A
9 IND9 17 A
10 IND10 20 A
11 IND11 14 A
12 IND12 13 A
13 IND13 11 B
14 IND14 17 B
15 IND15 21 B
16 IND16 11 B
17 IND17 16 B
18 IND18 14 B
19 IND19 17 B
20 IND20 17 B
21 IND21 19 B
22 IND22 21 B
23 IND23 7 B
24 IND24 13 B
25 IND25 0 C
26 IND26 1 C
27 IND27 7 C
28 IND28 2 C
29 IND29 3 C
30 IND30 1 C
31 IND31 2 C
32 IND32 1 C
33 IND33 3 C
34 IND34 0 C
35 IND35 1 C
36 IND36 4 C
37 IND37 3 D
38 IND38 5 D
39 IND39 12 D
40 IND40 6 D
41 IND41 4 D
42 IND42 3 D
43 IND43 5 D
44 IND44 5 D
45 IND45 5 D
46 IND46 5 D
47 IND47 2 D
48 IND48 4 D
49 IND49 3 E
50 IND50 5 E
51 IND51 3 E
52 IND52 5 E
53 IND53 3 E
54 IND54 6 E
55 IND55 1 E
56 IND56 1 E
57 IND57 3 E

15
58 IND58 2 E
59 IND59 6 E
60 IND60 4 E
61 IND61 11 F
62 IND62 9 F
63 IND63 15 F
64 IND64 22 F
65 IND65 15 F
66 IND66 16 F
67 IND67 13 F
68 IND68 10 F
69 IND69 26 F
70 IND70 26 F
71 IND71 24 F
72 IND72 13 F
> head(webfile)
Nom variable Group
1 IND1 10 A
2 IND2 7 A
3 IND3 20 A
4 IND4 14 A
5 IND5 14 A
6 IND6 12 A
> write.table(webfile,file="webfile.txt",quote=F,row.names=FALSE)
>
> # install package readxl first
>
> install.packages("readxl")
Error in install.packages : Updating loaded packages
> library(readxl)
>
> #6. reading excel datasheet
> df=read_excel("d:/ex1.xlsx",sheet=2)
> View(df)

16
Experiment – 4
Visualization
Data visualization is an efficient technique for gaining insight about data through a
visual medium. With the help of visualization techniques, we can easily obtain information
about hidden patterns in data and also we can work with large datasets to efficiently obtain
key insights.
Dataset used – mtcars
Description – The mtcars dataset is a built-in dataset in R that contains measurements on 11
aspects of automobile design and performance for 32 cars. The data was extracted from the
1974 Motor Trend US magazine.
Attributes –
1. Cyl
2. Disp
3. Hp
4. Drat
5. Wt
6. Qsec
7. Vs
8. Am
9. Gear
10. Carb

Package - ggplot2
R allows us to create graphics declaratively. This package is famous for its elegant and
qualitygraphs, which sets it apart from other visualization packages.
Boxplot – boxplot() function is used to create a boxplot. These are a measure of how well
data is distributed across a data set. This graph represents the minimum, maximum, average,
first quartile,and the third quartile in the data set.
Syntax – boxplot(x, data, names, main)

parameters –
 x is a vector or a formula.
 data is the data frame.
 names are the group labels which will be printed under each boxplot.
 main is used to give a title to the graph.

Scatterplot –
The scatter plots are used to compare variables. A comparison between variables is
requiredwhen we need to define how much one variable is affected by another variable.
Syntax – plot(x, y, main, xlab, ylab)
Parameters –
 x is the data set whose values are the horizontal coordinates.

17
 y is the data set whose values are the vertical coordinates.
 main is the tile of the graph.
 xlab is the label in the horizontal axis.
 ylab is the label in the vertical axis.

Outliers using plots –


An outlier is a point or set of points that are different from other points. Sometimes they
can be very high or very low. It’s often a good idea to detect and remove the outliers.
Because outliers are one of the primary reasons for resulting in a less accurate model.
Often outliers can be seen with visualizations using a box plot.
R Histogram
A histogram is a type of bar chart which shows the frequency of the number of values
which are compared with a set of values ranges. For creating a histogram, R provides hist()
function. The histogram is used for the distribution.
Syntax - hist(v,main,xlab,col)
Parameters –
 v is a vector containing numeric values used in histogram.
 main indicates title of the chart.
 xlab is used to give description of x-axis.
 col is used to set color of the bars.

R Bar Charts
A bar chart is a pictorial representation in which numerical values of variables are
represented by length or height of lines or rectangles of equal width. R provides the
barplot() function.
Syntax – barplot(H, xlab, ylab, main, names.arg, col)

Parameters –
 H is a vector or matrix containing numeric values used in bar chart.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the title of the bar chart.
 names.arg is a vector of names appearing under each bar.
 col is used to give colours to the bars in the graph.
R Pie Charts
A pie-chart is arepresentation of values in the form of slices of a circle with different colors.
Pie charts are created with the help of pie () function, which takes positive numbers as
vector input.
Syntax - pie(x, labels, main, col)
Parameters –
 x is a vector containing the numeric values used in the pie chart.
 labels is used to give description to the slices.
 main indicates the title of the chart.
 col indicates the colour palette.

18
4. Aim: To perform visualizations
a. To find the data distribution using box and scatter plot
b. To find the outliers using plot.
c. To plot the histogram, bar chart and pie chart on sample data.

Dataset used: mtcars

Program:
#Linear plot
x=1:10
y=x^2
plot(x,y,type="l",main=”Linear Plot Example”)
#installing package
install.packages("ggplot2")
#scatter plot
data("mtcars")
plot(
mtcars$wt,mtcars$mpg,
main = "scatter plot example",
xlab = "car weight",
ylab="miles per gallon",
)
#box plot
data("mtcars")
boxplot(
mtcars$mpg,
main = "box plot example",
ylab="miles per gallon"
)
#outliers
v<-c(50,25,30,12,78,99)
boxplot(v,main="outliers")
#Histogram
H<-c(9,13,28,36,4,54,99,98)
hist(H,main="Histogram",col="blue")
#Barchart
h<-c(9,13,28,36,4,54)
m<-c("MAR","APR","MAY","JUN","JUL","AUG")
barplot(h,names.arg=m,xlab="Month",ylab="revenue",main="barchart",border ="blue")
#pie chart
h<-c(90,78,80,25)
m<-c("OS","DBMS","Java","OE")
pie(h,m,main = "piechart")

19
Output:

20
21
Experiment – 5
Correlation and Covariance
Correlation and Covariance are terms used in statistics to measure relationships
between two random variables. Both of these terms measure linear dependency between a
pair ofrandom variables or bivariate data.
Correlation in R Programming Language –
cor() function in R programming measures the correlation coefficient value. Correlationis
a relationship term in statistics that uses the covariance method to measure how strong the
vectors are related. Mathematically,

where,

x represents the x data vector


y represents the y data vector
Syntax: cor(x, y, method)
where,

 x and y represents the data vectors


 method defines the type of method to be used to compute covariance.

Covariance in R Programming Language –


In R programming, covariance can be measured using cov() function. Covariance is a
statistical term used to measures the direction of the linear relationship between the data
vectors. Mathematically,

Syntax: cov(x, y, method)


where,
 x and y represents the data vectors
 method defines the type of method to be used to compute covariance.
 N represents total observations

Package – CORRPLOT()

22
R package corrplot provides a visual exploratory tool on correlation matrix that
supports automatic variable reordering to help detect hidden patterns among
variables.

corrplot is very easy to use and provides a rich array of plotting options invisualization
method, graphic layout, color, legend, text labels, etc. It also provides p-values and
confidence intervals to help users determine the statistical significance of the
correlations.

corrplot() - The mostly using parameters include method, type, order, diag, and etc.
Correlation matrix –
A correlation matrix is a table of correlation coefficients for a set of variables used to
determine if a relationship exists between the variables. The coefficient indicates both the
strength of the relationship as well as the direction.
Syntax: cor (x, use = , method = )
Parameters:
 x: It is a numeric matrix or a data frame.
 use: Deals with missing data.
 method: Deals with a type of relationship

Dataset - iris
Description –
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are NOT linearly separable
from each other.

Attributes -

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. species:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

Variance (ANOVA) –
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed
aggregate variability found inside a data set into two parts: systematic factors and random
factors. The systematic factors have a statistical influence on the given data set, while the
random factors do not.

23
5. Aim: To Calculate Correlation and Covariance

a. To find the correlation matrix.


b. To plot the correlation plot on the dataset and visualize giving an overview of
relationships among data on iris data.
c. To analysis of covariance: variance (ANOVA), if data have categorical variables on iris
data.
Dataset used: iris
Program:
install.packages('corrplot')
x<-rnorm(2)
x
y<-rnorm(2)
y
mat<-cbind(x,y)
mat
cor(mat)
cov(mat)
data(iris)
iris
mydata<-iris[,c(1,2,3,4)]
mydata
str(mydata)
d1<-cor(mydata)
d1
library(corrplot)
corrplot(d1,method="circle")
color<-c('red','green','blue','black')
pairs(mydata,col=color,bg=color,pch=21)
cov(iris$Petal.Length,iris$Petal.Width

Output:

24
25
Experiment – 6
Regression Model
Dataset – crashdata.csv
Description – This dataset has 80 observations of 6 variables.
Attributes –
1. ManHI
2. ManBI
3. IntI
4. HVACi
5. Safety
6. CarType

Dataset – crashdataset.csv
Description – This dataset has 20 observations of 6 variables.
Attributes –
1. ManHI
2. ManBI
3. IntI
4. HVACi
5. Safety
6. CarType
GLM – ‘glm’ is used to fit generalised linear models, specified by giving a symbolic
description of the linear predictor and a description of the error distribution.
Syntax - glm (formula, family, data)
Parameters –
 Family types includes binomial, Poisson, Gaussian, gamma, quasi.
 Data: refers to the dataset being used
Package used – caret
Caret stands for classification and regression training and is arguably the biggest project in R.
One of the most powerful and popular packages is the caret library, which follows a
consistent syntax for data preparation, model building, and model evaluation, making it easy
for data science practitioners.

26
6. Aim: To evaluate the performance of Regression Model
a. Import data from web storage. Name the dataset and perform Logistic
b. Regression to find out relation between variables in the model. Also
c. check the model is fit or not [require (foreign), require(MASS)]

Datasets used are


crashdata.csv
crashdataset.csv

Program:
#logistic regression
mydata <- read.csv('crashdata.csv')
mytestdata <- read.csv('crashtestdata.csv')
mydata
mytestdata
str(mydata)
summary(mydata)
mydata[6] <- as.factor(mydata$CarType)
mydata
fit <- glm(formula=mydata$CarType~.,family='binomial', data=mydata)
fit
summary(fit)
train <- predict(fit, type='response')
plot(train)
tapply(train, mydata$CarType, mean)
pred <- predict(fit,newdata = mytestdata, type='response')
plot(pred)
mytestdata[pred<=0.5,'Predict'] <- 'Hatchback'
mytestdata[pred>0.5,'Predict'] <- 'SUV'
mytestdata
#install.packages("caret") run on console
library(caret)
confusionMatrix(table(mytestdata[,7],mytestdata[,6]),positive='Hatchback')

Output:

27
28
29
Experiment – 7
Classification Model
Packets for classification:
1. Caret package –
Caret stands for classification and regression training and is arguably the biggest project in R.
One of the most powerful and popular packages is the caret library, which follows a
consistent syntax for data preparation, model building, and model evaluation, making it easy
for data science practitioners.
2. Class package –
A class is just a blueprint or a sketch of methods or attributes. It represents the set of
properties or methods that are common to all objects of one type.
Dataset – Servicetraindata.csv
Description – This data set contains 315 observations of 6 variables.
Attributes –
1. OilQual
2. EnginePerf
3. NormMileage
4. TypeWear
5. HVACwear
6. Service
Dataset – Servicetestdata.csv
Description – This dataset contains 135 observations of 6 variables.
Attributes –
1. OilQual
2. EnginePerf
3. NormMileage
4. TypeWear
5. HVACwear
6. Service
Predictknn –
Predictions are calculated for each test case by aggregating the responses of the k-
nearest neighbours among the training cases. k may be specified to be any positive integer
less than the number of training cases, but is generally between 1 and 10.

30
7. Aim: To find the performance of Classification Model
a. To install relevant packages for classification.
b. To choose a classifier for classification problems.
c. To evaluate the performance of the classifier.

Datasets used are servicetraindata.csv and servicetestdata.csv

Program:

# install.packages("caret") run command on console


# install.packages("class") run command on console

mytraindata <- read.csv('servicetraindata.csv')


mytestdata <- read.csv('servicetestdata.csv')
mytraindata
mytestdata
str(mytraindata)
str(mytestdata)
summary(mytraindata)
summary(mytestdata)
mytraindata[6] <- as.factor(mytraindata$Service)
summary(mytraindata)
mytestdata[6] <- as.factor(mytestdata$Service)
summary(mytestdata)
library(class)
predictknn <- knn(train=mytraindata[,-6],
test=mytestdata[,-6],
cl=mytraindata$Service,
k = 3)
predictknn
library(caret)
confusionMatrix(data=predictknn,mytestdata$Service)

Output:

31
32
33
Experiment – 8
Clustering Model
8a -
K-Means Clustering in R Programming language K-Means is an iterative hard clustering
technique that uses an unsupervised learning algorithm. In this, total numbers of clusters are
pre-defined by the user and based on the similarity of each data point, the data points are
clustered. This algorithm also finds out the centroid of the cluster.
Algorithm -
• Specify number of clusters (K)
• Randomly assign each data point to a cluster
• Calculate cluster centroids
• Re-allocate each data point to their nearest cluster centroid.
• Re-figure cluster centroid.
8b -
1. We will use the built in read.csv(...) function call, which reads the data in as a data frame,
and assign the data frame to a variable (using <-) so that it is stored in R’s memory. Then we
will explore some of the basic arguments that can be supplied to the function.
2. The default for read.csv(...) is to set the header argument to TRUE. This means that the
first row of values in the .csv is set as header information (column names). If your data set
does not have a header, set the header argument to FALSE
3. To see the internal structure, we can use another function, str(). In this case, the data
frame’s internal structure includes the format of each column.
Library – factoextra
“ factoextra “ is an R package making easy to extract and visualize the output of exploratory
multivariate data analyses.
• It produces a ggplot2-based elegant data visualization with less typing.
• It contains also many functions facilitating clustering analysis and visualization.

34
8 . Aim: To evaluate the performance of Clustering Model
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.

Datasets used : tripdetails.csv


Program:

mydata<-read.csv('tripdetails.csv')
mydata
str(mydata)
summary(mydata)
myclusters<-kmeans(mydata[-1],5)
myclusters
library(factoextra)
fviz_cluster(myclusters,da=mydata,goem="point")

Output:

35
36
37
Experiment-9

Reading Xml File

Aim: To read an XML file.

XML file:

<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>

<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>

38
</RECORDS>
Program:

# Load the package required to read XML files.


install.packages("XML")
library("XML")

# Also load the other required package.


library("methods")

# Give the input file name to the function.


result <- xmlParse(file = "D:/emp.xml")

# Print the result.


print(result)

Output:

39
40

You might also like