Data Science Lab Manual
Data Science Lab Manual
2
Experiment – 1
Calculator Application
R operators – R has many operators to carry out different mathematical and logical
operations. Operators in R can mainly be classified into the following categories.
1. Arithmetic operators - +, -, x, /
2. Assignment operators – <- , ->
3. Relational operators - <, >, ==, !=, <=, >=
4. Logical operators - !, &, &&, |, ||
3
1. Aim: To implement Calculator Application in R
Program:
1+2
3-1
4*2
5*2
a<-1
b<-4
c<-2
a+b
a-b
a*b
b/c
add<-function(x,y)
{
print(x+y)
}
add(2,3)
subt<-function(x,y)
{
print(x-y)
}
subt(7,2)
mul<-function(x,y)
{
print(x*y)
}
mul(6,3)
div<-function(x,y)
{
print(x/y);
}
div(10,2)
choice=readline(prompt="Enter add for addition
subt for subtraction
mul for multiplication
div for division
Choice: ");
num1=readline(prompt = "Enter first number : ");
4
num2=readline(prompt = "Enter second number : ");
num1=as.integer(num1)
num2=as.integer(num2)
cal<-switch(choice,"add"=print(num1+num2),
"subt"=print(num1-num2),
"mul"=print(num1*num2),
"div"=print(num1/num2))
Output –
5
Experiment – 2
Descriptive Analysis
Dataset – mtcars
Description – The mtcars dataset is a built-in dataset in R that contains measurements on 11
aspects of automobile design and performance for 32 cars. The data was extracted from the
1974 Motor Trend US magazine.
Attributes –
1. Cyl
2. Disp
3. Hp
4. Drat
5. Wt
6. Qsec
7. Vs
8. Am
9. Gear
10. Carb
Dataset – cars
Description – This dataset contains 50 observations of 2 variables. It shows various
readings on “speed“ and “distance“ collected.
Attributes –
1. speed
2. distance
Dataset – iris
Description – The data set contains 3 classes of 50 instances each, where each class refers to
a type of iris plant. One class is linearly separable from the other 2; the latter are NOT
linearly separable from each other.
Attributes -
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. species:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
6
Subset function –
Syntax – subset(x,subset,select)
Parameters –
Parameters –
x: R object
by: List of variables
FUN: Function to be applied for summary statistics
... : Additional arguments to be passed to FUN
Simplify: Whether to simplify results as much as possible or not
Drop: Whether to drop unused combinations of grouping values or not
mean() function – This will simply calculate the total mean of all the observations present in
the data of that particular mentioned attribute.
min() function – This will give us the least valued observation from the data being used.
max() function - This will give us the maximum valued observation from the data being used.
summary() function – The summary of all the attributes are shown separately. The factors
used in doing so are minimum value, 1st quartile, Median, Mean, 3rd Quartile, Maximum
value.
7
2. Aim: To perform Descriptive Statistics in R
a. To write an R script to find basic descriptive statistics using summary, str, quartile function
on metacars
b. To apply the above functions on cars data sets
b. To apply subset(), aggregate() functions on iris dataset.
Datasets used:
1. mtcars
2. cars
3. iris
Program :
data(mtcars)
head(mtcars)
tail(mtcars)
head(mtcars,10)str(mtcars)
mtcars[1]
mtcars[15]
mtcars[1:4]
mtcars[c(1,4)]
mtcars[-2]
max(mtcars$cyl)
min(mtcars$mpg)
mean(mtcars$mpg)
median(mtcars$mpg)
summary(mtcars)
Output:
8
9
b. Descriptive Statistics Analysis on cars dataset
data(cars)
head(cars,10)
tail(cars,20)
str(cars)
head(cars)
max(cars)
max(cars$speed)
min(cars$speed)
mean(cars$speed)
median(cars$speed)
mode(cars$speed)
summary(cars$speed)
summary(cars)
Output:
10
c. Applying subset and aggregate functions on iris dataset
data(iris)
head(iris)
tail(iris)
subset(iris,Sepal.Length==6.1)
aggregate(.~Species,data=iris,mean)
Output –
11
Experiment – 3
Reading and writing different types of data
Package used – readxl
The readxl package makes it easy to get data out of Excel and into R. Compared to
many of the existing packages, readxl has no external dependencies, so it's easy to install and
use on all operating systems. It is designed to work with tabular data.
Functions for Reading Data into R –
There are a few very useful functions for reading data into R.
1. read.table() and read.csv() are two popular functions used for reading tabular data
into R.
2. readLines() is used for reading lines from a text file.
3. source() is a very useful function for reading in R code files from a another R
program.
4. dget() function is also used for reading in R code files.
5. load() function is used for reading in saved workspaces
6. unserialize() function is used for reading single R objects in binary format.
1. write.table() is used for writing tabular data to text files (i.e. CSV).
2. read.delim() is used to read delimited text files in the R Language.
3. writeLines() function is useful for writing character data line-by-line to a file or
connection.
4. dump() is a function for dumping a textual representation of multiple R objects.
5. dput() function is used for outputting a textual representation of an R object.
6. serialize() is used for converting an R object into a binary format for outputting to a
connection .
12
3. Aim: To read and write different types of datasets
a. To read different types of datasets from web and disk and writing in file in specific disk
location.
b. To read Excel data sheet in R.
name=c("a","b","c","d","e")
marks=c(20,30,40,10,15)
id=c(1:5)
st=data.frame(id,name,marks)
View(st)
install.packages("readxl")
library(readxl)
13
Output:
> name=c("a","b","c","d","e")
> marks=c(20,30,40,10,15)
> id=c(1:5)
> st=data.frame(id,name,marks)
> View(st)
> #1. writing data frame into CSV file
> write.csv(student,"student.csv",row.names=FALSE)
14
Nom variable Group
1 IND1 10 A
2 IND2 7 A
3 IND3 20 A
4 IND4 14 A
5 IND5 14 A
6 IND6 12 A
7 IND7 10 A
8 IND8 23 A
9 IND9 17 A
10 IND10 20 A
11 IND11 14 A
12 IND12 13 A
13 IND13 11 B
14 IND14 17 B
15 IND15 21 B
16 IND16 11 B
17 IND17 16 B
18 IND18 14 B
19 IND19 17 B
20 IND20 17 B
21 IND21 19 B
22 IND22 21 B
23 IND23 7 B
24 IND24 13 B
25 IND25 0 C
26 IND26 1 C
27 IND27 7 C
28 IND28 2 C
29 IND29 3 C
30 IND30 1 C
31 IND31 2 C
32 IND32 1 C
33 IND33 3 C
34 IND34 0 C
35 IND35 1 C
36 IND36 4 C
37 IND37 3 D
38 IND38 5 D
39 IND39 12 D
40 IND40 6 D
41 IND41 4 D
42 IND42 3 D
43 IND43 5 D
44 IND44 5 D
45 IND45 5 D
46 IND46 5 D
47 IND47 2 D
48 IND48 4 D
49 IND49 3 E
50 IND50 5 E
51 IND51 3 E
52 IND52 5 E
53 IND53 3 E
54 IND54 6 E
55 IND55 1 E
56 IND56 1 E
57 IND57 3 E
15
58 IND58 2 E
59 IND59 6 E
60 IND60 4 E
61 IND61 11 F
62 IND62 9 F
63 IND63 15 F
64 IND64 22 F
65 IND65 15 F
66 IND66 16 F
67 IND67 13 F
68 IND68 10 F
69 IND69 26 F
70 IND70 26 F
71 IND71 24 F
72 IND72 13 F
> head(webfile)
Nom variable Group
1 IND1 10 A
2 IND2 7 A
3 IND3 20 A
4 IND4 14 A
5 IND5 14 A
6 IND6 12 A
> write.table(webfile,file="webfile.txt",quote=F,row.names=FALSE)
>
> # install package readxl first
>
> install.packages("readxl")
Error in install.packages : Updating loaded packages
> library(readxl)
>
> #6. reading excel datasheet
> df=read_excel("d:/ex1.xlsx",sheet=2)
> View(df)
16
Experiment – 4
Visualization
Data visualization is an efficient technique for gaining insight about data through a
visual medium. With the help of visualization techniques, we can easily obtain information
about hidden patterns in data and also we can work with large datasets to efficiently obtain
key insights.
Dataset used – mtcars
Description – The mtcars dataset is a built-in dataset in R that contains measurements on 11
aspects of automobile design and performance for 32 cars. The data was extracted from the
1974 Motor Trend US magazine.
Attributes –
1. Cyl
2. Disp
3. Hp
4. Drat
5. Wt
6. Qsec
7. Vs
8. Am
9. Gear
10. Carb
Package - ggplot2
R allows us to create graphics declaratively. This package is famous for its elegant and
qualitygraphs, which sets it apart from other visualization packages.
Boxplot – boxplot() function is used to create a boxplot. These are a measure of how well
data is distributed across a data set. This graph represents the minimum, maximum, average,
first quartile,and the third quartile in the data set.
Syntax – boxplot(x, data, names, main)
parameters –
x is a vector or a formula.
data is the data frame.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.
Scatterplot –
The scatter plots are used to compare variables. A comparison between variables is
requiredwhen we need to define how much one variable is affected by another variable.
Syntax – plot(x, y, main, xlab, ylab)
Parameters –
x is the data set whose values are the horizontal coordinates.
17
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
R Bar Charts
A bar chart is a pictorial representation in which numerical values of variables are
represented by length or height of lines or rectangles of equal width. R provides the
barplot() function.
Syntax – barplot(H, xlab, ylab, main, names.arg, col)
Parameters –
H is a vector or matrix containing numeric values used in bar chart.
xlab is the label for x axis.
ylab is the label for y axis.
main is the title of the bar chart.
names.arg is a vector of names appearing under each bar.
col is used to give colours to the bars in the graph.
R Pie Charts
A pie-chart is arepresentation of values in the form of slices of a circle with different colors.
Pie charts are created with the help of pie () function, which takes positive numbers as
vector input.
Syntax - pie(x, labels, main, col)
Parameters –
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
main indicates the title of the chart.
col indicates the colour palette.
18
4. Aim: To perform visualizations
a. To find the data distribution using box and scatter plot
b. To find the outliers using plot.
c. To plot the histogram, bar chart and pie chart on sample data.
Program:
#Linear plot
x=1:10
y=x^2
plot(x,y,type="l",main=”Linear Plot Example”)
#installing package
install.packages("ggplot2")
#scatter plot
data("mtcars")
plot(
mtcars$wt,mtcars$mpg,
main = "scatter plot example",
xlab = "car weight",
ylab="miles per gallon",
)
#box plot
data("mtcars")
boxplot(
mtcars$mpg,
main = "box plot example",
ylab="miles per gallon"
)
#outliers
v<-c(50,25,30,12,78,99)
boxplot(v,main="outliers")
#Histogram
H<-c(9,13,28,36,4,54,99,98)
hist(H,main="Histogram",col="blue")
#Barchart
h<-c(9,13,28,36,4,54)
m<-c("MAR","APR","MAY","JUN","JUL","AUG")
barplot(h,names.arg=m,xlab="Month",ylab="revenue",main="barchart",border ="blue")
#pie chart
h<-c(90,78,80,25)
m<-c("OS","DBMS","Java","OE")
pie(h,m,main = "piechart")
19
Output:
20
21
Experiment – 5
Correlation and Covariance
Correlation and Covariance are terms used in statistics to measure relationships
between two random variables. Both of these terms measure linear dependency between a
pair ofrandom variables or bivariate data.
Correlation in R Programming Language –
cor() function in R programming measures the correlation coefficient value. Correlationis
a relationship term in statistics that uses the covariance method to measure how strong the
vectors are related. Mathematically,
where,
Package – CORRPLOT()
22
R package corrplot provides a visual exploratory tool on correlation matrix that
supports automatic variable reordering to help detect hidden patterns among
variables.
corrplot is very easy to use and provides a rich array of plotting options invisualization
method, graphic layout, color, legend, text labels, etc. It also provides p-values and
confidence intervals to help users determine the statistical significance of the
correlations.
corrplot() - The mostly using parameters include method, type, order, diag, and etc.
Correlation matrix –
A correlation matrix is a table of correlation coefficients for a set of variables used to
determine if a relationship exists between the variables. The coefficient indicates both the
strength of the relationship as well as the direction.
Syntax: cor (x, use = , method = )
Parameters:
x: It is a numeric matrix or a data frame.
use: Deals with missing data.
method: Deals with a type of relationship
Dataset - iris
Description –
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are NOT linearly separable
from each other.
Attributes -
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. species:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
Variance (ANOVA) –
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed
aggregate variability found inside a data set into two parts: systematic factors and random
factors. The systematic factors have a statistical influence on the given data set, while the
random factors do not.
23
5. Aim: To Calculate Correlation and Covariance
Output:
24
25
Experiment – 6
Regression Model
Dataset – crashdata.csv
Description – This dataset has 80 observations of 6 variables.
Attributes –
1. ManHI
2. ManBI
3. IntI
4. HVACi
5. Safety
6. CarType
Dataset – crashdataset.csv
Description – This dataset has 20 observations of 6 variables.
Attributes –
1. ManHI
2. ManBI
3. IntI
4. HVACi
5. Safety
6. CarType
GLM – ‘glm’ is used to fit generalised linear models, specified by giving a symbolic
description of the linear predictor and a description of the error distribution.
Syntax - glm (formula, family, data)
Parameters –
Family types includes binomial, Poisson, Gaussian, gamma, quasi.
Data: refers to the dataset being used
Package used – caret
Caret stands for classification and regression training and is arguably the biggest project in R.
One of the most powerful and popular packages is the caret library, which follows a
consistent syntax for data preparation, model building, and model evaluation, making it easy
for data science practitioners.
26
6. Aim: To evaluate the performance of Regression Model
a. Import data from web storage. Name the dataset and perform Logistic
b. Regression to find out relation between variables in the model. Also
c. check the model is fit or not [require (foreign), require(MASS)]
Program:
#logistic regression
mydata <- read.csv('crashdata.csv')
mytestdata <- read.csv('crashtestdata.csv')
mydata
mytestdata
str(mydata)
summary(mydata)
mydata[6] <- as.factor(mydata$CarType)
mydata
fit <- glm(formula=mydata$CarType~.,family='binomial', data=mydata)
fit
summary(fit)
train <- predict(fit, type='response')
plot(train)
tapply(train, mydata$CarType, mean)
pred <- predict(fit,newdata = mytestdata, type='response')
plot(pred)
mytestdata[pred<=0.5,'Predict'] <- 'Hatchback'
mytestdata[pred>0.5,'Predict'] <- 'SUV'
mytestdata
#install.packages("caret") run on console
library(caret)
confusionMatrix(table(mytestdata[,7],mytestdata[,6]),positive='Hatchback')
Output:
27
28
29
Experiment – 7
Classification Model
Packets for classification:
1. Caret package –
Caret stands for classification and regression training and is arguably the biggest project in R.
One of the most powerful and popular packages is the caret library, which follows a
consistent syntax for data preparation, model building, and model evaluation, making it easy
for data science practitioners.
2. Class package –
A class is just a blueprint or a sketch of methods or attributes. It represents the set of
properties or methods that are common to all objects of one type.
Dataset – Servicetraindata.csv
Description – This data set contains 315 observations of 6 variables.
Attributes –
1. OilQual
2. EnginePerf
3. NormMileage
4. TypeWear
5. HVACwear
6. Service
Dataset – Servicetestdata.csv
Description – This dataset contains 135 observations of 6 variables.
Attributes –
1. OilQual
2. EnginePerf
3. NormMileage
4. TypeWear
5. HVACwear
6. Service
Predictknn –
Predictions are calculated for each test case by aggregating the responses of the k-
nearest neighbours among the training cases. k may be specified to be any positive integer
less than the number of training cases, but is generally between 1 and 10.
30
7. Aim: To find the performance of Classification Model
a. To install relevant packages for classification.
b. To choose a classifier for classification problems.
c. To evaluate the performance of the classifier.
Program:
Output:
31
32
33
Experiment – 8
Clustering Model
8a -
K-Means Clustering in R Programming language K-Means is an iterative hard clustering
technique that uses an unsupervised learning algorithm. In this, total numbers of clusters are
pre-defined by the user and based on the similarity of each data point, the data points are
clustered. This algorithm also finds out the centroid of the cluster.
Algorithm -
• Specify number of clusters (K)
• Randomly assign each data point to a cluster
• Calculate cluster centroids
• Re-allocate each data point to their nearest cluster centroid.
• Re-figure cluster centroid.
8b -
1. We will use the built in read.csv(...) function call, which reads the data in as a data frame,
and assign the data frame to a variable (using <-) so that it is stored in R’s memory. Then we
will explore some of the basic arguments that can be supplied to the function.
2. The default for read.csv(...) is to set the header argument to TRUE. This means that the
first row of values in the .csv is set as header information (column names). If your data set
does not have a header, set the header argument to FALSE
3. To see the internal structure, we can use another function, str(). In this case, the data
frame’s internal structure includes the format of each column.
Library – factoextra
“ factoextra “ is an R package making easy to extract and visualize the output of exploratory
multivariate data analyses.
• It produces a ggplot2-based elegant data visualization with less typing.
• It contains also many functions facilitating clustering analysis and visualization.
34
8 . Aim: To evaluate the performance of Clustering Model
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.
mydata<-read.csv('tripdetails.csv')
mydata
str(mydata)
summary(mydata)
myclusters<-kmeans(mydata[-1],5)
myclusters
library(factoextra)
fviz_cluster(myclusters,da=mydata,goem="point")
Output:
35
36
37
Experiment-9
XML file:
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
38
</RECORDS>
Program:
Output:
39
40