0% found this document useful (0 votes)
10 views64 pages

Dsda Manual

The document provides an introduction to R programming, detailing basic arithmetic and logical operations, as well as file import/export functionalities. It includes algorithms and sample programs for various operations such as addition, subtraction, and data frame manipulations, along with visualization techniques like bar plots and histograms. The document aims to demonstrate the capabilities of R for data exploration and analysis through practical examples.

Uploaded by

dharsanimv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views64 pages

Dsda Manual

The document provides an introduction to R programming, detailing basic arithmetic and logical operations, as well as file import/export functionalities. It includes algorithms and sample programs for various operations such as addition, subtraction, and data frame manipulations, along with visualization techniques like bar plots and histograms. The document aims to demonstrate the capabilities of R for data exploration and analysis through practical examples.

Uploaded by

dharsanimv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Ex.No.

1 INTRODUCTION TO R STUDIO, BASIC OPERATIONS AND


IMPORT AND EXPORT OF DATA USING R TOOL

AIM
To demonstrate basic arithmetic and logical operations, as well as file
input/output operations, using r programming language.

ALGORITHM
1. ADDITION OPERATOR (+)
 Define vectors a and b with values.
 Print the sum of vectors a and b.
2. SUBTRACTION OPERATOR (-)
 Define variables a and b with values.
 Print the result of subtracting b from a.
3. MULTIPLICATION OPERATOR (*)
 Define vectors b and c with values.
 Print the element-wise product of vectors b and c.
4. DIVISION OPERATOR (/)
 Define variables a and b with values.
 Print the result of dividing a by b.
5. ELEMENT-WISE LOGICAL AND OPERATOR (&)
 Define lists list1 and list2 with values.
 Print the element-wise logical and operation result of list1 and
list2.
6. ELEMENT-WISE LOGICAL OR OPERATOR (|)
 Define lists list1 and list2 with values.
 Print the element-wise logical or operation result of list1 and list2.
7. NOT OPERATOR (!)
 Define list list1 with values.
 Print the logical not operation result of list1.
8. LOGICAL AND OPERATOR (&&)
 Define lists list1 and list2 with values.
 Print the logical and operation result of list1 and list2.
9. LESS THAN (<)
 Define lists list1 and list2 with values.
 Print the result of comparing elements of list1 and list2 for less
than.
10.LESS THAN OR EQUAL TO (<=)
 Define lists list1 and list2 with values.
 Convert lists to character vectors.
 Print the result of comparing elements of list1 and list2 for less
than or equal to.
11.GREATER THAN (>)
 Define lists list1 and list2 with values.
 Print the result of comparing elements of list1 and list2 for greater
than.
12.GREATER THAN OR EQUAL TO (>=)
 Define lists list1 and list2 with values.
 Print the result of comparing elements of list1 and list2 for greater
than or equal to.
13.LEFT ASSIGNMENT (<-, <<- OR =)
 Define a vector vec1 using left assignment.
 Print vec1.
14.RIGHT ASSIGNMENT (-> OR ->>)
 Define a vector vec1 using right assignment.
 Print vec1.
15.%IN% OPERATOR
 Define a value and a list.
 Print whether the value is in the list using %in% operator.
16.%*% OPERATOR
 Define a matrix.
 Print the matrix.
17.EXPORT FILE INTO R
 Create vectors for name and age.
 Create a data frame using these vectors.
 Print the data frame.
18.CSV FILE FORMAT SAVE
 Create vectors for name and age.
 Create a data frame using these vectors.
 View the data frame.
 Write the data frame to a csv file named "newdata.csv".
19.SAVE FILE PATH VIEW
 Create vectors for name and age.
 Create a data frame using these vectors.
 View the data frame.
 Write the data frame to a csv file named "newdata.csv".
 Print the current working directory.

PROGRAM
Addition operator (+):
Sample Input:
a <- c (1, 0.1)
b <- c (2.33, 4)
print (a+b)
Sample Output : 3.33 4.10

Subtraction Operator (-):


Sample Input:
a <- 6
b <- 8.4
print (a-b)
Sample Output : -2.4

Multiplication Operator (*) :


Sample Input:
B= c(4,4)
C= c(5,5)
print (B*C)
Sample Output : 20 20

Division Operator (/) :


Sample Input:
a <- 10
b <- 5
print (a/b)
Sample Output : 2

Element-wise Logical AND operator (&):


Sample Input:
list1 <- c(TRUE, 0.1)
list2 <- c(0,4+3i)
print(list1 & list2)
Sample Output : : FALSE TRUE

Element-wise Logical OR operator (|):


Sample Input:
list1 <- c(TRUE, 0.1)
list2 <- c(0,4+3i)
print(list1|list2)
Sample Output : TRUE TRUE

NOT operator (!):


Sample Input:
list1 <- c(0,FALSE)
print(!list1)
Sample Output : TRUE TRUE

Logical AND operator (&&):


Sample Input:
list1 <- c(0.1)
list2 <- c(4+3i)
print(list1 && list2)
Sample Output : TRUE

Less than (<):


Sample Input:
list1 <- c(TRUE, 0.1,"apple")
list2 <- c(0,0.1,"bat")
print(list1<list2)
Sample Output : FALSE FALSE TRUE

Less than equal to (<=):


Sample Input:
list1 <- c(TRUE, 0.1, "apple")
list2 <- c(TRUE, 0.1, "bat")
list1_char <- as.character(list1)
list2_char <- as.character(list2)
# Compare character strings
print(list1_char <= list2_char)
Sample Output : TRUE TRUE TRUE

Greater than (>):


Sample Input:
list1 <- c(TRUE, 0.1, "apple")
list2 <- c(TRUE, 0.1, "bat")
print(list1 > list2)
Sample Output : FALSE FALSE FALSE

Greater than equal to (>=) :


Sample Input:
list1 <- c(TRUE, 0.1, "apple")
list2 <- c(TRUE, 0.1, "bat")
print(list1 >= list2)
Sample Output : TRUE TRUE FALSE
Left Assignment (<- or <<- or =):
Sample Input:
vec1 = c("ab", TRUE)
print (vec1)
Sample Output : "ab" "TRUE"

Right Assignment (-> or ->>):


Sample Input:
c("ab", TRUE) ->> vec1
print (vec1)
Sample Output : "ab" "TRUE"

%in% Operator:
Sample Input:
val <- 0.1
list1 <- c(TRUE, 0.1,"apple")
print (val %in% list1)
Sample Output : TRUE

%*% Operator:
Sample Input:
mat = matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
print (mat)
Sample Output : 1 3 5
2 4 6
EXPORT FILE INTO R ?
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
Sample output:
Name Age
xxx 20
yyy 30
zzz 25
aaa 21
bbb 23

CSV FILE FORMAT SAVE :


Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
View(data)
write.csv(data,"newdata.csv")

SAVE FILE PATH VIEW:


Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
View(data)
write.csv(data,"newdata.csv")
getwd()

Sample output:
"C:/Users/ELCOT/Documents/newdata"

RESULT
The program executes various arithmetic and logical operations, showcasing their
outcomes along with file input/output operations, ultimately generating files and
displaying results of operations on provided data.

Ex.No. 2 IMPLEMENT DATA EXPLORATION AND VISUALIZATION


ON DIFFERENT DATASETS TO EXPLORE MULTIPLE AND
INDIVIDUAL VARIABLES

AIM
To implement data exploration and visualization on different datasets to explore
multiple and individual variables

ALGORITHM
1. TOTAL ROWS AND COLUMNS:
 Create vectors for name and age.
 Combine vectors into a data frame.
 Print the dimensions of the data frame.
2. PROJECT VIEW:
 Create vectors for name and age.
 Combine vectors into a data frame.
 Print the first few rows of the data frame.
3. TITLES VIEW:
 Create vectors for name and age.
 Combine vectors into a data frame.
 Print the names of columns in the data frame.
4. DATA FRAME:
 Create vectors for name and age.
 Combine vectors into a data frame.
 Print the structure of the data frame.
5. PROJECT VIEW:
 Create vectors for name and age.
 Combine vectors into a data frame.
 View the data frame in the rstudio project view.
6. NUMERIC OR CHARACTER:
 Create vectors for name and age.
 Combine vectors into a data frame.
 Print the class of the name and age columns.
7. TABLE VIEW:
 Create vectors for name and age.
 Combine vectors into a data frame.
 Print the frequency table of the name column.
8. MEAN, MEDIAN, MODE, QUALITY, MINIMUM, MAXIMUM:
 Create vectors for name and age.
 Combine vectors into a data frame.
 Print summary statistics for the age column.
9. BARPLOT:
 Load the airquality dataset into a data frame.
 View the data frame.
 Create a barplot of the data.
10.HISTOGRAM:
 Load the airquality dataset into a data frame.
 View the data frame.
 Create a histogram of the data.
11.BOX PLOT:
 Load the airquality dataset into a data frame.
 View the data frame.
 Create a box plot of the data.
12.SCATTER PLOT:
 Load the airquality dataset into a data frame.
 View the data frame.
 Create a scatter plot of the data.
13.HEAT MAP:
 Create a random matrix.
 Assign row and column names to the matrix.
 Create a heatmap of the matrix.
14.3D GRAPHS:
 Define a function for a cone.
 Prepare variables for x, y, and z.
 Plot a 3d surface of the cone.
15.CLASS() FUNCTION:
 Define variables with different data types.
 Print the class of each variable.
16.LS() FUNCTION:
 Define variables using different assignment operators.
 Print the list of variables in the current environment.
17.RM() FUNCTION:
 Define variables using different assignment operators.
 Remove a variable.
 Print the removed variable (to demonstrate removal).
 Print an existing variable (to demonstrate non-removal).

PROGRAM
Total Rows and Columns:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
dim(data)
Output : 5 2

Project view:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
head(data)
Output : Name Age
xxx 20
yyy 30
zzz 25
aaa 21
bbb 23

Titles view:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
names(data)
Output : "Name" "Age"

Data Frame:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
str(data)
Output : data.frame': 5 obs. of 2 variables:
$ Name: chr "xxx" "yyy" "zzz" "aaa" ...
$ Age : num 20 30 25 21 23

Numeric or Character:
Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
class(data$Name)
Output : "character"

Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
class(data$Age)
Output : " "numeric""

Table view:
Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
table(data$Name)
Output : aaa bbb xxx yyy zzz
1 1 1 1 1

Mean,Median,Mode,Quality ,Minimum,Maximum:
Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
summary(data$Age))
Output :
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.0 21.0 23.0 23.8 25.0 30.0

BARPLOT:
Input:
data<-data.frame(airquality)
View(airquality)
Output :

barplot(airquality$Ozone,
main = 'Ozone Concenteration in air',
xlab = 'ozone levels', horiz = TRUE)
Output :
HISTOGRAM:
Input:
data<-data.frame(airquality)
View(airquality)
Output :

data(airquality)
hist(airquality$Temp, main ="La Guardia Airport's\
Maximum Temperature(Daily)",
xlab ="Temperature(Fahrenheit)",
xlim = c(50, 125), col ="yellow",
freq = TRUE)

Output :

BOX PLOT:
Input:
data<-data.frame(airquality)
View(airquality)
Output :

# Box plot for average wind speed


data(airquality)
boxplot(airquality$Wind, main = "Average wind speed\
at La Guardia Airport",
xlab = "Miles per hour", ylab = "Wind",
col = "orange", border = "brown",
horizontal = TRUE, notch = TRUE)
Output :

SCATTER PLOT:
Input:
data<-data.frame(airquality)
View(airquality)
Output :
# Scatter plot for Ozone Concentration per month
data(airquality)
plot(airquality$Ozone, airquality$Month,
main ="Scatterplot Example",
xlab ="Ozone Concentration in parts per billion",
ylab =" Month of observation ", pch = 19)

Output :
HEAT MAP:
Input:
data <- matrix(rnorm(50, 0, 5), nrow = 5, ncol = 5)
colnames(data) <- paste0("col", 1:5)
rownames(data) <- paste0("row", 1:5)
heatmap(data)
View(data)
Output :

# Set seed for reproducibility


# set.seed(110)
# Create example data
data <- matrix(rnorm(50, 0, 5), nrow = 5, ncol = 5)
# Column names
colnames(data) <- paste0("col", 1:5)
rownames(data) <- paste0("row", 1:5)
# Draw a heatmap
heatmap(data)
Output :

3D GRAPHS:
Input:
# Adding Titles and Labeling Axes to Plot
cone <- function(x, y){
sqrt(x ^ 2 + y ^ 2)
}

# prepare variables.
x <- y <- seq(-1, 1, length = 30)
z <- outer(x, y, cone)

# plot the 3D surface


# Adding Titles and Labeling Axes to Plot
persp(x, y, z,
main="Perspective Plot of a Cone",
zlab = "Height",
theta = 30, phi = 15,
col = "orange", shade = 0.4)

Output :

class() function:
Input:
#Character
var1 = "hello"
print(class(var1))
Output : "character"
Input:
#Variable
var1 = 10
print(class(var1))
Output : "numeric"

ls() function:
Input:
var1 = "hello"
var2 <- "hello"
"hello" -> var3
print(ls())
Output : "var1" "var2" "var3"

rm() function:
Input:
# using equal to operator
var1 = "hello"
# using leftward operator
var2 <- "hello"
# using rightward operator
"hello" -> var3
# Removing variable
rm(var3)
print(var2)
Output : "hello"
RESULT
Thus the data exploration and visualization on different datasets to explore
multiple and individual variables has been implemented.

Ex.No. 3 BUILD A DECISION TREE USING PARTY AND RPART


PACKAGES

AIM
To build a decision tree using party and rpart packages.

PACKAGE - PARTY
ALGORITHM
1. Load necessary libraries required for the analysis.

2. Load the dataset named readingSkills from the R datasets package and
display the first few rows using head() function.

3. Split the dataset into training and testing datasets using the sample.split()
function from caTools package.

4. Build a classification tree model (ctree()) using the training data

5. Plot the decision tree model using the plot() function.

PROGRAM
library(datasets)
library(catools)
library(party)
library(dplyr)
library(magrittr)
data("readingskills")
head(readingskills)
sample_data = sample.split(readingskills, splitratio = 0.8)
train_data <- subset(readingskills, sample_data == true)
test_data <- subset(readingskills, sample_data == false)
model<- ctree(nativespeaker ~ ., train_data)
plot(model)

OUTPUT

PACKAGE – RPART
ALGORITHM
1. Load necessary libraries required for the analysis.

2. Load the iris dataset from the datasets package into a data frame and view
its structure.
3. Set a seed for reproducibility, then randomly sample 70% of the rows
from the iris dataset for training and the rest for testing.

4. Build a decision tree model (rpart()) using the training data. The target
variable is species and all other variables are predictors.

5. Plot the decision tree model using the rpart.plot() function.


PROGRAM
library(rpart)
library(rpart.plot)
data <- data.frame(iris)
view(data)
set.seed(123)
train_index <- sample(1:nrow(iris), size = 0.7 * nrow(iris))
train <- iris[train_index, ]
test <- iris[-train_index, ]
tree <- rpart(species ~ ., data = train, method = "class")
rpart.plot(tree, main = "decision tree for the iris dataset")

OUTPUT
RESULT
Thus a decision tree using party and rpart packages has been built.

EX.NO. 4 BUILD A PREDICTIVE MODEL USING RANDOMFOREST


PACKAGE

AIM
To build a predictive model using randomForest package.

ALGORITHM
1. Load the required libraries: 'lubridate', 'randomForest', and 'forecast'.

2. Define the sales data vector 'x' containing observed sales data.

3. Convert 'x' into a time series object 'mts' with start date and frequency.

4. Convert 'mts' into a data frame 'df' with 'Week' and 'Sales' columns.

5. Build a random forest model 'rf_model' using 'randomForest()' with


'Sales' as the dependent variable and 'Week' as the independent variable.

6. Generate forecasts for the next 5 periods using 'predict()' with new data
representing the next 5 weeks.

7. Plot the observed sales data and the forecasted sales values for the next 5
periods using 'plot()'.

PROGRAM
# Load Required Libraries
library(lubridate)
library(randomForest)
library(forecast)
# Define Sales Data
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843,
471497, 936851, 1508725, 2072113)

# Convert Data to Time Series


mts <- ts(x, start = decimal_date(ymd("2020-01-22")), frequency = 365.25 / 7)

# Convert to Data Frame


df <- data.frame(Week = seq_len(length(x)), Sales = x)

# Build Random Forest Model


rf_model <- randomForest(Sales ~ Week, data = df)

# Generate Forecast for Next 5 Periods


forecast_values <- predict(rf_model, newdata = data.frame(Week =
seq(length(x) + 1, length(x) + 5)))

# Plot Forecast
plot(1:(length(x) + 5), c(x, forecast_values), type = "l",
xlab = "Week", ylab = "Total Revenue",
main = "Sales vs Revenue", col.main = "darkgreen",
ylim = c(0, max(x, forecast_values)))
lines(1:length(x), x, col = "blue", lwd = 2) # Plot observed sales
lines((length(x) + 1):(length(x) + 5), forecast_values, col = "red", lwd = 2) #
Plot forecasted sales
legend("topright", legend = c("Observed Sales", "Forecasted Sales"),
col = c("blue", "red"), lwd = 2)

OUTPUT

RESULT
Thus a predictive model using randomForest package has been built.

Ex.No.5 IMPLEMENT LINEAR AND LOGISTIC REGRESSION ON


DATASETS TO PREDICT THE PROBABILITY

AIM
To implement linear and logistic regression on datasets to predict the probability
LINEAR REGRESSION
ALGORITHM
1. Define two vectors 'x' and 'y' containing weight and height data points
respectively.
2. Fit a linear regression model 'relation' using the 'lm()' function, predicting
'y' based on 'x'.

3. Create a scatter plot of 'y' versus 'x' using 'plot()', with a regression line
added using 'abline()' with the linear regression model. Customize plot
appearance including title, axis labels, color, and point shape.

4. Save the plot as an image file named "linearregression.png" using 'png()'


function. Close the graphics device and finalize the image using 'dev.off()'.

PROGRAM
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
png(file = "linearregression.png")
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",
ylab = "Height in cm")
dev.off()
OUTPUT
LOGISTIC REGRESSION
ALGORITHM
1. Install and load required packages: 'dplyr', 'catools', and 'rocr'.

2. Generate summary statistics for the 'mtcars' dataset using 'summary()'.

3. Split the 'mtcars' dataset into training and testing sets using
'sample.split()' function.

4. Subset the dataset into training and testing sets.

5. Build a logistic regression model using 'glm()' with 'vs' as dependent and
'wt' and 'disp' as independent variables.

6. Generate summary of the logistic regression model.

7. Make predictions on the testing set and calculate classification error.

8. Calculate area under the roc curve (auc) using 'prediction()' and
'performance()' functions.

9. Plot the roc curve with labeled cutoffs and a legend showing the
calculated auc.

PROGRAM
install.packages("dplyr")
library(dplyr)
summary(mtcars)
install.packages("caTools")
install.packages("ROCR")
library(caTools)
library(ROCR)
split <- sample.split(mtcars, SplitRatio = 0.8)
Split
train_reg <- subset(mtcars, split == "TRUE")
test_reg <- subset(mtcars, split == "FALSE")
logistic_model <- glm(vs ~ wt + disp,data = train_reg,family = "binomial")
logistic_model
summary(logistic_model)
predict_reg <- predict(logistic_model,test_reg, type = "response")
predict_reg
predict_reg <- ifelse(predict_reg >0.5, 1, 0)
table(test_reg$vs, predict_reg)
missing_classerr <- mean(predict_reg != test_reg$vs)
print(paste('Accuracy =', 1 - missing_classerr))
ROCPred <- prediction(predict_reg, test_reg$vs)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- [email protected][[1]]
auc
plot(ROCPer)
plot(ROCPer, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1), main
= "ROC CURVE")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

OUTPUT
RESULT
Thus the linear and logistic regression on datasets to predict the probability has
been implemented.

Ex.No.6 IMPLEMENT K-MEANS, K-MEDOIDS, HIERARCHICAL AND


DENSITY BASED CLUSTERING TECHNIQUES

AIM
To implement K-means, K-medoids, hierarchical and density based clustering
techniques.

K-MEANS
ALGORITHM
1. Load the Iris dataset using 'data(iris)'.

2. Examine the structure of the dataset using 'str()' to understand its variables
and types.
3. Install required packages using 'install.packages()': 'ClusterR' and 'cluster'.

4. Load required libraries using 'library()': 'ClusterR' and 'cluster'.

5. Preprocess the dataset:


- Remove the species column to perform clustering only on numerical
variables.

6. Perform K-Means Clustering:


- Set seed for reproducibility using 'set.seed()'.
- Apply K-means clustering to the preprocessed dataset with 3 centers
and 20 random starts.

7. Explore Clustering Results:


- Display clustering results obtained from K-means.
- Compute confusion matrix to evaluate clustering performance.

8. Visualize Clustering Results:


- Create scatter plots of the iris dataset using 'plot()'.
- Color points based on assigned clusters obtained from K-means.
- Add cluster centers to the plot using different symbols and colors.
- Visualize clustering results using 'clusplot()' to generate a cluster
plot.

PROGRAM
data(iris)
str(iris)
install.packages("ClusterR")
install.packages("cluster")
library(ClusterR)
library(cluster)
iris_1 <- iris[, -5]
set.seed(240)
kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re

kmeans.re$cluster
cm <- table(iris$Species, kmeans.re$cluster)
cm

plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster,
main = "K-means with 3 clusters")

kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]
points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")],col = 1:3,
pch = 8, cex = 3)
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')

OUTPUT
K-MEDOIDS
ALGORITHM
1. Load the 'factoextra' and 'cluster' packages using 'library()'.

2. Prepare the dataset:


- Load the 'USArrests' dataset.
- Remove missing values using 'na.omit()'.
- Standardize the variables using 'scale()'.

3. Explore the optimal number of clusters:


- Visualize the number of clusters using the Within Sum of Squares
(WSS) method.
- Use 'fviz_nbclust()' to calculate and visualize the number of
clusters based on Partitioning Around Medoids (PAM) algorithm.
- Select the number of clusters based on the "elbow" or a significant
decrease in WSS.

4. Compute Gap statistics:


- Calculate the Gap statistic to determine the optimal number of
clusters.
- Use 'clusGap()' to compute the Gap statistic.
- Specify the range of candidate cluster numbers and the number of
bootstrap replicates.

5. Visualize Gap statistics:


- Plot the Gap statistic results using 'fviz_gap_stat()'.
- Compare the observed Gap statistic to its expected value to identify
the optimal number of clusters.

PROGRAM
library(factoextra)
library(cluster)
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)

fviz_nbclust(df, pam, method = "wss")

gap_stat <- clusGap(df, FUN = pam, K.max = 10, B = 50)


fviz_gap_stat(gap_stat)

OUTPUT
HIERARCHICAL CLUSTERING
ALGORITHM
1. Install and load the 'dplyr' package.

2. Load the dataset 'mtcars' and examine its first few rows using 'head()'
function.

3. Compute the Euclidean distance matrix from the dataset using 'dist()'
function.

4. Perform hierarchical clustering on the distance matrix using 'hclust()'


function with the "average" linkage method.

5. Plot the dendrogram obtained from hierarchical clustering using 'plot()'


function.

6. Optionally, add a horizontal line at a specified height to cut the


dendrogram into clusters using 'abline()'.

7. Cut the dendrogram into a specified number of clusters using 'cutree()'


function.

8. Assign each observation to a cluster based on the cutting height obtained


in the previous step.

9. Generate a table showing the number of observations assigned to each


cluster using 'table()' function.

10.Visualize the clusters in the dendrogram by drawing rectangles around


them using 'rect.hclust()' function.

PROGRAM
install.packages("dplyr")
library(dplyr)
head(mtcars)
distance_mat <- dist(mtcars, method = 'euclidean’)
distance_mat

set.seed(240)
Hierar_cl <- hclust(distance_mat, method = "average")
Hierar_cl

plot(Hierar_cl)abline(h = 110, col = "green")


fit <- cutree(Hierar_cl, k = 3 )
fit

table(fit)rect.hclust(Hierar_cl, k = 3, border = "green")

OUTPUT
DENSITY-BASED CLUSTERING
ALGORITHM
1. Load the Iris dataset using 'data()' function.

2. Examine the structure of the dataset using 'str()' function.

3. Install the 'dbscan' package using 'install.packages()' function.

4. Load the 'dbscan' package using 'library()' function.

5. Convert the iris dataset into a matrix format excluding the species column
using 'as.matrix()'.

6. Plot the k-nearest neighbor distance plot ('kNNdistplot') of the iris data
matrix with a specified value of k.

7. Optionally, add a horizontal line at a specific distance threshold using


'abline()'.

8. Set the random seed using 'set.seed()' for reproducibility.

9. Perform DBSCAN clustering ('dbscan()') on the iris data matrix with the
specified epsilon (eps) and minimum points (minPts) parameters.

10.Store the result in 'db'.

11.Plot the cluster hulls ('hullplot()') of the iris data matrix based on the
DBSCAN clustering result.

12.The hulls represent the convex hulls of each cluster.

PROGRAM
data(iris)
str(iris)
install.packages(“dbscan”)
library(dbscan)
iris_matrix <- as.matrix(iris[, -5])
kNNdistplot(iris_matrix, k=4)

abline(h=0.4, col="red")
set.seed(1234)
db = dbscan(iris_matrix, 0.4, 4)
Db
hullplot(iris_matrix, db$cluster
OUTPUT

RESULT
Thus the K-means, K-medoids, hierarchical and density based clustering
techniques has been implemented.

Ex.No. 7 IMPLEMENT TIME SERIES ANALYSIS USING


CLASSIFICATION AND CLUSTERING TECHNIQUES

AIM
To implement time series analysis using classification and clustering
techniques.

ALGORITHM
1. Install the "lubridate" package using `install.packages("lubridate")`.
2. Load the "lubridate" package using `library(lubridate)`.
3. Define the weekly data vector `x` representing total positive COVID-19
cases.

4. Convert the data into a time series object `mts` using the `ts()` function.

5. Specify the start date as January 22, 2020, and the frequency as weekly.

6. Open a PNG device for saving the plot using `png(file =


"timeSeries.png")`.

7. Plot the time series data `mts` using the `plot()` function.

8. Customize the plot with appropriate axis labels, title, and color.

9. Save the plot as a PNG file named "timeSeries.png" using `dev.off()`.

PROGRAM
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
Install.packages(“lubridate”)
library(lubridate)
png(file ="timeSeries.png")
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
plot(mts, xlab ="Weekly Data",
ylab ="Total Positive Cases",
main ="COVID-19 Pandemic",
col.main ="darkgreen")
dev.off()
OUTPUT

CLUSTERING
ALGORITHM
1. Install the "factoextra" package if it's not already installed.
2. Load the "factoextra" library.
3. Load the dataset (in this case, mtcars).
4. Remove any rows with missing values from the dataset.
5. Scale the dataset to standardize the variables.
6. Open a PNG device for saving the plot.
7. Perform K-means clustering on the scaled dataset with specified
parameters (centers = 5, nstart = 25).
8. Visualize the clustering results using the fviz_cluster() function from the
factoextra package.
9. Save the plot as a PNG file.
10.Close the PNG device.
PROGRAM
# Install the "factoextra" package if not installed
# install.packages("factoextra")

# Load the "factoextra" library


library(factoextra)

# Load the dataset (in this case, mtcars)


df <- mtcars

# Remove any rows with missing values from the dataset


df <- na.omit(df)

# Scale the dataset to standardize the variables


df <- scale(df)

# Open a PNG device for saving the plot


png(file = "KMeansExample2.png")

# Perform K-means clustering on the scaled dataset with specified parameters


km <- kmeans(df, centers = 5, nstart = 25)

# Visualize the clustering results using the fviz_cluster() function


fviz_cluster(km, data = df)

# Save the plot as a PNG file


dev.off()
OUTPUT

RESULT
Thus the time series analysis using classification and clustering techniques has
been implemented.

Ex.No. 8 IMPLEMENT APRIORI ALGORITHM IN ASSOCIATION


RULE MINING

AIM
To implement apriori algorithm in association rule mining.

ALGORITHM
1. Install the "arules" package if it's not already installed.
2. Load the "arules" library.
3. Install the "arulesViz" package if it's not already installed.
4. Load the "arulesViz" library.
5. Install the "RColorBrewer" package if it's not already installed.
6. Load the "RColorBrewer" library.
7. Load the "Groceries" dataset.
8. Generate association rules using the apriori() function from the "arules"
package. Set parameters for minimum support and confidence levels.
9. Inspect the first 10 association rules using the inspect() function.
10.Plot the relative item frequency of the top 20 items using the
itemFrequencyPlot() function. Customize the plot with colors from the
RColorBrewer palette.

PROGRAM
install.packages(“arules”)
library(arules)
install.packages(“arulesViz”)
library(arulesViz)
install.packages(“RColorBrewer”)
library(RColorBrewer)
data("Groceries")
rules <- apriori(Groceries,
parameter = list(supp = 0.01, conf = 0.2))
inspect(rules[1:10])

arules::itemFrequencyPlot(Groceries, topN = 20,


col = brewer.pal(8, 'Pastel2'),
main = 'Relative Item Frequency Plot',
type = "relative",
ylab = "Item Frequency (Relative)")

OUTPUT
ASSOCIATION RULE MINING
ALGORITHM
1. Define a list representing market basket transactions.
2. Assign names to the transactions.
3. Load the "arules" library.
4. Convert the list to transactions using the as() function.
5. Display the dimensions of the transactions.
6. Display a summary of the transactions.
7. Display an image of the transactions.
8. Plot the item frequency of the transactions.
9. Generate association rules using the apriori() function with specified
parameters.
10.Display a summary of the generated rules.
11.Inspect the generated rules.
12.Extract association rules with "beer" on the right-hand side (rhs).
13.Inspect the rules with "beer" on the left-hand side (lhs).
14.Load the "arulesViz" library.
15.Plot the association rules using different visualization methods.
PROGRAM
# Define market basket transactions
market_basket <- list(
c("apple", "beer", "rice", "meat"),
c("apple", "beer", "rice"),
c("apple", "beer"),
c("apple", "pear"),
c("milk", "beer", "rice", "meat"),
c("milk", "beer", "rice"),
c("milk", "beer"),
c("milk", "pear")
)
names(market_basket) <- paste("T", c(1:8), sep = "")

# Load the "arules" library


library(arules)

# Convert list to transactions


trans <- as(market_basket, "transactions")

# Display dimensions of transactions


dim(trans)

# Display summary of transactions


summary(trans)
# Display an image of the transactions
image(trans)

# Plot item frequency of transactions


itemFrequencyPlot(trans, topN = 10, cex.names = 1)

# Generate association rules


rules <- apriori(trans, parameter = list(supp = 0.3, conf = 0.5, maxlen = 10,
target = "rules"))

# Display summary of generated rules


summary(rules)

# Inspect generated rules


inspect(rules)

# Extract association rules with "beer" on the right-hand side (rhs)


beer_rules_rhs <- apriori(trans, parameter = list(supp = 0.3, conf = 0.5, maxlen
= 10, minlen = 2),
appearance = list(default = "lhs", rhs = "beer"))
# Inspect rules with "beer" on the left-hand side (lhs)
inspect(beer_rules_lhs)

# Load the "arulesViz" library


library(arulesViz)

# Plot association rules using different visualization methods


plot(rules)
plot(rules, measure = "confidence")

plot(rules, method = "two-key plot")


plot(rules, engine = "plotly")

subrules <- head(rules, n = 10, by = "confidence")


plot(subrules, method = "graph", engine = "htmlwidget")
plot(subrules, method = "paracoord")

RESULT
Thus the Apriori Algorithm in Association rule mining has been implemented.

Ex.No. 9 IMPLEMENT TEXT MINING ON TWITTER DATA USING


twitterR PACKAGE

AIM
To implement text mining on twitter data using twitterR package.

ALGORITHM
1. Install required packages: "rtweet", "ggplot2", "dplyr", "tidytext",
"igraph", and "ggraph".
2. Load the required libraries: "rtweet", "ggplot2", "dplyr", "tidytext",
"igraph", and "ggraph".
3. Use search_tweets() function from "rtweet" package to search for tweets
containing the hashtag "#climatechange".
4. Clean the text of tweets by removing URLs using regular expressions.
5. Tokenize the cleaned text to extract individual words.
6. Count the frequency of unique words in the tweets.
7. Plot the count of the top 15 unique words found in the tweets using
ggplot2.
8. Load the built-in stop words dataset from "tidytext".
9. Remove stop words from the tokenized text.
10.Count the frequency of unique words after removing stop words.
11.Plot the count of the top 15 unique words after stop words removal using
ggplot2.

PROGRAM
# Load required libraries
library(rtweet)
library(ggplot2)
library(dplyr)
library(tidytext)
library(igraph)
library(ggraph)

# Search for tweets containing the hashtag "#climatechange"


climate_tweets <- search_tweets(q = "#climatechange", n = 10000, lang = "en",
include_rts = FALSE)

# Clean the text of tweets by removing URLs using regular expressions


climate_tweets$stripped_text <- gsub("http.*", "", climate_tweets$text)
climate_tweets$stripped_text <- gsub("https.*", "",
climate_tweets$stripped_text)

# Tokenize the cleaned text to extract individual words


climate_tweets_clean <- climate_tweets %>%
dplyr::select(stripped_text) %>%
unnest_tokens(word, stripped_text)

# Count the frequency of unique words in the tweets


word_freq <- climate_tweets_clean %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n))

# Plot the count of the top 15 unique words found in tweets


ggplot(word_freq, aes(x = word, y = n)) +
geom_col() +
coord_flip() +
labs(x = "Unique words", y = "Count",
title = "Count of unique words found in tweets")

# Load the built-in stop words dataset from "tidytext"


data("stop_words")

# Remove stop words from the tokenized text


cleaned_tweet_words <- anti_join(climate_tweets_clean, stop_words)

# Count the frequency of unique words after removing stop words


word_freq_no_stopwords <- cleaned_tweet_words %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n))

# Plot the count of the top 15 unique words after stop words removal
ggplot(word_freq_no_stopwords, aes(x = word, y = n)) +
geom_col() +
coord_flip() +
labs(x = "Unique words", y = "Count",
title = "Count of unique words found in tweets (Stop words removed)")

RESULT
Thus the text mining on twitter data using twitterR package has been
implemented.

You might also like