Dsda Manual
Dsda Manual
AIM
To demonstrate basic arithmetic and logical operations, as well as file
input/output operations, using r programming language.
ALGORITHM
1. ADDITION OPERATOR (+)
Define vectors a and b with values.
Print the sum of vectors a and b.
2. SUBTRACTION OPERATOR (-)
Define variables a and b with values.
Print the result of subtracting b from a.
3. MULTIPLICATION OPERATOR (*)
Define vectors b and c with values.
Print the element-wise product of vectors b and c.
4. DIVISION OPERATOR (/)
Define variables a and b with values.
Print the result of dividing a by b.
5. ELEMENT-WISE LOGICAL AND OPERATOR (&)
Define lists list1 and list2 with values.
Print the element-wise logical and operation result of list1 and
list2.
6. ELEMENT-WISE LOGICAL OR OPERATOR (|)
Define lists list1 and list2 with values.
Print the element-wise logical or operation result of list1 and list2.
7. NOT OPERATOR (!)
Define list list1 with values.
Print the logical not operation result of list1.
8. LOGICAL AND OPERATOR (&&)
Define lists list1 and list2 with values.
Print the logical and operation result of list1 and list2.
9. LESS THAN (<)
Define lists list1 and list2 with values.
Print the result of comparing elements of list1 and list2 for less
than.
10.LESS THAN OR EQUAL TO (<=)
Define lists list1 and list2 with values.
Convert lists to character vectors.
Print the result of comparing elements of list1 and list2 for less
than or equal to.
11.GREATER THAN (>)
Define lists list1 and list2 with values.
Print the result of comparing elements of list1 and list2 for greater
than.
12.GREATER THAN OR EQUAL TO (>=)
Define lists list1 and list2 with values.
Print the result of comparing elements of list1 and list2 for greater
than or equal to.
13.LEFT ASSIGNMENT (<-, <<- OR =)
Define a vector vec1 using left assignment.
Print vec1.
14.RIGHT ASSIGNMENT (-> OR ->>)
Define a vector vec1 using right assignment.
Print vec1.
15.%IN% OPERATOR
Define a value and a list.
Print whether the value is in the list using %in% operator.
16.%*% OPERATOR
Define a matrix.
Print the matrix.
17.EXPORT FILE INTO R
Create vectors for name and age.
Create a data frame using these vectors.
Print the data frame.
18.CSV FILE FORMAT SAVE
Create vectors for name and age.
Create a data frame using these vectors.
View the data frame.
Write the data frame to a csv file named "newdata.csv".
19.SAVE FILE PATH VIEW
Create vectors for name and age.
Create a data frame using these vectors.
View the data frame.
Write the data frame to a csv file named "newdata.csv".
Print the current working directory.
PROGRAM
Addition operator (+):
Sample Input:
a <- c (1, 0.1)
b <- c (2.33, 4)
print (a+b)
Sample Output : 3.33 4.10
%in% Operator:
Sample Input:
val <- 0.1
list1 <- c(TRUE, 0.1,"apple")
print (val %in% list1)
Sample Output : TRUE
%*% Operator:
Sample Input:
mat = matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
print (mat)
Sample Output : 1 3 5
2 4 6
EXPORT FILE INTO R ?
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
Sample output:
Name Age
xxx 20
yyy 30
zzz 25
aaa 21
bbb 23
Sample output:
"C:/Users/ELCOT/Documents/newdata"
RESULT
The program executes various arithmetic and logical operations, showcasing their
outcomes along with file input/output operations, ultimately generating files and
displaying results of operations on provided data.
AIM
To implement data exploration and visualization on different datasets to explore
multiple and individual variables
ALGORITHM
1. TOTAL ROWS AND COLUMNS:
Create vectors for name and age.
Combine vectors into a data frame.
Print the dimensions of the data frame.
2. PROJECT VIEW:
Create vectors for name and age.
Combine vectors into a data frame.
Print the first few rows of the data frame.
3. TITLES VIEW:
Create vectors for name and age.
Combine vectors into a data frame.
Print the names of columns in the data frame.
4. DATA FRAME:
Create vectors for name and age.
Combine vectors into a data frame.
Print the structure of the data frame.
5. PROJECT VIEW:
Create vectors for name and age.
Combine vectors into a data frame.
View the data frame in the rstudio project view.
6. NUMERIC OR CHARACTER:
Create vectors for name and age.
Combine vectors into a data frame.
Print the class of the name and age columns.
7. TABLE VIEW:
Create vectors for name and age.
Combine vectors into a data frame.
Print the frequency table of the name column.
8. MEAN, MEDIAN, MODE, QUALITY, MINIMUM, MAXIMUM:
Create vectors for name and age.
Combine vectors into a data frame.
Print summary statistics for the age column.
9. BARPLOT:
Load the airquality dataset into a data frame.
View the data frame.
Create a barplot of the data.
10.HISTOGRAM:
Load the airquality dataset into a data frame.
View the data frame.
Create a histogram of the data.
11.BOX PLOT:
Load the airquality dataset into a data frame.
View the data frame.
Create a box plot of the data.
12.SCATTER PLOT:
Load the airquality dataset into a data frame.
View the data frame.
Create a scatter plot of the data.
13.HEAT MAP:
Create a random matrix.
Assign row and column names to the matrix.
Create a heatmap of the matrix.
14.3D GRAPHS:
Define a function for a cone.
Prepare variables for x, y, and z.
Plot a 3d surface of the cone.
15.CLASS() FUNCTION:
Define variables with different data types.
Print the class of each variable.
16.LS() FUNCTION:
Define variables using different assignment operators.
Print the list of variables in the current environment.
17.RM() FUNCTION:
Define variables using different assignment operators.
Remove a variable.
Print the removed variable (to demonstrate removal).
Print an existing variable (to demonstrate non-removal).
PROGRAM
Total Rows and Columns:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
dim(data)
Output : 5 2
Project view:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
head(data)
Output : Name Age
xxx 20
yyy 30
zzz 25
aaa 21
bbb 23
Titles view:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
names(data)
Output : "Name" "Age"
Data Frame:
Sample Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
str(data)
Output : data.frame': 5 obs. of 2 variables:
$ Name: chr "xxx" "yyy" "zzz" "aaa" ...
$ Age : num 20 30 25 21 23
Numeric or Character:
Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
class(data$Name)
Output : "character"
Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
class(data$Age)
Output : " "numeric""
Table view:
Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
table(data$Name)
Output : aaa bbb xxx yyy zzz
1 1 1 1 1
Mean,Median,Mode,Quality ,Minimum,Maximum:
Input:
Name<- c("xxx","yyy","zzz","aaa","bbb")
Age<- c(20,30,25,21,23)
data<-data.frame(Name,Age)
data
summary(data$Age))
Output :
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.0 21.0 23.0 23.8 25.0 30.0
BARPLOT:
Input:
data<-data.frame(airquality)
View(airquality)
Output :
barplot(airquality$Ozone,
main = 'Ozone Concenteration in air',
xlab = 'ozone levels', horiz = TRUE)
Output :
HISTOGRAM:
Input:
data<-data.frame(airquality)
View(airquality)
Output :
data(airquality)
hist(airquality$Temp, main ="La Guardia Airport's\
Maximum Temperature(Daily)",
xlab ="Temperature(Fahrenheit)",
xlim = c(50, 125), col ="yellow",
freq = TRUE)
Output :
BOX PLOT:
Input:
data<-data.frame(airquality)
View(airquality)
Output :
SCATTER PLOT:
Input:
data<-data.frame(airquality)
View(airquality)
Output :
# Scatter plot for Ozone Concentration per month
data(airquality)
plot(airquality$Ozone, airquality$Month,
main ="Scatterplot Example",
xlab ="Ozone Concentration in parts per billion",
ylab =" Month of observation ", pch = 19)
Output :
HEAT MAP:
Input:
data <- matrix(rnorm(50, 0, 5), nrow = 5, ncol = 5)
colnames(data) <- paste0("col", 1:5)
rownames(data) <- paste0("row", 1:5)
heatmap(data)
View(data)
Output :
3D GRAPHS:
Input:
# Adding Titles and Labeling Axes to Plot
cone <- function(x, y){
sqrt(x ^ 2 + y ^ 2)
}
# prepare variables.
x <- y <- seq(-1, 1, length = 30)
z <- outer(x, y, cone)
Output :
class() function:
Input:
#Character
var1 = "hello"
print(class(var1))
Output : "character"
Input:
#Variable
var1 = 10
print(class(var1))
Output : "numeric"
ls() function:
Input:
var1 = "hello"
var2 <- "hello"
"hello" -> var3
print(ls())
Output : "var1" "var2" "var3"
rm() function:
Input:
# using equal to operator
var1 = "hello"
# using leftward operator
var2 <- "hello"
# using rightward operator
"hello" -> var3
# Removing variable
rm(var3)
print(var2)
Output : "hello"
RESULT
Thus the data exploration and visualization on different datasets to explore
multiple and individual variables has been implemented.
AIM
To build a decision tree using party and rpart packages.
PACKAGE - PARTY
ALGORITHM
1. Load necessary libraries required for the analysis.
2. Load the dataset named readingSkills from the R datasets package and
display the first few rows using head() function.
3. Split the dataset into training and testing datasets using the sample.split()
function from caTools package.
PROGRAM
library(datasets)
library(catools)
library(party)
library(dplyr)
library(magrittr)
data("readingskills")
head(readingskills)
sample_data = sample.split(readingskills, splitratio = 0.8)
train_data <- subset(readingskills, sample_data == true)
test_data <- subset(readingskills, sample_data == false)
model<- ctree(nativespeaker ~ ., train_data)
plot(model)
OUTPUT
PACKAGE – RPART
ALGORITHM
1. Load necessary libraries required for the analysis.
2. Load the iris dataset from the datasets package into a data frame and view
its structure.
3. Set a seed for reproducibility, then randomly sample 70% of the rows
from the iris dataset for training and the rest for testing.
4. Build a decision tree model (rpart()) using the training data. The target
variable is species and all other variables are predictors.
OUTPUT
RESULT
Thus a decision tree using party and rpart packages has been built.
AIM
To build a predictive model using randomForest package.
ALGORITHM
1. Load the required libraries: 'lubridate', 'randomForest', and 'forecast'.
2. Define the sales data vector 'x' containing observed sales data.
3. Convert 'x' into a time series object 'mts' with start date and frequency.
4. Convert 'mts' into a data frame 'df' with 'Week' and 'Sales' columns.
6. Generate forecasts for the next 5 periods using 'predict()' with new data
representing the next 5 weeks.
7. Plot the observed sales data and the forecasted sales values for the next 5
periods using 'plot()'.
PROGRAM
# Load Required Libraries
library(lubridate)
library(randomForest)
library(forecast)
# Define Sales Data
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843,
471497, 936851, 1508725, 2072113)
# Plot Forecast
plot(1:(length(x) + 5), c(x, forecast_values), type = "l",
xlab = "Week", ylab = "Total Revenue",
main = "Sales vs Revenue", col.main = "darkgreen",
ylim = c(0, max(x, forecast_values)))
lines(1:length(x), x, col = "blue", lwd = 2) # Plot observed sales
lines((length(x) + 1):(length(x) + 5), forecast_values, col = "red", lwd = 2) #
Plot forecasted sales
legend("topright", legend = c("Observed Sales", "Forecasted Sales"),
col = c("blue", "red"), lwd = 2)
OUTPUT
RESULT
Thus a predictive model using randomForest package has been built.
AIM
To implement linear and logistic regression on datasets to predict the probability
LINEAR REGRESSION
ALGORITHM
1. Define two vectors 'x' and 'y' containing weight and height data points
respectively.
2. Fit a linear regression model 'relation' using the 'lm()' function, predicting
'y' based on 'x'.
3. Create a scatter plot of 'y' versus 'x' using 'plot()', with a regression line
added using 'abline()' with the linear regression model. Customize plot
appearance including title, axis labels, color, and point shape.
PROGRAM
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
png(file = "linearregression.png")
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",
ylab = "Height in cm")
dev.off()
OUTPUT
LOGISTIC REGRESSION
ALGORITHM
1. Install and load required packages: 'dplyr', 'catools', and 'rocr'.
3. Split the 'mtcars' dataset into training and testing sets using
'sample.split()' function.
5. Build a logistic regression model using 'glm()' with 'vs' as dependent and
'wt' and 'disp' as independent variables.
8. Calculate area under the roc curve (auc) using 'prediction()' and
'performance()' functions.
9. Plot the roc curve with labeled cutoffs and a legend showing the
calculated auc.
PROGRAM
install.packages("dplyr")
library(dplyr)
summary(mtcars)
install.packages("caTools")
install.packages("ROCR")
library(caTools)
library(ROCR)
split <- sample.split(mtcars, SplitRatio = 0.8)
Split
train_reg <- subset(mtcars, split == "TRUE")
test_reg <- subset(mtcars, split == "FALSE")
logistic_model <- glm(vs ~ wt + disp,data = train_reg,family = "binomial")
logistic_model
summary(logistic_model)
predict_reg <- predict(logistic_model,test_reg, type = "response")
predict_reg
predict_reg <- ifelse(predict_reg >0.5, 1, 0)
table(test_reg$vs, predict_reg)
missing_classerr <- mean(predict_reg != test_reg$vs)
print(paste('Accuracy =', 1 - missing_classerr))
ROCPred <- prediction(predict_reg, test_reg$vs)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- [email protected][[1]]
auc
plot(ROCPer)
plot(ROCPer, colorize = TRUE, print.cutoffs.at = seq(0.1, by = 0.1), main
= "ROC CURVE")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
OUTPUT
RESULT
Thus the linear and logistic regression on datasets to predict the probability has
been implemented.
AIM
To implement K-means, K-medoids, hierarchical and density based clustering
techniques.
K-MEANS
ALGORITHM
1. Load the Iris dataset using 'data(iris)'.
2. Examine the structure of the dataset using 'str()' to understand its variables
and types.
3. Install required packages using 'install.packages()': 'ClusterR' and 'cluster'.
PROGRAM
data(iris)
str(iris)
install.packages("ClusterR")
install.packages("cluster")
library(ClusterR)
library(cluster)
iris_1 <- iris[, -5]
set.seed(240)
kmeans.re <- kmeans(iris_1, centers = 3, nstart = 20)
kmeans.re
kmeans.re$cluster
cm <- table(iris$Species, kmeans.re$cluster)
cm
plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")], col = kmeans.re$cluster,
main = "K-means with 3 clusters")
kmeans.re$centers
kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]
points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")],col = 1:3,
pch = 8, cex = 3)
y_kmeans <- kmeans.re$cluster
clusplot(iris_1[, c("Sepal.Length", "Sepal.Width")],
y_kmeans,
lines = 0,
shade = TRUE,
color = TRUE,
labels = 2,
plotchar = FALSE,
span = TRUE,
main = paste("Cluster iris"),
xlab = 'Sepal.Length',
ylab = 'Sepal.Width')
OUTPUT
K-MEDOIDS
ALGORITHM
1. Load the 'factoextra' and 'cluster' packages using 'library()'.
PROGRAM
library(factoextra)
library(cluster)
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)
OUTPUT
HIERARCHICAL CLUSTERING
ALGORITHM
1. Install and load the 'dplyr' package.
2. Load the dataset 'mtcars' and examine its first few rows using 'head()'
function.
3. Compute the Euclidean distance matrix from the dataset using 'dist()'
function.
PROGRAM
install.packages("dplyr")
library(dplyr)
head(mtcars)
distance_mat <- dist(mtcars, method = 'euclidean’)
distance_mat
set.seed(240)
Hierar_cl <- hclust(distance_mat, method = "average")
Hierar_cl
OUTPUT
DENSITY-BASED CLUSTERING
ALGORITHM
1. Load the Iris dataset using 'data()' function.
5. Convert the iris dataset into a matrix format excluding the species column
using 'as.matrix()'.
6. Plot the k-nearest neighbor distance plot ('kNNdistplot') of the iris data
matrix with a specified value of k.
9. Perform DBSCAN clustering ('dbscan()') on the iris data matrix with the
specified epsilon (eps) and minimum points (minPts) parameters.
11.Plot the cluster hulls ('hullplot()') of the iris data matrix based on the
DBSCAN clustering result.
PROGRAM
data(iris)
str(iris)
install.packages(“dbscan”)
library(dbscan)
iris_matrix <- as.matrix(iris[, -5])
kNNdistplot(iris_matrix, k=4)
abline(h=0.4, col="red")
set.seed(1234)
db = dbscan(iris_matrix, 0.4, 4)
Db
hullplot(iris_matrix, db$cluster
OUTPUT
RESULT
Thus the K-means, K-medoids, hierarchical and density based clustering
techniques has been implemented.
AIM
To implement time series analysis using classification and clustering
techniques.
ALGORITHM
1. Install the "lubridate" package using `install.packages("lubridate")`.
2. Load the "lubridate" package using `library(lubridate)`.
3. Define the weekly data vector `x` representing total positive COVID-19
cases.
4. Convert the data into a time series object `mts` using the `ts()` function.
5. Specify the start date as January 22, 2020, and the frequency as weekly.
7. Plot the time series data `mts` using the `plot()` function.
8. Customize the plot with appropriate axis labels, title, and color.
PROGRAM
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
Install.packages(“lubridate”)
library(lubridate)
png(file ="timeSeries.png")
mts <- ts(x, start = decimal_date(ymd("2020-01-22")),
frequency = 365.25 / 7)
plot(mts, xlab ="Weekly Data",
ylab ="Total Positive Cases",
main ="COVID-19 Pandemic",
col.main ="darkgreen")
dev.off()
OUTPUT
CLUSTERING
ALGORITHM
1. Install the "factoextra" package if it's not already installed.
2. Load the "factoextra" library.
3. Load the dataset (in this case, mtcars).
4. Remove any rows with missing values from the dataset.
5. Scale the dataset to standardize the variables.
6. Open a PNG device for saving the plot.
7. Perform K-means clustering on the scaled dataset with specified
parameters (centers = 5, nstart = 25).
8. Visualize the clustering results using the fviz_cluster() function from the
factoextra package.
9. Save the plot as a PNG file.
10.Close the PNG device.
PROGRAM
# Install the "factoextra" package if not installed
# install.packages("factoextra")
RESULT
Thus the time series analysis using classification and clustering techniques has
been implemented.
AIM
To implement apriori algorithm in association rule mining.
ALGORITHM
1. Install the "arules" package if it's not already installed.
2. Load the "arules" library.
3. Install the "arulesViz" package if it's not already installed.
4. Load the "arulesViz" library.
5. Install the "RColorBrewer" package if it's not already installed.
6. Load the "RColorBrewer" library.
7. Load the "Groceries" dataset.
8. Generate association rules using the apriori() function from the "arules"
package. Set parameters for minimum support and confidence levels.
9. Inspect the first 10 association rules using the inspect() function.
10.Plot the relative item frequency of the top 20 items using the
itemFrequencyPlot() function. Customize the plot with colors from the
RColorBrewer palette.
PROGRAM
install.packages(“arules”)
library(arules)
install.packages(“arulesViz”)
library(arulesViz)
install.packages(“RColorBrewer”)
library(RColorBrewer)
data("Groceries")
rules <- apriori(Groceries,
parameter = list(supp = 0.01, conf = 0.2))
inspect(rules[1:10])
OUTPUT
ASSOCIATION RULE MINING
ALGORITHM
1. Define a list representing market basket transactions.
2. Assign names to the transactions.
3. Load the "arules" library.
4. Convert the list to transactions using the as() function.
5. Display the dimensions of the transactions.
6. Display a summary of the transactions.
7. Display an image of the transactions.
8. Plot the item frequency of the transactions.
9. Generate association rules using the apriori() function with specified
parameters.
10.Display a summary of the generated rules.
11.Inspect the generated rules.
12.Extract association rules with "beer" on the right-hand side (rhs).
13.Inspect the rules with "beer" on the left-hand side (lhs).
14.Load the "arulesViz" library.
15.Plot the association rules using different visualization methods.
PROGRAM
# Define market basket transactions
market_basket <- list(
c("apple", "beer", "rice", "meat"),
c("apple", "beer", "rice"),
c("apple", "beer"),
c("apple", "pear"),
c("milk", "beer", "rice", "meat"),
c("milk", "beer", "rice"),
c("milk", "beer"),
c("milk", "pear")
)
names(market_basket) <- paste("T", c(1:8), sep = "")
RESULT
Thus the Apriori Algorithm in Association rule mining has been implemented.
AIM
To implement text mining on twitter data using twitterR package.
ALGORITHM
1. Install required packages: "rtweet", "ggplot2", "dplyr", "tidytext",
"igraph", and "ggraph".
2. Load the required libraries: "rtweet", "ggplot2", "dplyr", "tidytext",
"igraph", and "ggraph".
3. Use search_tweets() function from "rtweet" package to search for tweets
containing the hashtag "#climatechange".
4. Clean the text of tweets by removing URLs using regular expressions.
5. Tokenize the cleaned text to extract individual words.
6. Count the frequency of unique words in the tweets.
7. Plot the count of the top 15 unique words found in the tweets using
ggplot2.
8. Load the built-in stop words dataset from "tidytext".
9. Remove stop words from the tokenized text.
10.Count the frequency of unique words after removing stop words.
11.Plot the count of the top 15 unique words after stop words removal using
ggplot2.
PROGRAM
# Load required libraries
library(rtweet)
library(ggplot2)
library(dplyr)
library(tidytext)
library(igraph)
library(ggraph)
# Plot the count of the top 15 unique words after stop words removal
ggplot(word_freq_no_stopwords, aes(x = word, y = n)) +
geom_col() +
coord_flip() +
labs(x = "Unique words", y = "Count",
title = "Count of unique words found in tweets (Stop words removed)")
RESULT
Thus the text mining on twitter data using twitterR package has been
implemented.