0% found this document useful (0 votes)
9 views23 pages

Data Scinece Practical File

The document outlines a term work assignment on Data Science using R, submitted by a student named Navjot Pant. It includes various programming tasks such as implementing Linear Regression and Decision Tree Classifier on datasets like california_housing.csv and iris.csv, along with handling missing values and performing descriptive statistics. The document provides code snippets, outputs, and results for each task.

Uploaded by

miwidoy210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views23 pages

Data Scinece Practical File

The document outlines a term work assignment on Data Science using R, submitted by a student named Navjot Pant. It includes various programming tasks such as implementing Linear Regression and Decision Tree Classifier on datasets like california_housing.csv and iris.csv, along with handling missing values and performing descriptive statistics. The document provides code snippets, outputs, and results for each task.

Uploaded by

miwidoy210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

TERM WORK

ON
Data Science Using R
LAB
PMC 401

Graphic Era Hill University, Dehradun


School Of Computing
Department of Computer Applications

SUBMITTED BY: SUBMITTED TO:


Name: Navjot Pant Dr. Poonam Verma
Course: MCA (B) Assistant Professor
Roll No: 33 Department of School of Computing

1
1. Write a program to implement Linear Regression on california_housing.csv

install.packages("caret")
# Load necessary package
library(caret)housing <- read.csv("/content/california_housing_train.csv")
head(housing)
set.seed(123)trainIndex <- createDataPartition(housing$median_house_value, p = 0.7, list = FALSE)
trainData <- housing[trainIndex, ]
testData <- housing[-trainIndex, ]
model <- lm(median_house_value ~ ., data = trainData)
summary(model)
predictions <- predict(model, newdata = testData)
actual <- testData$median_house_value
rmse <- sqrt(mean((actual - predictions)^2))
print(rmse)
Output :

Installing package into ‘/usr/local/lib/R/site-library’


(as ‘lib’ is unspecified)

A data.frame: 6 × 9

longitu latitu housing_median total_roo total_bedroo populati househol median_inco median_house_v


de de _age ms ms on ds me alue

<dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int>

1 -114.31 34.19 15 5612 1283 1015 472 1.4936 66900

2 -114.47 34.40 19 7650 1901 1129 463 1.8200 80100

3 -114.56 33.69 17 720 174 333 117 1.6509 85700

4 -114.57 33.64 14 1501 337 515 226 3.1917 73400

5 -114.57 33.57 20 1454 326 624 262 1.9250 65500

6 -114.58 33.63 29 1387 236 671 239 3.3438 74000

Call:
lm(formula = median_house_value ~ ., data = trainData)
Residuals:
Min 1Q Median 3Q Max
-562309 -43874 -11983 30091 762778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.712e+06 8.363e+04 -44.386 < 2e-16 ***
longitude -4.409e+04 9.541e+02 -46.216 < 2e-16 ***
latitude -4.354e+04 9.011e+02 -48.321 < 2e-16 ***
housing_median_age 1.151e+03 5.760e+01 19.976 < 2e-16 ***
total_rooms -8.212e+00 1.033e+00 -7.948 2.06e-15 ***

2
total_bedrooms 1.298e+02 9.571e+00 13.557 < 2e-16 ***
population -3.685e+01 1.375e+00 -26.794 < 2e-16 ***
households 2.829e+01 1.054e+01 2.683 0.00732 **
median_income 4.018e+04 4.415e+02 91.019 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 69950 on 11893 degrees of freedom


Multiple R-squared: 0.6365, Adjusted R-squared: 0.6362
F-statistic: 2603 on 8 and 11893 DF, p-value: < 2.2e-16
[1] 68425.51

2. Write an R program to implement Linear Regression on iris.csv

# Load built-in Iris dataset


data(iris)

# View the first few rows


head(iris)

# Build Linear Regression model to predict Sepal.Length


model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)

# Show model summary (coefficients, R-squared, etc.)


summary(model)

# Predict Sepal.Length using the trained model


iris$predicted <- predict(model, iris)

# View actual and predicted values


head(iris)

# Calculate Root Mean Squared Error (RMSE)


actual <- iris$Sepal.Length
predicted <- iris$predicted

rmse <- sqrt(mean((actual - predicted)^2))


print(rmse)
Output :

A data.frame: 6 × 5

Sepal.Lengt Petal.Widt
Sepal.Width Petal.Length Species
h h

<dbl> <dbl> <dbl> <dbl> <fct>

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

3
A data.frame: 6 × 5

Sepal.Lengt Petal.Widt
Sepal.Width Petal.Length Species
h h

<dbl> <dbl> <dbl> <dbl> <fct>

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = iris)

Residuals:
Min 1Q Median 3Q Max
-0.82816 -0.21989 0.01875 0.19709 0.84570

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.85600 0.25078 7.401 9.85e-12 ***
Sepal.Width 0.65084 0.06665 9.765 < 2e-16 ***
Petal.Length 0.70913 0.05672 12.502 < 2e-16 ***
Petal.Width -0.55648 0.12755 -4.363 2.41e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3145 on 146 degrees of freedom


Multiple R-squared: 0.8586, Adjusted R-squared: 0.8557
F-statistic: 295.5 on 3 and 146 DF, p-value: < 2.2e-16

A data.frame: 6 × 6

Sepal.Lengt
Sepal.Width Petal.Length Petal.Width Species predicted
h

<dbl> <dbl> <dbl> <dbl> <fct> <dbl>

1 5.1 3.5 1.4 0.2 setosa 5.015416

2 4.9 3.0 1.4 0.2 setosa 4.689997

3 4.7 3.2 1.3 0.2 setosa 4.749251

4 4.6 3.1 1.5 0.2 setosa 4.825994

5 5.0 3.6 1.4 0.2 setosa 5.080499

6 5.4 3.9 1.7 0.4 setosa 5.377194

[1] 0.3103268

Write a program to implement a Decision Tree Classifier on iris.csv

4
# Load required libraries
library(caret)
library(rpart)

# Load the dataset


data(iris)

# Split data using caret (70% train, 30% test)


set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
trainData <- iris[trainIndex, ]
testData <- iris[-trainIndex, ]

# Train Decision Tree model using rpart


model <- rpart(Species ~ ., data = trainData, method = "class")

# Predict on test data


predictions <- predict(model, testData, type = "class")

# Calculate accuracy
accuracy <- sum(predictions == testData$Species) / nrow(testData)
print(accuracy)

Output :

[1] 0.9333333

3. Write a program to implement a Decision Tree on california_housing_train.csv using ANOVA and Type 2 visualization techniques.

install.packages(c("rpart", "rpart.plot", "caret")) # Install required packages


library(rpart)
library(rpart.plot)
library(caret)

data(iris)
head(iris)

set.seed(123)
trainIndex<-createDataPartition(iris$Petal.Length, p=0.7, list=FALSE)
trainData<-iris[trainIndex, ]
testData<-iris[-trainIndex, ]
tree_model<-rpart(Petal.Length ~ ., data=trainData, method="anova")

print(tree_model)
predictions <-predict(tree_model,testData)
actual <-testData$Petal.Length
mse <-sqrt(mean(actual - predictions)^2)
print(mse)

5
rpart.plot(tree_model,type=2, fallen.leaves = TRUE)
Output :

Installing packages into ‘/usr/local/lib/R/site-library’


(as ‘lib’ is unspecified)

A data.frame: 6 × 5

Sepal.Lengt Petal.Widt
Sepal.Width Petal.Length Species
h h

<dbl> <dbl> <dbl> <dbl> <fct>

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa


n= 106

node), split, n, deviance, yval


* denotes terminal node

1) root 106 328.091000 3.776415


2) Petal.Width< 0.8 35 1.061714 1.462857 *
3) Petal.Width>=0.8 71 47.339720 4.916901
6) Species=versicolor 38 8.107632 4.307895
12) Sepal.Length< 5.85 16 2.497500 3.937500 *
13) Sepal.Length>=5.85 22 1.818636 4.577273 *
7) Species=virginica 33 8.909091 5.618182
14) Sepal.Length< 7 23 2.696522 5.356522 *
15) Sepal.Length>=7 10 1.016000 6.220000 *
[1] 0.02590396

5. Write an R program to handle missing values in a data frame.


# Create a dataframe with missing values
data <- data.frame(
6
Age = c(20, NA, 56, NA, 78),
Marks = c(80, NA, 45, 16, NA)
)

# Print original data


print("Original Data:")
print(data)

# Count of missing values in each column


print("Count of Missing Values:")
print(colSums(is.na(data)))

# Replace missing values with mean


data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
data$Marks[is.na(data$Marks)] <- mean(data$Marks, na.rm = TRUE)

# Print updated data


print("Data after handling missing values:")
print(data)

# Find mode (most frequent value) in Marks column


mode <- names(sort(table(data$Marks), decreasing = TRUE))[1]
print(paste("Most frequent Marks value (Mode):", mode))

Output :
[1] "Original Data:"
Age Marks
1 20 80
2 NA NA
3 56 45
4 NA 16
5 78 NA
[1] "Count of Missing Values:"
Age Marks
2 2
[1] "Data after handling missing values:"
Age Marks
1 20.00000 80
2 51.33333 47
3 56.00000 45
4 51.33333 16
5 78.00000 47
[1] "Most frequent Marks value (Mode): 47"

6. Write an R program to perform descriptive statistics (Mean, Median, Mode) on a given dataframe.

# Sample Data

7
data <- data.frame(
Age = c(25, 30, 30, 35, 40),
Marks = c(80, 85, 85, 90, 85)
)

# Mean and Median


mean_marks <- mean(data$Marks)
median_marks <- median(data$Marks)

# Mode (one-liner)
mode_marks <- names(sort(table(data$Marks), decreasing = TRUE))[1]

# Print results
cat("Mean (Marks):", mean_marks, "\n")
cat("Median (Marks):", median_marks, "\n")
cat("Mode (Marks):", mode_marks, "\n")

Output :
Mean (Marks): 85
Median (Marks): 85
Mode (Marks): 85

7. Write an R program to perform descriptive statistics

# Sample dataset
data <- data.frame(
Age = c(22, 25, 30, 35, 40, 45, 50),
Marks = c(70, 85, 90, 75, 60, 95, 80)
)

# View data
print("Original Data:")
print(data)

# Basic descriptive statistics


print("Summary:")
print(summary(data))

# Mean
mean_age <- mean(data$Age)
mean_marks <- mean(data$Marks)

# Median
median_age <- median(data$Age)
median_marks <- median(data$Marks)

# Standard deviation

8
sd_age <- sd(data$Age)
sd_marks <- sd(data$Marks)

# Variance
var_age <- var(data$Age)
var_marks <- var(data$Marks)

# Min and Max


min_age <- min(data$Age)
max_age <- max(data$Age)

# Mode (simple one-liner using table)


mode_age <- as.numeric(names(sort(table(data$Age), decreasing = TRUE))[1])
mode_marks <- as.numeric(names(sort(table(data$Marks), decreasing = TRUE))[1])

# Output
cat("Mean Age:", mean_age, "\n")
cat("Mean Marks:", mean_marks, "\n")
cat("Median Age:", median_age, "\n")
cat("Median Marks:", median_marks, "\n")
cat("Standard Deviation (Age):", sd_age, "\n")
cat("Standard Deviation (Marks):", sd_marks, "\n")
cat("Variance (Age):", var_age, "\n")
cat("Variance (Marks):", var_marks, "\n")
cat("Min Age:", min_age, " | Max Age:", max_age, "\n")
cat("Mode Age:", mode_age, "\n")
cat("Mode Marks:", mode_marks, "\n")
Output :
[1] "Original Data:"
Age Marks
1 22 70
2 25 85
3 30 90
4 35 75
5 40 60
6 45 95
7 50 80
[1] "Summary:"
Age Marks
Min. :22.00 Min. :60.00
1st Qu.:27.50 1st Qu.:72.50
Median :35.00 Median :80.00
Mean :35.29 Mean :79.29
3rd Qu.:42.50 3rd Qu.:87.50
Max. :50.00 Max. :95.00
Mean Age: 35.28571
Mean Marks: 79.28571

9
Median Age: 35
Median Marks: 80
Standard Deviation (Age): 10.35558
Standard Deviation (Marks): 12.05148
Variance (Age): 107.2381
Variance (Marks): 145.2381
Min Age: 22 | Max Age: 50
Mode Age: 22
Mode Marks: 60

8. Write an R program to visualize the parameters of a dataset

# Create a simple dataset


data <- data.frame(
Age = c(18, 19, 20, 21, 22, 23, 24),
Marks = c(75, 80, 85, 78, 92, 88, 95)
)

# Scatter Plot: Age vs Marks


plot(data$Age, data$Marks,
main = "Scatter Plot: Age vs Marks",
xlab = "Age",
ylab = "Marks",
pch = 19,
col = "red"
)

# Histogram of Marks
hist(data$Marks,
main = "Histogram of Marks",
xlab = "Marks",
col = "lavender"
)
Output :

10
9. Write a program to implement Naïve Bayes on iris.csv

# Install and load required package


if (!require("e1071")) install.packages("e1071")
library(e1071)

# Load the dataset


iris_data <- read.csv("/content/Iris.csv")
print(names(iris_data)) # View column names

# Convert Species to factor


iris_data$Species <- as.factor(iris_data$Species)

# Split the data (70% training, 30% testing)


set.seed(123)
train_index <- sample(seq_len(nrow(iris_data)), size = 0.7 * nrow(iris_data))
train_data <- iris_data[train_index, ]
test_data <- iris_data[-train_index, ]

11
# Train Naive Bayes model
nb_model <- naiveBayes(Species ~ SepalLengthCm + SepalWidthCm + PetalLengthCm + PetalWidthCm,
data = train_data)

# Predict on test data


nb_predictions <- predict(nb_model, newdata = test_data)

# Confusion matrix
conf_matrix_nb <- table(Predicted = nb_predictions, Actual = test_data$Species)
print("Confusion Matrix:")
print(conf_matrix_nb)

# Accuracy
accuracy <- sum(diag(conf_matrix_nb)) / sum(conf_matrix_nb)
print(paste("Accuracy:", round(accuracy * 100, 2), "%"))
Output :
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

[1] "Id" "SepalLengthCm" "SepalWidthCm" "PetalLengthCm"


[5] "PetalWidthCm" "Species"
[1] "Confusion Matrix:"
Actual
Predicted Iris-setosa Iris-versicolor Iris-virginica
Iris-setosa 14 0 0
Iris-versicolor 0 18 0
Iris-virginica 0 0 13
[1] "Accuracy: 100 %"

10. Write a program to implement DBSCAN on iris.csv.

# Install and load dbscan if not already installed


if (!require("dbscan")) install.packages("dbscan")
library(dbscan)

# Load the iris dataset from CSV


iris_data <- read.csv("/content/Iris.csv")

# View column names


print(names(iris_data))

# Select only numeric columns


numeric_data <- iris_data[, c("SepalLengthCm", "PetalLengthCm", "SepalWidthCm", "PetalWidthCm")]

# Scale the numeric data


numeric_data_scaled <- scale(numeric_data)

12
# Apply DBSCAN on scaled data
set.seed(123)
dbscan_result <- dbscan(numeric_data_scaled, eps = 0.5, minPts = 3)

# Print result
print(dbscan_result)

Output :
Loading required package: dbscan

Warning message in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :


“there is no package called ‘dbscan’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Attaching package: ‘dbscan’

The following object is masked from ‘package:stats’:

as.dendrogram

[1] "Id" "SepalLengthCm" "SepalWidthCm" "PetalLengthCm"


[5] "PetalWidthCm" "Species"
DBSCAN clustering for 150 objects.
Parameters: eps = 0.5, minPts = 3
Using euclidean distances and borderpoints = TRUE
The clustering contains 7 cluster(s) and 18 noise points.

0 1 2 3 4 5 6 7
18 44 73 3 3 3 3 3

Available fields: cluster, eps, minPts, metric, borderPoints

11. Write a program to implement k-Nearest Neighbors (KNN), a supervised algorithm using Euclidean distance.

if (!require("class")) install.packages("class", repos = "http://cran.rstudio.com/")


library(class)
iris_data<-read.csv("/content/Iris.csv")
print(names(iris_data))

iris_data$Species<- as.factor(iris$Species)

set.seed(123)

13
train_index<- sample(seq_len(nrow(iris_data)), size=0.7*nrow(iris_data))
train_data<-iris_data[train_index, ]
test_data<-iris_data[-train_index, ]
predictor_cols<-c("SepalLengthCm", "SepalWidthCm" , "PetalLengthCm","PetalWidthCm")
train_features<-train_data[,predictor_cols]
test_features<-test_data[,predictor_cols]

train_labels<-train_data$Species
test_labels<-test_data$Species
knn_pred<-knn(train =train_features,test=test_features,cl=train_labels, k=3)
conf_matrix<-table(Predicted = knn_pred, Actual = test_labels)
print(conf_matrix)

accuracy<-sum(diag(conf_matrix))/sum(conf_matrix)
print(accuracy)
Output :
[1] "Id" "SepalLengthCm" "SepalWidthCm" "PetalLengthCm"
[5] "PetalWidthCm" "Species"
Actual
Predicted setosa versicolor virginica
setosa 14 0 0
versicolor 0 17 0
virginica 0 1 13
[1] 0.9777778

12. Write a program to implement K-Means clustering on california_housing.csv.

# Load the data


california_data <- read.csv("//content/california_housing_train.csv")

# Select numeric features


features <- c("longitude", "latitude", "housing_median_age", "total_rooms",
"total_bedrooms", "population", "households", "median_income")

data_for_clustering <- california_data[, features]

# Remove missing values


data_for_clustering <- na.omit(data_for_clustering)

# Scale the data


data_scaled <- scale(data_for_clustering)

# Set seed for reproducibility


set.seed(123)

# K-Means clustering with 3 clusters


kmeans_result <- kmeans(data_scaled, centers = 3, nstart = 25)

14
# Add cluster labels to original data
california_data$Cluster <- kmeans_result$cluster

# Print only first 6 rows with clusters


print(head(california_data[, c(features, "Cluster")]))

# Print cluster sizes summary


cat("Cluster sizes:\n")
print(table(california_data$Cluster))
Output :
longitude latitude housing_median_age total_rooms total_bedrooms population
1 -114.31 34.19 15 5612 1283 1015
2 -114.47 34.40 19 7650 1901 1129
3 -114.56 33.69 17 720 174 333
4 -114.57 33.64 14 1501 337 515
5 -114.57 33.57 20 1454 326 624
6 -114.58 33.63 29 1387 236 671
households median_income Cluster
1 472 1.4936 1
2 463 1.8200 2
3 117 1.6509 1
4 226 3.1917 1
5 262 1.9250 1
6 239 3.3438 1
Cluster sizes:

1 2 3
8943 1394 6663

13. Write a program to implement KNN on california_housing.csv with clusters = 3.

# Load 'class' library for knn


if (!require("class")) install.packages("class", repos = "http://cran.rstudio.com/")
library(class)

# Load dataset
data <- read.csv("/content/california_housing_train.csv")

# Create a new categorical target by binning median_house_value into 3 classes


data$HouseValueCategory <- cut(data$median_house_value,
breaks = 3,
labels = c("Low", "Medium", "High"))

# Remove rows with missing values


data <- na.omit(data)

15
# Select features (numeric predictors)
features <- c("longitude", "latitude", "housing_median_age", "total_rooms",
"total_bedrooms", "population", "households", "median_income")

# Split data into training and testing (70%-30%)


set.seed(123)
train_index <- sample(nrow(data), 0.7 * nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

# Prepare features and labels


train_features <- scale(train_data[, features]) # scale features
test_features <- scale(test_data[, features])

train_labels <- train_data$HouseValueCategory


test_labels <- test_data$HouseValueCategory

# Run KNN with k=3


knn_pred <- knn(train = train_features, test = test_features, cl = train_labels, k = 3)

# Confusion matrix and accuracy


conf_matrix <- table(Predicted = knn_pred, Actual = test_labels)
print(conf_matrix)

accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)


print(paste("Accuracy:", accuracy))
Output :
Actual
Predicted Low Medium High
Low 2103 393 35
Medium 375 1310 246
High 27 183 428
[1] "Accuracy: 0.753137254901961"

14. Write a program to implement K-Means clustering on iris.csv with clusters = 3.

# Load the dataset


iris_data <- read.csv("/content/Iris.csv")

# Print column names to check structure


print(names(iris_data))

# Remove non-numeric columns (Id and Species)


iris_features <- iris_data[, c("SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm")]

# Apply K-Means clustering with 3 clusters


set.seed(123) # for reproducibility

16
kmeans_result <- kmeans(iris_features, centers = 3, nstart = 25)

# Print cluster assignments


print("Cluster assignments:")
print(kmeans_result$cluster)

# Print cluster centers


print("Cluster centers:")
print(kmeans_result$centers)

# Compare cluster assignment with actual species using a contingency table


print("Contingency table:")
print(table(Cluster = kmeans_result$cluster, Actual = iris_data$Species))
Output :
[1] "Id" "SepalLengthCm" "SepalWidthCm" "PetalLengthCm"
[5] "PetalWidthCm" "Species"
[1] "Cluster assignments:"
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
[112] 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
[149] 2 3
[1] "Cluster centers:"
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
1 5.006000 3.418000 1.464000 0.244000
2 6.850000 3.073684 5.742105 2.071053
3 5.901613 2.748387 4.393548 1.433871
[1] "Contingency table:"
Actual
Cluster Iris-setosa Iris-versicolor Iris-virginica
1 50 0 0
2 0 2 36
3 0 48 14

15. Write a program to implement Linear Regression on churn.csv

# Load the dataset


data <- read.csv("/content/customer_churn.csv", stringsAsFactors = TRUE)

# Keep only numeric predictor and target, remove missing values


data <- na.omit(data[, c("Years", "Total_Purchase")])

# Split into training and test sets (80/20)


set.seed(42)
sample_index <- sample(1:nrow(data), 0.8 * nrow(data))

17
train_data <- data[sample_index, ]
test_data <- data[-sample_index, ]

# Train linear regression model


model <- lm(Total_Purchase ~ Years, data = train_data)

# Show model summary


summary(model)

# Predict on test data


predictions <- predict(model, newdata = test_data)

# Evaluate model
rmse <- sqrt(mean((test_data$Total_Purchase - predictions)^2))
r2 <- cor(test_data$Total_Purchase, predictions)^2

cat("RMSE:", rmse, "\n")


cat("R-squared:", r2, "\n")
Output:

Call:
lm(formula = Total_Purchase ~ Years, data = train_data)

Residuals:
Min 1Q Median 3Q Max
-9954.8 -1576.0 14.4 1689.0 6824.3

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9826.35 390.69 25.151 <2e-16 ***
Years 43.35 71.85 0.603 0.546
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2415 on 718 degrees of freedom


Multiple R-squared: 0.0005067, Adjusted R-squared: -0.0008854
F-statistic: 0.364 on 1 and 718 DF, p-value: 0.5465
RMSE: 2395.241
R-squared: 0.01212856
16.Write a R program to implement Multinomial Logistic Regression on iris.csv.
install.packages("nnet")
library(nnet)
data(iris)
head(iris)
model<-multinom(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
summary(model)
predicted<-predict(model,iris)

18
conf<-table(Predicted=predicted,Actual=iris$Species)
print(conf)
accuracy<-sum(diag(conf)/sum(conf))
cat("Accuracy",accuracy)
OUTPUT:
nstalling package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

A data.frame: 6 × 5

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

<dbl> <dbl> <dbl> <dbl> <fct>

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa


# weights: 18 (10 variable)
initial value 164.791843
iter 10 value 16.177348
iter 20 value 7.111438
iter 30 value 6.182999
iter 40 value 5.984028
iter 50 value 5.961278
iter 60 value 5.954900
iter 70 value 5.951851
iter 80 value 5.950343
iter 90 value 5.949904
iter 100 value 5.949867
final value 5.949867
stopped after 100 iterations
Call:
multinom(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width, data = iris)

Coefficients:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 18.69037 -5.458424 -8.707401 14.24477 -3.097684
virginica -23.83628 -7.923634 -15.370769 23.65978 15.135301

Std. Errors:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 34.97116 89.89215 157.0415 60.19170 45.48852
virginica 35.76649 89.91153 157.1196 60.46753 45.93406

Residual Deviance: 11.89973


AIC: 31.89973
Actual
Predicted setosa versicolor virginica
setosa 50 0 0
versicolor 0 49 1
virginica 0 1 49
Accuracy 0.9866667

18. Write an R program to perform tokenization on a given text using the following packages: Tokenizers, Tidytext, Quanteda, and
Text2vec.

install.packages(c("tokenizers", "tidytext", "quanteda", "text2vec"))


library(tokenizers)
library(tidytext)
library(quanteda)
library(text2vec)
library(dplyr)

19
library(tibble)
text <- "Tokenization is the process of breaking text into tokens."
# Using tokenizers
tokenizer <- tokenize_words(text)
print("Tokens using tokenizers:")
print(tokenizer)
# Using tidytext
text_df <- tibble(line = 1, text = text)
tokens_tidytext <- text_df %>% unnest_tokens(word, text)
print("Tokens using tidytext:")
print(tokens_tidytext$word)
# Using quanteda
tokens_quanteda <- tokens(text, what = "word")
print("Tokens using quanteda:")
print(as.list(tokens_quanteda)[[1]])
# Using text2vec
it <- itoken(text, tokenizer = word_tokenizer, progressbar = FALSE)
tokens_text2vec <- iterators::iter_next(it)
print("Tokens using text2vec:")
print(tokens_text2vec)
19. Write an R program to perform stemming on a list of words using the following libraries: Snowballc, Quanteda, and Tidytext.

# Install required packages


install.packages(c("SnowballC", "quanteda", "tidytext", "dplyr", "tibble"))

# Load libraries
library(SnowballC)
library(quanteda)
library(tidytext)
library(dplyr)
library(tibble)

# Sample word list


words <- c("playing", "played", "plays", "player", "happily", "happiness")

# --- Stemming using SnowballC ---


snowball_stems <- wordStem(words, language = "en")
cat("Stemming using SnowballC:\n")
print(data.frame(original = words, stemmed = snowball_stems))

# --- Stemming using quanteda ---


quanteda_stems <- tokens(words, what = "word") %>%
tokens_wordstem(language = "english")
cat("\nStemming using quanteda:\n")
print(as.list(quanteda_stems)[[1]])

# --- Stemming using tidytext + SnowballC ---


word_df <- tibble(word = words)
tidytext_stems <- word_df %>%
mutate(stem = wordStem(word, language = "en"))
cat("\nStemming using tidytext:\n")
20
print(tidytext_stems)
20. Write a R program to implement SVM on Iris.csv

install.packages(c("e1071", "caret"))
library(e1071)
library(caret)
data(iris)
iris_data <- iris
iris_data$Species <- as.factor(iris_data$Species)
set.seed(123)
sample_index <- createDataPartition(iris_data$Species, p = 0.8, list = FALSE)
train_data <- iris_data[sample_index, ]
test_data <- iris_data[-sample_index, ]
svm_model <- svm(Species ~ ., data = train_data, kernel = "linear")
predictions <- predict(svm_model, test_data)
conf_matrix <- table(Predicted = predictions, Actual = test_data$Species)
print(conf_matrix)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(accuracy)

OUTPUT:

Actual
Predicted setosa versicolor virginica
setosa 10 0 0
versicolor 0 10 1
virginica 0 0 9
[1] 0.9666667

21. Write a R program to implement SVM on Boston Dataset

22. Write a R program to implement SVM on CAhousing.csv

install.packages(c("e1071","caret"))

library(caret)

library(e1071)

housing<-read.csv("/content/sample_data/california_housing_train.csv")

head(housing)

set.seed(123)

trainIndex=createDataPartition(housing$median_house_value,p=0.8,list=FALSE)

trainData<-housing[trainIndex,]

testData<-housing[-trainIndex,]

model<-svm(median_house_value~.,data=trainData)

summary(model)

predicted=predict(model,testData)

actual=testData$median_house_value

mse<-sqrt(mean(predicted-actual)^2)

rmse<-sqrt(mse)

print(mse)

21
print(rmse)

OUTPUT:

A data.frame: 6 × 9

longit latit housing_medi total_r total_bedr populat househo median_in median_house


ude ude an_age ooms ooms ion lds come _value

<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>

1 -114.31 34.19 15 5612 1283 1015 472 1.4936 66900

2 -114.47 34.40 19 7650 1901 1129 463 1.8200 80100

3 -114.56 33.69 17 720 174 333 117 1.6509 85700

4 -114.57 33.64 14 1501 337 515 226 3.1917 73400

5 -114.57 33.57 20 1454 326 624 262 1.9250 65500

6 -114.58 33.63 29 1387 236 671 239 3.3438 74000

Call:
svm(formula = median_house_value ~ ., data = trainData)

Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 1
gamma: 0.125
epsilon: 0.1

Number of Support Vectors: 10101

[1] 7433.255
[1] 86.21633

23. Write a R program to implement Lemmatization

install.packages("textstem")
library(textstem)
words <- c("playing", "played", "happily", "better", "studies", "running", "flies")
lemmatized_words <- lemmatize_words(words)
cat("Original Words:\n")
print(words)
cat("\nLemmatized Words:\n")

print(lemmatized_words)

OUTPUT:

also installing the dependencies ‘NLP’, ‘zoo’, ‘dtt’, ‘sylly.en’, ‘sylly’, ‘syuzhet’, ‘english’, ‘mgsub’,
‘qdapRegex’, ‘slam’, ‘koRpus.lang.en’, ‘hunspell’, ‘koRpus’, ‘lexicon’, ‘textclean’, ‘textshape’

Loading required package: koRpus.lang.en

Loading required package: koRpus

Loading required package: sylly

For information on available language packages for 'koRpus', run

available.koRpus.lang()

and see ?install.koRpus.lang()

Attaching package: ‘koRpus’

The following objects are masked from ‘package:quanteda’:

tokens, types

Original Words:
[1] "playing" "played" "happily" "better" "studies" "running" "flies"

22
Lemmatized Words:
[1] "play" "play" "happily" "good" "study" "run" "fly"

24. Write an R program to implement different types of stemmers on a set of words using the Porter Stemmer and the SnowballC Stemmer.

install.packages("SnowballC")
library(SnowballC)
words <- c("running", "runs", "easily", "fairly", "happily", "flying", "flies")
# Porter Stemmer (default in SnowballC for English)
porter_stems <- wordStem(words, language = "en")
cat("Porter Stemmer (SnowballC default):\n")
print(data.frame(Original = words, Stemmed = porter_stems))
OUTPUT:
Porter Stemmer (SnowballC default):
Original Stemmed
1 running run
2 runs run
3 easily easili
4 fairly fair
5 happily happili
6 flying fli
7 flies fli

25. Write a R program to implement Sentiment Analysis

install.packages(c("tidytext", "dplyr", "ggplot2", "tibble"))


library(tidytext)
library(dplyr)
library(tibble)
library(ggplot2)
text <- "I love the way this product works. It is amazing and delightful, though a bit expensive."
text_df <- tibble(line = 1, text = text)
tokens <- text_df %>%unnest_tokens(word, text)
sentiments <- get_sentiments("bing")
sentiment_analysis <- tokens %>%
inner_join(sentiments, by = "word")
cat("Words with Sentiments:\n")
print(sentiment_analysis)
sentiment_summary <- sentiment_analysis %>%
count(sentiment)
cat("\nSentiment Summary:\n")
print(sentiment_summary)
ggplot(sentiment_summary, aes(x = sentiment, y = n, fill = sentiment)) +
geom_col() +
labs(title = "Sentiment Analysis", x = "Sentiment", y = "Word Count") +
theme_minimal()

23

You might also like