0% found this document useful (0 votes)
37 views14 pages

Week 10 Abhishek Srivastava VFinal

ML Project

Uploaded by

abhishek.disney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views14 pages

Week 10 Abhishek Srivastava VFinal

ML Project

Uploaded by

abhishek.disney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Week Ten Exercise

Abhishek Srivastava

2024-08-10

Introduction to Machine Learning


# Load necessary libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.1

## Warning: package 'tidyr' was built under R version 4.4.1

## Warning: package 'purrr' was built under R version 4.4.1

## Warning: package 'dplyr' was built under R version 4.4.1

## Warning: package 'forcats' was built under R version 4.4.1

## Warning: package 'lubridate' was built under R version 4.4.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse


2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ──────────────────────────────────────────
tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all
conflicts to become errors

library(class)
library(ggplot2)

# Load the datasets


binary_data <- read.csv("binary-classifier-data.csv")
trinary_data <- read.csv("trinary-classifier-data.csv")

# Plot the data from binary dataset


ggplot(binary_data, aes(x = x, y = y, color = factor(label))) +
geom_point() +
labs(title = "Binary Classifier Data", x = "X", y = "Y")
# Plot the data from trinary dataset
ggplot(trinary_data, aes(x = x, y = y, color = factor(label))) +
geom_point() +
labs(title = "Trinary Classifier Data", x = "X", y = "Y")
# Function to calculate Euclidean distance between two points
euclidean_distance <- function(p1, p2) {
sqrt((p1$x - p2$x)^2 + (p1$y - p2$y)^2)
}

# Function to predict using k-nearest neighbors algorithm


knn_predict <- function(train_data, test_data, k) {
predicted_labels <- knn(train = train_data[, c("x", "y")],
test = test_data[, c("x", "y")],
cl = train_data$label,
k = k)
return(predicted_labels)
}

# Function to calculate accuracy


calculate_accuracy <- function(true_labels, predicted_labels) {
accuracy <- sum(true_labels == predicted_labels) / length(true_labels)
return(accuracy)
}

# Fit k-nearest neighbors model for each dataset and compute accuracy
k_values <- c(3, 5, 10, 15, 20, 25)
accuracy_results <- data.frame()

for (k in k_values) {
binary_predicted_labels <- knn_predict(binary_data, binary_data, k)
binary_accuracy <- calculate_accuracy(binary_data$label,
binary_predicted_labels)

trinary_predicted_labels <- knn_predict(trinary_data, trinary_data, k)


trinary_accuracy <- calculate_accuracy(trinary_data$label,
trinary_predicted_labels)

accuracy_results <- rbind(accuracy_results, data.frame(k = k,


binary_accuracy =
binary_accuracy,
trinary_accuracy =
trinary_accuracy))
}

# Plot accuracy results


ggplot(accuracy_results, aes(x = k)) +
geom_line(aes(y = binary_accuracy, color = "Binary Dataset")) +
geom_line(aes(y = trinary_accuracy, color = "Trinary Dataset")) +
labs(title = "Accuracy of k-Nearest Neighbors Model", x = "k", y =
"Accuracy") +
scale_color_manual(values = c("blue", "red"))

# Linear classifier might not work well on these datasets as the data points
are not linearly separable

# The accuracy of logistic regression classifier from last week may differ
from k-nearest neighbors due to the underlying assumptions and how the models
handle the data. Logistic regression assumes a linear relationship between
features and the log-odds of the output, while k-nearest neighbors makes
predictions based on the similarity of data points. The two methods may
perform differently depending on the distribution of the data and the
relationships between features.

Clustering Solution

### Summary :
1. Data Preparation:
• Load the dataset: load clustering-data.csv.
• Inspect the data: Examine the structure and content of the dataset.

2. Data Visualization:
• Scatter plot: Plot the data points using a scatter plot to visualize the dataset.

3. K-Means Clustering:
• Fit the model: Apply the k-means clustering algorithm to the dataset for different
values of k (from 2 to 12).
• Visualize clusters: Create scatter plots showing the resultant clusters for each
value of k.

4. Average Distance Calculation:


• Calculate distance: Compute the average distance of each data point to the center
of its assigned cluster.
• Plot results: Plot these average distances on a graph with k on the x-axis and the
average distance on the y-axis.

5. Determine Elbow Point:


• Elbow point: Analyze the graph to identify the “elbow point,” which is the value of k
where the average distance starts to decrease more slowly.

Detailed Steps:

a. Load and Inspect Data:


# Load libraries
library(ggplot2)
library(dplyr)

# Load dataset
data <- read.csv("clustering-data.csv")
# Inspect dataset
head(data)

## x y
## 1 46 236
## 2 69 236
## 3 144 236
## 4 171 236
## 5 194 236
## 6 195 236

b. Plot Data:
# Scatter plot of the dataset
ggplot(data, aes(x = x, y = y)) +
geom_point() +
ggtitle("Clustering Data")

c. Implement K-Means and Visualize Clusters:


# Function to plot clusters
plot_clusters <- function(data, k) {
kmeans_result <- kmeans(data, centers = k)
data$cluster <- as.factor(kmeans_result$cluster)

ggplot(data, aes(x = x, y = y, color = cluster)) +


geom_point() +
ggtitle(paste("K-Means Clustering with k =", k))
}
# Plot clusters for k from 2 to 12
for (k in 2:12) {
print(plot_clusters(data, k))
}
d. Calculate Average Distance:
# Function to calculate average distance from cluster center
average_distance <- function(data, k) {
kmeans_result <- kmeans(data, centers = k)
data$cluster <- kmeans_result$cluster
centers <- kmeans_result$centers

distances <- sapply(1:nrow(data), function(i) {


cluster_center <- centers[data$cluster[i], ]
sqrt(sum((data[i, 1:2] - cluster_center)^2))
})

avg_distance <- mean(distances)


return(avg_distance)
}

# Calculate average distances for k from 2 to 12


k_values <- 2:12
avg_distances <- sapply(k_values, average_distance, data = data)

e. Plot Average Distances:


# Line plot of average distances
avg_distance_data <- data.frame(k = k_values, avg_distance = avg_distances)

ggplot(avg_distance_data, aes(x = k, y = avg_distance)) +


geom_line() +
geom_point() +
ggtitle("Average Distance from Cluster Center vs. k") +
xlab("Number of Clusters (k)") +
ylab("Average Distance")

f. Determine Elbow Point:


• Elbow point analysis: The “elbow point” is typically where the curve starts to
flatten out, indicating that adding more clusters doesn’t significantly reduce the
average distance. In this example it is looking 6

You might also like