0% found this document useful (0 votes)
44 views34 pages

DSR LAB MANUAL - 10 Programs

..

Uploaded by

Supraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views34 pages

DSR LAB MANUAL - 10 Programs

..

Uploaded by

Supraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

DATA SCIENCE LABORATORY

VII Semester: CSE

Course Code Category Hours / Week Credits Maximum Marks


PC 751 CS Core L T P C CIE SEE Total
- - 3 1.5 25 50 75
OBJECTIVES:
The course should enable the students to:
I. Understand the R Programming Language.
II. Exposure on Solving of data science problems.
III. Understand The classification and Regression Model.
LIST OF EXPERIMENTS

1 R AS CALCULATOR APPLICATION
a. Using with and without R objects on console
b. Using mathematical functions on console
c. Write an R script, to create R objects for calculator application and save in a specified
location in disk
2 DESCRIPTIVE STATISTICS IN R

a. Write an R script to find basic descriptive statistics using summary, str,quartile functions on
MT Cars data set.
b. Write an R script to find subset of dataset by using subset ()
3 READING AND WRITING DIFFERENT TYPES OF DATASETS

a. Reading different types of data sets (.txt, .csv) from web and disk and writing in file in
specific disk location.
b. Reading Excel data sheet in R.
c. Reading XML dataset in R.
4 VISUALIZATIONS

a. Find the data distributions using box and scatter plot.


b. Find the outliers using plot.
c. Plot the histogram, bar chart and pie chart on sample data
5 CORRELATION AND COVARIANCE

a. Find the correlation matrix.


b. Plot the correlation plot on dataset and visualize giving an overview of relationships among
data on iris data.
c. c.Analysis of covariance: variance (ANOVA), if data have categorical variables on iris data

REGRESSION MODEL
6
Import a data from web storage. Name the dataset and now do Logistic Regression to find out
relation between variables that are affecting the admission of a student in a institute based on his
or her GRE score, GPA obtained and rank of the student. Also check the model is fit or not.
require (foreign),
require(MASS).
7 MULTIPLE REGRESSION MODEL

Apply multiple regressions, if data have a continuous independent variable. Apply on above
dataset.
8 REGRESSION MODEL FOR PREDICTION
Apply regression Model techniques to predict the data on above dataset

9 CLASSIFICATION MODEL
a. Install relevant package for classification.
b. Choose classifier for classification problem.
c. Evaluate the performance of classifier.
10 CLUSTERING MODEL
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.

Reference Books:
Yanchang Zhao, “R and Data Mining: Examples and Case Studies”, Elsevier, 1st Edition, 2012
Web References:
1.http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
2.http://www.ats.ucla.edu/stat/r/dae/rreg.htm 3.http://www.coastal.edu/kingw/statistics/R-
tutorials/logistic.html
4. http://www.ats.ucla.edu/stat/r/data/binary.csv
SOFTWARE AND HARDWARE REQUIREMENTS FOR 18 STUDENTS:
SOFTWARE: R Software , R Studio Software
HARDWARE: 18 numbers of Intel Desktop Computers with 4 GB RAM

1 - R AS CALCULATOR APPLICATION

a. Using without R objects on console

> 2587+2149

Output:-
> 287954-135479

Output:- [1]
> 257*52 [1] 13364

> 257/21

Output:- [1]

Using with R objects on console:

>A=1000
>B=2000
>c=A+B
>c Output:-

b. Using mathematical functions on console

>a=100
>class(a)

>b=500
>c=a-b
>class(b)

>sum<a-b

>sum

c.Write an R script, to create R objects for calculator


application andsave in a specified location in disk.
getwd()
[1] "C:/Users/Administrator/Documents"
>write.csv(a,'a.csv')
>write.csv(a,'C:\\Users\\Administrator\\Documents')
2 - DESCRIPTIVE STATISTICS IN R

a. Write an R script to find basic descriptive statistics using summary, str, quantile function on mtcars&
cars datasets.

# Load and explore the 'mtcars' dataset


mtcars # Display the 'mtcars' dataset in the console

# Summary statistics of the 'mtcars' dataset


summary(mtcars) # Provides a summary of each column in the 'mtcars' dataset, including min, max, mean,
median, and quartiles

# Structure of the 'mtcars' dataset


str(mtcars) # Displays the structure of the dataset, including column names, data types, and a preview of
values

# Quantiles of the 'mpg' (miles per gallon) column in 'mtcars'


quantile(mtcars$mpg) # Calculates the quantiles (0%, 25%, 50%, 75%, 100%) of the 'mpg' variable

# Load and explore the 'cars' dataset


cars # Display the 'cars' dataset in the console (default dataset with 'speed' and 'dist' columns)

# Summary statistics of the 'cars' dataset


summary(cars) # Provides summary statistics for 'speed' and 'dist' columns in the 'cars' dataset

# Check the class of the 'cars' dataset


class(cars) # Returns the class/type of the 'cars' object, usually "data.frame"

# Check the dimensions of the 'cars' dataset


dim(cars) # Displays the number of rows and columns in the 'cars' dataset

# Structure of the 'cars' dataset


str(cars) # Displays the structure of the dataset, showing column names, data types, and a preview of
values

# Quantiles of the 'speed' column in the 'cars' dataset


quantile(cars$speed) # Calculates the quantiles (0%, 25%, 50%, 75%, 100%) of the 'speed' variable

Explanation of Datasets:

1. mtcars: A built-in dataset in R containing specifications and performance metrics of 32 car


models, including variables like mpg (miles per gallon), cyl (cylinders), hp (horsepower), and
more.
2. cars: A built-in dataset in R containing data on speed (mph) and stopping distances (feet) of cars,
often used for linear regression examples.
b. Write an R script to find subset of dataset by using subset (), aggregate () functions on iris dataset.

Subset () function :

The subset() function in R is used to filter rows and select specific columns from a dataset. Here's how
you can use it with the iris dataset:

1. Filter Rows Based on a Condition

Extract rows where Sepal.Length is greater than 5.0:

subset(iris, Sepal.Length > 5.0)

2. Filter Rows with Multiple Conditions

Extract rows where Sepal.Length is greater than 5.0 and Species is "setosa":

subset(iris, Sepal.Length > 5.0 & Species == "setosa")

3. Select Specific Columns

Extract rows where Sepal.Width is less than 3.0, but only display Sepal.Length and
Sepal.Width:

subset(iris, Sepal.Width < 3.0, select = c(Sepal.Length, Sepal.Width))

4. Filter Rows with Exact Matches

Extract rows where Species is "versicolor":

subset(iris, Species == "versicolor")

5. Filter Rows with Column Subsetting

Extract rows where Sepal.Length equals 5.0 and only include the Sepal.Length and Species
columns:

subset(iris, Sepal.Length == 5.0, select = c(Sepal.Length, Species))

6. Use a Range for Filtering

Extract rows where Petal.Length is between 3.0 and 5.0:


subset(iris, Petal.Length >= 3.0 & Petal.Length <= 5.0)

7. Filter Rows Based on Factors

Extract rows where Species is not "setosa":

subset(iris, Species != "setosa")

Aggregate() function :

1. Calculate the mean of all numeric variables grouped by Species:

aggregate(. ~ Species, data = iris, mean)

● Groups the dataset by Species and computes the mean for all numeric columns
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width).

2. Calculate the sum of all numeric variables grouped by Species:

aggregate(. ~ Species, data = iris, sum)

● Groups the dataset by Species and computes the sum for all numeric columns.

3. Calculate the maximum value of each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, max)

● Groups the dataset by Species and returns the maximum value for each numeric column.

4. Calculate the minimum value of each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, min)

● Groups the dataset by Species and returns the minimum value for each numeric column.

5. Calculate the median of each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, median)

● Groups the dataset by Species and returns the median value for each numeric column.
6. Calculate the standard deviation for each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, sd)

● Groups the dataset by Species and calculates the standard deviation for each numeric column.

7. Use only specific columns:

aggregate(Sepal.Length ~ Species, data = iris, mean)

● Groups the dataset by Species and calculates the mean for the Sepal.Length column only.
3 - READING AND WRITING DIFFERENT TYPES OF DATASETS

a. Reading different types of data sets (.txt, .csv) from web and disk and writing in file in specific disk
location.

library(utils)
data<- read.csv("input.csv")
data

Output :-

data<- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))

Output:-

# Create a data frame.


data<- read.csv("input.csv")

# Get the max salary from data frame.


sal<- max(data$salary)
sal Output:- [1] 843.25
# Create a data frame.
data<- read.csv("input.csv")

# Get the max salary from data frame.


sal<- max(data$salary)

# Get the person detail having max salary.


retval<- subset(data, salary == max(salary)) retval

Output:-

Get all the people working in IT department


# Create a data frame.

data<- read.csv("input.csv")
retval<- subset( data, dept == "IT") retval

Output:-

#Create a data frame.


data<- read.csv("input.csv")
retval<- subset(data, as.Date(start_date) >as.Date("2014-01-01"))

# Write filtered data into a new file.

write.csv(retval,"output.csv") newdata<- read.csv("output.csv") newdata

Output:-

b. Reading Excel data sheet in R.

install.packages("xlsx")
library("xlsx")
data<- read.xlsx("input.xlsx", sheetIndex = 1)
data

Output:-

c. Reading XML dataset in R.

install.packages("XML")
library("XML")
library("methods")
result<- xmlParse(file = "input.xml")
result

Output:-

4 – VISUALIZATIONS
a. Find the data distributions using box and scatter plot.

Install.packages(“ggplot2”)
Library(ggplot2)
Input <- mtcars[,c('mpg','cyl')]
input

boxplot(mpg ~ cyl, data = mtcars, xlab = "number of cylinders", ylab = "miles per gallon", main =
"mileage data")

dev.off()

Output :-
b. Find the outliers using plot.

# Define the vector


v <- c(50, 75, 100, 125, 150, 175, 200)

# Create the boxplot


boxplot(v, main = "Boxplot of v", horizontal = TRUE, col = "lightblue")

# Compute Q1, Q3, and IQR


q1 <- quantile(v, 0.25) # First quartile
q3 <- quantile(v, 0.75) # Third quartile
iqr <- q3 - q1 # Interquartile range

# Calculate the lower and upper bounds for outliers


lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

# Print the bounds


cat("Lower Bound for Outliers:", lower_bound, "\n")
cat("Upper Bound for Outliers:", upper_bound, "\n")

# Identify outliers
outliers <- v[v < lower_bound | v > upper_bound]
cat("Outliers:", outliers, "\n")

output :
c. Plot the histogram, bar chart and pie chart on sample
data.

Histogram

# Load the necessary library


library(graphics) # For creating visualizations, like histograms (part of base R)

# Create a sample vector


v <- c(9, 13, 21, 8, 36, 22, 12, 41, 31, 33, 19)

# Create the histogram


hist(
v, # Data to plot
xlab = "Weight", # Label for the x-axis
col = "blue", # Color for the bars
border = "green", # Color for the bar borders
main = "Histogram of Sample Data" # Title of the histogram
)

# Turn off the current graphics device (if needed)


dev.off()

Output:-
Bar chart

# Load the necessary library


library(graphics) # This is part of base R, used for plotting functions

# Sample data
H <- c(7, 12, 28, 3, 41) # Revenue data for each month
M <- c("Jan", "Feb", "Mar", "Apr", "May") # Month names

# Create the bar chart


barplot(
H, # Heights of the bars (Revenue)
names.arg = M, # Labels for each bar (Months)
xlab = "Month", # Label for the x-axis
ylab = "Revenue", # Label for the y-axis
col = "blue", # Color of the bars
main = "Revenue Chart", # Title of the chart
border = "red" # Color of the bar borders
)

# Close the current graphics device (not always necessary in RStudio)


dev.off()

output :

Pie Chart

# Load the necessary library


library(graphics) # This is part of base R, used for plotting functions

# Sample data
x <- c(21, 62, 10, 53) # Data values representing some quantities (e.g., population, sales, etc.)
labels <- c("London", "New York", "Singapore", "Mumbai") # Labels for each segment of the pie chart

# Create the Pie chart


pie(
x, # Values for the pie chart
labels = labels, # Labels for each segment
col = rainbow(length(x)), # Color the segments with a rainbow color palette
main = "City Distribution" # Title of the pie chart
)

# Turn off the graphics device (not always necessary in RStudio)


dev.off()
output :
5. CORRELATION AND COVARIANCE

a. How to find a corelation matrix and plot the correlation on iris


data set

# Load necessary libraries


library(corrplot) # For plotting the correlation matrix
library(ggplot2) # For the iris dataset

# Load the iris dataset


data(iris)

# Select only numeric columns from the iris dataset (to avoid categorical 'Species')
iris_numeric <- iris[, 1:4] # Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

# Calculate the correlation matrix for the numeric columns


cor_matrix <- cor(iris_numeric)

# Print the correlation matrix


print(cor_matrix)

# Visualize the correlation matrix using corrplot


corrplot(cor_matrix, method = "square", type = "upper", col = colorRampPalette(c("blue", "white", "red"))
(200),
title = "Correlation Matrix for Iris Data", mar = c(0, 0, 1, 0))

b. Plot the correlation plot on dataset and visualize giving an


overview of relationships among data on iris data.

SOURCE CODE:
# Load necessary libraries
library(corrplot) # For plotting the correlation matrix
library(ggplot2) # For the iris dataset

# Load the iris dataset


data(iris)

# Remove the Species column as it is categorical


iris_numeric <- iris[, 1:4] # Select only the numeric columns (Sepal.Length, Sepal.Width, Petal.Length,
Petal.Width)

# Calculate the correlation matrix for the numeric columns


cor_matrix <- cor(iris_numeric)

# Print the correlation matrix


print(cor_matrix)
# Visualize the correlation matrix using corrplot
corrplot(cor_matrix, method = "square", type = "upper",
col = colorRampPalette(c("blue", "white", "red"))(200),
title = "Correlation Matrix for Iris Data", mar = c(0, 0, 1, 0))

c. Analysis of covariance: variance (ANOVA), if data have


categorical variables on iris data.

To analyze the variance (ANOVA) on the Iris dataset using categorical variables, we need to look at
how categorical variables (such as Species) influence the continuous variables (like Sepal.Length
and Petal.Length). One of the common ways to assess this is by performing Analysis of Variance
(ANOVA), which helps determine if there are statistically significant differences between the means of
different groups (in this case, the different species of flowers).

The corrected version of your code includes:

1. Visualizing relationships using ggplot2.


2. Performing ANOVA to check for significant differences based on the categorical variable
Species.

Steps for ANOVA analysis on Iris dataset:

1. Visualize relationships between variables (Sepal.Length, Petal.Length) by species.


2. Perform ANOVA to check if there are significant differences in the means of the numeric
variables (Sepal.Length, Petal.Length) across the different species.

SOURCE CODE:
# Load necessary libraries
library(ggplot2)

# Load the iris dataset


data(iris)

# Structure of the iris dataset


str(iris)

# Scatter plot to visualize the relationship between Sepal.Length and Petal.Length by Species
ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) +
geom_point(size=2) +
geom_smooth(method="lm", aes(color=Species), se=FALSE) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") +
ylab("Petal Length") +
theme(legend.position="bottom")
# Perform ANOVA to see if Sepal.Length differs by Species
anova_sepal <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_sepal)

# Perform ANOVA to see if Petal.Length differs by Species


anova_petal <- aov(Petal.Length ~ Species, data = iris)
summary(anova_petal)
6 PROBLEM DEFINITION:

REGRESSION MODEL: Import a data from web storage. Name the dataset and now do Logistic
Regression to find out relation between variables that are affecting the admission of a student in a institute
based on his or her GRE score, GPA obtained and rank of the student. Also check the model is fit or not.
require (foreign), require(MASS)

Steps for Logistic Regression:


1. Load the Data: First, we need a dataset to work with. In your case, the original dataset from UCLA
(http://www.ats.ucla.edu/stat/data/binary.csv) should ideally contain data on
student admissions, but if it's not accessible, we'll use a similar structure for illustration.
2. Preprocess the Data: We will modify the dataset for binary classification (admitted = 1, not
admitted = 0). You can use any similar dataset or load one locally for the exercise.
3. Logistic Regression Model: The model will predict the probability of a student being admitted
based on their GRE score, GPA, and rank. Logistic regression is ideal for binary classification tasks
like this.
4. Model Evaluation: After fitting the model, we will evaluate whether the model fits the data well by
inspecting the coefficients and summary statistics.

SOURCE CODE:

Solution using a Dummy Dataset:


I'll show you the full solution using a dummy dataset that replicates the structure of the student admission
problem, and then fit a logistic regression model.

Step-by-Step Code:
1. Load and Prepare the Data:

We'll simulate a dataset for student admission.

# Load required libraries


require(foreign)
require(MASS)

# Simulate a dummy dataset


set.seed(123) # For reproducibility
mydata <- data.frame(
admit = sample(0:1, 100, replace = TRUE), # Binary outcome: 0 = Not Admitted, 1 =
Admitted
gre = rnorm(100, 310, 30), # GRE score: Normally distributed with mean=310 and
SD=30
gpa = rnorm(100, 3.0, 0.4), # GPA: Normally distributed with mean=3.0 and SD=0.4
rank = sample(1:4, 100, replace = TRUE) # Rank: Factor variable with 4 levels
)

# Preview the dataset


head(mydata)

2. Logistic Regression Model:

We'll use glm (Generalized Linear Model) for logistic regression. The dependent variable is admit, and
the independent variables are gre, gpa, and rank.

# Logistic regression model


model <- glm(admit ~ gre + gpa + rank, data = mydata, family = binomial)

# Display the summary of the model


summary(model)

3. Model Interpretation:

● The summary(model) will provide the coefficients, standard errors, z-values, and p-values for
each predictor (gre, gpa, and rank).
● The coefficients represent the change in the log-odds of being admitted for a one-unit change in the
predictor variable, holding other variables constant.

For example, if the coefficient for gre is positive, it means that as GRE score increases, the probability of
admission increases, and similarly for other predictors.
4. Model Evaluation (Goodness-of-Fit):

We can check the goodness-of-fit for the model using the deviance and AIC (Akaike Information
Criterion).

# Deviance and AIC for the logistic regression model


deviance(model) # A measure of goodness-of-fit
AIC(model) # Akaike Information Criterion (lower AIC is better)

5. Predicted Probabilities and Classification:

Now we can use the predict() function to predict the probabilities of admission for each observation.

# Predicted probabilities
predicted_probs <- predict(model, type = "response")

# Predicted outcomes (classifications)


predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)

# Compare predicted vs actual outcomes


table(predicted_class, mydata$admit)

This will give us a confusion matrix showing the number of correct and incorrect predictions.

OUTPUT:
7: CLASSIFICATION MODEL

PROBLEM DEFINITION:
Apply multiple regressions, if data have a continuous independent variable. Apply on above dataset.

multiple regression on a dataset where the independent variables are continuous. The provided dataset
includes binary outcomes (admit: 0 = not admitted, 1 = admitted) and continuous variables such as GRE
scores and GPA. We will fit a logistic regression model to predict the probability of admission based on
these independent variables (GRE and GPA). Since we are dealing with binary outcomes (admit = 0 or 1),
logistic regression is appropriate for this problem.

However, in the case where the independent variables are continuous, we would use multiple regression
to model the relationship between multiple continuous independent variables and the dependent variable.
Logistic regression is a specific case of multiple regression for binary outcomes, and in this case, we treat
the problem as a logistic regression problem.

SOURCE CODE:

Step 1: Creating the Dataset


# Create a dataset
mydata <- data.frame(
admit = c(0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0),
# Binary outcome (0 = not admitted, 1 = admitted)
gre = c(380, 660, 800, 640, 520, 520, 760, 400, 540, 700, 800, 450, 360, 720, 560,
710, 440, 680, 760, 500), # GRE scores
gpa = c(3.61, 3.67, 4.00, 3.19, 2.93, 2.75, 3.00, 3.08, 3.39, 3.92, 3.75, 2.88,
2.52, 3.68, 3.32, 3.94, 2.98, 3.50, 3.89, 3.15), # GPA scores
rank = c(3, 1, 1, 4, 4, 2, 1, 3, 2, 2, 1, 4, 4, 2, 3, 1, 3, 2, 1, 4)
# Rank of undergraduate institution (1 = best, 4 = worst)
)

Explanation:

● A dataset called mydata is created using data.frame().


○ admit: Binary variable (0 = not admitted, 1 = admitted).
○ gre: GRE scores (continuous variable).
○ gpa: GPA scores (continuous variable).
○ rank: The rank of the undergraduate institution (categorical variable: 1 is best, 4 is worst).

This dataset simulates 20 students and their admission results based on the given predictors.

Step 2: Preview the Dataset

# Preview the dataset


head(mydata)

Explanation:

● The head() function displays the first few rows of the dataset (mydata).
● This helps to inspect the structure and values in the dataset to ensure the data was loaded correctly.
Step 3: Convert 'rank' to a Factor

# Ensure 'rank' variable is a factor


mydata$rank <- factor(mydata$rank)

Explanation:

● The rank variable is a categorical variable (with integer values 1, 2, 3, and 4). In regression
modeling, it is important to convert categorical variables to factors so they can be treated correctly
in statistical models.
● The factor() function is used to convert the rank variable into a factor.

Step 4: Fit a Logistic Regression Model

# Fit a logistic regression model


mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

Explanation:

● We use the glm() function to fit a logistic regression model.


○ admit: The dependent variable (binary outcome: 0 or 1).
○ gre, gpa, rank: These are the independent variables (predictors) in the model.
○ family = "binomial": This specifies that we are fitting a logistic regression model
for binary outcomes (since admit is binary).

This model estimates the probability of a student being admitted based on their GRE score, GPA, and rank
of their undergraduate institution.

Step 5: Display the Summary of the Model


# Display the summary of the model
summary(mylogit)

OUTPUT:
8 - REGRESSION MODEL FOR PREDICTION

Linear Regression with Categorical Data (Region as a Predictor)

to analyze how the average SAT score (csat) varies across different geographical regions using a linear
regression model. The dataset includes the SAT scores for different states, along with the region each state
belongs to. Since the region is a categorical variable, this will be treated appropriately in the regression
model.

Step 1: Create a Fictional Dataset

# Create a fictional dataset


states.data <- data.frame(
state = c("California", "New York", "Texas", "Florida", "Illinois",
"Ohio", "Georgia", "Pennsylvania", "Arizona", "Michigan"),
csat = c(1010, 1100, 980, 1020, 1050, 990, 970, 1080, 950, 1005), # Example SAT
scores
region = c("West", "N. East", "South", "South", "Midwest",
"Midwest", "South", "N. East", "West", "Midwest") # Geographic regions
)

● A dataset called states.data is created using data.frame().


● The dataset contains the following columns:
○ state: The name of the state.
○ csat: The average SAT score (a continuous variable).
○ region: The geographic region to which each state belongs (a categorical variable, which
will be treated as a factor).

This dataset represents the SAT scores of 10 states from different regions of the United States.

Step 2: Ensure the region Column is a Factor

# Ensure the 'region' column is a factor


states.data$region <- factor(states.data$region)

● The region variable is categorical (non-numeric), so it is important to convert it to a factor using


factor().
● This ensures that R treats region as a categorical variable in the regression model, and it will be
used to create dummy variables during the regression process.

Step 3: Inspect the Structure of the Dataset

# Inspect the structure of the dataset


str(states.data)
● The str() function provides a compact view of the dataset's structure, showing the types of each
variable (e.g., whether they are numeric, factor, etc.), the first few values of each column, and the
number of observations.
● This helps us confirm that the data has been correctly imported and the region variable is now a
factor.

Step 4: Fit a Linear Regression Model

# Fit a linear regression model with 'region' as a predictor


sat.region <- lm(csat ~ region, data = states.data)

● We use the lm() function to fit a linear regression model where csat (the SAT scores) is the
dependent variable, and region (the geographical region) is the independent variable.
● The formula csat ~ region specifies that csat is modeled as a function of region. This
tells R to predict SAT scores based on the region in which the state is located.
● The lm() function automatically treats region as a categorical predictor (factor) and creates
dummy variables for each level of the factor (e.g., for West, N. East, South, and Midwest).

Step 5: Show the Regression Coefficients Table

# Show the regression coefficients table


coef(summary(sat.region))

Out put:
9 :CLASSIFICATION MODEL

g. Install relevant package for classification:


To begin a classification task in R, we need to install and load the necessary libraries that provide the tools
for classification, decision trees, and model evaluation. These libraries help in building classification
models like regression trees and support advanced visualization tools.

h. Choose classifier for classification problem and evaluate the performance of classifier:
In this problem, we will use a decision tree classifier to predict the Salary of baseball players based on
their performance (measured in Hits and Years of experience). The steps include preparing the data,
training the decision tree model, performing cross-validation, pruning the tree to improve performance, and
finally evaluating the performance of the model using Mean Squared Error (MSE).

1. Install and Load Relevant Packages:

We need several libraries:

● rpart.plot: For plotting regression and classification trees.


● tree: For decision tree models.
● ISLR: Contains datasets like Hitters for analysis.
● rattle: Provides tools for data mining, including model visualization.

# Install the required packages (if not already installed)


install.packages("rpart.plot")
install.packages("tree")
install.packages("ISLR")
install.packages("rattle")

# Load the libraries


library(tree) # For decision tree functions
library(ISLR2) # For datasets like 'Hitters'
library(rpart.plot) # For plotting trees
library(rattle) # For advanced visualization and exploration

2. Load and Prepare the Data:

We use the Hitters dataset, which contains various variables like Salary, Hits, and Years. The
Salary variable is continuous, and we will apply a regression tree model to predict it.

# Load the Hitters dataset


data(Hitters)

# View the structure of the dataset


View(Hitters)

# Remove rows with missing values in 'Salary' or other columns


Hitters <- na.omit(Hitters)

# Log-transform the Salary variable to make it more normally distributed


Hitters$Salary <- log(Hitters$Salary)

# Visualize the distribution of the Salary variable after transformation


hist(Hitters$Salary, main="Log-transformed Salary", xlab="Log of Salary")

output:

Explanation:

● na.omit(Hitters): Removes any rows with missing data.


● log(Hitters$Salary): The salary variable is log-transformed to make it more normally
distributed, which is often beneficial for regression models.
● hist(Hitters$Salary): This shows the distribution of the log-transformed Salary
variable.

3. Fit a Decision Tree Model:

We create a regression tree model using Hits and Years as predictors to predict the Salary.

# Fit a decision tree model with 'Hits' and 'Years' as predictors


tree.fit <- tree(Salary ~ Hits + Years, data = Hitters)

# Show a summary of the fitted tree model


summary(tree.fit)

# Plot the tree


plot(tree.fit, uniform=TRUE, margin=0.2)
text(tree.fit, use.n=TRUE, all=TRUE, cex=.8)

output:

Explanation:

● tree(Salary ~ Hits + Years, data = Hitters): Fits a regression tree model to


predict Salary based on Hits and Years.
● summary(tree.fit): Displays the tree structure, including the number of terminal nodes and
residual mean deviance.
● plot(tree.fit): Visualizes the structure of the tree.
● text(tree.fit): Adds node labels with additional information (like the number of
observations in each node).

4. Split the Data into Training and Test Sets:

We will split the dataset into a training set (for model fitting) and a test set (for model evaluation).

# Split the data into training and testing sets (50% each)
split <- createDataPartition(y = Hitters$Salary, p = 0.5, list = FALSE)

# Training set
train <- Hitters[split,]

# Testing set
test <- Hitters[-split,]

output:

Explanation:

● createDataPartition(): Splits the dataset into training and testing sets. Here, we use 50%
of the data for training and the remaining 50% for testing.

5. Fit the Tree Model on the Training Data:

We fit the regression tree model on the training data using all available predictors (Hits, Years, etc.).

# Create the regression tree model on the training set


trees <- tree(Salary ~ ., train)

# Plot the tree


plot(trees)
text(trees, pretty = 0)

output:

Explanation:

● tree(Salary ~ ., train): Fits a regression tree model using all available predictors
(denoted by the .) on the training set.
● plot(trees) and text(trees): Visualize the fitted tree structure.

6. Cross-Validation to Prune the Tree:

Pruning helps improve the performance of the decision tree by removing unnecessary splits and avoiding
overfitting.

# Perform cross-validation to determine the best size of the tree


cv.trees <- cv.tree(trees)

# Plot the cross-validation results to determine the best tree size


plot(cv.trees)

# Prune the tree to the optimal size (e.g., 4 terminal nodes)


prune.trees <- prune.tree(trees, best = 4)

# Plot the pruned tree


plot(prune.trees)
text(prune.trees, pretty = 0)

output:
Explanation:

● cv.tree(trees): Performs cross-validation to assess the performance of trees with different


sizes.
● plot(cv.trees): Visualizes the cross-validation results. The x-axis represents the number of
terminal nodes, and the y-axis represents the deviance (error).
● prune.tree(trees, best = 4): Prunes the tree to the optimal size based on cross-
validation results (in this case, 4 terminal nodes).
● plot(prune.trees) and text(prune.trees): Visualizes the pruned tree.

7. Make Predictions and Evaluate Performance:

After pruning the tree, we predict the Salary for the test data and evaluate the model's performance by
calculating the Mean Squared Error (MSE).

# Make predictions using the pruned tree on the test data


yhat <- predict(prune.trees, test)

# Plot the predicted vs. actual values


plot(yhat, test$Salary)
abline(0, 1) # Add a reference line for perfect prediction

# Calculate Mean Squared Error (MSE) for model evaluation


mean((yhat - test$Salary)^2)

output:

Explanation:

● predict(prune.trees, test): Uses the pruned tree to predict the Salary for the test set.
● plot(yhat, test$Salary): Plots the predicted values (yhat) against the actual Salary
values in the test set. A line with slope 1 (abline(0, 1)) is added to indicate perfect
predictions.
● mean((yhat - test$Salary)^2): Computes the Mean Squared Error (MSE) to evaluate
the model's accuracy. The lower the MSE, the better the model fits the data.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10 PROBLEM DEFINITION:
CLUSTERING MODEL
e. Clustering algorithms for unsupervised classification. Plot the cluster data using R visualizations

Clustering is an unsupervised learning technique used to group similar data points into clusters based on
certain characteristics or features. Unlike supervised learning, clustering does not rely on labeled data and
aims to identify the inherent structure of the data. It's often used in exploratory data analysis to find
patterns or groupings in data.

In a clustering model, data points within the same group (or cluster) are more similar to each other than to
those in other groups. These models are particularly useful in tasks like customer segmentation, anomaly
detection, and pattern recognition.

Types of Clustering Algorithms:


There are several types of clustering algorithms, but the most commonly used ones include:

1. K-Means Clustering:
○ It is one of the most popular clustering algorithms.
○ K-means is based on partitioning the data into k clusters, where each cluster is defined by
its centroid (the mean of all the points in the cluster).
○ The algorithm iterates between assigning data points to the nearest centroid and updating the
centroids based on the points assigned to them, until convergence.
2. Hierarchical Clustering:
○ Hierarchical clustering creates a tree-like structure of clusters called a dendrogram.
○ The data points are recursively merged or split into clusters based on similarity, either using
an agglomerative (bottom-up) or divisive (top-down) approach.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
○ DBSCAN clusters points that are closely packed together while marking points in low-
density regions as noise.
○ Unlike K-means, DBSCAN does not require specifying the number of clusters beforehand.

K-Means Clustering Algorithm:

In this example, we will use the K-Means algorithm for clustering. The K-means algorithm requires
specifying the number of clusters, k, in advance. It tries to minimize the sum of squared distances between
the points in the cluster and their centroid.

SOURCE CODE:
1. Clustering algorithms for unsupervised classification.

K-Means Clustering algorithm to classify the Iris dataset based on two of its features: Sepal.Length
and Sepal.Width (columns 3 and 4 of the Iris dataset). Here's a breakdown of the code:

# Load the necessary library for clustering


library(cluster)
# Set the seed for reproducibility
set.seed(20)

# Apply K-means clustering on the first two features (Sepal.Length and


Sepal.Width) of the Iris dataset
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)

# Output the clustering results


irisCluster

2. library(cluster)

# Set seed for reproducibility


set.seed(20)

# Perform K-Means clustering on Petal.Length and Petal.Width with 3 clusters


irisCluster <- kmeans(iris[, 3:4], centers = 3, nstart = 20)

# Print the clustering result


print(irisCluster)

# Add cluster assignments to the original iris dataset


iris$Cluster <- as.factor(irisCluster$cluster)

# Load ggplot2 for visualization


library(ggplot2)

# Visualize the clusters using a scatter plot


ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Cluster)) +
geom_point(size = 3) +
labs(
title = "K-Means Clustering of Iris Dataset (Petal Features)",
x = "Petal Length",
y = "Petal Width"
)+
theme_minimal()

OUTPUT:

Petal.Length Petal.Width 1 1.462000 0.246000


2 4.269231 1.342308
3 5.595833 2.037500

Clustering vector:

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[42] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
[83] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3
[124] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3

Within cluster sum of squares by cluster:

[1] 2.02200 13.05769 16.29167


(between_SS / total_SS = 94.3 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"

[6] "betweenss" "size" "iter" "ifault"

OUTPUT:

SOURCE CODE:

# Load dataset
data(mtcars)

# Step 1: Compute the distance matrix


d <- dist(as.matrix(mtcars)) # Calculates the Euclidean distance between rows of the mtcars dataset

# Step 2: Perform hierarchical clustering


hc <- hclust(d) # Applies hierarchical clustering on the distance matrix

# Step 3: Plot the dendrogram


plot(hc, main = "Hierarchical Clustering Dendrogram", xlab = "Car Models", sub = "", cex = 0.8)
OUTPUT:

3. Plot the cluster data using R visualizations.

To achieve the task of plotting cluster data using R visualizations, the pam (Partitioning Around Medoids)
clustering method and the clusplot function from the cluster package are used. Here’s the corrected
and complete code to generate the clusters and visualize them using clusplot:

Complete Source Code


Step 1: Install and Load Necessary Packages

# Install the 'cluster' package if not already installed


if (!require("cluster")) install.packages("cluster")

# Load the 'cluster' package


library(cluster)

Step 2: Generate and Plot Cluster Data (Without Noise)

# Generate 25 objects divided into 2 clusters


set.seed(123) # Set seed for reproducibility
x <- rbind(
cbind(rnorm(10, 0, 0.5), rnorm(10, 0, 0.5)),
cbind(rnorm(15, 5, 0.5), rnorm(15, 5, 0.5))
)

# Perform clustering using Partitioning Around Medoids (PAM)


pam_result <- pam(x, 2)
# Visualize the clusters
clusplot(pam_result, main = "Cluster Plot (Without Noise)", color =
TRUE, shade = TRUE, labels = 2, lines = 0)

Step 3: Add Noise and Re-Cluster the Data

# Add noise to the dataset


x4 <- cbind(x, rnorm(25), rnorm(25))

# Perform clustering again with the noisy dataset


pam_result_with_noise <- pam(x4, 2)

# Visualize the clusters with noise


clusplot(pam_result_with_noise, main = "Cluster Plot (With Noise)",
color = TRUE, shade = TRUE, labels = 2, lines = 0)

Explanation

1. Cluster Data Generation:


○ Two clusters are generated with rnorm() around (0, 0) and (5, 5) with a small
standard deviation of 0.5.
2. Adding Noise:
○ Extra random dimensions are added using rnorm(25) to simulate noise in the data.
3. Clustering Method:
○ PAM (Partitioning Around Medoids) is used, which is robust to noise and selects
medoids as cluster centers.
4. Visualization:
○ clusplot() creates a 2D representation of the clusters.
○ Arguments:
■ color = TRUE: Adds colors to the clusters.
■ shade = TRUE: Adds shading for better visualization.
■ labels = 2: Labels both points and clusters.
■ lines = 0: Disables connecting lines between clusters.

OUTPUT:

OUTPUT:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

You might also like