0% found this document useful (0 votes)

44 views34 pages

DSR LAB MANUAL - 10 Programs

Uploaded by

Supraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views34 pages

DSR LAB MANUAL - 10 Programs

Uploaded by

Supraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

DATA SCIENCE LABORATORY

VII Semester: CSE

Course Code Category Hours / Week Credits Maximum Marks

PC 751 CS Core L T P C CIE SEE Total
- - 3 1.5 25 50 75
OBJECTIVES:
The course should enable the students to:
I. Understand the R Programming Language.
II. Exposure on Solving of data science problems.
III. Understand The classification and Regression Model.
LIST OF EXPERIMENTS

1 R AS CALCULATOR APPLICATION
a. Using with and without R objects on console
b. Using mathematical functions on console
c. Write an R script, to create R objects for calculator application and save in a specified
location in disk
2 DESCRIPTIVE STATISTICS IN R

a. Write an R script to find basic descriptive statistics using summary, str,quartile functions on
MT Cars data set.
b. Write an R script to find subset of dataset by using subset ()
3 READING AND WRITING DIFFERENT TYPES OF DATASETS

a. Reading different types of data sets (.txt, .csv) from web and disk and writing in file in
specific disk location.
b. Reading Excel data sheet in R.
c. Reading XML dataset in R.
4 VISUALIZATIONS

a. Find the data distributions using box and scatter plot.

b. Find the outliers using plot.
c. Plot the histogram, bar chart and pie chart on sample data
5 CORRELATION AND COVARIANCE

a. Find the correlation matrix.

b. Plot the correlation plot on dataset and visualize giving an overview of relationships among
data on iris data.
c. c.Analysis of covariance: variance (ANOVA), if data have categorical variables on iris data

REGRESSION MODEL
6
Import a data from web storage. Name the dataset and now do Logistic Regression to find out
relation between variables that are affecting the admission of a student in a institute based on his
or her GRE score, GPA obtained and rank of the student. Also check the model is fit or not.
require (foreign),
require(MASS).
7 MULTIPLE REGRESSION MODEL

Apply multiple regressions, if data have a continuous independent variable. Apply on above
dataset.
8 REGRESSION MODEL FOR PREDICTION
Apply regression Model techniques to predict the data on above dataset

9 CLASSIFICATION MODEL
a. Install relevant package for classification.
b. Choose classifier for classification problem.
c. Evaluate the performance of classifier.
10 CLUSTERING MODEL
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.

Reference Books:
Yanchang Zhao, “R and Data Mining: Examples and Case Studies”, Elsevier, 1st Edition, 2012
Web References:
1.http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
2.http://www.ats.ucla.edu/stat/r/dae/rreg.htm 3.http://www.coastal.edu/kingw/statistics/R-
tutorials/logistic.html
4. http://www.ats.ucla.edu/stat/r/data/binary.csv
SOFTWARE AND HARDWARE REQUIREMENTS FOR 18 STUDENTS:
SOFTWARE: R Software , R Studio Software
HARDWARE: 18 numbers of Intel Desktop Computers with 4 GB RAM

1 - R AS CALCULATOR APPLICATION

a. Using without R objects on console

> 2587+2149

Output:-
> 287954-135479

Output:- [1]
> 257*52 [1] 13364

> 257/21

Output:- [1]

Using with R objects on console:

>A=1000
>B=2000
>c=A+B
>c Output:-

b. Using mathematical functions on console

>a=100
>class(a)

>b=500
>c=a-b
>class(b)

>sum<a-b

>sum

c.Write an R script, to create R objects for calculator

application andsave in a specified location in disk.
getwd()
[1] "C:/Users/Administrator/Documents"
>write.csv(a,'a.csv')
>write.csv(a,'C:\\Users\\Administrator\\Documents')
2 - DESCRIPTIVE STATISTICS IN R

a. Write an R script to find basic descriptive statistics using summary, str, quantile function on mtcars&
cars datasets.

# Load and explore the 'mtcars' dataset

mtcars # Display the 'mtcars' dataset in the console

# Summary statistics of the 'mtcars' dataset

summary(mtcars) # Provides a summary of each column in the 'mtcars' dataset, including min, max, mean,
median, and quartiles

# Structure of the 'mtcars' dataset

str(mtcars) # Displays the structure of the dataset, including column names, data types, and a preview of
values

# Quantiles of the 'mpg' (miles per gallon) column in 'mtcars'

quantile(mtcars$mpg) # Calculates the quantiles (0%, 25%, 50%, 75%, 100%) of the 'mpg' variable

# Load and explore the 'cars' dataset

cars # Display the 'cars' dataset in the console (default dataset with 'speed' and 'dist' columns)

# Summary statistics of the 'cars' dataset

summary(cars) # Provides summary statistics for 'speed' and 'dist' columns in the 'cars' dataset

# Check the class of the 'cars' dataset

class(cars) # Returns the class/type of the 'cars' object, usually "data.frame"

# Check the dimensions of the 'cars' dataset

dim(cars) # Displays the number of rows and columns in the 'cars' dataset

# Structure of the 'cars' dataset

str(cars) # Displays the structure of the dataset, showing column names, data types, and a preview of
values

# Quantiles of the 'speed' column in the 'cars' dataset

quantile(cars$speed) # Calculates the quantiles (0%, 25%, 50%, 75%, 100%) of the 'speed' variable

Explanation of Datasets:

1. mtcars: A built-in dataset in R containing specifications and performance metrics of 32 car

models, including variables like mpg (miles per gallon), cyl (cylinders), hp (horsepower), and
more.
2. cars: A built-in dataset in R containing data on speed (mph) and stopping distances (feet) of cars,
often used for linear regression examples.
b. Write an R script to find subset of dataset by using subset (), aggregate () functions on iris dataset.

Subset () function :

The subset() function in R is used to filter rows and select specific columns from a dataset. Here's how
you can use it with the iris dataset:

1. Filter Rows Based on a Condition

Extract rows where Sepal.Length is greater than 5.0:

subset(iris, Sepal.Length > 5.0)

2. Filter Rows with Multiple Conditions

Extract rows where Sepal.Length is greater than 5.0 and Species is "setosa":

subset(iris, Sepal.Length > 5.0 & Species == "setosa")

3. Select Specific Columns

Extract rows where Sepal.Width is less than 3.0, but only display Sepal.Length and
Sepal.Width:

subset(iris, Sepal.Width < 3.0, select = c(Sepal.Length, Sepal.Width))

4. Filter Rows with Exact Matches

Extract rows where Species is "versicolor":

subset(iris, Species == "versicolor")

5. Filter Rows with Column Subsetting

Extract rows where Sepal.Length equals 5.0 and only include the Sepal.Length and Species
columns:

subset(iris, Sepal.Length == 5.0, select = c(Sepal.Length, Species))

6. Use a Range for Filtering

Extract rows where Petal.Length is between 3.0 and 5.0:

subset(iris, Petal.Length >= 3.0 & Petal.Length <= 5.0)

7. Filter Rows Based on Factors

Extract rows where Species is not "setosa":

subset(iris, Species != "setosa")

Aggregate() function :

1. Calculate the mean of all numeric variables grouped by Species:

aggregate(. ~ Species, data = iris, mean)

● Groups the dataset by Species and computes the mean for all numeric columns
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width).

2. Calculate the sum of all numeric variables grouped by Species:

aggregate(. ~ Species, data = iris, sum)

● Groups the dataset by Species and computes the sum for all numeric columns.

3. Calculate the maximum value of each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, max)

● Groups the dataset by Species and returns the maximum value for each numeric column.

4. Calculate the minimum value of each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, min)

● Groups the dataset by Species and returns the minimum value for each numeric column.

5. Calculate the median of each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, median)

● Groups the dataset by Species and returns the median value for each numeric column.
6. Calculate the standard deviation for each numeric variable grouped by Species:

aggregate(. ~ Species, data = iris, sd)

● Groups the dataset by Species and calculates the standard deviation for each numeric column.

7. Use only specific columns:

aggregate(Sepal.Length ~ Species, data = iris, mean)

● Groups the dataset by Species and calculates the mean for the Sepal.Length column only.
3 - READING AND WRITING DIFFERENT TYPES OF DATASETS

a. Reading different types of data sets (.txt, .csv) from web and disk and writing in file in specific disk
location.

library(utils)
data<- read.csv("input.csv")
data

Output :-

data<- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))

Output:-

# Create a data frame.

data<- read.csv("input.csv")

# Get the max salary from data frame.

sal<- max(data$salary)
sal Output:- [1] 843.25
# Create a data frame.
data<- read.csv("input.csv")

# Get the max salary from data frame.

sal<- max(data$salary)

# Get the person detail having max salary.

retval<- subset(data, salary == max(salary)) retval

Output:-

Get all the people working in IT department

# Create a data frame.

data<- read.csv("input.csv")
retval<- subset( data, dept == "IT") retval

Output:-

#Create a data frame.

data<- read.csv("input.csv")
retval<- subset(data, as.Date(start_date) >as.Date("2014-01-01"))

# Write filtered data into a new file.

write.csv(retval,"output.csv") newdata<- read.csv("output.csv") newdata

Output:-

b. Reading Excel data sheet in R.

install.packages("xlsx")
library("xlsx")
data<- read.xlsx("input.xlsx", sheetIndex = 1)
data

Output:-

c. Reading XML dataset in R.

install.packages("XML")
library("XML")
library("methods")
result<- xmlParse(file = "input.xml")
result

Output:-

4 – VISUALIZATIONS
a. Find the data distributions using box and scatter plot.

Install.packages(“ggplot2”)
Library(ggplot2)
Input <- mtcars[,c('mpg','cyl')]
input

boxplot(mpg ~ cyl, data = mtcars, xlab = "number of cylinders", ylab = "miles per gallon", main =
"mileage data")

dev.off()

Output :-
b. Find the outliers using plot.

# Define the vector

v <- c(50, 75, 100, 125, 150, 175, 200)

# Create the boxplot

boxplot(v, main = "Boxplot of v", horizontal = TRUE, col = "lightblue")

# Compute Q1, Q3, and IQR

q1 <- quantile(v, 0.25) # First quartile
q3 <- quantile(v, 0.75) # Third quartile
iqr <- q3 - q1 # Interquartile range

# Calculate the lower and upper bounds for outliers

lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

# Print the bounds

cat("Lower Bound for Outliers:", lower_bound, "\n")
cat("Upper Bound for Outliers:", upper_bound, "\n")

# Identify outliers
outliers <- v[v < lower_bound | v > upper_bound]
cat("Outliers:", outliers, "\n")

output :
c. Plot the histogram, bar chart and pie chart on sample
data.

Histogram

# Load the necessary library

library(graphics) # For creating visualizations, like histograms (part of base R)

# Create a sample vector

v <- c(9, 13, 21, 8, 36, 22, 12, 41, 31, 33, 19)

# Create the histogram

hist(
v, # Data to plot
xlab = "Weight", # Label for the x-axis
col = "blue", # Color for the bars
border = "green", # Color for the bar borders
main = "Histogram of Sample Data" # Title of the histogram
)

# Turn off the current graphics device (if needed)

dev.off()

Output:-
Bar chart

# Load the necessary library

library(graphics) # This is part of base R, used for plotting functions

# Sample data
H <- c(7, 12, 28, 3, 41) # Revenue data for each month
M <- c("Jan", "Feb", "Mar", "Apr", "May") # Month names

# Create the bar chart

barplot(
H, # Heights of the bars (Revenue)
names.arg = M, # Labels for each bar (Months)
xlab = "Month", # Label for the x-axis
ylab = "Revenue", # Label for the y-axis
col = "blue", # Color of the bars
main = "Revenue Chart", # Title of the chart
border = "red" # Color of the bar borders
)

# Close the current graphics device (not always necessary in RStudio)

dev.off()

output :

Pie Chart

# Load the necessary library

library(graphics) # This is part of base R, used for plotting functions

# Sample data
x <- c(21, 62, 10, 53) # Data values representing some quantities (e.g., population, sales, etc.)
labels <- c("London", "New York", "Singapore", "Mumbai") # Labels for each segment of the pie chart

# Create the Pie chart

pie(
x, # Values for the pie chart
labels = labels, # Labels for each segment
col = rainbow(length(x)), # Color the segments with a rainbow color palette
main = "City Distribution" # Title of the pie chart
)

# Turn off the graphics device (not always necessary in RStudio)

dev.off()
output :
5. CORRELATION AND COVARIANCE

a. How to find a corelation matrix and plot the correlation on iris

data set

# Load necessary libraries

library(corrplot) # For plotting the correlation matrix
library(ggplot2) # For the iris dataset

# Load the iris dataset

data(iris)

# Select only numeric columns from the iris dataset (to avoid categorical 'Species')
iris_numeric <- iris[, 1:4] # Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

# Calculate the correlation matrix for the numeric columns

cor_matrix <- cor(iris_numeric)

# Print the correlation matrix

print(cor_matrix)

# Visualize the correlation matrix using corrplot

corrplot(cor_matrix, method = "square", type = "upper", col = colorRampPalette(c("blue", "white", "red"))
(200),
title = "Correlation Matrix for Iris Data", mar = c(0, 0, 1, 0))

b. Plot the correlation plot on dataset and visualize giving an

overview of relationships among data on iris data.

SOURCE CODE:
# Load necessary libraries
library(corrplot) # For plotting the correlation matrix
library(ggplot2) # For the iris dataset

# Load the iris dataset

data(iris)

# Remove the Species column as it is categorical

iris_numeric <- iris[, 1:4] # Select only the numeric columns (Sepal.Length, Sepal.Width, Petal.Length,
Petal.Width)

# Calculate the correlation matrix for the numeric columns

cor_matrix <- cor(iris_numeric)

# Print the correlation matrix

print(cor_matrix)
# Visualize the correlation matrix using corrplot
corrplot(cor_matrix, method = "square", type = "upper",
col = colorRampPalette(c("blue", "white", "red"))(200),
title = "Correlation Matrix for Iris Data", mar = c(0, 0, 1, 0))

c. Analysis of covariance: variance (ANOVA), if data have

categorical variables on iris data.

To analyze the variance (ANOVA) on the Iris dataset using categorical variables, we need to look at
how categorical variables (such as Species) influence the continuous variables (like Sepal.Length
and Petal.Length). One of the common ways to assess this is by performing Analysis of Variance
(ANOVA), which helps determine if there are statistically significant differences between the means of
different groups (in this case, the different species of flowers).

The corrected version of your code includes:

1. Visualizing relationships using ggplot2.

2. Performing ANOVA to check for significant differences based on the categorical variable
Species.

Steps for ANOVA analysis on Iris dataset:

1. Visualize relationships between variables (Sepal.Length, Petal.Length) by species.

2. Perform ANOVA to check if there are significant differences in the means of the numeric
variables (Sepal.Length, Petal.Length) across the different species.

SOURCE CODE:
# Load necessary libraries
library(ggplot2)

# Load the iris dataset

data(iris)

# Structure of the iris dataset

str(iris)

# Scatter plot to visualize the relationship between Sepal.Length and Petal.Length by Species
ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) +
geom_point(size=2) +
geom_smooth(method="lm", aes(color=Species), se=FALSE) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") +
ylab("Petal Length") +
theme(legend.position="bottom")
# Perform ANOVA to see if Sepal.Length differs by Species
anova_sepal <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_sepal)

# Perform ANOVA to see if Petal.Length differs by Species

anova_petal <- aov(Petal.Length ~ Species, data = iris)
summary(anova_petal)
6 PROBLEM DEFINITION:

REGRESSION MODEL: Import a data from web storage. Name the dataset and now do Logistic
Regression to find out relation between variables that are affecting the admission of a student in a institute
based on his or her GRE score, GPA obtained and rank of the student. Also check the model is fit or not.
require (foreign), require(MASS)

Steps for Logistic Regression:

1. Load the Data: First, we need a dataset to work with. In your case, the original dataset from UCLA
(http://www.ats.ucla.edu/stat/data/binary.csv) should ideally contain data on
student admissions, but if it's not accessible, we'll use a similar structure for illustration.
2. Preprocess the Data: We will modify the dataset for binary classification (admitted = 1, not
admitted = 0). You can use any similar dataset or load one locally for the exercise.
3. Logistic Regression Model: The model will predict the probability of a student being admitted
based on their GRE score, GPA, and rank. Logistic regression is ideal for binary classification tasks
like this.
4. Model Evaluation: After fitting the model, we will evaluate whether the model fits the data well by
inspecting the coefficients and summary statistics.

SOURCE CODE:

Solution using a Dummy Dataset:

I'll show you the full solution using a dummy dataset that replicates the structure of the student admission
problem, and then fit a logistic regression model.

Step-by-Step Code:
1. Load and Prepare the Data:

We'll simulate a dataset for student admission.

# Load required libraries

require(foreign)
require(MASS)

# Simulate a dummy dataset

set.seed(123) # For reproducibility
mydata <- data.frame(
admit = sample(0:1, 100, replace = TRUE), # Binary outcome: 0 = Not Admitted, 1 =
Admitted
gre = rnorm(100, 310, 30), # GRE score: Normally distributed with mean=310 and
SD=30
gpa = rnorm(100, 3.0, 0.4), # GPA: Normally distributed with mean=3.0 and SD=0.4
rank = sample(1:4, 100, replace = TRUE) # Rank: Factor variable with 4 levels
)

# Preview the dataset

head(mydata)

2. Logistic Regression Model:

We'll use glm (Generalized Linear Model) for logistic regression. The dependent variable is admit, and
the independent variables are gre, gpa, and rank.

# Logistic regression model

model <- glm(admit ~ gre + gpa + rank, data = mydata, family = binomial)

# Display the summary of the model

summary(model)

3. Model Interpretation:

● The summary(model) will provide the coefficients, standard errors, z-values, and p-values for
each predictor (gre, gpa, and rank).
● The coefficients represent the change in the log-odds of being admitted for a one-unit change in the
predictor variable, holding other variables constant.

For example, if the coefficient for gre is positive, it means that as GRE score increases, the probability of
admission increases, and similarly for other predictors.
4. Model Evaluation (Goodness-of-Fit):

We can check the goodness-of-fit for the model using the deviance and AIC (Akaike Information
Criterion).

# Deviance and AIC for the logistic regression model

deviance(model) # A measure of goodness-of-fit
AIC(model) # Akaike Information Criterion (lower AIC is better)

5. Predicted Probabilities and Classification:

Now we can use the predict() function to predict the probabilities of admission for each observation.

# Predicted probabilities
predicted_probs <- predict(model, type = "response")

# Predicted outcomes (classifications)

predicted_class <- ifelse(predicted_probs > 0.5, 1, 0)

# Compare predicted vs actual outcomes

table(predicted_class, mydata$admit)

This will give us a confusion matrix showing the number of correct and incorrect predictions.

OUTPUT:
7: CLASSIFICATION MODEL

PROBLEM DEFINITION:
Apply multiple regressions, if data have a continuous independent variable. Apply on above dataset.

multiple regression on a dataset where the independent variables are continuous. The provided dataset
includes binary outcomes (admit: 0 = not admitted, 1 = admitted) and continuous variables such as GRE
scores and GPA. We will fit a logistic regression model to predict the probability of admission based on
these independent variables (GRE and GPA). Since we are dealing with binary outcomes (admit = 0 or 1),
logistic regression is appropriate for this problem.

However, in the case where the independent variables are continuous, we would use multiple regression
to model the relationship between multiple continuous independent variables and the dependent variable.
Logistic regression is a specific case of multiple regression for binary outcomes, and in this case, we treat
the problem as a logistic regression problem.

SOURCE CODE:

Step 1: Creating the Dataset

# Create a dataset
mydata <- data.frame(
admit = c(0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0),
# Binary outcome (0 = not admitted, 1 = admitted)
gre = c(380, 660, 800, 640, 520, 520, 760, 400, 540, 700, 800, 450, 360, 720, 560,
710, 440, 680, 760, 500), # GRE scores
gpa = c(3.61, 3.67, 4.00, 3.19, 2.93, 2.75, 3.00, 3.08, 3.39, 3.92, 3.75, 2.88,
2.52, 3.68, 3.32, 3.94, 2.98, 3.50, 3.89, 3.15), # GPA scores
rank = c(3, 1, 1, 4, 4, 2, 1, 3, 2, 2, 1, 4, 4, 2, 3, 1, 3, 2, 1, 4)
# Rank of undergraduate institution (1 = best, 4 = worst)
)

Explanation:

● A dataset called mydata is created using data.frame().

○ admit: Binary variable (0 = not admitted, 1 = admitted).
○ gre: GRE scores (continuous variable).
○ gpa: GPA scores (continuous variable).
○ rank: The rank of the undergraduate institution (categorical variable: 1 is best, 4 is worst).

This dataset simulates 20 students and their admission results based on the given predictors.

Step 2: Preview the Dataset

# Preview the dataset

head(mydata)

Explanation:

● The head() function displays the first few rows of the dataset (mydata).
● This helps to inspect the structure and values in the dataset to ensure the data was loaded correctly.
Step 3: Convert 'rank' to a Factor

# Ensure 'rank' variable is a factor

mydata$rank <- factor(mydata$rank)

Explanation:

● The rank variable is a categorical variable (with integer values 1, 2, 3, and 4). In regression
modeling, it is important to convert categorical variables to factors so they can be treated correctly
in statistical models.
● The factor() function is used to convert the rank variable into a factor.

Step 4: Fit a Logistic Regression Model

# Fit a logistic regression model

mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

Explanation:

● We use the glm() function to fit a logistic regression model.

○ admit: The dependent variable (binary outcome: 0 or 1).
○ gre, gpa, rank: These are the independent variables (predictors) in the model.
○ family = "binomial": This specifies that we are fitting a logistic regression model
for binary outcomes (since admit is binary).

This model estimates the probability of a student being admitted based on their GRE score, GPA, and rank
of their undergraduate institution.

Step 5: Display the Summary of the Model

# Display the summary of the model
summary(mylogit)

OUTPUT:
8 - REGRESSION MODEL FOR PREDICTION

Linear Regression with Categorical Data (Region as a Predictor)

to analyze how the average SAT score (csat) varies across different geographical regions using a linear
regression model. The dataset includes the SAT scores for different states, along with the region each state
belongs to. Since the region is a categorical variable, this will be treated appropriately in the regression
model.

Step 1: Create a Fictional Dataset

# Create a fictional dataset

states.data <- data.frame(
state = c("California", "New York", "Texas", "Florida", "Illinois",
"Ohio", "Georgia", "Pennsylvania", "Arizona", "Michigan"),
csat = c(1010, 1100, 980, 1020, 1050, 990, 970, 1080, 950, 1005), # Example SAT
scores
region = c("West", "N. East", "South", "South", "Midwest",
"Midwest", "South", "N. East", "West", "Midwest") # Geographic regions
)

● A dataset called states.data is created using data.frame().

● The dataset contains the following columns:
○ state: The name of the state.
○ csat: The average SAT score (a continuous variable).
○ region: The geographic region to which each state belongs (a categorical variable, which
will be treated as a factor).

This dataset represents the SAT scores of 10 states from different regions of the United States.

Step 2: Ensure the region Column is a Factor

# Ensure the 'region' column is a factor

states.data$region <- factor(states.data$region)

● The region variable is categorical (non-numeric), so it is important to convert it to a factor using

factor().
● This ensures that R treats region as a categorical variable in the regression model, and it will be
used to create dummy variables during the regression process.

Step 3: Inspect the Structure of the Dataset

# Inspect the structure of the dataset

str(states.data)
● The str() function provides a compact view of the dataset's structure, showing the types of each
variable (e.g., whether they are numeric, factor, etc.), the first few values of each column, and the
number of observations.
● This helps us confirm that the data has been correctly imported and the region variable is now a
factor.

Step 4: Fit a Linear Regression Model

# Fit a linear regression model with 'region' as a predictor

sat.region <- lm(csat ~ region, data = states.data)

● We use the lm() function to fit a linear regression model where csat (the SAT scores) is the
dependent variable, and region (the geographical region) is the independent variable.
● The formula csat ~ region specifies that csat is modeled as a function of region. This
tells R to predict SAT scores based on the region in which the state is located.
● The lm() function automatically treats region as a categorical predictor (factor) and creates
dummy variables for each level of the factor (e.g., for West, N. East, South, and Midwest).

Step 5: Show the Regression Coefficients Table

# Show the regression coefficients table

coef(summary(sat.region))

Out put:
9 :CLASSIFICATION MODEL

g. Install relevant package for classification:

To begin a classification task in R, we need to install and load the necessary libraries that provide the tools
for classification, decision trees, and model evaluation. These libraries help in building classification
models like regression trees and support advanced visualization tools.

h. Choose classifier for classification problem and evaluate the performance of classifier:
In this problem, we will use a decision tree classifier to predict the Salary of baseball players based on
their performance (measured in Hits and Years of experience). The steps include preparing the data,
training the decision tree model, performing cross-validation, pruning the tree to improve performance, and
finally evaluating the performance of the model using Mean Squared Error (MSE).

1. Install and Load Relevant Packages:

We need several libraries:

● rpart.plot: For plotting regression and classification trees.

● tree: For decision tree models.
● ISLR: Contains datasets like Hitters for analysis.
● rattle: Provides tools for data mining, including model visualization.

# Install the required packages (if not already installed)

install.packages("rpart.plot")
install.packages("tree")
install.packages("ISLR")
install.packages("rattle")

# Load the libraries

library(tree) # For decision tree functions
library(ISLR2) # For datasets like 'Hitters'
library(rpart.plot) # For plotting trees
library(rattle) # For advanced visualization and exploration

2. Load and Prepare the Data:

We use the Hitters dataset, which contains various variables like Salary, Hits, and Years. The
Salary variable is continuous, and we will apply a regression tree model to predict it.

# Load the Hitters dataset

data(Hitters)

# View the structure of the dataset

View(Hitters)

# Remove rows with missing values in 'Salary' or other columns

Hitters <- na.omit(Hitters)

# Log-transform the Salary variable to make it more normally distributed

Hitters$Salary <- log(Hitters$Salary)

# Visualize the distribution of the Salary variable after transformation

hist(Hitters$Salary, main="Log-transformed Salary", xlab="Log of Salary")

output:

Explanation:

● na.omit(Hitters): Removes any rows with missing data.

● log(Hitters$Salary): The salary variable is log-transformed to make it more normally
distributed, which is often beneficial for regression models.
● hist(Hitters$Salary): This shows the distribution of the log-transformed Salary
variable.

3. Fit a Decision Tree Model:

We create a regression tree model using Hits and Years as predictors to predict the Salary.

# Fit a decision tree model with 'Hits' and 'Years' as predictors

tree.fit <- tree(Salary ~ Hits + Years, data = Hitters)

# Show a summary of the fitted tree model

summary(tree.fit)

# Plot the tree

plot(tree.fit, uniform=TRUE, margin=0.2)
text(tree.fit, use.n=TRUE, all=TRUE, cex=.8)

output:

Explanation:

● tree(Salary ~ Hits + Years, data = Hitters): Fits a regression tree model to

predict Salary based on Hits and Years.
● summary(tree.fit): Displays the tree structure, including the number of terminal nodes and
residual mean deviance.
● plot(tree.fit): Visualizes the structure of the tree.
● text(tree.fit): Adds node labels with additional information (like the number of
observations in each node).

4. Split the Data into Training and Test Sets:

We will split the dataset into a training set (for model fitting) and a test set (for model evaluation).

# Split the data into training and testing sets (50% each)
split <- createDataPartition(y = Hitters$Salary, p = 0.5, list = FALSE)

# Training set
train <- Hitters[split,]

# Testing set
test <- Hitters[-split,]

output:

Explanation:

● createDataPartition(): Splits the dataset into training and testing sets. Here, we use 50%
of the data for training and the remaining 50% for testing.

5. Fit the Tree Model on the Training Data:

We fit the regression tree model on the training data using all available predictors (Hits, Years, etc.).

# Create the regression tree model on the training set

trees <- tree(Salary ~ ., train)

# Plot the tree

plot(trees)
text(trees, pretty = 0)

output:

Explanation:

● tree(Salary ~ ., train): Fits a regression tree model using all available predictors
(denoted by the .) on the training set.
● plot(trees) and text(trees): Visualize the fitted tree structure.

6. Cross-Validation to Prune the Tree:

Pruning helps improve the performance of the decision tree by removing unnecessary splits and avoiding
overfitting.

# Perform cross-validation to determine the best size of the tree

cv.trees <- cv.tree(trees)

# Plot the cross-validation results to determine the best tree size

plot(cv.trees)

# Prune the tree to the optimal size (e.g., 4 terminal nodes)

prune.trees <- prune.tree(trees, best = 4)

# Plot the pruned tree

plot(prune.trees)
text(prune.trees, pretty = 0)

output:
Explanation:

● cv.tree(trees): Performs cross-validation to assess the performance of trees with different

sizes.
● plot(cv.trees): Visualizes the cross-validation results. The x-axis represents the number of
terminal nodes, and the y-axis represents the deviance (error).
● prune.tree(trees, best = 4): Prunes the tree to the optimal size based on cross-
validation results (in this case, 4 terminal nodes).
● plot(prune.trees) and text(prune.trees): Visualizes the pruned tree.

7. Make Predictions and Evaluate Performance:

After pruning the tree, we predict the Salary for the test data and evaluate the model's performance by
calculating the Mean Squared Error (MSE).

# Make predictions using the pruned tree on the test data

yhat <- predict(prune.trees, test)

# Plot the predicted vs. actual values

plot(yhat, test$Salary)
abline(0, 1) # Add a reference line for perfect prediction

# Calculate Mean Squared Error (MSE) for model evaluation

mean((yhat - test$Salary)^2)

output:

Explanation:

● predict(prune.trees, test): Uses the pruned tree to predict the Salary for the test set.
● plot(yhat, test$Salary): Plots the predicted values (yhat) against the actual Salary
values in the test set. A line with slope 1 (abline(0, 1)) is added to indicate perfect
predictions.
● mean((yhat - test$Salary)^2): Computes the Mean Squared Error (MSE) to evaluate
the model's accuracy. The lower the MSE, the better the model fits the data.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10 PROBLEM DEFINITION:
CLUSTERING MODEL
e. Clustering algorithms for unsupervised classification. Plot the cluster data using R visualizations

Clustering is an unsupervised learning technique used to group similar data points into clusters based on
certain characteristics or features. Unlike supervised learning, clustering does not rely on labeled data and
aims to identify the inherent structure of the data. It's often used in exploratory data analysis to find
patterns or groupings in data.

In a clustering model, data points within the same group (or cluster) are more similar to each other than to
those in other groups. These models are particularly useful in tasks like customer segmentation, anomaly
detection, and pattern recognition.

Types of Clustering Algorithms:

There are several types of clustering algorithms, but the most commonly used ones include:

1. K-Means Clustering:
○ It is one of the most popular clustering algorithms.
○ K-means is based on partitioning the data into k clusters, where each cluster is defined by
its centroid (the mean of all the points in the cluster).
○ The algorithm iterates between assigning data points to the nearest centroid and updating the
centroids based on the points assigned to them, until convergence.
2. Hierarchical Clustering:
○ Hierarchical clustering creates a tree-like structure of clusters called a dendrogram.
○ The data points are recursively merged or split into clusters based on similarity, either using
an agglomerative (bottom-up) or divisive (top-down) approach.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
○ DBSCAN clusters points that are closely packed together while marking points in low-
density regions as noise.
○ Unlike K-means, DBSCAN does not require specifying the number of clusters beforehand.

K-Means Clustering Algorithm:

In this example, we will use the K-Means algorithm for clustering. The K-means algorithm requires
specifying the number of clusters, k, in advance. It tries to minimize the sum of squared distances between
the points in the cluster and their centroid.

SOURCE CODE:
1. Clustering algorithms for unsupervised classification.

K-Means Clustering algorithm to classify the Iris dataset based on two of its features: Sepal.Length
and Sepal.Width (columns 3 and 4 of the Iris dataset). Here's a breakdown of the code:

# Load the necessary library for clustering

library(cluster)
# Set the seed for reproducibility
set.seed(20)

# Apply K-means clustering on the first two features (Sepal.Length and

Sepal.Width) of the Iris dataset
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)

# Output the clustering results

irisCluster

2. library(cluster)

# Set seed for reproducibility

set.seed(20)

# Perform K-Means clustering on Petal.Length and Petal.Width with 3 clusters

irisCluster <- kmeans(iris[, 3:4], centers = 3, nstart = 20)

# Print the clustering result

print(irisCluster)

# Add cluster assignments to the original iris dataset

iris$Cluster <- as.factor(irisCluster$cluster)

# Load ggplot2 for visualization

library(ggplot2)

# Visualize the clusters using a scatter plot

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Cluster)) +
geom_point(size = 3) +
labs(
title = "K-Means Clustering of Iris Dataset (Petal Features)",
x = "Petal Length",
y = "Petal Width"
)+
theme_minimal()

OUTPUT:

Petal.Length Petal.Width 1 1.462000 0.246000

2 4.269231 1.342308
3 5.595833 2.037500

Clustering vector:

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[42] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
[83] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3
[124] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3

Within cluster sum of squares by cluster:

[1] 2.02200 13.05769 16.29167

(between_SS / total_SS = 94.3 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss"

[6] "betweenss" "size" "iter" "ifault"

OUTPUT:

SOURCE CODE:

# Load dataset
data(mtcars)

# Step 1: Compute the distance matrix

d <- dist(as.matrix(mtcars)) # Calculates the Euclidean distance between rows of the mtcars dataset

# Step 2: Perform hierarchical clustering

hc <- hclust(d) # Applies hierarchical clustering on the distance matrix

# Step 3: Plot the dendrogram

plot(hc, main = "Hierarchical Clustering Dendrogram", xlab = "Car Models", sub = "", cex = 0.8)
OUTPUT:

3. Plot the cluster data using R visualizations.

To achieve the task of plotting cluster data using R visualizations, the pam (Partitioning Around Medoids)
clustering method and the clusplot function from the cluster package are used. Here’s the corrected
and complete code to generate the clusters and visualize them using clusplot:

Complete Source Code

Step 1: Install and Load Necessary Packages

# Install the 'cluster' package if not already installed

if (!require("cluster")) install.packages("cluster")

# Load the 'cluster' package

library(cluster)

Step 2: Generate and Plot Cluster Data (Without Noise)

# Generate 25 objects divided into 2 clusters

set.seed(123) # Set seed for reproducibility
x <- rbind(
cbind(rnorm(10, 0, 0.5), rnorm(10, 0, 0.5)),
cbind(rnorm(15, 5, 0.5), rnorm(15, 5, 0.5))
)

# Perform clustering using Partitioning Around Medoids (PAM)

pam_result <- pam(x, 2)
# Visualize the clusters
clusplot(pam_result, main = "Cluster Plot (Without Noise)", color =
TRUE, shade = TRUE, labels = 2, lines = 0)

Step 3: Add Noise and Re-Cluster the Data

# Add noise to the dataset

x4 <- cbind(x, rnorm(25), rnorm(25))

# Perform clustering again with the noisy dataset

pam_result_with_noise <- pam(x4, 2)

# Visualize the clusters with noise

clusplot(pam_result_with_noise, main = "Cluster Plot (With Noise)",
color = TRUE, shade = TRUE, labels = 2, lines = 0)

Explanation

1. Cluster Data Generation:

○ Two clusters are generated with rnorm() around (0, 0) and (5, 5) with a small
standard deviation of 0.5.
2. Adding Noise:
○ Extra random dimensions are added using rnorm(25) to simulate noise in the data.
3. Clustering Method:
○ PAM (Partitioning Around Medoids) is used, which is robust to noise and selects
medoids as cluster centers.
4. Visualization:
○ clusplot() creates a 2D representation of the clusters.
○ Arguments:
■ color = TRUE: Adds colors to the clusters.
■ shade = TRUE: Adds shading for better visualization.
■ labels = 2: Labels both points and clusters.
■ lines = 0: Disables connecting lines between clusters.

OUTPUT:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Three Letter Words Sentences
100% (2)
Three Letter Words Sentences
24 pages
EDA With R Lab Manual
No ratings yet
EDA With R Lab Manual
110 pages
Lecture 1
No ratings yet
Lecture 1
167 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Shahun Term Workr1
No ratings yet
Shahun Term Workr1
34 pages
DEV Lab Manual
No ratings yet
DEV Lab Manual
27 pages
Ds Practical
No ratings yet
Ds Practical
25 pages
R Studio Lab Summary Sheet
No ratings yet
R Studio Lab Summary Sheet
3 pages
15 Easy Jazz, Blues and Funk Etudes
100% (10)
15 Easy Jazz, Blues and Funk Etudes
36 pages
Solutions For QB3
No ratings yet
Solutions For QB3
14 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Starting With R
No ratings yet
Starting With R
34 pages
XTC - Black Sea (Bass Tablature Songbook)
100% (2)
XTC - Black Sea (Bass Tablature Songbook)
43 pages
Da Lab File
No ratings yet
Da Lab File
33 pages
R Tutorial #1: Applied Econometrics (Econ3005)
No ratings yet
R Tutorial #1: Applied Econometrics (Econ3005)
21 pages
R Practical
No ratings yet
R Practical
9 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
CS605 Labcf
No ratings yet
CS605 Labcf
30 pages
Vibgyor International School Pilkhuwa: Syllabus
No ratings yet
Vibgyor International School Pilkhuwa: Syllabus
2 pages
Lab Week2-3
No ratings yet
Lab Week2-3
26 pages
R Program
No ratings yet
R Program
22 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
R Practicals
No ratings yet
R Practicals
32 pages
Saurabh
No ratings yet
Saurabh
22 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
R Pgms 30
No ratings yet
R Pgms 30
6 pages
R Programming-1
No ratings yet
R Programming-1
6 pages
R File Code
No ratings yet
R File Code
16 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
40 pages
R Programming
No ratings yet
R Programming
4 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
R Basics
No ratings yet
R Basics
18 pages
STA 272 Chapter 02 Notes and Codes Data Frames in R
No ratings yet
STA 272 Chapter 02 Notes and Codes Data Frames in R
5 pages
R
No ratings yet
R
13 pages
DSCI 100 Cheat Sheet
No ratings yet
DSCI 100 Cheat Sheet
3 pages
R-Lab p-4,2,1
No ratings yet
R-Lab p-4,2,1
12 pages
R Commands
No ratings yet
R Commands
18 pages
Final Report ABSTRACT
No ratings yet
Final Report ABSTRACT
4 pages
Kanak Gupta 1116 SEC Assignment
No ratings yet
Kanak Gupta 1116 SEC Assignment
3 pages
GOT Barcode Reader Function
No ratings yet
GOT Barcode Reader Function
8 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
DA Lab Week-1
No ratings yet
DA Lab Week-1
7 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
DS Lab
No ratings yet
DS Lab
31 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
R Program Record Book Iba
No ratings yet
R Program Record Book Iba
24 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
First Conditional Activity
No ratings yet
First Conditional Activity
7 pages
3GPP Decoder For LTE, UMTS and GSM - Download FREE
No ratings yet
3GPP Decoder For LTE, UMTS and GSM - Download FREE
9 pages
Linux Cheat Sheet
No ratings yet
Linux Cheat Sheet
3 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
R Examples
No ratings yet
R Examples
56 pages
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
No ratings yet
FDP Indoglobal Group of Colleges: 27 April To 1 May R Programming Language Assignment Submission
12 pages
Introduction To Basics of R - Assignment: Log2 (2 5) Log (Exp (1) Exp (2) )
No ratings yet
Introduction To Basics of R - Assignment: Log2 (2 5) Log (Exp (1) Exp (2) )
10 pages
UL2
No ratings yet
UL2
2 pages
100 Common Idioms in English
No ratings yet
100 Common Idioms in English
9 pages
BAN5
No ratings yet
BAN5
2 pages
R Course
No ratings yet
R Course
7 pages
Python For R Users
No ratings yet
Python For R Users
34 pages
Actual Reading 14
No ratings yet
Actual Reading 14
103 pages
A Short List of Some Useful R Commands: Input and Display
No ratings yet
A Short List of Some Useful R Commands: Input and Display
2 pages
Past Simple and Past Continuous: Grammar
No ratings yet
Past Simple and Past Continuous: Grammar
4 pages
70 346 Questions
No ratings yet
70 346 Questions
19 pages
Taiko Drums - Trio
No ratings yet
Taiko Drums - Trio
5 pages
STATA - Subject Table of Contents
No ratings yet
STATA - Subject Table of Contents
15 pages
Full Essays
100% (2)
Full Essays
8 pages
CRM Cheat Sheet
No ratings yet
CRM Cheat Sheet
7 pages
Reflection 1. Every Child Is Special
No ratings yet
Reflection 1. Every Child Is Special
2 pages
AC Traction Matlab
No ratings yet
AC Traction Matlab
4 pages
National & Kapodistrian University of Athens Lesson Planning and Materials Development
No ratings yet
National & Kapodistrian University of Athens Lesson Planning and Materials Development
17 pages
Unleashing The Power of ChatGPT For Translation
No ratings yet
Unleashing The Power of ChatGPT For Translation
10 pages
Week 2
No ratings yet
Week 2
30 pages
GEC108Module1secondSem2024 2025
No ratings yet
GEC108Module1secondSem2024 2025
21 pages
Prose 1,2,3 & Poetry 1,2,3
No ratings yet
Prose 1,2,3 & Poetry 1,2,3
6 pages
Stage 6 Techniques
No ratings yet
Stage 6 Techniques
5 pages
Iianiform 4 - Kcse 2024
No ratings yet
Iianiform 4 - Kcse 2024
3 pages
Presentation 94
No ratings yet
Presentation 94
5 pages
Mohammad Alfar CV-Accounting - Supplychain Coordinator
No ratings yet
Mohammad Alfar CV-Accounting - Supplychain Coordinator
2 pages
English Paper
No ratings yet
English Paper
3 pages
Read Roses and Champagne - MangaMirror
No ratings yet
Read Roses and Champagne - MangaMirror
1 page
Job Application Form - 2 Pages
No ratings yet
Job Application Form - 2 Pages
2 pages
22WHO GMP CoPP Units 1
No ratings yet
22WHO GMP CoPP Units 1
1 page