DSR LAB MANUAL - 10 Programs
DSR LAB MANUAL - 10 Programs
1 R AS CALCULATOR APPLICATION
a. Using with and without R objects on console
b. Using mathematical functions on console
c. Write an R script, to create R objects for calculator application and save in a specified
location in disk
2 DESCRIPTIVE STATISTICS IN R
a. Write an R script to find basic descriptive statistics using summary, str,quartile functions on
MT Cars data set.
b. Write an R script to find subset of dataset by using subset ()
3 READING AND WRITING DIFFERENT TYPES OF DATASETS
a. Reading different types of data sets (.txt, .csv) from web and disk and writing in file in
specific disk location.
b. Reading Excel data sheet in R.
c. Reading XML dataset in R.
4 VISUALIZATIONS
REGRESSION MODEL
6
Import a data from web storage. Name the dataset and now do Logistic Regression to find out
relation between variables that are affecting the admission of a student in a institute based on his
or her GRE score, GPA obtained and rank of the student. Also check the model is fit or not.
require (foreign),
require(MASS).
7 MULTIPLE REGRESSION MODEL
Apply multiple regressions, if data have a continuous independent variable. Apply on above
dataset.
8 REGRESSION MODEL FOR PREDICTION
Apply regression Model techniques to predict the data on above dataset
9 CLASSIFICATION MODEL
a. Install relevant package for classification.
b. Choose classifier for classification problem.
c. Evaluate the performance of classifier.
10 CLUSTERING MODEL
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.
Reference Books:
Yanchang Zhao, “R and Data Mining: Examples and Case Studies”, Elsevier, 1st Edition, 2012
Web References:
1.http://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
2.http://www.ats.ucla.edu/stat/r/dae/rreg.htm 3.http://www.coastal.edu/kingw/statistics/R-
tutorials/logistic.html
4. http://www.ats.ucla.edu/stat/r/data/binary.csv
SOFTWARE AND HARDWARE REQUIREMENTS FOR 18 STUDENTS:
SOFTWARE: R Software , R Studio Software
HARDWARE: 18 numbers of Intel Desktop Computers with 4 GB RAM
1 - R AS CALCULATOR APPLICATION
> 2587+2149
Output:-
> 287954-135479
Output:- [1]
> 257*52 [1] 13364
> 257/21
Output:- [1]
>A=1000
>B=2000
>c=A+B
>c Output:-
>a=100
>class(a)
>b=500
>c=a-b
>class(b)
>sum<a-b
>sum
a. Write an R script to find basic descriptive statistics using summary, str, quantile function on mtcars&
cars datasets.
Explanation of Datasets:
Subset () function :
The subset() function in R is used to filter rows and select specific columns from a dataset. Here's how
you can use it with the iris dataset:
Extract rows where Sepal.Length is greater than 5.0 and Species is "setosa":
Extract rows where Sepal.Width is less than 3.0, but only display Sepal.Length and
Sepal.Width:
Extract rows where Sepal.Length equals 5.0 and only include the Sepal.Length and Species
columns:
Aggregate() function :
● Groups the dataset by Species and computes the mean for all numeric columns
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width).
● Groups the dataset by Species and computes the sum for all numeric columns.
● Groups the dataset by Species and returns the maximum value for each numeric column.
● Groups the dataset by Species and returns the minimum value for each numeric column.
● Groups the dataset by Species and returns the median value for each numeric column.
6. Calculate the standard deviation for each numeric variable grouped by Species:
● Groups the dataset by Species and calculates the standard deviation for each numeric column.
● Groups the dataset by Species and calculates the mean for the Sepal.Length column only.
3 - READING AND WRITING DIFFERENT TYPES OF DATASETS
a. Reading different types of data sets (.txt, .csv) from web and disk and writing in file in specific disk
location.
library(utils)
data<- read.csv("input.csv")
data
Output :-
data<- read.csv("input.csv")
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
Output:-
Output:-
data<- read.csv("input.csv")
retval<- subset( data, dept == "IT") retval
Output:-
Output:-
install.packages("xlsx")
library("xlsx")
data<- read.xlsx("input.xlsx", sheetIndex = 1)
data
Output:-
install.packages("XML")
library("XML")
library("methods")
result<- xmlParse(file = "input.xml")
result
Output:-
4 – VISUALIZATIONS
a. Find the data distributions using box and scatter plot.
Install.packages(“ggplot2”)
Library(ggplot2)
Input <- mtcars[,c('mpg','cyl')]
input
boxplot(mpg ~ cyl, data = mtcars, xlab = "number of cylinders", ylab = "miles per gallon", main =
"mileage data")
dev.off()
Output :-
b. Find the outliers using plot.
# Identify outliers
outliers <- v[v < lower_bound | v > upper_bound]
cat("Outliers:", outliers, "\n")
output :
c. Plot the histogram, bar chart and pie chart on sample
data.
Histogram
Output:-
Bar chart
# Sample data
H <- c(7, 12, 28, 3, 41) # Revenue data for each month
M <- c("Jan", "Feb", "Mar", "Apr", "May") # Month names
output :
Pie Chart
# Sample data
x <- c(21, 62, 10, 53) # Data values representing some quantities (e.g., population, sales, etc.)
labels <- c("London", "New York", "Singapore", "Mumbai") # Labels for each segment of the pie chart
# Select only numeric columns from the iris dataset (to avoid categorical 'Species')
iris_numeric <- iris[, 1:4] # Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
SOURCE CODE:
# Load necessary libraries
library(corrplot) # For plotting the correlation matrix
library(ggplot2) # For the iris dataset
To analyze the variance (ANOVA) on the Iris dataset using categorical variables, we need to look at
how categorical variables (such as Species) influence the continuous variables (like Sepal.Length
and Petal.Length). One of the common ways to assess this is by performing Analysis of Variance
(ANOVA), which helps determine if there are statistically significant differences between the means of
different groups (in this case, the different species of flowers).
SOURCE CODE:
# Load necessary libraries
library(ggplot2)
# Scatter plot to visualize the relationship between Sepal.Length and Petal.Length by Species
ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) +
geom_point(size=2) +
geom_smooth(method="lm", aes(color=Species), se=FALSE) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") +
ylab("Petal Length") +
theme(legend.position="bottom")
# Perform ANOVA to see if Sepal.Length differs by Species
anova_sepal <- aov(Sepal.Length ~ Species, data = iris)
summary(anova_sepal)
REGRESSION MODEL: Import a data from web storage. Name the dataset and now do Logistic
Regression to find out relation between variables that are affecting the admission of a student in a institute
based on his or her GRE score, GPA obtained and rank of the student. Also check the model is fit or not.
require (foreign), require(MASS)
SOURCE CODE:
Step-by-Step Code:
1. Load and Prepare the Data:
We'll use glm (Generalized Linear Model) for logistic regression. The dependent variable is admit, and
the independent variables are gre, gpa, and rank.
3. Model Interpretation:
● The summary(model) will provide the coefficients, standard errors, z-values, and p-values for
each predictor (gre, gpa, and rank).
● The coefficients represent the change in the log-odds of being admitted for a one-unit change in the
predictor variable, holding other variables constant.
For example, if the coefficient for gre is positive, it means that as GRE score increases, the probability of
admission increases, and similarly for other predictors.
4. Model Evaluation (Goodness-of-Fit):
We can check the goodness-of-fit for the model using the deviance and AIC (Akaike Information
Criterion).
Now we can use the predict() function to predict the probabilities of admission for each observation.
# Predicted probabilities
predicted_probs <- predict(model, type = "response")
This will give us a confusion matrix showing the number of correct and incorrect predictions.
OUTPUT:
7: CLASSIFICATION MODEL
PROBLEM DEFINITION:
Apply multiple regressions, if data have a continuous independent variable. Apply on above dataset.
multiple regression on a dataset where the independent variables are continuous. The provided dataset
includes binary outcomes (admit: 0 = not admitted, 1 = admitted) and continuous variables such as GRE
scores and GPA. We will fit a logistic regression model to predict the probability of admission based on
these independent variables (GRE and GPA). Since we are dealing with binary outcomes (admit = 0 or 1),
logistic regression is appropriate for this problem.
However, in the case where the independent variables are continuous, we would use multiple regression
to model the relationship between multiple continuous independent variables and the dependent variable.
Logistic regression is a specific case of multiple regression for binary outcomes, and in this case, we treat
the problem as a logistic regression problem.
SOURCE CODE:
Explanation:
This dataset simulates 20 students and their admission results based on the given predictors.
Explanation:
● The head() function displays the first few rows of the dataset (mydata).
● This helps to inspect the structure and values in the dataset to ensure the data was loaded correctly.
Step 3: Convert 'rank' to a Factor
Explanation:
● The rank variable is a categorical variable (with integer values 1, 2, 3, and 4). In regression
modeling, it is important to convert categorical variables to factors so they can be treated correctly
in statistical models.
● The factor() function is used to convert the rank variable into a factor.
Explanation:
This model estimates the probability of a student being admitted based on their GRE score, GPA, and rank
of their undergraduate institution.
OUTPUT:
8 - REGRESSION MODEL FOR PREDICTION
to analyze how the average SAT score (csat) varies across different geographical regions using a linear
regression model. The dataset includes the SAT scores for different states, along with the region each state
belongs to. Since the region is a categorical variable, this will be treated appropriately in the regression
model.
This dataset represents the SAT scores of 10 states from different regions of the United States.
● We use the lm() function to fit a linear regression model where csat (the SAT scores) is the
dependent variable, and region (the geographical region) is the independent variable.
● The formula csat ~ region specifies that csat is modeled as a function of region. This
tells R to predict SAT scores based on the region in which the state is located.
● The lm() function automatically treats region as a categorical predictor (factor) and creates
dummy variables for each level of the factor (e.g., for West, N. East, South, and Midwest).
Out put:
9 :CLASSIFICATION MODEL
h. Choose classifier for classification problem and evaluate the performance of classifier:
In this problem, we will use a decision tree classifier to predict the Salary of baseball players based on
their performance (measured in Hits and Years of experience). The steps include preparing the data,
training the decision tree model, performing cross-validation, pruning the tree to improve performance, and
finally evaluating the performance of the model using Mean Squared Error (MSE).
We use the Hitters dataset, which contains various variables like Salary, Hits, and Years. The
Salary variable is continuous, and we will apply a regression tree model to predict it.
output:
Explanation:
We create a regression tree model using Hits and Years as predictors to predict the Salary.
output:
Explanation:
We will split the dataset into a training set (for model fitting) and a test set (for model evaluation).
# Split the data into training and testing sets (50% each)
split <- createDataPartition(y = Hitters$Salary, p = 0.5, list = FALSE)
# Training set
train <- Hitters[split,]
# Testing set
test <- Hitters[-split,]
output:
Explanation:
● createDataPartition(): Splits the dataset into training and testing sets. Here, we use 50%
of the data for training and the remaining 50% for testing.
We fit the regression tree model on the training data using all available predictors (Hits, Years, etc.).
output:
Explanation:
● tree(Salary ~ ., train): Fits a regression tree model using all available predictors
(denoted by the .) on the training set.
● plot(trees) and text(trees): Visualize the fitted tree structure.
Pruning helps improve the performance of the decision tree by removing unnecessary splits and avoiding
overfitting.
output:
Explanation:
After pruning the tree, we predict the Salary for the test data and evaluate the model's performance by
calculating the Mean Squared Error (MSE).
output:
Explanation:
● predict(prune.trees, test): Uses the pruned tree to predict the Salary for the test set.
● plot(yhat, test$Salary): Plots the predicted values (yhat) against the actual Salary
values in the test set. A line with slope 1 (abline(0, 1)) is added to indicate perfect
predictions.
● mean((yhat - test$Salary)^2): Computes the Mean Squared Error (MSE) to evaluate
the model's accuracy. The lower the MSE, the better the model fits the data.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
10 PROBLEM DEFINITION:
CLUSTERING MODEL
e. Clustering algorithms for unsupervised classification. Plot the cluster data using R visualizations
Clustering is an unsupervised learning technique used to group similar data points into clusters based on
certain characteristics or features. Unlike supervised learning, clustering does not rely on labeled data and
aims to identify the inherent structure of the data. It's often used in exploratory data analysis to find
patterns or groupings in data.
In a clustering model, data points within the same group (or cluster) are more similar to each other than to
those in other groups. These models are particularly useful in tasks like customer segmentation, anomaly
detection, and pattern recognition.
1. K-Means Clustering:
○ It is one of the most popular clustering algorithms.
○ K-means is based on partitioning the data into k clusters, where each cluster is defined by
its centroid (the mean of all the points in the cluster).
○ The algorithm iterates between assigning data points to the nearest centroid and updating the
centroids based on the points assigned to them, until convergence.
2. Hierarchical Clustering:
○ Hierarchical clustering creates a tree-like structure of clusters called a dendrogram.
○ The data points are recursively merged or split into clusters based on similarity, either using
an agglomerative (bottom-up) or divisive (top-down) approach.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
○ DBSCAN clusters points that are closely packed together while marking points in low-
density regions as noise.
○ Unlike K-means, DBSCAN does not require specifying the number of clusters beforehand.
In this example, we will use the K-Means algorithm for clustering. The K-means algorithm requires
specifying the number of clusters, k, in advance. It tries to minimize the sum of squared distances between
the points in the cluster and their centroid.
SOURCE CODE:
1. Clustering algorithms for unsupervised classification.
K-Means Clustering algorithm to classify the Iris dataset based on two of its features: Sepal.Length
and Sepal.Width (columns 3 and 4 of the Iris dataset). Here's a breakdown of the code:
2. library(cluster)
OUTPUT:
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[42] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
[83] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3
[124] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3
Available components:
OUTPUT:
SOURCE CODE:
# Load dataset
data(mtcars)
To achieve the task of plotting cluster data using R visualizations, the pam (Partitioning Around Medoids)
clustering method and the clusplot function from the cluster package are used. Here’s the corrected
and complete code to generate the clusters and visualize them using clusplot:
Explanation
OUTPUT:
OUTPUT:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++