0% found this document useful (0 votes)
34 views25 pages

Lab File AD PDF

Uploaded by

Dipayan Rakshit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views25 pages

Lab File AD PDF

Uploaded by

Dipayan Rakshit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Lab Experiment File

of
DATA ANALYTICS USING R (150701)

Submitted By:

Anirban Das
(0901CS211140)

Prof. Khushboo Agarwal


Assistant Professor
Computer Science and Engineering

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


MADHAV INSTITUTE OF TECHNOLOGY & SCIENCE
GWALIOR - 474005 (MP) est. 1957

JULY-DEC 2024
3
Experiment = 1

Write a program in R to understand the basics of R programming,


including data types, variables, and basic operations.

# 1. Data Types
# Numeric
num_var <- 10.5
print(paste("Numeric:", num_var))

# Integer
int_var <- 5L
print(paste("Integer:", int_var))

# Character
char_var <- "Hello, R!"
print(paste("Character:", char_var))

# Logical
log_var <- TRUE
print(paste("Logical:", log_var))

# 2. Variables
# Assigning values to variables
a <- 5
b <- 10

# 3. Basic Operations
# Addition
sum_result <- a + b
print(paste("Sum:", sum_result))

# Subtraction
sub_result <- a - b
print(paste("Subtraction:", sub_result))

# Multiplication
mul_result <- a * b
print(paste("Multiplication:", mul_result))

# Division
div_result <- a / b
print(paste("Division:", div_result))

4
OUTPUT :

5
Experiment = 2

Write a program in R to import and export data in different formats such


as CSV and Excel.

library(readr) # For CSV handling


library(openxlsx) # For Excel handling

# Import data from CSV


csv_data <- read_csv("data.csv")
print("Imported CSV Data:")
print(csv_data)

# Export data to CSV


write_csv(csv_data, "data/exported_sample.csv")

# Import data from Excel


excel_data <- read.xlsx("data.xlsx", sheet = 1)
print("Imported Excel Data:")
print(excel_data)

# Export data to Excel


write.xlsx(excel_data, "data/exported_sample.xlsx", sheetName =
"Sheet1") # Change the path as needed

OUTPUT :

File created

6
Experiment = 3

Write a program in R using the dplyr package to perform data manipulation


tasks like filtering, selecting, and summarizing data
library(dplyr)

# Sample data frame


data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
Age = c(25, 30, 35, 40, 22),
Score = c(85.5, 90.0, 78.5, 88.0, 92.5)
)

print("Original Data:")
print(data)

# 1. Filtering Data
# Filter for rows where Age is greater than 30
filtered_data <- data %>%
filter(Age > 30)

print("Filtered Data (Age > 30):")


print(filtered_data)

# 2. Selecting Columns
# Select only the Name and Score columns
selected_data <- data %>%
select(Name, Score)

print("Selected Columns (Name and Score):")


print(selected_data)

# 3. Summarizing Data
# Calculate the average score
average_score <- data %>%
summarize(Average_Score = mean(Score))

print("Average Score:")
print(average_score)

7
OUTPUT :

8
Experiment = 4

Write a program in R using the ggplot2 package to create various data


visualizations including scatter plots, bar charts, and histograms

library(ggplot2)

# Sample data frame


data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
Age = c(25, 30, 35, 40, 22),
Score = c(85.5, 90.0, 78.5, 88.0, 92.5)
)

# 1. Scatter Plot
scatter_plot <- ggplot(data, aes(x = Age, y = Score)) +
geom_point(color = "blue", size = 3) +
labs(title = "Scatter Plot of Age vs Score",
x = "Age",
y = "Score") +
theme_minimal()

print(scatter_plot)

# 2. Bar Chart
# Create a bar chart for Scores by Name
bar_chart <- ggplot(data, aes(x = Name, y = Score, fill = Name)) +
geom_bar(stat = "identity") +
labs(title = "Bar Chart of Scores by Name",
x = "Name",
y = "Score") +
theme_minimal()

# Print the bar chart


print(bar_chart)

# 3. Histogram
# Create a histogram of Scores
histogram <- ggplot(data, aes(x = Score)) +
geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
labs(title = "Histogram of Scores",
x = "Score",
y = "Frequency") +
theme_minimal()

print(histogram)

9
OUTPUT :

10
Experiment = 5

Write a program in R to perform Exploratory Data Analysis (EDA) by


generating summary statistics, visualizing distributions, and identifying
correlations.

library(ggplot2)
library(dplyr)
library(corrplot)

# Sample data frame


set.seed(123)
data <- data.frame(
Age = sample(20:50, 100, replace = TRUE),
Score = rnorm(100, mean = 75, sd = 10),
Height = rnorm(100, mean = 160, sd = 15),
Weight = rnorm(100, mean = 65, sd = 10)
)

# 1. Summary Statistics
summary_statistics <- summary(data)
print("Summary Statistics:")
print(summary_statistics)

# 2. Visualizing Distributions
# Histogram for Age
hist_age <- ggplot(data, aes(x = Age)) +
geom_histogram(binwidth = 2, fill = "lightblue", color = "black") +
labs(title = "Histogram of Age", x = "Age", y = "Frequency") +
theme_minimal()

print(hist_age)

# 3. Correlation Matrix
correlation_matrix <- cor(data)
print("Correlation Matrix:")
print(correlation_matrix)

# Visualizing the correlation matrix


corrplot(correlation_matrix, method = "circle", type = "upper",
tl.col = "black", tl.srt = 45,
title = "Correlation Matrix", mar = c(0, 0, 1, 0))

10
OUTPUT :

11
Experiment = 6

Write a program in R to clean and preprocess data, including handling


missing values, detecting outliers, and normalizing data.

library(dplyr)
library(ggplot2)

# Sample data frame creation with missing values and outliers


set.seed(123)
n <- 100

data <- data.frame(


Age = c(sample(20:50, n - 2, replace = TRUE), NA, 100), # 1 missing value
and 1 outlier
Score = c(rnorm(n - 1, mean = 75, sd = 10), NA), # 1 missing value
Height = c(rnorm(n - 2, mean = 160, sd = 15), 180, 250) # 1 outlier and 1
missing
)

print("Original Data:")
print(head(data))

# 1. Handling Missing Values


# Check for missing values
missing_summary <- sapply(data, function(x) sum(is.na(x)))
print("Missing Values Summary:")
print(missing_summary)

# Impute missing values with the mean of each column


data <- data %>%
mutate(
Age = ifelse(is.na(Age), mean(Age, na.rm = TRUE), Age),
Score = ifelse(is.na(Score), mean(Score, na.rm = TRUE), Score)
)

# Print data after handling missing values


print("Data after Handling Missing Values:")
print(head(data))

# 2. Detecting Outliers
# Function to detect outliers using IQR method
detect_outliers <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR

12
upper_bound <- Q3 + 1.5 * IQR
return(x < lower_bound | x > upper_bound)
}

# Detect outliers in Age and Height


data$Age_Outlier <- detect_outliers(data$Age)
data$Height_Outlier <- detect_outliers(data$Height)

# Print detected outliers


print("Detected Outliers:")
print(data %>% filter(Age_Outlier | Height_Outlier))

# 3. Normalizing Data
# Normalizing using min-max scaling
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}

data_normalized <- data %>%


mutate(
Age = normalize(Age),
Score = normalize(Score),
Height = normalize(Height)
)

print("Normalized Data:")
print(head(data_normalized))

OUTPUT :

13
14
Experiment = 7

Write a program in R to perform linear regression analysis, fit a model,


interpret coefficients, and make predictions .

library(ggplot2)

# Sample Data: Create a simple data frame


set.seed(42)
data <- data.frame(
x = rnorm(100, mean = 5, sd = 2), # Independent variable
y = rnorm(100, mean = 10, sd = 3) # Dependent variable
)

# Introduce a linear relationship


data$y <- 3 * data$x + rnorm(100) # y = 3x + noise

# Explore the data


head(data)
summary(data)

# Fit a linear regression model


model <- lm(y ~ x, data = data)

# Summarize the model


summary(model)

# Make predictions
new_data <- data.frame(x = c(3, 5, 7)) # New x values for prediction
predictions <- predict(model, new_data)

# Show predictions
predictions

# Visualize the data and the fitted model


ggplot(data, aes(x = x, y = y)) +
geom_point(color = 'blue') + # Scatter plot of the data
geom_smooth(method = 'lm', color = 'red') + # Regression line
labs(title = "Linear Regression Analysis",
x = "Independent Variable (x)",
y = "Dependent Variable (y)")

15
OUTPUT :

A data.frame: 6 × 2

x y

<dbl> <dbl>

1 7.741917 21.22482

2 3.870604 11.94559

3 5.726257 18.35010

4 6.265725 20.85671

5 5.808537 16.04875

6 4.787751 13.21240
x y
Min. :-0.9862 Min. :-3.976
1st Qu.: 3.7666 1st Qu.:11.827
Median : 5.1796 Median :15.581
Mean : 5.0650 Mean :15.185
3rd Qu.: 6.3231 3rd Qu.:19.132
Max. : 9.5733 Max. :30.757

Call:
lm(formula = y ~ x, data = data)

Residuals:
Min 1Q Median 3Q Max
-2.5640 -0.6301 -0.1046 0.6518 2.4347

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.34771 0.26710 1.302 0.196
x 2.92930 0.04881 60.018 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.011 on 98 degrees of freedom


Multiple R-squared: 0.9735, Adjusted R-squared: 0.9732
F-statistic: 3602 on 1 and 98 DF, p-value: < 2.2e-16
1
9.13562194092176
2
14.9942291875231
3
20.8528364341244
`geom_smooth()` using formula = 'y ~ x'

16
17
Experiment = 8

Write a program in R to apply logistic regression for classification, interpret


model coefficients, and evaluate performance with a confusion matrix and
ROC curve.

library(ggplot2)
library(caret)
library(pROC)

# Sample Data: Using the built-in iris dataset for binary


classification
data <- iris
data$Species <- ifelse(data$Species == "setosa", 1, 0)

# Split the data into training and testing sets


set.seed(42)
train_index <- createDataPartition(data$Species, p = 0.7, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

# Fit a logistic regression model


model <- glm(Species ~ Sepal.Length + Sepal.Width, data = train_data,
family = binomial)

# Summarize the model


summary(model)

# Make predictions on the test set


test_data$predicted_prob <- predict(model, newdata = test_data, type =
"response")
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, 1,
0)

# Create a confusion matrix


conf_matrix <- confusionMatrix(as.factor(test_data$predicted_class),
as.factor(test_data$Species))
print(conf_matrix)

# ROC Curve
roc_obj <- roc(test_data$Species, test_data$predicted_prob)
auc_value <- auc(roc_obj)

18
OUTPUT :

Number of Fisher Scoring iterations: 25


Confusion Matrix and Statistics

Reference
Prediction 0 1
0 33 0
1 0 12

Accuracy : 1
95% CI : (0.9213, 1)
No Information Rate : 0.7333
P-Value [Acc > NIR] : 8.681e-07

Kappa : 1

Mcnemar's Test P-Value : NA

Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.7333
Detection Rate : 0.7333
Detection Prevalence : 0.7333
Balanced Accuracy : 1.0000

'Positive' Class : 0

Setting levels: control = 0, case = 1

Setting direction: controls < cases

19
Experiment = 9

Write a program in R to perform clustering analysis using the k-means


algorithm, determine optimal clusters, and visualize results.

library(ggplot2)

data <- iris[, -5]

# Standardize the data


data_scaled <- scale(data)

# Determine optimal number of clusters using the Elbow method


wss <- numeric(10)
for (k in 1:10) {
kmeans_model <- kmeans(data_scaled, centers = k, nstart = 20)
wss[k] <- kmeans_model$tot.withinss
}

# Plot the Elbow curve


elbow_plot <- data.frame(Clusters = 1:10, WSS = wss)
ggplot(elbow_plot, aes(x = Clusters, y = WSS)) +
geom_line() +
geom_point() +
labs(title = "Elbow Method for Optimal k",
x = "Number of Clusters (k)",
y = "Total Within-Cluster Sum of Squares (WSS)") +
theme_minimal()

# Fit the k-means model with the optimal number of clusters


optimal_k <- which.min(diff(wss)) + 1 # Change this based on elbow
observation
kmeans_model <- kmeans(data_scaled, centers = optimal_k, nstart = 20)

# Add cluster assignments to the original data


iris$Cluster <- as.factor(kmeans_model$cluster)

# Visualize the clusters using a scatter plot


ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +
geom_point(size = 3) +
labs(title = paste("K-Means Clustering (k =", optimal_k, ")"),
x = "Sepal Length",
y = "Sepal Width") +
theme_minimal()

20
OUTPUT :

21
Experiment = 10

Write a program in R to analyze and forecast time series data using


techniques like moving averages, exponential smoothing, and ARIMA
models.

library(forecast)
library(ggplot2)

# Load the AirPassengers dataset


data("AirPassengers")

# Plot the original time series data


autoplot(AirPassengers) +
labs(title = "Air Passengers Over Time",
x = "Year",
y = "Number of Passengers") +
theme_minimal()

# Moving Average
# Calculate moving averages
ma_12 <- ma(AirPassengers, order = 12)
autoplot(AirPassengers) +
autolayer(ma_12, series = "12-Month Moving Average", PI = FALSE) +
labs(title = "Air Passengers with 12-Month Moving Average",
x = "Year",
y = "Number of Passengers") +
theme_minimal()

# Exponential Smoothing
# Fit exponential smoothing model
ets_model <- ets(AirPassengers)
summary(ets_model)

# Forecast using exponential smoothing


ets_forecast <- forecast(ets_model, h = 12)
autoplot(ets_forecast) +
labs(title = "Exponential Smoothing Forecast",
x = "Year",
y = "Number of Passengers") +
theme_minimal()

# ARIMA Model
# Fit an ARIMA model
arima_model <- auto.arima(AirPassengers)

22
summary(arima_model)

# Forecast using ARIMA


arima_forecast <- forecast(arima_model, h = 12)
autoplot(arima_forecast) +
labs(title = "ARIMA Forecast",
x = "Year",
y = "Number of Passengers") +
theme_minimal()

# Compare forecasts
autoplot(AirPassengers) +
autolayer(ets_forecast, series = "ETS Forecast", PI = FALSE) +
autolayer(arima_forecast, series = "ARIMA Forecast", PI = FALSE) +
labs(title = "Forecast Comparison: ETS vs ARIMA",
x = "Year",
y = "Number of Passengers") +
theme_minimal()

OUTPUT :

ETS(M,Ad,M)

23
Call:
ets(y = AirPassengers)

Smoothing parameters:
alpha = 0.7096
beta = 0.0204
gamma = 1e-04
phi = 0.98

Initial states:
l = 120.9939
b = 1.7705
s = 0.8944 0.7993 0.9217 1.0592 1.2203 1.2318
1.1105 0.9786 0.9804 1.011 0.8869 0.9059

sigma: 0.0392

AIC AICc BIC


1395.166 1400.638 1448.623

Training set error measures:


ME RMSE MAE MPE MAPE MASE
ACF1
Training set 1.567359 10.74726 7.791605 0.4357799 2.857917 0.2432573
0.03945056

Series: AirPassengers
ARIMA(2,1,1)(0,1,0)[12]

Coefficients:
ar1 ar2 ma1
0.5960 0.2143 -0.9819
s.e. 0.0888 0.0880 0.0292

24
sigma^2 = 132.3: log likelihood = -504.92
AIC=1017.85 AICc=1018.17 BIC=1029.35

Training set error measures:


ME RMSE MAE MPE MAPE MASE
ACF1
Training set 1.342299 10.84619 7.86754 0.4206976 2.800458 0.245628 -
0.001248475

25

You might also like