0% found this document useful (0 votes)

29 views57 pages

R Record-1

The document outlines several R scripts aimed at performing various data operations including import/export, pre-processing, descriptive statistics, visualization, and distribution implementations. Each exercise includes an algorithm, program code, and results confirming successful execution. The scripts utilize libraries like dplyr and ggplot2 for data manipulation and visualization.

Uploaded by

balajics3112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views57 pages

R Record-1

Uploaded by

balajics3112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Write an R Script to perform the data Import and Export Operations

Ex.No:01
Date:

AIM:

To Write an R Script to perform the data Import and Export Operations.

ALGORITHM:

Step 1:Define a data frame df with columns: Name, Language, and Age.
Step2:Use write.table() to export df to a text file named "myDataFrame.txt".
Step3:Define column names and add values to the data frame.
Step4:Call the write.table() function to save the data frame to a file.Specify the file name using
the file parameter.
Step5:Check the file in the working directory to confirm successful export.Open the file to verify
the structure and formatting.
Step6:Use the file.choose() function inside read.csv() to allow the user to manually select a
CSV file from their system.
Step7:TRUE Treats the first row of the file as column names.
Step8:Displays the entire data frame in the console.
Step9:The function will display all rows and columns of the data frame.

PROGRAM:

#IMPORTING DATA

df = data.frame(

"Name" = c("Amiya", "Raj","Asish"),

"Language" = c ("R","python","java"),

"Age" = c(22,25,45))
write.table(df,

file="myDataFrame.txt",

sep = "/t",

row.names = TRUE,

col.names = NA)

#EXPORTING DATA

#import and store the dataset in data1

data1 <- read.csv(file.choose(), header=T)

# display the data

data1

OUTPUT:
RESULT:

The data Import and Export Operations are done in R Studio.

Ex.No:02
Date:
Write an R Script to perform the Data Pre-processing techniques

AIM:

To Write an R Script to perform the Data Pre-processing techniques in R

studio.

ALGORITHM:

Step1:Load the dplyr and tidyr packages for data manipulation.

Step2:Load the mtcars dataset.Display the first few rows using head(mtcars).

Step3:Count and print the total number of missing values in the dataset.

Step4:Impute missing values using the mean of each column.

Step5:Introduce a duplicate row by appending the first row to the dataset.

Remove duplicate rows using distinct().

Step6:Normalize the mpg column using the scale() function.

Step6:Create an additional dataset with car names and randomly assigned country values
(USA, Japan, Europe).

Step7:Add a new column car to store row names.Merge the datasets based on the car column.

Step8:Display the first few rows of the final dataset using head(mtcars).

PROGRAM:

# Load necessary libraries

library(dplyr)

library(tidyr)

# Load the dataset

data(mtcars)

# View the first few rows of the dataset

head(mtcars)

# 1. Handling Missing Data

# For demonstration, let's introduce some missing values

mtcars[1, "mpg"] <- NA

mtcars[5, "hp"] <- NA

# Identify missing values

missing_values <- sum(is.na(mtcars))

print(paste("Number of missing values:", missing_values))

# Impute missing values with the mean of the column

mtcars <- mtcars %>%

mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# 2. Removing Duplicates

# For demonstration, let's introduce a duplicate row

mtcars <- rbind(mtcars, mtcars[1, ])

# Remove duplicate rows

mtcars <- mtcars %>%

distinct()

# 3. Data Transformation

# Normalize the 'mpg' column

mtcars <- mtcars %>%

mutate(mpg_normalized = scale(mpg))

# 4. Feature Engineering

# Create a new feature 'power_to_weight_ratio'

mtcars <- mtcars %>%

mutate(power_to_weight_ratio = hp / wt)
# 5. Data Integration

# For demonstration, let's create a second dataset

additional_data <- data.frame(

car = rownames(mtcars),

country = sample(c("USA", "Japan", "Europe"), nrow(mtcars), replace = TRUE)

# Merge the datasets

mtcars <- mtcars %>%

mutate(car = rownames(mtcars)) %>%

left_join(additional_data, by = "car")

# View the final pre-processed dataset

head(mtcars)

OUTPUT:

RESULT:

The Data Pre-processing techniques are executed successfully.

Ex.No:03
Date:
Write an R Script to perform the descriptive statistics concepts.

AIM:

ALGORITHM:

Step1:Check if the dplyr package is installed; if not, install it.Load the dplyr library.

Step2:Set a random seed (set.seed(123)) for reproducibility.

Step3:Display the first six rows of data using head(data).

Step4:Use summary(data) to get basic statistics (min, max, median, mean, etc.) for numeric
columns

Step5:Compute the frequency distribution of the Gender column using

table(data$Gender).

Step6:Compute the percentage distribution using prop.table() and multiply by 100

Step 7:Calculate the Pearson correlation between Height and Weight using cor(data$Height,
data$Weight).

Step8:Mean and standard deviation of Height.

Mean and standard deviation of Weight.

Step9:Use hist(data$Age) with blue bars and a black border.

PROGRAM:
# Load necessary library

if (!require("dplyr")) install.packages("dplyr")

library(dplyr)

# Create a sample dataset

set.seed(123)

data <- data.frame(

ID = 1:100,

Age = sample(18:60, 100, replace = TRUE),

Height = round(rnorm(100, mean = 165, sd = 10), 1),

Weight = round(rnorm(100, mean = 65, sd = 15), 1),

Gender = sample(c("Male", "Female"), 100, replace = TRUE)

# View the first few rows of the dataset

head(data)

# Descriptive statistics

# Summary statistics for numeric variables

summary(data)
# Mean, Median, Variance, and Standard Deviation for a specific column

mean_age <- mean(data$Age)

median_age <- median(data$Age)

var_age <- var(data$Age)

sd_age <- sd(data$Age)

cat("Mean Age:", mean_age, "\n")

cat("Median Age:", median_age, "\n")

cat("Variance of Age:", var_age, "\n")

cat("Standard Deviation of Age:", sd_age, "\n")

# Frequency distribution for categorical variable

gender_distribution <- table(data$Gender)

cat("\nGender Distribution:\n")

print(gender_distribution)

# Percentage distribution

gender_percentage <- prop.table(gender_distribution) * 100

cat("\nGender Percentage Distribution:\n")

print(round(gender_percentage, 2))

# Correlation between two numeric variables

correlation <- cor(data$Height, data$Weight)

cat("\nCorrelation between Height and Weight:", correlation, "\n")

# Grouped statistics: Mean Height and Weight by Gender

grouped_stats <- data %>%

group_by(Gender) %>%

summarise(

Mean_Height = mean(Height),

Mean_Weight = mean(Weight),

SD_Height = sd(Height),

SD_Weight = sd(Weight)

cat("\nGrouped Statistics by Gender:\n")

print(grouped_stats)

# Histogram for Age

hist(data$Age, main = "Histogram of Age", xlab = "Age", col = "blue", border = "black")

# Boxplot for Height by Gender

boxplot(Height ~ Gender, data = data,

main = "Boxplot of Height by Gender",

xlab = "Gender", ylab = "Height (cm)",

col = c("lightblue", "pink"))

OUTPUT:

RESULT:

Ex.No:04
Date:
Visualizing the data in different graphics using R Script.

AIM:

To write an R Script for visualizing the data in different graphics.

ALGORITHM:

Step1:Check if the ggplot2 and dplyr libraries are installed.

Step2:If not installed, install them using install.packages().Load the libraries using library().
Step3:Set a random seed for reproducibility.
Create a data frame (data)
Step4:Points plotted using geom_point() with size and transparency.Title, labels, and theme are
set.
Step5:Bars plotted using geom_histogram()
binwidth = 5 to group ages into intervals of 5 years.
Step6:DistributionGroup dataset by Gender and count occurrences using summarise().
Step7:Sort dataset by ID using arrange().
Line chart created using geom_line().
Step8:Density curves plotted using geom_density()Fill color based on Gender.
Step9:Group dataset by Gender and count occurrences.
Step10:Load necessary libraries (ggplot2, dplyr).
Step11:Create a sample dataset with variables: ID, Age, Height, Weight, Gender.
Step12:Scatter Plot: Height vs. Weight with Gender color coding.

PROGRAM:

# Load necessary library

if (!require("ggplot2")) install.packages("ggplot2")

if (!require("dplyr")) install.packages("dplyr")

library(ggplot2)

library(dplyr)

# Create a sample dataset

set.seed(123)

data <- data.frame(

ID = 1:100,

Age = sample(18:60, 100, replace = TRUE),

Height = round(rnorm(100, mean = 165, sd = 10), 1),

Weight = round(rnorm(100, mean = 65, sd = 15), 1),

Gender = sample(c("Male", "Female"), 100, replace = TRUE)

# 1. Scatter Plot: Height vs. Weight

ggplot(data, aes(x = Height, y = Weight, color = Gender)) +

geom_point(size = 3, alpha = 0.7) +

labs(title = "Scatter Plot of Height vs. Weight",

x = "Height (cm)",

y = "Weight (kg)") +

theme_minimal()

# 2. Histogram: Distribution of Age

ggplot(data, aes(x = Age)) +

geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +

labs(title = "Histogram of Age",

x = "Age (years)",

y = "Frequency") +

theme_minimal()

# 3. Boxplot: Height by Gender

ggplot(data, aes(x = Gender, y = Height, fill = Gender)) +

geom_boxplot() +

labs(title = "Boxplot of Height by Gender",

x = "Gender",

y = "Height (cm)") +

theme_minimal()

# 4. Bar Plot: Gender Distribution

gender_distribution <- data %>%

group_by(Gender) %>%

summarise(Count = n())

ggplot(gender_distribution, aes(x = Gender, y = Count, fill = Gender)) +

geom_bar(stat = "identity", width = 0.6) +

labs(title = "Bar Plot of Gender Distribution",

x = "Gender",

y = "Count") +

theme_minimal()

# 5. Line Chart: Age Trend (sorted by ID)

data <- data %>% arrange(ID)

ggplot(data, aes(x = ID, y = Age, group = 1)) +

geom_line(color = "darkgreen", size = 1) +

geom_point(color = "red", size = 2) +

labs(title = "Line Chart of Age Trend by ID",

x = "ID",

y = "Age (years)") +

theme_minimal()

# 6. Density Plot: Weight Distribution

ggplot(data, aes(x = Weight, fill = Gender)) +

geom_density(alpha = 0.5) +

labs(title = "Density Plot of Weight Distribution",

x = "Weight (kg)",

y = "Density") +

theme_minimal()

# 7. Pie Chart: Gender Proportion

gender_proportion <- data %>%

group_by(Gender) %>%

summarise(Count = n())

ggplot(gender_proportion, aes(x = "", y = Count, fill = Gender)) +

geom_bar(stat = "identity", width = 1) +

coord_polar("y") +

labs(title = "Pie Chart of Gender Proportion") +

theme_void()

# 8. Faceted Plot: Scatter Plot of Height vs. Weight by Gender

ggplot(data, aes(x = Height, y = Weight)) +

geom_point(aes(color = Gender), size = 2, alpha = 0.7) +

facet_wrap(~ Gender) +

labs(title = "Scatter Plot of Height vs. Weight (Faceted by Gender)",

x = "Height (cm)",

y = "Weight (kg)") +

theme_minimal()

OUTPUT:

RESULT:

Visualizing the data in different graphics using R Script was executed successfully.
Ex.No:05
Date:
Write an R Script to implement the Normal and binomial distribution.

AIM:

To write a Write an R Script to implement the Normal and binomial distribution.

ALGORITHM:

Step 1:Load the ggplot2 package, which is used for visualization.

Step2:Set a random seed to ensure reproducibility of result
Step3:Create a histogram to visualize the distribution:Use geom_histogram() to plot the density
of the normal data.
Step 4:Generate 1000 random observations from a binomial distribution with:Number of trials
(size) = 10,Probability of success (prob) = 0.5
Step5:Create a bar plot to visualize the binomial distribution
Step 6:Use geom_bar() to plot the proportion of each outcome.
Step7:This algorithm effectively visualizes both distributions in R

PROGRAM:

# Load necessary library

library(ggplot2)

# Generate a sample of 1000 observations from a normal distribution

set.seed(123)

normal_data <- rnorm(1000, mean = 0, sd = 1)

# Plot the histogram of the normal distribution

ggplot(data.frame(x = normal_data), aes(x = x)) +

geom_histogram(aes(y = ..density..), bins = 30, fill = "blue", alpha = 0.5) +

stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 1) +

labs(title = "Normal Distribution", x = "Value", y = "Density")

# Load necessary library

library(ggplot2)

# Generate a sample of 1000 observations from a binomial distribution

set.seed(123)

binomial_data <- rbinom(1000, size = 10, prob = 0.5)

# Plot the histogram of the binomial distribution

ggplot(data.frame(x = binomial_data), aes(x = factor(x))) +

geom_bar(aes(y = ..prop.., group = 1), fill = "green", alpha = 0.5) +

labs(title = "Binomial Distribution", x = "Number of Successes", y = "Proportion")

OUTPUT:

RESULT:

Normal and binomial distribution was executed successfully in R language.

Write an R Script to convert numerical data to categorical and binomial
distribution.
Ex.No:06
Date:

AIM:

To Write an R Script to convert numerical data to categorical and binomial

distribution.

ALGORITHM:

Step1:Ensures that the randomly generated numbers are the same every time you run the script

Step2:Create a data frame data with 20 observations.

Step3:cut() function is used to group ages into predefined categories

Step4:This prints the dataset after the categorical variables have been added

Step5:Counts occurrences of each age group. Counts occurrences of each income bracket

Step6:The script checks if the package ggplot2 is installed. If not, it installs it

Step 7:Defines the dataset and variable for the x-axis.

Step8:Similar to the age group plot, but with different colors

Step9:Two bar plots visualizing age groups and income brackets

Step 10:A dataset with numerical and categorical variables.Frequency counts of categories.

PROGRAM:

# Create a sample dataset with numerical data

set.seed(123)

data <- data.frame(

ID = 1:20,

Age = sample(18:60, 20, replace = TRUE),

Income = round(rnorm(20, mean = 50000, sd = 10000), 0)

# View the dataset

print("Original Dataset:")

print(data)

# Convert numerical data to categorical variables

# 1. Age categorization into age groups

data$Age_Group <- cut(

data$Age,

breaks = c(0, 18, 30, 45, 60, Inf),

labels = c("Child", "Young Adult", "Adult", "Middle Age", "Senior"),

right = FALSE

# 2. Income categorization into income brackets

data$Income_Bracket <- cut(

data$Income,

breaks = c(-Inf, 30000, 50000, 70000, Inf),

labels = c("Low", "Medium", "High", "Very High"),

right = TRUE

# View the updated dataset

print("Updated Dataset with Categorical Variables:")

print(data)

# Frequency tables for the new categorical variables

cat("\nFrequency of Age Groups:\n")

print(table(data$Age_Group))

cat("\nFrequency of Income Brackets:\n")

print(table(data$Income_Bracket))

# Visualize the categorical variables

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

# Bar plot for Age Groups

ggplot(data, aes(x = Age_Group)) +

geom_bar(fill = "skyblue", color = "black") +

labs(title = "Bar Plot of Age Groups",

x = "Age Group",

y = "Count") +

theme_minimal()

# Bar plot for Income Brackets

ggplot(data, aes(x = Income_Bracket)) +

geom_bar(fill = "orange", color = "black") +

labs(title = "Bar Plot of Income Brackets",

x = "Income Bracket",

y = "Count") +

theme_minimal()

OUTPUT:

RESULT:
R Script to convert numerical data to categorical and binomial distribution are executed
successfully.

Ex.No:07
Date:
Write an R Script to Bayes’ Theorem.

Aim:

To Write an R Script for Bayes’ Theorem.

ALGORITHM:

Step1:Probability that a randomly chosen person has the disease

Step2:Probability that the test is positive given the person has the disease

Step3:Probability that the test is positive given the person does not have the
disease

Step4:Test is positive with probability P(B|A)

Step5:Probability that a person has the disease given a positive test result.

Step6:Prior probability of disease (P(A))Likelihood of a positive test given disease

(P(B|A))

Step7:Marginal probability of a positive test (P(B))

Step8:Posterior probability of disease given a positive test (P(A|B))

Step9:A bar plot showing the prior, likelihood, and posterior probabilities.

PROGRAM:

# Define probabilities

# Example Scenario: Testing for a disease

# A: Person has the disease

# B: Test result is positive

# Prior probability (P(A)): Probability of having the disease

P_A <- 0.01 # 1% of the population has the disease

# Likelihood (P(B|A)): Probability of a positive test result given the person has the disease

P_B_given_A <- 0.95 # Test is 95% accurate for those with the disease

# Marginal probability of B (P(B)):

# P(B) = P(B|A) * P(A) + P(B|?A) * P(?A)

P_B_given_not_A <- 0.05 # 5% false positive rate

P_not_A <- 1 - P_A # Probability of not having the disease

P_B <- (P_B_given_A * P_A) + (P_B_given_not_A * P_not_A)

# Posterior probability (P(A|B)): Using Bayes' Theorem

P_A_given_B <- (P_B_given_A * P_A) / P_B

# Print the results

cat("P(A): Prior Probability of Having the Disease =", P_A, "\n")

cat("P(B|A): Likelihood of Positive Test Given Disease =", P_B_given_A, "\n")

cat("P(B): Marginal Probability of a Positive Test =", P_B, "\n")

cat("P(A|B): Posterior Probability of Having the Disease Given a Positive Test =",
P_A_given_B, "\n")

# Additional Example: Visualization

# Visualize the probabilities using a bar plot

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

data <- data.frame(

Category = c("Prior: P(A)", "Likelihood: P(B|A)", "Posterior: P(A|B)"),

Probability = c(P_A, P_B_given_A, P_A_given_B)

ggplot(data, aes(x = Category, y = Probability, fill = Category)) +

geom_bar(stat = "identity", color = "black") +

scale_fill_brewer(palette = "Set3") +

labs(title = "Bayes' Theorem Probabilities",

x = "Probability Component",

y = "Probability") +

theme_minimal()

OUTPUT:
RESULT:

In R Script the Bayes’ Theorem was executed successfully.

Write an R Script to implement the Time series data analysis and forecasting.
Ex.No:08
Date:
AIM:

To Write an R script implements the Time series data analysis and forecasting.

ALGORITHM:

Step1:Before running the analysis, ensure the required packages (forecast, ggplot2,
and tseries) are installed and loaded.

Step2:the time series into trend, seasonal, and residual components using
decompose()

Step3:Perform Augmented Dickey-Fuller (ADF) Test to check if the series is

stationary.

Step4:Perform Ljung-Box Test to check residual independence

Step5:Use STL (Seasonal and Trend decomposition using LOESS) for better
decomposition.

Step6:Load required packages.Generate synthetic time series data.

Step7:Visualize the original time series.Decompose the time series into trend,
seasonal, and residuals.

Step8:Forecast future values and plot the forecast.

Step9:Analyze residuals and validate the model using Ljung-Box test.

PROGRAM:

# Install and load necessary packages

if (!require("forecast")) install.packages("forecast")
if (!require("ggplot2")) install.packages("ggplot2")

library(forecast)

library(ggplot2)

# 1. Generate Sample Time Series Data

set.seed(123)

time_series_data <- ts(rnorm(120, mean = 50, sd = 10),

start = c(2010, 1), frequency = 12) # Monthly data from 2010

# Plot the time series data

autoplot(time_series_data) +

labs(title = "Original Time Series Data",

x = "Time",

y = "Value") +

theme_minimal()

# 2. Decompose the Time Series (Trend, Seasonal, Residuals)

decomposed <- decompose(time_series_data)

autoplot(decomposed) +

labs(title = "Decomposed Time Series")

# 3. Check Stationarity using Augmented Dickey-Fuller Test

if (!require("tseries")) install.packages("tseries")

library(tseries)

adf_test <- adf.test(time_series_data)

cat("Augmented Dickey-Fuller Test:\n")

print(adf_test)

# 4. Fit an ARIMA Model for Forecasting

arima_model <- auto.arima(time_series_data)

# Print the ARIMA model summary

cat("\nFitted ARIMA Model:\n")

print(summary(arima_model))

# 5. Forecast the Future Values

forecast_horizon <- 12 # Forecast for the next 12 months

forecast_values <- forecast(arima_model, h = forecast_horizon)

# Plot the forecast

autoplot(forecast_values) +

labs(title = "ARIMA Forecast",

x = "Time",

y = "Value") +

theme_minimal()

# 6. Validate the Model with Residual Analysis

autoplot(arima_model$residuals) +

labs(title = "Residual Analysis",

x = "Time",

y = "Residuals") +
theme_minimal()

# Perform Ljung-Box test for residuals

lb_test <- Box.test(arima_model$residuals, lag = 20, type = "Ljung-Box")

cat("\nLjung-Box Test for Residuals:\n")

print(lb_test)

# 7. Seasonal Decomposition of Time Series using LOESS (STL)

stl_decomposed <- stl(time_series_data, s.window = "periodic")

autoplot(stl_decomposed) +

labs(title = "STL Decomposition")

# 8. Export Forecasted Values (Optional)

# write.csv(data.frame(forecast_values), "forecasted_values.csv")

OUTPUT:

RESULT:

Implemented the Time series data analysis and forecasting in R script.

Hypothesis Testing in R Programming.
Ex.No:09
Date:

Aim:

To Write an R Script for Hypothesis Testing.

ALGORITHM:

Step1:Check if the ggplot2 package is installed.

Step2:Ifnotinstalled, install ggplot2.Load the ggplot2 library.

Step3:Set a random seed for reproducibility.

Step4:Generate group1 with 30 values from a normal distribution (mean = 50, sd =

Step5:Generate group2 with 30 values from a normal distribution (mean = 52, sd =

5).

Step6:The means of group1 and group2 are equal.

Step7:The means of group1 and group2 are not equal.

Step8:Perform an independent two-sample t-test assuming equal variance

Step9:The observed frequencies match the expected frequencie

Step10:Perform a Wilcoxon rank-sum test (non-parametric alternative to t-test).

Step11:There is no correlation between x and y.

Step12:Check the strength of linear relationship between two variables

Step13:Non-parametric alternative to t-test

PROGRAM:

# Load necessary library

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

# Create sample data

set.seed(123)

group1 <- rnorm(30, mean = 50, sd = 5) # Sample from Group 1

group2 <- rnorm(30, mean = 52, sd = 5) # Sample from Group 2

# 1. One-Sample t-test

# H0: The mean of group1 is equal to 50

# H1: The mean of group1 is not equal to 50

t_test_one_sample <- t.test(group1, mu = 50)

cat("\nOne-Sample t-test:\n")

print(t_test_one_sample)

# 2. Two-Sample t-test (Independent)

# H0: The means of group1 and group2 are equal

# H1: The means of group1 and group2 are not equal

t_test_two_sample <- t.test(group1, group2, var.equal = TRUE)

cat("\nTwo-Sample t-test:\n")

print(t_test_two_sample)

# 3. Paired t-test

# H0: The means of paired samples are equal

# Simulate paired data

paired_group1 <- rnorm(20, mean = 10, sd = 2)

paired_group2 <- paired_group1 + rnorm(20, mean = 0.5, sd = 1) # Add small difference

t_test_paired <- t.test(paired_group1, paired_group2, paired = TRUE)

cat("\nPaired t-test:\n")

print(t_test_paired)

# 4. Chi-Square Test

# H0: The observed frequencies match the expected frequencies

observed <- c(50, 30, 20) # Observed frequencies

expected <- c(40, 40, 20) # Expected frequencies

chi_sq_test <- chisq.test(observed, p = expected / sum(expected))

cat("\nChi-Square Test:\n")

print(chi_sq_test)

# 5. ANOVA (Analysis of Variance)

# H0: The means of all groups are equal

group_a <- rnorm(20, mean = 5, sd = 1)

group_b <- rnorm(20, mean = 6, sd = 1)

group_c <- rnorm(20, mean = 7, sd = 1)

anova_data <- data.frame(

values = c(group_a, group_b, group_c),

groups = rep(c("A", "B", "C"), each = 20)

anova_result <- aov(values ~ groups, data = anova_data)

cat("\nANOVA Test:\n")

summary(anova_result)

# 6. Visualizing ANOVA Results

ggplot(anova_data, aes(x = groups, y = values, fill = groups)) +

geom_boxplot() +

labs(title = "Boxplot of Groups for ANOVA",

x = "Group",

y = "Values") +

theme_minimal()

# 7. Shapiro-Wilk Test (Normality Test)

# H0: Data is normally distributed

shapiro_test <- shapiro.test(group1)

cat("\nShapiro-Wilk Test:\n")

print(shapiro_test)

# 8. Wilcoxon Rank-Sum Test (Non-parametric test)

# H0: The distributions of group1 and group2 are the same

wilcox_test <- wilcox.test(group1, group2)

cat("\nWilcoxon Rank-Sum Test:\n")

print(wilcox_test)

# 9. Correlation Test

# H0: There is no correlation between x and y

x <- rnorm(50, mean = 10, sd = 2)

y <- 2 * x + rnorm(50, mean = 0, sd = 1)

cor_test <- cor.test(x, y)

cat("\nCorrelation Test:\n")

print(cor_test)

OUTPUT:

RESULT:

Hypothesis Testing was implemented in R script.

Predictive Analysis R Programming.
Ex.No:10
Date:

AIM:

Write an R script for Predictive Analysis.

ALGORITHM:

Step1:Before running the analysis, necessary libraries are installed and loaded

Step 2:For machine learning operations like data partitioning and evaluation.

Step3:A synthetic dataset is generated using random values.

Step4:Split the data into training (70%) and testing (30%) sets.

Step5:Linear Regression (for continuous variable Income)

Step6:Decision Tree (for classification Default).Random Forest (for classification

Default)

Step7:Make predictions using decision tree and random forest.

Step8:Evaluate model performance using confusion matrices.

Step9:Visualize the decision tree and feature importance from random forest.

PROGRAM:

# Install and load required libraries

if (!require("caret")) install.packages("caret")

if (!require("rpart")) install.packages("rpart")

if (!require("randomForest")) install.packages("randomForest")

if (!require("ggplot2")) install.packages("ggplot2")

library(caret)

library(rpart)

library(randomForest)

library(ggplot2)

# 1. Load or Create Sample Dataset

set.seed(123)
data <- data.frame(

Age = sample(18:70, 100, replace = TRUE),

Income = round(rnorm(100, mean = 50000, sd = 10000), 0),

Education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 100, replace = TRUE),

Credit_Score = round(rnorm(100, mean = 700, sd = 50), 0)

# Target Variable: Whether the person defaults on a loan

data$Default <- ifelse(data$Credit_Score < 650, 1, 0) # 1 = Default, 0 = No Default

# View the dataset

cat("Sample Dataset:\n")

print(head(data))

# Convert categorical variables to factors

data$Education <- as.factor(data$Education)

data$Default <- as.factor(data$Default)

# 2. Split Data into Training and Testing Sets

set.seed(123)

trainIndex <- createDataPartition(data$Default, p = 0.7, list = FALSE)

trainData <- data[trainIndex, ]

testData <- data[-trainIndex, ]

cat("\nTraining Set Size:", nrow(trainData))

cat("\nTesting Set Size:", nrow(testData))

# 3. Build Predictive Models

# (a) Linear Regression (For continuous target variables)

lm_model <- lm(Income ~ Age + Credit_Score, data = trainData)

cat("\nLinear Regression Model Summary:\n")

print(summary(lm_model))

# (b) Decision Tree (For classification tasks)

tree_model <- rpart(Default ~ Age + Income + Education + Credit_Score, data = trainData,

method = "class")

cat("\nDecision Tree Summary:\n")

print(tree_model)

# (c) Random Forest (For classification tasks)

rf_model <- randomForest(Default ~ Age + Income + Education + Credit_Score, data =

trainData, ntree = 100)

cat("\nRandom Forest Model:\n")

print(rf_model)

# 4. Make Predictions

# Predict using Decision Tree

tree_predictions <- predict(tree_model, testData, type = "class")

# Predict using Random Forest

rf_predictions <- predict(rf_model, testData)

# 5. Evaluate Models

# Confusion Matrix for Decision Tree

cat("\nDecision Tree Confusion Matrix:\n")

print(confusionMatrix(tree_predictions, testData$Default))

# Confusion Matrix for Random Forest

cat("\nRandom Forest Confusion Matrix:\n")

print(confusionMatrix(rf_predictions, testData$Default))

# 6. Visualize the Results

# (a) Decision Tree Plot

if (!require("rpart.plot")) install.packages("rpart.plot")

library(rpart.plot)

rpart.plot(tree_model, main = "Decision Tree")

# (b) Feature Importance from Random Forest

importance <- importance(rf_model)

cat("\nFeature Importance:\n")

print(importance)

# Plot Feature Importance

importance_df <- data.frame(Feature = rownames(importance), Importance = importance[, 1])

ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance, fill = Feature)) +

geom_bar(stat = "identity") +

coord_flip() +

labs(title = "Feature Importance (Random Forest)",

x = "Features",
y = "Importance") +

theme_minimal()

OUTPUT:

RESULT:

Predictive Analysis in R script was executed successfully.

Ex.No:11
Date:
Write an R script to implement the Cross-Validation.

AIM:

Implement R Script for Cross-Validation.

ALGORITHM:

Step1:Load the required libraries using library().

Step2:Convert the Default column into a factor since it's a classification problem.

Step3:Display the first few rows of the dataset.

Step4:Use the train() function to fit a Random Forest model

Step5:Use the plot() function to visualize the model performance and parameter
tuning

Step6:Create a new dataset with sample values for prediction.

Step7:Print the predictions along with the input data

Step8:This step-by-step approach builds a robust credit default prediction model
using Random Forest with cross-validation

PROGRAM:

# Install and load required libraries

if (!require("caret")) install.packages("caret")

if (!require("randomForest")) install.packages("randomForest")

if (!require("e1071")) install.packages("e1071") # Needed for SVM in caret

library(caret)

library(randomForest)

# 1. Create Sample Dataset

set.seed(123)

data <- data.frame(

Age = sample(18:70, 200, replace = TRUE),

Income = round(rnorm(200, mean = 50000, sd = 10000), 0),

Credit_Score = round(rnorm(200, mean = 700, sd = 50), 0),

Default = sample(c("Yes", "No"), 200, replace = TRUE)

# Convert target variable to factor

data$Default <- as.factor(data$Default)

# View dataset
cat("Sample Dataset:\n")

print(head(data))

# 2. Define Train Control for Cross-Validation

train_control <- trainControl(

method = "cv", # Cross-validation method

number = 10, # Number of folds

verboseIter = TRUE# Display training progress

# 3. Train a Model with Cross-Validation

# Example: Random Forest

set.seed(123)

rf_model <- train(

Default ~ Age + Income + Credit_Score, # Formula

data = data, # Dataset

method = "rf", # Random forest method

trControl = train_control, # Cross-validation settings

tuneLength = 3 # Number of tuning parameters to try

# 4. Model Performance

cat("\nRandom Forest Model Summary:\n")

print(rf_model)
# Print the best model parameters

cat("\nBest Tuning Parameters:\n")

print(rf_model$bestTune)

# 5. Evaluate Model Performance

cat("\nCross-Validation Results:\n")

print(rf_model$resample)

# 6. Visualize Model Performance

plot(rf_model)

# 7. Predictions on a New Dataset (Optional)

new_data <- data.frame(

Age = c(25, 40, 65),

Income = c(40000, 55000, 70000),

Credit_Score = c(650, 720, 680)

# Predict using the trained model

predictions <- predict(rf_model, new_data)

cat("\nPredictions on New Data:\n")

print(data.frame(new_data, Predicted_Default = predictions))

OUTPUT:
RESULT:

Implemented R Script for Cross-Validation was Output verified.

Write an R script to implement the Ordinary Least Squares (OLS).

Ex.No:12
Date:
AIM:

Implement the Ordinary Least Squares (OLS) in R script.

ALGORITHM:

Step1:The ggplot2 package is loaded to enable visualization of the regression

results.

Step2:The built-in mtcars dataset is loaded into the R environment.

Step3:This function displays the first six rows of the mtcars dataset for a quick
preview

Step4:The lm() function is used to fit a linear regression model.

Step5:The summary() function provides key details about the regression results

Step6:Estimated effect of each independent variable.

Step7:How well the model explains the variance in mpg

Step 8:Differences between actual and predicted values.

PROGRAM:

# Load necessary libraries

library(ggplot2)

# Load the dataset

data(mtcars)
# View the first few rows of the dataset

head(mtcars)

# Perform OLS regression

model <- lm(mpg ~ wt + hp + disp, data = mtcars)

# View the summary of the model

summary(model)

# Plot the regression results

ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point() +

geom_smooth(method = "lm", formula = y ~ x, col = "red") +

labs(title = "OLS Regression: MPG vs Weight",

x = "Weight (1000 lbs)",

y = "Miles per Gallon (MPG)")

OUTPUT:

RESULT:

Ordinary Least Squares (OLS) in R script executed successfully.

Write an R script to implement the Linear regression algorithm.
Ex.No:13
Date:

AIM:

Implement the linear regression algorithm in R programming language.

ALGORITHM:

Step1:The script checks if the ggplot2 package is installed

Step2:If not, it installs the package.Then, it loads ggplot2.

Step3:Sets a seed for reproducibility using set.seed(123)

Step4:Creates a dataframe data with three variables

Step5:Uses the lm() function to create a linear regression model

Step 6:Uses ggplot2 to visualize the relationship between Age and Spending

Step7:The summary(lm_model) function prints details such as:Coefficients,

R-squared value, p-values, Residuals and standard errors

Step8:Plot the relationship between Age and Spending

PROGRAM:

# Load necessary library

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

# Sample dataset

set.seed(123)

data <- data.frame(

Age = sample(20:70, 100, replace = TRUE),

Income = round(rnorm(100, mean = 50000, sd = 10000), 0)

data$Spending <- 2000 + 50 * data$Age + 0.3 * data$Income + rnorm(100, mean = 0, sd =

5000)

# Fit Linear Regression Model

lm_model <- lm(Spending ~ Age + Income, data = data)

# Model Summary

summary(lm_model)

# Scatter plot with regression line

ggplot(data, aes(x = Age, y = Spending)) +

geom_point() +

geom_smooth(method = "lm", formula = y ~ x, color = "red") +

labs(title = "Linear Regression: Spending vs Age", x = "Age", y = "Spending")

OUTPUT:

RESULT:

Implemented the Linear regression algorithm in R script.

Write an R script to implement the K-Means clustering algorithm.
Ex.No:14
Date:

AIM:

Write an R script to implement the K-Means clustering algorithm in rstudio.

ALGORITHM:

Step1:This loads the ggplot2 library, which is used to create visualizations in R.

Step2:The iris dataset is a built-in dataset in R that contains measurements of iris

flowers for three species (setosa, versicolor, virginica).

Step3:The algorithm tries to group the data into 3 clusters.

Step4:The computed cluster labels (1, 2, or 3) are added to the iris dataset.

Step 5: the as.factor() ensures the cluster labels are treated as categorical values.

Step6:This creates a scatter plot using ggplot2:X-axis: Sepal.Length, Y-axis:

Sepal.Width,Color: Cluster assignment

Step7:Load the dataset and necessary libraries

Step8:Apply K-Means clustering to group the data into three clusters.

Step 9:Assign the cluster labels to the dataset.

Step10:Visualize the clusters using a scatter plot.

PROGRAM:

# Load necessary library

library(ggplot2)

# Load the dataset

data(iris)

# Perform K-Means clustering with 3 clusters

set.seed(123) # For reproducibility

kmeans_result <- kmeans(iris[, 1:4], centers = 3)

# Add the cluster results to the dataset

iris$Cluster <- as.factor(kmeans_result$cluster)

# Plot the clusters

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +

geom_point() +

labs(title = "K-Means Clustering of Iris Dataset",

x = "Sepal Length",

y = "Sepal Width")

OUTPUT:
RESULT:

The K-Means clustering algorithm was implemented successfully.

Ex.No:15
Date:
Write an R script to implement the Naive Bayes.
AIM:

Implement the Naive Bayes in R Script.

ALGORITHM:

Step1:This loads the e1071 package, which provides an implementation of the

Naïve Bayes classifier in R.

Step 2:Loads the built-in Iris dataset, which contains 150 samples of iris flowers
with four features (Sepal.Length, Sepal.Width, Petal.

Step 3:means predicting Species based on all other features.

Step 4:Uses the trained model to predict the species of flowers in the test set.

Step 5:Creates a confusion matrix, comparing the actual vs. predicted species.

Step 6:This helps measure the accuracy of the model.

Step 7:Computes the accuracy by dividing correct predictions by the total number
of test samples.

Step 8:The probability of each class P(Species) in the training set

Step 9:The probability of each feature value given a class using the Gaussian
(Normal) Distribution

Step 10:The class with the highest probability is selected as the prediction.

PROGRAM:

# Load necessary library

library(e1071)
# Load the dataset

data(iris)

# Split the data into training and test sets

set.seed(123) # For reproducibility

train_index <- sample(1:nrow(iris), 0.7 * nrow(iris))

train_data <- iris[train_index, ]

test_data <- iris[-train_index, ]

# Train the Na?ve Bayes model

model <- naiveBayes(Species ~ ., data = train_data)

# Predict on the test set

predictions <- predict(model, test_data)

# Evaluate the model

confusion_matrix <- table(predictions, test_data$Species)

print(confusion_matrix)

OUTPUT:

RESULT:

The Naive Bayes theorem was implemented successfully.

Advance R Prog.-1
No ratings yet
Advance R Prog.-1
24 pages
Rstudio Cours
No ratings yet
Rstudio Cours
11 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Sta301 Midterm Solved Mcqs by Junaid Malik
No ratings yet
Sta301 Midterm Solved Mcqs by Junaid Malik
65 pages
R Studio Lab Summary Sheet
No ratings yet
R Studio Lab Summary Sheet
3 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
12 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Data Visualization Notes-2
No ratings yet
Data Visualization Notes-2
223 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
48 pages
R1 Uptovisualisation
No ratings yet
R1 Uptovisualisation
122 pages
IAAO Glossary 3rd Ed Final
No ratings yet
IAAO Glossary 3rd Ed Final
155 pages
MSE 311 Grain Size Measurement
No ratings yet
MSE 311 Grain Size Measurement
9 pages
R-Unit 4
No ratings yet
R-Unit 4
93 pages
Chapter 2 & 3-Review of Probability and Statistics
No ratings yet
Chapter 2 & 3-Review of Probability and Statistics
93 pages
B Com Pa
No ratings yet
B Com Pa
89 pages
R-Unit 5
No ratings yet
R-Unit 5
76 pages
03 - Statistics Foundations Part 2
No ratings yet
03 - Statistics Foundations Part 2
73 pages
Pracal Labexamsamplequestions
No ratings yet
Pracal Labexamsamplequestions
35 pages
FM-Unit 2 Valuation of Securities
No ratings yet
FM-Unit 2 Valuation of Securities
64 pages
Unit 3
No ratings yet
Unit 3
36 pages
Statistics Course Outline
No ratings yet
Statistics Course Outline
2 pages
DA Lab Manual
No ratings yet
DA Lab Manual
42 pages
Immediate Download Behavioral Sciences STAT (New, Engaging Titles From 4LTR Press) 2nd Edition, (Ebook PDF) Ebooks 2024
100% (1)
Immediate Download Behavioral Sciences STAT (New, Engaging Titles From 4LTR Press) 2nd Edition, (Ebook PDF) Ebooks 2024
25 pages
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
No ratings yet
Business Analytics Unit - IV Notes - 60637706 - 2025 - 05!15!02 - 16
28 pages
Shahun Term Workr1
No ratings yet
Shahun Term Workr1
34 pages
Lab Manual - DSR
No ratings yet
Lab Manual - DSR
32 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
BS en 15042-1-2006
No ratings yet
BS en 15042-1-2006
30 pages
Research Report Shweta Kulkarni
No ratings yet
Research Report Shweta Kulkarni
53 pages
Stage 2 Mathematical Methods Subject Outline (For Teaching in 2024)
No ratings yet
Stage 2 Mathematical Methods Subject Outline (For Teaching in 2024)
54 pages
4.18 Data Wrangling Slides Part1
No ratings yet
4.18 Data Wrangling Slides Part1
54 pages
R File Code
No ratings yet
R File Code
16 pages
Cowin Et Al (2022) A Proposed Framework To Describe Movement Variability Within Sporting Tasks - A Scoping Review
No ratings yet
Cowin Et Al (2022) A Proposed Framework To Describe Movement Variability Within Sporting Tasks - A Scoping Review
24 pages
5 TU660浊度仪说明书.zh-CN.en
No ratings yet
5 TU660浊度仪说明书.zh-CN.en
30 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
Karan Parmar BBA (MS) Section-A - R-Programming Assignment
No ratings yet
Karan Parmar BBA (MS) Section-A - R-Programming Assignment
21 pages
An Introduction To The Psych Package: Part I: Data Entry and Data Description
No ratings yet
An Introduction To The Psych Package: Part I: Data Entry and Data Description
63 pages
MDPN460 Lecture05
No ratings yet
MDPN460 Lecture05
32 pages
Underwater Image Dehazing Using A Novel Color Channel Based Dual Transmission Map Estimation
No ratings yet
Underwater Image Dehazing Using A Novel Color Channel Based Dual Transmission Map Estimation
24 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
STAT 1000 - Worksheet 2
No ratings yet
STAT 1000 - Worksheet 2
14 pages
STAT 1000 - Worksheet 2
No ratings yet
STAT 1000 - Worksheet 2
14 pages
2 P 1 Assessment For Learning
No ratings yet
2 P 1 Assessment For Learning
42 pages
Commands For Data Analysis Using R
No ratings yet
Commands For Data Analysis Using R
11 pages
Statistical Modeling Using R - Lab Manual
No ratings yet
Statistical Modeling Using R - Lab Manual
23 pages
Lab File AD PDF
No ratings yet
Lab File AD PDF
25 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
R Code
No ratings yet
R Code
9 pages
R Practicals
No ratings yet
R Practicals
32 pages
R Code
No ratings yet
R Code
13 pages
Sales Forecasting
No ratings yet
Sales Forecasting
25 pages
PA Last
No ratings yet
PA Last
23 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
23MPSC18 - Research
No ratings yet
23MPSC18 - Research
21 pages
Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
Topic 8
No ratings yet
Topic 8
7 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
Does Clinical Experience Affect The Bracket Bonding Accuracy of Guided Bonding Devices in Vitro?
No ratings yet
Does Clinical Experience Affect The Bracket Bonding Accuracy of Guided Bonding Devices in Vitro?
9 pages
DA Lab Week-1
No ratings yet
DA Lab Week-1
7 pages
Kami Export - ZACHARY MOON - Sound-Of-Seagrass - STUDENTA
No ratings yet
Kami Export - ZACHARY MOON - Sound-Of-Seagrass - STUDENTA
6 pages
Chapter 02 Exploratory Data Analysis
No ratings yet
Chapter 02 Exploratory Data Analysis
38 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
MWW Survey Report
No ratings yet
MWW Survey Report
7 pages
IC Product Quality Control Chart Sample 11221
No ratings yet
IC Product Quality Control Chart Sample 11221
7 pages
F24 Lab-01
No ratings yet
F24 Lab-01
4 pages
FFM15, CH 08 (Risk), Chapter Model, 2-08-18
No ratings yet
FFM15, CH 08 (Risk), Chapter Model, 2-08-18
11 pages
Parametric Stability Analysis of Cold Tolerant Rice Genotypes For Grain Yield
No ratings yet
Parametric Stability Analysis of Cold Tolerant Rice Genotypes For Grain Yield
5 pages
R Programming End Term
No ratings yet
R Programming End Term
4 pages
Basic R Commands For Data Analysis
No ratings yet
Basic R Commands For Data Analysis
7 pages
10 Ijams October 2024 44
No ratings yet
10 Ijams October 2024 44
9 pages
Impact Resistance of Flat, Rigid Plastic Specimen by Means of A Striker Impacted by A Falling Weight (Gardner Impact)
No ratings yet
Impact Resistance of Flat, Rigid Plastic Specimen by Means of A Striker Impacted by A Falling Weight (Gardner Impact)
8 pages
R Commands
No ratings yet
R Commands
18 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
CEP 224. Lesson 5
No ratings yet
CEP 224. Lesson 5
10 pages
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Applied Mathematics Seminar Questions
No ratings yet
Applied Mathematics Seminar Questions
6 pages
DS Lab
No ratings yet
DS Lab
31 pages
Measures of Relative Position
No ratings yet
Measures of Relative Position
2 pages
Visualization - Hist and Box
No ratings yet
Visualization - Hist and Box
23 pages
Data Analysis in R
No ratings yet
Data Analysis in R
5 pages
R Studio Notes
No ratings yet
R Studio Notes
10 pages
BAN5
No ratings yet
BAN5
2 pages
R Course
No ratings yet
R Course
7 pages
Assignment DM - Odt
No ratings yet
Assignment DM - Odt
6 pages
UL2
No ratings yet
UL2
2 pages
All Values in The First Column
No ratings yet
All Values in The First Column
7 pages
R Notes For Data Analysis and Statistical Inference
No ratings yet
R Notes For Data Analysis and Statistical Inference
10 pages