0% found this document useful (0 votes)
29 views57 pages

R Record-1

The document outlines several R scripts aimed at performing various data operations including import/export, pre-processing, descriptive statistics, visualization, and distribution implementations. Each exercise includes an algorithm, program code, and results confirming successful execution. The scripts utilize libraries like dplyr and ggplot2 for data manipulation and visualization.

Uploaded by

balajics3112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views57 pages

R Record-1

The document outlines several R scripts aimed at performing various data operations including import/export, pre-processing, descriptive statistics, visualization, and distribution implementations. Each exercise includes an algorithm, program code, and results confirming successful execution. The scripts utilize libraries like dplyr and ggplot2 for data manipulation and visualization.

Uploaded by

balajics3112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Write an R Script to perform the data Import and Export Operations

Ex.No:01
Date:

AIM:

To Write an R Script to perform the data Import and Export Operations.

ALGORITHM:

Step 1:Define a data frame df with columns: Name, Language, and Age.
Step2:Use write.table() to export df to a text file named "myDataFrame.txt".
Step3:Define column names and add values to the data frame.
Step4:Call the write.table() function to save the data frame to a file.Specify the file name using
the file parameter.
Step5:Check the file in the working directory to confirm successful export.Open the file to verify
the structure and formatting.
Step6:Use the file.choose() function inside read.csv() to allow the user to manually select a
CSV file from their system.
Step7:TRUE​ Treats the first row of the file as column names.
Step8:Displays the entire data frame in the console.
Step9:The function will display all rows and columns of the data frame.

PROGRAM:

#IMPORTING DATA

df = data.frame(

"Name" = c("Amiya", "Raj","Asish"),

"Language" = c ("R","python","java"),

"Age" = c(22,25,45))
write.table(df,

file="myDataFrame.txt",

sep = "/t",

row.names = TRUE,

col.names = NA)

#EXPORTING DATA

#import and store the dataset in data1

data1 <- read.csv(file.choose(), header=T)

# display the data

data1

OUTPUT:
RESULT:

The data Import and Export Operations are done in R Studio.

Ex.No:02
Date:
Write an R Script to perform the Data Pre-processing techniques

AIM:

To Write an R Script to perform the Data Pre-processing techniques in R


studio.

ALGORITHM:

Step1:Load the dplyr and tidyr packages for data manipulation.


Step2:Load the mtcars dataset.Display the first few rows using head(mtcars).

Step3:Count and print the total number of missing values in the dataset.

Step4:Impute missing values using the mean of each column.

Step5:Introduce a duplicate row by appending the first row to the dataset.


Remove duplicate rows using distinct().

Step6:Normalize the mpg column using the scale() function.

Step6:Create an additional dataset with car names and randomly assigned country values
(USA, Japan, Europe).

Step7:Add a new column car to store row names.Merge the datasets based on the car column.

Step8:Display the first few rows of the final dataset using head(mtcars).

PROGRAM:

# Load necessary libraries

library(dplyr)

library(tidyr)

# Load the dataset

data(mtcars)

# View the first few rows of the dataset

head(mtcars)

# 1. Handling Missing Data

# For demonstration, let's introduce some missing values

mtcars[1, "mpg"] <- NA


mtcars[5, "hp"] <- NA

# Identify missing values

missing_values <- sum(is.na(mtcars))

print(paste("Number of missing values:", missing_values))

# Impute missing values with the mean of the column

mtcars <- mtcars %>%

mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# 2. Removing Duplicates

# For demonstration, let's introduce a duplicate row

mtcars <- rbind(mtcars, mtcars[1, ])

# Remove duplicate rows

mtcars <- mtcars %>%

distinct()

# 3. Data Transformation

# Normalize the 'mpg' column

mtcars <- mtcars %>%

mutate(mpg_normalized = scale(mpg))

# 4. Feature Engineering

# Create a new feature 'power_to_weight_ratio'

mtcars <- mtcars %>%

mutate(power_to_weight_ratio = hp / wt)
# 5. Data Integration

# For demonstration, let's create a second dataset

additional_data <- data.frame(

car = rownames(mtcars),

country = sample(c("USA", "Japan", "Europe"), nrow(mtcars), replace = TRUE)

# Merge the datasets

mtcars <- mtcars %>%

mutate(car = rownames(mtcars)) %>%

left_join(additional_data, by = "car")

# View the final pre-processed dataset

head(mtcars)

OUTPUT:

RESULT:

The Data Pre-processing techniques are executed successfully.


Ex.No:03
Date:
Write an R Script to perform the descriptive statistics concepts.

AIM:

ALGORITHM:

Step1:Check if the dplyr package is installed; if not, install it.Load the dplyr library.

Step2:Set a random seed (set.seed(123)) for reproducibility.

Step3:Display the first six rows of data using head(data).

Step4:Use summary(data) to get basic statistics (min, max, median, mean, etc.) for numeric
columns

Step5:Compute the frequency distribution of the Gender column using


table(data$Gender).

Step6:Compute the percentage distribution using prop.table() and multiply by 100

Step 7:Calculate the Pearson correlation between Height and Weight using cor(data$Height,
data$Weight).

Step8:Mean and standard deviation of Height.


Mean and standard deviation of Weight.

Step9:Use hist(data$Age) with blue bars and a black border.

PROGRAM:
# Load necessary library

if (!require("dplyr")) install.packages("dplyr")

library(dplyr)

# Create a sample dataset

set.seed(123)

data <- data.frame(

ID = 1:100,

Age = sample(18:60, 100, replace = TRUE),

Height = round(rnorm(100, mean = 165, sd = 10), 1),

Weight = round(rnorm(100, mean = 65, sd = 15), 1),

Gender = sample(c("Male", "Female"), 100, replace = TRUE)

# View the first few rows of the dataset

head(data)

# Descriptive statistics

# Summary statistics for numeric variables

summary(data)
# Mean, Median, Variance, and Standard Deviation for a specific column

mean_age <- mean(data$Age)

median_age <- median(data$Age)

var_age <- var(data$Age)

sd_age <- sd(data$Age)

cat("Mean Age:", mean_age, "\n")

cat("Median Age:", median_age, "\n")

cat("Variance of Age:", var_age, "\n")

cat("Standard Deviation of Age:", sd_age, "\n")

# Frequency distribution for categorical variable

gender_distribution <- table(data$Gender)

cat("\nGender Distribution:\n")

print(gender_distribution)

# Percentage distribution

gender_percentage <- prop.table(gender_distribution) * 100

cat("\nGender Percentage Distribution:\n")

print(round(gender_percentage, 2))

# Correlation between two numeric variables


correlation <- cor(data$Height, data$Weight)

cat("\nCorrelation between Height and Weight:", correlation, "\n")

# Grouped statistics: Mean Height and Weight by Gender

grouped_stats <- data %>%

group_by(Gender) %>%

summarise(

Mean_Height = mean(Height),

Mean_Weight = mean(Weight),

SD_Height = sd(Height),

SD_Weight = sd(Weight)

cat("\nGrouped Statistics by Gender:\n")

print(grouped_stats)

# Histogram for Age

hist(data$Age, main = "Histogram of Age", xlab = "Age", col = "blue", border = "black")

# Boxplot for Height by Gender

boxplot(Height ~ Gender, data = data,

main = "Boxplot of Height by Gender",


xlab = "Gender", ylab = "Height (cm)",

col = c("lightblue", "pink"))

OUTPUT:

RESULT:

Ex.No:04
Date:
Visualizing the data in different graphics using R Script.

AIM:

To write an R Script for visualizing the data in different graphics.


ALGORITHM:

Step1:Check if the ggplot2 and dplyr libraries are installed.


Step2:If not installed, install them using install.packages().Load the libraries using library().
Step3:Set a random seed for reproducibility.
Create a data frame (data)
Step4:Points plotted using geom_point() with size and transparency.Title, labels, and theme are
set.
Step5:Bars plotted using geom_histogram()
binwidth = 5 to group ages into intervals of 5 years.
Step6:DistributionGroup dataset by Gender and count occurrences using summarise().
Step7:Sort dataset by ID using arrange().
Line chart created using geom_line().
Step8:Density curves plotted using geom_density()Fill color based on Gender.
Step9:Group dataset by Gender and count occurrences.
Step10:Load necessary libraries (ggplot2, dplyr).
Step11:Create a sample dataset with variables: ID, Age, Height, Weight, Gender.
Step12:Scatter Plot: Height vs. Weight with Gender color coding.

PROGRAM:

# Load necessary library

if (!require("ggplot2")) install.packages("ggplot2")

if (!require("dplyr")) install.packages("dplyr")

library(ggplot2)

library(dplyr)

# Create a sample dataset

set.seed(123)

data <- data.frame(


ID = 1:100,

Age = sample(18:60, 100, replace = TRUE),

Height = round(rnorm(100, mean = 165, sd = 10), 1),

Weight = round(rnorm(100, mean = 65, sd = 15), 1),

Gender = sample(c("Male", "Female"), 100, replace = TRUE)

# 1. Scatter Plot: Height vs. Weight

ggplot(data, aes(x = Height, y = Weight, color = Gender)) +

geom_point(size = 3, alpha = 0.7) +

labs(title = "Scatter Plot of Height vs. Weight",

x = "Height (cm)",

y = "Weight (kg)") +

theme_minimal()

# 2. Histogram: Distribution of Age

ggplot(data, aes(x = Age)) +

geom_histogram(binwidth = 5, fill = "blue", color = "black", alpha = 0.7) +

labs(title = "Histogram of Age",

x = "Age (years)",

y = "Frequency") +

theme_minimal()

# 3. Boxplot: Height by Gender


ggplot(data, aes(x = Gender, y = Height, fill = Gender)) +

geom_boxplot() +

labs(title = "Boxplot of Height by Gender",

x = "Gender",

y = "Height (cm)") +

theme_minimal()

# 4. Bar Plot: Gender Distribution

gender_distribution <- data %>%

group_by(Gender) %>%

summarise(Count = n())

ggplot(gender_distribution, aes(x = Gender, y = Count, fill = Gender)) +

geom_bar(stat = "identity", width = 0.6) +

labs(title = "Bar Plot of Gender Distribution",

x = "Gender",

y = "Count") +

theme_minimal()

# 5. Line Chart: Age Trend (sorted by ID)

data <- data %>% arrange(ID)

ggplot(data, aes(x = ID, y = Age, group = 1)) +

geom_line(color = "darkgreen", size = 1) +

geom_point(color = "red", size = 2) +


labs(title = "Line Chart of Age Trend by ID",

x = "ID",

y = "Age (years)") +

theme_minimal()

# 6. Density Plot: Weight Distribution

ggplot(data, aes(x = Weight, fill = Gender)) +

geom_density(alpha = 0.5) +

labs(title = "Density Plot of Weight Distribution",

x = "Weight (kg)",

y = "Density") +

theme_minimal()

# 7. Pie Chart: Gender Proportion

gender_proportion <- data %>%

group_by(Gender) %>%

summarise(Count = n())

ggplot(gender_proportion, aes(x = "", y = Count, fill = Gender)) +

geom_bar(stat = "identity", width = 1) +

coord_polar("y") +

labs(title = "Pie Chart of Gender Proportion") +

theme_void()

# 8. Faceted Plot: Scatter Plot of Height vs. Weight by Gender


ggplot(data, aes(x = Height, y = Weight)) +

geom_point(aes(color = Gender), size = 2, alpha = 0.7) +

facet_wrap(~ Gender) +

labs(title = "Scatter Plot of Height vs. Weight (Faceted by Gender)",

x = "Height (cm)",

y = "Weight (kg)") +

theme_minimal()

OUTPUT:

RESULT:

Visualizing the data in different graphics using R Script was executed successfully.
Ex.No:05
Date:
Write an R Script to implement the Normal and binomial distribution.

AIM:

To write a Write an R Script to implement the Normal and binomial distribution.

ALGORITHM:

Step 1:Load the ggplot2 package, which is used for visualization.


Step2:Set a random seed to ensure reproducibility of result
Step3:Create a histogram to visualize the distribution:Use geom_histogram() to plot the density
of the normal data.
Step 4:Generate 1000 random observations from a binomial distribution with:Number of trials
(size) = 10,Probability of success (prob) = 0.5
Step5:Create a bar plot to visualize the binomial distribution
Step 6:Use geom_bar() to plot the proportion of each outcome.
Step7:This algorithm effectively visualizes both distributions in R

PROGRAM:

# Load necessary library

library(ggplot2)

# Generate a sample of 1000 observations from a normal distribution


set.seed(123)

normal_data <- rnorm(1000, mean = 0, sd = 1)

# Plot the histogram of the normal distribution

ggplot(data.frame(x = normal_data), aes(x = x)) +

geom_histogram(aes(y = ..density..), bins = 30, fill = "blue", alpha = 0.5) +

stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "red", size = 1) +

labs(title = "Normal Distribution", x = "Value", y = "Density")

# Load necessary library

library(ggplot2)

# Generate a sample of 1000 observations from a binomial distribution

set.seed(123)

binomial_data <- rbinom(1000, size = 10, prob = 0.5)

# Plot the histogram of the binomial distribution

ggplot(data.frame(x = binomial_data), aes(x = factor(x))) +

geom_bar(aes(y = ..prop.., group = 1), fill = "green", alpha = 0.5) +

labs(title = "Binomial Distribution", x = "Number of Successes", y = "Proportion")

OUTPUT:

RESULT:

Normal and binomial distribution was executed successfully in R language.


Write an R Script to convert numerical data to categorical and binomial
distribution.
Ex.No:06
Date:

AIM:

To Write an R Script to convert numerical data to categorical and binomial


distribution.

ALGORITHM:

Step1:Ensures that the randomly generated numbers are the same every time you run the script

Step2:Create a data frame data with 20 observations.


Step3:cut() function is used to group ages into predefined categories

Step4:This prints the dataset after the categorical variables have been added

Step5:Counts occurrences of each age group. Counts occurrences of each income bracket

Step6:The script checks if the package ggplot2 is installed. If not, it installs it

Step 7:Defines the dataset and variable for the x-axis.

Step8:Similar to the age group plot, but with different colors

Step9:Two bar plots visualizing age groups and income brackets

Step 10:A dataset with numerical and categorical variables.Frequency counts of categories.

PROGRAM:

# Create a sample dataset with numerical data

set.seed(123)

data <- data.frame(

ID = 1:20,

Age = sample(18:60, 20, replace = TRUE),

Income = round(rnorm(20, mean = 50000, sd = 10000), 0)

# View the dataset

print("Original Dataset:")

print(data)

# Convert numerical data to categorical variables


# 1. Age categorization into age groups

data$Age_Group <- cut(

data$Age,

breaks = c(0, 18, 30, 45, 60, Inf),

labels = c("Child", "Young Adult", "Adult", "Middle Age", "Senior"),

right = FALSE

# 2. Income categorization into income brackets

data$Income_Bracket <- cut(

data$Income,

breaks = c(-Inf, 30000, 50000, 70000, Inf),

labels = c("Low", "Medium", "High", "Very High"),

right = TRUE

# View the updated dataset

print("Updated Dataset with Categorical Variables:")

print(data)

# Frequency tables for the new categorical variables

cat("\nFrequency of Age Groups:\n")

print(table(data$Age_Group))

cat("\nFrequency of Income Brackets:\n")


print(table(data$Income_Bracket))

# Visualize the categorical variables

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

# Bar plot for Age Groups

ggplot(data, aes(x = Age_Group)) +

geom_bar(fill = "skyblue", color = "black") +

labs(title = "Bar Plot of Age Groups",

x = "Age Group",

y = "Count") +

theme_minimal()

# Bar plot for Income Brackets

ggplot(data, aes(x = Income_Bracket)) +

geom_bar(fill = "orange", color = "black") +

labs(title = "Bar Plot of Income Brackets",

x = "Income Bracket",

y = "Count") +

theme_minimal()

OUTPUT:

RESULT:
R Script to convert numerical data to categorical and binomial distribution are executed
successfully.

Ex.No:07
Date:
Write an R Script to Bayes’ Theorem.

Aim:

​ To Write an R Script for Bayes’ Theorem.

ALGORITHM:

Step1:Probability that a randomly chosen person has the disease

Step2:Probability that the test is positive given the person has the disease

Step3:Probability that the test is positive given the person does not have the
disease

Step4:Test is positive with probability P(B|A)

Step5:Probability that a person has the disease given a positive test result.

Step6:Prior probability of disease (P(A))Likelihood of a positive test given disease


(P(B|A))

Step7:Marginal probability of a positive test (P(B))

Step8:Posterior probability of disease given a positive test (P(A|B))


Step9:A bar plot showing the prior, likelihood, and posterior probabilities.

PROGRAM:

# Define probabilities

# Example Scenario: Testing for a disease

# A: Person has the disease

# B: Test result is positive

# Prior probability (P(A)): Probability of having the disease

P_A <- 0.01 # 1% of the population has the disease

# Likelihood (P(B|A)): Probability of a positive test result given the person has the disease

P_B_given_A <- 0.95 # Test is 95% accurate for those with the disease

# Marginal probability of B (P(B)):

# P(B) = P(B|A) * P(A) + P(B|?A) * P(?A)

P_B_given_not_A <- 0.05 # 5% false positive rate

P_not_A <- 1 - P_A # Probability of not having the disease

P_B <- (P_B_given_A * P_A) + (P_B_given_not_A * P_not_A)

# Posterior probability (P(A|B)): Using Bayes' Theorem

P_A_given_B <- (P_B_given_A * P_A) / P_B

# Print the results

cat("P(A): Prior Probability of Having the Disease =", P_A, "\n")

cat("P(B|A): Likelihood of Positive Test Given Disease =", P_B_given_A, "\n")

cat("P(B): Marginal Probability of a Positive Test =", P_B, "\n")


cat("P(A|B): Posterior Probability of Having the Disease Given a Positive Test =",
P_A_given_B, "\n")

# Additional Example: Visualization

# Visualize the probabilities using a bar plot

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

data <- data.frame(

Category = c("Prior: P(A)", "Likelihood: P(B|A)", "Posterior: P(A|B)"),

Probability = c(P_A, P_B_given_A, P_A_given_B)

ggplot(data, aes(x = Category, y = Probability, fill = Category)) +

geom_bar(stat = "identity", color = "black") +

scale_fill_brewer(palette = "Set3") +

labs(title = "Bayes' Theorem Probabilities",

x = "Probability Component",

y = "Probability") +

theme_minimal()

OUTPUT:
RESULT:

In R Script the Bayes’ Theorem was executed successfully.

Write an R Script to implement the Time series data analysis and forecasting.
Ex.No:08
Date:
AIM:

To Write an R script implements the Time series data analysis and forecasting.

ALGORITHM:

Step1:Before running the analysis, ensure the required packages (forecast, ggplot2,
and tseries) are installed and loaded.

Step2:the time series into trend, seasonal, and residual components using
decompose()

Step3:Perform Augmented Dickey-Fuller (ADF) Test to check if the series is


stationary.

Step4:Perform Ljung-Box Test to check residual independence

Step5:Use STL (Seasonal and Trend decomposition using LOESS) for better
decomposition.

Step6:Load required packages.Generate synthetic time series data.

Step7:Visualize the original time series.Decompose the time series into trend,
seasonal, and residuals.

Step8:Forecast future values and plot the forecast.

Step9:Analyze residuals and validate the model using Ljung-Box test.

PROGRAM:

# Install and load necessary packages

if (!require("forecast")) install.packages("forecast")
if (!require("ggplot2")) install.packages("ggplot2")

library(forecast)

library(ggplot2)

# 1. Generate Sample Time Series Data

set.seed(123)

time_series_data <- ts(rnorm(120, mean = 50, sd = 10),

start = c(2010, 1), frequency = 12) # Monthly data from 2010

# Plot the time series data

autoplot(time_series_data) +

labs(title = "Original Time Series Data",

x = "Time",

y = "Value") +

theme_minimal()

# 2. Decompose the Time Series (Trend, Seasonal, Residuals)

decomposed <- decompose(time_series_data)

autoplot(decomposed) +

labs(title = "Decomposed Time Series")

# 3. Check Stationarity using Augmented Dickey-Fuller Test

if (!require("tseries")) install.packages("tseries")

library(tseries)

adf_test <- adf.test(time_series_data)


cat("Augmented Dickey-Fuller Test:\n")

print(adf_test)

# 4. Fit an ARIMA Model for Forecasting

arima_model <- auto.arima(time_series_data)

# Print the ARIMA model summary

cat("\nFitted ARIMA Model:\n")

print(summary(arima_model))

# 5. Forecast the Future Values

forecast_horizon <- 12 # Forecast for the next 12 months

forecast_values <- forecast(arima_model, h = forecast_horizon)

# Plot the forecast

autoplot(forecast_values) +

labs(title = "ARIMA Forecast",

x = "Time",

y = "Value") +

theme_minimal()

# 6. Validate the Model with Residual Analysis

autoplot(arima_model$residuals) +

labs(title = "Residual Analysis",

x = "Time",

y = "Residuals") +
theme_minimal()

# Perform Ljung-Box test for residuals

lb_test <- Box.test(arima_model$residuals, lag = 20, type = "Ljung-Box")

cat("\nLjung-Box Test for Residuals:\n")

print(lb_test)

# 7. Seasonal Decomposition of Time Series using LOESS (STL)

stl_decomposed <- stl(time_series_data, s.window = "periodic")

autoplot(stl_decomposed) +

labs(title = "STL Decomposition")

# 8. Export Forecasted Values (Optional)

# write.csv(data.frame(forecast_values), "forecasted_values.csv")

OUTPUT:

RESULT:

Implemented the Time series data analysis and forecasting in R script.


Hypothesis Testing in R Programming.
Ex.No:09
Date:

Aim:

To Write an R Script for Hypothesis Testing.

ALGORITHM:

Step1:Check if the ggplot2 package is installed.


Step2:Ifnotinstalled, install ggplot2.Load the ggplot2 library.

Step3:Set a random seed for reproducibility.

Step4:Generate group1 with 30 values from a normal distribution (mean = 50, sd =


5)

Step5:Generate group2 with 30 values from a normal distribution (mean = 52, sd =


5).

Step6:The means of group1 and group2 are equal.

Step7:The means of group1 and group2 are not equal.

Step8:Perform an independent two-sample t-test assuming equal variance

Step9:The observed frequencies match the expected frequencie

Step10:Perform a Wilcoxon rank-sum test (non-parametric alternative to t-test).

Step11:There is no correlation between x and y.

Step12:Check the strength of linear relationship between two variables

Step13:Non-parametric alternative to t-test

PROGRAM:

# Load necessary library

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

# Create sample data


set.seed(123)

group1 <- rnorm(30, mean = 50, sd = 5) # Sample from Group 1

group2 <- rnorm(30, mean = 52, sd = 5) # Sample from Group 2

# 1. One-Sample t-test

# H0: The mean of group1 is equal to 50

# H1: The mean of group1 is not equal to 50

t_test_one_sample <- t.test(group1, mu = 50)

cat("\nOne-Sample t-test:\n")

print(t_test_one_sample)

# 2. Two-Sample t-test (Independent)

# H0: The means of group1 and group2 are equal

# H1: The means of group1 and group2 are not equal

t_test_two_sample <- t.test(group1, group2, var.equal = TRUE)

cat("\nTwo-Sample t-test:\n")

print(t_test_two_sample)

# 3. Paired t-test

# H0: The means of paired samples are equal

# Simulate paired data


paired_group1 <- rnorm(20, mean = 10, sd = 2)

paired_group2 <- paired_group1 + rnorm(20, mean = 0.5, sd = 1) # Add small difference

t_test_paired <- t.test(paired_group1, paired_group2, paired = TRUE)

cat("\nPaired t-test:\n")

print(t_test_paired)

# 4. Chi-Square Test

# H0: The observed frequencies match the expected frequencies

observed <- c(50, 30, 20) # Observed frequencies

expected <- c(40, 40, 20) # Expected frequencies

chi_sq_test <- chisq.test(observed, p = expected / sum(expected))

cat("\nChi-Square Test:\n")

print(chi_sq_test)

# 5. ANOVA (Analysis of Variance)

# H0: The means of all groups are equal

group_a <- rnorm(20, mean = 5, sd = 1)

group_b <- rnorm(20, mean = 6, sd = 1)

group_c <- rnorm(20, mean = 7, sd = 1)

anova_data <- data.frame(

values = c(group_a, group_b, group_c),


groups = rep(c("A", "B", "C"), each = 20)

anova_result <- aov(values ~ groups, data = anova_data)

cat("\nANOVA Test:\n")

summary(anova_result)

# 6. Visualizing ANOVA Results

ggplot(anova_data, aes(x = groups, y = values, fill = groups)) +

geom_boxplot() +

labs(title = "Boxplot of Groups for ANOVA",

x = "Group",

y = "Values") +

theme_minimal()

# 7. Shapiro-Wilk Test (Normality Test)

# H0: Data is normally distributed

shapiro_test <- shapiro.test(group1)

cat("\nShapiro-Wilk Test:\n")

print(shapiro_test)

# 8. Wilcoxon Rank-Sum Test (Non-parametric test)


# H0: The distributions of group1 and group2 are the same

wilcox_test <- wilcox.test(group1, group2)

cat("\nWilcoxon Rank-Sum Test:\n")

print(wilcox_test)

# 9. Correlation Test

# H0: There is no correlation between x and y

x <- rnorm(50, mean = 10, sd = 2)

y <- 2 * x + rnorm(50, mean = 0, sd = 1)

cor_test <- cor.test(x, y)

cat("\nCorrelation Test:\n")

print(cor_test)

OUTPUT:

RESULT:

Hypothesis Testing was implemented in R script.


Predictive Analysis R Programming.
Ex.No:10
Date:

AIM:

Write an R script for Predictive Analysis.

ALGORITHM:

Step1:Before running the analysis, necessary libraries are installed and loaded

Step 2:For machine learning operations like data partitioning and evaluation.

Step3:A synthetic dataset is generated using random values.


Step4:Split the data into training (70%) and testing (30%) sets.

Step5:Linear Regression (for continuous variable Income)

Step6:Decision Tree (for classification Default).Random Forest (for classification


Default)

Step7:Make predictions using decision tree and random forest.

Step8:Evaluate model performance using confusion matrices.

Step9:Visualize the decision tree and feature importance from random forest.

PROGRAM:

# Install and load required libraries

if (!require("caret")) install.packages("caret")

if (!require("rpart")) install.packages("rpart")

if (!require("randomForest")) install.packages("randomForest")

if (!require("ggplot2")) install.packages("ggplot2")

library(caret)

library(rpart)

library(randomForest)

library(ggplot2)

# 1. Load or Create Sample Dataset

set.seed(123)
data <- data.frame(

Age = sample(18:70, 100, replace = TRUE),

Income = round(rnorm(100, mean = 50000, sd = 10000), 0),

Education = sample(c("High School", "Bachelor's", "Master's", "PhD"), 100, replace = TRUE),

Credit_Score = round(rnorm(100, mean = 700, sd = 50), 0)

# Target Variable: Whether the person defaults on a loan

data$Default <- ifelse(data$Credit_Score < 650, 1, 0) # 1 = Default, 0 = No Default

# View the dataset

cat("Sample Dataset:\n")

print(head(data))

# Convert categorical variables to factors

data$Education <- as.factor(data$Education)

data$Default <- as.factor(data$Default)

# 2. Split Data into Training and Testing Sets

set.seed(123)

trainIndex <- createDataPartition(data$Default, p = 0.7, list = FALSE)

trainData <- data[trainIndex, ]

testData <- data[-trainIndex, ]

cat("\nTraining Set Size:", nrow(trainData))

cat("\nTesting Set Size:", nrow(testData))


# 3. Build Predictive Models

# (a) Linear Regression (For continuous target variables)

lm_model <- lm(Income ~ Age + Credit_Score, data = trainData)

cat("\nLinear Regression Model Summary:\n")

print(summary(lm_model))

# (b) Decision Tree (For classification tasks)

tree_model <- rpart(Default ~ Age + Income + Education + Credit_Score, data = trainData,


method = "class")

cat("\nDecision Tree Summary:\n")

print(tree_model)

# (c) Random Forest (For classification tasks)

rf_model <- randomForest(Default ~ Age + Income + Education + Credit_Score, data =


trainData, ntree = 100)

cat("\nRandom Forest Model:\n")

print(rf_model)

# 4. Make Predictions

# Predict using Decision Tree

tree_predictions <- predict(tree_model, testData, type = "class")

# Predict using Random Forest

rf_predictions <- predict(rf_model, testData)

# 5. Evaluate Models

# Confusion Matrix for Decision Tree


cat("\nDecision Tree Confusion Matrix:\n")

print(confusionMatrix(tree_predictions, testData$Default))

# Confusion Matrix for Random Forest

cat("\nRandom Forest Confusion Matrix:\n")

print(confusionMatrix(rf_predictions, testData$Default))

# 6. Visualize the Results

# (a) Decision Tree Plot

if (!require("rpart.plot")) install.packages("rpart.plot")

library(rpart.plot)

rpart.plot(tree_model, main = "Decision Tree")

# (b) Feature Importance from Random Forest

importance <- importance(rf_model)

cat("\nFeature Importance:\n")

print(importance)

# Plot Feature Importance

importance_df <- data.frame(Feature = rownames(importance), Importance = importance[, 1])

ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance, fill = Feature)) +

geom_bar(stat = "identity") +

coord_flip() +

labs(title = "Feature Importance (Random Forest)",

x = "Features",
y = "Importance") +

theme_minimal()

OUTPUT:

RESULT:

Predictive Analysis in R script was executed successfully.


Ex.No:11
Date:
Write an R script to implement the Cross-Validation.

AIM:

Implement R Script for Cross-Validation.

ALGORITHM:

Step1:Load the required libraries using library().

Step2:Convert the Default column into a factor since it's a classification problem.

Step3:Display the first few rows of the dataset.

Step4:Use the train() function to fit a Random Forest model

Step5:Use the plot() function to visualize the model performance and parameter
tuning

Step6:Create a new dataset with sample values for prediction.

Step7:Print the predictions along with the input data


Step8:This step-by-step approach builds a robust credit default prediction model
using Random Forest with cross-validation

PROGRAM:

# Install and load required libraries

if (!require("caret")) install.packages("caret")

if (!require("randomForest")) install.packages("randomForest")

if (!require("e1071")) install.packages("e1071") # Needed for SVM in caret

library(caret)

library(randomForest)

# 1. Create Sample Dataset

set.seed(123)

data <- data.frame(

Age = sample(18:70, 200, replace = TRUE),

Income = round(rnorm(200, mean = 50000, sd = 10000), 0),

Credit_Score = round(rnorm(200, mean = 700, sd = 50), 0),

Default = sample(c("Yes", "No"), 200, replace = TRUE)

# Convert target variable to factor

data$Default <- as.factor(data$Default)

# View dataset
cat("Sample Dataset:\n")

print(head(data))

# 2. Define Train Control for Cross-Validation

train_control <- trainControl(

method = "cv",​ # Cross-validation method

number = 10,​​ # Number of folds

verboseIter = TRUE​# Display training progress

# 3. Train a Model with Cross-Validation

# Example: Random Forest

set.seed(123)

rf_model <- train(

Default ~ Age + Income + Credit_Score, ​ # Formula

data = data,​ ​ ​ # Dataset

method = "rf",​ ​ # Random forest method

trControl = train_control,​ # Cross-validation settings

tuneLength = 3​ ​ # Number of tuning parameters to try

# 4. Model Performance

cat("\nRandom Forest Model Summary:\n")

print(rf_model)
# Print the best model parameters

cat("\nBest Tuning Parameters:\n")

print(rf_model$bestTune)

# 5. Evaluate Model Performance

cat("\nCross-Validation Results:\n")

print(rf_model$resample)

# 6. Visualize Model Performance

plot(rf_model)

# 7. Predictions on a New Dataset (Optional)

new_data <- data.frame(

Age = c(25, 40, 65),

Income = c(40000, 55000, 70000),

Credit_Score = c(650, 720, 680)

# Predict using the trained model

predictions <- predict(rf_model, new_data)

cat("\nPredictions on New Data:\n")

print(data.frame(new_data, Predicted_Default = predictions))

OUTPUT:
RESULT:

Implemented R Script for Cross-Validation was Output verified.

Write an R script to implement the Ordinary Least Squares (OLS).


Ex.No:12
Date:
AIM:

Implement the Ordinary Least Squares (OLS) in R script.

ALGORITHM:

Step1:The ggplot2 package is loaded to enable visualization of the regression


results.

Step2:The built-in mtcars dataset is loaded into the R environment.

Step3:This function displays the first six rows of the mtcars dataset for a quick
preview

Step4:The lm() function is used to fit a linear regression model.

Step5:The summary() function provides key details about the regression results

Step6:Estimated effect of each independent variable.

Step7:How well the model explains the variance in mpg

Step 8:Differences between actual and predicted values.

PROGRAM:

# Load necessary libraries

library(ggplot2)

# Load the dataset

data(mtcars)
# View the first few rows of the dataset

head(mtcars)

# Perform OLS regression

model <- lm(mpg ~ wt + hp + disp, data = mtcars)

# View the summary of the model

summary(model)

# Plot the regression results

ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point() +

geom_smooth(method = "lm", formula = y ~ x, col = "red") +

labs(title = "OLS Regression: MPG vs Weight",

x = "Weight (1000 lbs)",

y = "Miles per Gallon (MPG)")

OUTPUT:

RESULT:

Ordinary Least Squares (OLS) in R script executed successfully.


Write an R script to implement the Linear regression algorithm.
Ex.No:13
Date:

AIM:

Implement the linear regression algorithm in R programming language.

ALGORITHM:

Step1:The script checks if the ggplot2 package is installed

Step2:If not, it installs the package.Then, it loads ggplot2.


Step3:Sets a seed for reproducibility using set.seed(123)

Step4:Creates a dataframe data with three variables

Step5:Uses the lm() function to create a linear regression model

Step 6:Uses ggplot2 to visualize the relationship between Age and Spending

Step7:The summary(lm_model) function prints details such as:Coefficients,


R-squared value, p-values, Residuals and standard errors

Step8:Plot the relationship between Age and Spending

PROGRAM:

# Load necessary library

if (!require("ggplot2")) install.packages("ggplot2")

library(ggplot2)

# Sample dataset

set.seed(123)

data <- data.frame(

Age = sample(20:70, 100, replace = TRUE),

Income = round(rnorm(100, mean = 50000, sd = 10000), 0)

data$Spending <- 2000 + 50 * data$Age + 0.3 * data$Income + rnorm(100, mean = 0, sd =


5000)

# Fit Linear Regression Model


lm_model <- lm(Spending ~ Age + Income, data = data)

# Model Summary

summary(lm_model)

# Scatter plot with regression line

ggplot(data, aes(x = Age, y = Spending)) +

geom_point() +

geom_smooth(method = "lm", formula = y ~ x, color = "red") +

labs(title = "Linear Regression: Spending vs Age", x = "Age", y = "Spending")

OUTPUT:

RESULT:

Implemented the Linear regression algorithm in R script.


Write an R script to implement the K-Means clustering algorithm.
Ex.No:14
Date:

AIM:

Write an R script to implement the K-Means clustering algorithm in rstudio.

ALGORITHM:

Step1:This loads the ggplot2 library, which is used to create visualizations in R.

Step2:The iris dataset is a built-in dataset in R that contains measurements of iris


flowers for three species (setosa, versicolor, virginica).

Step3:The algorithm tries to group the data into 3 clusters.

Step4:The computed cluster labels (1, 2, or 3) are added to the iris dataset.

Step 5: the as.factor() ensures the cluster labels are treated as categorical values.

Step6:This creates a scatter plot using ggplot2:X-axis: Sepal.Length, Y-axis:


Sepal.Width,Color: Cluster assignment

Step7:Load the dataset and necessary libraries

Step8:Apply K-Means clustering to group the data into three clusters.

Step 9:Assign the cluster labels to the dataset.


Step10:Visualize the clusters using a scatter plot.

PROGRAM:

# Load necessary library

library(ggplot2)

# Load the dataset

data(iris)

# Perform K-Means clustering with 3 clusters

set.seed(123) # For reproducibility

kmeans_result <- kmeans(iris[, 1:4], centers = 3)

# Add the cluster results to the dataset

iris$Cluster <- as.factor(kmeans_result$cluster)

# Plot the clusters

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Cluster)) +

geom_point() +

labs(title = "K-Means Clustering of Iris Dataset",

x = "Sepal Length",

y = "Sepal Width")

OUTPUT:
RESULT:

The K-Means clustering algorithm was implemented successfully.

Ex.No:15
Date:
Write an R script to implement the Naive Bayes.
AIM:

Implement the Naive Bayes in R Script.

ALGORITHM:

Step1:This loads the e1071 package, which provides an implementation of the


Naïve Bayes classifier in R.

Step 2:Loads the built-in Iris dataset, which contains 150 samples of iris flowers
with four features (Sepal.Length, Sepal.Width, Petal.

Step 3:means predicting Species based on all other features.

Step 4:Uses the trained model to predict the species of flowers in the test set.

Step 5:Creates a confusion matrix, comparing the actual vs. predicted species.

Step 6:This helps measure the accuracy of the model.

Step 7:Computes the accuracy by dividing correct predictions by the total number
of test samples.

Step 8:The probability of each class P(Species) in the training set

Step 9:The probability of each feature value given a class using the Gaussian
(Normal) Distribution

Step 10:The class with the highest probability is selected as the prediction.

PROGRAM:

# Load necessary library

library(e1071)
# Load the dataset

data(iris)

# Split the data into training and test sets

set.seed(123) # For reproducibility

train_index <- sample(1:nrow(iris), 0.7 * nrow(iris))

train_data <- iris[train_index, ]

test_data <- iris[-train_index, ]

# Train the Na?ve Bayes model

model <- naiveBayes(Species ~ ., data = train_data)

# Predict on the test set

predictions <- predict(model, test_data)

# Evaluate the model

confusion_matrix <- table(predictions, test_data$Species)

print(confusion_matrix)

OUTPUT:

RESULT:

The Naive Bayes theorem was implemented successfully.

You might also like