DEV Lab Manual
DEV Lab Manual
Aim:
To install standalone R
Procedure:
1. Go to CRAN website
2. Cick on “Download R for Windows”
3. Run the downloaded RStudio9.exe file.
4. Select the language of the install process:
2
6. Read the License Agreement and accept its conditions.
7. Read the warning that you must not install the program on the disk where the lost
files resided.
3
9. If necessary, change the program group.
Result:
Thus the above experiment to install Standalone R has been executed successfully
and the output is verified.
4
Exp No: 2 USE R TOOLS TO EXPLORE VARIOUS COMMANDS
Date: FOR DESCRIPTIVE DATA ANALYTICS USING BENCH
MARK DATASETS
Aim:
To use R commands to implement descriptive Data Analytics on datasets.
Procedure:
Make sure R is installed to begin programming.
Import the required Dataset as a csv file to perform descriptive Analysis.
Declare required variables.
Declare variables for Mean, Median, Mode, Range, Minimum,
Maximum, Variance, etc,.
Apply the functions and store the ouTput in the variables declared.
Print the outputs.
• MEDIAN: It splits the data into Two halves. If the number of the elements in data
set is odd, then the centre element is median and if it is even then the median would
be the average of two central elements.
ODD: ( n+1) / 2
EVEN: (( n/2) + (n/2 + 1)) / 2
• MODE: It is the value that has the highest frequency in the given dataset.
• RANGE: The range is the difference between the maximum and minimum values
in a dataset.
Σ (x - μ)² / (n - 1),
where x is each data point, μ is the mean, and n is the number of data points.
5
individual data points deviate from the mean. It is the square root of the variance.
• INTERQUARTILE RANGE (IQR): The IQR measures the spread of the middle
50% of the data and is the range between the first quartile (Q1) and the third
quartile (Q3).
IQR = Q3 - Q1
CODE:
data = read.csv("CardioGoodFitness.csv")
mean_age = mean(data$Age)
print(“1. mean_age:”)
print(mean_age)
median = median(data$Age)
print(“2. median: ”)
print(median)
library(modeest) # import library to calculate mode
mode = mfv(data$Age)
print(“3. mode:”) print(mode)
max = max(data$Age)
min = min(data$Age)
range = max - min
print("4. Range is:")
print(range)
variance = var(data$Age)
print(“5. Variance is :”)
6
print(variance)
std = sd(data$Age)
print(“6. Standard deviation is: “)
print(std)
quartiles = quantile(data$Age)
print(“7. Quartiles are: “)
print(quartiles)
IQR = IQR(data$Age)
Print(“8. InterQuartile Range is :”)
print(IQR)
summary = summary(data$Age)
print(“9. SUMMARY IS :”)
print(summary)
OUTPUT :
“1. mean_age:”
28.7889
“2. median: ”
26
“3. mode:”
25
"4. Range is:"
32
“5. Variance is :”
48.21217
“6. Standard deviation is: “
943498
“7. Quartiles are: “
7
0% 25% 50% 75% 100%
18 24 26 33 50
“8. InterQuartile Range is
:” 9
“9. SUMMARY IS :”
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 24.00 26.00 28.79 33.00 50.00
Result:
Thus program to use R commands to implement descriptive Analytics on datasets
has been executed and verified successfully.
8
EXP NO : 3 EXPLORE THE VARIOUS VARIABLES AND ROW FILTER IN
R FOR DATA CLEANING
DATE:
AIM:
To develop a program to create various variables and row filter for data cleaning.
ALGORITHM:
1. Load the sample dataset "mtcars" into the variable mydata.
2. Display the first few rows of the dataset to get an overview.
3. Calculate summary statistics for numeric variables.
4. Create a frequency table for a categorical variable (e.g., 'cyl').
5. Compute a correlation matrix for selected numeric variables.
6. Filter rows based on specific conditions (e.g., keep cars with more than 20 mpg
and less than 200 horsepower).
7. Check for missing values in the dataset.
8. Remove rows with missing values to create a cleaned dataset.
9. Display the filtered dataset and the cleaned dataset to inspect the results.
PROGRAM:
data(mtcars)
mydata <- mtcars
head(mydata)
#Explore Variables
summary(mydata)
table(mydata$cyl)
correlation_matrix <- cor(mydata[, c("mpg", "hp", "wt")])
print(correlation_matrix)
#Row Filtering
filtered_data <- mydata[mydata$mpg > 20 & mydata$hp < 200, ]
head(filtered_data)
missing_values <- sum(is.na(mydata))
cat("Number of missing values in the dataset: ", missing_values, "\n")
9
cleaned_data <- mydata[complete.cases(mydata),
] head(cleaned_data)
OUTPUT:
10
RESULT:
Thus, the program has been executed and the output was verified
successfully.
11
Exp No: 4 USE R COMMANDS FOR PROBABILITY
DISTRIBUTION AND PROBABILITY
Date:
STATISTICS
Aim:
To use R commands to implement the probability distribution and
probability statistics
Procedure:
Make sure R is installed to begin programming.
Import the required Dataset as a csv file to perform descriptive Analysis.
Declare required variables.
Declare variables for probability distribution and statistics
Apply the functions and print the outputs.
Normal Distribution:
- Generate random data from a normal distribution and calculate
the mean and standard deviation:
Binomial Distribution:
- Compute the probability of getting exactly 3 successes in 10
trials with a success probability of 0.3:
Poisson Distribution:
- Calculate the probability of observing exactly 2 events when
the average event rate is 1 event per unit of time:
Descriptive Statistics:
- Calculate the mean and standard deviation of a dataset:
Hypothesis Testing (t-test):
- Perform a two-sample t-test to compare the means of two groups
(e.g., group1 and group2):
CODE:
NORMAL DISTRIBUTION:
data <- rnorm(100, mean = 0, sd = 1)
mean(data)
sd(data)
BINOMIAL DISTRIBUTION:
12
dbinom(3, size = 10, prob = 0.3)
POISSON DISTRIBUTION:
dpois(2, lambda = 1)
DESCRIPTIVE STATISTICS:
data <- c(10, 15, 20, 25, 30)
mean(data)
sd(data)
HYPOTHESIS TESTING:
group1 <- c(20, 22, 18, 24, 25)
group2 <- c(15, 17, 21, 19, 23)
t.test(group1, group2)
OUTPUT:
NORMAL DISTRIBUTION:
[1] -0.01515646
[1] 1.067949
BINOMIAL DISTRIBUTION:
[1] 0.2668279
POISSON DISTRIBUTION:
[1] 0.2668279
DESCRIPTIVE STATISTICS:
[1] 20
[1] 7.071068
HYPOTHESIS TESTING:
data: group1 and group2
13
t = 3.1091, df = 6.8347, p-value = 0.01867
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.323624 11.676376
sample estimates:
mean of x mean of y
21.8 19.0
Result:
Thus the program to use R commands to implement the probability
distribution and probability statistics has been executed and verified
successfully.
14
EXP NO: 5
FORMULATE REAL BUSINESS PROBLEMS
DATE : SCENARIOS TO HYPOTHESIS AND SOLVE USING
R STATISTICAL TESTING FEATURES
Code:
# Simulated data: Sales volume before and after price change
set.seed(123)
before_price <- rnorm(50, mean = 50, sd = 10) # Before price change
after_price <- rnorm(50, mean = 55, sd = 10) # After price change
hist(before_price, main = "Sales Volume Before Price Change", xlab =
"Sales Volume", col = "lightblue")
15
hist(after_price, main = "Sales Volume After Price Change", xlab = "Sales
Volume", col = "lightgreen")
# 3. Hypothesis Testing:
# Perform an independent t-test (assuming independence between the two
samples)
t_test_result <- t.test(before_price,
after_price) # 4. Result Analysis:
cat("Independent t-Test Results:\
n") cat(" \n")
cat("Before Price Change - Mean Sales Volume:", mean(before_price), "\
n")
cat("After Price Change - Mean Sales Volume:", mean(after_price), "\
n") cat(" \n")
cat("Test statistic:", t_test_result$statistic, "\
n") cat("P-value:", t_test_result$p.value, "\n")
cat(" \n")
# 5. Make a decision based on the p-value
alpha <- 0.05
if (t_test_result$p.value < alpha) {
cat("Conclusion: Reject the null hypothesis. The price change has a
statistically significant effect on sales volume.\n")
} else {
cat("Conclusion: Fail to reject the null hypothesis. There is no statistically
significant effect of the price change on sales volume.\n")
}
16
Output:
Result:
Thus R statistical testing features are used to implement hypothesis for real
business scenarios.
17
EXP NO : 6
PROGRAM TO APPLY VARIOUS PLOT FEATURES IN
DATE : R ON SAMPLE DATA SETS AND VISUALIZE
AIM:
To Apply Various Plot features in R on sample datasets and visualize.
PROCEDURE:
1. Load Libraries or generate data
2. Initialize a plot
3. Add geometric objects and customize the plot.
4. Save or display the plot in R environment
5. Iterate and experiment will refine your plot as required.
PROGRAM:
1. Bar Plot: They are generally used for continuous and categorical
variable plotting.
barplot(airquality$Ozone, main = 'Ozone Concenteration in air', = 'ozone
levels', col ='blue', horiz = FALSE)
Output:
18
2. Histogram: In a histogram values are grouped into consecutive
intervals called bins.
data(airquality)
hist(airquality$Temp, main ="La Guardia Airport's\ Maximum
Temperature(Daily)", xlab ="Temperature(Fahrenheit)", xlim = c(50, 125), col
="yellow", freq = TRUE)
Output:
3. Box Plot:A boxplot depicts information like the minimum and maximum
data point.
data(airquality)
boxplot(airquality$Wind, main = "Average wind speed\ at La Guardia
Airport", xlab = "Miles per hour", ylab = "Wind", col = "orange", border =
"brown", horizontal = TRUE, notch = TRUE)
19
Output:
20
5. Heat Map: Heatmap is defined as a graphical representation of data
using colors to visualize the value of the matrix.
data <- matrix(rnorm(50, 0, 5), nrow = 5, ncol = 5)
colnames(data) <- paste0("col", 1:5)
rownames(data) <- paste0("row", 1:5)
heatmap(data)
Result:
Thus, the Various Plot features in R on sample datasets has been
visualized and the displayed the plots.
21
Expt. No: 7 WRITE AND EXECUTE WORD COUNT,WORD SEARCH
AND PATTERN SEARCH PROBLEMS FROM LARGE TEXT
Date: FILES
Aim:
To develop a program to write and execute word count,word search and
pattern search problems from large text files using R.
A) Word
Count:
Procedure:
Initialize count to 0.
Open the textFile and Loop through each line in the file.
Split the line into words. For each word in the line,If word
equals targetWord Increment count by 1.
End Loop
Close the textFile
Return count
Code:
text <-
suppressWarnings(readLines("Products.txt")) text
<- paste(text, collapse = " ")
words <- unlist(strsplit(text, "\\s+"))
word_count <- length(words)
cat("Word Count:", word_count)
Output:
B) Word
Search:
Procedure:
22
Initialize found Words as an empty list.
Open the textFile and Loop through each line in the file.
Split the line into words.
For each word in the line,If word equals targetWord add the
word to foundWords list.
Close the textFile.
Return foundWords.
Code:
text <-
suppressWarnings(readLines("Products.txt")) text
<- paste(text, collapse = " ")
search_word <- "monitor"
word_found <- grep(search_word, text)
if (length(word_found) > 0) {
cat("Word found at positions:", word_found)
} else {
cat("Word not found in the text.")
}
Output:
C) Pattern Search:
Procedure:
Initialize foundLines as an empty list.
Open the textFile and Loop through each line in the file
23
If targetPattern is found in the line add the line to foundLines list.
Close the textFile.
Return foundLines
Code:
text <-
suppressWarnings(readLines("Products.txt")) text
<- paste(text, collapse = " ")
pattern <- "800"
pattern_found <- grep(pattern, text, perl = TRUE)
if (length(pattern_found) > 0) {
cat("Pattern found at positions:", pattern_found)
} else {
cat("Pattern not found in the text.")
}
Output:
RESULT:
Thus, the program to write and execute word count,word search and
pattern search problems from large text files using R has been executed
and output was verified successfully.
24
Exp No: 8 Explore various data preprocessing options
using bench mark data sets
Date:
Aim:
To use Python commands to implement data preprocessing using datasets.
Procedure:
Check Python is installed to implement data preprocessing.
Import pandas and train-test-split.
Create a dataset using Dataframe or import a dataset as csv file
if required.
Apply required data preprocessing commands and print the output.
CODE:
import pandas as pd
data = pd.DataFrame({'Age': [25, 30, 22, 35, 28, 40], 'Salary': [50000,
60000, 45000, 75000, 55000, 80000], 'Education': ['Bachelor', 'Master', 'Bachelor',
'Ph.D', 'Master','Bachelor']})
print("First few rows of the dataset:")
print(data.head())
print("Statistical summary of numeric columns:")
print(data.describe()) missing_values = data.isna().sum()
print("Missing
values in the dataset:") print(missing_values)
data.rename(columns={'Salary': 'Income'},
inplace=True) data['Age'] = data['Age'].astype(float)
data.drop_duplicates(inplace=True)
from sklearn.model_selection import train_test_split
25
X = data.drop('Education', axis=1) y
= data['Education']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
data.to_csv('preprocessed_data.csv', index=False)
OUTPUT:
Age Income
count 6.000000 6.000000
mean 29.166667 58333.333333
std 5.552774 11734.668916
min 22.000000 45000.000000
25% 25.750000 51250.000000
50% 29.000000 57500.000000
26
75% 32.500000 65000.000000
max 35.000000 75000.000000
Age 0
Income 0
Education 0
dtype: int64
Result:
Thus program to use Python command for data preprocessing option
is implemented using dataset.
27