0% found this document useful (0 votes)
11 views27 pages

DEV Lab Manual

The document outlines a laboratory manual for a Data Exploration and Visualization course at SRI Shakthi Institute, detailing various experiments using R programming. It includes procedures for installing R, performing descriptive data analytics, data cleaning, probability distributions, hypothesis testing, and data visualization. Each experiment concludes with a successful execution and verification of the R commands used.

Uploaded by

narmathaaiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

DEV Lab Manual

The document outlines a laboratory manual for a Data Exploration and Visualization course at SRI Shakthi Institute, detailing various experiments using R programming. It includes procedures for installing R, performing descriptive data analytics, data cleaning, probability distributions, hypothesis testing, and data visualization. Each experiment concludes with a successful execution and verification of the R commands used.

Uploaded by

narmathaaiml
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

SRI SHAKTHI

INSTITUTE OF ENGINEERING AND TECHNOLOGY


COIMBATORE – 62

Autonomous Institution, Accredited by NAAC with “A” Grade

21AD512 – DATA EXPLORATION AND VISUALIZATION


LABORATORY
DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
EXP NO: 1
Installation of Standalone R
DATE:

Aim:
To install standalone R

Procedure:
1. Go to CRAN website
2. Cick on “Download R for Windows”
3. Run the downloaded RStudio9.exe file.
4. Select the language of the install process:

5. Run the Setup wizard.

2
6. Read the License Agreement and accept its conditions.

7. Read the warning that you must not install the program on the disk where the lost
files resided.

8. If necessary, change the destination folder for the program.

3
9. If necessary, change the program group.

10. 8. Wait while the program is being installed.

Result:
Thus the above experiment to install Standalone R has been executed successfully
and the output is verified.

4
Exp No: 2 USE R TOOLS TO EXPLORE VARIOUS COMMANDS
Date: FOR DESCRIPTIVE DATA ANALYTICS USING BENCH
MARK DATASETS
Aim:
To use R commands to implement descriptive Data Analytics on datasets.
Procedure:
 Make sure R is installed to begin programming.
 Import the required Dataset as a csv file to perform descriptive Analysis.
 Declare required variables.
 Declare variables for Mean, Median, Mode, Range, Minimum,
Maximum, Variance, etc,.
 Apply the functions and store the ouTput in the variables declared.
 Print the outputs.

• MEAN: It is the sum of observations divided by the total number of observations.


 X=∑x/n

• MEDIAN: It splits the data into Two halves. If the number of the elements in data
set is odd, then the centre element is median and if it is even then the median would
be the average of two central elements.
 ODD: ( n+1) / 2
 EVEN: (( n/2) + (n/2 + 1)) / 2

• MODE: It is the value that has the highest frequency in the given dataset.

• RANGE: The range is the difference between the maximum and minimum values
in a dataset.

 Range = Max Val – Min Val

• MINIMUM: The minimum value is the smallest value in the dataset.


• MAXIMUM: The maximum value is the largest value in the dataset.
• VARIANCE: Variance measures the spread or dispersion of data points from
the mean. A higher variance indicates greater variability.

 Σ (x - μ)² / (n - 1),
where x is each data point, μ is the mean, and n is the number of data points.

• STANDARD DEVIATION: The standard deviation is a measure of how much

5
individual data points deviate from the mean. It is the square root of the variance.

 Standard Deviation = √ (Σ (x - μ)² / (n - 1))

• QUANTILE: A quantile divides a dataset into equal-sized portions. Common


quantiles include the median (50th percentile), quartiles (25th & 75th
percentiles).

• INTERQUARTILE RANGE (IQR): The IQR measures the spread of the middle
50% of the data and is the range between the first quartile (Q1) and the third
quartile (Q3).

 IQR = Q3 - Q1

CODE:
data = read.csv("CardioGoodFitness.csv")
mean_age = mean(data$Age)
print(“1. mean_age:”)
print(mean_age)
median = median(data$Age)
print(“2. median: ”)
print(median)
library(modeest) # import library to calculate mode
mode = mfv(data$Age)
print(“3. mode:”) print(mode)
max = max(data$Age)
min = min(data$Age)
range = max - min
print("4. Range is:")
print(range)
variance = var(data$Age)
print(“5. Variance is :”)

6
print(variance)
std = sd(data$Age)
print(“6. Standard deviation is: “)
print(std)
quartiles = quantile(data$Age)
print(“7. Quartiles are: “)
print(quartiles)
IQR = IQR(data$Age)
Print(“8. InterQuartile Range is :”)
print(IQR)
summary = summary(data$Age)
print(“9. SUMMARY IS :”)
print(summary)

OUTPUT :
“1. mean_age:”
28.7889
“2. median: ”
26
“3. mode:”
25
"4. Range is:"
32
“5. Variance is :”
48.21217
“6. Standard deviation is: “
943498
“7. Quartiles are: “

7
0% 25% 50% 75% 100%
18 24 26 33 50
“8. InterQuartile Range is
:” 9
“9. SUMMARY IS :”
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 24.00 26.00 28.79 33.00 50.00

Result:
Thus program to use R commands to implement descriptive Analytics on datasets
has been executed and verified successfully.

8
EXP NO : 3 EXPLORE THE VARIOUS VARIABLES AND ROW FILTER IN
R FOR DATA CLEANING
DATE:

AIM:
To develop a program to create various variables and row filter for data cleaning.
ALGORITHM:
1. Load the sample dataset "mtcars" into the variable mydata.
2. Display the first few rows of the dataset to get an overview.
3. Calculate summary statistics for numeric variables.
4. Create a frequency table for a categorical variable (e.g., 'cyl').
5. Compute a correlation matrix for selected numeric variables.
6. Filter rows based on specific conditions (e.g., keep cars with more than 20 mpg
and less than 200 horsepower).
7. Check for missing values in the dataset.
8. Remove rows with missing values to create a cleaned dataset.
9. Display the filtered dataset and the cleaned dataset to inspect the results.

PROGRAM:
data(mtcars)
mydata <- mtcars
head(mydata)
#Explore Variables
summary(mydata)
table(mydata$cyl)
correlation_matrix <- cor(mydata[, c("mpg", "hp", "wt")])
print(correlation_matrix)
#Row Filtering
filtered_data <- mydata[mydata$mpg > 20 & mydata$hp < 200, ]
head(filtered_data)
missing_values <- sum(is.na(mydata))
cat("Number of missing values in the dataset: ", missing_values, "\n")

9
cleaned_data <- mydata[complete.cases(mydata),
] head(cleaned_data)

OUTPUT:

10
RESULT:
Thus, the program has been executed and the output was verified
successfully.

11
Exp No: 4 USE R COMMANDS FOR PROBABILITY
DISTRIBUTION AND PROBABILITY
Date:
STATISTICS

Aim:
To use R commands to implement the probability distribution and
probability statistics
Procedure:
 Make sure R is installed to begin programming.
 Import the required Dataset as a csv file to perform descriptive Analysis.
 Declare required variables.
 Declare variables for probability distribution and statistics
 Apply the functions and print the outputs.

 Normal Distribution:
- Generate random data from a normal distribution and calculate
the mean and standard deviation:
 Binomial Distribution:
- Compute the probability of getting exactly 3 successes in 10
trials with a success probability of 0.3:
 Poisson Distribution:
- Calculate the probability of observing exactly 2 events when
the average event rate is 1 event per unit of time:
 Descriptive Statistics:
- Calculate the mean and standard deviation of a dataset:
 Hypothesis Testing (t-test):
- Perform a two-sample t-test to compare the means of two groups
(e.g., group1 and group2):
CODE:
NORMAL DISTRIBUTION:
data <- rnorm(100, mean = 0, sd = 1)
mean(data)
sd(data)
BINOMIAL DISTRIBUTION:

12
dbinom(3, size = 10, prob = 0.3)
POISSON DISTRIBUTION:
dpois(2, lambda = 1)
DESCRIPTIVE STATISTICS:
data <- c(10, 15, 20, 25, 30)
mean(data)
sd(data)
HYPOTHESIS TESTING:
group1 <- c(20, 22, 18, 24, 25)
group2 <- c(15, 17, 21, 19, 23)
t.test(group1, group2)

OUTPUT:

NORMAL DISTRIBUTION:
[1] -0.01515646
[1] 1.067949
BINOMIAL DISTRIBUTION:
[1] 0.2668279
POISSON DISTRIBUTION:
[1] 0.2668279
DESCRIPTIVE STATISTICS:
[1] 20
[1] 7.071068
HYPOTHESIS TESTING:
data: group1 and group2
13
t = 3.1091, df = 6.8347, p-value = 0.01867
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.323624 11.676376
sample estimates:
mean of x mean of y
21.8 19.0

Result:
Thus the program to use R commands to implement the probability
distribution and probability statistics has been executed and verified
successfully.

14
EXP NO: 5
FORMULATE REAL BUSINESS PROBLEMS
DATE : SCENARIOS TO HYPOTHESIS AND SOLVE USING
R STATISTICAL TESTING FEATURES

Aim: To use R statistical testing features to implement hypothesis for


real business scenarios.
Algorithm:

1. Simulate two datasets for sales volume before and after a


price change.
2. Create histograms to visually represent the distribution of
sales volume before and after the price change.
3. Conduct an independent t-test to compare the means of the
"before_price" and "after_price" samples and Store the test result in
the variable “t_test_result”.
4. Output the means of sales volume before and after the
price change.
5. Present the test statistic, degrees of freedom, and the p-value
from the t-test.
6. If the p-value is less than the significance level, conclude that
the price change has a statistically significant effect on sales volume.
7. If the p-value is greater than or equal to the significance level,
conclude that there is no statistically significant effect of the price
change on sales volume.

Code:
# Simulated data: Sales volume before and after price change
set.seed(123)
before_price <- rnorm(50, mean = 50, sd = 10) # Before price change
after_price <- rnorm(50, mean = 55, sd = 10) # After price change
hist(before_price, main = "Sales Volume Before Price Change", xlab =
"Sales Volume", col = "lightblue")

15
hist(after_price, main = "Sales Volume After Price Change", xlab = "Sales
Volume", col = "lightgreen")
# 3. Hypothesis Testing:
# Perform an independent t-test (assuming independence between the two
samples)
t_test_result <- t.test(before_price,
after_price) # 4. Result Analysis:
cat("Independent t-Test Results:\
n") cat(" \n")
cat("Before Price Change - Mean Sales Volume:", mean(before_price), "\
n")
cat("After Price Change - Mean Sales Volume:", mean(after_price), "\
n") cat(" \n")
cat("Test statistic:", t_test_result$statistic, "\
n") cat("P-value:", t_test_result$p.value, "\n")
cat(" \n")
# 5. Make a decision based on the p-value
alpha <- 0.05
if (t_test_result$p.value < alpha) {
cat("Conclusion: Reject the null hypothesis. The price change has a
statistically significant effect on sales volume.\n")
} else {
cat("Conclusion: Fail to reject the null hypothesis. There is no statistically
significant effect of the price change on sales volume.\n")
}

16
Output:

Result:
Thus R statistical testing features are used to implement hypothesis for real
business scenarios.

17
EXP NO : 6
PROGRAM TO APPLY VARIOUS PLOT FEATURES IN
DATE : R ON SAMPLE DATA SETS AND VISUALIZE

AIM:
To Apply Various Plot features in R on sample datasets and visualize.
PROCEDURE:
1. Load Libraries or generate data
2. Initialize a plot
3. Add geometric objects and customize the plot.
4. Save or display the plot in R environment
5. Iterate and experiment will refine your plot as required.

PROGRAM:
1. Bar Plot: They are generally used for continuous and categorical
variable plotting.
barplot(airquality$Ozone, main = 'Ozone Concenteration in air', = 'ozone
levels', col ='blue', horiz = FALSE)

Output:

18
2. Histogram: In a histogram values are grouped into consecutive
intervals called bins.
data(airquality)
hist(airquality$Temp, main ="La Guardia Airport's\ Maximum
Temperature(Daily)", xlab ="Temperature(Fahrenheit)", xlim = c(50, 125), col
="yellow", freq = TRUE)

Output:

3. Box Plot:A boxplot depicts information like the minimum and maximum
data point.
data(airquality)
boxplot(airquality$Wind, main = "Average wind speed\ at La Guardia
Airport", xlab = "Miles per hour", ylab = "Wind", col = "orange", border =
"brown", horizontal = TRUE, notch = TRUE)

19
Output:

4. Scatter Plot: A scatter plot is composed of many points on a Cartesian


plane. data(airquality)
plot(airquality$Ozone, airquality$Month, main ="Scatterplot Example",
xlab ="Ozone Concentration in parts per billion", ylab =" Month of
observation ", pch = 19)
Output:

20
5. Heat Map: Heatmap is defined as a graphical representation of data
using colors to visualize the value of the matrix.
data <- matrix(rnorm(50, 0, 5), nrow = 5, ncol = 5)
colnames(data) <- paste0("col", 1:5)
rownames(data) <- paste0("row", 1:5)
heatmap(data)

Result:
Thus, the Various Plot features in R on sample datasets has been
visualized and the displayed the plots.

21
Expt. No: 7 WRITE AND EXECUTE WORD COUNT,WORD SEARCH
AND PATTERN SEARCH PROBLEMS FROM LARGE TEXT
Date: FILES

Aim:
To develop a program to write and execute word count,word search and
pattern search problems from large text files using R.
A) Word
Count:
Procedure:
 Initialize count to 0.
 Open the textFile and Loop through each line in the file.
 Split the line into words. For each word in the line,If word
equals targetWord Increment count by 1.
 End Loop
 Close the textFile
 Return count

Code:
text <-
suppressWarnings(readLines("Products.txt")) text
<- paste(text, collapse = " ")
words <- unlist(strsplit(text, "\\s+"))
word_count <- length(words)
cat("Word Count:", word_count)
Output:

B) Word
Search:
Procedure:

22
 Initialize found Words as an empty list.
 Open the textFile and Loop through each line in the file.
 Split the line into words.
 For each word in the line,If word equals targetWord add the
word to foundWords list.
 Close the textFile.
 Return foundWords.
Code:
text <-
suppressWarnings(readLines("Products.txt")) text
<- paste(text, collapse = " ")
search_word <- "monitor"
word_found <- grep(search_word, text)
if (length(word_found) > 0) {
cat("Word found at positions:", word_found)
} else {
cat("Word not found in the text.")
}
Output:

C) Pattern Search:
Procedure:
 Initialize foundLines as an empty list.
 Open the textFile and Loop through each line in the file

23
 If targetPattern is found in the line add the line to foundLines list.
 Close the textFile.
 Return foundLines

Code:
text <-
suppressWarnings(readLines("Products.txt")) text
<- paste(text, collapse = " ")
pattern <- "800"
pattern_found <- grep(pattern, text, perl = TRUE)
if (length(pattern_found) > 0) {
cat("Pattern found at positions:", pattern_found)
} else {
cat("Pattern not found in the text.")
}
Output:

RESULT:
Thus, the program to write and execute word count,word search and
pattern search problems from large text files using R has been executed
and output was verified successfully.

24
Exp No: 8 Explore various data preprocessing options
using bench mark data sets
Date:

Aim:
To use Python commands to implement data preprocessing using datasets.
Procedure:
 Check Python is installed to implement data preprocessing.
 Import pandas and train-test-split.
 Create a dataset using Dataframe or import a dataset as csv file
if required.
 Apply required data preprocessing commands and print the output.

CODE:
import pandas as pd
data = pd.DataFrame({'Age': [25, 30, 22, 35, 28, 40], 'Salary': [50000,
60000, 45000, 75000, 55000, 80000], 'Education': ['Bachelor', 'Master', 'Bachelor',
'Ph.D', 'Master','Bachelor']})
print("First few rows of the dataset:")
print(data.head())
print("Statistical summary of numeric columns:")
print(data.describe()) missing_values = data.isna().sum()
print("Missing
values in the dataset:") print(missing_values)
data.rename(columns={'Salary': 'Income'},
inplace=True) data['Age'] = data['Age'].astype(float)
data.drop_duplicates(inplace=True)
from sklearn.model_selection import train_test_split

25
X = data.drop('Education', axis=1) y
= data['Education']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
data.to_csv('preprocessed_data.csv', index=False)

OUTPUT:

First few rows of the dataset:


Age Income Education
0 25.0 50000 Bachelor
1 30.0 60000 Master
2 22.0 45000 Bachelor
3 35.0 75000 Ph.D
4 28.0 55000 Master

Statistical summary of numeric columns:

Age Income
count 6.000000 6.000000
mean 29.166667 58333.333333
std 5.552774 11734.668916
min 22.000000 45000.000000
25% 25.750000 51250.000000
50% 29.000000 57500.000000

26
75% 32.500000 65000.000000
max 35.000000 75000.000000

Missing values in the dataset:

Age 0
Income 0
Education 0
dtype: int64

Result:
Thus program to use Python command for data preprocessing option
is implemented using dataset.

27

You might also like