0% found this document useful (0 votes)
28 views25 pages

Ds Practical

Uploaded by

Akula Tharuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views25 pages

Ds Practical

Uploaded by

Akula Tharuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

6 a) Write an R script to find basic descriptive statistics using summary,

str, quartile function on mtcars dataset

 Loading the Dataset: data(mtcars) loads the mtcars dataset into the R
environment.

 Structure of the Dataset: str(mtcars) displays the structure of the


dataset, showing the data types and the first few rows.

 Summary Statistics: summary(mtcars) provides summary statistics for


each numerical variable in the mtcars dataset. This includes minimum, 1st
quartile, median (2nd quartile), mean, 3rd quartile, and maximum values.

 Calculating Quartiles: quantile(mtcars$mpg, probs = c(0.25,


0.5, 0.75)) calculates the quartiles for the mpg column. The probs
argument specifies which quartiles to compute (0.25 for the 1st quartile, 0.5 for the
median, and 0.75 for the 3rd quartile).

 Printing Quantiles: cat and print functions are used to display the quartile
values for the mpg column.

# Load the mtcars dataset


Step 1: data(mtcars) or print(mtcars)

# Display the structure of the dataset


Step 2: str(mtcars)

# Summary statistics of the dataset


Step 3: summary(mtcars)

# Calculating quartiles for specific columns


Step 4: quantiles <- quantile(mtcars$mpg, probs = c(0.25, 0.5, 0.75)) #
Example for mpg column
cat("\nQuantiles for mpg column:\n")
print(quantiles)

6 b)Write an R script to find subset of dataset by using subset (),


aggregate () functions on iris dataset.

# Load the iris dataset


Step 1: data(iris) or print(iris)

# Display the structure of the dataset


Step 2: str(iris)

# Subset of dataset using subset() function


# Example: Subset rows where Sepal.Length > 5.5
Step 3: subset_iris <- subset(iris, Sepal.Length > 5.5)

# Display the first few rows of the subsetted dataset


cat("\nSubset of iris dataset where Sepal.Length > 5.5:\n")
print(head(subset_iris))
# Aggregate using aggregate() function
Aggregate functions in R are used to summarize data. Common
aggregate functions include sum(), mean(), min(), max(),
median(), and sd(). Here's a basic program that demonstrates how
to use these functions with a sample dataset.
# Create a sample data frame
data <- data.frame(
group = rep(c("A", "B", "C"), each = 5),
value = c(10, 15, 10, 14, 20, 5, 7, 12, 8, 6, 8, 9, 6, 3, 2)
)

print(data)
# Sum of values by group
sum_by_group <- aggregate(value ~ group, data,sum)
print(sum_by_group)

# Mean of values by group


mean_by_group <- aggregate(value ~ group, data, mean)
print(mean_by_group)
# Minimum value by group
min_by_group <- aggregate(value ~ group, data, min)
print(min_by_group)

# Maximum value by group


max_by_group <- aggregate(value ~ group, data, max)
print(max_by_group)

# Median of values by group


median_by_group <- aggregate(value ~ group, data, median)
print(median_by_group)

# Standard deviation of values by group


sd_by_group <- aggregate(value ~ group, data, sd)
print(sd_by_group)
__________________________________________________________
5 a) Using R Objects (Variables):
# Define variables for operands
a <- 10
b <- 5
# Perform arithmetic operations using variables
sum_result <- a + b
diff_result <- a - b
prod_result <- a * b
div_result <- a / b

# Print the results


cat("Using R objects (variables):\n")
cat("Sum:", sum_result, "\n")
cat("Difference:", diff_result, "\n")
cat("Product:", prod_result, "\n")
cat("Division:", div_result, "\n")

Explanation:

● Define Variables: a <- 10 and b <- 5 create two variables a and b with values 10 and
5, respectively.
● Perform Arithmetic Operations: sum_result <- a + b, diff_result <- a - b,
prod_result <- a * b, div_result <- a / b perform addition, subtraction,
multiplication, and division using the variables a and b, storing the results in respective
variables (sum_result, diff_result, prod_result, div_result).
● Print Results: cat() functions are used to print each result to the console.

Without Using R Objects (Directly on Console):

# Perform arithmetic operations directly

cat("Without using R objects (directly):\n")


cat("Sum:", 10 + 5, "\n")

cat("Difference:", 10 - 5, "\n")

cat("Product:", 10 * 5, "\n")

cat("Division:", 10 / 5, "\n")

Explanation:

● Perform Arithmetic Operations: cat("Sum:", 10 + 5, "\n"), cat("Difference:",


10 - 5, "\n"), cat("Product:", 10 * 5, "\n"), cat("Division:", 10 / 5,
"\n") directly perform addition, subtraction, multiplication, and division using numeric
literals (10 and 5).
● Print Results: cat() functions are used to print each result directly to the console.

Comparison:

● Using R Objects: This approach is useful when you want to store intermediate results for
further calculations or when you want to reuse values multiple times.
● Without Using R Objects: This approach is simpler and more concise for one-off
calculations where the result does not need to be stored or reused.
● _________________________________________________________________

5 b) R AS CALCULATOR APPLICATION

Using mathematical functions on console

# Addition

cat("Addition (10 + 5):\n")

print(sum(10, 5))

# Subtraction

cat("\nSubtraction (10 - 5):\n")

print(diff(10, 5))
# Multiplication

cat("\nMultiplication (10 * 5):\n")

print(prod(10, 5))

# Division

cat("\nDivision (10 / 5):\n")

print(10 / 5)

Explanation:

● Addition: sum(10, 5) computes the sum of 10 and 5.


● Subtraction: diff(10, 5) computes the difference between 10 and 5.
● Multiplication: prod(10, 5) computes the product of 10 and 5.
● Division: 10 / 5 directly performs division of 10 by 5.

5 c) Write an R script, to create R objects for the calculator application and


save in a specified location in disk.
# Specify the file path where you want to save the R objects

save_file <- "/path/to/your/directory/calculator_results.RData

8.CORRELATION AND COVARIANCE


a. Find the correlation matrix.
b. Plot the correlation plot on the dataset and visualize giving an overview of
relationships among data on iris data.
c. Analysis of covariance.

Explanation:
Load the Dataset: The iris dataset is loaded, and the first few rows are printed
for inspection.

Calculate the Correlation Matrix: The cor() function computes the correlation
matrix, excluding the Species column because it is a categorical variable.

Visualize the Correlation Matrix: The corrplot package is used to create a


visual representation of the correlation matrix.

The iris dataset is a classic dataset used for machine learning and statistical
analysis. It contains measurements of sepal length, sepal width, petal length, and
petal width for three species of iris flowers.

Here's an R script to calculate the correlation matrix for the iris dataset and
visualize it using the corrplot package:

correlation_matrix <- cor(iris[, 1:4])

print(correlation_matrix)
Sepal.Length and Petal.Length: There is a strong positive correlation (0.87),
meaning as sepal length increases, petal length also tends to increase.

Sepal.Length and Petal.Width: There is a strong positive correlation (0.82),


indicating that sepal length and petal width increase together.

Sepal.Length and Sepal.Width: There is a very weak negative correlation (-0.12),


suggesting almost no relationship between sepal length and sepal width.

Sepal.Width and Petal.Length: There is a moderate negative correlation (-0.43),


indicating that as sepal width increases, petal length tends to decrease.

Sepal.Width and Petal.Width: There is a moderate negative correlation (-0.37),


indicating that as sepal width increases, petal width tends to decrease.

Petal.Length and Petal.Width: There is a very strong positive correlation (0.96),


meaning petal length and petal width increase together.

# Visualize the correlation matrix using corrplot

corrplot(correlation_matrix)
Steps to Analyze the Correlation Plot

1. Identify the Color Scale:


○ Typically, the color scale ranges from blue to red. Blue usually
indicates positive correlations, while red indicates negative
correlations.
○ The intensity of the color (darker shades) represents the strength of the
correlation. Dark blue means a strong positive correlation, while dark
red means a strong negative correlation. Light colors indicate weaker
correlations.
2. Circle Sizes and Shapes:
○ The size of the circles indicates the strength of the correlation. Larger
circles represent stronger correlations (either positive or negative).
○ Some corrplot methods use shapes (like ellipses) where the
orientation also indicates the direction (positive or negative) and
strength.
3. Correlation Coefficients:
○ Look at the values in the correlation matrix for exact correlation
coefficients.
○ A value close to +1 indicates a strong positive correlation.
○ A value close to -1 indicates a strong negative correlation.
○ A value close to 0 indicates no correlation.
4. Upper vs. Lower Triangle:
○ In the type = "upper" option used in the script, only the upper
triangle of the matrix is shown. The lower triangle is a mirror image
of the upper triangle.
○ Focus on the unique pairs of variables in the upper triangle for
analysis.

Example Analysis of the Iris Dataset

Let's interpret the correlation plot for the iris dataset:

1. Sepal.Length vs. Petal.Length:


○ Color: Dark blue
○ Correlation Coefficient: 0.87
○ Interpretation: There is a strong positive correlation. As the sepal
length increases, the petal length also tends to increase.
2. Sepal.Length vs. Petal.Width:
○ Color: Dark blue
○ Correlation Coefficient: 0.82
○ Interpretation: There is a strong positive correlation. As the sepal
length increases, the petal width also tends to increase.
3. Sepal.Length vs. Sepal.Width:
○ Color: Very light (close to white)
○ Correlation Coefficient: -0.12
○ Interpretation: There is a very weak negative correlation, indicating
almost no relationship between sepal length and sepal width.
4. Petal.Length vs. Petal.Width:
○ Color: Dark blue
○ Correlation Coefficient: 0.96
○ Interpretation: There is a very strong positive correlation. As the
petal length increases, the petal width also tends to increase.
5. Sepal.Width vs. Petal.Length:
○ Color: Light red
○ Correlation Coefficient: -0.43
○ Interpretation: There is a moderate negative correlation. As the sepal
width increases, the petal length tends to decrease.
6. Sepal.Width vs. Petal.Width:
○ Color: Light red
○ Correlation Coefficient: -0.37
○ Interpretation: There is a moderate negative correlation. As the sepal
width increases, the petal width tends to decrease.

General Insights from the Iris Dataset

● Strong Positive Correlations: Petal length and petal width, sepal length and
petal length, sepal length and petal width.
● Moderate Negative Correlations: Sepal width with petal length and petal
width.
● Weak Correlations: Sepal length with sepal width.

__________________________________________________________________

7.Write an R script to find F Test, T Test, Z Test for the given dataset.

F test:

The F-test is a statistical test used to compare two variances to see if they are
significantly different. It is often used in the context of comparing the variances of
two populations or in the analysis of variance (ANOVA) to determine if there are
significant differences among group means.

Here's a step-by-step explanation of how to perform an F-test in R using a sample


dataset.

Step 1: Load Sample Data


We'll use the built-in iris dataset for this example. The iris dataset contains
measurements of different characteristics of iris flowers from three species.
data("iris") or print(“iris”)
2. Exploring the Data
Let's have a quick look at the dataset to understand its structure.
head(iris)

3. Formulating the Hypothesis


Suppose we want to compare the variances of Sepal.Length between two species of
iris: setosa and versicolor.
Null Hypothesis (H0): The variances of Sepal.Length for setosa and
versicolor are equal.
Alternative Hypothesis (H1): The variances of Sepal.Length for setosa and
versicolor are not equal.

4. Extracting the Data


We need to extract the Sepal.Length data for setosa and versicolor.

setosa <- iris$Sepal.Length[iris$ Species== "setosa"]

versicolor <- iris$Sepal.Length[iris$Species ==


"versicolor"]

5. Performing the F-test


Use the var.test() function to perform the F-test.
f_test_result <- var.test(setosa, versicolor)

print(f_test_result)

1. The var.test() function will return the F-statistic and the p-value. The
p-value helps us determine whether to reject the null hypothesis.
2. Interpreting the Results
If the p-value is less than the significance level (commonly 0.05), we reject
the null hypothesis and conclude that there is a significant difference
between the variances of the two groups.

Example Code

Here's the complete code for the steps above:


# Load necessary package

library(datasets)

# Load the iris dataset

data("iris")

# View the first few rows of the dataset

head(iris)

# Extract Sepal.Length for setosa and versicolor

setosa <- iris$Sepal.Length[iris$Species == "setosa"]

versicolor <- iris$Sepal.Length[iris$Species ==


"versicolor"]

# Perform the F-test

f_test_result <- var.test(setosa, versicolor)

# Print the result

print(f_test_result)

Output Interpretation
The output will look something like this:

● F: The calculated F-statistic.


● num df: Degrees of freedom for the numerator (setosa).
● denom df: Degrees of freedom for the denominator (versicolor).
● p-value: The probability of obtaining a result at least as extreme as the one
observed, under the assumption that the null hypothesis is true.

In this case, if the p-value is less than 0.05, we reject the null hypothesis and
conclude that the variances of Sepal.Length for setosa and versicolor are
significantly different.

_________________________________________________________

T test
A T-test is a statistical test used to compare the means of two groups. It helps
determine if the means are significantly different from each other.

In this example, we'll focus on the two-sample T-test using the iris dataset in R
to compare the means of Sepal.Length between two species: setosa and
versicolor.
Load Sample Data
We'll use the built-in iris dataset for this example. The iris dataset contains
measurements of different characteristics of iris flowers from three species.
data("iris") or print(“iris”)

1.Exploring the Data


Let's have a quick look at the dataset to understand its structure.

head(iris)

2.Formulating the Hypothesis


Suppose we want to compare the means of Sepal.Length between two species of
iris: setosa and versicolor.
Null Hypothesis (H0): The means of Sepal.Length for setosa and
versicolor are equal.
Alternative Hypothesis (H1): The means of Sepal.Length for setosa and
versicolor are not equal.

3. Extracting the Data


We need to extract the Sepal.Length data for setosa and versicolor.
setosa <- iris$Sepal.Length[iris$Species == "setosa"]

versicolor <- iris$Sepal.Length[iris$Species ==


"versicolor"]

4.Performing the Two-sample T-test


Use the t.test() function to perform the two-sample T-test.
t_test_result <- t.test(setosa, versicolor)

print(t_test_result)

1. The t.test() function will return the T-statistic, degrees of freedom,


p-value, and confidence interval for the difference in means.
2. Interpreting the Results
If the p-value is less than the significance level (commonly 0.05), we reject
the null hypothesis and conclude that there is a significant difference
between the means of the two groups.

Example Code

Here's the complete code for the steps above:

# Load necessary package

library(datasets)

# Load the iris dataset

data("iris")

# View the first few rows of the dataset

head(iris)

# Extract Sepal.Length for setosa and versicolor

setosa <- iris$Sepal.Length[iris$Species == "setosa"]

versicolor <- iris$Sepal.Length[iris$Species ==


"versicolor"]

# Perform the two-sample T-test

t_test_result <- t.test(setosa, versicolor)

# Print the result

print(t_test_result)

Output Interpretation:
● t: The calculated T-statistic.
● df: Degrees of freedom.
● p-value: The probability of obtaining a result at least as extreme as the one
observed, under the assumption that the null hypothesis is true.
● 95 percent confidence interval: The range within which the true difference
in means lies with 95% confidence.
● mean of x: Mean of Sepal.Length for setosa.
● mean of y: Mean of Sepal.Length for versicolor.

In this case, if the p-value is less than 0.05, we reject the null hypothesis and
conclude that the means of Sepal.Length for setosa and versicolor are significantly
different.

________________________________________________________________

Z-Test

A Z-test is used to determine if there is a significant difference between the means


of two groups, especially when the sample size is large (typically n > 30) or the
population standard deviation is known. It is similar to the T-test, but it is based on
the Z-distribution.

Step-by-Step Guide
1. Install and Load Necessary Packages
For the Z-test, we'll use the BSDA package in R, which provides a function
for performing Z-tests.
install.packages("BSDA")

library(BSDA)

2. Generate Sample Data


Suppose we have two large samples from two different populations, and we
want to compare their means. We'll generate synthetic data for this example.
# Set seed for reproducibility

set.seed(123)

# Generate sample data

sample1 <- rnorm(100, mean = 50, sd = 10) # Sample 1


with mean 50 and sd 10

sample2 <- rnorm(100, mean = 52, sd = 10) # Sample 2


with mean 52 and sd 10

3. Formulating the Hypothesis


We want to compare the means of sample1 and sample2.
Null Hypothesis (H0): The means of sample1 and sample2 are equal.
Alternative Hypothesis (H1): The means of sample1 and sample2 are
not equal.
4. Performing the Z-test
Use the z.test() function from the BSDA package to perform the Z-test.
Note that the Z-test assumes that the population standard deviations are
known. For this example, we will assume they are both 10.

z_test_result <- z.test(sample1, sample2, sigma.x =


10, sigma.y = 10)
print(z_test_result)

1. The z.test() function will return the Z-statistic, p-value, and confidence
interval for the difference in means.
2. Interpreting the Results
If the p-value is less than the significance level (commonly 0.05), we reject
the null hypothesis and conclude that there is a significant difference
between the means of the two groups.

Example Code

Here's the complete code for the steps above:

# Install and load necessary package

install.packages("BSDA")

library(BSDA)

# Set seed for reproducibility

set.seed(123)

Note:set.seed() function in R Programming Language is used to create


random numbers that can be reproduced. It helps in creating the same
random numbers each time a random function is called

# Generate sample data

sample1 <- rnorm(100, mean = 50, sd = 10) # Sample 1


with mean 50 and sd 10

sample2 <- rnorm(100, mean = 52, sd = 10) # Sample 2


with mean 52 and sd 10
Note:rnorm() is a popular function in R that generates random numbers
using normal distribution

# Perform the Z-test

z_test_result <- z.test(sample1, sample2, sigma.x = 10,


sigma.y = 10)

# Print the result

print(z_test_result)

Two-sample z-Test

data: sample1 and sample2

z = -1.4142, p-value = 0.1573

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-4.6919583 0.6919583

sample estimates:

mean of x mean of y

50.29094 51.91209

● z: The calculated Z-statistic.


● p-value: The probability of obtaining a result at least as extreme as the one
observed, under the assumption that the null hypothesis is true.
● 95 percent confidence interval: The range within which the true difference
in means lies with 95% confidence.
● mean of x: Mean of sample1.
● mean of y: Mean of sample2.

In this case, if the p-value is greater than 0.05, we fail to reject the null hypothesis
and conclude that there is no significant difference between the means of
sample1 and sample2

__________________________________________________________

11.Scraping the data from any website without using API.

Web scraping in R can be done using packages such as rvest, httr, and xml2.
Here’s a step-by-step guide to scraping data from a website without using an API.

Example Scenario

Let's scrape data from a website like Wikipedia. For this example, we'll scrape the
table of "List of countries by GDP (nominal)" from Wikipedia.

Step-by-Step Guide
1. Install and Load Necessary Packages
install.packages("rvest")

install.packages("httr")

install.packages("xml2")

library(rvest)

library(httr)

library(xml2)

2. Specify the URL to Scrape


url <-
"https://en.wikipedia.org/wiki/List_of_countries_by
_GDP_(nominal)"
3. Read the HTML Content of the Page
webpage <- read_html(url)
4. Extract the Table of Interest
Use CSS selectors to extract the table. In this case, the table containing the
GDP data has a class wikitable.
tables <- webpage %>% html_nodes("table.wikitable")
5. Convert the Table to a Data Frame
Convert the first table (which contains the data we need) to a data frame.
gdp_table <- tables[[1]] %>% html_table(fill =
TRUE)
6. Clean the Data (if necessary)
Clean the data by renaming columns, removing unnecessary rows, etc.
# View the first few rows of the table

head(gdp_table)

Complete Code

Here’s the complete R script for scraping the GDP data from Wikipedia:

# Install and load necessary packages

install.packages("rvest")

install.packages("httr")

install.packages("xml2")
library(rvest)

library(httr)

library(xml2)

# Specify the URL to scrape

url <-
"https://en.wikipedia.org/wiki/List_of_countries_by_GDP
_(nominal)"

# Read the HTML content of the page

webpage <- read_html(url)

# Extract the table of interest

tables <- webpage %>% html_nodes("table.wikitable")

# Convert the table to a data frame

gdp_table <- tables[[1]] %>% html_table(fill = TRUE)

# View the first few rows of the table

head(gdp_table)

_____________________________________________
______________________________________________________

Completed

You might also like