0% found this document useful (0 votes)
17 views11 pages

R Cac1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views11 pages

R Cac1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CAC – 1 :

ASSIGNMENT COMPONENT

Name : Aaditya Kumar Dhaka


Registration no : 2448001
Course Code : MDS 272
Course Name : Inferential Statistics Using R
R-Codes:

# Load the 'mtcars' dataset


data("mtcars")

# Example 1: Population vs. Sample

Objective: Visualize the difference between the entire population's distribution of mpg (miles per
gallon) in the mtcars dataset and a random sample of that data.

# Set a seed for reproducibility and draw a random sample of 10 rows


set.seed(123)
sample_data <- mtcars[sample(1:nrow(mtcars), 10), ]

# Plot population vs. sample for 'mpg' (Miles per Gallon)


hist(mtcars$mpg, breaks = 10, col = rgb(0, 0, 1, 0.5), main = "Population vs
Sample Distribution of 'mpg'",
xlab = "Miles per Gallon (mpg)", ylab = "Frequency")
hist(sample_data$mpg, breaks = 10, col = rgb(1, 0, 0, 0.5), add = TRUE)

# Add a legend
legend("topright", legend = c("Population", "Sample"), fill = c(rgb(0, 0, 1,
0.5), rgb(1, 0, 0, 0.5)))
Interpretation: The blue bars show the mpg distribution for all cars (the population), while the red bars
show the mpg distribution for a random sample of 10 cars. While the sample is only a subset of the
population, it attempts to represent the broader pattern. The distribution for the sample may differ
slightly from the population, especially with a small sample size, but it provides a snapshot, useful for
estimation or preliminary analysis.

# Example 2: Sampling Distribution of the Mean

Objective: Visualize the sampling distribution of the sample means of mpg to understand how sample
means approximate the population mean when sampling repeatedly.

set.seed(123)
sample_means <- replicate(1000, mean(sample(mtcars$mpg, size = 10, replace =
TRUE)))

# Plot the sampling distribution of the sample means


hist(sample_means, breaks = 20, col = "lightblue",
main = "Sampling Distribution of Sample Means (mpg)",
xlab = "Sample Mean of mpg", ylab = "Frequency")

# Add a red line representing the population mean of mpg


abline(v = mean(mtcars$mpg), col = "red", lwd = 2, lty = 2)

# Annotate the population mean for clarity


text(mean(mtcars$mpg), max(table(round(sample_means, 1))) - 5,
paste("Population Mean =", round(mean(mtcars$mpg), 2)),
col = "red", pos = 4)
Interpretation: The histogram represents the sampling distribution of mpg means across 1,000 samples,
each containing 10 random cars. The red line is the true population mean of mpg. This distribution is
concentrated around the population mean, showing the Law of Large Numbers: as we take more
samples, the sample means converge to the population mean. This concept is fundamental in inferential
statistics, as it demonstrates how sample statistics (like the mean) can estimate population parameters.

# Example 3: Standard Error

Objective: Calculate the standard error of the sample means to quantify the variability of the sampling
distribution and understand the precision of sample means as estimators.

# Calculate the standard error of sample means


standard_error <- sd(sample_means)
cat("Standard Error of Sample Means:", standard_error, "\n")

## Standard Error of Sample Means: 1.85098

Interpretation: The standard error (SE) measures the average deviation of the sample means from the
population mean. A smaller SE indicates that the sample means are clustered closely around the
population mean, implying greater precision in estimating the population mean. In our example, the
calculated SE shows how much the means of different samples of 10 cars typically vary from the true
mpg mean for all cars in the mtcars dataset.

# Example 4: Hypothesis Testing (Null and Alternative Hypotheses)

Objective: Perform a hypothesis test to check if the population mean of mpg significantly differs from a
hypothesized mean (e.g., 20).

# Perform a t-test to test if the population mean is equal to 20


t_test_result <- t.test(mtcars$mpg, mu = 20)
print(t_test_result)

##
## One Sample t-test
##
## data: mtcars$mpg
## t = 0.08506, df = 31, p-value = 0.9328
## alternative hypothesis: true mean is not equal to 20
## 95 percent confidence interval:
## 17.91768 22.26357
## sample estimates:
## mean of x
## 20.09062
Interpretation: This test examines if the average mpg for all cars in the dataset is statistically different
from 20. Here, the null hypothesis H0 : mean mpg = 20 is tested against the alternative hypothesis
H1 : mean mpg ≠ 20. The p-value from the t-test indicates whether this difference is statistically
significant. A low p-value (below the chosen significance level, typically 0.05) would mean rejecting the
null hypothesis, concluding that the population mean likely differs from 20.

# Example 5: Critical Region and Level of Significance

Objective: Illustrate the critical regions in a t-distribution for a significance level of 0.05, demonstrating
the rejection areas for a two-tailed hypothesis test.

alpha <- 0.05


df <- length(mtcars$mpg) - 1 # Degrees of freedom
critical_value <- qt(1 - alpha / 2, df = df)

# Generate t-distribution data


x <- seq(-4, 4, length = 100) # Range of t-values
y <- dt(x, df = df) # t-distribution density values

# Plot the t-distribution curve


plot(x, y, type = "l", col = "blue", lwd = 2,
main = "Critical Regions in t-Distribution (alpha = 0.05)",
xlab = "t-value", ylab = "Density")

# Add vertical lines for the critical regions (two-tailed)


abline(v = c(-critical_value, critical_value), col = "red", lty = 2, lwd = 2)

# Shade critical regions to show rejection areas


polygon(c(x[x <= -critical_value], -critical_value),
c(y[x <= -critical_value], 0), col = rgb(1, 0, 0, 0.2), border = NA)
polygon(c(x[x >= critical_value], critical_value),
c(y[x >= critical_value], 0), col = rgb(1, 0, 0, 0.2), border = NA)

# Add text labels for critical regions


text(-critical_value, 0.05, paste("Critical t =", round(-critical_value, 2)),
col = "red", pos = 4)
text(critical_value, 0.05, paste("Critical t =", round(critical_value, 2)),
col = "red", pos = 2)

# Add a legend to indicate the critical and acceptance regions


legend("topright", legend = c("Acceptance Region", "Critical Regions (Reject
H0)"),
fill = c("white", rgb(1, 0, 0, 0.2)), border = c("blue", "red"),
lty = c(1, 2), col = c("blue", "red"), lwd = c(2, 2))
Interpretation: The blue curve represents the t-distribution, with red dashed lines marking the critical
values. The shaded areas on the tails show the critical regions (where ∣t∣>2.045 at α=0.05), indicating
where we would reject H0 in a two-tailed test. If the test statistic falls within these shaded regions, we
reject the null hypothesis, concluding that our sample mean is significantly different from the
hypothesized mean.

# Example 6: Characteristics of a Good Estimator: Unbiasedness

Objective: Demonstrate the concept of unbiasedness by comparing the sample mean to the true
population mean.

# True mean of 'mpg' in population


true_mean <- mean(mtcars$mpg)
set.seed(123)
sample_data <- mtcars[sample(1:nrow(mtcars), 10), ]
sample_mean <- mean(sample_data$mpg)

# Display the true mean and sample mean


cat("True Mean of mpg:", true_mean, "\n")

## True Mean of mpg: 20.09062

cat("Sample Mean of mpg (unbiased):", sample_mean, "\n")

## Sample Mean of mpg (unbiased): 19.74


Interpretation: The true mean is the average mpg of all cars in the mtcars dataset, which serves as the
population parameter. The sample mean is calculated from a random sample of 10 cars. If the sample
mean is close to the true mean, this indicates that the sample is an unbiased estimator of the population
mean. Unbiasedness is a critical property of an estimator, implying that, on average, it neither
overestimates nor underestimates the parameter it estimates.

# Example 7: Scatter Plot for Correlation

Objective: Visualize the relationship between two continuous variables, horsepower and mpg, to assess
correlation.

# Create a scatter plot for horsepower vs. mpg


plot(mtcars$hp, mtcars$mpg, main = "Scatter Plot of Horsepower vs. mpg",
xlab = "Horsepower (hp)", ylab = "Miles per Gallon (mpg)", pch = 19, col
= "blue")
abline(lm(mpg ~ hp, data = mtcars), col = "red", lwd = 2)

Interpretation: This scatter plot displays each car's horsepower (x-axis) against its miles per gallon (y-
axis). The blue points represent individual cars. The red line is the linear regression line, showing the
best-fit relationship between horsepower and mpg. If the line slopes downward, it indicates a negative
correlation, meaning that higher horsepower tends to be associated with lower fuel efficiency (lower
mpg). This visual helps identify trends, patterns, and potential outliers in the data.
# Example 8: Histogram with Density Plot

Objective: Illustrate the distribution of the mpg variable and visualize its density for better
understanding.

# Create a histogram for 'mpg' with density overlay


hist(mtcars$mpg, breaks = 10, probability = TRUE, col = "lightblue", main =
"Histogram with Density Plot",
xlab = "Miles per Gallon (mpg)", ylab = "Density")
lines(density(mtcars$mpg), col = "red", lwd = 2)

Interpretation: The histogram represents the frequency distribution of mpg values. The area under the
histogram bars reflects the proportion of data points within each range of mpg. The red line shows the
estimated density of mpg values, providing a smoothed curve that helps visualize the overall shape of
the distribution. This allows us to observe the central tendency, spread, and any potential skewness or
outliers in the data.

# Example 9: Cumulative Frequency Plot

Objective: Display the cumulative frequency of the mpg variable to understand how values accumulate
across a range.

# Calculate cumulative frequency


cum_freq <- cumsum(table(cut(mtcars$mpg, breaks = 10)))

# Create a cumulative frequency plot


plot(cum_freq, type = "b", main = "Cumulative Frequency of mpg",
xlab = "Miles per Gallon (mpg)", ylab = "Cumulative Frequency", col =
"purple")

Interpretation: This cumulative frequency plot shows how the count of mpg values accumulates as mpg
increases. Each point on the plot represents the total number of cars that have an mpg less than or
equal to a specific value. This type of visualization is helpful for identifying percentiles and
understanding the distribution of the data. For example, if you know the cumulative frequency at a
certain mpg value, you can determine how many cars perform better or worse in terms of fuel
efficiency.

# Example 10: Boxplot for Comparing Groups

Objective: Compare the distribution of mpg across different groups defined by the number of cylinders
in cars.

# Create a boxplot for 'mpg' grouped by the number of cylinders


boxplot(mpg ~ cyl, data = mtcars, main = "Boxplot of mpg by Number of
Cylinders",
xlab = "Number of Cylinders", ylab = "Miles per Gallon (mpg)",
col = c("lightblue", "lightgreen", "lightpink"))
grid()
Interpretation: The boxplot displays the distribution of mpg for cars grouped by the number of cylinders
(4, 6, and 8). Each box represents the interquartile range (IQR), with the line inside indicating the
median mpg for that group. The whiskers extend to the minimum and maximum values (excluding
outliers), providing insights into variability and central tendency. Boxplots are useful for comparing
distributions between multiple groups and identifying potential outliers within each group.

# Example 11: Pie Chart for Proportions of Car Origin

Objective: Visualize the proportion of cars based on their origin, categorized by the number of cylinders.

# Create a new variable for car origin based on the number of cylinders
mtcars$origin <- ifelse(mtcars$cyl == 4, "European",
ifelse(mtcars$cyl == 6, "American", "Japanese"))

# Count the number of cars in each origin category


origin_counts <- table(mtcars$origin)

# Create a pie chart for the proportion of car origins


pie(origin_counts,
main = "Proportion of Cars by Origin",
col = c("lightblue", "lightgreen", "lightcoral"),
labels = paste(names(origin_counts), "\n(", origin_counts, " cars)", sep
= ""),
radius = 0.9)
legend("topright", legend = names(origin_counts), fill = c("lightblue",
"lightgreen", "lightcoral"))
Interpretation: The pie chart visually represents the proportion of cars based on their origin—European,
American, and Japanese. Each slice of the pie corresponds to the number of cars from each origin
category, with the size of each slice indicating its relative proportion to the whole dataset. This
visualization is effective for quickly understanding the distribution of categories and is helpful in
identifying which origin category has the most cars in the dataset.

You might also like