0% found this document useful (0 votes)

13 views32 pages

Team 3 Stats

The document outlines an assignment on importing and analyzing a dataset using R Studio, including descriptive statistics and visualizations such as histograms and box plots. Key financial metrics like Return on Capital, Return on Equity, and Earnings Per Share are examined, revealing insights into company performance and distribution trends. The analysis highlights moderate returns for most firms, balanced debt levels, and significant income disparity among companies.

Uploaded by

patelpranjal1085

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views32 pages

Team 3 Stats

Uploaded by

patelpranjal1085

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Statistics with R Assignment

Importing the data on R Studio

Importing the Data on R Studio.

getwd()
## [1] "C:/Users/rajni/Downloads"
data <- read.csv("C:/Users/rajni/Downloads/Data_For group assignment/Data_Pro
ject/Return.csv")

Getting the dataset description

head(data)
## roe rok dkr eps netinc sp90 sp94 salary return lsalary
## 1 18.7 17.4 4.0 48.1 1144 59.375 47.000 1090 -20.842110 6.993933
## 2 1.6 2.4 27.3 -85.3 35 47.875 43.500 1923 -9.138381 7.561642
## 3 4.9 4.6 36.8 -44.1 127 39.000 72.625 1012 86.217949 6.919684
## 4 11.1 8.6 46.4 192.4 367 61.250 142.000 579 131.836700 6.361302
## 5 5.6 4.5 36.2 -60.4 214 58.000 53.250 600 -8.189655 6.396930
## 6 3.5 2.9 18.7 -79.8 118 68.250 50.500 735 -26.007330 6.599871
## lsp90
## 1 4.083873
## 2 3.868593
## 3 3.663562
## 4 4.114964
## 5 4.060443
## 6 4.223177

Installing the required packages

library(ggsci)
library(haven)
library(plotrix)
library(glue)
library(ggtext)

Descriptive summary of the given data

summary(data)
## roe rok dkr eps
## Min. : 1.10 Min. : 1.400 Min. : 0.00 Min. :-89.30
## 1st Qu.:12.15 1st Qu.: 8.425 1st Qu.:12.90 1st Qu.:-15.70
## Median :15.80 Median :12.050 Median :26.40 Median : 6.70
## Mean :17.73 Mean :13.711 Mean :25.65 Mean : 2.02
## 3rd Qu.:21.18 3rd Qu.:16.850 3rd Qu.:35.48 3rd Qu.: 17.08
## Max. :61.10 Max. :43.300 Max. :79.50 Max. :266.60
## netinc sp90 sp94 salary
## Min. : 4.0 Min. : 13.38 Min. : 9.625 Min. : 267.0
## 1st Qu.: 151.2 1st Qu.: 36.25 1st Qu.: 30.781 1st Qu.: 733.5
## Median : 229.5 Median : 46.25 Median : 44.900 Median : 1040.5
## Mean : 512.3 Mean : 55.67 Mean : 47.489 Mean : 1325.1
## 3rd Qu.: 520.8 3rd Qu.: 62.19 3rd Qu.: 57.688 3rd Qu.: 1364.2
## Max. :4237.0 Max. :564.12 Max. :242.500 Max. :14336.0
## return lsalary lsp90
## Min. :-84.888 Min. :5.587 Min. :2.593
## 1st Qu.:-30.740 1st Qu.:6.598 1st Qu.:3.590
## Median : -8.673 Median :6.947 Median :3.834
## Mean : -4.043 Mean :6.958 Mean :3.868
## 3rd Qu.: 14.786 3rd Qu.:7.218 3rd Qu.:4.130
## Max. :131.837 Max. :9.571 Max. :6.335

Interpretation

• ROK (Return on Capital):

Return on Capital represents a financial performance ratio evaluating a company's efficiency in using its
capital.
Range: 1.4 to 30.0

• ROE (Return on Equity):

Return on Equity measures a company's profitability in relation to shareholders' equity.
Range: 1.10 to 61.10

• DKR (Debt-to-Known-Ratio or similar):

Likely a financial ratio related to debt or leverage within a company.
Range: 0.00 to 135.48

• EPS (Earnings Per Share):

EPS indicates a company's profitability allocated to each outstanding share of common stock.
Range: -15.70 to 17.08

• NETINC (Net Income):

Net Income is the total profit of a company after all expenses and taxes are deducted.
Range: -4.0 to 4237.0

• SP90 (Stock Price 1990):

This may represent the stock price or index level for companies as of the year 1990.
Range: 13.38 to 564.12

• SP94 (Stock Price 1994):

This may represent the stock price or index level for companies as of the year 1994.
Range: 9.625 to 242.50

• SALARY:
Refers to employee or executive compensation within a company.
Range: 267.0 to 14336.0

• RETURN (Stock Return):

Likely represents the return generated on stocks over a period, including gains or losses.
Range: -84.888 to 131.837

• LSALARY (Log of Salary):

Represents the natural logarithmic transformation of salary values.
Range: -5.587 to 9.571

• LSP90 (Log of SP90):

Represents the natural logarithmic transformation of SP90 values.
Range: 2.593 to 6.335

Summary Interpretation

o The return column shows both gains and losses across companies, with a mean return of
8.193%.
o The average return on equity (ROE) is approximately 18.78%, reflecting the typical
profitability relative to shareholder equity.
o Net income values vary significantly across companies, with some reporting negative
earnings.
o Earnings per share (EPS) exhibit significant variability and include negative values, indicating
losses for some firms.
o Salaries are skewed due to a few high values (up to 14336.0), which justifies the use of log
salary (lsalary) to reduce skewness.
o Similar trends are observed in stock prices (sp90 and sp94) and their log versions.

Displaying the Dataset

str(data)
## 'data.frame': 142 obs. of 11 variables:
## $ roe : num 18.7 1.6 4.9 11.1 5.6 ...
## $ rok : num 17.4 2.4 4.6 8.6 4.5 ...
## $ dkr : num 4 27.3 36.8 46.4 36.2 ...
## $ eps : num 48.1 -85.3 -44.1 192.4 -60.4 ...
## $ netinc : int 1144 35 127 367 214 118 175 1692 157 315 ...
## $ sp90 : num 59.4 47.9 39 61.2 58 ...
## $ sp94 : num 47 43.5 72.6 142 53.2 ...
## $ salary : int 1090 1923 1012 579 600 735 994 1227 913 733 ...
## $ return : num -20.84 -9.14 86.22 131.84 -8.19 ...
## $ lsalary: num 6.99 7.56 6.92 6.36 6.4 ...
## $ lsp90 : num 4.08 3.87 3.66 4.11 4.06 ...

print(str)
## function (object, ...)
## UseMethod("str")
## <bytecode: 0x000002e52fbc1bf8>
## <environment: namespace:utils>

Displaying column , number of rows , number of columns and rows

names(data)
## [1] "roe" "rok" "dkr" "eps" "netinc" "sp90" "sp94"
## [8] "salary" "return" "lsalary" "lsp90"
nrow(data)
## [1] 142
ncol(data)
## [1] 11
dim(data)
## [1] 142 11

Calculating the trimmed mean values of the 9 variables

mean(data$roe, trim = 0.5)

## [1] 15.8
mean(data$rok, trim = 0.5)
## [1] 12.05
mean(data$dkr, trim = 0.5)
## [1] 26.4
mean(data$eps, trim = 0.5)
## [1] 6.7
mean(data$netinc, trim = 0.5)
## [1] 229.5
mean(data$sp90, trim = 0.5)
## [1] 46.25
mean(data$sp94, trim = 0.5)
## [1] 44.9
mean(data$salary, trim = 0.5)
## [1] 1040.5
mean(data$return, trim = 0.5)
## [1] -8.673019
Histograms
all A histogram is a graphical representation of the distribution of quantitative data. To construct a
histogram, one must first divide ("bin" or "bucket") the entire range of values into a series of
intervals. Subsequently, the frequency of values within each interval is counted. The bins are
generally specified as consecutive, non-overlapping intervals of a variable and are typically (though
not necessarily) of equal size.
Histograms provide an approximate visualization of the density of the underlying data distribution
and are often utilized for density estimation: estimating the probability density function of the
underlying variable. The total area of a histogram used for probability density purposes is always
normalized to 1. If the intervals on the x-axis are the same length (equal to 1), then the histogram
corresponds directly to a relative frequency plot.
Plotting histogram for Return on Capital

hist(data$rok, main = "Return on Capital", xlab = "Observations", col = "cora

l2")

The Return on Capital (ROC) histogram has a pattern similar to ROE. It shows right-skewness, with
most companies having ROC values between 5 and 20. As ROC increases beyond this range, the
number of observations decreases significantly. This suggests that most companies produce
moderate returns on their invested capital, with only a few firms converting capital into profits more
efficiently. The data distribution allows for identifying consistent performance among firms and
points out outliers with very high ROC.

Plotting histogram for Return on Equity

hist(data$rok, main = "Return on Equity", xlab = "Observations", col = "burly

wood3")
The histogram for Return on Equity (ROE) shows the distribution of companies based on their
equity returns. It is right-skewed, indicating that a larger number of companies have ROE values in
the lower range (5 to 20). The frequency decreases as ROE increases, with few companies reaching
ROE values above 30 or 40. This indicates that higher equity returns are less common, and most
firms have moderate profitability regarding their equity base. The distribution also shows positive
skewness, where some firms have significantly higher ROE than the majority.

Plotting histogram for Debt/Capital

hist(data$dkr, main = "Debt/Capital", xlab = "Observations", col = "grey")

The histogram for the Debt/Capital ratio demonstrates an initially more even distribution,
particularly within the 0 to 40 range, followed by a notable decrease in frequency. This indicates that
a significant number of companies maintain moderate debt levels relative to their capital.
Moreover, there are fewer companies exhibiting very high debt levels, suggesting a general
inclination towards balanced financial leverage or conservative debt management strategies. The
gradual decline beyond the 40 mark further highlights the rarity of extremely high debt levels within
the dataset.

Plotting histogram for Earning per share

hist(data$eps, main = "Earnings Per Share", xlab = "Observations", col = "sky

blue1")
The Earnings Per Share (EPS) histogram is right-skewed, indicating a wide range of profitability
across firms. Most companies report EPS values between 0 and 50, with a notable peak near the
lower end of this range.
Some companies have high EPS values, exceeding 100 or even 200, which extends the distribution.
This long tail to the right indicates the presence of firms with exceptional earnings. It also shows
that while many firms have modest earnings, a few highly profitable companies contribute to the
higher end of the scale.

Summary

▪ Histograms reveal financial traits of companies:

▪ ROE and ROC show moderate returns for most firms, with some high performers.
▪ Debt/Capital indicates balanced leverage, with few extremes.
▪ EPS highlights income disparity, with a few firms leading in profitability.
▪ These visuals help understand distribution, central tendency, and outliers in financial
performance.
BoxPlot

The concept of the box plot was introduced by John Tukey in 1970 and elaborated upon in his book
"Exploratory Data Analysis" published in 1977. Known alternatively as a whisker plot, box-and-
whisker plot, or simply a box-and-whisker diagram, this graphical representation illustrates the
distribution of a dataset. It succinctly displays key summary statistics such as the median, quartiles,
and potential outliers. Box plots offer a compact visual summary of data distributions, aiding in the
identification of potential outliers and facilitating the comparison of different datasets.

A box plot provides a five-number summary of a dataset, which includes:

Minimum - The smallest value in the dataset, excluding outliers.

First Quartile (Q1) - Represents the value below which 25% of the data falls.

Median (Q2) - Marks the midpoint of the dataset, with half of the values below and half above it.

Third Quartile (Q3) - Indicates the value below which 75% of the data falls.

Maximum - The largest value in the dataset, excluding outliers.

Skewness: Analysis of the box plot suggests right skewness in the data, implying a distribution
leaning towards higher values. Though the median is approximately 1.2 liters, certain countries may
exhibit significantly higher consumption, thus pulling the distribution to the right.

Constructing a Box-Plot of Average Return

Description: The boxplot of Average Return (variable return) shows the central tendency and spread of
average returns across companies
.
library(ggplot2)
ggplot(data, aes(x = return)) +
geom_boxplot(fill = "aquamarine4") +
labs(title = "BOXPLOT", x = "Average Return", y = "")
Interpretation:

The box is centered around a small positive value, showing that most companies have modest
average returns. The outliers on the right side indicate that some companies have very high returns,
possibly due to exceptional performance or one-time events. Negative values are visible on the left
tail, indicating a few underperforming firms.

The distribution is right-skewed, reflecting positive outliers pulling the average up.

Constructing a Box-Plot of Net Income

Description: The boxplot for Net Income reveals how company earnings are distributed.

library(ggplot2)
ggplot(data, aes(x = netinc)) +
geom_boxplot(fill = "pink") +
labs(title = "BOXPLOT", x = "Net Income", y = "")
Interpretation:
• Most data is near 0, with few outliers far to the right.
• This shows most firms earn modest profits, with only a few making high net income.
• Highly profitable firms are the extreme rightward outliers.
• The data is highly right-skewed, indicating significant income inequality among firms.
Scatter plot
A scatter plot uses dots to show values for two numeric variables. Each dot's position on the axes
represents an individual data point’s values. Scatter plots help observe relationships between
variables.
The example above shows tree diameters and heights. Each dot represents a tree’s diameter
(horizontal) and height (vertical). The plot reveals a positive correlation between diameter and
height, with one outlier: a tree with a large diameter but relatively short height, which may need
further investigation.

Common issues:
Interpreting correlation as causation
This issue pertains to the interpretation of scatter plots rather than their creation. Observing a
relationship between two variables in a scatter plot does not necessarily imply that changes in one
variable cause changes in the other. This is encapsulated by the phrase "correlation does not imply
causation." The observed relationship could be driven by a third variable affecting both plotted
variables, the causal link could be reversed, or the pattern might be coincidental. For instance,
drawing conclusions about the amount of green space and crime rates in cities based solely on their
correlation can be misleading, as larger cities with more people tend to have higher values for both
measures. Establishing a causal link requires further analysis to control for other potential variables
and rule out alternative explanations.

Overplotting
When plotting a large number of data points, overplotting can occur. Overplotting is when data
points overlap to such an extent that it becomes difficult to discern relationships between points and
variables. It can be challenging to see how densely packed data points are if many are concentrated
in a small area. To address this issue, several methods can be used. One approach is to sample only a
subset of data points randomly, which should still provide an idea of the patterns in the full dataset.
Another method involves altering the appearance of dots by adding transparency or reducing point
size to minimize overlaps. Additionally, using a different chart type, such as a heatmap where color
indicates the number of points in each bin, can be effective. Heatmaps in this context are also
known as 2D histograms.

Plotting a scatter plot of Risk vs Average Return

ggplot(data, aes(x = sp90, y = sp94)) +

geom_point(color = "darkorchid1") +
labs(title = "SCATTER PLOT", x = "Selling price 1990", y = "Selling price 1
994")
The scatter plot shows a cluster of points at lower risk and return levels, implying that most firms are low-risk and
yield modest returns.

Some companies with high risk exhibit higher returns, which aligns with the risk-return tradeoff
theory in finance. However, the data also indicates that higher risk does not necessarily ensure
higher returns—some high-risk firms experience low or negative returns. Overall, the plot suggests a
weak but noticeable positive relationship between risk and return.

Sample Mean
The sample mean represents the average value of a set of observations taken from a larger
population. It indicates the central point around which the sample data clusters and serves as an
estimate of the overall population mean.
The sample standard error, on the other hand, quantifies how much the sample mean is likely to
vary if different samples of the same size were repeatedly taken from the population. It reflects the
precision of the sample mean as an estimate of the true population mean — the smaller the standard
error, the more precise the estimate.

sample.mean <- mean(data$roe)

print(sample.mean)
## [1] 17.73099
sample.n <- length(data$roe)
sample.sd <- sd(data$roe)
sample.se <- sample.sd/sqrt(sample.n)
print(sample.se)
## [1] 0.8295018
alpha = 0.05
degrees.freedom = sample.n- 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
print(t.score)
## [1] 1.976931
margin.error <- t.score * sample.se
print(margin.error)
## [1] 1.639868
lower.bound <- sample.mean- margin.error
upper.bound <- sample.mean + margin.error
print(c(lower.bound,upper.bound))
## [1] 16.09112 19.37085

Sample Mean (ROE):

• Average return on equity is 17.73%.
• Indicates firms earn this percentage on their equity on average.

Standard Error (SE):

• 0.83 (approximately).
• Indicates how accurate the sample mean is as an estimate of the true population mean. A
lower value means more precision.

Alpha (α):

• 0.05, which corresponds to a 95% confidence level.

• This signifies that there's a 5% chance the findings are due to random variation.

t-score:

• 1.98 (approximately).
• Represents the critical value from the t-distribution for a 95% confidence interval.
Margin of Error:

• 1.64 (approximately).
• This value is added and subtracted from the sample mean to create a range for the
confidence interval.

Confidence Interval (95%):

• Ranges from 16.09% to 19.37%.

• We are 95% confident that the actual average ROE for the entire population of firms lies
within this range.

Implications for Decision-Making

For investors or analysts, most companies in the sample generate solid returns between 16–19%.
▪ Firms with ROE below 16% are likely underperforming compared to industry average.
▪ Firms with ROE above 19% may be outperforming, but their sustainability needs further
investigation.

Confidence Interval
A confidence interval is the mean of your estimate plus and minus the variation in that estimate.
This is the range of values you expect your estimate to fall between if you redo your test, within a
certain level of confidence. Confidence, in statistics, is another way to describe probability

The sample mean ROE is calculated as 17.73%.

• The standard error of the mean is computed to be approximately 0.83.

• A 95% confidence interval is constructed using the t-distribution with degrees of freedom equal to
the sample size minus one.
• The critical t-value for a two-tailed test at α = 0.05 is approximately 1.98.
• The resulting margin of error is about 1.64.

Confidence Interval:
• The 95% confidence interval for the true mean ROE is (16.09%, 19.37%).

Interpretation:
• There is 95% confidence that the true mean ROE for the population lies between 16.09% and
19.37%.
• The interval provides a precise estimate of the mean ROE.
• The interval is above zero, indicating that firms in the dataset have a positive ROE on average.
Statistical Summary: ROE Confidence Interval

• 95% Confidence Interval for Mean ROE: [16.09, 19.37]

• Sample Size: Not shown directly but calculated internally (sample.n)

• Standard Error (SE): 0.8295

• t-score (for 95% confidence): 1.9769

• Margin of Error: 1.6399

• Mean ROE: 17.73

Interpretation:

▪ The true mean ROE of the population is estimated to lie between 16.09% and 19.37% with
a 95% confidence level.
▪ This range indicates moderate profitability for the average firm in the dataset.
▪ The narrow confidence interval shows that the sample provides a precise estimate of the
mean ROE due to a relatively small standard error.

Overall Insights:

▪ Average Return and Net Income are largely influenced by a few high-performing firms.
▪ Boxplots identify outliers that may require further investigation (top performers).
▪ The scatter plot indicates a risk-return pattern, though it is not very strong.
▪ The mean ROE is statistically reliable, with most firms having ROE around 17–18%,
indicating a generally healthy return on equity.

Purpose of the Code: The goal is to calculate a 95% confidence interval for the mean ROE of a
samp

Performing Hypothesis Testing

roe <- data$roe

mu <- 17
mr <- mean(roe)
s <- sd(roe)

n <- length(roe)

z <- (mr - mu) / (s / sqrt(n))

alpha <- 0.05

p_value<-1-pnorm(z)

if(p_value<alpha){
cat("Reject the null hypothesis. There is evidence to support the alternativ
e hypothesis.")
}else{
cat("Fail to reject the null hypothesis. There is not enough evidence to
support the alternative hypothesis.")
}
## Fail to reject the null hypothesis. There is not enough evidence to
## support the alternative hypothesis.
cat("\nP-value:",p_value,"\n")
##
## P-value: 0.1890954
critical_value<-qnorm(1-alpha)

mu_alternative <- 17.5

z_alternative <- (mr- mu_alternative) / (s / sqrt(n))

beta <- pnorm(critical_value- z_alternative)

cat("Probability of Type II error:", beta, "\n")

## Probability of Type II error: 0.9140917

ROE: This variable refers to the Return on Equity values derived from the dataset. It constitutes
a numerical vector of observations, which serves as the basis for hypothesis testing.

Mu: The hypothesized population mean value of Return on Equity is set at 17. This benchmark is
utilized for conducting the one-sample hypothesis test.

Mean, Standard Deviation, and Sample Size:

▪ The sample mean (mr) represents the average of the observed ROE values.
▪ The sample standard deviation (s) indicates the spread or variability of ROE values.
▪ The sample size (n) denotes the number of observations in the dataset.
These statistics are instrumental in computing the test statistic and assessing the validity of the null
hypothesis.

Z-value: This is the z-test statistic calculated using the formula:

\[ z = \frac{\bar{x} - \mu}{s / \sqrt{n}} \]
It quantifies how many standard errors the sample mean deviates from the hypothesized population
mean.

Significance Level (Alpha): The significance level is set at 0.05, implying a 5% threshold for
rejecting the null hypothesis. This level reflects the probability of committing a Type I error.

P-value: The p-value represents the probability of observing a test statistic as extreme as, or more
extreme than, the one calculated, assuming the null hypothesis holds true. It is compared to alpha to
determine statistical significance.
Interpretation of Hypothesis Testing Result
The calculated p-value is approximately 0.1891, which exceeds the significance level of 0.05.
Conclusion: Fail to reject the null hypothesis.

There is not enough statistical evidence to suggest that the population mean ROE differs from 17.
The sample does not provide strong support for the alternative hypothesis.

Type II Error and Power Consideration

Alternative Hypothesis Mean: The hypothetical mean for ROE is 17.5, used to assess Type II error
probability.

Critical Value: The critical z-score at a 5% significance level for a one-tailed test, determining
whether to reject the null hypothesis.

Z-value under the Alternative Hypothesis: This statistic measures the difference between the
sample mean and the alternative mean.

Type II Error Probability (Beta): The likelihood of failing to reject the null hypothesis when the true
population mean is 17.5 is approximately 0.9141.

Conclusion on Type II Error:

A Type II error probability of 91.41% is considerably high, suggesting that the test has low power in
detecting a small difference from the hypothesized value. To improve the reliability of the test, it
may be necessary to increase the sample size or re-evaluate the effect size being tested.
Upper-Tailed Test
t.test(data$roe, mu = 17, alternative = 'greater')
##
## One Sample t-test
##
## data: data$roe
## t = 0.88123, df = 141, p-value = 0.1898
## alternative hypothesis: true mean is greater than 17
## 95 percent confidence interval:
## 16.35755 Inf
## sample estimates:
## mean of x
## 17.73099

ROE refers to the Return on Equity values in the dataset, used for one sample t-tests to see if the
sample mean differs from a hypothesized population mean.

▪ Hypothesized Mean (Mu): The assumed population mean is 17. Each test checks if the
sample mean significantly differs from this value.

▪ Sample Mean: The ROE sample mean is approximately 17.73 based on 142 observations.

▪ Degrees of Freedom: Tests are conducted with 141 degrees of freedom, corresponding to
a sample size of 142.

Upper Tailed Test:

Objective To test whether the true population mean is greater than 17.

Test Statistic and Results:

t-value equals 0.88123, p-value equals 0.1898, 95 percent confidence interval ranges from 16.36 to
infinity

Conclusion: Since the p-value is greater than 0.05, we fail to reject the null hypothesis

There is not enough statistical evidence to conclude that the true mean ROE is greater than 17

Lower-TailedTest(
t.test(data$roe, mu = 17,alternative='less')
##
## One Sample t-test
##
## data: data$roe
## t = 0.88123, df = 141, p-value = 0.8102
## alternative hypothesis: true mean is less than 17
## 95 percent confidence interval:
## -Inf 19.10442
## sample estimates:
## mean of x
## 17.73099

Lower Tailed Test

Objective: To test whether the true population mean is less than 17.

Test Statistic and Results:

t-value equals 0.88123, p-value equals 0.8102, 95 percent confidence interval ranges from negative
infinity to 19.10

Conclusion The p-value is much greater than 0.05, so we fail to reject the null hypothesis

There is no statistical evidence to support the claim that the true mean ROE is less than 17

Two-Tailed Test
Two-TailedTest
t.test(data$roe, mu = 17)
##
## One Sample t-test
##
## data: data$roe
## t = 0.88123, df = 141, p-value = 0.3797
## alternative hypothesis: true mean is not equal to 17
## 95 percent confidence interval:
## 16.09112 19.37085
## sample estimates:
## mean of x
## 17.73099

Objective: To test whether the true population mean is different from 17 regardless of direction.

Test Statistic and Results:

t-value equals 0.88123, p-value equals 0.3797, 95 percent confidence interval ranges from 16.09 to
19.37
Conclusion: The p-value exceeds the significance level of 0.05, hence we do not reject the null
hypothesis. There is insufficient evidence to indicate that the true mean ROE differs from 17.

Overall Summary
Across all three hypothesis tests—namely upper-tailed, lower-tailed, and two-tailed—the conclusion
remains consistent: we fail to reject the null hypothesis. The sample data does not provide sufficient
evidence to conclude that the true mean Return on Equity (ROE) is significantly greater than, less
than, or different from 17. The sample mean of 17.73 falls within the confidence intervals in all
cases, and the high p-values further reinforce the lack of statistical significance.

Next, we'll test the variable salary similarly.

sample.mean <- mean(data$salary)

print(sample.mean)
## [1] 1325.106
sample.n <- length(data$salary)
sample.sd <- sd(data$salary)
sample.se <- sample.sd/sqrt(sample.n)
print(sample.se)
## [1] 129.225
alpha = 0.05
degrees.freedom = sample.n- 1
t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
print(t.score)
## [1] 1.976931
margin.error <- t.score * sample.se
print(margin.error)
## [1] 255.469
lower.bound <- sample.mean- margin.error
upper.bound <- sample.mean + margin.error
print(c(lower.bound,upper.bound))
## [1] 1069.637 1580.575

Salary

Salary refers to the CEO salary in 1990s

This is the hypothesized population mean value of salary, which is set to ₹1325.
Mean, Standard Deviation, and Sample Size

• The sample mean (mr) is approximately ₹1325.106, representing the average of the observed
salary values.

• The sample standard deviation (s) measures the variability of the salary data. Though not
printed directly, it was used to compute the standard error.

• The sample size (n) refers to the total number of salary observations in the dataset.

These descriptive statistics are used to calculate the test statistic and assess the validity of the null
hypothesis.

Hypothesis Testing-salary
salary <- data$salary

mu <- 1325

mr <- mean(salary)
s <- sd(salary)

n <- length(salary)

z <- (mr - mu) / (s / sqrt(n))

alpha <- 0.05

p_value<-1-pnorm(z)

Z-value
The z-test statistic is computed as mentioned above and this value indicates how many standard
errors the sample mean deviates from the hypothesized population mean of ₹1325.
Significance Level (Alpha)

The significance level is set at 0.05, implying a 5% threshold for rejecting the null hypothesis. This
reflects the maximum probability of committing a Type I error (false positive).

P-value

The calculated p-value is approximately 0.4997. This is the probability of observing a test statistic as
extreme as (or more extreme than) the one calculated, assuming the null hypothesis is true.

Interpretation of Hypothesis Testing Result

• Since the p-value (0.4997) > α (0.05), we fail to reject the null hypothesis.

Conclusion

Since the p-value is greater than 0.05, we fail to reject the null hypothesis.
There is no strong statistical evidence to suggest that the true mean salary is different from ₹1325 in either direction.
The sample mean (₹1325.106) lies well within the confidence interval, and the test result does not support the
alternative hypothesis.

Type-II Error and Power Analysis- Salary

Type-II Error and Power Analysis-salary

cat("\nP-value:",p_value,"\n")
##
## P-value: 0.4996739
critical_value<-qnorm(1-alpha)

mu_alternative <- 1330

z_alternative <- (mr- mu_alternative) / (s / sqrt(n))

beta <- pnorm(critical_value- z_alternative)

cat("Probability of Type II error:", beta, "\n")

## Probability of Type II error: 0.9537862

Type II Error Probability (Beta)

The probability of a Type II error is approximately 0.9538. This means there is a 95.38% chance of failing to
reject the null hypothesis even when the true population mean is actually ₹1330.
Conclusion on Type II Error

A Type II error probability of 95.38% is very high, indicating that the test has low power to detect a small
deviation from the hypothesized mean. To enhance the reliability of the test, increasing the sample size or the
effect size (difference from the null) would be recommended

One sample T-test

t.test(data$salary, mu = 1325, alternative = 'greater')

##
## One Sample t-test
##
## data: data$salary
## t = 0.00081744, df = 141, p-value = 0.4997
## alternative hypothesis: true mean is greater than 1325
## 95 percent confidence interval:
## 1111.144 Inf
## sample estimates:
## mean of x
## 1325.106
t.test(data$salary, mu = 1325,alternative='less')
##
## One Sample t-test
##
## data: data$salary
## t = 0.00081744, df = 141, p-value = 0.5003
## alternative hypothesis: true mean is less than 1325
## 95 percent confidence interval:
## -Inf 1539.068
## sample estimates:
## mean of x
## 1325.106
t.test(data$salary, mu = 1325)
##
## One Sample t-test
##
## data: data$salary
## t = 0.00081744, df = 141, p-value = 0.9993
## alternative hypothesis: true mean is not equal to 1325
## 95 percent confidence interval:
## 1069.637 1580.575
## sample estimates:
## mean of x
## 1325.106
Interpretation:

The statistical analysis displays three variations of one-sample t-tests examining different alternative
hypotheses regarding the average salary value, with each test comparing against a reference value of
1325:

Test Results Analysis

One-Tailed Tests:

For the "greater than" alternative hypothesis (H₁: μ > 1325), the test yielded a t-statistic of
0.00081744 (df = 141) with a p-value of 0.4997.

For the "less than" alternative hypothesis (H₁: μ < 1325), the test produced the same t-statistic with
a p-value of 0.5003.

The two-tailed test (H₁: μ ≠ 1325)

Resulted in a p-value of 0.9993.

Greater than alternative: [1111.144, Inf]

Less than alternative: [-Inf, 1539.068]

Two-tailed test: [1069.637, 1580.575]

Statistical Significance

All three tests yield p-values substantially greater than the conventional significance level of 0.05,
indicating a failure to reject the null hypothesis in each scenario. The sample mean of 1325.106 is
remarkably close to the hypothesized value of 1325, resulting in a very small t-statistic and
correspondingly high p-values.

Overall Conclusion

The comprehensive statistical analysis demonstrates that the available evidence does not substantiate
the assertion that the mean salary differs significantly from the specified null hypothesis value of
1325. The sample mean of 1325.106 shows only a negligible deviation of 0.106 from the
hypothesized value, which is not statistically significant.

Similar to the death variable analysis, these high p-values indicate a lack of substantial evidence to
support any significant deviation from the null hypothesis. The statistical tests suggest that the
observed sample mean is consistent with what would be expected if the true population mean were
indeed 1325.
Inferences Based on two samples
t_test_result<-t.test(data$roe,data$salary,alternative=
"greater")
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: data$roe and data$salary
## t = -10.117, df = 141.01, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -1521.341 Inf
## sample estimates:
## mean of x mean of y
## 17.73099 1325.10563

Interpretation:

This analysis uses a two-sample t-test to determine if there is a significant difference in the mean
values of ROE and salary. It includes sections on descriptive statistics, hypothesis testing, and final
inference based on the test result.

Descriptive Statistics:

Before conducting hypothesis testing, it is important to review the descriptive statistics of the
variables ROE and salary. The t-test compares these two variables to determine if a significant
difference exists between them. The mean ROE is 17.73099, and the mean salary is 1325.10563,
indicating a noticeable gap in central tendencies. This difference highlights the necessity of
hypothesis testing to establish if the disparity is statistically significant.

Hypothesis Testing:

The Welch Two Sample t-test has been utilized to determine whether the average Return on Equity
(ROE) exceeds the average salary. The test produced a t-value of -10.117, with 141.01 degrees of
freedom, accompanied by a p-value of 1. These results strongly indicate that the null hypothesis
cannot be rejected. Moreover, the negative t-value contradicts the assumption that ROE is greater
than salary, suggesting an inverse relationship. Therefore, there is no statistical evidence supporting
the alternative hypothesis.

Interpretation of Results:
The Welch Two Sample t-test results clearly indicate a lack of significant difference in the
hypothesized direction between ROE and salary. The negative test statistic and extremely high p-
value signify that the assumption of ROE being greater than salary is unsupported. The likelihood of
observing such a test statistic under the assumption that roe > salary is virtually zero. Therefore, the
statistical evidence favors the null hypothesis, reaffirming that salary values are significantly higher
than ROE, and the true mean difference lies in the opposite direction.

Interpretation for Box Plots:

A box plot, or box-and-whisker plot, shows numerical data distribution through quartiles. It includes
a box covering the interquartile range (IQR) with a line for the median. Whiskers extend to the
smallest and largest points within 1.5 times the IQR, while outliers are plotted separately. Box plots
effectively summarize the data’s spread, central tendency, and outliers, useful for comparing groups.

ROE:

The boxplot for “ROE” shows a median value slightly below 18, with the interquartile range (IQR)
spanning approximately 12 to 23. The whiskers extend from around 5 to 30, capturing most of the
variation in the data. A few outliers beyond this range are visible, indicating the presence of extreme
values either much higher or lower than the central mass of the data. Overall, the ROE distribution
appears moderately spread out, with a few unusually high or low entries.

Salary:

The boxplot for “Salary” depicts a much higher median, around 1500, and an interquartile range
stretching from about 600 to 2400. The whiskers extend broadly—from negative values (due to high
variability) up to over 5500. Several outliers are visible far above the upper whisker, highlighting
significant variability and the presence of extremely high salary values in the dataset.

Comparison:

Salary displays greater variability and a higher central value than ROE, which is more consistent and
lower. This contrast reflects significant differences in scale and distribution between the two
variables.

Prediction Interval-roe
Prediction interval-roe
sample.mean <- mean(data$roe)

sample.n <- length(data$roe)

sample.sd <- sd(data$roe)

sample.se <- sample.sd*(1+(1/sample.n))

alpha = 0.05

degrees.freedom = sample.n- 1

t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)

margin.error <- t.score * sample.se

lower_bound <- sample.mean- margin.error

upper_bound <- sample.mean + margin.error

print(c(lower_bound,upper_bound))
## [1] -1.947915 37.409887

A prediction interval is a statistical range that calculates where future individual observations are
likely to fall, with a specified probability (in this instance, 95%). Unlike a confidence interval, which
estimates the population mean, a prediction interval considers both the variability in the sample
mean and the inherent variability of individual data points. This results in a broader interval, making
it more suitable for predicting individual values accurately.

Descriptive Statistics

Key statistics calculated from the ROE data:

Sample Mean (Central Tendency): The average ROE across all firms
Sample Size (n): Number of firms in the dataset
Sample Standard Deviation: Measures dispersion of ROE values around the mean
Standard Error: Adjusted for prediction uncertainty (incorporates 1/n term)
Range: The interval spans from -1.95 to 37.41, indicating substantial variability

These statistics provide the foundation for calculating the prediction interval and understanding the
distribution of ROE values.

Prediction Interval Calculation

The interval was computed through these steps:

Determine t-score: Using t-distribution (qt function) with α=0.05 and n-1 degrees of freedom
Calculate margin of error: t-score × standard error (where SE = sd×(1+1/n))
Construct interval: Mean ± margin of error

Key formula components:

Standard Error: sample.sd * (1 + (1/sample.n))

Margin of Error: t.score * sample.se

Interval Bounds: sample.mean ± margin.error

The calculation yields: [-1.947915, 37.409887]

Interpretation of Prediction Interval

The 95% prediction interval suggests:

Range Interpretation: We expect 95% of future ROE observations to fall between -1.95 and 37.41
Variability: The wide range (-1.95 to 37.41) indicates high dispersion in firm profitability
Negative Values: The lower bound includes negative ROE, suggesting some firms operate at a loss
Right-Skewness: The upper bound extends farther from zero, hinting at some exceptionally high
performers

Prediction interval – salary

sample.mean <- mean(data$salary)

sample.n <- length(data$salary)

sample.sd <- sd(data$salary)

alpha = 0.05

sample.se <- sample.sd*(1+(1/sample.n))

degrees.freedom = sample.n- 1

t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)

margin.error <- t.score * sample.se

lower_bound <- sample.mean- margin.error

upper_bound <- sample.mean + margin.error

print(c(lower_bound,upper_bound))
## [1] -1740.597 4390.809
Prediction Interval

• Lower Bound: Sample Mean − Margin of Error

• Upper Bound: Sample Mean + Margin of Error

From the output:

• Lower Bound ≈ -1740.60

• Upper Bound ≈ 4390.81

Interpretation of Prediction Interval

The interval [-1740.60, 4390.81] represents a 95% confidence range for future individual salary
observations. The negative lower bound (-1740.60) indicates several possibilities:

▪ The salary distribution might be highly skewed or contain extreme outliers.

▪ The model may not be suitable for predicting salaries, since negative values are not
applicable.
▪ A large sample standard deviation or a small sample size could increase the margin of error.

----------------------------------------------------------------------------------------------------------------------------

Measures of Position For Ungrouped Data
100% (2)
Measures of Position For Ungrouped Data
49 pages
Baum - An Introduction To Modern Econometrics Using Stata
100% (1)
Baum - An Introduction To Modern Econometrics Using Stata
376 pages
Econometrics
No ratings yet
Econometrics
343 pages
Ardl 1
No ratings yet
Ardl 1
166 pages
Study Material - Econometrics
No ratings yet
Study Material - Econometrics
20 pages
Elementary Statistics: A Brief Version 8th Edition (Ebook PDF) PDF Download
100% (4)
Elementary Statistics: A Brief Version 8th Edition (Ebook PDF) PDF Download
56 pages
Introduction To Petroleum Engineering Course - Part
No ratings yet
Introduction To Petroleum Engineering Course - Part
201 pages
Stata Textbook Examples Introductory Eco No Metrics by Jeffrey
100% (1)
Stata Textbook Examples Introductory Eco No Metrics by Jeffrey
104 pages
Practical Introduction To Stata PDF
100% (1)
Practical Introduction To Stata PDF
58 pages
Statistics: Measure of Central Tendency Mean
No ratings yet
Statistics: Measure of Central Tendency Mean
25 pages
Statsss 1
No ratings yet
Statsss 1
18 pages
Stata Textbook Examples Introductory Econometrics by Jeffrey PDF
No ratings yet
Stata Textbook Examples Introductory Econometrics by Jeffrey PDF
104 pages
Introduction To Petroleum Engineering Course - Part#2
No ratings yet
Introduction To Petroleum Engineering Course - Part#2
201 pages
DSC2608 Learning Unit 5
No ratings yet
DSC2608 Learning Unit 5
26 pages
Excel Functions and Shortcuts
No ratings yet
Excel Functions and Shortcuts
13 pages
Python An Introduction
From Everand
Python An Introduction
Renier Engelbrecht
No ratings yet
Unit 5 Fully
No ratings yet
Unit 5 Fully
29 pages
Modelling Volatility
No ratings yet
Modelling Volatility
20 pages
Moments, Skewness & Kurtosis
No ratings yet
Moments, Skewness & Kurtosis
9 pages
Data Analysis and Decision Making PDF
No ratings yet
Data Analysis and Decision Making PDF
97 pages
Final - Project - Model 1
No ratings yet
Final - Project - Model 1
21 pages
Measures of Central Tendency (Finals)
No ratings yet
Measures of Central Tendency (Finals)
18 pages
Rinku Mitra MLID241017
No ratings yet
Rinku Mitra MLID241017
18 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
Code With Queries - Solved
No ratings yet
Code With Queries - Solved
10 pages
Intro To R Introspection
No ratings yet
Intro To R Introspection
24 pages
Practical Introduction To Stata PDF
No ratings yet
Practical Introduction To Stata PDF
58 pages
Lecture 01
No ratings yet
Lecture 01
26 pages
Stata Out Put
No ratings yet
Stata Out Put
11 pages
Stata Lecture Unit Root
No ratings yet
Stata Lecture Unit Root
59 pages
Machine Learning - Multi Linear Regression Analysis
No ratings yet
Machine Learning - Multi Linear Regression Analysis
29 pages
ETF5952 Week 1
No ratings yet
ETF5952 Week 1
39 pages
Introduccion A R en Mexico
No ratings yet
Introduccion A R en Mexico
29 pages
Final - Finance and Acc STATA Assignment
No ratings yet
Final - Finance and Acc STATA Assignment
22 pages
Stata Commands
No ratings yet
Stata Commands
7 pages
Econometrics in R
No ratings yet
Econometrics in R
34 pages
Econometrics
No ratings yet
Econometrics
28 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Eviews Commands
100% (1)
Eviews Commands
3 pages
CM
No ratings yet
CM
8 pages
Computer LAB Week#3 - Questions
No ratings yet
Computer LAB Week#3 - Questions
11 pages
Uecm3243 Topic 1
No ratings yet
Uecm3243 Topic 1
16 pages
Stata Instructions
No ratings yet
Stata Instructions
7 pages
Converted R
No ratings yet
Converted R
8 pages
28.statistics Formulae - by Anand Kaku-1
No ratings yet
28.statistics Formulae - by Anand Kaku-1
7 pages
Aditya Nitnaware - 2023 - 0909 - 0001 - 0005 Kaggle Assignment
No ratings yet
Aditya Nitnaware - 2023 - 0909 - 0001 - 0005 Kaggle Assignment
8 pages
Introduction To Econometrics and Operations Research
No ratings yet
Introduction To Econometrics and Operations Research
28 pages
Fin Ana
No ratings yet
Fin Ana
11 pages
Presentation 1
No ratings yet
Presentation 1
34 pages
Lec 4
No ratings yet
Lec 4
18 pages
Pratik Zanke Source Codes
No ratings yet
Pratik Zanke Source Codes
20 pages
Assignment Submitted By-Srishti Bhateja 19021141116: STR (Crew - Data)
No ratings yet
Assignment Submitted By-Srishti Bhateja 19021141116: STR (Crew - Data)
11 pages
R Program Record Book Iba
No ratings yet
R Program Record Book Iba
24 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
R Working Manuals Students
No ratings yet
R Working Manuals Students
11 pages
Linear Regression in R
No ratings yet
Linear Regression in R
19 pages
Bayes Regression
No ratings yet
Bayes Regression
16 pages
Variance Decomposition and Goodness of Fit
No ratings yet
Variance Decomposition and Goodness of Fit
7 pages
Case Studies in R
No ratings yet
Case Studies in R
4 pages
ECON 762: Vector Autoregression Example
No ratings yet
ECON 762: Vector Autoregression Example
9 pages
R Script Module 3
No ratings yet
R Script Module 3
6 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
R Introduction
No ratings yet
R Introduction
4 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
3 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Unit IV Lesson 9 Association of Attributes
No ratings yet
Unit IV Lesson 9 Association of Attributes
5 pages
Measures of Central Tendency: 1.1 Summation Notation
No ratings yet
Measures of Central Tendency: 1.1 Summation Notation
14 pages
ANCOVA How To Perform An Ancova in Python
No ratings yet
ANCOVA How To Perform An Ancova in Python
4 pages
Number, X Frequency, F X.F
No ratings yet
Number, X Frequency, F X.F
2 pages
Statistics Review Questions)
No ratings yet
Statistics Review Questions)
6 pages
Measures of Central Tendency-Ungrouped Data
No ratings yet
Measures of Central Tendency-Ungrouped Data
28 pages
Chapter 5 - Probability Distributions and Discrete Random Variables
No ratings yet
Chapter 5 - Probability Distributions and Discrete Random Variables
5 pages
Statustics P2 Exercise MS
No ratings yet
Statustics P2 Exercise MS
5 pages
Chapter 07
No ratings yet
Chapter 07
40 pages
His To Grams
No ratings yet
His To Grams
15 pages
Data Management
No ratings yet
Data Management
12 pages
Statistics For Machine Learning
No ratings yet
Statistics For Machine Learning
28 pages
Statistics - Mindmap
No ratings yet
Statistics - Mindmap
1 page
Secrets of Business Math Using Excel!
From Everand
Secrets of Business Math Using Excel!
Andrei Besedin
No ratings yet
(3, 4) Sampling Methods, Descriptive Statistics, & Data Collection CE
No ratings yet
(3, 4) Sampling Methods, Descriptive Statistics, & Data Collection CE
49 pages
Chapter 7: Collecting, Displaying, and Analyzing Data: Study Guide / Review For Mastery: Mean, Median, Mode, and Range
No ratings yet
Chapter 7: Collecting, Displaying, and Analyzing Data: Study Guide / Review For Mastery: Mean, Median, Mode, and Range
2 pages
Central Tendency
No ratings yet
Central Tendency
4 pages
MIS 7119 - Geospatial Data Visualisation 2
No ratings yet
MIS 7119 - Geospatial Data Visualisation 2
30 pages
Cuadro Original de Ejercicios Coursera - Media Movil
No ratings yet
Cuadro Original de Ejercicios Coursera - Media Movil
15 pages
Formula Sheet CT1
No ratings yet
Formula Sheet CT1
3 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
11 pages
Statistics For Data Analysis Lec 2 Measure of Center Tendancy
No ratings yet
Statistics For Data Analysis Lec 2 Measure of Center Tendancy
6 pages
Mean Mode Median Range
No ratings yet
Mean Mode Median Range
1 page
Yogesh Meena - A03
No ratings yet
Yogesh Meena - A03
3 pages
Nightscout Weekly Success
No ratings yet
Nightscout Weekly Success
1 page