Team 3 Stats
Team 3 Stats
getwd()
## [1] "C:/Users/rajni/Downloads"
data <- read.csv("C:/Users/rajni/Downloads/Data_For group assignment/Data_Pro
ject/Return.csv")
head(data)
## roe rok dkr eps netinc sp90 sp94 salary return lsalary
## 1 18.7 17.4 4.0 48.1 1144 59.375 47.000 1090 -20.842110 6.993933
## 2 1.6 2.4 27.3 -85.3 35 47.875 43.500 1923 -9.138381 7.561642
## 3 4.9 4.6 36.8 -44.1 127 39.000 72.625 1012 86.217949 6.919684
## 4 11.1 8.6 46.4 192.4 367 61.250 142.000 579 131.836700 6.361302
## 5 5.6 4.5 36.2 -60.4 214 58.000 53.250 600 -8.189655 6.396930
## 6 3.5 2.9 18.7 -79.8 118 68.250 50.500 735 -26.007330 6.599871
## lsp90
## 1 4.083873
## 2 3.868593
## 3 3.663562
## 4 4.114964
## 5 4.060443
## 6 4.223177
library(ggsci)
library(haven)
library(plotrix)
library(glue)
library(ggtext)
Interpretation
• SALARY:
Refers to employee or executive compensation within a company.
Range: 267.0 to 14336.0
Summary Interpretation
o The return column shows both gains and losses across companies, with a mean return of
8.193%.
o The average return on equity (ROE) is approximately 18.78%, reflecting the typical
profitability relative to shareholder equity.
o Net income values vary significantly across companies, with some reporting negative
earnings.
o Earnings per share (EPS) exhibit significant variability and include negative values, indicating
losses for some firms.
o Salaries are skewed due to a few high values (up to 14336.0), which justifies the use of log
salary (lsalary) to reduce skewness.
o Similar trends are observed in stock prices (sp90 and sp94) and their log versions.
str(data)
## 'data.frame': 142 obs. of 11 variables:
## $ roe : num 18.7 1.6 4.9 11.1 5.6 ...
## $ rok : num 17.4 2.4 4.6 8.6 4.5 ...
## $ dkr : num 4 27.3 36.8 46.4 36.2 ...
## $ eps : num 48.1 -85.3 -44.1 192.4 -60.4 ...
## $ netinc : int 1144 35 127 367 214 118 175 1692 157 315 ...
## $ sp90 : num 59.4 47.9 39 61.2 58 ...
## $ sp94 : num 47 43.5 72.6 142 53.2 ...
## $ salary : int 1090 1923 1012 579 600 735 994 1227 913 733 ...
## $ return : num -20.84 -9.14 86.22 131.84 -8.19 ...
## $ lsalary: num 6.99 7.56 6.92 6.36 6.4 ...
## $ lsp90 : num 4.08 3.87 3.66 4.11 4.06 ...
print(str)
## function (object, ...)
## UseMethod("str")
## <bytecode: 0x000002e52fbc1bf8>
## <environment: namespace:utils>
names(data)
## [1] "roe" "rok" "dkr" "eps" "netinc" "sp90" "sp94"
## [8] "salary" "return" "lsalary" "lsp90"
nrow(data)
## [1] 142
ncol(data)
## [1] 11
dim(data)
## [1] 142 11
The Return on Capital (ROC) histogram has a pattern similar to ROE. It shows right-skewness, with
most companies having ROC values between 5 and 20. As ROC increases beyond this range, the
number of observations decreases significantly. This suggests that most companies produce
moderate returns on their invested capital, with only a few firms converting capital into profits more
efficiently. The data distribution allows for identifying consistent performance among firms and
points out outliers with very high ROC.
Summary
The concept of the box plot was introduced by John Tukey in 1970 and elaborated upon in his book
"Exploratory Data Analysis" published in 1977. Known alternatively as a whisker plot, box-and-
whisker plot, or simply a box-and-whisker diagram, this graphical representation illustrates the
distribution of a dataset. It succinctly displays key summary statistics such as the median, quartiles,
and potential outliers. Box plots offer a compact visual summary of data distributions, aiding in the
identification of potential outliers and facilitating the comparison of different datasets.
First Quartile (Q1) - Represents the value below which 25% of the data falls.
Median (Q2) - Marks the midpoint of the dataset, with half of the values below and half above it.
Third Quartile (Q3) - Indicates the value below which 75% of the data falls.
Skewness: Analysis of the box plot suggests right skewness in the data, implying a distribution
leaning towards higher values. Though the median is approximately 1.2 liters, certain countries may
exhibit significantly higher consumption, thus pulling the distribution to the right.
Description: The boxplot of Average Return (variable return) shows the central tendency and spread of
average returns across companies
.
library(ggplot2)
ggplot(data, aes(x = return)) +
geom_boxplot(fill = "aquamarine4") +
labs(title = "BOXPLOT", x = "Average Return", y = "")
Interpretation:
The box is centered around a small positive value, showing that most companies have modest
average returns. The outliers on the right side indicate that some companies have very high returns,
possibly due to exceptional performance or one-time events. Negative values are visible on the left
tail, indicating a few underperforming firms.
The distribution is right-skewed, reflecting positive outliers pulling the average up.
Description: The boxplot for Net Income reveals how company earnings are distributed.
library(ggplot2)
ggplot(data, aes(x = netinc)) +
geom_boxplot(fill = "pink") +
labs(title = "BOXPLOT", x = "Net Income", y = "")
Interpretation:
• Most data is near 0, with few outliers far to the right.
• This shows most firms earn modest profits, with only a few making high net income.
• Highly profitable firms are the extreme rightward outliers.
• The data is highly right-skewed, indicating significant income inequality among firms.
Scatter plot
A scatter plot uses dots to show values for two numeric variables. Each dot's position on the axes
represents an individual data point’s values. Scatter plots help observe relationships between
variables.
The example above shows tree diameters and heights. Each dot represents a tree’s diameter
(horizontal) and height (vertical). The plot reveals a positive correlation between diameter and
height, with one outlier: a tree with a large diameter but relatively short height, which may need
further investigation.
Common issues:
Interpreting correlation as causation
This issue pertains to the interpretation of scatter plots rather than their creation. Observing a
relationship between two variables in a scatter plot does not necessarily imply that changes in one
variable cause changes in the other. This is encapsulated by the phrase "correlation does not imply
causation." The observed relationship could be driven by a third variable affecting both plotted
variables, the causal link could be reversed, or the pattern might be coincidental. For instance,
drawing conclusions about the amount of green space and crime rates in cities based solely on their
correlation can be misleading, as larger cities with more people tend to have higher values for both
measures. Establishing a causal link requires further analysis to control for other potential variables
and rule out alternative explanations.
Overplotting
When plotting a large number of data points, overplotting can occur. Overplotting is when data
points overlap to such an extent that it becomes difficult to discern relationships between points and
variables. It can be challenging to see how densely packed data points are if many are concentrated
in a small area. To address this issue, several methods can be used. One approach is to sample only a
subset of data points randomly, which should still provide an idea of the patterns in the full dataset.
Another method involves altering the appearance of dots by adding transparency or reducing point
size to minimize overlaps. Additionally, using a different chart type, such as a heatmap where color
indicates the number of points in each bin, can be effective. Heatmaps in this context are also
known as 2D histograms.
Some companies with high risk exhibit higher returns, which aligns with the risk-return tradeoff
theory in finance. However, the data also indicates that higher risk does not necessarily ensure
higher returns—some high-risk firms experience low or negative returns. Overall, the plot suggests a
weak but noticeable positive relationship between risk and return.
Sample Mean
The sample mean represents the average value of a set of observations taken from a larger
population. It indicates the central point around which the sample data clusters and serves as an
estimate of the overall population mean.
The sample standard error, on the other hand, quantifies how much the sample mean is likely to
vary if different samples of the same size were repeatedly taken from the population. It reflects the
precision of the sample mean as an estimate of the true population mean — the smaller the standard
error, the more precise the estimate.
• 0.83 (approximately).
• Indicates how accurate the sample mean is as an estimate of the true population mean. A
lower value means more precision.
Alpha (α):
t-score:
• 1.98 (approximately).
• Represents the critical value from the t-distribution for a 95% confidence interval.
Margin of Error:
• 1.64 (approximately).
• This value is added and subtracted from the sample mean to create a range for the
confidence interval.
Confidence Interval
A confidence interval is the mean of your estimate plus and minus the variation in that estimate.
This is the range of values you expect your estimate to fall between if you redo your test, within a
certain level of confidence. Confidence, in statistics, is another way to describe probability
Confidence Interval:
• The 95% confidence interval for the true mean ROE is (16.09%, 19.37%).
Interpretation:
• There is 95% confidence that the true mean ROE for the population lies between 16.09% and
19.37%.
• The interval provides a precise estimate of the mean ROE.
• The interval is above zero, indicating that firms in the dataset have a positive ROE on average.
Statistical Summary: ROE Confidence Interval
Interpretation:
▪ The true mean ROE of the population is estimated to lie between 16.09% and 19.37% with
a 95% confidence level.
▪ This range indicates moderate profitability for the average firm in the dataset.
▪ The narrow confidence interval shows that the sample provides a precise estimate of the
mean ROE due to a relatively small standard error.
Overall Insights:
▪ Average Return and Net Income are largely influenced by a few high-performing firms.
▪ Boxplots identify outliers that may require further investigation (top performers).
▪ The scatter plot indicates a risk-return pattern, though it is not very strong.
▪ The mean ROE is statistically reliable, with most firms having ROE around 17–18%,
indicating a generally healthy return on equity.
Purpose of the Code: The goal is to calculate a 95% confidence interval for the mean ROE of a
samp
mu <- 17
mr <- mean(roe)
s <- sd(roe)
n <- length(roe)
p_value<-1-pnorm(z)
if(p_value<alpha){
cat("Reject the null hypothesis. There is evidence to support the alternativ
e hypothesis.")
}else{
cat("Fail to reject the null hypothesis. There is not enough evidence to
support the alternative hypothesis.")
}
## Fail to reject the null hypothesis. There is not enough evidence to
## support the alternative hypothesis.
cat("\nP-value:",p_value,"\n")
##
## P-value: 0.1890954
critical_value<-qnorm(1-alpha)
ROE: This variable refers to the Return on Equity values derived from the dataset. It constitutes
a numerical vector of observations, which serves as the basis for hypothesis testing.
Mu: The hypothesized population mean value of Return on Equity is set at 17. This benchmark is
utilized for conducting the one-sample hypothesis test.
Significance Level (Alpha): The significance level is set at 0.05, implying a 5% threshold for
rejecting the null hypothesis. This level reflects the probability of committing a Type I error.
P-value: The p-value represents the probability of observing a test statistic as extreme as, or more
extreme than, the one calculated, assuming the null hypothesis holds true. It is compared to alpha to
determine statistical significance.
Interpretation of Hypothesis Testing Result
The calculated p-value is approximately 0.1891, which exceeds the significance level of 0.05.
Conclusion: Fail to reject the null hypothesis.
There is not enough statistical evidence to suggest that the population mean ROE differs from 17.
The sample does not provide strong support for the alternative hypothesis.
Alternative Hypothesis Mean: The hypothetical mean for ROE is 17.5, used to assess Type II error
probability.
Critical Value: The critical z-score at a 5% significance level for a one-tailed test, determining
whether to reject the null hypothesis.
Z-value under the Alternative Hypothesis: This statistic measures the difference between the
sample mean and the alternative mean.
Type II Error Probability (Beta): The likelihood of failing to reject the null hypothesis when the true
population mean is 17.5 is approximately 0.9141.
A Type II error probability of 91.41% is considerably high, suggesting that the test has low power in
detecting a small difference from the hypothesized value. To improve the reliability of the test, it
may be necessary to increase the sample size or re-evaluate the effect size being tested.
Upper-Tailed Test
t.test(data$roe, mu = 17, alternative = 'greater')
##
## One Sample t-test
##
## data: data$roe
## t = 0.88123, df = 141, p-value = 0.1898
## alternative hypothesis: true mean is greater than 17
## 95 percent confidence interval:
## 16.35755 Inf
## sample estimates:
## mean of x
## 17.73099
ROE refers to the Return on Equity values in the dataset, used for one sample t-tests to see if the
sample mean differs from a hypothesized population mean.
▪ Hypothesized Mean (Mu): The assumed population mean is 17. Each test checks if the
sample mean significantly differs from this value.
▪ Sample Mean: The ROE sample mean is approximately 17.73 based on 142 observations.
▪ Degrees of Freedom: Tests are conducted with 141 degrees of freedom, corresponding to
a sample size of 142.
Objective To test whether the true population mean is greater than 17.
t-value equals 0.88123, p-value equals 0.1898, 95 percent confidence interval ranges from 16.36 to
infinity
Conclusion: Since the p-value is greater than 0.05, we fail to reject the null hypothesis
There is not enough statistical evidence to conclude that the true mean ROE is greater than 17
Lower-TailedTest(
t.test(data$roe, mu = 17,alternative='less')
##
## One Sample t-test
##
## data: data$roe
## t = 0.88123, df = 141, p-value = 0.8102
## alternative hypothesis: true mean is less than 17
## 95 percent confidence interval:
## -Inf 19.10442
## sample estimates:
## mean of x
## 17.73099
Objective: To test whether the true population mean is less than 17.
t-value equals 0.88123, p-value equals 0.8102, 95 percent confidence interval ranges from negative
infinity to 19.10
Conclusion The p-value is much greater than 0.05, so we fail to reject the null hypothesis
There is no statistical evidence to support the claim that the true mean ROE is less than 17
Two-Tailed Test
Two-TailedTest
t.test(data$roe, mu = 17)
##
## One Sample t-test
##
## data: data$roe
## t = 0.88123, df = 141, p-value = 0.3797
## alternative hypothesis: true mean is not equal to 17
## 95 percent confidence interval:
## 16.09112 19.37085
## sample estimates:
## mean of x
## 17.73099
Objective: To test whether the true population mean is different from 17 regardless of direction.
t-value equals 0.88123, p-value equals 0.3797, 95 percent confidence interval ranges from 16.09 to
19.37
Conclusion: The p-value exceeds the significance level of 0.05, hence we do not reject the null
hypothesis. There is insufficient evidence to indicate that the true mean ROE differs from 17.
Overall Summary
Across all three hypothesis tests—namely upper-tailed, lower-tailed, and two-tailed—the conclusion
remains consistent: we fail to reject the null hypothesis. The sample data does not provide sufficient
evidence to conclude that the true mean Return on Equity (ROE) is significantly greater than, less
than, or different from 17. The sample mean of 17.73 falls within the confidence intervals in all
cases, and the high p-values further reinforce the lack of statistical significance.
Salary
Mu
This is the hypothesized population mean value of salary, which is set to ₹1325.
Mean, Standard Deviation, and Sample Size
• The sample mean (mr) is approximately ₹1325.106, representing the average of the observed
salary values.
• The sample standard deviation (s) measures the variability of the salary data. Though not
printed directly, it was used to compute the standard error.
• The sample size (n) refers to the total number of salary observations in the dataset.
These descriptive statistics are used to calculate the test statistic and assess the validity of the null
hypothesis.
Hypothesis Testing-salary
salary <- data$salary
mu <- 1325
mr <- mean(salary)
s <- sd(salary)
n <- length(salary)
p_value<-1-pnorm(z)
if(p_value<alpha){
cat("Reject the null hypothesis. There is evidence to support the alternativ
e hypothesis.")
}else{
cat("Fail to reject the null hypothesis. There is not enough evidence to
support the alternative hypothesis.")
}
## Fail to reject the null hypothesis. There is not enough evidence to
## support the alternative hypothesis.
Z-value
The z-test statistic is computed as mentioned above and this value indicates how many standard
errors the sample mean deviates from the hypothesized population mean of ₹1325.
Significance Level (Alpha)
The significance level is set at 0.05, implying a 5% threshold for rejecting the null hypothesis. This
reflects the maximum probability of committing a Type I error (false positive).
P-value
The calculated p-value is approximately 0.4997. This is the probability of observing a test statistic as
extreme as (or more extreme than) the one calculated, assuming the null hypothesis is true.
• Since the p-value (0.4997) > α (0.05), we fail to reject the null hypothesis.
Conclusion
Since the p-value is greater than 0.05, we fail to reject the null hypothesis.
There is no strong statistical evidence to suggest that the true mean salary is different from ₹1325 in either direction.
The sample mean (₹1325.106) lies well within the confidence interval, and the test result does not support the
alternative hypothesis.
The probability of a Type II error is approximately 0.9538. This means there is a 95.38% chance of failing to
reject the null hypothesis even when the true population mean is actually ₹1330.
Conclusion on Type II Error
A Type II error probability of 95.38% is very high, indicating that the test has low power to detect a small
deviation from the hypothesized mean. To enhance the reliability of the test, increasing the sample size or the
effect size (difference from the null) would be recommended
The statistical analysis displays three variations of one-sample t-tests examining different alternative
hypotheses regarding the average salary value, with each test comparing against a reference value of
1325:
One-Tailed Tests:
For the "greater than" alternative hypothesis (H₁: μ > 1325), the test yielded a t-statistic of
0.00081744 (df = 141) with a p-value of 0.4997.
For the "less than" alternative hypothesis (H₁: μ < 1325), the test produced the same t-statistic with
a p-value of 0.5003.
Statistical Significance
All three tests yield p-values substantially greater than the conventional significance level of 0.05,
indicating a failure to reject the null hypothesis in each scenario. The sample mean of 1325.106 is
remarkably close to the hypothesized value of 1325, resulting in a very small t-statistic and
correspondingly high p-values.
Overall Conclusion
The comprehensive statistical analysis demonstrates that the available evidence does not substantiate
the assertion that the mean salary differs significantly from the specified null hypothesis value of
1325. The sample mean of 1325.106 shows only a negligible deviation of 0.106 from the
hypothesized value, which is not statistically significant.
Similar to the death variable analysis, these high p-values indicate a lack of substantial evidence to
support any significant deviation from the null hypothesis. The statistical tests suggest that the
observed sample mean is consistent with what would be expected if the true population mean were
indeed 1325.
Inferences Based on two samples
t_test_result<-t.test(data$roe,data$salary,alternative=
"greater")
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: data$roe and data$salary
## t = -10.117, df = 141.01, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -1521.341 Inf
## sample estimates:
## mean of x mean of y
## 17.73099 1325.10563
Interpretation:
This analysis uses a two-sample t-test to determine if there is a significant difference in the mean
values of ROE and salary. It includes sections on descriptive statistics, hypothesis testing, and final
inference based on the test result.
Descriptive Statistics:
Before conducting hypothesis testing, it is important to review the descriptive statistics of the
variables ROE and salary. The t-test compares these two variables to determine if a significant
difference exists between them. The mean ROE is 17.73099, and the mean salary is 1325.10563,
indicating a noticeable gap in central tendencies. This difference highlights the necessity of
hypothesis testing to establish if the disparity is statistically significant.
Hypothesis Testing:
The Welch Two Sample t-test has been utilized to determine whether the average Return on Equity
(ROE) exceeds the average salary. The test produced a t-value of -10.117, with 141.01 degrees of
freedom, accompanied by a p-value of 1. These results strongly indicate that the null hypothesis
cannot be rejected. Moreover, the negative t-value contradicts the assumption that ROE is greater
than salary, suggesting an inverse relationship. Therefore, there is no statistical evidence supporting
the alternative hypothesis.
Interpretation of Results:
The Welch Two Sample t-test results clearly indicate a lack of significant difference in the
hypothesized direction between ROE and salary. The negative test statistic and extremely high p-
value signify that the assumption of ROE being greater than salary is unsupported. The likelihood of
observing such a test statistic under the assumption that roe > salary is virtually zero. Therefore, the
statistical evidence favors the null hypothesis, reaffirming that salary values are significantly higher
than ROE, and the true mean difference lies in the opposite direction.
ROE:
The boxplot for “ROE” shows a median value slightly below 18, with the interquartile range (IQR)
spanning approximately 12 to 23. The whiskers extend from around 5 to 30, capturing most of the
variation in the data. A few outliers beyond this range are visible, indicating the presence of extreme
values either much higher or lower than the central mass of the data. Overall, the ROE distribution
appears moderately spread out, with a few unusually high or low entries.
Salary:
The boxplot for “Salary” depicts a much higher median, around 1500, and an interquartile range
stretching from about 600 to 2400. The whiskers extend broadly—from negative values (due to high
variability) up to over 5500. Several outliers are visible far above the upper whisker, highlighting
significant variability and the presence of extremely high salary values in the dataset.
Comparison:
Salary displays greater variability and a higher central value than ROE, which is more consistent and
lower. This contrast reflects significant differences in scale and distribution between the two
variables.
Prediction Interval-roe
Prediction interval-roe
sample.mean <- mean(data$roe)
alpha = 0.05
degrees.freedom = sample.n- 1
print(c(lower_bound,upper_bound))
## [1] -1.947915 37.409887
A prediction interval is a statistical range that calculates where future individual observations are
likely to fall, with a specified probability (in this instance, 95%). Unlike a confidence interval, which
estimates the population mean, a prediction interval considers both the variability in the sample
mean and the inherent variability of individual data points. This results in a broader interval, making
it more suitable for predicting individual values accurately.
Descriptive Statistics
Sample Mean (Central Tendency): The average ROE across all firms
Sample Size (n): Number of firms in the dataset
Sample Standard Deviation: Measures dispersion of ROE values around the mean
Standard Error: Adjusted for prediction uncertainty (incorporates 1/n term)
Range: The interval spans from -1.95 to 37.41, indicating substantial variability
These statistics provide the foundation for calculating the prediction interval and understanding the
distribution of ROE values.
alpha = 0.05
degrees.freedom = sample.n- 1
print(c(lower_bound,upper_bound))
## [1] -1740.597 4390.809
Prediction Interval
----------------------------------------------------------------------------------------------------------------------------