Introduction To Statistics (4485) : Semester: Spring, 2023
Introduction To Statistics (4485) : Semester: Spring, 2023
Introduction To Statistics (4485) : Semester: Spring, 2023
ASSIGNMENT NO: 02
Ans: (a) Discuss the different measures of dispersion. Also indicate their
merits and demerits.
Range:
Merits: It is the simplest measure of dispersion, calculated as the difference
between the maximum and minimum values in a dataset. It is easy to understand
and compute.
Demerits: It is sensitive to extreme values and does not consider the distribution
of data between the minimum and maximum values. Consequently, it may not
accurately represent the overall variability.
Variance:
Merits: It considers the squared deviations of each data point from the mean,
providing a measure of average dispersion. It is widely used in statistical
analysis and has important theoretical properties.
Demerits: The variance is not in the original units of the data, making it less
interpretable. Additionally, it amplifies the effect of outliers due to squaring the
deviations.
Standard Deviation:
Merits: It is the square root of the variance and is expressed in the original units
of the data. It is widely used, easily interpretable, and considered one of the
most important measures of dispersion.
Demerits: Like the variance, the standard deviation is sensitive to outliers due to
the squaring of deviations.
To find the mean, we multiply each group frequency (f) by its respective group
midpoint (the average of the lower and upper limits of the group) and sum up
these values. Then we divide by the total frequency (sum of all f values).
Group midpoint (x) is calculated by adding the lower and upper limits of each
group and dividing by 2.
μ = (Σ(f * x)) / Σf
= (13 * 37 + 15 * 42 + 17 * 47 + 28 * 52 + 12 * 57 + 10 * 62 + 5 * 67) / (13 +
15 + 17 + 28 + 12 + 10 + 5)
= (481 + 630 + 799 + 1456 + 684 + 620 + 335) / 100
= 5005 / 100
= 50.05
The squared deviation for each group is calculated by subtracting the mean from
the group midpoint (x) and squaring the result. Then, we multiply the squared
deviation by the frequency (f) for each group.
The variance (σ^2) is obtained by summing up all the squared deviations and
dividing by the total frequency.
σ^2 = Σd^2 / Σf
= (174.24 + 48.04 + 24.01 + 5.76 + 47.61 + 144.04 + 245.05) / (13 + 15 + 17
+ 28 + 12 + 10 + 5)
= 688.75 / 100
= 6.888
The coefficient of variation (CV) is the ratio of the standard deviation to the
mean, expressed as a percentage.
CV = (σ / μ) * 100
First, let's calculate the standard deviation (σ), which is the square root of the
variance.
σ = √σ^2
= √6.888
≈ 2.622
CV = (σ / μ) * 100
= (2.622 / 50.05)
Q. 2 (a) what is a linear regression model? Explain the
assumptions underlying the linear regression model.
(b) Define the term correlation. Find the correlation coefficient between
X and Y, given
X 78 89 97 69 59 79 68 61
Y 125 137 156 11 107 136 123 108
2
Linearity:
The relationship between the independent variables and the dependent variable
is assumed to be linear. This means that the change in the dependent variable is
proportional to the change in the independent variables.
Independence:
The observations used to build the model are assumed to be independent of each
other. This assumption is important because if observations are not independent,
it can lead to biased or inefficient estimates.
Homoscedasticity:
The variance of the errors (residuals) should be constant across all levels of the
independent variables. In other words, the spread of the residuals should be
consistent throughout the range of the predictor variables.
No or little multicollinearity:
The independent variables should not be highly correlated with each other. High
multicollinearity can make it difficult to determine the individual effects of the
independent variables on the dependent variable.
Normality of residuals:
The residuals are assumed to follow a normal distribution. This assumption is
necessary for conducting hypothesis tests, constructing confidence intervals,
and obtaining reliable statistical measures such as p-values.
No endogeneity:
There should be no relationship between the errors (residuals) and the
independent variables. Endogeneity occurs when there is a two-way causal
relationship between the dependent variable and one or more of the independent
variables, which can lead to biased and inconsistent estimates.
(b) Define the term correlation. Find the correlation coefficient between
X and Y, given
X 78 89 97 69 59 79 68 61
Y 125 137 156 11 107 136 123 108
2
To find the correlation coefficient between variables X and Y, we can use the
Pearson correlation coefficient formula. The Pearson correlation coefficient,
often denoted as "r," ranges from -1 to 1. A positive value indicates a positive
correlation, a negative value indicates a negative correlation, and a value close
to 0 suggests no significant linear correlation.
Let's calculate the correlation coefficient between X and Y using the given data:
Mean of X (x̄ ):
x̄ = (78 + 89 + 97 + 69 + 59 + 79 + 68 + 61) / 8
= 600 / 8
= 75
Mean of Y (ȳ):
ȳ = (125 + 137 + 156 + 112 + 107 + 136 + 123 + 108) / 8
= 1004 / 8
= 125.5
Next, we calculate the deviations from the means for each value of X and Y:
Deviation of X (x - x̄ ):
78 - 75 = 3
89 - 75 = 14
97 - 75 = 22
69 - 75 = -6
59 - 75 = -16
79 - 75 = 4
68 - 75 = -7
61 - 75 = -14
Deviation of Y (y - ȳ):
125 - 125.5 = -0.5
137 - 125.5 = 11.5
156 - 125.5 = 30.5
112 - 125.5 = -13.5
107 - 125.5 = -18.5
136 - 125.5 = 10.5
123 - 125.5 = -2.5
108 - 125.5 = -17.5
Now, we multiply the deviations of X and Y for each data point:
(3)(-0.5) = -1.5
(14)(11.5) = 161
(22)(30.5) = 671
(-6)(-13.5) = 81
(-16)(-18.5) = 296
(4)(10.5) = 42
(-7)(-2.5) = 17.5
(-14)(-17.5) = 245
(3)^2 = 9
(14)^2 = 196
(22)^2 = 484
(-6)^2 = 36
(-16)^2 = 256
(4)^2 = 16
(-7)^2 = 49
(-14)^2 = 196
(b) How many dice must be thrown so that the probability of obtaining
at least one 6 is at least 0.99?
(c) From a group of 6 men and 8 women, 5 people are chosen at random.
Find the probability that there are more men chosen than women.
Ans: (a) Here are the definitions of the terms you mentioned:
4. Sample Space: The sample space is the set of all possible outcomes of an
experiment. It includes every possible result that can occur. It is denoted by the
symbol Ω or S.
8. Sure Event: A sure event, also known as a certain event, is an event that is
guaranteed to occur. It includes the entire sample space and represents the
outcome that is certain to happen. It is denoted by the symbol Ω or S.
9. Mutually Exclusive Events: Mutually exclusive events are events that cannot
occur simultaneously. If one event happens, the other event(s) cannot occur at
the same time. In other words, the occurrence of one event excludes the
occurrence of the other event(s).
(b) How many dice must be thrown so that the probability of obtaining
at least one 6 is at least 0.99?
1 - (5/6)^n ≥ 0.99
(5/6)^n ≤ 0.01
n * log(5/6) ≤ log(0.01)
n ≤ log(0.01) / log(5/6)
n ≤ 25.843
Since we cannot have a fraction of a die, the minimum number of dice required
is 26 to ensure a probability of at least 0.99 of obtaining at least one 6.
(c) From a group of 6 men and 8 women, 5 people are chosen at random.
Find the probability that there are more men chosen than women.
To find the probability that there are more men chosen than women from a
group of 6 men and 8 women when 5 people are chosen at random, we need to
consider the different possible scenarios.
The total number of ways to choose 5 people from the total group of 14 (6 men
+ 8 women) is given by the combination formula: C(14, 5) = 14! / (5! * (14 -
5)!) = 2002.
Therefore, the probability of selecting 3 men and 2 women is: (20 * 28) / 2002
= 560 / 2002 = 0.2797 (approximately).
Case 2: Selecting 4 men and 1 woman:
The number of ways to choose 4 men from 6 men is given by the combination
formula: C(6, 4) = 6! / (4! * (6 - 4)!) = 15.
Similarly, the number of ways to choose 1 woman from 8 women is given by
the combination formula: C(8, 1) = 8! / (1! * (8 - 1)!) = 8.
Therefore, the probability of selecting 4 men and 1 woman is: (15 * 8) / 2002 =
120 / 2002 = 0.0599 (approximately).
To find the probability that there are more men chosen than women, we sum up
the probabilities from all three cases:
So, the probability that there are more men chosen than women is
approximately 0.3426 or 34.26%.
Ans: Sampling techniques are used in research and data collection to select a
subset of individuals or elements from a larger population. Each sampling
technique has its own advantages and disadvantages, which should be carefully
considered to ensure the reliability and validity of the collected information.
Here are some common types of sampling techniques and their respective pros
and cons:
Stratified Sampling:
Advantages: The population is divided into homogeneous groups (strata), and a
proportional number of individuals are selected from each stratum. It ensures
representation from all subgroups, resulting in more precise estimates for each
subgroup.
Disadvantages: Proper stratification requires accurate prior knowledge about the
population characteristics, which may not always be available. It can be
challenging to classify individuals into mutually exclusive strata.
Cluster Sampling:
Advantages: The population is divided into clusters (e.g., geographical areas),
and a random selection of clusters is made. It is cost-effective and practical
when it is difficult to obtain a complete list of the population.Disadvantages:
There is a risk of high within-cluster similarity, which may reduce the
representativeness. The selection of clusters and subsequent sampling within
clusters can introduce bias.
Systematic Sampling:
Advantages: Every nth individual is selected from a population list after a
random start. It is less time-consuming than simple random sampling and can
still provide a representative sample if the list is randomized.
Disadvantages: If there is a pattern or periodicity in the list, systematic sampling
may introduce bias. It may also miss certain population characteristics if they
are related to the periodicity.
Convenience Sampling:
Advantages: It is convenient and easily accessible, requiring minimal effort. It
can be useful for exploratory or qualitative research when generalizability is not
a primary concern.
Disadvantages: The sample may not be representative of the population, as
individuals are selected based on their availability or accessibility. There is a
high risk of selection bias, limiting the generalizability of the findings.
Snowball Sampling:
Advantages: It is useful when the population is hard to reach or hidden, such as
marginalized or stigmatized groups. Existing participants refer additional
participants, creating a network of connections.
Disadvantages: It may lead to biased samples, as the initial participants may not
accurately represent the target population. The sample size may also be small,
limiting the generalizability of the findings.
These are just a few examples of sampling techniques, and researchers should
choose the most appropriate method based on their research objectives,
available resources, and population characteristics. It is important to
acknowledge and address the limitations and potential biases associated with
each technique to ensure the reliability and validity of the collected information.
The sampling distribution of the means, also known as the sampling distribution
of the sample mean, specifically focuses on the distribution of sample means.
Here are some key properties of the sampling distribution of the means:
2. Mean: The mean of the sampling distribution of the means is equal to the
population mean. In other words, the average of all possible sample means is
the same as the mean of the population.
4. Shape: As mentioned earlier, for large sample sizes, the sampling distribution
of the means approaches a normal distribution, even if the population
distribution is not normally distributed. For smaller sample sizes, the
distribution may still exhibit some skewness, but it becomes more symmetrical
and bell-shaped as the sample size increases.
5. Independence: The samples drawn for the calculation of sample means are
assumed to be independent of each other. This assumption is usually met when
samples are collected through simple random sampling or other randomization
methods.
The properties of the sampling distribution of the means play a crucial role in
statistical inference. They allow us to make statements about the likelihood of
observing certain sample means and help us estimate population parameters
with a certain level of confidence.
Q. 5 (a) Explain how you test the hypothesis on proportions?
(b) A thousand households are taken at random and divides into three
groups A, B and C, according to the total monthly income. The following
table shows the numbers in each group having a colour television receiver,
a black and white receiver, or no television at all.
Attributes A B C
Colour TV 56 51 93
Black & White 118 207 375
TV
None 26 42 32
Test the hypothesis that there is no association between total income
and television ownership.
Ans: (a) Explain how you test the hypothesis on proportions
For example:
- H₀: The population proportion is equal to a specific value.
- H₁: The population proportion is not equal to the specific value.
Make a decision:
Compare the test statistic (or p-value) to the critical value(s). If the test statistic
falls within the critical region or the p-value is smaller than the significance
level, reject the null hypothesis in favor of the alternative hypothesis.
Otherwise, fail to reject the null hypothesis.
It's important to note that this is a general outline of the process, and the specific
calculations or statistical tests may vary slightly depending on the context or
any additional assumptions made.
(b) A thousand households are taken at random and divides into three
groups A, B and C, according to the total monthly income.
The following table shows the numbers in each group having a colour television
receiver, a black and white receiver, or no television at all.
Attributes A B C
Colour TV 56 51 93
Black & White 118 207 375
TV
None 26 42 32
Test the hypothesis that there is no association between total income and
television ownership.
To test the hypothesis that there is no association between total income and
television ownership, we can use the chi-square test of independence. This test
assesses whether there is a significant relationship between two categorical
variables.
Using the given data, we can set up the observed and expected frequency tables
as follows:
Observed Frequencies:
Attributes A B C Total
Colour TV 56 51 93 200
Black & White TV118 207 375 700
None 26 42 32 100
Total 200 300 500 1000
Expected Frequencies:
Attributes A B C Total
Colour TV (200 * 200) / 1000 = 40 (200 * 300) / 1000 = 60 (200 * 500) /
1000 = 100 200
Black & White TV(700 * 200) / 1000 = 140 (700 * 300) / 1000 = 210 (700 *
500) / 1000 = 350 700
None (100 * 200) / 1000 = 20 (100 * 300) / 1000 = 30 (100 * 500) / 1000 =
50 100
Total 200 300 500 1000
Using the observed and expected frequencies, we can calculate the chi-square
statistic:
Next, we need to determine the degrees of freedom for the test. The degrees of
freedom can be calculated using the formula:
Please note that the critical value and the final conclusion will depend on the
chosen significance level (α).