Assignment of Biostatistics
Assignment of Biostatistics
Q1. Differentiate between data and information [2]. Define variable and its different types, levels of
measurements and give suitable examples [1+3+1+3].
Ans.
Data is an individual unit that contains raw materials which do not carry any specific meaning. Information is a
group of data that collectively carries a logical meaning. Data doesn't depend on information. Information
depends on data.
Quantitative/Numerical
Qualitative/Categorical
Levels of measurement: -
Discrete measurement variables: You can count discrete variables and they belong to a finite set. Example:
how old you are in years: 21, 23, 59, and so on. Note that the set will have an end (probably 100 or so at most)
Interval variables: also continuous, but they have meaningful intervals. As an example, a thermometer might
measure in intervals of 0.1 degree.
Ratio variables: also, interval, with a meaningful zero. For example, 0 pounds means that you weigh nothing.
Nominal (Categorical) variables: can be placed into categories like “under 10s” and “65 or older”.
Ranked variables: variables that have an order like 1st, 2nd, 3rd
Q2. Define Population and Sample in Statistics and explain how they different to each other [1+1+2]. Explain
what is meant by descriptive statistics and inferential statistics with suitable examples [2+2].
Ans.
A population is the entire group that you want to draw conclusions about. A sample is the specific group that
you will collect data from. The size of the sample is always less than the total size of the population.
Descriptive statistics describes data (for example, a chart or graph) and inferential statistics allows you to
make predictions (“inferences”) from that data. With inferential statistics, you take data from samples and
make generalizations about a population.
For example, you might stand in a mall and ask a sample of 100 people if they like shopping at Sears. You could
make a bar chart of yes or no answers (that would be descriptive statistics) or you could use your research
(and inferential statistics) to reason that around 75-80% of the population (all shoppers in all malls) like
shopping at
Let’s say you have some sample data about a potential new cancer drug. You could use descriptive statistics to
describe your sample, including:
Sample mean
Sample standard deviation
Making a bar chart or boxplot
Describing the shape of the sample probability distribution
Q3. We have learned about measures of central tendency [MCT], mainly, Mean, Median and Mode. Explain
their uses [1], strengths [1] and limitations [1]. Also mention the different type of variables associated with
the preferred use of different type of MCT with examples.
Ans.
The mode is the most commonly occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
The mode has an advantage over the median and the mean as it can be found for both numerical and
categorical (non-numerical) data.
The are some limitations to using the mode. In some distributions, the mode may not reflect the centre of the
distribution very well. When the distribution of retirement age is ordered from lowest to highest value, it is
easy to see that the centre of the distribution is 57 years, but the mode is lower, at 54 years.
It is also possible for there to be more than one mode for the same distribution of data, (bi-modal, or multi-
modal). The presence of more than one mode can limit the ability of the mode in describing the centre or
typical value of the distribution because a single value to describe the centre cannot be identified.
In some cases, particularly where the data are continuous, the distribution may have no mode at all (i.e., if all
values are different).
In cases such as these, it may be better to consider using the median or mean, or group the data in to
appropriate intervals, and find the modal class
The median is the middle value in distribution when the values are arranged in ascending or descending order.
The median divides the distribution in half (there are 50% of observations on either side of the median value).
In a distribution with an odd number of observations, the median value is the middle value.
Looking at the retirement age distribution (which has 11 observations), the median is the middle value, which
is 57 years:
When the distribution has an even number of observations, the median value is the mean of the two middle
values. In the following distribution, the two middle values are 56 and 57, therefore the median equals 56.5
years:
The median is less affected by outliers and skewed data than the mean, and is usually the preferred measure
of central tendency when the distribution is not symmetrical.
The median cannot be identified for categorical nominal data, as it cannot be logically ordered.
The mean is the sum of the value of each observation in a dataset divided by the number of observations. This
is also known as the arithmetic average.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values (54+54+54+55+56+57+57+58+58+60+60 = 623) and
dividing by the number of observations (11) which equals 56.6 years.
The mean can be used for both continuous and discrete numeric data.
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.
(3) What do you mean by skewness of given data set [1]? What are different types of skewness [2]? What is
relationship between the MCT for a given symmetrical data [1]?
Ans.
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal
distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed.
Types of Skewness
Broadly speaking, there are two types of skewness: They are (1) Positive skewness and (2) Negative skewness.
Positive skewness
A series is said to have positive skewness when the following characteristics are noticed:
Negative skewness
A series is said to have negative skewness when the following characteristics are noticed:
When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal. In this
case, analysts tend to use the mean because it includes all of the data in the calculations. However, if we have
a skewed distribution, the median is often the best measure of central tendency.
Q4. We have also learned about measures of dispersion [MD], mainly, Range, variance, and standard
deviation.
Why standard deviation is preferred measure of dispersion over variance though it is calculated
from variance [1]?
What is standard error in statistics [1]? and how it useful in estimation of parameter [2]?
Ans.
1)Merits of Range
Easy to calculate
Easy to understand
Merits of Variance
Squaring the deviations overcomes the drawback of ignoring signs in mean deviations
2)Variance is the square of the standard deviation. Being a squared term, it is non-negative. The unit of
variance is squared unit, thereby making it less intuitive. Moreover, standard deviation is preferred over
variance because standard deviation can be compared with the mean.
3) A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number
describing a sample (e.g., sample mean).
The goal of quantitative research is to understand characteristics of populations by finding parameters. In
practice, it’s often too difficult, time-consuming or unfeasible to collect data from every member of a
population. Instead, data is collected from samples.
4) A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples
drawn from a specific population. The sampling distribution of a given population is the distribution of
frequencies of a range of different outcomes that could possibly occur for a statistic of a population.
5) The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its
sampling distribution or an estimate of that standard deviation. This is because as the sample size increases,
sample means cluster more closely around the population mean.
Q5. We have discussed type of random variables about different probability distributions, namely, Binomial
Distribution [BD], Poisson Distribution [PD] and Normal Distribution [ND].
1. Please mention the type of random variable associated with BD, PD and ND [1.5].
2. What is the relationship between mean and variance in BD, PD, ND [1.5]?
3. What is a standardized normal variate [0.5] and what is a test statistic [0.5]?
4. Why Normal Distribution is important in statistics and mention its 2 key properties [0.5+0.5]?
Ans.
1) A random variable is a numerical description of the outcome of a statistical experiment. A random variable
that may assume only a finite number or an infinite sequence of values is said to be discrete; one that may
assume any value in some interval on the real number line is said to be continuous. For instance, a random
variable representing the number of automobiles sold at a particular dealership on one day would be discrete,
while a random variable representing the weight of a person in kilograms (or pounds) would be continuous.
2) The mean of the binomial distribution is always equal to p, and the variance is always equal to pq/N.
Moreover, for reasonable sample sizes and for values of p between about. 20 and. 80, the distribution is
roughly normally distributed.
3) What is Standard Normal Variate (SNV)? A standard normal variate is a normal variate with mean µ=0 and
standard deviation σ =1 with a probability density function is. The probability that the variate would take is
denoted by the shaded area in the figure. The variate would take a value between 0 and z.
The test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely your
observed data match the distribution expected under the null hypothesis of that statistical test.
The test statistic is used to calculate the p-value of your results, helping to decide whether to reject your null
hypothesis.
4) As with any probability distribution, the normal distribution describes how the values of a variable are
distributed. It is the most important probability distribution in statistics because it accurately describes the
distribution of values for many natural phenomena. Characteristics that are the sum of many independent
processes frequently follow normal distributions. For example, heights, blood pressure, measurement error,
and IQ scores follow the normal distribution
Q6. We have also discussed the different type of variables (categorical etc) and different type of statistical
test based on these types of variables. Please mention which type of variables you will use for these tests
[0.5 each]
Example:
Ans.
9)Dependent variable is DICHOTOMOUS and independent variable can be categorical or interval level
Q7. We have also discussed two types of error in statistical hypothesis testing. Explain them with examples
[1]. What is confidence interval estimation and its importance in statistical inferences [1]?
Ans.
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the
population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is
actually false in the population.
A Confidence interval provides a range of population values with which a sample statistic is consistent at a
given level of confidence (usually 95%). Conventional hypothesis testing serves to either reject or retain a null
hypothesis.
Truth
(for population studied)
Null Hypothesis True Null Hypothesis False
Decision Reject Null Hypothesis Type I Error Correct Decision
(based on
sample) Fail to reject Null Hypothesis Correct Decision Type II Error