Module 3 - Assignment Rakesh Thakor
Module 3 - Assignment Rakesh Thakor
Statistical methods
• Statistical methods are mathematical formulas, models, and techniques that are used in
statistical analysis of raw research data.
• Statistical methods involved in carrying out a study include planning, designing, collecting
data, analysing, drawing meaningful interpretation and reporting of the research findings.
• The statistical analysis gives meaning to the meaningless numbers, thereby breathing life
into a lifeless data.
• The results and inferences are precise only if proper statistical tests are used.
• The application of statistical methods extracts information from research data and provides
different ways to assess the robustness of research outputs
Types of Statistics
Statistics have majorly categorised into two types:
1. Descriptive statistics
2. Inferential statistics
Descriptive Statistics
In this type of statistics, the data is summarised through the given observations. The
summarisation is one from a sample of population using parameters such as the mean
or standard deviation.
Descriptive statistics is a way to organise, represent and describe a collection of data using
tables, graphs, and summary measures. For example, the collection of people in a city using
the internet or using Television.
Descriptive statistics are also categorised into four different categories:
Measure of frequency
Measure of dispersion
Measure of position
The frequency measurement displays the number of times a particular data occurs. Range,
Variance, Standard Deviation are measures of dispersion. It identifies the spread of data.
Central tendencies are the mean, median and mode of the data. And the measure of position
describes the percentile and quartile ranks.
Inferential Statistics
This type of statistics is used to interpret the meaning of Descriptive statistics. That means
once the data has been collected, analysed and summarised then we use these stats to
describe the meaning of the collected data. Or we can say, it is used to draw conclusions from
the data that depends on random variations such as observational errors, sampling variation,
etc.
Inferential Statistics is a method that allows us to use information collected from a sample to
make decisions, predictions or inferences from a population. It grants us permission to give
statements that goes beyond the available data or information. For example, deriving
estimates from hypothetical research.
Statistics Example
In a class, the collection of marks obtained by 50 students is the description of data. Now
when we take out the mean of the data, the result is the average of marks of 50 students. If
the average mark obtained by 50 students is 88 out of 100, then we can reach to a conclusion
or give a judgment on the basis of the result.
2. Mention the types of probability distribution and explain them.
Answer:
A distribution in statistics is a function that shows the possible values for a variable and
how often they occur.
• The distribution of a statistical data set (or a population) is a listing or function showing
all the possible values (or intervals) of the data and how often they occur. When a
distribution of categorical data is organized, you see the number or percentage of
individuals in each group.
• The distribution of an event consists not only of the input values that can be observed,
but is made up of all possible values.
• The distribution in statistics is defined by the underlying probabilities and not the graph.
The graph is just a visual representation.
• The binomial distribution measures the probabilities of the number of successes over a
given number of trials with a specified probability of success in each try.
• In the simplest scenario of a coin toss (with a fair coin), where the probability of getting a
head with each toss is 0.50 and there are a hundred trials, the binomial distribution will
measure the likelihood of getting anywhere from no heads in a hundred tosses (very
unlikely) to 50 heads (the most likely) to 100 heads (also very unlikely).
• The binomial distribution in this case will be symmetric, reflecting the even odds; as the
probabilities shift from even odds, the distribution will get more skewed.
• Negative Binomial distribution: Returning again to the coin toss example, assume that
you hold the number of successes fixed at a given number and estimate the number of
tries you will have before you reach the specified number of successes. The resulting
distribution is called the negative binomial and it very closely resembles the Poisson. In
fact, the negative binomial distribution converges on the Poisson distribution, but will be
more skewed to the right (positive values) than the Poisson distribution with similar
parameters.
• Binomial distributions for three scenarios – two with 50% probability of success and one
with a 70% probability of success and different trial sizes. As the probability of success is
varied (from 50%) the distribution will also shift its shape, becoming positively skewed for
probabilities less than 50% and negatively skewed for probabilities greater than 50%.
Poisson distribution
• The Poisson distribution measures the likelihood of a number of events occurring within
a given time interval, where the key parameter that is required is the average number of
events in the given interval (l).
• The resulting distribution looks similar to the binomial, with the skewness being positive
but decreasing with λ.
Consider again the coin toss example used to illustrate the binomial. Rather than focus on
the number of successes in n trials, assume that you were measuring the likelihood of
when the first success will occur. For instance, with a fair coin toss, there is a 50% chance
that the first success will occur at the first try, a 25% chance that it will occur on the
second try and a 12.5% chance that it will occur on the third try. The resulting distribution
is positively skewed and looks as follows for three different probability scenarios in fig. The
distribution is steepest with high probabilities of success and flattens out as the probability
decreases. However, the distribution is always positively skewed.
Hypergeometric distribution
The hypergeometric distribution measures the probability of a specified number of
successes in n trials, without replacement, from a finite population. Since the sampling is
without replacement, the probabilities can change as a function of previous draws.
Consider, for instance, the possibility of getting four face cards in hand of ten, over
repeated draws from a pack. Since there are 16 face cards and the total pack contains 52
cards, the probability of getting four face cards in a hand of ten can be estimated. Figure
provides a graph of the hypergeometric distribution: The hypergeometric distribution
converges on binomial distribution as the as the population size increases.
• With continuous data, we cannot specify all possible outcomes, since they are too
numerous to list, but we have two choices.
• The first is to convert the continuous data into a discrete form and then go through the
same process that we went through for discrete distributions of estimating
probabilities.
• For instance, we could take a variable such as market share and break it down into
discrete blocks – market share between 3% and 3.5%, between 3.5% and 4% and so
on – and consider the likelihood that we will fall into each block.
• The second is to find a continuous distribution that best fits the data and to specify the
parameters of the distribution.
3. Discuss in detail the types of statistical inference.
Answer:
There are different types of statistical inferences that are extensively used for making
conclusions.
• Confidence Interval
• Pearson Correlation
• Bi-variate regression
• Multi-variate regression
Statistical estimation
Statistical estimation is concerned with best estimating a value or range of values for a
particular population parameter, and hypothesis testing is concerned with deciding whether
the study data are consistent at some level of agreement with a particular population
parameter. We briefly describe statistical estimation and then devote the remainder of this
section to providing a conceptual overview of hypothesis testing.
There are two types of statistical estimation. The first type is point estimation, which
addresses what particular value of a parameter is most consistent with the data.
• Accumulate a sample of children from the population and continue the study
• Conduct statistical tests to see if the collected sample properties are adequately different
from what would be expected under the null hypothesis to be able to reject the null
hypothesis Importance
• It is majorly used in the future prediction for various observations in different fields.
• The statistical inference has a wide range of application in different fields such as:
o Business Analysis
o Artificial Intelligence
o Financial Analysis
o Fraud Detection
o Machine Learning
o Share Market
o Pharmaceutical Sector
4. Discuss in detail the classes of models.
Answer:
Fixed-effects models
The fixed-effects model (class I) of analysis of variance applies to situations in which the
experimenter applies one or more treatments to the subjects of the experiment to see
whether the response variable values change. This allows the experimenter to estimate the
ranges of response variable values that the treatment would generate in the population as a
whole. the fixed effects model assumes that the omitted effects of the model can be
arbitrarily correlated with the included variables. This is useful whenever you are only
interested in analyzing the impact of variables that vary over time (the time effects).
FE explore the relationship between predictor and outcome variables within an entity
(country, person, company, etc.). Each entity has its own individual characteristics that may
or may not influence the predictor variables (for example, being a male or female could
influence the opinion toward certain issue; or the political system of a particular country
could have some effect on trade or GDP; or the business practices of a company may
influence its stock price).
When using FE we assume that something within the individual may impact or bias the
predictor or outcome variables and we need to control for this. This is the rationale behind
the assumption of the correlation between entity’s error term and predictor variables. FE
remove the effect of those time-invariant characteristics so we can assess the net
effect of the predictors on the outcome variable.
The FE regression model has n different intercepts, one for each entity. These intercepts
can be represented by a set of binary variable and these binary variables absorb the
influences of all omitted variables that differ from one entity to the next but are constant over
time.
Random-effects models
Random-effects model (class II) is used when the treatments are not fixed. This occurs
when the various factor levels are sampled from a larger population. Because the levels
themselves are random variables, some assumptions and the method of contrasting the
treatments (a multivariable generalization of simple differences) differ from the fixed-effects
model. If the individual effects are strictly uncorrelated with the regressors it may be
appropriate to model the individual specific constant terms as randomly distributed across
cross-sectional units. This view would be appropriate if we believed that sampled cross-
sectional units were drawn from a large population.
If you have reason to believe that differences across entities have some influence on your
dependent variable then you should use random effects. In random-effects you need to
specify those individual characteristics that may or may not influence the predictor
variables. The problem with this is that some variables may not be available therefore
leading to omitted variable bias in the model.
An advantage of random effects is that you can include time invariant variables (i.e.
gender). In the fixed effects model these variables are absorbed by the intercept. The cost
is the possibility of inconsistent estimators, of the assumption is inappropriate.
Mixed-effects models
A mixed-effects model (class III) contains experimental factors of both fixed and random-
effects types, with appropriately different interpretations and analysis for the two types.
Example: Teaching experiments could be performed by a college or university department
to find a good introductory textbook, with each text considered a treatment. The fixed-
effects model would compare a list of candidate texts. The random-effects model would
determine whether important differences exist among a list of randomly selected texts. The
mixed-effects model would compare the (fixed) incumbent texts to randomly selected
alternatives.
Mixed-effects regression models are a powerful tool for linear regression models when your
data contains global and group-level trends. This article walks through an example using
fictitious data relating exercise to mood to introduce this concept. R has had an undeserved
rough time in the news lately, so this post will use R as a small condolence to the language,
though a robust framework exist in Python as well.
Analysis of variance
• The ANOVA is based on the law of total variance, where the observed variance in a
particular variable is partitioned into components attributable to different sources of
variation.
• In its simplest form, ANOVA provides a statistical test of whether two or more population
means are equal, and therefore generalizes the t-test beyond two means. ANOVA is a form
of statistical hypothesis testing heavily used in the analysis of experimental data. A test
result (calculated from the null hypothesis and the sample) is called statistically significant if
it is deemed unlikely to have occurred by chance, assuming the truth of the null hypothesis.
A statistically significant result, when a probability (p-value) is less than a pre-specified
threshold (significance level), justifies the rejection of the null hypothesis, but only if the a
priori probability of the null hypothesis is not high. In the typical application of ANOVA, the
null hypothesis is that all groups are random samples from the same population. For
example, when studying the effect of different treatments on similar samples of patients, the
null hypothesis would be that all treatments have the same effect (perhaps none). Rejecting
the null hypothesis is taken to mean that the differences in observed effects between
treatment groups are unlikely to be due to random chance. ANOVA is the synthesis of
several ideas and it is used for multiple purposes. It is difficult to define concisely or
precisely. "Classical" ANOVA for balanced data does three things at once:
• As exploratory data analysis, an ANOVA employs an additive data decomposition, and its
sums of squares indicate the variance of each component of the decomposition (or,
equivalently, each set of terms of a linear model).
• Comparisons of mean squares, along with an F-test ... allow testing of a nested sequence
of models.
• Closely related to the ANOVA is a linear model fit with coefficient estimates and standard
errors. ANOVA is a statistical tool used in several ways to develop and confirm an
explanation for the observed data. It is computationally elegant and relatively robust against
violations of its assumptions. ANOVA provides strong (multiple sample comparison)
statistical analysis. It has been adapted to the analysis of a variety of experimental designs.
ANOVA "has long enjoyed the status of being the most used (some would say abused)
statistical technique in psychological research. ANOVA "is probably the most useful
technique in the field of statistical inference. ANOVA is difficult to teach, particularly for
complex experiments, with split-plot designs being notorious.
One-way ANOVA
Factorial ANOVA
• ANOVA generalizes to the study of the effects of multiple factors. • When the experiment
includes observations at all combinations of levels of each factor, it is termed factorial. •
Factorial experiments are more efficient than a series of single factor experiments and the
efficiency grows as the number of factors increases. Consequently, factorial designs are
heavily used. • A variety of techniques are used with multiple factor ANOVA to reduce
expense. • One technique used in factorial designs is to minimize replication (possibly no
replication with support of analytical trickery) and to combine groups when effects are found
to be statistically (or practically) insignificant. • An experiment with many insignificant factors
may collapse into one with a few factors supported by many replications.
• Repeated measures design is a research design that involves multiple measures of the
same variable taken on the same or matched subjects either under different conditions or
over two or more time periods. • For instance, repeated measurements are collected in a
longitudinal study in which change over time is assessed. • A popular repeated-measure is
the crossover study. • A crossover study is a longitudinal study in which subjects receive a
sequence of different treatments (or exposures). While crossover studies can be
observational studies, many important crossover studies are controlled experiments.
Crossover designs are common for experiments in many scientific disciplines, for example
psychology, education, pharmaceutical science, and health care, especially medicine.