0% found this document useful (0 votes)
28 views16 pages

Module 3 - Assignment Rakesh Thakor

Uploaded by

Dr Rakesh Thakor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Module 3 - Assignment Rakesh Thakor

Uploaded by

Dr Rakesh Thakor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

ASSIGNMENT

Unit 3: Statistical methods

1. Describe the two types of statistical methods.


Answer:

Statistical methods

• Statistical methods are mathematical formulas, models, and techniques that are used in
statistical analysis of raw research data.

• Statistical methods involved in carrying out a study include planning, designing, collecting
data, analysing, drawing meaningful interpretation and reporting of the research findings.

• The statistical analysis gives meaning to the meaningless numbers, thereby breathing life
into a lifeless data.

• The results and inferences are precise only if proper statistical tests are used.

• The application of statistical methods extracts information from research data and provides
different ways to assess the robustness of research outputs

Types of Statistics
Statistics have majorly categorised into two types:
1. Descriptive statistics

2. Inferential statistics

Descriptive Statistics
In this type of statistics, the data is summarised through the given observations. The
summarisation is one from a sample of population using parameters such as the mean
or standard deviation.
Descriptive statistics is a way to organise, represent and describe a collection of data using
tables, graphs, and summary measures. For example, the collection of people in a city using
the internet or using Television.
Descriptive statistics are also categorised into four different categories:
 Measure of frequency
 Measure of dispersion

 Measure of central tendency

 Measure of position

The frequency measurement displays the number of times a particular data occurs. Range,
Variance, Standard Deviation are measures of dispersion. It identifies the spread of data.
Central tendencies are the mean, median and mode of the data. And the measure of position
describes the percentile and quartile ranks.
Inferential Statistics
This type of statistics is used to interpret the meaning of Descriptive statistics. That means
once the data has been collected, analysed and summarised then we use these stats to
describe the meaning of the collected data. Or we can say, it is used to draw conclusions from
the data that depends on random variations such as observational errors, sampling variation,
etc.
Inferential Statistics is a method that allows us to use information collected from a sample to
make decisions, predictions or inferences from a population. It grants us permission to give
statements that goes beyond the available data or information. For example, deriving
estimates from hypothetical research.
Statistics Example
In a class, the collection of marks obtained by 50 students is the description of data. Now
when we take out the mean of the data, the result is the average of marks of 50 students. If
the average mark obtained by 50 students is 88 out of 100, then we can reach to a conclusion
or give a judgment on the basis of the result.
2. Mention the types of probability distribution and explain them.
Answer:

A distribution in statistics is a function that shows the possible values for a variable and
how often they occur.

• The distribution of a statistical data set (or a population) is a listing or function showing
all the possible values (or intervals) of the data and how often they occur. When a
distribution of categorical data is organized, you see the number or percentage of
individuals in each group.

• The distribution of an event consists not only of the input values that can be observed,
but is made up of all possible values.

• The distribution in statistics is defined by the underlying probabilities and not the graph.
The graph is just a visual representation.

Types Probability Distribution - A probability distribution is a function that describes the


likelihood of obtaining the possible values that a random variable can assume. In other
words, the values of the variable vary based on the underlying probability distribution. A
function (or mapping) of events to probabilities Motivation: Using historical data and
experience (or assumptions), a convenient way to estimate or predict probabilities of
events Methods: • Using histograms • Using probability density functions

• Using cumulative distribution functions • Probability distributions are generally divided


into two classes. A discrete probability distribution (applicable to the scenarios where the
set of possible outcomes is discrete, such as a coin toss or a roll of dice) can be encoded
by a discrete list of the probabilities of the outcomes, known as a probability mass
function. A continuous probability distribution (applicable to the scenarios where the set of
possible outcomes can take on values in a continuous range (e.g. real numbers), such as
the temperature on a given day) is typically described by probability density functions (with
the probability of any individual outcome actually being

1). Binomial distributions:

• The binomial distribution measures the probabilities of the number of successes over a
given number of trials with a specified probability of success in each try.

• In the simplest scenario of a coin toss (with a fair coin), where the probability of getting a
head with each toss is 0.50 and there are a hundred trials, the binomial distribution will
measure the likelihood of getting anywhere from no heads in a hundred tosses (very
unlikely) to 50 heads (the most likely) to 100 heads (also very unlikely).

• The binomial distribution in this case will be symmetric, reflecting the even odds; as the
probabilities shift from even odds, the distribution will get more skewed.

• Negative Binomial distribution: Returning again to the coin toss example, assume that
you hold the number of successes fixed at a given number and estimate the number of
tries you will have before you reach the specified number of successes. The resulting
distribution is called the negative binomial and it very closely resembles the Poisson. In
fact, the negative binomial distribution converges on the Poisson distribution, but will be
more skewed to the right (positive values) than the Poisson distribution with similar
parameters.

• Binomial distributions for three scenarios – two with 50% probability of success and one
with a 70% probability of success and different trial sizes. As the probability of success is
varied (from 50%) the distribution will also shift its shape, becoming positively skewed for
probabilities less than 50% and negatively skewed for probabilities greater than 50%.

Poisson distribution

• The Poisson distribution measures the likelihood of a number of events occurring within
a given time interval, where the key parameter that is required is the average number of
events in the given interval (l).

• The resulting distribution looks similar to the binomial, with the skewness being positive
but decreasing with λ.

• Figure presents three Poisson distributions, with l ranging from 1 to 10.


Geometric distribution

Consider again the coin toss example used to illustrate the binomial. Rather than focus on
the number of successes in n trials, assume that you were measuring the likelihood of
when the first success will occur. For instance, with a fair coin toss, there is a 50% chance
that the first success will occur at the first try, a 25% chance that it will occur on the
second try and a 12.5% chance that it will occur on the third try. The resulting distribution
is positively skewed and looks as follows for three different probability scenarios in fig. The
distribution is steepest with high probabilities of success and flattens out as the probability
decreases. However, the distribution is always positively skewed.

Hypergeometric distribution
The hypergeometric distribution measures the probability of a specified number of
successes in n trials, without replacement, from a finite population. Since the sampling is
without replacement, the probabilities can change as a function of previous draws.
Consider, for instance, the possibility of getting four face cards in hand of ten, over
repeated draws from a pack. Since there are 16 face cards and the total pack contains 52
cards, the probability of getting four face cards in a hand of ten can be estimated. Figure
provides a graph of the hypergeometric distribution: The hypergeometric distribution
converges on binomial distribution as the as the population size increases.

Direct uniform distribution

• With continuous data, we cannot specify all possible outcomes, since they are too
numerous to list, but we have two choices.

• The first is to convert the continuous data into a discrete form and then go through the
same process that we went through for discrete distributions of estimating
probabilities.

• For instance, we could take a variable such as market share and break it down into
discrete blocks – market share between 3% and 3.5%, between 3.5% and 4% and so
on – and consider the likelihood that we will fall into each block.

• The second is to find a continuous distribution that best fits the data and to specify the
parameters of the distribution.
3. Discuss in detail the types of statistical inference.
Answer:

Statistical inference is the process of using data analysis to deduce properties of an


underlying distribution of probability. Statistical inference makes propositions about a
population, using data drawn from the population with some form of sampling. Given a
hypothesis about a population, for which we wish to draw inferences, statistical inference
consists of (first) selecting a statistical model of the process that generates the data and
(second) deducing propositions from the model. Inferential statistical analysis infers
properties of a population, for example by testing hypotheses and deriving estimates. It is
assumed that the observed data set is sampled from a larger population. Descriptive
statistics is solely concerned with properties of the observed data, and it does not rest on
the assumption that the data come from a larger population. • The conclusion of a statistical
inference is a statistical proposition. Some common forms of statistical proposition are the
following: • A point estimate, i.e. a particular value that best approximates some parameter
of interest; • An interval estimate, e.g. a confidence interval (or set estimate), i.e. an interval
constructed using a dataset drawn from a population so that, under repeated sampling of
such datasets, such intervals would contain the true parameter value with the probability at
the stated confidence level; • A credible interval, i.e. a set of values containing, for example,
95% of posterior belief • Rejection of a hypothesis; • Clustering or classification of data
points into groups. The ingredients used for making statistical inference are: • Sample Size •
Variability in the sample • Size of the observed differences.

Types of statistical inference

There are different types of statistical inferences that are extensively used for making
conclusions.

They are: • One sample hypothesis testing

• Confidence Interval

• Pearson Correlation

• Bi-variate regression

• Multi-variate regression

• Chi-square statistics and contingency table

• ANOVA or T-test Procedure

Statistical estimation

Statistical estimation is concerned with best estimating a value or range of values for a
particular population parameter, and hypothesis testing is concerned with deciding whether
the study data are consistent at some level of agreement with a particular population
parameter. We briefly describe statistical estimation and then devote the remainder of this
section to providing a conceptual overview of hypothesis testing.

There are two types of statistical estimation. The first type is point estimation, which
addresses what particular value of a parameter is most consistent with the data.

The second type of statistical estimation is interval estimation. Interval estimation is


concerned with quantifying the uncertainty or variability associated with the estimate. This
approach supplements point estimation because it gives important information about the
variability (or confidence) in the point estimate.

Statistical hypothesis testing

Hypothesis testing has a complementary perspective. The framework addresses whether a


particular value (often called the null hypothesis) of the parameter is consistent with the
sample data. We then address how much evidence we have to reject (or fail to reject) the
null hypothesis.
The procedures involved in inferential statistics are:

• Begin with a theory

• Create a research hypothesis

• Operationalize the variables

• Recognize the population to which the study results should apply

• Formulate a null hypothesis for this population

• Accumulate a sample of children from the population and continue the study

• Conduct statistical tests to see if the collected sample properties are adequately different
from what would be expected under the null hypothesis to be able to reject the null
hypothesis Importance

• Inferential Statistics is important to examine the data properly.

• To make an accurate conclusion, proper data analysis is important to interpret the


research results.

• It is majorly used in the future prediction for various observations in different fields.

• It helps us to make inference about the data.

• The statistical inference has a wide range of application in different fields such as:

o Business Analysis

o Artificial Intelligence

o Financial Analysis

o Fraud Detection

o Machine Learning

o Share Market

o Pharmaceutical Sector
4. Discuss in detail the classes of models.
Answer:

Fixed-effects models

The fixed-effects model (class I) of analysis of variance applies to situations in which the
experimenter applies one or more treatments to the subjects of the experiment to see
whether the response variable values change. This allows the experimenter to estimate the
ranges of response variable values that the treatment would generate in the population as a
whole. the fixed effects model assumes that the omitted effects of the model can be
arbitrarily correlated with the included variables. This is useful whenever you are only
interested in analyzing the impact of variables that vary over time (the time effects).
FE explore the relationship between predictor and outcome variables within an entity
(country, person, company, etc.). Each entity has its own individual characteristics that may
or may not influence the predictor variables (for example, being a male or female could
influence the opinion toward certain issue; or the political system of a particular country
could have some effect on trade or GDP; or the business practices of a company may
influence its stock price).

When using FE we assume that something within the individual may impact or bias the
predictor or outcome variables and we need to control for this. This is the rationale behind
the assumption of the correlation between entity’s error term and predictor variables. FE
remove the effect of those time-invariant characteristics so we can assess the net
effect of the predictors on the outcome variable.
The FE regression model has n different intercepts, one for each entity. These intercepts
can be represented by a set of binary variable and these binary variables absorb the
influences of all omitted variables that differ from one entity to the next but are constant over
time.

Another important assumption of the FE model is that those time-invariant characteristics


are unique to the individual and should not be correlated with other individual
characteristics. Each entity is different therefore the entity’s error term and the constant
(which captures individual characteristics) should not be correlated with the others. If the
error terms are correlated, then FE is no suitable since inferences may not be correct and
you need to model that relationship

Random-effects models

Random-effects model (class II) is used when the treatments are not fixed. This occurs
when the various factor levels are sampled from a larger population. Because the levels
themselves are random variables, some assumptions and the method of contrasting the
treatments (a multivariable generalization of simple differences) differ from the fixed-effects
model. If the individual effects are strictly uncorrelated with the regressors it may be
appropriate to model the individual specific constant terms as randomly distributed across
cross-sectional units. This view would be appropriate if we believed that sampled cross-
sectional units were drawn from a large population.
If you have reason to believe that differences across entities have some influence on your
dependent variable then you should use random effects. In random-effects you need to
specify those individual characteristics that may or may not influence the predictor
variables. The problem with this is that some variables may not be available therefore
leading to omitted variable bias in the model.

An advantage of random effects is that you can include time invariant variables (i.e.
gender). In the fixed effects model these variables are absorbed by the intercept. The cost
is the possibility of inconsistent estimators, of the assumption is inappropriate.

Mixed-effects models

A mixed-effects model (class III) contains experimental factors of both fixed and random-
effects types, with appropriately different interpretations and analysis for the two types.
Example: Teaching experiments could be performed by a college or university department
to find a good introductory textbook, with each text considered a treatment. The fixed-
effects model would compare a list of candidate texts. The random-effects model would
determine whether important differences exist among a list of randomly selected texts. The
mixed-effects model would compare the (fixed) incumbent texts to randomly selected
alternatives.

Mixed-effects regression models are a powerful tool for linear regression models when your
data contains global and group-level trends. This article walks through an example using
fictitious data relating exercise to mood to introduce this concept. R has had an undeserved
rough time in the news lately, so this post will use R as a small condolence to the language,
though a robust framework exist in Python as well.

Mixed-effect models are common in political polling analysis where national-level


characteristics are assumed to occur at a state-level while state-level sample sizes may be
too small to drive those characteristics on their own. They are also common in scientific
experiments where a given effect is assumed to be present among all study individuals
which needs to be teased out from a specific effect on a treatment group. In a similar vein,
this framework can be helpful in pre/post studies of interventions.
5. Discuss in detail Analysis of variance.
Answer:

Analysis of variance

• Analysis of variance (ANOVA) is a collection of statistical models and their associated


estimation procedures (such as the "variation" among and between groups) used to analyze
the differences among group means in a sample.

• ANOVA was developed by statistician and evolutionary biologist Ronald Fisher.

• The ANOVA is based on the law of total variance, where the observed variance in a
particular variable is partitioned into components attributable to different sources of
variation.

• In its simplest form, ANOVA provides a statistical test of whether two or more population
means are equal, and therefore generalizes the t-test beyond two means. ANOVA is a form
of statistical hypothesis testing heavily used in the analysis of experimental data. A test
result (calculated from the null hypothesis and the sample) is called statistically significant if
it is deemed unlikely to have occurred by chance, assuming the truth of the null hypothesis.
A statistically significant result, when a probability (p-value) is less than a pre-specified
threshold (significance level), justifies the rejection of the null hypothesis, but only if the a
priori probability of the null hypothesis is not high. In the typical application of ANOVA, the
null hypothesis is that all groups are random samples from the same population. For
example, when studying the effect of different treatments on similar samples of patients, the
null hypothesis would be that all treatments have the same effect (perhaps none). Rejecting
the null hypothesis is taken to mean that the differences in observed effects between
treatment groups are unlikely to be due to random chance. ANOVA is the synthesis of
several ideas and it is used for multiple purposes. It is difficult to define concisely or
precisely. "Classical" ANOVA for balanced data does three things at once:

• As exploratory data analysis, an ANOVA employs an additive data decomposition, and its
sums of squares indicate the variance of each component of the decomposition (or,
equivalently, each set of terms of a linear model).

• Comparisons of mean squares, along with an F-test ... allow testing of a nested sequence
of models.

• Closely related to the ANOVA is a linear model fit with coefficient estimates and standard
errors. ANOVA is a statistical tool used in several ways to develop and confirm an
explanation for the observed data. It is computationally elegant and relatively robust against
violations of its assumptions. ANOVA provides strong (multiple sample comparison)
statistical analysis. It has been adapted to the analysis of a variety of experimental designs.
ANOVA "has long enjoyed the status of being the most used (some would say abused)
statistical technique in psychological research. ANOVA "is probably the most useful
technique in the field of statistical inference. ANOVA is difficult to teach, particularly for
complex experiments, with split-plot designs being notorious.

One-way ANOVA

One-way analysis of variance (abbreviated one-way ANOVA) is a technique that can be


used to compare means of two or more samples (using the F distribution). This technique
can be used only for numerical response data, the "Y", usually one variable, and numerical
or (usually) categorical input data, the "X", always one variable, hence "one-way". The one-
way ANOVA is used to test for differences among at least three groups, since the twogroup
case can be covered by a t-test. When there are only two means to compare, the t-test and
the F-test are equivalent; the relation between ANOVA and t is given by F = t2. An
extension of one-way ANOVA is two-way analysis of variance that examines the influence
of two different categorical independent variables on one dependent variable.

Factorial ANOVA

• ANOVA generalizes to the study of the effects of multiple factors. • When the experiment
includes observations at all combinations of levels of each factor, it is termed factorial. •
Factorial experiments are more efficient than a series of single factor experiments and the
efficiency grows as the number of factors increases. Consequently, factorial designs are
heavily used. • A variety of techniques are used with multiple factor ANOVA to reduce
expense. • One technique used in factorial designs is to minimize replication (possibly no
replication with support of analytical trickery) and to combine groups when effects are found
to be statistically (or practically) insignificant. • An experiment with many insignificant factors
may collapse into one with a few factors supported by many replications.

Repeated measures ANOVA

• Repeated measures design is a research design that involves multiple measures of the
same variable taken on the same or matched subjects either under different conditions or
over two or more time periods. • For instance, repeated measurements are collected in a
longitudinal study in which change over time is assessed. • A popular repeated-measure is
the crossover study. • A crossover study is a longitudinal study in which subjects receive a
sequence of different treatments (or exposures). While crossover studies can be
observational studies, many important crossover studies are controlled experiments.
Crossover designs are common for experiments in many scientific disciplines, for example
psychology, education, pharmaceutical science, and health care, especially medicine.

Multivariate analysis of Variance (MANOVA)

• Multivariate analysis of variance (MANOVA) is a procedure for comparing multivariate


sample means. • As a multivariate procedure, it is used when there are two or more
dependent variables. • Is often followed by significance tests involving individual dependent
variables separately. • MANOVA is a generalized form of univariate analysis of variance
(ANOVA) • It uses the covariance between outcome variables in testing the statistical
significance of the mean differences

You might also like