0% found this document useful (0 votes)
8 views18 pages

RM Sec 3 1

The document provides an overview of descriptive statistics, including frequency distribution, measures of central tendency, and measures of variability, which are essential for analyzing and summarizing data sets. It also discusses tabulation, emphasizing its role in organizing data for easier analysis and comparison, and outlines the components and objectives of effective tabulation. Furthermore, it covers diagrammatic representations of data, highlighting various types such as bar diagrams and pie charts, which enhance the visual understanding of statistical information.

Uploaded by

Suman Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

RM Sec 3 1

The document provides an overview of descriptive statistics, including frequency distribution, measures of central tendency, and measures of variability, which are essential for analyzing and summarizing data sets. It also discusses tabulation, emphasizing its role in organizing data for easier analysis and comparison, and outlines the components and objectives of effective tabulation. Furthermore, it covers diagrammatic representations of data, highlighting various types such as bar diagrams and pie charts, which enhance the visual understanding of statistical information.

Uploaded by

Suman Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1.

Descriptive Statistics
The term “descriptive statistics” refers to the analysis, summary, and presentation of findings related to a
data set derived from a sample or entire population. Descriptive statistics comprises three main categories
– Frequency Distribution, Measures of Central Tendency, and Measures of Variability. Descriptive
statistics summarize and organize characteristics of a data set. A data set is a collection of responses or
observations from a sample or entire population. In quantitative research, after collecting data, the first step
of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g.,
age), or the relation between two variables (e.g., age and creativity).
Types of descriptive statistics: -
1. The distribution concerns the frequency of each value.
2. The central tendency concerns the averages of the values.
3. The variability or dispersion concerns how spread out the values are.
You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more,
in bivariate and multivariate analysis.
1. Frequency distribution: -A data set is made up of a distribution of values, or scores. In tables or
graphs, you can summarize the frequency of every possible value of a variable in numbers or
percentages.
• Simple frequency distribution table- For the variable of gender, you list all possible answers
on the left hand column. You count the number or percentage of responses for each answer
and display it on the right hand column.
Gender Number

Man 182

Woman 235

No answer 27

• Grouped frequency distribution table- In a grouped frequency distribution, you can group
numerical response values and add up the number of responses for each group. You can also
convert each of these numbers to percentages.
Library visits in the past year Percent

0–4 6%

5–8 20%

9–12 42%

13–16 24%

17+ 8%
2. Measures of central tendency: -Measures of central tendency estimate the center, or average, of a
data set. The mean, median and mode are 3 ways of finding the average. Here we will demonstrate
how to calculate the mean, median, and mode using the first 6 responses of our survey.
• Mean -The mean, or M, is the most commonly used method for finding the average.
To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.
Mean number of library visits

Data set 15, 3, 12, 0, 24, 3

Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57

Total number of responses N = 6

Mean Divide the sum of values by N to find M: 57/6 = 9.5

• Median -The median is the value that’s exactly in the middle of a data set.
To find the median, order each response value from the smallest to the biggest. Then, the
median is the number in the middle. If there are two numbers in the middle, find their mean.
Median number of library visits

Ordered data set 0, 3, 3, 12, 15, 24

Middle numbers 3, 12

Median Find the mean of the two middle numbers: (3 + 12)/2 = 7.5

• Mode - The mode is the simply the most popular or most frequent response value. A data set
can have no mode, one mode, or more than one mode.
To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.
Mode number of library visits

Ordered data set 0, 3, 3, 12, 15, 24

Mode Find the most frequently occurring response: 3

3. Measures of variability or Dispersion: -Measures of variability give you a sense of how spread out
the response values are. The range, standard deviation and variance each reflect different aspects of
spread.
• Range- The range gives you an idea of how far apart the most extreme response scores are.
To find the range, simply subtract the lowest value from the highest value.

Range of visits to the library in the past year


Ordered data set: 0, 3, 3, 12, 15, 24
Range: 24 – 0 = 24
• Standard deviation - The standard deviation (s) is the average amount of variability in your
dataset. It tells you, on average, how far each score lies from the mean. The larger the
standard deviation, the more variable the data set is.
There are six steps for finding the standard deviation:
i. List each score and find their mean.
ii. Subtract the mean from each score to get the deviation from the mean.
iii. Square each of these deviations.
iv. Add up all of the squared deviations.
v. Divide the sum of the squared deviations by N – 1.
vi. Find the square root of the number you found.
Standard deviations of visits to the library in the past year. In the table below, you
complete Steps 1 through 4.
Raw data Deviation from mean Squared deviation
X d=X-mean d2

15 15 – 9.5 = 5.5 30.25

3 3 – 9.5 = -6.5 42.25

12 12 – 9.5 = 2.5 6.25

0 0 – 9.5 = -9.5 90.25

24 24 – 9.5 = 14.5 210.25

3 3 – 9.5 = -6.5 42.25

Mean = 9.5 Sum of d = 0 Sum of d2 = 421.5


Step 5: 421.5/5 = 84.3
Step 6: sd = √84.3 = 9.18
From learning that s = 9.18, you can say that on average, each score deviates from the mean
by 9.18 points.
• Variance- The variance is the average of squared deviations from the mean. Variance reflects
the degree of spread in the data set. The more spread the data, the larger the variance is in
relation to the mean.
To find the variance, simply square the standard deviation. The symbol for variance is s2.
Variance of visits to the library in the past year. Data set: 15, 3, 12, 0, 24, 3
s = 9.18
s2 = 84.3

2. Tabulation
Tabulation refers to the system of processing data or information by arranging it into a table. With tabulation,
numeric data is arrayed logically and systematically into columns and rows, to aid in their statistical
analysis. The purpose of tabulation is to present a large mass of complicated information in an orderly
fashion and allow viewers to draw reasonable conclusions and interpretations from them.
To tabulate data correctly, one must learn about the eight essential parts of a table. These are as follows –
• Table Number – This is the first part of a table and is given on top of any table to facilitate easy
identification and for further reference.
• Title of the Table – One of the most important parts of any table, its title is placed on top of the same
and narrates its contents. It is imperative that the title be brief, crisp and carefully-worded to describe
the tables’ contents effectively.
• Headnote – The headnote of a table is presented in the portion just below the title. It provides
information about the unit of data in the table, like “amount in Rupees” or “quantity in kilograms”,
etc.
• Column Headings or Captions – Captions are the portion of the table on top of each column which
explains the figures under each column.
• Row Headings or Stubs – The title of each horizontal row is called a stub.
• Body of a Table – This is the portion that contains the numeric information collected from
investigated facts. The data in the body is presented in rows which are read horizontally from left to
right and in columns, read vertically from top to bottom.
• Footnote – Given at the bottom of a table above the source note, a footnote is used to state any fact
that is not clear from the table’s title, headings, caption or stub. For instance, if a table denotes the
profit earned by a company, a footnote can be used to state if said profit is earned before or after-tax
calculations.
• Source Note – As its name suggests, a source note refers to the source from where the table’s
information has been collected.
Objectives of Tabulation: - Tabulation essentially bridges the gap between the collection of data and
analysing them. The primary objectives of tabulation can be encapsulated below –
• For Simplification of Complex Data – When any information is tabulated, the volume of raw data
is compressed and presented in a much more simplified manner. This facilitates easy comprehension
and analysis of previously complex data.
• To Highlight Important Information – Representing any data in tabular form increases the scope to
highlight important information. Since data is presented in a concise manner without any textual
explanation, any crucial information is automatically highlighted without difficulty.
• To Enable Easy Comparison – When data is presented in an orderly fashion in rows and columns,
it becomes easier to compare between them on the basis of several parameters. For example, it
becomes easier to determine the month when a country has received the maximum amount of rainfall
if the data is presented in a table. Otherwise, there always remains room for making a mistake in
processing the data correctly.
• To Help in the Statistical Analysis of Data – Statistical analysis involves the computing correlation,
average, dispersion, etc. of data. When information is presented in an organised manner in a table,
statistical analysis becomes a lot simpler.
• Saves Space - Even though it might not seem as important as the other objective of tabulation, saving
space without sacrificing the quality of data can be extremely helpful in the long run. Additionally,
a table helps to present facts in a much more concise manner than page after page of text.
Types of Tabulation – Generally, tabulation can be classified into two types – simple and complex
tabulation.
A. Simple Tabulation- This is the process of tabulation through which information regarding one or
more independent questions is illustrated. It is also known as one-way tabulation.
B. Complex Tabulation- These are the types of tables which represent the division of data into two or
more categories based on two or more characteristics. This type of data tabulation can be divided
into three types. These are –
• Two Way Tables – These tables illustrate information collected from two mutually
dependent questions. For instance, say that a table has to illustrate the highest population in
different states of India. This can be done in a one-way table. But if the population has to be
compared in terms of the total number of males and females in each state, it will require a
two way table.
• Three-Way Table – Like the above mentioned category, three-way tables illustrate
information collected from three mutually dependent and inter-related questions. Let us take
the above example and elaborate on that further with another category added to the table –
the position of literacy amongst the male and female population in each state. The tabulation
for these categories has to be put down in a three-way table.
• Manifold Table – These tables are utilised to illustrate information collected from more than
three interrelated questions or characteristics.
Rules of Tabulation- There are a few general rules that have to be followed while constructing tables. These
are –
i. Tables illustrated should be self-explanatory. Even though footnotes form a part of tables, they
should not be mandatory to explain the meaning of the data presented in a table.
ii. If the volume of information is substantial, it is best to put them down in multiple tables instead of
a single one. This reduces the chances of mistakes and defeats the purpose of forming a table.
However, each table formed should also be complete in itself and serve the purpose of analysis.
iii. The number of rows and columns should be kept minimal to present information in a crisp and
concise manner.
iv. Before tabulating, data should be approximated, wherever necessary.
v. Stubs and captions should be self-explanatory and should not require the help of footnotes to be
comprehended.
vi. If certain positions of data collected cannot be tabulated under any stub or captions, they should be
put down in a separate table under the heading of miscellaneous.
vii. Quantity and quality of data should not be compromised under any scenario while forming a table.

3. Diagrammatical Representation of Data


Diagrammatic Representations: - Diagrams are an important tool for representing statistical data.
Diagrammatic representations are the best technique used to represent any numerical data collected in the
statistics. One of the famous quotes says that “A picture speaks more than a thousand words.” Similarly, the
diagrammatic representation of data gives a lot of information regarding the numerical data. We have
different types of diagrammatic representations. Let us learn about diagrammatic representations and their
types in detail in this article.
Representation of any numerical data by using diagrams is known as diagrammatic representations.
Diagrammatic data representations give a simple and easy understanding of any numerical data collected as
compared with the tabular form of the data or textual form of the data. One of the famous quotes says that
“A picture speaks more than words.” Similarly, to represent the statistical data, the essential tool is the
diagrams. Diagrammatic data representations translate the highly complex ideas included in the given
numerical data into concrete and pretty effectively in a simple, understandable manner. Diagrammatic
representations use geometrical figures as diagrams to improve the representation of the data. Diagrammatic
representations are like visual assistance to the readers.
In one-dimensional diagrammatic representations of the data, we will consider only the length of the
diagram. We have different types of one-dimensional diagrams that are listed below:
• Simple bar diagram-
• Multiple bar diagrams
• Subdivided bar diagrams
• Percentage bar diagram
• Deviation bar diagram
Types of Diagrammatic Representations: - Diagrammatic representations use the geometrical figures as
diagrams to improve the data representation, such as cartographs, pictographs, Pie charts, bar diagrams, etc.
1) Line Diagrams - In the linear diagrammatic representations of the data, we will use the line that
connects the points or portions of the various data in the graph by taking two variables on horizontal
and vertical axes.
2) Bar Diagrams -In the bar diagrammatic representation of data, the data can be represented by
rectangular bars. The height of the bars gives the value or frequency of the variable. All rectangular
bars should have equal width. This is one of the best-used tools for the comparison of the data.
3) Histograms -Histograms are also similar to bar diagrams; they use rectangular bars to represent the
data. But all the rectangular bars are kept without any gaps.
4) Pie Diagrams -Pie Diagram is a diagrammatic representation of data by using circles and spheres.
In the pie diagrams, a circle is divided into parts, such that each part shows the proportion of various
data.
5) Pictographs -The pictographic representation shows the given data graphically by using images or
symbols. The symbol or image is used in the pictographic diagrams describes the frequency of the
object in the given set of data. Pictographs provided the information of the given data by using
symbols or images.
Advantages: -
i. The diagrammatic representations of the data are more attractive and pretty impressive compared
with the tabular form of the data or textual form of the data.
ii. The diagrammatic representations of the data are easy to remember as they use the geometrical
figures as the diagrams.
iii. The diagrammatic representation of data is easy to understand.
iv. Diagrammatic data representations translate the highly complex ideas included in the given
numerical data into concrete and pretty effectively in a simple, understandable manner.
v. Diagrammatic representations also help identify hidden facts or relations in the data that are not
observed in the tabular form.
vi. Diagrammatic representations of the data are a handy tool in the comparison of data.

4. Difference Between Tabulation and Diagrammatic Representation


The following are the cardinal points of difference between a tubular, and a diagrammatic representation of
data:
• Visual appeal: The tabular presentation of data does not have any visual appeal whereas the
diagrammatic presentation of data has a visual appeal, and as such, it proves to be more impressive
for a layman.
• Nature of information: The tabular presentation of data provides information in a precised manner.
Whereas the diagrammatic presentation provides the information in an approximate manner.
• Nature of reading for interpretation: Tabular presentation needs closer, and intensive reading for
making purposeful interpretation of the data, whereas diagrammatic presentation does not need so.
• Number of Information: In tabular representation of data more than one type of information relation
to the problem can be exhibited easily. But in diagrammatic representation of data only one type of
information can be presented clearly. However, in certain diagrams, information relation to the
different components of the main information can be displayed.

5. Correlation
The correlation is one of the most common and most useful statistics. A correlation is a single number that
describes the degree of relationship between two variables. It is a measure of the extent to which two
variables are related. There are three possible results of a correlational study: a positive correlation, a
negative correlation, and no correlation.
• A positive correlation is a relationship between two variables in which both variables move in the
same direction. Therefore, when one variable increases as the other variable increases, or one
variable decreases while the other decreases. An example of positive correlation would be height
and weight. Taller people tend to be heavier.
• A negative correlation is a relationship between two variables in which an increase in one variable
is associated with a decrease in the other. An example of negative correlation would be height above
sea level and temperature. As you climb the mountain (increase in height) it gets colder (decrease in
temperature).
• A zero correlation exists when there is no relationship between two variables. For example there is
no relationship between the amount of tea drunk and level of intelligence.
A correlation can be expressed visually. This is done by drawing a scattergram (also known as a scatterplot,
scatter graph, scatter chart, or scatter diagram). A scattergram is a graphical display that shows the
relationships or associations between two numerical variables (or co-variables), which are represented as
points (or dots) for each pair of score. A scatter graph indicates the strength and direction of the correlation
between the co-variables.
Uses of Correlations: -
• Prediction- If there is a relationship between two variables, we can make predictions about one from
another.
• Validity- Concurrent validity (correlation between a new measure and an established measure).
• Reliability- Test-retest reliability (are measures consistent).
Inter-rater reliability (are observers consistent).
• Theory verification- Predictive validity.

6. Regression Analysis
Regression analysis is a set of statistical methods used for the estimation of relationships between a
dependent variable and one or more independent variables. It can be utilized to assess the strength of the
relationship between variables and for modelling the future relationship between them.

Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most
common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for
more complicated data sets in which the dependent and independent variables show a nonlinear relationship.
Types of Regression Analysis: -
1. Linear Regression- Linear regression is one of the most basic types of regression in machine
learning. The linear regression model consists of a predictor variable and a dependent variable
related linearly to each other. In case the data involves more than one independent variable, then
linear regression is called multiple linear regression models.
The below-given equation is used to denote the linear regression model:
y=mx+c+e
where m is the slope of the line, c is an intercept, and e represents the error in the model.
2. Logistic Regression: -Logistic regression is one of the types of regression analysis technique, which
gets used when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This means the
target variable can have only two values, and a sigmoid curve denotes the relation between the target
variable and the independent variable. Logit function is used in Logistic Regression to measure the
relationship between the target variable and independent variables. Below is the equation that
denotes the logistic regression.
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
where p is the probability of occurrence of the feature.

3. Ridge Regression: - This is another one of the types of regression in machine learning which is
usually used when there is a high correlation between the independent variables. This is because, in
the case of multi collinear data, the least square estimates give unbiased values. But, in case the
collinearity is very high, there can be some bias value. Therefore, a bias matrix is introduced in the
equation of Ridge Regression. This is a powerful regression method where the model is less
susceptible to overfitting.
4. Lasso Regression: -Lasso Regression is one of the types of regression in machine learning that
performs regularization along with feature selection. It prohibits the absolute size of the regression
coefficient. As a result, the coefficient value gets nearer to zero, which does not happen in the case
of Ridge Regression.

5. Polynomial Regression: -Polynomial Regression is another one of the types of regression


analysis techniques in machine learning, which is the same as Multiple Linear Regression with a
little modification. In Polynomial Regression, the relationship between independent and dependent
variables, that is X and Y, is denoted by the n-th degree. It is a linear model as an estimator. Least
Mean Squared Method is used in Polynomial Regression also. The best fit line in Polynomial
Regression that passes through all the data points is not a straight line, but a curved line, which
depends upon the power of X or value of n.
6. Bayesian Linear Regression: -Bayesian Regression is one of the types of regression in machine
learning that uses the Bayes theorem to find out the value of regression coefficients. In this method
of regression, the posterior distribution of the features is determined instead of finding the least-
squares. Bayesian Linear Regression is like both Linear Regression and Ridge Regression but is
more stable than the simple Linear Regression.

7. Probability Distribution
A probability distribution is a table or an equation that interconnects each outcome of a statistical experiment
with its probability of occurrence. To understand the concept of a probability distribution, it is important to
know variables, random variables, and some other notations.
• Variables: A variable is defined as any symbol that can take any particular set of values.
• Random Variable: When the value of any variable is the outcome of a statistical experiment, that
variable is determined as a probability distribution of random variables. It can be discrete (not
constant) or continuous or both.
• Notations: Mostly, statisticians make use of capital letters to denote a probability distribution of
random variables and small-case letters to represent any of its values.
✓ X denotes the probability distribution of random variable X
✓ P(X) denotes the probability of X.
✓ p(X=r) denotes the probability that random variable X is equivalent to any particular value,
represented by r. For example: P (X=1) states the probability distribution of the random
variable X is equivalent to 1.
Probability Distribution: - Probability Distributions give up the possible outcome of any random event. It is
also identified on the grounds of underlying sample space as a set of possible outcomes of any random
experiment. These settings can be set of prime numbers, set of real numbers, set of complex numbers, or set
of any entities. The Probability distribution is a part of probability and statistics. Random experiments are
termed as the outcomes of an experiment whose results cannot be predicted. For example- if we toss a coin,
we cannot predict what will appear, either the head or tail. The possible result of a random experiment is
known as the outcome. And the set of outcomes is termed as a sample point. Through these possibilities,
we can design a probability table on the basis of variables and probabilities.
Types of Probability Distribution: - There are two types of probability distribution which are used for
distinct purposes and various types of data generation processes.
A. Discrete Probability Distribution
B. Continuous probability Distribution

A. Discrete Probability Distribution: -


1) Binomial Distribution: - The Binomial distribution is also termed as a discrete probability
function where the set of outcomes are discrete in nature. For example: if a dice is rolled, then all
its possible outcomes will be discrete in nature and it gives the mass of outcome. It is also
considered a probability mass function. The Formula for the Binomial Distribution
𝑃(𝑋) = 𝑛𝑟𝐶 𝑝𝑟 (1 − 𝑝)𝑛−𝑟
Here,
n =Total number of events
r = Total number of successful events
p = successful on a single trial probability,
1-p = Failure probability
Properties: -
• It involves a sequence of “n identical trials”.
• The trials are independent as the outcome of past events doesn’t decide or affect the outcome
of the present event.
• Two outcomes are possible, “success or failure”, “win or lose” or “gain or lose” for each
outcome.
• The probability of success on each trial, denoted by “p”, doesn’t alter from trial to trial.
When the probability of success and failure is equal then the graph of binomial distribution in that
situation looks like;

2) Poisson Distribution: - Poisson random variables are the number of successes that yield from the
Poisson experiment and their corresponding probability is known as the Poisson distribution that
can be expressed as
Where “X” is the Poisson random variable, “x” is the number of successes and the mean “µ” is the
fundamental parameter in this distribution. The graph of Poisson distribution is shown below -

B. Continuous probability Distribution: - A random variable X has a continuous probability distribution


where it can take any values that are infinite, and hence uncountable. A continuous distribution is
made of continuous variables. They are expressed with the probability density function that describes
the shape of the distribution.
Properties: -
• The graph of the continuous probability distribution is mostly a smooth curve. It is usually
represented by an equation of a function.
• The total area beneath the curve is 1.
• The area that is present in between the horizontal axis and the curve from value a to value b is
called the probability of the random variable that can take the value in the interval (a, b).

1) Normal Distribution: - It is the most common distribution in all the probabilities and statistics and
can be used frequently in finance, investing, science, and engineering, the probability density
function for the normal distribution is defined as;

Where the μ and σ represent the mean (the point of the center of the distribution) and the standard
deviation (how to spread out the distribution is) of the population respectively.
8. Properties and Applications of Normal Curves
The random variables following the normal distribution are those whose values can find any unknown value
in a given range. For example, finding the height of the students in the school. Here, the distribution can
consider any value, but it will be bounded in the range say, 0 to 6ft. This limitation is forced physically in
our query. Whereas, the normal distribution doesn’t even bother about the range. The range can also extend
to –∞ to + ∞ and still we can find a smooth curve. These random variables are called Continuous Variables,
and the Normal Distribution then provides here probability of the value lying in a particular range for a
given experiment. Also, use the normal distribution calculator to find the probability density function by
just providing the mean and standard deviation value.
A normal distribution is symmetric from the peak of the curve, where the mean is. This means that most of
the observed data is clustered near the mean, while the data become less frequent when farther away from
the mean. The resultant graph appears as bell-shaped where the mean, median, and mode are of the same
values and appear at the peak of the curve. The graph is a perfect symmetry, such that, if you fold it at the
middle, you will get two equal halves since one-half of the observable data points fall on each side of the
graph.
Properties: - All forms of (normal) distribution share the following characteristics:
1. symmetric: -A normal distribution comes with a perfectly symmetrical shape. This means that the
distribution curve can be divided in the middle to produce two equal halves. The symmetric shape occurs
when one-half of the observations fall on each side of the curve.
2. The mean, median, and mode are equal: - The middle point of a normal distribution is the point with the
maximum frequency, which means that it possesses the most observations of the variable. The midpoint is
also the point where these three measures fall. The measures are usually equal in a perfectly (normal)
distribution.
3. Empirical rule: -In normally distributed data, there is a constant proportion of distance lying under the
curve between the mean and specific number of standard deviations from the mean. For example, 68.25%
of all cases fall within +/- one standard deviation from the mean. 95% of all cases fall within +/- two standard
deviations from the mean, while 99% of all cases fall within +/- three standard deviations from the mean.
4. Skewness and kurtosis: -Skewness and kurtosis are coefficients that measure how different a distribution
is from a normal distribution. Skewness measures the symmetry of a normal distribution while kurtosis
measures the thickness of the tail ends relative to the tails of a normal distribution.
Applications: -
1. To determine the percentage of cases (in a normal distribution) within given limits or scores.
2. To determine the percentage of cases that are above or below a given score or reference point.
3. To determine the limits of scores which include a given percentage of cases.
4. To determine the percentile rank of a student in his group.
5. To find out the percentile value of a student’s percentile rank.
6. To compare the two distributions in terms of overlapping.
7. To determine the relative difficulty of test items, and
8. Dividing a group into sub-groups according to certain ability and assigning the grades.

9. Statical Inferences
Statistics is a branch of Mathematics, that deals with the collection, analysis, interpretation, and the
presentation of the numerical data. In other words, it is defined as the collection of quantitative data. The
main purpose of Statistics is to make an accurate conclusion using a limited sample about a greater
population.
Types of Statistics: -Statistics can be classified into two different categories. The two different types of
Statistics are:
• Descriptive Statistics
• Inferential Statistics
In Statistics, descriptive statistics describe the data, whereas inferential statistics help you make predictions
from the data. In inferential statistics, the data are taken from the sample and allows you to generalize the
population. In general, inference means “guess”, which means making inference about something. So,
statistical inference means, making inference about the population. To take a conclusion about the
population, it uses various statistical analysis techniques.
Statistical Inference: - Statistical inference is the process of analysing the result and making conclusions
from data subject to random variation. It is also called inferential statistics. Hypothesis testing
and confidence intervals are the applications of the statistical inference. Statistical inference is a method of
making decisions about the parameters of a population, based on random sampling. It helps to assess the
relationship between the dependent and independent variables. The purpose of statistical inference to
estimate the uncertainty or sample to sample variation. It allows us to provide a probable range of values
for the true values of something in the population. The components used for making statistical inference
are:
• Sample Size
• Variability in the sample
• Size of the observed differences
Types of Statistical Inference: - There are different types of statistical inferences that are extensively used
for making conclusions. They are:
• One sample hypothesis testing
• Confidence Interval
• Pearson Correlation
• Bi-variate regression
• Multi-variate regression
• Chi-square statistics and contingency table
• ANOVA or T-test
Statistical Inference Procedure: - The procedure involved in inferential statistics are:
• Begin with a theory
• Create a research hypothesis
• Operationalize the variables
• Recognize the population to which the study results should apply
• Formulate a null hypothesis for this population
• Accumulate a sample from the population and continue the study
• Conduct statistical tests to see if the collected sample properties are adequately different from what
would be expected under the null hypothesis to be able to reject the null hypothesis
Statistical Inference Solution: - Statistical inference solutions produce efficient use of statistical data relating
to groups of individuals or trials. It deals with all characters, including the collection, investigation and
analysis of data and organizing the collected data. By statistical inference solution, people can acquire
knowledge after starting their work in diverse fields. Some statistical inference solution facts are:
• It is a common way to assume that the observed sample is of independent observations from a
population type like Poisson or normal
• Statistical inference solution is used to evaluate the parameter(s) of the expected model like normal
mean or binomial proportion
Importance of Statistical Inference: - Inferential Statistics is important to examine the data properly. To
make an accurate conclusion, proper data analysis is important to interpret the research results. It is majorly
used in the future prediction for various observations in different fields. It helps us to make inference about
the data. The statistical inference has a wide range of application in different fields, such as:
• Business Analysis
• Artificial Intelligence
• Financial Analysis
• Fraud Detection
• Machine Learning
• Share Market
• Pharmaceutical Sector

10. Hypothesis Testing: Types and Process


Hypothesis testing in statistics refers to analysing an assumption about a population parameter. It is used to
make an educated guess about an assumption using statistics. With the use of sample data, hypothesis testing
makes an assumption about how true the assumption is for the entire population from where the sample is
being taken.
Any hypothetical statement we make may or may not be valid, and it is then our responsibility to provide
evidence for its possibility. To approach any hypothesis, we follow these four simple steps that test its
validity.
1) First, we formulate two hypothetical statements such that only one of them is true. By doing so, we
can check the validity of our own hypothesis.
2) The next step is to formulate the statistical analysis to be followed based upon the data points.
3) Then we analyse the given data using our methodology.
4) The final step is to analyse the result and judge whether the null hypothesis will be rejected or is
true.
Types Of Hypothesis Testing: - There are several types of hypothesis testing, and they are used based on
the data provided. Depending on the sample size and the data given, we choose among different hypothesis
testing methodologies. Here starts the use of hypothesis testing tools in research methodology.
1) Normality - This type of testing is used for normal distribution in a population sample. If the data
points are grouped around the mean, the probability of them being above or below the mean is
equally likely. Its shape resembles a bell curve that is equally distributed on either side of the mean.
2) T-test - This test is used when the sample size in a normally distributed population is comparatively
small, and the standard deviation is unknown. Usually, if the sample size drops below 30, we use a
T-test to find the confidence intervals of the population.
3) Chi-Square Test - The Chi-Square test is used to test the population variance against the known or
assumed value of the population variance. It is also a better choice to test the goodness of fit of a
distribution of data. The two most common Chi-Square tests are the Chi-Square test of independence
and the chi-square test of variance.
4) ANOVA- Analysis of Variance or ANOVA compares the data sets of two different populations or
samples. It is similar in its use to the t-test or the Z-test, but it allows us to compare more than two
sample means. ANOVA allows us to test the significance between an independent variable and a
dependent variable, namely X and Y, respectively.
5) Z-test - It is a statistical measure to test that the means of two population samples are different when
their variance is known. For a Z-test, the population is assumed to be normally distributed. A z-test
is better suited in the case of large sample sizes greater than 30. This is due to the central limit
theorem that as the sample size increases, the samples are considered to be distributed normally.
Procedure Of Hypothesis Testing: -
i. State the Null Hypothesis: - The null hypothesis can be thought of as the opposite of the "guess" the
researchers made (in this example the biologist thinks the plant height will be different for the
fertilizers). So the null would be that there will be no difference among the groups of plants.
Specifically, in more statistical language the null for an ANOVA is that the means are the same. We
state the null hypothesis as:
H0:μ1=μ2=⋯=μT
For T levels of an experimental treatment.
ii. State the Alternative Hypothesis
HA: treatment level means not all equal
The reason we state the alternative hypothesis this way is that if the null is rejected, there are many
possibilities.
For example, μ1≠μ2=⋯=μT is one possibility, as is μ1=μ2≠μ3=⋯=μT. Many people make the
mistake of stating the alternative hypothesis as μ1≠μ2≠⋯≠μT which says that every mean differs
from every other mean. This is a possibility, but only one of many possibilities. To cover all
alternative outcomes, we resort to a verbal statement of ‘not all equal’ and then follow up with mean
comparisons to find out where differences among means exist. In our example, this means that
fertilizer 1 may result in plants that are really tall, but fertilizers 2, 3, and the plants with no fertilizers
don't differ from one another. A simpler way of thinking about this is that at least one mean is
different from all others.
iii. Set α: - If we look at what can happen in a hypothesis test, we can construct the following
In Reality

Decision H0 is TRUE H0 is FALSE

Accept H0 correct Type II Error


β = probability of Type II Error

Reject H0 Type I Error correct


α = probability of
Type I Error

contingency table:
You should be familiar with type I and type II errors from your introductory course. It is important to
note that we want to set α before the experiment (a-priori) because the Type I error is the more ‘grievous
error to make. The typical value of α is 0.05, establishing a 95% confidence level. For this course, we
will assume α =0.05, unless stated otherwise.
iv. Collect Data: -Remember the importance of recognizing whether data is collected through an
experimental design or observational study.
v. Calculate a test statistic: -For categorical treatment level means, we use an F statistic, named after
R.A. Fisher. We will explore the mechanics of computing the F statistic beginning in Lesson 2.
The F value we get from the data is labelled F(Calculated).
vi. Construct Acceptance / Rejection regions: - As with all other test statistics, a threshold (critical)
value of F is established. This F value can be obtained from statistical tables or software and is
referred to as F(critical) or Fα. As a reminder, this critical value is the minimum value for the test
statistic (in this case the F test) for us to be able to reject the null. The F distribution, Fα, and the
location of acceptance/rejection regions are shown in the graph below:

vii. Based on steps 5 and 6, draw a conclusion about H0: -If the F(calculated) from the data is larger than
the Fα, then you are in the rejection region and you can reject the null hypothesis with (1−α) level
of confidence.
Note that modern statistical software condenses steps 6 and 7 by providing a p-value. The p-value
here is the probability of getting an F(calculated) even greater than what you observe assuming the
null hypothesis is true. If by chance, the F(calculated)=Fα , then the p-value would exactly equal
to α. With larger F(calculated) values, we move further into the rejection region and the p-value
becomes less than α. So, the decision rule is as follows:
If the p-value obtained from the ANOVA is less than α, then reject H0 and accept HA.

You might also like