RM Sec 3 1
RM Sec 3 1
Descriptive Statistics
The term “descriptive statistics” refers to the analysis, summary, and presentation of findings related to a
data set derived from a sample or entire population. Descriptive statistics comprises three main categories
– Frequency Distribution, Measures of Central Tendency, and Measures of Variability. Descriptive
statistics summarize and organize characteristics of a data set. A data set is a collection of responses or
observations from a sample or entire population. In quantitative research, after collecting data, the first step
of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g.,
age), or the relation between two variables (e.g., age and creativity).
Types of descriptive statistics: -
1. The distribution concerns the frequency of each value.
2. The central tendency concerns the averages of the values.
3. The variability or dispersion concerns how spread out the values are.
You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more,
in bivariate and multivariate analysis.
1. Frequency distribution: -A data set is made up of a distribution of values, or scores. In tables or
graphs, you can summarize the frequency of every possible value of a variable in numbers or
percentages.
• Simple frequency distribution table- For the variable of gender, you list all possible answers
on the left hand column. You count the number or percentage of responses for each answer
and display it on the right hand column.
Gender Number
Man 182
Woman 235
No answer 27
• Grouped frequency distribution table- In a grouped frequency distribution, you can group
numerical response values and add up the number of responses for each group. You can also
convert each of these numbers to percentages.
Library visits in the past year Percent
0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%
2. Measures of central tendency: -Measures of central tendency estimate the center, or average, of a
data set. The mean, median and mode are 3 ways of finding the average. Here we will demonstrate
how to calculate the mean, median, and mode using the first 6 responses of our survey.
• Mean -The mean, or M, is the most commonly used method for finding the average.
To find the mean, simply add up all response values and divide the sum by the total number
of responses. The total number of responses or observations is called N.
Mean number of library visits
• Median -The median is the value that’s exactly in the middle of a data set.
To find the median, order each response value from the smallest to the biggest. Then, the
median is the number in the middle. If there are two numbers in the middle, find their mean.
Median number of library visits
Middle numbers 3, 12
Median Find the mean of the two middle numbers: (3 + 12)/2 = 7.5
• Mode - The mode is the simply the most popular or most frequent response value. A data set
can have no mode, one mode, or more than one mode.
To find the mode, order your data set from lowest to highest and find the response that occurs
most frequently.
Mode number of library visits
3. Measures of variability or Dispersion: -Measures of variability give you a sense of how spread out
the response values are. The range, standard deviation and variance each reflect different aspects of
spread.
• Range- The range gives you an idea of how far apart the most extreme response scores are.
To find the range, simply subtract the lowest value from the highest value.
2. Tabulation
Tabulation refers to the system of processing data or information by arranging it into a table. With tabulation,
numeric data is arrayed logically and systematically into columns and rows, to aid in their statistical
analysis. The purpose of tabulation is to present a large mass of complicated information in an orderly
fashion and allow viewers to draw reasonable conclusions and interpretations from them.
To tabulate data correctly, one must learn about the eight essential parts of a table. These are as follows –
• Table Number – This is the first part of a table and is given on top of any table to facilitate easy
identification and for further reference.
• Title of the Table – One of the most important parts of any table, its title is placed on top of the same
and narrates its contents. It is imperative that the title be brief, crisp and carefully-worded to describe
the tables’ contents effectively.
• Headnote – The headnote of a table is presented in the portion just below the title. It provides
information about the unit of data in the table, like “amount in Rupees” or “quantity in kilograms”,
etc.
• Column Headings or Captions – Captions are the portion of the table on top of each column which
explains the figures under each column.
• Row Headings or Stubs – The title of each horizontal row is called a stub.
• Body of a Table – This is the portion that contains the numeric information collected from
investigated facts. The data in the body is presented in rows which are read horizontally from left to
right and in columns, read vertically from top to bottom.
• Footnote – Given at the bottom of a table above the source note, a footnote is used to state any fact
that is not clear from the table’s title, headings, caption or stub. For instance, if a table denotes the
profit earned by a company, a footnote can be used to state if said profit is earned before or after-tax
calculations.
• Source Note – As its name suggests, a source note refers to the source from where the table’s
information has been collected.
Objectives of Tabulation: - Tabulation essentially bridges the gap between the collection of data and
analysing them. The primary objectives of tabulation can be encapsulated below –
• For Simplification of Complex Data – When any information is tabulated, the volume of raw data
is compressed and presented in a much more simplified manner. This facilitates easy comprehension
and analysis of previously complex data.
• To Highlight Important Information – Representing any data in tabular form increases the scope to
highlight important information. Since data is presented in a concise manner without any textual
explanation, any crucial information is automatically highlighted without difficulty.
• To Enable Easy Comparison – When data is presented in an orderly fashion in rows and columns,
it becomes easier to compare between them on the basis of several parameters. For example, it
becomes easier to determine the month when a country has received the maximum amount of rainfall
if the data is presented in a table. Otherwise, there always remains room for making a mistake in
processing the data correctly.
• To Help in the Statistical Analysis of Data – Statistical analysis involves the computing correlation,
average, dispersion, etc. of data. When information is presented in an organised manner in a table,
statistical analysis becomes a lot simpler.
• Saves Space - Even though it might not seem as important as the other objective of tabulation, saving
space without sacrificing the quality of data can be extremely helpful in the long run. Additionally,
a table helps to present facts in a much more concise manner than page after page of text.
Types of Tabulation – Generally, tabulation can be classified into two types – simple and complex
tabulation.
A. Simple Tabulation- This is the process of tabulation through which information regarding one or
more independent questions is illustrated. It is also known as one-way tabulation.
B. Complex Tabulation- These are the types of tables which represent the division of data into two or
more categories based on two or more characteristics. This type of data tabulation can be divided
into three types. These are –
• Two Way Tables – These tables illustrate information collected from two mutually
dependent questions. For instance, say that a table has to illustrate the highest population in
different states of India. This can be done in a one-way table. But if the population has to be
compared in terms of the total number of males and females in each state, it will require a
two way table.
• Three-Way Table – Like the above mentioned category, three-way tables illustrate
information collected from three mutually dependent and inter-related questions. Let us take
the above example and elaborate on that further with another category added to the table –
the position of literacy amongst the male and female population in each state. The tabulation
for these categories has to be put down in a three-way table.
• Manifold Table – These tables are utilised to illustrate information collected from more than
three interrelated questions or characteristics.
Rules of Tabulation- There are a few general rules that have to be followed while constructing tables. These
are –
i. Tables illustrated should be self-explanatory. Even though footnotes form a part of tables, they
should not be mandatory to explain the meaning of the data presented in a table.
ii. If the volume of information is substantial, it is best to put them down in multiple tables instead of
a single one. This reduces the chances of mistakes and defeats the purpose of forming a table.
However, each table formed should also be complete in itself and serve the purpose of analysis.
iii. The number of rows and columns should be kept minimal to present information in a crisp and
concise manner.
iv. Before tabulating, data should be approximated, wherever necessary.
v. Stubs and captions should be self-explanatory and should not require the help of footnotes to be
comprehended.
vi. If certain positions of data collected cannot be tabulated under any stub or captions, they should be
put down in a separate table under the heading of miscellaneous.
vii. Quantity and quality of data should not be compromised under any scenario while forming a table.
5. Correlation
The correlation is one of the most common and most useful statistics. A correlation is a single number that
describes the degree of relationship between two variables. It is a measure of the extent to which two
variables are related. There are three possible results of a correlational study: a positive correlation, a
negative correlation, and no correlation.
• A positive correlation is a relationship between two variables in which both variables move in the
same direction. Therefore, when one variable increases as the other variable increases, or one
variable decreases while the other decreases. An example of positive correlation would be height
and weight. Taller people tend to be heavier.
• A negative correlation is a relationship between two variables in which an increase in one variable
is associated with a decrease in the other. An example of negative correlation would be height above
sea level and temperature. As you climb the mountain (increase in height) it gets colder (decrease in
temperature).
• A zero correlation exists when there is no relationship between two variables. For example there is
no relationship between the amount of tea drunk and level of intelligence.
A correlation can be expressed visually. This is done by drawing a scattergram (also known as a scatterplot,
scatter graph, scatter chart, or scatter diagram). A scattergram is a graphical display that shows the
relationships or associations between two numerical variables (or co-variables), which are represented as
points (or dots) for each pair of score. A scatter graph indicates the strength and direction of the correlation
between the co-variables.
Uses of Correlations: -
• Prediction- If there is a relationship between two variables, we can make predictions about one from
another.
• Validity- Concurrent validity (correlation between a new measure and an established measure).
• Reliability- Test-retest reliability (are measures consistent).
Inter-rater reliability (are observers consistent).
• Theory verification- Predictive validity.
6. Regression Analysis
Regression analysis is a set of statistical methods used for the estimation of relationships between a
dependent variable and one or more independent variables. It can be utilized to assess the strength of the
relationship between variables and for modelling the future relationship between them.
Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most
common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for
more complicated data sets in which the dependent and independent variables show a nonlinear relationship.
Types of Regression Analysis: -
1. Linear Regression- Linear regression is one of the most basic types of regression in machine
learning. The linear regression model consists of a predictor variable and a dependent variable
related linearly to each other. In case the data involves more than one independent variable, then
linear regression is called multiple linear regression models.
The below-given equation is used to denote the linear regression model:
y=mx+c+e
where m is the slope of the line, c is an intercept, and e represents the error in the model.
2. Logistic Regression: -Logistic regression is one of the types of regression analysis technique, which
gets used when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This means the
target variable can have only two values, and a sigmoid curve denotes the relation between the target
variable and the independent variable. Logit function is used in Logistic Regression to measure the
relationship between the target variable and independent variables. Below is the equation that
denotes the logistic regression.
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk
where p is the probability of occurrence of the feature.
3. Ridge Regression: - This is another one of the types of regression in machine learning which is
usually used when there is a high correlation between the independent variables. This is because, in
the case of multi collinear data, the least square estimates give unbiased values. But, in case the
collinearity is very high, there can be some bias value. Therefore, a bias matrix is introduced in the
equation of Ridge Regression. This is a powerful regression method where the model is less
susceptible to overfitting.
4. Lasso Regression: -Lasso Regression is one of the types of regression in machine learning that
performs regularization along with feature selection. It prohibits the absolute size of the regression
coefficient. As a result, the coefficient value gets nearer to zero, which does not happen in the case
of Ridge Regression.
7. Probability Distribution
A probability distribution is a table or an equation that interconnects each outcome of a statistical experiment
with its probability of occurrence. To understand the concept of a probability distribution, it is important to
know variables, random variables, and some other notations.
• Variables: A variable is defined as any symbol that can take any particular set of values.
• Random Variable: When the value of any variable is the outcome of a statistical experiment, that
variable is determined as a probability distribution of random variables. It can be discrete (not
constant) or continuous or both.
• Notations: Mostly, statisticians make use of capital letters to denote a probability distribution of
random variables and small-case letters to represent any of its values.
✓ X denotes the probability distribution of random variable X
✓ P(X) denotes the probability of X.
✓ p(X=r) denotes the probability that random variable X is equivalent to any particular value,
represented by r. For example: P (X=1) states the probability distribution of the random
variable X is equivalent to 1.
Probability Distribution: - Probability Distributions give up the possible outcome of any random event. It is
also identified on the grounds of underlying sample space as a set of possible outcomes of any random
experiment. These settings can be set of prime numbers, set of real numbers, set of complex numbers, or set
of any entities. The Probability distribution is a part of probability and statistics. Random experiments are
termed as the outcomes of an experiment whose results cannot be predicted. For example- if we toss a coin,
we cannot predict what will appear, either the head or tail. The possible result of a random experiment is
known as the outcome. And the set of outcomes is termed as a sample point. Through these possibilities,
we can design a probability table on the basis of variables and probabilities.
Types of Probability Distribution: - There are two types of probability distribution which are used for
distinct purposes and various types of data generation processes.
A. Discrete Probability Distribution
B. Continuous probability Distribution
2) Poisson Distribution: - Poisson random variables are the number of successes that yield from the
Poisson experiment and their corresponding probability is known as the Poisson distribution that
can be expressed as
Where “X” is the Poisson random variable, “x” is the number of successes and the mean “µ” is the
fundamental parameter in this distribution. The graph of Poisson distribution is shown below -
1) Normal Distribution: - It is the most common distribution in all the probabilities and statistics and
can be used frequently in finance, investing, science, and engineering, the probability density
function for the normal distribution is defined as;
Where the μ and σ represent the mean (the point of the center of the distribution) and the standard
deviation (how to spread out the distribution is) of the population respectively.
8. Properties and Applications of Normal Curves
The random variables following the normal distribution are those whose values can find any unknown value
in a given range. For example, finding the height of the students in the school. Here, the distribution can
consider any value, but it will be bounded in the range say, 0 to 6ft. This limitation is forced physically in
our query. Whereas, the normal distribution doesn’t even bother about the range. The range can also extend
to –∞ to + ∞ and still we can find a smooth curve. These random variables are called Continuous Variables,
and the Normal Distribution then provides here probability of the value lying in a particular range for a
given experiment. Also, use the normal distribution calculator to find the probability density function by
just providing the mean and standard deviation value.
A normal distribution is symmetric from the peak of the curve, where the mean is. This means that most of
the observed data is clustered near the mean, while the data become less frequent when farther away from
the mean. The resultant graph appears as bell-shaped where the mean, median, and mode are of the same
values and appear at the peak of the curve. The graph is a perfect symmetry, such that, if you fold it at the
middle, you will get two equal halves since one-half of the observable data points fall on each side of the
graph.
Properties: - All forms of (normal) distribution share the following characteristics:
1. symmetric: -A normal distribution comes with a perfectly symmetrical shape. This means that the
distribution curve can be divided in the middle to produce two equal halves. The symmetric shape occurs
when one-half of the observations fall on each side of the curve.
2. The mean, median, and mode are equal: - The middle point of a normal distribution is the point with the
maximum frequency, which means that it possesses the most observations of the variable. The midpoint is
also the point where these three measures fall. The measures are usually equal in a perfectly (normal)
distribution.
3. Empirical rule: -In normally distributed data, there is a constant proportion of distance lying under the
curve between the mean and specific number of standard deviations from the mean. For example, 68.25%
of all cases fall within +/- one standard deviation from the mean. 95% of all cases fall within +/- two standard
deviations from the mean, while 99% of all cases fall within +/- three standard deviations from the mean.
4. Skewness and kurtosis: -Skewness and kurtosis are coefficients that measure how different a distribution
is from a normal distribution. Skewness measures the symmetry of a normal distribution while kurtosis
measures the thickness of the tail ends relative to the tails of a normal distribution.
Applications: -
1. To determine the percentage of cases (in a normal distribution) within given limits or scores.
2. To determine the percentage of cases that are above or below a given score or reference point.
3. To determine the limits of scores which include a given percentage of cases.
4. To determine the percentile rank of a student in his group.
5. To find out the percentile value of a student’s percentile rank.
6. To compare the two distributions in terms of overlapping.
7. To determine the relative difficulty of test items, and
8. Dividing a group into sub-groups according to certain ability and assigning the grades.
9. Statical Inferences
Statistics is a branch of Mathematics, that deals with the collection, analysis, interpretation, and the
presentation of the numerical data. In other words, it is defined as the collection of quantitative data. The
main purpose of Statistics is to make an accurate conclusion using a limited sample about a greater
population.
Types of Statistics: -Statistics can be classified into two different categories. The two different types of
Statistics are:
• Descriptive Statistics
• Inferential Statistics
In Statistics, descriptive statistics describe the data, whereas inferential statistics help you make predictions
from the data. In inferential statistics, the data are taken from the sample and allows you to generalize the
population. In general, inference means “guess”, which means making inference about something. So,
statistical inference means, making inference about the population. To take a conclusion about the
population, it uses various statistical analysis techniques.
Statistical Inference: - Statistical inference is the process of analysing the result and making conclusions
from data subject to random variation. It is also called inferential statistics. Hypothesis testing
and confidence intervals are the applications of the statistical inference. Statistical inference is a method of
making decisions about the parameters of a population, based on random sampling. It helps to assess the
relationship between the dependent and independent variables. The purpose of statistical inference to
estimate the uncertainty or sample to sample variation. It allows us to provide a probable range of values
for the true values of something in the population. The components used for making statistical inference
are:
• Sample Size
• Variability in the sample
• Size of the observed differences
Types of Statistical Inference: - There are different types of statistical inferences that are extensively used
for making conclusions. They are:
• One sample hypothesis testing
• Confidence Interval
• Pearson Correlation
• Bi-variate regression
• Multi-variate regression
• Chi-square statistics and contingency table
• ANOVA or T-test
Statistical Inference Procedure: - The procedure involved in inferential statistics are:
• Begin with a theory
• Create a research hypothesis
• Operationalize the variables
• Recognize the population to which the study results should apply
• Formulate a null hypothesis for this population
• Accumulate a sample from the population and continue the study
• Conduct statistical tests to see if the collected sample properties are adequately different from what
would be expected under the null hypothesis to be able to reject the null hypothesis
Statistical Inference Solution: - Statistical inference solutions produce efficient use of statistical data relating
to groups of individuals or trials. It deals with all characters, including the collection, investigation and
analysis of data and organizing the collected data. By statistical inference solution, people can acquire
knowledge after starting their work in diverse fields. Some statistical inference solution facts are:
• It is a common way to assume that the observed sample is of independent observations from a
population type like Poisson or normal
• Statistical inference solution is used to evaluate the parameter(s) of the expected model like normal
mean or binomial proportion
Importance of Statistical Inference: - Inferential Statistics is important to examine the data properly. To
make an accurate conclusion, proper data analysis is important to interpret the research results. It is majorly
used in the future prediction for various observations in different fields. It helps us to make inference about
the data. The statistical inference has a wide range of application in different fields, such as:
• Business Analysis
• Artificial Intelligence
• Financial Analysis
• Fraud Detection
• Machine Learning
• Share Market
• Pharmaceutical Sector
contingency table:
You should be familiar with type I and type II errors from your introductory course. It is important to
note that we want to set α before the experiment (a-priori) because the Type I error is the more ‘grievous
error to make. The typical value of α is 0.05, establishing a 95% confidence level. For this course, we
will assume α =0.05, unless stated otherwise.
iv. Collect Data: -Remember the importance of recognizing whether data is collected through an
experimental design or observational study.
v. Calculate a test statistic: -For categorical treatment level means, we use an F statistic, named after
R.A. Fisher. We will explore the mechanics of computing the F statistic beginning in Lesson 2.
The F value we get from the data is labelled F(Calculated).
vi. Construct Acceptance / Rejection regions: - As with all other test statistics, a threshold (critical)
value of F is established. This F value can be obtained from statistical tables or software and is
referred to as F(critical) or Fα. As a reminder, this critical value is the minimum value for the test
statistic (in this case the F test) for us to be able to reject the null. The F distribution, Fα, and the
location of acceptance/rejection regions are shown in the graph below:
vii. Based on steps 5 and 6, draw a conclusion about H0: -If the F(calculated) from the data is larger than
the Fα, then you are in the rejection region and you can reject the null hypothesis with (1−α) level
of confidence.
Note that modern statistical software condenses steps 6 and 7 by providing a p-value. The p-value
here is the probability of getting an F(calculated) even greater than what you observe assuming the
null hypothesis is true. If by chance, the F(calculated)=Fα , then the p-value would exactly equal
to α. With larger F(calculated) values, we move further into the rejection region and the p-value
becomes less than α. So, the decision rule is as follows:
If the p-value obtained from the ANOVA is less than α, then reject H0 and accept HA.