Statistics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

What Is Statistics?

Statistics is a branch of applied mathematics that involves the collection,


description, analysis, and inference of conclusions from quantitative data.
The mathematical theories behind statistics rely heavily on differential and
integral calculus, linear algebra, and probability theory.

People who do statistics are referred to as statisticians. They're particularly


concerned with determining how to draw reliable conclusions about large
groups and general events from the behavior and other observable
characteristics of small samples. These small samples represent a portion of
the large group or a limited number of instances of a general phenomenon.

Understanding Statistics

Statistics are used in virtually all scientific disciplines, such as the


physical and social sciences as well as in business, the humanities,
government, and manufacturing. Statistics is fundamentally a branch of
applied mathematics that developed from the application of
mathematical tools, including calculus and linear algebra to probability
theory.

In practice, statistics is the idea that we can learn about the


properties of large sets of objects or events (a population) by
studying the characteristics of a smaller number of similar objects or
events (a sample). Gathering comprehensive data about an entire
population is too costly, difficult, or impossible in many cases so
statistics start with a sample that can be conveniently or affordably
observed.

Descriptive and Inferential Statistics


The two major areas of statistics are known as descriptive statistics,
which describes the properties of sample and population data, and
inferential statistics, which uses those properties to test hypotheses
and draw conclusions. Descriptive statistics include mean (average),
variance, skewness, and kurtosis. Inferential statistics include linear
regression analysis, analysis of variance (ANOVA), logit/Probit
models, and null hypothesis testing.

Descriptive statistics
are brief
informational coefficients that summarize a given data set, which can
be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of
central tendency and measures of variability (spread). Measures of
central tendency include the mean, median, and mode, while
measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.
A population is the complete set group of individuals, whether that
group comprises a nation or a group of people with a common
characteristic.
In statistics, a population is the pool of individuals from which a
statistical sample is drawn for a study. Thus, any selection of
individuals grouped by a common feature can be said to be a
population. A sample may also refer to a statistically significant
portion of a population, not an entire population. For this reason, a
statistical analysis of a sample must report the approximate standard
deviation, or standard error, of its results from the entire population.
Only an analysis of an entire population would have no standard
error.

A sample refers to a smaller, manageable version of a larger group.


It is a subset containing the characteristics of a larger population.
Samples are used in statistical testing when population sizes are too
large for the test to include all possible members or observations. A
sample should represent the population as a whole and not reflect
any bias toward a specific attribute.

Inferential Statistics

Inferential statistics are tools that statisticians use to draw


conclusions about the characteristics of a population, drawn from the
characteristics of a sample, and to determine how certain they can
be of the reliability of those conclusions. Based on the sample size
and distribution, statisticians can calculate the probability that
statistics, which measure the central tendency, variability,
distribution, and relationships between characteristics within a data
sample, provide an accurate picture of the corresponding
parameters of the whole population from which the sample is drawn.

Inferential statistics are used to make generalizations about large


groups, such as estimating average demand for a product by
surveying a sample of consumers' buying habits or attempting to
predict future events. This might mean projecting the future return of
a security or asset class based on returns

Understanding Statistical Data


The root of statistics is driven by variables. A variable is a data set
that can be counted that marks a characteristic or attribute of an
item. For example, a car can have variables such as make, model,
year, mileage, color, or condition. By combining the variables across
a set of data, such as the colors of all cars in a given parking lot,
statistics allows us to better understand trends and outcomes.

There are two main types of variables. First, qualitative variables are
specific attributes that are often non-numeric. Many of the examples
given in the car example are qualitative. Other examples of
qualitative variables in statistics are gender, eye color, or city of
birth. Qualitative data is most often used to determine what
percentage of an outcome occurs for any given qualitative
variable. Qualitative analysis often does not rely on numbers. For
example, trying to determine what percentage of women own a
business analyzes qualitative data.

The second type of variable in statistics is quantitative variables.


Quantitative variables are studied numerically and only have weight
when they're about a non-numerical descriptor. Similar to
quantitative analysis, this information is rooted in numbers. In the car
example above, the mileage driven is a quantitative variable, but the
number 60,000 holds no value unless it is understood that is the total
number of miles driven.
Quantitative variables can be further broken into two categories.
First, discrete variables have limitations in statistics and infer that
there are gaps between potential discrete variable values. The
number of points scored in a football game is a discrete variable
because :

. There can be no decimals, and


. It is impossible for a team to score only one point

Statistics also makes use of continuous quantitative variables. These


values run along a scale. Discrete values have limitations, but
continuous variables are often measured into decimals. Any value
within possible limits can be obtained when measuring the height of
the football players, and the heights can be measured down to
1/16th of an inch, if not further.

Statistical Levels of Measurement


There are several resulting levels of measurement after analyzing
variables and outcomes. Statistics can quantify outcomes in four
ways.

Nominal-level Measurement

There's no numerical or quantitative value, and qualities are not


ranked. Nominal-level measurements are instead simply labels or
categories assigned to other variables. It's easiest to think of
nominal-level measurements as non-numerical facts about a
variable.

Example: The name of the President elected in 2020 was Joseph


Robinette Biden, Jr.

Ordinal-level Measurement

Outcomes can be arranged in an order, but all data values have the
same value or weight. Although numerical, ordinal-level
measurements can't be subtracted against each other in statistics
because only the position of the data point matters. Ordinal levels
are often incorporated into nonparametric statistics and compared
against the total variable group.

Example: American Fred Kerley was the 2nd fastest man at the
2020 Tokyo Olympics based on 100-meter sprint times.3
Olympics. "Tokyo 2020, Athletics, Men's 100m ."

Interval-level Measurement

Outcomes can be arranged in order but differences between data


values may now have meaning. Two data points are often used to
compare the passing of time or changing conditions within a data
set. There is often no "starting point" for the range of data values,
and calendar dates or temperatures may not have a meaningful
intrinsic zero value.

Example: Inflation hit 8.6% in May 2022. The last time inflation was
this high was in December 1981.4

Ratio-level Measurement

Outcomes can be arranged in order and differences between data


values now have meaning. But there's a starting point or "zero value"
that can be used to further provide value to a statistical value. The
ratio between data values has meaning, including its distance away
from zero.

Example: The lowest meteorological temperature recorded was -


128.6 degrees Fahrenheit in Antarctica.5

Statistics Sampling Techniques


It would often not be possible to gather data from every data point
within a population to gather statistical information. Statistics relies
instead on different sampling techniques to create a representative
subset of the population that's easier to analyze. In statistics, there
are several primary types of sampling in statistics.
Simple Random Sampling

A simple random sample is a subset of a statistical population in


which each member of the subset has an equal probability of being
chosen. A simple random sample is meant to be an unbiased
representation of a group.

Simple random sampling calls for every member within the


population to have an equal chance of being selected for analysis.
The entire population is used as the basis for sampling, and any
random generator based on chance can select the sample items. For
example, 100 individuals are lined up and 10 are chosen at random.

An example of a simple random sample would be the names of 25


employees being chosen out of a hat from a company of 250
employees. In this case, the population is all 250 employees, and
the sample is random because each employee has an equal chance
of being chosen. Random sampling is used in science to conduct
randomized control tests or for blinded experiments.

Simple Random Sampling

Advantages
Each item within a population has an equal chance of being
selected.

There is less of a chance of sampling bias as every item is


randomly selected.

This sampling method is easy and convenient for data sets


already listed or digitally stored

Disadvantages
Incomplete population demographics may exclude certain groups
from being sampled.
Random selection means the sample may not be truly
representative of the population.

Depending on the data set size and format, random sampling may
be a time-intensive process.

Random Sampling Techniques


There is no single method for determining the random values to be
selected (i.e. Step 5 above). The analyst can not simply choose
numbers at random as there may not be randomness with numbers.
For example, the analyst's wedding anniversary may be the 24th, so
they may consciously (or subconsciously) pick the random value 24.
Instead, the analyst may choose one of the following methods:

 Random lottery. Whether by ping-pong ball or slips of paper,


each population number receives an equivalent item that is
stored in a box or other indistinguishable container. Then,
random numbers are selected by pulling or selecting items
without view from the container.
 Physical Methods. Simple, early methods of random
selection may use dice, flipping coins, or spinning wheels.
Each outcome is assigned a value or outcome relating to the
population.
 Random number table. Many statistics and research books
contain sample tables with randomized numbers.
 Online random number generator. Many online tools exist
where the analyst inputs the population size and sample size
to be selected.
 Random numbers from Excel. Numbers can be selected in
Excel using the =RANDBETWEEN formula. A cell containing
=RANDBETWEEN(1,5) will selected a single random number
between 1 and 5.

Systematic sampling is a type of probability sampling method in


which sample members from a larger population are selected
according to a random starting point but with a fixed, periodic
interval. This interval, called the sampling interval, is calculated by
dividing the population size by the desired sample size.1 Despite the
sample population being selected in advance, systematic sampling is
still thought of as being random if the periodic interval is determined
beforehand and the starting point is random.

When carried out correctly on a large population of a defined size,


systematic sampling can help researchers, including marketing and
sales professionals, obtain representative findings on a huge group
of people without having to reach out to each and every one of them.

Systemic Sampling

Systematic sampling calls for a random sample as well, but its


technique is slightly modified to make it easier to conduct. A single
random number is generated and individuals are then selected at a
specified regular interval until the sample size is complete. For
example, 100 individuals are lined up and numbered. The 7th
individual is selected for the sample followed by every subsequent
9th individual until 10 sample items have been selected.

examples of Systematic Sampling


As a hypothetical example of systematic sampling, assume that, in a
population of 10,000 people, a statistician selects every 100th
person for sampling. The sampling intervals can also be systematic,
such as choosing a new sample to draw from every 12 hours.

As another example, if you wanted to select a random group of


1,000 people from a population of 50,000 using systematic sampling,
all the potential participants must be placed on a list and a starting
point would be selected. Once the list is formed, every 50th person
on the list (starting the count at the selected starting point) would be
chosen as a participant, since 50,000 ÷ 1,000 = 50.

For example, if the selected starting point was 20, the 70th person
on the list would be chosen followed by the 120th, and so on. Once
the end of the list was reached and if additional participants are
required, the count loops to the beginning of the list to finish the
count.

Types of Systematic Sampling


Generally, there are three ways to generate a systematic sample:

 Systematic random sampling: The classic form of systematic


sampling where the subject is selected at a predetermined
interval.
 Linear systematic sampling: Rather than randomly selecting
the sampling interval, a skip pattern is created following a
linear path.
 Circular systematic sampling: A sample starts again at the
same point after ending.

Systematic Sampling vs. Cluster Sampling


Systematic sampling and cluster sampling differ in how they pull
sample points from the population included in the sample. Cluster
sampling breaks the population down into clusters, while systematic
sampling uses fixed intervals from the larger population to create the
sample.

Systematic sampling selects a random starting point from the


population, then a sample is taken from regular fixed intervals of the
population depending on its size. Cluster sampling divides the
population into clusters and then takes a simple random sample from
each cluster.

Cluster sampling is considered less precise than other methods of


sampling. However, it may save costs on obtaining a sample. Cluster
sampling is a two-step sampling procedure. It may be used when
completing a list of the entire population is difficult. For example, it
could be difficult to construct the entire population of the customers
of a grocery store to interview.

However, a person could create a random subset of stores, which is


the first step in the process. The second step is to interview a
random sample of the customers of those stores. This is a simple,
manual process that can save time and money.

Limitations of Systematic Sampling


One risk that statisticians must consider when conducting systematic
sampling involves how the list used with the sampling interval is
organized. If the population placed on the list is organized in a
cyclical pattern that matches the sampling interval, the selected
sample may be biased.

For example, a company’s human resources department wants to


pick a sample of employees and ask how they feel about company
policies. Employees are grouped in teams of 20, with each team
headed by a manager. If the list used to pick the sample size is
organized with teams clustered together, the statistician risks picking
only managers (or no managers at all) depending on the sampling
interval.

What are the advantages of systematic


sampling?
Systematic sampling is simple to conduct and easy to understand,
which is why it’s generally favored by researchers. The central
assumption, that the results represent the majority of normal
populations, guarantees that the entire population is evenly sampled.

Also, systematic sampling provides an increased degree of control


compared to other sampling methodologies because of its process.
Systematic sampling also carries a low risk factor because there is a
low chance that the data can be contaminated.

What are the disadvantages of systematic


sampling?
The main disadvantage of systematic sampling is that the size of the
population is needed. Without knowing the specific number of
participants in a population, systematic sampling does not work well.
For example, if a statistician would like to examine the age of
homeless people in a specific region but cannot accurately obtain
how many homeless people there are, then they won’t have a
population size or a starting point. Another disadvantage is that the
population needs to exhibit a natural amount of randomness to it or
else the risk of choosing similar instances is increased, defeating the
purpose of the sample.
What Is Stratified Random Sampling?
Stratified random sampling is a method of sampling that involves the
division of a population into smaller subgroups known as strata. In
stratified random sampling, or stratification, the strata are formed
based on members’ shared attributes or characteristics, such as
income or educational attainment. Stratified random sampling has
numerous applications and benefits, such as studying population
demographics and life expectancy.

Stratified random sampling is also called proportional random


sampling or quota random sampling.

How Stratified Random Sampling Works


When completing analysis or research on a group of entities with similar
characteristics, a researcher may find that the population size is too large to
complete research on it. To save time and money, an analyst may take on a
more feasible approach by selecting a small group from the population. The
small group is referred to as a sample size, which is a subset of the
population used to represent the entire population. A sample may be
selected from a population through a number of ways, one of which is the
stratified random sampling method.

Stratified random sampling involves dividing the entire population into


homogeneous groups called strata (plural for stratum). Random samples are
then selected from each stratum. For example, consider an academic
researcher who would like to know the number of MBA students in 2021 who
received a job offer within three months of graduation.

The researcher will soon find that there were almost 200,000 MBA graduates
for the year. They might decide just to take a simple random sample of
50,000 graduates and run a survey. Better still, they could divide the
population into strata and take a random sample from the strata. To do this,
they would create population groups based on gender, age range, race,
country of nationality, and career background. A random sample from each
stratum is taken in a number proportional to the stratum’s size compared with
the population. These subsets of the strata are then pooled to form a random
sample.

Example of Stratified Random Sampling


Suppose a research team wants to determine the grade point
average (GPA) of college students across the United States. The
research team has difficulty collecting data from all 21 million college
students; it decides to take a random sample of the population by
using 4,000 students.

Now assume that the team looks at the different attributes of the
sample participants and wonders if there are any differences in
GPAs and students’ majors. Suppose it finds that 560 students are
English majors, 1,135 are science majors, 800 are computer science
majors, 1,090 are engineering majors, and 415 are math majors. The
team wants to use a proportional stratified random sample where the
stratum of the sample is proportional to the random sample in the
population.

Assume the team researches the demographics of college students


in the U.S. and finds the percentage of what students major in: 12%
major in English, 28% major in science, 24% major in computer
science, 21% major in engineering, and 15% major in mathematics.
Thus, five strata are created from the stratified random sampling
process.

The team then needs to confirm that the stratum of the population is
in proportion to the stratum in the sample; however, they find the
proportions are not equal. The team then needs to resample 4,000
students from the population and randomly select 480 English, 1,120
science, 960 computer science, 840 engineering, and 600
mathematics students.

With those groups, it has a proportionate stratified random sample of


college students, which provides a better representation of students’
college majors in the U.S. The researchers can then highlight
specific stratum, observe the varying types of studies of U.S. college
students and observe the various GPAs.

Advantages of Stratified Random Sampling


The main advantage of stratified random sampling is that it captures
key population characteristics in the sample. Similar to a weighted
average, this method of sampling produces characteristics in the
sample that are proportional to the overall population. Stratified
random sampling works well for populations with a variety of
attributes but is otherwise ineffective if subgroups cannot be formed.
Stratification gives a smaller error in estimation and greater precision
than the simple random sampling method. The greater the
differences among the strata, the greater the gain in precision.

Disadvantages of Stratified Random Sampling


Unfortunately, this method of research cannot be used in every
study. The method’s disadvantage is that several conditions must be
met for it to be used properly. Researchers must identify every
member of a population being studied and classify each of them into
one, and only one, subpopulation. As a result, stratified random
sampling is disadvantageous when researchers can’t confidently
classify every member of the population into a subgroup. Also,
finding an exhaustive and definitive list of an entire population can
be challenging.

Overlapping can be an issue if there are subjects that fall into


multiple subgroups. When simple random sampling is performed,
those who are in multiple subgroups are more likely to be chosen.
The result could be a misrepresentation or inaccurate reflection of
the population.

The above examples make it easy: Undergraduate, graduate, male,


and female are clearly defined groups. In other situations, however,
it might be far more difficult. Imagine incorporating characteristics
such as race, ethnicity, or religion. The sorting process becomes
more difficult, rendering stratified random sampling an ineffective
and less-than-ideal method.

When would you use stratified random


sampling?
Stratified random sampling is often when researchers want to know
about different subgroups or strata based on the entire population
being studied—for instance, if one is interested in differences among
groups based on race, gender, or education.

You might also like