0% found this document useful (0 votes)
9 views20 pages

What Is Statistics

Statistics is a branch of applied mathematics focused on the collection, analysis, and interpretation of quantitative data, divided into descriptive and inferential statistics. Descriptive statistics summarize data characteristics while inferential statistics draw conclusions about populations based on sample data. Various graphical representations, such as bar charts and pie charts, are used to visualize data, and frequency distributions help organize and summarize observations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views20 pages

What Is Statistics

Statistics is a branch of applied mathematics focused on the collection, analysis, and interpretation of quantitative data, divided into descriptive and inferential statistics. Descriptive statistics summarize data characteristics while inferential statistics draw conclusions about populations based on sample data. Various graphical representations, such as bar charts and pie charts, are used to visualize data, and frequency distributions help organize and summarize observations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

What Is Statistics?

Statistics is a branch of applied mathematics that involves the collection, description,


analysis, and inference of conclusions from quantitative data. The mathematical theories
behind statistics rely heavily on differential and integral calculus, linear algebra, and
probability theory.

Key points:
 Statistics is the study and manipulation of data, including ways to gather, review,
analyze, and draw conclusions from data.
 The two major areas of statistics are descriptive and inferential statistics.
 Statistics can be communicated at different levels ranging from non-numerical
descriptor (nominal-level) to numerical in reference to a zero-point (ratio-level).
 A number of sampling techniques can be used to compile statistical data including
simple random, systematic, stratified, or cluster sampling.
 Statistics are present in almost every department of every company and are an integral
part of investing as well.

Descriptive and Inferential Statistics


The two major areas of statistics are known as descriptive statistics, which describes the
properties of sample and population data, and inferential statistics, which uses those
properties to test hypotheses and draw conclusions. Descriptive statistics include mean
(average), variance, skewness, and kurtosis. Inferential statistics include linear regression
analysis, analysis of variance (ANOVA), logit/Probit models, and null hypothesis testing.

Descriptive Statistics
Descriptive statistics mostly focus on the central tendency, variability, and distribution of
sample data. Central tendency means the estimate of the characteristics, a typical element
of a sample or population, and includes descriptive statistics such as mean, median,
and mode. Variability refers to a set of statistics that show how much difference there is
among the elements of a sample or population along the characteristics measured, and
includes metrics such as range, variance, and standard deviation.

The distribution refers to the overall "shape" of the data, which can be depicted on a chart
such as a histogram or dot plot, and includes properties such as the probability distribution
function, skewness, and kurtosis. Descriptive statistics can also describe differences
between observed characteristics of the elements of a data set. Descriptive statistics help
us understand the collective properties of the elements of a data sample and form the
basis for testing hypotheses and making predictions using inferential statistics.

Inferential Statistics
Inferential statistics are tools that statisticians use to draw conclusions about the
characteristics of a population, drawn from the characteristics of a sample, and to decide
how certain they can be of the reliability of those conclusions. Based on the sample size
and distribution statisticians can calculate the probability that statistics, which measure the
central tendency, variability, distribution, and relationships between characteristics within
a data sample, provide an accurate picture of the corresponding parameters of the whole
population from which the sample is drawn.

Inferential statistics are used to make generalizations about large groups, such as
estimating average demand for a product by surveying a sample of consumers' buying
habits or to attempt to predict future events, such as projecting the future return of a
security or asset class based on returns in a sample period.

Regression analysis is a widely used technique of statistical inference used to determine


the strength and nature of the relationship (i.e., the correlation) between a dependent
variable and one or more explanatory (independent) variables. The output of a regression
model is often analyzed for statistical significance, which refers to the claim that a result
from findings generated by testing or experimentation is not likely to have occurred
randomly or by chance but is likely to be attributable to a specific cause elucidated by the
data. Having statistical significance is important for academic disciplines or practitioners
that rely heavily on analyzing data and research.

Types of Variables

A. Qualitative or Attribute variable - the characteristic being studied is nonnumeric.


EXAMPLES: Gender, religious affiliation, type of automobile owned, state of birth, eye color
are examples.
B. Quantitative variable - information is reported numerically.
EXAMPLES: balance in your checking account, minutes remaining in class, or number of
children in a family.
Quantitative Variables - Classifications
Quantitative variables can be classified as either discrete or continuous.
Discrete variables: can only assume certain values and there are usually “gaps” between
values.
EXAMPLE: the number of bedrooms in a house, or the number of hammers sold at the local
Home Depot (1,2,3,…,etc).

Continuous variable can assume any value within a specified range.


EXAMPLE: The pressure in a tire, the weight of a pork chop, or the height of students in a
class.

Statistical Levels of Measurement

After analyzing variables and outcomes as part of statistics, there are several resulting
levels of measurement. Statistics can quantify outcomes in these different ways:

1. Nominal Level Measurement. There is no numerical or quantitative value, and qualities


are not ranked. Instead, nominal level measurements are simply labels or categories
assigned to other variables. It's easiest to think of nominal level measurements as non-
numerical facts about a variable. Example: The name of the President elected in 2020
was Joseph Robinette Biden, Jr.
2. Ordinal Level Measurement: Outcomes can be arranged in an order, however, all data
values have the same value or weight. Although numerical, ordinal level measurements
in statistics can't be subtracted against each other as only the position of the data point
matters. Often incorporated into nonparametric statistics , ordinal levels are often
compared against the total variable group. Example: American Fred Kerley was the 2nd
fastest man at the 2020 Tokyo Olympics based on 100-meter sprint times.4
3. Interval Level Measurement: Outcomes can be arranged in order; however differences
between data values may now have meaning. Two different data points are often used
to compare the passing of time or changing conditions within a data set. There is often
no "starting point" for the range of data values, and calendar dates or temperatures
may not have a meaningful intrinsic zero value. Example: Inflation hit 8.6% in May
2022. The last time inflation was this high was December 1981.5
4. Ratio Level Measurement: Outcomes can be arranged in order, and differences
between data values now have meaning. However, there is now a starting point or
"zero value" that can be used to further provide value to a statistical value. The ratio
between data values now has meaning, including its distance away from zero. Example:
The lowest meteorological temperature recorded was -128.6 degrees Fahrenheit in
Antarctica.

Bar chart

A bar chart (aka bar graph, column chart) plots numeric values for levels of a categorical
feature as bars. Levels are plotted on one chart axis, and values are plotted on the other
axis. Each categorical value claims one bar, and the length of each bar corresponds to the
bar’s value. Bars are plotted on a common baseline to allow for easy comparison of values.

This example bar chart depicts the number of purchases made on a site by different types of
users. The categorical feature, user type, is plotted on the horizontal axis, and each bar’s
height corresponds to the number of purchases made under each user type. We can see
from this chart that while there are about three times as many purchases from new users
who create user accounts than those that do not create user accounts (guests), both are
dwarfed by the number of purchases made by repeating users.
What is a Pie Chart?
A pie chart is a graphical representation technique that displays data in a circular-shaped
graph. It is a composite static chart that works best with few variables. Pie charts are often
used to represent sample data—with data points belonging to a combination of different
categories. Each of these categories is represented as a “slice of the pie.” The size of each
slice is directly proportional to the number of data points that belong to a particular
category.

For example, take a pie chart representing the sales a retail store makes in a month. The
slices represent the sales of toys, home decor, furniture, and electronics.

The below pie chart represents sales data, where 43 percent of sales were from furniture,
28 percent from electronics, 15 percent from toys, and 14 percent from home decor. These
figures add up to 100 percent, as should always be the case with pie graphs.

Pie charts were invented by William Playfair in 1801. The first pie charts appeared in his
book Commercial and Political Atlas and Statistical Breviary. Playfair, a Scottish engineer, is
considered the founder of graphical methods in statistics.

When Should Pie Charts be Used:


Distinct Parts

When the data comprises distinctive parts, a pie chart is best suited to represent it. The aim
of using a pie chart is to compare the contribution of each part to the whole data. In the
above example, the data is made up of sales from the furniture department (43.3 percent),
electronics (28.2 percent), groceries (14.4 percent), and toys (14.1 percent). The chart
visualizes how each department contributes to the total sales.

No Time Representation
A pie chart is suitable when data visualization doesn’t need to represent time. While some
other types of graphs have a dimension that represents time, the pie chart doesn’t have
one. In the above example, the sales data representation doesn’t indicate when these sales
were made. Pie charts cannot represent the change in data over time.
Few Components
A pie chart works best if the sample data only has a few components. In the above example,
the sales data from the four departments are represented. As the number of categories
increases, so does the number of slices. It might be challenging to interpret a pie chart with
many small slices.

Easy Visualization
Pie charts are the best to visualize how much each category contributes to the sample data.
In the above example, without even reading the numbers, it is easy to visualize that the
furniture department contributes most to the organization’s sales.
Frequency Distribution | Tables, Types & Examples
Published on June 7, 2022 by Shaun Turney. Revised on November 10, 2022.
A frequency distribution describes the number of observations for each possible value of
a variable. Frequency distributions are depicted using graphs and frequency tables.
Example: Frequency distributionIn the 2022 Winter Olympics, Team USA won 25 medals.
This frequency table gives the medals’ values (gold, silver, and bronze) and frequencies:

What is a frequency distribution?


The frequency of a value is the number of times it occurs in a dataset. A frequency
distribution is the pattern of frequencies of a variable. It’s the number of times each
possible value of a variable occurs in a dataset.

Types of frequency distributions


There are four types of frequency distributions:

 Ungrouped frequency distributions: The number of observations of each value of a


variable.
o You can use this type of frequency distribution for categorical variables.
 Grouped frequency distributions: The number of observations of each class interval of a
variable. Class intervals are ordered groupings of a variable’s values.
o You can use this type of frequency distribution for quantitative variables.
 Relative frequency distributions: The proportion of observations of each value or class
interval of a variable.
o You can use this type of frequency distribution for any type of variable when
you’re more interested in comparing frequencies than the actual number of
observations.
 Cumulative frequency distributions: The sum of the frequencies less than or equal to
each value or class interval of a variable.
o You can use this type of frequency distribution for ordinal or quantitative
variables when you want to understand how often observations fall below
certain values.

How to make a frequency table


Frequency distributions are often displayed using frequency tables. A frequency table is an
effective way to summarize or organize a dataset. It’s usually composed of two columns:

 The values or class intervals


 Their frequencies

The method for making a frequency table differs between the four types of frequency
distributions. You can follow the guides below or use software such as Excel, SPSS, or R to
make a frequency table.

How to make an ungrouped frequency table

1. Create a table with two columns and as many rows as there are values of the
variable. Label the first column using the variable name and label the second column
“Frequency.” Enter the values in the first column.
o For ordinal variables, the values should be ordered from smallest to largest in
the table rows.
o For nominal variables, the values can be in any order in the table. You may wish
to order them alphabetically or in some other logical order.
2. Count the frequencies. The frequencies are the number of times each value occurs.
Enter the frequencies in the second column of the table beside their corresponding
values.
o Especially if your dataset is large, it may help to count the frequencies
by tallying. Add a third column called “Tally.” As you read the observations, make
a tick mark in the appropriate row of the tally column for each observation.
Count the tally marks to determine the frequency.

Example: Making an ungrouped frequency tableA gardener set up a bird feeder in their
backyard. To help them decide how much and what type of birdseed to buy, they decide to
record the bird species that visit their feeder. Over the course of one morning, the following
birds visit their feeder:
How to make a grouped frequency table

● Class frequencies can be converted to relative class frequencies to show the fraction
of the total number of observations in each class.
● A relative frequency captures the relationship between a class total and the total
number of observations.

Constructing a Frequency Table - Example


● Step 1: Decide on the number of classes.
A useful recipe to determine the number of classes (k) is the
“2 to the k rule.” such that 2k > n.
There were 80 vehicles sold. So n = 80. If we try k = 6, which
means we would use 6 classes, then 26 = 64, somewhat less
than 80. Hence, 6 is not enough classes. If we let k = 7, then 27
128, which is greater than 80. So the recommended number
of classes is 7.
● Step 2: Determine the class interval or width.
The formula is: i ≥ (H-L)/k where i is the class interval, H is the
highest observed value, L is the lowest observed value, and k is the
number of classes.
($35,925 - $15,546)/7 = $2,911
Round up to some convenient number, such as a multiple of 10 or 100. Use a class width
of $3,000.
 Step 3: Set the individual class limits

vehicle selling prices into the classes.
● Step 5: Count the number of items in each class

How to make a relative frequency table

1. Create an ungrouped or grouped frequency table.


2. Add a third column to the table for the relative frequencies. To calculate the
relative frequencies, divide each frequency by the sample size. The sample size is the
sum of the frequencies.
Example: Relative frequency distribution

From this table, the gardener can make observations, such as that 19% of the bird feeder
visits were from chickadees and 25% were from finches.

How to make a cumulative frequency table

1. Create an ungrouped or grouped frequency table for an ordinal or quantitative variable.


Cumulative frequencies don’t make sense for nominal variables because the values have
no order—one value isn’t more than or less than another value.
2. Add a third column to the table for the cumulative frequencies. The cumulative
frequency is the number of observations less than or equal to a certain value or class
interval. To calculate the relative frequencies, add each frequency to the frequencies in
the previous rows.
3. Optional: If you want to calculate the cumulative relative frequency, add another
column and divide each cumulative frequency by the sample size.

Example: Cumulative frequency distribution


From this table, the sociologist can make observations such as 13 respondents (65%) were
under 39 years old, and 16 respondents (80%) were under 49 years old.
Histogram
A histogram is a display of statistical information that uses rectangles to show the frequency
of data items in successive numerical intervals of equal size. In the most common form of
histogram, the independent variable is plotted along the horizontal axis and the dependent
variable is plotted along the vertical axis. The data appears as colored or shaded rectangles
of variable area.

The illustration, below, is a histogram showing the results of a final exam given to a
hypothetical class of students. Each score range is denoted by a bar of a certain color. If this
histogram were compared with those of classes from other years that received the same
test from the same professor, conclusions might be drawn about intelligence changes
among students over the years. Conclusions might also be drawn concerning the
improvement or decline of the professor's teaching ability with the passage of time. If this
histogram were compared with those of other classes in the same semester who had
received the same final exam but who had taken the course from different professors, one
might draw conclusions about the relative competence of the professors.
Some histograms are presented with the independent variable along the vertical axis and
the dependent variable along the horizontal axis. That format is less common than the one
shown here.

Frequency Polygons
Frequency polygons are a graphical representation of data distribution that helps in
understanding the data through a specific shape. Frequency polygons are very similar to
histograms but are helpful and useful while comparing two or more data. The graph mainly
showcases cumulative frequency distribution data in the form of a line graph. Let us learn
about the frequency polygons graph, the steps in creating a graph, and solve a few examples
to understand the concept better.

Definition of Frequency Polygons

Frequency Polygons can be defined as a form of a graph that interprets information or data
that is widely used in statistics. This visual form of data representation helps in depicting the
shape and trend of the data in an organized and systematic manner. Frequency polygons
through the shape of the graph depict the number of occurrence of class intervals. This type
of graph is usually drawn with a histogram but can be drawn without a histogram as well.
While a histogram is a graph with rectangular bars without spaces, a frequency polygon
graph is a line graph that represents cumulative frequency distribution data. Frequency
polygons look like the image below:
Steps to Construct Frequency Polygons

The curve in a frequency polygon is drawn on an x-axis and y-axis. As a regular graph, the x-
axis represents the value in a dataset and the y-axis shows the number of occurrences of
each category. While plotting a frequency polygon graph, the most important aspect is the
mid-point which is called the class interval or class marks. The frequency polygon curve can
be drawn with or without a histogram. For drawing with a histogram, we first draw
rectangular bars against the class intervals and join the midpoints of the bars to get the
frequency polygons. Here are the steps to drawing a frequency polygon graph without a
histogram:

 Step 1: Mark the class intervals for each class on an x-axis while we plot the curve on the
y-axis.
 Step 2: Calculate the midpoint of each of the class intervals which is the classmarks. (The
formula is mentioned in the next section)
 Step 3: Once the classmarks are obtained, mark them on the x-axis.
 Step 4: Since the height always depicts the frequency, plot the frequency according to
each class mark. It should be plotted against the classmark itself and not on the upper or
lower limit.
 Step 5: Once the points are marked, join them with a line segment similar to a line
graph.
 Step 6: The curve that is obtained by this line segment is the frequency polygon.
Formula to Find the Frequency Polygons Midpoint

While plotting a frequency polygon graph we require to calculate the midpoint or the
classmark for each of the class intervals. The formula to do so is:

Class Mark (Midpoint) = (Upper Limit + Lower Limit) / 2

Difference Between Frequency Polygons and Histogram


Even though a frequency polygon graph is similar to a histogram and can be plotted with or
without a histogram, the two graphs are yet different from each other. The two graphs have
their own unique properties that show the difference visually. The differences are:

Frequency Polygons Histograms

A histogram is a graph that


A frequency polygon graph is a
depicts data through rectangular-
curve that is depicted by a line
shaped bars with no spaces
segment.
between them.

In a frequency polygon graph, the In a histogram, the frequencies


midpoint of the frequencies is are evenly spread over the class
used. intervals.

The accurate points in a


The height of the bars in a
frequency polygon graph
histogram only depicts the
represent the data of the
quantity of the data.
particular class interval.

Comparison of data is visually


Comparison of data is not visually
more accurate in a frequency
appealing in a histogram graph.
polygon graph.

Numerical Measures
Characteristics of Arithmatic Mean
Some of the important characteristics of the arithmetic mean are:

1. The sum of the deviations of the individual items from the arithmetic mean is always
zero. This means I: (x - x)= 0, where x is the value of an item and x is the arithmetic
mean. Since the sum of the deviations in the positive direction is equal to the sum of
the deviations in the negative direction, the arithmetic mean is regarded as a
measure of central tendency.
2. The sum of the squared deviations of the individual items from the arithmetic mean
is always minimum. In other words, the sum of the squared deviations taken from
any value other than the arithmetic mean will be higher.
3. As the arithmetic mean is based on all the items in a series, a change in the value of
any item will lead to a change in the value of the arithmetic mean.
4. In the case of highly skewed distribution, the arithmetic mean may get distorted on
account of a few items with extreme values. In such a case, it may cease to be the
representative characteristic of the distribution.

POPULATION MEAN:

The population means is the mean or average of all values in the given population. It is
calculated by the sum of all values in the population, denoted by the summation of X
divided by the number of population values denoted by N.

It arrives by summing up all the observations in the group and dividing the summation by
the number of observations. When one uses the whole data set for computing a statistical
parameter, the data set is the population. For example, the returns of all the stocks listed in
the NASDAQ stock exchange in the population of that group. So, for this example, the
population means that the return of all the stocks listed in the NASDAQ stock exchange will
be the average return of all the stocks listed in that exchange.
To calculate the population mean for a group, we first need to find out the sum of all the
observed values. So, if the total number of observed values is denoted by X, then the
summation of all the observed values will be ∑X. And let the number of observations in the
population be N.

The formula is represented as follows,

µ= ∑X/N

SAMPLE MEAN

The Sample Mean is the average mathematical value of a sample’s values. It is a statistical
indicator used to analyze various variables over time. Statisticians and researchers obtain it
by dividing the total sum of values of all data fragments and then dividing it by the total
number of data sets. Statistical researchers mathematically represent it as follows:

Where X̄ = Simple average or Mean

X = variable with different values, and


N= total number of variable observations.

In actual practice, the center of any number of data set is the Mean. However, surveying
every individual or data set to get the population Means accurately is impossible. It
becomes time-consuming, requires huge capital, and makes the work cumbersome. Also,
the forecasting of population behavior and the assessment of the population is very
important to making important decisions of policy. Hence, in such cases, simple average
comes to our rescue. Also, statisticians and researchers further use the simple average to
calculate the Sample mean variance and Sample mean standard deviation of the population.

Weighted Mean

In Mathematics, the weighted mean is used to calculate the average value of the data. In
the weighted mean calculation, the average value can be calculated by providing different
weights to some of the individual values. We need to calculate the weighted mean when
data is given in a different way compared to the arithmetic mean or sample mean. Different
types of means are used to calculate the average of the data values. Let’s understand what
is weighted mean and how to define weighted mean along with solved examples.

The weighted mean is defined as an average computed by giving different weights to some
of the individual values. When all the weights are equal, then the weighted mean is similar
to the arithmetic mean. A free online tool called the weighted mean calculator is used to
calculate the weighted mean for the given range of values.
Weighted Mean Formula
To calculate the weighted mean for a given set of non-negative data x 1,x2,x3,...xn with non-
negative weights w1,w2,w3,..., we use the formula given below.

Weighted mean = Σ(w)n (x̄ )n/Σ(w)n

The Median

● The Median is the midpoint of the values after they have been ordered from the smallest to
the largest.
– There are as many values above the median as below it in the data array.
– For an even set of values, the median will be the arithmetic average of the two middle
numbers.

Properties of Median:
● There is a unique median for each data set.

● It is not affected by extremely large or small values and is


therefore a valuable measure of central tendency when
such values occur.

● It can be computed for ratio-level,


interval-
level, and ordinal-level data.

● It can be computed for an open-ended frequency


distribution if the median does not lie in an open-ended
class.

The mode

● The mode is the value of the observation that appears most


frequently.

Relative position of Mean, Median, Mode


GEOMETRIC MEAN

● Useful in finding the average change of percentages, ratios, indexes, or


growth rates over time.
● It has a wide application in business and economics because we are often
interested in finding the percentage changes in sales, salaries, or economic
figures, such as the GDP, which compound or build on each other.

● The geometric mean will always be less than or equal to the


arithmetic mean.
● The geometric mean of a set of n positive numbers is defined as the
nth root of the product of n values.
● The formula for the geometric mean is written:

Suppose you receive a 5 percent increase in salary this year and a 15 percent
increase next year. The average annual percent increase is 9.886, not
10.0. Why is this so? We begin by calculating the geometric mean.

DISPERSION
Why Study Dispersion?
– A measure of location, such as the mean or the median, only
describes the center of the data. It is valuable from that
standpoint, but it does not tell us anything about the spread of the
data.
– For example, if your nature guide told you that the river ahead averaged 3
feet in depth, would you want to wade across on foot without additional
information? Probably not. You would want to know something about the
variation in the depth.
– A second reason for studying the dispersion in a set of data is to compare
the spread in two or more distributions.
● Range

● Mean Deviation

● Population variance

● Population Standard Deviation

The number of traffic citations issued during the last five months in Beaufort
County, South Carolina, is 38, 26, 13, 41, and 22. What is the population
variance?
Relative Measures of Dispersion
Measures of dispersion can not be used for comparing the variability of two
or more distributions given in different units as these are absolute measures
of dispersion. Considering this limitation, relative measures of dispersion are
introduced.
(i) Coefficient of Variation (CV) = (SD/Mean) X 100
(II) Coefficient of Mean Deviation (CMD) = (MD/Mean) X 100

CHEBYSHEV’S THEOREM
The arithmetic mean biweekly amount contributed by the Dupree Paint
employees to the company’s profit-sharing plan is $51.54, and the
standard deviation is $7.51. At least what percent of the contributions lie
within plus 3.5 standard deviations and minus
3.5 standard deviations of the mean?

THE EMPIRICAL RULE


STANDARD DEVIATION OF GROUPED DATA

You might also like