What Is Statistics
What Is Statistics
Key points:
Statistics is the study and manipulation of data, including ways to gather, review,
analyze, and draw conclusions from data.
The two major areas of statistics are descriptive and inferential statistics.
Statistics can be communicated at different levels ranging from non-numerical
descriptor (nominal-level) to numerical in reference to a zero-point (ratio-level).
A number of sampling techniques can be used to compile statistical data including
simple random, systematic, stratified, or cluster sampling.
Statistics are present in almost every department of every company and are an integral
part of investing as well.
Descriptive Statistics
Descriptive statistics mostly focus on the central tendency, variability, and distribution of
sample data. Central tendency means the estimate of the characteristics, a typical element
of a sample or population, and includes descriptive statistics such as mean, median,
and mode. Variability refers to a set of statistics that show how much difference there is
among the elements of a sample or population along the characteristics measured, and
includes metrics such as range, variance, and standard deviation.
The distribution refers to the overall "shape" of the data, which can be depicted on a chart
such as a histogram or dot plot, and includes properties such as the probability distribution
function, skewness, and kurtosis. Descriptive statistics can also describe differences
between observed characteristics of the elements of a data set. Descriptive statistics help
us understand the collective properties of the elements of a data sample and form the
basis for testing hypotheses and making predictions using inferential statistics.
Inferential Statistics
Inferential statistics are tools that statisticians use to draw conclusions about the
characteristics of a population, drawn from the characteristics of a sample, and to decide
how certain they can be of the reliability of those conclusions. Based on the sample size
and distribution statisticians can calculate the probability that statistics, which measure the
central tendency, variability, distribution, and relationships between characteristics within
a data sample, provide an accurate picture of the corresponding parameters of the whole
population from which the sample is drawn.
Inferential statistics are used to make generalizations about large groups, such as
estimating average demand for a product by surveying a sample of consumers' buying
habits or to attempt to predict future events, such as projecting the future return of a
security or asset class based on returns in a sample period.
Types of Variables
After analyzing variables and outcomes as part of statistics, there are several resulting
levels of measurement. Statistics can quantify outcomes in these different ways:
Bar chart
A bar chart (aka bar graph, column chart) plots numeric values for levels of a categorical
feature as bars. Levels are plotted on one chart axis, and values are plotted on the other
axis. Each categorical value claims one bar, and the length of each bar corresponds to the
bar’s value. Bars are plotted on a common baseline to allow for easy comparison of values.
This example bar chart depicts the number of purchases made on a site by different types of
users. The categorical feature, user type, is plotted on the horizontal axis, and each bar’s
height corresponds to the number of purchases made under each user type. We can see
from this chart that while there are about three times as many purchases from new users
who create user accounts than those that do not create user accounts (guests), both are
dwarfed by the number of purchases made by repeating users.
What is a Pie Chart?
A pie chart is a graphical representation technique that displays data in a circular-shaped
graph. It is a composite static chart that works best with few variables. Pie charts are often
used to represent sample data—with data points belonging to a combination of different
categories. Each of these categories is represented as a “slice of the pie.” The size of each
slice is directly proportional to the number of data points that belong to a particular
category.
For example, take a pie chart representing the sales a retail store makes in a month. The
slices represent the sales of toys, home decor, furniture, and electronics.
The below pie chart represents sales data, where 43 percent of sales were from furniture,
28 percent from electronics, 15 percent from toys, and 14 percent from home decor. These
figures add up to 100 percent, as should always be the case with pie graphs.
Pie charts were invented by William Playfair in 1801. The first pie charts appeared in his
book Commercial and Political Atlas and Statistical Breviary. Playfair, a Scottish engineer, is
considered the founder of graphical methods in statistics.
When the data comprises distinctive parts, a pie chart is best suited to represent it. The aim
of using a pie chart is to compare the contribution of each part to the whole data. In the
above example, the data is made up of sales from the furniture department (43.3 percent),
electronics (28.2 percent), groceries (14.4 percent), and toys (14.1 percent). The chart
visualizes how each department contributes to the total sales.
No Time Representation
A pie chart is suitable when data visualization doesn’t need to represent time. While some
other types of graphs have a dimension that represents time, the pie chart doesn’t have
one. In the above example, the sales data representation doesn’t indicate when these sales
were made. Pie charts cannot represent the change in data over time.
Few Components
A pie chart works best if the sample data only has a few components. In the above example,
the sales data from the four departments are represented. As the number of categories
increases, so does the number of slices. It might be challenging to interpret a pie chart with
many small slices.
Easy Visualization
Pie charts are the best to visualize how much each category contributes to the sample data.
In the above example, without even reading the numbers, it is easy to visualize that the
furniture department contributes most to the organization’s sales.
Frequency Distribution | Tables, Types & Examples
Published on June 7, 2022 by Shaun Turney. Revised on November 10, 2022.
A frequency distribution describes the number of observations for each possible value of
a variable. Frequency distributions are depicted using graphs and frequency tables.
Example: Frequency distributionIn the 2022 Winter Olympics, Team USA won 25 medals.
This frequency table gives the medals’ values (gold, silver, and bronze) and frequencies:
The method for making a frequency table differs between the four types of frequency
distributions. You can follow the guides below or use software such as Excel, SPSS, or R to
make a frequency table.
1. Create a table with two columns and as many rows as there are values of the
variable. Label the first column using the variable name and label the second column
“Frequency.” Enter the values in the first column.
o For ordinal variables, the values should be ordered from smallest to largest in
the table rows.
o For nominal variables, the values can be in any order in the table. You may wish
to order them alphabetically or in some other logical order.
2. Count the frequencies. The frequencies are the number of times each value occurs.
Enter the frequencies in the second column of the table beside their corresponding
values.
o Especially if your dataset is large, it may help to count the frequencies
by tallying. Add a third column called “Tally.” As you read the observations, make
a tick mark in the appropriate row of the tally column for each observation.
Count the tally marks to determine the frequency.
Example: Making an ungrouped frequency tableA gardener set up a bird feeder in their
backyard. To help them decide how much and what type of birdseed to buy, they decide to
record the bird species that visit their feeder. Over the course of one morning, the following
birds visit their feeder:
How to make a grouped frequency table
● Class frequencies can be converted to relative class frequencies to show the fraction
of the total number of observations in each class.
● A relative frequency captures the relationship between a class total and the total
number of observations.
From this table, the gardener can make observations, such as that 19% of the bird feeder
visits were from chickadees and 25% were from finches.
The illustration, below, is a histogram showing the results of a final exam given to a
hypothetical class of students. Each score range is denoted by a bar of a certain color. If this
histogram were compared with those of classes from other years that received the same
test from the same professor, conclusions might be drawn about intelligence changes
among students over the years. Conclusions might also be drawn concerning the
improvement or decline of the professor's teaching ability with the passage of time. If this
histogram were compared with those of other classes in the same semester who had
received the same final exam but who had taken the course from different professors, one
might draw conclusions about the relative competence of the professors.
Some histograms are presented with the independent variable along the vertical axis and
the dependent variable along the horizontal axis. That format is less common than the one
shown here.
Frequency Polygons
Frequency polygons are a graphical representation of data distribution that helps in
understanding the data through a specific shape. Frequency polygons are very similar to
histograms but are helpful and useful while comparing two or more data. The graph mainly
showcases cumulative frequency distribution data in the form of a line graph. Let us learn
about the frequency polygons graph, the steps in creating a graph, and solve a few examples
to understand the concept better.
Frequency Polygons can be defined as a form of a graph that interprets information or data
that is widely used in statistics. This visual form of data representation helps in depicting the
shape and trend of the data in an organized and systematic manner. Frequency polygons
through the shape of the graph depict the number of occurrence of class intervals. This type
of graph is usually drawn with a histogram but can be drawn without a histogram as well.
While a histogram is a graph with rectangular bars without spaces, a frequency polygon
graph is a line graph that represents cumulative frequency distribution data. Frequency
polygons look like the image below:
Steps to Construct Frequency Polygons
The curve in a frequency polygon is drawn on an x-axis and y-axis. As a regular graph, the x-
axis represents the value in a dataset and the y-axis shows the number of occurrences of
each category. While plotting a frequency polygon graph, the most important aspect is the
mid-point which is called the class interval or class marks. The frequency polygon curve can
be drawn with or without a histogram. For drawing with a histogram, we first draw
rectangular bars against the class intervals and join the midpoints of the bars to get the
frequency polygons. Here are the steps to drawing a frequency polygon graph without a
histogram:
Step 1: Mark the class intervals for each class on an x-axis while we plot the curve on the
y-axis.
Step 2: Calculate the midpoint of each of the class intervals which is the classmarks. (The
formula is mentioned in the next section)
Step 3: Once the classmarks are obtained, mark them on the x-axis.
Step 4: Since the height always depicts the frequency, plot the frequency according to
each class mark. It should be plotted against the classmark itself and not on the upper or
lower limit.
Step 5: Once the points are marked, join them with a line segment similar to a line
graph.
Step 6: The curve that is obtained by this line segment is the frequency polygon.
Formula to Find the Frequency Polygons Midpoint
While plotting a frequency polygon graph we require to calculate the midpoint or the
classmark for each of the class intervals. The formula to do so is:
Numerical Measures
Characteristics of Arithmatic Mean
Some of the important characteristics of the arithmetic mean are:
1. The sum of the deviations of the individual items from the arithmetic mean is always
zero. This means I: (x - x)= 0, where x is the value of an item and x is the arithmetic
mean. Since the sum of the deviations in the positive direction is equal to the sum of
the deviations in the negative direction, the arithmetic mean is regarded as a
measure of central tendency.
2. The sum of the squared deviations of the individual items from the arithmetic mean
is always minimum. In other words, the sum of the squared deviations taken from
any value other than the arithmetic mean will be higher.
3. As the arithmetic mean is based on all the items in a series, a change in the value of
any item will lead to a change in the value of the arithmetic mean.
4. In the case of highly skewed distribution, the arithmetic mean may get distorted on
account of a few items with extreme values. In such a case, it may cease to be the
representative characteristic of the distribution.
POPULATION MEAN:
The population means is the mean or average of all values in the given population. It is
calculated by the sum of all values in the population, denoted by the summation of X
divided by the number of population values denoted by N.
It arrives by summing up all the observations in the group and dividing the summation by
the number of observations. When one uses the whole data set for computing a statistical
parameter, the data set is the population. For example, the returns of all the stocks listed in
the NASDAQ stock exchange in the population of that group. So, for this example, the
population means that the return of all the stocks listed in the NASDAQ stock exchange will
be the average return of all the stocks listed in that exchange.
To calculate the population mean for a group, we first need to find out the sum of all the
observed values. So, if the total number of observed values is denoted by X, then the
summation of all the observed values will be ∑X. And let the number of observations in the
population be N.
µ= ∑X/N
SAMPLE MEAN
The Sample Mean is the average mathematical value of a sample’s values. It is a statistical
indicator used to analyze various variables over time. Statisticians and researchers obtain it
by dividing the total sum of values of all data fragments and then dividing it by the total
number of data sets. Statistical researchers mathematically represent it as follows:
In actual practice, the center of any number of data set is the Mean. However, surveying
every individual or data set to get the population Means accurately is impossible. It
becomes time-consuming, requires huge capital, and makes the work cumbersome. Also,
the forecasting of population behavior and the assessment of the population is very
important to making important decisions of policy. Hence, in such cases, simple average
comes to our rescue. Also, statisticians and researchers further use the simple average to
calculate the Sample mean variance and Sample mean standard deviation of the population.
Weighted Mean
In Mathematics, the weighted mean is used to calculate the average value of the data. In
the weighted mean calculation, the average value can be calculated by providing different
weights to some of the individual values. We need to calculate the weighted mean when
data is given in a different way compared to the arithmetic mean or sample mean. Different
types of means are used to calculate the average of the data values. Let’s understand what
is weighted mean and how to define weighted mean along with solved examples.
The weighted mean is defined as an average computed by giving different weights to some
of the individual values. When all the weights are equal, then the weighted mean is similar
to the arithmetic mean. A free online tool called the weighted mean calculator is used to
calculate the weighted mean for the given range of values.
Weighted Mean Formula
To calculate the weighted mean for a given set of non-negative data x 1,x2,x3,...xn with non-
negative weights w1,w2,w3,..., we use the formula given below.
The Median
● The Median is the midpoint of the values after they have been ordered from the smallest to
the largest.
– There are as many values above the median as below it in the data array.
– For an even set of values, the median will be the arithmetic average of the two middle
numbers.
Properties of Median:
● There is a unique median for each data set.
The mode
Suppose you receive a 5 percent increase in salary this year and a 15 percent
increase next year. The average annual percent increase is 9.886, not
10.0. Why is this so? We begin by calculating the geometric mean.
DISPERSION
Why Study Dispersion?
– A measure of location, such as the mean or the median, only
describes the center of the data. It is valuable from that
standpoint, but it does not tell us anything about the spread of the
data.
– For example, if your nature guide told you that the river ahead averaged 3
feet in depth, would you want to wade across on foot without additional
information? Probably not. You would want to know something about the
variation in the depth.
– A second reason for studying the dispersion in a set of data is to compare
the spread in two or more distributions.
● Range
● Mean Deviation
● Population variance
The number of traffic citations issued during the last five months in Beaufort
County, South Carolina, is 38, 26, 13, 41, and 22. What is the population
variance?
Relative Measures of Dispersion
Measures of dispersion can not be used for comparing the variability of two
or more distributions given in different units as these are absolute measures
of dispersion. Considering this limitation, relative measures of dispersion are
introduced.
(i) Coefficient of Variation (CV) = (SD/Mean) X 100
(II) Coefficient of Mean Deviation (CMD) = (MD/Mean) X 100
CHEBYSHEV’S THEOREM
The arithmetic mean biweekly amount contributed by the Dupree Paint
employees to the company’s profit-sharing plan is $51.54, and the
standard deviation is $7.51. At least what percent of the contributions lie
within plus 3.5 standard deviations and minus
3.5 standard deviations of the mean?