Statistical Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

STATISTICAL MACHINE LEARNING

Prepared By

D.Deva Hema
Inferential statistics:

Inferential statistics can help us understand the collective properties of the elements of a data sample.
Knowing the sample means, variance, and distribution of a variable can help us understand the world
around us.

Descriptive statistics

Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire population or a sample of a population. Descriptive statistics are
broken down into measures of central tendency and measures of variability. Measures of central tendency
include the mean, median, and mode, while measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.

Descriptive statistics, in short, help describe and understand the features of a specific data set by giving
short summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of center: the mean, median, and mode, which are used at almost all levels of math
and statistics. The mean, or the average, is calculated by adding all the figures within the data set and then
dividing by the number of figures within the set.

For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a
data set is the value appearing most often, and the median is the figure situated in the middle of the data
set. It is the figure separating the higher figures from the lower figures within a data set. However, there
are less common types of descriptive statistics that are still very important.

A frequency distribution shows how often each different value in a set of data occurs. A histogram is the
most commonly used graph to show frequency distributions. It looks very much like a bar chart, but there
are important differences between them. This helpful data collection and analysis tool is considered one
of the seven basic quality tools.

Histogram

A histogram is used to summarize discrete or continuous data. In other words, it provides a visual
interpretation of numerical data by showing the number of data points that fall within a specified range of
values (called ―bins‖).It is similar to a vertical bar graph. However, a histogram, unlike a vertical bar
graph, shows no gaps between the bars.
WHEN TO USE A HISTOGRAM

Use a histogram when:

 The data are numerical


 You want to see the shape of the data’s distribution, especially when determining whether the output of a
process is distributed approximately normally
 Analyzing whether a process can meet the customer’s requirements
 Analyzing what the output from a supplier’s process looks like
 Seeing whether a process change has occurred from one time period to another
 Determining whether the outputs of two or more processes are different
 You wish to communicate the distribution of data quickly and easily to others

HOW TO CREATE A HISTOGRAM

 Collect at least 50 consecutive data points from a process.


 Use a histogram worksheet to set up the histogram. It will help you determine the number of bars, the
range of numbers that go into each bar, and the labels for the bar edges. After calculating W in Step 2 of
the worksheet, use your judgment to adjust it to a convenient number. For example, you might decide to
round 0.9 to an even 1.0. The value for W must not have more decimal places than the numbers you will
be graphing.
 Draw x- and y-axes on graph paper. Mark and label the y-axis for counting data values. Mark and label
the x-axis with the L values from the worksheet. The spaces between these numbers will be the bars of
the histogram. Do not allow for spaces between bars.
 For each data point, mark off one count above the appropriate bar with an X or by shading that portion of
the bar.

What is the central tendency?

How would you describe a data set with a single value? The most common approach is to define a central
position of your data distribution. This is what the statisticians call the central tendency. Being a core
concept in statistics, the central tendency summarizes the entire data set, thus giving an idea of its typical
value.

What are the measures of central tendency?

The arithmetic mean (or average) is the first measure that comes to one’s mind when talking about a
center point in the data or its typical value. Nevertheless, there are also other measures that describe the
central tendency more accurately in certain scenarios. This time we’ll break down the purposes of the
three main measures to describe the central position within a data set, namely:

Mean

Median

Mode

Mean

The mean is the most common way to summarize a data set. You can use the mean with either discrete or
continuous data. Yet, it’s mostly used with continuous data. There are two important properties the mean
has:

 The calculation of the mean considers each data point of your data set
 The sum of deviations of each data point from the mean is always zero.

Median

The median is the middlemost number in the data distribution.


How to calculate the median

First, arrange the values in order from the least to the greatest. Next, select the data point which is located
in the middle. This number is the median of your data set.

Mode

A mode is the most common data point across all the observations. In other words, it’s the value that
occurs most often. The mode is rarely used with continuous data. The data set can have two or more
modes. In such a case, it’s said that data has two or more peaks. The corresponding types of distributions
are called bimodal or multimodal. The mode is not the best way to represent the central tendency since it
may lie quite far from the rest of the data points:

Measures of dispersion

Measures of dispersion describe the spread of the data. They include the range, interquartile range,
standard deviation and variance. The range is given as the smallest and largest observations. This is the
simplest measure of variability. Dispersion is the state of getting dispersed or spread. Statistical
dispersion means the extent to which a numerical data is likely to vary about an average value. In other
words, dispersion helps to understand the distribution of the data.

Measures of Dispersion

In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how much
homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or scattered the
variable is.

Types of Measures of Dispersion

There are two main types of dispersion methods in statistics which are:

 Absolute Measure of Dispersion

 Relative Measure of Dispersion


Absolute Measure of Dispersion

An absolute measure of dispersion contains the same unit as the original data set. Absolute dispersion
method expresses the variations in terms of the average of deviations of observations like standard or
means deviations. It includes range, standard deviation, quartile deviation, etc.

The types of absolute measures of dispersion are:

1. Range: It is simply the difference between the maximum value and the minimum value given in a
data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set then squaring each of them and adding each
square and finally dividing them by the total no of values in the data set is the variance. Variance
(σ2)=∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e. S.D.
= √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into
quarters. The quartile deviation is half of the distance between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic
mean of the absolute deviations of the observations from a measure of central tendency is known
as the mean deviation (also called mean absolute deviation).

Range in Statistics

In Statistics, the range is the smallest of all the measures of dispersion. It is the difference between the
two extreme conclusions of the distribution. In other words, the range is the difference between the
maximum and the minimum observation of the distribution.

It is defined by

Range = Xmax – Xmin

Where Xmax is the largest observation and Xmin is the smallest observation of the variable values.

Interquartile Range Definition


The interquartile range defines the difference between the third and the first quartile. Quartiles are the
partitioned values that divide the whole series into 4 equal parts. So, there are 3 quartiles. First Quartile is
denoted by Q1 known as the lower quartile, the second Quartile is denoted by Q2 and the third Quartile is
denoted by Q3 known as the upper quartile. Therefore, the interquartile range is equal to the upper
quartile minus lower quartile.

The difference between the upper and lower quartile is known as the interquartile range. The formula for
the interquartile range is given below

Interquartile range = Upper Quartile – Lower Quartile = Q-3 – Q-1

where Q1 is the first quartile and Q3 is the third quartile of the series.

How to Calculate the Interquartile Range?

The procedure to calculate the interquartile range is given as follows:

 Arrange the given set of numbers into increasing or decreasing order.

 Then count the given values. If it is odd, then the center value is median otherwise obtain the
mean value for two center values. This is known as Q2 value. If there is even number of values,
the median will be the average of the middle two values.
 Median equally cuts the given values into two equal parts. They are described as Q1 and
Q3 parts.

 The median of data values below the median represents Q1.

 The median of data values above the median value represents Q3.

 Finally, we can subtract the median values of Q1 and Q3.

 The resulting value is the interquartile range.

Standard deviation

Standard deviation is a measure of dispersement in statistics. ―Dispersement‖ tell you how much your
data is spread out. Specifically, it shows you how much your data is spread out around
the mean or average. For example, are all your scores close to the average? Or are lots of scores way
above (or way below) the average score?

Steps to Calculate Standard Deviation

 Find the mean, which is the arithmetic mean of the observations.


 Find the squared differences from the mean. (The data value - mean)2
 Find the average of the squared differences. (Variance = The sum of squared differences ÷ the
number of observations)
 Find the square root of variance. (Standard deviation = √Variance)
Standard Deviation of Ungrouped Data

The calculations for standard deviation differ for different data. Distribution measures the deviation of
data from its mean or average position. There are two methods to find the standard deviation.

 actual mean method


 assumed mean method

Standard Deviation by the Actual Mean Method

σ = √(∑x−¯x)x−x¯)2 /n)

Consider the data observations 3, 2, 5, 6. Here the mean of these data points is 16/4 = 4.

The squared differences from mean = (4-3)2+(2-4)2 +(5-4)2 +(6-4)2= 10

Variance = Squared differences from mean/ number of data points =10/4 =2.5

Standard deviation = √2.5 = 1.58

Standard deviation by Assumed Mean Method

When the x values are large, an arbitrary value (A) is chosen as the mean. The deviation from this
assumed mean is calculated as d = x - A.

σ = √[(∑(d)2 /n) - (∑d/n)2]

Standard Deviation of Grouped Data

When the data points are grouped, we first construct a frequency distribution.
(1)Standard Deviation of Grouped Discrete Frequency Distribution

(2)Standard Deviation of Grouped Continuous Frequency Distribution

(3)Standard Deviation of Random Variables

(4)Standard Deviation of Probability Distribution

The coefficient of variation (CV)


The coefficient of variation (CV) is a statistical measure of the relative dispersion of data points in a data
series around the mean. In finance, the coefficient of variation allows investors to determine how much
volatility, or risk, is assumed in comparison to the amount of return expected from investments.

The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard
deviation to the mean (average). For example, the expression ―The standard deviation is 15% of the
mean‖ is a CV. The CV is particularly useful when you want to compare results from two different
surveys or tests that have different measures or values. For example, if you are comparing the results from
two tests that have different scoring mechanisms. If sample A has a CV of 12% and sample B has a CV of
25%, you would say that sample B has more variation, relative to its mean.
Formula
The formula for the coefficient of variation is:
Coefficient of Variation = (Standard Deviation / Mean) * 100.
In symbols CV = (SD/x) * 100.

Multiplying the coefficient by 100 is an optional step to get a percentage, as opposed to a decimal.

How to Find a Coefficient of Variation: Overview.

The empirical rule and Chebyshev Rule


The Empirical Rule is an approximation that applies only to data sets with a bell-shaped relative
frequency histogram. It estimates the proportion of the measurements that lie within one, two, and three
standard deviations of the mean. Chebyshev's Theorem is a fact that applies to all possible data sets.

Chebyshev Rule is used when the data is not in bell shaped

1-(1/k2)*100 of the values will fall within the k standard deviation of the mean for k>1
For example when k=2 , atleast 75% of the values will fall within µ+-2sigma
1-(1/42)*100= 75 When k=2
Five Number Summary, Boxplot and other plots

These values are presented together and ordered from lowest to highest: minimum value, lower quartile
(Q1), median value (Q2), upper quartile (Q3), maximum value. These five number helps to describe
centre, spread and shape of the data.

It tells shape of distribution: Left skewed, symmetry and right skewed

Box plot
A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The
five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot,
we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the
median.

You might also like