Statistical Machine Learning
Statistical Machine Learning
Statistical Machine Learning
Prepared By
D.Deva Hema
Inferential statistics:
Inferential statistics can help us understand the collective properties of the elements of a data sample.
Knowing the sample means, variance, and distribution of a variable can help us understand the world
around us.
Descriptive statistics
Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire population or a sample of a population. Descriptive statistics are
broken down into measures of central tendency and measures of variability. Measures of central tendency
include the mean, median, and mode, while measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.
Descriptive statistics, in short, help describe and understand the features of a specific data set by giving
short summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of center: the mean, median, and mode, which are used at almost all levels of math
and statistics. The mean, or the average, is calculated by adding all the figures within the data set and then
dividing by the number of figures within the set.
For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a
data set is the value appearing most often, and the median is the figure situated in the middle of the data
set. It is the figure separating the higher figures from the lower figures within a data set. However, there
are less common types of descriptive statistics that are still very important.
A frequency distribution shows how often each different value in a set of data occurs. A histogram is the
most commonly used graph to show frequency distributions. It looks very much like a bar chart, but there
are important differences between them. This helpful data collection and analysis tool is considered one
of the seven basic quality tools.
Histogram
A histogram is used to summarize discrete or continuous data. In other words, it provides a visual
interpretation of numerical data by showing the number of data points that fall within a specified range of
values (called ―bins‖).It is similar to a vertical bar graph. However, a histogram, unlike a vertical bar
graph, shows no gaps between the bars.
WHEN TO USE A HISTOGRAM
How would you describe a data set with a single value? The most common approach is to define a central
position of your data distribution. This is what the statisticians call the central tendency. Being a core
concept in statistics, the central tendency summarizes the entire data set, thus giving an idea of its typical
value.
The arithmetic mean (or average) is the first measure that comes to one’s mind when talking about a
center point in the data or its typical value. Nevertheless, there are also other measures that describe the
central tendency more accurately in certain scenarios. This time we’ll break down the purposes of the
three main measures to describe the central position within a data set, namely:
Mean
Median
Mode
Mean
The mean is the most common way to summarize a data set. You can use the mean with either discrete or
continuous data. Yet, it’s mostly used with continuous data. There are two important properties the mean
has:
The calculation of the mean considers each data point of your data set
The sum of deviations of each data point from the mean is always zero.
Median
First, arrange the values in order from the least to the greatest. Next, select the data point which is located
in the middle. This number is the median of your data set.
Mode
A mode is the most common data point across all the observations. In other words, it’s the value that
occurs most often. The mode is rarely used with continuous data. The data set can have two or more
modes. In such a case, it’s said that data has two or more peaks. The corresponding types of distributions
are called bimodal or multimodal. The mode is not the best way to represent the central tendency since it
may lie quite far from the rest of the data points:
Measures of dispersion
Measures of dispersion describe the spread of the data. They include the range, interquartile range,
standard deviation and variance. The range is given as the smallest and largest observations. This is the
simplest measure of variability. Dispersion is the state of getting dispersed or spread. Statistical
dispersion means the extent to which a numerical data is likely to vary about an average value. In other
words, dispersion helps to understand the distribution of the data.
Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how much
homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or scattered the
variable is.
There are two main types of dispersion methods in statistics which are:
An absolute measure of dispersion contains the same unit as the original data set. Absolute dispersion
method expresses the variations in terms of the average of deviations of observations like standard or
means deviations. It includes range, standard deviation, quartile deviation, etc.
1. Range: It is simply the difference between the maximum value and the minimum value given in a
data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set then squaring each of them and adding each
square and finally dividing them by the total no of values in the data set is the variance. Variance
(σ2)=∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e. S.D.
= √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into
quarters. The quartile deviation is half of the distance between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic
mean of the absolute deviations of the observations from a measure of central tendency is known
as the mean deviation (also called mean absolute deviation).
Range in Statistics
In Statistics, the range is the smallest of all the measures of dispersion. It is the difference between the
two extreme conclusions of the distribution. In other words, the range is the difference between the
maximum and the minimum observation of the distribution.
It is defined by
Where Xmax is the largest observation and Xmin is the smallest observation of the variable values.
The difference between the upper and lower quartile is known as the interquartile range. The formula for
the interquartile range is given below
where Q1 is the first quartile and Q3 is the third quartile of the series.
Then count the given values. If it is odd, then the center value is median otherwise obtain the
mean value for two center values. This is known as Q2 value. If there is even number of values,
the median will be the average of the middle two values.
Median equally cuts the given values into two equal parts. They are described as Q1 and
Q3 parts.
The median of data values above the median value represents Q3.
Standard deviation
Standard deviation is a measure of dispersement in statistics. ―Dispersement‖ tell you how much your
data is spread out. Specifically, it shows you how much your data is spread out around
the mean or average. For example, are all your scores close to the average? Or are lots of scores way
above (or way below) the average score?
The calculations for standard deviation differ for different data. Distribution measures the deviation of
data from its mean or average position. There are two methods to find the standard deviation.
σ = √(∑x−¯x)x−x¯)2 /n)
Consider the data observations 3, 2, 5, 6. Here the mean of these data points is 16/4 = 4.
Variance = Squared differences from mean/ number of data points =10/4 =2.5
When the x values are large, an arbitrary value (A) is chosen as the mean. The deviation from this
assumed mean is calculated as d = x - A.
When the data points are grouped, we first construct a frequency distribution.
(1)Standard Deviation of Grouped Discrete Frequency Distribution
The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard
deviation to the mean (average). For example, the expression ―The standard deviation is 15% of the
mean‖ is a CV. The CV is particularly useful when you want to compare results from two different
surveys or tests that have different measures or values. For example, if you are comparing the results from
two tests that have different scoring mechanisms. If sample A has a CV of 12% and sample B has a CV of
25%, you would say that sample B has more variation, relative to its mean.
Formula
The formula for the coefficient of variation is:
Coefficient of Variation = (Standard Deviation / Mean) * 100.
In symbols CV = (SD/x) * 100.
Multiplying the coefficient by 100 is an optional step to get a percentage, as opposed to a decimal.
1-(1/k2)*100 of the values will fall within the k standard deviation of the mean for k>1
For example when k=2 , atleast 75% of the values will fall within µ+-2sigma
1-(1/42)*100= 75 When k=2
Five Number Summary, Boxplot and other plots
These values are presented together and ordered from lowest to highest: minimum value, lower quartile
(Q1), median value (Q2), upper quartile (Q3), maximum value. These five number helps to describe
centre, spread and shape of the data.
Box plot
A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The
five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot,
we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the
median.