100% found this document useful (1 vote)

202 views12 pages

Statistical Machine Learning

Statistical machine learning involves descriptive statistics and inferential statistics. Descriptive statistics help summarize and describe data through measures of central tendency like mean, median, and mode, and measures of dispersion like range, variance, and standard deviation. Inferential statistics help understand properties of a data sample and make predictions about a population based on that sample. Histograms are useful for visualizing the distribution of numerical data by binning data into ranges, and can help determine if a process is normally distributed.

Uploaded by

Deva Hema

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

202 views12 pages

Statistical Machine Learning

Uploaded by

Deva Hema

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

STATISTICAL MACHINE LEARNING

Prepared By

D.Deva Hema
Inferential statistics:

Inferential statistics can help us understand the collective properties of the elements of a data sample.
Knowing the sample means, variance, and distribution of a variable can help us understand the world
around us.

Descriptive statistics

Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be
either a representation of the entire population or a sample of a population. Descriptive statistics are
broken down into measures of central tendency and measures of variability. Measures of central tendency
include the mean, median, and mode, while measures of variability include standard deviation, variance,
minimum and maximum variables, kurtosis, and skewness.

Descriptive statistics, in short, help describe and understand the features of a specific data set by giving
short summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of center: the mean, median, and mode, which are used at almost all levels of math
and statistics. The mean, or the average, is calculated by adding all the figures within the data set and then
dividing by the number of figures within the set.

For example, the sum of the following data set is 20: (2, 3, 4, 5, 6). The mean is 4 (20/5). The mode of a
data set is the value appearing most often, and the median is the figure situated in the middle of the data
set. It is the figure separating the higher figures from the lower figures within a data set. However, there
are less common types of descriptive statistics that are still very important.

A frequency distribution shows how often each different value in a set of data occurs. A histogram is the
most commonly used graph to show frequency distributions. It looks very much like a bar chart, but there
are important differences between them. This helpful data collection and analysis tool is considered one
of the seven basic quality tools.

Histogram

A histogram is used to summarize discrete or continuous data. In other words, it provides a visual
interpretation of numerical data by showing the number of data points that fall within a specified range of
values (called ―bins‖).It is similar to a vertical bar graph. However, a histogram, unlike a vertical bar
graph, shows no gaps between the bars.
WHEN TO USE A HISTOGRAM

Use a histogram when:

 The data are numerical

 You want to see the shape of the data’s distribution, especially when determining whether the output of a
process is distributed approximately normally
 Analyzing whether a process can meet the customer’s requirements
 Analyzing what the output from a supplier’s process looks like
 Seeing whether a process change has occurred from one time period to another
 Determining whether the outputs of two or more processes are different
 You wish to communicate the distribution of data quickly and easily to others

HOW TO CREATE A HISTOGRAM

 Collect at least 50 consecutive data points from a process.

 Use a histogram worksheet to set up the histogram. It will help you determine the number of bars, the
range of numbers that go into each bar, and the labels for the bar edges. After calculating W in Step 2 of
the worksheet, use your judgment to adjust it to a convenient number. For example, you might decide to
round 0.9 to an even 1.0. The value for W must not have more decimal places than the numbers you will
be graphing.
 Draw x- and y-axes on graph paper. Mark and label the y-axis for counting data values. Mark and label
the x-axis with the L values from the worksheet. The spaces between these numbers will be the bars of
the histogram. Do not allow for spaces between bars.
 For each data point, mark off one count above the appropriate bar with an X or by shading that portion of
the bar.

What is the central tendency?

How would you describe a data set with a single value? The most common approach is to define a central
position of your data distribution. This is what the statisticians call the central tendency. Being a core
concept in statistics, the central tendency summarizes the entire data set, thus giving an idea of its typical
value.

What are the measures of central tendency?

The arithmetic mean (or average) is the first measure that comes to one’s mind when talking about a
center point in the data or its typical value. Nevertheless, there are also other measures that describe the
central tendency more accurately in certain scenarios. This time we’ll break down the purposes of the
three main measures to describe the central position within a data set, namely:

Mean

Median

Mode

Mean

The mean is the most common way to summarize a data set. You can use the mean with either discrete or
continuous data. Yet, it’s mostly used with continuous data. There are two important properties the mean
has:

 The calculation of the mean considers each data point of your data set
 The sum of deviations of each data point from the mean is always zero.

Median

The median is the middlemost number in the data distribution.

How to calculate the median

First, arrange the values in order from the least to the greatest. Next, select the data point which is located
in the middle. This number is the median of your data set.

Mode

A mode is the most common data point across all the observations. In other words, it’s the value that
occurs most often. The mode is rarely used with continuous data. The data set can have two or more
modes. In such a case, it’s said that data has two or more peaks. The corresponding types of distributions
are called bimodal or multimodal. The mode is not the best way to represent the central tendency since it
may lie quite far from the rest of the data points:

Measures of dispersion

Measures of dispersion describe the spread of the data. They include the range, interquartile range,
standard deviation and variance. The range is given as the smallest and largest observations. This is the
simplest measure of variability. Dispersion is the state of getting dispersed or spread. Statistical
dispersion means the extent to which a numerical data is likely to vary about an average value. In other
words, dispersion helps to understand the distribution of the data.

Measures of Dispersion

In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how much
homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or scattered the
variable is.

Types of Measures of Dispersion

There are two main types of dispersion methods in statistics which are:

 Absolute Measure of Dispersion

 Relative Measure of Dispersion

Absolute Measure of Dispersion

An absolute measure of dispersion contains the same unit as the original data set. Absolute dispersion
method expresses the variations in terms of the average of deviations of observations like standard or
means deviations. It includes range, standard deviation, quartile deviation, etc.

The types of absolute measures of dispersion are:

1. Range: It is simply the difference between the maximum value and the minimum value given in a
data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set then squaring each of them and adding each
square and finally dividing them by the total no of values in the data set is the variance. Variance
(σ2)=∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation i.e. S.D.
= √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into
quarters. The quartile deviation is half of the distance between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic
mean of the absolute deviations of the observations from a measure of central tendency is known
as the mean deviation (also called mean absolute deviation).

Range in Statistics

In Statistics, the range is the smallest of all the measures of dispersion. It is the difference between the
two extreme conclusions of the distribution. In other words, the range is the difference between the
maximum and the minimum observation of the distribution.

It is defined by

Range = Xmax – Xmin

Where Xmax is the largest observation and Xmin is the smallest observation of the variable values.

Interquartile Range Definition

The interquartile range defines the difference between the third and the first quartile. Quartiles are the
partitioned values that divide the whole series into 4 equal parts. So, there are 3 quartiles. First Quartile is
denoted by Q1 known as the lower quartile, the second Quartile is denoted by Q2 and the third Quartile is
denoted by Q3 known as the upper quartile. Therefore, the interquartile range is equal to the upper
quartile minus lower quartile.

The difference between the upper and lower quartile is known as the interquartile range. The formula for
the interquartile range is given below

Interquartile range = Upper Quartile – Lower Quartile = Q-3 – Q-1

where Q1 is the first quartile and Q3 is the third quartile of the series.

How to Calculate the Interquartile Range?

The procedure to calculate the interquartile range is given as follows:

 Arrange the given set of numbers into increasing or decreasing order.

 Then count the given values. If it is odd, then the center value is median otherwise obtain the
mean value for two center values. This is known as Q2 value. If there is even number of values,
the median will be the average of the middle two values.
 Median equally cuts the given values into two equal parts. They are described as Q1 and
Q3 parts.

 The median of data values below the median represents Q1.

 The median of data values above the median value represents Q3.

 Finally, we can subtract the median values of Q1 and Q3.

 The resulting value is the interquartile range.

Standard deviation

Standard deviation is a measure of dispersement in statistics. ―Dispersement‖ tell you how much your
data is spread out. Specifically, it shows you how much your data is spread out around
the mean or average. For example, are all your scores close to the average? Or are lots of scores way
above (or way below) the average score?

Steps to Calculate Standard Deviation

 Find the mean, which is the arithmetic mean of the observations.

 Find the squared differences from the mean. (The data value - mean)2
 Find the average of the squared differences. (Variance = The sum of squared differences ÷ the
number of observations)
 Find the square root of variance. (Standard deviation = √Variance)
Standard Deviation of Ungrouped Data

The calculations for standard deviation differ for different data. Distribution measures the deviation of
data from its mean or average position. There are two methods to find the standard deviation.

 actual mean method

 assumed mean method

Standard Deviation by the Actual Mean Method

σ = √(∑x−¯x)x−x¯)2 /n)

Consider the data observations 3, 2, 5, 6. Here the mean of these data points is 16/4 = 4.

The squared differences from mean = (4-3)2+(2-4)2 +(5-4)2 +(6-4)2= 10

Variance = Squared differences from mean/ number of data points =10/4 =2.5

Standard deviation = √2.5 = 1.58

Standard deviation by Assumed Mean Method

When the x values are large, an arbitrary value (A) is chosen as the mean. The deviation from this
assumed mean is calculated as d = x - A.

σ = √[(∑(d)2 /n) - (∑d/n)2]

Standard Deviation of Grouped Data

When the data points are grouped, we first construct a frequency distribution.
(1)Standard Deviation of Grouped Discrete Frequency Distribution

(2)Standard Deviation of Grouped Continuous Frequency Distribution

(3)Standard Deviation of Random Variables

(4)Standard Deviation of Probability Distribution

The coefficient of variation (CV)

The coefficient of variation (CV) is a statistical measure of the relative dispersion of data points in a data
series around the mean. In finance, the coefficient of variation allows investors to determine how much
volatility, or risk, is assumed in comparison to the amount of return expected from investments.

The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard
deviation to the mean (average). For example, the expression ―The standard deviation is 15% of the
mean‖ is a CV. The CV is particularly useful when you want to compare results from two different
surveys or tests that have different measures or values. For example, if you are comparing the results from
two tests that have different scoring mechanisms. If sample A has a CV of 12% and sample B has a CV of
25%, you would say that sample B has more variation, relative to its mean.
Formula
The formula for the coefficient of variation is:
Coefficient of Variation = (Standard Deviation / Mean) * 100.
In symbols CV = (SD/x) * 100.

Multiplying the coefficient by 100 is an optional step to get a percentage, as opposed to a decimal.

How to Find a Coefficient of Variation: Overview.

The empirical rule and Chebyshev Rule

The Empirical Rule is an approximation that applies only to data sets with a bell-shaped relative
frequency histogram. It estimates the proportion of the measurements that lie within one, two, and three
standard deviations of the mean. Chebyshev's Theorem is a fact that applies to all possible data sets.

Chebyshev Rule is used when the data is not in bell shaped

1-(1/k2)*100 of the values will fall within the k standard deviation of the mean for k>1
For example when k=2 , atleast 75% of the values will fall within µ+-2sigma
1-(1/42)*100= 75 When k=2
Five Number Summary, Boxplot and other plots

These values are presented together and ordered from lowest to highest: minimum value, lower quartile
(Q1), median value (Q2), upper quartile (Q3), maximum value. These five number helps to describe
centre, spread and shape of the data.

It tells shape of distribution: Left skewed, symmetry and right skewed

Box plot
A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The
five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot,
we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the
median.

Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
Fake News Detection Using Machine Learning Models
No ratings yet
Fake News Detection Using Machine Learning Models
5 pages
Resume Working Student Jollibee
50% (2)
Resume Working Student Jollibee
3 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Prediction of Company Bankruptcy: Amlan Nag
100% (2)
Prediction of Company Bankruptcy: Amlan Nag
16 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Random Forest
No ratings yet
Random Forest
32 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
A Comprehensive Statistics Cheat Sheet For Data Science 1685659812
No ratings yet
A Comprehensive Statistics Cheat Sheet For Data Science 1685659812
39 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
No ratings yet
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
27 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
K-Means Clustering Using Python
No ratings yet
K-Means Clustering Using Python
30 pages
Machine Learning Python
100% (1)
Machine Learning Python
9 pages
Different Types of Regression Models
No ratings yet
Different Types of Regression Models
18 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Predictive Model For E-Commerce
100% (1)
Predictive Model For E-Commerce
3 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
100% (1)
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
504 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
API Reference - Scikit-Learn 0.19.2 Documentation
No ratings yet
API Reference - Scikit-Learn 0.19.2 Documentation
21 pages
Machine Learning Projects For Final Year PDF
No ratings yet
Machine Learning Projects For Final Year PDF
4 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
ML Notes
100% (2)
ML Notes
125 pages
Introduction To Data Visualization in Python
No ratings yet
Introduction To Data Visualization in Python
16 pages
Top 100 ML Interview Q&A
100% (1)
Top 100 ML Interview Q&A
39 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
ML Algorithms
100% (1)
ML Algorithms
1 page
Data Visualization Cheatsheet 1702209209
100% (1)
Data Visualization Cheatsheet 1702209209
7 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
100% (1)
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
23 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Weather Forecasting Basepaper
100% (1)
Weather Forecasting Basepaper
14 pages
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
100% (1)
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
14 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Use of Modified Bituminous Binders in India - Current Imperatives
100% (5)
Use of Modified Bituminous Binders in India - Current Imperatives
25 pages
223 Dak 17 DRG Cul Misc GW Typ 01
No ratings yet
223 Dak 17 DRG Cul Misc GW Typ 01
2 pages
Exec Order On PCART
No ratings yet
Exec Order On PCART
5 pages
AbInitio String Functions
100% (3)
AbInitio String Functions
13 pages
Package Desire': R Topics Documented
No ratings yet
Package Desire': R Topics Documented
22 pages
Ajsr 50 08
No ratings yet
Ajsr 50 08
14 pages
E11 BR PD
No ratings yet
E11 BR PD
6 pages
Introduction To Linear Programming Sau
No ratings yet
Introduction To Linear Programming Sau
42 pages
Compact Batching Plants CT 5097 IN - Feb 2023
No ratings yet
Compact Batching Plants CT 5097 IN - Feb 2023
6 pages
Disec Study Guide Aurora Mun
No ratings yet
Disec Study Guide Aurora Mun
28 pages
3 RD Sem Results
No ratings yet
3 RD Sem Results
2 pages
Testing & Commissioning of Irrigation System
No ratings yet
Testing & Commissioning of Irrigation System
13 pages
Dairy Industry Analysis
No ratings yet
Dairy Industry Analysis
48 pages
2,000 Most Common Italian Words
No ratings yet
2,000 Most Common Italian Words
30 pages
Law485 Director
No ratings yet
Law485 Director
67 pages
Island Agriculture Assessment - TOR
No ratings yet
Island Agriculture Assessment - TOR
2 pages
ID Strategi Pengembangan Cabai Keriting Di
100% (1)
ID Strategi Pengembangan Cabai Keriting Di
12 pages
February 3, 2020 G.R. No.: Click or Tap Here To Enter Ponente
100% (1)
February 3, 2020 G.R. No.: Click or Tap Here To Enter Ponente
2 pages
(MDS-G6) PMS
No ratings yet
(MDS-G6) PMS
22 pages
ACC30 Accounting For Partnerships Lesson Plan Liquidation
0% (1)
ACC30 Accounting For Partnerships Lesson Plan Liquidation
4 pages
Rea P6 Extra Practice 1
No ratings yet
Rea P6 Extra Practice 1
16 pages
CB Model Gearbox Rebuild
No ratings yet
CB Model Gearbox Rebuild
7 pages
K.1.1 Sisters and Brothers (Social Studies)
No ratings yet
K.1.1 Sisters and Brothers (Social Studies)
10 pages
Weekly Lesson Plan (Grade 10)
No ratings yet
Weekly Lesson Plan (Grade 10)
8 pages
Ot MCQ 3
No ratings yet
Ot MCQ 3
13 pages
Law Assignment (Final)
No ratings yet
Law Assignment (Final)
10 pages
EE Lab 10
No ratings yet
EE Lab 10
7 pages
KPCSW Report.2022
No ratings yet
KPCSW Report.2022
43 pages
06 Intro ERP Using GBI Case Study PP (Letter) en v2.11 PDF
No ratings yet
06 Intro ERP Using GBI Case Study PP (Letter) en v2.11 PDF
41 pages

Statistical Machine Learning

Uploaded by

Statistical Machine Learning

Uploaded by

STATISTICAL MACHINE LEARNING

Use a histogram when:

 The data are numerical

HOW TO CREATE A HISTOGRAM

 Collect at least 50 consecutive data points from a process.

What is the central tendency?

What are the measures of central tendency?

The median is the middlemost number in the data distribution.

Types of Measures of Dispersion

 Absolute Measure of Dispersion

 Relative Measure of Dispersion

The types of absolute measures of dispersion are:

Range = Xmax – Xmin

Interquartile Range Definition

Interquartile range = Upper Quartile – Lower Quartile = Q-3 – Q-1

How to Calculate the Interquartile Range?

The procedure to calculate the interquartile range is given as follows:

 Arrange the given set of numbers into increasing or decreasing order.

 The median of data values below the median represents Q1.

 Finally, we can subtract the median values of Q1 and Q3.

 The resulting value is the interquartile range.

Steps to Calculate Standard Deviation

 Find the mean, which is the arithmetic mean of the observations.

 actual mean method

Standard Deviation by the Actual Mean Method

The squared differences from mean = (4-3)2+(2-4)2 +(5-4)2 +(6-4)2= 10

Standard deviation = √2.5 = 1.58

Standard deviation by Assumed Mean Method

σ = √[(∑(d)2 /n) - (∑d/n)2]

Standard Deviation of Grouped Data

(2)Standard Deviation of Grouped Continuous Frequency Distribution

(3)Standard Deviation of Random Variables

(4)Standard Deviation of Probability Distribution

The coefficient of variation (CV)

How to Find a Coefficient of Variation: Overview.

The empirical rule and Chebyshev Rule

Chebyshev Rule is used when the data is not in bell shaped

It tells shape of distribution: Left skewed, symmetry and right skewed

You might also like