0% found this document useful (0 votes)

93 views

Math For Machine Learning

This document discusses critical points of functions and how to classify them as local maxima, minima or saddle points based on the properties of the Hessian matrix Hf at those points. It notes that the trace and determinant of Hf can be used to determine if it is a local minimum, maximum or saddle point, and that having the Hessian positive semi-definite implies the function is convex. It also warns that non-convex models like neural networks can have many local minima and saddle points.

Uploaded by

ProxyYt

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views

Math For Machine Learning

Uploaded by

ProxyYt

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

det(Hf) < 0

saddle point

tr(Hf) > 0
det(Hf) > 0 local minimum
further
Critical points investigation tr(Hf) < 0
=0 local maximum
Vectors as Directions Matrix derivative
tr(Hf) = 0
does not happen
det(Hf) = 0
unclear Vector derivative
Scalar Multiplication Hf= need more info

Definitions
2D intuition Gradient Descent
Geometry of Column Vectors. Derivative Condition
A Warning
Addition as Displacement
Operations When going to more Hf is called positive semi-definite if Hf= The Gradient
collection of partial derivatives
complex models i.e.
Neural Networks, there
are many local minima & And this implies f is convex
many saddle points.
Subtraction as Mapping So they are not convex Hf=

Opposite is Concave Level set

Lp-Norm
Visualize Now look for an algorithm to find
the zero of some function g(x)

Minimizing f <---> f '(x)=0

Geometry of Norms Idea Apply this algorithm to f '(x)

L1-Norm If the matrix is diagonal, a

the function is convex if the line positive entry is a
between two points stays above Hessian direction where it curves Newthon's method line: on (x 0, g(x 0))
slope g '(x 0)
Types of Norms up, and a negative entry
Benefits Convexity is a direction where it
points in the direction of y=g '(x 0) (x-x 0)+g(x 0)
Trace maximum increase solve the equation y=0
Euclidean Norm (L2-Norm) When a function is convex then there is curves down The Gradient
sum of diagonal terms tr(Hf)
a single unique local minimum
- points in the direction of Computing the Line
no maxima Key Properties
For , there is maximum decrease Relationship to
no saddle points
L∞-Norm Gradient Descent
Gradient descent is guarantied to find the many derivatives Second Derivative A learning rate is
global minimum with small enough at local max & min
adaptive to f(x) we want to find where g(x)=0 and we start with
Newton's Method always works If you have a function some initial guess x0 and then iterate
L0-Norm Update Step for Zero Finding
Despite the name, it's not a norm where n - n-dimensional vector,
Partial Derivatives
It's a number of non-zero elements of a vector Then it's a function of many variables.
is a measure of the rate of change of
the function... when one of the Pictorially
You need to know how the function responds
variables is subjected to a small g(x) x such that g(x)=0
All distances are non-negative Norm Properties Norms - are measures change but the others are kept
to changes in all of them.
of distance constant. To minimize f, we want to find where f '(x)=0 and

Vectors The majority of this will be just bookkeeping, thus we may start with some initial guess x0 and
Example: then iterate Newton's Method on f ' to get
Distances multiply with scalar multiplication but will be terribly messy bookkeeping. Update Step for Minimization

If I travel from A to B then B to C, that is at least as

far as going from A to C (Triangular Inequality)

Measures of Magnitude
Orthogonality v*w=0 how to pick eta
Matrix Calculus As simplistic as this is, almost all machine
If A is a matrix where the rows are features wi learning you have heard of use some recall that an improperly chosen
and B is a matrix where the columns are data learning rate will cause the entire
The computational complexity of inverting an Newton Method version of this in the learning process optimization procedure to either fail or
vectors vj then the i,j-th entry of the product is
n x n matrix is not actually known, but the operate too slowly to be of practical
wi*vj, which is to say the i-th feature if the j-th use.
best-known algorithm is O(n^2.373)
vector. Issues:
For high dimensional data sets, anything past
Gradient Goal: Minimize f(x)
vw<0 vw>0
In formulae: if C=AB, where A is an impractical, so Newton's Method is reserved
for a few hundred dimentions at most.
Multivariate
linear time in the dimensions is often
Descent
n x m matrix and B is a m x k
1D
matrix, then C is a n x k matrix
where
Matrix
multiplication
and examples
Derivatives Maximum
Sometimes we can circumvent this issue.

Key Consequences Likelihood

Estimation 1. Start with a guess of X0
Well defined: 2. Iterate through
2D find p such that Pp(D) is maximized
- is learning rate

Univariate f''(x)
ALGORITHM
3. Stop after some condition is met
Decision plane Second shows how the
max -> f '' < 0 if the value if x doesn't change more than 0.001

Math for
a fixed number of steps

Derivatives
min -> f '' > 0
Derivative slope is changing Can't tell -> f '' = 0 fancier things TBD
proceed with higher derivative

3D Angles
Machine Can be presented as:
Hyperplane - is the thing orthogonal to
a given vector
perpendicular line in 2D
Learning let's approximate
Interpretation

perpendicular surface in 3D
Dot product and how better approximation
Derivative
to extract angles
Chain Rule
f(x +є) = f(x) + f'(x)є Alternative

Dot Product - the sum of the products of the

corresponding entries of the two sequences

Matrices
Distributativity of numbers. http://hyperphysics.phy- Product Rule
A(B+C) = AB +AC Matrix Products astr.gsu.edu/hbase/Math/
Associativity derfunc.html Rules Sum Rule
A(BC)=(AB)C
Not commutativity
Most usable Quotient Rule
Intersection of two sets
The Identity Matrix
AB!=BA IA=A Union of two sets

Outcome
Matrix product A single possibility
from the experiment
properties Sample Space 2. Something always happens.
Hadamard product The set of all possible 1. The fraction of the times an event
outcomes occurs is between 0 and 1. Symmetric difference of two sets
All ones on the diagonal An (often less useful) method of Linear dependence Inclusion/Exclusion
Distributativity
A(B+C) = AB +AC multiplying matrices is element-wise Capital Omega
3. If two events can't happen at the same time (disjoint events),
AoB then the fraction of the time that at least one of them occurs is
det(A)=0 only if columns of A are the sum of the fraction of the time either one occurs
Associativity Event Relative complement
linearly dependent
A(BC)=(AB)C
Example Something you Axioms of probability separately.
of A (left) in B (right)
Properties of can observe with
the Hadamard a yes/no answer Absolute complement
Commutativity Product of A in U
AoB=BoC Terminology Capital E
Definition
lies in lower dimentional space Visualizing Probability General Picture
Sample Space <-> Region
if there are some a-s, that using Venn diagram Outcomes <-> Points
a1*v1+a2*v2+...+ak*vk=0 Probability Events <-> Subregion
Suppose A is 2x2 matrix (mapping R^2 to itself). Any Fraction of an experiment Disjoint events <-> Disjoint subregions
such matrix can be expressed uniquely as a where an event occurs
stretching, followed by a skewing, followed by a Geometry of matrix Probability <-> Area of subregion
Intuition from Two
rotation
Dimentions
operations a1=1, a2=-2, a3=-1 P{E} є [0,1]

Any vector can be written as a sum scalar multiple of two If I know B occurred, the probability that A
specific vectors The Determinant occurred is the fraction of the area of B which
det(A) is the factor the area is multiplied by Intuition: is occupied by A
A applied to any vector The probability of an event Conditional probability
is the expected fraction of can be leveraged to understand
det(A) is negative if it flips the plane over competing hypotheses
time that the outcome would
occur with repeated
experiments.
Given a probability model with some vector of Building machine learning models odds is fraction of two probabilities
The Two-by-two
parameters (Theta) and observed data D, the best i.e. 2/1
det(A)=ad-bc Determinant fitting model is the one that maximizes the
probability Bayes’ rule
computation Maximum Likelihood Estimation Posterior odds = ratio of probability of
Larger Matrices generating data * prior odds
m determinants of (m-1)x(m-1) matrices
computer does it simplier Q(m^3) times
called matrix factorizations Two events are independent if one event

Matrix invertibility
is a statistical theory states that given a sufficiently large
sample size from a population with a finite level of variance,
the mean of all samples from the same population will be
Central limit theorem Probability doesn't influence the other

approximately equal to the mean of the population.

When can you invert? Key Properties Independence
it can be done only when det != 0 A and B are independent if P{AnB}=P{A}*P{B}
Amongst all continuous RV with E[X]=0, Var[X]=1. H(X) Chebyshev’s inequality
Entropy is maximized uniquely for X~N(0,1)
Maximum entropy distribution
How to Compute the Inverse Gaussian is the most Randon RV with fixed mean and is a function X that takes in an outcome and
A^(-1)*A=I variance
For any random gives a number back
The Gaussian curve variable X (no
Discrete X takes only at most countable many
assumptions) at least
Random variables values, usually only a finite set of values
99 of the time
General Gaussian Density Expected Value
mean
Standard Gaussian
(Normal Distribution) Density
E[X]=0 Entropy (H) Variance
Var[X]=1
Standard Deviation how close to the mean are samples

Entropy

The Only Choice was the Units One coin H (1/2)

Entropy = one bit of randomness
For many applications ML works with firstly you need to choose the base for the
continuous random variables logarithm. T (1/2)
(measurement with real numbers). If the base is not 2, then Entropy should be
divided by log2 of the base
Two coins HH (1/4)
Entropy = 2 bits of randomness H
Probability density function HT (1/4)
Continuous random variables
Examples TH (1/4)
T
Examine the Trees TT (1/4)
if we flip n coins, then P=1/2^n H (1/2)
# coin flips = -log2(P) A mixed case
Entropy = 1.5 bits of randomnes = TH (1/4)
=1/2(1 bit) + 1/2(2 bits) T
TT (1/4)

GATE CLOUD - Signals and Systems by R K Kanodia
84% (25)
GATE CLOUD - Signals and Systems by R K Kanodia
101 pages
Pos. 12 Pos. 15 (1:1.5) Pos. 10 (1: 1.5) : Shuttle Frame 12 Pin
No ratings yet
Pos. 12 Pos. 15 (1:1.5) Pos. 10 (1: 1.5) : Shuttle Frame 12 Pin
1 page
Site Plan
No ratings yet
Site Plan
1 page
gsea_result.Cell cycle - Homo sapiens (human).plot.case.1
No ratings yet
gsea_result.Cell cycle - Homo sapiens (human).plot.case.1
1 page
Assignment4 110041030劉可若
No ratings yet
Assignment4 110041030劉可若
4 pages
建筑图纸 45
No ratings yet
建筑图纸 45
1 page
Diseño Alcantarillado Vista Planta
No ratings yet
Diseño Alcantarillado Vista Planta
1 page
Lampiran 2
No ratings yet
Lampiran 2
20 pages
Minimal Load - SWF Spring (Yellow)
No ratings yet
Minimal Load - SWF Spring (Yellow)
1 page
Workdone For Earthwork Drainage Culvert
No ratings yet
Workdone For Earthwork Drainage Culvert
20 pages
Outdoor Lighting Chandina_04.12.24
No ratings yet
Outdoor Lighting Chandina_04.12.24
13 pages
111-2 微積分甲：Homework 6
No ratings yet
111-2 微積分甲：Homework 6
5 pages
Attachment I Part II Compressed
No ratings yet
Attachment I Part II Compressed
42 pages
GFC DWG W Singhbhum
No ratings yet
GFC DWG W Singhbhum
18 pages
Marigold 3
No ratings yet
Marigold 3
6 pages
Bar Bending Diagram: Detail "B" Gsp-2 Sheet Pile
No ratings yet
Bar Bending Diagram: Detail "B" Gsp-2 Sheet Pile
1 page
89+614 Lvup SH.2 of 2
No ratings yet
89+614 Lvup SH.2 of 2
1 page
ENSC324_Week02
No ratings yet
ENSC324_Week02
41 pages
Asorgmap
No ratings yet
Asorgmap
1 page
Asorgmap
No ratings yet
Asorgmap
1 page
9204 Bp12a MMJV Ifcd LN P2 L01 09602 Rev.003
No ratings yet
9204 Bp12a MMJV Ifcd LN P2 L01 09602 Rev.003
1 page
ChassisCompatibility
No ratings yet
ChassisCompatibility
2 pages
Data Structures Algorithm PDF
No ratings yet
Data Structures Algorithm PDF
1 page
6 Éme Étage
No ratings yet
6 Éme Étage
1 page
1 - GA-drawing
No ratings yet
1 - GA-drawing
1 page
Motorpool Plumbing Plan
No ratings yet
Motorpool Plumbing Plan
1 page
Orch Flute
No ratings yet
Orch Flute
14 pages
Carretera Tramo 1-Tupe
No ratings yet
Carretera Tramo 1-Tupe
1 page
FIRE
No ratings yet
FIRE
1 page
PP 02
No ratings yet
PP 02
1 page
1.-Plano de Ubicacion-Plot A2
No ratings yet
1.-Plano de Ubicacion-Plot A2
1 page
F#unctional Programing For Flute Solo
No ratings yet
F#unctional Programing For Flute Solo
5 pages
700x420 PROG DEPLIANT 2022 EN WEB-2
No ratings yet
700x420 PROG DEPLIANT 2022 EN WEB-2
2 pages
국어 중세국어 한장정리
No ratings yet
국어 중세국어 한장정리
1 page
19-25
No ratings yet
19-25
8 pages
Palindong Road Realignment For Construction - Signed-Signed-1
No ratings yet
Palindong Road Realignment For Construction - Signed-Signed-1
8 pages
New Phase 8 - Layout1
No ratings yet
New Phase 8 - Layout1
1 page
Hdhekdn
No ratings yet
Hdhekdn
2 pages
Results:::Lavteam
No ratings yet
Results:::Lavteam
1 page
01 DESAIN EMBUNG BUKIT BIRU 1-Model
No ratings yet
01 DESAIN EMBUNG BUKIT BIRU 1-Model
1 page
24-006 ICity Map A3 Updated 06Jan25
No ratings yet
24-006 ICity Map A3 Updated 06Jan25
2 pages
Мой ласковый и нежный зверь
No ratings yet
Мой ласковый и нежный зверь
3 pages
CMPT 413/713: Natural Language Processing: Nat Langlab
No ratings yet
CMPT 413/713: Natural Language Processing: Nat Langlab
31 pages
Orche cello
No ratings yet
Orche cello
15 pages
Thermodynamics Mindmap 2
No ratings yet
Thermodynamics Mindmap 2
2 pages
CUBIERTA
No ratings yet
CUBIERTA
1 page
Mxico Lindo y Querido Chucho Monge
No ratings yet
Mxico Lindo y Querido Chucho Monge
1 page
Additional Land required 263+400 to 263+600 RHS (1)
No ratings yet
Additional Land required 263+400 to 263+600 RHS (1)
1 page
Secciones Tipicas Muros de Contencion - MC-01
No ratings yet
Secciones Tipicas Muros de Contencion - MC-01
1 page
Line 5 Line 5: Key Plan
No ratings yet
Line 5 Line 5: Key Plan
1 page
HG Palya Big Site r4
No ratings yet
HG Palya Big Site r4
1 page
Merry-Go-Round of Life - Howl's Moving Castle Piano Tutorial-Piano
No ratings yet
Merry-Go-Round of Life - Howl's Moving Castle Piano Tutorial-Piano
7 pages
Howl's Moving Castle - Main Theme
No ratings yet
Howl's Moving Castle - Main Theme
6 pages
Wellness: Villa B2 +57.50
No ratings yet
Wellness: Villa B2 +57.50
1 page
Applied Art Draw in 0000 Pe DR
100% (1)
Applied Art Draw in 0000 Pe DR
422 pages
You Created This PDF From An Application That Is Not Licensed To Print To Novapdf Printer
No ratings yet
You Created This PDF From An Application That Is Not Licensed To Print To Novapdf Printer
1 page
White Christmas: Swing
No ratings yet
White Christmas: Swing
2 pages
2a.clearance Diagram Model
No ratings yet
2a.clearance Diagram Model
1 page
Estacionamiento Patio Patio: Distribucion: Primera Planta
No ratings yet
Estacionamiento Patio Patio: Distribucion: Primera Planta
1 page
5 Éme Étage
No ratings yet
5 Éme Étage
1 page
World Development Indicators 2013
From Everand
World Development Indicators 2013
World Bank
No ratings yet
Week 11 Isometric Drawings: CE-112L: Engineering Drawing For Civil Engineers (Lab) Spring 2016
No ratings yet
Week 11 Isometric Drawings: CE-112L: Engineering Drawing For Civil Engineers (Lab) Spring 2016
4 pages
Trigonometric Table - Compute Expert
No ratings yet
Trigonometric Table - Compute Expert
212 pages
Pre Cal 1
No ratings yet
Pre Cal 1
32 pages
1 Laws of Exponents
100% (1)
1 Laws of Exponents
1 page
Algorithm Continues
No ratings yet
Algorithm Continues
24 pages
6.1 Compound Angle Identities FILLED
No ratings yet
6.1 Compound Angle Identities FILLED
2 pages
Topical Paper On Integration 1
No ratings yet
Topical Paper On Integration 1
20 pages
THE101 Chap06 Rod Weight Degree 2 En
No ratings yet
THE101 Chap06 Rod Weight Degree 2 En
32 pages
Integration by Parts
No ratings yet
Integration by Parts
13 pages
State, Whether The Following Numbers Are Rational or Not: (I) (Ii) (Iii)
No ratings yet
State, Whether The Following Numbers Are Rational or Not: (I) (Ii) (Iii)
14 pages
HW4 Soln
No ratings yet
HW4 Soln
5 pages
EECE 574 - Adaptive Control
No ratings yet
EECE 574 - Adaptive Control
73 pages
Concepts of Pipe Stress Analysis
100% (13)
Concepts of Pipe Stress Analysis
54 pages
05a. the Product Rule
No ratings yet
05a. the Product Rule
1 page
Unit 4 Indices Logarithms
No ratings yet
Unit 4 Indices Logarithms
29 pages
Review On Robust Control For SISO Systems
No ratings yet
Review On Robust Control For SISO Systems
7 pages
PHYS4055 - Mathematical Methods 1 (Dec. Exam) - Course Handbook
No ratings yet
PHYS4055 - Mathematical Methods 1 (Dec. Exam) - Course Handbook
3 pages
Vector Product of Two Vectors: (EAMCET 2007)
No ratings yet
Vector Product of Two Vectors: (EAMCET 2007)
2 pages
OPRE7320 CH 14
No ratings yet
OPRE7320 CH 14
21 pages
2020微积分II期末考试 (A卷) - 出题人刘刚 - 20200702102946
No ratings yet
2020微积分II期末考试 (A卷) - 出题人刘刚 - 20200702102946
6 pages
C2 Formula and Proofs To Learn
No ratings yet
C2 Formula and Proofs To Learn
1 page
Algebra 5c
No ratings yet
Algebra 5c
10 pages
2023 MTH-101 PS 2
No ratings yet
2023 MTH-101 PS 2
1 page
Multi Step Equations Part 2 I Can Quiz
No ratings yet
Multi Step Equations Part 2 I Can Quiz
1 page
Matlab MPC
No ratings yet
Matlab MPC
4 pages
Homework Assignment 3 Homework Assignment 3
No ratings yet
Homework Assignment 3 Homework Assignment 3
10 pages
Trigonometry
No ratings yet
Trigonometry
12 pages
Advanced Strength of Materials Prof. S. K. Maiti Department of Mechanical Engineering Indian Institute of Technology, Bombay Lecture - 4
No ratings yet
Advanced Strength of Materials Prof. S. K. Maiti Department of Mechanical Engineering Indian Institute of Technology, Bombay Lecture - 4
18 pages
HELM Workbook9 Vectors PDF
No ratings yet
HELM Workbook9 Vectors PDF
70 pages

Math For Machine Learning

Uploaded by

Math For Machine Learning

Uploaded by

det(Hf) < 0

Opposite is Concave Level set

Minimizing f <---> f '(x)=0

L1-Norm If the matrix is diagonal, a

If I travel from A to B then B to C, that is at least as

Key Consequences Likelihood

Dot Product - the sum of the products of the

approximately equal to the mean of the population.

The Only Choice was the Units One coin H (1/2)

You might also like