Math For Machine Learning
Math For Machine Learning
saddle point
tr(Hf) > 0
det(Hf) > 0 local minimum
further
Critical points investigation tr(Hf) < 0
=0 local maximum
Vectors as Directions Matrix derivative
tr(Hf) = 0
does not happen
det(Hf) = 0
unclear Vector derivative
Scalar Multiplication Hf= need more info
Definitions
2D intuition Gradient Descent
Geometry of Column Vectors. Derivative Condition
A Warning
Addition as Displacement
Operations When going to more Hf is called positive semi-definite if Hf= The Gradient
collection of partial derivatives
complex models i.e.
Neural Networks, there
are many local minima & And this implies f is convex
many saddle points.
Subtraction as Mapping So they are not convex Hf=
Vectors The majority of this will be just bookkeeping, thus we may start with some initial guess x0 and
Example: then iterate Newton's Method on f ' to get
Distances multiply with scalar multiplication but will be terribly messy bookkeeping. Update Step for Minimization
Measures of Magnitude
Orthogonality v*w=0 how to pick eta
Matrix Calculus As simplistic as this is, almost all machine
If A is a matrix where the rows are features wi learning you have heard of use some recall that an improperly chosen
and B is a matrix where the columns are data learning rate will cause the entire
The computational complexity of inverting an Newton Method version of this in the learning process optimization procedure to either fail or
vectors vj then the i,j-th entry of the product is
n x n matrix is not actually known, but the operate too slowly to be of practical
wi*vj, which is to say the i-th feature if the j-th use.
best-known algorithm is O(n^2.373)
vector. Issues:
For high dimensional data sets, anything past
Gradient Goal: Minimize f(x)
vw<0 vw>0
In formulae: if C=AB, where A is an impractical, so Newton's Method is reserved
for a few hundred dimentions at most.
Multivariate
linear time in the dimensions is often
Descent
n x m matrix and B is a m x k
1D
matrix, then C is a n x k matrix
where
Matrix
multiplication
and examples
Derivatives Maximum
Sometimes we can circumvent this issue.
Univariate f''(x)
ALGORITHM
3. Stop after some condition is met
Decision plane Second shows how the
max -> f '' < 0 if the value if x doesn't change more than 0.001
Math for
a fixed number of steps
Derivatives
min -> f '' > 0
Derivative slope is changing Can't tell -> f '' = 0 fancier things TBD
proceed with higher derivative
3D Angles
Machine Can be presented as:
Hyperplane - is the thing orthogonal to
a given vector
perpendicular line in 2D
Learning let's approximate
Interpretation
perpendicular surface in 3D
Dot product and how better approximation
Derivative
to extract angles
Chain Rule
f(x +є) = f(x) + f'(x)є Alternative
Matrices
Distributativity of numbers. http://hyperphysics.phy- Product Rule
A(B+C) = AB +AC Matrix Products astr.gsu.edu/hbase/Math/
Associativity derfunc.html Rules Sum Rule
A(BC)=(AB)C
Not commutativity
Most usable Quotient Rule
Intersection of two sets
The Identity Matrix
AB!=BA IA=A Union of two sets
Outcome
Matrix product A single possibility
from the experiment
properties Sample Space 2. Something always happens.
Hadamard product The set of all possible 1. The fraction of the times an event
outcomes occurs is between 0 and 1. Symmetric difference of two sets
All ones on the diagonal An (often less useful) method of Linear dependence Inclusion/Exclusion
Distributativity
A(B+C) = AB +AC multiplying matrices is element-wise Capital Omega
3. If two events can't happen at the same time (disjoint events),
AoB then the fraction of the time that at least one of them occurs is
det(A)=0 only if columns of A are the sum of the fraction of the time either one occurs
Associativity Event Relative complement
linearly dependent
A(BC)=(AB)C
Example Something you Axioms of probability separately.
of A (left) in B (right)
Properties of can observe with
the Hadamard a yes/no answer Absolute complement
Commutativity Product of A in U
AoB=BoC Terminology Capital E
Definition
lies in lower dimentional space Visualizing Probability General Picture
Sample Space <-> Region
if there are some a-s, that using Venn diagram Outcomes <-> Points
a1*v1+a2*v2+...+ak*vk=0 Probability Events <-> Subregion
Suppose A is 2x2 matrix (mapping R^2 to itself). Any Fraction of an experiment Disjoint events <-> Disjoint subregions
such matrix can be expressed uniquely as a where an event occurs
stretching, followed by a skewing, followed by a Geometry of matrix Probability <-> Area of subregion
Intuition from Two
rotation
Dimentions
operations a1=1, a2=-2, a3=-1 P{E} є [0,1]
Any vector can be written as a sum scalar multiple of two If I know B occurred, the probability that A
specific vectors The Determinant occurred is the fraction of the area of B which
det(A) is the factor the area is multiplied by Intuition: is occupied by A
A applied to any vector The probability of an event Conditional probability
is the expected fraction of can be leveraged to understand
det(A) is negative if it flips the plane over competing hypotheses
time that the outcome would
occur with repeated
experiments.
Given a probability model with some vector of Building machine learning models odds is fraction of two probabilities
The Two-by-two
parameters (Theta) and observed data D, the best i.e. 2/1
det(A)=ad-bc Determinant fitting model is the one that maximizes the
probability Bayes’ rule
computation Maximum Likelihood Estimation Posterior odds = ratio of probability of
Larger Matrices generating data * prior odds
m determinants of (m-1)x(m-1) matrices
computer does it simplier Q(m^3) times
called matrix factorizations Two events are independent if one event
Matrix invertibility
is a statistical theory states that given a sufficiently large
sample size from a population with a finite level of variance,
the mean of all samples from the same population will be
Central limit theorem Probability doesn't influence the other
Entropy