Lesson02-Python Calculus Maths
Lesson02-Python Calculus Maths
Activation Functions
Advantages
1. Linear and Nonlinear Functions
• Sigmoid Activation Function:
• Range from [0,1]
• Not Zero Centered
• Have Exponential Operation
• Hyperbolic Tangent Activation Function(tanh):
• Ranges Between [-1,1]
• Zero Centered
• Rectified Linear Unit Activation Function (ReLU):
• It doesn’t Saturate
• It converges faster than some other activation functions
1. Linear and Nonlinear Functions
• Leaky ReLU:
• Leaky ReLU improvement over ReLU Activation function.
• It has all properties of ReLU
• It will never have dead ReLU problem.
• Maxout:
• It has property of Linearity in it
• it never saturates or die
• But is Expensive as it doubles the parameters.
• ELU(Exponential Linear Units):
• No Dead ReLU Situation.
• Closer to Zero mean Outputs than Leaky ReLU
• More Computation because of Exponential Function
2. Derivatives and Finding Extreme Points
• Suppose we have a function y = f(x) which is dependent on x then the derivation
of this function means the rate at which the value y of the function changes
with change in x.
• In geometry slope represents the steepness of a line. It answers the question:
how much does y or f(x) change given a specific change in x?
• Using this definition we can easily calculate the slope between two points. But
what if I asked you, instead of the slope between two points, what is the slope
at a single point on the line? In this case there isn’t any obvious “rise-over-run”
to calculate. Derivatives help us answer this question
2. Derivatives and Finding Extreme Points
Finding Extreme
Points
2. Derivatives and Finding Extreme Points
Partial derivative
3. Gradient Descent
• A gradient is a vector that stores the partial derivatives of multivariable
functions. It helps us calculate the slope at a specific point on a curve for
functions with multiple independent variables.
• The gradient vector is the vector generating the line orthogonal to the tangent
hyperplane. Then you take the opposite of this vector (hence “descent”),
multiply it by the learning rate lr.
3. Gradient Descent
• The projection of this vector on the parameter space (here: the x-axis) gives you
the new (updated) parameter. Then you repeat this operation several times to
go down the cost (error) function, with the goal of reaching a value for w where
the cost function is minimal.
• The parameter is thus updated as follow at each step:
parameter <-- parameter - lr*gradient
3. Gradient Descent
3. Gradient Descent
4. Loss Function
• Let’s say you are on the top of a hill and need to climb down. How do you
decide where to walk towards? Here’s what I would do:
• Look around to see all the possible paths
• Reject the ones going up. This is because these paths would actually cost me
more energy and make my task even more difficult
• Finally, take the path that I think has the most slope downhill
• A loss function maps decisions to their associated costs.
4. Loss Function
Log
Loss
where,
N : no. of samples.
M : no. of attributes.
yij : indicates whether ith sample belongs to jth class or not.
pij : indicates probability of ith sample belonging to jth class.
Focal Loss
Exponential
Loss
Hinge Loss
4. Loss Function
Cross
Entropy Loss
19 19