0% found this document useful (0 votes)
10 views

Lesson02-Python Calculus Maths

The document covers key concepts in machine learning, including linear and nonlinear functions, derivatives, gradient descent, and loss functions. It explains the importance of activation functions in neural networks and various types such as Sigmoid, ReLU, and ELU. Additionally, it discusses how gradient descent is used to minimize loss functions through iterative parameter updates.

Uploaded by

yennhing.work
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lesson02-Python Calculus Maths

The document covers key concepts in machine learning, including linear and nonlinear functions, derivatives, gradient descent, and loss functions. It explains the importance of activation functions in neural networks and various types such as Sigmoid, ReLU, and ELU. Additionally, it discusses how gradient descent is used to minimize loss functions through iterative parameter updates.

Uploaded by

yennhing.work
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CONTENT

1. Linear and Nonlinear Functions


2. Derivatives and Finding Extreme Points
3. Gradient Descent
4. Loss Function
1. Linear and Nonlinear Functions
• A linear function is a systematic or sequential increase or decrease
represented by a straight line.
• Example : Linear Regression
1. Linear and Nonlinear Functions

Searching minimal loss Loss function


1. Linear and Nonlinear Functions
• A non-linear function is a function where the data does not increase or decrease
in a systematic or sequential way.
• Activation function is an important concept in machine learning, especially in
deep learning. They basically decide whether a neuron should be activated or
not and introduce non-linear transformation to a neural network. The main
purpose of these functions is to convert an input signal of a neuron and produce
an output to feed in the next neuron in the next layer
• Example: Activation Functions
1. Linear and Nonlinear Functions

Activation Functions
Advantages
1. Linear and Nonlinear Functions
• Sigmoid Activation Function:
• Range from [0,1]
• Not Zero Centered
• Have Exponential Operation
• Hyperbolic Tangent Activation Function(tanh):
• Ranges Between [-1,1]
• Zero Centered
• Rectified Linear Unit Activation Function (ReLU):
• It doesn’t Saturate
• It converges faster than some other activation functions
1. Linear and Nonlinear Functions
• Leaky ReLU:
• Leaky ReLU improvement over ReLU Activation function.
• It has all properties of ReLU
• It will never have dead ReLU problem.
• Maxout:
• It has property of Linearity in it
• it never saturates or die
• But is Expensive as it doubles the parameters.
• ELU(Exponential Linear Units):
• No Dead ReLU Situation.
• Closer to Zero mean Outputs than Leaky ReLU
• More Computation because of Exponential Function
2. Derivatives and Finding Extreme Points
• Suppose we have a function y = f(x) which is dependent on x then the derivation
of this function means the rate at which the value y of the function changes
with change in x.
• In geometry slope represents the steepness of a line. It answers the question:
how much does y or f(x) change given a specific change in x?
• Using this definition we can easily calculate the slope between two points. But
what if I asked you, instead of the slope between two points, what is the slope
at a single point on the line? In this case there isn’t any obvious “rise-over-run”
to calculate. Derivatives help us answer this question
2. Derivatives and Finding Extreme Points

Finding Extreme
Points
2. Derivatives and Finding Extreme Points

Partial derivative
3. Gradient Descent
• A gradient is a vector that stores the partial derivatives of multivariable
functions. It helps us calculate the slope at a specific point on a curve for
functions with multiple independent variables.
• The gradient vector is the vector generating the line orthogonal to the tangent
hyperplane. Then you take the opposite of this vector (hence “descent”),
multiply it by the learning rate lr.
3. Gradient Descent
• The projection of this vector on the parameter space (here: the x-axis) gives you
the new (updated) parameter. Then you repeat this operation several times to
go down the cost (error) function, with the goal of reaching a value for w where
the cost function is minimal.
• The parameter is thus updated as follow at each step:
parameter <-- parameter - lr*gradient
3. Gradient Descent
3. Gradient Descent
4. Loss Function
• Let’s say you are on the top of a hill and need to climb down. How do you
decide where to walk towards? Here’s what I would do:
• Look around to see all the possible paths
• Reject the ones going up. This is because these paths would actually cost me
more energy and make my task even more difficult
• Finally, take the path that I think has the most slope downhill
• A loss function maps decisions to their associated costs.
4. Loss Function

Log
Loss
where,
N : no. of samples.
M : no. of attributes.
yij : indicates whether ith sample belongs to jth class or not.
pij : indicates probability of ith sample belonging to jth class.

Focal Loss

The hyperparameter γ of the Focal loss is used to tune the


weight of different samples. When γ > 0, it reduces the
relative loss for well-classified examples.
4. Loss Function

Exponential
Loss

Hinge Loss
4. Loss Function

Cross
Entropy Loss
19 19

You might also like