0% found this document useful (0 votes)

67 views10 pages

On Loss Functions For Deep Neural Networks in Classification Katarzyna Janocha, Wojciech Marian Czarnecki

This document discusses different loss functions that can be used for deep neural networks in classification tasks. It begins with an introduction noting that while deep learning research has explored many different network architectures and components, the loss function used for classification has been largely limited to log loss. The document then reviews 12 different loss functions for classification, including L1, L2, hinge, and log losses. It provides theoretical analysis showing that L1 and L2 losses can be justified as classification objectives when interpreted probabilistically as minimizing expected misclassification. Experiments are also conducted to analyze the effects of different loss functions on factors like learning speed, performance, and robustness.

Uploaded by

ingaleharshal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views10 pages

On Loss Functions For Deep Neural Networks in Classification Katarzyna Janocha, Wojciech Marian Czarnecki

Uploaded by

ingaleharshal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

On Loss Functions for Deep Neural Networks

in Classification

Katarzyna Janocha1 , Wojciech Marian Czarnecki2,1

1
Faculty of Mathematics and Computer Science,
Jagiellonian University, Krakow, Poland
2
DeepMind, London, UK
arXiv:1702.05659v1 [cs.LG] 18 Feb 2017

e-mail: [email protected], [email protected]

Abstract
Deep neural networks are currently among the most commonly used
classifiers. Despite easily achieving very good performance, one of the
best selling points of these models is their modular design – one can con-
veniently adapt their architecture to specific needs, change connectivity
patterns, attach specialised layers, experiment with a large amount of
activation functions, normalisation schemes and many others. While one
can find impressively wide spread of various configurations of almost every
aspect of the deep nets, one element is, in authors’ opinion, underrepre-
sented – while solving classification problems, vast majority of papers and
applications simply use log loss. In this paper we try to investigate how
particular choices of loss functions affect deep models and their learning
dynamics, as well as resulting classifiers robustness to various effects. We
perform experiments on classical datasets, as well as provide some addi-
tional, theoretical insights into the problem. In particular we show that L1
and L2 losses are, quite surprisingly, justified classification objectives for
deep nets, by providing probabilistic interpretation in terms of expected
misclassification. We also introduce two losses which are not typically
used as deep nets objectives and show that they are viable alternatives to
the existing ones.

1 Introduction
For the last few years the Deep Learning (DL) research has been rapidly
developing. It evolved from tricky pretraining routines [6] to a highly
modular, customisable framework for building machine learning systems
for various problems, spanning from image recognition [5], voice recog-
nition and synthesis [9] to complex AI systems [11]. One of the biggest
advantages of DL is enormous flexibility in designing each part of the ar-
chitecture, resulting in numerous ways of putting priors over data inside
the model itself [6], finding the most efficient activation functions [2] or
learning algorithms [4]. However, to authors’ best knowledge, most of
the community still keeps one element nearly completely fixed – when it
comes to classification, we use log loss (applied to softmax activation of
the output of the network). In this paper we try to address this issue by
performing both theoretical and empirical analysis of effects various loss
functions have on the training of deep nets.

1
2

It is worth noting that Tang et al. [13] showed that well fitted hinge
loss can outperform log loss based networks in typical classification tasks.
Lee et al. [8] used squared hinge loss for classification tasks, achieving very
good results. From slightly more theoretical perspective Choromanska et
al. [1] also considered L1 loss as a deep net objective. However, these
works seem to be exceptions, appear in complete separation from one
another, and usually do not focus on any effect of the loss function but
the final performance. Our goal is to show these losses in a wider context,
comparing one another under various criteria and provide insights into
when – and why – one should use them.

Table 1: List of losses analysed in this paper. y is true label as one-hot encoding,
ŷ is true label as +1/-1 encoding, o is the output of the last layer of the net-
work, ·(j) denotes jth dimension of a given vector, and σ(·) denotes probability
estimate.
symbol name equation
L1 L1 loss ky − ok1
L2 L2 loss ky − ok22
L1 ◦ σ expectation loss ky − σ(o)k1
L2 ◦ σ regularised expectation loss1 ky − σ(o)k22
(j) (j)
L∞ ◦ σ Chebyshev loss
P j |σ(o) 1 − y(j) | (j)
max
hinge hinge [13] (margin) loss max(0, 2 − ŷ o )
Pj
hinge2 squared hinge (margin) loss max(0, 12 − ŷ(j) o(j) )2
Pj 1
hinge3 cubed hinge (margin) loss j max(0, 2 − ŷ
(j) (j) 3
o )
− j y log σ(o)(j)
(j)
P
log log (cross entropy) loss
log2 − j [y(j) log σ(o)(j) ]2
P
squared log loss
− j σ(o)(j) y(j)
P
tan Tanimoto loss kσ(o)k2 +kyk22 − j σ(o)(j) y(j)
2
P
P (j) (j)
j σ(o) y
DCS Cauchy-Schwarz Divergence [3] − log kσ(o)k2 kyk2

This work focuses on 12 loss functions, described in Table 1. Most of

them appear in deep learning (or more generally – machine learning) liter-
ature, however some in slightly different context than a classification loss.
In the following section we present new insights into theoretical properties
of a couple of these losses and then provide experimental evaluation of re-
sulting models’ properties, including the effect on speed of learning, final
performance, input data and label noise robustness as well as convergence
for simple dataset under limited resources regime.

2 Theory
Let us begin with showing interesting properties of Lp functions, typi-
cally considered as purely regressive losses, which should not be used in
classification. L1 is often used as an auxiliary loss in deep nets to en-
sure sparseness of representations. Similarly, L2 is sometimes (however
nowadays quite rarely) applied to weights in order to prevent them from
growing to infinity. In this section we show that – despite their regression
1 See Proposition 1
3

roots – they still have reasonable probabilistic interpretation for classifi-

cation and can be used as a main classification objective.
We use the following notation: {(xi , yi )}N d
i=1 ⊂ R × {0, 1}
K
is a train-
ing set, an iid sample from unknown P (x, y) and σ denotes a function
producing probability estimates (usually sigmoid or softmax).
Proposition 1. L1 loss applied to the probability estimates p̂(y|x) leads
to minimisation of expected misclassification probability (as opposed to
maximisation of fully correct labelling given by the log loss). Similarly L2
minimises the same factor, but regularised with a half of expected squared
L2 norm of the predictions probability estimates.

Proof. In K-class classification dependent variables are vectors yi ∈ {0, 1}K

with L1 (yi ) = 1, thus using notation pi = p̂(y|xi )
X X (j) (j)
X hX (j) (j) (j) (j)
i
L1 = N1 |pi − yi | = N1 yi (1 − pi ) + (1 − yi )pi
i j i j
X hX (j) X (j) (j) X (j) i X hX (j) (j) i
= N1 yi − 2 yi pi + pi = 2 − 2 N1 yi pi .
i j j j i j

Consequently if we sample label according to pi then probability that it

actually matches one hot encoded label in yi equals P (ˆ
l = l|ˆ
l ∼ pi , l ∼
P (j) (j)
yi ) = j yi pi , and consequently
X hX (j) (j)
i h i
L1 = 2 − 2 N1 yi pi ≈ −2EP (x,y) P (ˆ
l = l|ˆ
l ∼ pi , l ∼ yi ) + const.
i j

Analogously for L2 ,
X hX (j) (j) i X X
L2 = −2 N1 yi pi + N1 L2 (yi )2 + N1 L2 (pi )2
i j i i
h i
≈ −2EP (x,y) P (ˆ
l = l|ˆ
l ∼ pi , l ∼ yi ) + EP (x,y) [L2 (pi )2 ] + const.

For this reason we refer to these losses as expectation loss and regu-
larised expectation loss respectively. One could expect that this should
lead to higher robustness to the outliers/noise, as we try to maximise the
expected probability of good classification as opposed to the probability
of completely correct labelling (which log loss does). Indeed, as we show
in the experimental section – this property is true for all losses sharing
connection with expectation losses.
So why is using these two loss functions unpopular? Is there any-
thing fundamentally wrong with this formulation from the mathematical
perspective? While the following observation is not definitive, it shows
an insight into what might be the issue causing slow convergence of such
methods.
Proposition 2. L1 , L2 losses applied to probabilities estimates coming
from sigmoid (or softmax) have non-monotonic partial derivatives wrt. to
the output of the final layer (and the loss is not convex nor concave wrt.
to last layer weights). Furthermore, they vanish in both infinities, which
slows down learning of heavily misclassified examples.

Proof. Let us denote sigmoid activation as σ(x) = (1 + e−x )−1 and, with-
out loss of generality, compute partial derivative of L1 when network is
4

presented with xp with positive label. Let op denote the output activation
for this sample.

∂(L1 ◦ σ) ∂ e−op
|1 − (1 + e−o )−1 | (op ) = − −op

(op ) =
∂o ∂o (e + 1)2
e−o e−o
lim − = 0 = lim − ,
o→−∞ (e−o + 1)2 o→∞ (e−o + 1)2
0
while at the same time − (e0e+1)2 = − 41 < 0, completing the proof of both
non-monotonicity as well as the fact it vanishes when point is heavily mis-
classified. Lack of convexity comes from the same argument since second
derivative wrt. to any weight in the final layer of the model changes sign
(as it is equivalent to first derivative being non-monotonic). This comes
directly from the above computations and the fact that op = hw, hp i + b
for some internal activation hp , layer weights w and layer bias b. In a
natural way this is true even if we do not have any hidden layers (model
is linear). Proofs for L2 and softmax are completely analogous.

Given this negative result, it seems natural to ask whether a similar

property can be proven to show which loss functions should lead to fast
convergence. It seems like the answer is again positive, however based on
the well known deep learning hypothesis that deep models learn well when
dealing with piece-wise linear functions. An interesting phenomenon in
classification based on neural networks is that even in a deep linear model
or rectifier network the top layer is often non-linear, as it uses softmax or
sigmoid activation to produce probability estimates. Once this is intro-
duced, also the partial derivatives stop being piece-wise linear. We believe
that one can achieve faster, better convergence when we ensure that ar-
chitecture together with loss function, produces a piecewise linear partial
derivatives (but not constant) wrt. to final layer activations, especially
while using first order optimisation methods. This property is true only
for L2 loss and squared hinge loss (see Figure 1) among all considered
ones in this paper.

Figure 1: Left: Visualisation of analysed losses as functions of activation on pos-

itive sample. Middle: Visualisation of partial derivatives wrt. to output neuron
for losses based on linear output. Right: Visualisation of partial derivatives wrt.
to output neuron for losses based on probability estimates.

Finally we show relation between Cauchy-Schwarz Divergence loss and

the log loss, justifying its introduction as an objective for neural nets.
Proposition 3. Cauchy-Schwarz Divergence loss is equivalent to cross
entropy loss regularised with half of expected Renyi’s quadratic entropy of
the predictions.
5

(j) P (j) (j)

Proof. Using the fact that ∀i ∃!j : yi = 1 we get that log j pi yi =
P (j) (j)
j yi log pi as well as kyi k2 = 1, consequently
P (j) (j)
X p y X X (j) (j) X
DCS = − N1 log kpji k2i kyiik2 = − N1 log pi yi + N1 log kpi k2 kyi k2
i i j i
X X (j) (j)
X
= − N1 yi log pi + 2N 1
log kpi k22 ≈ Llog + 21 EP (x,y) [H2 (pi )]
i j i

3 Experiments
We begin the experimental section with two simple 2D toy datasets. The
first one is checkerboard – 4 class classification problem where [-1,1] square
is divided into 64 small squares with cyclic class assignment. The second
one, spiral, is a 4 class generalisation of the well known 2 spirals dataset.
Both datasets have 800 training and 800 testing samples. We train rectifier
neural network having from 0 to 5 hidden layers with 200 units in each
of them. Training is performed using Adam [4] with learning rate of
0.00003 for 60,000 iterations with batch size of 50 samples. In these simple

Figure 2: Top row: Learning curves for toy datasets. Bottom row: examples of
decision boundaries, from left: L1 loss, log loss, L1 ◦ σ loss, hinge2 loss.

problems one can distinguish (Figure 2) two groups of losses – one able to
fit to our very dense, low-dimensional data and one struggling to reduce
error to 0. The second group consists of L1 , Chebyshev, Tanimoto and
expectation loss. This division becomes clear once we build a relatively
deep model (5 hidden layers), while for shallow ones this distinction is not
very clear (3 hidden layers) or is even completely lost (1 hidden layer or
linear model). To further confirm the lack of ability to easily overfit we
also ran an experiment in which we tried to fit 800 samples from uniform
distribution over [−1, 1] with randomly assigned 4 labels and achieved
analogous partitioning.
6

During following, real data-based experiments, we focus on further

investigation of loss functions properties emerging after application to
deep models, as well as characteristics of the created models. In particular,
we show that lack of ability to reduce training error to 0 is often correlated
with robustness to various types of noise (despite not underfitting the
data).
Let us now proceed with one of the most common datasets used in
deep learning community – MNIST [7]. We train network consisting
from 0 to 5 hidden layers, each followed by ReLU activation function
and dropout [12] with 50% probability. Each hidden layer consists of 512
neurons, and whole model is trained using Adam [4] with learning rate of
0.00003 for 100,000 iterations using batch size of 100 samples. There are

Figure 3: Top two rows: learning curves for MNIST dataset. Bottom row:
(left) speed of learning expressed as expected training/testing accuracy when
we sample iteration uniformly between 10k and 100k; (right) learning curves for
CIFAR10 dataset.
few interesting findings, visible on Figure 3. First, results obtained for a
linear model (lack of hidden layers) are qualitatively different from all the
remaining ones. For example, using regularised expectation loss leads to
the strongest model in terms of both training accuracy and generalisation
capabilities, while the same loss function is far from being the best one
once we introduce non-linearities. This shows two important things: first
– observations and conclusions drawn from linear models do not seem to
transfer to deep nets, and second – there seems to be an interesting co-
dependence between learning dynamics coming from training rectifier nets
and loss functions used. As a side note, 93% testing accuracy, obtained
by L2 ◦ σ and DCS , is a very strong result on MNIST using linear model
without any data augmentation or model regularisation.
Second interesting observation regards the speed of learning. It ap-
pears that (apart from linear models) hinge2 and hinge3 losses are con-
sistently the fastest in training, and once we have enough hidden layers
(basically more than 1) also L2 . This matches our theoretical analysis of
7

these losses in the previous section. At the same time both expectation
losses are much slower to train, which we believe to be a result of their van-
ishing partial derivatives in heavily misclassified points (Proposition 2).
It is important to notice that while higher order hinge losses (especially
2nd ) actually help in terms of both speed and final performance, the same
property does not hold for higher order log losses. One possible expla-
nation is that taking a square of log loss only reduces model’s certainty
in classification (since any number between 0 and 1 taken to 2nd power
decreases), while for hinge losses the metric used for penalising margin-
outliers is changed, and both L1 metric (leading to hinge) as well as any
other Lp norm (leading to hingep ) make perfect sense.
Third remark is that pure L1 does not learn at all (ending up with
20% accuracy) due to causing serious “jumps” in the model because of
its partial derivatives wrt. to net output always being either -1 or 1.
Consequently, even after classifying a point correctly, we are still heavily
penalised for it, while with losses like L2 the closer we are to the correct
classification - the smaller the penalty is.
Finally, in terms of generalisation capabilities margin-based losses seem
to outperform the remaining families. One could argue that this is just
a result of lack of regularisation in the rest of the losses, however we un-
derline that all the analysed networks use strong dropout to counter the
overfitting problem, and that typical L1 or L2 regularisation penalties do
not work well in deep networks.
For CIFAR10 dataset we used a simple convnet, consisting of 3 layers
of convolutions, each of size 5x5 and 64 filters, with ReLU activation
functions, batch-normalisation and pooling operations in between them
(max pooling after first layer and then two average poolings, all 3x3 with
stride 2), followed by a single fully connected hidden layer with 128 ReLU
neurons, and final softmax layer with 10 neurons. As one can see in
Figure 3, despite completely different architecture than before, we obtain
very similar results – higher order margin losses lead to faster training and
significantly stronger models. Quite surprisingly – L2 loss also exhibits
similar property. Expectation losses again learn much slower (with the
regularised one – training at the level of log loss and unregularised –
significantly worse). We would like to underline that this is a very simple
architecture, far from the state-of-the art models for CIFAR10, however
we wish to avoid using architectures which are heavily overfitted to the
log loss. Furthermore, the aim of this paper is not to provide any state-
of-the-art models, but rather to characterise effects of loss functions on
deep networks.
As the final interesting result in these experiments, we notice that
Cauchy-Schwarz Divergence as the optimisation criterion seems to be a
consistently better choice than log loss. It performs equally well or better
on both MNIST and CIFAR10 in terms of both learning speed and the
final performance. At the same time this information theoretic measure
is very rarely used in DL community, and rather exploited in shallow
learning (for both classification [3] and clustering [10]).
Now we focus on the impact these losses have on noise robustness of
the deep nets. We start by performing the following experiment on pre-
viously trained MNIST classifiers: we add noise sampled from N (0, I) to
each xi and observe how quickly (in terms of growing ) network’s train-
ing accuracy drops (Figure 4). The first crucial observation is that both
expectation losses perform very well in terms of input noise robustness.
We believe that this is a consequence of what Proposition 1 showed about
8

Figure 4: Top row: Training accuracy curves for the MNIST trained models,
when presented with training examples with added noise from N (0, I), plotted
as a function of . Middle and bottom rows: Testing accuracy curves for the
MNSIT experiment with of training labels changed, plotted as a function of
training iteration. If L1 ◦ σ is not visible, it is almost perfectly overlapped by
L∞ ◦ σ.

their probabilistic interpretation – that they lead to minimisation of the

expected misclassification, which is less biased towards outliers than log
loss (or other losses that focus on maximisation of probability of correct
labelling of all samples at the same time). For log loss a single heavily
misclassified point has an enormous impact on the overall error surface,
while for these two losses – it is minor. Secondly, margin based losses
also perform well on this test, usually slightly worse than the expectation
losses, but still better than log loss. This shows that despite no longer
maximising the misclassification margin while being used in deep nets –
they still share some characteristics with their linear origins (SVM). In
another, similar experiment, we focus on the generalisation capabilities of
the networks trained with increasing amount of label noise in the training
set (Figure 4) and obtain analogous results, showing that robustness to
the noise of expectation and margin losses is high for both input and label
noise for deep nets, while again – slightly different results are obtained for
linear models, where log loss is more robust than the margin-based ones.
What is even more interesting, a completely non-standard loss function –
Tanimoto loss – performs extremely well on this task. We believe that its
exact analysis is one of the important future research directions.

4 Conclusions
This paper provides basic analysis of effects the choice of the classifica-
tion loss function has on deep neural networks training as well as their
final characteristics. We believe the obtained results will lead to a wider
9

adoption of various losses in DL work – where up till now log loss is

unquestionable favourite.
In the theoretical section we show that, surprisingly, losses which are
believed to be applicable mostly to regression, have a valid probabilistic
interpretation when applied to deep network-based classifiers. We also
provide theoretical arguments explaining why using them might lead to
slower training, which might be one of the reasons DL practitioners have
not yet exploited this path. Our experiments lead to two crucial con-
clusions. First, that intuitions drawn from linear models rarely transfer
to highly-nonlinear deep networks. Second, that depending on the appli-
cation of the deep model – losses other than log loss are preferable. In
particular, for purely accuracy focused research, squared hinge loss seems
to be a better choice at it converges faster as well as provides better per-
formance. It is also more robust to noise in the training set labelling and
slightly more robust to noise in the input space. However, if one works
with highly noised dataset (both input and output spaces) – the expecta-
tion losses described in detail in this paper – seem to be the best choice,
both from theoretical and empirical perspective.
At the same time this topic is far from being exhausted, with a large
amount of possible paths to follow and questions to be answered. In
particular, non-classical loss functions such as Tanimoto loss and Cauchy-
Schwarz Divergence are worth further investigation.

References
[1] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben
Arous, and Yann LeCun. The loss surfaces of multilayer networks.
In AISTATS, 2015.
[2] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast
and accurate deep network learning by exponential linear units (elus).
arXiv preprint arXiv:1511.07289, 2015.
[3] Wojciech Marian Czarnecki, Rafal Jozefowicz, and Jacek Tabor.
Maximum entropy linear manifold for learning discriminative low-
dimensional representation. In Joint European Conference on Ma-
chine Learning and Knowledge Discovery in Databases, pages 52–67.
Springer, 2015.
[4] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980, 2014.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. In Advances
in neural information processing systems, pages 1097–1105, 2012.
[6] Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and Pascal
Lamblin. Exploring strategies for training deep neural networks.
Journal of Machine Learning Research, 10(Jan):1–40, 2009.
[7] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist
database of handwritten digits, 1998.
[8] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and
Zhuowen Tu. Deeply-supervised nets. In AISTATS, volume 2, page 6,
2015.
[9] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and
10

Koray Kavukcuoglu. Wavenet: A generative model for raw audio.

arXiv preprint arXiv:1609.03499, 2016.
[10] Jose C Principe, Dongxin Xu, and John Fisher. Information theoretic
learning. Unsupervised adaptive filtering, 1:265–319, 2000.
[11] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Lau-
rent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis
Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. Nature,
529(7587):484–489, 2016.
[12] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya
Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to
prevent neural networks from overfitting. Journal of Machine Learn-
ing Research, 15(1):1929–1958, 2014.
[13] Yichuan Tang. Deep learning using linear support vector machines.
arXiv preprint arXiv:1306.0239, 2013.

Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
Lect 8
No ratings yet
Lect 8
117 pages
7 TrainingNN-2
No ratings yet
7 TrainingNN-2
84 pages
Lecture 4 - Linear Classification
No ratings yet
Lecture 4 - Linear Classification
34 pages
Chap 2 Slides
No ratings yet
Chap 2 Slides
74 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
Roz-4 - Janocha
No ratings yet
Roz-4 - Janocha
11 pages
Nptel Lec
No ratings yet
Nptel Lec
22 pages
04 LossFunctions
No ratings yet
04 LossFunctions
22 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
DL 02 Basics
No ratings yet
DL 02 Basics
95 pages
What Are Probabilistic Machine Learning Models?
No ratings yet
What Are Probabilistic Machine Learning Models?
61 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
3 - Loss Functions
No ratings yet
3 - Loss Functions
14 pages
Loss Functions in Deep Learning - MLearning - Ai
No ratings yet
Loss Functions in Deep Learning - MLearning - Ai
14 pages
Lecture 11
No ratings yet
Lecture 11
26 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Cross Entropy Loss Intro, Applications
No ratings yet
Cross Entropy Loss Intro, Applications
21 pages
practicalMachineLearning Lecture3
No ratings yet
practicalMachineLearning Lecture3
25 pages
Hybrid Discriminative-Generative Training Via COnstrrastive Leanring
No ratings yet
Hybrid Discriminative-Generative Training Via COnstrrastive Leanring
14 pages
Polyloss A Polynomial Expansion Perspec
No ratings yet
Polyloss A Polynomial Expansion Perspec
16 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Module 6 - Loss Function
No ratings yet
Module 6 - Loss Function
22 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
Loss Functions
No ratings yet
Loss Functions
7 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
7.losses and Activations
No ratings yet
7.losses and Activations
79 pages
Loss Functions Types
No ratings yet
Loss Functions Types
11 pages
Loss Functions
No ratings yet
Loss Functions
29 pages
Lecun 05 A
No ratings yet
Lecun 05 A
8 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Loss Functions
No ratings yet
Loss Functions
8 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
Lecture 7 Loss Function and Regularization
No ratings yet
Lecture 7 Loss Function and Regularization
38 pages
On Expected Accuracy: Ozan İrsoy
No ratings yet
On Expected Accuracy: Ozan İrsoy
6 pages
Why You Don't Overfit, and Don't Need Bayes If You Only Train For One Epoch
No ratings yet
Why You Don't Overfit, and Don't Need Bayes If You Only Train For One Epoch
6 pages
Loss Functions
No ratings yet
Loss Functions
15 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
1 page
CS6910 Tutorial5
No ratings yet
CS6910 Tutorial5
9 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
10 Gradient Based Learning 10-08-2024
No ratings yet
10 Gradient Based Learning 10-08-2024
22 pages
DL Assi02
No ratings yet
DL Assi02
9 pages
AI and Math - Python Multiple-Choice Questions
No ratings yet
AI and Math - Python Multiple-Choice Questions
16 pages
Loss Function
No ratings yet
Loss Function
2 pages
Loss Functions in Deep Learning: A Comprehensive Review
No ratings yet
Loss Functions in Deep Learning: A Comprehensive Review
36 pages
Unit 2b
No ratings yet
Unit 2b
11 pages
Loss Function
No ratings yet
Loss Function
9 pages
CS-31002 (ML) - CS End April 2025
No ratings yet
CS-31002 (ML) - CS End April 2025
19 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Cheat Sheet 1 Microsoft Azure Ai Fundamentals Ai 900 Ai Concepts
100% (1)
Cheat Sheet 1 Microsoft Azure Ai Fundamentals Ai 900 Ai Concepts
1 page
Lecture 2
No ratings yet
Lecture 2
6 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
SP18 CS182 Midterm Solutions - Edited
No ratings yet
SP18 CS182 Midterm Solutions - Edited
14 pages
Unit 6 - Week 5: Assignment 5
No ratings yet
Unit 6 - Week 5: Assignment 5
3 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
38 pages
Victor Hugo C. de Albuquerque, Pethuru Raj, Satya Prakash Yadav (Eds.) Toward Artificial General Intelligence
No ratings yet
Victor Hugo C. de Albuquerque, Pethuru Raj, Satya Prakash Yadav (Eds.) Toward Artificial General Intelligence
424 pages
12 Neural Network
No ratings yet
12 Neural Network
52 pages
Report
No ratings yet
Report
7 pages
Deep Learning Question Paper 2025
No ratings yet
Deep Learning Question Paper 2025
4 pages
High Frequency Volatility
No ratings yet
High Frequency Volatility
14 pages
English Paper
No ratings yet
English Paper
13 pages
p1851 PDF
No ratings yet
p1851 PDF
7 pages
Menu - 634641189420527500 - CS5106 Soft Computing Lab Assignments
0% (1)
Menu - 634641189420527500 - CS5106 Soft Computing Lab Assignments
3 pages
Transformer NLP
No ratings yet
Transformer NLP
64 pages
Srinidhi Kannan Resume AI
No ratings yet
Srinidhi Kannan Resume AI
1 page
Bilkis Yakub Rasool Vs Union of India On 8 January 2024
No ratings yet
Bilkis Yakub Rasool Vs Union of India On 8 January 2024
105 pages
Self-Supervised Learning Generative or Contrastive
No ratings yet
Self-Supervised Learning Generative or Contrastive
20 pages
Image Captioning Final
No ratings yet
Image Captioning Final
31 pages
List of Artificial Intelligence Projects
No ratings yet
List of Artificial Intelligence Projects
5 pages
What Is Deep Learning and How Does It Work - Towards Data Science
No ratings yet
What Is Deep Learning and How Does It Work - Towards Data Science
38 pages
Harsh Chaudhary: 2018UME1704@mnit - Ac.in
No ratings yet
Harsh Chaudhary: 2018UME1704@mnit - Ac.in
1 page
Mid-Price Prediction in A Limit Order Book: Deepan Palguna and Ilya Pollak
No ratings yet
Mid-Price Prediction in A Limit Order Book: Deepan Palguna and Ilya Pollak
9 pages
Xia Text2Loc 3D Point Cloud Localization From Natural Language CVPR 2024 Paper
No ratings yet
Xia Text2Loc 3D Point Cloud Localization From Natural Language CVPR 2024 Paper
10 pages
Deep Neural Networks and Tabular Data: A Survey
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
22 pages
FISS GAN A Generative Adversarial Network For Foggy Image Semantic Segmentation
No ratings yet
FISS GAN A Generative Adversarial Network For Foggy Image Semantic Segmentation
12 pages
Curse of Dimensionality and Its Reduction
No ratings yet
Curse of Dimensionality and Its Reduction
5 pages
Neural Networks - Comprehensive Foundation (Introduction)
No ratings yet
Neural Networks - Comprehensive Foundation (Introduction)
47 pages
Modularity in Technology and Society
No ratings yet
Modularity in Technology and Society
41 pages
Lightweight Deep Learning Framework For Speech Emotion Recognition
No ratings yet
Lightweight Deep Learning Framework For Speech Emotion Recognition
13 pages
3 ICT Nawel
No ratings yet
3 ICT Nawel
6 pages
1223-Article Text-5565-1-10-20200702
No ratings yet
1223-Article Text-5565-1-10-20200702
7 pages
Using HTK
No ratings yet
Using HTK
36 pages
Impact of Openai On Tech
No ratings yet
Impact of Openai On Tech
4 pages
Neural Networks: Aroob Amjad Farrukh
No ratings yet
Neural Networks: Aroob Amjad Farrukh
6 pages
1611 03530 PDF
No ratings yet
1611 03530 PDF
15 pages
Application of Kalman Filter in The Prediction of Stock Price
No ratings yet
Application of Kalman Filter in The Prediction of Stock Price
2 pages
Realtime Power System Security Assessment Using Artifical Neural Networks
No ratings yet
Realtime Power System Security Assessment Using Artifical Neural Networks
13 pages
Printed Arabic Letter Recognition Based On Image
No ratings yet
Printed Arabic Letter Recognition Based On Image
6 pages
Analytical Views: Factoring
No ratings yet
Analytical Views: Factoring
6 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Reflective Trigate Design for Classical Computers
From Everand
Reflective Trigate Design for Classical Computers
Ylia Callan
No ratings yet
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
Manish Soni
No ratings yet