ANN Unit-2
ANN Unit-2
• Classical machine learning models reach their learning capacity early because they
are simple neural networks.
• When we have more data, we can add more computational units to improve
performance.
• Exploring the neural models for traditional machine learning is useful because it
exposes the cases in which deep learning has an advantage.
– Add capacity with more nodes for more data.
– Controlling the structure of the architecture provides a way to incorporate domain-
specific insights (e.g., recurrent networks and convolutional networks).
• In some cases, making minor changes to the architecture leads to interesting models:
– Adding a sigmoid/softmax layer in the output of a neural model for (linear) matrix
factorization can result in logistic/multinomial matrix factorization (e.g., word2vec).
The Perceptron criterion is a minor variation of hinge loss with identical update of
W ⇐ W +αyX in both cases.
• We update only for misclassified instances in perceptron, but update also for “marginally
correct” instances in SVM.
School or Dept. Name here
What About the Kernel SVM?
• Connections tell us about the cases that it makes sense to use conventional machine
learning:
– If you have less data with noise, you want to use conventional machine learning.
– If you have a lot of data with rich structure, you want to use neural networks.
– Structure is often learned by using deep neural architectures.
• Architectures like convolutional neural networks can use domain-specific insights.
• In linear regression, we have training pairs (Xi, yi) for i ∈ {1 . . . n}, so that Xi
contains d-dimensional features and yi contains a numerical target.
• We use a linear parameterized function to predict
• Goal is to learn W, so that the sum-of-squared differences between observed yi and
predicted ˆyi is minimized over the entire training data.
• Solution exists in closed form, but requires the inversion of a potentially large
matrix.
Convert the binary loss functions and updates to a form more easily comparable to
perceptron using
Consider the training pair (Xi, yi) with d-dimensional feature variables in Xi and class
variable yi ∈ {−1,+1}.
• In logistic regression, the sigmoid function is applied to W·Xi, which predicts the
probability that yi is +1.
• We want to maximize ˆyi for positive class instances and 1−ˆyi for negative class
instances.
– Same as minimizing −log(ˆyi) for positive class instances and −log(1 − ˆyi) for
negative instances.
– Same as minimizing loss Li = −log(|yi/2 − 0.5+ ˆyi|).
– Alternative form of loss Li = log(1+exp[−yi(W · Xi)])
• This factor is 1− ˆyi for positive instances and ˆyi for negative instances ⇒ Probability of
mistake!
The unregularized updates of the perceptron, SVM, Widrow- Hoff, and logistic
regression can all be written in the following form:
W ⇐ W +αyiδ(Xi, yi)Xi
• The quantity δ(Xi, yi) is a mistake function, which is:
– Raw mistake value (1 − yi(W · Xi)) for Widrow-Hoff
– Mistake indicator whether (0−yi(W ·Xi)) > 0 for perceptron.
– Margin/mistake indicator whether (1−yi(W ·Xi)) > 0 for SVM.
– Probability of mistake on (Xi, yi) for logistic regression.
▪ Auto encoders are a specific type of feedforward neural networks where the input
is the same as the output.
▪ They compress the input into a lower-dimensional code and then reconstruct the
output from this representation.
▪ The code is a compact “summary” or “compression” of the input, also called the
latent space representation.
▪ An auto encoder consists of 3 components: encoder, code and decoder.
▪ The encoder compresses the input and produces the code, the decoder then
reconstructs the input only using this code.
Vocabulary:{"queen","man","woman","child","king","prince","princess","throne","palac
e","royal"}
Indices: {0: "queen", 1: "man", 2: "woman", 3: "child", ..., 9: "royal"}.
Parameters: Vocabulary size (𝑉) = 10
Embedding size (𝑁) = 3
Context size (𝐶) = 3
Learning rate (𝜂) = 0.1
Training Example:
Context words: ["queen", "man", "woman"] ([0,1,2])
Target word: "king" ([4])
𝐿 = − 𝑇𝑎𝑟𝑔𝑒𝑡𝑖 . log 𝑦𝑖
𝑖=1
Where:
• V: Vocabulary size.
• yi: Predicted probability for word i.
• targeti: One-hot encoded value for the target word.
𝑒 𝑢𝑖
𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑢 = 𝑉
σ𝑗=1 𝑒 𝑢𝑗
The Skip-Gram model learns word embeddings by predicting the context words given a
target word.
For 𝑤=2, context words are 2 words before and after the target word.
Vocabulary: {"We", "love", "machine", "learning"}.
Input: One-hot vector of the target word.
Output: Probabilities of context words.46