0% found this document useful (0 votes)
6 views48 pages

ANN Unit-2

The document outlines the principles and architectures of shallow neural networks, emphasizing their role in binary and multiclass classification, as well as autoencoders. It discusses the relationship between classical machine learning models and shallow neural networks, highlighting how minor architectural changes can yield different models. Additionally, it covers the advantages of deep learning, the structure of autoencoders, and the importance of hyperparameters in training these models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views48 pages

ANN Unit-2

The document outlines the principles and architectures of shallow neural networks, emphasizing their role in binary and multiclass classification, as well as autoencoders. It discusses the relationship between classical machine learning models and shallow neural networks, highlighting how minor architectural changes can yield different models. Additionally, it covers the advantages of deep learning, the structure of autoencoders, and the importance of hyperparameters in training these models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Artificial Neural Network

School or Dept. Name here


Syllabus
UNIT 2 :Shallow Neural Networks 9 hours, P - 6 hours
Neural Architectures for Binary Classification Models, Neural Architectures for
Multiclass Models, Autoencoder: Basic Principles, Neural embedding with continuous
bag of words, Simple neural architectures for graph embeddings

School or Dept. Name here


Neural Networks and Machine Learning

• Neural networks are optimization-based learning models.


• Many classical machine learning models use continuous optimization:
– SVMs, Linear Regression, and Logistic Regression
– Singular Value Decomposition
– (Incomplete) Matrix factorization for Recommender Systems
• All these models can be represented as special cases of shallow neural networks!

School or Dept. Name here


The Continuum Between Machine Learning and Deep Learning

• Classical machine learning models reach their learning capacity early because they
are simple neural networks.
• When we have more data, we can add more computational units to improve
performance.

School or Dept. Name here


The Deep Learning Advantage

• Exploring the neural models for traditional machine learning is useful because it
exposes the cases in which deep learning has an advantage.
– Add capacity with more nodes for more data.
– Controlling the structure of the architecture provides a way to incorporate domain-
specific insights (e.g., recurrent networks and convolutional networks).
• In some cases, making minor changes to the architecture leads to interesting models:
– Adding a sigmoid/softmax layer in the output of a neural model for (linear) matrix
factorization can result in logistic/multinomial matrix factorization (e.g., word2vec).

School or Dept. Name here


Neural Architectures for Binary Classification Models
Recap: Perceptron versus Linear Support Vector Machine

The Perceptron criterion is a minor variation of hinge loss with identical update of
W ⇐ W +αyX in both cases.
• We update only for misclassified instances in perceptron, but update also for “marginally
correct” instances in SVM.
School or Dept. Name here
What About the Kernel SVM?

RBF Network for unsupervised feature engineering.


– Unsupervised feature engineering is good for noisy
data.
– Supervised feature engineering (with deep learning)
is good for learning rich structure.

School or Dept. Name here


Much of Machine Learning is a Shallow Neural Model

By minor changes to the architecture of perceptron we can get:


– Linear regression, Fisher discriminant, and Widrow-Hoff learning ⇒ Linear
activation in output node
– Logistic regression ⇒ Sigmoid activation in output node
• Multinomial logistic regression ⇒ Softmax Activation in Final Layer
• Singular value decomposition ⇒ Linear autoencoder
• Incomplete matrix factorization for Recommender Systems
⇒ Autoencoder-like architecture with single hidden layer (also used in word2vec)

School or Dept. Name here


Why do We Care about Connections?

• Connections tell us about the cases that it makes sense to use conventional machine
learning:
– If you have less data with noise, you want to use conventional machine learning.
– If you have a lot of data with rich structure, you want to use neural networks.
– Structure is often learned by using deep neural architectures.
• Architectures like convolutional neural networks can use domain-specific insights.

School or Dept. Name here


Widrow-Hoff Rule: The Neural Avatar of Linear Regression

• The perceptron (1958) was historically followed by Widrow-Hoff Learning (1960).


• Identical to linear regression when applied to numerical targets.
– Originally proposed by Widrow and Hoff for binary targets (not natural for regression).
• The Widrow-Hoff method, when applied to mean-centered features and mean-centered
binary class encoding, learns the Fisher discriminant.

School or Dept. Name here


Linear Regression: An Introduction

• In linear regression, we have training pairs (Xi, yi) for i ∈ {1 . . . n}, so that Xi
contains d-dimensional features and yi contains a numerical target.
• We use a linear parameterized function to predict
• Goal is to learn W, so that the sum-of-squared differences between observed yi and
predicted ˆyi is minimized over the entire training data.
• Solution exists in closed form, but requires the inversion of a potentially large
matrix.

School or Dept. Name here


Linear Regression with Numerical Targets:Neural Model

School or Dept. Name here


Widrow-Hoff: Linear Regression with Binary Targets

School or Dept. Name here


Comparison of Widrow-Hoff with Perceptron and SVM

Convert the binary loss functions and updates to a form more easily comparable to
perceptron using

School or Dept. Name here


Connections with Fisher Discriminant

• Consider a binary classification problem with training instances

Mean-center each feature vector as

Mean-center the binary class by subtracting from each yi.

Use the delta rule for learning.

School or Dept. Name here


Neural Models for Logistic Regression

Consider the training pair (Xi, yi) with d-dimensional feature variables in Xi and class
variable yi ∈ {−1,+1}.
• In logistic regression, the sigmoid function is applied to W·Xi, which predicts the
probability that yi is +1.

• We want to maximize ˆyi for positive class instances and 1−ˆyi for negative class
instances.
– Same as minimizing −log(ˆyi) for positive class instances and −log(1 − ˆyi) for
negative instances.
– Same as minimizing loss Li = −log(|yi/2 − 0.5+ ˆyi|).
– Alternative form of loss Li = log(1+exp[−yi(W · Xi)])

School or Dept. Name here


School or Dept. Name here
Interpreting the Logistic Update

• An important multiplicative factor in the update increment is 1/(1+exp[yi(W · Xi)]).

• This factor is 1− ˆyi for positive instances and ˆyi for negative instances ⇒ Probability of

mistake!

• Interpret as: W ⇐ W+α [Probability of mistake on (Xi, yi)] (yiXi)

School or Dept. Name here


Comparing Updates of Different Models

The unregularized updates of the perceptron, SVM, Widrow- Hoff, and logistic
regression can all be written in the following form:
W ⇐ W +αyiδ(Xi, yi)Xi
• The quantity δ(Xi, yi) is a mistake function, which is:
– Raw mistake value (1 − yi(W · Xi)) for Widrow-Hoff
– Mistake indicator whether (0−yi(W ·Xi)) > 0 for perceptron.
– Margin/mistake indicator whether (1−yi(W ·Xi)) > 0 for SVM.
– Probability of mistake on (Xi, yi) for logistic regression.

School or Dept. Name here


Comparing Loss Functions of Different Models

School or Dept. Name here


Auto Encoders

▪ Auto encoders are a specific type of feedforward neural networks where the input
is the same as the output.
▪ They compress the input into a lower-dimensional code and then reconstruct the
output from this representation.
▪ The code is a compact “summary” or “compression” of the input, also called the
latent space representation.
▪ An auto encoder consists of 3 components: encoder, code and decoder.
▪ The encoder compresses the input and produces the code, the decoder then
reconstructs the input only using this code.

School or Dept. Name here


Architecture
• Both the encoder and decoder are fully-connected feedforward neural networks
• Code is a single layer of an ANN with the dimensionality of our choice.
• The number of nodes in the code layer (code size) is a hyperparameter that we set
before training the autoencoder.

School or Dept. Name here


Architecture

School or Dept. Name here


Encoder: An encoder is a feedforward, fully connected neural network that compresses the
input into a latent space representation and encodes the input image as a compressed
representation in a reduced dimension. The compressed image is the distorted version of the
original image.
Code: This part of the network contains the reduced representation of the input that is fed into
the decoder.
Decoder: Decoder is also a feedforward network like the encoder and has a similar structure
to the encoder. This network is responsible for reconstructing the input back to the original
dimensions from the code.
Latent Space :Abstract multi-dimensional space that encodes a meaningful internal
representation of externally observed events.

School or Dept. Name here


Hyper parameters of Autoencoders

There are 4 hyperparameters that we need to set before training an


autoencoder:
1. Code size: It represents the number of nodes in the middle layer.
Smaller size results in more compression.
2. Number of layers: The autoencoder can consist of as many layers as we want.
3. Number of nodes per layer: The number of nodes per layer decreases with each
subsequent layer of the encoder, and increases back in the decoder. The decoder is symmetric
to the encoder in terms of the layer structure.
4. Loss function: We either use mean squared error or binary cross-entropy. If the input
values are in the range [0, 1] then we typically use cross-entropy, otherwise, we use the mean
squared error.
Autoencoders are trained the same way as ANNs via backpropagation

School or Dept. Name here


Autoencoder – Example

▪ Two inputs neurons


▪ One hidden layer with two neurons (encoder)
▪ Latent layer with two neurons
▪ One hidden layer with two neurons (decoder)
▪ Output two neurons (same as input)

School or Dept. Name here


Autoencoder – Forward Pass Autoencoder – Backpropagation
▪ Decoder Gradients: Applying the Chain
▪ Encoder 𝑍 = 𝑓𝑒𝑛𝑐𝑜𝑑𝑒𝑟 𝑥; 𝑊1 , 𝑏1 = rule
𝜎 𝑊1 𝑥 + 𝑏1 𝜕𝐿 𝜕𝐿 𝜕𝑥ො 𝜕(𝑊2 𝑍)
where: = . .
𝜕𝑊2 𝜕𝑥ො 𝜕(𝑊2 𝑍) 𝜕𝑊2
W1: Encoder weights, = 2 𝑥ො𝑖 − 𝑥𝑖 ⨀𝜎 ′ (𝑊2 𝑍 + 𝑏2 )⨀𝑍 𝑇
b1​: Encoder biases, Similarly for the bias
σ: Activation function (e.g., ReLU or 𝜕𝐿
sigmoid). = 2 𝑥ො𝑖 − 𝑥𝑖 ⨀𝜎 ′ (𝑊2 𝑍 + 𝑏2 )
𝜕𝑏2
▪ Decoder 𝑥ො = 𝑓𝑑𝑒𝑐𝑜𝑑𝑒𝑟 𝑍; 𝑊2 , 𝑏2 = ▪ Encoder Gradients
𝜎 𝑊2 𝑍 + 𝑏2 𝜕𝐿 𝜕𝐿 𝜕𝑥ො 𝜕𝑍 𝜕(𝑊1 𝑥)
where: = . . .
𝜕𝑊1 𝜕𝑥ො 𝜕𝑍 𝜕(𝑊1 𝑥) 𝜕𝑊1
W2: Decoder weights,
= ൣ൫2 𝑥ො𝑖 − 𝑥𝑖 ⨀𝜎 ′ (𝑊2 𝑍
b2​: Decoder biases,
Loss function = σ𝑖 𝑥ො𝑖 − 𝑥𝑖 2 + 𝑏2 )⨀𝑊2 𝑇 ൯⨀𝜎 ′ 𝑊1 𝑥 + 𝑏1 ൧⨀𝑥 𝑇
Similarly for the bias (𝑥 𝑇 =1 in the above
School or Dept. Name here equation)
Autoencoder – Weight Updates Autoencoder – with L2 norm

Decoder weights λ: Regularization parameter (controls the


𝜕𝐿 𝜕𝐿 strength of the penalty),
𝑊2 ← 𝑊2 − 𝛼 ; 𝑏2 ← 𝑏2 − 𝛼 New Loss function = σ 𝑥
ො − 𝑥 2
+
𝜕𝑊2 𝜕𝑏2 2 + | 𝑊 |2
𝑖 𝑖 𝑖
Encoder weights 𝜆 | 𝑊1 | 2
𝜕𝐿 𝜕𝐿 Decoder weights
𝑊1 ← 𝑊1 − 𝛼 ; 𝑏1 ← 𝑏1 − 𝛼 𝜕𝐿
𝜕𝑊1 𝜕𝑏1 𝑊2 ← 𝑊2 − 𝛼 + 𝜆𝑊2
This process propagates errors from the output 𝜕𝑊2
back through the decoder to the encoder, Encoder weights
ensuring the autoencoder improves at 𝜕𝐿
𝑊1 ← 𝑊1 − 𝛼 + 𝜆𝑊1 ;
reconstructing the input. 𝜕𝑊1

School or Dept. Name here


Numerical Example
1
Input 𝑥 = .
0.5
0.5 0.3 0
Encoder weights 𝑊1 = ; 𝑏1 =
0.2 0.7 0
0.6 0.4 0
Decoder weights 𝑊2 = ;𝑏 =
0.1 0.9 2 0
Activation function: Identity
0.65 0.65
Forward pass: Encoder 𝑍 = ; Decoder𝑥ො𝑖 =
0.55 0.56
Loss = 0.1261
Decoder gradient
𝜕𝐿 𝑇 −0.455 −0.385 𝜕𝐿 −0.7
= 2 𝑥ො𝑖 − 𝑥𝑖 ⨀𝑍 = ; =
𝜕𝑊2 0.078 0.066 𝜕𝑏2 0.12
Encoder gradient
𝜕𝐿 𝑇 𝑇 −0.402 −0.201 𝜕𝐿 −0.402
= 2 𝑥ො𝑖 − 𝑥𝑖 ⨀𝑊2 ⨀𝑥 = ; =
𝜕𝑊1 0.0468 0.0234 𝜕𝑏1 0.0468
School or Dept. Name here
0.6455 0.4385 0.07
Decoder weight new 𝑊2 = ; 𝑏2 =
0.0922 0.8934 −0.012

0.5402 0.3201 0.0402


Encoder weight new 𝑊1 = ; 𝑏1 =
0.1953 0.6977 0.00468
𝜕𝐿 −0.395 −0.345
Decoder gradient with L2 norm =
𝜕𝑊2 0.088 0.156
𝜕𝐿 −0.352 −0.171
Encoder gradient with L2 norm =
𝜕𝑊1 0.0668 0.0934

School or Dept. Name here


Text Embedding with Word2vec

Consider a sentence containing the words w1w2 ...wn in that sequence.


The words wi−twi−t+1 ...wi−1wi+1 ...wi+t−1wi+t are used to predict the target word wi.
This model is referred to as the continuous bag-of-words (CBOW) model.
"The cat sits on the mat"
Vocabulary (d) :["the", "cat", "sits", "on", "mat"] Context Target
Let the context size (m) = 2. ["the", "sits"] "cat"
Each word is represented as a
one-hot vector: ["cat", "on"] "sits"
"the" = [1, 0, 0, 0, 0] ["sits", "mat"] "on"
"cat" = [0, 1, 0, 0, 0]
"sits" = [0, 0, 1, 0, 0]…

School or Dept. Name here


Continuous Bag of Words (CBOW)

# of hidden neurons - p Say 3


Encoder weights (Shared Weights) – W
Dimensions – 5x3
Decoder Weights – V
Dimensions – 3x5
Output – Softmax layer
Dimensions – 1x5
Highest probability predicts the target
word.

School or Dept. Name here


Numerical Example (Lab 5 code)

Vocabulary:{"queen","man","woman","child","king","prince","princess","throne","palac
e","royal"}
Indices: {0: "queen", 1: "man", 2: "woman", 3: "child", ..., 9: "royal"}.
Parameters: Vocabulary size (𝑉) = 10
Embedding size (𝑁) = 3
Context size (𝐶) = 3
Learning rate (𝜂) = 0.1
Training Example:
Context words: ["queen", "man", "woman"] ([0,1,2])
Target word: "king" ([4])

School or Dept. Name here


Neural embedding with continuous bag of words

School or Dept. Name here


Random Encoder Decoder Weights

Input to Hidden Weights (W): V×N


Hidden to Output Weights (W′): N×V
Step 1: Embed the Context Words
The one-hot encoded vectors for the context words
(queen, man, woman) are multiplied by 𝑊:
Embedding=𝑊𝑇one-hot(context)
Embeddingqueen=[0.1,0.2,0.3],
Embeddingman=[0.4,0.5,0.6],
Embeddingwoman=[0.7,0.8,0.9]

School or Dept. Name here


Forward Pass
Step 2: Compute the Hidden Layer (Average Embedding)
𝐶
1
ℎ = ෍ 𝐸𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑖 = 0.4, 0.5, 0.6
𝐶
𝑖=1
1
= 0.1, 0.2, 0.3 + 0.4, 0.5, 0.6 + 0.7, 0.8, 0.9
3
Step 3: Compute the Scores for Each Word (Pre-Softmax)
0.3 0.5 …
𝑢 = ℎ. 𝑊 ′ = 0.4, 0.5, 0.6 0.7 0.3 …
0.5 0.9 …
= 0.74,0.83,0.74,0.99,0.98,1.09,0.58,0.89,1.09,0.86
Step 4: Apply Softmax to Compute Probabilities
𝑒 𝑢𝑖
School or Dept. Name here𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑢 = 𝑉
σ𝑗=1 𝑒 𝑢𝑗
Loss and Error Correction

In the CBOW model, categorical cross-entropy loss is used


𝑉

𝐿 = − ෍ 𝑇𝑎𝑟𝑔𝑒𝑡𝑖 . log 𝑦𝑖
𝑖=1
Where:
• V: Vocabulary size.
• yi​: Predicted probability for word i.
• targeti: One-hot encoded value for the target word.
𝑒 𝑢𝑖
𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑢 = 𝑉
σ𝑗=1 𝑒 𝑢𝑗

School or Dept. Name here


Loss and Error Correction
To compute the gradient ∂L/∂ui​, we use the chain rule:
𝜕𝐿 𝜕𝐿 𝜕𝑦𝑖
= .
𝜕𝑢𝑖 𝜕𝑦𝑖 𝜕𝑢𝑖
The derivative of the loss with respect to yi
𝜕𝐿 𝑇𝑎𝑟𝑔𝑒𝑡𝑖
=−
𝜕𝑦𝑖 𝑦𝑖
(for i = t, target word)
The derivative of the softmax function is
𝜕𝑦𝑖
= 𝑦𝑖 1 − 𝑦𝑖 , for 𝑖 = 𝑡
𝜕𝑢𝑖
Combining the terms, the final derivative for i-th output is:
𝜕𝐿
= 𝑦𝑖 − 𝑡𝑎𝑟𝑔𝑒𝑡𝑖 = 𝑒𝑟𝑟𝑜𝑟
𝜕𝑢𝑖
School or Dept. Name here
Loss and Backward Propagation
Target word King (index 4) = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] = T
Output probabilities: y
[0.085,0.093,0.085,0.121,0.12,0.133,0.072,0.103,0.133,0.106]
Backward Propagation
Step 1: Compute the Error at the Output Layer
Error:𝑒=𝑦−T=[0.085,0.093,0.085,0.121,−0.88,0.133,0.072,0.103,0.133,0.1
06]
Step 2: Compute Gradients for W′
∇𝑊 ′ = ℎ𝑇 . 𝑒𝑟𝑟𝑜𝑟
0.4
= 0.5 ሾ0.085 0.093 0.085 0.121
0.6
− 0.88 0.133 0.072 0.103 0.133 0.106ሿ
School or Dept. Name here
Backward Pass
Step 2: Compute Gradients for W′
∇𝑊 ′ = ℎ𝑇 . 𝑒𝑟𝑟𝑜𝑟
0.4
= 0.5 ሾ0.085 0.093 0.085 0.121
0.6
− 0.88 0.133 0.072 0.103 0.133 0.106ሿ
0.4 × 0.085 0.4 × 0.093 …
= 0.5 × 0.5 × 0.093 …
0.6 × 0.6 × 0.093 …

Step 3: Propagate Error Back to Hidden Layer


𝛿ℎ = 𝑊 ′ 𝑒𝑟𝑟𝑜𝑟 𝑇
School or Dept. Name here
Backward Pass
Step 3: Propagate Error Back to Hidden Layer 𝜹𝒉 =
0.0535, 0.0726, 0.0618
Step 4: Compute Gradients for W
1 𝐶
∇𝑊 = σ 𝛿
𝐶 𝑖=1 ℎ

School or Dept. Name here


Backward Pass
Step 4: Compute Gradients for W
Each word contributes equally to W, so the gradient for each word’s
embedding is simply δh scaled by the context size C.
1
∇𝑊𝑄𝑢𝑒𝑒𝑛 = ∇𝑊𝑚𝑎𝑛 = ∇𝑊𝑤𝑜𝑚𝑎𝑛 = 0.0535, 0.0726, 0.0618
3
= 0.01783, 0.0242, 0.0206
Step 5 Update Weights W
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂∇𝑊
Let learning rate 𝜂 = 0.1
𝑊𝑄𝑢𝑒𝑒𝑛 = 𝑊𝑄𝑢𝑒𝑒𝑛 − 𝜂∇𝑊𝑄𝑢𝑒𝑒𝑛
= 0.1, 0.2, 0.3 − 0.1. 0.01783, 0.0242, 0.0206
= 0.09822, 0.19758, 0.29794ሿ

School or Dept. Name here


Backward Pass

Step 5 Update Weights W


𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂∇𝑊
Let learning rate 𝜂 = 0.1
𝑊𝑚𝑎𝑛 = 𝑊𝑚𝑎𝑛 − 𝜂∇𝑊𝑚𝑎𝑛
𝑊𝑚𝑎𝑛 = 0.4, 0.5, 0.6 − 0.1. 0.01783, 0.0242, 0.0206
= 0.39822, 0.49758, 0.59794
𝑊𝑤𝑜𝑚𝑎𝑛 = 𝑊𝑤𝑜𝑚𝑎𝑛 − 𝜂∇𝑊𝑤𝑜𝑚𝑎𝑛
= 0.7 0.8, 0.9 − 0.1. 0.01783, 0.0242, 0.0206 = 0.69822, 0.79758, 0.89794
The embeddings for other words remain unchanged.

School or Dept. Name here


Skip-gram model

The Skip-Gram model learns word embeddings by predicting the context words given a
target word.
For 𝑤=2, context words are 2 words before and after the target word.
Vocabulary: {"We", "love", "machine", "learning"}.
Input: One-hot vector of the target word.
Output: Probabilities of context words.46

School or Dept. Name here


School or Dept. Name here
Sentiment classification using Word2Vec

▪ Gather a labeled dataset for sentiment


analysis (e.g., positive and negative
sentiments).
▪ Use a pre-trained Word2Vec model (e.g.,
Gensim's pre-trained Word2Vec, Google
Word2Vec). Convert sentences into feature
vectors by averaging the Word2Vec
embeddings of the words in each sentence.
▪ Use a machine learning model like
Logistic Regression, Random Forest, or a
Neural Network.
▪ Split the data into training and test sets.
Evaluate the model using metrics like
accuracy
School or Dept. Name here
Applications of Word Embedding vector

Text Classification : Classifying customer reviews as positive or negative using


sentiment-based embeddings.
Machine Translation : Using embeddings in neural machine translation systems like
Google Translate to improve language understanding.
Named Entity Recognition (NER) : Extracting "John" as a person and "New York" as a
location from sentences.
Semantic Search and Information Retrieval : Retrieving relevant documents for the
query "car" by matching it with related terms like "vehicle" or "automobile."
Text Generation: Using embeddings in models like GPT to generate human-like
responses in conversational AI systems.

School or Dept. Name here


Thank You

School or Dept. Name here

You might also like