Traditional Neural Networks (TNNS) - Simple Explanation What Are Traditional Neural Networks?
Traditional Neural Networks (TNNS) - Simple Explanation What Are Traditional Neural Networks?
It’s like a big “function” that takes some inputs, processes them
through layers, and gives an output.
They are great for problems where the input and output are fixed
size.
Examples:
Input Layer: Takes in the raw data (e.g., pixel values of an image,
or features of a house).
Key points:
Simple flow: Data flows straight from input to output without loops
or cycles.
Analogy:
Imagine a pipeline where raw materials (input data) enter one end.
Summary:
1. Neurons (Nodes)
2. Weights
Each connection between neurons has a weight — a number that
controls how important that input is.
3. Bias
4. Weighted Sum
5. Activation Function
Imagine you want to classify apples and bananas based on their shape and color.
If we use a linear function, it can only separate them using a straight line.
But real-world data is often more complex (e.g., overlapping colors, different lighting).
By adding a non-linear activation function (like ReLU, Sigmoid, or Tanh), the network can
create curved decision boundaries to separate them correctly.)
6. Layers
7. Forward Propagation
8. Loss Function
Measures how far the network’s output is from the correct answer.
This process repeats for many iterations (epochs) until the network
learns to predict well.
RNNs work the same way — they are special types of neural
networks designed to process sequences of data step-by-step,
remembering previous information to understand the context.
Vector-to-Sequence
Sequence-to-Vector
o Input: a sequence
Summary:
They have "memory" that carries info forward as they process one
step at a time.
o Vector-to-sequence
o Sequence-to-vector
o Sequence-to-sequence
1. Purpose:
RNNs are designed to process sequences — like sentences, time
series, or any data where order matters.
2. Key Idea:
Unlike traditional neural nets, RNNs have a memory of what
happened before. They take the current input and remember
information from previous steps.
o At time step t, RNN takes the current input xtx_t (like the
current word in a sentence).
4. Mathematically:
o bb is bias.
5. Output:
At each step, RNN can output a prediction based on the current
hidden state.
Because they:
Perfect, babe 😄 — let's lock this in without code, just clean, simple logic.
You give this to the LSTM one word at a time, and at each step it:
1. Forget Gate
It looks at the current word (x_t) and what it remembers from before
(h_{t-1}).
2. Input Gate
It decides which parts of the new input to actually keep and store.
Combines what it kept from the past with the new info it chose to
store.
✅ Result: The cell now has an updated, smarter memory of the sentence
so far.
4. Output Gate
🔁 This Repeats
It does this for each word, updating and remembering better as it goes.
🎯 Why It Works
🧠 Memory Recap
💡 Example Use-Cases
Great question! After the LSTM parses through the whole sentence, here’s
what happens next in a typical setup:
Once the LSTM has processed every word one by one, it ends up
with a final hidden state (h_t from the last time step).
Other tasks:
The output could be fed into other layers or models depending on
what you want to achieve.
3. Calculate Loss
The error is sent backwards through the LSTM layers across all
time steps.
The model adjusts its weights and biases to reduce the loss —
basically “learning” from its mistakes.
5. Repeat
Summary:
Process full
Get final hidden state(s) summarizing input
sentence
What is a GRU?
o The hidden state from the previous time step (summary of all
past inputs).
The reset gate decides how much past info to forget when
processing the current input.
This helps the GRU decide if the previous info is relevant for the
current step.
This candidate contains new info from the current input mixed with
relevant past info.
The update gate now decides how much of the candidate hidden
state (new info) and how much of the previous hidden state
(old info) to keep.
Final hidden state is a blend of old and new info, controlled by the
update gate.
Step 6: Move to Next Time Step
The final hidden state becomes the previous hidden state for the
next word.
Why GRU?
It’s simpler than LSTM because it has two gates instead of three.
Great for sequence tasks where speed matters but you still want
good memory.
Say you want to know if “The movie was amazing” is positive or negative:
After the last word, the final hidden state summarizes the whole
sentence sentiment.
Summary Table
Candidate state
New info combined with reset past state
(ĥ)
Move to next
Repeat for next input in sequence
word
Even though CNNs were made for images, they can be adapted for
sequences, like sentences or time series, by treating the input as a 1D
sequence instead of a 2D image.
CNN applies filters that slide over small groups of words (like 2-3
words at a time) — these are called n-grams.
CNNs are fast and parallelizable — unlike RNNs, they don’t have
to process word-by-word sequentially.
Summary:
Captures long-range
Transformer More compute needed
dependencies, parallel
This helps the network learn complex patterns beyond simple linear
combinations.
This takes small regions (like 2x2) of the feature map and keeps
only the maximum value.
o Decreases computation
6. Flattening
This vector contains all the features extracted from the image, now
ready for classification.
8. Output Layer
The final layer is a fully connected layer with neurons equal to the
number of classes (e.g., 10 for CIFAR-10).
Summary flow:
Quick example:
Fully connected layers combine all features and say, “This looks like
a cat!” with 95% confidence.
Convert tokens into token IDs (numbers), e.g., [101, 2203, 2017]
❓Why?
🔹 A. Multi-Head Self-Attention
Key idea: each word looks at all other words and decides what's
important.
Example: in "I love you", "love" might pay more attention to "you"
than "I".
Multi-head self-attention
Feedforward net
✅ You stack multiple such layers (e.g., 6 or 12) to extract deeper features.
🔹 A. Masked Self-Attention
🔹 B. Encoder-Decoder Attention
🔹 C. Feedforward Layer
Decoder outputs a vector for each position (like the word “Je” or
“t’aime”).
These go through:
Example:
Input: "I love you"
Output: "Je" → "t’" → "aime" → <eos>
🧠 Summary Flow:
🤖 What is BERT?
🧠 Core Idea
Example:
👉 Input sentence:
[CLS] The cat sat on the mat [SEP]
2. Tokenization
Example:
3. Embedding Layer
🔁 All these get added together and passed into the Transformer encoder.
Multi-head self-attention
Feedforward layer
Model predicts the missing word using context from both sides.
Example:
The cat sat on the [MASK] → Predict: mat
Feed 2 sentences.
Sentence similarity
etc.
🔁 Just add a simple output layer (like Softmax), and fine-tune on your
specific data.
TL;DR Summary
Ste
What Happens
p