Deep Learning
Deep Learning
MODULE 1
• In today's fast-growing AI world, deep learning is a crucial technology changing how
machines process and understand data.
• Deep learning mimics how the human brain works by using artificial neural networks to help
computers recognize patterns and make decisions from large amounts of data.
• This technology has led to major improvements in areas like image recognition, language
understanding, healthcare, and self-driving cars.
• Deep learning is changing many industries, expanding what AI can do and pushing the limits
of technology.
• It’s helping create intelligent systems that can see, understand, and even come up with new
ideas on their own.
BIOLOGICAL NEURAL NETWORK
• A biological neural network refers to the
network of neurons (nerve cells) in the brain
and nervous system that work together to
process information
Speech Recognition:
major example of transcription is speech recognition.
• In this task, the computer is given an audio recording (like
someone talking) and is asked to convert the spoken words
into written text.
• The audio recording is just a continuous waveform of sound,
and the program’s job is to break it down and figure out what
words are being said.
• For example, if someone says "hello," the system listens to the
sound, understands it, and outputs the word "hello" as text.
• Speech Recognition in Use: This technology is used by major
companies like Microsoft, IBM, and Google in systems such
as virtual assistants (e.g., Google Assistant, Siri, Cortana).
When you speak a command like "What’s the weather today?"
the system transcribes your voice into text so that it can
process the command.
• Why is Transcription Important?
• Transcription tasks are essential in many real-world applications because
they help turn raw, unstructured data (like images or sounds) into something
usable (like text).
• For example, being able to automatically convert a scanned document into text saves
time and effort compared to manually typing it out.
• Similarly, speech recognition allows people to interact with computers or devices
using their voice, which is more natural than typing in many situations.
Machine translation
• Machine translation is a machine learning task where the computer takes
text in one language (like English) and converts it into text in another language
(like French).
• The input is a sequence of symbols (like letters or words) in one language,
and the output is a sequence of symbols in a different language.
• An example is Google Translate, which takes a sentence in English and gives
you the equivalent sentence in French.
• Deep learning has made a big impact in this area.
• Deep learning uses complex models (like neural networks) that can handle
sequences of words and learn the meaning of sentences.
• Traditional translation systems used rules or statistical methods, but deep
learning systems learn from large amounts of data to provide more accurate
and fluent translations.
Structured output task
• Structured output tasks are a broader category of machine learning tasks
where the computer is asked to produce a set of values that are related to
each other.
• This could be something like a sequence of words (for example, a sentence)
or a structure like a tree
How is Structured Output Different from Simple Output?
• In simple output tasks (like regression or classification), the computer
gives a single number or category as the answer.
• In structured output tasks, the computer needs to give a set of related
answers. For example:
• In machine translation, the output is a sequence of words that form a sentence.
• In other tasks, the output might be more complex structures like a tree (which
represents how words in a sentence are grammatically related).
• Parsing Sentences:
• One example of a structured output task is parsing. In parsing, the computer takes
a sentence in natural language (like English) and breaks it down into its
grammatical parts (like nouns, verbs, etc.).
• The output is a tree structure that shows how the words in the sentence are
connected grammatically. For example, in the sentence “The cat sat on the mat,”
the computer would identify "The cat" as a noun phrase and "sat on the mat" as a
verb phrase.
• Deep learning has been used to improve parsing by making the system better at
understanding sentence structure.
• Image Segmentation:
• Another example is pixel-wise image segmentation, where the computer looks at
an image and labels each pixel (the tiny dots that make up the image) with a
category.
• For instance, if the image is an aerial photo of a city, the program might label some
pixels as roads, others as buildings, and others as trees.
• Deep learning helps with this by using convolutional neural networks (a type of
deep learning model designed for images) to process the image and understand
the different parts.
Anomaly Detection
• Anomaly detection is a task where a computer program looks through a set
of events or objects and tries to find things that don’t fit the normal pattern or
are unusual. These “unusual” things are called anomalies.
• Credit Card Fraud Detection:
• One common example is how credit card companies use anomaly detection
to spot fraud.
• The credit card company has information about your purchasing habits—for example,
the kinds of stores you usually shop at, how much you typically spend, and where you
live.
• If someone steals your credit card or the card details and tries to use it for purchases
that are different from your usual pattern—like spending large amounts of money in
a different country—the company can detect this as anomalous behavior.
• When the system detects a purchase that doesn’t match your typical spending, it
might flag the transaction as suspicious and temporarily stop the card from working to
prevent further fraud.
• How it Works:
• The computer builds a model of what “normal” looks like for each person or
object by looking at past data.
• If something happens that looks very different from that model, it’s flagged as
an anomaly.
• Why It’s Useful:
• Anomaly detection is helpful for tasks like fraud detection, error detection
in systems, or finding unusual events in medical data. Anytime you need to
find something that stands out from the norm, anomaly detection can be
used.
Synthesis and sampling
• Synthesis and sampling is a task where a computer program is asked to
create new examples that are similar to the ones it has seen during training.
• The program doesn’t just repeat what it has seen before but generates new
data that follows similar patterns to the training data.
Generating Textures for Video Games:
• In video games, artists usually need to create textures (like the appearance of
grass, mountains, or walls). If they had to manually create textures for every
surface in a large game world, it would take a lot of time.
• Machine learning can generate textures automatically, using patterns that it has
learned from existing textures. This helps reduce the work artists need to do and can
save time and resources.
Speech Synthesis:
• Another example is speech synthesis, where the program is given a written
sentence (like “Hello, how are you?”) and is asked to produce a spoken
version of that sentence.
• The machine learns to generate a realistic-sounding audio waveform that sounds like a
person speaking the sentence.
Imputation of missing values
• What it means: In this task, the machine learning algorithm
is given a set of data, but some parts of the data are
missing. The algorithm’s job is to predict the missing values.
Example:
• Imagine you have a table with information about people—
age, height, and weight. But for some people, the height is
missing. The machine learning algorithm would try to guess
(or impute) the missing heights based on the available data.
How it works:
• The algorithm looks at the patterns in the data and uses these patterns to fill
in the missing values. For example, if taller people usually weigh more, and
the weight is known, the algorithm might guess a taller height for someone
with a higher weight.
Why it’s important:
• Missing data happens often in real-world scenarios, especially in fields like
medicine or finance, where not every measurement is always available.
Being able to accurately fill in the gaps is crucial for making good predictions
or analyses.
Denoising
• In this type of task, the machine learning algorithm is given in x∈R^n obtained
by an unknown corruption process from a clean example x ∈ R^n. The learner
must predict the clean example x from its corrupted version ˜ x, or more
generally predict the conditional probability distribution p(x | ˜ x)
• In denoising tasks, the machine learning algorithm is given a corrupted or
noisy version of data, and it must try to restore the data back to its original,
clean state.
Example:
• Imagine you have a blurry, noisy photo and you want to clean it up so that the
details are visible again. The machine learning algorithm would try to take
this noisy image and recover the clean image.
• Another example is when you have a corrupted audio recording and you
want to recover the original, clear sound.
How it works:
• The algorithm learns how the noise or corruption usually affects the data. It then
tries to reverse this effect to bring back the original data, or at least predict what
the original data might have looked like.
Why it’s important:
• This kind of task is useful in areas like image and audio processing, where data
often gets corrupted by noise (e.g., static in an audio recording, or low resolution in
an image). By removing the noise, we can improve the quality of the data.
Density estimation or probability mass function
estimation
• In this task, the machine learning algorithm tries to learn the underlying
probability distribution of the data. It needs to figure out how likely different
examples are to occur in the data.
• Example:
• Let’s say you have a dataset of people’s heights. The algorithm would try to
learn the distribution of heights, such as how common it is for someone to
be 5 feet tall, 6 feet tall, and so on.
How it works:
• The machine learning algorithm creates a probability model that tells us
how likely different values (or combinations of values) are. For example, it
might tell us that a person being 5 feet tall is much more common than being
7 feet tall.
• In a more complex case, if you have multiple features (like height, weight, and
age), the algorithm would learn how these features are related to each other
and how likely different combinations of them are.
Performance Measure, P
The Performance Measure, P is a way to check how well a machine
learning algorithm is doing at solving a particular task. Just like in
school, we give students tests and grades to measure how well they
understand the material, in machine learning, we need to measure
the algorithm’s success in a similar way.
Why It’s Important
Every machine learning algorithm is created to perform a specific
task, like classification , transcription or regression. To know how
well the algorithm is working, we must quantify its performance.
This means coming up with a numerical score that tells us if the
algorithm is doing a good or bad job.
Different Performance Measures for Different Tasks
• For tasks like classification (e.g., identifying cats vs. dogs in images):
• We often measure accuracy, which is simply the proportion of examples the model
got right. For example, if the model correctly identifies 90 out of 100 images, its
accuracy is 90%.
• We could also look at the error rate, which is the opposite of accuracy. If the model
gets 10 wrong out of 100, its error rate is 10%.
Evaluating the Algorithm on New Data
• Test set: We want to see how well the model works on data it hasn’t seen
before. In the real world, when the algorithm is deployed, it will be making
predictions on new data. To mimic this, we set aside some data (called the
test set) that the algorithm doesn’t use during training. We measure the
model’s performance on this unseen data to see how well it might perform in
real-world situations.
Challenges in Choosing a Performance Measure
What should we measure?
• It can be hard to decide exactly what to measure. For example:
• In a transcription task (e.g., converting spoken words to text), do we give credit only
when the entire sentence is correct? Or should we give partial credit if some words
are correct?
• In a regression task (e.g., predicting prices), should we be more concerned if the
model frequently makes medium-sized mistakes, or if it rarely makes large
mistakes?
What if the best measure is hard to compute?
• Sometimes we know what we want to measure, but it's too difficult or
expensive to calculate. For example, when trying to estimate the
probability of certain data points, some models make it very hard to
compute an actual number. In such cases, we might need to use an
approximation or a different measure that’s easier to compute, but still
reflects what we care about.
Experience, E
• The Experience, E in machine learning refers to the data that the algorithm is
exposed to during the learning process. This "experience" allows the
algorithm to learn patterns, relationships, and rules from the data.
• Two Types of Experience in Machine Learning
• Supervised learning:
• In this type, the algorithm is given both input data (examples) and correct answers
(labels). For example, if the task is to classify images as either "cats" or "dogs," the
algorithm will see images (inputs) and be told which ones are cats and which ones are
dogs (labels). The algorithm's experience includes knowing what the right answers are
while learning.
• Unsupervised learning:
• In this type, the algorithm only receives input data, without being given the correct
answers. For example, if you gave the algorithm a bunch of pictures without telling it
which ones are cats and which ones are dogs, it has to figure out patterns or groupings
on its own. The experience here is more open-ended.
• The Dataset: A Collection of Examples
• Most machine learning algorithms are trained using a dataset, which is a
collection of examples. Each example represents a data point, which is
made up of different features or measurements. The algorithm looks at all
these examples during training, learning from them in order to make
predictions about new, unseen data.
• One of the oldest and most famous datasets in machine learning is the Iris
dataset
• The Iris dataset contains measurements from 150 iris plants.
• Each plant is one example (or data point) in the dataset.
• For each plant, there are four measurements recorded:
• Sepal length (the length of the outer part of the flower)
• Sepal width
• Petal length (the length of the inner part of the flower)
• Petal width
• The dataset also includes labels, which tell us the species of each iris plant.
There are three species in this dataset, so for each plant, the algorithm
knows which species it belongs to.
• This reduces the computational cost of classifying new examples, as only the
support vectors need to be evaluated.
• Advantages of the Kernel Trick:
• Nonlinear Decision Boundaries: By using kernel functions, SVMs can create
highly complex, nonlinear decision boundaries in the original space.
• Efficient Computation: Kernels often provide a computationally efficient
way to compute the decision function without explicitly mapping data into
higher dimensions.
• Convex Optimization: The optimization problem in SVMs remains convex,
ensuring that we can find a global optimum efficiently.
Limitations of Kernel Machines:
• Computational Cost: The cost of evaluating the decision function is linear
in the number of training examples because each example contributes to the
decision function.
• Training Time: Training SVMs with large datasets can be computationally
expensive, especially when using complex kernels like the RBF kernel.
• Generalization: Kernel machines, especially with generic kernels,
sometimes struggle to generalize well to unseen data.
Deep Learning vs. Kernel Machines:
• Deep learning models were designed to overcome some of the limitations of
kernel machines, especially in terms of scalability and generalization.
• For instance, deep neural networks can learn hierarchical representations
from data, which allows them to perform better than kernel-based SVMs on
large-scale problems like image classification (e.g., MNIST).
Other Simple Supervised Learning Algorithms
• The k-nearest neighbors (k-NN) algorithm is a simple yet powerful non-parametric
method used for both classification and regression tasks in supervised learning. Here's a
breakdown of key concepts from the text:
Non-Parametric Nature:
• k-NN is a non-parametric algorithm, meaning it does not assume any fixed form or
distribution for the data.
• Unlike models like linear regression, which have a fixed number of parameters (weights),
k-NN doesn't have a training process in the traditional sense. Instead, it memorizes the
training data and makes predictions based on the nearest examples during testing.
How k-NN Works:
• At the testing stage, when a new input x needs to be classified or a value y predicted, the
algorithm identifies the k-nearest neighbors to x from the training data.
• For classification, it returns the most frequent label among these nearest neighbors.
• For regression, it returns the average of the output values (y) of the nearest neighbors.
Averaging for Classification:
• In classification tasks, k-NN can be thought of as averaging over one-hot
vectors. A one-hot vector is used to represent class labels, where the
position corresponding to the class is 1, and the rest are 0s. Averaging these
vectors gives a probability distribution over the possible classes.
• This allows k-NN to handle probabilistic classification by assigning a
likelihood to each class based on the neighbors' votes.
High Capacity and Bayes Error:
• High capacity means that k-NN can theoretically model very complex
relationships in the data, especially as the size of the training set increases.
• The algorithm converges to twice the Bayes error when using only one
nearest neighbor (1-NN). The Bayes error rate is the minimum possible error
any classifier can achieve, given the inherent noise in the data.
• However, as the training set grows infinitely large, k-NN can approach the
Bayes error rate by considering all nearest neighbors rather than randomly
choosing one, leading to highly accurate predictions.
Weaknesses of k-NN:
• High computational cost: During testing, k-NN must calculate the distance
between the test point and every training example, making it slow and
computationally expensive for large datasets.
• Poor generalization on small datasets: With a limited amount of training
data, k-NN may perform poorly because it heavily relies on the proximity of
training examples, and small training sets may not capture the underlying
patterns in the data.
• Feature sensitivity: k-NN cannot distinguish between more and less
important features. For example, if only one feature is relevant to the output,
but the dataset has many irrelevant features, k-NN may get "confused" by the
irrelevant features. This is because distance calculations will be affected by
all features, even those that are irrelevant to the target.
Decision Tree
Decision tree uses the tree
representation to solve the
problem in which each leaf
node corresponds to a class
label and attributes are
represented on the internal
node of the tree. We can
represent any boolean function
on discrete attributes using the
decision tree.
Diagrams describing how a decision tree works.
(Top)Each node of the tree chooses to send the input
example to the child node on the left (0) or or the
child node on the right (1). Internal nodes are drawn
as circles and leaf nodes as squares. Each node is
displayed with a binary string identifier
corresponding to its position in the tree, obtained by
appending a bit to its parent identifier (0=choose left
or top, 1=choose right or bottom). (Bottom)The tree
divides space into regions. The 2D plane shows how
a decision tree might divide R2. The nodes of the tree
are plotted in this plane, with each internal node
drawn along the dividing line it uses to categorize
examples, and leaf nodes drawn in the center of the
region of examples they receive. The result is a
piecewise-constant function, with one piece per
leaf. Each leaf requires at least one training example
to define, so it is not possible for the decision tree to
learn a function that has more local maxima than the
number of training examples.
Unsupervised Learning Algorithms
Features without Supervision:
• Unsupervised learning algorithms only use features from the data, without
needing human-annotated labels or targets.
• These algorithms often deal with tasks such as density estimation,
denoising, manifold learning, and clustering.
Representation Learning:
• One of the core goals of unsupervised learning is to find the "best"
representation of the data.
• The term "best" refers to how much information about the data is retained in
a simpler or more accessible form.
Types of Representations: Three common representations in unsupervised
learning are:
• Lower-dimensional representations: Compress the data into fewer
dimensions, preserving as much information as possible. These help in
reducing complexity while keeping key features intact.
• Sparse representations: Involve representations where most of the values
are zero, capturing important information while ignoring irrelevant or
redundant features. Typically, this increases the dimensionality but
emphasizes sparsity in the data structure.
• Independent representations: Aim to disentangle the factors of variation
within the data, ensuring that the dimensions are statistically independent.
Interconnections between Representations:
• These types of representations are often intertwined. For example, low-
dimensional representations typically reduce dependencies and
redundancies in the data. Similarly, sparse representations may lead to
independent representations by isolating key factors of variation
Importance of Representation in Deep Learning:
• Representation learning is central to deep learning, as models aim to learn meaningful
representations of the data. Unsupervised learning plays a key role in this process,
enabling models to capture underlying patterns and structures without needing
labeled data.