0% found this document useful (0 votes)
21 views

Deep Learning

Uploaded by

sharanyarb534
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Deep Learning

Uploaded by

sharanyarb534
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

DEEP LEARNING

MODULE 1
• In today's fast-growing AI world, deep learning is a crucial technology changing how
machines process and understand data.
• Deep learning mimics how the human brain works by using artificial neural networks to help
computers recognize patterns and make decisions from large amounts of data.
• This technology has led to major improvements in areas like image recognition, language
understanding, healthcare, and self-driving cars.
• Deep learning is changing many industries, expanding what AI can do and pushing the limits
of technology.
• It’s helping create intelligent systems that can see, understand, and even come up with new
ideas on their own.
BIOLOGICAL NEURAL NETWORK
• A biological neural network refers to the
network of neurons (nerve cells) in the brain
and nervous system that work together to
process information

• Neurons are the basic building blocks of the


brain. Each neuron has a cell body, dendrites
(which receive signals from other neurons),
and an axon (which sends signals to other
neurons).

• Synapses are tiny gaps between neurons


where the exchange of neurotransmitters
happens. This is how one neuron can send a
signal to the next. Stronger or more frequent
signals can lead to stronger synaptic
connections.
ARTFICIAL NEURAL NETWORK
In Artificial Neural Networks, dendrites from biological
neural networks represent inputs, cell nuclei represent
nodes, synapse represent weights, and axon represent
output.
The Artificial Neural Network made up of three layers:
1. Input layer: It accepts inputs in a variety of formats
specified by the programmer, as the name implies.

2. Hidden layer: Between the input and output layers is a


concealed layer. It does all the math to uncover hidden
features and patterns.

3. Output layer: The input goes through a series of


transformations using the hidden layer, which finally results
in output that is conveyed using this layer.

The artificial neural network takes input and computes the


weighted sum of the inputs and includes a bias. This
computation is represented in the form of a transfer
function.
• Early achievements in AI happened in simple, controlled environments. For
example, IBM's Deep Blue computer beat the world chess champion in 1997.
Chess is easy for computers because it has clear, fixed rules that can be
programmed in advance.
• Tasks like chess, which are mentally challenging for humans, are often easy
for computers because they follow strict, formal rules. On the other hand,
everyday human tasks like recognizing objects or speech are harder for
computers because they involve understanding complex, informal
knowledge.
• One of the big challenges in AI is teaching computers to understand informal,
everyday knowledge about the world, which is often intuitive for humans but
difficult to explain or program.
• The struggles with hard-coding knowledge suggest that AI needs to learn on
its own from data, which is the idea behind machine learning. Instead of
giving computers all the rules, we allow them to figure out patterns by
themselves.
• Machine learning allows computers to solve real-world problems and make
decisions that feel subjective. For example, a simple algorithm like logistic
regression can help doctors decide whether a patient should have a
cesarean delivery.
• Logistic regression and naive Bayes are examples of basic machine learning
algorithms. Logistic regression can predict outcomes based on features, like
medical history, while naive Bayes can separate spam emails from legitimate
ones.
• The success of these algorithms heavily depends on how the data is presented.
For instance, logistic regression works well with structured information, like
medical reports, but can't handle raw data like MRI images effectively because it
can't interpret individual pixels.
• Some AI tasks can be solved by designing the right features, such as using vocal
tract size to identify whether a speaker is a man, woman, or child. These features
help machine learning models make better predictions.
• For some tasks, it’s hard to know which features to use. For example, detecting a
car in a photo might seem simple, but factors like shadows, glare, or partial views
make it difficult to define what a "wheel" looks like in pixel terms.
• However, for some tasks, it's hard to decide what features to use. For
instance, if you want to detect cars in photos, you might think a wheel is a
good feature to identify. But it's hard to describe exactly what a wheel looks
like in a photo using just pixel values. This is because real-world images of
wheels can be complicated by factors like shadows, bright reflections from
the sun, parts of the car covering the wheel, or other objects partially hiding
it. All these factors make it challenging to extract useful features like a wheel
in a consistent way.
Historical Trends in Deep Learning
• Deep learning has had a long and rich history, but has gone by
many names reflecting different philosophical viewpoints, and
has waxed and waned in popularity.
• Deep learning has become more useful as the amount of
available training data has increased.
• Deep learning models have grown in size over time as computer
infrastructure (both hardware and software) for deep learning has
improved.
• Deep learning has solved increasingly complicated applications
with increasing accuracy over time.
• Many people think deep learning is a recent technology, but it actually dates back to the
1940s. It has only become popular again in recent years after a period of being less well-
known.
• Over the decades, deep learning has gone by different names depending on who was
studying it and how they viewed it. This is why it might seem unfamiliar even though it's
been around for a long time.
• There have been three key periods in deep learning's development:
• The 1940s-1960s, when it was called cybernetics.
• The 1980s-1990s, when it was called connectionism.
• From 2006 onward, it became known as deep learning.
• Many early deep learning algorithms were designed to model how learning happens in the
brain. That's why deep learning is often referred to as artificial neural networks (ANNs), as
they were inspired by the way biological neural networks (like the brain) work.
• Although neural networks are inspired by the brain, they are not accurate models of how
the brain works. The idea was that if the brain can produce intelligent behavior, we could
learn how by studying its functions and replicating them in machines.
• Today, the term "deep learning" refers to a broader idea. It involves building models that
can learn from multiple layers of data, and these models aren’t necessarily trying to mimic
how the brain works—they're focused on solving complex problems in various fields.
• Early deep learning models were simple linear models inspired by how the
brain works. They took inputs, applied weights to them, and produced an
output by adding everything up.
f(x * w ) = x1*w1 + ··· +xn*wn.
1. McCulloch-Pitts Model of Neuron
The McCulloch-Pitts neural model, which was the earliest ANN model, has only
two types of inputs — Excitatory and Inhibitory. The excitatory inputs have
weights of positive magnitude and the inhibitory weights have weights of
negative magnitude. The inputs of the McCulloch-Pitts neuron could be either 0
or 1. It has a threshold function as an activation function. So, the output
signal yout is 1 if the input ysum is greater than or equal to a given threshold value,
else 0.
• What is Machine Learning?
Machine learning (ML) is a type of Artificial Intelligence (AI) that allows
computers to learn without being explicitly programmed. It involves feeding data
into algorithms that can then identify patterns and make predictions on new
data. Machine learning is used in a wide variety of applications, including image
and speech recognition, natural language processing, and recommender systems.
• 1. Supervised learning:
• Supervised learning is the machine learning task of learning a function that maps
an input to an output based on example input-output pairs. The given data is
labeled. Both classification and regression problems are supervised learning
problems.
• Example – Consider the following data regarding patients entering a clinic . The
data consists of the gender and age of the patients and each patient is labeled as
“healthy” or “sick”.
2. Unsupervised learning:
Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses. In
unsupervised learning algorithms, classification or categorization is not included in
the observations. Example: Consider the following data regarding patients entering
a clinic. The data consists of the gender and age of the patients.
Learning Algorithms
A machine learning algorithm is an algorithm that is able to learn
from data.
Definition of ML:
“A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E.”
The Task, T
A task (T) in machine learning refers to the specific problem or
objective that we want the system to accomplish. It could be
anything from recognizing an image, predicting stock prices, or
controlling a robot to walk.
• What makes machine learning different from traditional programming is that
some tasks are too difficult or complex for humans to solve using fixed,
hand-written rules. Instead of manually programming every possible
scenario, machine learning allows the system to learn how to perform the
task from data and experience.
Learning is Not the Task:
• A key distinction made here is that learning itself is not the task. Instead,
learning is a means to accomplish the task.
• For example, if the task is for a robot to walk, then the task is walking.
• Machine learning is one way to achieve that task, by allowing the robot to learn how to
walk based on data and feedback from its movements, rather than programming it
with explicit instructions on how to walk.
• Alternatively, you could try to manually write a program that tells the robot step-by-
step how to walk, but this approach might be difficult due to the complexity of the
task.
Many kinds of tasks can be solved with machine learning.
Some of the most common machine learning tasks include
the following
1. Classification:
• Classification is a type of task where the computer is asked
to figure out which category (or class) an input belongs to.
• For example, if you show the program a picture, it needs to
decide whether the picture is of a cat, a dog, or some other
object.
• How Classification Works:
• In classification, the program is trained to take an input, like an image, and
assign it to one of k categories. In math, this is written as a function:
• f:R^n→{1,2,...,k}
• This means that the input, represented as a vector (a list of numbers), is mapped to
one of k categories.
• For example, if the task is to recognize objects in images, each image (input) can be
described by numbers that represent things like the brightness of each pixel (the small
dots that make up the image). The output is a number y that identifies the category,
like 1 for "cat", 2 for "dog", etc.
Variants of Classification:
• In some variations of classification, instead of directly outputting a category,
the program can give a probability distribution. This means it doesn’t just
say, “This is a cat,” but it might say, “There’s a 70% chance this is a cat, 20%
chance it’s a dog, and 10% chance it’s a rabbit.”
• Example: Object Recognition:
• One common example of a classification task is object recognition, where
the program looks at an image and identifies what object is in it.
• For example, there’s a robot called Willow Garage PR2, which can act as a waiter. It
looks at different drinks and can recognize them (e.g., it knows whether the drink is
water, soda, juice, etc.) and delivers them to people on command.
• To do this, the robot uses a classification system that looks at each drink and assigns it
to a category, such as water, soda, or juice.
• Deep Learning and Classification:
• The best way to solve complex classification tasks like object recognition is
through deep learning. Deep learning is a type of machine learning that uses
neural networks with many layers to analyze and learn from large amounts of
data.
• Deep learning has been successfully used for tasks like recognizing objects in images
(such as recognizing different types of drinks) and even recognizing faces.
Face Recognition Example:
A specific example of classification is face recognition.
Programs can now look at a picture and recognize who is in
it. This technology can be used to automatically tag people in
photo collections, like how Facebook automatically suggests
names when you upload pictures of your friends.
• It’s also useful for other applications, like allowing computers or
phones to recognize their users and interact with them more
naturally.
Classification with missing inputs
Classification with Complete Inputs:
Normally, in a classification task, the machine learning
model takes a set of inputs (called a vector) and uses that
information to predict a category or class.
• For example, if the task is to classify whether a patient has a
disease, the input might be different medical test results, and
the output would be a prediction of whether the patient is
healthy or sick.
• In this case, the model expects all the inputs (all the test
results) to be available every time it makes a prediction.
Challenge of Missing Inputs:
• However, in some cases, not all the input data is available. For example, a
patient might not have undergone all the medical tests due to cost of the
tests.
• When some inputs are missing, classification becomes more difficult. The
algorithm can’t rely on always having a full set of information to make a
decision.
• When inputs are missing, the algorithm now needs to learn a set of
functions—one function for every possible combination of missing inputs.
Each function would handle a different situation where some inputs are
missing.
• For example, if a test result is missing, the algorithm needs to be able to still
make a prediction based on the other available test results.
• Efficient Solution: Probability Distribution:
• Learning a different function for every possible combination of missing inputs
would be very inefficient because there are 2^n combinations of inputs
(where n is the number of input variables, like medical tests).
• For example, if there are 5 medical tests, there would be 32 different
combinations of inputs that could be missing (since 2^5 = 32).
• Instead of learning separate functions for each case, the algorithm can learn
a single probability distribution over all the variables (all the input data).
• A probability distribution describes how likely different combinations of
inputs and outputs are.
• Once the algorithm knows the overall distribution of the data, it can fill in
the gaps for the missing inputs by estimating what those missing values
could be. This process is called marginalization, which means the
algorithm averages over the possible values of the missing inputs to make
its decision.
Deep Learning Example:
Deep learning models can handle this kind of problem very efficiently. They
can learn a probability distribution over all the relevant variables and use that
to make predictions even when inputs are missing.
• For example, in the 2013 research by Goodfellow et al., they applied a deep
probabilistic model to solve a classification task where inputs were missing. This
model allowed the system to make accurate predictions even when it didn’t have all
the input data.
Regression
• What is Regression?
• Regression is a machine learning task where the computer is asked to predict a
continuous number given some input data.
• For example, if you want to predict the price of a house based on its features (like size,
location, number of bedrooms), the task is to estimate a numerical value (the house price).
• This is different from classification, where the goal is to predict a category (like
whether an email is spam or not). In regression, the output is not a category but a
number.
• Mathematically, this is written as f: R^n →R, which means the function takes a
vector (list of n numbers, where each number represents a feature) as input and
produces a single number as output.
• For example, if you're predicting house prices, the input vector might include
features like the size of the house, the number of bedrooms, and the year it was
built. The output will be a single number: the predicted price of the house.
• Difference Between Regression and Classification:
• While classification and regression are similar in that both involve
predicting something based on input data, the key difference is in the type of
output:
• In classification, the output is a category (e.g., is this email spam or not?).
• In regression, the output is a continuous numerical value (e.g., predicting the
temperature for tomorrow, the price of a stock, or how much money someone will
claim on their insurance).
• Regression is crucial in many fields because it helps make quantitative
predictions. These predictions can help businesses make better decisions,
optimize pricing, or predict future trends.
• For example, in insurance, predicting claim amounts helps companies manage risk
and set appropriate premiums.
• In finance, predicting stock prices helps investors decide when to buy or sell stocks.
• In real estate, predicting house prices helps agents and buyers understand the market
and make informed decisions.
• List atleast 5 example for regression Task
Transcription
Transcription is a type of task where a machine learning system
takes unstructured data (like an image or sound) and converts it
into a structured text format.

Speech Recognition:
major example of transcription is speech recognition.
• In this task, the computer is given an audio recording (like
someone talking) and is asked to convert the spoken words
into written text.
• The audio recording is just a continuous waveform of sound,
and the program’s job is to break it down and figure out what
words are being said.
• For example, if someone says "hello," the system listens to the
sound, understands it, and outputs the word "hello" as text.
• Speech Recognition in Use: This technology is used by major
companies like Microsoft, IBM, and Google in systems such
as virtual assistants (e.g., Google Assistant, Siri, Cortana).
When you speak a command like "What’s the weather today?"
the system transcribes your voice into text so that it can
process the command.
• Why is Transcription Important?
• Transcription tasks are essential in many real-world applications because
they help turn raw, unstructured data (like images or sounds) into something
usable (like text).
• For example, being able to automatically convert a scanned document into text saves
time and effort compared to manually typing it out.
• Similarly, speech recognition allows people to interact with computers or devices
using their voice, which is more natural than typing in many situations.
Machine translation
• Machine translation is a machine learning task where the computer takes
text in one language (like English) and converts it into text in another language
(like French).
• The input is a sequence of symbols (like letters or words) in one language,
and the output is a sequence of symbols in a different language.
• An example is Google Translate, which takes a sentence in English and gives
you the equivalent sentence in French.
• Deep learning has made a big impact in this area.
• Deep learning uses complex models (like neural networks) that can handle
sequences of words and learn the meaning of sentences.
• Traditional translation systems used rules or statistical methods, but deep
learning systems learn from large amounts of data to provide more accurate
and fluent translations.
Structured output task
• Structured output tasks are a broader category of machine learning tasks
where the computer is asked to produce a set of values that are related to
each other.
• This could be something like a sequence of words (for example, a sentence)
or a structure like a tree
How is Structured Output Different from Simple Output?
• In simple output tasks (like regression or classification), the computer
gives a single number or category as the answer.
• In structured output tasks, the computer needs to give a set of related
answers. For example:
• In machine translation, the output is a sequence of words that form a sentence.
• In other tasks, the output might be more complex structures like a tree (which
represents how words in a sentence are grammatically related).
• Parsing Sentences:
• One example of a structured output task is parsing. In parsing, the computer takes
a sentence in natural language (like English) and breaks it down into its
grammatical parts (like nouns, verbs, etc.).
• The output is a tree structure that shows how the words in the sentence are
connected grammatically. For example, in the sentence “The cat sat on the mat,”
the computer would identify "The cat" as a noun phrase and "sat on the mat" as a
verb phrase.
• Deep learning has been used to improve parsing by making the system better at
understanding sentence structure.
• Image Segmentation:
• Another example is pixel-wise image segmentation, where the computer looks at
an image and labels each pixel (the tiny dots that make up the image) with a
category.
• For instance, if the image is an aerial photo of a city, the program might label some
pixels as roads, others as buildings, and others as trees.
• Deep learning helps with this by using convolutional neural networks (a type of
deep learning model designed for images) to process the image and understand
the different parts.
Anomaly Detection
• Anomaly detection is a task where a computer program looks through a set
of events or objects and tries to find things that don’t fit the normal pattern or
are unusual. These “unusual” things are called anomalies.
• Credit Card Fraud Detection:
• One common example is how credit card companies use anomaly detection
to spot fraud.
• The credit card company has information about your purchasing habits—for example,
the kinds of stores you usually shop at, how much you typically spend, and where you
live.
• If someone steals your credit card or the card details and tries to use it for purchases
that are different from your usual pattern—like spending large amounts of money in
a different country—the company can detect this as anomalous behavior.
• When the system detects a purchase that doesn’t match your typical spending, it
might flag the transaction as suspicious and temporarily stop the card from working to
prevent further fraud.
• How it Works:
• The computer builds a model of what “normal” looks like for each person or
object by looking at past data.
• If something happens that looks very different from that model, it’s flagged as
an anomaly.
• Why It’s Useful:
• Anomaly detection is helpful for tasks like fraud detection, error detection
in systems, or finding unusual events in medical data. Anytime you need to
find something that stands out from the norm, anomaly detection can be
used.
Synthesis and sampling
• Synthesis and sampling is a task where a computer program is asked to
create new examples that are similar to the ones it has seen during training.
• The program doesn’t just repeat what it has seen before but generates new
data that follows similar patterns to the training data.
Generating Textures for Video Games:
• In video games, artists usually need to create textures (like the appearance of
grass, mountains, or walls). If they had to manually create textures for every
surface in a large game world, it would take a lot of time.
• Machine learning can generate textures automatically, using patterns that it has
learned from existing textures. This helps reduce the work artists need to do and can
save time and resources.
Speech Synthesis:
• Another example is speech synthesis, where the program is given a written
sentence (like “Hello, how are you?”) and is asked to produce a spoken
version of that sentence.
• The machine learns to generate a realistic-sounding audio waveform that sounds like a
person speaking the sentence.
Imputation of missing values
• What it means: In this task, the machine learning algorithm
is given a set of data, but some parts of the data are
missing. The algorithm’s job is to predict the missing values.
Example:
• Imagine you have a table with information about people—
age, height, and weight. But for some people, the height is
missing. The machine learning algorithm would try to guess
(or impute) the missing heights based on the available data.
How it works:
• The algorithm looks at the patterns in the data and uses these patterns to fill
in the missing values. For example, if taller people usually weigh more, and
the weight is known, the algorithm might guess a taller height for someone
with a higher weight.
Why it’s important:
• Missing data happens often in real-world scenarios, especially in fields like
medicine or finance, where not every measurement is always available.
Being able to accurately fill in the gaps is crucial for making good predictions
or analyses.
Denoising
• In this type of task, the machine learning algorithm is given in x∈R^n obtained
by an unknown corruption process from a clean example x ∈ R^n. The learner
must predict the clean example x from its corrupted version ˜ x, or more
generally predict the conditional probability distribution p(x | ˜ x)
• In denoising tasks, the machine learning algorithm is given a corrupted or
noisy version of data, and it must try to restore the data back to its original,
clean state.
Example:
• Imagine you have a blurry, noisy photo and you want to clean it up so that the
details are visible again. The machine learning algorithm would try to take
this noisy image and recover the clean image.
• Another example is when you have a corrupted audio recording and you
want to recover the original, clear sound.
How it works:
• The algorithm learns how the noise or corruption usually affects the data. It then
tries to reverse this effect to bring back the original data, or at least predict what
the original data might have looked like.
Why it’s important:
• This kind of task is useful in areas like image and audio processing, where data
often gets corrupted by noise (e.g., static in an audio recording, or low resolution in
an image). By removing the noise, we can improve the quality of the data.
Density estimation or probability mass function
estimation
• In this task, the machine learning algorithm tries to learn the underlying
probability distribution of the data. It needs to figure out how likely different
examples are to occur in the data.
• Example:
• Let’s say you have a dataset of people’s heights. The algorithm would try to
learn the distribution of heights, such as how common it is for someone to
be 5 feet tall, 6 feet tall, and so on.
How it works:
• The machine learning algorithm creates a probability model that tells us
how likely different values (or combinations of values) are. For example, it
might tell us that a person being 5 feet tall is much more common than being
7 feet tall.
• In a more complex case, if you have multiple features (like height, weight, and
age), the algorithm would learn how these features are related to each other
and how likely different combinations of them are.
Performance Measure, P
The Performance Measure, P is a way to check how well a machine
learning algorithm is doing at solving a particular task. Just like in
school, we give students tests and grades to measure how well they
understand the material, in machine learning, we need to measure
the algorithm’s success in a similar way.
Why It’s Important
Every machine learning algorithm is created to perform a specific
task, like classification , transcription or regression. To know how
well the algorithm is working, we must quantify its performance.
This means coming up with a numerical score that tells us if the
algorithm is doing a good or bad job.
Different Performance Measures for Different Tasks
• For tasks like classification (e.g., identifying cats vs. dogs in images):
• We often measure accuracy, which is simply the proportion of examples the model
got right. For example, if the model correctly identifies 90 out of 100 images, its
accuracy is 90%.
• We could also look at the error rate, which is the opposite of accuracy. If the model
gets 10 wrong out of 100, its error rate is 10%.
Evaluating the Algorithm on New Data
• Test set: We want to see how well the model works on data it hasn’t seen
before. In the real world, when the algorithm is deployed, it will be making
predictions on new data. To mimic this, we set aside some data (called the
test set) that the algorithm doesn’t use during training. We measure the
model’s performance on this unseen data to see how well it might perform in
real-world situations.
Challenges in Choosing a Performance Measure
What should we measure?
• It can be hard to decide exactly what to measure. For example:
• In a transcription task (e.g., converting spoken words to text), do we give credit only
when the entire sentence is correct? Or should we give partial credit if some words
are correct?
• In a regression task (e.g., predicting prices), should we be more concerned if the
model frequently makes medium-sized mistakes, or if it rarely makes large
mistakes?
What if the best measure is hard to compute?
• Sometimes we know what we want to measure, but it's too difficult or
expensive to calculate. For example, when trying to estimate the
probability of certain data points, some models make it very hard to
compute an actual number. In such cases, we might need to use an
approximation or a different measure that’s easier to compute, but still
reflects what we care about.
Experience, E
• The Experience, E in machine learning refers to the data that the algorithm is
exposed to during the learning process. This "experience" allows the
algorithm to learn patterns, relationships, and rules from the data.
• Two Types of Experience in Machine Learning
• Supervised learning:
• In this type, the algorithm is given both input data (examples) and correct answers
(labels). For example, if the task is to classify images as either "cats" or "dogs," the
algorithm will see images (inputs) and be told which ones are cats and which ones are
dogs (labels). The algorithm's experience includes knowing what the right answers are
while learning.
• Unsupervised learning:
• In this type, the algorithm only receives input data, without being given the correct
answers. For example, if you gave the algorithm a bunch of pictures without telling it
which ones are cats and which ones are dogs, it has to figure out patterns or groupings
on its own. The experience here is more open-ended.
• The Dataset: A Collection of Examples
• Most machine learning algorithms are trained using a dataset, which is a
collection of examples. Each example represents a data point, which is
made up of different features or measurements. The algorithm looks at all
these examples during training, learning from them in order to make
predictions about new, unseen data.
• One of the oldest and most famous datasets in machine learning is the Iris
dataset
• The Iris dataset contains measurements from 150 iris plants.
• Each plant is one example (or data point) in the dataset.
• For each plant, there are four measurements recorded:
• Sepal length (the length of the outer part of the flower)
• Sepal width
• Petal length (the length of the inner part of the flower)
• Petal width
• The dataset also includes labels, which tell us the species of each iris plant.
There are three species in this dataset, so for each plant, the algorithm
knows which species it belongs to.

Why This Matters


• By exposing the algorithm to a dataset, we give it the "experience" it needs to
recognize patterns. For instance, in the Iris dataset, the algorithm might learn
that certain sepal and petal lengths and widths are more common in one
species than another. Once it learns these patterns, it can make predictions
about new plants it hasn't seen before.
• Unsupervised learning algorithms experience a dataset containing many
features, then learn useful properties of the structure of this dataset. In the
context of deep learning, we usually want to learn the entire probability
distribution that generated a dataset, whether explicitly as in density
estimation or implicitly for tasks like synthesis or denoising. Some other
unsupervised learning algorithms perform other roles, like clustering, which
consists of dividing the dataset into clusters of similar examples.
• Supervised learning algorithms experience a dataset containing features, but
each example is also associated with a label or target. For example, the Iris
dataset is annotated with the species of each iris plant. A supervised learning
algorithm can study the Iris dataset and learn to classify iris plants into three
different species based on their measurements.
• Roughly speaking, unsupervised learning involves observing
several examples of a random vector x, and attempting to implicitly
or explicitly learn the probability distribution p(x), or some
interesting properties of that distribution, while supervised
learning involves observing several examples of a random vector x
and an associated value or vector y, and learning to predict y from
x, usually by estimating p(y | x ).
• The term supervised learning originates from the view of the target y
being provided by an instructor or teacher who shows the machine
learning system what to do. In unsupervised learning, there is no
instructor or teacher, and the algorithm must learn to make sense
of the data without this guide.
Blurry Lines Between Supervised and Unsupervised Learning
• In practice, these two types of learning aren’t always clearly separated.
Sometimes, the same techniques can be applied to both tasks. This is
where things get a bit more interesting.
• For example, the chain rule of probability allows you to break down a
complex problem into smaller, simpler parts. It says that you can represent
the probability of a complex event happening as a combination of smaller
events.
• Semi-Supervised Learning:
• In semi-supervised learning, some examples have labels (targets) that tell
the algorithm what the correct answer is, while others do not. This is useful
when labeling data is expensive or time-consuming, but we have access to a
lot of unlabeled data.
• For example, in a dataset of 1000 images, only 200 images might be labeled
with their respective categories (e.g., dog, cat), while the other 800 images
are unlabeled. The algorithm learns from both the labeled and unlabeled
data.
• Multi-Instance Learning:
• Here, instead of labeling each individual example, we label an entire group or
collection of examples. The collection might contain a certain class, but we
don't know which specific examples in the collection belong to that class.
• Think of a bag of multiple objects. The bag might be labeled as containing at
least one specific type of object (e.g., a fruit), but we don’t know which
objects in the bag are fruits.
• Reinforcement Learning:
• This type of learning is different from standard supervised or unsupervised
learning. In reinforcement learning, the algorithm interacts with an environment
and learns by receiving feedback (rewards or penalties) based on its actions.
• Predictive text, text summarization, question answering, and machine translation
are all examples of natural language processing (NLP) that uses reinforcement
learning. By studying typical language patterns, RL agents can mimic and predict
how people speak to each other every day. This includes the actual language
used, as well as syntax, (the arrangement of words and phrases) and diction (the
choice of words).
Datasets and Design Matrix
• Dataset: A dataset is a collection of examples, where each example
contains multiple features.
• For instance, in the famous Iris dataset, each example represents a different
iris plant, and the features are the measurements of the plant (like sepal
length, sepal width, etc.).
• Design Matrix:
• In some cases, datasets are represented as a matrix. In this matrix, each row
is a different example, and each column corresponds to a feature.
• This matrix format is commonly used in machine learning algorithms to
process datasets, where all examples must have the same number of
features.
Linear Regression
• Linear regression is a simple machine learning algorithm used to solve regression
problems—where the task is to predict a continuous value (like temperature or
price).
• Linear regression is a type of supervised machine learning algorithm that
computes the linear relationship between the dependent variable and one or
more independent features by fitting a linear equation to observed data.
• When there is only one independent feature, it is known as Simple Linear
Regression, and when there are more than one feature, it is known as Multiple
Linear Regression.
Here Y is called a dependent or target variable and X is called an
independent variable also known as the predictor of Y.
Home Price
Prediction
Given these home prices find out
prices of homes whose area is,
3300 square feet
5000 square feet
Supervised Learning Algorithms
• learning algorithms that learn to associate some input with some output,
given a training set of examples of inputs x and outputs y. In many cases the
outputs y may be difficult to collect automatically and must be provided by a
human “supervisor,” but the term still applies even when the training set
targets were collected automatically.
Probabilistic Supervised Learning
Support Vector Machines
✓Support Vector Machine (SVM) is a powerful machine learning algorithm
used for linear or nonlinear classification, regression, and even outlier
detection tasks. SVMs can be used for a variety of tasks, such as text
classification, image classification, spam detection, handwriting
identification, gene expression analysis, face detection, and anomaly
detection.
✓Support Vector Machine (SVM) is a supervised machine
learning algorithm used for both classification and regression.
✓The main objective of the SVM algorithm is to find the
optimal hyperplane in an N-dimensional space that can separate the data
points in different classes in the feature space. The hyperplane tries that
the margin between the closest points of different classes should be as
maximum as possible.
• Suppose we see a strange cat that
also has some features of dogs, so if
we want a model that can accurately
identify whether it is a cat or dog, so
such a model can be created by
using the SVM algorithm. We will
first train our model with lots of
images of cats and dogs so that it
can learn about different features of
cats and dogs, and then we test it
with this strange creature. So as
support vector creates a decision
boundary between these two data
(cat and dog) and choose extreme
cases (support vectors), it will see the
extreme case of cat and dog. On the
basis of the support vectors, it will
classify it as a cat.
• Feature Transformation via Kernels:
The kernel trick allows SVMs to learn nonlinear decision boundaries in the
original input space by implicitly mapping the input data into a higher-
dimensional feature space.

Kernels allow SVMs to operate in infinite-dimensional spaces without


explicitly computing the transformation, which can be computationally
intractable.
• Support vectors are the key training examples that lie closest to the decision
boundary and are used to define the SVM model. The coefficients for non-
support vectors are zero, meaning they do not affect the classification.
• Only support vectors contribute to the final decision function

• This reduces the computational cost of classifying new examples, as only the
support vectors need to be evaluated.
• Advantages of the Kernel Trick:
• Nonlinear Decision Boundaries: By using kernel functions, SVMs can create
highly complex, nonlinear decision boundaries in the original space.
• Efficient Computation: Kernels often provide a computationally efficient
way to compute the decision function without explicitly mapping data into
higher dimensions.
• Convex Optimization: The optimization problem in SVMs remains convex,
ensuring that we can find a global optimum efficiently.
Limitations of Kernel Machines:
• Computational Cost: The cost of evaluating the decision function is linear
in the number of training examples because each example contributes to the
decision function.
• Training Time: Training SVMs with large datasets can be computationally
expensive, especially when using complex kernels like the RBF kernel.
• Generalization: Kernel machines, especially with generic kernels,
sometimes struggle to generalize well to unseen data.
Deep Learning vs. Kernel Machines:
• Deep learning models were designed to overcome some of the limitations of
kernel machines, especially in terms of scalability and generalization.
• For instance, deep neural networks can learn hierarchical representations
from data, which allows them to perform better than kernel-based SVMs on
large-scale problems like image classification (e.g., MNIST).
Other Simple Supervised Learning Algorithms
• The k-nearest neighbors (k-NN) algorithm is a simple yet powerful non-parametric
method used for both classification and regression tasks in supervised learning. Here's a
breakdown of key concepts from the text:
Non-Parametric Nature:
• k-NN is a non-parametric algorithm, meaning it does not assume any fixed form or
distribution for the data.
• Unlike models like linear regression, which have a fixed number of parameters (weights),
k-NN doesn't have a training process in the traditional sense. Instead, it memorizes the
training data and makes predictions based on the nearest examples during testing.
How k-NN Works:
• At the testing stage, when a new input x needs to be classified or a value y predicted, the
algorithm identifies the k-nearest neighbors to x from the training data.
• For classification, it returns the most frequent label among these nearest neighbors.
• For regression, it returns the average of the output values (y) of the nearest neighbors.
Averaging for Classification:
• In classification tasks, k-NN can be thought of as averaging over one-hot
vectors. A one-hot vector is used to represent class labels, where the
position corresponding to the class is 1, and the rest are 0s. Averaging these
vectors gives a probability distribution over the possible classes.
• This allows k-NN to handle probabilistic classification by assigning a
likelihood to each class based on the neighbors' votes.
High Capacity and Bayes Error:
• High capacity means that k-NN can theoretically model very complex
relationships in the data, especially as the size of the training set increases.
• The algorithm converges to twice the Bayes error when using only one
nearest neighbor (1-NN). The Bayes error rate is the minimum possible error
any classifier can achieve, given the inherent noise in the data.
• However, as the training set grows infinitely large, k-NN can approach the
Bayes error rate by considering all nearest neighbors rather than randomly
choosing one, leading to highly accurate predictions.
Weaknesses of k-NN:
• High computational cost: During testing, k-NN must calculate the distance
between the test point and every training example, making it slow and
computationally expensive for large datasets.
• Poor generalization on small datasets: With a limited amount of training
data, k-NN may perform poorly because it heavily relies on the proximity of
training examples, and small training sets may not capture the underlying
patterns in the data.
• Feature sensitivity: k-NN cannot distinguish between more and less
important features. For example, if only one feature is relevant to the output,
but the dataset has many irrelevant features, k-NN may get "confused" by the
irrelevant features. This is because distance calculations will be affected by
all features, even those that are irrelevant to the target.
Decision Tree
Decision tree uses the tree
representation to solve the
problem in which each leaf
node corresponds to a class
label and attributes are
represented on the internal
node of the tree. We can
represent any boolean function
on discrete attributes using the
decision tree.
Diagrams describing how a decision tree works.
(Top)Each node of the tree chooses to send the input
example to the child node on the left (0) or or the
child node on the right (1). Internal nodes are drawn
as circles and leaf nodes as squares. Each node is
displayed with a binary string identifier
corresponding to its position in the tree, obtained by
appending a bit to its parent identifier (0=choose left
or top, 1=choose right or bottom). (Bottom)The tree
divides space into regions. The 2D plane shows how
a decision tree might divide R2. The nodes of the tree
are plotted in this plane, with each internal node
drawn along the dividing line it uses to categorize
examples, and leaf nodes drawn in the center of the
region of examples they receive. The result is a
piecewise-constant function, with one piece per
leaf. Each leaf requires at least one training example
to define, so it is not possible for the decision tree to
learn a function that has more local maxima than the
number of training examples.
Unsupervised Learning Algorithms
Features without Supervision:
• Unsupervised learning algorithms only use features from the data, without
needing human-annotated labels or targets.
• These algorithms often deal with tasks such as density estimation,
denoising, manifold learning, and clustering.
Representation Learning:
• One of the core goals of unsupervised learning is to find the "best"
representation of the data.
• The term "best" refers to how much information about the data is retained in
a simpler or more accessible form.
Types of Representations: Three common representations in unsupervised
learning are:
• Lower-dimensional representations: Compress the data into fewer
dimensions, preserving as much information as possible. These help in
reducing complexity while keeping key features intact.
• Sparse representations: Involve representations where most of the values
are zero, capturing important information while ignoring irrelevant or
redundant features. Typically, this increases the dimensionality but
emphasizes sparsity in the data structure.
• Independent representations: Aim to disentangle the factors of variation
within the data, ensuring that the dimensions are statistically independent.
Interconnections between Representations:
• These types of representations are often intertwined. For example, low-
dimensional representations typically reduce dependencies and
redundancies in the data. Similarly, sparse representations may lead to
independent representations by isolating key factors of variation
Importance of Representation in Deep Learning:
• Representation learning is central to deep learning, as models aim to learn meaningful
representations of the data. Unsupervised learning plays a key role in this process,
enabling models to capture underlying patterns and structures without needing
labeled data.

Applications of Unsupervised Learning:


• Clustering: Grouping similar data points together.
• Dimensionality reduction: Techniques like PCA (Principal Component
Analysis) and t-SNE that reduce data complexity.
• Anomaly detection: Identifying data points that differ significantly from the
norm.
• Generative modeling: Learning to generate new samples from the data
distribution (e.g., GANs).
Principal Component Analysis (PCA)
PCA is a popular unsupervised learning algorithm used for dimensionality
reduction. It helps in compressing large datasets by transforming them into a
simpler form, capturing the most important patterns and structures in the data.
• Lower Dimensionality Representation:
• PCA helps reduce the complexity of the data by creating a new version of it with
fewer dimensions (variables), while still keeping most of the important information.
• In this new representation, the different components (or variables) are not linearly
correlated, meaning they don't influence each other in a straight-line relationship.
• Step Toward Independence:
• By removing the linear relationships between variables (making them
uncorrelated), PCA moves toward creating variables that are independent. But to
make them fully independent, more advanced techniques are needed to remove
nonlinear relationships too.
•Left side (Original Data): The data points in their original form (x) have variation in different
directions, but this variation isn't aligned with the current axes (like the x and y-axis). The
variation is happening in some other direction that doesn't match these straight lines.
•Right side (Transformed Data): After applying PCA, the data is transformed into a new
space. The new axes (z₁ and z₂) are aligned with the directions where the data varies the
most. The biggest variation is along the first new axis (z₁), and the second biggest variation is
along the second axis (z₂).
• Linear Transformation in PCA:
• PCA uses a mathematical operation called a linear transformation to rotate
the data in such a way that it finds the directions where the data varies the
most. This transformation projects the original data (x) onto new variables (z)
as seen in the figure.
• One-Dimensional Representation (Principal Component):
• PCA can reduce the data to just one dimension (a line) that best represents
the original data. This line is called the first principal component, and it
captures the largest amount of variation in the data.
• This ability to reduce the data to fewer dimensions while still retaining most
of the important information is why PCA is a powerful dimensionality
reduction method.
• PCA Decorrelates Data:
• In the following section, the text will explain how PCA not only reduces the
dimensionality but also decorrelates the data. This means it transforms the
data so that the new variables (z) don't have any linear relationships with
each other.
• We are dealing with Principal Component Analysis (PCA), which helps
reduce the dimensionality of data while preserving important information.
• The goal is to transform the original data x into a new representation z, such
that the new variables are uncorrelated and arranged according to the
direction of maximum variance.
• We are working with a design matrix X of size m×n , where m is the number
of samples, and n is the number of features or dimensions in the data.
k-means Clustering
• K-Means Clustering is an Unsupervised Learning algorithm, which
groups the unlabeled dataset into different clusters. Here K defines the
number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.
• Clustering: It divides the data into k clusters (where "k" is the number of groups
you want to create). Each data point is assigned to the nearest cluster, and all the
points in one cluster are similar to each other.
• One-Hot Representation: After clustering, each data point is represented by a
one-hot code, a vector that is mostly zeros except for a "1" at the position of the
cluster it belongs to. For example, if the data point is in cluster 3, the one-hot
vector would look like this: [0, 0, 1, 0, 0].
• Sparse Representation: This one-hot code is called a sparse representation
because most of its entries are zero. A sparse representation is useful because it is
simple and makes it clear which cluster a point belongs to.
Process of K-means:
• Start by randomly choosing k points (centroids) to represent the clusters.
• Repeat two steps until the clusters stop changing:
• Assign each data point to the nearest centroid (the one it’s closest to).
• Update each centroid to be the average (mean) of all the points assigned to
it.
• Clustering Problem: One challenge with clustering is that there's no perfect
way to measure how well the clusters reflect the real world. We can measure
how close data points are to their centroid, but it’s hard to know if the clusters
really represent meaningful groups in reality
Example:
Imagine a dataset of images with red trucks, red cars, gray trucks, and gray cars. If
we ask the algorithm to make two clusters:
• One algorithm might group by vehicle type (cars and trucks).
• Another algorithm might group by color (red vehicles and gray vehicles).
• If we allow the algorithm to choose more clusters, it could create four clusters: red
cars, red trucks, gray cars, and gray trucks. However, in this case, we lose
information about the similarity between red and gray cars (even though both are
cars, they are now in different clusters).
Why Distributed Representation is Better
A one-hot representation (like in k-means) tells us which cluster a point
belongs to but doesn’t capture all the similarities between data points. A
distributed representation is more flexible. Instead of assigning each point
to just one cluster, we can describe data with multiple attributes. For
example, for vehicles:
• One attribute could represent the color (red or gray).
• Another could represent the type (car or truck).
This allows us to compare data points based on multiple features, giving us a
more detailed and meaningful way to understand similarities between
objects.

You might also like