0% found this document useful (0 votes)

6 views15 pages

Module 4

The document explains key concepts in information theory, including information content, entropy, and cross-entropy, which quantify the amount of surprise or uncertainty associated with events and probability distributions. It also discusses their applications in machine learning, particularly in classification tasks, and introduces information gain and mutual information as measures for feature selection and variable dependence. Additionally, it covers Natural Language Generation (NLG), a process that produces human-readable language from data, highlighting its reliance on computational linguistics and natural language processing.

Uploaded by

Swayam sahay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

Module 4

Uploaded by

Swayam sahay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Entropy & Cross Entropy

Information Content
Information content, also known as self-information, is a
fundamental concept in information theory that quantifies the
amount of surprise or unexpectedness associated with an event or
outcome. It represents the minimum number of bits needed to
encode or communicate the occurrence of the event.

The key idea behind information content is that rare or improbable

events convey more information than common or expected events.
For example, if you toss a fair coin and it lands heads up, the
information content of this event is lower compared to the
information content of the coin landing on its edge (an extremely
rare event).

The formula for calculating the information content of an event “X”

with probability “P(X)” is:

Information(X) = -log2(P(X))

Where:

 “log2” is the base-2 logarithm.

 “P(X)” is the probability of event “X” occurring.

The negative sign ensures that information content is a positive
quantity. The base-2 logarithm is used in the formula, which allows
the result to be measured in bits. This means that the information
content of an event is the number of bits needed to represent that
event’s occurrence in a binary system.

Key points

Key points to note about information content:

1. Rare events have higher information content because their

probability is low, requiring more bits to communicate them.

2. Common events have lower information content because they

are expected and convey less surprise.

3. Information content is additive. If two events are independent,

the information content of their joint occurrence is the sum of
their individual information contents.

4. Information content provides a way to measure uncertainty or

the level of surprise associated with a particular event.

Entropy
Entropy is a fundamental concept in information theory,
thermodynamics, and probability theory. In the context of
information theory, entropy measures the average amount of
uncertainty or surprise associated with a random variable or
probability distribution. It provides a way to quantify the amount of
information required to describe or represent a set of outcomes.
In information theory, entropy is often denoted by “H” and is
calculated using the probabilities of the various outcomes of a
random variable. For a discrete random variable “X” with
probability distribution “P(X)”, the entropy “H(X)” is given by the
formula:

H(X) = -Σ [ P(x) * log2(P(x)) ]

Where:

 “Σ” represents the sum taken over all possible values of “X”.

 “P(x)” is the probability of the event “x” occurring.

 “log2” is the base-2 logarithm.

Key points of entropy:

1. Measuring Uncertainty: Entropy quantifies the level of

uncertainty or randomness associated with a probability
distribution. High entropy implies high uncertainty, while low
entropy implies low uncertainty.

2. Equally Likely Outcomes: The entropy is maximized when all

possible outcomes are equally likely. This means that there is the
most uncertainty when all options are as likely as each other.

3. Coding Efficiency: Entropy is related to the minimum average

number of bits needed to encode the outcomes of a random
variable. Efficient coding schemes allocate fewer bits to more
probable outcomes and more bits to less probable outcomes.

4. Information Content: Entropy is related to the average

amount of information content per outcome. It can be thought of
as a measure of surprise: a low-entropy distribution has less
surprising outcomes, while a high-entropy distribution has more
surprising outcomes.

5. Unit of Measurement: Entropy is typically measured in bits

(when using base-2 logarithms) or nats (when using natural
logarithms). A bit of entropy represents the amount of
information needed to make a binary choice between two equally
likely alternatives.

6. Application in Machine Learning: In machine learning,

entropy is often used to measure the impurity or disorder of a set
of data points. It’s commonly used in decision tree algorithms to
determine the best attribute for splitting data.

Cross-Entropy
Cross-entropy is a concept used to compare two probability
distributions and quantify the difference between them. It is a
fundamental concept in information theory and is widely applied in
machine learning, particularly in tasks involving classification and
probabilistic modeling.

Using cross-entropy as a loss function is a common practice in

machine learning, particularly in tasks involving classification. It’s a
way to measure the difference between the predicted probability
distribution and the true distribution (one-hot encoded labels).

In the context of comparing two probability distributions “P(X)” and

“Q(X)”, where “P(X)” represents the true distribution (e.g., actual
labels) and “Q(X)” represents the predicted or estimated
distribution (e.g., model’s probabilities), the cross-entropy “H(P, Q)”
is calculated using the formula:

H(P, Q) = -Σ [ P(x) * log2(Q(x)) ]

Where:

 “Σ” represents the sum taken over all possible values of “X”.

 “P(x)” is the probability of the event “x” occurring according to

the true distribution.

 “Q(x)” is the probability of the event “x” occurring according to

the predicted distribution.

 “log2” is the base-2 logarithm.

Key points to understand about cross-entropy:

1. Measuring Difference: Cross-entropy measures how well the

predicted distribution (“Q(X)”) approximates the true
distribution (“P(X)”). It quantifies the difference between the two
distributions in terms of information content.
2. Minimization Objective: In machine learning, especially in
tasks like classification, the goal is often to minimize the cross-
entropy. This effectively aims to make the predicted distribution
as close as possible to the true distribution.

3. Relationship with Entropy: Cross-entropy is related to the

entropy of the true distribution (“P(X)”), but it accounts for the
differences introduced by the predicted distribution (“Q(X)”).
When the predicted distribution perfectly matches the true
distribution, the cross-entropy becomes equal to the entropy of
the true distribution.

4. Loss Function: In machine learning, the cross-entropy is

commonly used as a loss function to optimize models during
training. The cross-entropy loss penalizes larger discrepancies
between predicted and true distributions, encouraging the model
to make more accurate predictions.

5. Numerical Stability: In practice, when dealing with

probabilities close to zero (which can lead to undefined
logarithms), it’s common to use a small positive constant
(epsilon) to ensure numerical stability in the calculation of cross-
entropy.

6. Applications: Cross-entropy is used in various machine

learning algorithms, including logistic regression, neural
networks, and decision trees. It’s particularly useful in tasks like
classification, where the goal is to predict the probability
distribution over classes.
Example
Let’s consider an example involving email classification. Imagine
you’re building a spam filter, and you want to use information
theory concepts like information content, entropy, and cross-
entropy to improve its performance.

Scenario: Email Spam Filter

Information Content

In the context of your spam filter, information content would help

you quantify the significance of certain words or phrases in an email.
For instance, the word “viagra” might have a high information
content because it’s rare in legitimate emails but often appears in
spam. On the other hand, common words like “the” or “and” might
have low information content.

Entropy

Entropy would help you evaluate the uncertainty or disorder within

your dataset of emails. Let’s say you have a large dataset of both
spam and legitimate emails. You can calculate the entropy of the
dataset’s label distribution to determine how well-balanced it is. If
you have almost an equal number of spam and legitimate emails, the
entropy would be high, indicating significant uncertainty in
predicting the class of a new email.

Cross-Entropy
Now, let’s consider the predicted probabilities from your spam filter.
You’ve trained a machine learning model to predict the probability
that an email is spam. For each email, the model assigns a
probability distribution (predicted probabilities for spam and not
spam). You can calculate the cross-entropy between the predicted
distribution and the true distribution (actual label) for a set of
emails. A low cross-entropy indicates that your model’s predictions
are close to the actual labels, while a high cross-entropy suggests
differences.

If your cross-entropy is high, it means that the model’s predicted

distribution doesn’t match the true distribution well. This could
signal areas where your model needs improvement. By minimizing
cross-entropy during training, your model would learn to make
more accurate predictions and reduce the differences between
predicted and actual distributions.

Application Steps:

1. Data Preparation: Gather a labeled dataset of emails, with

each email labeled as spam or legitimate.

2. Information Content: Calculate the information content of

words or phrases in the emails using their occurrence
frequencies. Identify high-information-content terms that are
likely indicative of spam.

3. Entropy: Calculate the entropy of the email labels to assess the

uncertainty in the dataset.
4. Model Training: Train a machine learning model (e.g., logistic
regression, neural network) using features derived from the
emails (word frequencies, etc.).

5. Cross-Entropy: Evaluate your model’s performance using

cross-entropy on a validation or test dataset. Minimize cross-
entropy during training to improve the model’s accuracy.

6. Filtering Emails: Deploy your trained model as a spam filter.

When a new email arrives, the model predicts its probability of
being spam. If the predicted probability is high, the email is
flagged as spam.

What is information gain?

 Information Gain (IG) is a measure used in decision trees to quantify

the effectiveness of a feature in splitting the dataset into classes. It
calculates the reduction in entropy (uncertainty) of the target variable
(class labels) when a particular feature is known.
 In simpler terms, Information Gain helps us understand how much a
particular feature contributes to making accurate predictions in a
decision tree. Features with higher Information Gain are considered
more informative and are preferred for splitting the dataset, as they
lead to nodes with more homogenous classes.
IG(D,A)=H(D)−H(D∣A)IG(D,A)=H(D)−H(D∣A)
Where,
 IG(D, A) is the Information Gain of feature A concerning dataset D.
 H(D) is the entropy of dataset D.
 H(D∣A) is the conditional entropy of dataset D given feature A.
1. Entropy H(D)
H(D)=−∑i=1nP(xi)log 2(P(xi))H(D)=−∑i=1nP(xi)log2(P(xi))

 n represents the number of different outcomes in the dataset.

 P(xi) is the probability of outcome xi occurring.
2. Conditional Entropy H(D|A)
H(D∣A)=∑j=1mP(aj)⋅H(D∣aj)H(D∣A)=∑j=1mP(aj)⋅H(D∣aj)

 P(aj) is the probability of feature value aj in feature A,and

 H(D|aj) is the entropy of dataset D given feature A has value aj.
Advantages of Information Gain (IG)

 Simple to Compute: IG is straightforward to calculate, making it easy

to implement in machine learning algorithms.
 Effective for Feature Selection: IG is particularly useful in decision
tree algorithms for selecting the most informative features, which can
improve model accuracy and reduce overfitting.
 Interpretability: The concept of IG is intuitive and easy to understand,
as it measures how much knowing a feature reduces uncertainty in
predicting the target variable.

Limitations of Information Gain (IG)

 Ignores Feature Interactions: IG treats features independently and

may not consider interactions between features, potentially missing
important relationships that could improve model performance.
 Biased Towards Features with Many Categories: Features with a
large number of categories or levels may have higher IG simply due to
their granularity, leading to bias in feature selection towards such
features.

What is Mutual Information?

Mutual Information (MI) is a measure of the mutual dependence between

two random variables. In the context of machine learning, MI quantifies
the amount of information obtained about one variable through the other
variable. It is a non-negative value that indicates the degree of
dependence between the variables: the higher the MI, the greater the
dependence.
I(X;Y)=∑x∈X∑y∈Yp(x,y)log (p(x,y)p(x)p(y))I(X;Y)=∑x∈X∑y∈Yp(x,y)log(p(x)p(y)p(x,y))

where,
 P(x,y) is the joint probability of X and Y.
 P(x) and P(y) are the marginal probabilities of X and Y respectively.

Advantages of Mutual Information (MI)

 Captures Nonlinear Relationships: MI can capture both linear and

nonlinear relationships between variables, making it suitable for
identifying complex dependencies in the data.
 Versatile: MI can be used in various machine learning tasks such as
feature selection, clustering, and dimensionality reduction, providing
valuable insights into the relationships between variables.
 Handles Continuous and Discrete Variables: MI is effective for both
continuous and discrete variables, making it applicable to a wide range
of datasets.
Limitations of Mutual Information (MI)

 Sensitive to Feature Scaling: MI can be sensitive to feature scaling,

where the magnitude or range of values in different features may affect
the calculated mutual information values.
 Affected by Noise: MI may be influenced by noise or irrelevant
features in the dataset, potentially leading to overestimation or
underestimation of the true dependencies between variables.
 Computational Complexity: Calculating MI for large datasets with
many features can be computationally intensive, especially when
dealing with high-dimensional data.

Difference between Information Gain Vs Mutual Information
Information Gain
Criteria (IG) Mutual Information (MI)

Measures reduction in Measures mutual dependence

uncertainty of the target between two variables, indicating
variable when a feature how much information one
Definition is known. variable provides about the other.

Mutual dependence and

Individual feature
information exchange between
importance
Focus variables

Commonly used in Versatile application in feature

decision trees for feature selection, clustering, and
Usage selection dimensionality reduction

Considers interactions between

Ignores feature
variables, capturing complex
interactions
Interactions relationships

Effective for discrete Suitable for both continuous and

features with clear discrete variables, capturing
Applicability categories linear and nonlinear relationships
Can be computationally intensive
Simple to compute for large datasets or high-
dimensional data
Computation

What is Natural Language

Generation?
Natural Language Generation, otherwise known as NLG, is a software process driven by
artificial intelligence that produces natural written or spoken language from structured
and unstructured data. It helps computers to feed back to users in human language that
they can comprehend, rather than in a way a computer might.

For example, NLG can be used after analysing customer input (such as commands to
voice assistants, queries to chatbots, calls to help centres or feedback on survey forms)
to respond in a personalised, easily-understood way. This makes human-seeming
responses from voice assistants and chatbots possible.

It can also be used for transforming numerical data input and other complex data into
reports that we can easily understand. For example, NLG might be used to generate
financial reports or weather updates automatically.

Natural Language Generation technology is enabled through a range of computer

science processes. These include:

Computational linguistics

The scientific understanding of written and spoken language from the perspective of
computer-based analysis. This involves breaking down written or spoken dialogue and
creating a system of understanding that computer software can use. It uses semantic
and grammatical frameworks to help create a language model system that computers
can utilise to accurately analyse our speech.

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the actual application of computational linguistics

to written or spoken human language. NLG is classified as a sub-category of Natural
Language Processing.

Natural Language Understanding (NLU)

Natural Language Understanding (NLU) tries to determine not just the words or phrases
being said, but the emotion, intent, effort or goal behind the speaker’s communication. It
takes the understanding a step further and makes the analysis more akin to a human’s
understanding of what is being said. Natural Language Understanding takes machine
learning to a deeper level to help make comprehension even more detailed.

How does NLG work?

NLG is a multi-stage process, with each step further refining the data being
used to produce content with natural-sounding language. The six stages of
NLG are as follows:

1. Content analysis. Data is filtered to determine what should be included

in the content produced at the end of the process. This stage includes
identifying the main topics in the source document and the relationships
between them.

2. Data understanding. The data is interpreted, patterns are identified

and it's put into context. Machine learning is often used at this stage.

3. Document structuring. A document plan is created and a narrative

structure chosen based on the type of data being interpreted.

4. Sentence aggregation. Relevant sentences or parts of sentences are

combined in ways that accurately summarize the topic.

5. Grammatical structuring. Grammatical rules are applied to generate

natural-sounding text. The program deduces the syntactical structure of
the sentence. It then uses this information to rewrite the sentence in a
grammatically correct manner.

6. Language presentation. The final output is generated based on a

template or format the user or programmer has selected.

What does using Natural Language

Generation entail?
There are a few different ways in which Natural Language Generation can work, but the
most common methods are called extractive and abstractive.

An extractive approach takes a large body of text, pulls out sentences that are most
representative of key points, and combines them in a grammatically accurate way to
generate a summary of the larger text.

An abstractive approach creates novel text by identifying key concepts and then
generating new language that attempts to capture the key points of a larger body of text
intelligibly.

Whichever approach is used, Natural Language Generation involves multiple steps to

understand human language, analyse for insights and generate responsive text.

Natural Language Generation in six steps

Data analysis

First, data (both structured data like financial information and unstructured data like
transcribed call audio) must be analysed. The data is filtered, to make sure that the end
text that is generated is relevant to the user’s needs, whether it’s to answer a query or
generate a specific report. At this stage, your NLG tools will pick out the main topics in
your source data and the relationships between each topic.

Data understanding

Here is where Natural Language processing, machine learning and a language model
come in. Your software identifies the patterns in the data and based on its algorithmic
training, it’s able to interpret what is being said and the context of these statements. For
numerical data or other types of non-textual data, your software spots the data it’s been
taught to recognise and is able to understand how it relates to actual text.
Document creation and structuring

At this stage, your NLG solutions are working to create data-driven narratives based on
the data being analysed and the result you’ve requested (report, chat response etc.). A
subsequent document plan is created.

Sentence aggregation

Sentences and parts of sentences that have been identified as relevant are put together
to summarise the information to be presented.

Grammatical structuring

Your software begins its generated text, using natural language grammatical rules to
make the text fit our understanding.

Language presentation

Finally, the software will create the final output in whatever format the user has chosen.
As mentioned, this could be in the form of a report, a customer-directed email or a voice
assistant response.

Transforming Conversational AI
No ratings yet
Transforming Conversational AI
235 pages
CSD411-_Week_4-_MF,_IT_and_Model_1724689912176241587666ccadf8821c9
No ratings yet
CSD411-_Week_4-_MF,_IT_and_Model_1724689912176241587666ccadf8821c9
48 pages
A Friendly Introduction To Cross Entropy Loss
No ratings yet
A Friendly Introduction To Cross Entropy Loss
10 pages
Iict Unit One
No ratings yet
Iict Unit One
35 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Entropy in Mathemtics
No ratings yet
Entropy in Mathemtics
2 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
C&C Combined Module Notes
No ratings yet
C&C Combined Module Notes
206 pages
21ECE72_Coding and Cryp Module 1
No ratings yet
21ECE72_Coding and Cryp Module 1
34 pages
04 Entropy Perplexity Notes
No ratings yet
04 Entropy Perplexity Notes
16 pages
Natural Language Processing Natural Language Processing: Unit - 1 Essential Information Theory
No ratings yet
Natural Language Processing Natural Language Processing: Unit - 1 Essential Information Theory
34 pages
comp101-lect02
No ratings yet
comp101-lect02
44 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
Lecture2
No ratings yet
Lecture2
19 pages
AI-W5
No ratings yet
AI-W5
29 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
3. Entropy
No ratings yet
3. Entropy
29 pages
Entropy (Information Theory)
No ratings yet
Entropy (Information Theory)
17 pages
A Beginners’ Guide to Cross-Entropy in Machine Learning
No ratings yet
A Beginners’ Guide to Cross-Entropy in Machine Learning
2 pages
Information Theory 5th Unit
No ratings yet
Information Theory 5th Unit
20 pages
ITC Module - I
No ratings yet
ITC Module - I
98 pages
Cross Entropy Loss
No ratings yet
Cross Entropy Loss
31 pages
ICT - Module 1 Lecture 1
No ratings yet
ICT - Module 1 Lecture 1
34 pages
Information Theory
No ratings yet
Information Theory
37 pages
A Gentle Introduction to Cross-Entropy for Machine Learning
No ratings yet
A Gentle Introduction to Cross-Entropy for Machine Learning
24 pages
Unit 1 part 2 notes
No ratings yet
Unit 1 part 2 notes
34 pages
Binary Cross Entropy and Categorical Cross Entropy
No ratings yet
Binary Cross Entropy and Categorical Cross Entropy
19 pages
Quantum Information: Stephen M. Barnett
No ratings yet
Quantum Information: Stephen M. Barnett
60 pages
Lecture2 1
No ratings yet
Lecture2 1
37 pages
Module-1
No ratings yet
Module-1
40 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
EE 231A: Information Theory: Rick Wesel Wesel@ee - Ucla.edu
No ratings yet
EE 231A: Information Theory: Rick Wesel Wesel@ee - Ucla.edu
16 pages
Lecture 2
No ratings yet
Lecture 2
22 pages
Entr5
No ratings yet
Entr5
2 pages
Ml Lecture04x2
No ratings yet
Ml Lecture04x2
16 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
Entropy
No ratings yet
Entropy
21 pages
Introduction To Information Entropy
No ratings yet
Introduction To Information Entropy
11 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Chapter 2 - Mathematical Preliminaries For Lossless Compression
No ratings yet
Chapter 2 - Mathematical Preliminaries For Lossless Compression
56 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
22 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
ITC_2020_21_Lecture_3
No ratings yet
ITC_2020_21_Lecture_3
25 pages
Unit 1
No ratings yet
Unit 1
94 pages
Amount of Information I Log (1/P)
No ratings yet
Amount of Information I Log (1/P)
2 pages
Information Theory
No ratings yet
Information Theory
29 pages
INFORMATION THEORY AND SOURCE CODING
No ratings yet
INFORMATION THEORY AND SOURCE CODING
45 pages
BEC503-DC-M3-Information Theory
No ratings yet
BEC503-DC-M3-Information Theory
100 pages
EC401 M1-Information Theory & Coding-Ktustudents - in PDF
No ratings yet
EC401 M1-Information Theory & Coding-Ktustudents - in PDF
50 pages
Ch5 Entropy and Information
No ratings yet
Ch5 Entropy and Information
77 pages
Probability & Information: Prof. J Bapat
No ratings yet
Probability & Information: Prof. J Bapat
20 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
Probabilistic Methods in Information Theory
No ratings yet
Probabilistic Methods in Information Theory
48 pages
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
No ratings yet
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
76 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
Information Theory: A Concise Introduction
From Everand
Information Theory: A Concise Introduction
Stefan Hollos
No ratings yet
Probability and Statistics Made Easy
From Everand
Probability and Statistics Made Easy
Pasquale De Marco
No ratings yet
Technicalseminar
No ratings yet
Technicalseminar
47 pages
Deep Learning Approaches To Text Production
No ratings yet
Deep Learning Approaches To Text Production
201 pages
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
No ratings yet
Minigpt-4 Enhancing Vision-Language Understanding With Advanced Large Language Models
15 pages
Scaling_Transformer_Paradigms__A_Robust_Framework_for_Hallucination_Detection_in_Resource_Limited_NLP_Systems
No ratings yet
Scaling_Transformer_Paradigms__A_Robust_Framework_for_Hallucination_Detection_in_Resource_Limited_NLP_Systems
10 pages
Section A: Ques. 1
No ratings yet
Section A: Ques. 1
31 pages
21pa1a05d3 Swecha Internship Documentation
No ratings yet
21pa1a05d3 Swecha Internship Documentation
18 pages
Random Text Ai Generator - Google Search
No ratings yet
Random Text Ai Generator - Google Search
11 pages
Wade_200622083
No ratings yet
Wade_200622083
153 pages
A Practical Guide Using LLMs ChatGPT and Beyond
No ratings yet
A Practical Guide Using LLMs ChatGPT and Beyond
24 pages
Digital Generative Multimedia Tool Theory
No ratings yet
Digital Generative Multimedia Tool Theory
16 pages
GPTZero AI Scan - Blossoming Values - A Leafy Tale of Trade in Tradeburg
No ratings yet
GPTZero AI Scan - Blossoming Values - A Leafy Tale of Trade in Tradeburg
2 pages
TC4033 FinalQuiz 33
No ratings yet
TC4033 FinalQuiz 33
5 pages
Richard Khan - The AI Glossary_ Demystifying 101 Essential Artificial Intelligence Terms for Everyone-CRC Press (2025)
No ratings yet
Richard Khan - The AI Glossary_ Demystifying 101 Essential Artificial Intelligence Terms for Everyone-CRC Press (2025)
238 pages
Artificial Intelligence Questions
No ratings yet
Artificial Intelligence Questions
12 pages
Automatic Image Captioning Combining Natural Language Processing and
No ratings yet
Automatic Image Captioning Combining Natural Language Processing and
14 pages
Natural Language Processing-2
No ratings yet
Natural Language Processing-2
13 pages
Ad8701 DL Unit5 Notes
No ratings yet
Ad8701 DL Unit5 Notes
68 pages
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
No ratings yet
A Survey of Data Augmentation Approaches For NLP: Liu Et Al. 2020a
21 pages
NLP Assignment 2
No ratings yet
NLP Assignment 2
10 pages
Applications of AI Notes
No ratings yet
Applications of AI Notes
7 pages
写作助手
100% (1)
写作助手
8 pages
TP3 ETUDE CAS Tableau BI
No ratings yet
TP3 ETUDE CAS Tableau BI
74 pages
Deloitte Internal Audit 4 0
No ratings yet
Deloitte Internal Audit 4 0
16 pages
Lecture_1_Introduction
No ratings yet
Lecture_1_Introduction
57 pages
Unit 1 and Unit 2 Good Notes
No ratings yet
Unit 1 and Unit 2 Good Notes
21 pages
AI_Research_Data_Science_Visalakshi_Iyer
No ratings yet
AI_Research_Data_Science_Visalakshi_Iyer
1 page
2023.findings Emnlp.314v2
No ratings yet
2023.findings Emnlp.314v2
21 pages
AI Unit 4
No ratings yet
AI Unit 4
30 pages
AI-Unit 5
No ratings yet
AI-Unit 5
32 pages

Module 4

Uploaded by

Module 4

Uploaded by

Entropy & Cross Entropy

The key idea behind information content is that rare or improbable

The formula for calculating the information content of an event “X”

 “log2” is the base-2 logarithm.

 “P(X)” is the probability of event “X” occurring.

Key points to note about information content:

1. Rare events have higher information content because their

2. Common events have lower information content because they

3. Information content is additive. If two events are independent,

4. Information content provides a way to measure uncertainty or

H(X) = -Σ [ P(x) * log2(P(x)) ]

 “P(x)” is the probability of the event “x” occurring.

 “log2” is the base-2 logarithm.

Key points of entropy:

1. Measuring Uncertainty: Entropy quantifies the level of

2. Equally Likely Outcomes: The entropy is maximized when all

3. Coding Efficiency: Entropy is related to the minimum average

4. Information Content: Entropy is related to the average

5. Unit of Measurement: Entropy is typically measured in bits

6. Application in Machine Learning: In machine learning,

Using cross-entropy as a loss function is a common practice in

In the context of comparing two probability distributions “P(X)” and

H(P, Q) = -Σ [ P(x) * log2(Q(x)) ]

 “P(x)” is the probability of the event “x” occurring according to

 “Q(x)” is the probability of the event “x” occurring according to

 “log2” is the base-2 logarithm.

Key points to understand about cross-entropy:

1. Measuring Difference: Cross-entropy measures how well the

3. Relationship with Entropy: Cross-entropy is related to the

4. Loss Function: In machine learning, the cross-entropy is

5. Numerical Stability: In practice, when dealing with

6. Applications: Cross-entropy is used in various machine

Scenario: Email Spam Filter

In the context of your spam filter, information content would help

Entropy would help you evaluate the uncertainty or disorder within

If your cross-entropy is high, it means that the model’s predicted

1. Data Preparation: Gather a labeled dataset of emails, with

2. Information Content: Calculate the information content of

3. Entropy: Calculate the entropy of the email labels to assess the

5. Cross-Entropy: Evaluate your model’s performance using

6. Filtering Emails: Deploy your trained model as a spam filter.

What is information gain?

 Information Gain (IG) is a measure used in decision trees to quantify

 n represents the number of different outcomes in the dataset.

 P(aj) is the probability of feature value aj in feature A,and

 Simple to Compute: IG is straightforward to calculate, making it easy

Limitations of Information Gain (IG)

 Ignores Feature Interactions: IG treats features independently and

What is Mutual Information?

Mutual Information (MI) is a measure of the mutual dependence between

Advantages of Mutual Information (MI)

 Captures Nonlinear Relationships: MI can capture both linear and

 Sensitive to Feature Scaling: MI can be sensitive to feature scaling,

Measures reduction in Measures mutual dependence

Mutual dependence and

Commonly used in Versatile application in feature

Considers interactions between

Effective for discrete Suitable for both continuous and

What is Natural Language

Natural Language Generation technology is enabled through a range of computer

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the actual application of computational linguistics

Natural Language Understanding (NLU)

How does NLG work?

1. Content analysis. Data is filtered to determine what should be included

2. Data understanding. The data is interpreted, patterns are identified

3. Document structuring. A document plan is created and a narrative

4. Sentence aggregation. Relevant sentences or parts of sentences are

5. Grammatical structuring. Grammatical rules are applied to generate

6. Language presentation. The final output is generated based on a

What does using Natural Language

Whichever approach is used, Natural Language Generation involves multiple steps to

Natural Language Generation in six steps

You might also like