0% found this document useful (0 votes)
6 views15 pages

Module 4

The document explains key concepts in information theory, including information content, entropy, and cross-entropy, which quantify the amount of surprise or uncertainty associated with events and probability distributions. It also discusses their applications in machine learning, particularly in classification tasks, and introduces information gain and mutual information as measures for feature selection and variable dependence. Additionally, it covers Natural Language Generation (NLG), a process that produces human-readable language from data, highlighting its reliance on computational linguistics and natural language processing.

Uploaded by

Swayam sahay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Module 4

The document explains key concepts in information theory, including information content, entropy, and cross-entropy, which quantify the amount of surprise or uncertainty associated with events and probability distributions. It also discusses their applications in machine learning, particularly in classification tasks, and introduces information gain and mutual information as measures for feature selection and variable dependence. Additionally, it covers Natural Language Generation (NLG), a process that produces human-readable language from data, highlighting its reliance on computational linguistics and natural language processing.

Uploaded by

Swayam sahay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Entropy & Cross Entropy

Information Content
Information content, also known as self-information, is a
fundamental concept in information theory that quantifies the
amount of surprise or unexpectedness associated with an event or
outcome. It represents the minimum number of bits needed to
encode or communicate the occurrence of the event.

The key idea behind information content is that rare or improbable


events convey more information than common or expected events.
For example, if you toss a fair coin and it lands heads up, the
information content of this event is lower compared to the
information content of the coin landing on its edge (an extremely
rare event).

The formula for calculating the information content of an event “X”


with probability “P(X)” is:

Information(X) = -log2(P(X))

Where:

 “log2” is the base-2 logarithm.

 “P(X)” is the probability of event “X” occurring.


The negative sign ensures that information content is a positive
quantity. The base-2 logarithm is used in the formula, which allows
the result to be measured in bits. This means that the information
content of an event is the number of bits needed to represent that
event’s occurrence in a binary system.

Key points

Key points to note about information content:

1. Rare events have higher information content because their


probability is low, requiring more bits to communicate them.

2. Common events have lower information content because they


are expected and convey less surprise.

3. Information content is additive. If two events are independent,


the information content of their joint occurrence is the sum of
their individual information contents.

4. Information content provides a way to measure uncertainty or


the level of surprise associated with a particular event.

Entropy
Entropy is a fundamental concept in information theory,
thermodynamics, and probability theory. In the context of
information theory, entropy measures the average amount of
uncertainty or surprise associated with a random variable or
probability distribution. It provides a way to quantify the amount of
information required to describe or represent a set of outcomes.
In information theory, entropy is often denoted by “H” and is
calculated using the probabilities of the various outcomes of a
random variable. For a discrete random variable “X” with
probability distribution “P(X)”, the entropy “H(X)” is given by the
formula:

H(X) = -Σ [ P(x) * log2(P(x)) ]

Where:

 “Σ” represents the sum taken over all possible values of “X”.

 “P(x)” is the probability of the event “x” occurring.

 “log2” is the base-2 logarithm.

Key points of entropy:

1. Measuring Uncertainty: Entropy quantifies the level of


uncertainty or randomness associated with a probability
distribution. High entropy implies high uncertainty, while low
entropy implies low uncertainty.

2. Equally Likely Outcomes: The entropy is maximized when all


possible outcomes are equally likely. This means that there is the
most uncertainty when all options are as likely as each other.

3. Coding Efficiency: Entropy is related to the minimum average


number of bits needed to encode the outcomes of a random
variable. Efficient coding schemes allocate fewer bits to more
probable outcomes and more bits to less probable outcomes.

4. Information Content: Entropy is related to the average


amount of information content per outcome. It can be thought of
as a measure of surprise: a low-entropy distribution has less
surprising outcomes, while a high-entropy distribution has more
surprising outcomes.

5. Unit of Measurement: Entropy is typically measured in bits


(when using base-2 logarithms) or nats (when using natural
logarithms). A bit of entropy represents the amount of
information needed to make a binary choice between two equally
likely alternatives.

6. Application in Machine Learning: In machine learning,


entropy is often used to measure the impurity or disorder of a set
of data points. It’s commonly used in decision tree algorithms to
determine the best attribute for splitting data.

Cross-Entropy
Cross-entropy is a concept used to compare two probability
distributions and quantify the difference between them. It is a
fundamental concept in information theory and is widely applied in
machine learning, particularly in tasks involving classification and
probabilistic modeling.

Using cross-entropy as a loss function is a common practice in


machine learning, particularly in tasks involving classification. It’s a
way to measure the difference between the predicted probability
distribution and the true distribution (one-hot encoded labels).

In the context of comparing two probability distributions “P(X)” and


“Q(X)”, where “P(X)” represents the true distribution (e.g., actual
labels) and “Q(X)” represents the predicted or estimated
distribution (e.g., model’s probabilities), the cross-entropy “H(P, Q)”
is calculated using the formula:

H(P, Q) = -Σ [ P(x) * log2(Q(x)) ]

Where:

 “Σ” represents the sum taken over all possible values of “X”.

 “P(x)” is the probability of the event “x” occurring according to


the true distribution.

 “Q(x)” is the probability of the event “x” occurring according to


the predicted distribution.

 “log2” is the base-2 logarithm.

Key points to understand about cross-entropy:

1. Measuring Difference: Cross-entropy measures how well the


predicted distribution (“Q(X)”) approximates the true
distribution (“P(X)”). It quantifies the difference between the two
distributions in terms of information content.
2. Minimization Objective: In machine learning, especially in
tasks like classification, the goal is often to minimize the cross-
entropy. This effectively aims to make the predicted distribution
as close as possible to the true distribution.

3. Relationship with Entropy: Cross-entropy is related to the


entropy of the true distribution (“P(X)”), but it accounts for the
differences introduced by the predicted distribution (“Q(X)”).
When the predicted distribution perfectly matches the true
distribution, the cross-entropy becomes equal to the entropy of
the true distribution.

4. Loss Function: In machine learning, the cross-entropy is


commonly used as a loss function to optimize models during
training. The cross-entropy loss penalizes larger discrepancies
between predicted and true distributions, encouraging the model
to make more accurate predictions.

5. Numerical Stability: In practice, when dealing with


probabilities close to zero (which can lead to undefined
logarithms), it’s common to use a small positive constant
(epsilon) to ensure numerical stability in the calculation of cross-
entropy.

6. Applications: Cross-entropy is used in various machine


learning algorithms, including logistic regression, neural
networks, and decision trees. It’s particularly useful in tasks like
classification, where the goal is to predict the probability
distribution over classes.
Example
Let’s consider an example involving email classification. Imagine
you’re building a spam filter, and you want to use information
theory concepts like information content, entropy, and cross-
entropy to improve its performance.

Scenario: Email Spam Filter

Information Content

In the context of your spam filter, information content would help


you quantify the significance of certain words or phrases in an email.
For instance, the word “viagra” might have a high information
content because it’s rare in legitimate emails but often appears in
spam. On the other hand, common words like “the” or “and” might
have low information content.

Entropy

Entropy would help you evaluate the uncertainty or disorder within


your dataset of emails. Let’s say you have a large dataset of both
spam and legitimate emails. You can calculate the entropy of the
dataset’s label distribution to determine how well-balanced it is. If
you have almost an equal number of spam and legitimate emails, the
entropy would be high, indicating significant uncertainty in
predicting the class of a new email.

Cross-Entropy
Now, let’s consider the predicted probabilities from your spam filter.
You’ve trained a machine learning model to predict the probability
that an email is spam. For each email, the model assigns a
probability distribution (predicted probabilities for spam and not
spam). You can calculate the cross-entropy between the predicted
distribution and the true distribution (actual label) for a set of
emails. A low cross-entropy indicates that your model’s predictions
are close to the actual labels, while a high cross-entropy suggests
differences.

If your cross-entropy is high, it means that the model’s predicted


distribution doesn’t match the true distribution well. This could
signal areas where your model needs improvement. By minimizing
cross-entropy during training, your model would learn to make
more accurate predictions and reduce the differences between
predicted and actual distributions.

Application Steps:

1. Data Preparation: Gather a labeled dataset of emails, with


each email labeled as spam or legitimate.

2. Information Content: Calculate the information content of


words or phrases in the emails using their occurrence
frequencies. Identify high-information-content terms that are
likely indicative of spam.

3. Entropy: Calculate the entropy of the email labels to assess the


uncertainty in the dataset.
4. Model Training: Train a machine learning model (e.g., logistic
regression, neural network) using features derived from the
emails (word frequencies, etc.).

5. Cross-Entropy: Evaluate your model’s performance using


cross-entropy on a validation or test dataset. Minimize cross-
entropy during training to improve the model’s accuracy.

6. Filtering Emails: Deploy your trained model as a spam filter.


When a new email arrives, the model predicts its probability of
being spam. If the predicted probability is high, the email is
flagged as spam.

What is information gain?

 Information Gain (IG) is a measure used in decision trees to quantify


the effectiveness of a feature in splitting the dataset into classes. It
calculates the reduction in entropy (uncertainty) of the target variable
(class labels) when a particular feature is known.
 In simpler terms, Information Gain helps us understand how much a
particular feature contributes to making accurate predictions in a
decision tree. Features with higher Information Gain are considered
more informative and are preferred for splitting the dataset, as they
lead to nodes with more homogenous classes.
IG(D,A)=H(D)−H(D∣A)IG(D,A)=H(D)−H(D∣A)
Where,
 IG(D, A) is the Information Gain of feature A concerning dataset D.
 H(D) is the entropy of dataset D.
 H(D∣A) is the conditional entropy of dataset D given feature A.
1. Entropy H(D)
H(D)=−∑i=1nP(xi)log 2(P(xi))H(D)=−∑i=1nP(xi)log2(P(xi))
​ ​ ​ ​

 n represents the number of different outcomes in the dataset.


 P(xi) is the probability of outcome xi occurring.
2. Conditional Entropy H(D|A)
H(D∣A)=∑j=1mP(aj)⋅H(D∣aj)H(D∣A)=∑j=1mP(aj)⋅H(D∣aj)
​ ​ ​

 P(aj) is the probability of feature value aj in feature A,and


 H(D|aj) is the entropy of dataset D given feature A has value aj.
Advantages of Information Gain (IG)

 Simple to Compute: IG is straightforward to calculate, making it easy


to implement in machine learning algorithms.
 Effective for Feature Selection: IG is particularly useful in decision
tree algorithms for selecting the most informative features, which can
improve model accuracy and reduce overfitting.
 Interpretability: The concept of IG is intuitive and easy to understand,
as it measures how much knowing a feature reduces uncertainty in
predicting the target variable.

Limitations of Information Gain (IG)

 Ignores Feature Interactions: IG treats features independently and


may not consider interactions between features, potentially missing
important relationships that could improve model performance.
 Biased Towards Features with Many Categories: Features with a
large number of categories or levels may have higher IG simply due to
their granularity, leading to bias in feature selection towards such
features.

What is Mutual Information?

Mutual Information (MI) is a measure of the mutual dependence between


two random variables. In the context of machine learning, MI quantifies
the amount of information obtained about one variable through the other
variable. It is a non-negative value that indicates the degree of
dependence between the variables: the higher the MI, the greater the
dependence.
I(X;Y)=∑x∈X∑y∈Yp(x,y)log (p(x,y)p(x)p(y))I(X;Y)=∑x∈X∑y∈Yp(x,y)log(p(x)p(y)p(x,y))
​ ​ ​

where,
 P(x,y) is the joint probability of X and Y.
 P(x) and P(y) are the marginal probabilities of X and Y respectively.

Advantages of Mutual Information (MI)

 Captures Nonlinear Relationships: MI can capture both linear and


nonlinear relationships between variables, making it suitable for
identifying complex dependencies in the data.
 Versatile: MI can be used in various machine learning tasks such as
feature selection, clustering, and dimensionality reduction, providing
valuable insights into the relationships between variables.
 Handles Continuous and Discrete Variables: MI is effective for both
continuous and discrete variables, making it applicable to a wide range
of datasets.
Limitations of Mutual Information (MI)

 Sensitive to Feature Scaling: MI can be sensitive to feature scaling,


where the magnitude or range of values in different features may affect
the calculated mutual information values.
 Affected by Noise: MI may be influenced by noise or irrelevant
features in the dataset, potentially leading to overestimation or
underestimation of the true dependencies between variables.
 Computational Complexity: Calculating MI for large datasets with
many features can be computationally intensive, especially when
dealing with high-dimensional data.

Difference between Information Gain Vs Mutual Information
Information Gain
Criteria (IG) Mutual Information (MI)

Measures reduction in Measures mutual dependence


uncertainty of the target between two variables, indicating
variable when a feature how much information one
Definition is known. variable provides about the other.

Mutual dependence and


Individual feature
information exchange between
importance
Focus variables

Commonly used in Versatile application in feature


decision trees for feature selection, clustering, and
Usage selection dimensionality reduction

Considers interactions between


Ignores feature
variables, capturing complex
interactions
Interactions relationships

Effective for discrete Suitable for both continuous and


features with clear discrete variables, capturing
Applicability categories linear and nonlinear relationships
Can be computationally intensive
Simple to compute for large datasets or high-
dimensional data
Computation

What is Natural Language


Generation?
Natural Language Generation, otherwise known as NLG, is a software process driven by
artificial intelligence that produces natural written or spoken language from structured
and unstructured data. It helps computers to feed back to users in human language that
they can comprehend, rather than in a way a computer might.

For example, NLG can be used after analysing customer input (such as commands to
voice assistants, queries to chatbots, calls to help centres or feedback on survey forms)
to respond in a personalised, easily-understood way. This makes human-seeming
responses from voice assistants and chatbots possible.

It can also be used for transforming numerical data input and other complex data into
reports that we can easily understand. For example, NLG might be used to generate
financial reports or weather updates automatically.

Natural Language Generation technology is enabled through a range of computer


science processes. These include:

Computational linguistics

The scientific understanding of written and spoken language from the perspective of
computer-based analysis. This involves breaking down written or spoken dialogue and
creating a system of understanding that computer software can use. It uses semantic
and grammatical frameworks to help create a language model system that computers
can utilise to accurately analyse our speech.

Natural Language Processing (NLP)

Natural Language Processing (NLP) is the actual application of computational linguistics


to written or spoken human language. NLG is classified as a sub-category of Natural
Language Processing.

Natural Language Understanding (NLU)


Natural Language Understanding (NLU) tries to determine not just the words or phrases
being said, but the emotion, intent, effort or goal behind the speaker’s communication. It
takes the understanding a step further and makes the analysis more akin to a human’s
understanding of what is being said. Natural Language Understanding takes machine
learning to a deeper level to help make comprehension even more detailed.

How does NLG work?


NLG is a multi-stage process, with each step further refining the data being
used to produce content with natural-sounding language. The six stages of
NLG are as follows:

1. Content analysis. Data is filtered to determine what should be included


in the content produced at the end of the process. This stage includes
identifying the main topics in the source document and the relationships
between them.

2. Data understanding. The data is interpreted, patterns are identified


and it's put into context. Machine learning is often used at this stage.

3. Document structuring. A document plan is created and a narrative


structure chosen based on the type of data being interpreted.

4. Sentence aggregation. Relevant sentences or parts of sentences are


combined in ways that accurately summarize the topic.

5. Grammatical structuring. Grammatical rules are applied to generate


natural-sounding text. The program deduces the syntactical structure of
the sentence. It then uses this information to rewrite the sentence in a
grammatically correct manner.

6. Language presentation. The final output is generated based on a


template or format the user or programmer has selected.

What does using Natural Language


Generation entail?
There are a few different ways in which Natural Language Generation can work, but the
most common methods are called extractive and abstractive.

An extractive approach takes a large body of text, pulls out sentences that are most
representative of key points, and combines them in a grammatically accurate way to
generate a summary of the larger text.

An abstractive approach creates novel text by identifying key concepts and then
generating new language that attempts to capture the key points of a larger body of text
intelligibly.

Whichever approach is used, Natural Language Generation involves multiple steps to


understand human language, analyse for insights and generate responsive text.

Natural Language Generation in six steps

Data analysis

First, data (both structured data like financial information and unstructured data like
transcribed call audio) must be analysed. The data is filtered, to make sure that the end
text that is generated is relevant to the user’s needs, whether it’s to answer a query or
generate a specific report. At this stage, your NLG tools will pick out the main topics in
your source data and the relationships between each topic.

Data understanding

Here is where Natural Language processing, machine learning and a language model
come in. Your software identifies the patterns in the data and based on its algorithmic
training, it’s able to interpret what is being said and the context of these statements. For
numerical data or other types of non-textual data, your software spots the data it’s been
taught to recognise and is able to understand how it relates to actual text.
Document creation and structuring

At this stage, your NLG solutions are working to create data-driven narratives based on
the data being analysed and the result you’ve requested (report, chat response etc.). A
subsequent document plan is created.

Sentence aggregation

Sentences and parts of sentences that have been identified as relevant are put together
to summarise the information to be presented.

Grammatical structuring

Your software begins its generated text, using natural language grammatical rules to
make the text fit our understanding.

Language presentation

Finally, the software will create the final output in whatever format the user has chosen.
As mentioned, this could be in the form of a report, a customer-directed email or a voice
assistant response.

You might also like