0% found this document useful (0 votes)
150 views39 pages

3 - Unit - 1 - Find Structures of Documents

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views39 pages

3 - Unit - 1 - Find Structures of Documents

Uploaded by

Kola Siri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Finding the Structure of Documents

• In human language, words and sentences do not appear


randomly but usually have a structure.

• For example, combinations of words form sentences,


meaningful grammatical units, such as statements, requests,
and commands.

• Automatic extraction of structure of documents helps


subsequent NLP tasks like parsing, machine translation, and
semantic labelling that use sentences as the basic processing
unit.
• Sentences boundary detection allows the task of deciding where
sentences start and end given a sequence of characters.
Finding the Structure of Documents
• Topic segmentation allows the task of determining when a topic
starts and ends in a sequence of sentences.

• The statistical classification approaches that try to find the


presence of sentence and topic boundaries given human-
annotated training data for segmentation.
•These methods base their predictions on features of the input
and local characteristic that give evidence toward the presence
or absence of a sentence.
•Features are the core of classification approaches and require
careful design and selection in order to be successful and
prevent overfitting and noise problem.
Finding the Structure of Documents
•For example,
•For processing of Chinese documents, the processor may need to
first segment the character sequences into words, as the words
usually are not separated by a space in Chinese documents.
•Similarly, for morphological rich languages, the word structure
may need to be analyzed to extract additional features.
•Such processing is usually done in a pre-processing step, where
a sequence of tokens is determined.
Finding the Structure of Documents
1. Sentence Boundary Detection

• Sentence boundary detection (Sentence segmentation) deals


with automatically segmenting a sequence of word tokens into
sentence units.

• In written text for example in English and in some other


languages, the beginning of a sentence is usually marked with
an uppercase letter, and the end of a sentence is explicitly
marked with a period(.), a question mark(?), an exclamation
mark(!), or any other type of punctuation.

• In addition to their role as sentence boundary markers,


capitalized initial letters are used to distinguish proper nouns.
Finding the Structure of Documents
For Example:
• I spoke with Dr. Smith. My house is on Mountain Dr.
• In the first sentence, the abbreviation Dr. does not end a
sentence, and in the second it does.
• An automatic method that outputs word boundaries as
ending sentences according to the presence of such
punctuation marks would sometime result in cutting some
sentences incorrectly.
• Sentence Boundary Detection (SBD) is a natural language
processing (NLP) task that involves identifying the
boundaries or boundaries between the sentences in a text.
Finding the Structure of Documents
• It is a crucial preprocessing step in many NLP applications, as
correctly segmenting a text into sentences is necessary for tasks
such as text summarization, machine translation, sentiment
analysis, and part-of-speech tagging.

• Objective of Sentence Boundary Detection:

1. The main goal of Sentence Boundary Detection is to determine


where one sentence ends and the next one begins within a given
text.

2. Sentence boundary detection (Sentence segmentation) deals with


automatically segmenting a sequence of word tokens into sentence
units.
Finding the Structure of Documents
• Challenges in Sentence Boundary Detection:

• Detecting sentence boundaries can be challenging due to


various factors, including:

1. Ambiguities: Sentences can end with different punctuation


marks (e.g., ".", "!", "?") or even none at all.

2. Abbreviations: Periods can appear in the middle of sentences


within abbreviations, making it hard to distinguish sentence
boundaries.

3. Domain-Specific Text: In some domains, like legal or medical


texts, sentences may have unconventional structures that are
different from typical written language.
Finding the Structure of Documents
4. Multilingual Text: Different languages may have different
sentence boundary conventions, making SBD more complex in
multilingual contexts.

Techniques for Sentence Boundary Detection

Several techniques and approaches are used for Sentence


Boundary Detection:

1. Rule-Based Methods: These methods rely on a set of rules and


heuristics to determine sentence boundaries.

• Rules may consider punctuation marks, abbreviations, and


contextual information.
Finding the Structure of Documents
2. Machine Learning: Machine learning techniques, such as sequence
labeling using Conditional Random Fields (CRF) or Recurrent Neural
Networks (RNNs), can be trained to predict sentence boundaries based
on labeled training data.

3. Language-Specific Models: Some languages may require language-


specific models and resources (e.g., dictionaries, grammatical rules) to
perform accurate Sentence Boundary Detection.

4. Pre-trained Models: Pre-trained language models, like BERT or GPT,


can be fine-tuned for Sentence Boundary Detection tasks, leveraging
their contextual understanding of text.
Finding the Structure of Documents
• Topic Boundary Detection (topic segmentation)
• Topic boundary detection in NLP is the task of
identifying where one topic ends and another begins in a
text.
• Given a sequence of (written or spoken) words, the aim
of topic segmentation is to find the boundaries where
topics change.
• Topics can be expressed in a variety of ways, including
explicitly (e.g., using topic headings or keywords) or
implicitly (e.g., through the use of discourse markers or
logical transitions).
Finding the Structure of Documents
• Objective of Topic Boundary Detection.

1. The primary goal of Topic Boundary Detection is to identify


points in a text where the topic or subject matter changes
significantly.

2. These boundaries can be identified at the sentence,


paragraph, or document level, depending on the specific
application.

• Topic segmentation is an important task for various language


understanding applications, such as information retrieval,
machine translation and text summarization.
Finding the Structure of Documents
• For example

• In information retrieval, if a long documents can be segmented


into shorter, topically coherent segments, then only the
segment that is about the users query could be retrieved.

• Challenges in Topic Boundary Detection

1. Topic ambiguity: It can be difficult to define what constitutes


a topic, and different people may have different
interpretations of the same text.

2. Language Variations: Different languages and writing styles


may have distinct ways of signaling topic changes.
Finding the Structure of Documents
• Several techniques and approaches are used for Topic Boundary
Detection.

1. Rule-Based Methods: Rule-based methods use predefined rules


and heuristics to identify topic boundaries.

2. Machine Learning: Machine learning techniques, such as


supervised classification or sequence labeling, can be trained to
predict topic boundaries based on labeled training data.

3. Topic Models: Topic modelling techniques like Negative Matrix


Factorization (NMF), can be used to identify topic transitions
based on changes in topic distributions within a document or
corpus.
Methods for Sentence segmentation and Topic
segmentation
• Sentence segmentation and topic segmentation have been
considered as a boundary classification problems.

• Given a boundary candidate( between two word tokens for


sentence segmentation and between two sentences for topic
segmentation), the goal is to predict whether or not the
candidate is an actual boundary (sentence or topic boundary).

• Formally, let x Ɛ X be the vector of features (the observation)


associated with a candidate and y Ɛ Y be the label predicted for
that candidate.

• The label y can be b for boundary and 𝒃’ for non-boundary.


Methods for Sentence segmentation and topic
segmentation
• Classification problem: Given a set of training examples (x, y),
find a function that will assign the most accurate possible label
y of unseen examples x which is unseen.
• Alternatively to the binary classification problem, it is possible
to model boundary types using finer-grained categories.

• For segmentation the text be framed as a three-class problem:


sentence boundary ba, without an abbreviation ba’ and
abbreviation not as a boundary b-a

• Similarly for spoken language, a two way classification can be


made between non-boundaries statements bs, and question
boundaries bq.
Methods for Sentence segmentation and topic
segmentation
• For sentence or topic segmentation, the problem is defined as
finding the most probable sentence or topic boundaries.
• The natural unit of sentence segmentation is words and of topic
segmentation is sentence, as we can assume that topics
typically do not change in the middle of a sentences.

• The words or sentences are then grouped into categories


belonging to one sentences or topic as boundaries and non-
boundaries.

• The classification can be done at each potential boundary i


(local modelling); then, the aim is to estimate the most probable
boundary type 𝒚ෝ𝒊 for each candidate 𝑋𝑖 .
Methods for Sentence segmentation and topic
segmentation

• Here, the ^ is used to denote estimated categories, and a variable


without a ^ is used to show possible categories.

• It is also possible to see the candidate boundaries as a sequence and


search for the sequence of boundary types 𝒚ෝ𝒊 = 𝒚
ෞ𝟏 , … … . . 𝒚
ෞ𝒏 that
have maximum probability given the candidate examples 𝑋 =
𝑥1 , … … … 𝑥𝑛 .

• We categorize the methods into local and sequence classification.


Methods for Sentence segmentation and topic
segmentation

• Another categorization of methods is done according to the type of


the machine learning algorithm: Generative versus Discriminative.

• Generative sequence models estimate the joint distribution of the


observations P(X,Y) (words, punctuation) and the labels (sentence
boundary, topic boundary).

• Discriminative sequence models, however, focus on features that


categorize the differences between the labelling of the examples.
Generative and Discriminative Sequence Models
• NLP encompasses various tasks, such as text classification,
sentiment analysis, machine translation, and language generation.

• Several models have been developed to address these tasks, the


two primary models in NLP being generative and discriminative
models.

• Generative and discriminative sequence models are two categories


of statistical models used in natural language processing (NLP)
and in various other fields.

• They have different approaches to modeling and solving sequence-


related tasks:
Generative and Discriminative Sequence Models
• Discriminative models map an input to output and are trained
on labeled data following the supervised learning paradigm.

• These models learn to identify patterns and correlations


between the input and output, allowing them to make highly
accurate predictions.

• The discriminative model indirectly learns certain features of


the dataset that makes the task easier.

• Discriminative models work by learning the decision boundary


that separates the input data into different classes.
Generative and Discriminative Sequence Models

• Discriminative modelling learns to model the conditional probability


of class label y given set of features x as P(Y|X).

• Generative models are probabilistic models that can generate new


text based on the input given to them.
Generative and Discriminative Sequence Models
• These models are trained on large amounts of unlabeled data and
can be fine-tuned to perform various NLP tasks.

• They work by learning the probability distribution of words in a


language and use this knowledge to generate new text that matches
the input’s context.

• Generative modelling defines how a dataset is generated.

• It tries to understand the distribution of data points, providing a


model of how the data is actually generated in terms of a
probabilistic model.
Generative and Discriminative Sequence Models

• The aim is to generate new samples from what has already been
distributed in the training data.

• The advantage of using generative model is it helps to represent the


data more realistically.

• Generative modelling learns to approximate P(X), which is the


probability of observing from the observation X.

• For Example

• Let's say you have input data x and you want to classify the data
into labels y.

• A generative model learns the joint probability distribution P(X,Y)


and a discriminative model learns the conditional probability
distribution P(Y|X).
Generative and Discriminative Sequence Models
• Suppose you have the following data in the form (x,y): (1,0), (1,1),
(2,0), (2, 1).

• p(x,y) is p(y|x) is

Y=0 Y=1 Y=0 Y=1


X = 1 0.5 0 X=1 1 0
X = 2 0.25 0.25 X=2 0.5 0.5

• The distribution p(y|x) is the natural distribution for classifying a


given example x into a class y, which is why algorithms that model
this directly are called discriminative algorithms.

• Generative algorithms model p(x,y), which can be transformed into


p(y|x) by applying Bayes rule and then used for classification.
Generative and Discriminative Sequence Models
• Generative models are not as simple as discriminative. A generative
model describes how likely each topic is, and how likely words are
given in the topic.

• This is how it says documents are actually "generated" by the word


-- a topic arises according to some distribution, words arise because
of the topic, that have a document.

• Classifying documents of words W into topic T is a matter of


maximizing the joint likelihood: P(T,W) = P(W|T)P(T).

• A discriminative model operates by only describing how likely a


topic is, given the words without knowing how likely the words or
topic in a document.
Generative and Discriminative Sequence Models
• The task is to model P(T|W) directly and find the T that maximizes
this. The discriminative approaches do not care about P(T) or P(W)
directly.

• In discriminative models, to predict the label y from the training


example x, we must evaluate:

• which merely chooses what is the most likely class y considering x.

• It's like we were trying to model the decision boundary between the
classes.

• Now, using Bayes' rule, let's replace P(Y|X) in the equation by

𝑃 𝑥𝑦 𝑃(𝑦)
𝑃(𝑥)
Generative and Discriminative Sequence Models
• We are just interested in the argmax, elliminate the denominator
that will be the same for every y.

• So, we get :

• which is the equation used in generative models.

• In the first case, the conditional probability distribution P(Y|X),


which modeled the boundary between classes.

• In the second case, the joint probability distribution P(X, Y), since
P(X | Y) P(Y) = P(X, Y), which explicitly models the actual
distribution of each class.
Generative and Discriminative Sequence Models
• With the joint probability distribution function, given a Y, calculate
("generate") its respective X.

• For this reason, they are called "generative" models.

• Here 𝑌 Predicted class(boundary) label

• Y=(y1,y2,….yk)=Set of class (boundary) labels

• X=(x1,x2,….xn)=set of feature vectors

• P(Y|X) the probability of given the X (feature vectors), what is the


probability of X belongs to the class(boundary) label P(x)
Methods for Sentence segmentation and topic
segmentation
• P(x) Probability of word sequence

• P(Y) = Probability of the class(boundary)

• P(X) in the denominator is dropped because it is fixed for different Y


and hence does not change the argument of max.

• P(X|Y) and P(Y) can be estimated as

• 𝑃 𝑋|𝑌 = ς𝑛𝑖=1 𝑃(𝑥𝑖 |𝑦1 … … … … . 𝑦𝑖 ) ----------------- 2.4

and

• 𝑃 𝑌 = 𝑖=1 𝑃(𝑦𝑖 |𝑦1 … … … … . 𝑦𝑖−1 ) ----------------- 2.5


ς 𝑛
Methods for Sentence segmentation and topic
segmentation
• Generative vs Discriminative Modeling
1. Both Generative (G) and Discriminative (D) can be used
for classification.
2. G models work with both supervised and unsupervised
learning. But D models are only used for supervised
learning problems.
3. The goal of the D model is to estimate the conditional
probability P(Y|X). In contrast, the G model learns to
approximate P(X) and P(X|Y) in an unsupervised
setting, then deduces P(Y|X) in a supervised setting.
Methods for Sentence segmentation and topic
segmentation
4. D models learn a linear and non-linear decision boundary,
using the data points and their respective labels,
without knowing how the data was generated. In learning to
model the probability distribution of the data, G models get
to understand the data’s underlying characteristics.

5. In a supervised setting, D models are known to outperform G


models. Especially, when G models do not fit the data well.

6. G models can provide rich insights about the data, when you
do not have labels.
Complexity of Approaches
• Sentence/topic segmentation approaches can be rated in terms of
complexity (time and memory) of their training and prediction
algorithms and in terms of their performance on real-world
dataset.

• Some approaches may require specific pre-processing, such as


converting or normalizing continuous features to discrete features.

• The complexity of the approaches can be seen in

1. Discriminative approach

2. Generative models

3. Discriminative classifiers and in

4. Sequence approaches
Complexity of Approaches
1. Discriminative approach

• In terms of complexity, training of discriminative approaches is


more complex than training of generative ones.

• Since, they require multiple passes over the training data to adjust
for their feature weights.

2. Generative models

• Generative models such as HELMs can handle multiple orders of


magnitude larger training sets.

• But, they do not cope well with unseen events.

3. Discriminative classifiers

• They allow for a wider variety of features and perform better on


smaller training sets.
Complexity of Approaches
• Predicting with discriminative classifiers is also slower, even though
the models are relatively simple (linear or log-linear).

4. Sequence approaches

• Compared to local approaches, sequence approaches bring the


additional complexity of decoding:

• To find the best sequence of decisions requires evaluating all


possible sequences of decisions.
Performance of the Approaches
• The performance of the sentence segmentation approaches can be
analyzed in:

• A. Sentence segmentation in text

• B. Sentence segmentation in speech

• A. Sentence segmentation in text

• For sentence segmentation in text, researchers have reported error


rate results on a subset of the Wall Street Journal Corpus of about
27,000 sentences.

• For instance, Mikheev reports that have rule-based system performs


at an error rate of 1.41 %.
Performance of the Approaches
• Even though the error rates presented seem low, sentence segmentation
is one of the first processing steps for any NLP task, and each error
impacts subsequent steps, especially if the resulting sentences are
presented to the user.

• B. Sentence segmentation in speech

• For sentence segmentation in speech, performance is usually evaluated


using:

• 1. The error rate (ratio of number of errors to the numbers of examples).

• 2. F1-measure (the harmonic mean of recall and precision).

• Recall is defined as the ratio of the number of correctly returned sentence


boundaries to the number of sentence boundaries in the reference
annotations.
Performance of the Approaches
• Precision is the ratio of the number of correctly returned sentence
boundaries to the number of all automatically estimated sentence
boundaries.

You might also like