0% found this document useful (0 votes)
36 views48 pages

UNIT 5a

The document discusses best practices for developing natural language processing applications including how to design and train NLP models, techniques for batching tokens like padding and sorting, methods for handling unknown words like character-based and subword tokenization, ways to avoid overfitting such as regularization and early stopping, and approaches for dealing with imbalanced datasets.

Uploaded by

20bd1a6622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views48 pages

UNIT 5a

The document discusses best practices for developing natural language processing applications including how to design and train NLP models, techniques for batching tokens like padding and sorting, methods for handling unknown words like character-based and subword tokenization, ways to avoid overfitting such as regularization and early stopping, and approaches for dealing with imbalanced datasets.

Uploaded by

20bd1a6622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Natural language Processing

Prepared By
J.Kamal Vijetha|Anuradha Surabhi|Asha Jyothi
Unit 5a
Best practices in developing NLP applications
Objectives
• Design a NLP Model
• Training NLP model
• Making neural network inference more efficient by sorting, padding, and masking
tokens
• Applying character-based and BPE tokenization for splitting text into tokens
• Avoiding overfitting
• Dealing with imbalanced datasets by using upsampling, downsampling, and loss
weighting
• Optimizing hyperparameters
5.0 Building NLP applications
typical structure of a modern NLP application

Figure : training Chat Bot, spell checker


Train the NLP Model
NLP Models
• Deep neural network models such as RNNS, CNNS, and the Transformer, and
modern NLP frameworks(Hugging Face )
• In training and inference the model
For example,
⮚ How do you train and make predictions efficiently?
⮚ How do you avoid having your model over fit?
⮚ How do you optimize hyper parameters?
⮚ These factors could make a huge impact on the final performance and
generalizability of your model.
NLP Models
• Logistic regression – 2 class binary sentimental analysis
• RNN- sequential labeling
• RNN with feedback – language models , Generation and detection
• Seq2 Seq- Encoder Decoder-Machine Tanslator,chatbot
• CNN – text classification- Document classification
• Seq2seq with attentions- machine Translation
• Seq2seq with self attentions- Transformers- MT,spell checking
• BERT- sentence classification- sentimental analysis
• NLI – sentence pair classification
How to build robust & accurate NLP application
Techniques used to build the model

• Batching instances- Padding, Sorting , Masking

• Handling Unknown words Tokenization for neural models - Character models, Subword
models

• Avoiding overfitting -Regularization, Early stopping, Cross-validation

• Dealing with imbalanced datasets -Using appropriate evaluation metrics, Upsampling and
downsampling, Weighting losses

• Hyperparameter tuning- epochs, parameters, Grid search vs. random search


5.1 Batching
• Batching is a machine learning technique where instances are grouped together to
form batches and sent to the processor (CPU or, more often GPU).
• Batching is necessary when training large neural networks—it is critical for efficient
and stable training.
• Batching is used for making model more efficient computation.
• Batching instances methods
⮚ Padding,
⮚ Sorting ,
⮚ Masking
Batching- Padding
• Training large neural networks requires a number of linear algebra operations such
as matrix addition and multiplication,
• it requires specialized hardware such as GPUs, processors designed to execute such
operations in a highly parallelized manner
• Data as input is sent to the GPU as tensors (high-dimensional arrays) and do
mathematical operations and the result is sent back as another tensor.
• In NLP, handling the sequences of text in different lengths.
• Training the sequences in row should be same length and batches have to be
rectangular, we need to do padding, (i.e., append special tokens,<PAD> ,
• The need to pad short sequences is to make as long as the longest sequence in the
same batch. This is illustrated in figure
Batching- Padding
Padding and batching. Black squares are tokens, gray ones are EOS tokens, and
white ones are padding.
Batching- Sorting
• Padding and batching of embedded sequences create rectangular, three-
dimensional tensors.
Batching- Sorting
Sorting instances before batching (right) reduces the total number of tensors.
Batching- Masking
Masking is an operation where
you ignore some part of the
network that corresponds to
padding.

This becomes relevant especially


when you are dealing with a
sequential-labeling or a language-
generation model.
5.2 challenges in Tokenization for neural models
• The basic linguistic units are words, characters, and n-grams and how to
compute their embeddings.

• Here we will focus on how to analyze texts and obtain these units —a
process called tokenization.

• Neural network models pose a set of unique challenges on how to deal


with tokens when they are Unknown words.
⮚ Character models
⮚ Sub-word models
5.2 challenges in Tokenization
• NLP model deals with set of tokens as vocabulary.

• Many neural NLP models operate within a fixed, finite set of tokens.

• A huge problem when we are building a machine translation system or a


conversational engine , MT system or a chatbot if it produces “I don’t know” every
time it sees new words!

• In general, the OOV problem- Unknown words is more serious for language-
generation systems (including machine translation and conversational AI) compared
to NLP systems for prediction (sentiment analysis, POS tagging, and so on)
How to solve OOV tokens problem in NLP
• big problem is handling OOV tokens in NLP have a lot of research work
to deal with them.
• commonly used two techniques for building robust neural NLP models.
character-based and subword-based models,
Handling OOV-Character Models
• The effective solution for OOV problem is to treat characters as tokens.
• break the input text into individual characters, even including punctuation and
whitespace, and treat them as if they are regular tokens.
• The rest of the application is unchanged—“word” embeddings are assigned to
characters, which are further processed by the model. If the model produces text, it
does so character-by-character as in character-level model or language generator.
• Instead of generating text word-by-word, the RNN produces text one character at a
time, as illustrated in figure below. Thanks to this strategy, the model was able to
produce words that look like English but actually aren’t.
• If the model operated on words, it produces only known words (or UNKs when
unsure), and this wouldn’t have been possible.
How to solve OOV tokens problem in NLP
• A language-generation model that generates text character-by character (including
whitespace)
How to solve OOV tokens problem in NLP
• The word-based approach is efficient but not great with unknown words.
• The character-based approach is great with unknown words but is inefficient.
• Is there something in between?
Tokenization - unknown words
• Tokenization that is both efficient and robust to unknown words?
• Subword models are a recent invention that addresses this problem for neural
networks.
• In subword models, the input text is segmented into a unit called subwords, which
simply means something smaller than words.
• There is no formal linguistic definition as to what subwords actually are, but they
roughly correspond to part of words that appear frequently.
• For example, one way to segment “dishwasher” is “dish + wash + er,” although
some other segmentation is possible.
• Some varieties of algorithms (such as WordPiece and SentencePiece ) tokenize
input into subwords, but by far the most widely used is byte-pair encoding (BPE).
Byte-pair encoding (BPE)
• Byte-pair encoding (BPE) is a compression algorithm, used as a tokenization for
neural models, particularly in machine translation.
• The BPE is to keep frequent words (such as “the” and “you”),n-grams ,
unsegmented words (such as “-able” and “anti-”), while breaking up rarer words
(such as “dishwasher”) into subwords (“dish + wash + er”).
• Keeping frequent words and n-grams together helps the model process those
tokens efficiently, whereas breaking up rare words ensures there are no UNK
tokens, because everything can be ultimately broken up into individual characters.
• By flexibly choosing where to tokenize based on the frequency, BPE achieves the
best of two worlds—being efficient while addressing the unknown word problem.
N-grams
n-gram is a contiguous sequence of one or more occurrences of linguistic units, such as characters and words.

Uni gram
bi gram
Tri gram

I have a book <start> I have a book <end>

Unigrams of =4 Unigrams of =6

bi grams=3 bi grams=5
Tri gram=2 Tri gram=4

in search and information retrieval, n-grams often mean character n-grams used for indexing documents.
BPE
Figure BPE learns subword units by iteratively merging consecutive units that cooccur
frequently.
5.3 Avoiding overfitting
• Overfitting is one of the most common and important issues when building any
machine learning applications.
• An ML model is said to overfit when it fits the given data so well that it loses its
generalization ability to unseen data.
• the model may capture the training data very well and show good performance on
it, but it may not be able to capture its inherent patterns well and shows poor
performance on test data that the model has never seen before.
5.3 Avoiding overfitting-
• To avoid overfitting a number of algorithms and techniques
⮚ Regularization- L2 regularization (weight decay)
⮚ Dropouts
⮚ and early stopping.
⮚ Cross Validation
⮚ Call Backs
These are popular in any ML applications (not just NLP) and worth getting under your
belt.
Regularization
• Regularization in ML refers to techniques that encourage the simplicity and
the generalization of the model.

Figure : Classification boundaries with increasing complexity


Regularization
• L2 regularization, also called weight decay, is one of the most common
regularization.
• L2 regularization adds a penalty for the complexity of a model measured by
how large its parameters are.
• To represent a complex classification boundary, an ML model needs to
adjust a large number of parameters (the “magic constants”) to extreme
values, measured by the L2 loss, which captures how far away they are from
zero. Such models incur a larger L2 penalty, which is why L2 encourages
simpler models.
DROPOUT
• Dropout is another popular regularization technique commonly used with
neural networks.
• Dropout works by randomly “dropping” neurons during training, where a
“neuron” is basically a dimension of an intermediate layer .
• “dropping” means to mask it with zeros.
Early stopping
• Early stopping is a technique where you stop training your model when the
model performance stops improving, usually measured by the validation set
loss.
• Eg EnglishSpanish machine translation model the validation loss curve
flattens out around the eighth epoch and starts to creep up after that, which
is a sign of overfitting.
• Early stopping would detect this, stop the training, and use the result from
the best epoch when the loss is lowest.
• early stopping has a “patience” parameter, which is the number of non
improving epochs for early stopping to kick in. When patience is 10 epochs,
for example, the training pipeline will wait 10 epochs after the loss stops
improving to stop the training.
Early stopping

The validation loss curve flattens out around the eighth epoch and creeps back up.
Cross-validation
• Cross-validation is not exactly a regularization method,
• if training data is small, the model is validated and tested on just a few
dozen instances, which can make the estimated metrics unstable.

k-fold cross validation, the dataset is split into k equally sized folds and one is used for validation.
5.4 How to deal Imbalanced Datasets
• To encounter the class imbalance problem in building NLP and ML models
• Eg: The goal of a classification task is to assign one of the classes (e.g., spam or
nonspam).
• In document classification, some topics (such as politics or sports) are usually
more popular than other topics.
• when some classes have way more instances than others is called imbalanced .
• techniques used to balance the dataset.
⮚ Using appropriate evaluation metrics- F1-measure instead
of accuracy
⮚ Upsampling and downsampling
⮚ Weighting losses
How to deal Imbalanced Datasets
a) Calculating the F1 score

b) Upsampling and down sampling


How to deal Imbalanced Datasets
c)Weighting losses
• weighting when computing the loss, instead of making modifiy training data.
• the loss penalizes more when the ground truth belongs to the minority class
• In binary cross-entropy loss,
• When the prediction is perfectly correct (probability = 1), there’s no
penalty, whereas as the predition gets worse (probability < 1), the loss goes up

binary cross entropy loss Weighted binary cross entropy loss


5.5 Hyperparameter tuning-optimization
• Hyperparameters are parameters about the model and the training
algorithm. This term is used in contrast with parameters, which are numbers
that are used by the model to make predictions from the input. “magic
constants”
• Hyperparameters
⮚ how many hidden units (dimensions) to use for representing words
⮚ number of RNN layers to use
⮚ the number of attention heads
⮚ the learning rate
⮚ number of epochs (iterations through the training dataset)
Hyperparameter tuning
• set of hyperparameters that look reasonable and measure the model’s
performance on a validation set
• One issue with this manual tuning approach is that it is slow and arbitrary
• two more-organized ways of tuning hyperparameters—grid search and
random search.

• Hyperparameter tuning with Optuna plug in


Byte-pair encoding (BPE)
• Byte-pair encoding (BPE) is a compression algorithm, used as a tokenization
for neural models, particularly in machine translation.
• The BPE is to keep frequent words (such as “the” and “you”) and n-grams
unsegmented words (such as “-able” and “anti-”), while breaking up rarer
words (such as “dishwasher”) into subwords (“dish + wash + er”).
• Keeping frequent words and n-grams together helps the model process
those tokens efficiently, whereas breaking up rare words ensures there are
no UNK tokens, because everything can be ultimately broken up into
individual characters.
• By flexibly choosing where to tokenize based on the frequency, BPE
achieves the best of two worlds—being efficient while addressing the
unknown word problem.
Summary
Figure BPE learns subword units by iteratively merging consecutive units that cooccur
frequently.
Summary
▪ Instances are sorted, padded, and batched together for more efficient computation.

▪ Subword tokenization algorithms such as BPE split words into units smaller than
words to mitigate the out-of-vocabulary problem in neural network models.

▪ Regularization (such as L2 and dropout) is a technique to encourage model


simplicity and generalizability in machine learning.

▪ data upsampling, downsampling, or loss weights for addressing the data imbalance
issue.

▪ Hyperparameters are parameters about the model or the training algorithm. They
can be optimized using manual, grid, or random search.
Calculate the Accuracy Precision and recall
2-Class problem
Instances Actual Labels Predicted Labels
Review 1 Positive Positive
Review 2 Negative Negative
Review 3 Positive Positive
Review 4 Positive Negative
Review 5 Negative Positive
Review 6 Positive Negative
Review 7 Negative Negative
Review 8 Negative Positive
Review 9 Positive Positive
Review 10 Negative Negative
Calculate the Accuracy Precision and recall

Actual Labels Predicted Labels


Positive Negative Positive
Positive
TP= 3 FP=2
Negative Negative
FN=3 TN=2 Positive Positive
Positive Negative
Negative Positive
Accuracy= TP+TN / (TP+TN+FP+FN)
Positive Negative
=0.5 Positive Negative

Recall= TP/TP+FN =0.6 Negative Positive


Positive Positive
Precision = TP/TP+FP =0.6 Negative Negative
Instances Actual Labels Predicted Labels
3-Class problem
Review 1 Positive Positive
Review 2 Negative Negative
Review 3 Positive Positive positive neutral negative
Review 4 Neutral Neutral 3 1 2
Review 5 Neutral Positive 2 2 1
Review 6 Positive Negative 1 0 3
Review 7 Negative Negative
Review 8 Negative Positive
Review 9 Positive Positive
Review 10 Negative Negative
Review 11 Positive Neutral
Review 12 Neutral Neutral
Review 13 Negative Positive
Review 14 Positive Neutral
Review 15 Negative Neutral
positive others

TP= 3 FP=3
positive neutral negative Fo=3 To=6
3 1 2
2 2 1
neutral others
1 0 3
TN= 1 FN=5
Fo=2 To=7
Accuracy= TP+TN / (TP+TN+FP+FN)

Recall= TP/TP+FN
negative others
Precision = TP/TP+FP
TNg= 2 FNg=4
Macro average Accuracy= A1*A2*A3/3 Fo=4 To=5

Macro average Recall= R1*R2*R3/3

Macro average Precision=P1*P2*P3/3

You might also like