UNIT 5a
UNIT 5a
Prepared By
J.Kamal Vijetha|Anuradha Surabhi|Asha Jyothi
Unit 5a
Best practices in developing NLP applications
Objectives
• Design a NLP Model
• Training NLP model
• Making neural network inference more efficient by sorting, padding, and masking
tokens
• Applying character-based and BPE tokenization for splitting text into tokens
• Avoiding overfitting
• Dealing with imbalanced datasets by using upsampling, downsampling, and loss
weighting
• Optimizing hyperparameters
5.0 Building NLP applications
typical structure of a modern NLP application
• Handling Unknown words Tokenization for neural models - Character models, Subword
models
• Dealing with imbalanced datasets -Using appropriate evaluation metrics, Upsampling and
downsampling, Weighting losses
• Here we will focus on how to analyze texts and obtain these units —a
process called tokenization.
• Many neural NLP models operate within a fixed, finite set of tokens.
• In general, the OOV problem- Unknown words is more serious for language-
generation systems (including machine translation and conversational AI) compared
to NLP systems for prediction (sentiment analysis, POS tagging, and so on)
How to solve OOV tokens problem in NLP
• big problem is handling OOV tokens in NLP have a lot of research work
to deal with them.
• commonly used two techniques for building robust neural NLP models.
character-based and subword-based models,
Handling OOV-Character Models
• The effective solution for OOV problem is to treat characters as tokens.
• break the input text into individual characters, even including punctuation and
whitespace, and treat them as if they are regular tokens.
• The rest of the application is unchanged—“word” embeddings are assigned to
characters, which are further processed by the model. If the model produces text, it
does so character-by-character as in character-level model or language generator.
• Instead of generating text word-by-word, the RNN produces text one character at a
time, as illustrated in figure below. Thanks to this strategy, the model was able to
produce words that look like English but actually aren’t.
• If the model operated on words, it produces only known words (or UNKs when
unsure), and this wouldn’t have been possible.
How to solve OOV tokens problem in NLP
• A language-generation model that generates text character-by character (including
whitespace)
How to solve OOV tokens problem in NLP
• The word-based approach is efficient but not great with unknown words.
• The character-based approach is great with unknown words but is inefficient.
• Is there something in between?
Tokenization - unknown words
• Tokenization that is both efficient and robust to unknown words?
• Subword models are a recent invention that addresses this problem for neural
networks.
• In subword models, the input text is segmented into a unit called subwords, which
simply means something smaller than words.
• There is no formal linguistic definition as to what subwords actually are, but they
roughly correspond to part of words that appear frequently.
• For example, one way to segment “dishwasher” is “dish + wash + er,” although
some other segmentation is possible.
• Some varieties of algorithms (such as WordPiece and SentencePiece ) tokenize
input into subwords, but by far the most widely used is byte-pair encoding (BPE).
Byte-pair encoding (BPE)
• Byte-pair encoding (BPE) is a compression algorithm, used as a tokenization for
neural models, particularly in machine translation.
• The BPE is to keep frequent words (such as “the” and “you”),n-grams ,
unsegmented words (such as “-able” and “anti-”), while breaking up rarer words
(such as “dishwasher”) into subwords (“dish + wash + er”).
• Keeping frequent words and n-grams together helps the model process those
tokens efficiently, whereas breaking up rare words ensures there are no UNK
tokens, because everything can be ultimately broken up into individual characters.
• By flexibly choosing where to tokenize based on the frequency, BPE achieves the
best of two worlds—being efficient while addressing the unknown word problem.
N-grams
n-gram is a contiguous sequence of one or more occurrences of linguistic units, such as characters and words.
Uni gram
bi gram
Tri gram
Unigrams of =4 Unigrams of =6
bi grams=3 bi grams=5
Tri gram=2 Tri gram=4
in search and information retrieval, n-grams often mean character n-grams used for indexing documents.
BPE
Figure BPE learns subword units by iteratively merging consecutive units that cooccur
frequently.
5.3 Avoiding overfitting
• Overfitting is one of the most common and important issues when building any
machine learning applications.
• An ML model is said to overfit when it fits the given data so well that it loses its
generalization ability to unseen data.
• the model may capture the training data very well and show good performance on
it, but it may not be able to capture its inherent patterns well and shows poor
performance on test data that the model has never seen before.
5.3 Avoiding overfitting-
• To avoid overfitting a number of algorithms and techniques
⮚ Regularization- L2 regularization (weight decay)
⮚ Dropouts
⮚ and early stopping.
⮚ Cross Validation
⮚ Call Backs
These are popular in any ML applications (not just NLP) and worth getting under your
belt.
Regularization
• Regularization in ML refers to techniques that encourage the simplicity and
the generalization of the model.
The validation loss curve flattens out around the eighth epoch and creeps back up.
Cross-validation
• Cross-validation is not exactly a regularization method,
• if training data is small, the model is validated and tested on just a few
dozen instances, which can make the estimated metrics unstable.
k-fold cross validation, the dataset is split into k equally sized folds and one is used for validation.
5.4 How to deal Imbalanced Datasets
• To encounter the class imbalance problem in building NLP and ML models
• Eg: The goal of a classification task is to assign one of the classes (e.g., spam or
nonspam).
• In document classification, some topics (such as politics or sports) are usually
more popular than other topics.
• when some classes have way more instances than others is called imbalanced .
• techniques used to balance the dataset.
⮚ Using appropriate evaluation metrics- F1-measure instead
of accuracy
⮚ Upsampling and downsampling
⮚ Weighting losses
How to deal Imbalanced Datasets
a) Calculating the F1 score
▪ Subword tokenization algorithms such as BPE split words into units smaller than
words to mitigate the out-of-vocabulary problem in neural network models.
▪ data upsampling, downsampling, or loss weights for addressing the data imbalance
issue.
▪ Hyperparameters are parameters about the model or the training algorithm. They
can be optimized using manual, grid, or random search.
Calculate the Accuracy Precision and recall
2-Class problem
Instances Actual Labels Predicted Labels
Review 1 Positive Positive
Review 2 Negative Negative
Review 3 Positive Positive
Review 4 Positive Negative
Review 5 Negative Positive
Review 6 Positive Negative
Review 7 Negative Negative
Review 8 Negative Positive
Review 9 Positive Positive
Review 10 Negative Negative
Calculate the Accuracy Precision and recall
TP= 3 FP=3
positive neutral negative Fo=3 To=6
3 1 2
2 2 1
neutral others
1 0 3
TN= 1 FN=5
Fo=2 To=7
Accuracy= TP+TN / (TP+TN+FP+FN)
Recall= TP/TP+FN
negative others
Precision = TP/TP+FP
TNg= 2 FNg=4
Macro average Accuracy= A1*A2*A3/3 Fo=4 To=5