Unit - 1
Unit - 1
For training a language model, a number of probabilistic approaches are used. These approaches
vary on the basis of the purpose for which a language model is created. The amount of text data
to be analyzed and the math applied for analysis makes a difference in the approach followed for
creating and training a language model.
For example, a language model used for predicting the next word in a search query will be
absolutely different from those used in predicting the next word in a long document (such as
Google Docs). The approach followed to train the model would be unique in both cases.
You could develop a language model and use it standalone for purposes like generating new
sequences of text that appear to have come from the body.
Language modeling is a core problem for a rather wide range of natural language
processing tasks. Language models are generally used on the front-end or back-end of a more
sophisticated model for a task that needs language understanding.
2. Machine Translation: Google Translator and Microsoft Translate are examples of how
NLP models can help in translating one language to another.
3. Sentiment Analysis: This helps in analyzing sentiments behind a phrase. This use case
of NLP models is used in products that allow businesses to understand a customer’s
intent behind opinions or attitudes expressed in the text. Hubspot’s Service Hub is an
example of how language models can help in sentiment analysis.
4. Text Suggestions: Google services such as Gmail or Google Docs use language models
to help users get text suggestions while they compose an email or create long text
documents, respectively.
5. Parsing Tools: Parsing involves analyzing sentences or words that comply with syntax
or grammar rules. Spell checking tools are perfect examples of language modelling and
parsing.
Language models are also used to generate text in other similar language processing tasks like
optical character recognition, handwriting recognition, image captioning, etc.
2. Exponential Growth
The second challenge is that the number of n-grams grows as an nth exponent of the vocabulary
size. A 10,000-word vocabulary would have 10¹² tri-grams and a 100,000 word vocabulary will
have 10¹⁵ trigrams.
3. Generalization
The last issue with MLE techniques is the lack of generalization. If the model sees the term
‘white horse’ in the training data but does not see ‘black horse’, the MLE will assign zero
probability to ‘black horse’.