Assignment 1_NLP
Assignment 1_NLP
2. Describe the process of tokenization in text processing and its importance in NLP applications.
Ans. Tokenization is the process of breaking down text into smaller units, such as words or sentences, known as tokens. It’s
important because most NLP tasks (e.g., sentiment analysis, translation) require working with individual tokens rather than
entire text blocks, enabling better analysis, feature extraction, and text understanding.
3. Discuss the advantages and limitations of the Bag of Words model for text representation.
Ans. Advantages: The Bag of Words (BoW) model is simple to implement and effective for representing text by counting word
occurrences, enabling easy text classification and clustering. Limitations: It disregards word order, context, and meaning,
leading to loss of semantic information and an inability to handle polysemy (words with multiple meanings).
4. Explain how TFIDF improves upon the Bag of Words model for text analysis. Provide an example of how TFIDF might be
used in practice.
Ans. TF-IDF (Term Frequency-Inverse Document Frequency) improves BoW by giving less weight to common words and more
weight to rare but important words in the text, thereby enhancing relevance. For example, in document classification, TF-IDF
helps identify key terms that distinguish one document from others by reducing the impact of frequent but non-informative
words (e.g., "the," "is").
5. What is NLTK, and how does it assist in Natural Language Processing tasks? Mention at least two functionalities provided by
the NLTK library.
Ans. NLTK (Natural Language Toolkit) is a comprehensive library in Python that supports NLP tasks such as text preprocessing,
tokenization, and sentiment analysis. Two key functionalities of NLTK include:
Tokenization: Breaking text into words or sentences.
Stemming and Lemmatization: Reducing words to their base or root form.
6. In text processing, what are stop words, and why are they typically removed from text data before analysis? Provide
examples of common stop words.
Ans. Stop words are common words (e.g., "the," "is," "in") that appear frequently in text but carry little meaning. They are
removed in text analysis to reduce noise and focus on more significant words. Removing stop words helps in improving the
accuracy and efficiency of NLP tasks such as text classification or search engine optimization.