Understanding Each Pre-Processing Aspect
Understanding Each Pre-Processing Aspect
1. Stopword Removal
- What it is: Stopwords are frequent words like “the,” “is,” or “and” that often
carry little unique meaning. Removing them reduces noise and focuses on content-
bearing words.
- When to use :Useful in tasks like text classification, clustering, or search engines
where common words don’t add value. However, retain stopwords in tasks like
sentiment analysis where words like “not” affect meaning.
- How to apply: Use a pre-defined stopword list (e.g., from libraries like NLTK or
spaCy) or create a custom list tailored to your domain.
2. Thesauri
- When to use: Ideal for matching names or words with spelling variations, such
as in database searches or error-tolerant systems.
- How to apply: Apply the Soundex algorithm to convert words into phonetic
codes, then match or group based on these codes.
4. Stemming
- What it is: Stemming reduces words to their root form (e.g., “running” → “run”)
using heuristic rules.
- When to use: Useful in search engines or text matching to handle different word
forms, especially in morphologically rich languages.
- How to apply: Use algorithms like the Porter Stemmer or Snowball Stemmer,
available in libraries like NLTK.
5. Morphological Analysis
- When to use: Preferred in tasks requiring precise word forms (e.g., machine
translation) or in languages with complex morphology.
- How to apply: Use lemmatization tools (e.g., spaCy, Stanford NLP) with part-of-
speech tagging, or advanced morphological analyzers for specific languages.
- What it is: N-grams are sequences of *n* items (words or characters) from text
(e.g., bigrams: “machine learning”). Combining with stemming means generating
N-grams from stemmed words.
- When to use: N-grams capture context for tasks like language modeling or text
classification. Stemming beforehand reduces vocabulary size and groups related
phrases.
- How to apply: Stem words first, then generate N-grams (e.g., “running fast” →
“run fast” → [“run fast”] as a bigram).
- N-grams are generated after stemming if combining the two, or from raw tokens
if preserving full words.
- Thesauri and Soundex are typically applied later or separately (e.g., during
query processing or matching).
3. Addressing Limitations
- N-grams: Higher *n* increases context but also data size. Tune *n* (e.g., 1-3).
4. Implementation
1. Clean text
2. Tokenize
4. Stem