NLP PDF
NLP PDF
1. Tokenization
Tokenization is the process of splitting text into smaller units, such as words or sentences.
```python
tokens = word_tokenize(sentence)
```
2. Stop Words
Stop words are common words that are usually removed from text as they carry minimal meaning.
```python
stop_words = set(stopwords.words('english'))
```
becomes "run".
- Lemmatization reduces words to their base form using language rules, e.g., "better" to "good".
Example using NLTK:
```python
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
```
4. Word Embeddings
Word embeddings are dense vector representations of words capturing semantic relationships.
```python
vector = model.wv['hello']
```
frequency.
```python
bow = CountVectorizer().fit_transform(text)
tfidf = TfidfVectorizer().fit_transform(text)
```
6. Vector Databases
Vector databases store word embeddings or document vectors for efficient similarity search.