0% found this document useful (0 votes)
3 views5 pages

NLP 7

This document is a comprehensive cheat sheet on Natural Language Processing (NLP), covering key topics such as data collection, preprocessing, feature extraction, sentiment analysis, text classification, named entity recognition, machine translation, text summarization, text generation, and real-world applications of NLP. It includes code snippets for various NLP tasks using popular libraries like NLTK, scikit-learn, spaCy, and Hugging Face Transformers. The document serves as a practical guide for implementing NLP techniques and understanding their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

NLP 7

This document is a comprehensive cheat sheet on Natural Language Processing (NLP), covering key topics such as data collection, preprocessing, feature extraction, sentiment analysis, text classification, named entity recognition, machine translation, text summarization, text generation, and real-world applications of NLP. It includes code snippets for various NLP tasks using popular libraries like NLTK, scikit-learn, spaCy, and Hugging Face Transformers. The document serves as a practical guide for implementing NLP techniques and understanding their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5

# NLP (Natural Language Processing) Cheat Sheet - Partie 7

---

## 1. **Pipeline de Traitement de Langue Naturelle**

1. **Pipeline typique pour NLP**


- **Collecte des Données** : Extraire des données textuelles (articles, tweets,
emails, etc.).
- **Prétraitement** :
- Nettoyage (suppression des caractères spéciaux, normalisation).
- Tokenisation (division en mots ou phrases).
- Suppression des mots vides (stopwords).
- **Extraction de Caractéristiques** :
- TF-IDF, Word2Vec, ou embeddings.
- **Modélisation** :
- Modèles supervisés (classificateurs).
- Modèles non supervisés (clustering).
- **Évaluation** :
- Mesures comme la précision, le rappel, ou la F1-score.

---

## 2. **Nettoyage et Prétraitement de Texte**

1. **Nettoyage de Texte avec regex**


```python
import re

text = "Hello!!! NLP is amazing... Visit https://example.com"


cleaned_text = re.sub(r'https?://\S+|www\.\S+', '', text) # Supprimer les URLs
cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text) # Supprimer la ponctuation
print(cleaned_text)
```

2. **Tokenisation avec NLTK**


```python
from nltk.tokenize import word_tokenize

text = "Natural Language Processing enables machines to understand human


language."
tokens = word_tokenize(text)
print(tokens)
```

3. **Suppression des Stopwords**


```python
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)
```

---

## 3. **Vectorisation de Texte**

1. **TF-IDF avec scikit-learn**


```python
from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["I love NLP.", "NLP is amazing.", "I enjoy learning NLP."]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())
```

2. **Bag of Words (CountVectorizer)**


```python
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())
```

---

## 4. **Analyse de Sentiment**

1. **Analyse de Sentiment avec VADER**


```python
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
sentence = "I absolutely love this product. It's fantastic!"
sentiment_score = analyzer.polarity_scores(sentence)
print(sentiment_score)
```

2. **Analyse de Sentiment avec TextBlob**


```python
from textblob import TextBlob

sentence = "This movie is great, but the ending was disappointing."


blob = TextBlob(sentence)
print(blob.sentiment)
```

---

## 5. **Classification de Texte**

1. **Classification de texte avec un pipeline Hugging Face**


```python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
text = "I really enjoyed the movie. It was fantastic!"
result = classifier(text)
print(result)
```

2. **Classification supervisée avec scikit-learn**


```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

data = ["I love NLP.", "I hate math.", "NLP is fun.", "Math is boring."]
labels = ["positive", "negative", "positive", "negative"]

model = make_pipeline(TfidfVectorizer(), MultinomialNB())


model.fit(data, labels)

test_text = ["I enjoy studying NLP."]


print(model.predict(test_text))
```

---

## 6. **Reconnaissance d'Entités Nommées (NER)**

1. **Extraction avec spaCy**


```python
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple is looking to buy a startup in the UK for $1 billion."
doc = nlp(text)

for ent in doc.ents:


print(f"Entity: {ent.text}, Label: {ent.label_}")
```

2. **NER avec Hugging Face**


```python
from transformers import pipeline

ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-


english")
text = "Barack Obama was born in Hawaii and became the President of the USA."
entities = ner_pipeline(text)
print(entities)
```

---

## 7. **Traduction Automatique**

1. **Traduction avec MarianMT (Hugging Face)**


```python
from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = "Machine learning is the future of technology."


translated = tokenizer.encode(text, return_tensors="pt")
result = model.generate(translated)
translated_text = tokenizer.decode(result[0], skip_special_tokens=True)
print(translated_text)
```
2. **Traduction avec Google Translate API**
```bash
pip install googletrans==4.0.0-rc1
```

```python
from googletrans import Translator

translator = Translator()
text = "Natural Language Processing is amazing."
translation = translator.translate(text, src="en", dest="fr")
print(translation.text)
```

---

## 8. **Résumé Automatique (Text Summarization)**

1. **Résumé avec BART**


```python
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")


text = """Natural Language Processing (NLP) is a fascinating field of artificial
intelligence.
It focuses on enabling machines to understand, interpret, and respond to human
language."""
summary = summarizer(text, max_length=50, min_length=20, do_sample=False)
print(summary[0]['summary_text'])
```

2. **Résumé avec T5**


```python
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

text = "Machine learning and artificial intelligence are transforming industries


worldwide."
input_text = f"summarize: {text}"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## 9. **Modèles de Langue et Génération de Texte**

1. **Génération de texte avec GPT-2**


```python
from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")


result = generator("Natural Language Processing is", max_length=50,
num_return_sequences=1)
print(result[0]["generated_text"])
```

2. **Génération de texte avec GPT-3**


```python
import openai

openai.api_key = "YOUR_API_KEY"

response = openai.Completion.create(
engine="text-davinci-003",
prompt="Explain the importance of NLP in modern technology.",
max_tokens=100
)
print(response.choices[0].text.strip())
```

---

## 10. **Applications Réelles de NLP**

1. **Systèmes de Recommandation** : Filtrer et recommander des contenus basés sur


les préférences utilisateur.
2. **Chatbots** : Assistants virtuels pour répondre aux requêtes client.
3. **Recherche d’Information** : Extraire des informations spécifiques d’un corpus
de données.
4. **Détection de Spam** : Identifier les spams dans les emails ou les
commentaires.
5. **Traduction en Temps Réel** : Traduire des conversations dans différentes
langues.
6. **Analyse des Médias Sociaux** : Analyser les opinions sur Twitter ou Facebook
pour comprendre les tendances.

You might also like