0% found this document useful (0 votes)
9 views43 pages

SocrAI Day 3

The document provides an overview of Natural Language Processing (NLP), including its definition, importance, and key techniques such as tokenization, stop word removal, stemming, and lemmatization. It covers feature extraction methods like Bag of Words, TF-IDF, and word embeddings, along with transformer architectures like BERT and GPT. Additionally, it discusses data manipulation using Pandas and neural network training with PyTorch, highlighting the significance of fine-tuning and performance metrics.

Uploaded by

merest.sa.slayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views43 pages

SocrAI Day 3

The document provides an overview of Natural Language Processing (NLP), including its definition, importance, and key techniques such as tokenization, stop word removal, stemming, and lemmatization. It covers feature extraction methods like Bag of Words, TF-IDF, and word embeddings, along with transformer architectures like BERT and GPT. Additionally, it discusses data manipulation using Pandas and neural network training with PyTorch, highlighting the significance of fine-tuning and performance metrics.

Uploaded by

merest.sa.slayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

SocrAI Day 3

Definition and Overview of NLP

● What is NLP?
○ Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the
interaction between computers and humans through natural language.
○ NLP involves the application of algorithms to identify and extract natural language rules such
that the unstructured language data is converted into a form that computers can understand.
● Importance of NLP
○ NLP is crucial in enabling computers to understand and process human language, facilitating
more natural interactions between humans and machines.
○ Applications include chatbots, language translation, sentiment analysis, and more.
Text Tokenization
● Text Tokenization
○ Tokenization is the process of splitting text into individual words or phrases, called
tokens.
○ Importance: Tokenization helps in breaking down text into manageable pieces for
further processing.

Code Example:
from nltk.tokenize import word_tokenize

text = "Natural Language Processing with Python."

tokens = word_tokenize(text)

print(tokens)
Stop Word Removal

Stop Word Removal

● Stop words are common words that carry little meaningful information, such as 'and', 'the', 'is'.
● Importance: Removing stop words reduces the size of the dataset and improves the performance of
NLP models.
Code Example

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

words = word_tokenize(text)

filtered_words = [w for w in words if not w in stop_words]

print(filtered_words)
Stemming and Lemmatization

Stemming

● Reduces words to their base or root form (e.g., 'running' to 'run').


Code Example

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(word) for word in filtered_words]

print(stemmed_words)
Lemmatization

● Converts words to their base form based on their dictionary form (e.g.,
'better' to 'good').
Code Example

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

print(lemmatized_words)
Feature Extraction in NLP
Bag of Words (BoW) Model

● Represents text as a collection of word frequencies, ignoring grammar and


word order.
● Importance: Simple and effective method for text representation in machine
learning.
Code Example

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Natural language processing is fun.", "Python is great for


NLP."]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())

print(X.toarray())
TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF)

● Weighs words based on their frequency in a document and their rarity across documents.
● Importance: Reduces the impact of frequently occurring common words that are less informative.
Code Example

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())

print(X.toarray())
Word Embeddings

Word Embeddings

● Represents words as dense vectors in continuous vector space.


● Importance: Captures semantic relationships between words (e.g., king - man +
woman = queen).
● Word2Vec and GloVe
○ Word2Vec: Predicts context words from a target word.
○ GloVe: Combines global word co-occurrence and local context to generate
word vectors.
Code Example

from gensim.models import Word2Vec

sentences = [["natural", "language", "processing"], ["word", "embeddings",


"are", "fun"]]

model = Word2Vec(sentences, vector_size=50, window=2, min_count=1,


workers=4)

print(model.wv['natural'])
Introduction to Transformers

Transformer Architecture

● Introduced in the paper "Attention is All You Need".


● Relies entirely on self-attention mechanisms, eschewing recurrent and
convolutional layers.
● Components:
○ Encoder: Processes the input text.
○ Decoder: Generates the output text.
○ Attention Mechanisms: Focus on relevant parts of the input.
BERT: Bidirectional Encoder Representations
from Transformers
BERT Overview

● Bidirectional Context: Reads text in both directions (left-to-right and right-to-left).


● Applications: Question answering, sentiment analysis, named entity recognition.
Code Example

from transformers import BertTokenizer, BertModel


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
text = "Transformers are revolutionary for NLP tasks."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)

print(outputs.last_hidden_state)
GPT: Generative Pre-trained Transformer

GPT Overview

● Unidirectional Context: Reads text in one direction (left-to-right).


● Applications: Text generation, summarization, translation.
Code Example

from transformers import GPT2Tokenizer, GPT2LMHeadModel


tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
text = "Transformers are"
inputs = tokenizer(text, return_tensors='pt')
outputs = model.generate(inputs['input_ids'], max_length=50,
num_return_sequences=1)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Fine-tuning Overview

Importance of Fine-Tuning

● Adapts pre-trained models to specific tasks or datasets.


● Enhances model performance by leveraging large-scale pre-training.
Code Example
from transformers import BertForSequenceClassification, Trainer,
TrainingArguments

model =
BertForSequenceClassification.from_pretrained('bert-base-uncased',
num_labels=2)

training_args = TrainingArguments(output_dir='./results',
num_train_epochs=3, per_device_train_batch_size=4)

trainer = Trainer(model=model, args=training_args,


train_dataset=train_dataset, eval_dataset=eval_dataset)

trainer.train()
Performance Metrics

Performance Metrics

● Accuracy: Proportion of correctly predicted instances.


● Precision: Proportion of true positive results among all positive predictions.
● Recall: Proportion of true positive results among all actual positives.
● F1 Score: Harmonic mean of precision and recall.
Code Example

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_true = [0, 1, 1, 0, 1]

y_pred = [0, 1, 0, 0, 1]

print("Accuracy:", accuracy_score(y_true, y_pred))

print("Precision:", precision_score(y_true, y_pred))

print("Recall:", recall_score(y_true, y_pred))

print("F1 Score:", f1_score(y_true, y_pred))


Pandas Overview

● What is Pandas?
○ Pandas is a powerful data manipulation and analysis library for Python.
○ It provides data structures and functions needed to work on structured
data seamlessly.
● Importance of Pandas
○ Simplifies data cleaning, transformation, and analysis.
○ Widely used in data science and machine learning for preprocessing
data.
Loading and Inspecting Data
● Loading Data with Pandas

Code Example:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

● Inspecting Data
○ Checking for missing values, data types, basic statistics.

Code Example:
print(df.info())

print(df.describe())
Data Cleaning

Handling Missing Values

● Techniques: Dropping, filling, or imputing missing values.


Code Example

df = df.dropna() # Drop rows with missing values

df = df.fillna(0) # Fill missing values with 0

● Removing Duplicates
○ Code Example:
df = df.drop_duplicates()
Data Aggregation and Grouping

Grouping Data

● Group data by columns and apply aggregate functions.


Code Example

grouped = df.groupby('category').mean()

print(grouped)
Data Visualization with Pandas

Code Example:
import matplotlib.pyplot as plt

df['column_name'].plot(kind='hist')

plt.show()
Merging DataFrames

Code Example:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2,
3]})

df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5,


6]})

merged_df = pd.merge(df1, df2, on='key', how='inner')

print(merged_df)
Handling Time Series Data

Code Example:
df['date'] = pd.to_datetime(df['date'])

df.set_index('date', inplace=True)

print(df.resample('M').mean()) # Resample data to monthly frequency


Overview of PyTorch Library

● What is PyTorch?
○ PyTorch is an open-source machine learning library developed by
Facebook's AI Research lab.
○ Provides flexibility and speed in building and training neural networks.
● Importance of PyTorch
○ Dynamic computation graph, making it intuitive and easier to debug.
○ Extensive support for GPU acceleration.
Tensors in PyTorch
● Creating Tensors
Code Example:
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print(x)
y = torch.ones_like(x)
print(y)

● Tensor Operations
Code Example:
z = x + y
print(z)
Automatic Differentiation

Autograd

Code Example:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

y = x * 2

z = y.mean()

z.backward()

print(x.grad) # Gradient of z w.r.t x


Defining a Neural Network with Pytorch
● Building a Simple Neural Network

Code Example:
import torch.nn as nn

class SimpleNN(nn.Module):

def __init__(self):

super(SimpleNN, self).__init__()

self.fc1 = nn.Linear(10, 50)

self.relu = nn.ReLU()

self.fc2 = nn.Linear(50, 1)

def forward(self, x):

x = self.fc1(x)

x = self.relu(x)

x = self.fc2(x)

return x

model = SimpleNN()

print(model)
Training a Neural Network
● Training Loop

Code Example:
import torch.optim as optim

criterion = nn.MSELoss()

optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(100):

optimizer.zero_grad()

outputs = model(inputs)

loss = criterion(outputs, targets)

loss.backward()

optimizer.step()

print(f'Epoch {epoch+1}, Loss: {loss.item()}')


Advanced PyTorch Techniques

Transfer Learning Overview

● Using pre-trained models on new tasks.


Code Example

from torchvision import models

model = models.resnet18(pretrained=True)

for param in model.parameters():

param.requires_grad = False

model.fc = nn.Linear(model.fc.in_features, num_classes)


Model Evaluation and Inference

Code Example:
model.eval() # Set model to evaluation mode

with torch.no_grad():

outputs = model(test_inputs)

predicted = torch.argmax(outputs, dim=1)

print(predicted)
Summary

● Key Takeaways
○ Importance of NLP, transformers, and their applications.
○ Efficient data manipulation and analysis with Pandas.
○ Building and training neural networks using PyTorch.

You might also like