0% found this document useful (0 votes)

9 views43 pages

SocrAI Day 3

The document provides an overview of Natural Language Processing (NLP), including its definition, importance, and key techniques such as tokenization, stop word removal, stemming, and lemmatization. It covers feature extraction methods like Bag of Words, TF-IDF, and word embeddings, along with transformer architectures like BERT and GPT. Additionally, it discusses data manipulation using Pandas and neural network training with PyTorch, highlighting the significance of fine-tuning and performance metrics.

Uploaded by

merest.sa.slayer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views43 pages

SocrAI Day 3

Uploaded by

merest.sa.slayer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

SocrAI Day 3

Deﬁnition and Overview of NLP

● What is NLP?
○ Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the
interaction between computers and humans through natural language.
○ NLP involves the application of algorithms to identify and extract natural language rules such
that the unstructured language data is converted into a form that computers can understand.
● Importance of NLP
○ NLP is crucial in enabling computers to understand and process human language, facilitating
more natural interactions between humans and machines.
○ Applications include chatbots, language translation, sentiment analysis, and more.
Text Tokenization
● Text Tokenization
○ Tokenization is the process of splitting text into individual words or phrases, called
tokens.
○ Importance: Tokenization helps in breaking down text into manageable pieces for
further processing.

Code Example:
from nltk.tokenize import word_tokenize

text = "Natural Language Processing with Python."

tokens = word_tokenize(text)

print(tokens)
Stop Word Removal

Stop Word Removal

● Stop words are common words that carry little meaningful information, such as 'and', 'the', 'is'.
● Importance: Removing stop words reduces the size of the dataset and improves the performance of
NLP models.
Code Example

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

words = word_tokenize(text)

filtered_words = [w for w in words if not w in stop_words]

print(filtered_words)
Stemming and Lemmatization

Stemming

● Reduces words to their base or root form (e.g., 'running' to 'run').

Code Example

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_words = [stemmer.stem(word) for word in filtered_words]

print(stemmed_words)
Lemmatization

● Converts words to their base form based on their dictionary form (e.g.,
'better' to 'good').
Code Example

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

print(lemmatized_words)
Feature Extraction in NLP
Bag of Words (BoW) Model

● Represents text as a collection of word frequencies, ignoring grammar and

word order.
● Importance: Simple and effective method for text representation in machine
learning.
Code Example

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Natural language processing is fun.", "Python is great for

NLP."]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())

print(X.toarray())
TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF)

● Weighs words based on their frequency in a document and their rarity across documents.
● Importance: Reduces the impact of frequently occurring common words that are less informative.
Code Example

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())

print(X.toarray())
Word Embeddings

Word Embeddings

● Represents words as dense vectors in continuous vector space.

● Importance: Captures semantic relationships between words (e.g., king - man +
woman = queen).
● Word2Vec and GloVe
○ Word2Vec: Predicts context words from a target word.
○ GloVe: Combines global word co-occurrence and local context to generate
word vectors.
Code Example

from gensim.models import Word2Vec

sentences = [["natural", "language", "processing"], ["word", "embeddings",

"are", "fun"]]

model = Word2Vec(sentences, vector_size=50, window=2, min_count=1,

workers=4)

print(model.wv['natural'])
Introduction to Transformers

Transformer Architecture

● Introduced in the paper "Attention is All You Need".

● Relies entirely on self-attention mechanisms, eschewing recurrent and
convolutional layers.
● Components:
○ Encoder: Processes the input text.
○ Decoder: Generates the output text.
○ Attention Mechanisms: Focus on relevant parts of the input.
BERT: Bidirectional Encoder Representations
from Transformers
BERT Overview

● Bidirectional Context: Reads text in both directions (left-to-right and right-to-left).

● Applications: Question answering, sentiment analysis, named entity recognition.
Code Example

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
text = "Transformers are revolutionary for NLP tasks."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)

print(outputs.last_hidden_state)
GPT: Generative Pre-trained Transformer

GPT Overview

● Unidirectional Context: Reads text in one direction (left-to-right).

● Applications: Text generation, summarization, translation.
Code Example

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
text = "Transformers are"
inputs = tokenizer(text, return_tensors='pt')
outputs = model.generate(inputs['input_ids'], max_length=50,
num_return_sequences=1)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Fine-tuning Overview

Importance of Fine-Tuning

● Adapts pre-trained models to specific tasks or datasets.

● Enhances model performance by leveraging large-scale pre-training.
Code Example
from transformers import BertForSequenceClassification, Trainer,
TrainingArguments

model =
BertForSequenceClassification.from_pretrained('bert-base-uncased',
num_labels=2)

training_args = TrainingArguments(output_dir='./results',
num_train_epochs=3, per_device_train_batch_size=4)

trainer = Trainer(model=model, args=training_args,

train_dataset=train_dataset, eval_dataset=eval_dataset)

trainer.train()
Performance Metrics

Performance Metrics

● Accuracy: Proportion of correctly predicted instances.

● Precision: Proportion of true positive results among all positive predictions.
● Recall: Proportion of true positive results among all actual positives.
● F1 Score: Harmonic mean of precision and recall.
Code Example

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_true = [0, 1, 1, 0, 1]

y_pred = [0, 1, 0, 0, 1]

print("Accuracy:", accuracy_score(y_true, y_pred))

print("Precision:", precision_score(y_true, y_pred))

print("Recall:", recall_score(y_true, y_pred))

print("F1 Score:", f1_score(y_true, y_pred))

Pandas Overview

● What is Pandas?
○ Pandas is a powerful data manipulation and analysis library for Python.
○ It provides data structures and functions needed to work on structured
data seamlessly.
● Importance of Pandas
○ Simplifies data cleaning, transformation, and analysis.
○ Widely used in data science and machine learning for preprocessing
data.
Loading and Inspecting Data
● Loading Data with Pandas

Code Example:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.head())

● Inspecting Data
○ Checking for missing values, data types, basic statistics.

Code Example:
print(df.info())

print(df.describe())
Data Cleaning

Handling Missing Values

● Techniques: Dropping, filling, or imputing missing values.

Code Example

df = df.dropna() # Drop rows with missing values

df = df.fillna(0) # Fill missing values with 0

● Removing Duplicates
○ Code Example:
df = df.drop_duplicates()
Data Aggregation and Grouping

Grouping Data

● Group data by columns and apply aggregate functions.

Code Example

grouped = df.groupby('category').mean()

print(grouped)
Data Visualization with Pandas

Code Example:
import matplotlib.pyplot as plt

df['column_name'].plot(kind='hist')

plt.show()
Merging DataFrames

Code Example:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2,
3]})

df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5,

6]})

merged_df = pd.merge(df1, df2, on='key', how='inner')

print(merged_df)
Handling Time Series Data

Code Example:
df['date'] = pd.to_datetime(df['date'])

df.set_index('date', inplace=True)

print(df.resample('M').mean()) # Resample data to monthly frequency

Overview of PyTorch Library

● What is PyTorch?
○ PyTorch is an open-source machine learning library developed by
Facebook's AI Research lab.
○ Provides flexibility and speed in building and training neural networks.
● Importance of PyTorch
○ Dynamic computation graph, making it intuitive and easier to debug.
○ Extensive support for GPU acceleration.
Tensors in PyTorch
● Creating Tensors
Code Example:
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print(x)
y = torch.ones_like(x)
print(y)

● Tensor Operations
Code Example:
z = x + y
print(z)
Automatic Differentiation

Autograd

Code Example:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

y = x * 2

z = y.mean()

z.backward()

print(x.grad) # Gradient of z w.r.t x

Deﬁning a Neural Network with Pytorch
● Building a Simple Neural Network

Code Example:
import torch.nn as nn

class SimpleNN(nn.Module):

def __init__(self):

super(SimpleNN, self).__init__()

self.fc1 = nn.Linear(10, 50)

self.relu = nn.ReLU()

self.fc2 = nn.Linear(50, 1)

def forward(self, x):

x = self.fc1(x)

x = self.relu(x)

x = self.fc2(x)

return x

model = SimpleNN()

print(model)
Training a Neural Network
● Training Loop

Code Example:
import torch.optim as optim

criterion = nn.MSELoss()

optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(100):

optimizer.zero_grad()

outputs = model(inputs)

loss = criterion(outputs, targets)

loss.backward()

optimizer.step()

print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Advanced PyTorch Techniques

Transfer Learning Overview

● Using pre-trained models on new tasks.

Code Example

from torchvision import models

model = models.resnet18(pretrained=True)

for param in model.parameters():

param.requires_grad = False

model.fc = nn.Linear(model.fc.in_features, num_classes)

Model Evaluation and Inference

Code Example:
model.eval() # Set model to evaluation mode

with torch.no_grad():

outputs = model(test_inputs)

predicted = torch.argmax(outputs, dim=1)

print(predicted)
Summary

● Key Takeaways
○ Importance of NLP, transformers, and their applications.
○ Efficient data manipulation and analysis with Pandas.
○ Building and training neural networks using PyTorch.

The Ultimate Guide To Prompt Engineering From Beginner To Expert Free Resources Hands-On Practice With Practical Examples (Yadav, Chandradev) (Z-Library)
100% (1)
The Ultimate Guide To Prompt Engineering From Beginner To Expert Free Resources Hands-On Practice With Practical Examples (Yadav, Chandradev) (Z-Library)
76 pages
ياسمين حكمت
75% (8)
ياسمين حكمت
130 pages
Ed3book Jan122022
No ratings yet
Ed3book Jan122022
653 pages
Natural Language Processing: Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
No ratings yet
Natural Language Processing: Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
61 pages
Log Og
No ratings yet
Log Og
608 pages
mcs212 Block2 Unit2&3
No ratings yet
mcs212 Block2 Unit2&3
12 pages
Project Proposal Format
No ratings yet
Project Proposal Format
49 pages
Week 1 - Chapter 1 - Introduction
No ratings yet
Week 1 - Chapter 1 - Introduction
25 pages
Eugene Charniak: 1. Personal
No ratings yet
Eugene Charniak: 1. Personal
18 pages
LLMs and Retrieval-Augmented Generation (RAG)
No ratings yet
LLMs and Retrieval-Augmented Generation (RAG)
120 pages
Automata Theory and Logic:: Nfa To Dfa
No ratings yet
Automata Theory and Logic:: Nfa To Dfa
18 pages
Applied Linguistics and Artificial Intelligence
No ratings yet
Applied Linguistics and Artificial Intelligence
17 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
34 pages
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
No ratings yet
Natural Language Processing With Pytorch Readthedocs Io en Latest PDF
35 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Transformer-Based Korean Pretrained Language Models - NLP - Ai
No ratings yet
Transformer-Based Korean Pretrained Language Models - NLP - Ai
7 pages
Satya Final Minor Report
100% (1)
Satya Final Minor Report
25 pages
First Contact With Tensor Flow - Part 1
100% (1)
First Contact With Tensor Flow - Part 1
136 pages
First Contact With Tensor Flow PDF
100% (2)
First Contact With Tensor Flow PDF
136 pages
Explain To Me Like I Am Five - Sentence Simplification Using Transformers
No ratings yet
Explain To Me Like I Am Five - Sentence Simplification Using Transformers
4 pages
Hendy Et Al (2023) - How Good at GPT Models at Machine Translation-2
No ratings yet
Hendy Et Al (2023) - How Good at GPT Models at Machine Translation-2
30 pages
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
No ratings yet
Dokumen - Pub - Natural Language Processing Practical Using Transformers With Python
275 pages
A Guide To Text Classification (NLP)
No ratings yet
A Guide To Text Classification (NLP)
17 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
21 01 23
No ratings yet
21 01 23
8 pages
Cs224n Text Generation
No ratings yet
Cs224n Text Generation
73 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
UNIT 5 NLP Tools and Techniques
No ratings yet
UNIT 5 NLP Tools and Techniques
7 pages
Python Unit 5
No ratings yet
Python Unit 5
23 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Leveraging ParsBERT and Pretrained MT5 For Persian Abstractive
No ratings yet
Leveraging ParsBERT and Pretrained MT5 For Persian Abstractive
7 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Deep Learning Workflow
No ratings yet
Deep Learning Workflow
11 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
Srihari Thirumaligai Senior Project Final Draft
No ratings yet
Srihari Thirumaligai Senior Project Final Draft
13 pages
Arquivs nlp01
No ratings yet
Arquivs nlp01
3 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
CTT 2024 Proceedings
No ratings yet
CTT 2024 Proceedings
96 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
Introduction (BT4222) YL
No ratings yet
Introduction (BT4222) YL
48 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Literature Review On Vulnerability Detection Using
No ratings yet
Literature Review On Vulnerability Detection Using
10 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
2022 Naacl Handbook
No ratings yet
2022 Naacl Handbook
287 pages
NLP Exp 4
No ratings yet
NLP Exp 4
2 pages
Benchmarking Hallucination in Large Language Models Based On Unanswerable Math Word Problem
No ratings yet
Benchmarking Hallucination in Large Language Models Based On Unanswerable Math Word Problem
11 pages
Topic Modelling - Deep Learning Interview Questions
No ratings yet
Topic Modelling - Deep Learning Interview Questions
19 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
Dan Jurafsky
No ratings yet
Dan Jurafsky
2 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Sony Ai Content
No ratings yet
Sony Ai Content
26 pages
ChatGPT - MyLearning On Coding For NLP
No ratings yet
ChatGPT - MyLearning On Coding For NLP
10 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Software Engineering Rev
No ratings yet
Software Engineering Rev
5 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Natural Language Processing (Peiii)
No ratings yet
Natural Language Processing (Peiii)
2 pages
ML Libraries PPT (3.3)
No ratings yet
ML Libraries PPT (3.3)
10 pages
Fake News Detection
100% (1)
Fake News Detection
25 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
NLP 2
No ratings yet
NLP 2
8 pages
cl12 Huggingface
No ratings yet
cl12 Huggingface
34 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
38 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
SNLP Past Papers
No ratings yet
SNLP Past Papers
6 pages
Statistical Natural Language Processing
No ratings yet
Statistical Natural Language Processing
2 pages
The Significance of LLM Tokenization
No ratings yet
The Significance of LLM Tokenization
6 pages
Methodology
No ratings yet
Methodology
9 pages
Text Mining Problems-4
No ratings yet
Text Mining Problems-4
59 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
Wa0013.
No ratings yet
Wa0013.
12 pages
Final Ojt
No ratings yet
Final Ojt
31 pages
DeekshikaJadyada AP24LDS11
No ratings yet
DeekshikaJadyada AP24LDS11
6 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
NLP 160709201345
No ratings yet
NLP 160709201345
61 pages
DL Assignment 2 Final
No ratings yet
DL Assignment 2 Final
15 pages

SocrAI Day 3

Uploaded by

SocrAI Day 3

Uploaded by

SocrAI Day 3

Deﬁnition and Overview of NLP

text = "Natural Language Processing with Python."

Stop Word Removal

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

filtered_words = [w for w in words if not w in stop_words]

● Reduces words to their base or root form (e.g., 'running' to 'run').

from nltk.stem import PorterStemmer

stemmed_words = [stemmer.stem(word) for word in filtered_words]

from nltk.stem import WordNetLemmatizer

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

● Represents text as a collection of word frequencies, ignoring grammar and

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Natural language processing is fun.", "Python is great for

Term Frequency-Inverse Document Frequency (TF-IDF)

from sklearn.feature_extraction.text import TfidfVectorizer

● Represents words as dense vectors in continuous vector space.

from gensim.models import Word2Vec

sentences = [["natural", "language", "processing"], ["word", "embeddings",

model = Word2Vec(sentences, vector_size=50, window=2, min_count=1,

● Introduced in the paper "Attention is All You Need".

● Bidirectional Context: Reads text in both directions (left-to-right and right-to-left).

from transformers import BertTokenizer, BertModel

● Unidirectional Context: Reads text in one direction (left-to-right).

from transformers import GPT2Tokenizer, GPT2LMHeadModel

● Adapts pre-trained models to specific tasks or datasets.

trainer = Trainer(model=model, args=training_args,

● Accuracy: Proportion of correctly predicted instances.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("Accuracy:", accuracy_score(y_true, y_pred))

print("Precision:", precision_score(y_true, y_pred))

print("Recall:", recall_score(y_true, y_pred))

print("F1 Score:", f1_score(y_true, y_pred))

Handling Missing Values

● Techniques: Dropping, filling, or imputing missing values.

df = df.dropna() # Drop rows with missing values

df = df.fillna(0) # Fill missing values with 0

● Group data by columns and apply aggregate functions.

df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5,

merged_df = pd.merge(df1, df2, on='key', how='inner')

print(df.resample('M').mean()) # Resample data to monthly frequency

print(x.grad) # Gradient of z w.r.t x

self.fc1 = nn.Linear(10, 50)

def forward(self, x):

optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(100):

loss = criterion(outputs, targets)

print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Transfer Learning Overview

● Using pre-trained models on new tasks.

from torchvision import models

for param in model.parameters():

model.fc = nn.Linear(model.fc.in_features, num_classes)

predicted = torch.argmax(outputs, dim=1)

You might also like