SocrAI Day 3
SocrAI Day 3
● What is NLP?
○ Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the
interaction between computers and humans through natural language.
○ NLP involves the application of algorithms to identify and extract natural language rules such
that the unstructured language data is converted into a form that computers can understand.
● Importance of NLP
○ NLP is crucial in enabling computers to understand and process human language, facilitating
more natural interactions between humans and machines.
○ Applications include chatbots, language translation, sentiment analysis, and more.
Text Tokenization
● Text Tokenization
○ Tokenization is the process of splitting text into individual words or phrases, called
tokens.
○ Importance: Tokenization helps in breaking down text into manageable pieces for
further processing.
Code Example:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)
Stop Word Removal
● Stop words are common words that carry little meaningful information, such as 'and', 'the', 'is'.
● Importance: Removing stop words reduces the size of the dataset and improves the performance of
NLP models.
Code Example
stop_words = set(stopwords.words('english'))
words = word_tokenize(text)
print(filtered_words)
Stemming and Lemmatization
Stemming
stemmer = PorterStemmer()
print(stemmed_words)
Lemmatization
● Converts words to their base form based on their dictionary form (e.g.,
'better' to 'good').
Code Example
lemmatizer = WordNetLemmatizer()
print(lemmatized_words)
Feature Extraction in NLP
Bag of Words (BoW) Model
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
TF-IDF
● Weighs words based on their frequency in a document and their rarity across documents.
● Importance: Reduces the impact of frequently occurring common words that are less informative.
Code Example
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Word Embeddings
Word Embeddings
print(model.wv['natural'])
Introduction to Transformers
Transformer Architecture
print(outputs.last_hidden_state)
GPT: Generative Pre-trained Transformer
GPT Overview
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Fine-tuning Overview
Importance of Fine-Tuning
model =
BertForSequenceClassification.from_pretrained('bert-base-uncased',
num_labels=2)
training_args = TrainingArguments(output_dir='./results',
num_train_epochs=3, per_device_train_batch_size=4)
trainer.train()
Performance Metrics
Performance Metrics
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
● What is Pandas?
○ Pandas is a powerful data manipulation and analysis library for Python.
○ It provides data structures and functions needed to work on structured
data seamlessly.
● Importance of Pandas
○ Simplifies data cleaning, transformation, and analysis.
○ Widely used in data science and machine learning for preprocessing
data.
Loading and Inspecting Data
● Loading Data with Pandas
Code Example:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
● Inspecting Data
○ Checking for missing values, data types, basic statistics.
Code Example:
print(df.info())
print(df.describe())
Data Cleaning
● Removing Duplicates
○ Code Example:
df = df.drop_duplicates()
Data Aggregation and Grouping
Grouping Data
grouped = df.groupby('category').mean()
print(grouped)
Data Visualization with Pandas
Code Example:
import matplotlib.pyplot as plt
df['column_name'].plot(kind='hist')
plt.show()
Merging DataFrames
Code Example:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2,
3]})
print(merged_df)
Handling Time Series Data
Code Example:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
● What is PyTorch?
○ PyTorch is an open-source machine learning library developed by
Facebook's AI Research lab.
○ Provides flexibility and speed in building and training neural networks.
● Importance of PyTorch
○ Dynamic computation graph, making it intuitive and easier to debug.
○ Extensive support for GPU acceleration.
Tensors in PyTorch
● Creating Tensors
Code Example:
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print(x)
y = torch.ones_like(x)
print(y)
● Tensor Operations
Code Example:
z = x + y
print(z)
Automatic Differentiation
Autograd
Code Example:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2
z = y.mean()
z.backward()
Code Example:
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.relu = nn.ReLU()
self.fc2 = nn.Linear(50, 1)
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = SimpleNN()
print(model)
Training a Neural Network
● Training Loop
Code Example:
import torch.optim as optim
criterion = nn.MSELoss()
optimizer.zero_grad()
outputs = model(inputs)
loss.backward()
optimizer.step()
model = models.resnet18(pretrained=True)
param.requires_grad = False
Code Example:
model.eval() # Set model to evaluation mode
with torch.no_grad():
outputs = model(test_inputs)
print(predicted)
Summary
● Key Takeaways
○ Importance of NLP, transformers, and their applications.
○ Efficient data manipulation and analysis with Pandas.
○ Building and training neural networks using PyTorch.