DSC 253 Homework 1
DSC 253 Homework 1
ipynb - Colab
import pandas as pd
from sklearn.model_selection import train_test_split
print(nyt.head())
print(ag.head())
text label
0 (reuters) - carlos tevez sealed his move to ju... sports
1 if professional pride and strong defiance can ... sports
2 palermo, sicily — roberta vinci beat top-seede... sports
3 spain's big two soccer teams face a pair of it... sports
4 the argentine soccer club san lorenzo complete... sports
text
0 wall st. bears claw back into the black (reute...
1 carlyle looks toward commercial aerospace (reu...
2 oil and economy cloud stocks' outlook (reuters...
3 iraq halts oil exports from main southern pipe...
4 oil prices soar to all-time record, posing new...
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
# Preprocessing function
def preprocess_text(text):
text = text.lower()
text = re.sub(r'\W', ' ', text)
return text.split()
# # Apply preprocessing
nyt['processed_text'] = nyt['text'].apply(preprocess_text)
ag['processed_text'] = ag['text'].apply(preprocess_text)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 1/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
X_train_binary = binary_vectorizer.fit_transform(train_data['text'])
X_val_binary = binary_vectorizer.transform(val_data['text'])
X_test_binary = binary_vectorizer.transform(test_data['text'])
import pandas as pd
import gensim
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import re
# Preprocessing function
def preprocess_text(text):
text = text.lower()
text = re.sub(r'\W', ' ', text)
return text.split()
# Apply preprocessing
nyt['processed_text'] = nyt['text'].apply(preprocess_text)
ag['processed_text'] = ag['text'].apply(preprocess_text)
text label \
0 (reuters) - carlos tevez sealed his move to ju... sports
1 if professional pride and strong defiance can ... sports
2 palermo, sicily — roberta vinci beat top-seede... sports
3 spain's big two soccer teams face a pair of it... sports
4 the argentine soccer club san lorenzo complete... sports
processed_text
0 [reuters, carlos, tevez, sealed, his, move, to...
1 [if, professional, pride, and, strong, defianc...
2 [palermo, sicily, roberta, vinci, beat, top, s...
3 [spain, s, big, two, soccer, teams, face, a, p...
4 [the, argentine, soccer, club, san, lorenzo, c...
text \
0 wall st. bears claw back into the black (reute...
1 carlyle looks toward commercial aerospace (reu...
2 oil and economy cloud stocks' outlook (reuters...
3 iraq halts oil exports from main southern pipe...
4 oil prices soar to all-time record, posing new...
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 2/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
processed_text
0 [wall, st, bears, claw, back, into, the, black...
1 [carlyle, looks, toward, commercial, aerospace...
2 [oil, and, economy, cloud, stocks, outlook, re...
3 [iraq, halts, oil, exports, from, main, southe...
4 [oil, prices, soar, to, all, time, record, pos...
import numpy as np
def load_glove_model(File):
print("Loading Glove Model")
glove_model = {}
with open(File,'r') as f:
for line in f:
split_line = line.split()
word = split_line[0]
embedding = np.array(split_line[1:], dtype=np.float64)
glove_model[word] = embedding
print(f"{len(glove_model)} words loaded!")
return glove_model
# Glove - 2 (a)
y_train = train_data['label']
y_val = val_data['label']
y_test = test_data['label']
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 3/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
# Word2Vec AG -> NYT - 2 (b)
# Create document vectors for NYT dataset using AG News Word2Vec model
nyt['w2v_ag_vector'] = nyt['processed_text'].apply(get_document_vector_w2v_ag)
X_train_w2v_ag = np.vstack(train_data['w2v_ag_vector'])
X_val_w2v_ag = np.vstack(val_data['w2v_ag_vector'])
X_test_w2v_ag = np.vstack(test_data['w2v_ag_vector'])
y_train = train_data['label']
y_val = val_data['label']
y_test = test_data['label']
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')
# Word2Vec - 2 (c)
# Create document vectors for NYT dataset using NYT Word2Vec model
nyt['w2v_nyt_vector'] = nyt['processed_text'].apply(get_document_vector_w2v_nyt)
X_train_w2v_nyt = np.vstack(train_data['w2v_nyt_vector'])
X_val_w2v_nyt = np.vstack(val_data['w2v_nyt_vector'])
X_test_w2v_nyt = np.vstack(test_data['w2v_nyt_vector'])
y_train = train_data['label']
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 4/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
y_val = val_data['label']
y_test = test_data['label']
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
macro_f1 = f1_score(y_test, y_pred, average='macro')
micro_f1 = f1_score(y_test, y_pred, average='micro')
# BERT - 3 (a)
# Importing libraries
import numpy as np
import pandas as pd
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertModel, BertConfig
# Setting up GPU
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)
cuda
df = pd.read_csv('nyt.csv')
df['list'] = df[df.columns[2:]].values.tolist()
new_df = df[['text', 'label']].copy()
new_df.head()
text label
new_df['label'].value_counts()
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 5/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
count
label
sports 8639
politics 1451
business 1429
dtype: int64
mapping = {
'sports': [1, 0, 0],
'politics': [0, 1, 0],
'business': [0, 0, 1]
}
new_df['label'] = df['label'].map(mapping)
new_df.head()
text label
# config
MAX_LEN = 64
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 3
LEARNING_RATE = 1e-05
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
tokenizer_config.json: 100% 48.0/48.0 [00:00<00:00, 1.35kB/s]
class CustomDataset(Dataset):
def __len__(self):
return len(self.text)
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 6/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
pad_to_max_length=True,
return_token_type_ids=True
)
ids = inputs['input_ids']
mask = inputs['attention_mask']
token_type_ids = inputs["token_type_ids"]
return {
'ids': torch.tensor(ids, dtype=torch.long),
'mask': torch.tensor(mask, dtype=torch.long),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
'targets': torch.tensor(self.targets[index], dtype=torch.float)
}
train_size = 0.8
train_dataset=new_df.sample(frac=train_size,random_state=42)
test_dataset=new_df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)
print(f"Dataset: {new_df.shape}")
print(f"Train set: {train_dataset.shape}")
print(f"Test set: {test_dataset.shape}")
Dataset: (11519, 2)
Train set: (9215, 2)
Test set: (2304, 2)
class BERTClass(torch.nn.Module):
def __init__(self):
super(BERTClass, self).__init__()
self.l1 = transformers.BertModel.from_pretrained('bert-base-uncased')
self.l2 = torch.nn.Dropout(0.3)
self.l3 = torch.nn.Linear(768, 3)
model = BERTClass()
model.to(device)
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 7/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
def train(epoch):
model.train()
for _,data in enumerate(training_loader, 0):
ids = data['ids'].to(device, dtype = torch.long)
mask = data['mask'].to(device, dtype = torch.long)
token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
targets = data['targets'].to(device, dtype = torch.float)
optimizer.zero_grad()
loss = loss_fn(outputs, targets)
if _%5000==0:
print(f'Epoch: {epoch}, Loss: {loss.item()}')
optimizer.zero_grad()
loss.backward()
optimizer.step()
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to expli
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:2870: FutureWarning: The `pad_to_max_length`
warnings.warn(
Epoch: 0, Loss: 0.6873428225517273
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 8/9
10/17/24, 6:49 PM HW 1.ipynb - Colab
Epoch: 1, Loss: 0.014567049220204353
Epoch: 2, Loss: 0.003150239121168852
def validation(epoch):
model.eval()
fin_targets=[]
fin_outputs=[]
with torch.no_grad():
for _, data in enumerate(testing_loader, 0):
ids = data['ids'].to(device, dtype = torch.long)
mask = data['mask'].to(device, dtype = torch.long)
token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
targets = data['targets'].to(device, dtype = torch.float)
outputs = model(ids, mask, token_type_ids)
fin_targets.extend(targets.cpu().detach().numpy().tolist())
fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
return fin_outputs, fin_targets
https://colab.research.google.com/drive/1LeuPtiOivZ1KaBBVycZ9tnzT8zX_J_Gx#scrollTo=eTa2tfw2UEpp&printMode=true 9/9
DSC 253 - Homework 1
Bag of Words Model:
Dataset Preparation
Preprocessing
For the binary BoW representation, I used CountVectorizer with the binary=True parameter. This
configuration creates a binary vector where:
● The value is set to 1 if a word is present in the document, regardless of its frequency.
● The value is 0 if the word is not present.
Model Performance
The model achieved the following performance metrics on the test set:
● Accuracy: 97.83%
● Macro-F1 Score: 94.02%
● Micro-F1 Score: 97.83%
Frequency Bag of Words Model:
For the frequency BoW representation, I used CountVectorizer without setting the binary
parameter. This configuration creates a frequency vector where the value represents the
number of times a word appears in the document.
The high accuracy and F1 scores indicate that the binary BoW model effectively captures the
presence or absence of words, which is sufficient for this classification task.
Model Performance
The model achieved the following performance metrics on the test set:
● Accuracy: 98.26%
● Macro-F1 Score: 95.36%
● Micro-F1 Score: 98.26%
For the TF-IDF representation, I used TfidfVectorizer. This configuration creates a vector where
the value represents the TF-IDF score of each word in the document, reflecting both its
frequency in the document and its importance across the entire corpus.
Model Performance
The model achieved the following performance metrics on the test set:
● Accuracy: 97.83%
● Macro-F1 Score: 94.41%
● Micro-F1 Score: 97.83%
The high accuracy and F1 scores indicate that the BoW model effectively captures word
occurrences, which enhances classification performance in this task.
I loaded pre-trained GloVe vectors (100-dimensional) to create word embeddings. For each
document, I calculated the average of its word vectors to obtain a document vector. If no word
from the document was found in the GloVe model, a zero vector was assigned.
Model Evaluation
A Logistic Regression model was trained using the GloVe-based document vectors:
● Training: The model was trained with the training set (80% of the data).
● Validation: Performance was monitored using the validation set (10% of the data) to
fine-tune model parameters.
● Testing: The final performance was evaluated using the test set (10% of the data).
Model Performance
The model achieved the following performance metrics on the test set:
● Accuracy: 97.48%
● Macro-F1 Score: 93.51%
● Micro-F1 Score: 97.48%
The high accuracy and F1 scores indicate that the GloVe-based model effectively captures
semantic information, improving classification performance for this task.
I trained a Word2Vec model on the AG News dataset using gensim with the following
configuration:
The trained Word2Vec model was then used to create document vectors for the NYT dataset by
averaging the word vectors of each document. If no word from a document was present in the
Word2Vec vocabulary, a zero vector was assigned.
Model Evaluation
The model achieved the following performance metrics on the test set:
● Accuracy: 96.26%
● Macro-F1 Score: 90.42%
● Micro-F1 Score: 96.26%
In this task, I trained a Word2Vec model on the NYT dataset and used it to create document
vectors for text classification. Below are the details of the implementation:
I trained the Word2Vec model using the NYT dataset's preprocessed text. The configuration
used for training included:
Model Evaluation
A Logistic Regression model was trained using the document vectors from the Word2Vec
model. The evaluation metrics were as follows:
● Accuracy: 96.96%
● Macro-F1 Score: 92.63%
● Micro-F1 Score: 96.96%
The results indicate that the Word2Vec model effectively captures the semantic meaning of
words, leading to high accuracy and F1 scores for the classification task.
What are the disadvantages of averaging word vectors for the document representation?
Describe an idea to overcome this. The document vectors should be formed using word vectors.
-> Averaging word vectors ignores the order in which words appear in a document. This can
lead to situations where sentences with completely different meanings (e.g., "The cat chased
the mouse" vs. "The mouse chased the cat") produces the same document vector.
Instead of averaging the word vectors equally, apply weights based on the positions of the
words. For example, you can assign higher weights to words that appear earlier or later in the
document, depending on the nature of the text.
BERT Model
Dataset Preparation:
Config:
MAX_LEN = 64
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 3
LEARNING_RATE = 1e-05
Preprocessing:
● Tokenize the input text into tokens that BERT can process.
● Pad the sequences to ensure uniform length.
● Create attention masks to distinguish between real tokens and padding.
The pre-trained BERT model was fine-tuned for a binary classification task. A linear layer was
added on top of BERT’s output to classify the inputs.
Model Training:
Model Performance:
The model achieved the following performance metrics on the test set:
● Accuracy: 96.39%
● Macro-F1 Score: 94.45%
● Micro-F1 Score: 97.57%
These metrics indicate that the BERT model performs well on the text classification task.