Hugging Face
Hugging Face
●
Popular models like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-
trained Transformer), and DistilBERT are based on the Transformer architecture.
Hugging Face’s Transformers library is a powerful tool for working with Transformer-based models, which are
widely used in natural language processing (NLP) tasks like text classification, sentiment analysis, translation,
and more. Here's an introduction to Transformers and how you can fine-tune pre-trained models such as
DistilBERT for tasks like text classification and sentiment analysis on your custom dataset.
●
Hugging Face provides a variety of pre-trained models that you can directly use or fine-tune for specific
NLP tasks. A common use case is text classification or sentiment analysis, where you want to assign a
label to a given text.
●
DistilBERT is one of the popular pre-trained models for such tasks. It is a smaller and faster version of
BERT, with 40% fewer parameters while retaining 97% of BERT’s performance. It’s a great choice when
you need efficiency without sacrificing much accuracy.
●
Hugging Face provides a variety of pre-trained models that you can directly use or fine-tune for
specific NLP tasks. A common use case is text classification or sentiment analysis, where you want to
assign a label to a given text.
●
DistilBERT is one of the popular pre-trained models for such tasks. It is a smaller and faster version of
BERT, with 40% fewer parameters while retaining 97% of BERT’s performance. It’s a great choice
when you need efficiency without sacrificing much accuracy.
“
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2) # 2 labels for binary classification
c) Preprocess the dataset:
You can use the datasets library from Hugging Face to load and preprocess your custom dataset. You need to tokenize your text data using
the tokenizer.
”
def preprocess_function(examples):
return tokenizer(examples['text'], padding="max_length", truncation=True)
# Example with a custom dataset, you can replace this with your dataset
dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'})
# Apply the tokenizer to the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)
“
from datasets import load_dataset
“
e) Initialize the Trainer:
The Trainer class simplifies the training and evaluation loop for most Transformer models.
trainer = Trainer(
”
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
)
“
Sentiment Analysis
The above setup can be used for sentiment analysis by assigning labels like 0 for negative and 1 for positive
”
sentiments in the dataset. After fine-tuning, the model can classify whether a given text has a positive or negative
sentiment.
Fine-Tuning on a Custom Dataset
Fine-tuning a pre-trained model like DistilBERT is effective when you want to adapt the model to a specific task or
domain, such as legal text classification, medical document classification, etc. All you need is a labeled dataset with
the target labels (for example, positive/negative in sentiment analysis).
Dr. Ashaq Hussain Bhat
Fine-tuning Pre-trained Models for Specific Tasks
Pre-trained models like DistilBERT are general-purpose, meaning they are trained on vast amounts of diverse data and can be
adapted (fine-tuned) for specific tasks with relatively few task-specific labeled examples.
Fine-tuning Process:
• Load the pre-trained model and tokenizer: The model and tokenizer are loaded from the Hugging Face Model Hub.
• Prepare your dataset: Your custom dataset needs to be tokenized. This means converting your text into numerical input.
• Training the model: The model is trained (fine-tuned) on your dataset using Hugging Face's Trainer class, which abstracts
“
away the complexity of model training.
• Evaluate the model: After training, the model is evaluated on a test set to check how well it generalizes to unseen data.
The process involves setting training arguments such as learning rate, batch size, number of epochs, and more.
”
Dr. Ashaq Hussain Bhat
Example of fine-tuning DistilBERT for text classification:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
trainer = Trainer(
output_dir='./results', # output directory model=model,
evaluation_strategy="epoch", # evaluate after each epoch args=training_args,
train_dataset=tokenized_datasets['train'],
“
per_device_train_batch_size=16, # batch size for training
eval_dataset=tokenized_datasets['test'],
per_device_eval_batch_size=64, # batch size for evaluation )
num_train_epochs=3, # number of training epochs
”
trainer.train() # Fine-tune the model
weight_decay=0.01, # strength of weight decay
)