A Simple Guide On Using BERT For Binary Text Classification
A Simple Guide On Using BERT For Binary Text Classification
You have 1 free member-only story left this month. Sign up and get an extra one for free.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 1/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
Update Notice II
Please consider using the Simple Transformers library as it is easy to use, feature-
packed, and regularly updated. The article still stands as a reference to BERT models
and is likely to be helpful with understanding how BERT works. However, Simple
Transformers offers a lot more features, much more straightforward tuning options, all
the while being quick and easy to use! The links below should help you get started
quickly.
1. Binary Classification
2. Multi-Class Classification
3. Multi-Label Classification
5. Question Answering
7. Conversational AI
Update Notice I
In light of the update to the library used in this article (HuggingFace updated the
pytorch-pretrained-bert library to pytorch-transformers ), I have written a new guide as
well as a new repo. If you are starting out with Transformer models, I recommend
using those as the code has been cleaned up both on my end and in the Pytorch-
Transformers library, greatly streamlining the whole process. The new repo also
supports XLNet, XLM, and RoBERTa models out of the box, in addition to BERT, as of
September 2019.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 2/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
1. Intro
Let’s talk about what we are going to (and not going to) do.
Before we begin, let me point you towards the github repo containing all the code used in this
guide. All code in the repo is included in the guide here, and vice versa. Feel free to refer to it
anytime, or clone the repo to follow along with the guide.
If your internet wanderings have led you here, I guess it’s safe to assume that you have
heard of BERT, the powerful new language representation model, open-sourced by
Google towards the end of 2018. If you haven’t, or if you’d like a refresher, I recommend
giving their paper a read as I won’t be going into the technical details of how BERT
works. If you are unfamiliar with the Transformer model (or if words like “attention”,
“embeddings”, and “encoder-decoder” sound scary), check out this brilliant article by
Jay Alammar. You don’t necessarily need to know everything about BERT (or
Transformers) to follow the rest of this guide, but the above links should help if you wish
to learn more about BERT and Transformers.
Now that we’ve gotten what we won’t do out of the way, let’s dig into what we will do,
shall we?
Getting BERT downloaded and set up. We will be using the PyTorch version
provided by the amazing folks at Hugging Face.
Converting a dataset in the .csv format to the .tsv format that BERT knows and loves.
Loading the .tsv files into a notebook and converting the text representations to a
feature representation (think numerical) that the BERT model can work with.
One last thing before we dig in, I’ll be using three Jupyter Notebooks for data preparation,
training, and evaluation. It’s not strictly necessary, but it felt cleaner to separate those three
processes.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 3/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
2. Getting set up
Time to get BERT up and running.
1. Create a virtual environment with the required packages. You can use any
package/environment manager, but I’ll be using Conda.
conda create -n bert python pytorch pandas tqdm
(Note: If you run into any missing package error while following the guide, go ahead and
install them using your package manager. A google search should tell you how to install
a specific package.)
3. To do text classification, we’ll obviously need a text classification dataset. For this
guide, I’ll be using the Yelp Reviews Polarity dataset which you can find here on
fast.ai. (Direct download link for any lazy asses, I mean busy folks.)
Decompress the downloaded file and get the train.csv, and test.csv files. For
reference, the path to my train.csv file is <starting_directory>/data/train.csv
3. Preparing data
Before we can cook the meal, we need to prepare the ingredients! (Or
something like that. <Insert proper analogy here>)
Most datasets you find will typically come in the csv format and the Yelp Reviews dataset
is no exception. Let’s load it in with pandas and take a look.
Out[2]: 0 1
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 8/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
62 """Processor for binary classification dataset."""
63
64 def get_train_examples(self, data_dir):
65 """See base class."""
66 return self._create_examples(
67 self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
68
69 def get_dev_examples(self, data_dir):
70 """See base class."""
71 return self._create_examples(
72 self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
73
74 def get_labels(self):
75 """See base class."""
76 return ["0", "1"]
77
78 def create examples(self lines set type):
The first class, InputExample, is the format that a single example of our dataset should
be in. We won’t be using the text_b attribute since that is not necessary for our binary
classification task. The other attributes should be fairly self-explanatory.
The BinaryClassificationProcessor class can read in the train.tsv and dev.tsv files
and convert them into lists of InputExample objects.
So far, we have the capability to read in tsv datasets and convert them into
InputExample objects. BERT, being a neural network, cannot directly deal with text as
we have in InputExample objects. The next step is to convert them into InputFeatures.
BERT has a constraint on the maximum length of a sequence after tokenizing. For any
BERT model, the maximum sequence length after tokenization is 512. But we can set
any sequence length equal to or below this value. For faster training, I’ll be using 128 as
the maximum sequence length. A bigger number may give better results if there are
sequences longer than this value.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 9/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
Out[3]: 0 1
As you can see, the data is in the two csv files train.csv and test.csv . They contain no
headers, and two columns for the label and the text. The labels used here feel a little
weird to me, as they have used 1 and 2 instead of the typical 0 and 1. Here, a label of 1
means the review is bad, and a label of 2 means the review is good. I’m going to change
this to the more familiar 0 and 1 labelling, where a label 0 indicates a bad review, and a
label 1 indicates a good review.
BERT, however, wants data to be in a tsv file with a specific format as given below (Four
columns, and no header row).
Column 2: A column of the same letter for all rows. BERT wants this so we’ll give it,
but we don’t have a use for it.
An InputFeature consists of purely numerical data (with the proper sequence lengths)
that can then be fed into the BERT model. This is prepared by tokenizing the text of each
example and truncating the longer sequence while padding the shorter sequences to the
given maximum sequence length (128). I found the conversion of InputExample
objects to InputFeature objects to be quite slow by default, so I modified the conversion
code to utilize the multiprocessing library of Python to significantly speed up the process.
1 class InputFeatures(object):
2 """A single set of features of data."""
3
4 def __init__(self, input_ids, input_mask, segment_ids, label_id):
5 self.input_ids = input_ids
6 self.input_mask = input_mask
7 self.segment_ids = segment_ids
8 self.label_id = label_id
9
10
11 def _truncate_seq_pair(tokens_a, tokens_b, max_length):
12 """Truncates a sequence pair in place to the maximum length."""
13
14 # This is a simple heuristic which will always truncate the longer sequence
15 # one token at a time. This makes more sense than truncating an equal percent
16 # of tokens from each, since if one sequence is very short then each token
17 # that's truncated likely contains more information than a longer sequence.
18 while True:
19 total_length = len(tokens_a) + len(tokens_b)
20 if total_length <= max_length:
21 break
22 if len(tokens_a) > len(tokens_b):
23 tokens_a.pop()
24 else:
25 tokens_b.pop()
26
27
28 def convert_example_to_feature(example_row):
29 # return example_row
30 example, label_map, max_seq_length, tokenizer, output_mode = example_row
31
32 tokens_a = tokenizer.tokenize(example.text_a)
33
34 tokens_b = None
35 if example.text_b:
36 t k b t k i t k i ( l t t b)
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 10/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
label :train_df[0],
'alpha':['a']*train_df.shape[0],
'text': train_df[1].replace(r'\n', ' ', regex=True)
})
train_df_bert.head()
Out[17]:
id label alpha text
For convenience, I’ve named the test data as dev data. The convenience stems from the
fact that BERT comes with data loading classes that expects train and dev files in the
above format. We can use the train data to train our model, and the dev data to evaluate
its performance. BERT’s data loading classes can also use a test file but it expects the test
file to be unlabelled. Therefore, I will be using the train and dev files instead.
Now that we have the data in the correct form, all we need to do is to save the train and
dev data as .tsv files.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 6/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
That’s the eggs beaten, the chicken thawed, and the veggies sliced. Let’s get cooking!
4. Data to Features
The final step before fine-tuning is to convert the data into features that BERT
uses. Most of the remaining code was adapted from the HuggingFace example
run_classifier.py, found here.
Now, we will see the reason for us rearranging the data into the .tsv format in the
previous section. It enables us to easily reuse the example classes that come with BERT
for our own binary classification task. Here’s how they look.
First, let’s import all the packages that we’ll need, and then get our paths straightened
out.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 11/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
(Tip: The model will be downloaded into a temporary folder. Find the folder by following the
path printed on the output once the download completes and copy the downloaded file to the
cache/ directory. The file should be a compressed file in .tar.gz format. Next time, you can
just use this downloaded file without having to download it all over again. All you need to do
is comment out the line that downloaded the model, and uncomment the line below it.)
We just need to do a tiny bit more configuration for the training. Here, I’m just using the
default parameters.
Training time!
Now we’ve trained the BERT model for one epoch, we can evaluate the results. Of
course, more training will likely yield better results but even one epoch should be
sufficient for proof of concept (hopefully!).
In order to be able to easily load our fine-tuned model, we should save it in a specific
way, i.e. the same way the default BERT models are saved. Here is how you can do that.
Go into the outputs/yelp directory where the fine tuned models will be saved.
There, you should find 3 files; config.json , pytorch_model.bin , vocab.txt .
Archive the two files (I use 7zip for archiving) config.json, and pytorch_model.bin
into a .tar file.
Compress the .tar file into gzip format. Now the file should be something like
yelp.tar.gz
6. Evaluation
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 16/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
Time to see what our fine-tuned model can do. (We’ve cooked the meal, let’s see
how it tastes.)
(Note: I’m switching to the evaluation notebook)
Most of the code for the evaluation is very similar to the training process, so I won’t go
into too much detail but I’ll list some important points.
The tokenizer should be loaded from the vocabulary file created in the training
stage. In my case, that would outputs/yelp/vocab.txt (or the path can be set as
OUTPUT_DIR + vocab.txt )
Double check to make sure you are loading the fine-tuned model and not the
original BERT model. 😅
With just one single epoch of training, our BERT model achieves a 0.914 Matthews
correlation coefficient (Good measure for evaluating unbalanced datasets. Sklearn doc
here). With more training, and perhaps some hyperparameter tuning, we can almost
certainly improve upon what is already an impressive score.
7. Conclusion
BERT is an incredibly powerful language representation model that shows great promise
in a wide variety of NLP tasks. Here, I’ve tried to give a basic guide to how you might use
it for binary text classification.
As the results show, BERT is a very effective tool for binary text classification, not to
mention all the other tasks it has already been used for.
Reminder: Github repo with all the code can be found here.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 17/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 18/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
In [2]: # The input data dir. Should contain the .tsv files (or other data
)
imports_and_paths.ipynb hosted with ❤ by GitHub view raw
In the first cell, we are importing the necessary packages. In the next cell, we are setting
some paths for where files should be stored and where certain files can be found. We are
also setting some configuration options for the BERT model. Finally, we will create the
directories if they do not already exist.
Next, we will use our BinaryClassificationProcessor to load in the data, and get
everything ready for the tokenization step.
Now, we can use the multi-core goodness of modern CPU’s to process the examples
(relatively) quickly. My Ryzen 7 2700x took about one and a half hours for this part.
Your notebook should show the progress of the processing rather than the ‘HBox’ thing I have here. It’s an
issue with uploading the notebook to Gist.
(Note: If you have any issues getting the multiprocessing to work, just copy paste all the code
up to, and including, the multiprocessing into a python script and run it from the command
line or an IDE. Jupyter Notebooks can sometimes get a little iffy with multiprocessing. I’ve
included an example script on github named converter.py )
Once all the examples are converted into features, we can pickle them to disk for
safekeeping (I, for one, do not want to run the processing for another one and a half
hours). Next time, you can just unpickle the file to get the list of features.
Well, that was a lot of data preparation. You deserve a coffee, I’ll see you for the training
part in a bit. (Unless you already had your coffee while the processing was going on. In
which case, kudos to efficiency!)
# model = BertForSequenceClassification.from_pretrained(CACHE_DIR +
'cased_base_bert_pytorch.tar.gz', cache_dir=CACHE_DIR, num_labels=n
um_labels)
1%|▌
| 3306496/404400730 [00:19<08:08, 820603.15B/s]
In [11]: model.to(device)
Out[11]: BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(28996, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bi
Don’t panic if you see the following output once the model is downloaded, I know it
looks panic inducing but this is actually the expected behavior. The not initialized
things are not meant to be initialized. Intentionally.
INFO:pytorch_pretrained_bert.modeling:Weights of
BertForSequenceClassification not initialized from pretrained model:
['classifier.weight', 'classifier.bias']
INFO:pytorch_pretrained_bert.modeling:Weights from pretrained model
not used in BertForSequenceClassification: ['cls.predictions.bias',
'cls.predictions.transform.dense.weight',
'cls.predictions.transform.dense.bias',
'cls.predictions.decoder.weight', 'cls.seq_relationship.weight',
'cls.seq_relationship.bias',
'cls.predictions.transform.LayerNorm.weight',
'cls.predictions.transform.LayerNorm.bias']
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 15/18