0% found this document useful (0 votes)
185 views

A Simple Guide On Using BERT For Binary Text Classification

The document provides a guide on using BERT for binary text classification. It discusses downloading and setting up BERT with PyTorch, preparing a dataset by converting it from CSV to TSV format, loading and preprocessing the text data, fine-tuning a pretrained BERT model for the classification task, and evaluating the model's performance. The guide aims to explain using BERT for text classification as simply as possible.

Uploaded by

sita devi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views

A Simple Guide On Using BERT For Binary Text Classification

The document provides a guide on using BERT for binary text classification. It discusses downloading and setting up BERT with PyTorch, preparing a dataset by converting it from CSV to TSV format, loading and preprocessing the text data, fine-tuning a pretrained BERT model for the classification task, and evaluating the model's performance. The guide aims to explain using BERT for text classification as simply as possible.

Uploaded by

sita devi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

You have 1 free member-only story left this month. Sign up and get an extra one for free.

A Simple Guide On Using BERT for Binary Text


Classi cation.
The A-to-Z guide on how you can use Google’s BERT for binary text classi cation tasks. I’ll be
aiming to explain, as simply and straightforwardly as possible, how to ne-tune a BERT model
(with PyTorch) and use it for a binary text classi cation task.

Thilina Rajapakse Follow


Jun 9, 2019 · 10 min read

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 1/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

Photo by Andy Kelly on Unsplash

Update Notice II
Please consider using the Simple Transformers library as it is easy to use, feature-
packed, and regularly updated. The article still stands as a reference to BERT models
and is likely to be helpful with understanding how BERT works. However, Simple
Transformers offers a lot more features, much more straightforward tuning options, all
the while being quick and easy to use! The links below should help you get started
quickly.

1. Binary Classification

2. Multi-Class Classification

3. Multi-Label Classification

4. Named Entity Recognition (Part-of-Speech Tagging)

5. Question Answering

6. Sentence-Pair Tasks and Regression

7. Conversational AI

8. Language Model Fine-Tuning

9. ELECTRA and Language Model Training from Scratch

10. Visualising Model Training

Update Notice I
In light of the update to the library used in this article (HuggingFace updated the
pytorch-pretrained-bert library to pytorch-transformers ), I have written a new guide as

well as a new repo. If you are starting out with Transformer models, I recommend
using those as the code has been cleaned up both on my end and in the Pytorch-
Transformers library, greatly streamlining the whole process. The new repo also
supports XLNet, XLM, and RoBERTa models out of the box, in addition to BERT, as of
September 2019.

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 2/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

1. Intro
Let’s talk about what we are going to (and not going to) do.
Before we begin, let me point you towards the github repo containing all the code used in this
guide. All code in the repo is included in the guide here, and vice versa. Feel free to refer to it
anytime, or clone the repo to follow along with the guide.

If your internet wanderings have led you here, I guess it’s safe to assume that you have
heard of BERT, the powerful new language representation model, open-sourced by
Google towards the end of 2018. If you haven’t, or if you’d like a refresher, I recommend
giving their paper a read as I won’t be going into the technical details of how BERT
works. If you are unfamiliar with the Transformer model (or if words like “attention”,
“embeddings”, and “encoder-decoder” sound scary), check out this brilliant article by
Jay Alammar. You don’t necessarily need to know everything about BERT (or
Transformers) to follow the rest of this guide, but the above links should help if you wish
to learn more about BERT and Transformers.

Now that we’ve gotten what we won’t do out of the way, let’s dig into what we will do,
shall we?

Getting BERT downloaded and set up. We will be using the PyTorch version
provided by the amazing folks at Hugging Face.

Converting a dataset in the .csv format to the .tsv format that BERT knows and loves.

Loading the .tsv files into a notebook and converting the text representations to a
feature representation (think numerical) that the BERT model can work with.

Setting up a pretrained BERT model for fine-tuning.

Fine-tuning a BERT model.

Evaluating the performance of the BERT model.

One last thing before we dig in, I’ll be using three Jupyter Notebooks for data preparation,
training, and evaluation. It’s not strictly necessary, but it felt cleaner to separate those three
processes.

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 3/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

2. Getting set up
Time to get BERT up and running.
1. Create a virtual environment with the required packages. You can use any
package/environment manager, but I’ll be using Conda.
conda create -n bert python pytorch pandas tqdm

conda install -c anaconda scikit-learn

(Note: If you run into any missing package error while following the guide, go ahead and
install them using your package manager. A google search should tell you how to install
a specific package.)

2. Install the PyTorch version of BERT from Hugging Face.


pip install pytorch-pretrained-bert

3. To do text classification, we’ll obviously need a text classification dataset. For this
guide, I’ll be using the Yelp Reviews Polarity dataset which you can find here on
fast.ai. (Direct download link for any lazy asses, I mean busy folks.)
Decompress the downloaded file and get the train.csv, and test.csv files. For
reference, the path to my train.csv file is <starting_directory>/data/train.csv

3. Preparing data
Before we can cook the meal, we need to prepare the ingredients! (Or
something like that. <Insert proper analogy here>)
Most datasets you find will typically come in the csv format and the Yelp Reviews dataset
is no exception. Let’s load it in with pandas and take a look.

In [1]: import pandas as pd

In [2]: train_df = pd.read_csv('data/train.csv', header=None)


train_df.head()

Out[2]: 0 1

0 1 Unfortunately, the frustration of being Dr. Go...

1 2 Been going to Dr. Goldberg for over 10 years. ...

2 1 I don't know what Dr. Goldberg was like before...


https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 4/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
17
18 Args:
19 guid: Unique id for the example.
20 text_a: string. The untokenized text of the first sequence. For single
21 sequence tasks, only this sequence must be specified.
22 text_b: (Optional) string. The untokenized text of the second sequence.
23 Only must be specified for sequence pair tasks.
24 label: (Optional) string. The label of the example. This should be
25 specified for train and dev examples, but not for test examples.
26 """
27 self.guid = guid
28 self.text_a = text_a
29 self.text_b = text_b
30 self.label = label
31
32
33 class DataProcessor(object):
34 """Base class for data converters for sequence classification data sets."""
35
36 def get_train_examples(self, data_dir):
37 """Gets a collection of `InputExample`s for the train set."""
38 raise NotImplementedError()
39
40 def get_dev_examples(self, data_dir):
41 """Gets a collection of `InputExample`s for the dev set."""
42 raise NotImplementedError()
43
44 def get_labels(self):
45 """Gets the list of labels for this data set."""
46 raise NotImplementedError()
47
48 @classmethod
49 def _read_tsv(cls, input_file, quotechar=None):
50 """Reads a tab separated value file."""
51 with open(input_file, "r", encoding="utf-8") as f:
52 reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
53 lines = []
54 for line in reader:
55 if sys.version_info[0] == 2:
56 line = list(unicode(cell, 'utf-8') for cell in line)
57 lines.append(line)
58 return lines
59
60
61 class BinaryClassificationProcessor(DataProcessor):

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 8/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
62 """Processor for binary classification dataset."""
63
64 def get_train_examples(self, data_dir):
65 """See base class."""
66 return self._create_examples(
67 self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
68
69 def get_dev_examples(self, data_dir):
70 """See base class."""
71 return self._create_examples(
72 self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
73
74 def get_labels(self):
75 """See base class."""
76 return ["0", "1"]
77
78 def create examples(self lines set type):

The first class, InputExample, is the format that a single example of our dataset should
be in. We won’t be using the text_b attribute since that is not necessary for our binary
classification task. The other attributes should be fairly self-explanatory.

The other two classes, DataProcessor and BinaryClassificationProcessor, are helper


classes that can be used to read in .tsv files and prepare them to be converted into
features that will ultimately be fed into the actual BERT model.

The BinaryClassificationProcessor class can read in the train.tsv and dev.tsv files
and convert them into lists of InputExample objects.

So far, we have the capability to read in tsv datasets and convert them into
InputExample objects. BERT, being a neural network, cannot directly deal with text as
we have in InputExample objects. The next step is to convert them into InputFeatures.

BERT has a constraint on the maximum length of a sequence after tokenizing. For any
BERT model, the maximum sequence length after tokenization is 512. But we can set
any sequence length equal to or below this value. For faster training, I’ll be using 128 as
the maximum sequence length. A bigger number may give better results if there are
sequences longer than this value.

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 9/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

3 1 I'm writing this review to give you a heads up...

4 2 All the food is great here. But the best thing...

In [3]: test_df = pd.read_csv('data/test.csv', header=None)


test_df.head()

Out[3]: 0 1

0 2 Contrary to other reviews, I have zero complai...

1 1 Last summer I had an appointment to get new ti...

2 2 Friendly staff same starbucks fair you get an


dataset.ipynb hosted with ❤ by GitHub view raw

As you can see, the data is in the two csv files train.csv and test.csv . They contain no

headers, and two columns for the label and the text. The labels used here feel a little
weird to me, as they have used 1 and 2 instead of the typical 0 and 1. Here, a label of 1
means the review is bad, and a label of 2 means the review is good. I’m going to change
this to the more familiar 0 and 1 labelling, where a label 0 indicates a bad review, and a
label 1 indicates a good review.

Much better, am I right?

BERT, however, wants data to be in a tsv file with a specific format as given below (Four
columns, and no header row).

Column 0: An ID for the row

Column 1: The label for the row (should be an int)

Column 2: A column of the same letter for all rows. BERT wants this so we’ll give it,
but we don’t have a use for it.

Column 3: The text for the row

Let’s make things a little BERT-friendly.

In [17]: train_df_bert = pd.DataFrame({


'id':range(len(train_df)),
'label':train df[0]
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 5/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

An InputFeature consists of purely numerical data (with the proper sequence lengths)
that can then be fed into the BERT model. This is prepared by tokenizing the text of each
example and truncating the longer sequence while padding the shorter sequences to the
given maximum sequence length (128). I found the conversion of InputExample
objects to InputFeature objects to be quite slow by default, so I modified the conversion
code to utilize the multiprocessing library of Python to significantly speed up the process.

1 class InputFeatures(object):
2 """A single set of features of data."""
3
4 def __init__(self, input_ids, input_mask, segment_ids, label_id):
5 self.input_ids = input_ids
6 self.input_mask = input_mask
7 self.segment_ids = segment_ids
8 self.label_id = label_id
9
10
11 def _truncate_seq_pair(tokens_a, tokens_b, max_length):
12 """Truncates a sequence pair in place to the maximum length."""
13
14 # This is a simple heuristic which will always truncate the longer sequence
15 # one token at a time. This makes more sense than truncating an equal percent
16 # of tokens from each, since if one sequence is very short then each token
17 # that's truncated likely contains more information than a longer sequence.
18 while True:
19 total_length = len(tokens_a) + len(tokens_b)
20 if total_length <= max_length:
21 break
22 if len(tokens_a) > len(tokens_b):
23 tokens_a.pop()
24 else:
25 tokens_b.pop()
26
27
28 def convert_example_to_feature(example_row):
29 # return example_row
30 example, label_map, max_seq_length, tokenizer, output_mode = example_row
31
32 tokens_a = tokenizer.tokenize(example.text_a)
33
34 tokens_b = None
35 if example.text_b:
36 t k b t k i t k i ( l t t b)
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 10/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
label :train_df[0],
'alpha':['a']*train_df.shape[0],
'text': train_df[1].replace(r'\n', ' ', regex=True)
})

train_df_bert.head()

Out[17]:
id label alpha text

0 0 0 a Unfortunately, the frustration of being Dr. Go...

1 1 1 a Been going to Dr. Goldberg for over 10 years. ...

2 2 0 a I don't know what Dr. Goldberg was like before...

3 3 0 a I'm writing this review to give you a heads up...

4 4 1 a All the food is great here. But the best thing...

In [18]: dev_df_bert = pd.DataFrame({


'id':range(len(test_df)),
'label':test_df[0],
'alpha':['a']*test_df.shape[0],
'text': test_df[1].replace(r'\n', ' ', regex=True)
})
bert_format.ipynb hosted with ❤ by GitHub view raw

For convenience, I’ve named the test data as dev data. The convenience stems from the
fact that BERT comes with data loading classes that expects train and dev files in the
above format. We can use the train data to train our model, and the dev data to evaluate
its performance. BERT’s data loading classes can also use a test file but it expects the test
file to be unlabelled. Therefore, I will be using the train and dev files instead.

Now that we have the data in the correct form, all we need to do is to save the train and
dev data as .tsv files.

In [20]: train_df_bert.to_csv('data/train.tsv', sep='\t', index=False, heade


r=False)

In [21]: dev_df_bert.to_csv('data/dev.tsv', sep='\t', index=False, header=Fa


lse)

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 6/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

save_tsv.ipynb hosted with ❤ by GitHub view raw

That’s the eggs beaten, the chicken thawed, and the veggies sliced. Let’s get cooking!

4. Data to Features
The final step before fine-tuning is to convert the data into features that BERT
uses. Most of the remaining code was adapted from the HuggingFace example
run_classifier.py, found here.
Now, we will see the reason for us rearranging the data into the .tsv format in the
previous section. It enables us to easily reuse the example classes that come with BERT
for our own binary classification task. Here’s how they look.

1 from __future__ import absolute_import, division, print_function


2
3 import csv
4 import os
5 import sys
6 import logging
7
8 logger = logging.getLogger()
9 csv.field_size_limit(2147483647) # Increase CSV reader's field limit incase we have long text.
10
11
12 class InputExample(object):
13 """A single training/test example for simple sequence classification."""
14
15 def __init__(self, guid, text_a, text_b=None, label=None):
16 """Constructs a InputExample.
17
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 7/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
36 tokens_b = tokenizer.tokenize(example.text_b)
37 # Modifies `tokens_a` and `tokens_b` in place so that the total
38 # length is less than the specified length.
39 # Account for [CLS], [SEP], [SEP] with "- 3"
40 _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
41 else:
42 # Account for [CLS] and [SEP] with "- 2"
43 if len(tokens_a) > max_seq_length - 2:
44 tokens_a = tokens_a[:(max_seq_length - 2)]
45
46 tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
47 segment_ids = [0] * len(tokens)
48
49 if tokens_b:
50 tokens += tokens_b + ["[SEP]"]
51 segment_ids += [1] * (len(tokens_b) + 1)
52
53 input_ids = tokenizer.convert_tokens_to_ids(tokens)
54
55 # The mask has 1 for real tokens and 0 for padding tokens. Only real
56 # tokens are attended to.
57 input_mask = [1] * len(input_ids)
58
59 # Zero-pad up to the sequence length.
60 padding = [0] * (max_seq_length - len(input_ids))
61 input_ids += padding
62 input_mask += padding
63 segment_ids += padding
64
65 assert len(input_ids) == max_seq_length
66 assert len(input_mask) == max_seq_length
67 assert len(segment_ids) == max_seq_length
68
69 if output_mode == "classification":
70 label_id = label_map[example.label]
71 elif output mode == "regression":

We will see how to use these methods in just a bit.

(Note: I’m switching to the training notebook.)

First, let’s import all the packages that we’ll need, and then get our paths straightened
out.

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 11/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

(Tip: The model will be downloaded into a temporary folder. Find the folder by following the
path printed on the output once the download completes and copy the downloaded file to the
cache/ directory. The file should be a compressed file in .tar.gz format. Next time, you can
just use this downloaded file without having to download it all over again. All you need to do
is comment out the line that downloaded the model, and uncomment the line below it.)

We just need to do a tiny bit more configuration for the training. Here, I’m just using the
default parameters.

Setting up our DataLoader for training..

Training time!

Now we’ve trained the BERT model for one epoch, we can evaluate the results. Of
course, more training will likely yield better results but even one epoch should be
sufficient for proof of concept (hopefully!).

In order to be able to easily load our fine-tuned model, we should save it in a specific
way, i.e. the same way the default BERT models are saved. Here is how you can do that.

Go into the outputs/yelp directory where the fine tuned models will be saved.
There, you should find 3 files; config.json , pytorch_model.bin , vocab.txt .

Archive the two files (I use 7zip for archiving) config.json, and pytorch_model.bin
into a .tar file.

Compress the .tar file into gzip format. Now the file should be something like
yelp.tar.gz

Copy the compressed file into the cache/ directory.

We will load this fine tuned model in the next step.

6. Evaluation

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 16/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

Time to see what our fine-tuned model can do. (We’ve cooked the meal, let’s see
how it tastes.)
(Note: I’m switching to the evaluation notebook)

Most of the code for the evaluation is very similar to the training process, so I won’t go
into too much detail but I’ll list some important points.

BERT_MODEL parameter should be the name of your fine-tuned model. For


example, yelp.tar.gz .

The tokenizer should be loaded from the vocabulary file created in the training
stage. In my case, that would outputs/yelp/vocab.txt (or the path can be set as
OUTPUT_DIR + vocab.txt )

This time, we’ll be using the BinaryClassificationProcessor to load in the dev.tsv

file by calling the get_dev_examples method.

Double check to make sure you are loading the fine-tuned model and not the
original BERT model. 😅

Here’s my notebook for the evaluation.

With just one single epoch of training, our BERT model achieves a 0.914 Matthews
correlation coefficient (Good measure for evaluating unbalanced datasets. Sklearn doc
here). With more training, and perhaps some hyperparameter tuning, we can almost
certainly improve upon what is already an impressive score.

7. Conclusion
BERT is an incredibly powerful language representation model that shows great promise
in a wide variety of NLP tasks. Here, I’ve tried to give a basic guide to how you might use
it for binary text classification.

As the results show, BERT is a very effective tool for binary text classification, not to
mention all the other tasks it has already been used for.

Reminder: Github repo with all the code can be found here.

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 17/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

Sign up for Top Stories from The Startup


A newsletter that delivers The Startup's most popular stories to your inbox once a month.

Create a free Medium account to get Top Stories in your


Get this newsletter inbox.

Data Science Arti cial Intelligence NLP Bert Pytorch

About Help Legal

Get the Medium app

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 18/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

In [1]: import torch


import pickle
from torch.utils.data import (DataLoader, RandomSampler, Sequential
Sampler, TensorDataset)
from torch.nn import CrossEntropyLoss, MSELoss

from tqdm import tqdm_notebook, trange


import os
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertF
orMaskedLM, BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLi
nearSchedule

from multiprocessing import Pool, cpu_count


from tools import *
import convert_examples_to_features

# OPTIONAL: if you want to have more information on what's happenin


g, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

device = torch.device("cuda" if torch.cuda.is_available() else "cp


u")

In [2]: # The input data dir. Should contain the .tsv files (or other data
)
imports_and_paths.ipynb hosted with ❤ by GitHub view raw

In the first cell, we are importing the necessary packages. In the next cell, we are setting
some paths for where files should be stored and where certain files can be found. We are
also setting some configuration options for the BERT model. Finally, we will create the
directories if they do not already exist.

Next, we will use our BinaryClassificationProcessor to load in the data, and get
everything ready for the tokenization step.

In [6]: processor = BinaryClassificationProcessor()


train_examples = processor.get_train_examples(DATA_DIR)
train_examples_len = len(train_examples)

In [7]: label_list = processor.get_labels() # [0, 1] for binary classificat


ion
num_labels = len(label_list)

In [8]: num_train_optimization_steps = int(


train examples len / TRAIN BATCH SIZE / GRADIENT ACCUMULATION S
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 12/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
train_examples_len / TRAIN_BATCH_SIZE / GRADIENT_ACCUMULATION_S
TEPS) * NUM_TRAIN_EPOCHS

In [5]: # Load pre-trained model tokenizer (vocabulary)


tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_low
er_case=False)

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file h


ttps://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-
vocab.txt from cache at C:\Users\chatu\.pytorch_pretrained_bert\5e8
a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13db
b970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1

In [9]: label_map = {label: i for i, label in enumerate(label_list)}


train_examples_for_processing = [(example, label_map, MAX_SEQ_LENGT
H, tokenizer, OUTPUT_MODE) for example in train_examples]

prepare_for_tokenizing.ipynb hosted with ❤ by GitHub view raw

Here, we are creating our BinaryClassificationProcessor and using it to load in the


train examples. Then, we are setting some variables that we’ll use while training the
model. Next, we are loading the pretrained tokenizer by BERT. In this case, we’ll be
using the bert-base-cased model.

The convert_example_to_feature function expects a tuple containing an example, the


label map, the maximum sequence length, a tokenizer, and the output mode. So lastly, we
will create an examples list ready to be processed (tokenized, truncated/padded, and
turned into InputFeatures) by the convert_example_to_feature function.

Now, we can use the multi-core goodness of modern CPU’s to process the examples
(relatively) quickly. My Ryzen 7 2700x took about one and a half hours for this part.

In [10]: process_count = cpu_count() - 1


if __name__ == '__main__':
print(f'Preparing to convert {train_examples_len} examples..')
print(f'Spawning {process_count} processes..')
with Pool(process_count) as p:
train_features = list(tqdm_notebook(p.imap(convert_examples
_to_features.convert_example_to_feature, train_examples_for_process
ing), total=train_examples_len))

Preparing to convert 560000 examples..


Spawning 15 processes..
HBox(children=(IntProgress(value=0, max=560000), HTML(value='')))

In [11]: with open(DATA_DIR + "train_features.pkl", "wb") as f:


https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 13/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.
pickle.dump(train_features, f)

BERT.ipynb hosted with ❤ by GitHub view raw

Your notebook should show the progress of the processing rather than the ‘HBox’ thing I have here. It’s an
issue with uploading the notebook to Gist.

(Note: If you have any issues getting the multiprocessing to work, just copy paste all the code
up to, and including, the multiprocessing into a python script and run it from the command
line or an IDE. Jupyter Notebooks can sometimes get a little iffy with multiprocessing. I’ve
included an example script on github named converter.py )

Once all the examples are converted into features, we can pickle them to disk for
safekeeping (I, for one, do not want to run the processing for another one and a half
hours). Next time, you can just unpickle the file to get the list of features.

Well, that was a lot of data preparation. You deserve a coffee, I’ll see you for the training
part in a bit. (Unless you already had your coffee while the processing was going on. In
which case, kudos to efficiency!)

5. Fine-tuning BERT (finally!)


Had your coffee? Raring to go? Let’s show BERT how it’s done! (Fine tune. Show
how it’s done. Get it? I might be bad at puns.)
Not much left now, let’s hope for smooth sailing. (Or smooth.. cooking? I forgot my
analogy somewhere along the way. Anyway, we now have all the ingredients in the pot,
and all we have to do is turn on the stove and let thermodynamics work its magic.)

In [ ]: # Load pre-trained model (weights)


model = BertForSequenceClassification.from_pretrained(BERT_MODEL, c
ache_dir=CACHE_DIR, num_labels=num_labels)
https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 14/18
6/28/2020 A Simple Guide On Using BERT for Binary Text Classification.

# model = BertForSequenceClassification.from_pretrained(CACHE_DIR +
'cased_base_bert_pytorch.tar.gz', cache_dir=CACHE_DIR, num_labels=n
um_labels)

1%|▌
| 3306496/404400730 [00:19<08:08, 820603.15B/s]

In [11]: model.to(device)

Out[11]: BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(28996, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bi

loading_bert.ipynb hosted with ❤ by GitHub view raw

HuggingFace’s pytorch implementation of BERT comes with a function that


automatically downloads the BERT model for us (have I mentioned I love these dudes?).
I stopped my download since I have terrible internet, but it shouldn’t take long. It’s only
about 400 MB in total for the base models. Just wait for the download to complete and
you are good to go.

Don’t panic if you see the following output once the model is downloaded, I know it
looks panic inducing but this is actually the expected behavior. The not initialized
things are not meant to be initialized. Intentionally.

INFO:pytorch_pretrained_bert.modeling:Weights of
BertForSequenceClassification not initialized from pretrained model:
['classifier.weight', 'classifier.bias']
INFO:pytorch_pretrained_bert.modeling:Weights from pretrained model
not used in BertForSequenceClassification: ['cls.predictions.bias',
'cls.predictions.transform.dense.weight',
'cls.predictions.transform.dense.bias',
'cls.predictions.decoder.weight', 'cls.seq_relationship.weight',
'cls.seq_relationship.bias',
'cls.predictions.transform.LayerNorm.weight',
'cls.predictions.transform.LayerNorm.bias']

https://medium.com/swlh/a-simple-guide-on-using-bert-for-text-classification-bbf041ac8d04 15/18

You might also like