0% found this document useful (0 votes)
14 views82 pages

Reproducibility at ICLR 2019

reproducibility document in iclr 2019 conference
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views82 pages

Reproducibility at ICLR 2019

reproducibility document in iclr 2019 conference
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 82

reproducibili

as a vehicle for
ty
engineering
best practices

ICLR 2019 Joel Grus


"Reproducibility in @joelgrus
ML" workshop
Why (As Someone
Why ML
Who Cares About
Researchers Reproducibility)
Should Care
About You Should Care
Reproducibi About Software
lity
Engineering Best
Practices
Replicability
Reproducibility is
Reproducibility
good
Repeatability
Non-
definition
other people
(including future-you)
should be able to

● train your models


● make predictions using
your trained models
● get the same results as
you (modulo whatever)
About Me
(This talk
also applies
to
data
scientists!)
The Story
Arc of This
Talk
As usual, it
started with
me
tweeting
while angry
Which somehow led to me giving a talk at
JupyterCon
One of whose many complaints was about
reproducibility
Which got me invited to give a talk at AAAI
Which led to
me being
here

(I think)
But I'm
not going
to talk
about
notebooks
Reproducibili
ty and
Software
Engineering
Best
Practices
I take a somewhat
expansive view of
why reproducibility
matters
Reproducibility Helps With
Correctness
Reproducibili
ty Protects
Against Bad
Actors
Reproducibility
Helps Ensure
Robustness
Reproducibility Makes
It Easier to Try New
Datasets
Reproducibili
ty Makes It
Easier to
Create New
Datasets
Reproducibility Makes It Easier to Try
New Tasks
Reproducibility Enables Strong
Baselines

Wouldn't you like your model to be the


standard by which new models are
judged?
Reproducibility Is Necessary For
Extensibility

"you can't stand on the shoulders of


giants if they keep their shoulders
private"
Extensibility Leads to
Progress
? ?
BER
GPT-2
T
transformer

self-attention

attention
Fundamental Premise:

The tools you choose and the


processes you adopt can make
reproducibility either a lot harder or
Fundamental Co-Premise:

The desire for reproducibility can


get people to adopt better tools
and processes
"reproducibilit
y"

code
unit
review Docker
tests
s
me researchers
If you are a
researcher
If you are simply someone who cares about
researchers
Your ML
Experimen
ts Are
Software
Engineerin
g!
Many
Software
Engineering
Best Practices
are Intended
to Facilitate
Collaboration
Collaboration is a
Forcing Function for
Reproducibility
If nothing else,
you will almost
certainly need
to collaborate
with future-you
"But it runs on my computer!"
Use
Sourc
e
Contr
ol
Using GitHub (or similar) is a pretty de facto way to
collaborate
It also
serves
as a
"time
capsule"
for your
code
Separate "Library" Code and "Experiment"
Code
Separate "Library" Code and "Experiment"
Code
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

masked_index = 8
tokenized_text[masked_index] = '[MASK]'

indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)

predicted_index = torch.argmax(predictions[0, masked_index]).item()


predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'
Write
Unit
Tests
Unit tests
check that
small pieces of
your logic are
correct

(If you run


them)
Yeah, but
what do unit
tests for ML
experiments
even look
like?
Unit Tests for ML Experiments

tiny known dataset


check that model
runs
check that output has the right fields

check that output has the right shape

check that output has reasonable values


The best time to
find mistakes is
before you run
your
experiments!
Unit tests are
documentation
that's necessarily
correct
Use Code Reviews
Code reviews are an early protection against
incorrect code (and hence incorrect science)
Code
reviews
help you
grow as a
coder and
as a
scientist
Make It Easy To Vary
Parameters
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py \
--task_name=MRPC \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DIR/MRPC \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=/tmp/mrpc_output/
Be Explicit About Your Dependencies
#### ESSENTIAL LIBRARIES FOR MAIN FUNCTIONALITY ####

# This installs Pytorch for CUDA 8 only. If you are using a newer version,
# please visit http://pytorch.org/ and install the relevant version.
# For now AllenNLP works with both PyTorch 1.0 and 0.4.1. Expect that in
# the future only >=1.0 will be supported.
torch>=0.4.1

# Parameter parsing (but not on Windows).


jsonnet==0.10.0 ; sys.platform != 'win32'

# Adds an @overrides decorator for better documentation and error checking when using subclasses.
overrides

# Used by some old code. We moved away from it because it's too slow, but some old code still
# imports this.
nltk

# Mainly used for the faster tokenizer.


spacy>=2.0,<2.1
Consider Docker Images
● Create container with OS +
environment + code (+
data?)
● Can share and anyone can
run (in theory)
● Build up step by step with
smart caching when you
change it
$ docker run -it --entrypoint /bin/bash allennlp/allennlp:v0.8.0

root@7d1f120a83e9:/stage/allennlp# echo '{"sentence": "Did Uriah honestly think he could


beat the game in under three hours?"}' | allennlp predict https://s3-us-west-
2.amazonaws.com/allennlp/models/srl-model-2018.05.25.tar.gz -

input: {"sentence": "Did Uriah honestly think he could beat the game in under three
hours?"}
prediction: {"verbs": [{"verb": "Did", "description": "[V: Did] Uriah honestly think he
could beat the game in under three hours ?", "tags": ["B-V", "O", "O", "O", "O", "O", "O",
"O", "O", "O", "O", "O", "O", "O"]}, {"verb": "think", "description": "Did [ARG0: Uriah]
[ARGM-MNR: honestly] [V: think] [ARG1: he could beat the game in under three hours] ?",
"tags": ["O", "B-ARG0", "B-ARGM-MNR", "B-V", "B-ARG1", "I-ARG1", "I-ARG1", "I-ARG1", "I-
ARG1", "I-ARG1", "I-ARG1", "I-ARG1", "I-ARG1", "O"]}, {"verb": "could", "description": "Did
Uriah honestly think he [V: could] beat the game in under three hours ?", "tags": ["O",
"O", "O", "O", "O", "B-V", "O", "O", "O", "O", "O", "O", "O", "O"]}, {"verb": "beat",
"description": "Did Uriah honestly think [ARG0: he] [ARGM-MOD: could] [V: beat] [ARG1: the
game] [ARGM-TMP: in under three hours] ?", "tags": ["O", "O", "O", "O", "B-ARG0", "B-ARGM-
MOD", "B-V", "B-ARG1", "I-ARG1", "B-ARGM-TMP", "I-ARGM-TMP", "I-ARGM-TMP", "I-ARGM-TMP",
"O"]}], "words": ["Did", "Uriah", "honestly", "think", "he", "could", "beat", "the",
"game", "in", "under", "three", "hours", "?"]}
Provide Instructions
Reproducibility requires you to design
dynamically
less good: "I did some science, now here is an artifact capturing what I
did"

good: "I did some science, now you do some more science on top of it"
Case Study:
Reproducibility and
AllenNLP
What is AllenNLP?

opinionate
d
Programming to Higher-Level Abstractions
# models/crf_tagger.py

class CrfTagger(Model):
"""
The ``CrfTagger`` encodes a sequence of text with a ``Seq2SeqEncoder``,
then uses a Conditional Random Field model to predict a tag for each token in the
sequence.
def __init__(self, vocab: Vocabulary,
text_field_embedder: TextFieldEmbedder,
encoder: Seq2SeqEncoder,
label_namespace: str = "labels",
feedforward: Optional[FeedForward] = None,
label_encoding: Optional[str] = None,
include_start_end_transitions: bool = True,
constrain_crf_decoding: bool = None,
calculate_span_f1: bool = None,
dropout: Optional[float] = None,
verbose_metrics: bool = False,
initializer: InitializerApplicator = InitializerApplicator(),
regularizer: Optional[RegularizerApplicator] = None) -> None:
super().__init__(vocab, regularizer)
"token_characters": {

Declarative Configuration "type": "character_encoding",


"embedding": {
"embedding_dim": 16
},
"model": { "encoder": {
"type": "crf_tagger", "type": "cnn",
"label_encoding": "BIOUL", "embedding_dim": 16,
"dropout": 0.5, "num_filters": 128,
"include_start_end_transitions": false, "ngram_filter_sizes": [3],
"text_field_embedder": { "conv_layer_activation": "relu"
"token_embedders": { }
"tokens": { }
"type": "embedding", }
"embedding_dim": 50, },
"pretrained_file": "/path/to/glove.txt.gz", "encoder": {
"trainable": true "type": "lstm",
}, "input_size": 1202,
"elmo":{ "hidden_size": 200,
"type": "elmo_token_embedder", "num_layers": 2,
"options_file": "/path/to/elmo/options.json", "dropout": 0.5,
"weight_file": "/path/to/elmo/weights.hdf5", "bidirectional": true
"do_layer_norm": false, },
"dropout": 0.0 "regularizer": [
}, [
"scalar_parameters",
{
"type": "l2",
"alpha": 0.1
}
]
]
},
Command-Line Tools
Interactive Demos and Visualization of Model
Internals
Higher-Level NLP Abstractions as Library
Primitives
● Field + Instance (nice representation of examples)
○ question_field = TextField("What is the …", token_indexers)
○ instance = Instance({"question": question_field, "passage":
passage_field)
● Vocabulary (map word <-> index, label <-> index, etc)
● TokenIndexer (map token -> [index1, ...., indexn]
○ could be one per word
○ could be one per character
○ could be one per wordpiece
● TokenEmbedder (map indices -> embedding vectors)
● Seq2VecEncoder (map [v1, …, vn] -> w)
● Seq2SeqEncoder (map [v1, .., vn] -> [w1, …, wn])
● ...and many more
"I want to
try using
BERT
vectors
instead of
GloVe
vectors"
"Oh, great,
now I'm going
to make lots of
changes to my
"I'll just make a
code and
new config file
maintain all
for the BERT
these different
version!"
versions so
that my results
are
reproducible"
"token_indexers": { "token_indexers": {
"tokens": { "bert": {
"type": "single_id", "type": "bert-pretrained",
"lowercase_tokens": true "pretrained_model":
}, std.extVar("BERT_VOCAB"),
"do_lowercase": false,
"use_starting_offsets": true
"token_characters": { },
"type": "characters", "token_characters": {
"min_padding_length": 3 "type": "characters",
} "min_padding_length": 3
} }
}
"text_field_embedder": { "text_field_embedder": {
"allow_unmatched_keys": true,
"embedder_to_indexer_map": {
"bert": ["bert", "bert-offsets"],
"token_embedders": { "token_characters": ["token_characters"],
"tokens": { },
"type": "embedding", "token_embedders": {
"embedding_dim": 50, "bert": {
"pretrained_file": "/path/to/glove.tar.gz", "type": "bert-pretrained",
"trainable": true "pretrained_model":
}, std.extVar("BERT_WEIGHTS")
"token_characters": { },
"type": "character_encoding", "token_characters": {
"embedding": { "type": "character_encoding",
"embedding_dim": 16 "embedding": {
}, "embedding_dim": 16
"encoder": { },
"type": "cnn", "encoder": {
"embedding_dim": 16, "type": "cnn",
"num_filters": 128, "embedding_dim": 16,
"ngram_filter_sizes": [3], "num_filters": 128,
"conv_layer_activation": "relu" "ngram_filter_sizes": [3],
} "conv_layer_activation": "relu"
} }
}, }
}, }
},
"Want to Reproduce My Results?"
● create a virtual environment
● clone my GitHub repo and pip install its dependencies
○ including a specific version of AllenNLP
○ which includes a specific version of PyTorch
● each experiment has its own JSON configuration file
● allennlp train specific_experiment.json -s /tmp/results
● open an issue on GitHub asking why you didn't get the exact exact
same results
Case Study:
Reasoning about Actions
and State Changes
Case Study:
Reproducibility and
Beaker
What is Beaker?
● Kubernetes-based platform for rapid
experimentation
● Specify experiments as Docker
containers + config.yaml
● Request GPUs + Memory + etc
● Upload or mount existing datasets
● Track results
● (Currently runs on GKE, working on a
"runs on your machines" version)
● Mostly internal-only for now (sorry)
What is Beaker?
beaker experiment run \
--name wordcount-moby \
--blueprint examples/wordcount \
--source examples/moby:/input \
--result-path /output
description

parameters

logs

docker image

metrics

datasets

cost
Organize Experiments Into Groups

ideally you'd
give them more
descriptive
names, though
Beaker and Reproducibility
● old code + new data => upload the dataset, reuse the blueprint
● new code + old data => create the blueprint, point at existing
dataset
● want to see previous results?
○ inputs + logs + outputs stored "forever"
○ record of every experiment run + results
○ share with a link
● Reproducibility is important for
more than the obvious reasons
● Your choices of tools and
processes make reproducibility
easier or harder
To Sum Up ● Search out tools that make
reproducibility easier
● Adopt processes that make
reproducibility easier
● If nothing else, be kind to future-
you
● But also be kind to everyone else
who might build on your research
A Few Related Presentations
● I Don't Like Notebooks

The talk that launched a thousand arguments. Despite all the


jokes and memes, it's deeply rooted in the idea that tools and
processes matter.

● Writing Code for NLP Research

EMNLP 2018 tutorial from me + Matt Gardner + Mark Neumann,


goes much deeper into "what good research code looks like"

● How Becoming Not a Data Scientist Made Me a Better Data Scientist

Explores some similar themes in the context of "why data


scientists should care about software engineering best
Any questions?
me: @joelgrus

AI2: allenai.org

AllenNLP: allennlp.org

will tweet out slides from

@joelgrus and @ai2_allennlp

You might also like