0% found this document useful (0 votes)

532 views15 pages

Jay Alammar - Visualizing Machine Learning One Concept at A Time.

This document summarizes Jay Alammar's blog and provides visualizations and explanations of machine learning concepts. It discusses interfaces for explaining transformer language models and how they work. It also summarizes posts on GPT-3, BERT, word embeddings, and NumPy for data representation. The document aims to increase understanding of machine learning techniques through visualizations and examples.

Uploaded by

Alon Gonen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

532 views15 pages

Jay Alammar - Visualizing Machine Learning One Concept at A Time.

Uploaded by

Alon Gonen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Jay

(/) Alammar (/)

Visualizing machine learning one concept at a time.
@JayAlammar (https://twitter.com/JayAlammar) on Twitter. YouTube Channel
(https://www.youtube.com/channel/UCmOwsoHty5PrmE-3QhUBfPQ)

Blog (/) About (/about)

Interfaces for Explaining Transformer Language Models

(/explaining-transformers/)
Interfaces for exploring transformer language models by looking at input saliency and neuron activation.

Explorable #1: Input saliency of a list of countries generated by a language model

Tap or hover over the output tokens:

1. Austria 2. Belgium 3. >> Brazil 4. Hungary 5. Romania 6. Luxembourg 7.

Slovakia 8.

Explorable #2: Neuron activation analysis reveals four groups of neurons, each is associated with generating a certain type of token
Tap or hover over the sparklines on the left to isolate a certain factor:

1 . Austria 2 . Belgium 3 . >> Brazil 4 . Hungary 5 .

1
Romania 6 . Luxembourg 7 . Slovakia 8 .
2

4
0 5 10 15 20

The Transformer architecture has been powering a number of the recent advances in NLP. A breakdown of this
architecture is provided here (https://jalammar.github.io/illustrated-transformer/) . Pre-trained language models based
on the architecture, in both its auto-regressive (models that use their own output as input to next time-steps and that
process tokens from left-to-right, like GPT2) and denoising (models trained by corrupting/masking the input and that
process tokens bidirectionally, like BERT) variants continue to push the envelope in various tasks in NLP and, more
recently, in computer vision . Our understanding of why these models work so well, however, still lags behind these
developments.

This exposition series continues the pursuit to interpret and visualize the inner-workings of transformer-based
language models. We illustrate how some key interpretability methods apply to transformer-based language models.
This article focuses on auto-regressive models, but these methods are applicable to other architectures and tasks as
well.

This is the first article in the series. In it, we present explorables and visualizations aiding the intuition of:

Input Saliency methods that score input tokens importance to generating a token.
Neuron Activations and how individual and groups of model neurons spike in response to inputs and to produce
outputs.

The next article addresses Hidden State Evolution across the layers of the model and what it may tell us about each
layer's role.

The tech world is abuzz (https://www.theverge.com/21346343/gpt-3-explainer-openai-examples-errors-agi-potential)

with GPT3 hype. Massive language models (like GPT3) are starting to surprise us with their abilities. While not yet
completely reliable for most businesses to put in front of their customers, these models are showing sparks of
cleverness that are sure to accelerate the march of automation and the possibilities of intelligent computer systems.
Let’s remove the aura of mystery around GPT3 and learn how it’s trained and how it works.

A trained language model generates text.

We can optionally pass it some text as input, which influences its output.

The output is generated from what the model “learned” during its training period where it scanned vast amounts of
text.

YouTube Series - Jay's Intro to AI (/jays-intro-to-ai/)

Jay's Visual Intro to AI

Check out the first video in my new series introducing the general public to AI and machine learning.

My aim for this series is to help people integrate ML into their world-view away from all the hype and overpromises
that plauge the topic.

QCon 2020 - Visual Intro to Machine Learning and Deep Learning

(/qcon-2020-intro-to-ai/)

I had an incredible time organizing and speaking at the AI/machine learning track at QCon London 2020
(https://qconlondon.com/) where I invited and shared the stage with incredible speakers Vincent Warmerdam
(https://twitter.com/fishnets88), Susanne Groothuis (https://www.linkedin.com/in/susanne-groothuis/), Peter Elger
(https://www.linkedin.com/in/peterelger/), and Hien Luu (https://www.linkedin.com/in/hienluu/).

QCon is a global software conference for software engineers, architects, and team leaders, with over 1,600 attendees
in London. All speakers have a software background.
READ MORE (/QCON-2020-INTRO-TO-AI/)

A Visual Guide to Using BERT for the First Time (/a-visual-guide-

to-using-bert-for-the-first-time/)
Translations: Chinese (http://www.junphy.com/wordpress/index.php/2020/10/20/a-visual-guide-using-bert/), Russian (https://habr.com/ru/post/498144/)

Progress has been rapidly accelerating in machine learning models that process language over the last couple of
years. This progress has left the research lab and started powering some of the leading digital products. A great
example of this is the recent announcement of how the BERT model is now a major force behind Google Search
(https://www.blog.google/products/search/search-language-understanding-bert/). Google believes this step (or
progress in natural language understanding as applied in search) represents “the biggest leap forward in the past five
years, and one of the biggest leaps forward in the history of Search”.

This post is a simple tutorial for how to use a variant of BERT to classify sentences. This is an example that is basic
enough as a first intro, yet advanced enough to showcase some of the key concepts involved.

Alongside this post, I’ve prepared a notebook. You can see it here the notebook
(https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ip
or run it on colab
(https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT

Language Models and Skipgram Recommenders Talk @ MIT

(/mit-analytics-lab-talk/)
I had a great time speaking at the MIT Analytics Lab about some of my favorite ideas in natural language processing
and their practical applications.

The Illustrated GPT-2 (Visualizing Transformer Language Models)

(/illustrated-gpt2/)
Discussions: Hacker News (64 points, 3 comments) (https://news.ycombinator.com/item?id=20677411), Reddit r/MachineLearning (219 points, 18
comments) (https://www.reddit.com/r/MachineLearning/comments/cp8prq/p_the_illustrated_gpt2_visualizing_transformer/)

Translations: Russian (https://habr.com/ru/post/490842/)

This year, we saw a dazzling application of machine learning. The OpenAI GPT-2 (https://openai.com/blog/better-
language-models/) exhibited impressive ability of writing coherent and passionate essays that exceed what we
anticipated current language models are able to produce. The GPT-2 wasn’t a particularly novel architecture – it’s
architecture is very similar to the decoder-only transformer. The GPT2 was, however, a very large, transformer-based
language model trained on a massive dataset. In this post, we’ll look at the architecture that enabled the model to
produce its results. We will go into the depths of its self-attention layer. And then we’ll look at applications for the
decoder-only transformer beyond language modeling.
My goal here is to also supplement my earlier post, The Illustrated Transformer (/illustrated-transformer/), with more
visuals explaining the inner-workings of transformers, and how they’ve evolved since the original paper. My hope is
that this visual language will hopefully make it easier to explain later Transformer-based models as their inner-
workings continue to evolve.

A Visual Intro to NumPy and Data Representation (/visual-

numpy/)
Discussions: Hacker News (366 points, 21 comments) (https://news.ycombinator.com/item?id=20282985), Reddit r/MachineLearning (256 points, 18
comments) (https://www.reddit.com/r/MachineLearning/comments/c5nc89/p_a_visual_intro_to_numpy_and_data_representation/)
Translations: Chinese 1 (http://www.junphy.com/wordpress/index.php/2019/10/24/visual-numpy/), Chinese 2
(https://github.com/kevingo/blog/blob/master/ML/visual-numpy.md), Japanese (https://note.mu/sayajewels/n/n95edaedb0fc5)

The NumPy (https://www.numpy.org/) package is the workhorse of data analysis, machine learning, and scientific
computing in the python ecosystem. It vastly simplifies manipulating and crunching vectors and matrices. Some of
python’s leading package rely on NumPy as a fundamental piece of their infrastructure (examples include scikit-learn,
SciPy, pandas, and tensorflow). Beyond the ability to slice and dice numeric data, mastering numpy will give you an
edge when dealing and debugging with advanced usecases in these libraries.

In this post, we’ll look at some of the main ways to use NumPy and how it can represent different types of data
(tables, images, text…etc) before we can serve them to machine learning models.

Video: Intuition & Use-Cases of Embeddings in NLP & beyond

(/skipgram-recommender-talk/)
I gave a talk at Qcon London (https://qconlondon.com/) this year. Watch it here:

Intuition & Use-Cases of Embeddings in NLP & beyond (https://www.youtube.com/watch?v=4-QoMdSqG_I) [YouTube]

https://www.infoq.com/presentations/nlp-word-embedding/ (https://www.infoq.com/presentations/nlp-word-
embedding/) [infoQ]

In this video, I introduced word embeddings and the word2vec algorithm. I then proceeded to discuss how the
word2vec algorithm is used to create recommendation engines in companies like Airbnb and Alibaba. I close by
glancing at real-world consequences of popular recommendation systems like those of YouTube and Facebook.

My Illustrated Word2vec (/illustrated-word2vec/) post used and built on the materials I created for this talk (but didn’t
include anything on the recommender application of word2vec). This was my first talk at a technical conference and I
spent quite a bit of time preparing for it. In the six weeks prior to the conference I spent about 100 hours working on
the presentation and ended up with 200 slides. It was an interesting balancing act of trying to make it introductory but
not shallow, suitable for senior engineers and architects yet not necessarily ones who have machine learning
experience.

The Illustrated Word2vec (/illustrated-word2vec/)

Discussions: Hacker News (347 points, 37 comments) (https://news.ycombinator.com/item?id=19498356), Reddit r/MachineLearning (151 points, 19
comments) (https://www.reddit.com/r/MachineLearning/comments/b60jtg/p_the_illustrated_word2vec/)
Translations: Chinese (Simplified) (https://mp.weixin.qq.com/s?
__biz=MjM5MTQzNzU2NA==&mid=2651669277&idx=2&sn=bc8f0590f9e340c1f1359982726c5a30&chksm=bd4c648e8a3bed9817f30c5a512e79fe0cc6fbc58544f97c857c30b1
Korean (https://databreak.netlify.com/2019-04-25-illustrated_word2vec/), Portuguese (https://pessoalex.wordpress.com/2019/03/29/o-word2vec-
ilustrado/), Russian (https://habr.com/ru/post/446530/)
“There is in all things a pattern that is part of our universe. It has symmetry, elegance, and grace - those qualities you find always in that
which the true artist captures. You can find it in the turning of the seasons, in the way sand trails along a ridge, in the branch clusters of the
creosote bush or the pattern of its leaves.

We try to copy these patterns in our lives and our society, seeking the rhythms, the dances, the forms that comfort. Yet, it is possible to see
peril in the finding of ultimate perfection. It is clear that the ultimate pattern contains it own fixity. In such perfection, all things move toward
death.” ~ Dune (1965)

I find the concept of embeddings to be one of the most fascinating ideas in machine learning. If you’ve ever used Siri,
Google Assistant, Alexa, Google Translate, or even smartphone keyboard with next-word prediction, then chances are
you’ve benefitted from this idea that has become central to Natural Language Processing models. There has been
quite a development over the last couple of decades in using embeddings for neural models (Recent developments
include contextualized word embeddings leading to cutting-edge models like BERT
(https://jalammar.github.io/illustrated-bert/) and GPT2).

Word2vec is a method to efficiently create word embeddings and has been around since 2013. But in addition to its
utility as a word-embedding method, some of its concepts have been shown to be effective in creating
recommendation engines and making sense of sequential data even in commercial, non-language tasks. Companies
like Airbnb (https://www.kdd.org/kdd2018/accepted-papers/view/real-time-personalization-using-embeddings-for-
search-ranking-at-airbnb), Alibaba (https://www.kdd.org/kdd2018/accepted-papers/view/billion-scale-commodity-
embedding-for-e-commerce-recommendation-in-alibaba), Spotify (https://www.slideshare.net/AndySloane/machine-
learning-spotify-madison-big-data-meetup), and Anghami (https://towardsdatascience.com/using-word2vec-for-music-
recommendations-bb9649ac2484) have all benefitted from carving out this brilliant piece of machinery from the world
of NLP and using it in production to empower a new breed of recommendation engines.

In this post, we’ll go over the concept of embedding, and the mechanics of generating embeddings with word2vec. But
let’s start with an example to get familiar with using vectors to represent things. Did you know that a list of five
numbers (a vector) can represent so much about your personality?

A Gentle Visual Intro to Data Analysis in Python Using Pandas

(/gentle-visual-intro-to-data-analysis-python-pandas/)
Discussions: Hacker News (195 points, 51 comments) (https://news.ycombinator.com/item?id=18351685), Reddit r/Python (140 points, 18 comments)
(https://www.reddit.com/r/Python/comments/9scznd/a_gentle_visual_intro_to_data_analysis_in_python/)

If you’re planning to learn data analysis, machine learning, or data science tools in python, you’re most likely going to
be using the wonderful pandas (https://pandas.pydata.org/) library. Pandas is an open source library for data
manipulation and analysis in python.

Loading Data
One of the easiest ways to think about that, is that you can load tables (and excel files) and then slice and dice them
in multiple ways:
READ MORE (/GENTLE-VISUAL-INTRO-TO-DATA-ANALYSIS-PYTHON-PANDAS/)

The Illustrated Transformer (/illustrated-transformer/)

Discussions: Hacker News (65 points, 4 comments) (https://news.ycombinator.com/item?id=18351674), Reddit r/MachineLearning (29 points, 3
comments) (https://www.reddit.com/r/MachineLearning/comments/8uh2yz/p_the_illustrated_transformer_a_visual_look_at/)
Translations: Chinese (Simplified) (https://blog.csdn.net/yujianmin1990/article/details/85221271), French (https://a-coles.github.io/post/transformer-
illustre/), Japanese (https://tips-memo.com/translation-jayalmmar-transformer), Korean (https://nlpinkorean.github.io/illustrated-transformer/), Russian
(https://habr.com/ru/post/486358/), Spanish (https://hackernoon.com/el-transformador-ilustrado-una-traduccion-al-espanol-0y73wwp)
Watch: MIT’s Deep Learning State of the Art (https://youtu.be/53YvP6gdD7U?t=432) lecture referencing this post

In the previous post, we looked at Attention (https://jalammar.github.io/visualizing-neural-machine-translation-

mechanics-of-seq2seq-models-with-attention/) – a ubiquitous method in modern deep learning models. Attention is a
concept that helped improve the performance of neural machine translation applications. In this post, we will look at
The Transformer – a model that uses attention to boost the speed with which these models can be trained. The
Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit,
however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation
to use The Transformer as a reference model to use their Cloud TPU (https://cloud.google.com/tpu/) offering. So let’s
try to break the model apart and look at how it functions.

The Transformer was proposed in the paper Attention is All You Need (https://arxiv.org/abs/1706.03762). A
TensorFlow implementation of it is available as a part of the Tensor2Tensor
(https://github.com/tensorflow/tensor2tensor) package. Harvard’s NLP group created a guide annotating the paper
with PyTorch implementation (http://nlp.seas.harvard.edu/2018/04/03/attention.html). In this post, we will attempt to
oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people
without in-depth knowledge of the subject matter.

2020 Update: I’ve created a “Narrated Transformer” video which is a gentler approach to the topic:

The Narrated Transformer Language Model

A High-Level Look
Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a
sentence in one language, and output its translation in another.

Visualizing A Neural Machine Translation Model (Mechanics of

Seq2seq Models With Attention) (/visualizing-neural-machine-
translation-mechanics-of-seq2seq-models-with-attention/)
Translations: Chinese (Simplified) (https://blog.csdn.net/qq_41664845/article/details/84245520), Japanese (https://tips-memo.com/translation-
jayalmmar-attention), Korean (https://nlpinkorean.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/),
Russian (https://habr.com/ru/post/486158/), Turkish (https://medium.com/@SenemAktas/n%C3%B6ral-makine-%C3%A7eviri-modelini-
g%C3%B6rselle%C5%9Ftirme-seq2seq-modelinin-attention-mekanizmas%C4%B1-b12581b5a1df)
Watch: MIT’s Deep Learning State of the Art (https://youtu.be/53YvP6gdD7U?t=335) lecture referencing this post

May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final
attention example.

Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you
can pause if needed.

Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine
translation, text summarization, and image captioning. Google Translate started using
(https://blog.google/products/translate/found-translation-more-accurate-fluent-sentences-google-translate/) such a
model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014
(https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf), Cho et al., 2014
(http://emnlp2014.org/papers/pdf/EMNLP2014179.pdf)).

I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts
that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually.
That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post.
I hope it can be a useful companion to reading the papers mentioned above (and the attention papers linked later in
the post).

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc)
and outputs another sequence of items. A trained model would work like this:
0:00 -0:08

Visualizing Pandas' Pivoting and Reshaping Functions

(/visualizing-pandas-pivoting-and-reshaping/)

0:00 -0:21

I love using python’s Pandas (https://pandas.pydata.org/) package for data analysis. The 10 Minutes to pandas
(https://pandas.pydata.org/pandas-docs/stable/10min.html) is a great place to start learning how to use it for data
analysis.

Things get a lot more interesting once you’re comfortable with the fundamentals and start with Reshaping and Pivot
Tables (https://pandas.pydata.org/pandas-docs/stable/reshaping.html). That guide shows some of the more
interesting functions of reshaping data. Below are some visualizations to go along with the Pandas reshaping guide.

A Visual And Interactive Look at Basic Neural Network Math

(/feedforward-neural-networks-visual-interactive/)
In the previous post, we looked at the basic concepts of neural networks (https://jalammar.github.io/visual-interactive-
guide-basics-neural-networks/). Let us now take another example as an excuse to guide us to explore some of the
basic mathematical ideas involved in prediction with neural networks.

0:00 -0:15

A Visual and Interactive Guide to the Basics of Neural Networks

(/visual-interactive-guide-basics-neural-networks/)
Discussions: Hacker News (63 points, 8 comments) (https://news.ycombinator.com/item?id=13183171), Reddit r/programming (312 points, 37
comments) (https://www.reddit.com/r/programming/comments/5igdix/a_visual_and_interactive_guide_to_the_basics_of/)
Translations: French (https://rr0.org/people/a/AlammarJay/visual-interactive-guide-basics-neural-networks/index_fr.html), Spanish
(https://camporeale.github.io/guia-interactiva-visual-conceptos-basicos-redes-neuronales/)
Update: Part 2 is now live: A Visual And Interactive Look at Basic Neural Network Math
(https://jalammar.github.io/feedforward-neural-networks-visual-interactive/)

Motivation
I’m not a machine learning expert. I’m a software engineer by training and I’ve had little interaction with AI. I had
always wanted to delve deeper into machine learning, but never really found my “in”. That’s why when Google open
sourced TensorFlow in November 2015, I got super excited and knew it was time to jump in and start the learning
journey. Not to sound dramatic, but to me, it actually felt kind of like Prometheus handing down fire to mankind from
the Mount Olympus of machine learning. In the back of my head was the idea that the entire field of Big Data and
technologies like Hadoop were vastly accelerated when Google researchers released their Map Reduce paper. This
time it’s not a paper – it’s the actual software they use internally after years and years of evolution.

So I started learning what I can about the basics of the topic, and saw the need for gentler resources for people with
no experience in the field. This is my attempt at that.

Supercharging Android Apps With TensorFlow (Google's Open

Source Machine Learning Library) (/Supercharging-android-apps-
using-tensorflow/)
Discussion: Reddit r/Android (80 points, 16 comments)
(https://www.reddit.com/r/androiddev/comments/3zpkb6/supercharging_android_apps_with_tensorflow/)
In November 2015, Google announced (https://googleblog.blogspot.com/2015/11/tensorflow-smarter-machine-
learning-for.html) and open sourced TensorFlow (https://www.tensorflow.org/), its latest and greatest machine learning
library. This is a big deal for three reasons:

1. Machine Learning expertise: Google is a dominant force in machine learning. Its prominence in search owes a lot
to the strides it achieved in machine learning.
2. Scalability: the announcement noted that TensorFlow was initially designed for internal use and that it’s already in
production for some live product features.
3. Ability to run on Mobile.

This last reason is the operating reason for this post since we’ll be focusing on Android. If you examine the tensorflow
repo on GitHub (https://github.com/tensorflow/tensorflow), you’ll find a little tensorflow/examples/android
(https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android) directory. I’ll try to shed some light
on the Android TensorFlow example and some of the things going on under the hood.

Subscribe to get notified about upcoming posts by email

Email Address

(http://creativecommons.org/licenses/by-nc-sa/4.0/)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-nc-sa/4.0/).
Attribution example:
Alammar, Jay (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/ (https://jalammar.github.io/illustrated-
transformer/)

Note: If you translate any of the posts, let me know so I can link your translation to the original post. My email is in the about page (/about).

(https://github.com/jalammar) (https://www.linkedin.com/in/jalammar)
(https://www.twitter.com/jayalammar)