A Generative Vision Model That Trains With High Data Efficiency and Breaks Text-Based Captchas
A Generative Vision Model That Trains With High Data Efficiency and Breaks Text-Based Captchas
A Generative Vision Model That Trains With High Data Efficiency and Breaks Text-Based Captchas
Learning from few examples and generalizing to dramatically different situations are capabilities of
human visual intelligence that are yet to be matched by leading machine learning models. By drawing
inspiration from systems neuroscience, we introduce a probabilistic generative model for vision in which
message-passing based inference handles recognition, segmentation and reasoning in a unified way. The
The ability to learn and generalize from a few examples is a when partially transparent objects occupy the same spatial
hallmark of human intelligence (1). CAPTCHAs, images used locations (1316). Lateral connections in the visual cortex are
by websites to block automated interactions, are examples of implicated in enforcing contour continuity (17, 18). Contours
problems that are easy for humans but difficult for comput- and surfaces are represented using separate mechanisms that
ers. CAPTCHAs are hard for algorithms because they add interact (1921), enabling the recognition and imagination of
clutter and crowd letters together to create a chicken-and-egg objects with unusual appearance for example a chair made
problem for character classifiers the classifiers work well of ice. The timing and topography of cortical activations give
for characters that have been segmented out, but segmenting clues about contour-surface representations and inference al-
the individual characters requires an understanding of the gorithms (22, 23). These insights based on cortical function
characters, each of which might be rendered in a combinato- are yet to be incorporated into leading machine learning
rial number of ways (25). A recent deep-learning approach models.
for parsing one specific CAPTCHA style required millions of We introduce a hierarchical model called the Recursive
labeled examples from it (6), and earlier approaches mostly Cortical Network (RCN) that incorporates these neuroscience
relied on hand-crafted style-specific heuristics to segment out insights in a structured probabilistic generative model frame-
the character (3, 7); whereas humans can solve new styles work (5, 2427).
without explicit training (Fig. 1A). The wide variety of ways In addition to developing RCN and its learning and infer-
in which letterforms could be rendered and still be under- ence algorithms, we applied the model to a variety of visual
stood by people is illustrated in Fig. 1. cognition tasks that required generalizing from one or a few
Building models that generalize well beyond their train- training examples: parsing of CAPTCHAs, one-shot and few-
ing distribution is an important step toward the flexibility shot recognition and generation of handwritten digits, occlu-
Douglas Hofstadter envisioned when he said that for any sion reasoning, and scene text recognition. We then com-
program to handle letterforms with the flexibility that human pared its performance to state of the art models.
beings do, it would have to possess full-scale artificial intelli-
gence (8). Many researchers have conjectured that this could Recursive cortical network
be achieved by incorporating the inductive biases of the vis- RCN builds on existing compositional models (24, 2832) in
ual cortex (912), utilizing the wealth of data generated by important ways [section 6 of (33)]. Although grammar based
neuroscience and cognitive science research. In the mamma- models (24) have the advantage of being based on well-
lian brain, feedback connections in the visual cortex play known ideas from linguistics, they either limit interpreta-
roles in figure-ground-segmentation, and in object-based top- tions to single trees or are computationally infeasible when
down attention that isolates the contours of an object even using attributed relations (32). The seminal work on AND-OR
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 1
templates and tree-structured compositional models (34) has representations in detail.
the advantage of simplified inference, but is lacking in selec- Figure 2B shows two subnetworks (black and blue) within
tivity owing to the absence of lateral constraints (35). Models a level of the RCN contour hierarchy. The filled and empty
from another important class (25, 29) use lateral constraints, circular nodes in the graph are binary random variables that
but rather than gradually building invariance through a pool- correspond to features and pools respectively. Each feature
ing structure (36), they use parametric transformations for node encodes an AND relation of its child pools, and each
complete scale, rotation and translation invariance at each pool variable encodes the OR of its child features, similar to
level. Custom inference algorithms are required, but those AND-OR graphs (34). Lateral constraints, represented as rec-
are not effective in propagating the effect of lateral con- tangular factor nodes, coordinate the choices between the
straints beyond local interactions. The representation of con- pools they connect to. The two subnetworks, which can cor-
tours and surfaces in (37) do not model their interactions, respond to two objects or object parts, share lower level-fea-
choosing instead to model these as independent mechanisms. tures.
RCNs and Composition Machines (CM) (32) share the moti- Figure 2C shows a three-level network that represents the
vation of placing compositional model ideas in a graphical contours of a square. The features at the lowest, intermediate
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 2
the sharing of features between different objects (42). Both of mask on the input image. By considering combinations of ob-
these result in efficient learning and inference through ject hypotheses, i.e., parses, that produce spatially contiguous
shared computations. masks when their 2d-masks overlap, we create a topological
Surfaces are modeled using a pairwise CRF (Fig. 3C). Lo- ordering of the parses by sorting them according to masks
cal surface patch properties like color, texture, or surface nor- that are contained in other masks. This results in a recursive
mal are represented by categorical variables, whose computation of the score where only a linear number of can-
smoothness of variation is enforced by the lateral factors didate parses need to be evaluated in searching for the best
(gray squares in Fig. 2). Contours generated by the contour- parse. See section 4.7 of (33) for more details.
hierarchy interact with the surface CRF in a specific way: con-
tours signal the breaks in continuity of surfaces that occur Learning
both within an object and between the object and its back- Features and lateral connections up to the penultimate level
ground, a representational choice inspired by neurobiology of the network are trained unsupervised using a generic 3D
(19). Figure 3, B and D, shows samples generated from an object data set that is task agnostic and rendered only as con-
RCN. tour images. The resulting learned features vary from simple
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 3
at the lowest level have a smoothing parameter that sets an the network deteriorates rapidly with even minor perturba-
estimate on the probability that an edge pixel is ON owing to tions to the spacing of characters that are barely perceptible
noise. This parameter can be set according to the noise levels to humans 15% more spacing reduced accuracy to 38.4%,
in a domain. and 25% more spacing reduced accuracy to just 7%. This sug-
gests that the deep-learning method learned to exploit the
Results specifics of a particular CAPTCHA rather than learning mod-
A CAPTCHA is considered broken if it can be automatically els of characters that are then used for parsing the scene. For
solved at a rate above 1% (3). RCN was effective in breaking RCN, increasing the spacing of the characters results in an
a wide variety of text-based CAPTCHAs with very little train- improvement in the recognition accuracy (Fig. 5B).
ing data, and without using CAPTCHA-specific heuristics The wide variety of character appearances in BotDetect
(Fig. 5). It was able to solve reCAPTCHAs at an accuracy rate (Fig. 5C) demonstrates why the factorization of contours and
of 66.6% (character level accuracy of 94.3%), BotDetect at surfaces is important: models without this factorization
64.4%, Yahoo at 57.4% and PayPal at 57.1%, significantly could latch on to the specific appearance details of a font,
above the 1% rate at which CAPTCHAs are considered inef- thereby limiting their generalization. The RCN results are
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 4
The effect of increasing the number of levels in the hierarchy segmentation of the characters (Fig. 7E) that the competing
is to reduce the inference time as detailed in section 8.11 of methods do not provide.
(33).
As a generative model, RCN outperformed Variational Discussion
Auto Encoders (VAE) (49) and DRAW (50) on reconstructing Segmentation resistance, the primary defense of text-based
corrupted MNIST images (Fig. 7, A and B). DRAWs ad- CAPTCHAs, has been a general principle that enabled their
vantage over RCN for the clean test set is not surprising be- automated generation (2, 3). Although specific CAPTCHAs
cause DRAW is learning an overly flexible model that almost have been broken before using style-specific segmentation
copies the input image in the reconstruction, which hurts its heuristics (3, 7), those attacks could be foiled easily by minor
performance on more cluttered data sets [section 8.9 of (33)]. alterations to CAPTCHAs. RCN breaks the segmentation de-
On the Omniglot data (1), examples generated from RCN after fense in a fundamental way and with very little training data,
one-shot training showed significant variations, while still which suggests that websites should move to more robust
being identifiable as the original category [Fig. 7D and sec- mechanisms for blocking bots.
tion 8.6 of (33)]. Compositional models have been successfully used in the
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 5
combined to obtain the expressive power and efficient infer- from Google Fonts, which resulted in 25584 character train-
ence required to model the parallel and sequential processes ing images. From this we selected a set of training images
(60) involved in perception and cognition. using an automated greedy font selection approach. We ren-
Of course, Douglas Hofstadters challenge understand- dered binary images for all fonts and then used the resulting
ing letterforms with the same efficiency and flexibility of hu- images of the same letter to train an RCN. This RCN is then
mans still stands as a grand goal for artificial intelligence. used to recognize the exact images it was trained on, provid-
People use a lot more commonsense knowledge, in context- ing a compatibility score (between 0.0 and 1.0) for all pairs of
sensitive and dynamic ways, when they identify letterforms fonts of the same letter. Finally, using a threshold (=0.8) as
(Fig. 1C, iii). Our works suggests that incorporating inductive the stopping criterion, we greedily select the most representa-
biases from systems neuroscience can lead to robust, gener- tive fonts until 90% of all fonts are represented, which re-
alizable machine-learning models that demonstrate high sulted in 776 unique training images. The parser is trained
data efficiency. We hope that this work inspires improved using 630 word images and the character ngrams are trained
models of cortical circuits (61, 62) and investigations that using words from the Wikipedia.
combine the power of neural networks and structured prob- RCN classification experiments on the MNIST data set are
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 6
models of cognition: Exploring representations and inductive biases. Trends Cogn. 100, 212224 (2006). doi:10.1016/j.jphysparis.2007.01.001 Medline
Sci. 14, 357364 (2010). doi:10.1016/j.tics.2010.05.004 Medline 36. T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, T. Poggio, Robust object recognition
11. T. S. Lee, The visual systems internal model of the world, in Proceedings of the with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell. 29, 411426
IEEE, vol. 103 (IEEE, 2015), pp. 13591378 (2007). doi:10.1109/TPAMI.2007.56 Medline
12. D. Kersten, A. Yuille, Bayesian models of object perception. Curr. Opin. Neurobiol. 37. C. Guo, S.-C. Zhu, Y. N. Wu, Primal sketch: Integrating structure and texture.
13, 150158 (2003). doi:10.1016/S0959-4388(03)00042-4 Medline Comput. Vis. Image Underst. 106, 519 (2007). doi:10.1016/j.cviu.2005.09.004
13. C. D. Gilbert, W. Li, Top-down influences on visual processing. Nat. Rev. Neurosci. 38. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
14, 350363 (2013). doi:10.1038/nrn3476 Medline Inference (Morgan Kaufmann, 1988).
14. V. A. F. Lamme, P. R. Roelfsema, The distinct modes of vision offered by 39. T. Wu, S.-C. Zhu, A numerical study of the bottom-up and top-down inference
feedforward and recurrent processing. Trends Neurosci. 23, 571579 (2000). processes in and-or graphs. Int. J. Comput. Vis. 93, 226252 (2011).
doi:10.1016/S0166-2236(00)01657-X Medline doi:10.1007/s11263-010-0346-6
15. P. R. Roelfsema, V. A. F. Lamme, H. Spekreijse, Object-based attention in the 40. J. Xu, T. L. Wickramarathne, N. V. Chawla, Representing higher-order
primary visual cortex of the macaque monkey. Nature 395, 376381 (1998). dependencies in networks. Sci. Adv. 2, e1600028 (2016).
doi:10.1038/26475 Medline doi:10.1126/sciadv.1600028 Medline
16. E. H. Cohen, F. Tong, Neural mechanisms of object-based attention. Cereb. Cortex 41. D. Sontag, A. Globerson, T. Jaakkola, Introduction to dual decomposition for
25, 10801092 (2015). doi:10.1093/cercor/bht303 Medline inference, in Optimization for Machine Learning (MIT Press, 2010), pp. 137.
17. D. J. Field, A. Hayes, R. F. Hess, Contour integration by the human visual system: 42. E. Bienenstock, S. Geman, D. Potter, Compositionality, MDL priors, and object
Evidence for a local association field. Vision Res. 33, 173193 (1993). recognition, in Advances in Neural Information Processing Systems 10 (1997), pp.
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 7
compositions. Comput. Vis. Image Underst. 138, 102113 (2015). Trends Cogn. Sci. 11, 5864 (2007). doi:10.1016/j.tics.2006.11.009 Medline
doi:10.1016/j.cviu.2015.04.006 83. L. Bottou, Y. Bengio, Y. Le Cun, Global training of document processing systems
58. S. M. Ali Eslami, N. Heess, T. Weber, Y. Tassa, K. Kavukcuoglu, Attend, infer, using graph transformer networks, in 1997 IEEE Conference on Computer Vision
repeat: Fast scene understanding with generative models, paper presented at and Pattern Recognition (IEEE, 1997), pp. 489494.
the 30th Conference on Neural Information Processing Systems (NIPS 2016), 84. Z. Kourtzi, N. Kanwisher, Representation of perceived object shape by the human
Barcelona, Spain, 5 to 10 December 2016. lateral occipital complex. Science 293, 15061509 (2001).
59. M. Lazaro-Gredilla, Y. Liu, D. S. Phoenix, D. George, Hierarchical compositional doi:10.1126/science.1061133 Medline
feature learning, arXiv:1611.02252 [cs.LG] (7 November 2016). 85. J. J. DiCarlo, D. Zoccolan, N. C. Rust, How does the brain solve visual object
60. S. Ullman, Visual routines. Cognition 18, 97159 (1984). doi:10.1016/0010- recognition? Neuron 73, 415434 (2012). doi:10.1016/j.neuron.2012.01.010
0277(84)90023-4 Medline Medline
61. S. Litvak, S. Ullman, Cortical circuitry implementing graphical models. Neural 86. D. H. Hubel, T. N. Wiesel, Receptive fields, binocular interaction and functional
Comput. 21, 30103056 (2009). doi:10.1162/neco.2009.05-08-783 Medline architecture in the cats visual cortex. J. Physiol. 160, 106154 (1962).
62. D. George, J. Hawkins, Towards a mathematical theory of cortical micro-circuits. doi:10.1113/jphysiol.1962.sp006837 Medline
PLOS Comput. Biol. 5, e1000532 (2009). doi:10.1371/journal.pcbi.1000532 87. L. Wiskott, T. J. Sejnowski, Slow feature analysis: Unsupervised learning of
Medline invariances. Neural Comput. 14, 715770 (2002).
63. N. Le Roux, N. Heess, J. Shotton, J. Winn, Learning a generative model of images doi:10.1162/089976602317318938 Medline
by factoring appearance and shape. Neural Comput. 23, 593650 (2011). 88. R. D. S. Raizada, S. Grossberg, Towards a theory of the laminar architecture of
doi:10.1162/NECO_a_00086 Medline cerebral cortex: Computational clues from the visual system. Cereb. Cortex 13,
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 8
language of thought, in The Conceptual Mind: New Directions in the Study of
Concepts, E. Margolis, S. Lawrence, Eds. (MIT Press, 2015), pp. 623654.
107. H. Lee, R. Grosse, R. Ranganath, A. Y. Ng, Convolutional deep belief networks for
scalable unsupervised learning of hierarchical representations, in Proceedings of
the 26th Annual International Conference on Machine Learning (ACM, 2009), pp.
609616.
108. D. Kingma, J. Ba, Adam: A method for stochastic optimization, paper presented
at the International Conference on Learning Representations (ICLR) 2015, San
Diego, CA, 7 to 9 May 2015.
ACKNOWLEDGMENTS
We thank the anonymous reviewers for their careful review and helpful suggestions
that greatly improved the manuscript. We thank B. Olshausen, T. Dean, B. Lake,
and B. Jaros for suggesting improvements after reading early versions of this
manuscript. We are grateful to A. Yuille and F.-F. Li for insightful discussions
leading to this work. Data sets used in the paper are available for download at
www.vicarious.com. The inventions described in this paper are protected by U.S.
patents 9262698, 9373085, 9607262 and 9607263. As text-based CAPTCHAs
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 9
Downloaded from http://science.sciencemag.org/ on October 26, 2017
Fig. 1. Flexibility of letterform perception in humans. (A) Humans are good at parsing unfamiliar CAPTCHAs.
(B) The same character shape can be rendered in a wide variety of appearances, and people can detect the A
in these images regardless. (C) Common sense and context affect letterform perception: (i) m vs u and n. (ii)
the same line segments are interpreted as N or S depending on occluder positions. (iii) perception of the shapes
aids the recognition of b,i,s,o,n and b,i,k,e. [Bison logo with permission from Seamus Leonard,
http://www.steadynow.com]
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 10
Fig. 2. Structure of the RCN. (A) A
hierarchy generates the contours
of an object, and a Conditional
Random Field (CRF) generates its
surface appearance. (B) Two
subnetworks at the same level of
the contour hierarchy keep
separate lateral connections by
making parent-specific copies of
child features and connecting them
with parent-specific laterals; nodes
within the green rectangle are
copies of the feature marked e.
(C) A three level RCN representing
the contours of a square. Features
at Level 2 represent the four
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 11
Fig. 3. Samples from RCN. (A)
Samples from a corner feature with and
without lateral connections. (B)
Samples from character A for
different deformability settings,
determined by pooling and lateral
perturb-factors, in a 3-level hierarchy
similar to Fig. 2D, where the lowest level
features are edges. Column 2 shows a
balanced setting where deformability is
distributed between the levels to
produce local deformations and global
translations. The other columns show
some extreme configurations. (C)
Contour to surface-CRF interaction for
a cube. Green factors: foreground-to-
background edges, blue: within-object
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 12
Downloaded from http://science.sciencemag.org/ on October 26, 2017
Fig. 4. (A) (i) Forward pass, including lateral propagation, produces
hypotheses about the multiple letters present in the input image. PreProc is
a bank of Gabor-like filters that convert from pixels to edge likelihoods
[section 4.2 of (33)]. (ii) Backward pass and lateral propagation creates the
segmentation mask for a selected forward-pass hypothesis, here the letter
A [section 4.4 of (33)]. (iii) A false hypothesis V is hallucinated at the
intersection of A and K; false hypotheses are resolved via parsing
[section 4.7 of (33)]. (iv) Multiple hypotheses can be activated to produce a
joint explanation that involves explaining away and occlusion reasoning. (B)
Learning features at the second feature level. Colored circles represent
feature activations. The dotted circle is a proposed feature [see text and
section 5 of (33)]. (C) Learning of laterals from contour adjacency (see
text).
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 13
Fig. 5. Parsing CAPTCHAs
with RCN. (A) Representative
reCAPTCHA parses showing
top two solutions, their
segmentations, and labels by
two different Amazon
Mechanical Turk workers. (B)
Word accuracy rates of RCN
and CNN on the control
CAPTCHA data set. CNN is
brittle and RCN is robust
when character-spacing is
changed. (C) Accuracies for
different CAPTCHA styles.
(D) Representative
BotDetect parses and
segmentations (indicated by
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 14
Fig. 6. MNIST classification results for
training with few examples. (A) MNIST
classification accuracy for RCN, CNN,
and CPM. (B) Classification accuracy on
corrupted MNIST tests. Legends show
the total number of training examples.
(C) MNIST classification accuracy for
different RCN configurations.
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 15
Downloaded from http://science.sciencemag.org/ on October 26, 2017
Fig. 7. Generation, occlusion reasoning, and scene-text parsing with
RCN. Examples of reconstructions (A) and reconstruction error (B) from
RCN, VAE and DRAW on corrupted MNIST. Legends show the number of
training examples. (C) Occlusion reasoning. The third column shows edges
remaining after RCN explains away the edges of the first detected object.
Ground-truth masks reflect the occlusion relationships between the square
and the digit. The portions of the digit that are in front of the square are
indicated by brown color and the portions that are behind the square are
indicated by orange color. The last column shows the predicted occlusion
mask. (D) One-shot generation from Omniglot. In each column, row 1 shows
the training example and the remaining rows show generated samples. (E)
Examples of ICDAR images successfully parsed by RCN. The yellow outlines
show segmentations.
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 16
Fig. 8. Application of RCN to
parsing scenes with objects.
Shown are the detections and
instance segmentations obtained
when RCN was applied to a scene
parsing task with multiple real-
world objects in cluttered scenes
on random backgrounds. Our
experiments suggest that RCN
could be generalized beyond text
parsing [see section 8.12 of (33)
and Discussion].
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 17
Table 1. Accuracy and number of training images for different methods on the ICDAR-13 robust reading data set.
First release: 26 October 2017 www.sciencemag.org (Page numbers not final at time of first release) 18
A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs
D. George, W. Lehrach, K. Kansky, M. Lzaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang, A. Lavin and D. S.
Phoenix
SUPPLEMENTARY http://science.sciencemag.org/content/suppl/2017/10/25/science.aag2612.DC1
MATERIALS
REFERENCES This article cites 57 articles, 12 of which you can access for free
http://science.sciencemag.org/content/early/2017/10/25/science.aag2612#BIBL
PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. 2017 The Authors, some rights reserved; exclusive licensee
American Association for the Advancement of Science. No claim to original U.S. Government Works. The title Science is a
registered trademark of AAAS.