ML Science
ML Science
M
(2013).
5. A. Sadilek, H. Kautz, V. Silenzio, in Proceedings of the
achine learning is a discipline focused ance when executing some task, through some
Twenty-Sixth AAAI Conference on Artificial Intelligence
(AAAI, Palo Alto, CA, 2012). on two interrelated questions: How can type of training experience. For example, in learn-
6. M. De Choudhury, S. Counts, E. Horvitz, in Proceedings of the one construct computer systems that auto- ing to detect credit-card fraud, the task is to as-
SIGCHI Conference on Human Factors in Computing Systems matically improve through experience? sign a label of “fraud” or “not fraud” to any given
(Association for Computing Machinery, New York, 2013),
and What are the fundamental statistical- credit-card transaction. The performance metric
pp. 3267–3276.
7. R. W. White, R. Harpaz, N. H. Shah, W. DuMouchel, E. Horvitz,
computational-information-theoretic laws that to be improved might be the accuracy of this
Clin. Pharmacol. Ther. 96, 239–246 (2014). govern all learning systems, including computers, fraud classifier, and the training experience might
8. Samaritans Radar; www.samaritans.org/how-we-can-help-you/ humans, and organizations? The study of machine consist of a collection of historical credit-card
supporting-someone-online/samaritans-radar. learning is important both for addressing these transactions, each labeled in retrospect as fraud-
9. Shut down Samaritans Radar; http://bit.ly/Samaritans-after.
fundamental scientific and engineering ques- ulent or not. Alternatively, one might define a
10. U.S. Equal Employment Opportunity Commission (EEOC), 29
Code of Federal Regulations (C.F.R.), 1630.2 (g) (2013). tions and for the highly practical computer soft- different performance metric that assigns a higher
11. EEOC, 29 CFR 1635.3 (c) (2013). ware it has produced and fielded across many penalty when “fraud” is labeled “not fraud” than
12. M. A. Rothstein, J. Law Med. Ethics 36, 837–840 (2008). applications. when “not fraud” is incorrectly labeled “fraud.”
13. Executive Office of the President, Big Data: Seizing Machine learning has progressed dramati- One might also define a different type of training
Opportunities, Preserving Values (White House, Washington,
DC, 2014); http://1.usa.gov/1TSOhiG.
cally over the past two decades, from laboratory experience—for example, by including unlab-
14. Letter from Maneesha Mithal, FTC, to Reed Freeman, Morrison, curiosity to a practical technology in widespread eled credit-card transactions along with labeled
& Foerster LLP, Counsel for Netflix, 2 [closing letter] (2010); commercial use. Within artificial intelligence (AI), examples.
http://1.usa.gov/1GCFyXR. machine learning has emerged as the method A diverse array of machine-learning algorithms
15. In re Facebook, Complaint, FTC File No. 092 3184 (2012).
16. FTC Staff Report, Mobile Privacy Disclosures: Building Trust Through
of choice for developing practical software for has been developed to cover the wide variety of
Transparency (FTC, Washington, DC, 2013); http://1.usa.gov/1eNz8zr. computer vision, speech recognition, natural lan- data and problem types exhibited across differ-
17. FTC, Protecting Consumer Privacy in an Era of Rapid Change: guage processing, robot control, and other ap- ent machine-learning problems (1, 2). Conceptual-
Recommendations for Businesses and Policymakers (FTC, plications. Many developers of AI systems now ly, machine-learning algorithms can be viewed as
Washington, DC, 2012).
18. Directive 95/46/ec of the European Parliament and of The
recognize that, for many applications, it can be searching through a large space of candidate
Council of Europe, 24 October 1995. far easier to train a system by showing it exam- programs, guided by training experience, to find
19. L. Sweeney, Online ads roll the dice [blog]; http://1.usa.gov/ ples of desired input-output behavior than to a program that optimizes the performance metric.
1KgEcYg. program it manually by anticipating the desired Machine-learning algorithms vary greatly, in part
20. FTC, “Big data: A tool for inclusion or exclusion?” (workshop,
FTC, Washington, DC, 2014); http://1.usa.gov/1SR65cv
response for all possible inputs. The effect of ma- by the way in which they represent candidate
21. FTC, Data Brokers: A Call for Transparency and Accountability chine learning has also been felt broadly across programs (e.g., decision trees, mathematical func-
(FTC, Washington, DC, 2014); http://1.usa.gov/1GCFoj5. computer science and across a range of indus- tions, and general programming languages) and in
22. J. Podesta, “Big data and privacy: 1 year out” [blog]; http://bit. tries concerned with data-intensive issues, such part by the way in which they search through this
ly/WHsePrivacy.
as consumer services, the diagnosis of faults in space of programs (e.g., optimization algorithms
23. White House Council of Economic Advisers, Big Data and
Differential Pricing (White House, Washington, DC, 2015). complex systems, and the control of logistics with well-understood convergence guarantees
24. Executive Office of the President, Big Data and Differential Processing chains. There has been a similarly broad range of and evolutionary search methods that evaluate
(White House, Washington, DC, 2015); http://1.usa.gov/1eNy7qR. effects across empirical sciences, from biology to successive generations of randomly mutated pro-
25. Executive Office of the President, Big Data: Seizing cosmology to social science, as machine-learning grams). Here, we focus on approaches that have
Opportunities, Preserving Values (White House, Washington,
DC, 2014); http://1.usa.gov/1TSOhiG.
methods have been developed to analyze high- been particularly successful to date.
26. President’s Council of Advisors on Science and Technology throughput experimental data in novel ways. See Many algorithms focus on function approxi-
(PCAST), Big Data and Privacy: A Technological Perspective Fig. 1 for a depiction of some recent areas of ap- mation problems, where the task is embodied
(White House, Washington, DC, 2014); http://1.usa.gov/1C5ewNv. plication of machine learning. in a function (e.g., given an input transaction, out-
27. European Commission, Proposal for a Regulation of the European
Parliament and of the Council on the Protection of Individuals
A learning problem can be defined as the put a “fraud” or “not fraud” label), and the learn-
with regard to the processing of personal data and on the free problem of improving some measure of perform- ing problem is to improve the accuracy of that
movement of such data (General Data Protection Regulation), function, with experience consisting of a sample
COM(2012) 11 final (2012); http://bit.ly/1Lu5POv. 1
Department of Electrical Engineering and Computer of known input-output pairs of the function. In
28. M. Schrems v. Facebook Ireland Limited, §J. Unlawful data Sciences, Department of Statistics, University of California,
transmission to the U.S.A. (“PRISM”), ¶166 and 167 (2013);
some cases, the function is represented explicit-
Berkeley, CA, USA. 2Machine Learning Department, Carnegie
www.europe-v-facebook.org/sk/sk_en.pdf. Mellon University, Pittsburgh, PA, USA.
ly as a parameterized functional form; in other
*Corresponding author. E-mail: [email protected] (M.I.J.); cases, the function is implicit and obtained via a
10.1126/science.aac4520 [email protected] (T.M.M.) search process, a factorization, an optimization
Input image Convolutional feature extraction RNN with attention over image Word by word
generation
A
bird
flying
LSTM over
a body
of
14 x 14 feature map
water
Fig. 2. Automatic generation of text captions for images with deep networks. A convolutional neural network is trained to interpret images, and its
output is then used by a recurrent neural network trained to generate a text caption (top). The sequence at the bottom shows the word-by-word focus of
the network on different parts of input image while it generates the caption word-by-word. [Adapted with permission from (30)]
Topics Documents
gene 0.04
dna 0.02 Topic proportions
genetic 0.01 and assignments
.,,
genes organism genes
survive?
brain 0.04
neuron 0.02
nerve 0.01
.,,
predictions
Fig. 3. Topic models. Topic modeling is a methodology for analyzing documents, where a document is viewed as a collection of words, and the words in
the document are viewed as being generated by an underlying set of topics (denoted by the colors in the figure). Topics are probability distributions
across words (leftmost column), and each document is characterized by a probability distribution across topics (histogram). These distributions are
inferred based on the analysis of a collection of documents and can be viewed to classify, index, and summarize the content of documents. [From (31).
Copyright 2012, Association for Computing Machinery, Inc. Reprinted with permission]
(see Fig. 2). Deep network methods are being rithms are developed to optimize the criterion. egy is trained to chose actions for any given state,
actively pursued in a variety of additional appli- As another example, clustering is the problem with the objective of maximizing its expected re-
cations from natural language translation to of finding a partition of the observed data (and ward over time. The ties to research in control
collaborative filtering. a rule for predicting future data) in the absence theory and operations research have increased
The internal layers of deep networks can be of explicit labels indicating a desired partition. over the years, with formulations such as Markov
viewed as providing learned representations of A wide range of clustering procedures has been decision processes and partially observed Mar-
the input data. While much of the practical suc- developed, all based on specific assumptions kov decision processes providing points of con-
cess in deep learning has come from supervised regarding the nature of a “cluster.” In both clus- tact (15, 16). Reinforcement-learning algorithms
learning methods for discovering such repre- tering and dimension reduction, the concern generally make use of ideas that are familiar
sentations, efforts have also been made to devel- with computational complexity is paramount, from the control-theory literature, such as policy
op deep learning algorithms that discover useful given that the goal is to exploit the particularly iteration, value iteration, rollouts, and variance
representations of the input without the need for large data sets that are available if one dis- reduction, with innovations arising to address
labeled training data (13). The general problem is penses with supervised labels. the specific needs of machine learning (e.g., large-
referred to as unsupervised learning, a second A third major machine-learning paradigm is scale problems, few assumptions about the un-
paradigm in machine-learning research (2). reinforcement learning (14, 15). Here, the infor- known dynamical environment, and the use of
Broadly, unsupervised learning generally in- mation available in the training data is inter- supervised learning architectures to represent
volves the analysis of unlabeled data under as- mediate between supervised and unsupervised policies). It is also worth noting the strong ties
sumptions about structural properties of the learning. Instead of training examples that in- between reinforcement learning and many dec-
data (e.g., algebraic, combinatorial, or probabi- dicate the correct output for a given input, the ades of work on learning in psychology and
listic). For example, one can assume that data training data in reinforcement learning are as- neuroscience, one notable example being the
lie on a low-dimensional manifold and aim to sumed to provide only an indication as to whether use of reinforcement learning algorithms to pre-
identify that manifold explicitly from data. Di- an action is correct or not; if an action is incor- dict the response of dopaminergic neurons in
mension reduction methods—including prin- rect, there remains the problem of finding the monkeys learning to associate a stimulus light
cipal components analysis, manifold learning, correct action. More generally, in the setting of with subsequent sugar reward (17).
factor analysis, random projections, and autoen- sequences of inputs, it is assumed that reward Although these three learning paradigms help
coders (1, 2)—make different specific assump- signals refer to the entire sequence; the assign- to organize ideas, much current research involves
tions regarding the underlying manifold (e.g., ment of credit or blame to individual actions in the blends across these categories. For example, semi-
that it is a linear subspace, a smooth nonlinear sequence is not directly provided. Indeed, although supervised learning makes use of unlabeled data
manifold, or a collection of submanifolds). An- simplified versions of reinforcement learning to augment labeled data in a supervised learning
CREDIT: ISTOCK/AKINBOSTANCI
other example of dimension reduction is the known as bandit problems are studied, where it context, and discriminative training blends ar-
topic modeling framework depicted in Fig. 3. is assumed that rewards are provided after each chitectures developed for unsupervised learning
A criterion function is defined that embodies action, reinforcement learning problems typically with optimization formulations that make use
these assumptions—often making use of general involve a general control-theoretic setting in of labels. Model selection is the broad activity of
statistical principles such as maximum like- which the learning task is to learn a control strat- using training data not only to fit a model but
lihood, the method of moments, or Bayesian egy (a “policy”) for an agent acting in an unknown also to select from a family of models, and the
integration—and optimization or sampling algo- dynamical environment, where that learned strat- fact that training data do not directly indicate
ments and explicitly allow users to express and the product in some fashion (perhaps by purchas- machine learning remains a young field with
control trade-offs among resources. ing that item in the past). The machine-learning prob- many underexplored research opportunities.
As an example of resource constraints, let us lem is to suggest other items to a given user that he Some of these opportunities can be seen by con-
suppose that the data are provided by a set of or she may also be interested in, based on the data trasting current machine-learning approaches
individuals who wish to retain a degree of pri- across all users. to the types of learning we observe in naturally
GraphX
SparkR
us. Considerations such as these suggest that
Splash
Access and
Spark
Velox
MLPiplines
transformative technologies of the 21st century.
SparkSQL MLIib Processing Although it is impossible to predict the future, it
engine appears essential that society begin now to con-
Spark Core
sider how to maximize its benefits.
Succinct
Tachyon HDFS, S3, Ceph, … Storage REFERENCES
1. T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical
Learning: Data Mining, Inference, and Prediction (Springer,
Resource New York, 2011).
Mesos Hadoop Yarn virtualization 2. K. Murphy, Machine Learning: A Probabilistic Perspective
(MIT Press, Cambridge, MA, 2012).
3. L. Valiant, Commun. ACM 27, 1134–1142 (1984).
AMPLab developed Spark Community 3rd party
4. V. Chandrasekaran, M. I. Jordan, Proc. Natl. Acad. Sci. U.S.A.
110, E1181–E1190 (2013).
Fig. 5. Data analytics stack. Scalable machine-learning systems are layered architectures that are 5. S. Decatur, O. Goldreich, D. Ron, SIAM J. Comput. 29, 854–879
built on parallel and distributed computing platforms. The architecture depicted here—an open- (2000).
6. S. Shalev-Shwartz, O. Shamir, E. Tromer, Using more data to
source data analysis stack developed in the Algorithms, Machines and People (AMP) Laboratory at
speed up training time, Proceedings of the Fifteenth Conference
the University of California, Berkeley—includes layers that interface to underlying operating systems; on Artificial Intelligence and Statistics, Canary Islands, Spain, 21
layers that provide distributed storage, data management, and processing; and layers that provide to 23 April, 2012.
core machine-learning competencies such as streaming, subsampling, pipelines, graph processing, 7. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, in
Foundations and Trends in Machine Learning 3 (Now
and model serving. Publishers, Boston, 2011), pp. 1–122.
8. S. Sra, S. Nowozin, S. Wright, Optimization for Machine
occurring systems such as humans and other organizations, and biological evolution and see Learning (MIT Press, Cambridge, MA, 2011).
animals, organizations, economies, and biological machine learning benefit from ongoing studies 9. J. Schmidhuber, Neural Netw. 61, 85–117 (2015).
evolution. For example, whereas most machine- of these other types of learning systems. 10. Y. Bengio, in Foundations and Trends in Machine Learning 2
learning algorithms are targeted to learn one As with any powerful technology, machine (Now Publishers, Boston, 2009), pp. 1–127.
11. A. Krizhevsky, I. Sutskever, G. Hinton, Adv. Neural Inf. Process.
specific function or data model from one single learning raises questions about which of its po- Syst. 25, 1097–1105 (2015).
data source, humans clearly learn many differ- tential uses society should encourage and dis- 12. G. Hinton et al., IEEE Signal Process. Mag. 29, 82–97
ent skills and types of knowledge, from years courage. The push in recent years to collect new (2012).
of diverse training experience, supervised and kinds of personal data, motivated by its eco- 13. G. E. Hinton, R. R. Salakhutdinov, Science 313, 504–507
(2006).
unsupervised, in a simple-to-more-difficult se- nomic value, leads to obvious privacy issues, as 14. V. Mnih et al., Nature 518, 529–533 (2015).
quence (e.g., learning to crawl, then walk, then mentioned above. The increasing value of data 15. R. S. Sutton, A. G. Barto, Reinforcement Learning:
run). This has led some researchers to begin also raises a second ethical issue: Who will have An Introduction (MIT Press, Cambridge, MA, 1998).
exploring the question of how to construct com- access to, and ownership of, online data, and who 16. E. Yaylali, J. S. Ivy, Partially observable MDPs (POMDPs):
Introduction and examples. Encyclopedia of Operations Research
puter lifelong or never-ending learners that op- will reap its benefits? Currently, much data are and Management Science (John Wiley, New York, 2011).
erate nonstop for years, learning thousands of collected by corporations for specific uses leading 17. W. Schultz, P. Dayan, P. R. Montague, Science 275, 1593–1599
interrelated skills or functions within an over- to improved profits, with little or no motive for (1997).
all architecture that allows the system to im- data sharing. However, the potential benefits that 18. C. Dwork, F. McSherry, K. Nissim, A. Smith, in Proceedings of
the Third Theory of Cryptography Conference, New York, 4 to 7
prove its ability to learn one skill based on society could realize, even from existing online March 2006, pp. 265–284.
having learned another (26–28). Another aspect data, would be considerable if those data were to 19. A. Blum, K. Ligett, A. Roth, J. ACM 20, (2013).
of the analogy to natural learning systems sug- be made available for public good. 20. J. Duchi, M. I. Jordan, J. Wainwright, J. ACM 61, 1–57
gests the idea of team-based, mixed-initiative To illustrate, consider one simple example (2014).
21. M.-F. Balcan, A. Blum, S. Fine, Y. Mansour, Distributed learning,
learning. For example, whereas current machine- of how society could benefit from data that is communication complexity and privacy. Proceedings of the
learning systems typically operate in isolation already online today by using this data to de- 29th Conference on Computational Learning Theory, Edinburgh,
to analyze the given data, people often work crease the risk of global pandemic spread from UK, 26 June to 1 July 2012.
in teams to collect and analyze data (e.g., biol- infectious diseases. By combining location data 22. Y. Zhang, J. Duchi, M. Jordan, M. Wainwright, in Advances in
Neural Information Processing Systems 26, L. Bottou,
ogists have worked as teams to collect and an- from online sources (e.g., location data from cell C. Burges, Z. Ghahramani, M. Welling, Eds. (Curran Associates,
alyze genomic data, bringing together diverse phones, from credit-card transactions at retail Red Hook, NY, 2014), pp. 1–23.
experiments and perspectives to make progress outlets, and from security cameras in public places 23. Q. Berthet, P. Rigollet, Ann. Stat. 41, 1780–1815 (2013).
on this difficult problem). New machine-learning and private buildings) with online medical data 24. A. Kleiner, A. Talwalkar, P. Sarkar, M. I. Jordan, J. R. Stat. Soc.,
B 76, 795–816 (2014).
methods capable of working collaboratively with (e.g., emergency room admissions), it would be 25. M. Mahoney, Found. Trends Machine Learn. 3, 123–224
humans to jointly analyze complex data sets feasible today to implement a simple system to (2011).
might bring together the abilities of machines telephone individuals immediately if a person 26. T. Mitchell et al., Proceedings of the Twenty-Ninth Conference
to tease out subtle statistical regularities from they were in close contact with yesterday was just on Artificial Intelligence (AAAI-15), 25 to 30 January 2015,
Austin, TX.
massive data sets with the abilities of humans to admitted to the emergency room with an infec- 27. M. Taylor, P. Stone, J. Mach. Learn. Res. 10, 1633–1685
draw on diverse background knowledge to gen- tious disease, alerting them to the symptoms they (2009).
erate plausible explanations and suggest new should watch for and precautions they should 28. S. Thrun, L. Pratt, Learning To Learn (Kluwer Academic Press,
CREDIT: ISTOCK/AKINBOSTANCI
hypotheses. Many theoretical results in machine take. Here, there is clearly a tension and trade-off Boston, 1998).
29. L. Wehbe et al., PLOS ONE 9, e112575 (2014).
learning apply to all learning systems, whether between personal privacy and public health, and 30. K. Xu et al., Proceedings of the 32nd International Conference
they are computer algorithms, animals, organ- society at large needs to make the decision on on Machine Learning, vol. 37, Lille, France,
izations, or natural evolution. As the field pro- how to make this trade-off. The larger point of 6 to 11 July 2015, pp. 2048–2057.
gresses, we may see machine-learning theory this example, however, is that, although the data 31. D. Blei, Commun. ACM 55, 77–84 (2012).
and algorithms increasingly providing models are already online, we do not currently have the
for understanding learning in neural systems, laws, customs, culture, or mechanisms to enable 10.1126/science.aaa8415