Automl: A Perspective Where Industry Meets Academy
Automl: A Perspective Where Industry Meets Academy
1
Machine Learning Pipeline
So MANY choices
• Which feature transformation?
• Which model architecture?
• Which hyperparameters?
2
Machine Learning Pipeline
AutoML
• Auto Feature Generation
• Neural Architecture Search
• Hyperparameters Optimization
• Meta Learning
3
Automated Machine Learning
Hyperparameter Neural Architecture Search
Op+miza+on
Inductive bias (prior α): how we represent data, which kinds of models to be
considered, how to tune hyper-parameter, how to transfer knowledge across
tasks, etc…
5
Tutorial Schedule
6
Hyperparameter
Optimization (HPO)
7
Hyperparameter Optimization
Hyperparameter
Optimization Best
Hyperparameter
Model
Training
Best
Model
8
Hyperparameter Configuration v.s. Schedule
9
Hyperparameter Optimization
q Hyperparameter Configuration
Hyperparameter
Optimization Best • Random search, Grid Search
Hyperparameter
• Successive-halving, Hyperband
• Hypergradient
10
Search Methods
𝑛!
𝑟!
14
Bayesian Optimization
Given some tried {hyperparameter, performance} pairs,
which hyperparameter should be the next one to try?
Bayesian Optimization
15
Bayesian Optimization
Fit a probabilistic function f(x) to model {x=hyperparameter, f(x)=performance}
• Exploration-exploitation trade-off
• Costly
16
Hyperparameter Optimization
q Hyperparameter Configuration
Hyperparameter
Optimization Best • Random search, Grid Search
Hyperparameter
• Successive-halving, Hyperband
• Hypergradient
17
Hyperparameter Schedule
A generalized framework for population based Self-tuning networks: Bilevel optimization of hyperparameters
training. KDD, 2019. using structured best-response functions. ICLR, 2019.
18
Practical Challenge (1)
Hyperparameter
Optimization Best
Hyperparameter
Model
Training Model Size
Best
Model
Data Size
19
ABC: Sampling
Hyperparameter
Training
Data Configuration: C
Model
Training
Sampling
ratio: r {r, C, P}
Testing
Data
Model
Testing Performance: P
Efficient Identification of Approximate Best Configuration of Training in Large Datasets. AAAI, 2019.
20
A New Method: 𝜀GE
Hyperparameter
Optimization Best
Hyperparameter
q ExisCng methods
• Search-strategy based: Successive-halving, Hyperband, etc.
• Evolu+onary algorithm: Popula+on Based Training, etc.
• Bayesian op+miza+on
Terminated
• Random strategy: randomly choose a
Output C with configuration with probability 𝜺
• Greedy strategy: choose the best configuration
the best P
• Evolution strategy: choose the best
configuration and perturb it with mutation and
crossover
22
HPO: Sampling method-𝜀GE
The task-adaptively combination of different hyperparameter
optimization methods leads to faster solutions!
24
Hyperparameter Schedule
How to learn a good trade-off between the global search and local search?
25
HyperMutaion (HPM)
∇$./
!./ !./ !./ !./ ∇ℎ/.
/
!.2/
/ /
($. , ℎ. ) " = 1 + tanh(+)
ℎ!" ! ℎ!" ← "⨀ℎ!∗
∇$.0
!.0 !.0 0
($.0 , ℎ.0 ) !. !.0 0
!.2/ " ℒ$%& (1!∗ ℎ!" , ℎ!" )
∇ℎ.0 # $
dot product
∇$ 1 softmax weighted sum
!.1 !.1 !.1 !.1
. 1
!.2/
($./ , 5 ⊙ ℎ/. ) ∇ℎ.1
30
Continue hypertraining after exploit & explore
31
Experiments on test functions
Figure: (a)-(b) The mean performance computed by different methods along with the standard
deviation over 10 trials, in terms of different given budget of iterations. (c) The average mutation
values learned by HPM over 10 trials. In each trial, HPM runs 30 iterations in total with a
population size of 5, resulting in 6 training steps and 5 mutations.
32
Experiments on test functions
Figure: (a)-(b) The mean performance computed by different methods along with the standard
deviation over 10 trials, in terms of different given budget of iterations. (c) The average mutation
values learned by HPM over 10 trials. In each trial, HPM runs 30 iterations in total with a
population size of 5, resulting in 6 training steps and 5 mutations.
33
Experiments on benchmark datasets
34
Takeaways
Hyperparameter
Optimization Best
q Hyperparameter Configuration
Hyperparameter
• Random search, Grid Search
q Hyperparameter Schedule
• Population-based training
• Hypergradient
• HyperMutation (HPM)
35
Future Directions
Ø Faster, Green
§ HPO via Meta-Learning
Ø Interactive, Human-in-the-loop
36
Neural Architecture
Search (NAS)
37
Neural Architecture Search
q What is neural architecture search (NAS)? q Why NAS?
• To find the optimal topology and/or size • Architecture matters a lot on the
configuration for the neural network. performance!
• E.g., select a filter from {CNN3×3 , CNN5×5 , • The choices cannot be exhausted.
DilatedCNN5×5 }.
• Useful prior knowledge, e.g., the invariance
• E.g., determine the depth and width of a neural network.
possessed by the task, has been exploited.
38
Elements of NAS
q Search space
• All the possible configurations.
• E.g., filter size, activation functions, depth, etc.
q Search strategy
• How to utilize experience?
• How to propose new configuration to try?
• E.g., RL, ES, and differentiable search.
39
And the Theme of NAS
Exploitation v.s. Exploration
q Search space
• All the possible configurations. Incorporating prior knowledge reduces search space but
• E.g., filter size, activation functions, depth, etc. makes it constrained to some extent, e.g., Inception-v2/3
à stacked cells [Zoph et al. 2018].
q Search strategy
• How to utilize experience? Instead of asympto+c regret, prac++oners balance the
• How to propose new configuration to try? exploita+on and explora+on to achieve best solu+on
under a given finite horizon.
• E.g., RL, ES, and differentiable search.
40
Pioneer Works of NAS
q Search space
• Consider both CNN and RNN cells.
• The configuration of each layer can be
determined respectively.
q Search strategy
• RL with the policy parameterized by a RNN.
q Performance estimation strategy Figure: An overview o the trial-and-error process of NAS.
• Standard train&validation
Figure: How the controller (i.e., a RNN) samples a CNN with skip connection.
41
Neural Architecture Search with Reinforcement Learning. ICLR, 2017, https://arxiv.org/pdf/1611.01578.pdf
Pioneer Works of NAS
q Search space
• Consider both CNN and RNN cells.
• The configuration of each layer can be
determined respectively.
q Search strategy
• RL with the policy parameterized by a RNN.
q Performance estimation strategy Figure: An overview o the trial-and-error process of NAS.
• Standard train&validation
42
Neural Architecture Search with Reinforcement Learning. ICLR, 2017, https://arxiv.org/pdf/1611.01578.pdf
Weight Sharing for One-shot NAS
q Weight sharing q One-shot NAS
• Represent NAS’s search space using a single DAG. • Each architecture (i.e., subgraph) is evaluated
• An architecture can be realized by taking a subgraph. by inheriting the shared parameters.
• E.g., deducing a RNN cell as follow: • Shared parameters are trained with sampled
architecture.
• Parameters and the controller are updated
alternatively.
Efficient Neural Architecture Search via Parameter Sharing. ICML, 2018, http://proceedings.mlr.press/v80/pham18a/pham18a.pdf 43
Differentiable NAS
q Continuous relaxation
• Each edge denotes a mixture of ops in Ο = {CNN3×3 ,DilatedCNN3×3 ,Zero,Identity, … }.
!,#
• For each edge 𝑖, 𝑗 , they parameterize the weights of ops by architecture parameter 𝛼 .
• Suppose the tensor at node! is 𝑥, then the tensor propagated to node# will be:
q Differentiable learning
• Formulated as a bilevel optimization
problem:
Figure: ProxylessNAS directly optimizes neural architecture on target task and hardware.
q Binarized architecture
• Transform real-valued path weights to binary gates.
• Only one path is ac+ve in memory at run+me. Figure: Note the straight-through estimator (STE) trick.
46
ProxylessNAS: Direct Neural Architecture Search of Target Task And Hardware. ICLR, 2019, hXps://arxiv.org/pdf/1812.00332.pdf
Rethinking the Search Space of NAS
q Explore less constrained search spaces
[Xie et al. 19]
• Consider stochastic network generator, e.g., ER, BA,
and WS.
• All yield >73% mean accuracy on ImageNet with a
low variance!
• Presented graph damage ablation.
Figure: randomly remove one node/edge.
Figure: two steps of refinement with the error distribution constantly improved.
47
Rethinking the Search Space of NAS
q From the view of graph structure [You et al. 20a]
• From DAG to relational graph.
• Sweet spots are consistent across different datasets and
architectures.
48
Graph Structure of Neural Networks. ICML, 2020, http://proceedings.mlr.press/v119/you20b/you20b.pdf
Rethinking the Search Space of NAS
q From the view of graph structure [You et al.
20a]
• From DAG to rela+onal graph.
• Sweet spots are consistent across different datasets and
architectures.
49
Graph Structure of Neural Networks. ICML, 2020, http://proceedings.mlr.press/v119/you20b/you20b.pdf
Size Search Space
q Model scaling
• Keep the architecture but adjust the size:
• Depth 𝐿
• Width 𝐶
• And resolution 𝐻, 𝑊
• Maximize the performance w.r.t. the size.
50
From CNN/RNN to GNN
q Uniqueness in search space
• More dimensions of choices:
• Micro: mainly aggregation and combine functions.
• Macro: how node embeddings in each layer produce the final one.
• Nodes are not independent, so how about in a node-wise manner? Figure: General message passing.
51
Figure: Comparing the correlations.
Beyond Accuracy: Efficiency and Robustness
q Making latency differentiable [Cai et al. 19]
52
Figure: Performance of 1k sampled architecture. Figure: Analysis of top 300 robust v.s. non-robust architectures.
Beyond Accuracy: Compressed Model Search
53
AdaBERT: Task-Adaptive BERT Compression with D-NAS. IJCAI, 2020, https://arxiv.org/abs/2001.04246
Beyond Accuracy: Compressed Model Search
54
AdaBERT: Task-Adaptive BERT Compression with D-NAS, IJCAI, 2020, https://arxiv.org/abs/2001.04246
NAS Benchmarks
q NAS-Bench-101 [Ying et al. 19]
• Provides a lookup table for the 423k architectures.
• Including their train/valid/test accuracies, number of parameters, and training +me.
q NATS-Bench [Dong et al. 21]
• Search space considers both size and topology factors.
55
Figure: The search space of NATS-Bench.
NAS Benchmarks
q NAS-Bench-101 [Ying et al. 19]
• Provides a lookup table for the 423k architectures.
• Including their train/valid/test accuracies, number of parameters, and training time.
q NATS-Bench [Dong et al. 21]
• Search space considers both size and topology factors.
56
Takeaways
q Search space
• Layer by layer • Repeated normal&reduction cell
• Pre-defined restricted design space • Search for design space
• Pre-defined size • Also search for optimal size
q Search strategy
• One-shot NAS
• Trial-and-error, e.g., RL and ES
• Differentiable (+sampling ops)
57
Future Directions
q Reduce the variance of one-shot NAS
• The interference between child models is a main factor [Zhang et al. 2020].
• E.g., sharing unless some condition(s) are satisfied.
Figure: Validation
performance of each
child model during the
last 120 steps.
58
q Beyond NAS: From
staCc to dynamic
neural architecture.
• Fine-grained tuning.
• Mainly focusing on CNNs
and efficiency issue now.
59
References of NAS
• [Zoph et al. 2017] Neural Architecture Search with Reinforcement Learning. ICLR. 2017.
• [Zoph et al. 2018] Learning Transferable Architectures for Scalable Image Recognition. CVPR. 2018.
• [Bender et al. 2018] Understanding and Simplifying One-Shot Architecture Search. ICML. 2018.
• [Zhang et al. 2020] Deeper Insights into Weight Sharing in Neural Architecture Search. arXiv. 2020.
• [Li et al. 2021] Geometry-aware Gradient Algorithms for Neural Architecture Search. ICLR. 2021.
• [Xie et al. 2019] Exploring Randomly Wired Neural Networks for Image Recognition. ICCV. 2019.
• [Radosavovic et al. 2020] Designing Network Design Spaces. CVPR. 2020.
• [You et al. 2020a] Graph Structure of Neural Networks. ICML. 2020a.
• [Zhou et al. 2019] Auto-GNN: Neural Architecture Search of Graph Neural Networks. Arxiv. 2019.
• [You et al. 2020b] Design Space for Graph Neural Networks. NeurIPS. 2020b.
• [Cai et al. 2019] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. ICLR. 2019.
• [Guo et al. 2020] When NAS Meets Robustness: In Search of Robust Architectures against Adversarial Attacks.
CVPR. 2020.
• [Ying et al. 2019] NAS-Bench-101: Towards Reproducible Neural Architecture Search. ICML. 2019.
• [Dong et al. 2021] NATS-Bench: Benchmarking NAS Algorithms for Architecture Topology and Size. TPAMI.
2021.
• [Tan et al. 2019] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML. 2019.
• [Wang and Cheng et al. 2021] Rethinking Architecture Selection in Differentiable NAS. ICLR. 2021.
60
Meta-Learning
61
Meta-learning
q What is meta-learning?
• Training on a meta-dataset consis+ng of many datasets, where each is a different task.
• Extract prior knowledge from it that accelerates the learning of new tasks.
Hyperparameter
Optimization Figure: The distribu]on of the scales of
datasets. (source: hXps://cs330.stanford.edu/).
NAS
AutoML
Automatic
Feature
Generation 63
When Meta-learning Meets AutoML
q AutoML as a service
• Assume different tasks share some common principles.
• Can we exploit the cumulated experience?
Hyperparameter
Optimization
NAS Meta-Learning😄
AutoML
Automatic
Feature
Generation 64
Meta-learning Basics
q Exploit the meta-dataset
• Conventional ML:
• Meta-learning:
q Replace the meta-dataset by meta-parameters
• Sufficient to represent the meta-dataset.
65
Optimization-based Meta-learning
q AdaptaCon problem
• Acquire 𝜙! via op+miza+on 𝜙! =
𝑎𝑟𝑔𝑚𝑎𝑥0 log 𝑝 𝐷!12 |𝜙 + log 𝑝 𝜙|𝜃 .
• 𝜃 serves as a prior.
q Which form of prior to take?
• Ini+aliza+on and fine-tuning!
Figure: Illustra]ng the idea of op]miza]on-based meta-
learning (source: hXps://arxiv.org/pdf/1703.03400.pdf).
3 3
Where 𝑔! = 𝐿 𝜃! , 𝐷 12 , 𝑔̅! = 𝐿 𝜃! , 𝐷 12 ,
34# 34*
T! = 3 𝑔̅! (Hessian w.r.t. 𝜃5 )
𝐻 34 *
66
Optimization-based Meta-learning
q Probabilistic interpretation
• Maximize a posterior (MAP) with 𝜃 as the prior.
q MAML [Finn et al. 17] approximates hierarchical
Bayesian inference!
• Gradient descent with early stop = MAP inference under Gaussian prior
with mean at initial parameters.
• Other forms, e.g.,
😄Model-agnosOc
😄Maximally expressive with sufficiently deep neural networks
☹Typically requires second-order computaOon/memory intensive
67
Model-based Meta-learning
q Adaptation problem
• From solving optimization problem to black-box adaptation 𝜙! = 𝑓4 𝐷!12 = 𝑎𝑟𝑔𝑚𝑎𝑥0 log 𝑝 𝜙|𝐷!12 , 𝜃
• Train a neural networks to represent 𝑝 𝜙! |𝐷!12 , 𝜃
• E.g., RNN, Neural Turing Machine, memory-augmented NN [Santoro et al. 16], etc.
😄Expressive
☹Oden sample-inefficient
𝐷!"# 𝐷!"$
68
Figure: Memory-augmented neural networks (source: https://proceedings.mlr.press/v48/santoro16.pdf).
Metric-based Meta-learning
q Use Non-parametric learner
Figure: The idea of metric-based
meta-learning (source:
https://cs330.stanford.edu/slide
s/cs330_nonparametric_2020.p
df).
😄EnOrely feedforward
😄Easy to opOmize
☹Harder to generalize to varying
k-ways (especially for very large k)
69
Metric-based Meta-learning
q Use Siamese neural networks
Figure: Architecture of
Siamese neural networks
and its application to
one-shot learning.
70
Siamese Neural Networks for One-shot Image Recogni]on. ICML, 2015, hXps://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf
Metric-based Meta-learning
q Match the train&test phases by Matching networks
• Fix the mismatch between meta-training and meta-test.
• Map a (support) set 𝑆 = 𝑥! , 𝑦! to a classifier:
71
Matching Networks for One Shot Learning. NeurIPS, 2016, https://arxiv.org/abs/2001.00745
GPT-3: meta-learning as pre-training
72
Generalization v.s. Customization
73
Generalization v.s. Customization
Cumulated
Experience Global Prior Customized Task-specific
Ini+aliza+on Model
74
Relational Meta-Learning
Learner-2
Learner-1
Most Meta-learning
Meta
methods don’t capture
Learner the relations among
Learner-4
tasks/learners.
Learner-3
75
Automated Rela]onal Meta-learning, ICLR, 2020, hXps://arxiv.org/abs/2001.00745
Summary and Future Directions
q How to uClize exisCng experience---meta-learning
• Learn a meta-parameter, so that we can quickly transfer to new task.
• Op+miza+on-based, model-based, metric-based
76
References of Meta-learning
[Santoro et al. 2016] Meta-Learning with Memory-Augmented Neural Networks. ICML. 2016.
[Finn et al. 2017] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML. 2017.
[Nichol et al. 2018] On First-Order Meta-Learning Algorithm. arXiv. 2018.
[Kock et al. 2015] Siamese Neural Networks for One-shot Image Recognition. ICML. 2015.
[Vinyals et al. 2016] Matching Networks for One Shot Learning. NeurIPS. 2016.
77
Auto Feature Generation
78
Automatic Feature Generation
• In pracPce, many data scienPsts search for useful interacPve features in a trial-and-error manner,
which has occupied a lot of their workloads.
• Therefore, automaPc feature generaPon (AutoFeature), as one major topic of automated machine
learning (AutoML), has received a lot of aYenPon from both academia and industry.
…… …… ……
…… …… ……
…… …… ……
…… …… ……
AutoFeature Useful Interactive Downstream
Data Model Features Applications
79
Automatic Feature Generation
80
Automatic Feature Generation
q The related works on automaPc feature generaPon can be roughly divided
into two categories:
• DNN-based methods
• Search-based methods
81
AutoInt Session: Long - Deep Nerual Network II CIKM ’19, November 3–7, 2019, Beijing, China
q Map the original features into low-dimensional feature space and model
There are some recent works that model high-order feature in-
the high-order
teractions. feature interactions
For example, viadeep
NFM [13] stacked self-attention.
neural networks on
Output Layer: Estimated CTR
power matrix.
embedding of embedding-based models and tree-based models but had Figure 1: Overview of our proposed model AutoInt. The de-
to break training procedure into multiple stages. Third, HOFM [5]
Overview of the proposed model AutoInt.
tails of embedding layer and interacting layer are illustrated
proposed e�cient training algorithms for high-order factorization in Figure 2 and Figure 3 respectively.
machines. However, HOFM requires too many parameters and only
its low-order (usually less than 5) form can be practically used. Dif-
ferent from existing work, we explicitly model feature interactions
with attention mechanism in an end-to-end manner, and probe the Speci�cally, we de�ne the high-order combinatorial82features as
AutoInt: Automa]c Feature Interac]on Learning via Self-AXen]ve Neural Networks. CIKM, 2019.
follows:
Next, we feed embeddings of all �elds into a
model the correlations
dimensional space, webetween
move to di�erent
model high-order feature combinatorial
�elds.
er, which is implemented as a multi-head self-
work. Speci�cally,
features in the wespace.
adoptThe thekeykey-value
problem attention
is to determine mechanism [22] to
which fea-
es andFor each attributes
item’s interacting as layer, high-order
a sparse tures should be combined to form meaningful high-order features.
ed through the attention mechanism, and dif- determine which feature combinations are meaningful. Taking the
ion of all �elds. Speci�cally,
AutoInt
Traditionally, this is accomplished by domain experts who create
inations can be evaluated with the multi-head feature m as an example, next we explain how to identify multiple
meaningful combinations based on their knowledge. In this pa-
;map thexfeatures
x2 ; ...; M ],
into di�erent
Session: subspaces.
Long By meaningful
(1) - Deep Nerual high-order
Network II features involving feature m. We �rst de�ne CIKM ’19, November 3–7, 2019, Beijing, China
per, we tackle this problem with a novel method, the multi-head
eracting layers, di�erent orders of combinato- the correlation between feature m and feature k under a speci�c
feature �elds, and xi is the feature self-attention mechanism [36].
modeled.
attention head h self-attentive
Multi-head as follows: network [36] has recently achieved
i is ainteracting
x�nal one-hot vector layer isifthe
thelow-dimensional
i-th �eld
remarkable performance in modeling (h) (ecomplicated relations. For
ee input
2). xifeature,
is a Map whichvalue
scalar models if the
thehigh-order exp(
es and is
q
further used
the
for
original
estimating
i-thfeaturesexample,
the click-
into low-dimensional
it shows superiority
(h)
Õ
feature , e
for modeling arbitrary
m k space
)) and
word model
depen-
igure 2). There are some recent works that model
m,k high-order
= feature in- ,
the high-order
h a sigmoid function. Next, we feature
introduce the
dency in machine
interacPons via translation M [36]
self-aYenPon. exp(and(h)sentence
(em , el ))embedding [20], (5) Output Layer: Estimated CTR
teractions. For example, NFM [13] stacked deep neural
l =1 networks
and has been successfully applied to capturing on node similarities
ed method. top of the output of the second-order feature , einteractions
(h) (h) to model (h)
in graph embedding (em[37]. k ) Here
= hW we�eryextendemthis, WKey latest
ektechnique
i, to
er ③ Interacting higher-order
s of the categorical features are very
Layer:
features. Similarly,
model thePNN [25], FNN
correlations between[41],di�erent
DeepCross- feature �elds.
ing [32], Wide&Deepwhere [8] and DeepFM
Speci�cally,
(h) (·, ·) iswe[11]
an adopt utilized feed-forward
the key-value
attention function attention
which mechanism
de�nes the[22] to
similarity
common
ser’s pro�lesway and Theis tomulti-head
item’s represent
attributes them key-value
as a sparse attention
determine
mechanism
whichfeature
feature combinations
is adopted Interacting
concatenation
.g., word embeddings).of all
neural networks to model
�elds. Speci�cally,
Speci�cally, between high-order
the feature k. It can beare
m andinteractions. How-
de�nedmeaningful. Taking the
as a neural network 3
Layer
to capture ever, allthe interactions
these approaches
or as
between
feature
learn
simple
mtheas
as inner
different
an example,
high-orderproduct,
next wefeatures.
feature explain how to identify multiple
interactions
i.e., h·, ·i. In this work, we use inner
Multi-head
ature x =with
[x1 ; xa2 low-dimensional
; ...; xM ], in an implicit vector, way meaningfullack
(1) and therefore high-order featuresexplainability.
good model involving feature m. We �rst de�ne
(h) (h)
Self-Attention
product due to itsbetween
the correlation simplicity featureand m e�ectiveness.
and feature k under
W , WKey 2
a speci�c
ber of total feature �elds,On (*) the
and xi iscontrary,
the featurethere0 are three lines of works are {Vi , vthat
m , Wlearn
(h) fea-
(h) (h) �ery
attention head as follows: �ery , WKey , WValue , WRes , w, b}, which are updated
=i-th xi , !x,.i.is a one-hot ture
Vi�eld. #
vector
-./%& if the i-th(2). .�eldin Ran explicit
interactions
d ⇥d in Equation
h
fashion.5 First,
are
viatransformation
Deep&Cross
minimizing matrices
[38]
the total Logloss which map
using gradient the
descent.
Embedding
xrix in Figure 2).
1 for �eld !i, and
0
x i is a scalar
xi and value
is anxDeepFM if the
one-hot [19] took
i-th original
outer embedding
product ofspace features
(h)
=
exp(
RÕ into
d at thea new
(h)
m space
bit-
(e , eand
k )) R
,
d 0
2
. Next, we update
Layer … …
g., xM in Figure 2). the representation of feature in subspace l h via combining (5) all
m,k M exp( (h) (e , e ))
features can be multi-valued,
(*)
vector-wise i.e., xlevel
i respectively. Although they 4.7perform m explicit
Analysis
l =1 mfea-
Of AutoInt
# …
egwatching
Layer prediction
3&(
ture
as an interactions,
example, it is not trivial
relevant !̃
(*) to explain
features (h) which
guided by
(emModeling combinations
coe�cients
, ek ) = hW�ery
(h)
Arbitraryem , W are
(h)
(h):
Ordere 1
Combinatorial
i,
0
Features.
0 1
Given fea-
0.3 … 0.5
" m,k k
Genre which describes . . types
useful.
the Second,
of some tree-based methods [39,ture 42,interaction
44] combined operator
Key
the de�ned by Equation 5
Feature field 1
- 8, we now Input Layer: sparse feature X
analyze
Feature field M
resentations of the categorical features are very
how’
0.1
valued power of embedding-based where models (h) (·,and
·) is tree-based
an attention models
functionand but
which hadde�nes combinatorial
Figure
the 1: Overview
similarity features areof our proposed model AutoInt. The de-
ensional,(e.g., Drama wayand is toRomance low-order high-order modeled
M
a common represent them Overview of the proposed model AutoInt.
e
(h) (h) (h)
to breakinputs,
training. procedure into the multiple estages.
=ourk.Third,
It canHOFM [5] (6)
0.8
mpatible
al with
spaces (e.g., multi-valued
word embeddings). Speci�cally, between feature mminand proposed be model.
(W de�ned ekas),a neural tailsnetwork
of embedding layer and interacting layer are illustrated
(*) proposed e�cient
. training or as simple as inner
algorithms product,
for high-order
For
m,k
i.e.,
Value
In this
factorization
simplicity,
h·, ·i.let’s work,
assume we
in use
there are innerfeature
four
Figure 2 and�elds Figure (i.e.,3M=4)
respectively.
n 2 and represent$%&'(
egorical feature with
# the multi-valued 0.02
a low-dimensional vector, k=1
machines. However, HOFM requires
product due tooto itsmany denoted
parameters
simplicity by x 1e�ectiveness.
and , xand
2 , x 3only
and x 4 W respectively.
(h) Within the �rst interacting
(h)
corresponding
!" feature embedding (*)
(h) 0 ⇥d �ery , WKey 2
1"
its low-order (usually (2) where
lessRdthan
0
WValue 5)Equation
form
d
2 Rcan5 be layer, each individual
. practically used.matrices
Dif-feature interacts with any other features
ei = Vi xi , ⇥d in are transformation which map the
Figure The
3: Thearchitecture
architecture of
of interacting
interacting layer.
layer. Combinato- through attention mechanism (i.e. Equation 5) and therefore a set
1 matrix for �eld i, ferent and xi from existing work, wee explicitly model feature
R into interactions
0
dding is an one-hot original
Since e embedding
(h)
2 R d 0
isspace
a
d
combination a newofspace R . Next,
feature
d
m andweitsupdate
relevant
Vifeatures
rial xi , features
arecan
conditioned on (3)
attention weights, m
i.e.,
(h)
. of second-order feature combinations such
Speci�cally,as (xwe 1 , xde�ne x 3 ) and
2 ), (x 2 , the high-order combinatorial83features as
=ategorical with
be attention
multi-valued, mechanism
i.e.,
AutoInt: x thein an
features (under
i Automatic end-to-end
representation
Featurehead
m ofmanner,
feature and
h),(xit3 ,represents
Interaction m
Learning in probe
subspace
x 4 ) are captured a
via newthe h via combining
combinatorial
Self-Attentive Neural all
feature
Networks. CIKM, 2019.
(h) with distinct correlation weights, where the
q
follows:
56 0.7759 0.1573 0.8252 0.3998
54 0.7659 0.1591 0.8227 0.4048 Criteo Avazu
Model
AUC Logloss AUC Log
89 0.7715 0.1591 0.8448 0.3814
64
68
AutoInt
0.7515
0.7773
0.1631
0.1572
0.8357
0.7968
0.3883
0.4266
Wide&Deep (LR)
DeepFM (FM)
0.8026
0.8066
0.4494
0.4449
0.7749
0.7751
0.3
0.3
Deep&Cross (CN) 0.8067 0.4447 0.7731 0.3
29 0.7799 0.1566 0.8286 0.4108
xDeepFM (CIN) 0.8070 0.4447 0.7770 0.3
54 q0.7707 0.1586
Experimental results on0.8304 0.4013datasets show the advantages of AutoInt
four real-world
AutoInt+ (ours) 0.8083** 0.4434** 0.7774* 0.38
824 0.7883** 0.1546** 0.8456* 0.3797**
• Performance comparison in offline AUC evaluation for click-through rate (CTR) prediction
• Efficiency AutoInt+ outperforms the strongest baseline
M data at the: ** 0.01 andcomparison
* 0.05 level, unpaired t-test.
• Explainable recommendations
Attentional Scoring La er
graph structure, and captures the feature interactions r
f
through node representation learning in the graph. Graph Ne ral Net ork La er -
a
w
• Feature interaction via a graph view: nodes represent Graph Ne ral Net ork La er - i
features and edges denote their interactions
Graph Ne ral Net ork La er -
• Model feature interacPons via Graph Neural Networks 3
(GNN) W
o
• Attentional scoring for predictions c
M lti-head Self-attention La er
3
Field-a are Embedding La er
S
0 0 0 0 0 0 0 0 t
Field 1 Field 2 Field 3 Field 4 Feat re Graph
p
Overview of the proposed Fi-GNN. e
Figure 1: Overview of our proposed method. The input
85 raw
Fi-GNN: Modeling Feature Interac]ons via Graph Neural Networksfeature
for CTR Predic]on. m
multi-�eld vector isCIKM,
�rst2019.
converted to �eld embed-
Fi-GNN
q Feature interaction in Fi-GNN: The nodes interact with neighbors and update their states
in a recurrent fashion.
Feature Graph:
The edge weights
3.4 Multi-head reflect theLayer
Self-attention importance of
interacPons
Transformer between
[29] is prevalent theand
in NLP connected
has achievednodes
great suc-
cess in (features),
many tasks. Atwhich areoflearned
the core via an
Transformer, theaYenPon
multi-head
self-attention mechanism is able to model complicated dependen-
mechanism.
cies between word pairs in multiple semantic subspaces. In the
literature of CTR prediction, we take advantage of the multi-head
Node Aggrega<on:
self-attention mechanism to capture the complex dependencies
between
Thefeature �eld pairs, i.e,the
node aggregates pairwise feature interactions,
transformed informaPon in
di�erent semantic subspaces.
from neighbors and update its state according to
Following [26], given the feature embeddings E, we obtain the
therepresentation
feature aggregatedofinformaPon and the
features that cover history viainterac-
pairwise GRU
tionsand
of anresidual
attention connecPon.
head i via scaled dot-product: Feature interaction in Fi-GNN.
Figure 2: Framework of Fi-GNN. The nodes interact with
QKT neighbors and update their states in a recurrent fashion. At
Hi = softmaxi ( p )V, 86
Fi-GNN: Modeling
dK Feature Interac]ons via Grapheach interaction
Neural Networks forstep, each node
CTR Predic]on. will
CIKM, �rst aggregate trans-
2019.
Fi-GNN
q Taking advantage of the strong representative power of graphs, Fi-GNN captures high-order
feature interaction in an efficient way.
q Fi-GNN also provides good model explanations for CTR prediction.
Table 2: Performance Comparison of Di�erent methods. The best performance on each dataset and metric are highlighted.
Further analysis is provided in Section 4.2.
Criteo Avazu
Model Type Model
AUC RI-AUC Logloss RI-Logloss AUC RI-AUC Logloss RI-Logloss
First-order LR 0.7820 3.00% 0.4695 5.43% 0.7560 2.60% 0.3964 3.63%
FM [23] 0.7836 2.80% 0.4700 5.55% 0.7706 0.72% 0.3856 0.76%
Second-order
AFM[34] 0.7938 1.54% 0.4584 2.94% 0.7718 0.57% 0.3854 0.81%
DeepCrossing [25] 0.8009 0.66% 0.4513 1.35% 0.7643 1.53% 0.3889 1.67%
NFM [8] 0.7957 1.57% 0.4562 2.45% 0.7708 0.70% 0.3864 1.02%
High-order CrossNet [31] 0.7907 1.92% 0.4591 3.10% 0.7667 1.22% 0.3868 1.12%
CIN [15] 0.8009 0.63% 0.4517 1.44% 0.7758 0.05% 0.3829 0.10%
Fi-GNN (ours) 0.8062 0.00% 0.4453 0.00% 0.7762 0.00% 0.3825 0.00%
gl
Performance comparison.
AFM [34] (B) is a extent of FM, which considers the weight This indicates the Heat
second-order
map offeature interactions
auen+onal edgeisweights.
not Figure 6:
of di�erent second-order feature interactions by using attention su�cient. Figure 5: Heat map of attentional edge weights at the global-87
level for
on Avazu, which re�ects global- an
Fi-GNN: Modeling
mechanism. It is one of the state-of-the-art Feature
models thatInterac]ons
model via Graph
(4) Neural Networks
DeepCrossing CTR Predic]on.
outperforms CIKM, the
NFM, proving 2019.importance of relations
the e�ectiveness tance of d
Automatic Feature Generation
q The related works on automatic feature generation can be roughly divided
into two categories:
• DNN-based methods
• Search-based methods
88
AutoCross
q AutoCross searches useful feature interacPons in the high-order interacPve feature space by
incrementally construcPng local opPmal feature set
• Multi-granularity
approximation discretization
of E(S), with accuracy traded for higher e�ciency. value
However, Greedy
• since & beam
the purpose searchset evaluation is to identify
of feature original
numerical feature
Field-wise
the most• promising candidate, rather
logistic than to accurately estimate the
regression 1st
0 1 2 3 4 5 6 7 8 9
performance a degraded accuracy
of candidates,mini-batch
• Successive is acceptable
gradient descentif only discretized feature
it can recognize the best candidate with high probability. Experi-
decreasing
granularity
2nd
0 1 2 3 4
mental results reported in Section 5 demonstrate the e�ectiveness discretized feature
MulP-granularity
in model •training, discrePzaPon
or the learned model in inference. It employs A, B, C, D
• Greedy
hashing trick [39] to & beamthe
improve search
accelerate feature producing.
Compared• with deep-learning-based
Field-wise methods, the feature producer
logisPc regression + AB + AC … + CD
takes signi�cantly
• Successiveless computation
mini-batch resources,
gradientand descent
is hence espe-
cially suitable for real-time inference.
Inside the black box (‘�ow’ part in Figure 2), the data will �rst + AC + CD + ABC + ABD
…
be preprocessed,
Greedywhere hyper-parameters
& beam search: are determined, missing
values �lled and numerical features discretized. Afterwards, useful
• Tree-structured space with the original + AC … + ABC … + BCD + ABCD
feature sets are iteratively constructed in a loop consisting of two
features
steps: 1) feature as the where
set generation, root. candidate feature sets with
+ AC + BD + BCD + ABCD
The children are generated by added one
… …
new cross• features are generated; and 2) feature set evaluation,
where candidate feature sets
pair-wise are evaluated
crossing to theand the best is selected
parent.
as a new solution.
An illustration
Figure of theof
3: An illustration search spacespace
the search and beam search
and beam search
• OnlyThis theiterative procedure is child
most promising terminated
will once
be some strategy.
strategy employed in AutoCross. In beam search, only the
conditions are met.
expanded during the search
From the implementation perspective (‘infrastructures’ part in best node (bold stroke) at each level is expanded. We use two
Figure 2), the foundation of AutoCross colors to indicate the two features that are used to construct
AutoCross:isAutoma]c
a distributed computing 90
Feature Crossing for Tabular Data in Real-World Applica]ons. KDD, 2019.
the new cross feature.
AutoCross
q AutoCross searches useful feature interacPons in the high-order interacPve feature space by
incrementally construcPng local opPmal feature set
• MulP-granularity discrePzaPon
• Greedy & beam search
• Field-wise logisPc regression
• Successive mini-batch gradient descent
91
AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications. KDD, 2019.
Figure 7: Validation AUC curves in real-business datasets.
AutoCross
weekly or even monthly. In contrast, within every millisecond, hun-
dreds or thousands of inferences may sequentially take place, which
makes high e�ciency a must. Online inference consists of two ma-
jor steps: 1) feature producing to transform the input data, and
The number
advantages of AutoCross: cross features 2) inference to make prediction. Deep-learning method combines
Figure q
6: The of second/high-order
these steps. In Table 7, we report the inference time of AC+LR,
generated
• for each dataset.
Explicit high-order feature generaPon AC+W&D, Deep and xDeepFM.
• Fast inference
Table 5: Test AUC improvement: second v.s. high order fea-
• Interpretability
tures on benchmark datasets.
Table 7: Inference latency comparison (unit: millisecond).
Benchmark Datasets
v.s. LR(base) Bank Adult Credit Employee Criteo Average
Method Bank Adult Credit Employee Criteo
CMI+LR 0.330% -0.175% 0.531% 2.842% -0.140% 0.678%
AC+LR 0.00048 0.00048 0.00062 0.00073 0.00156
AC+LR 0.585% 1.211% 3.316% 3.316% 2.279% 2.141%
AC+W&D 0.01697 0.01493 0.00974 0.02807 0.02698
Deep 0.01413 0.01142 0.00726 0.02166 0.01941
xDeepFM 0.08828 0.05522 0.04466 0.06467 0.18985
5.2.3 Time costs of feature crossing. Table 6 reports the feature
Real-World Business Datasets
crossing time of AutoCross on each dataset. Figure 7 shows the Method Data1 Data2 Data3 Data4 Data5
validation AUC (AC+LR) versus runtime on real-world business AC+LR 0.00367 0.00111 0.00185 0.00393 0.00279
datasets. Such curves are visible to the user and she can terminate AC+W&D 0.03537 0.01706 0.04042 0.02434 0.02582
AutoCross at any time to get the current result. It is notable that Deep 0.02616 0.01348 0.03150 0.01414 0.01406
due to the high simplicity of AutoCross, no hyper-parameter needs xDeepFM 0.32435 0.11415 0.40746 0.12467 0.13235
to be �ne-tuned, and the user does not need to spend any extra
time to get The number
it work. of second/high-order
In contrast, interac+ve
if deep-learning-based features. It can be easily
methods Inference latency comparison.
observed that AC+LR is orders of magnitude
are used, plenty of time will be spent on the network architecture faster than other Applications.
methods inKDD,inference.
2019. This demonstrates that,
92
AutoCross: Automatic Feature Crossing for Tabular Data in Real-World
design and hyper-parameter tuning.
AutoFIS
q AutoFIS automatically identifies important feature interactions for Factorization Models (FM).
Overview of AutoFIS.
93
AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. KDD, 2020.
0.99% 0.8010* 0.5404* 13% / 2% 0.86 128 + 17 1.28%
with same order. AutoFM compares with FM and AutoDeepFM compares with all baselines.
6tatistics_AUC
ons in Avazu and Criteo
AFM
with a single
0.8915
GPU
0.08772
card).
100% 0.39%
0.65
FIVES
(1)
any :-order interactive features could be regarded as the interaction graph convolutional operator to such a graph, the output n8 is
(i.e., Cartesian product) of several lower-order features, with many capable of expressing the 2-order interactive features. However
choices of the decomposition. This raises a question that could we gradually propagating the node representations with only one adja
solve the task of generating interactive features in a bottom-up cency matrix fails to express higher-order (: > 2) interactions. Thus
q manner?
To possess To beboth
speci�c, could we
feature generate interactive
interpretability and features to generate
search inefficiency, at highest method
the proposed -order interactive features, we extend the
FIVES formulates
the task of interactive feature generation as searching for feature
an inductive manner, that is, searching for a group of informative :-
edges graph by de�ning an adjacency tensor A 2 {0, 1} ⇥<⇥< to
on the defined feature graph.
indicate the interactions among features at each order, where each
order features from the interactions between original features and
the group of (: 1)-order features identi�ed in previous step? We slice A (:) 2 {0, 1}<⇥< , : = 0, . . . , ( 1) represents a layer-wise
1. present
Search theStrategy
following proposition to provide theoretical evidence adjacency matrix and < = |N | is the number of original features
for discussing the question. (:)
(nodes). Once an entry A8,9 , 8, 9 2 1, . . . , < is active, we intend to
• This proposiPon states that informaPve (: 1) (0)
P���������� 1. Let - 1, - 2 and . be Bernoulli random variables generate a (: + features
interacPve 1)-order feature basedcome
unlikely on node =8 bythe
from n8 n9
with a joint conditional probability mass function, ?G 1 ,G 2 |~ := P(- 1 = and
(:)
synthesize theselower-order
into n8 . Formally,
G 1 ; - 2 = G 2 | . = ~) such that G 1, G 2, ~ 2 {0, 1}. Suppose further that uninformaPve ones.with an adjacency tenso
A, our dedicated graph convolutional operator produces the node
mutual information between -8 and . satis�es I (-8 ; . ) < ! where representations layer-by-layer, in the following way:
8 2 {1, 2} and ! is a non-negative constant. If - 1 and - 2 are weakly • The theory moPvates the boYom-up search
⇠>E (- ,- |. =~)
correlated given ~ 2 {0, 1}, that is, f- |. =~ (: 1)
1
1 2
f-2 |. =~ d, we have strategy in FIVES:
(:)
n8 = Searching
p8
(:)
n8 for a group of
informaPve 𝑘-order
(:) features from the (0) (2
I (- 1- 2 ; . ) < 2! + log(2d 2 + 1). (1) where p8 = MEAN 9 |A (: ) =1 {W 9 n 9 }.
interacPons between original 8,9 features and the
Theoretical
We defer support
the proof to forA.
Appendix the search strategy.
Speci�cally, the random vari- group of (𝑘is−adopted
1)-order features.
Here “MEAN” as the aggregator, and denotes the
able - and . stands for the feature and the label respectively, and
element-wise product. W 9 is the transformation matrix for node
the joint of - s stands for their interaction. Recall that: 1) As the (0)
considered raw features are categorical, modeling each feature as a = 9 , and = 8 is the initial input to the GNN and described as the
Bernoulli random variableFIVES:
wouldFeature
not sacri�ce much
Interac]on generality;
Via Edge feature
2) Large-Scale
Search for embeddings
Tabular Data. KDD, of node =8 . Assume that the capacity
2021.
95 of ou
(0) (0)
FIVES
q To possess both feature interpretability and search efficiency, the proposed method FIVES formulates
the task of interacPve feature generaPon as searching for edges on the defined feature graph.
2. Feature Graph
• To instantiate the proposed search strategy, the original features are conceptually
regarded as a feature graph and their interactions are modeled by a designed GNN.
• Each node 𝑛! corresponds to a feature 𝑓! . Each edge 𝑒!,$ indicates an interacPon between
𝑛! and 𝑛$ .
n1 n2 n1 n2 n1 n2
……
n3 n4 n3 n4 n3 n4
2. Feature Graph
• The feature graph consists of 𝐾 subgraphs to represent high-order interacPve feature. Each
subgraph indicates a layer-wise interacPon between features, represented by an adjacency
matrix 𝐴(&) ∈ {0,1}"×" . The graph convoluPonal operator for aggregaPon are defined as:
(&) (&) &)* & . (1)
𝑛! = 𝑝! ⨀ 𝑛! , where 𝑝! = MEAN$|, ( -* 𝑊$ 𝑛$
%,'
97
FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data. KDD, 2021.
FIVES
q To possess both feature interpretability and search efficiency, the proposed method FIVES formulates
the task of interactive feature generation as searching for edges on the defined feature graph.
• To make the opPmizaPon more efficient, 𝐴 is regraded as Bernoulli random variables parameterized
by 𝐻 ∈ 0,1 :×"×" , and a soa 𝐴(&) is allowed to be used for propagaPon at the 𝑘-th layer.
98
FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data. KDD, 2021.
4: Propagate the graph signal for times according to Eq. (2); can
5: Update Θ by descending U 1 rΘ L (Dtrain |A, Θ); 1, .
6: Update H by descending U 2 rH L (Dval |A, Θ);
7: end for 3
FIVES the learned adjacency tensor A, which means that FIVES can also
We
of t
ing
serve as a feature generator. In general, we are allowed to specify
as a
layer-wise thresholds for binarizing the learned A and then derive
q To possess both feature interpretability and search efficiency, the proposed method FIVES formulates
the useful :-order (1 : ) interactive features suggested by
fea
the task of interactive feature generation as searchingAfor edges on the defined feature graph. of s
inductively. An example of the interactive feature derivation is
by
shown in Figure 2.
its
4. Interac<ve Feature Deriva<on adj
Feature as
• The learned adjacency tensor can explicitly indicate Graph "(") "($) "(%)
in a
which interacPve features are useful. " ! " " " ! " " " ! " " fea
D
tha
• One can inducPvely derive useful high-order " # " $ " # " $ " # " $ mo
interacPve features by specify layer-wise thresholds Generated
titi
for binarizing the learned 𝐴. Features in T
#! ⨂ #" #! ⨂ #" ⨂ #$ in A
#" ⨂ ##
#! ⨂ ## #! ⨂ ## ⨂ #$
• FIVES serves as a feature generator for lightweight
models to meet the requirement of inference #$ ⨂ #" #$ ⨂ #" ⨂ ##
speed.
Figure 2: Interactive
An example feature
of interac+ve derivation
feature deriva+on.
Speci�cally,
FIVES: Feature Interaction Via Edge Search for Large-Scale given
Tabular Data.anKDD,
adjacency
2021. matrix A
(:) ,
which is the 99
:-th
slice of the binarized adjacency tensor A, we can determine the
FIVES
q Extensive experiments on five public datasets and two business datasets confirm that FIVES can
generate useful interactive features.
• FIVES as a predictive model for downstream tasks, such as CTR prediction
• FIVES as the feature generator for lightweight models to meet the requirement of inference
speed
Correla+on between the entries of 𝐴 and the AUC of the Efficiency comparisons.
corresponding indicated feature.
100
FIVES: Feature Interac]on Via Edge Search for Large-Scale Tabular Data. KDD, 2021.
Takeaways
ü Feature Interpretability
ü Search Efficiency
AutoFeature Useful Interactive
Model Features
102
References of AutoFeature
[1] AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In CIKM 2019.
[2] Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction. In CIKM 2019.
[3] AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications. In KDD 2019.
[4] AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate
Prediction. In KDD 2020.
[5] FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data. In KDD 2021.
103
VolcanoML: End-to-End AutoML via
Scalable Search Space Decomposition
104
Two Complica3ons of AutoML going E2E
Personal perspec,ves, from our past experiences
AutoML • VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space
𝛼 ∗ = argmax 𝑓 𝐷′, 𝜃Y∗ Decomposi,on. VLDB 2021.
Y • AutoML from Service Provider's Perspec,ve: Mul,-device, Mul,-tenant Model
s.t, 𝜃Y∗ = argmax𝑃 𝐷 𝜃 𝑃(𝜃|𝛼) Selec,on with GP-EI. AISTATS 2019.
Z • Ease.ml: Towards Mul,-tenant Resource Sharing for Machine Learning
Workloads. VLDB 2018.
Two Complications
1. 𝛼 is not a homogenous space, it is rather 2. From single-tenant to mul+-tenant scenarios
heterogenous
(auto-sklearn)
𝛼 ∈ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 × 𝐻𝑃 × 𝑀𝑜𝑑𝑒𝑙 105
Disclaimer
This segment of the tutorial is more opinioned and closer to our own
experience than previous segments
It is less about how much we know about these two problems, but
more about discussing some observations and preliminary explorations
to show you what we don’t know and a “cry for help”.
106
Heterogenous Search Space
• 𝑭𝒆𝒂𝒕𝒖𝒓𝒆 × 𝑯𝑷 × 𝑴𝒐𝒅𝒆𝒍
• A strong baseline: Treat the heterogenous space as a single
joint space.
• Model it with a single Bayesian op9miza9on problem, a
single gene9c algorithm, or a single hyperband problem
• Good? Very powerful approach, yet simple.
• Could be improved?
• “The curse of dimensionality”: oFen it is not easy to
scale up when the dimensionality of the space is high.
• Heterogeneity in algorithm: Different subspaces might
benefit from different algorithms.
• Can we do be1er?
107
Heterogenous Search Space
• Different ways to conduct search. Let’s take for example the
space 𝜶 ∈ 𝑿 ×𝒀
• Strategy 1. Joint
• Treating the space 𝑿 ×𝒀 as a single search space
• (If you are doing BO) Create a surrogate model M to approximate 𝑓(𝛼)
• Use M to select 𝜶(
• Evaluate 𝑓(( 𝜶) and update the surrogate model M
• One can implement such a strategy using methods beyond BO.
108
Heterogenous Search Space
• Different ways to conduct search. Let’s take for example the space
𝜶 ∈ 𝑿 ×𝒀
• Strategy 2. Conditioning
• Idea: decompose 𝑿 ×𝒀 into multiple subspaces, e.g., one for each value of 𝑿
• min;,< 𝑓 𝑥, 𝑦 ⇒ min min 𝑔; (𝑦)
;∈> <
• Then treating each 𝑥 ∈ 𝑋 as a subproblem min 𝑔; (𝑦)
<
• Can be modeled as a Multi-armed bandit problem – each arm corresponds to a
possible value of 𝑥 ∈ 𝑋, playing an arm means optimizing min 𝑔; (𝑦) one step
<
109
Heterogenous Search Space
• Different ways to conduct search. Let’s take for example the space
𝜶 ∈ 𝑿 ×𝒀
• Strategy 3. AlternaCng
• Idea: decompose 𝑿 ×𝒀 into two subspaces, 𝑿 and 𝒀
• Solve two problems alternaPvely:
• min 𝑔𝒚, 𝑥 , where 𝑦* is the current best value for subspace 𝑌
+
• min 𝑔+̅ (𝑦), where 𝑥̅ is the current best value for subspace 𝑋
.
• Each subproblem can be solved either jointly or via some condiPoning strategy
• At each iteraPon, pick the subproblem with the largest expected improvement
• For example, think about X as Feature and Y as HP – AlternaCng the
process of search for feature and search for HP
110
Heterogenous Search Space
• Different ways to conduct search
• Strategy 1. Joint
• Pros: Simple, works well when dimensionality is low
• Cons: Might suffer when the dimensionality is high
• Strategy 2. Conditioning
• Pros: Effective when some dimension is categorical variable with small cardinality
• Cons: Might not be applicable to other scenarios.
• Strategy 3. Alternating
• Pros: Very effective in reducing dimensions
• Cons: Assuming conditional independence of two subspaces
111
Heterogenous Search Space
• A single search space can be decomposed in different ways.
112
Heterogenous Search Space
• Moving Forward
• Build up a suite of different building blocks – what is the
unified framework to talk about different search algorithms?
• How to automatically construct search space decomposition?
• How to automatically conduct building block selection?
AutoML for AutoML?
113
AutoML: From Single-tenant to Multi-tenant
Exis+ng Models
Single-tenant Scenario: One target dataset
M1 M2 M3 M4 M5 M6 M7
Existing Datasets
D1 0.9 0.2 0.2 0.6 0.5 0.6 0.2 What if mulGple users running their own AutoML
workload over a shared infrastructure?
D2 0.6 0.7 0.2 0.4 0.6 0.1 0.7
D3 0.9 0.2 0.2 0.6 0.5 0.6 0.2 Interesting problem especially when AutoML as a
D4 0.9 0.2 0.2 0.6 0.5 0.6 0.2 service becomes more and more popular.
D6 ? ? 0.5 ? ? ? ?
Pool of Resources
114
AutoML: From Single-tenant to Multi-tenant
Existing Models
How to balance resource allocations to different users?
M1 M2 M3 M4 M5 M6 M7
Existing Datasets
D6 ? ? 0.5 ? ? ? ?
… ? ? ? ? ? ? ?
Dn ? ? ? ? ? ? ?
116
AutoML: From Single-tenant to Mul3-tenant
1
Quality 0.5
0
1 2 3 4 5 6
# Trials
1 Which
user
Quality
0.5 should
we serve
next?
0
1 2 3 4 5 6
# Trials 117
AutoML: From Single-tenant to Multi-tenant
User 1: [0.99] [0.99]
119
AutoML: From Single-tenant to Multi-tenant
Machine Learning Models
Existing Models New Models
D2 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ? Serve the user with a factor that is very similar to
Datasets (Users)
D3 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? 0.2 ? 02 expected improvement (directly comparing each
D4 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? ? ? user’s UCB does not work, for obvious reason)
D5 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ?
D6 ? ? 0.5 ? ? ? ? ? ? ?
New Datasets
… ? ? ? ? ? ? ? 0.9 ? ?
Dn ? ? ? ? 0.7 ? ? ? ? ?
Computation Resource
120
AutoML: From Single-tenant to Multi-tenant
Modeling error dominates Modeling error dominates
Multi-tenant
Mul+-tenant
121
AutoML: From Single-tenant to Multi-tenant
Machine Learning Models
Existing Models New Models
M1 M2 M3 M4 M5 M6 M7 M8 … Mk
Existing Datasets
123
Two Complications of AutoML going E2E
AutoML
𝛼 ∗ = argmax 𝑓 𝐷′, 𝜃Y∗ A lot of challenges and exciting
Y opportunities when bring AutoML to
s.t, 𝜃Y∗ = argmax𝑃 𝐷 𝜃 𝑃(𝜃|𝛼) and end-to-end production scenario!
Z
Two Complications
1. 𝛼 is not a homogenous space, it is rather 2. From single-tenant to mul+-tenant scenarios
heterogenous
(auto-sklearn)
𝛼 ∈ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 × 𝐻𝑃 × 𝑀𝑜𝑑𝑒𝑙 124
AutoML: A Small Personal Remark
ML today is now a Data Problem MLBench
• For many tasks, given the raw features from Kaggle, most VLDB (2018)
AutoML platforms rank in the bottom 50%. http://www.vldb.org
/pvldb/vol11/p1220-
• It is the data that we need to improve, and knowledge that liu.pdf
we need to integrate, to build better ML applications.
• To improve data, we need to first understand them.
Moving from a Model-driven development to a Data-driven
development.
125
ML-Guided Database
126
Where DB Meets ML
• Human involved in research/engineering/analyzing/administrating:
• Building and maintaining indexes
• Query optimization
• Physical design tuning
• Optimizing view materialization
127
Where DB Meets ML: Learning to Index
• Human involved in research/engineering/analyzing/administrating:
• Building and maintaining indexes
• Query optimization
• Physical design tuning
• Optimizing view materialization
128
B-Tree Index from Learning Perspective
Key Key
Input: Key Input: Key
Output: Posi+on Output: Position
B-Tree Index: posi+on = B-tree(Key) Learned Index: position = function(Key)
[Image source] Kraska et al., The case for learned index structures. SIGMOD, 2018 129
Why Learning Index from Data?
• Consider this (ideal) case: build an index to store and query over a
table of n rows with continuous integer keys, i.e., Keys = [11, 12, 13,
14, 15, ...] and Pos = [0, 1, 2, 3, 4, …]
• B-Tree: seeking Pos in time O(log n)
• a learned function Pos = M(Key) = Key + offset : O(1)
130
Recursive-Model Index (RMI)
Root model
Sub-models
Data to be indexed
[Image source] Kraska et al., The case for learned index structures. SIGMOD, 2018 131
FITing-Tree
Error-Bounded Linear Segment: Given threshold 𝑒𝑟𝑟𝑜𝑟, ShrinkingCone (building a segment): Point 1 is the origin of the
a segment from (𝑥5 , 𝑦5 ) to (𝑥7 , 𝑦7 ) is not valid if (𝑥8 , 𝑦8 ) cone. Point 2 is then added, resulting in the dashed cone. Point
is further than 𝑒𝑟𝑟𝑜𝑟 from the interpolated line. 3 is added next, yielding in the dotted cone. Point 4 is outside
the dotted cone and therefore starts a new segment.
[Image source] Galakatos et al., FITing-Tree: A Data-aware Index Structure. SIGMOD, 2019 132
RMI v.s. FITing-Tree
Root Model B-Tree
e.g., y=ax+b Sub-model
Organization
Sub-models
RMI FITing-Tree
133
More Learned Index Methods
• PGM [1] improves FITing-Tree by finding the optimal number of learned
segments given an error bound.
• Multi-dimensional indexes: NEIST [4], Flood [5], Tsunami [6] and LISA [7].
134
More Learned Index Methods
• [1] The PGM-Index: A Fully-Dynamic Compressed Learned Index with Provable
Worst-Case Bounds. PVLDB, 2020.
• [2] ALEX: An Updatable AdapCve Learned Index. SIGMOD, 2020.
• [3] RadixSpline: A Single-Pass Learned Index. In aiDM Workshop on SIGMOD, 2020.
• [4] NEIST: a Neural-Enhanced Index for SpaCo-Temporal Queries. TKDE, 2019.
• [5] Learning MulC-dimensional Indexes. SIGMOD, 2020.
• [6] Tsunami: A Learned MulC-dimensional Index for Correlated Data and Skewed
Workloads. PVLDB, 2020.
• [7] LISA: A Learned Index Structure for SpaCal Data. SIGMOD, 2020.
135
Questions about Learned Indexes
How to systematically analyze and design
machine learning based indexing methods?
136
Task Definition
• Given a database D with n records (rows), let’s assume that a range
index structure will be built on a specific column x. For each record
𝑖 ∈ [𝑛], the value of this column, , is adopted as the key, and is
the posi0on where the record is stored.
137
Learning Index: A Machine Learning Task
A Pluggable Learned Index Method via Sampling and Gap Insertion, https://arxiv.org/pdf/2101.00808.pdf 138
Learning Index: A Machine Learning Task
objective function
A Pluggable Learned Index Method via Sampling and Gap Insertion, https://arxiv.org/pdf/2101.00808.pdf 139
Benefits of Learned Index
• Smaller Size
• Faster Index Seek
• Better Handling Index Update
• Generalization ability of machine learning
• Incremental learning
• Question Mark
• Is model training/inference scalable enough?
140
Learned Index with Sampling
• How large the sample needs to be?
• 𝑛 is the data size
• 𝑀∗ is fully opCmized
A Pluggable Learned Index Method via Sampling and Gap Inser+on, hups://arxiv.org/pdf/2101.00808.pdf 141
Learned Index with Sampling
• Up to 78x
building speedup
• Non-degraded
performance in
terms of query
time and
prediction error)
A Pluggable Learned Index Method via Sampling and Gap Inser+on, hups://arxiv.org/pdf/2101.00808.pdf 142
Is Linear Model Sufficient?
• Linearization of a learned model
A learned model ( = 𝑴(𝒙)
𝒚
143
Is Linear Model Sufficient?
• Linearization of a learned model
A learned model ( = 𝑴(𝒙)
𝒚
Landmark points … , 𝒙𝒍 , 𝒚𝒍 , 𝒙𝒓 , 𝒚𝒓 , …
144
Is Linear Model Sufficient?
• Linearization of a learned model
A learned model ( = 𝑴(𝒙)
𝒚
Landmark points … , 𝒙𝒍 , 𝒚𝒍 , 𝒙𝒓 , 𝒚𝒓 , …
Linearized model 9 = 𝑴𝐋 𝒙
𝒚
(𝒍 = 𝑴(𝒙𝒍 ) to 𝒙𝒓 , 𝒚
connecting 𝒙𝒍 , 𝒚 (𝒓 = 𝑴(𝒙𝒓 )
145
Is Linear Model Sufficient? Yes! As long as landmark
points are dense enough
Linearized model 9 = 𝑴𝐋 𝒙
𝒚
(𝒍 = 𝑴(𝒙𝒍 ) to 𝒙𝒓 , 𝒚
connecting 𝒙𝒍 , 𝒚 (𝒓 = 𝑴(𝒙𝒓 )
( − 𝑦 ≤ 𝜖, after linearization, we
Theorem 2. Suppose ∀𝑥, 𝒚
9 − 𝑦 ≤ 3𝜖 + 2(𝒚𝒓 − 𝒚𝒍 ).
have ∀𝑥, 𝒚
A Pluggable Learned Index Method via Sampling and Gap Inser+on, hups://arxiv.org/pdf/2101.00808.pdf 146
Sampling-Restriction-Linearization
Sampled data points as landmark points:
… , 𝒙𝒍 , 𝒚𝒍 , 𝒙𝒓 , 𝒚𝒓 , …
A Pluggable Learned Index Method via Sampling and Gap Insertion, https://arxiv.org/pdf/2101.00808.pdf 147
Open Questions
• How to handle extremely outlier keys?
148
AutoML Tools
149
Availability
AdaBERT: Task-Adaptive BERT
Learning to Mutate with Hypergradient Compression with D-NAS, IJCAI 2020
Guided PopulaMon, NeurIPS 2020. https://arxiv.org/abs/2001.04246
Compressed
Hyperparameter Model Search
Optimization
AutoML
Meta-Learning
Auto Feature
Generation
Automated Relational Meta-learning,
FIVES: Feature Interaction Via Edge Search for ICLR 2020.
Large-Scale Tabular Data, KDD 2021. https://arxiv.org/abs/2001.00745
https://arxiv.org/abs/2007.14573
150
Availability
Compressed
Hyperparameter
Model Search
Op+miza+on
Publicly available at
Publicly available at Alibaba PlaRorm of A.I.,
Alibaba Platform of A.I., EasyTransfer product
AutoML product
AutoML
Feature Meta-Learning
Generation
151
A Summary of AutoML Tools
Name Authors Functionalities Algorithms Language
Auto Tune Models (ATM) MIT AutoFeature, Model Selection, BO and Bandit Python
HPO
AutoKeras Texas A&M NAS BO Python
University
NNI Microsoft AutoFeature, HPO, NAS, Model Comprehensive Python
Selection
emukit Amazon HPO Meta-surrogate Python
model
Ray Tune Berkeley HPO Comprehensive Python
152
Tutorial Schedule
153
Thank you!
Yaliang Li, Zhen Wang, Yuexiang Xie, Bolin Ding, and Ce Zhang
Email: [email protected]
Please feel free to contact us if you have any questions,
or you are interested in full-time or research intern positions.
154