0% found this document useful (0 votes)
122 views154 pages

Automl: A Perspective Where Industry Meets Academy

This document discusses automated machine learning (AutoML) and focuses on hyperparameter optimization techniques. It provides an overview of the machine learning pipeline and challenges in selecting optimal hyperparameters. It then describes several approaches to hyperparameter optimization, including random search, grid search, successive halving, Hyperband, and Bayesian optimization. It also discusses using hyperparameter schedules rather than fixed configurations through methods like population-based training and hypergradient descent. Finally, it proposes a new adaptive method called εGE that combines random, greedy, and evolutionary search strategies to optimize hyperparameters for different tasks.

Uploaded by

Data LOG NGUYEN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views154 pages

Automl: A Perspective Where Industry Meets Academy

This document discusses automated machine learning (AutoML) and focuses on hyperparameter optimization techniques. It provides an overview of the machine learning pipeline and challenges in selecting optimal hyperparameters. It then describes several approaches to hyperparameter optimization, including random search, grid search, successive halving, Hyperband, and Bayesian optimization. It also discusses using hyperparameter schedules rather than fixed configurations through methods like population-based training and hypergradient descent. Finally, it proposes a new adaptive method called εGE that combines random, greedy, and evolutionary search strategies to optimize hyperparameters for different tasks.

Uploaded by

Data LOG NGUYEN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 154

AutoML: A Perspective where

Industry Meets Academy


Yaliang Li, Zhen Wang, Yuexiang Xie, Bolin Ding, and Ce Zhang

Alibaba Group, ETH Zürich

1
Machine Learning Pipeline

Data/Feature Model Hyperparameter


Preprocessing Selection Tuning

So MANY choices
• Which feature transformation?
• Which model architecture?
• Which hyperparameters?

2
Machine Learning Pipeline

Data/Feature Model Hyperparameter


Preprocessing Selection Tuning

AutoML
• Auto Feature Generation
• Neural Architecture Search
• Hyperparameters Optimization
• Meta Learning

3
Automated Machine Learning
Hyperparameter Neural Architecture Search
Op+miza+on

Auto Feature Generation AutoML


ML-Guided Database Meta-Learning

AutoML: How to automate the process of applying machine


learning components to various real-world tasks?
4
Automated Machine Learning

Inductive bias (prior α): how we represent data, which kinds of models to be
considered, how to tune hyper-parameter, how to transfer knowledge across
tasks, etc…

5
Tutorial Schedule

Yaliang Li, Background and Overview of AutoML


Hyperparameter Optimization

Zhen Wang, Neural Architecture Search


Meta-Learning

Yuexiang Xie, Automatic Feature Generation

Ce Zhang, VolcanoML: End-to-End AutoML via


Scalable Search Space Decomposition

Bolin Ding, Machine Learning Guided Database

6
Hyperparameter
Optimization (HPO)

7
Hyperparameter Optimization
Hyperparameter
Optimization Best
Hyperparameter

Model
Training
Best
Model

8
Hyperparameter Configuration v.s. Schedule

Hyperparameter • Hyperparameter configuration search


Optimization Best
Hyperparameter methods find a fixed hyperparameter
setting to maximize the model performance.
Model
Training
Best • Hyperparameter schedule search methods
Model
seek a dynamic hyperparameter schedule
in the model training process.

9
Hyperparameter Optimization
q Hyperparameter Configuration
Hyperparameter
Optimization Best • Random search, Grid Search
Hyperparameter
• Successive-halving, Hyperband

Model • Bayesian optimization


Training
Best
Model
q Hyperparameter Schedule
• Population-based training

• Hypergradient

10
Search Methods

Image source: Bergstra & Bengio. JMLR, 2012. 11


Successive-Halving

• Uniformly allocate a budget to a set of


hyperparameter configurations
• Evaluate the performance of all configurations
• Throw out the worst half

Repeat until one configuration remains

Non-stochastic best arm identification and hyperparameter optimization. 2016.


12
Hyperband
• Successive-Halving needs to determine the number of configurations (i.e., 𝑛)
• Outer loop
• Grid search for different 𝑛
• Inner loop
• Successive-Halving for given 𝑛 configs
• s.t. at least one config is trained for 𝑅

𝑛!

𝑟!

Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 2018. 13


Bayesian Optimization
Given some tried {hyperparameter, performance} pairs,
which hyperparameter should be the next one to try?

14
Bayesian Optimization
Given some tried {hyperparameter, performance} pairs,
which hyperparameter should be the next one to try?

Independence assumption Follow a certain distribution

Bayesian Optimization

15
Bayesian Optimization
Fit a probabilistic function f(x) to model {x=hyperparameter, f(x)=performance}

• Function f(x) isn’t required to be convex, differentiable


• Rich theoretical results: convergence, sync v.s. async, various model choices

• Exploration-exploitation trade-off
• Costly

16
Hyperparameter Optimization
q Hyperparameter Configuration
Hyperparameter
Optimization Best • Random search, Grid Search
Hyperparameter
• Successive-halving, Hyperband

Model • Bayesian optimization


Training
Best
Model
q Hyperparameter Schedule
• Population-based training

• Hypergradient

17
Hyperparameter Schedule

Population-based training Hypergradient

A generalized framework for population based Self-tuning networks: Bilevel optimization of hyperparameters
training. KDD, 2019. using structured best-response functions. ICLR, 2019.
18
Practical Challenge (1)

Hyperparameter
Optimization Best
Hyperparameter

Model
Training Model Size
Best
Model

Data Size
19
ABC: Sampling
Hyperparameter
Training
Data Configuration: C
Model
Training
Sampling
ratio: r {r, C, P}

Testing
Data
Model
Testing Performance: P

Efficient Identification of Approximate Best Configuration of Training in Large Datasets. AAAI, 2019.
20
A New Method: 𝜀GE
Hyperparameter
Optimization Best
Hyperparameter

Each category of hyperparameter optimization


Model methods has its advantages and disadvantages.
Training
Best Can we adaptively combine them and utilize
Model
their advantages for different tasks?

Illustration of Hyperparameter optimization

q ExisCng methods
• Search-strategy based: Successive-halving, Hyperband, etc.
• Evolu+onary algorithm: Popula+on Based Training, etc.
• Bayesian op+miza+on

Hyperparameter Recommendation for Machine Learning Method. Patent, 2019. 21


A New Method: 𝜀GE
<r’, C’, P’>

Choose Increase Train and test


{r, C, P} C’ r’ model: P’

Terminated
• Random strategy: randomly choose a
Output C with configuration with probability 𝜺
• Greedy strategy: choose the best configuration
the best P
• Evolution strategy: choose the best
configuration and perturb it with mutation and
crossover

22
HPO: Sampling method-𝜀GE
The task-adaptively combination of different hyperparameter
optimization methods leads to faster solutions!

• A soft version of Hyperband


• Evolutionary operation
• A simplified version of
Bayesian optimization
(i.e., local smoothness
assumption)

Hyperparameter Recommendation for Machine Learning Method. Patent, 2019. 23


Practical Challenge (2)

Mutation-driven global search Hypergradient-guided local search


PBT, KDD2019 STN, ICLR2019

24
Hyperparameter Schedule

A smooth optimization problem A non-smooth op+miza+on problem

Trade-off between Evolutionary algorithm (PBT) and Hyper-gradient based method:


• Hyper-gradient based method performs better than PBT on the smooth optimization problems.
• Hyper-gradient based method performs worse than PBT on the cases of many local minima (non-
smooth).

How to learn a good trade-off between the global search and local search?
25
HyperMutaion (HPM)

Randomly Hyper … Hyper Learnable Hyper Learnable Hyper …


Initialization Training Training Mutation Training Mutation Training

!"# $"# , ℎ"# , ' ∈ {1,2,3}


/
/ ∇$.8/ ∇$./
!.8/ ∇ℎ/.8/
!./ !./ / / !./ !./ ∇ℎ/.
/
!.2/
($. , ℎ. )
0
0 ∇$.8/ ∇$.0
!.8/ 0
∇ℎ.8/ !.0 !.0 0 0
($. , ℎ. ) !.
0
!.0 ∇ℎ.0
0
!.2/
1
1 ∇$.8/ ∇$.1
!.8/ 1 !.1 !.1 !.1 !.1 1
!.2/
∇ℎ.8/
($./ , 5 ⊙ ℎ/. ) ∇ℎ.1

Activate student Top student Middle student Bottom student

Learning to Mutate with Hypergradient Guided Population. NeurIPS, 2020.


26
Hypertraining: a joint optimization over 𝜃 and ℎ

Randomly Hyper … Hyper Learnable Hyper Learnable Hyper …


Initialization Training Training Mutation Training Mutation Training

!"# $"# , ℎ"# , ' ∈ {1,2,3}


/
/ ∇$.8/ ∇$./
!.8/ ∇ℎ/.8/
!./ !./ / / !./ !./ ∇ℎ/.
/
!.2/
($. , ℎ. )
0
0 ∇$.8/ ∇$.0
!.8/ 0
∇ℎ.8/ !.0 !.0 0 0
($. , ℎ. ) !.
0
!.0 ∇ℎ.0
0
!.2/
1
1 ∇$.8/ ∇$.1
!.8/ 1 !.1 !.1 !.1 !.1 1
!.2/
∇ℎ.8/
($./ , 5 ⊙ ℎ/. ) ∇ℎ.1

Activate student Top student Middle student Bottom student

Learning to Mutate with Hypergradient Guided Population. NeurIPS, 2020.


27
Exploit by a truncation selection

Randomly Hyper … Hyper Learnable Hyper Learnable Hyper …


Initialization Training Training Mutation Training Mutation Training

!"# $"# , ℎ"# , ' ∈ {1,2,3}


/
/ ∇$.8/ ∇$./
!.8/ ∇ℎ/.8/
!./ !./ / / !./ !./ ∇ℎ/.
/
!.2/
($. , ℎ. )
0
0 ∇$.8/ ∇$.0
!.8/ 0
∇ℎ.8/ !.0 !.0 0 0
($. , ℎ. ) !.
0
!.0 ∇ℎ.0
0
!.2/
1
1 ∇$.8/ ∇$.1
!.8/ 1 !.1 !.1 !.1 !.1 1
!.2/
∇ℎ.8/
($./ , 5 ⊙ ℎ/. ) ∇ℎ.1

Activate student Top student Middle student Bottom student

Learning to Mutate with Hypergradient Guided Population. NeurIPS, 2020.


28
Explore by the learnable mutation

Randomly Hyper … Hyper Learnable Hyper Learnable Hyper …


Initialization Training Training Mutation Training Mutation Training

!"# $"# , ℎ"# , ' ∈ {1,2,3}


/
/ ∇$.8/ ∇$./
!.8/ ∇ℎ/.8/
!./ !./ / / !./ !./ ∇ℎ/.
/
!.2/
($. , ℎ. )
0
0 ∇$.8/ ∇$.0
!.8/ 0
∇ℎ.8/ !.0 !.0 0 0
($. , ℎ. ) !.
0
!.0 ∇ℎ.0
0
!.2/
1
1 ∇$.8/ ∇$.1
!.8/ 1 !.1 !.1 !.1 !.1 1
!.2/
∇ℎ.8/
($./ , 5 ⊙ ℎ/. ) ∇ℎ.1

Activate student Top student Middle student Bottom student

Learning to Mutate with Hypergradient Guided Population. NeurIPS, 2020.


29
Learning mutations with a teacher network
Hyper … Hyper Learnable Hyper Learnable Hyper …
Training Training Mutation Training Mutation Training

,2,3} § Student-teaching schema § Teacher model with attention networks

∇$./
!./ !./ !./ !./ ∇ℎ/.
/
!.2/
/ /
($. , ℎ. ) " = 1 + tanh(+)
ℎ!" ! ℎ!" ← "⨀ℎ!∗
∇$.0
!.0 !.0 0
($.0 , ℎ.0 ) !. !.0 0
!.2/ " ℒ$%& (1!∗ ℎ!" , ℎ!" )
∇ℎ.0 # $
dot product
∇$ 1 softmax weighted sum
!.1 !.1 !.1 !.1
. 1
!.2/
($./ , 5 ⊙ ℎ/. ) ∇ℎ.1

dent Top student Middle student Bottom student

30
Continue hypertraining after exploit & explore

Randomly Hyper … Hyper Learnable Hyper Learnable Hyper …


Initialization Training Training Mutation Training Mutation Training

!"# $"# , ℎ"# , ' ∈ {1,2,3}


/
/ ∇$.8/ ∇$./
!.8/ ∇ℎ/.8/
!./ !./ / / !./ !./ ∇ℎ/.
/
!.2/
($. , ℎ. )
0
0 ∇$.8/ ∇$.0
!.8/ 0
∇ℎ.8/ !.0 !.0 0 0
($. , ℎ. ) !.
0
!.0 ∇ℎ.0
0
!.2/
1
1 ∇$.8/ ∇$.1
!.8/ 1 !.1 !.1 !.1 !.1 1
!.2/
∇ℎ.8/
($./ , 5 ⊙ ℎ/. ) ∇ℎ.1

Activate student Top student Middle student Bottom student

31
Experiments on test functions

Figure: (a)-(b) The mean performance computed by different methods along with the standard
deviation over 10 trials, in terms of different given budget of iterations. (c) The average mutation
values learned by HPM over 10 trials. In each trial, HPM runs 30 iterations in total with a
population size of 5, resulting in 6 training steps and 5 mutations.

32
Experiments on test functions

Figure: (a)-(b) The mean performance computed by different methods along with the standard
deviation over 10 trials, in terms of different given budget of iterations. (c) The average mutation
values learned by HPM over 10 trials. In each trial, HPM runs 30 iterations in total with a
population size of 5, resulting in 6 training steps and 5 mutations.

33
Experiments on benchmark datasets

34
Takeaways
Hyperparameter
Optimization Best
q Hyperparameter Configuration
Hyperparameter
• Random search, Grid Search

Model • Successive-halving, Hyperband


Training
Best
Model • Bayesian optimization

q Hyperparameter Schedule
• Population-based training

• Hypergradient

• HyperMutation (HPM)

35
Future Directions
Ø Faster, Green
§ HPO via Meta-Learning

Ø HPO for a specific domain


§ a group of algorithm, e.g. Graph-related

Ø Interactive, Human-in-the-loop

36
Neural Architecture
Search (NAS)

37
Neural Architecture Search
q What is neural architecture search (NAS)? q Why NAS?
• To find the optimal topology and/or size • Architecture matters a lot on the
configuration for the neural network. performance!
• E.g., select a filter from {CNN3×3 , CNN5×5 , • The choices cannot be exhausted.
DilatedCNN5×5 }.
• Useful prior knowledge, e.g., the invariance
• E.g., determine the depth and width of a neural network.
possessed by the task, has been exploited.

Figure: Image classification on ImageNet (source: https://paperswithcode.com/sota/image-


classification-on-imagenet).

38
Elements of NAS

q Search space
• All the possible configurations.
• E.g., filter size, activation functions, depth, etc.

q Search strategy
• How to utilize experience?
• How to propose new configuration to try?
• E.g., RL, ES, and differentiable search.

q Performance esCmaCon strategy


• How to evaluate a configura+on?
• E.g., standard training and surrogate objec+ve.

39
And the Theme of NAS
Exploitation v.s. Exploration
q Search space
• All the possible configurations. Incorporating prior knowledge reduces search space but
• E.g., filter size, activation functions, depth, etc. makes it constrained to some extent, e.g., Inception-v2/3
à stacked cells [Zoph et al. 2018].

q Search strategy
• How to utilize experience? Instead of asympto+c regret, prac++oners balance the
• How to propose new configuration to try? exploita+on and explora+on to achieve best solu+on
under a given finite horizon.
• E.g., RL, ES, and differentiable search.

q Performance estimation strategy


• How to evaluate a configuration? Standard training&validation is expensive but accurate.
• E.g., standard training and surrogate objective. The proposed surrogate objectives are efficient but less
correlated.

40
Pioneer Works of NAS
q Search space
• Consider both CNN and RNN cells.
• The configuration of each layer can be
determined respectively.
q Search strategy
• RL with the policy parameterized by a RNN.
q Performance estimation strategy Figure: An overview o the trial-and-error process of NAS.

• Standard train&validation

Figure: How the controller (i.e., a RNN) samples a CNN with skip connection.
41
Neural Architecture Search with Reinforcement Learning. ICLR, 2017, https://arxiv.org/pdf/1611.01578.pdf
Pioneer Works of NAS
q Search space
• Consider both CNN and RNN cells.
• The configuration of each layer can be
determined respectively.
q Search strategy
• RL with the policy parameterized by a RNN.
q Performance estimation strategy Figure: An overview o the trial-and-error process of NAS.
• Standard train&validation

q Unfold the gain of NAS 😄 and also its pain ☹


• Searched CNN and RNN cells achieve compe++ve
performances against manually designed architectures
on CIFAR-10 and PTB respec+vely.
• Searched architecture can be transferred to other tasks.
• Trained 12,800 models in total on 800 GPUs.

42
Neural Architecture Search with Reinforcement Learning. ICLR, 2017, https://arxiv.org/pdf/1611.01578.pdf
Weight Sharing for One-shot NAS
q Weight sharing q One-shot NAS
• Represent NAS’s search space using a single DAG. • Each architecture (i.e., subgraph) is evaluated
• An architecture can be realized by taking a subgraph. by inheriting the shared parameters.
• E.g., deducing a RNN cell as follow: • Shared parameters are trained with sampled
architecture.
• Parameters and the controller are updated
alternatively.

q Advantage and concern


• ENAS [Pham and Guan et al., 2018] uses 10h
of one GTX1080Ti, which is 1000x faster than
[Zoph et al., 2017].
• Does the performance of a stand-alone
training correlate with that of one-shot NAS
[Bender et al., 2018, Zhang et al., 2020]?
Figure: How a RNN cell (i.e., highlighted subgraph)
inherits the shared parameters.

Efficient Neural Architecture Search via Parameter Sharing. ICML, 2018, http://proceedings.mlr.press/v80/pham18a/pham18a.pdf 43
Differentiable NAS
q Continuous relaxation
• Each edge denotes a mixture of ops in Ο = {CNN3×3 ,DilatedCNN3×3 ,Zero,Identity, … }.
!,#
• For each edge 𝑖, 𝑗 , they parameterize the weights of ops by architecture parameter 𝛼 .
• Suppose the tensor at node! is 𝑥, then the tensor propagated to node# will be:

q Differentiable learning
• Formulated as a bilevel optimization
problem:

• Regarded as a Stackelberg game


• Architecture parameters as leader Figure: An overview of DARTS. (a) Operations are initially unknown. (b) Continuous relaxation.
(c) architecture parameters are optimized jointly. (d) Inducing the final architecture.
• Model parameters as follower
44
DARTS: Differentiable Architecture Search. ICLR, 2019, https://arxiv.org/pdf/1806.09055.pdf
Differentiable NAS
q DifferenCable learning (contd’) q Improve DARTS by annealing and pruning

• No way to es+mate the ∇$ 𝐿%&' 𝑤 ∗ 𝛼 , 𝛼


exactly.
• DARTS approximates the gradient by looking
ahead one-step for 𝜔 like meta-learning.
• It is further simplified by trea+ng the
parameters equally [Li et al., 2021].

q Deriving discrete architecture


• Retain the top-k strongest predecessors for
each node 𝑗 where strength of 𝑖, 𝑗 is defined
(#,%)
,-. $!
as: argmax #,% .
)∈+ ∑ ' ,-. $ '
! ∈) !
• Replace each edge by the most likely op:
!,#
𝑜 !,# = argmax 𝛼) ASAP: Architecture Search, Anneal and Prune. AISTATS, 2020,
45
)∈+ http://proceedings.mlr.press/v108/noy20a/noy20a.pdf
Dealing with Scalability Issue
q Horrible memory occupation of one-shot NAS
• The supergraph cannot fit into GPU memory for large datasets.
• Usually search architecture on CIFAR-10 and transfer to ImageNet.

Figure: ProxylessNAS directly optimizes neural architecture on target task and hardware.

q Binarized architecture
• Transform real-valued path weights to binary gates.
• Only one path is ac+ve in memory at run+me. Figure: Note the straight-through estimator (STE) trick.

46
ProxylessNAS: Direct Neural Architecture Search of Target Task And Hardware. ICLR, 2019, hXps://arxiv.org/pdf/1812.00332.pdf
Rethinking the Search Space of NAS
q Explore less constrained search spaces
[Xie et al. 19]
• Consider stochastic network generator, e.g., ER, BA,
and WS.
• All yield >73% mean accuracy on ImageNet with a
low variance!
• Presented graph damage ablation.
Figure: randomly remove one node/edge.

q Design search space [Radosavovic et al. 20]


• Evaluate a search space by its error distribution.
• Input a search space and output a refined one.

Figure: two steps of refinement with the error distribution constantly improved.
47
Rethinking the Search Space of NAS
q From the view of graph structure [You et al. 20a]
• From DAG to relational graph.
• Sweet spots are consistent across different datasets and
architectures.

Figure: proposed WS-flex provides a larger search space.

48
Graph Structure of Neural Networks. ICML, 2020, http://proceedings.mlr.press/v119/you20b/you20b.pdf
Rethinking the Search Space of NAS
q From the view of graph structure [You et al.
20a]
• From DAG to rela+onal graph.
• Sweet spots are consistent across different datasets and
architectures.

Figure: Key results.

49
Graph Structure of Neural Networks. ICML, 2020, http://proceedings.mlr.press/v119/you20b/you20b.pdf
Size Search Space
q Model scaling
• Keep the architecture but adjust the size:
• Depth 𝐿
• Width 𝐶
• And resolution 𝐻, 𝑊
• Maximize the performance w.r.t. the size.

q EfficientNet [Tan et al. 19]

Figure: Model size v.s. ImageNet accuracy.

• Compound scaling method: 𝑑 = 𝛼 ! , 𝜔 = 𝛽! , 𝑟 = 𝛾 ! where 𝛼×𝛽" ×𝛾 " ≈ 2, 𝛼 ≥ 1, 𝛽 ≥ 1, 𝛾 ≥ 1.


• Step1: Fix 𝜙 = 1 and do a small grid search for 𝛼, 𝛽, 𝛾. Step2: Fix 𝛼, 𝛽, 𝛾 and scale up 𝜙.

50
From CNN/RNN to GNN
q Uniqueness in search space
• More dimensions of choices:
• Micro: mainly aggregation and combine functions.
• Macro: how node embeddings in each layer produce the final one.
• Nodes are not independent, so how about in a node-wise manner? Figure: General message passing.

q Challenges of weight-sharing one-shot NAS


• Different options lead to quite different output statistics [Zhou et al. 19].
q Transfer across datasets and tasks [You et al. 20b]
• Collect 32 (diverse) tasks.
• Use anchor models to calculate task similarities.

51
Figure: Comparing the correlations.
Beyond Accuracy: Efficiency and Robustness
q Making latency differentiable [Cai et al. 19]

Figure: Introducing latency


regulariza]on loss.

q Searching robust architecture [Guo et al. 20]

52
Figure: Performance of 1k sampled architecture. Figure: Analysis of top 300 robust v.s. non-robust architectures.
Beyond Accuracy: Compressed Model Search

• Pre-trained language model such as BERT achieves great performance on


Method Averaged Inference
various tasks, but it is difficult to be deployed to real-Ome applicaOons.
Performance Speed
• Can we task-adapFvely compresses original BERT for different tasks?
BERT 82.5 1x
BERT-PKD 80.6 1.9x
DistillBERT 76.8 3.0x
TinyBERT 80.6 9.4x

AdaBERT 80.1 12.7x ~ 29.3x

The proposed AdaBERT achieves significant speedup


in inference time while maintaining comparable
performance compared to uncompressed model.

53
AdaBERT: Task-Adaptive BERT Compression with D-NAS. IJCAI, 2020, https://arxiv.org/abs/2001.04246
Beyond Accuracy: Compressed Model Search

Table: Performance of searched structures across different tasks

These results demonstrate that the proposed


AdaBERT compresses original BERT adaptively
for different downstream tasks.
Figure: Searched structures of compressed models for different tasks

54
AdaBERT: Task-Adaptive BERT Compression with D-NAS, IJCAI, 2020, https://arxiv.org/abs/2001.04246
NAS Benchmarks
q NAS-Bench-101 [Ying et al. 19]
• Provides a lookup table for the 423k architectures.
• Including their train/valid/test accuracies, number of parameters, and training +me.
q NATS-Bench [Dong et al. 21]
• Search space considers both size and topology factors.

55
Figure: The search space of NATS-Bench.
NAS Benchmarks
q NAS-Bench-101 [Ying et al. 19]
• Provides a lookup table for the 423k architectures.
• Including their train/valid/test accuracies, number of parameters, and training time.
q NATS-Bench [Dong et al. 21]
• Search space considers both size and topology factors.

Figure: Comparison the benchmarks.

56
Takeaways

q Search space
• Layer by layer • Repeated normal&reduction cell
• Pre-defined restricted design space • Search for design space
• Pre-defined size • Also search for optimal size

q Search strategy
• One-shot NAS
• Trial-and-error, e.g., RL and ES
• Differentiable (+sampling ops)

q Performance estimation strategy


• Stand training&validaOon • With weight-sharing
• Single objecOve • MulOple objecOves

57
Future Directions
q Reduce the variance of one-shot NAS
• The interference between child models is a main factor [Zhang et al. 2020].
• E.g., sharing unless some condition(s) are satisfied.

Figure: Validation
performance of each
child model during the
last 120 steps.

q Select the truly useful architecture


• The magnitude of architecture parameters does not necessarily indicate how much the operation
contributes to the supernet’s performance [Wang and Cheng et al. 2021].

58
q Beyond NAS: From
staCc to dynamic
neural architecture.
• Fine-grained tuning.
• Mainly focusing on CNNs
and efficiency issue now.

Dynamic Neural Networks: A Survey,


https://arxiv.org/pdf/2102.04906.pdf

59
References of NAS
• [Zoph et al. 2017] Neural Architecture Search with Reinforcement Learning. ICLR. 2017.
• [Zoph et al. 2018] Learning Transferable Architectures for Scalable Image Recognition. CVPR. 2018.
• [Bender et al. 2018] Understanding and Simplifying One-Shot Architecture Search. ICML. 2018.
• [Zhang et al. 2020] Deeper Insights into Weight Sharing in Neural Architecture Search. arXiv. 2020.
• [Li et al. 2021] Geometry-aware Gradient Algorithms for Neural Architecture Search. ICLR. 2021.
• [Xie et al. 2019] Exploring Randomly Wired Neural Networks for Image Recognition. ICCV. 2019.
• [Radosavovic et al. 2020] Designing Network Design Spaces. CVPR. 2020.
• [You et al. 2020a] Graph Structure of Neural Networks. ICML. 2020a.
• [Zhou et al. 2019] Auto-GNN: Neural Architecture Search of Graph Neural Networks. Arxiv. 2019.
• [You et al. 2020b] Design Space for Graph Neural Networks. NeurIPS. 2020b.
• [Cai et al. 2019] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. ICLR. 2019.
• [Guo et al. 2020] When NAS Meets Robustness: In Search of Robust Architectures against Adversarial Attacks.
CVPR. 2020.
• [Ying et al. 2019] NAS-Bench-101: Towards Reproducible Neural Architecture Search. ICML. 2019.
• [Dong et al. 2021] NATS-Bench: Benchmarking NAS Algorithms for Architecture Topology and Size. TPAMI.
2021.
• [Tan et al. 2019] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. ICML. 2019.
• [Wang and Cheng et al. 2021] Rethinking Architecture Selection in Differentiable NAS. ICLR. 2021.

60
Meta-Learning

61
Meta-learning
q What is meta-learning?
• Training on a meta-dataset consis+ng of many datasets, where each is a different task.
• Extract prior knowledge from it that accelerates the learning of new tasks.

Figure: Example of how meta-learning works (source: hXps://cs330.stanford.edu/slides/cs330_metalearning_bbox_2020.pdf).


62
When Meta-learning Meets AutoML
q AutoML as a service
• What if users do not have a large dataset for training a deep model?
• What if users want to quickly learn a new task?

Hyperparameter
Optimization Figure: The distribu]on of the scales of
datasets. (source: hXps://cs330.stanford.edu/).

NAS

AutoML

Automatic
Feature
Generation 63
When Meta-learning Meets AutoML
q AutoML as a service
• Assume different tasks share some common principles.
• Can we exploit the cumulated experience?

Hyperparameter
Optimization

NAS Meta-Learning😄

AutoML

Automatic
Feature
Generation 64
Meta-learning Basics
q Exploit the meta-dataset
• Conventional ML:

• Meta-learning:
q Replace the meta-dataset by meta-parameters
• Sufficient to represent the meta-dataset.

this is the adaptation problem

65
Optimization-based Meta-learning
q AdaptaCon problem
• Acquire 𝜙! via op+miza+on 𝜙! =
𝑎𝑟𝑔𝑚𝑎𝑥0 log 𝑝 𝐷!12 |𝜙 + log 𝑝 𝜙|𝜃 .
• 𝜃 serves as a prior.
q Which form of prior to take?
• Ini+aliza+on and fine-tuning!
Figure: Illustra]ng the idea of op]miza]on-based meta-
learning (source: hXps://arxiv.org/pdf/1703.03400.pdf).

3 3
Where 𝑔! = 𝐿 𝜃! , 𝐷 12 , 𝑔̅! = 𝐿 𝜃! , 𝐷 12 ,
34# 34*
T! = 3 𝑔̅! (Hessian w.r.t. 𝜃5 )
𝐻 34 *
66
Optimization-based Meta-learning

q Probabilistic interpretation
• Maximize a posterior (MAP) with 𝜃 as the prior.
q MAML [Finn et al. 17] approximates hierarchical
Bayesian inference!
• Gradient descent with early stop = MAP inference under Gaussian prior
with mean at initial parameters.
• Other forms, e.g.,

Figure: Probabilistic interpretation of optimization-based


meta-learning (source: https://cs330.stanford.edu/).

😄Model-agnosOc
😄Maximally expressive with sufficiently deep neural networks
☹Typically requires second-order computaOon/memory intensive

67
Model-based Meta-learning
q Adaptation problem
• From solving optimization problem to black-box adaptation 𝜙! = 𝑓4 𝐷!12 = 𝑎𝑟𝑔𝑚𝑎𝑥0 log 𝑝 𝜙|𝐷!12 , 𝜃
• Train a neural networks to represent 𝑝 𝜙! |𝐷!12 , 𝜃
• E.g., RNN, Neural Turing Machine, memory-augmented NN [Santoro et al. 16], etc.

😄Expressive
☹Oden sample-inefficient

𝐷!"# 𝐷!"$
68
Figure: Memory-augmented neural networks (source: https://proceedings.mlr.press/v48/santoro16.pdf).
Metric-based Meta-learning
q Use Non-parametric learner
Figure: The idea of metric-based
meta-learning (source:
https://cs330.stanford.edu/slide
s/cs330_nonparametric_2020.p
df).

😄EnOrely feedforward
😄Easy to opOmize
☹Harder to generalize to varying
k-ways (especially for very large k)

69
Metric-based Meta-learning
q Use Siamese neural networks

Figure: Architecture of
Siamese neural networks
and its application to
one-shot learning.

• Meta-training: binary classificaPon.


• Meta-test: k-way classificaPon.

70
Siamese Neural Networks for One-shot Image Recogni]on. ICML, 2015, hXps://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf
Metric-based Meta-learning
q Match the train&test phases by Matching networks
• Fix the mismatch between meta-training and meta-test.
• Map a (support) set 𝑆 = 𝑥! , 𝑦! to a classifier:

• The afenOon mechanism 𝑎 ⋅,⋅ fully specifies the classifier.

Figure: Architecture of Matching network.

71
Matching Networks for One Shot Learning. NeurIPS, 2016, https://arxiv.org/abs/2001.00745
GPT-3: meta-learning as pre-training

q What’s the meta-dataset?


• Crawled text corpora.
• 𝐷!12 : sequence of characters, 𝐷!16 : the following sequence of characters.
q What’s the meta-learning problem?
• Put different tasks all in the form of text.
• Thus trained on language generaPon tasks.
q What’s the extracted prior knowledge?
• A “Transformer” model as the ini+aliza+on.

Figure: The model is far from perfect (source:


hXps://github.com/shreyashankar/gpt3-
sandbox/blob/master/docs/priming.md).

72
Generalization v.s. Customization

q Key assumption of meta-learning


• Meta-training and meta-test tasks are drawn i.i.d. from the same task
distribution.
• E.g., Omniglot:
• 1623 characters from 50 different alphabets.
• 20 instances for each character.

Figure: Characters of different alphabets (source: https://omniglot.com/).

73
Generalization v.s. Customization

q Key assumpCon of meta-learning


• Meta-training and meta-test tasks are drawn i.i.d. from the same task
distribu+on. ☹Can NOT be strictly saOsfied!
• E.g., Omniglot:
• 1623 characters from 50 different alphabets.
• 20 instances for each character.

q Experience cumulated on the cloud


• Different user experiments can be quite different. Task 𝐷!12
• Learning a global prior may be insufficient. Signature

Meta-learning Customization Adaptation

Cumulated
Experience Global Prior Customized Task-specific
Ini+aliza+on Model
74
Relational Meta-Learning
Learner-2

Learner-1
Most Meta-learning
Meta
methods don’t capture
Learner the relations among
Learner-4
tasks/learners.
Learner-3

The proposed relational meta-


learning method can capture the
relations among different tasks,
which enhances the
effectiveness of meta-learners.

75
Automated Rela]onal Meta-learning, ICLR, 2020, hXps://arxiv.org/abs/2001.00745
Summary and Future Directions
q How to uClize exisCng experience---meta-learning
• Learn a meta-parameter, so that we can quickly transfer to new task.
• Op+miza+on-based, model-based, metric-based

q What if tasks are heterogeneous?


• Trade-off between generaliza+on v.s. customiza+on

q Use meta-learning for improving real-world services


• AutoML as a service has cumulated a lot of experience.
• Learning tasks on different domains and/or with different models share some intrinsic patterns of
machine learning.
• What kinds of features are transferable? How to represent a task, a model, and a objective?

76
References of Meta-learning
[Santoro et al. 2016] Meta-Learning with Memory-Augmented Neural Networks. ICML. 2016.
[Finn et al. 2017] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML. 2017.
[Nichol et al. 2018] On First-Order Meta-Learning Algorithm. arXiv. 2018.
[Kock et al. 2015] Siamese Neural Networks for One-shot Image Recognition. ICML. 2015.
[Vinyals et al. 2016] Matching Networks for One Shot Learning. NeurIPS. 2016.

77
Auto Feature Generation

78
Automatic Feature Generation
• In pracPce, many data scienPsts search for useful interacPve features in a trial-and-error manner,
which has occupied a lot of their workloads.
• Therefore, automaPc feature generaPon (AutoFeature), as one major topic of automated machine
learning (AutoML), has received a lot of aYenPon from both academia and industry.

…… …… ……
…… …… ……
…… …… ……
…… …… ……
AutoFeature Useful Interactive Downstream
Data Model Features Applications

79
Automatic Feature Generation

• Industries such as healthcare and finance


Feature Interpretability need interpretability
• Can be applied to train lightweight models
for real-time requirement

• The number of possible interacPve features


Search Efficiency is too large to be traversed
( 𝑂 2" for 𝑚 original features )

80
Automatic Feature Generation
q The related works on automaPc feature generaPon can be roughly divided
into two categories:
• DNN-based methods
• Search-based methods

DNN-based methods design specific neural Search-based methods focus on designing


architectures to express the interactions different search strategies that prune as much of
among different features. the candidates to be evaluated as possible,
• Implicit feature generation while aiming to keep the most useful interacPve
• One-shot training course features.
• Lack of interpretable rules for feature • Explicit feature generaPon
interactions • Trial-and-error training manner
• Need lots of Pme and compuPng resource

81
AutoInt Session: Long - Deep Nerual Network II CIKM ’19, November 3–7, 2019, Beijing, China

q Map the original features into low-dimensional feature space and model
There are some recent works that model high-order feature in-
the high-order
teractions. feature interactions
For example, viadeep
NFM [13] stacked self-attention.
neural networks on
Output Layer: Estimated CTR

top of the output of the second-order feature interactions to model


① Input higher-order
Layer: features. Similarly, PNN [25], FNN [41], DeepCross-
ing [32], Wide&Deep [8] and DeepFM [11] utilized feed-forward
Each feature field is represented as an one-hot
neural networks to model high-order feature interactions. How-
Interacting
Layer
vector ever,
(for all
categorical feature)
these approaches orhigh-order
learn the a scalarfeature
valueinteractions
(for 3 Multi-head
Self-Attention
in anfeature).
numerical implicit way and therefore lack good model explainability.
On the contrary, there are three lines of works that learn fea-
ture interactions in an explicit fashion. First, Deep&Cross [38] Embedding
and xDeepFM [19] took outer product of features at the bit- and 2 Layer … …
② Embedding Layer:
vector-wise level respectively. Although they perform explicit fea-
To transform the sparse
ture interactions, it is notand
trivialhigh-dimension featuresare
to explain which combinations … …
1 0 0 1 0.3 0.5

into a low-dimensional featuremethods


useful. Second, some tree-based space[39,via42,
a 44]
learnable
combined the Feature field 1
Input Layer: sparse feature X Feature field M

power matrix.
embedding of embedding-based models and tree-based models but had Figure 1: Overview of our proposed model AutoInt. The de-
to break training procedure into multiple stages. Third, HOFM [5]
Overview of the proposed model AutoInt.
tails of embedding layer and interacting layer are illustrated
proposed e�cient training algorithms for high-order factorization in Figure 2 and Figure 3 respectively.
machines. However, HOFM requires too many parameters and only
its low-order (usually less than 5) form can be practically used. Dif-
ferent from existing work, we explicitly model feature interactions
with attention mechanism in an end-to-end manner, and probe the Speci�cally, we de�ne the high-order combinatorial82features as
AutoInt: Automa]c Feature Interac]on Learning via Self-AXen]ve Neural Networks. CIKM, 2019.
follows:
Next, we feed embeddings of all �elds into a
model the correlations
dimensional space, webetween
move to di�erent
model high-order feature combinatorial
�elds.
er, which is implemented as a multi-head self-
work. Speci�cally,
features in the wespace.
adoptThe thekeykey-value
problem attention
is to determine mechanism [22] to
which fea-
es andFor each attributes
item’s interacting as layer, high-order
a sparse tures should be combined to form meaningful high-order features.
ed through the attention mechanism, and dif- determine which feature combinations are meaningful. Taking the
ion of all �elds. Speci�cally,

AutoInt
Traditionally, this is accomplished by domain experts who create
inations can be evaluated with the multi-head feature m as an example, next we explain how to identify multiple
meaningful combinations based on their knowledge. In this pa-
;map thexfeatures
x2 ; ...; M ],
into di�erent
Session: subspaces.
Long By meaningful
(1) - Deep Nerual high-order
Network II features involving feature m. We �rst de�ne CIKM ’19, November 3–7, 2019, Beijing, China
per, we tackle this problem with a novel method, the multi-head
eracting layers, di�erent orders of combinato- the correlation between feature m and feature k under a speci�c
feature �elds, and xi is the feature self-attention mechanism [36].
modeled.
attention head h self-attentive
Multi-head as follows: network [36] has recently achieved
i is ainteracting
x�nal one-hot vector layer isifthe
thelow-dimensional
i-th �eld
remarkable performance in modeling (h) (ecomplicated relations. For
ee input
2). xifeature,
is a Map whichvalue
scalar models if the
thehigh-order exp(
es and is
q
further used
the
for
original
estimating
i-thfeaturesexample,
the click-
into low-dimensional
it shows superiority
(h)
Õ
feature , e
for modeling arbitrary
m k space
)) and
word model
depen-
igure 2). There are some recent works that model
m,k high-order
= feature in- ,
the high-order
h a sigmoid function. Next, we feature
introduce the
dency in machine
interacPons via translation M [36]
self-aYenPon. exp(and(h)sentence
(em , el ))embedding [20], (5) Output Layer: Estimated CTR
teractions. For example, NFM [13] stacked deep neural
l =1 networks
and has been successfully applied to capturing on node similarities
ed method. top of the output of the second-order feature , einteractions
(h) (h) to model (h)
in graph embedding (em[37]. k ) Here
= hW we�eryextendemthis, WKey latest
ektechnique
i, to
er ③ Interacting higher-order
s of the categorical features are very
Layer:
features. Similarly,
model thePNN [25], FNN
correlations between[41],di�erent
DeepCross- feature �elds.
ing [32], Wide&Deepwhere [8] and DeepFM
Speci�cally,
(h) (·, ·) iswe[11]
an adopt utilized feed-forward
the key-value
attention function attention
which mechanism
de�nes the[22] to
similarity
common
ser’s pro�lesway and Theis tomulti-head
item’s represent
attributes them key-value
as a sparse attention
determine
mechanism
whichfeature
feature combinations
is adopted Interacting

concatenation
.g., word embeddings).of all
neural networks to model
�elds. Speci�cally,
Speci�cally, between high-order
the feature k. It can beare
m andinteractions. How-
de�nedmeaningful. Taking the
as a neural network 3
Layer
to capture ever, allthe interactions
these approaches
or as
between
feature
learn
simple
mtheas
as inner
different
an example,
high-orderproduct,
next wefeatures.
feature explain how to identify multiple
interactions
i.e., h·, ·i. In this work, we use inner
Multi-head
ature x =with
[x1 ; xa2 low-dimensional
; ...; xM ], in an implicit vector, way meaningfullack
(1) and therefore high-order featuresexplainability.
good model involving feature m. We �rst de�ne
(h) (h)
Self-Attention
product due to itsbetween
the correlation simplicity featureand m e�ectiveness.
and feature k under
W , WKey 2
a speci�c
ber of total feature �elds,On (*) the
and xi iscontrary,
the featurethere0 are three lines of works are {Vi , vthat
m , Wlearn
(h) fea-
(h) (h) �ery
attention head as follows: �ery , WKey , WValue , WRes , w, b}, which are updated
=i-th xi , !x,.i.is a one-hot ture
Vi�eld. #
vector
-./%& if the i-th(2). .�eldin Ran explicit
interactions
d ⇥d in Equation
h
fashion.5 First,
are
viatransformation
Deep&Cross
minimizing matrices
[38]
the total Logloss which map
using gradient the
descent.
Embedding
xrix in Figure 2).
1 for �eld !i, and
0
x i is a scalar
xi and value
is anxDeepFM if the
one-hot [19] took
i-th original
outer embedding
product ofspace features
(h)
=
exp(
RÕ into
d at thea new
(h)
m space
bit-
(e , eand
k )) R
,
d 0
2
. Next, we update
Layer … …
g., xM in Figure 2). the representation of feature in subspace l h via combining (5) all
m,k M exp( (h) (e , e ))
features can be multi-valued,
(*)
vector-wise i.e., xlevel
i respectively. Although they 4.7perform m explicit
Analysis
l =1 mfea-
Of AutoInt
# …
egwatching
Layer prediction
3&(
ture
as an interactions,
example, it is not trivial
relevant !̃
(*) to explain
features (h) which
guided by
(emModeling combinations
coe�cients
, ek ) = hW�ery
(h)
Arbitraryem , W are
(h)
(h):
Ordere 1
Combinatorial
i,
0
Features.
0 1
Given fea-
0.3 … 0.5
" m,k k
Genre which describes . . types
useful.
the Second,
of some tree-based methods [39,ture 42,interaction
44] combined operator
Key
the de�ned by Equation 5
Feature field 1
- 8, we now Input Layer: sparse feature X
analyze
Feature field M
resentations of the categorical features are very
how’
0.1
valued power of embedding-based where models (h) (·,and
·) is tree-based
an attention models
functionand but
which hadde�nes combinatorial
Figure
the 1: Overview
similarity features areof our proposed model AutoInt. The de-
ensional,(e.g., Drama wayand is toRomance low-order high-order modeled
M
a common represent them Overview of the proposed model AutoInt.
e
(h) (h) (h)
to breakinputs,
training. procedure into the multiple estages.
=ourk.Third,
It canHOFM [5] (6)
0.8
mpatible
al with
spaces (e.g., multi-valued
word embeddings). Speci�cally, between feature mminand proposed be model.
(W de�ned ekas),a neural tailsnetwork
of embedding layer and interacting layer are illustrated
(*) proposed e�cient
. training or as simple as inner
algorithms product,
for high-order
For
m,k
i.e.,
Value
In this
factorization
simplicity,
h·, ·i.let’s work,
assume we
in use
there are innerfeature
four
Figure 2 and�elds Figure (i.e.,3M=4)
respectively.
n 2 and represent$%&'(
egorical feature with
# the multi-valued 0.02
a low-dimensional vector, k=1
machines. However, HOFM requires
product due tooto itsmany denoted
parameters
simplicity by x 1e�ectiveness.
and , xand
2 , x 3only
and x 4 W respectively.
(h) Within the �rst interacting
(h)
corresponding
!" feature embedding (*)
(h) 0 ⇥d �ery , WKey 2
1"
its low-order (usually (2) where
lessRdthan
0
WValue 5)Equation
form
d
2 Rcan5 be layer, each individual
. practically used.matrices
Dif-feature interacts with any other features
ei = Vi xi , ⇥d in are transformation which map the
Figure The
3: Thearchitecture
architecture of
of interacting
interacting layer.
layer. Combinato- through attention mechanism (i.e. Equation 5) and therefore a set
1 matrix for �eld i, ferent and xi from existing work, wee explicitly model feature
R into interactions
0
dding is an one-hot original
Since e embedding
(h)
2 R d 0
isspace
a
d
combination a newofspace R . Next,
feature
d
m andweitsupdate
relevant
Vifeatures
rial xi , features
arecan
conditioned on (3)
attention weights, m
i.e.,
(h)
. of second-order feature combinations such
Speci�cally,as (xwe 1 , xde�ne x 3 ) and
2 ), (x 2 , the high-order combinatorial83features as
=ategorical with
be attention
multi-valued, mechanism
i.e.,
AutoInt: x thein an
features (under
i Automatic end-to-end
representation
Featurehead
m ofmanner,
feature and
h),(xit3 ,represents
Interaction m
Learning in probe
subspace
x 4 ) are captured a
via newthe h via combining
combinatorial
Self-Attentive Neural all
feature
Networks. CIKM, 2019.
(h) with distinct correlation weights, where the
q
follows:
56 0.7759 0.1573 0.8252 0.3998
54 0.7659 0.1591 0.8227 0.4048 Criteo Avazu
Model
AUC Logloss AUC Log
89 0.7715 0.1591 0.8448 0.3814
64
68
AutoInt
0.7515
0.7773
0.1631
0.1572
0.8357
0.7968
0.3883
0.4266
Wide&Deep (LR)
DeepFM (FM)
0.8026
0.8066
0.4494
0.4449
0.7749
0.7751
0.3
0.3
Deep&Cross (CN) 0.8067 0.4447 0.7731 0.3
29 0.7799 0.1566 0.8286 0.4108
xDeepFM (CIN) 0.8070 0.4447 0.7770 0.3
54 q0.7707 0.1586
Experimental results on0.8304 0.4013datasets show the advantages of AutoInt
four real-world
AutoInt+ (ours) 0.8083** 0.4434** 0.7774* 0.38
824 0.7883** 0.1546** 0.8456* 0.3797**
• Performance comparison in offline AUC evaluation for click-through rate (CTR) prediction
• Efficiency AutoInt+ outperforms the strongest baseline
M data at the: ** 0.01 andcomparison
* 0.05 level, unpaired t-test.
• Explainable recommendations

An instance of auen+on weights for


Efficiency comparison on MovieLens-1M. (a) Label=1, Predicted CTR=0.89 (b) Overall feature interacti
feature interac+ons on MovieLens-1M.
) KDD12 (d) MovieLens-1M Figure 7: Heat maps of attention weights for both c
me. “DC” and “CN” are DeepCrossing and CrossNet and global-level feature interactions on MovieLens-1M.
84
KDD12 dataset, extra communication cost makes it axises represent feature �elds <Gender, Age, Occupation,
AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. CIKM, 2019.
Fi-GNN
q Fi-GNN proposes to represent multi-field features in a Logistic Loss
s

Attentional Scoring La er
graph structure, and captures the feature interactions r
f
through node representation learning in the graph. Graph Ne ral Net ork La er -
a
w
• Feature interaction via a graph view: nodes represent Graph Ne ral Net ork La er - i
features and edges denote their interactions
Graph Ne ral Net ork La er -
• Model feature interacPons via Graph Neural Networks 3
(GNN)                                              W
o
• Attentional scoring for predictions    c
M lti-head Self-attention La er   
3
Field-a are Embedding La er
S
  
0 0 0 0 0 0 0 0 t
Field 1 Field 2 Field 3 Field 4 Feat re Graph
p
Overview of the proposed Fi-GNN. e
Figure 1: Overview of our proposed method. The input
85 raw
Fi-GNN: Modeling Feature Interac]ons via Graph Neural Networksfeature
for CTR Predic]on. m
multi-�eld vector isCIKM,
�rst2019.
converted to �eld embed-
Fi-GNN
q Feature interaction in Fi-GNN: The nodes interact with neighbors and update their states
in a recurrent fashion.

Feature Graph:
The edge weights
3.4 Multi-head reflect theLayer
Self-attention importance of
interacPons
Transformer between
[29] is prevalent theand
in NLP connected
has achievednodes
great suc-
cess in (features),
many tasks. Atwhich areoflearned
the core via an
Transformer, theaYenPon
multi-head
self-attention mechanism is able to model complicated dependen-
mechanism.
cies between word pairs in multiple semantic subspaces. In the
literature of CTR prediction, we take advantage of the multi-head
Node Aggrega<on:
self-attention mechanism to capture the complex dependencies
between
Thefeature �eld pairs, i.e,the
node aggregates pairwise feature interactions,
transformed informaPon in
di�erent semantic subspaces.
from neighbors and update its state according to
Following [26], given the feature embeddings E, we obtain the
therepresentation
feature aggregatedofinformaPon and the
features that cover history viainterac-
pairwise GRU
tionsand
of anresidual
attention connecPon.
head i via scaled dot-product: Feature interaction in Fi-GNN.
Figure 2: Framework of Fi-GNN. The nodes interact with
QKT neighbors and update their states in a recurrent fashion. At
Hi = softmaxi ( p )V, 86
Fi-GNN: Modeling
dK Feature Interac]ons via Grapheach interaction
Neural Networks forstep, each node
CTR Predic]on. will
CIKM, �rst aggregate trans-
2019.
Fi-GNN
q Taking advantage of the strong representative power of graphs, Fi-GNN captures high-order
feature interaction in an efficient way.
q Fi-GNN also provides good model explanations for CTR prediction.
Table 2: Performance Comparison of Di�erent methods. The best performance on each dataset and metric are highlighted.
Further analysis is provided in Section 4.2.

Criteo Avazu
Model Type Model
AUC RI-AUC Logloss RI-Logloss AUC RI-AUC Logloss RI-Logloss
First-order LR 0.7820 3.00% 0.4695 5.43% 0.7560 2.60% 0.3964 3.63%
FM [23] 0.7836 2.80% 0.4700 5.55% 0.7706 0.72% 0.3856 0.76%
Second-order
AFM[34] 0.7938 1.54% 0.4584 2.94% 0.7718 0.57% 0.3854 0.81%
DeepCrossing [25] 0.8009 0.66% 0.4513 1.35% 0.7643 1.53% 0.3889 1.67%
NFM [8] 0.7957 1.57% 0.4562 2.45% 0.7708 0.70% 0.3864 1.02%
High-order CrossNet [31] 0.7907 1.92% 0.4591 3.10% 0.7667 1.22% 0.3868 1.12%
CIN [15] 0.8009 0.63% 0.4517 1.44% 0.7758 0.05% 0.3829 0.10%
Fi-GNN (ours) 0.8062 0.00% 0.4453 0.00% 0.7762 0.00% 0.3825 0.00%
gl
Performance comparison.
AFM [34] (B) is a extent of FM, which considers the weight This indicates the Heat
second-order
map offeature interactions
auen+onal edgeisweights.
not Figure 6:
of di�erent second-order feature interactions by using attention su�cient. Figure 5: Heat map of attentional edge weights at the global-87
level for
on Avazu, which re�ects global- an
Fi-GNN: Modeling
mechanism. It is one of the state-of-the-art Feature
models thatInterac]ons
model via Graph
(4) Neural Networks
DeepCrossing CTR Predic]on.
outperforms CIKM, the
NFM, proving 2019.importance of relations
the e�ectiveness tance of d
Automatic Feature Generation
q The related works on automatic feature generation can be roughly divided
into two categories:
• DNN-based methods
• Search-based methods

DNN-based methods design specific neural Search-based methods focus on designing


architectures to express the interactions different search strategies that prune as much
among different features. of the candidates to be evaluated as possible,
• Implicit feature generation while aiming to keep the most useful interactive
• One-shot training course features.
• Lack of interpretable rules for feature • Explicit feature generation
interactions • Trial-and-error training manner
• Need lots of time and computing resource

88
AutoCross
q AutoCross searches useful feature interacPons in the high-order interacPve feature space by
incrementally construcPng local opPmal feature set

• Multi-granularity
approximation discretization
of E(S), with accuracy traded for higher e�ciency. value

However, Greedy
• since & beam
the purpose searchset evaluation is to identify
of feature original
numerical feature
Field-wise
the most• promising candidate, rather
logistic than to accurately estimate the
regression 1st
0 1 2 3 4 5 6 7 8 9
performance a degraded accuracy
of candidates,mini-batch
• Successive is acceptable
gradient descentif only discretized feature
it can recognize the best candidate with high probability. Experi-

decreasing
granularity
2nd
0 1 2 3 4
mental results reported in Section 5 demonstrate the e�ectiveness discretized feature

of �eld-wise LR. 3nd


0 1 2 3
AfterMul<-granularity
a candidate is selecteddiscre<za<on:
to replace the current solution S ⇤ discretized feature

(Step 6, Algorithm 1), we train an LR model with the new S ⇤ , 4nd


• For automaPc discrePzaPon, each numerical
evaluate its performance, and update bsum for data blocks that will discretized feature
0 1 2

be used in feature is discrePzed


the next iteration. into
Details will be several
discussedcategorical
immediately.
features with different granulariPes.
lower bound upper bound
4.3.2 Successive Mini-batch Gradient Descent. In AutoCross, we Figure 5: An illustration
An illustration of multi-granularity
of multi-granularity discretization.
discretization.
use a successive mini-batch gradient descent method to further Shade indicates the value taken by each discretized feature.
accelerate �eld-wise LR training. It is motivated by the successive
halving algorithm [18] which was originally proposed for multi-arm In order to automate discretization and spare its dependence
for Tabularon human experts,Applications.
we proposeKDD,
a multi-granularity 89
discretization
bandit problems. Successive halving features
AutoCross: an e�cient
Automatic allocation
Feature Crossing Data in Real-World 2019.
AutoCross
q AutoCross searches useful feature interacPons in the high-order interacPve feature space by
incrementally construcPng local opPmal feature set

MulP-granularity
in model •training, discrePzaPon
or the learned model in inference. It employs A, B, C, D
• Greedy
hashing trick [39] to & beamthe
improve search
accelerate feature producing.
Compared• with deep-learning-based
Field-wise methods, the feature producer
logisPc regression + AB + AC … + CD
takes signi�cantly
• Successiveless computation
mini-batch resources,
gradientand descent
is hence espe-
cially suitable for real-time inference.
Inside the black box (‘�ow’ part in Figure 2), the data will �rst + AC + CD + ABC + ABD

be preprocessed,
Greedywhere hyper-parameters
& beam search: are determined, missing
values �lled and numerical features discretized. Afterwards, useful
• Tree-structured space with the original + AC … + ABC … + BCD + ABCD
feature sets are iteratively constructed in a loop consisting of two
features
steps: 1) feature as the where
set generation, root. candidate feature sets with
+ AC + BD + BCD + ABCD
The children are generated by added one
… …
new cross• features are generated; and 2) feature set evaluation,
where candidate feature sets
pair-wise are evaluated
crossing to theand the best is selected
parent.
as a new solution.
An illustration
Figure of theof
3: An illustration search spacespace
the search and beam search
and beam search
• OnlyThis theiterative procedure is child
most promising terminated
will once
be some strategy.
strategy employed in AutoCross. In beam search, only the
conditions are met.
expanded during the search
From the implementation perspective (‘infrastructures’ part in best node (bold stroke) at each level is expanded. We use two
Figure 2), the foundation of AutoCross colors to indicate the two features that are used to construct
AutoCross:isAutoma]c
a distributed computing 90
Feature Crossing for Tabular Data in Real-World Applica]ons. KDD, 2019.
the new cross feature.
AutoCross
q AutoCross searches useful feature interacPons in the high-order interacPve feature space by
incrementally construcPng local opPmal feature set

• MulP-granularity discrePzaPon
• Greedy & beam search
• Field-wise logisPc regression
• Successive mini-batch gradient descent

Field-wise logis<c regression : Successive mini-batch gradient descent :


• For each node, the weights of the newly • The data are split into several blocks,
added interacPve features are updated and gradually added into the training
during training, while other weights are process along with narrowing the
inherited from the parent and fixed. candidate interactive features.

91
AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications. KDD, 2019.
Figure 7: Validation AUC curves in real-business datasets.

AutoCross
weekly or even monthly. In contrast, within every millisecond, hun-
dreds or thousands of inferences may sequentially take place, which
makes high e�ciency a must. Online inference consists of two ma-
jor steps: 1) feature producing to transform the input data, and
The number
advantages of AutoCross: cross features 2) inference to make prediction. Deep-learning method combines
Figure q
6: The of second/high-order
these steps. In Table 7, we report the inference time of AC+LR,
generated
• for each dataset.
Explicit high-order feature generaPon AC+W&D, Deep and xDeepFM.
• Fast inference
Table 5: Test AUC improvement: second v.s. high order fea-
• Interpretability
tures on benchmark datasets.
Table 7: Inference latency comparison (unit: millisecond).
Benchmark Datasets
v.s. LR(base) Bank Adult Credit Employee Criteo Average
Method Bank Adult Credit Employee Criteo
CMI+LR 0.330% -0.175% 0.531% 2.842% -0.140% 0.678%
AC+LR 0.00048 0.00048 0.00062 0.00073 0.00156
AC+LR 0.585% 1.211% 3.316% 3.316% 2.279% 2.141%
AC+W&D 0.01697 0.01493 0.00974 0.02807 0.02698
Deep 0.01413 0.01142 0.00726 0.02166 0.01941
xDeepFM 0.08828 0.05522 0.04466 0.06467 0.18985
5.2.3 Time costs of feature crossing. Table 6 reports the feature
Real-World Business Datasets
crossing time of AutoCross on each dataset. Figure 7 shows the Method Data1 Data2 Data3 Data4 Data5
validation AUC (AC+LR) versus runtime on real-world business AC+LR 0.00367 0.00111 0.00185 0.00393 0.00279
datasets. Such curves are visible to the user and she can terminate AC+W&D 0.03537 0.01706 0.04042 0.02434 0.02582
AutoCross at any time to get the current result. It is notable that Deep 0.02616 0.01348 0.03150 0.01414 0.01406
due to the high simplicity of AutoCross, no hyper-parameter needs xDeepFM 0.32435 0.11415 0.40746 0.12467 0.13235
to be �ne-tuned, and the user does not need to spend any extra
time to get The number
it work. of second/high-order
In contrast, interac+ve
if deep-learning-based features. It can be easily
methods Inference latency comparison.
observed that AC+LR is orders of magnitude
are used, plenty of time will be spent on the network architecture faster than other Applications.
methods inKDD,inference.
2019. This demonstrates that,
92
AutoCross: Automatic Feature Crossing for Tabular Data in Real-World
design and hyper-parameter tuning.
AutoFIS
q AutoFIS automatically identifies important feature interactions for Factorization Models (FM).

• Search Stage: Learn the relaPve importance of each feature interacPon


via architecture parameters within one full training process.
• Re-train Stage: Remove the unimportance interacPons and re-train the
resulPng neural networks.

Overview of AutoFIS.
93
AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. KDD, 2020.
0.99% 0.8010* 0.5404* 13% / 2% 0.86 128 + 17 1.28%
with same order. AutoFM compares with FM and AutoDeepFM compares with all baselines.

in the search stage to learn their importance. Finally, we re-train


AutoFIS
our model with the selected 2nd - and 3r d -order interactions.
Note that in the search stage, the architecture parameters
are optimized by GRDA optimizer and other parameters v are
Experiments
q by
optimized on large-scale
Adam optimizer. datasetsstage,
In the re-train demonstrate that AutoFIS can improve various FM
all parameters
based
are optimized by models in CTR predicPon tasks.
Adam optimizer.
Table 3: Performance in Private Dataset. "Rel. Impr." is the relative
AUC improvement
ormance boost could beover FM model.
achieved with marginal
example, it takes 24 minutes and
Model AUC128 minutes
log loss fortop ReI. Impr 0.70

(3rd) to search important


FM 2nd - 0.8880
and 3r d -order
0.08881 fea-
100% 0
FwFM 0.8897 0.08826 100% 0.19%

6tatistics_AUC
ons in Avazu and Criteo
AFM
with a single
0.8915
GPU
0.08772
card).
100% 0.39%
0.65

ult might take the human


FFM engineers 0.8921many hours or
0.08816 100% 0.46%
ve by identifying DeepFM
such important feature
0.8948 interac-
0.08735 100% 0.77%
0.60

y. AutoFM(2nd) 0.8944* 0.08665* 37% 0.72%


AutoDeepFM(2nd) 0.8979* 0.08560* 15% 1.11%
0.55

ly enumerating the 3r d -order feature interactions


Performance comparison.
⇤ denotes statistically signi�cant improvement (measured by t-test with p-value < 0.005).
FM enlarges the inference time about 7 to 12
AutoFM compares with FM and AutoDeepFM compares times,with all baselines.
0.00 0.05 0.10

the absolute value of α


0.15 0.20 0.25

ptable in industrial applications.


4.3 Feature Interaction Selection by Figure AutoFIS
Correlations between the
3: Relationship architecture
between parameters and
statistics_AUC 𝛼 and AUC.
value for
each second-order interaction
erability(RQ1)
of the Selected Feature 94
AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. KDD, 2020.
Table 5: Performance comparison between the model with interac-
2.2 Search Strategy as the initial node representation for node =8 that conventionally
As aforementioned, exhaustively traversal of the exponentially takes the embedding looked up by 58 from the feature embedding
growing interactive feature space seems intractable. By De�nition 1, matrix W as its value. It is easy to show that, by applying a vanilla

FIVES
(1)
any :-order interactive features could be regarded as the interaction graph convolutional operator to such a graph, the output n8 is
(i.e., Cartesian product) of several lower-order features, with many capable of expressing the 2-order interactive features. However
choices of the decomposition. This raises a question that could we gradually propagating the node representations with only one adja
solve the task of generating interactive features in a bottom-up cency matrix fails to express higher-order (: > 2) interactions. Thus
q manner?
To possess To beboth
speci�c, could we
feature generate interactive
interpretability and features to generate
search inefficiency, at highest method
the proposed -order interactive features, we extend the
FIVES formulates
the task of interactive feature generation as searching for feature
an inductive manner, that is, searching for a group of informative :-
edges graph by de�ning an adjacency tensor A 2 {0, 1} ⇥<⇥< to
on the defined feature graph.
indicate the interactions among features at each order, where each
order features from the interactions between original features and
the group of (: 1)-order features identi�ed in previous step? We slice A (:) 2 {0, 1}<⇥< , : = 0, . . . , ( 1) represents a layer-wise
1. present
Search theStrategy
following proposition to provide theoretical evidence adjacency matrix and < = |N | is the number of original features
for discussing the question. (:)
(nodes). Once an entry A8,9 , 8, 9 2 1, . . . , < is active, we intend to
• This proposiPon states that informaPve (: 1) (0)
P���������� 1. Let - 1, - 2 and . be Bernoulli random variables generate a (: + features
interacPve 1)-order feature basedcome
unlikely on node =8 bythe
from n8 n9
with a joint conditional probability mass function, ?G 1 ,G 2 |~ := P(- 1 = and
(:)
synthesize theselower-order
into n8 . Formally,
G 1 ; - 2 = G 2 | . = ~) such that G 1, G 2, ~ 2 {0, 1}. Suppose further that uninformaPve ones.with an adjacency tenso
A, our dedicated graph convolutional operator produces the node
mutual information between -8 and . satis�es I (-8 ; . ) < ! where representations layer-by-layer, in the following way:
8 2 {1, 2} and ! is a non-negative constant. If - 1 and - 2 are weakly • The theory moPvates the boYom-up search
⇠>E (- ,- |. =~)
correlated given ~ 2 {0, 1}, that is, f- |. =~ (: 1)
1
1 2
f-2 |. =~  d, we have strategy in FIVES:
(:)
n8 = Searching
p8
(:)
n8 for a group of
informaPve 𝑘-order
(:) features from the (0) (2
I (- 1- 2 ; . ) < 2! + log(2d 2 + 1). (1) where p8 = MEAN 9 |A (: ) =1 {W 9 n 9 }.
interacPons between original 8,9 features and the
Theoretical
We defer support
the proof to forA.
Appendix the search strategy.
Speci�cally, the random vari- group of (𝑘is−adopted
1)-order features.
Here “MEAN” as the aggregator, and denotes the
able - and . stands for the feature and the label respectively, and
element-wise product. W 9 is the transformation matrix for node
the joint of - s stands for their interaction. Recall that: 1) As the (0)
considered raw features are categorical, modeling each feature as a = 9 , and = 8 is the initial input to the GNN and described as the
Bernoulli random variableFIVES:
wouldFeature
not sacri�ce much
Interac]on generality;
Via Edge feature
2) Large-Scale
Search for embeddings
Tabular Data. KDD, of node =8 . Assume that the capacity
2021.
95 of ou
(0) (0)
FIVES
q To possess both feature interpretability and search efficiency, the proposed method FIVES formulates
the task of interacPve feature generaPon as searching for edges on the defined feature graph.

2. Feature Graph
• To instantiate the proposed search strategy, the original features are conceptually
regarded as a feature graph and their interactions are modeled by a designed GNN.
• Each node 𝑛! corresponds to a feature 𝑓! . Each edge 𝑒!,$ indicates an interacPon between
𝑛! and 𝑛$ .
n1 n2 n1 n2 n1 n2

……

n3 n4 n3 n4 n3 n4

The constructed feature graph to represent high-order feature interactions.


96
FIVES: Feature Interac]on Via Edge Search for Large-Scale Tabular Data. KDD, 2021.
FIVES
q To possess both feature interpretability and search efficiency, the proposed method FIVES formulates
the task of interactive feature generation as searching for edges on the defined feature graph.

2. Feature Graph
• The feature graph consists of 𝐾 subgraphs to represent high-order interacPve feature. Each
subgraph indicates a layer-wise interacPon between features, represented by an adjacency
matrix 𝐴(&) ∈ {0,1}"×" . The graph convoluPonal operator for aggregaPon are defined as:
(&) (&) &)* & . (1)
𝑛! = 𝑝! ⨀ 𝑛! , where 𝑝! = MEAN$|, ( -* 𝑊$ 𝑛$
%,'

• The node representaPon at 𝑘-th layer corresponds to the generated features:


(&) . &)*
𝑛! = MEAN$|, ( -* 𝑊$ 𝑛$ ⨀𝑛! ≈ MEAN '
/) ,…,/( |,%,* -*,$-*,…,&
{𝑓/) ⨂ … ⨂𝑓/( ⨂𝑓! } (2)
%,' '

97
FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data. KDD, 2021.
FIVES
q To possess both feature interpretability and search efficiency, the proposed method FIVES formulates
the task of interactive feature generation as searching for edges on the defined feature graph.

3. Differen<able Edge Search


• The task of generaPng useful interacPve features is equivalent to learning an opPmal adjacency
tensor 𝐴, so-called edge search.
min ℒ 𝒟234 𝐴, Θ 𝐴 )
𝑨
s. t. Θ 𝐴 = arg min ℒ (𝒟67389|𝐴, Θ) (3)
5

• To make the opPmizaPon more efficient, 𝐴 is regraded as Bernoulli random variables parameterized
by 𝐻 ∈ 0,1 :×"×" , and a soa 𝐴(&) is allowed to be used for propagaPon at the 𝑘-th layer.

98
FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data. KDD, 2021.
4: Propagate the graph signal for times according to Eq. (2); can
5: Update Θ by descending U 1 rΘ L (Dtrain |A, Θ); 1, .
6: Update H by descending U 2 rH L (Dval |A, Θ);
7: end for 3
FIVES the learned adjacency tensor A, which means that FIVES can also
We
of t
ing
serve as a feature generator. In general, we are allowed to specify
as a
layer-wise thresholds for binarizing the learned A and then derive
q To possess both feature interpretability and search efficiency, the proposed method FIVES formulates
the useful :-order (1  :  ) interactive features suggested by
fea
the task of interactive feature generation as searchingAfor edges on the defined feature graph. of s
inductively. An example of the interactive feature derivation is
by
shown in Figure 2.
its
4. Interac<ve Feature Deriva<on adj
Feature as
• The learned adjacency tensor can explicitly indicate Graph "(") "($) "(%)
in a
which interacPve features are useful. " ! " " " ! " " " ! " " fea
D
tha
• One can inducPvely derive useful high-order " # " $ " # " $ " # " $ mo
interacPve features by specify layer-wise thresholds Generated
titi
for binarizing the learned 𝐴. Features in T
#! ⨂ #" #! ⨂ #" ⨂ #$ in A
#" ⨂ ##
#! ⨂ ## #! ⨂ ## ⨂ #$
• FIVES serves as a feature generator for lightweight
models to meet the requirement of inference #$ ⨂ #" #$ ⨂ #" ⨂ ##

speed.
Figure 2: Interactive
An example feature
of interac+ve derivation
feature deriva+on.
Speci�cally,
FIVES: Feature Interaction Via Edge Search for Large-Scale given
Tabular Data.anKDD,
adjacency
2021. matrix A
(:) ,
which is the 99
:-th
slice of the binarized adjacency tensor A, we can determine the
FIVES
q Extensive experiments on five public datasets and two business datasets confirm that FIVES can
generate useful interactive features.
• FIVES as a predictive model for downstream tasks, such as CTR prediction
• FIVES as the feature generator for lightweight models to meet the requirement of inference
speed

Correla+on between the entries of 𝐴 and the AUC of the Efficiency comparisons.
corresponding indicated feature.
100
FIVES: Feature Interac]on Via Edge Search for Large-Scale Tabular Data. KDD, 2021.
Takeaways

ü Feature Interpretability
ü Search Efficiency
AutoFeature Useful Interactive
Model Features

DNN-based methods Search-based methods


• Implicit feature generation • Explicit feature generaPon
• One-shot training course • Trial-and-error training manner
• Lack of interpretable rules for feature • Need lots of Pme and compuPng resource
interactions
101
Future Directions
qHow to introduce human experience as prior knowledge for AutoFeature?

qCausal features or spurious correlations?

qHow to balance the trade-off between the usefulness of generated


features and the completeness of them?

102
References of AutoFeature
[1] AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In CIKM 2019.
[2] Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction. In CIKM 2019.
[3] AutoCross: Automatic Feature Crossing for Tabular Data in Real-World Applications. In KDD 2019.
[4] AutoFIS: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate
Prediction. In KDD 2020.
[5] FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data. In KDD 2021.

103
VolcanoML: End-to-End AutoML via
Scalable Search Space Decomposition

104
Two Complica3ons of AutoML going E2E
Personal perspec,ves, from our past experiences
AutoML • VolcanoML: Speeding up End-to-End AutoML via Scalable Search Space
𝛼 ∗ = argmax 𝑓 𝐷′, 𝜃Y∗ Decomposi,on. VLDB 2021.
Y • AutoML from Service Provider's Perspec,ve: Mul,-device, Mul,-tenant Model
s.t, 𝜃Y∗ = argmax𝑃 𝐷 𝜃 𝑃(𝜃|𝛼) Selec,on with GP-EI. AISTATS 2019.
Z • Ease.ml: Towards Mul,-tenant Resource Sharing for Machine Learning
Workloads. VLDB 2018.
Two Complications
1. 𝛼 is not a homogenous space, it is rather 2. From single-tenant to mul+-tenant scenarios
heterogenous

(auto-sklearn)
𝛼 ∈ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 × 𝐻𝑃 × 𝑀𝑜𝑑𝑒𝑙 105
Disclaimer

This segment of the tutorial is more opinioned and closer to our own
experience than previous segments

It is less about how much we know about these two problems, but
more about discussing some observations and preliminary explorations
to show you what we don’t know and a “cry for help”.

106
Heterogenous Search Space
• 𝑭𝒆𝒂𝒕𝒖𝒓𝒆 × 𝑯𝑷 × 𝑴𝒐𝒅𝒆𝒍
• A strong baseline: Treat the heterogenous space as a single
joint space.
• Model it with a single Bayesian op9miza9on problem, a
single gene9c algorithm, or a single hyperband problem
• Good? Very powerful approach, yet simple.
• Could be improved?
• “The curse of dimensionality”: oFen it is not easy to
scale up when the dimensionality of the space is high.
• Heterogeneity in algorithm: Different subspaces might
benefit from different algorithms.
• Can we do be1er?

107
Heterogenous Search Space
• Different ways to conduct search. Let’s take for example the
space 𝜶 ∈ 𝑿 ×𝒀
• Strategy 1. Joint
• Treating the space 𝑿 ×𝒀 as a single search space
• (If you are doing BO) Create a surrogate model M to approximate 𝑓(𝛼)
• Use M to select 𝜶(
• Evaluate 𝑓(( 𝜶) and update the surrogate model M
• One can implement such a strategy using methods beyond BO.

108
Heterogenous Search Space
• Different ways to conduct search. Let’s take for example the space
𝜶 ∈ 𝑿 ×𝒀
• Strategy 2. Conditioning
• Idea: decompose 𝑿 ×𝒀 into multiple subspaces, e.g., one for each value of 𝑿
• min;,< 𝑓 𝑥, 𝑦 ⇒ min min 𝑔; (𝑦)
;∈> <
• Then treating each 𝑥 ∈ 𝑋 as a subproblem min 𝑔; (𝑦)
<
• Can be modeled as a Multi-armed bandit problem – each arm corresponds to a
possible value of 𝑥 ∈ 𝑋, playing an arm means optimizing min 𝑔; (𝑦) one step
<

• For example, think about X as Algorithm and Y as Feature – For each


Algorithm, search for the best feature, and pick the best Algorithm

109
Heterogenous Search Space
• Different ways to conduct search. Let’s take for example the space
𝜶 ∈ 𝑿 ×𝒀
• Strategy 3. AlternaCng
• Idea: decompose 𝑿 ×𝒀 into two subspaces, 𝑿 and 𝒀
• Solve two problems alternaPvely:
• min 𝑔𝒚, 𝑥 , where 𝑦* is the current best value for subspace 𝑌
+
• min 𝑔+̅ (𝑦), where 𝑥̅ is the current best value for subspace 𝑋
.
• Each subproblem can be solved either jointly or via some condiPoning strategy
• At each iteraPon, pick the subproblem with the largest expected improvement
• For example, think about X as Feature and Y as HP – AlternaCng the
process of search for feature and search for HP

110
Heterogenous Search Space
• Different ways to conduct search
• Strategy 1. Joint
• Pros: Simple, works well when dimensionality is low
• Cons: Might suffer when the dimensionality is high
• Strategy 2. Conditioning
• Pros: Effective when some dimension is categorical variable with small cardinality
• Cons: Might not be applicable to other scenarios.
• Strategy 3. Alternating
• Pros: Very effective in reducing dimensions
• Cons: Assuming conditional independence of two subspaces

111
Heterogenous Search Space
• A single search space can be decomposed in different ways.

Different plans have


different performance

PotenGally, can learn


to decompose given a
target workload

112
Heterogenous Search Space
• Moving Forward
• Build up a suite of different building blocks – what is the
unified framework to talk about different search algorithms?
• How to automatically construct search space decomposition?
• How to automatically conduct building block selection?
AutoML for AutoML?

113
AutoML: From Single-tenant to Multi-tenant
Exis+ng Models
Single-tenant Scenario: One target dataset
M1 M2 M3 M4 M5 M6 M7
Existing Datasets

D1 0.9 0.2 0.2 0.6 0.5 0.6 0.2 What if mulGple users running their own AutoML
workload over a shared infrastructure?
D2 0.6 0.7 0.2 0.4 0.6 0.1 0.7
D3 0.9 0.2 0.2 0.6 0.5 0.6 0.2 Interesting problem especially when AutoML as a
D4 0.9 0.2 0.2 0.6 0.5 0.6 0.2 service becomes more and more popular.

D5 0.6 0.7 0.2 0.4 0.6 0.1 0.7


New Dataset

D6 ? ? 0.5 ? ? ? ?

Pool of Resources
114
AutoML: From Single-tenant to Multi-tenant
Existing Models
How to balance resource allocations to different users?
M1 M2 M3 M4 M5 M6 M7
Existing Datasets

D1 0.9 0.2 0.2 0.6 0.5 0.6 0.2


D2 0.6 0.7 0.2 0.4 0.6 0.1 0.7
D3 0.9 0.2 0.2 0.6 0.5 0.6 0.2
D4 0.9 0.2 0.2 0.6 0.5 0.6 0.2
D5 0.6 0.7 0.2 0.4 0.6 0.1 0.7
New Datasets

D6 ? ? 0.5 ? ? ? ?
… ? ? ? ? ? ? ?
Dn ? ? ? ? ? ? ?

Pool of Resources 115


AutoML: From Single-tenant to Multi-tenant
• Regret: A Single User’s Unhappiness
(Regret: We could have serve the user a beuer model if
1 Regret aver T trials: RT we magically knows the best model to try)
Decisions Quality
0.8
M1 0.5
0.6
Quality
M2 0.7
M3 0.76 0.4
M4 0.79
0.2
M5 0.85
M6 0.87 0
1 2 3 4 5 6
# Trials

116
AutoML: From Single-tenant to Mul3-tenant
1

Quality 0.5

0
1 2 3 4 5 6
# Trials
1 Which
user
Quality

0.5 should
we serve
next?
0
1 2 3 4 5 6
# Trials 117
AutoML: From Single-tenant to Multi-tenant
User 1: [0.99] [0.99]

User 2: [0.10] [0.35]

Extreme Case: User 1 is not worth serving any more

How about more general case?


118
AutoML: From Single-tenant to Multi-tenant
Machine Learning Models
Existing Models New Models

M1 M2 M3 M4 M5 M6 M7 M8 … Mk 01 Each user runs their own GP-EI model selecMon


D1 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? ? ?
Existing Datasets

D2 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ?


Serve the user with highest expected
Datasets (Users)

D3 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? 0.2 ? 02 improvement.


D4 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? ? ?
D5 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ?
Informal Theorem. If the performance of all
D6 ? ? 0.5 ? ? ? ? ? ? ?
models is a linear combination of a finite,
New Datasets

… ? ? ? ? ? ? ? 0.9 ? ? shared set of hidden Gaussian variables, the


Dn ? ? ? ? 0.7 ? ? ? ? ? global regret converges to 0 with rate O(1 /
runtime).
Computation Resource

119
AutoML: From Single-tenant to Multi-tenant
Machine Learning Models
Existing Models New Models

M1 M2 M3 M4 M5 M6 M7 M8 … Mk 01 Each user runs their own GP-UCB algorithm


D1 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? ? ?
Existing Datasets

D2 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ? Serve the user with a factor that is very similar to
Datasets (Users)

D3 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? 0.2 ? 02 expected improvement (directly comparing each
D4 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? ? ? user’s UCB does not work, for obvious reason)
D5 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ?
D6 ? ? 0.5 ? ? ? ? ? ? ?
New Datasets

… ? ? ? ? ? ? ? 0.9 ? ?
Dn ? ? ? ? 0.7 ? ? ? ? ?

Computation Resource

120
AutoML: From Single-tenant to Multi-tenant
Modeling error dominates Modeling error dominates

Multi-tenant

Mul+-tenant

121
AutoML: From Single-tenant to Multi-tenant
Machine Learning Models
Existing Models New Models

M1 M2 M3 M4 M5 M6 M7 M8 … Mk
Existing Datasets

D1 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? ? ?


D2 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ? Need some special care on the diversity: don’t
D3 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? 0.2 ? put all GPUs on a single user.
Datasets/Users

D4 0.9 0.2 0.2 0.6 0.5 0.6 0.2 ? ? ?


D5 0.6 0.7 0.2 0.4 0.6 0.1 0.7 ? ? ? Theorem. Near linear speed up with respect
New Datasets

to the number of devices when # devices << #


D6 ? ? 0.5 ? ? ? ? ? ? ?
users.
… ? ? ? ? ? ? ? 0.9 ? ?
Dn ? ? ? ? 0.7 ? ? ? ? ?

Pool of Resources 122


AutoML: From Single-tenant to Multi-tenant
• Moving Forward
• In my opinion, it is exciCng future direcCon to try to understand
resource allocaCon and scheduling for AutoML workloads
• What’s the unified way to talk about and think about different
AutoML workloads, e.g., those we have been talking about over the
last two hours
• Fairness? Efficiency? How should we aggregate unhappiness from
mulCple users?

123
Two Complications of AutoML going E2E
AutoML
𝛼 ∗ = argmax 𝑓 𝐷′, 𝜃Y∗ A lot of challenges and exciting
Y opportunities when bring AutoML to
s.t, 𝜃Y∗ = argmax𝑃 𝐷 𝜃 𝑃(𝜃|𝛼) and end-to-end production scenario!
Z

Two Complications
1. 𝛼 is not a homogenous space, it is rather 2. From single-tenant to mul+-tenant scenarios
heterogenous

(auto-sklearn)
𝛼 ∈ 𝐹𝑒𝑎𝑡𝑢𝑟𝑒 × 𝐻𝑃 × 𝑀𝑜𝑑𝑒𝑙 124
AutoML: A Small Personal Remark
ML today is now a Data Problem MLBench

• For many tasks, given the raw features from Kaggle, most VLDB (2018)
AutoML platforms rank in the bottom 50%. http://www.vldb.org
/pvldb/vol11/p1220-
• It is the data that we need to improve, and knowledge that liu.pdf
we need to integrate, to build better ML applications.
• To improve data, we need to first understand them.
Moving from a Model-driven development to a Data-driven
development.

125
ML-Guided Database

126
Where DB Meets ML
• Human involved in research/engineering/analyzing/administrating:
• Building and maintaining indexes
• Query optimization
• Physical design tuning
• Optimizing view materialization

• Learning to automatically designing/optimizing/tuning?

127
Where DB Meets ML: Learning to Index
• Human involved in research/engineering/analyzing/administrating:
• Building and maintaining indexes
• Query optimization
• Physical design tuning
• Optimizing view materialization

• Learning to automatically designing/optimizing/tuning?

128
B-Tree Index from Learning Perspective

Key Key
Input: Key Input: Key
Output: Posi+on Output: Position
B-Tree Index: posi+on = B-tree(Key) Learned Index: position = function(Key)

[Image source] Kraska et al., The case for learned index structures. SIGMOD, 2018 129
Why Learning Index from Data?
• Consider this (ideal) case: build an index to store and query over a
table of n rows with continuous integer keys, i.e., Keys = [11, 12, 13,
14, 15, ...] and Pos = [0, 1, 2, 3, 4, …]
• B-Tree: seeking Pos in time O(log n)
• a learned function Pos = M(Key) = Key + offset : O(1)

• Main motivation: the hidden yet useful distribution information


about the data to be indexed has not been fully explored and utilized
in the classic index techniques
• learned index: an automatic way to explore and utilize such information

130
Recursive-Model Index (RMI)

Root model

Sub-models

Data to be indexed

[Image source] Kraska et al., The case for learned index structures. SIGMOD, 2018 131
FITing-Tree

Error-Bounded Linear Segment: Given threshold 𝑒𝑟𝑟𝑜𝑟, ShrinkingCone (building a segment): Point 1 is the origin of the
a segment from (𝑥5 , 𝑦5 ) to (𝑥7 , 𝑦7 ) is not valid if (𝑥8 , 𝑦8 ) cone. Point 2 is then added, resulting in the dashed cone. Point
is further than 𝑒𝑟𝑟𝑜𝑟 from the interpolated line. 3 is added next, yielding in the dotted cone. Point 4 is outside
the dotted cone and therefore starts a new segment.

[Image source] Galakatos et al., FITing-Tree: A Data-aware Index Structure. SIGMOD, 2019 132
RMI v.s. FITing-Tree
Root Model B-Tree
e.g., y=ax+b Sub-model
Organization

Sub-models

RMI FITing-Tree

133
More Learned Index Methods
• PGM [1] improves FITing-Tree by finding the optimal number of learned
segments given an error bound.

• ALEX [2] proposes an adaptive RMI with workload-specific optimization,


achieving high performance on dynamic workloads.

• RadixSpline [3] gains competitive performance with a radix structure


while using a single-pass training.

• Multi-dimensional indexes: NEIST [4], Flood [5], Tsunami [6] and LISA [7].
134
More Learned Index Methods
• [1] The PGM-Index: A Fully-Dynamic Compressed Learned Index with Provable
Worst-Case Bounds. PVLDB, 2020.
• [2] ALEX: An Updatable AdapCve Learned Index. SIGMOD, 2020.
• [3] RadixSpline: A Single-Pass Learned Index. In aiDM Workshop on SIGMOD, 2020.
• [4] NEIST: a Neural-Enhanced Index for SpaCo-Temporal Queries. TKDE, 2019.
• [5] Learning MulC-dimensional Indexes. SIGMOD, 2020.
• [6] Tsunami: A Learned MulC-dimensional Index for Correlated Data and Skewed
Workloads. PVLDB, 2020.
• [7] LISA: A Learned Index Structure for SpaCal Data. SIGMOD, 2020.

135
Questions about Learned Indexes
How to systematically analyze and design
machine learning based indexing methods?

More scalable index learning methods?

Which class of models suffice?

136
Task Definition
• Given a database D with n records (rows), let’s assume that a range
index structure will be built on a specific column x. For each record
𝑖 ∈ [𝑛], the value of this column, , is adopted as the key, and is
the posi0on where the record is stored.

• We want to learn a mechanism with the key as input and


outputs a predicated posi0on yˆ ← M (x) for accessing data.

137
Learning Index: A Machine Learning Task

measures the cost of


calcula+ng

A Pluggable Learned Index Method via Sampling and Gap Insertion, https://arxiv.org/pdf/2101.00808.pdf 138
Learning Index: A Machine Learning Task

regularization training loss


trade-off

objective function

A Pluggable Learned Index Method via Sampling and Gap Insertion, https://arxiv.org/pdf/2101.00808.pdf 139
Benefits of Learned Index
• Smaller Size
• Faster Index Seek
• Better Handling Index Update
• Generalization ability of machine learning
• Incremental learning

• Question Mark
• Is model training/inference scalable enough?

140
Learned Index with Sampling
• How large the sample needs to be?
• 𝑛 is the data size
• 𝑀∗ is fully opCmized

Fig: Illustration of sampling

A Pluggable Learned Index Method via Sampling and Gap Inser+on, hups://arxiv.org/pdf/2101.00808.pdf 141
Learned Index with Sampling

• Up to 78x
building speedup
• Non-degraded
performance in
terms of query
time and
prediction error)

Fig: Illustration of sampling

A Pluggable Learned Index Method via Sampling and Gap Inser+on, hups://arxiv.org/pdf/2101.00808.pdf 142
Is Linear Model Sufficient?
• Linearization of a learned model
A learned model ( = 𝑴(𝒙)
𝒚

143
Is Linear Model Sufficient?
• Linearization of a learned model
A learned model ( = 𝑴(𝒙)
𝒚
Landmark points … , 𝒙𝒍 , 𝒚𝒍 , 𝒙𝒓 , 𝒚𝒓 , …

144
Is Linear Model Sufficient?
• Linearization of a learned model
A learned model ( = 𝑴(𝒙)
𝒚
Landmark points … , 𝒙𝒍 , 𝒚𝒍 , 𝒙𝒓 , 𝒚𝒓 , …

Linearized model 9 = 𝑴𝐋 𝒙
𝒚
(𝒍 = 𝑴(𝒙𝒍 ) to 𝒙𝒓 , 𝒚
connecting 𝒙𝒍 , 𝒚 (𝒓 = 𝑴(𝒙𝒓 )

145
Is Linear Model Sufficient? Yes! As long as landmark
points are dense enough

• Linearization of a learned model


A learned model ( = 𝑴(𝒙)
𝒚
Landmark points … , 𝒙𝒍 , 𝒚𝒍 , 𝒙𝒓 , 𝒚𝒓 , …

Linearized model 9 = 𝑴𝐋 𝒙
𝒚
(𝒍 = 𝑴(𝒙𝒍 ) to 𝒙𝒓 , 𝒚
connecting 𝒙𝒍 , 𝒚 (𝒓 = 𝑴(𝒙𝒓 )

( − 𝑦 ≤ 𝜖, after linearization, we
Theorem 2. Suppose ∀𝑥, 𝒚
9 − 𝑦 ≤ 3𝜖 + 2(𝒚𝒓 − 𝒚𝒍 ).
have ∀𝑥, 𝒚

A Pluggable Learned Index Method via Sampling and Gap Inser+on, hups://arxiv.org/pdf/2101.00808.pdf 146
Sampling-Restriction-Linearization
Sampled data points as landmark points:
… , 𝒙𝒍 , 𝒚𝒍 , 𝒙𝒓 , 𝒚𝒓 , …

Learned Index on Sampled Data Restriction Linearization

A Pluggable Learned Index Method via Sampling and Gap Insertion, https://arxiv.org/pdf/2101.00808.pdf 147
Open Questions
• How to handle extremely outlier keys?

• How to maintain index on updating data? [2]

• How to handle multi-dim data? [5, 6, 7]

• How to build it into real DB systems?


• without too much modification to the current system

148
AutoML Tools

149
Availability
AdaBERT: Task-Adaptive BERT
Learning to Mutate with Hypergradient Compression with D-NAS, IJCAI 2020
Guided PopulaMon, NeurIPS 2020. https://arxiv.org/abs/2001.04246
Compressed
Hyperparameter Model Search
Optimization

AutoML
Meta-Learning
Auto Feature
Generation
Automated Relational Meta-learning,
FIVES: Feature Interaction Via Edge Search for ICLR 2020.
Large-Scale Tabular Data, KDD 2021. https://arxiv.org/abs/2001.00745
https://arxiv.org/abs/2007.14573

150
Availability
Compressed
Hyperparameter
Model Search
Op+miza+on

Publicly available at
Publicly available at Alibaba PlaRorm of A.I.,
Alibaba Platform of A.I., EasyTransfer product
AutoML product
AutoML

Feature Meta-Learning
Generation

151
A Summary of AutoML Tools
Name Authors Functionalities Algorithms Language

Auto Tune Models (ATM) MIT AutoFeature, Model Selection, BO and Bandit Python
HPO
AutoKeras Texas A&M NAS BO Python
University
NNI Microsoft AutoFeature, HPO, NAS, Model Comprehensive Python
Selection
emukit Amazon HPO Meta-surrogate Python
model
Ray Tune Berkeley HPO Comprehensive Python

TPOT University of AutoFeature, Model Selection, Genetic Python


Pennsylvania HPO programming
More AutoML packages include AutoFolio, Auto-sklearn, Auto-PyTorch, Auto-WEKA, etc.

152
Tutorial Schedule

Yaliang Li, Background and Overview of AutoML


Hyperparameter Optimization

Zhen Wang, Neural Architecture Search


Meta-Learning

Yuexiang Xie, Automatic Feature Generation

Ce Zhang, VolcanoML: End-to-End AutoML via


Scalable Search Space DecomposiPon

Bolin Ding, Machine Learning Guided Database

153
Thank you!

Yaliang Li, Zhen Wang, Yuexiang Xie, Bolin Ding, and Ce Zhang

Email: [email protected]
Please feel free to contact us if you have any questions,
or you are interested in full-time or research intern positions.

154

You might also like