(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: SketchX, CVSSP, University of Surrey
11email: {h.bandyopadhyay, p.chowdhury, a.sain,
s.koley, t.xiang, a.bhunia, y.song}@surrey.ac.uk

Do Generalised Classifiers really work
on Human Drawn Sketches?

Abstract

This paper, for the first time, marries large foundation models with human sketch understanding. We demonstrate what this brings – a paradigm shift in terms of generalised sketch representation learning (e.g., classification). This generalisation happens on two fronts: (i) generalisation across unknown categories (i.e., open-set), and (ii) generalisation traversing abstraction levels (i.e., good and bad sketches), both being timely challenges that remain unsolved in the sketch literature. Our design is intuitive and centred around transferring the already stellar generalisation ability of CLIP to benefit generalised learning for sketches. We first “condition” the vanilla CLIP model by learning sketch-specific prompts using a novel auxiliary head of raster to vector sketch conversion. This importantly makes CLIP “sketch-aware”. We then make CLIP acute to the inherently different sketch abstraction levels. This is achieved by learning a codebook of abstraction-specific prompt biases, a weighted combination of which facilitates the representation of sketches across abstraction levels – low abstract edge-maps, medium abstract sketches in TU-Berlin, and highly abstract doodles in QuickDraw. Our framework surpasses popular sketch representation learning algorithms in both zero-shot and few-shot setups and in novel settings across different abstraction boundaries.

Keywords:
Sketch Classification Sketch Abstraction Zero-Shot
[Uncaptioned image]

Naively training CLIP +++ prompt learning on QuickDraw (QD), or TU-Berlin (TU) sketches, or Edgemaps (EM) fails to generalise to multiple abstraction levels.

Training Evaluation CLIP Ours
QD QD+TU+EM 43.24 47.35 (\uparrow 4.1)
TU QD+TU+EM 43.71 51.03 (\uparrow 7.3)
EM QD+TU+EM 42.91 43.76 (\uparrow 0.8)
QD+TU+EM QD+TU+EM 45.59 62.96 (\uparrow 17.4)
Figure 1: Unlike photos, sketch classification poses additional challenges such as abstraction – humans draw differently based on their subjective interpretations, sketching ability, and drawing time. Existing datasets such as TU-Berlin [23] and QuickDraw [31] only capture the time axis and collect human sketches drawn under 280280280280 and 20202020 seconds, respectively. Following [23, 31, 33, 70] we consider Edgemaps (EM) as low abstract, TU-Berlin (TU) sketches as medium abstract, and QuickDraw (QD) ones as highly abstract (left). Naively training CLIP via prompt learning, [91] on sketches (right) from one abstraction level (QD, TU, or EM) individually do not generalise across varying abstractions (QD + TU + EM). Jointly training on multiple abstractions (QD + TU + EM) is also sub-optimal (45.645.645.645.6 on CoOp [91] vs 62.962.962.962.9 on Ours). (middle) Importantly, our proposed method predicts a classification score and an abstraction level for input sketches. Plotting classification accuracy vs predicted abstraction level, reveals a scope for improvement (shaded region) showing – despite our significant improvement in classification (\uparrow 17.4%) over naive CLIP + prompt learning, generalisation across varying sketch abstractions is still an open problem. We hope this will motivate future works to democratise existing methods [51, 90] for human drawn sketches.

1 Introduction

The vision community is witnessing a paradigm shift in the face of large foundation models [51, 54]. Instead of learning visual semantics from scratch, the rich semantics inherent to foundation models are explored to enrich visual learning, as in [3, 24, 39] for retrieval, and [14, 43, 73] for generation. The most salient advantage that foundation models bring is their generalisation ability [91, 26, 90, 38], which made a significant impact on zero-shot and few-shot learning.

In this paper, we marry human sketches with foundation models to naturally tackle two of most significant bottlenecks in the sketch community, with little effort, piggybacking on generalisability of foundation models. First is the data scarcity problem of human drawn sketches – the largest sketch dataset (QuickDraw [31]) contains 350350350350 categories compared with the easily >1000absent1000>1000> 1000 categories for photos [18], which makes generalised learning for sketch even more pronounced a problem. The second is, although everyone can sketch, most people sketch differently as a result of varying drawing skills and diverse subjective interpretations – see Fig. 1 for distinctly different sketches of a “bike”. These sketch-specific challenges call for a single model generalising along two axes – (i) across unseen categories for the first data scarcity challenge, and (ii) across abstraction levels for the second challenge of sketches exhibiting varying abstraction levels.

Solving these challenges, we show that it all comes down to making CLIP [51] sketch-specific. For the former (data scarcity problem), we learn a set of continuous vectors (visual prompts) injected into CLIP’s vision transformer encoder. This enables CLIP (pre-trained on similar-to\sim400400400400M image-text pairs) to adapt to sketches while preserving its generalisation ability – resulting in a generalised sketch recogniser that works across unknown categories. More specifically, we first design two sets of visual prompts – shallow prompts injected into the initial layer of the CLIP transformer and deep prompts injected up to a depth of 9999 layers. Keeping rest of CLIP frozen, we train these prompts on the extracted sketch and [category] embeddings (as class labels) from CLIP text encoder using cross-entropy loss. Although shallow+deep prompts encourage CLIP to become “sketch-aware”, they do not model any sketch-specific traits during training. Hence, we additionally use an auxiliary task of raster to vector sketch conversion that exploits the dual representation of sketch [7] to reinforce that awareness.

The latter challenge of dealing with sketch abstraction111For rest of the paper, we denote ‘abstraction’ as 𝔸𝔸\mathbb{A}blackboard_A. is less obvious and extends the status quo of what is possible with foundation models. While there is no consensus on what constitutes an abstract sketch [17], prior works follow: (i) number of strokes (more stroke \rightarrow less abstract) [70], and (ii) sketch drawing time (more time \rightarrow less abstract) [31, 23]. Since precisely annotating the abstraction score for each sketch is ill-posed, we learn sketch abstraction semi-supervised. Particularly, we assign semi-accurate, coarse abstraction levels as: (i) doodles from QuickDraw dataset [31] drawn under 20202020 seconds as high abstraction, (ii) freehand sketches from TU-Berlin [23] drawn under 280280280280 seconds as medium abstraction, and (iii) Edgemaps from [86] as low abstraction. To model the continuous abstraction spectrum (low \rightarrow high), we employ codebook-learning [46]. Our abstraction codebook comprises of three “codes” (a learned feature embedding), where the continuous abstraction is modelled by mixing (a weighted average) the discrete codes. To make CLIP “abstraction-aware”, the abstraction vector (from mixing codes) is injected as an additional prompt to the generalised sketch classifier. To our best knowledge, ours is the first work showing the potential of combining codebook learning [46] and prompt learning [92] for generalised sketch representation learning.

In summary, our contributions are: (i) we, for the first time marry human sketches with foundation models to tackle two of the most significant bottlenecks facing the sketch community – data scarcity and abstraction levels. (ii) For data scarcity, we achieve generalisation across unseen categories by adapting CLIP for sketch classification via prompt learning. (iii) To further make CLIP “sketch-aware”, we exploit sketch-specific traits like raster-to-vector sketch conversion as an auxiliary loss. (iv) For abstraction, we achieve generalisation across varying abstraction levels using a codebook-driven approach, where a mixup of learned codebook vectors acts as prompts that interface with CLIP, with CLIP, to make our model robust to recognising sketches from multiple abstraction levels, including those not seen during training.

2 Related Works

Sketch for Visual Understanding: Sketches can depict visual concepts [33] using intuitive free-hand drawings, overcoming linguistic barriers often faced in text representations. Fine-grained in nature, a sketch is an attractive query medium for tasks like sketch-based image [21, 57, 15, 64] and 3D shape retrieval [77]. Creative sketches [27] encouraged sketch-based synthesis and editing of images [84, 42, 82], natural objects or scenes [13, 25, 72], faces[12], and animation frames [81]. Despite being expressed as monochrome lines on a 2D plane, sketches convey complex 3D structures and find use in 3D shape modelling [30, 89, 78, 5]. As an interactive medium , a sketch is an important modality of input in vision tasks like sketch-based object detection [67], image-inpainting [82], representation learning [71], incremental learning [9], image-segmentation [34], etc. Beyond its discriminative [57] or representative [71] potential, sketch has also been employed for pictionary style gaming [8]. The success of sketch-based visual understanding leads us to propose a framework for recognising any (open-set [4]) free-hand drawing on unseen categories.

Sketch Classification: Early approaches to sketch understanding [63] extract hand-crafted features like shape primitives [48], bag-of-words [23], or Fischer Vectors [60] from raster (static pixel-map) sketches. Better representations were formulated with deep learning algorithms, dramatically improving sketch classification on complete [59, 61] and partial sketches [62], surpassing even humans [83] in recognition. Alternate architectural designs encode the temporal order of vector strokes (pen coordinates & motions) [31] using RNNs [31] or Transformers [53]. Fusion of both raster representations for encoding structural information and vector representations for temporal cues has been employed with CNN-RNN feature fusion [40] for million-scale recognition with hashing [75], or with Graph Neural Networks [80, 76, 66, 50]. Various sketch-specific self-supervised methods have recently emerged that employ BERT-like [19] pre-training [41], jigsaw solving [47], or cross-modal translation between raster-vector dual representation of sketches [7]. While recognition in existing works is limited to seen classes, we, for the first time, introduce a zero-shot recognition pipeline that can recognise unseen classes under varying sketch abstraction levels.

CLIP in Vision Tasks: Contrastive Language-Image Pre-training (CLIP) [51] pairs rich semantics from text-descriptions with large-scale (similar-to\sim400400400400M) image datasets, exploiting underlying relationships in cross-modal data (image+text) by representing them in the same embedding space. As such, CLIP features are highly generalisable in downstream tasks using very-few (few-shot) or no labels (zero-shot) [91] as opposed to traditional features (trained on discrete labels). This adaptation to downstream tasks like few-shot recognition [26], retrieval [3, 55], object detection [29], semantic segmentation [52], image generation [14], etc., is done primarily through prompting, first introduced for NLP [10]. Prompting, in general, involves construction of a task-specific template (e.g., ‘[MASK] is the capital of the France’), which is then filled with word labels (e.g. ‘London/Paris’). Prompts give context to the word labels, forming an appropriate text description. Engineering prompts by hand requires domain expertise, hence recent works (prompt tuning) model them as learnable continuous vectors [90, 91] to be optimised directly during fine-tuning. Beyond language prompts, visual prompts in the form of continuous vectors have also been explored in visual feature extractors like ViT [35]. In contrast to learning general [91, 35, 2] or instance-specific prompts [90], we learn continuous abstraction prompts modelled on abstraction level of input sketch. This enables the recognition of a wide range of sketches, from professional edge-map like drawings to free-hand abstract doodles [31].

Abstraction in Sketches: Sketch abstraction (𝔸𝔸\mathbb{A}blackboard_A) was first defined as a factor of time [6] with an observation that users tend to draw only salient regions in a constrained time setting. This idea was later modelled as a trade-off between compactness and recognisability [45, 44] for the ‘generation’ of abstract strokes by removing strokes based on their salience. Parametric representations model abstraction through Bézier curves, [16] and differential equations [17], controlling stroke “complexity” with Bézier control points and sinusoidal frequencies respectively. Alternate representations of abstraction include modelling sketch as a combination of appearance and structure [79] in a coarse-to-fine manner through hierarchical feature encoder learning [56], or via a composition of predefined drawing primitives [1] to identify the most distinctive parts of the sketch, and ground them into interpretable features. In contrast, we represent abstraction on a continuous spectrum, varying from Edgemaps (equivalent to professional sketches) to human-drawn doodles (highly abstract) by learning abstraction-specific codebook vectors, which we interpolate in a zero/few-shot sketch recognition setup.

3 Background

CLIP: The generalisability of CLIP makes it a popular choice [55] for open-set vision-language tasks. Specifically, CLIP uses 2222 independent encoders: (i) a ResNet [32] or Vision Transformer [20] image encoder, and (ii) a transformer-based [68] text encoder. The Vision Transformer image encoder F𝐹Fitalic_F processes input images as r𝑟ritalic_r fixed-sized patches I={p1,,pr};Isubscript𝑝1subscript𝑝𝑟\mathrm{I}=\{p_{1},\dots,p_{r}\};roman_I = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } ; pj3×h×wsubscript𝑝𝑗superscript3𝑤\ p_{j}\in\mathbb{R}^{3\times h\times w}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_h × italic_w end_POSTSUPERSCRIPT that are embedded along with an extra learnable class token [19] cpsuperscript𝑐𝑝c^{p}italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. These are then passed through transformer layers [68] with multi-head attention to obtain the visual features fp=F(I,cp)dsubscript𝑓𝑝𝐹𝐼superscript𝑐𝑝superscript𝑑f_{p}=F(I,c^{p})\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_F ( italic_I , italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The text input (say, n𝑛nitalic_n words) is pre-processed to word-embeddings W0={𝐰0j}j=1nsubscript𝑊0superscriptsubscriptsuperscriptsubscript𝐰0𝑗𝑗1𝑛W_{0}=\{\mathbf{w}_{0}^{j}\}_{j=1}^{n}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, from which the text transformer T𝑇Titalic_T extracts textual features as t0=T(W0)dsubscript𝑡0Tsubscript𝑊0superscript𝑑t_{0}=\mathrm{T}(W_{0})\in\mathbb{R}^{d}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_T ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Since CLIP maps cross-modal (image+text) features on the same embedding space, features of text-photo pairs have maximal similarity sim()sim\texttt{sim}(\cdot)sim ( ⋅ ) compared to features from unpaired (mismatched) samples after contrastive training. For zero-shot classification, textual prompts like ‘a photo of a [category]’ (from a list of K𝐾Kitalic_K categories) are used to obtain category-specific textual features {tj}j=1Ksuperscriptsubscriptsubscript𝑡𝑗𝑗1𝐾\{t_{j}\}_{j=1}^{K}{ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. The probability of input photo I𝐼Iitalic_I (with photo-feature fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) belonging to ythsuperscript𝑦thy^{\text{th}}italic_y start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT category can be calculated as

(y|I)=exp(sim(fp,ty)/τ)j=1Kexp(sim(fp,tj)/τ)conditional𝑦Isimsubscript𝑓𝑝subscript𝑡𝑦𝜏superscriptsubscript𝑗1𝐾simsubscript𝑓𝑝subscript𝑡𝑗𝜏\mathbb{P}(y|\mathrm{I})=\frac{\exp\;(\texttt{sim}\;(f_{p},t_{y})/\tau)}{\sum_% {j=1}^{K}\exp\;(\texttt{sim}\;(f_{p},t_{j})/\tau)}\vspace{-0.1cm}blackboard_P ( italic_y | roman_I ) = divide start_ARG roman_exp ( sim ( italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( sim ( italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG (1)

Prompt Learning: Prompt learning uses foundation models, like BERT [19] and CLIP [51], as knowledge bases from which useful information can be extracted for downstream tasks [49]. Apart from using handcrafted prompts like ‘a photo of a’, recent methods like CoOp [91] and CoCoOp [90] learns n𝑛nitalic_n continuous context vectors, {v1,,vn}subscript𝑣1subscript𝑣𝑛\{v_{1},\dots,v_{n}\}{ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, each having vjdtsubscript𝑣𝑗superscriptsubscript𝑑𝑡v_{j}\in\mathbb{R}^{d_{t}}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT dimension word embeddings. With base CLIP frozen, the continuous context vectors vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are learned by backpropagating gradients through the text T()T\mathrm{T}(\cdot)roman_T ( ⋅ ) encoder. Using word embedding for the k𝑘kitalic_k-th ‘[category]’, denoted as w0kdtsuperscriptsubscript𝑤0𝑘superscriptsubscript𝑑𝑡{w_{0}^{k}}\in\mathbb{R}^{d_{t}}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the prompt is constructed as [v1,,vn,w0k]subscript𝑣1subscript𝑣𝑛superscriptsubscript𝑤0𝑘[v_{1},\dots,v_{n},{w_{0}^{k}}][ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ], and passed to the transformer to obtain text feature ft=T([v1,,vn,w0k])subscript𝑓𝑡Tsubscript𝑣1subscript𝑣𝑛superscriptsubscript𝑤0𝑘{f}_{t}=\mathrm{T}([v_{1},\dots,v_{n},{w_{0}^{k}}])italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_T ( [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] ). Recent works learn continuous context vectors (i.e., prompts) for text [91] and image [2] multimodal prompts [38]. Variations of prompt learning also include shallow vs. deep [74] prompts, depending on the layers where the context vectors are injected. We use deep prompt learning to adapt CLIP to design a generalised sketch classifier.

4 Proposed Methodology

In this paper, we build a generalised sketch classifier that works in an unseen setup. This “unseen” problem in sketch representation learning has two axes: (a) generalisation across unseen categories – train on ‘cats’ or ‘dogs’ (not on ‘zebras’) but evaluate on ‘zebras’; and (b) generalisation across abstractions – a sketch can be drawn in 20202020 seconds (i.e., highly abstract doodle) or 280280280280 seconds (TU-Berlin sketches [23]). For generalisation across categories, we use the open-vocabulary potential of CLIP [51], which has excellent generalisation across several downstream tasks. Particularly, we show how off-the-shelf CLIP is sub-optimal, and a simple yet significant sketch-specific adaptation with prompt learning [92] and raster-to-vector self-reconstruction objective [7] can help generalisation to unseen categories. Generalising across abstraction (𝔸𝔸\mathbb{A}blackboard_A) levels is challenging as 𝔸𝔸\mathbb{A}blackboard_A is hard-to-label and a subjective metric (e.g., it is hard quantifying “how abstract is that sketch” on a scale 00 to 1.01.01.01.0). Hence, we propose a weakly-supervised codebook-learning paradigm [46] to learn generalisation across sketch abstractions.

4.1 Generalisation Across Unseen Categories

Baseline Sketch Classifier: Sketch classification aims at predicting the category a given query-sketch belongs to. An input raster sketch IS3×H×WsubscriptI𝑆superscript3𝐻𝑊\mathrm{I}_{S}\in\mathbb{R}^{3\times H\times W}roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT is encoded using a backbone feature extractor fs=F(I)dsubscript𝑓𝑠FIsuperscript𝑑f_{s}=\mathrm{F}(\mathrm{I})\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_F ( roman_I ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT like ResNet-101 [32] followed by mapping it to a K𝐾Kitalic_K-dimensional vector Fc:dK:subscriptF𝑐superscript𝑑superscript𝐾\mathrm{F}_{c}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{K}roman_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT that classifies ISsubscriptI𝑆\mathrm{I}_{S}roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT into predefined K𝐾Kitalic_K categories fc=Fc(fs)Ksubscript𝑓𝑐subscriptF𝑐subscript𝑓𝑠superscript𝐾f_{c}=\mathrm{F}_{c}(f_{s})\in\mathbb{R}^{K}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = roman_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Both backbone F()F\mathrm{F}(\cdot)roman_F ( ⋅ ) and classifier Fc()subscriptF𝑐\mathrm{F}_{c}(\cdot)roman_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ) are learned given ground-truth class f^c,ksubscript^𝑓𝑐𝑘\hat{f}_{c,k}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_c , italic_k end_POSTSUBSCRIPT as,

CE=f^c,klogexp(fc,k)j=1Kexp(fc,j)subscriptCEsubscript^𝑓𝑐𝑘subscript𝑓𝑐𝑘superscriptsubscript𝑗1𝐾subscript𝑓𝑐𝑗\mathcal{L}_{\text{CE}}=-\hat{f}_{c,k}\log\frac{\exp(f_{c,k})}{\sum_{j=1}^{K}% \exp(f_{c,j})}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_c , italic_k end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT italic_c , italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( italic_f start_POSTSUBSCRIPT italic_c , italic_j end_POSTSUBSCRIPT ) end_ARG (2)

Prompt Learning to Adapt CLIP for Sketches: We use CLIP with ViT visual encoder [51] which extends the fixed set classifier FcsubscriptF𝑐\mathrm{F}_{c}roman_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in Eq. 2 into an open-set setup. We now learn J𝐽Jitalic_J sketch prompts 𝐯𝐬={v0s,,vJ1s}superscript𝐯𝐬subscriptsuperscript𝑣𝑠0subscriptsuperscript𝑣𝑠𝐽1\mathbf{v^{s}}=\{v^{s}_{0},\dots,v^{s}_{J-1}\}bold_v start_POSTSUPERSCRIPT bold_s end_POSTSUPERSCRIPT = { italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT }, where vjs5×dpsubscriptsuperscript𝑣𝑠𝑗superscript5subscript𝑑𝑝v^{s}_{j}\in\mathbb{R}^{5\times d_{p}}italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. First, we divide the raster sketch into r𝑟ritalic_r fixed-sized patches IS={s1,,sr}subscriptI𝑆subscript𝑠1subscript𝑠𝑟\mathrm{I}_{S}=\{s_{1},\dots,s_{r}\}roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } where each patch si3×h×wsubscript𝑠𝑖superscript3𝑤s_{i}\in\mathbb{R}^{3\times h\times w}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_h × italic_w end_POSTSUPERSCRIPT is embedded as E0={e0j}j=1rsubscript𝐸0superscriptsubscriptsuperscriptsubscript𝑒0𝑗𝑗1𝑟E_{0}=\{e_{0}^{j}\}_{j=1}^{r}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Next, the learnable prompts are injected into each transformer block of CLIP vision transformer F()F\mathrm{F}(\cdot)roman_F ( ⋅ ) up to a specific depth J𝐽Jitalic_J, as

[cjp,Ej,a]=Fj([cj1p,Ej1,vj1s])|j=1J[cip,Ei,vis]=Fi([ci1p,Ei1,vi1s])|i=J+1Nfs=ImageProj(cNp)subscriptsuperscript𝑐𝑝𝑗subscript𝐸𝑗aevaluated-atsubscriptF𝑗subscriptsuperscript𝑐𝑝𝑗1subscript𝐸𝑗1subscriptsuperscript𝑣𝑠𝑗1𝑗1𝐽subscriptsuperscript𝑐𝑝𝑖subscript𝐸𝑖subscriptsuperscript𝑣𝑠𝑖evaluated-atsubscriptF𝑖subscriptsuperscript𝑐𝑝𝑖1subscript𝐸𝑖1subscriptsuperscript𝑣𝑠𝑖1𝑖𝐽1𝑁subscript𝑓𝑠ImageProjsubscriptsuperscript𝑐𝑝𝑁\begin{split}[c^{p}_{j},E_{j},\textbf{\text@underline{{\color[rgb]{1,1,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}% \pgfsys@color@gray@fill{1}a}}}\;]&=\mathrm{F}_{j}([c^{p}_{j-1},E_{j-1},v^{s}_{% j-1}])\;|_{j=1}^{J}\\ [c^{p}_{i},E_{i},v^{s}_{i}]&=\mathrm{F}_{i}([c^{p}_{i-1},E_{i-1},v^{s}_{i-1}])% \;\,\,|_{i=J+1}^{N}\\ f_{s}&=\texttt{ImageProj}(c^{p}_{N})\\[-1.42271pt] \end{split}start_ROW start_CELL [ italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , a ] end_CELL start_CELL = roman_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( [ italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ] ) | start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL [ italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_CELL start_CELL = roman_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( [ italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] ) | start_POSTSUBSCRIPT italic_i = italic_J + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL start_CELL = ImageProj ( italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) end_CELL end_ROW (3)

where, c0psubscriptsuperscript𝑐𝑝0c^{p}_{0}italic_c start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a pre-trained [CLS] token (see Sec. 3). To classify the visual feature fsdsubscript𝑓𝑠superscript𝑑f_{s}\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we use handcrafted prompts like ‘a photo of a [category]’ that is encoded using CLIP text encoder T()T\mathrm{T}(\cdot)roman_T ( ⋅ ) into ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as in Eq. 1. However, (i) our input is ‘sketch’ not ‘photo’, and (ii) handcrafted prompts are sub-optimal compared to learnable prompts [91], 𝐯𝐭={v0t,,vJ1t}superscript𝐯𝐭subscriptsuperscript𝑣𝑡0subscriptsuperscript𝑣𝑡𝐽1\mathbf{v^{t}}=\{v^{t}_{0},\dots,v^{t}_{J-1}\}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT = { italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT }; vjt5×dtsubscriptsuperscript𝑣𝑡𝑗superscript5subscript𝑑𝑡v^{t}_{j}\in\mathbb{R}^{5\times d_{t}}italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Hence, we inject prompts 𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT in T()T\mathrm{T}(\cdot)roman_T ( ⋅ ) up to depth J𝐽Jitalic_J as,

[a,wjk]=Tj([vj1t,wj1k])|j=1J[vit,wik]=Ti([vi1t,wi1k])|i=J+1Nft=TextProj(wNk)asuperscriptsubscript𝑤𝑗𝑘evaluated-atsubscriptT𝑗subscriptsuperscript𝑣𝑡𝑗1superscriptsubscript𝑤𝑗1𝑘𝑗1𝐽subscriptsuperscript𝑣𝑡𝑖superscriptsubscript𝑤𝑖𝑘evaluated-atsubscriptT𝑖subscriptsuperscript𝑣𝑡𝑖1superscriptsubscript𝑤𝑖1𝑘𝑖𝐽1𝑁subscript𝑓𝑡TextProjsuperscriptsubscript𝑤𝑁𝑘\begin{split}[\;\textbf{\text@underline{{\color[rgb]{1,1,1}\definecolor[named]% {pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}% \pgfsys@color@gray@fill{1}a}}}\;,{w}_{j}^{{k}}]&=\mathrm{T}_{j}([v^{t}_{j-1},{% w}_{j-1}^{{k}}])\;|_{j=1}^{J}\\ [v^{t}_{i},{w}_{i}^{{k}}]&=\mathrm{T}_{i}([v^{t}_{i-1},{w}_{i-1}^{{k}}])\;\,\,% |_{i=J+1}^{N}\\ f_{t}&=\texttt{TextProj}({w}_{N}^{{k}})\\[-2.84544pt] \end{split}start_ROW start_CELL [ a , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] end_CELL start_CELL = roman_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( [ italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] ) | start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL [ italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] end_CELL start_CELL = roman_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( [ italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] ) | start_POSTSUBSCRIPT italic_i = italic_J + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = TextProj ( italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_CELL end_ROW (4)

where, w0ksubscriptsuperscript𝑤𝑘0w^{k}_{0}italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the word embedding of ‘[category]’. Naively using learnable prompts 𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT overfits to training/seen categories, lacking generalisation to unseen categories [90]. Hence, we use a lightweight Meta-Net π=H(fs)𝜋Hsubscript𝑓𝑠\pi=\mathrm{H}(f_{s})italic_π = roman_H ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) to predict an instance-specific context, π5×dt𝜋superscript5subscript𝑑𝑡\pi\in\mathbb{R}^{5\times d_{t}}italic_π ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT that shifts 𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT as 𝐯𝐭(fs)={v0t+π,,vJ1t+π}superscript𝐯𝐭subscript𝑓𝑠subscriptsuperscript𝑣𝑡0𝜋subscriptsuperscript𝑣𝑡𝐽1𝜋\mathbf{v^{t}}(f_{s})=\{v^{t}_{0}+\pi,\dots,v^{t}_{J-1}+\pi\}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = { italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_π , … , italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT + italic_π }. Intuitively, Meta-Net (a two-layer Linear-ReLU-Linear) reduces overfitting of 𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT to training/seen categories, generalising better for unseen classes using sketch-conditional prompts 𝐯𝐭(fs)superscript𝐯𝐭subscript𝑓𝑠\mathbf{v^{t}}(f_{s})bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). Finally,

CE=logexp(sim(fs,T([𝐯𝐭(fs),w0i]))/τ)j=1Kexp(sim(fs,T([𝐯𝐭(fs),w0j]))/τ)subscriptCEsimsubscript𝑓𝑠Tsuperscript𝐯𝐭subscript𝑓𝑠subscriptsuperscript𝑤𝑖0𝜏superscriptsubscript𝑗1𝐾simsubscript𝑓𝑠Tsuperscript𝐯𝐭subscript𝑓𝑠subscriptsuperscript𝑤𝑗0𝜏\mathcal{L}_{\text{CE}}=-\log\frac{\exp\;(\texttt{sim}\;(f_{s},\mathrm{T}([% \mathbf{v^{t}}(f_{s}),w^{i}_{0}])\;)/\tau)}{\sum_{j=1}^{K}\exp\;(\texttt{sim}% \;(f_{s},\mathrm{T}([\mathbf{v^{t}}(f_{s}),w^{j}_{0}])\;)/\tau)}\vspace{0.1cm}caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( sim ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , roman_T ( [ bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ) ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( sim ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , roman_T ( [ bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_w start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ) ) / italic_τ ) end_ARG (5)

Auxiliary Loss using Sketch Specific Traits: Sketches are uniquely characterised by its existence in dual modalities – rasterised images IS3×H×WsubscriptI𝑆superscript3𝐻𝑊\mathrm{I}_{S}\in\mathbb{R}^{3\times H\times W}roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT and vector coordinate sequences IVN×5subscriptI𝑉superscript𝑁5\mathrm{I}_{V}\in\mathbb{R}^{N\times 5}roman_I start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 5 end_POSTSUPERSCRIPT. Translating ISIVsubscriptI𝑆subscriptI𝑉\mathrm{I}_{S}\rightarrow\mathrm{I}_{V}roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT → roman_I start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT enforces image encoder F()F\mathrm{F}(\cdot)roman_F ( ⋅ ), particularly its learnable prompts 𝐯𝐬superscript𝐯𝐬\mathbf{v^{s}}bold_v start_POSTSUPERSCRIPT bold_s end_POSTSUPERSCRIPT, to learn sparse stroke information. Accordingly, we use a linear embedding layer [7] to obtain the initial hidden state of a Gated Recurrent Unit (GRU) decoder as h0=Whfs+bhsubscript0subscript𝑊subscript𝑓𝑠subscript𝑏h_{0}=W_{h}f_{s}+b_{h}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Its hidden state htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is updated as: ht=GRU(ht1;[fs,Pt1])subscript𝑡GRUsubscript𝑡1subscript𝑓𝑠subscript𝑃𝑡1h_{t}=\texttt{GRU}(h_{t-1};[f_{s},P_{t-1}])italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = GRU ( italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; [ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] ), where Pt1subscript𝑃𝑡1P_{t-1}italic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is the last predicted point. Next, an embedding layer is used to predict a five-element vector at each time step as Pt=Wpht+bpsubscript𝑃𝑡subscript𝑊𝑝subscript𝑡subscript𝑏𝑝P_{t}=W_{p}h_{t}+b_{p}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT where Pt=(xt,yt,qt1,qt2,qt3)2+3subscript𝑃𝑡subscript𝑥𝑡subscript𝑦𝑡superscriptsubscript𝑞𝑡1superscriptsubscript𝑞𝑡2superscriptsubscript𝑞𝑡3superscript23P_{t}=(x_{t},y_{t},q_{t}^{1},q_{t}^{2},q_{t}^{3})\in\mathbb{R}^{2+3}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 + 3 end_POSTSUPERSCRIPT whose first two logits represent absolute coordinate (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) and the later three for pen’s state position (q1,q2,q3)superscript𝑞1superscript𝑞2superscript𝑞3(q^{1},q^{2},q^{3})( italic_q start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ). We use mean-square error and categorical cross-entropy loss to train raster \to vector on ground-truth absolute coordinates (x^t,y^t)subscript^𝑥𝑡subscript^𝑦𝑡(\hat{x}_{t},\hat{y}_{t})( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and pen state (q^1,q^2,q^3subscript^𝑞1subscript^𝑞2subscript^𝑞3\hat{q}_{1},\hat{q}_{2},\hat{q}_{3}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) as,

SV=1Tt=1Nx^txt2+y^tyt21Nt=1Ni=33q^tilog(exp(qti)j=13exp(qtj))subscriptSV1𝑇superscriptsubscript𝑡1𝑁subscriptnormsubscript^𝑥𝑡subscript𝑥𝑡2subscriptnormsubscript^𝑦𝑡subscript𝑦𝑡21𝑁superscriptsubscript𝑡1𝑁superscriptsubscript𝑖33superscriptsubscript^𝑞𝑡𝑖superscriptsubscript𝑞𝑡𝑖superscriptsubscript𝑗13superscriptsubscript𝑞𝑡𝑗\begin{split}\mathcal{L}_{\text{S}\to\text{V}}&=\frac{1}{T}\sum_{t=1}^{N}||% \hat{x}_{t}-x_{t}||_{2}+||\hat{y}_{t}-y_{t}||_{2}\\[-2.0pt] &-\frac{1}{N}\sum_{t=1}^{N}\sum_{i=3}^{3}\hat{q}_{t}^{i}\log\Big{(}\frac{\exp(% q_{t}^{i})}{\sum_{j=1}^{3}\exp(q_{t}^{j})}\Big{)}\\[-11.38092pt] \end{split}\vspace{-0.7cm}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT S → V end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + | | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT roman_log ( divide start_ARG roman_exp ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_exp ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG ) end_CELL end_ROW (6)

4.2 Generalisation Across Sketch Abstractions

Pilot Study: Here we elaborate Fig. 1 (right) that examines generalisation of CLIP when abstractions vary from EM \to TU \to QD. We randomly select 40404040 classes common across QD, TU and EM20202020 seen classes to adapt CLIP via prompt learning (CoOp [91]), and 20202020 unseen classes for zero-shot evaluation. We observe: (i) training and evaluating on the same abstraction (QD, or TU, or EM) performs similar-to\sim5.09%percent5.095.09\%5.09 % better, than training on one and jointly evaluating on QD + TU + EM. This drop signifies that a naive CLIP + prompt learning fails to generalise across abstractions. (ii) Jointly training on sketches from QD + TU + EM, only marginally improves accuracy by 2.30%percent2.302.30\%2.30 %. This highlights an even deeper problem – simply scaling the training data will likely222Exploring scaling laws [36] for human sketch abstraction is an interesting and non-trivial problem for future work. not solve the varying sketch abstraction problem.

Overview: Although we can easily sketch at varying abstraction levels, collecting precise abstraction annotation is difficult. Fig. 1 (left), shows that in general, QD doodles are more abstract than TU sketches, however precisely annotating abstraction score from 01010\to 10 → 1 for every sample (e.g., “bike” in QD) is an ill-posed problem. Prior works developed proxy metrics for sketch abstraction, like the number of strokes (fewer strokes \Rightarrow higher abstraction) [70], drawing skills (amateur vs. professional) [28], and time to sketch (lesser time \Rightarrow higher abstraction) [58]. Instead, we design our method based on the general consensus [23, 33, 31, 58] that – (i) EMs are (usually) less abstract than TU, and (ii) QD doodles are (usually) more abstract than TU. Particularly, while QD + TU + EM do not cover all possible sketch abstractions, EM and QD are a good approximation of the lower and upper bounds of sketch abstractions.

Refer to caption
Figure 2: Plotting number of sketches vs class membership of 600600600600 sketch instances, defined by class-labels (𝔸^l,𝔸^m,𝔸^hsubscript^𝔸𝑙subscript^𝔸𝑚subscript^𝔸\hat{\mathbb{A}}_{l},\hat{\mathbb{A}}_{m},\hat{\mathbb{A}}_{h}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT) and softmax normalised distributions (𝔸l,𝔸m,𝔸hsubscript𝔸𝑙subscript𝔸𝑚subscript𝔸\mathbb{A}_{l},\mathbb{A}_{m},\mathbb{A}_{h}blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT). Sketches are taken from 20202020 unseen categories common across QD, TU, and EM with 10101010 sketches per category. Despite expected peaks, a significant number of sketches lie in the continuous spectrum between 𝔸^l𝔸^msubscript^𝔸𝑙subscript^𝔸𝑚\hat{\mathbb{A}}_{l}\to\hat{\mathbb{A}}_{m}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT → over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝔸^m𝔸^hsubscript^𝔸𝑚subscript^𝔸\hat{\mathbb{A}}_{m}\to\hat{\mathbb{A}}_{h}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT → over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (overlaps).

Abstraction (𝔸𝔸\mathbb{A}blackboard_A) Learning without Annotations: Although drawing a sketch at varying abstractions is easy, annotating its precise abstraction level is hard, inferring the need of a weakly-supervised approach for abstraction modelling. Given that EM \to QD roughly provides a low\tohigh abstraction [23, 33, 31, 58], we define a classification problem: represent QD (high 𝔸𝔸\mathbb{A}blackboard_A) by ground-truth class label 𝔸^hsubscript^𝔸\hat{\mathbb{A}}_{h}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, TU (medium 𝔸𝔸\mathbb{A}blackboard_A) by 𝔸^msubscript^𝔸𝑚\hat{\mathbb{A}}_{m}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and EM (low 𝔸𝔸\mathbb{A}blackboard_A) by 𝔸^lsubscript^𝔸𝑙\hat{\mathbb{A}}_{l}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT . For each abstraction level, we learn a codebook vector {θl,θm,θh}subscript𝜃𝑙subscript𝜃𝑚subscript𝜃\{\theta_{l},\theta_{m},\theta_{h}\}{ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } where θl,m,h5×dtsubscript𝜃𝑙𝑚superscript5subscript𝑑𝑡\theta_{l,m,h}\in\mathbb{R}^{5\times d_{t}}italic_θ start_POSTSUBSCRIPT italic_l , italic_m , italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Given an input sketch ISsubscriptI𝑆\mathrm{I}_{S}roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, our backbone sketch encoder extracts fs=F(IS,𝐯𝐬)subscript𝑓𝑠FsubscriptI𝑆superscript𝐯𝐬f_{s}=\mathrm{F}(\mathrm{I}_{S},\mathbf{v^{s}})italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_F ( roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , bold_v start_POSTSUPERSCRIPT bold_s end_POSTSUPERSCRIPT ), which is then fed into a codebook classifier 𝒞θ:d3:subscript𝒞𝜃superscript𝑑superscript3\mathcal{C}_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{3}caligraphic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to get a softmax normalised probability distribution over the three abstraction class labels as, [𝔸l,𝔸m,𝔸h]=𝒞θ(fs)subscript𝔸𝑙subscript𝔸𝑚subscript𝔸subscript𝒞𝜃subscript𝑓𝑠[\mathbb{A}_{l},\mathbb{A}_{m},\mathbb{A}_{h}]=\mathcal{C}_{\theta}(f_{s})[ blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] = caligraphic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). We train 𝒞θ()subscript𝒞𝜃\mathcal{C}_{\theta}(\cdot)caligraphic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) via a categorical cross-entropy loss as,

CB=(𝔸^llog𝔸l+𝔸^mlog𝔸m+𝔸^hlog𝔸h)subscriptCBsubscript^𝔸𝑙subscript𝔸𝑙subscript^𝔸𝑚subscript𝔸𝑚subscript^𝔸subscript𝔸\mathcal{L}_{\text{CB}}=-(\hat{\mathbb{A}}_{l}\log\mathbb{A}_{l}+\hat{\mathbb{% A}}_{m}\log\mathbb{A}_{m}+\hat{\mathbb{A}}_{h}\log\mathbb{A}_{h})\vspace{-0.1cm}caligraphic_L start_POSTSUBSCRIPT CB end_POSTSUBSCRIPT = - ( over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT roman_log blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_log blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT roman_log blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) (7)

The predicted scores are used to combine (weighted summation) learned codebooks as η=𝔸lθl+𝔸mθm+𝔸hθh𝜂subscript𝔸𝑙subscript𝜃𝑙subscript𝔸𝑚subscript𝜃𝑚subscript𝔸subscript𝜃\eta=\mathbb{A}_{l}\theta_{l}+\mathbb{A}_{m}\theta_{m}+\mathbb{A}_{h}\theta_{h}italic_η = blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT which acts as an abstraction prompt η5×dt𝜂superscript5subscript𝑑𝑡\eta\in\mathbb{R}^{5\times d_{t}}italic_η ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and shifts the sketch-conditional prompt 𝐯𝐭(fs)superscript𝐯𝐭subscript𝑓𝑠\mathbf{v^{t}}(f_{s})bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (Sec. 4.1) like a bias as, 𝐯𝐭(fs)={(v01+π+η),,(vJ1t+π+η)}superscript𝐯𝐭superscriptsubscript𝑓𝑠superscriptsubscript𝑣01𝜋𝜂superscriptsubscript𝑣𝐽1𝑡𝜋𝜂\mathbf{v^{t}}(f_{s}^{*})=\{(v_{0}^{1}+\pi+\eta),\dots,(v_{J-1}^{t}+\pi+\eta)\}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = { ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_π + italic_η ) , … , ( italic_v start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_π + italic_η ) }. Next, we use Eq. 5 to classify.

Augmentations using Abstraction-Mixup: Eq. 7 helps us model abstractions in human drawn sketch by learning a codebook vector for 3333-levels [θl,θm,θh]subscript𝜃𝑙subscript𝜃𝑚subscript𝜃[\theta_{l},\theta_{m},\theta_{h}][ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] and predicting their softmax normalised probabilities [𝔸l,𝔸m,𝔸h]subscript𝔸𝑙subscript𝔸𝑚subscript𝔸[\mathbb{A}_{l},\mathbb{A}_{m},\mathbb{A}_{h}][ blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ]. Unseen sketches however do not adhere to these predefined levels and are often on a continuous spectrum of abstraction (𝔸lhsubscript𝔸𝑙\mathbb{A}_{l\leftrightarrow h}blackboard_A start_POSTSUBSCRIPT italic_l ↔ italic_h end_POSTSUBSCRIPT). To verify this, we compute the class membership among EM, TU and QD for 600600600600 sketches using the softmax normalised distribution as: 𝔸lsubscript𝔸𝑙\mathbb{A}_{l}blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT of class (𝔸^lsubscript^𝔸𝑙\hat{\mathbb{A}}_{l}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT), 𝔸msubscript𝔸𝑚\mathbb{A}_{m}blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of class (𝔸^msubscript^𝔸𝑚\hat{\mathbb{A}}_{m}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT), and 𝔸hsubscript𝔸\mathbb{A}_{h}blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of class (𝔸^hsubscript^𝔸\hat{\mathbb{A}}_{h}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT). These 600600600600 sketches are taken from the unseen split of 20202020 categories common across EM, TU and QD, where each category has 10101010 sketches. From Fig. 2, while there is an expected peak near class labels (𝔸^l,𝔸^m,𝔸^h)subscript^𝔸𝑙subscript^𝔸𝑚subscript^𝔸(\hat{\mathbb{A}}_{l},\hat{\mathbb{A}}_{m},\hat{\mathbb{A}}_{h})( over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), we observe: (i) a significant number of sketches lie in the continuous spectrum between 𝔸^l𝔸^msubscript^𝔸𝑙subscript^𝔸𝑚\hat{\mathbb{A}}_{l}\leftrightarrow\hat{\mathbb{A}}_{m}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ↔ over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝔸^m𝔸^hsubscript^𝔸𝑚subscript^𝔸\hat{\mathbb{A}}_{m}\leftrightarrow\hat{\mathbb{A}}_{h}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ↔ over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. This indicates that sketch abstraction is not discrete at 𝔸^l,𝔸^m,𝔸^hsubscript^𝔸𝑙subscript^𝔸𝑚subscript^𝔸\hat{\mathbb{A}}_{l},\hat{\mathbb{A}}_{m},\hat{\mathbb{A}}_{h}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT but continuous from EM \leftrightarrow QD. (ii) The abstraction of sketches in TU vary widely, overlapping with those in QD and EM. Assigning all sketches in TU (class 𝔸^msubscript^𝔸𝑚\hat{\mathbb{A}}_{m}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) to a fixed discrete level modelled by θmsubscript𝜃𝑚\theta_{m}italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can corrupt generalisation [85]. To alleviate this hard assumption, we propose abstraction-mixup – a simple extension of an augmentation routine, mixup [85, 69]. Now, we randomly sample sketches {ISl,ISm,ISh}subscriptsuperscriptI𝑙𝑆subscriptsuperscriptI𝑚𝑆subscriptsuperscriptI𝑆\{\mathrm{I}^{l}_{S},\mathrm{I}^{m}_{S},\mathrm{I}^{h}_{S}\}{ roman_I start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , roman_I start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , roman_I start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT } from {QD, TU, EM} and obtain {fsl,fsm,fsh}subscriptsuperscript𝑓𝑙𝑠subscriptsuperscript𝑓𝑚𝑠subscriptsuperscript𝑓𝑠\{f^{l}_{s},f^{m}_{s},f^{h}_{s}\}{ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } (using F()F\mathrm{F}(\cdot)roman_F ( ⋅ )) respectively. Next, we randomly sample the mixing coefficients from a 3333-dimensional Dirichlet distribution as {λ1,λ2,λ3}Dir(α)similar-tosubscript𝜆1subscript𝜆2subscript𝜆3Dir𝛼\{\lambda_{1},\lambda_{2},\lambda_{3}\}\sim\texttt{Dir}(\alpha){ italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } ∼ Dir ( italic_α ) where,

Dir(λ1,λ2,λ3;α)=Γ(3α)Γ(α)3i=13λiα1Dirsubscript𝜆1subscript𝜆2subscript𝜆3𝛼Γ3𝛼Γsuperscript𝛼3superscriptsubscriptproduct𝑖13superscriptsubscript𝜆𝑖𝛼1\begin{split}\texttt{Dir}(\lambda_{1},&\lambda_{2},\lambda_{3};\alpha)=\frac{% \Gamma(3\alpha)}{\Gamma(\alpha)^{3}}\prod_{i=1}^{3}\lambda_{i}^{\alpha-1}\end{split}start_ROW start_CELL Dir ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL start_CELL italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ; italic_α ) = divide start_ARG roman_Γ ( 3 italic_α ) end_ARG start_ARG roman_Γ ( italic_α ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT end_CELL end_ROW (8)

and Γ()Γ\Gamma(\cdot)roman_Γ ( ⋅ ) is the gamma function with α>0𝛼0\alpha>0italic_α > 0. Next, we compute a mixup sketch feature fsα=λ1fsl+λ2fsm+λ3fshsuperscriptsubscript𝑓𝑠𝛼superscriptsubscript𝜆1superscriptsubscript𝑓𝑠𝑙superscriptsubscript𝜆2superscriptsubscript𝑓𝑠𝑚superscriptsubscript𝜆3superscriptsubscript𝑓𝑠f_{s}^{\alpha}=\lambda_{1}^{*}f_{s}^{l}+\lambda_{2}^{*}f_{s}^{m}+\lambda_{3}^{% *}f_{s}^{h}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, where λi=λi/(j=13λj)superscriptsubscript𝜆𝑖subscript𝜆𝑖superscriptsubscript𝑗13subscript𝜆𝑗\lambda_{i}^{*}=\lambda_{i}/(\sum_{j=1}^{3}\lambda_{j})italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the normalised coefficients. Using fsαsubscriptsuperscript𝑓𝛼𝑠f^{\alpha}_{s}italic_f start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we train the codebook classifier [𝔸lα,𝔸mα,𝔸hα]=𝒞θ(fsα)superscriptsubscript𝔸𝑙𝛼superscriptsubscript𝔸𝑚𝛼superscriptsubscript𝔸𝛼subscript𝒞𝜃superscriptsubscript𝑓𝑠𝛼[\mathbb{A}_{l}^{\alpha},\mathbb{A}_{m}^{\alpha},\mathbb{A}_{h}^{\alpha}]=% \mathcal{C}_{\theta}(f_{s}^{\alpha})[ blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ] = caligraphic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) by modifying Eq. 7 as,

mix=(λ1log𝔸lα+λ2log𝔸mα+λ3log𝔸hα)subscriptmixsuperscriptsubscript𝜆1superscriptsubscript𝔸𝑙𝛼superscriptsubscript𝜆2superscriptsubscript𝔸𝑚𝛼superscriptsubscript𝜆3superscriptsubscript𝔸𝛼\mathcal{L}_{\text{mix}}=-(\lambda_{1}^{*}\log\mathbb{A}_{l}^{\alpha}+\lambda_% {2}^{*}\log\mathbb{A}_{m}^{\alpha}+\lambda_{3}^{*}\log\mathbb{A}_{h}^{\alpha})% \vspace{-2mm}caligraphic_L start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT = - ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT roman_log blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT roman_log blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT roman_log blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) (9)

Essentially, mixsubscriptmix\mathcal{L}_{\text{mix}}caligraphic_L start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT helps to augment synthetic latent representations of sketches across a continuous spectrum of sketch abstraction. Our final training loss combines Eqs. 5, 6, 7 and 9 with coefficients (hyper-parameters) β1,2,3subscript𝛽123\beta_{1,2,3}italic_β start_POSTSUBSCRIPT 1 , 2 , 3 end_POSTSUBSCRIPT as,

tot=CE+β1SV+β2CB+β3mixsubscripttotsubscriptCEsubscript𝛽1subscriptSVsubscript𝛽2subscriptCBsubscript𝛽3subscriptmix\mathcal{L}_{\text{tot}}=\mathcal{L}_{\text{CE}}+\beta_{1}\;\mathcal{L}_{\text% {S}\rightarrow\text{V}}+\beta_{2}\;\mathcal{L}_{\text{CB}}+\beta_{3}\;\mathcal% {L}_{\text{mix}}\vspace{-0.3cm}caligraphic_L start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT S → V end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT CB end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT (10)
\floatbox

[\capbeside\thisfloatsetupcapbesideposition=right,top,capbesidewidth=0.3]figure[\FBwidth] Refer to caption \justifycompute an abstraction prompt η𝜂\etaitalic_η, given codebook vectors {θl,θm,θh}subscript𝜃𝑙subscript𝜃𝑚subscript𝜃\{\theta_{l},\theta_{m},\theta_{h}\}{ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT }. (iv) Combine text prompts 𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT with π𝜋\piitalic_π, η𝜂\etaitalic_η, and category embedding w0ksuperscriptsubscript𝑤0𝑘w_{0}^{k}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to generate text feature ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using CLIP text encoder T()T\mathrm{T}(\cdot)roman_T ( ⋅ ). Finally, class-specific text features ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are matched with fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to compute class probabilities.

Figure 3: Given an input sketch, we compute visual feature fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with sketch prompts using a CLIP image encoder. Next, fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is fed into 4 pipelines: (i) An auxiliary raster-to-vector (sketch2vec) translation module that distils sketch-specific traits. (ii) A Meta-Net to predict an instance-specific context π𝜋\piitalic_π to generalise on unseen sketches. (iii) A codebook classifier 𝒞θsubscript𝒞𝜃\mathcal{C}_{\theta}caligraphic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to

4.3 Inference Pipleline

First, we use CLIP vision transformer F()F\mathrm{F}(\cdot)roman_F ( ⋅ ) and our learned sketch prompts 𝐯𝐬={v0s,,vJ1s}superscript𝐯𝐬subscriptsuperscript𝑣𝑠0subscriptsuperscript𝑣𝑠𝐽1\mathbf{v^{s}}=\{v^{s}_{0},\dots,v^{s}_{J-1}\}bold_v start_POSTSUPERSCRIPT bold_s end_POSTSUPERSCRIPT = { italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT }, where vjs5×dpsubscriptsuperscript𝑣𝑠𝑗superscript5subscript𝑑𝑝v^{s}_{j}\in\mathbb{R}^{5\times d_{p}}italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, to encode an input sketch ISsubscriptI𝑆\mathrm{I}_{S}roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT into a visual feature fs=F(IS;𝐯𝐬)dsubscript𝑓𝑠FsubscriptI𝑆superscript𝐯𝐬superscript𝑑f_{s}=\mathrm{F}(\mathrm{I}_{S};\mathbf{v^{s}})\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_F ( roman_I start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ; bold_v start_POSTSUPERSCRIPT bold_s end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Second, fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is simultaneously given to two modules: (i) a lightweight Meta-Net to predict instance-specific context, π=H(fs)𝜋Hsubscript𝑓𝑠\pi=\mathrm{H}(f_{s})italic_π = roman_H ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), where π5×dt𝜋superscript5subscript𝑑𝑡\pi\in\mathbb{R}^{5\times d_{t}}italic_π ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and (ii) a codebook classifier 𝒞θ:d3:subscript𝒞𝜃superscript𝑑superscript3\mathcal{C}_{\theta}:\mathbb{R}^{d}\to\mathbb{R}^{3}caligraphic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to get softmax normalised probability distribution over the three abstraction class labels, [𝔸l,𝔸m,𝔸h]=𝒞θ(fs)subscript𝔸𝑙subscript𝔸𝑚subscript𝔸subscript𝒞𝜃subscript𝑓𝑠[\mathbb{A}_{l},\mathbb{A}_{m},\mathbb{A}_{h}]=\mathcal{C}_{\theta}(f_{s})[ blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] = caligraphic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). The predicted scores are used to get the abstraction prompt η=𝔸lθl+𝔸mθm+𝔸hθh𝜂subscript𝔸𝑙subscript𝜃𝑙subscript𝔸𝑚subscript𝜃𝑚subscript𝔸subscript𝜃\eta=\mathbb{A}_{l}\theta_{l}+\mathbb{A}_{m}\theta_{m}+\mathbb{A}_{h}\theta_{h}italic_η = blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, where η5×dt𝜂superscript5subscript𝑑𝑡\eta\in\mathbb{R}^{5\times d_{t}}italic_η ∈ blackboard_R start_POSTSUPERSCRIPT 5 × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and {θl,θm,θh}subscript𝜃𝑙subscript𝜃𝑚subscript𝜃\{\theta_{l},\theta_{m},\theta_{h}\}{ italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } are the codebook vectors. Third, for classification, we compute the word embedding for K𝐾Kitalic_K categories as {w0k}k=1Ksuperscriptsubscriptsuperscriptsubscript𝑤0𝑘𝑘1𝐾\{w_{0}^{k}\}_{k=1}^{K}{ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Finally, we concatenate word embedding w0ksuperscriptsubscript𝑤0𝑘w_{0}^{k}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT to the sum of our learned text prompt 𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT, instance-specific context π𝜋\piitalic_π, and abstraction prompt η𝜂\etaitalic_η to obtain the final text feature ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using CLIP text encoder as ft=T([(𝐯𝐭+π+η),w0k])dsubscript𝑓𝑡Tsuperscript𝐯𝐭𝜋𝜂superscriptsubscript𝑤0𝑘superscript𝑑f_{t}=\mathrm{T}(\ [(\mathbf{v^{t}}+\pi+\eta),w_{0}^{k}]\ )\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_T ( [ ( bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT + italic_π + italic_η ) , italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Classification probability is calculated from Eq. 1.

5 Experiments

Implementation Details: We use pre-trained CLIP with ViT-B/16 (vision transformer) as visual encoder F()F\mathrm{F}(\cdot)roman_F ( ⋅ ) and transformer-based text encoder T()T\mathrm{T(\cdot)}roman_T ( ⋅ ). For text encoder, we learn five 512512512512-dimensional context vectors as prompt vjt5×512subscriptsuperscript𝑣𝑡𝑗superscript5512v^{t}_{j}\in\mathbb{R}^{5\times 512}italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × 512 end_POSTSUPERSCRIPT. The class token for the k𝑘kitalic_k-th ‘[category]’ is given by w0k512subscriptsuperscript𝑤𝑘0superscript512w^{k}_{0}\in\mathbb{R}^{512}italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT. We learn five 768768768768-dimensional sketch prompts vjs5×768subscriptsuperscript𝑣𝑠𝑗superscript5768v^{s}_{j}\in\mathbb{R}^{5\times 768}italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × 768 end_POSTSUPERSCRIPT. The learnable prompts are injected upto a depth J=9𝐽9J=9italic_J = 9, where {v0s,v0t}subscriptsuperscript𝑣𝑠0subscriptsuperscript𝑣𝑡0\{v^{s}_{0},v^{t}_{0}\}{ italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } are shallow prompts, while the deep prompts are {v1s,,vJ1s}8×5×512subscriptsuperscript𝑣𝑠1subscriptsuperscript𝑣𝑠𝐽1superscript85512\{v^{s}_{1},\dots,v^{s}_{J-1}\}\in\mathbb{R}^{8\times 5\times 512}{ italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT 8 × 5 × 512 end_POSTSUPERSCRIPT and {v1t,,vJ1t}8×5×512subscriptsuperscript𝑣𝑡1subscriptsuperscript𝑣𝑡𝐽1superscript85512\{v^{t}_{1},\dots,v^{t}_{J-1}\}\in\mathbb{R}^{8\times 5\times 512}{ italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT 8 × 5 × 512 end_POSTSUPERSCRIPT for vision and text encoders, respectively. Although CLIP’s weights are frozen during training, we fine-tune the layer-norm parameters for improved performance [55]. Our method consisting of Meta-Net + Codebooks + layer-norm + vision prompts (𝐯𝐬superscript𝐯𝐬\mathrm{\mathbf{v^{s}}}bold_v start_POSTSUPERSCRIPT bold_s end_POSTSUPERSCRIPT) + text prompts (𝐯𝐭superscript𝐯𝐭\mathrm{\mathbf{v^{t}}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT) is trained with Adam optimizer for 7777 epochs with 1e41𝑒41e-41 italic_e - 4 learning rate and 64646464 batch-size.

Datasets: We use sketches from QuickDraw [31] and TU-Berlin [23] along with Edgemaps [11] of TU-Berlin Extended [86]. Ranked from highest to lowest abstraction, QuickDraw has 50505050M sketches across 345345345345 categories, TU-Berlin has 20202020K sketches from 250250250250 categories, and the Edgemaps are generated using [11] from 204204204204K images across 250250250250 categories in TU-Berlin extended. For few-shot training, we randomly pick 10101010 sketches per class from a list of 125125125125 classes common to all three datasets and reserve the remaining ones (220220220220 for QuickDraw, 125125125125 for TU-Berlin, and 125125125125 for Edgemaps) for zero-shot inference. Generating Edgemaps from complex scene images (with noisy backgrounds) leads to noisy sketches. We filter images with high classification scores (higher score \Rightarrow less noisier background) using pre-trained zero-shot CLIP (details in supplementary).

Evaluation Setup: We evaluate our algorithm on two fronts: (i) few-shot accuracy: where we train our model on 10101010 randomly sampled sketches from each of 125125125125 common classes in all three datasets and evaluate them on previously unseen samples from the same class list. (ii) zero-shot accuracy: where we use our previously trained few-shot model and evaluate on unseen samples from new classes in these datasets. This difficult evaluation setup helps us understand (a) how well the model generalises to unseen classes i.e., how much did we adapt the generalisation potential of CLIP for sketch recognition, and (b) how well the model trained on seen categories, generalises across varying abstractions using codebook vectors and their mix-up. We also evaluate the adaptability of our network, by replacing CLIP-backbone with FLAVA [65].

\RawFloats
EM TU [23] QD [31]
Methods Seen Unseen Seen Unseen Seen Unseen
CLIP-Z [51] 52.09 50.10 56.57 47.71 20.00 13.27
CoOp [91] 55.06 50.66 58.92 47.92 22.80 12.64
CoCoOp [90] 56.03 51.57 59.74 50.25 24.48 12.68
VPT-A [2] 52.82 41.22 66.08 51.01 37.02 15.36
MaPLe [38] 61.01 52.90 71.66 53.91 36.24 17.74
Linear Probe [51] 34.84 57.24 36.94
Tip-Adapter [87] 60.58 65.74 42.24
Sketch-A-Net [83] 27.01 3.14 18.08 0.68
ResNet [32] 9.00 2.16 18.34 1.82 7.20 0.63
ResNet-Adapt 8.68 1.33 25.18 2.33 9.84 0.36
Edge-Augment 14.95 0.70 35.17 0.81 9.52 0.46
VPT-P (Shallow) [35] 27.92 46.56 - - 24.33
VPT-P (Deep) [35] 42.08 55.83 34.00
Ours 66.72 59.05 76.96 60.51 45.20 22.41
Table 1: Recognition accuracy for sketches in EM, TU, and QD. All competitors are jointly trained on QD + TU + EM.
Accuracy
Prompt Depth Context Token Meta Net Layer Norm Codebook Vectors Mixup Sketch2 Vec Seen Unseen
1 5 66.87 59.39
3 5 69.42 58.99
7 5 74.52 57.77
9 2 74.52 60.61
9 10 74.21 57.56
9 20 75.54 55.94
9 5 73.41 54.19
9 5 73.50 59.90
9 5 73.09 60.20
9 5 71.36 60.00
9 5 73.70 59.49
9 5 76.96 60.51
Table 2: Ablation studies on TU-Berlin [23]: With varying Prompt Depth and number of Context Tokens trained with/without LayerNorm fine-tuning,Meta-Net, Codebook Vectors, auxiliary Sketch2Vec and Mixup.

Competitors: We compare against (i) existing state-of-the-art (SOTA) zero-shot and few-shot recognition methods. CLIP [51] classifies sketches by comparing a sketch encoding from the visual encoder with class encodings from the text encoder using hand-crafted text prompts like ‘a photo of a [category]’. CoOp [91] extends CLIP by replacing hand-crafted prompts with learnable text prompts. VPT-A [2] learns a visual prompt instead, that is added to the image to adapt the vision encoder for sketch classification via hand-crafted text prompts (similar to CLIP). MaPLe [38] learns a joint vision-text “deep prompt” inserted in multiple layers of the vision and text encoders for better sketch and class encodings respectively. We use the independent vision-language prompting mode of MaPLe for a fairer comparison. Contrary to prior works learning static prompts, CoCoOp [90] learns an adaptive text prompt – a bias vector π𝜋\piitalic_π conditioned on the input sketch and added to the learned text prompt in CoOp. Linear Probe [51] classifies sketch by adding a linear layer at the end of CLIP’s visual encoder. Tip-Adapter [88] uses a CLIP-based non-parametric query-key cache model [37] as an adapter with similarity-based retrieval to determine the class of test sample from its feature encoding. (ii) Apart from adapting CLIP-based SOTAs, we provide a comprehensive comparison with widely used sketch classifiers like ResNet [32] and Sketch-A-Net [83]. ResNet-Adapt bridges the domain gap between abstract sketches in TU [23], QD [31], and EM using a domain discriminator to align ResNet visual features from all three domains. Edge-Augment [22] fine-tunes a ResNet, pre-trained on geometrically augmented Edgemaps, on real sketches. VPT-P [35] learns visual prompts for the vision transformer [20], injected as image patches in VPT-P (Shallow), or multiple deeper layers [38] in VPT-P (Deep).

5.1 Sketch Recognition

We report few-shot and zero-shot recognition results of our algorithm on QD, TU, and EM in Tab. 2 using average accuracy across all datasets for reference.

Few-Shot Recognition: We obtain a Top-1 accuracy of 66.7266.7266.7266.72%, 76.9676.9676.9676.96%, and 45.2045.2045.2045.20% on EM, TU, and QD respectively with our algorithm, beating SOTA MaPLe by an average margin of 6.666.666.666.66%. Works on shallow language prompts like CoOp (45.5945.5945.5945.59%) and CoCoOp (46.7546.7546.7546.75%) and shallow visual prompts like VPT-A (51.9751.9751.9751.97%) beat zero-shot CLIP by an average of 2.712.712.712.71%, 3.873.873.873.87%, and 9.099.099.099.09%, reinforcing the idea of prompt learning for better recognition. We note that visual prompting on CLIP (VPT-A) yields better results than language prompting (CoCoOp) in a shallow prompt setting, with an accuracy difference of 5.225.225.225.22%. As analysed in [38] and evident from our experiments, we note that deeper prompts like those in MaPLe (56.3356.3356.3356.33%) and VPT-P (Deep) (43.9743.9743.9743.97%) outperform their shallow prompt counterparts like VPT-P (shallow) (32.9332.9332.9332.93%). Our work also outperforms few-shot methods that use hand-crafted prompts like Tip-Adapter [87] and Linear Probe [51] by 6.916.916.916.91% and 20.0920.0920.0920.09% respectively. The superiority of our abstraction handling algorithm is demonstrated by its performance-gain of 48.5348.5348.5348.53% over naive adaptation-based solutions like ResNet-Adapt (14.5614.5614.5614.56%) for multi-dataset training. Sketch-A-Net, having a sketch-specific network architecture, beats ResNet/ResNet-Adapt pre-trained on images by 8.678.678.678.67/1.831.831.831.83% and 10.8810.8810.8810.88/8.248.248.248.24% on TU and QD, respectively. As Sketch-A-Net requires stroke order information, we do not report accuracy against EM that lack this data.

Zero-Shot Recognition: We find zero-shot performance in CLIP-based methods to be significantly higher than baseline networks, as CLIP is pre-trained on 400400400400M image-text pairs, making it easy for CLIP to recognise sketch categories unseen during fine-tuning. Zero-shot CLIP beats non-CLIP baselines like Sketch-A-Net and ResNet-Adapt by as much as 29.2729.2729.2729.27% and 35.6935.6935.6935.69%, respectively. Amongst CLIP-based methods, we find our method tto outperform MaPLe [38] marginally (5.805.805.805.80%) and zero-shot CLIP considerably (10.3010.3010.3010.30%) in recognition accuracy. CLIP-based methods that are adapted specifically for few-shot training (Tip-Adapter [87] and Linear Probe [51]) are not reported under Zero-Shot Recognition. Furthermore, replacing the CLIP-backbone with FLAVA [65], we find our model improves upon zero-shot recognition with FLAVA (38.3738.3738.3738.37/35.2135.2135.2135.21 %) by +17.3617.36+17.36+ 17.36/+9.499.49+9.49+ 9.49 % for categories seen/unseen by our model.

5.2 Ablation

We ablate various components and hyperparameters in Tab. 2. (i) Varying prompt depth J𝐽Jitalic_J affects recognition accuracy, where J=1𝐽1J=1italic_J = 1 (shallow prompts) drops it to 66.87%percent66.8766.87\%66.87 % and 59.39%percent59.3959.39\%59.39 % in seen and unseen classes, respectively. (ii) Varying length of language prompt (vjt5×512subscriptsuperscript𝑣𝑡𝑗superscript5512v^{t}_{j}\in\mathbb{R}^{5\times 512}italic_v start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 × 512 end_POSTSUPERSCRIPT) from 5555 to 2222, 10101010 and 20202020 drops accuracy to 74.5274.5274.5274.52%, 74.2174.2174.2174.21% and 75.5475.5475.5475.54% respectively. (ii) Removing the Meta-Net drops accuracy by 4.944.944.944.94%, particularly zero-shot accuracy by 6.326.326.326.32%. (iii) Removing SVsubscriptSV\mathcal{L}_{\text{S}\rightarrow\text{V}}caligraphic_L start_POSTSUBSCRIPT S → V end_POSTSUBSCRIPT drops accuracy by 3.26%percent3.263.26\%3.26 %, due to lack of sketch-specific traits. (iv) Removing codebook learning that models abstraction drops accuracy by 3.87%percent3.873.87\%3.87 %. (v) Removing codebook mixup that models continuous abstraction drops accuracy by 3.06%percent3.063.06\%3.06 %. (vi) Removing layer-norm drops accuracy by 2.03%percent2.032.03\%2.03 %.

5.3 Recognising the Degree of Abstraction

The codebook classifier predicts abstraction score as a softmax normalised probability distribution on 3333 coarse abstraction levels: 𝔸^lsubscript^𝔸𝑙\hat{\mathbb{A}}_{l}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (EM), 𝔸^msubscript^𝔸𝑚\hat{\mathbb{A}}_{m}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (TU), 𝔸^hsubscript^𝔸\hat{\mathbb{A}}_{h}over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT (QD)333Since there is no precise abstraction annotation for each sketch, we assign coarse abstraction levels to EM, TU, and QD..

Refer to caption
(a) Model agrees with human abstraction ranking
Refer to caption
(b) Model disagrees with human abstraction ranking
Figure 4: Human study to rank 3333 sketches from same category and same dataset into low, medium, or high abstraction levels.

Quantitative: Given unseen sketches from QD, TU, or EM, we predict its abstraction score as a softmax normalised distribution over codebooks {𝔸l,𝔸m,𝔸h}subscript𝔸𝑙subscript𝔸𝑚subscript𝔸\{\mathbb{A}_{l},\mathbb{A}_{m},\mathbb{A}_{h}\}{ blackboard_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , blackboard_A start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT }. Our model effectively classifies unseen sketches into their ground-truth abstraction level {𝔸^l,𝔸^m,𝔸^h}subscript^𝔸𝑙subscript^𝔸𝑚subscript^𝔸\{\hat{\mathbb{A}}_{l},\hat{\mathbb{A}}_{m},\hat{\mathbb{A}}_{h}\}{ over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over^ start_ARG blackboard_A end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } with an average accuracy of 95%percent9595\%95 %. To verify that our abstraction classification is not merely a sketch dataset classification, we perform a human study, where we perform a detailed analysis of human rankings for abstraction levels (0,0.5,1.0)00.51.0(0,0.5,1.0)( 0 , 0.5 , 1.0 ) against our predicted abstraction ranks.

Human Study: Sketch abstraction is a subjective metric and hard to quantify. As prior works have no consensus on “what is sketch abstraction[17, 70, 23, 31], we conduct a human study, where we select 20202020 common categories across 3333 datasets. Next from each dataset, and each category we select 3333 sketches (e.g., 3333 bicycle sketches all from TU-Berlin), giving us a total of 20 (categories) ×3 (datasets) =6020 (categories) 3 (datasets) 6020\text{ (categories) }\times 3\text{ (datasets) }=6020 (categories) × 3 (datasets) = 60 triplets.

We then engage 5555 users across a diverse demographic in the age range of 2030203020-3020 - 30. We provide each user a set of 12121212 triplets (a triplet has 3333 sketches per category per dataset) and ask them to rank each sketch (Fig. 4) by its abstraction in the triplet.

Accordingly, the ranked-list of 60606060 sketch triplets is compared with our model’s prediction using the abstraction classifier. Our predicted ranking aligns with humans at an avg. correlation of 71.6771.6771.6771.67%, 78.3378.3378.3378.33%, and 66.6766.6766.6766.67% for low, medium, and high abstraction levels respectively (LABEL:tab:abs_sup).

Refer to caption
Figure 5: User opinion on abstraction rankings

Next, we shuffle the 5555 sets of 12121212 triplets among the 5555 users, such that no one receives their old set. Now, for each triplet, we show participants our predicted abstraction ranking and ask them to rate their overall agreement (15)15(1-5)( 1 - 5 ) to the provided ranking (Fig. 5) with 1111 being “Strongly Disagree” and 5555 being “Strongly Agree”. For a total of 60606060 triplets, users report an average agreement score (Mean Opinion Score) of 4.34.34.34.3.

\RawFloats
Refer to caption
Figure 6: Human rankings of sketch triplets.
Human Study EM TU [23] QD [31]
Rank-1 13 (65 %) 15 (75 %) 12 (60 %)
Rank-2 16 (80 %) 17 (85 %) 15 (75 %)
Rank-3 14 (70 %) 15 (75 %) 13 (65 %)
Table 3: No. of sketches on which our network agrees with the humans on most abstract (Rank-1), average (Rank-2), and least abstract (Rank-3).

High correlation and Mean Opinion Scores, even when sketch triplets come from the same dataset, proves our codebook classifier is not simply a dataset classifier.

5.4 Evaluating On Unseen Abstraction Levels

To judge our model’s generalisability, we simulate unseen sketch-abstraction levels using CLIPasso [70]. Specifically, we test on unseen CLIPasso generated sketches having 12,16,2412162412,16,2412 , 16 , 24 and 32323232 strokes representing decreasing order of abstraction.

Fig. 7 shows our predicted abstraction scores (a) to align with the notion of abstraction (c) in CLIPasso (more strokes \Rightarrow less abstract). A high recognition accuracy verifies our generalisability not only for abstraction prediction, but also for classification of unseen sketches.

Refer to caption\floatbox

[\capbeside\thisfloatsetupcapbesideposition=right,top,capbesidewidth=0.37]figure[\FBwidth]

Refer to caption
Figure 7: Top: CLIPasso [70] sketches at different number of strokes (and abstraction levels). Left: For CLIPasso, (a) Predicted abstraction scores vs. Num. strokes. (b) Sketch classification Acc. vs. Num. strokes

5.5 Interpreting Learned Abstraction Prompts

We aim to interpret the influence of abstraction prompt η𝜂\etaitalic_η on learned text prompts (𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT) to understand how it instils abstraction-aware knowledge into our sketch-classifier. Recent studies [91] reveal that learned context tokens (𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT) usually converge close to their initial embedding corresponding to the prompt of ‘a sketch of a [category]’. Influencing 𝐯𝐭superscript𝐯𝐭\mathbf{v^{t}}bold_v start_POSTSUPERSCRIPT bold_t end_POSTSUPERSCRIPT with the codebook vector (η𝜂\etaitalic_η) however, pushes the embedding towards somewhat different words in the euclidean space [38] which are sketch-specific in nature. For instance, when analysing sketches at lower abstraction levels, we found cases where the euclidean distance from our prompt embedding to a word ‘artistic’ is equivalent to that from other irrelevant words like ‘box’, ‘camera’, etc. This confusion dismisses naive measures like Euclidean distance as a tool for interpreting the influence of abstraction prompt η𝜂\etaitalic_η, necessitating further investigation for alternatives in future.

6 Conclusion

We extend the notion of a generalised classifier from photos to sketches. Towards this goal, we adapt CLIP (with open-set generalisation) for sketches by learning prompts for both the vision and language branches. In addition, to learn sketch-specific traits, we employ an auxiliary raster \rightarrow vector sketch reconstruction loss. Finally, we generalise CLIP across varying sketch abstractions. As sketches lack precise abstraction annotation, we assign coarse-level scores to Edgemaps as low abstraction, TU-Berlin sketches as medium, and QuickDraw doodles as high abstraction. We employ codebook learning and mixup to learn sketch abstraction in a semi-supervised setup. The resulting SketchCLIP serves as a foundation model for robust sketch recognition algorithms based on large-scale vision-language models to classify “any” abstract sketch.

References

  • [1] Alaniz, S., Mancini, M., Dutta, A., Marcos, D., Akata, Z.: Abstracting Sketches through Simple Primitives. In: ECCV (2022)
  • [2] Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring Visual Prompts for Adapting Large-Scale Models. arXiv preprint arXiv:2203.17274 (2022)
  • [3] Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective Conditioned and Composed Image Retrieval Combining CLIP-Based Features. In: CVPR (2022)
  • [4] Bendale, A., Boult, T.: TowardsOpenWorldRecognition. In: CVPR (2015)
  • [5] Berardi, G., Gryaditskaya, Y.: Fine-Tuned but Zero-Shot 3D Shape Sketch View Similarity and Retrieval. In: ICCV SHARP Workshop (ICCV) (2023)
  • [6] Berger, I., Shamir, A., Mahler, M., Carter, E., Hodgins, J.: Style and abstraction in portrait sketching. ACM TOG (2013)
  • [7] Bhunia, A.K., Chowdhury, P.N., Yang, Y., Hospedales, T.M., Xiang, T., Song, Y.Z.: Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting. In: CVPR (2021)
  • [8] Bhunia, A.K., Das, A., Muhammad, U.R., Yang, Y., Hospedales, T.M., Xiang, T., Gryaditskaya, Y., Song, Y.Z.: Pixelor: A Competitive Sketching AI Agent. So you think you can sketch? ACM TOG (2020)
  • [9] Bhunia, A.K., Gajjala, V.R., Koley, S., Kundu, R., Sain, A., Xiang, T., Song, Y.Z.: Doodle it yourself: Class incremental learning by drawing a few sketches. In: CVPR (2022)
  • [10] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners. In: NeurIPS (2020)
  • [11] Chan, C., Durand, F., Isola, P.: Learning to generate line drawings that convey geometry and semantics. In: CVPR (2022)
  • [12] Chen, S.Y., Su, W., Gao, L., Xia, S., Fu, H.: DeepFaceDrawing: Deep generation of face images from sketches. ACM TOG (2020)
  • [13] Chen, W., Hays, J.: Sketchygan: Towards diverse and realistic sketch to image synthesis. In: ICCV (2018)
  • [14] Chen, Z., Wang, G., Liu, Z.: Text2Light: Zero-Shot Text-Driven HDR Panorama Generation. ACM TOG (2022)
  • [15] Collomosse, J., Bui, T., Jin, H.: Livesketch: Query perturbations for guided sketch-based visual search. In: CVPR (2019)
  • [16] Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.Z.: BézierSketch: A generative model for scalable vector sketches. In: ECCV (2020)
  • [17] Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.Z.: SketchODE: Learning neural sketch representation in continuous time. In: ICLR (2021)
  • [18] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
  • [19] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: NAACL (2019)
  • [20] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR (2021)
  • [21] Dutta, A., Akata, Z.: Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In: CVPR (2019)
  • [22] Efthymiadis, N., Tolias, G., Chum, O.: Edge Augmentation for Large-Scale Sketch Recognition without Sketches. In: ICPR (2022)
  • [23] Eitz, M., Hays, J., Alexa, M.: How Do Humans Sketch Objects? ACM TOG (2012)
  • [24] Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
  • [25] Gao, C., Liu, Q., Xu, Q., Wang, L., Liu, J., Zou, C.: Sketchycoco: Image generation from freehand scene sketches. In: CVPR (2020)
  • [26] Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: CLIP-Adapter: Better Vision-Language Models with Feature Adapters. IJCV (2023)
  • [27] Ge, S., Goswami, V., Zitnick, L., Parikh, D.: Creative Sketch Generation. In: ICLR (2021)
  • [28] Gryaditskaya, Y., Sypesteyn, M., Hoftijzer, J.W., Pont, S., Durand, F., Bousseau, A.: OpenSketch: A Richly-Annotated Dataset of Product Design Sketches. ACM SIGGRAPH (2019)
  • [29] Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. In: ICLR (2022)
  • [30] Guillard, B., Remelli, E., Yvernay, P., Fua, P.: Sketch2Mesh: Reconstructing and Editing 3D Shapes from Sketches. In: CVPR (2021)
  • [31] Ha, D., Eck, D.: A Neural Representation of Sketch Drawings. In: ICLR (2018)
  • [32] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: ICCV (2016)
  • [33] Hertzmann, A.: Why do line drawings work? A realism hypothesis. Perception (2020)
  • [34] Hu, C., Li, D., Yang, Y., Hospedales, T.M., Song, Y.Z.: Sketch-a-segmenter: Sketch-based photo segmenter generation. IEEE TIP (2020)
  • [35] Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Visual prompt tuning. In: ECCV (2022)
  • [36] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361 (2020)
  • [37] Khandelwal, U., Levy, O., Jurafsky, Zettlemoyer, L., Lewis, M.: Generalization through Memorization: Nearest Neighbor Language Models. In: ICLR (2020)
  • [38] Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: MaPLe: Multi-modal Prompt Learning. In: CVPR (2023)
  • [39] Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T.L., Bansal, M., Liu, J.: Less is more: Clipbert for video-and-language learning via sparse sampling. In: CVPR (2021)
  • [40] Li, H., Jiang, X., Guan, B., Wang, R., Thalmann, N.M.: Multistage Spatio-Temporal Networks for Robust Sketch Recognition. IEEE TIP (2022)
  • [41] Lin, H., Fu, Y., Xue, X., Jiang, Y.G.: Sketch-bert: Learning sketch bidirectional encoder representation from transformers by self-supervised learning of sketch gestalt. In: CVPR (2020)
  • [42] Liu, H., Wan, Z., Huang, W., Song, Y., Han, X., Liao, J., Jiang, B., Liu, W.: Deflocnet: Deep image editing via flexible low-level controls. In: CVPR (2021)
  • [43] Mirowski, P., Banarse, D., Malinowski, M., Osindero, S., Fernando, C.: Clip-clop: Clip-guided collage and photomontage. arXiv preprint arXiv:2205.03146 (2022)
  • [44] Muhammad, U.R., Yang, Y., Hospedales, T.M., Xiang, T., Song, Y.Z.: Goal-driven sequential data abstraction. In: CVPR (2019)
  • [45] Muhammad, U.R., Yang, Y., Song, Y.Z., Xiang, T., Hospedales, T.M.: Learning deep sketch abstraction. In: ICCV (2018)
  • [46] Oord, A.v.d., Vinyals, O., Kavukcuoglu, K.: NeuralDiscreteRepresentationLearning. In: NeurIPS (2017)
  • [47] Pang, K., Yang, Y., Hospedales, T.M., Xiang, T., Song, Y.Z.: Solving Mixed-modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval. In: CVPR (2020)
  • [48] Paulson, B., Hammond, T.: PaleoSketch: Accurate Primitive Sketch Recognition and Beautification. In: IUI (2008)
  • [49] Petroni, F., Rockt aschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A.H., Riedel, S.: Language models as knowledge bases? In: EMNLP (2019)
  • [50] Qi, Y., Su, G., Chowdhury, P.N., Li, M., Song, Y.Z.: SketchLattice: Latticed Representation for Sketch Manipulation. In: ICCV (2021)
  • [51] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
  • [52] Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., Lu, J.: DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. In: CVPR (2022)
  • [53] Ribeiro, L.S.F., Bui, T., Collomosse, J., Ponti, M.: Sketchformer: Transformer-based representation for sketched structure. In: CVPR (2020)
  • [54] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models. In: CVPR (2022)
  • [55] Sain, A., Bhunia, A.K., Chowdhury, P.N., Koley, S., Xiang, T., Song, Y.Z.: Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In: CVPR (2023)
  • [56] Sain, A., Bhunia, A.K., Yang, Y., Xiang, T., Song, Y.Z.: Cross-Modal Hierarchical Modelling forFine-Grained Sketch Based Image Retrieval. In: BMVC (2020)
  • [57] Sain, A., Bhunia, A.K., Yang, Y., Xiang, T., Song, Y.Z.: Stylemeup: Towards style-agnostic sketch-based image retrieval. In: CVPR (2021)
  • [58] Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM TOG (2016)
  • [59] Sarvadevabhatla, R.K., Babu, R.V.: Freehand Sketch Recognition Using Deep Features. In: ICIP (2015)
  • [60] Schneider, R.G., Tuytelaars, T.: Sketch classification and classification-driven analysis using fisher vectors. ACM TOG (2014)
  • [61] Seddati, O., Dupont, S., Mahmoudi, S.: DeepSketch: Deep Convolutional Neural Networks for Sketch Recognition and Similarity Search. In: CBMI (2015)
  • [62] Seddati, O., Dupont, S., Mahmoudi, S.: DeepSketch 2: Deep Convolutional Neural Networks for Partial Sketch Recognition. In: CBMI (2016)
  • [63] Sezgin, T.M., Stahovich, T., Davis, R.: Sketch Based Interfaces: Early Processing for Sketch Understanding. In: PUI (2001)
  • [64] Shen, Y., Liu, L., Shen, F., Shao, L.: Zero-shot sketch-image hashing. In: ICCV (2018)
  • [65] Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., Kiela, D.: Flava: A foundational language and vision alignment model. In: CVPR (2022)
  • [66] Su, G., Qi, Y., Pang, K., Yang, J., Song, Y.Z.: Sketchhealer: A graph-to-sequence network for recreating partial human sketches. In: BMVC (2020)
  • [67] Tripathi, A., Dani, R.R., Mishra, A., Chakraborty, A.: Sketch-guided object localization in natural images. In: ECCV (2020)
  • [68] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
  • [69] Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Lopez-Paz, D., Bengio, Y.: Manifold mixup: Better representations by interpolating hidden states. In: ICML (2019)
  • [70] Vinker, Y., Pajouheshgar, E., Bo, J.Y., Bachmann, R.C., Bermano, A.H., Cohen-Or, D., Zamir, A., Shamir, A.: Clipasso: Semantically-aware object sketching. ACM TOG (2022)
  • [71] Wang, A., Ren, M., Zemel, R.: Sketchembednet: Learning novel concepts by imitating drawings. In: ICML (2021)
  • [72] Wang, S.Y., Bau, D., Zhu, J.Y.: Sketch your own gan. In: CVPR (2021)
  • [73] Wang, Z., Liu, W., He, Q., Wu, X., Yi, Z.: Clip-gen: Language-free training of a text-to-image generator with clip. arXiv preprint arXiv:2203.00386 (2022)
  • [74] Xing, Y., Wu, Q., Cheng, D., Zhang, S., Liang, G., Zhang, Y.: Class-aware visual prompt tuning for vision-language pre-trained model. arXiv preprint arXiv:2208.08340 (2022)
  • [75] Xu, P., Huang, Y., Yuan, T., Pang, K., Song, Y.Z., Xiang, T., Hospedales, T.M., Ma, Z., Guo, J.: Sketchmate: Deep hashing for million-scale human sketch retrieval. In: ICCV (2018)
  • [76] Xu, P., Joshi, C.K., Bresson, X.: Multi-Graph Transformer for Free-Hand Sketch Recognition. IEEE TNNLS (2022)
  • [77] Xu, R., Han, Z., Hui, L., Qian, J., Xie, J.: Domain Disentangled Generative Adversarial Network for Zero-Shot Sketch-Based 3D Shape Retrieval. In: AAAI (2022)
  • [78] Yan, G., Chen, Z., Yang, J., Wang, H.: Interactive liquid splash modeling by user sketches. ACM TOG (2020)
  • [79] Yang, L., Pang, K., Zhang, H., Song, Y.Z.: Sketchaa: Abstract representation for abstract sketches. In: CVPR (2021)
  • [80] Yang, L., Sain, A., Li, L., Qi, Y., Zhang, H., Song, Y.Z.: S3NET: Graph Representational Network For Sketch Recognition. In: ICME (2020)
  • [81] Yi, R., Ye, Z., Fan, R., Shu, Y., Liu, Y.J., Lai, Y.K., Rosin, P.L.: Animating portrait line drawings from a single face photo and a speech signal. In: ACM SIGGRAPH (2022)
  • [82] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: CVPR (2019)
  • [83] Yu, Q., Yang, Y., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M.: Sketch-a-net: A deep neural network that beats humans. IJCV (2017)
  • [84] Zeng, Y., Lin, Z., Patel, V.M.: Sketchedit: Mask-free local image manipulation with partial sketches. In: CVPR (2022)
  • [85] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond Empirical Risk Minimization. In: ICLR (2018)
  • [86] Zhang, H., Liu, S., Zhang, C., Ren, W., Wang, R., Cao, X.: Sketchnet: Sketch classification with web images. In: ICCV (2016)
  • [87] Zhang, R., Fang, R., Gao, P., Zhang, W., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. In: ECCV (2022)
  • [88] Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification. In: ECCV (2022)
  • [89] Zhang, S.H., Guo, Y.C., Gu, Q.W.: Sketch2Model: View-aware 3d modeling from single free-hand sketches. In: CVPR (2021)
  • [90] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR (2022)
  • [91] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. IJCV (2022)
  • [92] Zhu, B., Niu, Y., Han, Y., Wu, Y., Zhang, H.: Prompt-aligned Gradient for Prompt Tuning. In: ICCV (2023)