AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder Ensembles.
AE-ViT: Token Enhancement for Vision Transformers via CNN-based Autoencoder Ensembles.
1, January 2025
Abstract. While Vision Transformers (ViTs) have revolutionized computer vision with their exceptional
results, they struggle to balance processing speed with visual detail preservation. This tension becomes
particularly evident when implementing larger patch sizes. Although larger patches reduce computational
costs, they lead to significant information loss during the tokenization process. We present AE-ViT, a novel
architecture that leverages an ensemble of autoencoders to address this issue by introducing specialized
latent tokens that integrate seamlessly with standard patch tokens, enabling ViTs to capture both global
and fine-grained features.
Our experiments on CIFAR-100 show that AE-ViT achieves a 23.67% relative accuracy improvement over
the baseline ViT when using 16×16 patches, effectively recovering fine-grained details typically lost with
larger patches. Notably, AE-ViT maintains relevant performance (60.64%) even at 32×32 patches. We
further validate our method on CIFAR-10, confirming consistent benefits and adaptability across different
datasets.
Ablation studies on ensemble size and integration strategy underscore the robustness of AE-ViT, while
computational analysis shows that its efficiency scales favorably with increasing patch size. Overall, these
findings suggest that AE-ViT provides a practical solution to the patch-size dilemma in ViTs by striking
a balance between accuracy and computational cost, all within a simple, end-to-end trainable design.
1 Introduction
Originally introduced by Dosovitskiy et al. [9], Vision Transformers (ViTs) adapt the
transformer architecture [26] for image processing tasks by treating images as sequences of
patches. This approach employs self-attention mechanisms to capture global dependencies,
yielding remarkable performance across numerous vision applications. The success of the
original ViT has led to multiple improvements, such as DeiT [24], which proposes efficient
training techniques, and CaiT [25], which enhances feature representation by increasing
model depth.
Despite their success, ViTs face computational challenges stemming from self-attention
complexity, which grows quadratically with the number of tokens. Smaller patches yield
more tokens and capture finer details, but with significantly higher computational cost.
To address this, the Swin Transformer [20] introduces local attention windows to reduce
overall complexity, and PVT [27] employs a progressive shrinking pyramid to reduce the
token count in deeper layers. Another approach, TNT [13], processes patch-level and pixel-
level tokens in a nested structure, emphasizing multi-scale feature representation.
48
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
3 Method
3.1 Overview
The Fig. 1 presents an overview of the architecture of our ensemble. For clarity, we chose
to present the case of 4 autoencoders. The latent token is “plugged ” as an additional token
for the transformer.
2, followed by batch normalization and ReLU activation. The feature dimensions evolve
as follows:
conv conv conv
3 −−−→ 32 −−−→ 64 −−−→ 128 (1)
The final feature map is flattened and projected to a latent space of dimension 256
through a fully connected layer. This results in a compression ratio of:
64 × 64 × 3
compression ratio = = 48 : 1 (2)
256
The decoder mirrors this structure with transposed convolutions to progressively re-
construct the spatial dimensions:
Patch Tokenization The input image is divided into non-overlapping patches of size
sp × sp , resulting in N = (H/sp ) × (W/sp ) patches. Where sp defines the patch size in
pixels. Each patch is linearly projected to dimension D through a learnable embedding
matrix E ∈ R(sp ×sp ×3)×D :
N ×D
tpatch = [x1p E; x2p E; ...; xN
p E] ∈ R (6)
Latent Encoding Simultaneously, our ensemble of five autoencoders processes the input
image, each producing a latent vector zi ∈ R256 . These latents are concatenated:
Token Sequence Formation The final sequence presented to the transformer concate-
nates the class token tcls , patch tokens, and latent token:
Tfinal = T + P (10)
For the final classification, we utilize both the class token and the latent token, con-
catenating their representations after the transformer processing:
Our training process follows a two-phase approach designed to maximize the complemen-
tary strengths of both the autoencoder ensemble and the Vision Transformer.
4 Experiments
With 16×16 patches, our approach not only outperforms the baseline ViT but also
achieves better accuracy than the more computationally intensive 8×8 patch configuration,
52
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
and has 23.67% relative accuracy improvement compared to baseline ViT with 16x16
patches. Most notably, with 32×32 patches, AE-ViT maintains reasonable performance
(60.64%) while the baseline ViT severely degrades (35.86%), demonstrating a remarkable
relative improvement of 69%.
Scaling to CIFAR-10. To validate the generality of our approach, we conduct some ex-
periments on CIFAR-10. Results in Table 2 show that AE-ViT maintains its effectiveness:
We can see that the architecture also works on the CIFAR-10 dataset.
Conclusion. These results indicate that adding more than 4 autoencoders is not beneficial
for the AE-ViT architecture. The use of 4 autoencoders represents an optimal balance
between accuracy, generalization, and computational efficiency.
53
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
ViT with 8×8 patches, which achieves similar performance levels. This comparison is
particularly interesting as it addresses the trade-off between accuracy and computational
efficiency.
Table 4 presents a detailed comparison:
While Swin Transformer achieves higher accuracy, AE-ViT offers significantly better
computational efficiency.
This efficiency gap becomes even more significant when considering higher resolution
images. For example, with 1080p images (1920×1080):
This scaling advantage stems from our efficient use of large patches (16×16) com-
bined with the fixed-cost autoencoder ensemble. While Swin Transformer’s computational
requirements grow quadratically with image size, AE-ViT maintains better efficiency, mak-
ing it particularly suitable for high-resolution applications where computational resources
are constrained.
The results reveal that AE-ViT achieves comparable (slightly better) accuracy while
requiring about 35% fewer FLOPs. This efficiency gain stems primarily from two factors:
1) Token Efficiency: AE-ViT processes only 18 tokens (16 patch tokens + 1 CLS token
+ 1 latent token) compared to 65 tokens in the 8×8 ViT, resulting in a 72.3% reduction
in the self-attention computational load.
2) Computational Distribution: While AE-ViT introduces additional computation through
its autoencoder ensemble (136.3M FLOPs), this is more than offset by the reduced trans-
former complexity (78.4M FLOPs vs 287.6M FLOPs).
54
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
The memory footprint also favors AE-ViT, as the attention mechanism’s quadratic
memory scaling with respect to the number of tokens (O(n²)) makes the reduced token
count particularly significant. This demonstrates that our approach not only bridges the
performance gap of larger patches but does so in a computationally efficient manner.
5 Discussion
Our experimental results demonstrate several key findings about AE-ViT and provide
insights into the trade-offs between patch size, computational efficiency, and model per-
formance.
The systematic study of autoencoder ensemble size reveals a clear optimization pattern.
Starting from a single autoencoder (59.11%), we observe significant improvements with
each additional autoencoder up to four (61.76%), followed by diminishing returns with
five autoencoders (61.75%) and performance degradation with six (61.27%). This pattern
suggests that:
– The ensemble approach is fundamentally sound, with even two autoencoders outper-
forming a single autoencoder by 1.62%
– Four autoencoders represent an optimal balance between performance and complexity
– Additional autoencoders beyond four may introduce unnecessary redundancy or noise
Perhaps the most striking result is AE-ViT’s ability to maintain performance with larger
patch sizes:
– With 16×16 patches, AE-ViT (61.76%) matches the performance of standard ViT with
8×8 patches (61.23%) while using 35% fewer FLOPs
– With 32×32 patches, AE-ViT (60.64%) demonstrates remarkable resilience compared
to the baseline (35.86%), achieving a 69% relative improvement
This scaling behavior suggests that our autoencoder ensemble effectively compensates for
the information loss in larger patches, potentially offering a pathway to processing high-
resolution images efficiently.
– The mixed ensemble (61.94%) slightly outperforms the pure CIFAR-100 ensemble
(61.76%)
– The improvement, while modest, suggests potential benefits from diverse training data
– The approach maintains effectiveness across datasets, as demonstrated by strong per-
formance on CIFAR-10 (84.35%)
55
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
Comparing AE-ViT with state-of-the-art models like Swin Transformer reveals an inter-
esting efficiency-accuracy trade-off:
– While Swin-T achieves higher accuracy (78.41%), it requires 24× more FLOPs
– AE-ViT’s efficiency advantage grows with image resolution due to fixed autoencoder
costs
– The parameter count remains modest (12.4M vs 28.3M for Swin-T)
5.5 Limitations
– Develop dynamic ensemble selection mechanisms that adapt to specific domains and
computational constraints
– Explore alternative autoencoder architectures for enhanced feature extraction
– Investigate integration with state-of-the-art transformer variants
– Explore cross dataset learning seen on 4.3 for further knowledge transfer.
– Using various types of autoencoders to increase the robustness of transformers. To
have better results than those seen in [23]
Extended Applications
56
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
6 Conclusion
In this paper, we introduced AE-ViT, a novel approach that effectively addresses the
patch-size dilemma in Vision Transformers through an ensemble of autoencoders. Our
method achieves a 23.67% relative improvement over the baseline ViT on CIFAR-100
with 16×16 patches, while using significantly fewer computational resources than models
that rely on smaller patches. Through extensive experimentation, we identified an optimal
configuration of four autoencoders that balances performance and efficiency.
The effectiveness of AE-ViT is particularly evident in its ability to maintain strong
performance with large patches, achieving 60.64% accuracy with 32×32 patches compared
to the baseline’s 35.86%. This capability, combined with its strong showing on CIFAR-10
(84.35%) and efficient scaling properties, demonstrates the potential of our approach for
high-resolution image processing applications. Most importantly, AE-ViT offers a practical
solution for scenarios requiring efficient vision processing, providing a favorable trade-off
between accuracy and computational cost. While more computationally intensive models
like Swin Transformer achieve higher absolute accuracy, AE-ViT’s efficiency (24× fewer
FLOPs) makes it particularly attractive for resource-constrained settings or real-time pro-
cessing of high-resolution images.
In line with our findings, promising avenues for further research include exploring dy-
namic ensemble selection mechanisms to adapt AE-ViT to specific domains, developing
more sophisticated autoencoder architectures for enhanced feature extraction, and scal-
ing to larger datasets like ImageNet or higher resolutions (1080p and beyond). Extending
AE-ViT to dense prediction tasks (e.g., segmentation or detection) and time-series data
(e.g., video) would also be valuable, as would investigating transfer learning for special-
ized domains such as medical imaging or satellite imagery. These extensions may reinforce
AE-ViT’s robustness, accelerate its adoption in real-world scenarios, and deepen our un-
derstanding of hybrid autoencoder–transformer models.
7 Acknowledgments
We thank the anonymous reviewers for their valuable feedback. We thank as well ISPM
(https://ispm-edu.com/) for funding our research.
References
1. Baldi, P.: Autoencoders, unsupervised learning, and deep architectures. Proceedings of ICML Work-
shop on Unsupervised and Transfer Learning (2012)
2. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE
Transactions on Pattern Analysis and Machine Intelligence 35(8), 1798–1828 (2013)
3. Chen, C., Derakhshani, M.M., Liu, Z., Fu, J., Shi, Q., Xu, X., Yuan, L.: On the vision transformer
scaling: Parameter scaling laws and improved training strategies. arXiv preprint arXiv:2204.08476
(2022)
4. Chen, C.F., Fan, Q., Panda, R.: Crossvit: Cross-attention multi-scale vision transformer for image
classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
(2021)
5. Cubuk, E.D., Zoph, B., Mané, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation
strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 113–123 (2019). https://doi.org/10.1109/CVPR.2019.00020
6. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: Marrying Convolution and Attention for All Data Sizes.
In: Advances in Neural Information Processing Systems. vol. 34, pp. 3965–3977 (2021)
7. d’Ascoli, S., Touvron, H., Leavitt, M., Morcos, A., Biroli, G., Sagun, L.: Convit: Improving vision
transformers with soft convolutional inductive biases. In: Proceedings of the International Conference
on Machine Learning (ICML) (2021)
57
International Journal of Artificial Intelligence and Applications (IJAIA), Vol.16, No.1, January 2025
8. Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Attention in atten-
tion: Modeling context correlation for efficient vision transformers. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV) (2021)
9. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M.,
Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
10. Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., Douze, M.: Levit: A vision
transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV) (2021)
11. Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: Convolutional neural networks
meet vision transformers (2022), https://arxiv.org/abs/2107.06263
12. Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A
survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
13. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: Advances in
Neural Information Processing Systems (NeurIPS) (2021)
14. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: A survey.
ACM Computing Surveys (2021)
15. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., University
of Toronto (2009)
16. Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (canadian institute for advanced research). http:
//www.cs.toronto.edu/~kriz/cifar.html (2009), dataset
17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural
networks. Advances in Neural Information Processing Systems 25 (2012)
18. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
19. Liu, K., Zhang, W., Tang, K., Li, Y., Cheng, J., Liu, Q.: A survey of vision transformer. IEEE
Transactions on Pattern Analysis and Machine Intelligence (2022)
20. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical
vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV) (2021)
21. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Efficient transformers: A survey.
ACM Computing Surveys (2021)
22. Mehta, S., Rastegari, M.: MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision
Transformer. In: International Conference on Learning Representations (ICLR) (2022), https://
openreview.net/forum?id=7RkGY0FtwrY
23. Raboanary, H.A., Raboanary, R., Tahiridimbisoa, N.H.M.: Robustness assessment of neural
network architectures to geometric transformations: A comparative study with data augmen-
tation. In: 2023 3rd International Conference on Electrical, Computer, Communications and
Mechatronics Engineering (ICECCME). IEEE, Tenerife, Canary Islands, Spain (Jul 2023).
https://doi.org/10.1109/ICECCME57830.2023.10253075
24. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient
image transformers & distillation through attention. In: Proceedings of the International Conference
on Machine Learning (ICML) (2021)
25. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image trans-
formers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
(2021)
26. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems.
vol. 30 (2017)
27. Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision
transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
28. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to
vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV) (2021)
29. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help trans-
formers see better. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
58