This repository contains the official Python implementation of "Do Foundational Audio Encoders Understand Music Structure?" presented in ICASSP 2026 (arXiv 2512.17209).
-
prepare corpus (Harmonix/RWC)
- see corpus/README.md
-
calculate FAE features
- see FAE/README.md
-
visualize FEA features
-
train/evaluate linear probing models
- see MSA/README.md
| FAE | github repository |
|---|---|
| MusicFM | https://github.com/minzwon/musicfm |
| MERT | https://github.com/yizhilll/MERT |
| AudioMAE (Huang et al.) | https://github.com/facebookresearch/AudioMAE |
| AudioMAE (Zhong et al.) | This has not been publicly opened. |
| MULE | https://github.com/PandoraMedia/music-audio-representations |
| EnCodec | https://github.com/facebookresearch/encodec |
| DAC | https://github.com/descriptinc/descript-audio-codec |
| PANNs | https://github.com/qiuqiangkong/audioset_tagging_cnn |
| PaSST | https://github.com/kkoutini/PaSST |
| CLAP | https://github.com/LAION-AI/CLAP |
| OpenL3 | https://github.com/torchopenl3/torchopenl3 |
| FAE | Boundary detection | Function prediction | ||||||
|---|---|---|---|---|---|---|---|---|
| HR.5F | HR3F | PW | ACC | |||||
| No pooling | Pooling | No pooling | Pooling | No pooling | Pooling | No pooling | Pooling | |
| Self-supervised Learning: Masked Language Modeling (MLM) | ||||||||
| MusicFM (FMA) | 51.22±1.12 | 41.39±1.58 | 58.80±0.84 | 59.19±1.23 | 66.35±1.99 | 63.78±1.29 | 67.65±1.59 | 63.14±0.97 |
| MusicFM (MSD) | 54.19±0.94 | 49.76±0.64 | 60.58±0.76 | 63.91±1.18 | 66.89±1.52 | 64.66±1.14 | 68.13±1.84 | 64.44±1.55 |
| MERT (95M) | 42.94±0.85 | 42.23±1.93 | 52.25±0.56 | 60.99±1.27 | 62.44±1.22 | 63.40±1.33 | 62.58±1.16 | 62.01±1.27 |
| MERT (330M) | 45.39±1.01 | 40.63±1.88 | 54.46±0.99 | 57.72±1.96 | 64.16±1.30 | 64.17±1.37 | 63.77±1.37 | 62.30±1.46 |
| AudioMAE (Huang) | 36.57±1.17 | 36.95±1.18 | 51.09±1.54 | 58.11±1.09 | 60.33±0.88 | 64.58±1.49 | 59.65±1.24 | 63.07±1.93 |
| AudioMAE (Zhong) | 43.92±0.49 | 53.86±1.07 | 59.26±0.80 | 64.87±0.98 | 62.85±1.24 | 64.06±1.71 | 60.99±1.23 | 61.33±2.02 |
| Self-supervised Learning: Constrastive Learning | ||||||||
| MULE | 20.40±0.66 | n/a | 43.61±0.89 | n/a | 57.67±1.43 | n/a | 57.38±1.85 | n/a |
| Self-supervised Learning: Tokenization (Codec) | ||||||||
| EnCodec (24kHz/3kbps) | 23.49±0.89 | 19.39±1.15 | 42.63±0.75 | 31.88±0.74 | 53.86±0.98 | 52.87±1.07 | 48.95±1.56 | 45.87±1.18 |
| EnCodec (24kHz/6kbps) | 23.47±0.68 | 19.15±1.38 | 42.78±0.65 | 31.90±0.80 | 53.72±1.16 | 52.71±1.12 | 48.73±1.42 | 46.09±1.84 |
| EnCodec (24kHz/12kbps) | 23.77±0.84 | 18.94±1.47 | 42.99±1.05 | 31.88±0.79 | 54.02±1.17 | 52.75±1.11 | 48.88±1.41 | 45.94±1.78 |
| EnCodec (24kHz/24kbps) | 23.98±0.73 | 19.25±1.47 | 43.00±0.72 | 31.81±0.85 | 53.94±1.13 | 52.87±1.14 | 48.52±1.39 | 45.77±2.14 |
| EnCodec (48kHz/3kbps) | 24.00±1.03 | 19.64±0.60 | 43.10±0.98 | 37.27±0.76 | 55.25±1.47 | 53.94±1.11 | 51.82±1.78 | 47.50±1.98 |
| EnCodec (48kHz/6kbps) | 23.89±1.15 | 19.04±0.79 | 42.80±1.08 | 36.08±0.57 | 55.34±1.09 | 54.30±1.14 | 52.74±1.59 | 47.99±1.52 |
| EnCodec (48kHz/12kbps) | 23.27±1.13 | 19.06±1.54 | 42.55±1.02 | 34.78±0.85 | 54.98±1.12 | 53.84±0.80 | 52.57±1.40 | 46.72±1.84 |
| EnCodec (48kHz/24kbps) | 23.42±1.09 | 19.67±1.40 | 42.60±0.94 | 34.77±1.33 | 54.42±0.96 | 53.44±0.96 | 52.40±1.59 | 44.63±1.75 |
| DAC | 23.33±1.06 | 19.10±0.93 | 42.73±0.81 | 39.63±0.96 | 54.79±1.35 | 55.06±0.83 | 50.34±1.76 | 50.21±1.42 |
| Supervised Fine-tuning (Audio Tagging) after MLM | ||||||||
| AudioMAE(Huang) | 44.26±0.70 | 38.41±1.34 | 57.23±0.89 | 59.14±0.42 | 63.30±1.61 | 63.95±1.92 | 63.25±1.70 | 63.14±1.64 |
| AudioMAE(Zhong) | 37.74±1.10 | 36.50±1.25 | 53.82±0.91 | 54.31±1.14 | 62.61±1.69 | 61.53±1.39 | 61.09±2.28 | 58.64±1.21 |
| Supervised Learning (Audio Tagging) | ||||||||
| PANNs | n/a | 26.12±0.76 | n/a | 46.37±0.89 | n/a | 59.29±0.94 | n/a | 57.55±1.59 |
| PaSST | 28.94±1.08 | 22.00±0.96 | 45.52±0.87 | 44.06±1.20 | 59.28±1.08 | 58.39±1.56 | 57.61±1.10 | 55.80±1.94 |
| Supervised Learning & Fine-tuning (Sound Event Detection) | ||||||||
| PANNs | 28.73±0.83 | 23.89±0.72 | 53.22±0.72 | 46.73±0.79 | 60.01±1.29 | 57.60±1.23 | 58.45±1.24 | 54.90±1.06 |
| Cross-modal Contrastive Learning (Audio-text) | ||||||||
| CLAP (music-audioset) | n/a | 29.21±0.96 | n/a | 46.60±1.30 | n/a | 60.36±1.08 | n/a | 58.56±1.21 |
| CLAP (music-speech-audioset) | n/a | 29.29±0.92 | n/a | 46.50±1.17 | n/a | 60.46±1.19 | n/a | 59.03±0.96 |
| Cross-modal Contrastive Learning (Audio-visual) | ||||||||
| OpenL3 | 38.33±1.24 | 22.65±0.86 | 50.24±0.95 | 44.48±1.20 | 60.30±1.88 | 60.15±1.05 | 58.09±2.40 | 58.45±1.23 |
| FAE | Boundary detection | Function prediction | ||||||
|---|---|---|---|---|---|---|---|---|
| HR.5F | HR3F | PW | ACC | |||||
| No pooling | Pooling | No pooling | Pooling | No pooling | Pooling | No pooling | Pooling | |
| Self-supervised Learning: Masked Language Modeling (MLM) | ||||||||
| MusicFM (FMA) | 48.0 | 36.0 | 55.8 | 55.2 | 62.4 | 60.1 | 56.4 | 51.9 |
| MusicFM (MSD) | 49.5 | 45.0 | 55.6 | 59.1 | 61.3 | 59.5 | 54.9 | 50.6 |
| MERT (95M) | 37.5 | 38.7 | 49.6 | 57.6 | 57.3 | 57.9 | 49.4 | 47.7 |
| MERT (330M) | 36.6 | 32.2 | 48.3 | 49.6 | 58.7 | 58.1 | 50.7 | 47.9 |
| AudioMAE (Huang) | 30.6 | 30.0 | 46.4 | 50.9 | 54.9 | 59.0 | 49.2 | 48.8 |
| AudioMAE (Zhong) | 38.6 | 44.7 | 54.2 | 57.5 | 59.5 | 59.0 | 50.4 | 48.2 |
| Self-supervised Learning: Constrastive Learning | ||||||||
| MULE | 17.3 | n/a | 41.3 | n/a | 55.4 | n/a | 48.5 | n/a |
| Self-supervised Learning: Tokenization (Codec) | ||||||||
| EnCodec (24kHz/3kbps) | 23.5 | 16.8 | 42.6 | 30.9 | 54.2 | 53.9 | 47.0 | 44.7 |
| EnCodec (24kHz/6kbps) | 23.3 | 17.4 | 42.7 | 31.3 | 54.1 | 53.9 | 46.9 | 43.3 |
| EnCodec (24kHz/12kbps) | 23.0 | 16.5 | 42.6 | 31.0 | 53.9 | 53.7 | 46.0 | 43.7 |
| EnCodec (24kHz/24kbps) | 23.3 | 17.1 | 42.6 | 31.0 | 54.1 | 53.5 | 46.2 | 43.3 |
| EnCodec (48kHz/3kbps) | 24.5 | 17.8 | 43.0 | 36.2 | 55.6 | 54.3 | 51.3 | 46.7 |
| EnCodec (48kHz/6kbps) | 23.5 | 16.2 | 42.6 | 35.8 | 55.7 | 54.4 | 51.3 | 46.9 |
| EnCodec (48kHz/12kbps) | 22.5 | 15.9 | 42.0 | 35.1 | 55.5 | 54.1 | 50.7 | 47.1 |
| EnCodec (48kHz/24kbps) | 22.0 | 16.0 | 41.7 | 34.7 | 55.4 | 54.2 | 51.2 | 47.3 |
| DAC | 21.2 | 17.1 | 41.0 | 38.9 | 53.8 | 53.5 | 44.1 | 41.7 |
| Supervised Fine-tuning (Audio Tagging) after MLM | ||||||||
| AudioMAE (Huang) | 34.9 | 24.3 | 49.0 | 45.6 | 55.3 | 56.2 | 44.6 | 45.2 |
| AudioMAE (Zhong) | 31.1 | 26.7 | 48.1 | 45.3 | 58.2 | 57.4 | 50.1 | 46.5 |
| Supervised Learning (Audio Tagging) | ||||||||
| PANNs | n/a | 25.2 | n/a | 43.0 | n/a | 56.3 | n/a | 44.5 |
| PaSST | 25.4 | 19.6 | 43.5 | 40.9 | 57.2 | 55.8 | 51.6 | 48.7 |
| Supervised Learning & Fine-tuning (Sound Event Detection) | ||||||||
| PANNs | 26.8 | 20.8 | 50.9 | 44.8 | 58.2 | 57.2 | 47.1 | 46.2 |
| Cross-modal Contrastive Learning (Audio-text) | ||||||||
| CLAP (music-audioset) | n/a | 27.0 | n/a | 43.8 | n/a | 56.2 | n/a | 44.4 |
| CLAP (music-speech-audioset) | n/a | 26.0 | n/a | 43.5 | n/a | 57.3 | n/a | 48.8 |
| Cross-modal Contrastive Learning (Audio-visual) | ||||||||
| OpenL3 | 29.4 | 19.3 | 43.5 | 41.6 | 55.0 | 54.8 | 44.8 | 42.6 |
| FAE | Boundary detection | Function prediction | ||||||
|---|---|---|---|---|---|---|---|---|
| HR.5F | HR3F | PW | ACC | |||||
| No pooling | Pooling | No pooling | Pooling | No pooling | Pooling | No pooling | Pooling | |
| Self-supervised Learning: Masked Language Modeling (MLM) | ||||||||
| MusicFM (FMA) | 55.4 | 41.7 | 67.3 | 64.8 | 63.3 | 60.7 | 60.3 | 55.4 |
| MusicFM (MSD) | 59.3 | 53.1 | 69.5 | 68.9 | 66.8 | 64.5 | 61.1 | 57.4 |
| MERT (95M) | 46.8 | 42.3 | 60.5 | 66.3 | 60.6 | 62.8 | 52.9 | 54.0 |
| MERT (330M) | 48.0 | 36.2 | 61.2 | 60.6 | 62.3 | 61.2 | 53.2 | 52.9 |
| AudioMAE (Huang) | 38.6 | 36.4 | 59.5 | 61.7 | 56.5 | 64.5 | 53.4 | 53.9 |
| AudioMAE (Zhong) | 42.8 | 50.6 | 64.7 | 65.3 | 61.8 | 62.5 | 56.1 | 49.2 |
| Self-supervised Learning: Constrastive Learning | ||||||||
| MULE | 19.7 | n/a | 46.8 | n/a | 55.4 | n/a | 51.4 | n/a |
| Self-supervised Learning: Tokenization (Codec) | ||||||||
| EnCodec (24kHz/3kbps) | 27.6 | 17.8 | 50.2 | 29.6 | 50.6 | 50.1 | 47.2 | 43.9 |
| EnCodec (24kHz/6kbps) | 28.4 | 17.5 | 51.9 | 34.3 | 51.0 | 50.6 | 47.8 | 45.6 |
| EnCodec (24kHz/12kbps) | 28.2 | 17.7 | 51.5 | 33.9 | 50.4 | 49.9 | 47.3 | 44.5 |
| EnCodec (24kHz/24kbps) | 28.0 | 18.1 | 51.8 | 34.1 | 50.1 | 50.4 | 47.1 | 44.5 |
| EnCodec (48kHz/3kbps) | 28.4 | 19.5 | 51.1 | 37.1 | 50.0 | 49.1 | 48.0 | 40.4 |
| EnCodec (48kHz/6kbps) | 28.0 | 17.9 | 52.6 | 32.2 | 49.9 | 49.3 | 47.8 | 38.6 |
| EnCodec (48kHz/12kbps) | 25.9 | 16.6 | 51.7 | 32.9 | 50.2 | 50.4 | 47.7 | 40.1 |
| EnCodec (48kHz/24kbps) | 26.1 | 17.8 | 51.5 | 32.3 | 49.4 | 49.6 | 47.8 | 38.6 |
| DAC | 26.9 | 20.5 | 53.0 | 39.3 | 51.0 | 50.8 | 48.6 | 46.1 |
| Supervised Fine-tuning (Audio Tagging) after MLM | ||||||||
| AudioMAE (Huang) | 45.4 | 35.1 | 64.5 | 63.8 | 62.8 | 63.5 | 53.8 | 51.2 |
| AudioMAE (Zhong) | 38.7 | 34.7 | 60.5 | 56.1 | 61.3 | 58.8 | 55.3 | 49.1 |
| Supervised Learning (Audio Tagging) | ||||||||
| PANNs | n/a | 27.1 | n/a | 55.9 | n/a | 64.0 | n/a | 54.7 |
| PaSST | 34.9 | 22.9 | 52.7 | 47.6 | 58.0 | 56.7 | 49.9 | 50.7 |
| Supervised Learning & Fine-tuning (Sound Event Detection) | ||||||||
| PANNs | 29.5 | 25.2 | 60.1 | 47.4 | 59.4 | 56.7 | 51.7 | 48.1 |
| Cross-modal Contrastive Learning (Audio-text) | ||||||||
| CLAP (music-audioset) | n/a | 28.4 | n/a | 51.5 | n/a | 63.1 | n/a | 52.3 |
| CLAP (music-speech-audioset) | n/a | 29.0 | n/a | 52.7 | n/a | 61.6 | n/a | 52.3 |
| Cross-modal Contrastive Learning (Audio-visual) | ||||||||
| OpenL3 | 42.4 | 21.5 | 53.4 | 28.9 | 52.6 | 47.8 | 45.0 | 37.7 |
- Goto et al., "RWC Music Database: Popular, Classical and Jazz Music Databases," in Proceedings of ISMIR, 2002
- Goto, "AIST Annotation for the RWC Music Database," in Proceedings of ISMIR, 2006
@inproceedings{toyama2026icassp,
author={Keisuke Toyama and Zhi Zhong and Akira Takahashi and Shusuke Takahashi and Yuki Mitsufuji},
title={Do Foundational Audio Encoders Understand Music Structure?},
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing},
year={2026}
}
- Keisuke Toyama (keisuke.toyama@sony.com)