Do Foundational Audio Encoders Understand Music Structure?

This repository contains the official Python implementation of "Do Foundational Audio Encoders Understand Music Structure?" presented in ICASSP 2026 (arXiv 2512.17209).

Usage

prepare corpus (Harmonix/RWC)
- see corpus/README.md
calculate FAE features
- see FAE/README.md
visualize FEA features
- see visualization/README.md
train/evaluate linear probing models
- see MSA/README.md

Foundational Audio Encoders (FAEs)

FAE	github repository
MusicFM	https://github.com/minzwon/musicfm
MERT	https://github.com/yizhilll/MERT
AudioMAE (Huang et al.)	https://github.com/facebookresearch/AudioMAE
AudioMAE (Zhong et al.)	This has not been publicly opened.
MULE	https://github.com/PandoraMedia/music-audio-representations
EnCodec	https://github.com/facebookresearch/encodec
DAC	https://github.com/descriptinc/descript-audio-codec
PANNs	https://github.com/qiuqiangkong/audioset_tagging_cnn
PaSST	https://github.com/kkoutini/PaSST
CLAP	https://github.com/LAION-AI/CLAP
OpenL3	https://github.com/torchopenl3/torchopenl3

Supplimental Results

Table A. 8-fold validaiton of linear probing on Harmonix dataset. Values are mean±standard deviation.

FAE	Boundary detection				Function prediction
	HR.5F		HR3F		PW		ACC
	No pooling	Pooling	No pooling	Pooling	No pooling	Pooling	No pooling	Pooling
Self-supervised Learning: Masked Language Modeling (MLM)
MusicFM (FMA)	51.22±1.12	41.39±1.58	58.80±0.84	59.19±1.23	66.35±1.99	63.78±1.29	67.65±1.59	63.14±0.97
MusicFM (MSD)	54.19±0.94	49.76±0.64	60.58±0.76	63.91±1.18	66.89±1.52	64.66±1.14	68.13±1.84	64.44±1.55
MERT (95M)	42.94±0.85	42.23±1.93	52.25±0.56	60.99±1.27	62.44±1.22	63.40±1.33	62.58±1.16	62.01±1.27
MERT (330M)	45.39±1.01	40.63±1.88	54.46±0.99	57.72±1.96	64.16±1.30	64.17±1.37	63.77±1.37	62.30±1.46
AudioMAE (Huang)	36.57±1.17	36.95±1.18	51.09±1.54	58.11±1.09	60.33±0.88	64.58±1.49	59.65±1.24	63.07±1.93
AudioMAE (Zhong)	43.92±0.49	53.86±1.07	59.26±0.80	64.87±0.98	62.85±1.24	64.06±1.71	60.99±1.23	61.33±2.02
Self-supervised Learning: Constrastive Learning
MULE	20.40±0.66	n/a	43.61±0.89	n/a	57.67±1.43	n/a	57.38±1.85	n/a
Self-supervised Learning: Tokenization (Codec)
EnCodec (24kHz/3kbps)	23.49±0.89	19.39±1.15	42.63±0.75	31.88±0.74	53.86±0.98	52.87±1.07	48.95±1.56	45.87±1.18
EnCodec (24kHz/6kbps)	23.47±0.68	19.15±1.38	42.78±0.65	31.90±0.80	53.72±1.16	52.71±1.12	48.73±1.42	46.09±1.84
EnCodec (24kHz/12kbps)	23.77±0.84	18.94±1.47	42.99±1.05	31.88±0.79	54.02±1.17	52.75±1.11	48.88±1.41	45.94±1.78
EnCodec (24kHz/24kbps)	23.98±0.73	19.25±1.47	43.00±0.72	31.81±0.85	53.94±1.13	52.87±1.14	48.52±1.39	45.77±2.14
EnCodec (48kHz/3kbps)	24.00±1.03	19.64±0.60	43.10±0.98	37.27±0.76	55.25±1.47	53.94±1.11	51.82±1.78	47.50±1.98
EnCodec (48kHz/6kbps)	23.89±1.15	19.04±0.79	42.80±1.08	36.08±0.57	55.34±1.09	54.30±1.14	52.74±1.59	47.99±1.52
EnCodec (48kHz/12kbps)	23.27±1.13	19.06±1.54	42.55±1.02	34.78±0.85	54.98±1.12	53.84±0.80	52.57±1.40	46.72±1.84
EnCodec (48kHz/24kbps)	23.42±1.09	19.67±1.40	42.60±0.94	34.77±1.33	54.42±0.96	53.44±0.96	52.40±1.59	44.63±1.75
DAC	23.33±1.06	19.10±0.93	42.73±0.81	39.63±0.96	54.79±1.35	55.06±0.83	50.34±1.76	50.21±1.42
Supervised Fine-tuning (Audio Tagging) after MLM
AudioMAE(Huang)	44.26±0.70	38.41±1.34	57.23±0.89	59.14±0.42	63.30±1.61	63.95±1.92	63.25±1.70	63.14±1.64
AudioMAE(Zhong)	37.74±1.10	36.50±1.25	53.82±0.91	54.31±1.14	62.61±1.69	61.53±1.39	61.09±2.28	58.64±1.21
Supervised Learning (Audio Tagging)
PANNs	n/a	26.12±0.76	n/a	46.37±0.89	n/a	59.29±0.94	n/a	57.55±1.59
PaSST	28.94±1.08	22.00±0.96	45.52±0.87	44.06±1.20	59.28±1.08	58.39±1.56	57.61±1.10	55.80±1.94
Supervised Learning & Fine-tuning (Sound Event Detection)
PANNs	28.73±0.83	23.89±0.72	53.22±0.72	46.73±0.79	60.01±1.29	57.60±1.23	58.45±1.24	54.90±1.06
Cross-modal Contrastive Learning (Audio-text)
CLAP (music-audioset)	n/a	29.21±0.96	n/a	46.60±1.30	n/a	60.36±1.08	n/a	58.56±1.21
CLAP (music-speech-audioset)	n/a	29.29±0.92	n/a	46.50±1.17	n/a	60.46±1.19	n/a	59.03±0.96
Cross-modal Contrastive Learning (Audio-visual)
OpenL3	38.33±1.24	22.65±0.86	50.24±0.95	44.48±1.20	60.30±1.88	60.15±1.05	58.09±2.40	58.45±1.23

Table B. Cross-dataset validaiton of linear probing between Harmonix(test) and RWC(train) datasets.

FAE	Boundary detection				Function prediction
	HR.5F		HR3F		PW		ACC
	No pooling	Pooling	No pooling	Pooling	No pooling	Pooling	No pooling	Pooling
Self-supervised Learning: Masked Language Modeling (MLM)
MusicFM (FMA)	48.0	36.0	55.8	55.2	62.4	60.1	56.4	51.9
MusicFM (MSD)	49.5	45.0	55.6	59.1	61.3	59.5	54.9	50.6
MERT (95M)	37.5	38.7	49.6	57.6	57.3	57.9	49.4	47.7
MERT (330M)	36.6	32.2	48.3	49.6	58.7	58.1	50.7	47.9
AudioMAE (Huang)	30.6	30.0	46.4	50.9	54.9	59.0	49.2	48.8
AudioMAE (Zhong)	38.6	44.7	54.2	57.5	59.5	59.0	50.4	48.2
Self-supervised Learning: Constrastive Learning
MULE	17.3	n/a	41.3	n/a	55.4	n/a	48.5	n/a
Self-supervised Learning: Tokenization (Codec)
EnCodec (24kHz/3kbps)	23.5	16.8	42.6	30.9	54.2	53.9	47.0	44.7
EnCodec (24kHz/6kbps)	23.3	17.4	42.7	31.3	54.1	53.9	46.9	43.3
EnCodec (24kHz/12kbps)	23.0	16.5	42.6	31.0	53.9	53.7	46.0	43.7
EnCodec (24kHz/24kbps)	23.3	17.1	42.6	31.0	54.1	53.5	46.2	43.3
EnCodec (48kHz/3kbps)	24.5	17.8	43.0	36.2	55.6	54.3	51.3	46.7
EnCodec (48kHz/6kbps)	23.5	16.2	42.6	35.8	55.7	54.4	51.3	46.9
EnCodec (48kHz/12kbps)	22.5	15.9	42.0	35.1	55.5	54.1	50.7	47.1
EnCodec (48kHz/24kbps)	22.0	16.0	41.7	34.7	55.4	54.2	51.2	47.3
DAC	21.2	17.1	41.0	38.9	53.8	53.5	44.1	41.7
Supervised Fine-tuning (Audio Tagging) after MLM
AudioMAE (Huang)	34.9	24.3	49.0	45.6	55.3	56.2	44.6	45.2
AudioMAE (Zhong)	31.1	26.7	48.1	45.3	58.2	57.4	50.1	46.5
Supervised Learning (Audio Tagging)
PANNs	n/a	25.2	n/a	43.0	n/a	56.3	n/a	44.5
PaSST	25.4	19.6	43.5	40.9	57.2	55.8	51.6	48.7
Supervised Learning & Fine-tuning (Sound Event Detection)
PANNs	26.8	20.8	50.9	44.8	58.2	57.2	47.1	46.2
Cross-modal Contrastive Learning (Audio-text)
CLAP (music-audioset)	n/a	27.0	n/a	43.8	n/a	56.2	n/a	44.4
CLAP (music-speech-audioset)	n/a	26.0	n/a	43.5	n/a	57.3	n/a	48.8
Cross-modal Contrastive Learning (Audio-visual)
OpenL3	29.4	19.3	43.5	41.6	55.0	54.8	44.8	42.6

Table C. Cross-dataset validaiton of linear probing between Harmonix(train) and RWC(test) datasets.

FAE	Boundary detection				Function prediction
	HR.5F		HR3F		PW		ACC
	No pooling	Pooling	No pooling	Pooling	No pooling	Pooling	No pooling	Pooling
Self-supervised Learning: Masked Language Modeling (MLM)
MusicFM (FMA)	55.4	41.7	67.3	64.8	63.3	60.7	60.3	55.4
MusicFM (MSD)	59.3	53.1	69.5	68.9	66.8	64.5	61.1	57.4
MERT (95M)	46.8	42.3	60.5	66.3	60.6	62.8	52.9	54.0
MERT (330M)	48.0	36.2	61.2	60.6	62.3	61.2	53.2	52.9
AudioMAE (Huang)	38.6	36.4	59.5	61.7	56.5	64.5	53.4	53.9
AudioMAE (Zhong)	42.8	50.6	64.7	65.3	61.8	62.5	56.1	49.2
Self-supervised Learning: Constrastive Learning
MULE	19.7	n/a	46.8	n/a	55.4	n/a	51.4	n/a
Self-supervised Learning: Tokenization (Codec)
EnCodec (24kHz/3kbps)	27.6	17.8	50.2	29.6	50.6	50.1	47.2	43.9
EnCodec (24kHz/6kbps)	28.4	17.5	51.9	34.3	51.0	50.6	47.8	45.6
EnCodec (24kHz/12kbps)	28.2	17.7	51.5	33.9	50.4	49.9	47.3	44.5
EnCodec (24kHz/24kbps)	28.0	18.1	51.8	34.1	50.1	50.4	47.1	44.5
EnCodec (48kHz/3kbps)	28.4	19.5	51.1	37.1	50.0	49.1	48.0	40.4
EnCodec (48kHz/6kbps)	28.0	17.9	52.6	32.2	49.9	49.3	47.8	38.6
EnCodec (48kHz/12kbps)	25.9	16.6	51.7	32.9	50.2	50.4	47.7	40.1
EnCodec (48kHz/24kbps)	26.1	17.8	51.5	32.3	49.4	49.6	47.8	38.6
DAC	26.9	20.5	53.0	39.3	51.0	50.8	48.6	46.1
Supervised Fine-tuning (Audio Tagging) after MLM
AudioMAE (Huang)	45.4	35.1	64.5	63.8	62.8	63.5	53.8	51.2
AudioMAE (Zhong)	38.7	34.7	60.5	56.1	61.3	58.8	55.3	49.1
Supervised Learning (Audio Tagging)
PANNs	n/a	27.1	n/a	55.9	n/a	64.0	n/a	54.7
PaSST	34.9	22.9	52.7	47.6	58.0	56.7	49.9	50.7
Supervised Learning & Fine-tuning (Sound Event Detection)
PANNs	29.5	25.2	60.1	47.4	59.4	56.7	51.7	48.1
Cross-modal Contrastive Learning (Audio-text)
CLAP (music-audioset)	n/a	28.4	n/a	51.5	n/a	63.1	n/a	52.3
CLAP (music-speech-audioset)	n/a	29.0	n/a	52.7	n/a	61.6	n/a	52.3
Cross-modal Contrastive Learning (Audio-visual)
OpenL3	42.4	21.5	53.4	28.9	52.6	47.8	45.0	37.7

Reference for the supplimental experiments

Goto et al., "RWC Music Database: Popular, Classical and Jazz Music Databases," in Proceedings of ISMIR, 2002
Goto, "AIST Annotation for the RWC Music Database," in Proceedings of ISMIR, 2006

Citation

@inproceedings{toyama2026icassp,
    author={Keisuke Toyama and Zhi Zhong and Akira Takahashi and Shusuke Takahashi and Yuki Mitsufuji},
    title={Do Foundational Audio Encoders Understand Music Structure?},
    booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing},
    year={2026}
}

Contact

Keisuke Toyama (keisuke.toyama@sony.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do Foundational Audio Encoders Understand Music Structure?

Usage

Foundational Audio Encoders (FAEs)

Supplimental Results

Reference for the supplimental experiments

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
FAE		FAE
MSA		MSA
config		config
corpus		corpus
visualization		visualization
LICENSE		LICENSE
README.md		README.md

License

sony/MSA-bench

Folders and files

Latest commit

History

Repository files navigation

Do Foundational Audio Encoders Understand Music Structure?

Usage

Foundational Audio Encoders (FAEs)

Supplimental Results

Reference for the supplimental experiments

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages