Skip to content
/ MSA-bench Public

Pytorch implementation of "Do Foundational Audio Encoders Understand Music Structure?" presented in ICASSP 2026.

License

Notifications You must be signed in to change notification settings

sony/MSA-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Do Foundational Audio Encoders Understand Music Structure?

This repository contains the official Python implementation of "Do Foundational Audio Encoders Understand Music Structure?" presented in ICASSP 2026 (arXiv 2512.17209).

Usage

  1. prepare corpus (Harmonix/RWC)

  2. calculate FAE features

  3. visualize FEA features

  4. train/evaluate linear probing models

Foundational Audio Encoders (FAEs)

FAE github repository
MusicFM https://github.com/minzwon/musicfm
MERT https://github.com/yizhilll/MERT
AudioMAE (Huang et al.) https://github.com/facebookresearch/AudioMAE
AudioMAE (Zhong et al.) This has not been publicly opened.
MULE https://github.com/PandoraMedia/music-audio-representations
EnCodec https://github.com/facebookresearch/encodec
DAC https://github.com/descriptinc/descript-audio-codec
PANNs https://github.com/qiuqiangkong/audioset_tagging_cnn
PaSST https://github.com/kkoutini/PaSST
CLAP https://github.com/LAION-AI/CLAP
OpenL3 https://github.com/torchopenl3/torchopenl3

Supplimental Results

Table A. 8-fold validaiton of linear probing on Harmonix dataset. Values are mean±standard deviation.
FAEBoundary detectionFunction prediction
HR.5FHR3FPWACC
No poolingPoolingNo poolingPoolingNo poolingPoolingNo poolingPooling
Self-supervised Learning: Masked Language Modeling (MLM)
MusicFM (FMA)51.22±1.1241.39±1.5858.80±0.8459.19±1.2366.35±1.9963.78±1.2967.65±1.5963.14±0.97
MusicFM (MSD)54.19±0.9449.76±0.6460.58±0.7663.91±1.1866.89±1.5264.66±1.1468.13±1.8464.44±1.55
MERT (95M)42.94±0.8542.23±1.9352.25±0.5660.99±1.2762.44±1.2263.40±1.3362.58±1.1662.01±1.27
MERT (330M)45.39±1.0140.63±1.8854.46±0.9957.72±1.9664.16±1.3064.17±1.3763.77±1.3762.30±1.46
AudioMAE (Huang)36.57±1.1736.95±1.1851.09±1.5458.11±1.0960.33±0.8864.58±1.4959.65±1.2463.07±1.93
AudioMAE (Zhong)43.92±0.4953.86±1.0759.26±0.8064.87±0.9862.85±1.2464.06±1.7160.99±1.2361.33±2.02
Self-supervised Learning: Constrastive Learning
MULE20.40±0.66n/a43.61±0.89n/a57.67±1.43n/a57.38±1.85n/a
Self-supervised Learning: Tokenization (Codec)
EnCodec (24kHz/3kbps)23.49±0.8919.39±1.1542.63±0.7531.88±0.7453.86±0.9852.87±1.0748.95±1.5645.87±1.18
EnCodec (24kHz/6kbps)23.47±0.6819.15±1.3842.78±0.6531.90±0.8053.72±1.1652.71±1.1248.73±1.4246.09±1.84
EnCodec (24kHz/12kbps)23.77±0.8418.94±1.4742.99±1.0531.88±0.7954.02±1.1752.75±1.1148.88±1.4145.94±1.78
EnCodec (24kHz/24kbps)23.98±0.7319.25±1.4743.00±0.7231.81±0.8553.94±1.1352.87±1.1448.52±1.3945.77±2.14
EnCodec (48kHz/3kbps)24.00±1.0319.64±0.6043.10±0.9837.27±0.7655.25±1.4753.94±1.1151.82±1.7847.50±1.98
EnCodec (48kHz/6kbps)23.89±1.1519.04±0.7942.80±1.0836.08±0.5755.34±1.0954.30±1.1452.74±1.5947.99±1.52
EnCodec (48kHz/12kbps)23.27±1.1319.06±1.5442.55±1.0234.78±0.8554.98±1.1253.84±0.8052.57±1.4046.72±1.84
EnCodec (48kHz/24kbps)23.42±1.0919.67±1.4042.60±0.9434.77±1.3354.42±0.9653.44±0.9652.40±1.5944.63±1.75
DAC23.33±1.0619.10±0.9342.73±0.8139.63±0.9654.79±1.3555.06±0.8350.34±1.7650.21±1.42
Supervised Fine-tuning (Audio Tagging) after MLM
AudioMAE(Huang)44.26±0.7038.41±1.3457.23±0.8959.14±0.4263.30±1.6163.95±1.9263.25±1.7063.14±1.64
AudioMAE(Zhong)37.74±1.1036.50±1.2553.82±0.9154.31±1.1462.61±1.6961.53±1.3961.09±2.2858.64±1.21
Supervised Learning (Audio Tagging)
PANNsn/a26.12±0.76n/a46.37±0.89n/a59.29±0.94n/a57.55±1.59
PaSST28.94±1.0822.00±0.9645.52±0.8744.06±1.2059.28±1.0858.39±1.5657.61±1.1055.80±1.94
Supervised Learning & Fine-tuning (Sound Event Detection)
PANNs28.73±0.8323.89±0.7253.22±0.7246.73±0.7960.01±1.2957.60±1.2358.45±1.2454.90±1.06
Cross-modal Contrastive Learning (Audio-text)
CLAP (music-audioset)n/a29.21±0.96n/a46.60±1.30n/a60.36±1.08n/a58.56±1.21
CLAP (music-speech-audioset)n/a29.29±0.92n/a46.50±1.17n/a60.46±1.19n/a59.03±0.96
Cross-modal Contrastive Learning (Audio-visual)
OpenL338.33±1.2422.65±0.8650.24±0.9544.48±1.2060.30±1.8860.15±1.0558.09±2.4058.45±1.23
Table B. Cross-dataset validaiton of linear probing between Harmonix(test) and RWC(train) datasets.
FAEBoundary detectionFunction prediction
HR.5FHR3FPWACC
No poolingPoolingNo poolingPoolingNo poolingPoolingNo poolingPooling
Self-supervised Learning: Masked Language Modeling (MLM)
MusicFM (FMA)48.036.055.855.262.460.156.451.9
MusicFM (MSD)49.545.055.659.161.359.554.950.6
MERT (95M)37.538.749.657.657.357.949.447.7
MERT (330M)36.632.248.349.658.758.150.747.9
AudioMAE (Huang)30.630.046.450.954.959.049.248.8
AudioMAE (Zhong)38.644.754.257.559.559.050.448.2
Self-supervised Learning: Constrastive Learning
MULE17.3n/a41.3n/a55.4n/a48.5n/a
Self-supervised Learning: Tokenization (Codec)
EnCodec (24kHz/3kbps)23.516.842.630.954.253.947.044.7
EnCodec (24kHz/6kbps)23.317.442.731.354.153.946.943.3
EnCodec (24kHz/12kbps)23.016.542.631.053.953.746.043.7
EnCodec (24kHz/24kbps)23.317.142.631.054.153.546.243.3
EnCodec (48kHz/3kbps)24.517.843.036.255.654.351.346.7
EnCodec (48kHz/6kbps)23.516.242.635.855.754.451.346.9
EnCodec (48kHz/12kbps)22.515.942.035.155.554.150.747.1
EnCodec (48kHz/24kbps)22.016.041.734.755.454.251.247.3
DAC21.217.141.038.953.853.544.141.7
Supervised Fine-tuning (Audio Tagging) after MLM
AudioMAE (Huang)34.924.349.045.655.356.244.645.2
AudioMAE (Zhong)31.126.748.145.358.257.450.146.5
Supervised Learning (Audio Tagging)
PANNsn/a25.2n/a43.0n/a56.3n/a44.5
PaSST25.419.643.540.957.255.851.648.7
Supervised Learning & Fine-tuning (Sound Event Detection)
PANNs26.820.850.944.858.257.247.146.2
Cross-modal Contrastive Learning (Audio-text)
CLAP (music-audioset)n/a27.0n/a43.8n/a56.2n/a44.4
CLAP (music-speech-audioset)n/a26.0n/a43.5n/a57.3n/a48.8
Cross-modal Contrastive Learning (Audio-visual)
OpenL329.419.343.541.655.054.844.842.6
Table C. Cross-dataset validaiton of linear probing between Harmonix(train) and RWC(test) datasets.
FAEBoundary detectionFunction prediction
HR.5FHR3FPWACC
No poolingPoolingNo poolingPoolingNo poolingPoolingNo poolingPooling
Self-supervised Learning: Masked Language Modeling (MLM)
MusicFM (FMA)55.441.767.364.863.360.760.355.4
MusicFM (MSD)59.353.169.568.966.864.561.157.4
MERT (95M)46.842.360.566.360.662.852.954.0
MERT (330M)48.036.261.260.662.361.253.252.9
AudioMAE (Huang)38.636.459.561.756.564.553.453.9
AudioMAE (Zhong)42.850.664.765.361.862.556.149.2
Self-supervised Learning: Constrastive Learning
MULE19.7n/a46.8n/a55.4n/a51.4n/a
Self-supervised Learning: Tokenization (Codec)
EnCodec (24kHz/3kbps)27.617.850.229.650.650.147.243.9
EnCodec (24kHz/6kbps)28.417.551.934.351.050.647.845.6
EnCodec (24kHz/12kbps)28.217.751.533.950.449.947.344.5
EnCodec (24kHz/24kbps)28.018.151.834.150.150.447.144.5
EnCodec (48kHz/3kbps)28.419.551.137.150.049.148.040.4
EnCodec (48kHz/6kbps)28.017.952.632.249.949.347.838.6
EnCodec (48kHz/12kbps)25.916.651.732.950.250.447.740.1
EnCodec (48kHz/24kbps)26.117.851.532.349.449.647.838.6
DAC26.920.553.039.351.050.848.646.1
Supervised Fine-tuning (Audio Tagging) after MLM
AudioMAE (Huang)45.435.164.563.862.863.553.851.2
AudioMAE (Zhong)38.734.760.556.161.358.855.349.1
Supervised Learning (Audio Tagging)
PANNsn/a27.1n/a55.9n/a64.0n/a54.7
PaSST34.922.952.747.658.056.749.950.7
Supervised Learning & Fine-tuning (Sound Event Detection)
PANNs29.525.260.147.459.456.751.748.1
Cross-modal Contrastive Learning (Audio-text)
CLAP (music-audioset)n/a28.4n/a51.5n/a63.1n/a52.3
CLAP (music-speech-audioset)n/a29.0n/a52.7n/a61.6n/a52.3
Cross-modal Contrastive Learning (Audio-visual)
OpenL342.421.553.428.952.647.845.037.7

Reference for the supplimental experiments

  • Goto et al., "RWC Music Database: Popular, Classical and Jazz Music Databases," in Proceedings of ISMIR, 2002
  • Goto, "AIST Annotation for the RWC Music Database," in Proceedings of ISMIR, 2006

Citation

@inproceedings{toyama2026icassp,
    author={Keisuke Toyama and Zhi Zhong and Akira Takahashi and Shusuke Takahashi and Yuki Mitsufuji},
    title={Do Foundational Audio Encoders Understand Music Structure?},
    booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing},
    year={2026}
}

Contact

About

Pytorch implementation of "Do Foundational Audio Encoders Understand Music Structure?" presented in ICASSP 2026.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published