PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

In this work, we introduced a novel model, PerceiverS, which builds on the Perceiver AR architecture by incorporating Effective Segmentation and a Multi-Scale attention mechanism. The Effective Segmentation approach progressively expands the context segment during training, aligning more closely with autoregressive generation and enabling smooth, coherent generation across ultra-long symbolic music sequences. The Multi-Scale attention mechanism further enhances the model's ability to capture both long-term structural dependencies and short-term expressive details.

All music pieces are presented in the order they were originally generated, with no selection, filtering, or reordering applied. Click here to listen to more demo music.

This article has been published in IEEE Transactions on Audio, Speech and Language Processing.

Authors

Section 1: Original Model

Dataset: Maestro / Context Length: 32,768 / Segmentation: Traditional / Cross Attention Mask: None

Section 2: Improvement 1 (Effective Segmentation)

Dataset: Maestro / Context Length: 32,768 / Segmentation: Progressive / Cross Attention Mask: None

Section 3: Improvement 1 + Improvement 2 (Multi-Scale)

Dataset: Maestro / Context Length: 32,768 / Segmentation: Progressive / Cross Attention Mask: Multi-Scale

Citing This Paper

To cite this paper, please use the following format:

            @ARTICLE{11172710,
              author={Yi, Yungang and Li, Weihua and Kuo, Matthew M.Y. and Bai, Quan},
              journal={IEEE Transactions on Audio, Speech and Language Processing}, 
              title={PerceiverS: A Multi-Scale Perceiver With Effective Segmentation for Long-Term Expressive Symbolic Music Generation}, 
              year={2025},
              volume={33},
              number={},
              pages={3975-3987},
              keywords={Music;Transformers;Training;Computational modeling;Annotations;Bars;Context modeling;Attention mechanisms;Recording;Lead;Symbolic music generation;long-term structure;transformer models;cross-attention;self-attention;PerceiverS;effective segmentation;multi-scale attention},
              doi={10.1109/TASLPRO.2025.3611836}}