PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

In this work, we introduced a novel model, PerceiverS, which builds on the Perceiver AR architecture by incorporating Effective Segmentation and a Multi-Scale attention mechanism. The Effective Segmentation approach progressively expands the context segment during training, aligning more closely with autoregressive generation and enabling smooth, coherent generation across ultra-long symbolic music sequences. The Multi-Scale attention mechanism further enhances the model's ability to capture both long-term structural dependencies and short-term expressive details.

Image

Section 1: Baseline Model

Dataset: Maestro / Context Length: 32,768 / Segmentation: Default / Cross Attention Mask: None




Section 2: Applied Effective Segmentation

Dataset: Maestro / Context Length: 32,768 / Segmentation: Progressive / Cross Attention Mask: None




Section 3: Multi-Scale Mask Added to Cross Attention

Dataset: Maestro / Context Length: 32,768 / Segmentation: Progressive / Cross Attention Mask: Multi-Scale




Section 4: Fine-tuned on Large Datasets

Dataset: Maestro + GiantMIDI + ATEPP / Context Length: 32,768 / Segmentation: Progressive / Cross Attention Mask: Multi-Scale




Citing This Paper

To cite this paper, please use the following format:

            @misc{yi2024perceiversmultiscaleperceivereffective,
                  title={PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation}, 
                  author={Yungang Yi and Weihua Li and Matthew Kuo and Quan Bai},
                  year={2024},
                  eprint={2411.08307},
                  archivePrefix={arXiv},
                  primaryClass={cs.AI},
                  url={https://arxiv.org/abs/2411.08307}, 
            }