This repository contains experiments and tutorials for understanding VampNet, a masked acoustic token modeling approach for music generation.
VampNet is a neural audio codec model that uses masked token modeling to generate, compress, and transform music. This project breaks down VampNet's operation into understandable components, showing both high-level usage and low-level implementation details.
-
vampnet_tutorial.ipynb
- Main tutorial notebook demonstrating:- High-level VampNet interface usage
- Step-by-step breakdown of the generation pipeline
- Low-level implementation matching the high-level interface
- Visualization of tokens, masks, and generation process
-
assets/
- Audio files for testing:stargazing.wav
- Example input audioexample.wav
- Additional test audiovampnet.png
- VampNet architecture diagram
See requirements.txt
for dependencies. Key packages include:
vampnet
- The VampNet model implementationaudiotools
- Audio processing utilitiestorch
- PyTorch for model inferencematplotlib
- For visualizationsipython
- For notebook audio playback
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Download pretrained models:
- All pretrained models (trained by Hugo) are stored at: https://huggingface.co/hugggof/vampnet
- Download the models from this link: https://zenodo.org/record/8136629
- Extract the models to the
models/
folder - Licensing for Pretrained Models: The weights for the models are licensed CC BY-NC-SA 4.0. Likewise, any VampNet models fine-tuned on the pretrained models are also licensed CC BY-NC-SA 4.0.
-
Run the notebook:
jupyter notebook vampnet_tutorial.ipynb
VampNet operates on discrete audio tokens obtained from a neural codec. It uses masking strategies to:
- Preserve periodic prompts (e.g., every 13th token)
- Mask upper codebooks while preserving lower ones
- Generate new tokens in masked positions
- Preprocessing: Normalize audio to -24 dBFS
- Encoding: Convert audio to discrete tokens using neural codec
- Masking: Apply strategic masking patterns
- Coarse Generation: Generate coarse tokens with transformer
- Fine Generation: Refine with coarse-to-fine model
- Decoding: Convert tokens back to audio
periodic_prompt
: Interval for keeping unmasked tokens (e.g., 13)upper_codebook_mask
: Number of lower codebooks to preservetemperature
: Sampling temperature for generationtypical_filtering
: Whether to use typical sampling
- Paper: VAMPNET: Music generation via masked acoustic token modeling (García et al., 2023)
- Original implementation: https://github.com/hugofloresgarcia/vampnet