Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [arXiv:2506.04598]
by Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev [arXiv:2506.04598]
In this repository, we provide detailed results and code to reproduce all figures from the paper.
We demonstrate scaling law derivation based model and dataset comparison. As working example, we compare contrastive loss based CLIP and contrastive + text generative (captioning) loss based MaMMUT, using open datasets Re-LAION-1.4B, DataComp-1.4B and DFN-1.4B. In plots below, we illustrate consistent stronger scalability of MaMMUT across datasets and downstream tasks (zero-shot evaluation), as well as stronger performance when training on DFN-1.4B for both CLIP and MaMMUT.
Established scaling laws allow us to make accurate predictions, here on example of important L-14 12.8B scales, extrapolating about 4x beyond compute scale used for scaling law measurements.
In overview.ipynb, you can view detailed results of all models that we trained.
To reproduce all figures from the paper, you can use the provided bash scripts. These scripts, located in the scripts
directory, are responsible for generating each figure presented in the paper by running the necessary data processing and plotting commands.
To execute all the scripts in the correct order and reproduce the figures from the paper, use the reproduce_all_figures.sh script. This master script runs each figure script (e.g., fig1.sh, etc.) sequentially:
bash reproduce_all_figures.sh
The master script will run each figure script and generate the corresponding plots in the scaling_laws/derived_scaling_laws
directory.
OpenMaMMUT-L-14 12.8B is a large scale open vision-language foundation model, trained using insights from scaling analysis to guide choice of model and samples seen scale.
OpenMaMMUT-L-14 12.8B achieves state of the art performance on zero-shot classification and retrieval tasks among similar-sized models trained only on publicly-available data (MetaCLIP, DataComp, OpenVision). It outperforms with
HuggingFace Model Repo, with examples of model usage: OpenMaMMUT-L-14 12.8B DataComp-1.4B
openMammut-L-14 12.8B DataComp-1.4B was trained using code from custom openCLIP+Mammut fork and automated experiments workflow autoexperiment
If you find this work helpful, please cite our paper:
@article{nezhurina2025scaling,
title={Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets},
author={Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev},
journal={arXiv:2506.04598},
url={https://arxiv.org/abs/2506.04598},
year={2025}
}
Authors acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) under grant no. 01IS24085C (OPENHAFM), under the grant 16HPC117K (MINERVA) and under the grant no. 01IS22094B (WestAI - AI Service Center West), as well as co-funding by EU from EuroHPC Joint Undertaking programm under grant no. 101182737 (MINERVA) and from Digital Europe Programme under grant no. 101195233 (openEuroLLM).
Authors acknowledge the Gauss Centre for Supercomputing e.V. for funding this work by providing computing time through the John von Neumann Institute for Computing (NIC) on the supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC), EuroHPC Joint Undertaking for computing time and storage on the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium through an EuroHPC Extreme Access grant EHPC-EXT-2023E02-068, storage resources on JUST granted and operated by JSC and supported by Helmholtz Data Federation (HDF), computing time granted by the JARA and JSC on the supercomputer JURECA at JSC, and computing time granted on prototype JEDI via JUREAP (JUPITER Early Access Programm) grant at JSC.
Further thanks go for support provided by supercomputing facilities and their teams, especially to Damian Alvarez and Mathis Bode from Juelich Supercomputer Center (JSC, Germany) and to Laura Morselli from CINECA (Italy).
Authors also would like to express gratitude to all the people who are working on making code, models and data publicly available, advancing community based research and making research more reproducible. Specifically, we would like to thank all the members of the LAION Discord server community and Open-