lm-eval v0.4.8 Release Notes
Key Improvements
-
New Backend Support:
- Added SGLang as new evaluation backend! by @Monstertail
- Enabled model steering with vector support via
sparsify
orsae_lens
by @luciaquirke and @AMindToThink
-
Breaking Change: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.
-
Added Support for
gen_prefix
in config, allowing you to append text after the <|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models
New Benchmarks & Tasks
Code Evaluation
- HumanEval by @hjlee1371 in #1992
- MBPP by @hjlee1371 in #2247
- HumanEval+ and MBPP+ by @bzantium in #2734
Multilingual Expansion
-
Global Coverage:
- Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)
- MLQA multilingual question answering by @KahnSvaer in #2622
-
Asian Languages:
-
European Languages:
-
Middle Eastern Languages:
- Arabic MMLU by @bodasadallah in #2541
- AraDICE task by @firojalam in #2507
Ethics & Reasoning
- Moral Stories by @upunaprosk in #2653
- Histoires Morales by @upunaprosk in #2662
Others
- MMLU Pro Plus by @asgsaeid in #2366
- GroundCocoa by @HarshKohli in #2724
We extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- drop python 3.8 support by @baberabb in #2575
- Add Global MMLU Lite by @shivalika-singh in #2567
- add warning for truncation by @baberabb in #2585
- Wandb step handling bugfix and feature by @sjmielke in #2580
- AraDICE task config file by @firojalam in #2507
- fix extra_match low if batch_size > 1 by @sywangyi in #2595
- fix model tests by @baberabb in #2604
- update scrolls by @baberabb in #2602
- some minor logging nits by @baberabb in #2609
- Fix gguf loading via Transformers by @CL-ModelCloud in #2596
- Fix Zeno visualizer on tasks like GSM8k by @pasky in #2599
- Fix the format of mgsm zh and ja. by @timturing in #2587
- Add HumanEval by @hjlee1371 in #1992
- Add MBPP by @hjlee1371 in #2247
- Add MLQA by @KahnSvaer in #2622
- assistant prefill by @baberabb in #2615
- fix gen_prefix by @baberabb in #2630
- update pre-commit by @baberabb in #2632
- add hrm8k benchmark for both Korean and English by @bzantium in #2627
- New arabicmmlu by @bodasadallah in #2541
- Add
global_mmlu
full version by @bzantium in #2636 - Update KorMedMCQA: ver 2.0 by @GyoukChu in #2540
- fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in #2420
- fixed mmlu generative response extraction by @RawthiL in #2503
- revise mbpp prompt by @bzantium in #2645
- aggregate by group (total and categories) by @bzantium in #2643
- Fix max_tokens handling in vllm_vlms.py by @jkaniecki in #2637
- separate category for
global_mmlu
by @bzantium in #2652 - Add Moral Stories by @upunaprosk in #2653
- add TransformerLens example by @nickypro in #2651
- fix multiple input chat tempalte by @baberabb in #2576
- Add Aggregation for Kobest Benchmark by @tryumanshow in #2446
- update pre-commit by @baberabb in #2660
- remove
group
from bigbench task configs by @baberabb in #2663 - Add Histoires Morales task by @upunaprosk in #2662
- MMLU Pro Plus by @asgsaeid in #2366
- fix early return for multiple dict in task process_results by @baberabb in #2673
- Turkish mmlu Config Update by @ArdaYueksel in #2678
- Fix typos by @omahs in #2679
- remove cuda device assertion by @baberabb in #2680
- Adding the Evalita-LLM benchmark by @m-resta in #2681
- Delete lm_eval/tasks/evalita_llm/single_prompt.zip by @baberabb in #2687
- Update unitxt task.py to bring in line with recent repo changes by @kiersten-stokes in #2684
- change ensure_ascii to False for JsonChatStr by @artemorloff in #2691
- Set defaults for BLiMP scores by @jmichaelov in #2692
- Update remaining references to
assistant_prefill
in docs togen_prefix
by @kiersten-stokes in #2683 - Update README.md by @upunaprosk in #2694
- fix
construct_requests
kwargs in python tasks by @baberabb in #2700 arithmetic
: set target delimiter to empty string by @baberabb in #2701- fix vllm by @baberabb in #2708
- add math_verify to some tasks by @baberabb in #2686
- Logging by @lintangsutawika in #2203
- Replace missing
lighteval/MATH-Hard
dataset withDigitalLearningGmbH/MATH-lighteval
by @f4str in #2719 - remove unused import by @baberabb in #2728
- README updates: Added IberoBench citation info in correpsonding READMEs by @naiarapm in #2729
- add o3-mini support by @HelloJocelynLu in #2697
- add Basque translation of ARC and PAWS to BasqueBench by @naiarapm in #2732
- Add cocoteros_es task in spanish_bench by @sgs97ua in #2721
- Fix the import source for eval_logger by @kailashbuki in #2735
- add humaneval+ and mbpp+ by @bzantium in #2734
- Support SGLang as Potential Backend for Evaluation by @Monstertail in #2703
- fix log condition on main by @baberabb in #2737
- fix vllm data parallel by @baberabb in #2746
- [Readme change for SGLang] fix error in readme and add OOM solutions for sglang by @Monstertail in #2738
- Groundcocoa by @HarshKohli in #2724
- fix doc: generate_until only outputs the generated text! by @baberabb in #2755
- Enable steering HF models by @luciaquirke in #2749
- Add test for a simple Unitxt task by @kiersten-stokes in #2742
- add debug log by @baberabb in #2757
- increment version to 0.4.8 by @baberabb in #2760
New Contributors
- @shivalika-singh made their first contribution in #2567
- @sjmielke made their first contribution in #2580
- @firojalam made their first contribution in #2507
- @CL-ModelCloud made their first contribution in #2596
- @pasky made their first contribution in #2599
- @timturing made their first contribution in #2587
- @hjlee1371 made their first contribution in #1992
- @KahnSvaer made their first contribution in #2622
- @bzantium made their first contribution in #2627
- @bodasadallah made their first contribution in #2541
- @GyoukChu made their first contribution in #2540
- @nike00811 made their first contribution in #2420
- @RawthiL made their first contribution in #2503
- @jkaniecki made their first contribution in #2637
- @upunaprosk made their first contribution in #2653
- @nickypro made their first contribution in #2651
- @asgsaeid made their first contribution in #2366
- @omahs made their first contribution in #2679
- @m-resta made their first contribution in #2681
- @f4str made their first contribution in #2719
- @HelloJocelynLu made their first contribution in #2697
- @sgs97ua made their first contribution in #2721
- @kailashbuki made their first contribution in #2735
- @Monstertail made their first contribution in #2703
- @HarshKohli made their first contribution in #2724
- @luciaquirke made their first contribution in #2749
Full Changelog: v0.4.7...v0.4.8