lm-eval v0.4.8 Release Notes

Key Improvements

New Backend Support:
- Added SGLang as new evaluation backend! by @Monstertail
- Enabled model steering with vector support via sparsify or sae_lens by @luciaquirke and @AMindToThink
Breaking Change: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.
Added Support for gen_prefix in config, allowing you to append text after the <|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models

New Benchmarks & Tasks

Code Evaluation

HumanEval by @hjlee1371 in #1992
MBPP by @hjlee1371 in #2247
HumanEval+ and MBPP+ by @bzantium in #2734

Multilingual Expansion

Global Coverage:
- Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)
- MLQA multilingual question answering by @KahnSvaer in #2622
Asian Languages:
- HRM8K benchmark for Korean and English by @bzantium in #2627
- Updated KorMedMCQA to version 2.0 by @GyoukChu in #2540
- Fixed TMLU Taiwan-specific tasks tag by @nike00811 in #2420
European Languages:
- Added Evalita-LLM benchmark by @m-resta in #2681
- BasqueBench with Basque translations of ARC and PAWS by @naiarapm in #2732
- Updated Turkish MMLU configuration by @ArdaYueksel in #2678
Middle Eastern Languages:
- Arabic MMLU by @bodasadallah in #2541
- AraDICE task by @firojalam in #2507

Ethics & Reasoning

Moral Stories by @upunaprosk in #2653
Histoires Morales by @upunaprosk in #2662

Others

MMLU Pro Plus by @asgsaeid in #2366
GroundCocoa by @HarshKohli in #2724

We extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

drop python 3.8 support by @baberabb in #2575
Add Global MMLU Lite by @shivalika-singh in #2567
add warning for truncation by @baberabb in #2585
Wandb step handling bugfix and feature by @sjmielke in #2580
AraDICE task config file by @firojalam in #2507
fix extra_match low if batch_size > 1 by @sywangyi in #2595
fix model tests by @baberabb in #2604
update scrolls by @baberabb in #2602
some minor logging nits by @baberabb in #2609
Fix gguf loading via Transformers by @CL-ModelCloud in #2596
Fix Zeno visualizer on tasks like GSM8k by @pasky in #2599
Fix the format of mgsm zh and ja. by @timturing in #2587
Add HumanEval by @hjlee1371 in #1992
Add MBPP by @hjlee1371 in #2247
Add MLQA by @KahnSvaer in #2622
assistant prefill by @baberabb in #2615
fix gen_prefix by @baberabb in #2630
update pre-commit by @baberabb in #2632
add hrm8k benchmark for both Korean and English by @bzantium in #2627
New arabicmmlu by @bodasadallah in #2541
Add global_mmlu full version by @bzantium in #2636
Update KorMedMCQA: ver 2.0 by @GyoukChu in #2540
fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in #2420
fixed mmlu generative response extraction by @RawthiL in #2503
revise mbpp prompt by @bzantium in #2645
aggregate by group (total and categories) by @bzantium in #2643
Fix max_tokens handling in vllm_vlms.py by @jkaniecki in #2637
separate category for global_mmlu by @bzantium in #2652
Add Moral Stories by @upunaprosk in #2653
add TransformerLens example by @nickypro in #2651
fix multiple input chat tempalte by @baberabb in #2576
Add Aggregation for Kobest Benchmark by @tryumanshow in #2446
update pre-commit by @baberabb in #2660
remove group from bigbench task configs by @baberabb in #2663
Add Histoires Morales task by @upunaprosk in #2662
MMLU Pro Plus by @asgsaeid in #2366
fix early return for multiple dict in task process_results by @baberabb in #2673
Turkish mmlu Config Update by @ArdaYueksel in #2678
Fix typos by @omahs in #2679
remove cuda device assertion by @baberabb in #2680
Adding the Evalita-LLM benchmark by @m-resta in #2681
Delete lm_eval/tasks/evalita_llm/single_prompt.zip by @baberabb in #2687
Update unitxt task.py to bring in line with recent repo changes by @kiersten-stokes in #2684
change ensure_ascii to False for JsonChatStr by @artemorloff in #2691
Set defaults for BLiMP scores by @jmichaelov in #2692
Update remaining references to assistant_prefill in docs to gen_prefix by @kiersten-stokes in #2683
Update README.md by @upunaprosk in #2694
fix construct_requests kwargs in python tasks by @baberabb in #2700
arithmetic: set target delimiter to empty string by @baberabb in #2701
fix vllm by @baberabb in #2708
add math_verify to some tasks by @baberabb in #2686
Logging by @lintangsutawika in #2203
Replace missing lighteval/MATH-Hard dataset with DigitalLearningGmbH/MATH-lighteval by @f4str in #2719
remove unused import by @baberabb in #2728
README updates: Added IberoBench citation info in correpsonding READMEs by @naiarapm in #2729
add o3-mini support by @HelloJocelynLu in #2697
add Basque translation of ARC and PAWS to BasqueBench by @naiarapm in #2732
Add cocoteros_es task in spanish_bench by @sgs97ua in #2721
Fix the import source for eval_logger by @kailashbuki in #2735
add humaneval+ and mbpp+ by @bzantium in #2734
Support SGLang as Potential Backend for Evaluation by @Monstertail in #2703
fix log condition on main by @baberabb in #2737
fix vllm data parallel by @baberabb in #2746
[Readme change for SGLang] fix error in readme and add OOM solutions for sglang by @Monstertail in #2738
Groundcocoa by @HarshKohli in #2724
fix doc: generate_until only outputs the generated text! by @baberabb in #2755
Enable steering HF models by @luciaquirke in #2749
Add test for a simple Unitxt task by @kiersten-stokes in #2742
add debug log by @baberabb in #2757
increment version to 0.4.8 by @baberabb in #2760

New Contributors

@shivalika-singh made their first contribution in #2567
@sjmielke made their first contribution in #2580
@firojalam made their first contribution in #2507
@CL-ModelCloud made their first contribution in #2596
@pasky made their first contribution in #2599
@timturing made their first contribution in #2587
@hjlee1371 made their first contribution in #1992
@KahnSvaer made their first contribution in #2622
@bzantium made their first contribution in #2627
@bodasadallah made their first contribution in #2541
@GyoukChu made their first contribution in #2540
@nike00811 made their first contribution in #2420
@RawthiL made their first contribution in #2503
@jkaniecki made their first contribution in #2637
@upunaprosk made their first contribution in #2653
@nickypro made their first contribution in #2651
@asgsaeid made their first contribution in #2366
@omahs made their first contribution in #2679
@m-resta made their first contribution in #2681
@f4str made their first contribution in #2719
@HelloJocelynLu made their first contribution in #2697
@sgs97ua made their first contribution in #2721
@kailashbuki made their first contribution in #2735
@Monstertail made their first contribution in #2703
@HarshKohli made their first contribution in #2724
@luciaquirke made their first contribution in #2749

Full Changelog: v0.4.7...v0.4.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.8