EleutherAI / lm-evaluation-harness Public

Notifications You must be signed in to change notification settings
Fork 2.4k
Star 9.2k

Code
Issues 408
Pull requests 121
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: EleutherAI/lm-evaluation-harness

reproduce llama 3 evals

#2557 opened Dec 10, 2024 by baberabb

Open 7

Beta

Labels 10 Milestones 1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

408 Open 984 Closed

Author

Filter by author

Uh oh!

There was an error while loading. Please reload this page.

Label

Filter by label

Uh oh!

There was an error while loading. Please reload this page.

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Uh oh!

There was an error while loading. Please reload this page.

Milestones

Filter by milestone

Uh oh!

There was an error while loading. Please reload this page.

Assignee

Filter by who’s assigned

Assigned to nobody

Uh oh!

There was an error while loading. Please reload this page.

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

self-defined architecture model eval

#3042 opened Jun 3, 2025 by shaojintian

Passing sample based parameters for metric feature request

A feature that isn't implemented yet.

#3038 opened Jun 3, 2025 by elements72

Caught jinja2.exceptions.UndefinedError: 'context' is undefined when dealing with japanese_leaderboard asking questions

For asking for clarification / support on library usage.

#3028 opened May 29, 2025 by Lynnzake

Add Support Conditional Generation Models like Mistral3 feature request

A feature that isn't implemented yet.

#3027 opened May 29, 2025 by KyleMylonakisProtopia

Issue with quantization_config argument bug

Something isn't working.

#3026 opened May 28, 2025 by shanhx2000

Support for using a remote /tokenize API endpoint as the tokenizer feature request

A feature that isn't implemented yet.

#3017 opened May 24, 2025 by furkancoskun

Couldn't find file squad-v1.1/train-v1.1.json when evaluate Qwen3-A3B with vllm pipeline asking questions

For asking for clarification / support on library usage.

#3015 opened May 23, 2025 by Lynnzake

Docker build fails due to missing pip module in Conda environment during setup.py develop on editable install feature request

A feature that isn't implemented yet.

#3014 opened May 22, 2025 by osmangoninahid

hellaswag not working: "no tasks specified" and "Keyerror: 'train' asking questions

For asking for clarification / support on library usage.

#3010 opened May 22, 2025 by matthijsvk

zeno_visualize.py can't parse model_args bug

Something isn't working.

good first issue

Good for newcomers

#3005 opened May 21, 2025 by login256

backward compatibility for unitxt (and others) after adding question_suffix to task.fewshot_context in #2876 bug

Something isn't working.

feature request

A feature that isn't implemented yet.

#3004 opened May 21, 2025 by baberabb

How to reproduce the Qwen2.5 base model results on GSM8K Task

#3003 opened May 21, 2025 by deema-A

unitxt with local-chat-completions gets stuck forever bug

Something isn't working.

#2986 opened May 15, 2025 by ivanbaldo

Longbench classification_score() missing 1 required positional argument: 'results'

#2976 opened May 12, 2025 by sustcsonglin

Move To A Saner Benchmark Definition Scheme

#2971 opened May 9, 2025 by aabdullah27182845

Performance bottleneck: consider multiprocessing for cached request checking

#2964 opened May 9, 2025 by justHungryMan

Ruler QA tasks do not work for max_seq_lengths < 4096 bug

Something isn't working.

#2963 opened May 9, 2025 by sustcsonglin

Log truncation/max_length to logged samples feature request

A feature that isn't implemented yet.

#2961 opened May 8, 2025 by freshpearYoon

Is this result reasonable, please?

#2960 opened May 7, 2025 by kuang1216

Generation length is limited to 2048 tokens.Qwen3 model accuracy is low

#2953 opened May 3, 2025 by sravan500

Allow tasks to register a metric

#2950 opened May 2, 2025 by cbare

lm_eval multimodal failing for any num_fewshot>0

#2948 opened May 1, 2025 by brian-dellabetta

Livecodebench,AIME24 datasets

#2944 opened Apr 30, 2025 by sravan500

How to test those think/no_think model like Qwen3

#2941 opened Apr 30, 2025 by bash99

please add model phi-4-multimodal, Qwen omni

#2938 opened Apr 29, 2025 by HERIUN

Previous 1 2 3 4 5 … 16 17 Next

Previous Next

ProTip! Adding no:label will show everything without a label.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!