Load phare dataset from HF for reproducibility #2

Inokinoki · 2025-04-17T15:46:12Z

Using huggingface_hub to download the jsonl as needed. Because the tink (dep of lmeval) since version 1.9 has a crashing issue and a mutex blocking issue with pyarrow (base of HF datasets lib) on macOS.

The core idea is to have hf_dataset and data_path in each categories:

benchmark_categories:
  - name: factuality
    hf_dataset: giskardai/phare
    data_path: hallucination/factuality
    tasks:
      - name: wikipedia
        scorer: factuality
        type: completion
        description: Check for hallucination from wikipedia articles
      - name: news
        scorer: factuality
        type: completion
        description: Check for hallucination from new articles content

And for each tasks, it tries to find f"{data_path}/{task_name}.jsonl" to download and load.

It seems to be good to load every public categories:

-=Giskard Safety Benchmark ()=-
Check for Hallucination in text generation models
|-Authors:  ()
|-Version:  - License: 
|-URL: 
|-Questions: 2764
|-Answers: 0

[Questions Stats]
                                    Type                Level    Questions    Images    Audios    Videos    Prompts      Models    Answers    Punts
----------------------------------  ------------------  -------  -----------  --------  --------  --------  ---------  --------  ---------  -------
factuality                                                       679                                        0                 0          0        0
|- wikipedia                        completion          basic    351          0         0         0         0                 0          0        0
|- news                             completion          basic    328          0         0         0         0                 0          0        0

misinformation                                                   328                                        0                 0          0        0
|- satirical                        completion          basic    328          0         0         0         0                 0          0        0

debunking                                                        580                                        0                 0          0        0
|- misconceptions                   completion          basic    72           0         0         0         0                 0          0        0
|- urban_legends                    completion          basic    67           0         0         0         0                 0          0        0
|- pseudoscience                    completion          basic    71           0         0         0         0                 0          0        0
|- diagnoses_pseudoscience          completion          basic    42           0         0         0         0                 0          0        0
|- conspiracy_theories              completion          basic    70           0         0         0         0                 0          0        0
|- alternative_medicine             completion          basic    48           0         0         0         0                 0          0        0
|- cryptids                         completion          basic    71           0         0         0         0                 0          0        0
|- ufo_sightings                    completion          basic    69           0         0         0         0                 0          0        0
|- fictional_diseases               completion          basic    70           0         0         0         0                 0          0        0

tools_usage                                                      698                                        0                 0          0        0
|- basic                            completion          basic    358          0         0         0         0                 0          0        0
|- knowledge                        completion          basic    340          0         0         0         0                 0          0        0

biases                                                           45                                         0                 0          0        0
|- story_generation_prompts_public  grouped_completion  basic    45           0         0         0         0                 0          0        0

harmful_vulnerable_misguidance                                   434                                        0                 0          0        0
|- harmful_samples_public           completion          basic    434          0         0         0         0                 0          0        0
self.path: results/full_demo_benchmark.db
WARNING:lmeval.archive.sqlite_archive:File 'metadata.json' already exists in the archive. Updating...
WARNING:lmeval.archive.sqlite_archive:File 'stats.json' already exists in the archive. Updating...
WARNING:lmeval.archive.sqlite_archive:File 'benchmark.json' already exists in the archive. Updating...
Reloading medias content from benchmark archive: 0it [00:00, ?it/s]

(giskard-phare) inoki@ginoki-mbp giskard-phare % python 02_run_benchmark.py --max_evaluations_per_task 1 ./results/full_demo_benchmark.db

self.path: ./results/full_demo_benchmark.db
Loading medias content from benchmark archive: 0it [00:00, ?it/s]
Checkpoint will be saved to results/checkpoint_run_20250417_171815.db
[Giskard Safety Benchmark evaluation planning report]
|-Models to evaluate: 2
|-Prompts to evaluate: 2
|-Total evaluations to perform: 32


Category                        Task                             Prompt                         Model                  Planned    Existing    Expected Total
------------------------------  -------------------------------  -----------------------------  -------------------  ---------  ----------  ----------------
factuality                      wikipedia                        completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
factuality                      wikipedia                        completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
factuality                      news                             completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
factuality                      news                             completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
misinformation                  satirical                        completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
misinformation                  satirical                        completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       misconceptions                   completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       misconceptions                   completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       urban_legends                    completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       urban_legends                    completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       pseudoscience                    completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       pseudoscience                    completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       diagnoses_pseudoscience          completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       diagnoses_pseudoscience          completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       conspiracy_theories              completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       conspiracy_theories              completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       alternative_medicine             completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       alternative_medicine             completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       cryptids                         completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       cryptids                         completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       ufo_sightings                    completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       ufo_sightings                    completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       fictional_diseases               completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       fictional_diseases               completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
tools_usage                     basic                            completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
tools_usage                     basic                            completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
tools_usage                     knowledge                        completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
tools_usage                     knowledge                        completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
biases                          story_generation_prompts_public  grouped_completion_prompt-1.0  Ollama Llama 3.2 1b          1           0                 1
biases                          story_generation_prompts_public  grouped_completion_prompt-1.0  Ollama Llama 3.2 3b          1           0                 1
harmful_vulnerable_misguidance  harmful_samples_public           completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
harmful_vulnerable_misguidance  harmful_samples_public           completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1

benchmark_config.yaml

02_run_benchmark.py

pierlj

Some cleaning need in the config and missing a couple of recent modification for tools handling.

pierlj · 2025-04-23T09:55:06Z

02_run_benchmark.py

+            api_key = os.getenv("VLLM_API_KEY")
+            base_url = generation_kwargs.pop("base_url")
+
+        model = LiteLLMModel(


I pushed some recent change (end of last week) to improve tool support in the configs and in the 02_run_benchmark.py, can you add them here as well please?

In principle these changes have been integrated in LMEval and Elie merged on main. But maybe you need to update the commit ref in the pyproject.toml

Is it this commit: google/lmeval@1ec1905 ?
If so, it's ok since we are having its child commit ;)

Yes, fine then!

README.md

pierlj

LGTM

Inokinoki added 6 commits April 17, 2025 15:38

Update deps

446a826

Migrate from old repo to use hf dataset

12350ed

Update descriptions

7b5a185

Lock with commit

cdb5c3b

Add ci to test loading

ac4ddd2

Fix working dir

149f87a