Skip to content

Load phare dataset from HF for reproducibility #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 25, 2025
Merged

Load phare dataset from HF for reproducibility #2

merged 11 commits into from
Apr 25, 2025

Conversation

Inokinoki
Copy link
Member

Using huggingface_hub to download the jsonl as needed. Because the tink (dep of lmeval) since version 1.9 has a crashing issue and a mutex blocking issue with pyarrow (base of HF datasets lib) on macOS.

The core idea is to have hf_dataset and data_path in each categories:

benchmark_categories:
  - name: factuality
    hf_dataset: giskardai/phare
    data_path: hallucination/factuality
    tasks:
      - name: wikipedia
        scorer: factuality
        type: completion
        description: Check for hallucination from wikipedia articles
      - name: news
        scorer: factuality
        type: completion
        description: Check for hallucination from new articles content

And for each tasks, it tries to find f"{data_path}/{task_name}.jsonl" to download and load.

It seems to be good to load every public categories:

-=Giskard Safety Benchmark ()=-
Check for Hallucination in text generation models
|-Authors:  ()
|-Version:  - License: 
|-URL: 
|-Questions: 2764
|-Answers: 0

[Questions Stats]
                                    Type                Level    Questions    Images    Audios    Videos    Prompts      Models    Answers    Punts
----------------------------------  ------------------  -------  -----------  --------  --------  --------  ---------  --------  ---------  -------
factuality                                                       679                                        0                 0          0        0
|- wikipedia                        completion          basic    351          0         0         0         0                 0          0        0
|- news                             completion          basic    328          0         0         0         0                 0          0        0

misinformation                                                   328                                        0                 0          0        0
|- satirical                        completion          basic    328          0         0         0         0                 0          0        0

debunking                                                        580                                        0                 0          0        0
|- misconceptions                   completion          basic    72           0         0         0         0                 0          0        0
|- urban_legends                    completion          basic    67           0         0         0         0                 0          0        0
|- pseudoscience                    completion          basic    71           0         0         0         0                 0          0        0
|- diagnoses_pseudoscience          completion          basic    42           0         0         0         0                 0          0        0
|- conspiracy_theories              completion          basic    70           0         0         0         0                 0          0        0
|- alternative_medicine             completion          basic    48           0         0         0         0                 0          0        0
|- cryptids                         completion          basic    71           0         0         0         0                 0          0        0
|- ufo_sightings                    completion          basic    69           0         0         0         0                 0          0        0
|- fictional_diseases               completion          basic    70           0         0         0         0                 0          0        0

tools_usage                                                      698                                        0                 0          0        0
|- basic                            completion          basic    358          0         0         0         0                 0          0        0
|- knowledge                        completion          basic    340          0         0         0         0                 0          0        0

biases                                                           45                                         0                 0          0        0
|- story_generation_prompts_public  grouped_completion  basic    45           0         0         0         0                 0          0        0

harmful_vulnerable_misguidance                                   434                                        0                 0          0        0
|- harmful_samples_public           completion          basic    434          0         0         0         0                 0          0        0
self.path: results/full_demo_benchmark.db
WARNING:lmeval.archive.sqlite_archive:File 'metadata.json' already exists in the archive. Updating...
WARNING:lmeval.archive.sqlite_archive:File 'stats.json' already exists in the archive. Updating...
WARNING:lmeval.archive.sqlite_archive:File 'benchmark.json' already exists in the archive. Updating...
Reloading medias content from benchmark archive: 0it [00:00, ?it/s]

(giskard-phare) inoki@ginoki-mbp giskard-phare % python 02_run_benchmark.py --max_evaluations_per_task 1 ./results/full_demo_benchmark.db

self.path: ./results/full_demo_benchmark.db
Loading medias content from benchmark archive: 0it [00:00, ?it/s]
Checkpoint will be saved to results/checkpoint_run_20250417_171815.db
[Giskard Safety Benchmark evaluation planning report]
|-Models to evaluate: 2
|-Prompts to evaluate: 2
|-Total evaluations to perform: 32


Category                        Task                             Prompt                         Model                  Planned    Existing    Expected Total
------------------------------  -------------------------------  -----------------------------  -------------------  ---------  ----------  ----------------
factuality                      wikipedia                        completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
factuality                      wikipedia                        completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
factuality                      news                             completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
factuality                      news                             completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
misinformation                  satirical                        completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
misinformation                  satirical                        completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       misconceptions                   completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       misconceptions                   completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       urban_legends                    completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       urban_legends                    completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       pseudoscience                    completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       pseudoscience                    completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       diagnoses_pseudoscience          completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       diagnoses_pseudoscience          completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       conspiracy_theories              completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       conspiracy_theories              completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       alternative_medicine             completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       alternative_medicine             completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       cryptids                         completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       cryptids                         completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       ufo_sightings                    completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       ufo_sightings                    completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
debunking                       fictional_diseases               completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
debunking                       fictional_diseases               completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
tools_usage                     basic                            completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
tools_usage                     basic                            completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
tools_usage                     knowledge                        completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
tools_usage                     knowledge                        completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1
biases                          story_generation_prompts_public  grouped_completion_prompt-1.0  Ollama Llama 3.2 1b          1           0                 1
biases                          story_generation_prompts_public  grouped_completion_prompt-1.0  Ollama Llama 3.2 3b          1           0                 1
harmful_vulnerable_misguidance  harmful_samples_public           completion_prompt-1.0          Ollama Llama 3.2 1b          1           0                 1
harmful_vulnerable_misguidance  harmful_samples_public           completion_prompt-1.0          Ollama Llama 3.2 3b          1           0                 1

Copy link
Member

@pierlj pierlj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cleaning need in the config and missing a couple of recent modification for tools handling.

api_key = os.getenv("VLLM_API_KEY")
base_url = generation_kwargs.pop("base_url")

model = LiteLLMModel(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some recent change (end of last week) to improve tool support in the configs and in the 02_run_benchmark.py, can you add them here as well please?

In principle these changes have been integrated in LMEval and Elie merged on main. But maybe you need to update the commit ref in the pyproject.toml

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it this commit: google/lmeval@1ec1905 ?
If so, it's ok since we are having its child commit ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fine then!

@Inokinoki Inokinoki requested review from pierlj and mattbit April 24, 2025 16:15
Copy link
Member

@pierlj pierlj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mattbit mattbit merged commit 7d0d5aa into main Apr 25, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants