You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce a HF_DATASET_CACHE_NETWORK_LOCATION configuration (e.g. an environment variable) together with a companion network cache service.
Enable a three-tier cache lookup for datasets:
Local on-disk cache
Configurable network cache proxy
Official Hugging Face Hub
Motivation
Distributed training & ephemeral jobs: In high-performance or containerized clusters, relying solely on a local disk cache either becomes a streaming bottleneck or incurs a heavy cold-start penalty as each job must re-download datasets.
Traffic & cost reduction: A pull-through network cache lets multiple consumers share a common cache layer, reducing duplicate downloads from the Hub and lowering egress costs.
Better streaming adoption: By offloading repeat dataset pulls to a locally managed cache proxy, streaming workloads can achieve higher throughput and more predictable latency.
Feature request
Introduce a HF_DATASET_CACHE_NETWORK_LOCATION configuration (e.g. an environment variable) together with a companion network cache service.
Enable a three-tier cache lookup for datasets:
Motivation
Your contribution
I’m happy to draft the initial PR for adding HF_DATASET_CACHE_NETWORK_LOCATION support in datasets and sketch out a minimal cache-service prototype.
I have limited bandwidth so I would be looking for collaborators if anyone else is interested.
The text was updated successfully, but these errors were encountered: