Closed
Description
Checks
- I have checked that this issue has not already been reported.I have confirmed this bug exists on the latest version of Polars.
Reproducible example
pl.scan_parquet(<list of files>, hive_partitioning=True, <storage_options>).collect()
Log output
polars-stream: updating graph state
polars-stream: running in-memory-sink in subgraph
polars-stream: running multi-scan[parquet] in subgraph
[MultiScanTaskInitializer]: spawn_background_tasks(), 34 sources, reader name: parquet, ReaderCapabilities(ROW_INDEX | PRE_SLICE | NEGATIVE_PRE_SLICE | PARTIAL_FILTER | FULL_FILTER)
[MultiScanTaskInitializer]: predicate: Some("<predicate>"), skip files mask: None, predicate to reader: Some("<predicate>")
[MultiScanTaskInitializer]: scan_source_idx: 0 extra_ops: ExtraOperations { row_index: None, pre_slice: None, cast_columns_policy: ErrorOnMismatch, missing_columns_policy: Raise, include_file_paths: None, predicate: Some(scan_io_predicate) }
[MultiScanTaskInitializer]: Readers init range: 0..34 (34 / 34 files)
[ReaderStarter]: max_concurrent_scans: 8
[MultiScan]: Initialize source 0
[MultiScan]: Initialize source 1
[MultiScan]: Initialize source 2
[ReaderStarter]: scan_source_idx: 0
[AttachReaderToBridge]: got reader, n_readers_received: 1
[MultiScan]: Initialize source 3
memory prefetch function: madvise_willneed
[ParquetFileReader]: project: 2 / 18, pre_slice: None, resolved_pre_slice: None, row_index: None, predicate: Some("<predicate>")
[ParquetFileReader]: Config { num_pipelines: 32, row_group_prefetch_size: 128, min_values_per_thread: 16777216 }
[ParquetFileReader]: Pre-filtered decode enabled (1 live, 1 non-live)
[ParquetFileReader]: ideal_morsel_size: 100000
[ParquetFileReader]: Predicate pushdown: reading 1 / 1 row groups
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[AttachReaderToBridge]: ApplyExtraOps::Noop
[MultiScanState]: Readers disconnected
Issue description
When querying from a S3 dataset with scan_parquet, it fails repeatedly due to Generic S3 error: HTTP error: request or response body error, will attempt re-build
. This only happens after the migration to new-streaming, and works for polars v1.27.1. This dataset has ~32 files and ~12m rows.
For way smaller datasets e.g. 1.3MB, 42 rows and 31 columns, it works so this only occurs for larger datasets, although I'm not sure exactly which size is the breaking point.
#14384 could be related but the error there is Connection reset by peer (os error 104)
Expected behavior
scan_parquet is able to read large datasets from S3
Installed versions
--------Version info---------
Polars: 1.29.0
Index type: UInt32
Platform: Linux-5.15.0-1081-aws-x86_64-with-glibc2.31
Python: 3.10.2 | packaged by conda-forge | (main, Jan 14 2022, 08:02:19) [GCC 9.4.0]
LTS CPU: False
----Optional dependencies----
Azure CLI <not installed>
adbc_driver_manager <not installed>
altair <not installed>
azure.identity <not installed>
boto3 <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec 2025.3.2
gevent <not installed>
google.auth <not installed>
great_tables <not installed>
matplotlib <not installed>
numpy 2.2.5
openpyxl <not installed>
pandas 2.2.3
polars_cloud <not installed>
pyarrow 16.1.0
pydantic <not installed>
pyiceberg <not installed>
sqlalchemy <not installed>
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
matthewbayer commentedon May 20, 2025
+1. I think this is related to apache/arrow-rs-object-store#15, which is covering up some kind of error in coming from the object store read.
I am able to minimally reproduce this by reading two 15mb files using
scan_parquet
, passed as a list of files (no glob).My optimized query plan is quite literally:
On
1.27.1
, this manifests aspolars.exceptions.ComputeError: Generic S3 error: error decoding response body
when using the new streaming engine. In 1.29.0, it matches the error message in the issue title.No combination of
POLARS_CONCURRENCY_BUDGET
orPOLARS_ASYNC_THREAD_COUNT
orPOLARS_STREAMING_CHUNK_SIZE
orPOLARS_MAX_CONCURRENT_SCANS
appears to solve.It actually appears to be related to passing a list of files to scan_parquet. I'm able to have success with concatting two single-file calls:
This issue occurs on a hive partitioned dataset. No other configuration, just
Not sure if there is different logic for reading collections of objects that's more complex than just a concatenation.
matthewbayer commentedon May 28, 2025
@coastalwhite any thoughts here? I'm struggling to find a reason that the Polars
scan_parquet
runtime would be performing different response parsing depending on whether it was passed a single file, or a list containing two. Any pointers would be appreciated :)lisasgoh commentedon Jun 2, 2025
Closing since it was an implementation error.