Skip to content

scan_parquet(...).collect() fails with Generic S3 error: HTTP error: request or response body error, will attempt re-build after new-streaming enabled #22795

Closed
@lisasgoh

Description

@lisasgoh

Checks

  • I have checked that this issue has not already been reported.
    I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.scan_parquet(<list of files>, hive_partitioning=True, <storage_options>).collect()

Log output

polars-stream: updating graph state
polars-stream: running in-memory-sink in subgraph
polars-stream: running multi-scan[parquet] in subgraph
[MultiScanTaskInitializer]: spawn_background_tasks(), 34 sources, reader name: parquet, ReaderCapabilities(ROW_INDEX | PRE_SLICE | NEGATIVE_PRE_SLICE | PARTIAL_FILTER | FULL_FILTER)
[MultiScanTaskInitializer]: predicate: Some("<predicate>"), skip files mask: None, predicate to reader: Some("<predicate>")
[MultiScanTaskInitializer]: scan_source_idx: 0 extra_ops: ExtraOperations { row_index: None, pre_slice: None, cast_columns_policy: ErrorOnMismatch, missing_columns_policy: Raise, include_file_paths: None, predicate: Some(scan_io_predicate) } 
[MultiScanTaskInitializer]: Readers init range: 0..34 (34 / 34 files)
[ReaderStarter]: max_concurrent_scans: 8
[MultiScan]: Initialize source 0
[MultiScan]: Initialize source 1
[MultiScan]: Initialize source 2
[ReaderStarter]: scan_source_idx: 0
[AttachReaderToBridge]: got reader, n_readers_received: 1
[MultiScan]: Initialize source 3
memory prefetch function: madvise_willneed
[ParquetFileReader]: project: 2 / 18, pre_slice: None, resolved_pre_slice: None, row_index: None, predicate: Some("<predicate>") 
[ParquetFileReader]: Config { num_pipelines: 32, row_group_prefetch_size: 128, min_values_per_thread: 16777216 }
[ParquetFileReader]: Pre-filtered decode enabled (1 live, 1 non-live)
[ParquetFileReader]: ideal_morsel_size: 100000
[ParquetFileReader]: Predicate pushdown: reading 1 / 1 row groups
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[AttachReaderToBridge]: ApplyExtraOps::Noop
[MultiScanState]: Readers disconnected

Issue description

When querying from a S3 dataset with scan_parquet, it fails repeatedly due to Generic S3 error: HTTP error: request or response body error, will attempt re-build. This only happens after the migration to new-streaming, and works for polars v1.27.1. This dataset has ~32 files and ~12m rows.

For way smaller datasets e.g. 1.3MB, 42 rows and 31 columns, it works so this only occurs for larger datasets, although I'm not sure exactly which size is the breaking point.

#14384 could be related but the error there is Connection reset by peer (os error 104)

Expected behavior

scan_parquet is able to read large datasets from S3

Installed versions

--------Version info---------
Polars:              1.29.0
Index type:          UInt32
Platform:            Linux-5.15.0-1081-aws-x86_64-with-glibc2.31
Python:              3.10.2 | packaged by conda-forge | (main, Jan 14 2022, 08:02:19) [GCC 9.4.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2025.3.2
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.5
openpyxl             <not installed>
pandas               2.2.3
polars_cloud         <not installed>
pyarrow              16.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Activity

added
bugSomething isn't working
pythonRelated to Python Polars
needs triageAwaiting prioritization by a maintainer
on May 17, 2025
matthewbayer

matthewbayer commented on May 20, 2025

@matthewbayer

+1. I think this is related to apache/arrow-rs-object-store#15, which is covering up some kind of error in coming from the object store read.

I am able to minimally reproduce this by reading two 15mb files using scan_parquet, passed as a list of files (no glob).

My optimized query plan is quite literally:

Parquet SCAN [s3://<bucket>/file1.parquet, s3://<bucket>/file2.parquet]
PROJECT */34 COLUMNS

On 1.27.1, this manifests as polars.exceptions.ComputeError: Generic S3 error: error decoding response body when using the new streaming engine. In 1.29.0, it matches the error message in the issue title.

No combination of POLARS_CONCURRENCY_BUDGET or POLARS_ASYNC_THREAD_COUNT or POLARS_STREAMING_CHUNK_SIZE or POLARS_MAX_CONCURRENT_SCANS appears to solve.

It actually appears to be related to passing a list of files to scan_parquet. I'm able to have success with concatting two single-file calls:

pl.concat([pl.scan_parquet("s3://<file1>"), pl.scan_parquet("s3://<file2>")])

This issue occurs on a hive partitioned dataset. No other configuration, just

pl.scan_parquet(file, hive_partitioning=True, storage_options={"access_key_id": <>, "secret_access_key": <>, "region": <>})

Not sure if there is different logic for reading collections of objects that's more complex than just a concatenation.

matthewbayer

matthewbayer commented on May 28, 2025

@matthewbayer

@coastalwhite any thoughts here? I'm struggling to find a reason that the Polars scan_parquet runtime would be performing different response parsing depending on whether it was passed a single file, or a list containing two. Any pointers would be appreciated :)

lisasgoh

lisasgoh commented on Jun 2, 2025

@lisasgoh
Author

Closing since it was an implementation error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageAwaiting prioritization by a maintainerpythonRelated to Python Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @lisasgoh@matthewbayer

        Issue actions

          scan_parquet(...).collect() fails with `Generic S3 error: HTTP error: request or response body error, will attempt re-build` after new-streaming enabled · Issue #22795 · pola-rs/polars