scan_parquet(...).collect() fails with `Generic S3 error: HTTP error: request or response body error, will attempt re-build` after new-streaming enabled

### Checks

- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the [latest version](https://pypi.org/project/polars/) of Polars.

### Reproducible example

```python
pl.scan_parquet(<list of files>, hive_partitioning=True, <storage_options>).collect()

```


### Log output

```shell
polars-stream: updating graph state
polars-stream: running in-memory-sink in subgraph
polars-stream: running multi-scan[parquet] in subgraph
[MultiScanTaskInitializer]: spawn_background_tasks(), 34 sources, reader name: parquet, ReaderCapabilities(ROW_INDEX | PRE_SLICE | NEGATIVE_PRE_SLICE | PARTIAL_FILTER | FULL_FILTER)
[MultiScanTaskInitializer]: predicate: Some("<predicate>"), skip files mask: None, predicate to reader: Some("<predicate>")
[MultiScanTaskInitializer]: scan_source_idx: 0 extra_ops: ExtraOperations { row_index: None, pre_slice: None, cast_columns_policy: ErrorOnMismatch, missing_columns_policy: Raise, include_file_paths: None, predicate: Some(scan_io_predicate) } 
[MultiScanTaskInitializer]: Readers init range: 0..34 (34 / 34 files)
[ReaderStarter]: max_concurrent_scans: 8
[MultiScan]: Initialize source 0
[MultiScan]: Initialize source 1
[MultiScan]: Initialize source 2
[ReaderStarter]: scan_source_idx: 0
[AttachReaderToBridge]: got reader, n_readers_received: 1
[MultiScan]: Initialize source 3
memory prefetch function: madvise_willneed
[ParquetFileReader]: project: 2 / 18, pre_slice: None, resolved_pre_slice: None, row_index: None, predicate: Some("<predicate>") 
[ParquetFileReader]: Config { num_pipelines: 32, row_group_prefetch_size: 128, min_values_per_thread: 16777216 }
[ParquetFileReader]: Pre-filtered decode enabled (1 live, 1 non-live)
[ParquetFileReader]: ideal_morsel_size: 100000
[ParquetFileReader]: Predicate pushdown: reading 1 / 1 row groups
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[PolarsObjectStore]: got error: Generic S3 error: HTTP error: request or response body error, will attempt re-build
[AttachReaderToBridge]: ApplyExtraOps::Noop
[MultiScanState]: Readers disconnected
```

### Issue description

When querying from a S3 dataset with scan_parquet, it fails repeatedly due to `Generic S3 error: HTTP error: request or response body error, will attempt re-build`. This only happens after the migration to new-streaming, and works for polars v1.27.1. This dataset has ~32 files and ~12m rows.

For way smaller datasets e.g. 1.3MB, 42 rows and 31 columns, it works so this only occurs for larger datasets, although I'm not sure exactly which size is the breaking point.

https://github.com/pola-rs/polars/issues/14384 could be related but the error there is `Connection reset by peer (os error 104)`

### Expected behavior

scan_parquet is able to read large datasets from S3

### Installed versions

<details>

```
--------Version info---------
Polars:              1.29.0
Index type:          UInt32
Platform:            Linux-5.15.0-1081-aws-x86_64-with-glibc2.31
Python:              3.10.2 | packaged by conda-forge | (main, Jan 14 2022, 08:02:19) [GCC 9.4.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2025.3.2
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.5
openpyxl             <not installed>
pandas               2.2.3
polars_cloud         <not installed>
pyarrow              16.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
```

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scan_parquet(...).collect() fails with `Generic S3 error: HTTP error: request or response body error, will attempt re-build` after new-streaming enabled #22795

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

scan_parquet(...).collect() fails with Generic S3 error: HTTP error: request or response body error, will attempt re-build after new-streaming enabled #22795

Description

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Activity

matthewbayer commented on May 20, 2025

matthewbayer commented on May 28, 2025

lisasgoh commented on Jun 2, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

scan_parquet(...).collect() fails with `Generic S3 error: HTTP error: request or response body error, will attempt re-build` after new-streaming enabled #22795