feat: Add `stream` method for `HttpClient` #1241

Mantisus · 2025-06-10T22:30:08Z

Description

Add stream method for HttpClient
Add an async context manager for cleaning up resources when closing a HttpClient

Relates: #1169

Mantisus · 2025-06-10T22:32:43Z

The addition of the stream method is necessary, to use the HttpClient instance utility in Sitemap instead of httpx.

#1169 (comment)

Copilot

Pull Request Overview

This PR introduces a new streaming API for HttpClient implementations, allowing responses to be processed in chunks via an async context manager. Key changes include:

Adding a stream method to both HttpxHttpClient and CurlImpersonateHttpClient.
Implementing async iter_bytes support in response adapters.
Updating tests to verify proper streaming functionality.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/unit/http_clients/test_httpx.py	Adds a streaming test case for the Httpx client.
tests/unit/http_clients/test_curl_impersonate.py	Adds a streaming test case for the Curl impersonate client.
src/crawlee/http_clients/_httpx.py	Implements stream method using asynccontextmanager and integrates with request building.
src/crawlee/http_clients/_curl_impersonate.py	Implements stream method with proper cookie and error handling.
src/crawlee/http_clients/_base.py	Adds abstract stream method and updates context manager behavior for client activation.
src/crawlee/crawlers/_playwright/_playwright_http_client.py	Provides a stub implementation of stream that raises NotImplementedError.
src/crawlee/crawlers/_basic/_basic_crawler.py	Injects the new http_client into the crawler context managers.

janbuchar

Nice, that seems pretty smooth! I have a couple of nits/questions.

src/crawlee/crawlers/_playwright/_types.py

src/crawlee/http_clients/_base.py

tests/unit/http_clients/test_curl_impersonate.py

Mantisus · 2025-06-11T19:10:09Z

After further testing, and because of possible issues related to the sharing of read and iter_bytes as noted by @janbuchar. I renamed iter_bytes to read_stream, which should be more transparent to users and clearer to use.

vdusek · 2025-06-12T11:32:52Z

src/crawlee/http_clients/_base.py

+    async def __aenter__(self) -> HttpClient:
+        """Initialize the client when entering the context manager.
+
+        Raises:
+            RuntimeError: If the context manager is already active.
+        """
+        if self._active:
+            raise RuntimeError(f'The {self.__class__.__name__} is already active.')
+
+        self._active = True
+        return self
+
+    async def __aexit__(
+        self, exc_type: BaseException | None, exc_value: BaseException | None, traceback: TracebackType | None
+    ) -> None:
+        """Deinitialize the client and clean up resources when exiting the context manager.
+
+        Raises:
+            RuntimeError: If the context manager is already active.
+        """
+        if not self._active:
+            raise RuntimeError(f'The {self.__class__.__name__} is not active.')
+
+        await self.cleanup()
+        self._active = False


So we use the _active flag only for the detection of multiple aenter/aexit calls?

If so, I remember we had some discussion a while ago with either @janbuchar or @Pijukatel on how the context managers should behave in that case. Do you guys remember?

Thank you for pointing that out. I added this flag to match the logic in basic_crawler - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L675

But I completely forgot about the public property.

Although you may have discussed some other details )

We have 3 options. Re-entrant , re-usable, single-use:
https://docs.python.org/3/library/contextlib.html#single-use-reusable-and-reentrant-context-managers

Re-entrant does not make sense here in my opinion.

So then second question. Does it make sense and is it possible to re-use context that was already left before? If yes, then it is re-usable. (Based on the flags, this is the setup in your change. But is it really possible to enter closed context again and everything will work as expected?)

If not then it is single use and we should probably throw some error if someone tries to enter it again.
Based on the code in the cleanup, it might not be possible to re-use already closed client (unless __aenter__ is implemented in a away to make the client ready again).

Thank you for the detailed explanation.

Yes, HttpClient should be reuse. Fixed and added the tests

src/crawlee/crawlers/_playwright/_types.py

tests/unit/http_clients/test_httpx.py

tests/unit/http_clients/test_curl_impersonate.py

Co-authored-by: Jan Buchar <[email protected]>

janbuchar

LGTM

janbuchar · 2025-06-12T15:19:12Z

src/crawlee/http_clients/_httpx.py

        return self._response.read()

+    async def read_stream(self) -> AsyncIterator[bytes]:


I'm tempted to ask what's going to happen if somebody tries to iterate over the same stream twice "in parallel". On the other hand, it's hard enough to do that on purpose, so maybe we can ignore that case.

vdusek

LGTM

Pijukatel

It also looks good to me, my only concern is that the tests are all geting only one chunk responses, so we do not test fully consuming the multiple chunks. I think you would have to add some new endpoint to the test server to give such chunked response and use it in the tests.

Mantisus added 6 commits June 10, 2025 18:04

implimitation stream method

a82985a

add chunk_size parameter for iter_bytes

8a7aa4a

Merge branch 'master' into stream-http-client

f7a93cb

add support timeout for stream

594604f

add test

0607be6

update docsstrings

2be9383

Mantisus self-assigned this Jun 10, 2025

Mantisus requested a review from Copilot June 10, 2025 22:34

Copilot AI reviewed Jun 10, 2025

View reviewed changes

Mantisus requested a review from janbuchar June 10, 2025 22:34

janbuchar reviewed Jun 11, 2025

View reviewed changes

src/crawlee/crawlers/_playwright/_types.py Outdated Show resolved Hide resolved

src/crawlee/http_clients/_base.py Outdated Show resolved Hide resolved

tests/unit/http_clients/test_curl_impersonate.py Outdated Show resolved Hide resolved

Mantisus added 3 commits June 11, 2025 13:42

remove chunk_size

95579fe

iter_bytesread_stream

1942822

Merge branch 'master' into stream-http-client

4b52abf

vdusek reviewed Jun 12, 2025

View reviewed changes

Mantisus added 2 commits June 12, 2025 11:54

Merge branch 'master' into stream-http-client

a9aecf2

add active property

fac5ca1

janbuchar reviewed Jun 12, 2025

View reviewed changes

src/crawlee/crawlers/_playwright/_types.py Outdated Show resolved Hide resolved

tests/unit/http_clients/test_httpx.py Outdated Show resolved Hide resolved

tests/unit/http_clients/test_curl_impersonate.py Outdated Show resolved Hide resolved

Mantisus and others added 3 commits June 12, 2025 16:39

Update src/crawlee/crawlers/_playwright/_types.py

8abd28e

Co-authored-by: Jan Buchar <[email protected]>

raise Error for read_stream if stream is consumed

3c53169

update Error

54ed3f8

janbuchar approved these changes Jun 12, 2025

View reviewed changes

add reuse tests for context manager

f933954

Mantisus force-pushed the stream-http-client branch from 71ac709 to f933954 Compare June 13, 2025 13:57

Mantisus requested a review from vdusek June 18, 2025 12:55

Mantisus added 2 commits June 18, 2025 16:01

Use context manager in test

b501739

Merge branch 'master' into stream-http-client

9638c2d

vdusek approved these changes Jun 19, 2025

View reviewed changes

Pijukatel reviewed Jun 19, 2025

View reviewed changes

vdusek merged commit 95c68b0 into apify:master Jun 19, 2025
23 checks passed

		return self._response.read()

		async def read_stream(self) -> AsyncIterator[bytes]:

feat: Add stream method for HttpClient #1241

feat: Add stream method for HttpClient #1241

Uh oh!

Conversation

Mantisus commented Jun 10, 2025

Description

Uh oh!

Mantisus commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Mantisus commented Jun 11, 2025

Uh oh!

vdusek Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Mantisus Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

feat: Add `stream` method for `HttpClient` #1241

feat: Add `stream` method for `HttpClient` #1241

Mantisus commented Jun 10, 2025 •

edited

Loading