Skip to content

feat: Add stream method for HttpClient #1241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 19, 2025
Merged

Conversation

Mantisus
Copy link
Collaborator

Description

  • Add stream method for HttpClient
  • Add an async context manager for cleaning up resources when closing a HttpClient

Relates: #1169

@Mantisus
Copy link
Collaborator Author

Mantisus commented Jun 10, 2025

The addition of the stream method is necessary, to use the HttpClient instance utility in Sitemap instead of httpx.

#1169 (comment)

@Mantisus Mantisus self-assigned this Jun 10, 2025
@Mantisus Mantisus requested a review from Copilot June 10, 2025 22:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new streaming API for HttpClient implementations, allowing responses to be processed in chunks via an async context manager. Key changes include:

  • Adding a stream method to both HttpxHttpClient and CurlImpersonateHttpClient.
  • Implementing async iter_bytes support in response adapters.
  • Updating tests to verify proper streaming functionality.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/unit/http_clients/test_httpx.py Adds a streaming test case for the Httpx client.
tests/unit/http_clients/test_curl_impersonate.py Adds a streaming test case for the Curl impersonate client.
src/crawlee/http_clients/_httpx.py Implements stream method using asynccontextmanager and integrates with request building.
src/crawlee/http_clients/_curl_impersonate.py Implements stream method with proper cookie and error handling.
src/crawlee/http_clients/_base.py Adds abstract stream method and updates context manager behavior for client activation.
src/crawlee/crawlers/_playwright/_playwright_http_client.py Provides a stub implementation of stream that raises NotImplementedError.
src/crawlee/crawlers/_basic/_basic_crawler.py Injects the new http_client into the crawler context managers.

@Mantisus Mantisus requested a review from janbuchar June 10, 2025 22:34
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, that seems pretty smooth! I have a couple of nits/questions.

@Mantisus
Copy link
Collaborator Author

After further testing, and because of possible issues related to the sharing of read and iter_bytes as noted by @janbuchar. I renamed iter_bytes to read_stream, which should be more transparent to users and clearer to use.

Comment on lines +188 to +212
async def __aenter__(self) -> HttpClient:
"""Initialize the client when entering the context manager.

Raises:
RuntimeError: If the context manager is already active.
"""
if self._active:
raise RuntimeError(f'The {self.__class__.__name__} is already active.')

self._active = True
return self

async def __aexit__(
self, exc_type: BaseException | None, exc_value: BaseException | None, traceback: TracebackType | None
) -> None:
"""Deinitialize the client and clean up resources when exiting the context manager.

Raises:
RuntimeError: If the context manager is already active.
"""
if not self._active:
raise RuntimeError(f'The {self.__class__.__name__} is not active.')

await self.cleanup()
self._active = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we use the _active flag only for the detection of multiple aenter/aexit calls?

If so, I remember we had some discussion a while ago with either @janbuchar or @Pijukatel on how the context managers should behave in that case. Do you guys remember?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out. I added this flag to match the logic in basic_crawler - https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L675

But I completely forgot about the public property.

Although you may have discussed some other details )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 3 options. Re-entrant , re-usable, single-use:
https://docs.python.org/3/library/contextlib.html#single-use-reusable-and-reentrant-context-managers

Re-entrant does not make sense here in my opinion.

So then second question. Does it make sense and is it possible to re-use context that was already left before? If yes, then it is re-usable. (Based on the flags, this is the setup in your change. But is it really possible to enter closed context again and everything will work as expected?)

If not then it is single use and we should probably throw some error if someone tries to enter it again.
Based on the code in the cleanup, it might not be possible to re-use already closed client (unless __aenter__ is implemented in a away to make the client ready again).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation.

Yes, HttpClient should be reuse. Fixed and added the tests

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return self._response.read()

async def read_stream(self) -> AsyncIterator[bytes]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to ask what's going to happen if somebody tries to iterate over the same stream twice "in parallel". On the other hand, it's hard enough to do that on purpose, so maybe we can ignore that case.

@Mantisus Mantisus force-pushed the stream-http-client branch from 71ac709 to f933954 Compare June 13, 2025 13:57
@Mantisus Mantisus requested a review from vdusek June 18, 2025 12:55
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also looks good to me, my only concern is that the tests are all geting only one chunk responses, so we do not test fully consuming the multiple chunks. I think you would have to add some new endpoint to the test server to give such chunked response and use it in the tests.

@vdusek vdusek merged commit 95c68b0 into apify:master Jun 19, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants