-
Notifications
You must be signed in to change notification settings - Fork 392
feat: Add stream
method for HttpClient
#1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The addition of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new streaming API for HttpClient implementations, allowing responses to be processed in chunks via an async context manager. Key changes include:
- Adding a stream method to both HttpxHttpClient and CurlImpersonateHttpClient.
- Implementing async iter_bytes support in response adapters.
- Updating tests to verify proper streaming functionality.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/http_clients/test_httpx.py | Adds a streaming test case for the Httpx client. |
tests/unit/http_clients/test_curl_impersonate.py | Adds a streaming test case for the Curl impersonate client. |
src/crawlee/http_clients/_httpx.py | Implements stream method using asynccontextmanager and integrates with request building. |
src/crawlee/http_clients/_curl_impersonate.py | Implements stream method with proper cookie and error handling. |
src/crawlee/http_clients/_base.py | Adds abstract stream method and updates context manager behavior for client activation. |
src/crawlee/crawlers/_playwright/_playwright_http_client.py | Provides a stub implementation of stream that raises NotImplementedError. |
src/crawlee/crawlers/_basic/_basic_crawler.py | Injects the new http_client into the crawler context managers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that seems pretty smooth! I have a couple of nits/questions.
After further testing, and because of possible issues related to the sharing of |
async def __aenter__(self) -> HttpClient: | ||
"""Initialize the client when entering the context manager. | ||
|
||
Raises: | ||
RuntimeError: If the context manager is already active. | ||
""" | ||
if self._active: | ||
raise RuntimeError(f'The {self.__class__.__name__} is already active.') | ||
|
||
self._active = True | ||
return self | ||
|
||
async def __aexit__( | ||
self, exc_type: BaseException | None, exc_value: BaseException | None, traceback: TracebackType | None | ||
) -> None: | ||
"""Deinitialize the client and clean up resources when exiting the context manager. | ||
|
||
Raises: | ||
RuntimeError: If the context manager is already active. | ||
""" | ||
if not self._active: | ||
raise RuntimeError(f'The {self.__class__.__name__} is not active.') | ||
|
||
await self.cleanup() | ||
self._active = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we use the _active
flag only for the detection of multiple aenter/aexit calls?
If so, I remember we had some discussion a while ago with either @janbuchar or @Pijukatel on how the context managers should behave in that case. Do you guys remember?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out. I added this flag to match the logic in basic_crawler
- https://github.com/apify/crawlee-python/blob/master/src/crawlee/crawlers/_basic/_basic_crawler.py#L675
But I completely forgot about the public property
.
Although you may have discussed some other details )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 3 options. Re-entrant , re-usable, single-use:
https://docs.python.org/3/library/contextlib.html#single-use-reusable-and-reentrant-context-managers
Re-entrant does not make sense here in my opinion.
So then second question. Does it make sense and is it possible to re-use context that was already left before? If yes, then it is re-usable. (Based on the flags, this is the setup in your change. But is it really possible to enter closed context again and everything will work as expected?)
If not then it is single use and we should probably throw some error if someone tries to enter it again.
Based on the code in the cleanup, it might not be possible to re-use already closed client (unless __aenter__
is implemented in a away to make the client ready again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed explanation.
Yes, HttpClient
should be reuse
. Fixed and added the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return self._response.read() | ||
|
||
async def read_stream(self) -> AsyncIterator[bytes]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to ask what's going to happen if somebody tries to iterate over the same stream twice "in parallel". On the other hand, it's hard enough to do that on purpose, so maybe we can ignore that case.
71ac709
to
f933954
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also looks good to me, my only concern is that the tests are all geting only one chunk responses, so we do not test fully consuming the multiple chunks. I think you would have to add some new endpoint to the test server to give such chunked response and use it in the tests.
Description
stream
method forHttpClient
HttpClient
Relates: #1169