Proposal: Support for Stateless Crawler Execution #1219

neviaumi · 2025-05-27T19:16:44Z

neviaumi
May 27, 2025

Background

I'm working on integrating Crawlee with Modal Context Protocol where each tool maps to a specific crawler function. In my case, I need two separate crawlers - one for product search and another for product details.

Current Challenges

Storage Management: The default MemoryStorageClient writes results to the local filesystem, which isn't suitable for stateless execution.
Shared Global Storage: The default Dataset is globally shared, causing product detail crawls to include product search results due to persistent storage.

Current Workaround

I've implemented a temporary solution using custom datasets and manual cleanup:

async def product_search(keyword: str):
    query = urllib.parse.urlencode({"term": keyword})
    request = Request.from_url(f"{BASE_URL}/search?{query}", label="product search")
    dataset = await Dataset.open(name=request.id)

    crawler = ParselCrawler(
        configure_logging=False,
        request_handler=router,
        http_client=HttpxHttpClient(),
    )

    await crawler.run([request])
    result = [item for item in (await dataset.get_data()).items]
    await dataset.drop()  # Manual cleanup
    crawler.stop()
    return result

Proposed Enhancement

I believe the API could be more intuitive by allowing dataset injection through the crawler constructor:

async def product_search(keyword: str):
    query = urllib.parse.urlencode({"term": keyword})
    request = Request.from_url(f"{BASE_URL}/search?{query}", label="product search")
    dataset = await Dataset.open()

    crawler = ParselCrawler(
        configure_logging=False,
        request_handler=router,
        dataset=dataset,  # Inject custom dataset
        http_client=HttpxHttpClient(),
    )

    await crawler.run([request])
    result = [item for item in (await crawler.get_data()).items]
    crawler.stop()
    return result

Questions

Is this the right approach for handling stateless crawler execution?
Would it make sense to add dataset injection as a feature?
Are there any existing best practices for managing storage in stateless scenarios?

Looking forward to your thoughts and suggestions!

Answered by janbuchar

May 28, 2025

Hi @neviaumi and thanks for opening this discussion! The Crawlee storage system is currently undergoing a significant refactor - see #1194. With that, you should be able to easily set up a crawler in a way that each MCP "call" (forgive me for not knowing the correct terminology) has its own non-persistent storage (datasets, key-value stores and request queues).

In fact, you can already configure the MemoryStorageClient to not dump anything in the filesystem - see the highlighted code in https://crawlee.dev/python/docs/deployment/gcp-cloud-run-functions, for example.

View full answer

neviaumi · 2025-05-27T19:47:22Z

neviaumi
May 27, 2025
Author

I have example repo here
https://github.com/neviaumi/demo-in-memory-crawler/blob/main/main.py

0 replies

janbuchar · 2025-05-28T11:54:47Z

janbuchar
May 28, 2025
Maintainer

Hi @neviaumi and thanks for opening this discussion! The Crawlee storage system is currently undergoing a significant refactor - see #1194. With that, you should be able to easily set up a crawler in a way that each MCP "call" (forgive me for not knowing the correct terminology) has its own non-persistent storage (datasets, key-value stores and request queues).

In fact, you can already configure the MemoryStorageClient to not dump anything in the filesystem - see the highlighted code in https://crawlee.dev/python/docs/deployment/gcp-cloud-run-functions, for example.

1 reply

neviaumi May 28, 2025
Author

Hi @janbuchar , thanks for sharing the example code !

It make the challenge one fixed.

Looking forward for #1194 get merged ! But as we have workaround for the problem.

I am happy to close this discussion :) Seem your team is actively working on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Support for Stateless Crawler Execution #1219

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Proposal: Support for Stateless Crawler Execution #1219

Uh oh!

Uh oh!

neviaumi May 27, 2025

Background

Current Challenges

Current Workaround

Proposed Enhancement

Questions

Replies: 2 comments · 1 reply

Uh oh!

neviaumi May 27, 2025 Author

Uh oh!

janbuchar May 28, 2025 Maintainer

Uh oh!

neviaumi May 28, 2025 Author

neviaumi
May 27, 2025

Replies: 2 comments 1 reply

neviaumi
May 27, 2025
Author

janbuchar
May 28, 2025
Maintainer

neviaumi May 28, 2025
Author