-
BackgroundI'm working on integrating Crawlee with Modal Context Protocol where each tool maps to a specific crawler function. In my case, I need two separate crawlers - one for product search and another for product details. Current Challenges
Current WorkaroundI've implemented a temporary solution using custom datasets and manual cleanup: async def product_search(keyword: str):
query = urllib.parse.urlencode({"term": keyword})
request = Request.from_url(f"{BASE_URL}/search?{query}", label="product search")
dataset = await Dataset.open(name=request.id)
crawler = ParselCrawler(
configure_logging=False,
request_handler=router,
http_client=HttpxHttpClient(),
)
await crawler.run([request])
result = [item for item in (await dataset.get_data()).items]
await dataset.drop() # Manual cleanup
crawler.stop()
return result Proposed EnhancementI believe the API could be more intuitive by allowing dataset injection through the crawler constructor: async def product_search(keyword: str):
query = urllib.parse.urlencode({"term": keyword})
request = Request.from_url(f"{BASE_URL}/search?{query}", label="product search")
dataset = await Dataset.open()
crawler = ParselCrawler(
configure_logging=False,
request_handler=router,
dataset=dataset, # Inject custom dataset
http_client=HttpxHttpClient(),
)
await crawler.run([request])
result = [item for item in (await crawler.get_data()).items]
crawler.stop()
return result
Questions
Looking forward to your thoughts and suggestions! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I have example repo here |
Beta Was this translation helpful? Give feedback.
-
Hi @neviaumi and thanks for opening this discussion! The Crawlee storage system is currently undergoing a significant refactor - see #1194. With that, you should be able to easily set up a crawler in a way that each MCP "call" (forgive me for not knowing the correct terminology) has its own non-persistent storage (datasets, key-value stores and request queues). In fact, you can already configure the |
Beta Was this translation helpful? Give feedback.
Hi @neviaumi and thanks for opening this discussion! The Crawlee storage system is currently undergoing a significant refactor - see #1194. With that, you should be able to easily set up a crawler in a way that each MCP "call" (forgive me for not knowing the correct terminology) has its own non-persistent storage (datasets, key-value stores and request queues).
In fact, you can already configure the
MemoryStorageClient
to not dump anything in the filesystem - see the highlighted code in https://crawlee.dev/python/docs/deployment/gcp-cloud-run-functions, for example.