Integrate Crawl4Ai to scrape web content #383

Blakeinstein · 2024-10-17T21:41:39Z

Changelog

chore: add crawl4ai
feat: async data loaders implement web crawler
feat: Implement smart web crawler
fix: validate urls for web scraper
fix: remove async scrape logic
chore: include web preview and fix

Summary

Refactor's frontend NewDataSource form breaking it into sub components, fixes form validation on required fields. Cleaner state management and separation of abstractions for different flows.
AbstractDataLoader::load_filtered_data is now async this affects all data loaders.
Update web_loader to now use Crawl4Ai to
Fetch sitemap to get a list of URLs to crawl.
Crawl each of the URL using user provided configs (including using model gateway to extract semantic data from the webpage). I wanted to make crawling pages parallel but looks like doing it async cause all urls to timeout. Can revisit this later perhaps with multi-threading.

backend/Dockerfile

backend/modules/dataloaders/loader.py

backend/Dockerfile

backend/modules/dataloaders/web_loader.py

backend/modules/parsers/web_parser.py

Integrate Crawl4Ai to scrape web content

Blakeinstein force-pushed the feat/web-crawler branch 4 times, most recently from e740a18 to cafffba Compare October 22, 2024 21:01

mnvsk97 force-pushed the feat/web-crawler branch from cafffba to 21fa59c Compare October 23, 2024 05:11

Blakeinstein force-pushed the feat/web-crawler branch 2 times, most recently from efc3a15 to d710426 Compare October 24, 2024 20:17

Blakeinstein force-pushed the feat/web-crawler branch 2 times, most recently from 391fedd to 0728967 Compare November 1, 2024 19:28

chore: add crawl4ai

72af101

Blakeinstein force-pushed the feat/web-crawler branch from 0728967 to 72af101 Compare November 7, 2024 16:32

Blakeinstein added 3 commits November 7, 2024 11:29

fix: load data from direct media

ccc5d61

fix: use fit content from newer version of crawl4ai

2b57d43

fix: backport changes from #405

30c9bd9

mnvsk97 approved these changes Nov 8, 2024

View reviewed changes

mnvsk97 merged commit cd64ac6 into main Nov 8, 2024
1 check passed

mnvsk97 deleted the feat/web-crawler branch November 8, 2024 03:10

S1LV3RJ1NX pushed a commit that referenced this pull request Nov 29, 2024

Merge pull request #383 from truefoundry/feat/web-crawler

40b900e

Integrate Crawl4Ai to scrape web content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Crawl4Ai to scrape web content #383

Integrate Crawl4Ai to scrape web content #383

Blakeinstein commented Oct 17, 2024 •

edited

Loading

Integrate Crawl4Ai to scrape web content #383

Integrate Crawl4Ai to scrape web content #383

Conversation

Blakeinstein commented Oct 17, 2024 • edited Loading

Changelog

Summary

Blakeinstein commented Oct 17, 2024 •

edited

Loading