Skip to content

Integrate Crawl4Ai to scrape web content #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 8, 2024
Merged

Integrate Crawl4Ai to scrape web content #383

merged 4 commits into from
Nov 8, 2024

Conversation

Blakeinstein
Copy link
Contributor

@Blakeinstein Blakeinstein commented Oct 17, 2024

Changelog

  • chore: add crawl4ai
  • feat: async data loaders implement web crawler
  • feat: Implement smart web crawler
  • fix: validate urls for web scraper
  • fix: remove async scrape logic
  • chore: include web preview and fix

Summary

Refactor's frontend NewDataSource form breaking it into sub components, fixes form validation on required fields. Cleaner state management and separation of abstractions for different flows.
AbstractDataLoader::load_filtered_data is now async this affects all data loaders.
Update web_loader to now use Crawl4Ai to
Fetch sitemap to get a list of URLs to crawl.
Crawl each of the URL using user provided configs (including using model gateway to extract semantic data from the webpage). I wanted to make crawling pages parallel but looks like doing it async cause all urls to timeout. Can revisit this later perhaps with multi-threading.

@Blakeinstein Blakeinstein force-pushed the feat/web-crawler branch 4 times, most recently from e740a18 to cafffba Compare October 22, 2024 21:01
@Blakeinstein Blakeinstein force-pushed the feat/web-crawler branch 2 times, most recently from efc3a15 to d710426 Compare October 24, 2024 20:17
@Blakeinstein Blakeinstein force-pushed the feat/web-crawler branch 2 times, most recently from 391fedd to 0728967 Compare November 1, 2024 19:28
@mnvsk97 mnvsk97 merged commit cd64ac6 into main Nov 8, 2024
1 check passed
@mnvsk97 mnvsk97 deleted the feat/web-crawler branch November 8, 2024 03:10
S1LV3RJ1NX pushed a commit that referenced this pull request Nov 29, 2024
Integrate Crawl4Ai to scrape web content
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants