Skip to content

Unify HTTP fingerprinting accross framework components #1081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
janbuchar opened this issue Mar 13, 2025 · 0 comments
Open

Unify HTTP fingerprinting accross framework components #1081

janbuchar opened this issue Mar 13, 2025 · 0 comments
Labels
debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@janbuchar
Copy link
Collaborator

janbuchar commented Mar 13, 2025

Background

Currently, our approach to HTTP fingerprinting is fragmented across different components. This leads to potential inconsistencies where, for example, HTTP headers might not align with TLS fingerprints or device characteristics, making our scrapers easier to detect. Furthermore, tracking down the code responsible for various parts of the fingerprinting functionality is difficult.

Objective

Create a unified approach to HTTP fingerprinting across all Crawlee components to produce more realistic and consistent scraper behavior. This will be ported to JS crawlee as a part of v4.

Proposed Solution

  1. Create a FingerprintProfile data structure that encapsulates:

    • HTTP headers collection
    • Browser type and version (for TLS impersonation)
    • Device characteristics (viewport, screen resolution, etc.)
    • Proxy configuration that aligns with the fingerprint's locale/behavior
    • potentially any other stuff I forgot about or that will be added later on
  2. Integrate this structure across Crawlee components:

    • components responsible for fingerprinting should accept a FingerprintProfile instance in the API responsible for handling individual requests
    • HTTP clients should apply appropriate headers and proxy settings
    • Browser Pool should select browsers with matching TLS fingerprints and inject appropriate DOM properties (viewport, locale, ...)
    • the FingerprintProfile should probably be included in the Session objects
    • the way the FingerprintProfile is generated should be configurable, ideally in a way that allows adding custom code

@Pijukatel @vdusek @B4nan

@janbuchar janbuchar added debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team. labels Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

1 participant