Skip to content

Web Source#4848

Open
kashifkhan0771 wants to merge 18 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source
Open

Web Source#4848
kashifkhan0771 wants to merge 18 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source

Conversation

@kashifkhan0771
Copy link
Copy Markdown
Contributor

@kashifkhan0771 kashifkhan0771 commented Mar 30, 2026

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Introduces a new network-facing crawl-and-scan source (configurable depth/robots/timeouts) plus new third-party scraping dependencies, which could impact performance and target-site interaction if misconfigured.

Overview
Adds a new web scan mode that fetches one or more seed URLs and optionally crawls same-domain links to scan page content (including linked scripts) for secrets, with CLI flags for crawl enablement, depth, per-domain delay, overall timeout, user-agent, and robots.txt enforcement.

Wires the new source through the engine (Engine.ScanWeb), adds WebConfig, extends protobufs/metadata to record per-page crawl details (URL, title, content-type, depth, timestamp), and includes basic Prometheus metrics plus a comprehensive test suite for crawl behavior and robots/domain constraints.

Written by Cursor Bugbot for commit d4b7a57. This will update automatically on new commits. Configure here.

@kashifkhan0771 kashifkhan0771 requested a review from a team March 30, 2026 11:40
@kashifkhan0771 kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40
case <-ctx.Done():
ctx.Logger().Info("Context cancelled or timeout reached")
<-done // Wait for goroutine to finish cleanup
return ctx.Err()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout blocks indefinitely waiting for Colly cleanup

Medium Severity

When the context timeout fires, crawlURL enters the <-ctx.Done() case and then blocks on <-done, waiting for collector.Wait() to return. However, Colly's async collector has no context-awareness and no per-request HTTP timeout configured, so collector.Wait() blocks until all in-flight HTTP requests naturally complete. Against a slow or unresponsive server, this effectively makes the --timeout flag unreliable and can cause the crawl to hang well beyond the configured duration.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

if _, err := url.Parse(u); err != nil {
return fmt.Errorf("invalid URL %q: %w", u, err)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URL validation too permissive to catch invalid input

Medium Severity

The URL validation uses url.Parse, which succeeds for almost any string — including empty strings, relative paths, and bare words like "not-a-url". This means truly invalid inputs pass validation silently, leading to confusing runtime failures instead of clear init-time errors. Checking for a non-empty scheme (e.g., http or https) and host would catch these cases.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant