Conversation
| case <-ctx.Done(): | ||
| ctx.Logger().Info("Context cancelled or timeout reached") | ||
| <-done // Wait for goroutine to finish cleanup | ||
| return ctx.Err() |
There was a problem hiding this comment.
Timeout blocks indefinitely waiting for Colly cleanup
Medium Severity
When the context timeout fires, crawlURL enters the <-ctx.Done() case and then blocks on <-done, waiting for collector.Wait() to return. However, Colly's async collector has no context-awareness and no per-request HTTP timeout configured, so collector.Wait() blocks until all in-flight HTTP requests naturally complete. Against a slow or unresponsive server, this effectively makes the --timeout flag unreliable and can cause the crawl to hang well beyond the configured duration.
| if _, err := url.Parse(u); err != nil { | ||
| return fmt.Errorf("invalid URL %q: %w", u, err) | ||
| } | ||
| } |
There was a problem hiding this comment.
URL validation too permissive to catch invalid input
Medium Severity
The URL validation uses url.Parse, which succeeds for almost any string — including empty strings, relative paths, and bare words like "not-a-url". This means truly invalid inputs pass validation silently, leading to confusing runtime failures instead of clear init-time errors. Checking for a non-empty scheme (e.g., http or https) and host would catch these cases.


Description:
Adds a new
websource that crawls and scans websites for exposed secrets. The source uses theCollyframework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via--crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.Checklist:
make test-community)?make lintthis requires golangci-lint)?Note
Medium Risk
Introduces a new network-facing crawl-and-scan source (configurable depth/robots/timeouts) plus new third-party scraping dependencies, which could impact performance and target-site interaction if misconfigured.
Overview
Adds a new
webscan mode that fetches one or more seed URLs and optionally crawls same-domain links to scan page content (including linked scripts) for secrets, with CLI flags for crawl enablement, depth, per-domain delay, overall timeout, user-agent, and robots.txt enforcement.Wires the new source through the engine (
Engine.ScanWeb), addsWebConfig, extends protobufs/metadata to record per-page crawl details (URL, title, content-type, depth, timestamp), and includes basic Prometheus metrics plus a comprehensive test suite for crawl behavior and robots/domain constraints.Written by Cursor Bugbot for commit d4b7a57. This will update automatically on new commits. Configure here.