Web Source by kashifkhan0771 · Pull Request #4848 · trufflesecurity/trufflehog

kashifkhan0771 · 2026-03-30T11:40:44Z

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Introduces a new network-facing crawl-and-scan source (configurable depth/robots/timeouts) plus new third-party scraping dependencies, which could impact performance and target-site interaction if misconfigured.

Overview
Adds a new web scan mode that fetches one or more seed URLs and optionally crawls same-domain links to scan page content (including linked scripts) for secrets, with CLI flags for crawl enablement, depth, per-domain delay, overall timeout, user-agent, and robots.txt enforcement.

Wires the new source through the engine (Engine.ScanWeb), adds WebConfig, extends protobufs/metadata to record per-page crawl details (URL, title, content-type, depth, timestamp), and includes basic Prometheus metrics plus a comprehensive test suite for crawl behavior and robots/domain constraints.

^{Written by Cursor Bugbot for commit d4b7a57. This will update automatically on new commits. Configure here.}

pkg/engine/web.go

pkg/sources/web/web.go

main.go

cursor · 2026-03-30T12:28:16Z

pkg/sources/web/web.go

+	case <-ctx.Done():
+		ctx.Logger().Info("Context cancelled or timeout reached")
+		<-done // Wait for goroutine to finish cleanup
+		return ctx.Err()


Timeout blocks indefinitely waiting for Colly cleanup

Medium Severity

When the context timeout fires, crawlURL enters the <-ctx.Done() case and then blocks on <-done, waiting for collector.Wait() to return. However, Colly's async collector has no context-awareness and no per-request HTTP timeout configured, so collector.Wait() blocks until all in-flight HTTP requests naturally complete. Against a slow or unresponsive server, this effectively makes the --timeout flag unreliable and can cause the crawl to hang well beyond the configured duration.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

cursor · 2026-04-02T16:37:24Z

pkg/sources/web/web.go

+		if _, err := url.Parse(u); err != nil {
+			return fmt.Errorf("invalid URL %q: %w", u, err)
+		}
+	}


URL validation too permissive to catch invalid input

Medium Severity

The URL validation uses url.Parse, which succeeds for almost any string — including empty strings, relative paths, and bare words like "not-a-url". This means truly invalid inputs pass validation silently, leading to confusing runtime failures instead of clear init-time errors. Checking for a non-empty scheme (e.g., http or https) and host would catch these cases.

kashifkhan0771 added 11 commits March 27, 2026 15:26

basic structure for source

71d3b22

it works end to end

ee3c447

some more enhancements + README.md

c9a95bd

A simple working test

d036ca5

user-agent flag

976fc15

made ignore-robots configurable

d34aa84

added metric

b2161da

detailed test cases

259f6d6

fixed some comments

37caab9

updated README.md

4ca9e46

Merge branch 'main' into feature/web-source

2403fb4

kashifkhan0771 requested a review from a team March 30, 2026 11:40

kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/engine/web.go Show resolved Hide resolved

pkg/sources/web/web.go Show resolved Hide resolved

kashifkhan0771 added 2 commits March 30, 2026 16:50

Added missed config in engine and rewrite timeout comment

9288366

fixed lint issues

2a52dab

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

pkg/sources/web/web.go Outdated Show resolved Hide resolved

kashifkhan0771 added 2 commits March 30, 2026 17:16

fixed allowed domains validation

4952315

fixed comment

d830d51

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

main.go Show resolved Hide resolved

fixed sub-domain filter

4336d62

cursor bot reviewed Mar 30, 2026

View reviewed changes

kashifkhan0771 requested review from amanfcp, camgunz and rosecodym March 31, 2026 05:22

Merge branch 'main' into feature/web-source

a8889df

cursor bot reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into feature/web-source

d4b7a57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Source#4848

Web Source#4848
kashifkhan0771 wants to merge 18 commits intotrufflesecurity:mainfrom
kashifkhan0771:feature/web-source

kashifkhan0771 commented Mar 30, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 30, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kashifkhan0771 commented Mar 30, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 30, 2026

Choose a reason for hiding this comment

Timeout blocks indefinitely waiting for Colly cleanup

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

URL validation too permissive to catch invalid input

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kashifkhan0771 commented Mar 30, 2026 •

edited by cursor bot

Loading