Add HTML decoder for secret detection in HTML-formatted sources by alafiand · Pull Request #4840 · trufflesecurity/trufflehog

alafiand · 2026-03-25T21:00:13Z

Summary

Adds a new HTML decoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.
Parses HTML to extract text nodes, high-signal attribute values (href, src, value, xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.
Handles syntax-highlight boundary detection (hljs-* classes), zero-width/invisible character stripping, URL decoding in attributes, and double-encoded HTML entity cleanup.
Adds HTML = 5 to the DecoderType protobuf enum and registers the decoder in DefaultDecoders().
feature is gated behind htmlDecoder feature flag in ConfigCat.

Test plan

Unit tests covering: split secrets across tags, attribute extraction, URL decoding, script/style/comment content, code blocks, syntax-highlight boundaries, zero-width character stripping, double-encoded entities, feature flag gating
Integration tested via thog dev deploy with live MS Teams and Confluence scanning (companion thog PR)

Made with Cursor

Note

Medium Risk
Adds a new decoder into the default decoding pipeline and extends the DecoderType protobuf enum, which can affect scan output and compatibility/performance when enabled. Runtime gating via feature.HTMLDecoderEnabled reduces blast radius but HTML parsing and normalization logic may still introduce edge-case false positives/negatives.

Overview
Adds a new HTML decoder to the default decoder chain (behind feature.HTMLDecoderEnabled) that parses HTML chunks and emits a normalized text view for secret detection, including text nodes, comments, script/style contents, and selected/high-signal attribute values with URL/entity cleanup and whitespace normalization.

Extends detectorspb.DecoderType with a new HTML = 5 enum value and adds comprehensive unit tests covering extraction behavior, token-boundary handling (including Teams hljs-* spans), invisible-character stripping, and feature-flag enable/disable behavior.

^{Written by Cursor Bugbot for commit 2a2997c. This will update automatically on new commits. Configure here.}

CLAassistant · 2026-03-25T21:00:20Z

All committers have signed the CLA.

pkg/decoders/html.go

Sources like MS Teams and Confluence emit HTML rather than plain text, causing secrets split across tags or embedded in attributes to be missed. This adds an HTML decoder to the pipeline that extracts text nodes, high-signal attribute values, script/style/comment content, and code blocks. It handles syntax-highlight boundary detection, zero-width character stripping, and double-encoded HTML entity decoding. Made-with: Cursor

pkg/decoders/html.go

- Remove unreachable "xlink:href" map entry: the html parser splits namespace-prefixed attributes into separate Namespace/Key fields, so attr.Key is "href" (already in the map), never "xlink:href". - Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts '+' to space per form-encoding spec, corrupting secrets that contain literal '+' characters (e.g. base64 values, API keys). Made-with: Cursor

casey-tran

Looks pretty good to me! Left a few comments.

casey-tran · 2026-04-02T18:26:36Z

pkg/decoders/html.go

+	// Enabled controls whether the decoder is active. When nil, the decoder
+	// is always active. Inject a function that checks a feature flag to
+	// allow dynamic toggling without restarting the scanner.
+	Enabled func() bool


Is Enabled really needed if nothing by default uses this HTML package?

I think the reason to include Enabled is so we have a kill switch for EE (my understanding is that rollout will be gradual), but OSS will be able to use it out of the box.

Yes but for EE blocking, you should just need to deal with the feature flag in thog. If the intention for OSS is to not have a flag at all.

Ah okay I hear you. I think the tradeoff is that checking the flag in pipeline.go only evaluates at startup, so after toggling in configcat, we would require a scanner restart for it to take effect (which may be expected behavior from customers). That would be fine for the initial release, but the Enabled callback would let us toggle per-customer via ConfigCat at runtime without restarts. I imagine something similar to this has been debated before, so it is very possible my position is known to be no good.

Confirming our offline chat, I agree with your suggestion to move this to thog.

The ConfigCat flags should be checked every 10 seconds here.

casey-tran · 2026-04-02T18:37:40Z

pkg/decoders/html.go

+// syntaxHighlightPrefixes lists CSS class prefixes used by syntax highlighting
+// libraries. Elements with these classes mark logical line boundaries in code
+// blocks where the platform (e.g. Teams) strips actual newlines.
+var syntaxHighlightPrefixes = []string{"hljs-"}


Do you think it would be helpful to note that hljs- is a MS Teams specific use case? In case people need to append more to the slice over time as sources change.

I agree, I edited the comment to call out the Teams use case and guide future non-Teams additions.

cursor · 2026-04-02T20:11:40Z

pkg/decoders/html.go

+	"details": true, "summary": true, "main": true, "nav": true, "aside": true,
+	"form": true, "fieldset": true, "legend": true,
+	"dd": true, "dt": true, "dl": true,
+}


Script/style content merges with adjacent inline text

Medium Severity

script and style elements are absent from blockElements, so they receive no newline boundaries in walkNode. When a script or style follows an inline element (e.g. <span>text</span><script>var key="secret";</script>), the script content concatenates directly with the preceding text (textvar key="secret";), breaking token boundaries and potentially preventing secret detection. The existing tests pass only because script happens to follow a <p> block element which provides the needed newline.

Additional Locations (1)

pkg/decoders/html.go#L145-L163

cursor · 2026-04-02T20:11:40Z

pkg/decoders/html.go

+
+		if isBlock || n.Data == "br" {
+			ensureNewline(buf)
+		}


Redundant br check since it's already in blockElements

Low Severity

The condition isBlock || n.Data == "br" at the trailing-newline check is redundant because "br" is already present in the blockElements map, so isBlock is always true when n.Data == "br". The extra disjunct adds confusion about whether br is intended to be a block element or a special case.

Additional Locations (1)

pkg/decoders/html.go#L74-L75

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-04-03T14:03:30Z

pkg/decoders/html.go

+
+	result := stripInvisible(buf.Bytes())
+	result = decodeResidualEntities(result)
+	return normalizeWhitespace(result)


Residual entity replacer corrupts raw script/style text

Low Severity

decodeResidualEntities is applied globally to all extracted text, but Go's HTML parser treats <script> and <style> content as raw text — entities are NOT decoded by the parser in those elements. So literal & in JavaScript (e.g., var url = "a=1&b=2") is preserved as-is by the parser but then incorrectly decoded to & by the residualEntityReplacer. The replacer was designed only for double-encoded content surviving a parser pass, but it also matches legitimate entity-like sequences in raw text blocks.

Additional Locations (1)

pkg/decoders/html.go#L217-L225

cursor bot reviewed Mar 25, 2026

View reviewed changes

pkg/decoders/html.go Show resolved Hide resolved

alafiand force-pushed the dl.276-new-html-decoder branch from 2e42a11 to cd28c03 Compare April 1, 2026 17:28

cursor bot reviewed Apr 1, 2026

View reviewed changes

pkg/decoders/html.go Outdated Show resolved Hide resolved

alafiand and others added 2 commits April 1, 2026 12:20

Merge branch 'main' into dl.276-new-html-decoder

86a9a9f

alafiand marked this pull request as ready for review April 1, 2026 20:15

alafiand requested a review from a team April 1, 2026 20:15

alafiand requested review from a team as code owners April 1, 2026 20:15

casey-tran reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into dl.276-new-html-decoder

779dcef

cursor bot reviewed Apr 2, 2026

View reviewed changes

alafiand added 2 commits April 2, 2026 14:46

updated comment around syntaxHighlightPrefixes to guide future additions

38a1763

removed Enabled func from HTML struct to follow normal flag conventions

2a2997c

cursor bot reviewed Apr 3, 2026

View reviewed changes

Conversation

alafiand commented Mar 25, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

CLAassistant commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casey-tran left a comment

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casey-tran Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

alafiand Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

Script/style content merges with adjacent inline text

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

Redundant br check since it's already in blockElements

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 3, 2026

Choose a reason for hiding this comment

Residual entity replacer corrupts raw script/style text

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alafiand commented Mar 25, 2026 •

edited by cursor bot

Loading

CLAassistant commented Mar 25, 2026 •

edited

Loading

casey-tran Apr 2, 2026 •

edited

Loading

Redundant `br` check since it's already in blockElements