Skip to content

Add HTML decoder for secret detection in HTML-formatted sources#4840

Open
alafiand wants to merge 6 commits intomainfrom
dl.276-new-html-decoder
Open

Add HTML decoder for secret detection in HTML-formatted sources#4840
alafiand wants to merge 6 commits intomainfrom
dl.276-new-html-decoder

Conversation

@alafiand
Copy link
Copy Markdown

@alafiand alafiand commented Mar 25, 2026

Summary

  • Adds a new HTML decoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.
  • Parses HTML to extract text nodes, high-signal attribute values (href, src, value, xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.
  • Handles syntax-highlight boundary detection (hljs-* classes), zero-width/invisible character stripping, URL decoding in attributes, and double-encoded HTML entity cleanup.
  • Adds HTML = 5 to the DecoderType protobuf enum and registers the decoder in DefaultDecoders().
  • feature is gated behind htmlDecoder feature flag in ConfigCat.

Test plan

  • Unit tests covering: split secrets across tags, attribute extraction, URL decoding, script/style/comment content, code blocks, syntax-highlight boundaries, zero-width character stripping, double-encoded entities, feature flag gating
  • Integration tested via thog dev deploy with live MS Teams and Confluence scanning (companion thog PR)

Made with Cursor


Note

Medium Risk
Adds a new decoder into the default decoding pipeline and extends the DecoderType protobuf enum, which can affect scan output and compatibility/performance when enabled. Runtime gating via feature.HTMLDecoderEnabled reduces blast radius but HTML parsing and normalization logic may still introduce edge-case false positives/negatives.

Overview
Adds a new HTML decoder to the default decoder chain (behind feature.HTMLDecoderEnabled) that parses HTML chunks and emits a normalized text view for secret detection, including text nodes, comments, script/style contents, and selected/high-signal attribute values with URL/entity cleanup and whitespace normalization.

Extends detectorspb.DecoderType with a new HTML = 5 enum value and adds comprehensive unit tests covering extraction behavior, token-boundary handling (including Teams hljs-* spans), invisible-character stripping, and feature-flag enable/disable behavior.

Written by Cursor Bugbot for commit 2a2997c. This will update automatically on new commits. Configure here.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 25, 2026

CLA assistant check
All committers have signed the CLA.

Sources like MS Teams and Confluence emit HTML rather than plain text,
causing secrets split across tags or embedded in attributes to be missed.
This adds an HTML decoder to the pipeline that extracts text nodes,
high-signal attribute values, script/style/comment content, and code blocks.
It handles syntax-highlight boundary detection, zero-width character stripping,
and double-encoded HTML entity decoding.

Made-with: Cursor
@alafiand alafiand force-pushed the dl.276-new-html-decoder branch from 2e42a11 to cd28c03 Compare April 1, 2026 17:28
alafiand and others added 2 commits April 1, 2026 12:20
- Remove unreachable "xlink:href" map entry: the html parser splits
  namespace-prefixed attributes into separate Namespace/Key fields,
  so attr.Key is "href" (already in the map), never "xlink:href".
- Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts
  '+' to space per form-encoding spec, corrupting secrets that contain
  literal '+' characters (e.g. base64 values, API keys).

Made-with: Cursor
@alafiand alafiand marked this pull request as ready for review April 1, 2026 20:15
@alafiand alafiand requested a review from a team April 1, 2026 20:15
@alafiand alafiand requested review from a team as code owners April 1, 2026 20:15
Copy link
Copy Markdown
Contributor

@casey-tran casey-tran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me! Left a few comments.

// Enabled controls whether the decoder is active. When nil, the decoder
// is always active. Inject a function that checks a feature flag to
// allow dynamic toggling without restarting the scanner.
Enabled func() bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Enabled really needed if nothing by default uses this HTML package?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason to include Enabled is so we have a kill switch for EE (my understanding is that rollout will be gradual), but OSS will be able to use it out of the box.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but for EE blocking, you should just need to deal with the feature flag in thog. If the intention for OSS is to not have a flag at all.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I hear you. I think the tradeoff is that checking the flag in pipeline.go only evaluates at startup, so after toggling in configcat, we would require a scanner restart for it to take effect (which may be expected behavior from customers). That would be fine for the initial release, but the Enabled callback would let us toggle per-customer via ConfigCat at runtime without restarts. I imagine something similar to this has been debated before, so it is very possible my position is known to be no good.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming our offline chat, I agree with your suggestion to move this to thog.

Copy link
Copy Markdown
Contributor

@casey-tran casey-tran Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConfigCat flags should be checked every 10 seconds here.

// syntaxHighlightPrefixes lists CSS class prefixes used by syntax highlighting
// libraries. Elements with these classes mark logical line boundaries in code
// blocks where the platform (e.g. Teams) strips actual newlines.
var syntaxHighlightPrefixes = []string{"hljs-"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be helpful to note that hljs- is a MS Teams specific use case? In case people need to append more to the slice over time as sources change.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I edited the comment to call out the Teams use case and guide future non-Teams additions.

"details": true, "summary": true, "main": true, "nav": true, "aside": true,
"form": true, "fieldset": true, "legend": true,
"dd": true, "dt": true, "dl": true,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script/style content merges with adjacent inline text

Medium Severity

script and style elements are absent from blockElements, so they receive no newline boundaries in walkNode. When a script or style follows an inline element (e.g. <span>text</span><script>var key="secret";</script>), the script content concatenates directly with the preceding text (textvar key="secret";), breaking token boundaries and potentially preventing secret detection. The existing tests pass only because script happens to follow a <p> block element which provides the needed newline.

Additional Locations (1)
Fix in Cursor Fix in Web


if isBlock || n.Data == "br" {
ensureNewline(buf)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant br check since it's already in blockElements

Low Severity

The condition isBlock || n.Data == "br" at the trailing-newline check is redundant because "br" is already present in the blockElements map, so isBlock is always true when n.Data == "br". The extra disjunct adds confusion about whether br is intended to be a block element or a special case.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.


result := stripInvisible(buf.Bytes())
result = decodeResidualEntities(result)
return normalizeWhitespace(result)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Residual entity replacer corrupts raw script/style text

Low Severity

decodeResidualEntities is applied globally to all extracted text, but Go's HTML parser treats <script> and <style> content as raw text — entities are NOT decoded by the parser in those elements. So literal &amp; in JavaScript (e.g., var url = "a=1&amp;b=2") is preserved as-is by the parser but then incorrectly decoded to & by the residualEntityReplacer. The replacer was designed only for double-encoded content surviving a parser pass, but it also matches legitimate entity-like sequences in raw text blocks.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants