Add HTML decoder for secret detection in HTML-formatted sources#4840
Add HTML decoder for secret detection in HTML-formatted sources#4840
Conversation
Sources like MS Teams and Confluence emit HTML rather than plain text, causing secrets split across tags or embedded in attributes to be missed. This adds an HTML decoder to the pipeline that extracts text nodes, high-signal attribute values, script/style/comment content, and code blocks. It handles syntax-highlight boundary detection, zero-width character stripping, and double-encoded HTML entity decoding. Made-with: Cursor
2e42a11 to
cd28c03
Compare
- Remove unreachable "xlink:href" map entry: the html parser splits namespace-prefixed attributes into separate Namespace/Key fields, so attr.Key is "href" (already in the map), never "xlink:href". - Switch url.QueryUnescape to url.PathUnescape: QueryUnescape converts '+' to space per form-encoding spec, corrupting secrets that contain literal '+' characters (e.g. base64 values, API keys). Made-with: Cursor
casey-tran
left a comment
There was a problem hiding this comment.
Looks pretty good to me! Left a few comments.
pkg/decoders/html.go
Outdated
| // Enabled controls whether the decoder is active. When nil, the decoder | ||
| // is always active. Inject a function that checks a feature flag to | ||
| // allow dynamic toggling without restarting the scanner. | ||
| Enabled func() bool |
There was a problem hiding this comment.
Is Enabled really needed if nothing by default uses this HTML package?
There was a problem hiding this comment.
I think the reason to include Enabled is so we have a kill switch for EE (my understanding is that rollout will be gradual), but OSS will be able to use it out of the box.
There was a problem hiding this comment.
Yes but for EE blocking, you should just need to deal with the feature flag in thog. If the intention for OSS is to not have a flag at all.
There was a problem hiding this comment.
Ah okay I hear you. I think the tradeoff is that checking the flag in pipeline.go only evaluates at startup, so after toggling in configcat, we would require a scanner restart for it to take effect (which may be expected behavior from customers). That would be fine for the initial release, but the Enabled callback would let us toggle per-customer via ConfigCat at runtime without restarts. I imagine something similar to this has been debated before, so it is very possible my position is known to be no good.
There was a problem hiding this comment.
Confirming our offline chat, I agree with your suggestion to move this to thog.
There was a problem hiding this comment.
The ConfigCat flags should be checked every 10 seconds here.
| // syntaxHighlightPrefixes lists CSS class prefixes used by syntax highlighting | ||
| // libraries. Elements with these classes mark logical line boundaries in code | ||
| // blocks where the platform (e.g. Teams) strips actual newlines. | ||
| var syntaxHighlightPrefixes = []string{"hljs-"} |
There was a problem hiding this comment.
Do you think it would be helpful to note that hljs- is a MS Teams specific use case? In case people need to append more to the slice over time as sources change.
There was a problem hiding this comment.
I agree, I edited the comment to call out the Teams use case and guide future non-Teams additions.
| "details": true, "summary": true, "main": true, "nav": true, "aside": true, | ||
| "form": true, "fieldset": true, "legend": true, | ||
| "dd": true, "dt": true, "dl": true, | ||
| } |
There was a problem hiding this comment.
Script/style content merges with adjacent inline text
Medium Severity
script and style elements are absent from blockElements, so they receive no newline boundaries in walkNode. When a script or style follows an inline element (e.g. <span>text</span><script>var key="secret";</script>), the script content concatenates directly with the preceding text (textvar key="secret";), breaking token boundaries and potentially preventing secret detection. The existing tests pass only because script happens to follow a <p> block element which provides the needed newline.
Additional Locations (1)
|
|
||
| if isBlock || n.Data == "br" { | ||
| ensureNewline(buf) | ||
| } |
There was a problem hiding this comment.
Redundant br check since it's already in blockElements
Low Severity
The condition isBlock || n.Data == "br" at the trailing-newline check is redundant because "br" is already present in the blockElements map, so isBlock is always true when n.Data == "br". The extra disjunct adds confusion about whether br is intended to be a block element or a special case.
Additional Locations (1)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
|
|
||
| result := stripInvisible(buf.Bytes()) | ||
| result = decodeResidualEntities(result) | ||
| return normalizeWhitespace(result) |
There was a problem hiding this comment.
Residual entity replacer corrupts raw script/style text
Low Severity
decodeResidualEntities is applied globally to all extracted text, but Go's HTML parser treats <script> and <style> content as raw text — entities are NOT decoded by the parser in those elements. So literal & in JavaScript (e.g., var url = "a=1&b=2") is preserved as-is by the parser but then incorrectly decoded to & by the residualEntityReplacer. The replacer was designed only for double-encoded content surviving a parser pass, but it also matches legitimate entity-like sequences in raw text blocks.


Summary
HTMLdecoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.href,src,value,xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.HTML = 5to theDecoderTypeprotobuf enum and registers the decoder inDefaultDecoders().htmlDecoderfeature flag in ConfigCat.Test plan
Made with Cursor
Note
Medium Risk
Adds a new decoder into the default decoding pipeline and extends the
DecoderTypeprotobuf enum, which can affect scan output and compatibility/performance when enabled. Runtime gating viafeature.HTMLDecoderEnabledreduces blast radius but HTML parsing and normalization logic may still introduce edge-case false positives/negatives.Overview
Adds a new
HTMLdecoder to the default decoder chain (behindfeature.HTMLDecoderEnabled) that parses HTML chunks and emits a normalized text view for secret detection, including text nodes, comments, script/style contents, and selected/high-signal attribute values with URL/entity cleanup and whitespace normalization.Extends
detectorspb.DecoderTypewith a newHTML = 5enum value and adds comprehensive unit tests covering extraction behavior, token-boundary handling (including Teamshljs-*spans), invisible-character stripping, and feature-flag enable/disable behavior.Written by Cursor Bugbot for commit 2a2997c. This will update automatically on new commits. Configure here.