Skip to content

Introduce OCR Handler for secret detection in images and videos#4863

Draft
amanfcp wants to merge 2 commits intomainfrom
hackathon/ocr-handler
Draft

Introduce OCR Handler for secret detection in images and videos#4863
amanfcp wants to merge 2 commits intomainfrom
hackathon/ocr-handler

Conversation

@amanfcp
Copy link
Copy Markdown
Contributor

@amanfcp amanfcp commented Apr 3, 2026

Problem Statement

Secret Leakage Through Visual Media is a Blind Spot in Secret Scanning

Secret scanning tools today operate exclusively on text-based content — source code, config files, logs, and documents. But credentials and secrets increasingly appear in visual media: screenshots of terminal sessions, screen recordings of deployments, documentation images showing API keys, and video tutorials where dashboards with tokens are briefly visible.

These secrets are completely invisible to current scanning pipelines because image and video files are treated as opaque binaries and skipped entirely. An AWS key pasted in a screenshot committed to a repo is just as dangerous as one in a .env file, but no scanner will catch it.

Our Solution

We extend TruffleHog's scanning pipeline with an OCR-powered handler that extracts text from images (PNG, JPG, JPEG) and video frames (MP4, MKV, WEBM), then feeds it through the existing secret detection engine. Same decoders, same detectors, same verification.

Team:

@mustansir14 @MuneebUllahKhan222 @amanfcp

Key design decisions:

  • Handler-level integration: Works for any source (filesystem, Git, S3, GCS) not coupled to a single source
  • Zero cgo, fully static binary: Uses tesseract and ffmpeg as CLI tools via os/exec, preserving TruffleHog's CGO_ENABLED=0 static binary model
  • Opt-in via feature flag (--enable-ocr): No performance impact or dependency burden when disabled
  • Video intelligence: Extracts frames at 1fps and OCRs each, catching secrets that appear even briefly

Accuracy Improvements

Out-of-the-box tesseract struggles with monospaced IDE/terminal fonts. We've tuned the pipeline in several ways:

  • Image preprocessing: Images are converted to grayscale and upscaled 2x before OCR, improving accuracy on small or low-contrast text (common in screenshots)
  • PSM 6 (uniform text block): Tesseract's page segmentation is set to "single uniform block of text" mode, better suited for screenshots of terminals, config files, and dashboards than the default auto-layout analysis
  • DPI hint (300): Signals tesseract to treat input at print-quality resolution, improving character recognition
  • Monospace-aware spacing: preserve_interword_spaces=1 and textord_space_size_is_variable=0 tell tesseract that spacing is uniform — reduces spurious space insertion that breaks secret patterns

Usage

Scan a directory for secrets in images and videos

trufflehog filesystem --enable-ocr /path/to/scan

Requirements:

tesseract and ffmpeg must be installed and available in PATH when --enable-ocr is set. Images work with tesseract alone; video requires both.

Challenges / Constraints

  1. Character confusion: Tesseract can misread visually similar characters (0/O/Q, I/l/1, @/Q). This is inherent to OCR on rasterized text. Some secrets will be partially garbled, potentially causing missed detections
  2. Unintended spacing: OCR may insert extra spaces within tokens (e.g., AKIA IOSF instead of AKIAIOSF), which can break regex-based detector patterns
  3. Font sensitivity: Accuracy varies significantly by font. Monospaced IDE fonts (JetBrains Mono, Fira Code) generally OCR better than proportional or decorative fonts
  4. External tool dependency: Requires tesseract and ffmpeg as system-installed binaries. Not embedded in the Go binary

Future Improvements

  1. Prevent frame duplication: Deduplicate identical or near-identical video frames before OCR to avoid redundant processing and duplicate findings
  2. CI test coverage: Add tesseract and ffmpeg to CI environment so OCR tests run in the pipeline instead of being skipped
  3. Archive support: OCR images found inside archives (e.g., screenshots in a zip file)
  4. Additional format support: TIFF, BMP, GIF, WEBP for images; AVI, MOV for videos
  5. Custom tesseract models: Fine-tuned model trained on monospaced/IDE fonts for higher accuracy on code screenshots
  6. OCR text post-processing: Collapse whitespace and normalize common character confusions before feeding to detectors

Making It Production-Ready

  1. Standalone Dockerfile: Bundle tesseract-ocr and ffmpeg in the Docker image so --enable-ocr works out of the box without extra install steps
  2. Graceful degradation: Optionally warn instead of error when tools are missing, allowing image-only OCR when ffmpeg is absent
  3. Performance tuning: Parallel frame OCR for videos, configurable frame rate, memory-bounded processing for large media files

This closes a real gap in the secret scanning landscape, secrets don't stop being secrets just because they're in a screenshot.

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Adds an opt-in file handler that shells out to tesseract/ffmpeg and processes large binary inputs, which introduces new runtime dependencies and potential performance/resource risks when enabled.

Overview
Adds an opt-in OCR pipeline (--enable-ocr) that extracts text from supported image/video files and feeds it through the existing secret-scanning flow.

Introduces a new ocr file handler routed by MIME type (image/png, image/jpeg, video/mp4, video/x-matroska, video/webm), implements frame extraction (1fps) and image preprocessing before calling external tesseract/ffmpeg, and adds handler tests plus README documentation for installation and usage.

Written by Cursor Bugbot for commit e54f4b4. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Err: fmt.Errorf("%w: OCR processing error: %v", ErrProcessingWarning, err),
}
h.measureLatencyAndHandleErrors(ctx, start, err, dataOrErrChan)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCR error handler sends duplicate errors to channel

Medium Severity

When OCR processing fails, the error is sent to dataOrErrChan twice — once explicitly on line 80–82, and again inside measureLatencyAndHandleErrors on line 83, which also writes the error to the same channel. Every other handler (defaultHandler, arHandler, archiveHandler, apkHandler) relies solely on measureLatencyAndHandleErrors for error reporting. This causes duplicate error events for consumers of the channel. Worse, if the error is context.DeadlineExceeded, the second write wraps it differently and isFatal returns true, potentially causing unexpected early termination.

Fix in Cursor Fix in Web

const (
maxOCRImageSize = 50 * 1024 * 1024 // 50 MB
maxOCRVideoSize = 500 * 1024 * 1024 // 500 MB
frameIntervalSeconds = 1 // Extract 1 frame per second.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interval constant incorrectly used as frame rate

Low Severity

The constant frameIntervalSeconds (named as a time interval) is passed directly to ffmpeg's fps filter, which expects a frame rate (frames per second). This works by coincidence because the value is 1 (1 fps = 1 second interval), but the semantics are inverted. If someone changes the value to 2 (intending a frame every 2 seconds), it would instead extract 2 frames per second — the exact opposite of the intent.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants