Benchmark and calibration
This is a calibration and safety report, not a real-world accuracy score. It documents exactly what we have measured, how, and what we have deliberately not claimed. ChronoVerify returns investigative triage, so the number that matters most here is how often it would wrongly flag an authentic photo, and we hold that at zero.
The short version
- Zero false positives on authentic images in the calibration set: 0 of 40, with a wide safety margin.
- A manipulation verdict requires two corroborating pixel signals. No single signal, error level analysis included, is ever decisive on its own.
- The pipeline never returns a manipulation verdict on a C2PA-signed image, and it is biased toward inconclusive when evidence is thin.
- We do not publish a real-world false-positive or detection rate yet. Doing that credibly needs a large, diverse, labeled corpus of real photos, and we will not post a number we cannot defend.
What the pipeline optimizes for
For a verification tool, the worst outcome is calling a genuine photo fake. A false accusation is the failure that destroys trust, so the system is tuned to avoid it rather than to maximize raw detection.
Concretely, a manipulation_indicated verdict needs a localized error-level-analysis spike and a corroborating noise-dispersion signal. The internal gate is ela_localization >= 80 and noise_dispersion >= 0.45, and it is suppressed entirely when valid C2PA Content Credentials are present. When metadata and forensic signals are absent or ambiguous, the pipeline returns inconclusive rather than guess.
False positives on authentic images
We run a labeled corpus of 80 large images (4032 by 3024, a typical phone resolution) through the exact production pipeline at the production forensic resolution. The classes are authentic camera captures, authentic recompressed images, stripped or screenshot-like images, and edits.
Every authentic image, camera-original and recompressed alike, returned consistent. None were flagged. The authentic error-level-analysis values topped out at 1.31 against a flag threshold of 80, so authentic photos sit far from the line.
These thresholds were not tuned on synthetic data. They were hand-calibrated against real photos, where authentic images measured error-level-analysis localization up to about 23, still well under the 80 flag. An earlier synthetic-only calibration produced about a 1-in-18 false-positive rate on real photos; testing on real images and recalibrating brought that to zero. Honest caveat: synthetic error-level analysis is not representative of real photos (synthetic authentic values run near 1, real ones near 19 to 23). This set proves the verdict logic does not over-fire; it is not a real-world false-positive rate.
Signal separation on a known edit
The other question is whether the pipeline can tell a real edit apart from an authentic file. On a controlled double-compression splice, where a region was recompressed at a different JPEG quality and composited back in, the localized error-level-analysis signal is unmistakable:
| Image | ELA localization | Verdict |
|---|---|---|
| Authentic capture | 1.3 | consistent |
| Double-compression splice | 1000 | manipulation_indicated |
The separation is large: the spliced region produces a localized signal roughly 750 times the authentic baseline. We are not publishing a detection rate across diverse real edits. Pixel forensics degrade on recompressed, resized, screenshotted, or platform-stripped images, so a controlled example demonstrates that the signal separates, not that every edit in the wild will be caught.
Why the forensic resolution stays at 4000 pixels
A natural way to cut compute is to downscale the forensic working copy. We tested that directly, sweeping the same known splice across working resolutions, and it is not safe: downscaling smooths the spliced patch and destroys the localized signal, turning a real edit into a clean result.
| Working resolution | Splice ELA localization | Splice verdict | Authentic false positives |
|---|---|---|---|
| 4000 (production) | 1000 | manipulation_indicated | 0 |
| 2000 | 1000 | manipulation_indicated | 0 |
| 1500 | 1000 | manipulation_indicated | 0 |
| 800 | 2.8 | consistent (missed) | 0 |
| 500 | 3.1 | consistent (missed) | 0 |
Authentic photos only move further from the flag as resolution drops, so a lower resolution adds no false positives, but it trades away detection. We keep the forensic copy at 4000 pixels. More to the point: nothing that touches the forensic input ships without a sweep like this one. We measure first.
What we have not measured, and will not fake
- A real-world false-positive and detection rate across a large, diverse, labeled set of real photos. This is the meaningful number, and it is pending a corpus big enough to mean something. We would rather show no number than a flattering one we cannot stand behind.
- Performance on AI-generated images. ChronoVerify is provenance-first and is deliberately not a deepfake detector.
- What a clean result proves. A
consistentverdict means a file's saved data is internally consistent, not that the scene it shows is real.
Reproduce it, or help us improve it
The harness is a single script that builds the labeled corpus, runs every image through the exact production verify pipeline, and reports the false-positive rate, the inconclusive rate, and a per-image verdict diff across resolutions. The same script is run as a regression check before any change to the forensic path.
The most useful thing you can do is send us real photos whose history you know: authentic captures, known edits, and screenshots. Upload one at the verifier and tell us what you expected, or email support@chronoverify.com. We will report where the verdict is right and where it is wrong, and fold real misses back into the calibration.
See how the verdict is produced, signal by signal.
Read the method and limits