Run it
What it measures
1 · Value ceiling — I(X;Y)
The mutual information between the correct action X and the agent's choice
Y: the most value this agent could realize on this task. Reported against a
permutation null so it's above-chance, not raw. Plus H(X) and
saturation I/H(X) — how close to the ceiling it already is.
2 · Realized rate — ΔG
Two numbers, clearly labelled. In-sample ΔG = I is an arithmetic
identity (definitional, no empirical weight). The out-of-sample number —
calibrate the posterior on a fit split, score on a holdout — is the empirical one, with a
bootstrap CI.
3 · Dissipation — D(q‖p)
The Second-Law term: value lost to miscalibration. If your records carry a prob
(the agent's stated belief), it's measured against that. Otherwise a constructed
over-confident belief is used — and the output says which.
4 · Value per compute
I / tokens (primary, load-free) and I / median-latency (secondary;
median is load-robust). What each unit of compute actually buys in value.
The caveats are the instrument
The meter measures; it does not call an agent good or bad, and makes no benchmark-beating claim. Every run carries these:
- Task-relative. A per-(agent, task) profile, not a universal agent score.
- In-sample ΔG = I is definitional. The empirical content is the out-of-sample number and its CI.
- Dissipation is of a stated or constructed belief — the output says which.
- Intervals, not point estimates. Read the null and the bootstrap CI.
- Single-agent scope only. No multi-agent coordination/governance (that layer is unvalidated).
Run it on your own agent
For each task item, log the correct answer and your agent's answer (plus tokens/latency if you have them). Dump them as records JSON and paste above — or run it from the command line:
cd tools/value-meter python3 -m value_meter your_run.json # human-readable profile python3 -m value_meter your_run.json --json # machine-readable
There is no model dependency — it works on any agent's recorded outputs. The information-theoretic primitives are reused from the value theory's reference implementation, and a self-test reproduces the paper's published numbers exactly on the cached runs.