Agent Value-Meter

Run it

booting Python runtime…

Load an example:

fit split

What it measures

1 · Value ceiling — I(X;Y)

The mutual information between the correct action X and the agent's choice Y: the most value this agent could realize on this task. Reported against a permutation null so it's above-chance, not raw. Plus H(X) and saturation I/H(X) — how close to the ceiling it already is.

2 · Realized rate — ΔG

Two numbers, clearly labelled. In-sample ΔG = I is an arithmetic identity (definitional, no empirical weight). The out-of-sample number — calibrate the posterior on a fit split, score on a holdout — is the empirical one, with a bootstrap CI.

3 · Dissipation — D(q‖p)

The Second-Law term: value lost to miscalibration. If your records carry a prob (the agent's stated belief), it's measured against that. Otherwise a constructed over-confident belief is used — and the output says which.

4 · Value per compute

I / tokens (primary, load-free) and I / median-latency (secondary; median is load-robust). What each unit of compute actually buys in value.

The caveats are the instrument

The meter measures; it does not call an agent good or bad, and makes no benchmark-beating claim. Every run carries these:

Task-relative. A per-(agent, task) profile, not a universal agent score.
In-sample ΔG = I is definitional. The empirical content is the out-of-sample number and its CI.
Dissipation is of a stated or constructed belief — the output says which.
Intervals, not point estimates. Read the null and the bootstrap CI.
Single-agent scope only. No multi-agent coordination/governance (that layer is unvalidated).

Run it on your own agent

For each task item, log the correct answer and your agent's answer (plus tokens/latency if you have them). Dump them as records JSON and paste above — or run it from the command line:

cd tools/value-meter
python3 -m value_meter your_run.json          # human-readable profile
python3 -m value_meter your_run.json --json    # machine-readable

There is no model dependency — it works on any agent's recorded outputs. The information-theoretic primitives are reused from the value theory's reference implementation, and a self-test reproduces the paper's published numbers exactly on the cached runs.