Agent Value-Meter

Measure any agent's value profile on a task — how much value it can generate, how much it actually generates out-of-sample, how much it dissipates, and what it buys per unit of compute. All in nats, with nulls and confidence intervals, and the honesty caveats printed in the output.

This page runs the real tool — the published value_meter.py — in your browser via Pyodide. No server, no upload, no API key: your records never leave the page.

Run it

booting Python runtime…
Load an example:

What it measures

1 · Value ceiling — I(X;Y)

The mutual information between the correct action X and the agent's choice Y: the most value this agent could realize on this task. Reported against a permutation null so it's above-chance, not raw. Plus H(X) and saturation I/H(X) — how close to the ceiling it already is.

2 · Realized rate — ΔG

Two numbers, clearly labelled. In-sample ΔG = I is an arithmetic identity (definitional, no empirical weight). The out-of-sample number — calibrate the posterior on a fit split, score on a holdout — is the empirical one, with a bootstrap CI.

3 · Dissipation — D(q‖p)

The Second-Law term: value lost to miscalibration. If your records carry a prob (the agent's stated belief), it's measured against that. Otherwise a constructed over-confident belief is used — and the output says which.

4 · Value per compute

I / tokens (primary, load-free) and I / median-latency (secondary; median is load-robust). What each unit of compute actually buys in value.

The caveats are the instrument

The meter measures; it does not call an agent good or bad, and makes no benchmark-beating claim. Every run carries these:

  • Task-relative. A per-(agent, task) profile, not a universal agent score.
  • In-sample ΔG = I is definitional. The empirical content is the out-of-sample number and its CI.
  • Dissipation is of a stated or constructed belief — the output says which.
  • Intervals, not point estimates. Read the null and the bootstrap CI.
  • Single-agent scope only. No multi-agent coordination/governance (that layer is unvalidated).

Run it on your own agent

For each task item, log the correct answer and your agent's answer (plus tokens/latency if you have them). Dump them as records JSON and paste above — or run it from the command line:

cd tools/value-meter
python3 -m value_meter your_run.json          # human-readable profile
python3 -m value_meter your_run.json --json    # machine-readable

There is no model dependency — it works on any agent's recorded outputs. The information-theoretic primitives are reused from the value theory's reference implementation, and a self-test reproduces the paper's published numbers exactly on the cached runs.