Autonomous Treasury Model Tournament

3/26/2026 seed

Preamble

Model Tournament comes later, after the treasury can measure strategies without lying to itself. Models and agents can compete only when the same prompt, data window, cost model, risk rules, and audit burden make their differences visible.


Models Love Uneven Rooms

A model comparison is easy to fake by accident. Give one model a better prompt, a cleaner context window, a kinder tool route, a cheaper latency profile, or a more forgiving judge, and the tournament has already chosen its winner.

Financial benchmarks show how broad the test surface can get: extraction, question answering, sentiment, risk, forecasting, decision-making, and stock trading all pull on different parts of a model. A treasury agent adds another burden. It has to decide under cost, uncertainty, tool limits, and a record that remembers when it was wrong.

The Smartest Voice May Be The Worst Trader

A model can write the best rationale and make the worst decision. It can summarize the market beautifully while missing the fee that kills the trade. It can sound cautious while choosing a strategy with ugly tail risk. It can refuse too often and call that prudence. It can act quickly and call that edge.

Model Tournament has to score behavior before eloquence. Decision quality, cost, latency, hallucination rate, auditability, refusal behavior, tool discipline, and survival usefulness all belong in the same ledger.

Agent Shape Matters

The model is only one piece of the creature. Retrieval, memory, tool permissions, prompt contract, evaluator, planner, execution route, and kill switch all change the result. A weaker base model inside a stricter cage may outperform a stronger model with too much freedom.

That is the latent question beneath the tournament. Intelligence alone cannot define the treasury. The boundary around intelligence changes what intelligence becomes.

Comparison Waits Its Turn

Model Tournament waits until Research Stack, Simulation Arena, Strategy Tournament, Investor Personas, and Risk And Validity have made the room fair enough to matter. Before that, model comparison is mostly costume change.

The winning model earns a narrower claim than the public usually wants: this model-agent setup made better treasury decisions under this cage, with this data, this cost model, this wallet boundary, and this record of failure.