Benchmarks

We treat PQ readiness as measurable bottlenecks — bytes, sync points, and latency distributions — and benchmark relentlessly against the same workloads as we evolve the protocol. This page lists what we measure; specific numbers live in the dated benchmark reports.

Consensus metrics

Time to notarization / finalization (p50 / p95 / p99).
End-to-end bytes per view — total bytes a validator transmits and receives to advance one view.
Bytes broadcast per validator per view — the slope of the byte budget.
Durability sync points — WAL / fsync impact on latency tails. Larger PQ artifacts can make these more visible.
Sign-time tails under load — particularly for Falcon (rejection-sampling based).
Certificate size vs validator count, with and without thresholding. The headline metric for the PQ scaling cliff.

User layer metrics

Transaction byte size — pk-in-every-tx vs KeyVault key-id reference. The KeyVault-detached model reduces ML-DSA-44 wire size by ~35%.
Composite signature verification throughput — primary + cosigner, across the four supported schemes.
Precompile verification throughput — KeyVault, NonceManager, CryptoSwitchboard, ML-DSA verifier, Falcon verifier.
Block propagation vs tx size — as average tx size grows, how does propagation latency scale?
Cold-vault tx overhead — ML-DSA primary + SLH-DSA cosigner, vs a standard composite (ML-DSA primary + optional P256 / ECDSA cosigner).

Crypto metrics

Sign / verify microbenchmarks across CPU targets.
Implementation variance and tail latency — same scheme, different implementations, different hardware.
Mithril threshold signing latency under realistic custody workflows (TEE coordination, policy checks, network round-trips).

PQ Wallet Layer metrics

PQ smart wallet gas costs and overhead vs an equivalent ECDSA UserOp.
Verifier contract gas costs per supported chain (Stylus on Arbitrum: ~374K gas; pure-EVM verifier costs on chains without Stylus are higher).
End-to-end UX latency — sign → submit → finalize.
Tooling friction for Foundry / Hardhat and common dev stacks.

P2P transport evaluation (post-mainnet)

If / when we evaluate ML-KEM for P2P key establishment:

Handshake size and fragmentation behavior.
Handshake CPU costs and tail latency.
Connection churn impacts — reconnect storms, NAT traversal, mobile links.
Operational debugging complexity and failure modes.
Compatibility with existing network stacks and observability tooling.

We do not plan a hybrid KEM design at this stage. The intent is to decide on a clean single-lane approach if and when the measured data justifies it.

Localnet baselines

Measured on a current dev laptop with a debug build, empty blocks, localnet init --nodes 4:

Total RSS ≈ 10 GB (≈ 2.5 GB per node × 4).
Total CPU < 1 core.
Steady-state at empty blocks.

Real transaction load grows these numbers. The repository's dated benchmark reports (e.g., docs/coding-egress-bench-2026-04-29.md) cover outbound bandwidth at N = 4 and N = 8 under both Standard and Coding marshal variants. Re-run before drawing conclusions — the protocol moves and the numbers move with it.

How to read benchmark numbers honestly

A few notes that apply to every chart we publish.

A devnet number is a lower bound. L1 data costs are zero on devnet. Real Arbitrum, Optimism, or mainnet calldata pricing dominates the user-layer "real" cost.
Single-thread microbench ≠ end-to-end. Sign / verify microbenchmarks ignore scheduling, allocator pressure, and the cost of moving bytes between layers. We publish microbenchmarks for context, not for cost forecasting.
Tail latency is the metric that matters. Falcon signing has occasional rejection-sampling retries; Mithril threshold signing has more. p95 / p99 numbers tell a different story than p50 — both go in the report.
Validator count matters more than block size. Quorum proofs scale linearly with validator count. Double the validators, double the per-view byte budget. This is the leverage threshold signatures are meant to unlock.

PQ scaling cliff — the byte-budget analysis that drives these measurements
Consensus (Simplex + Falcon) — what the consensus numbers are measuring
Roadmap — benchmark-gated milestones

Consensus metrics​

User layer metrics​

Crypto metrics​

PQ Wallet Layer metrics​

P2P transport evaluation (post-mainnet)​

Localnet baselines​

How to read benchmark numbers honestly​

Related​