Benchmarks
We treat PQ readiness as measurable bottlenecks — bytes, sync points, and latency distributions — and benchmark relentlessly against the same workloads as we evolve the protocol. This page lists what we measure; specific numbers live in the dated benchmark reports.
Consensus metrics
- Time to notarization / finalization (p50 / p95 / p99).
- End-to-end bytes per view — total bytes a validator transmits and receives to advance one view.
- Bytes broadcast per validator per view — the slope of the byte budget.
- Durability sync points — WAL / fsync impact on latency tails. Larger PQ artifacts can make these more visible.
- Sign-time tails under load — particularly for Falcon (rejection-sampling based).
- Certificate size vs validator count, with and without thresholding. The headline metric for the PQ scaling cliff.
User layer metrics
- Transaction byte size — pk-in-every-tx vs KeyVault key-id reference. The KeyVault-detached model reduces ML-DSA-44 wire size by ~35%.
- Composite signature verification throughput — primary + cosigner, across the four supported schemes.
- Precompile verification throughput — KeyVault, NonceManager, CryptoSwitchboard, ML-DSA verifier, Falcon verifier.
- Block propagation vs tx size — as average tx size grows, how does propagation latency scale?
- Cold-vault tx overhead — ML-DSA primary + SLH-DSA cosigner, vs a standard composite (ML-DSA primary + optional P256 / ECDSA cosigner).
Crypto metrics
- Sign / verify microbenchmarks across CPU targets.
- Implementation variance and tail latency — same scheme, different implementations, different hardware.
- Mithril threshold signing latency under realistic custody workflows (TEE coordination, policy checks, network round-trips).
PQ Wallet Layer metrics
- PQ smart wallet gas costs and overhead vs an equivalent ECDSA UserOp.
- Verifier contract gas costs per supported chain (Stylus on Arbitrum: ~374K gas; pure-EVM verifier costs on chains without Stylus are higher).
- End-to-end UX latency — sign → submit → finalize.
- Tooling friction for Foundry / Hardhat and common dev stacks.
P2P transport evaluation (post-mainnet)
If / when we evaluate ML-KEM for P2P key establishment:
- Handshake size and fragmentation behavior.
- Handshake CPU costs and tail latency.
- Connection churn impacts — reconnect storms, NAT traversal, mobile links.
- Operational debugging complexity and failure modes.
- Compatibility with existing network stacks and observability tooling.
We do not plan a hybrid KEM design at this stage. The intent is to decide on a clean single-lane approach if and when the measured data justifies it.
Localnet baselines
Measured on a current dev laptop with a debug build, empty blocks, localnet init --nodes 4:
- Total RSS ≈ 10 GB (≈ 2.5 GB per node × 4).
- Total CPU < 1 core.
- Steady-state at empty blocks.
Real transaction load grows these numbers. The repository's dated benchmark reports (e.g., docs/coding-egress-bench-2026-04-29.md) cover outbound bandwidth at N = 4 and N = 8 under both Standard and Coding marshal variants. Re-run before drawing conclusions — the protocol moves and the numbers move with it.
How to read benchmark numbers honestly
A few notes that apply to every chart we publish.
- A devnet number is a lower bound. L1 data costs are zero on devnet. Real Arbitrum, Optimism, or mainnet calldata pricing dominates the user-layer "real" cost.
- Single-thread microbench ≠ end-to-end. Sign / verify microbenchmarks ignore scheduling, allocator pressure, and the cost of moving bytes between layers. We publish microbenchmarks for context, not for cost forecasting.
- Tail latency is the metric that matters. Falcon signing has occasional rejection-sampling retries; Mithril threshold signing has more. p95 / p99 numbers tell a different story than p50 — both go in the report.
- Validator count matters more than block size. Quorum proofs scale linearly with validator count. Double the validators, double the per-view byte budget. This is the leverage threshold signatures are meant to unlock.
Related
- PQ scaling cliff — the byte-budget analysis that drives these measurements
- Consensus (Simplex + Falcon) — what the consensus numbers are measuring
- Roadmap — benchmark-gated milestones