Running Practical Worlds Onchain
Running Practical Worlds Onchain
A technical research paper on how blockchains can scale toward hundreds-of-millions of TPS, a numerical throughput model, and practical approaches for adding AI acceleration while preserving decentralization
Abstract
This paper builds a practical, technical argument , with a step-by-step numerical model; for how blockchains can be architected to reach very high sustained throughput (100M+ TPS) in realistic deployments. We synthesize proven building blocks from the literature (DAG & pipelined consensus, BLS aggregation, parallel execution engines, erasure-coded data availability, compact block propagation, hardware acceleration) and show how they fit together. We then present a numerical model (explicit equations and arithmetic) for a 100M TPS design point, analyze bottlenecks (network, consensus, execution, data availability), and describe how AI acceleration can be introduced without centralizing ML control , using federated learning, secure aggregation, TEEs/MPC, and verifiable computation. Key prior works used for inspiration and grounding include Bullshark/Narwhal (DAG BFT), Block-STM (highly parallel execution), BLS aggregation, Celestia-style data availability/erasure coding, Solana-style propagation ideas, and secure aggregation for federated learning.
1. Introduction & motivation
A single monolithic L1 that naively tries to process every transaction on every full validator will always hit a hard ceiling. Practical ultra-high-TPS architectures therefore combine horizontal parallelism (sharding / many sequencers), fast deterministic local execution (compiled & parallel VMs), sublinear consensus and signature primitives (BLS, aggregation), and data availability systems with erasure coding and sampling. Our goal is to present: (a) a reproducible numerical model; (b) a practical architecture that composes known building blocks; and (c) strategies for adding decentralized AI acceleration.
2. Building-blocks
Bullshark / Narwhal: DAG-based ordering + data-separation that reduces latency and makes DAG BFT practical; useful pattern for high-throughput ordering with short fast paths. arXiv
Block-STM : demonstrates how a pre-ordered block can be executed in parallel at very high throughput by dynamically detecting dependencies (reported 100k–170k-level TPS in experiments). This informs the per-node execution throughput assumptions below. arXiv
BLS signature aggregation : compresses many validator signatures into one constant-size signature; crucial to eliminating O(N) signature overhead in consensus rounds. IETF
Data availability & erasure coding (Celestia+ideas) : shift from “everyone must store everything” to sampling-based availability checks and erasure coding reduces per-node bandwidth/storage pressure. Celestia Docs
Block propagation techniques (e.g., multi-layer / Reed-Solomon shreds) : reduce leader upload bottlenecks and distribute block distribution load.
3. Model, assumptions & notation (so the math is reproducible)
We construct a numerical model for a target throughput T_target = 100,000,000 TPS (100M TPS). Below are the core variables and the base assumptions used for the worked example.
Variables / base assumptions
T_target= target TPS (100,000,000)s_avg= average compressed transaction size (bytes). Base case: 80 bytes per tx (achievable with schema-aware + streaming compression for simple payments / short calldata). (We analyze sensitivity later.)S_total = T_target * s_avg= bytes/sec global.B_global= total aggregate network traffic required (bits/sec) =S_total * 8.shards= number of independent ordered partitions (sequencer shards / execution domains). We pickshards = 1000for the base scenario → so per-shard TPS =T_target / shards.t_per_core= per-core execution capacity (TPS per CPU core) under a highly optimized parallel VM. We use a conservativet_per_core = 5000TPS/core inspired by Block-STM experimental figures (Block-STM achieved on the order of hundreds of thousands of TPS on multi-core machines; dividing their peak by thread counts gives per-core ballpark). arXivcores_per_machine= 128 (dense server/edge machine with many cores).E_machine = cores_per_machine * t_per_core= execution capacity per sequencer/validator machine.Network & hardware efficiency assumptions are conservative; final section outlines experimental checks.
4. Step-by-step numerical calculation (explicit arithmetic)
We will compute all intermediate values carefully and explicitly.
4.1 Global data rate
T_target = 100,000,000transactions/sec.s_avg = 80bytes/tx.Compute bytes/sec:
S_total = T_target * s_avg = 100,000,000 * 80.100,000,000 * 80 = 8,000,000,000bytes/sec.
Convert to Gbps:
bits/sec =
8,000,000,000 * 8 = 64,000,000,000bits/sec.64,000,000,000 bits/sec = 64 Gbps.
Result 4.1: With 80-byte compressed txs, global payload bandwidth required is 64 Gbps.
4.2 Partition into shards / sequencers
Choose
shards = 1000.Per-shard TPS:
T_shard = T_target / shards = 100,000,000 / 1000 = 100,000TPS per shard.(Because
100,000,000 / 1000 = 100,000.)
4.3 Execution capacity per machine
t_per_core = 5,000TPS/core (assumption from Block-STM scaling baseline). arXivcores_per_machine = 128.E_machine = cores_per_machine * t_per_core = 128 * 5,000.128 * 5,000 = 640,000TPS per machine.
4.4 Machines needed
machines_per_shard = T_shard / E_machine = 100,000 / 640,000.100,000 / 640,000 = 0.15625machines per shard (i.e., one machine could handle ~6 shards at this efficiency).
total_sequencer_machines = machines_per_shard * shards = 0.15625 * 1000 = 156.25→ 157 machines (round up).
Interpretation: Using these (conservative) execution assumptions, ~157 sequencer/executor machines each with 128 cores could cover the execution workload for 100M TPS (with ample CPU headroom).
4.5 Bandwidth per machine
Global bandwidth (from 4.1) = 64 Gbps.
Spread across
total_sequencer_machines ≈ 156.25→ per-machine bandwidth:BW_machine = 64 Gbps / 156.25 = 0.4096 Gbps ≈ 410 Mbps.(Calculation:
64 / 156.25 = 0.4096.)
Result 4.5: Each sequencer machine needs ≈ 410 Mbps sustained network capacity to carry its share of payload , modest compared to modern datacenter NICs (10–100 Gbps).
4.6 Consensus & signature cost (high level)
If each block (per shard) is aggregated using BLS into one signature, the validator-side signature verification cost is O(1) per block rather than O(n) per signer, this removes a major per-block per-validator overhead. BLS aggregation and constant-size commit proofs are essential to scaling to thousands of shards without multiplying signature traffic. IETF
Blocks per second per shard depend on batching strategy. Example: if the shard batches
B_tx = 1,000tx per block, thenblocks_per_sec_shard = T_shard / B_tx = 100,000 / 1,000 = 100blocks/sec. Each block requires verifying one aggregate signature (100 verify ops/sec per node per shard). If nodes verify across all shards they subscribe to, verification load becomes block_rate × shards_subscribed, but aggregate signatures keep that load tiny and constant.
5. Sensitivity & reality checks
Bandwidth sensitivity:
If
s_avgincreases to 200 bytes (less compression), global bits/sec =100,000,000*200*8 = 160 Gbps. That implies per-machine bandwidth≈ 1.02 Gbpswith the earlier 157 machines , still achievable in datacenter NICs, but more demanding for globally distributed validators.
Execution sensitivity:
If
t_per_coredrops to 2500 TPS/core (less optimized VM), thenE_machine = 128*2500 = 320k→ machines needed double → ~314 machines , still in practical datacenter scale.
Latency / propagation:
These numbers intentionally assume data partitioning so each shard/sequence domain shares responsibility for a fraction of global data. The key global limit remains propagation & finality: to achieve subsecond finality across globally distributed parties you must either (a) restrict the validator set per shard / use committees and aggregate commits, (b) use fast DAG ordering (e.g., Bullshark/Narwhal patterns) for low latency, and (c) use erasure coding / sampling for DA to avoid everyone needing full copies. arXiv Celestia Docs
6. Putting the architecture pieces together (practical blueprint)
To reach 100M+ TPS in practice, compose the following layers:
Many sequenced execution partitions (shards/sectors/rollups)
Partition the global transaction namespace into S sequenced domains (100s–10k). Each domain has independent sequencers + committees. This linearizes scale.
High-performance local execution engine
Use a VM with: compiled (or JIT/native) smart contract execution, multi-threaded parallel engine (Block-STM style), deterministic scheduling for reproducibility. This maximizes
t_per_core. arXiv
Streaming compression + compact tx formats
Use schema-aware binary formats + sliding window streaming compression to reduce
s_avgas blocks are time-series, not independent random bytes.
Distributed data availability (erasure coding + sampling)
Use 2D Reed-Solomon or namespaced Merkle trees and data sampling to ensure availability without every node needing full copies; only a subset hold full data, others sample. Celestia Docs
Balanced propagation (multi-layer shreds / shards)
Avoid single-leader megablock broadcast. Use multi-layer dissemination (e.g., Turbine-style shreds and Reed-Solomon) to split, distribute and reconstruct blocks efficiently.
7. How to add AI acceleration while maintaining decentralization
Many visions place powerful AI tightly coupled with on-chain systems. To add AI acceleration without centralization follow a layered approach:
7.1 Training: Federated learning + secure aggregation
Use federated learning where model updates are computed locally by many participants and aggregated securely on-chain or by a committee using secure aggregation protocols (Bonawitz et al.). Secure aggregation prevents a central aggregator from seeing individual updates and scales with O(N log N) techniques (Turbo-Aggregate / other improvements). This keeps model training decentralized. Google Research arXiv
7.2 Inference: decentralized inference marketplaces + verifiable compute
Real-time high-throughput inference can be offered by many inference providers (GPU nodes) that stake to provide correct service. Use verifiable computation primitives: providers return results together with succinct proofs (SNARK/STARK) or attestations from TEEs so that consumers can verify correctness cheaply. Hardware acceleration (GPUs / FPGAs / ASICs) powers the inference, while cryptographic proofs + economic staking provide decentralization incentives. Research on accelerating zk systems shows GPU and hardware pipelines are practical for accelerating prover/verifier tasks. MIT CSAIL Wiley Online Library
7.3 Secure incentive & provenance on-chain
Record model commitments, training round hashes, and contributor staking on-chain. Use on-chain governance for model updates. Distribute rewards algorithmically for honest contributions. Use secure aggregation to preserve contributor privacy (Bonawitz). Google Research
7.4 Maintain non-centralized control of model weights
Avoid a single authoritative model owner: store model checkpoints as content-addressed artifacts in the DA layer. Updates occur by a consensus-proven aggregation round. Participants can fork, retrain, or contribute alternative models, the blockchain stores provenance and incentive metadata.
8. Security & decentralization trade-offs (short)
Committee/Shard sizing: Smaller per-shard committees improve latency but reduce Byzantine robustness. Use rotating committees and randomized VRF selection to mitigate capture risk.
DA sampling vs full replication: Sampling reduces per-node cost but increases reliance on economic incentives & fraud proofs. Erasure coding + light sampling provides probabilistic guarantees. Celestia Docs
Hardware & centralized clouds: Relying on large datacenter-grade machines reduces node count and increases efficiency, but risks concentration, mitigate by geographically dispersing sequencers and having permissionless alternatives with slower performance tiers.
9. Experimental roadmap & benchmarks (what to measure)
Bench suite (per shard & global):
throughput: TPS achieved under steady state and burst.latency: 95/99/999th percentiles for commit/finality.bandwidth: sustained per-node & per-sequencer.CPU / memory: per core utilization, per-machine heatmaps.DA detection: probability of data withholding under sampling.failure modes: network partitions, sequential leader failure, equivocation.
Example microbenchmarks to replicate the model
Block-STM style execution tests to measure
t_per_coreon your VM and contract mix. arXivBLS aggregation verification per second using blst & pairings to estimate aggregate verification throughput. GitHub
Erasure coding / shreds throughput & latency experiments (Turbine style).
10. Conclusions & practical takeaways
Theoretical possibility: With careful composition (thousands of sequenced partitions, aggressive streaming compression, BLS aggregation, Block-STM style parallel execution, erasure-coded DA, and compact propagation), 100M TPS is achievable in principle in datacenter-backed, well-provisioned deployments. The arithmetic above shows that execution & bandwidth are not the hard, impossible part, they become practical with partitioning and compression.
Real-world caveats: Global decentralization/participation, cross-domain messaging, finality guarantees, and censorship resistance remain the true operational constraints. The bigger obstacles are synchronization and security trade-offs when pushing to these extremes.
AI acceleration can be integrated non-centrally via federated learning, secure aggregation, decentralized inference markets, and verifiable compute, combined with hardware acceleration for crypto and ML workloads.
References
BullShark: “BullShark: DAG BFT Protocols Made Practical.” A. Spiegelman et al. (Bullshark / Narwhal analyses). arXiv
Block-STM: “Block-STM: Scaling Blockchain Execution by Turning Ordering Curse to a Performance Blessing.” R. Gelashvili et al. (Aptos / Diem research). arXiv
BLS Signatures: IETF draft and practical references on BLS aggregation for consensus (Boneh-Lynn-Shacham families / BLS12-381). IETF eth2book.info
Data Availability & Erasure Coding: Celestia docs and DA literature (namespaced Merkle trees + RS coding). Celestia Docs
Turbine / shreds style propagation: Solana Turbine exposition & Reed-Solomon shreds method. Solana
Practical Secure Aggregation for Federated Learning: Bonawitz et al., Google research (secure aggregation). Google Research
Hardware acceleration for ZK and crypto kernels: recent papers showing GPU/FPGA acceleration for SNARKs/NTT/KZG kernels. MIT CSAIL Wiley Online Library
Last updated