01 — The Problem

H100 vs H200 vs B200 inference cost comparison: the GPU purchase price is roughly 15–25% of total cost of ownership for an inference cluster over a 3-year depreciation cycle. The larger costs are power, cooling, networking, and operations. Choosing the wrong generation locks in those costs for 3+ years.

This article derives cost per million output tokens for a 70B parameter dense model across H100 SXM5, H200 SXM5, and B200 SXM, using real bandwidth and power numbers — not marketing claims.

02 — Hardware Specifications

The relevant numbers for inference, stripped of irrelevant marketing metrics:

SpecH100 SXM5H200 SXM5B200 SXM
HBM capacity80 GB HBM3141 GB HBM3e192 GB HBM3e
HBM bandwidth3.35 TB/s4.8 TB/s8.0 TB/s
FP8 TFLOPS3,9583,9589,000
TDP700 W700 W1,000 W
NVLink bandwidth900 GB/s900 GB/s1,800 GB/s
List price (est.)~$30K~$40K~$60–70K

Three observations:

  1. H200 and H100 have identical compute. H200’s entire value is bandwidth (+43%) and capacity (+76%).
  2. B200 doubles bandwidth again and adds FP8 compute headroom (2.3×), at the cost of 43% more power.
  3. Memory capacity determines which model sizes you can serve without tensor parallelism or offloading.

03 — Model Fit and Parallelism Requirements

The first constraint is whether the model fits. Llama-3-70B in BF16 requires approximately 140 GB:

params×bytes/param=70×109×2=140 GB\text{params} \times \text{bytes/param} = 70 \times 10^9 \times 2 = 140\ \text{GB}

At FP8 (1 byte/param): 70 GB.

ModelBF16FP8H100 fitH200 fitB200 fit
70B140 GB70 GB2 GPUs (TP=2)1 GPU1 GPU
405B810 GB405 GB10 GPUs6 GPUs3 GPUs
7B14 GB7 GB1 GPU1 GPU1 GPU

The single-GPU H200 vs two-GPU H100 tradeoff for 70B FP8 inference is material: removing TP=2 eliminates the NVLink all-reduce overhead at every transformer layer. NCCL all-reduce for 70B at TP=2 costs approximately 2–4 ms per forward pass on NVLink — roughly 5–10% of total decode time for a single token.

04 — Bandwidth-Bound Throughput Model

For decode (the production inference bottleneck), throughput is:

tokens/s/GPU=BWHBMmodel bytes×batch_size\text{tokens/s/GPU} = \frac{BW_{\text{HBM}}}{\text{model bytes}} \times \text{batch\_size}

At batch size 1 (single user, latency-optimized):

GPUBW70B FP8 tokens/s70B BF16 tokens/s
H1003.35 TB/s47.8 t/s23.9 t/s
H2004.8 TB/s68.6 t/s34.3 t/s
B2008.0 TB/s114 t/s57 t/s

These are theoretical ceilings assuming weights are the only HBM traffic. In production, KV cache adds 10–30% memory traffic for typical sequence lengths, bringing effective throughput down to roughly 70–80% of these numbers. The GPU memory hierarchy article derives the bandwidth analysis from first principles.

05 — Power and TCO Derivation

The cost model over a 36-month depreciation:

TCO=CHW+Cpower+Cops\text{TCO} = C_{\text{HW}} + C_{\text{power}} + C_{\text{ops}}

Assumptions: PUE = 1.3, power cost $0.08/kWh, 90% utilization, ops = 15% of HW cost/year:

H100H200B200
HW cost$30,000$40,000$65,000
Power (36 mo, 700W/1000W)$2,177$2,177$3,110
Ops (15%/yr × 3yr)$13,500$18,000$29,250
Total TCO$45,677$60,177$97,360

Cost per million tokens at 70B FP8, 80% utilization, 512 output tokens/request:

cost/M tokens=TCO36×30×24×3600×tokens/s×0.8/106\text{cost/M tokens} = \frac{\text{TCO}}{36 \times 30 \times 24 \times 3600 \times \text{tokens/s} \times 0.8 / 10^6}
GPUTokens/s (eff.)Cost/M tokens
H10038$1.27
H20055$0.93
B20091$0.91

H200 wins on cost/token despite higher HW cost because the bandwidth improvement directly translates to throughput — it does not change the power bill. B200 is nearly identical to H200 at 70B scale because the extra FLOPs are wasted (decode is memory-bound, not compute-bound). B200’s advantage emerges at larger batch sizes, speculative decoding drafts, or for models with very long prefill sequences where FP8 compute matters.

For distributed training, where the inter-node interconnect determines scaling efficiency, see the distributed interconnects deep dive.

References

  1. [1] Kwon et al.. Efficient Memory Management for Large Language Model Serving with PagedAttention . SOSP, 2023. arXiv:2309.06180.
  2. [2] Pope et al.. Efficiently Scaling Transformer Inference . MLSys, 2023. arXiv:2212.09561.
  3. [3] Agrawal et al.. Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills . 2023. arXiv:2306.03078.

BibTeX

@article{fp4-2606003,
  title   = {H100 vs H200 vs B200: TCO for Inference Infrastructure},
  author  = {fp4 editorial desk},
  year    = {2026},
  url     = {https://fp4.dev/silicon/h100-h200-b200-tco/},
  journal = {fp4}
}