Pay for Joules, Not Tokens: Energy-Based Inference Pricing Explained
NeuralWatt bills AI inference by actual GPU energy ($5/kWh) instead of per-token — and meters every request's watt-hours and CO₂. How it works, why MoE models make it up to 95% cheaper, the carbon angle, and where the math breaks down.
Every inference API you’ve used bills the same way: per token. It’s a clean abstraction — but it’s also a fiction. A token is not a unit of cost. The actual thing a provider pays for is GPU time and the electricity that feeds it, and that varies wildly with model architecture, prompt length, batching, and quantization. Per-token pricing papers over all of it with a flat rate and a margin.
NeuralWatt is the first provider we’ve cataloged that bills the underlying resource directly: energy. You pay a flat $5.00/kWh for the GPU energy your request actually consumed, and every response hands back the measured watt-hours — and grams of CO₂ — for that call. It’s a genuinely different model, and worth understanding even if you never switch to it.
How energy-based pricing actually works
This isn’t a marketing reframe of token pricing. NeuralWatt measures real hardware draw and attributes it per request:
-
Hardware-level metering. Energy is read from NVIDIA’s NVML interface — actual GPU power counters, sampled multiple times per second, not an estimate. Per request, they capture the start counter, track draw through generation, and compute the Joule delta.
-
Fair-share attribution. Your request rarely has the GPU to itself. So your energy is your share of the server’s measured draw, weighted by your tokens in the in-flight pool:
your_energy = server_energy × min(token_pool_ratio, attribution_cap)The cap (≈25% on single-GPU servers, 7–10% on dense multi-GPU boxes) stops a request that hits a quiet server from being billed for the whole idle-plus-generation draw — which is mostly fixed GPU overhead, not your work. It bounds the worst case so per-request cost stays predictable.
-
Billed in kWh.
Energy (kWh) = (end − start counters) / 3,600,000. The API even tells you which method produced the number (counter_exact_token_pool_weighted_multi_gpu_8, etc.) for auditability.
The response object is the product:
"energy": { "energy_kwh": 0.000015, "energy_joules": 54.0, "avg_power_watts": 45.2,
"attribution_method": "counter_exact_token_pool_weighted", "attribution_ratio": 0.25 }
The killer insight: MoE makes this cheap
Here’s why energy pricing isn’t just a gimmick. Token pricing charges you the same per token regardless of how much compute that token took. But a Mixture-of-Experts model activates only a fraction of its parameters per token — so a 397B-parameter MoE can burn less energy per request than a 20B dense model while being far more capable.
Under per-token pricing you can’t capture that efficiency. Under energy pricing you pay for it directly. NeuralWatt’s own numbers (at the $5/kWh PAYG rate):
| Model | Energy cost / Mtok | Token market rate / Mtok | Savings |
|---|---|---|---|
| Qwen3.5-397B (MoE) | $0.14 | $2.34 | 94% |
| Qwen3.5-35B (MoE) | $0.05 | $0.97 | 95% |
| Devstral Small 2 | $0.01 | $0.22 | 95% |
That’s the whole pitch: for efficient MoE architectures, paying for electricity instead of tokens is an order of magnitude cheaper. Subscriptions (billed in kWh — $20/6kWh, $50/16kWh, $100/33kWh) push the effective rate to ~$3/kWh, cheaper still.
NeuralWatt’s catalog is, tellingly, all efficient open-weight models — GLM-5.2, Kimi K2.6, Kimi K2.7 Code, Qwen3.5-397B, Qwen3.6-35B — the same Chinese open-weight models that already win on cost. Energy pricing compounds that advantage.
The carbon dimension
Because they’re already measuring joules, NeuralWatt also reports per-request CO₂: carbon = energy_kwh × grid_carbon_intensity, with live grid intensity from Electricity Maps and location-based accounting (the real grid where the GPU runs, not REC paper offsets). A request on French nuclear (~20–30 gCO₂/kWh) emits ~10× less than the same request on the Carolinas grid (~350–450). For anyone doing sustainability reporting on AI workloads, per-request, auditable emissions data is something no token-priced API gives you.
Where the math breaks down
The “up to 95% cheaper” headline is real for efficient MoE models — but NeuralWatt’s own docs are refreshingly honest that it’s not universal, and the Hacker News thread surfaced the rest:
- Frontier models narrow the gap. Large/dense models (GLM-5, MiniMax M2.5) activate more parameters and run on more GPUs. At solo PAYG load, energy pricing can approach token-market parity — the savings come from concurrency (amortizing fixed overhead) and subscription rates, not magic.
- Cache accounting is contested. One commenter argued the savings comparison assumes every input token is a cache miss; a heavy user countered with real data (1.1B tokens, 97% cached, ~$18/month). Both can be true — your mileage depends heavily on your cache-hit rate.
- The kWh→token mental model is hard. “$5/kWh” is not intuitively comparable to “$0.69/Mtok.” You have to trust (or measure) the energy-per-request figures, which vary with load. NeuralWatt publishes 7-day rolling real-traffic averages per model, which helps — but it’s a new muscle to build.
- Operational maturity. Early users reported rate-limit bugs and API-key hiccups — expected for a young provider.
This is why NeuralWatt also offers standard per-token pricing on every model (the rates in our provider page): you can migrate without changing billing integrations, and still get the energy/carbon numbers for free.
Does it matter beyond NeuralWatt?
Even if you stay on per-token APIs, the idea is worth internalizing. Per-token pricing is a margin product: the provider absorbs the variance between what your request cost them and what they charge. As MoE and quantization make models dramatically more efficient, that gap widens — and someone is capturing it. Energy pricing is the first attempt to hand that efficiency back to the buyer and make the real cost legible.
It won’t replace per-token billing soon — the abstraction is too convenient and too entrenched. But the trend it rides is the same one driving the whole market: inference is commoditizing toward its physical floor — GPU-seconds and electricity. A model that prices in joules is just being honest about where things are heading. For cost-sensitive, high-volume workloads on efficient open models, it’s worth a real benchmark on your own traffic.
Compare NeuralWatt’s per-token rates against every other provider on the NeuralWatt provider page, or browse the cheapest LLM APIs. Methodology and pricing per NeuralWatt’s energy documentation, as of June 2026.