by Inference Hub

Pay for Joules, Not Tokens: Energy-Based Inference Pricing Explained

NeuralWatt bills AI inference by actual GPU energy ($5/kWh) instead of per-token — and meters every request's watt-hours and CO₂. How it works, why MoE models make it up to 95% cheaper, the carbon angle, and where the math breaks down.

pricingenergyinferencemoesustainabilityneuralwatt

Every inference API you’ve used bills the same way: per token. It’s a clean abstraction — but it’s also a fiction. A token is not a unit of cost. The actual thing a provider pays for is GPU time and the electricity that feeds it, and that varies wildly with model architecture, prompt length, batching, and quantization. Per-token pricing papers over all of it with a flat rate and a margin.

NeuralWatt is the first provider we’ve cataloged that bills the underlying resource directly: energy. You pay a flat $5.00/kWh for the GPU energy your request actually consumed, and every response hands back the measured watt-hours — and grams of CO₂ — for that call. It’s a genuinely different model, and worth understanding even if you never switch to it.

How energy-based pricing actually works

This isn’t a marketing reframe of token pricing. NeuralWatt measures real hardware draw and attributes it per request:

  1. Hardware-level metering. Energy is read from NVIDIA’s NVML interface — actual GPU power counters, sampled multiple times per second, not an estimate. Per request, they capture the start counter, track draw through generation, and compute the Joule delta.

  2. Fair-share attribution. Your request rarely has the GPU to itself. So your energy is your share of the server’s measured draw, weighted by your tokens in the in-flight pool:

    your_energy = server_energy × min(token_pool_ratio, attribution_cap)

    The cap (≈25% on single-GPU servers, 7–10% on dense multi-GPU boxes) stops a request that hits a quiet server from being billed for the whole idle-plus-generation draw — which is mostly fixed GPU overhead, not your work. It bounds the worst case so per-request cost stays predictable.

  3. Billed in kWh. Energy (kWh) = (end − start counters) / 3,600,000. The API even tells you which method produced the number (counter_exact_token_pool_weighted_multi_gpu_8, etc.) for auditability.

The response object is the product:

"energy": { "energy_kwh": 0.000015, "energy_joules": 54.0, "avg_power_watts": 45.2,
            "attribution_method": "counter_exact_token_pool_weighted", "attribution_ratio": 0.25 }

The killer insight: MoE makes this cheap

Here’s why energy pricing isn’t just a gimmick. Token pricing charges you the same per token regardless of how much compute that token took. But a Mixture-of-Experts model activates only a fraction of its parameters per token — so a 397B-parameter MoE can burn less energy per request than a 20B dense model while being far more capable.

Under per-token pricing you can’t capture that efficiency. Under energy pricing you pay for it directly. NeuralWatt’s own numbers (at the $5/kWh PAYG rate):

ModelEnergy cost / MtokToken market rate / MtokSavings
Qwen3.5-397B (MoE)$0.14$2.3494%
Qwen3.5-35B (MoE)$0.05$0.9795%
Devstral Small 2$0.01$0.2295%

That’s the whole pitch: for efficient MoE architectures, paying for electricity instead of tokens is an order of magnitude cheaper. Subscriptions (billed in kWh — $20/6kWh, $50/16kWh, $100/33kWh) push the effective rate to ~$3/kWh, cheaper still.

NeuralWatt’s catalog is, tellingly, all efficient open-weight models — GLM-5.2, Kimi K2.6, Kimi K2.7 Code, Qwen3.5-397B, Qwen3.6-35B — the same Chinese open-weight models that already win on cost. Energy pricing compounds that advantage.

The carbon dimension

Because they’re already measuring joules, NeuralWatt also reports per-request CO₂: carbon = energy_kwh × grid_carbon_intensity, with live grid intensity from Electricity Maps and location-based accounting (the real grid where the GPU runs, not REC paper offsets). A request on French nuclear (~20–30 gCO₂/kWh) emits ~10× less than the same request on the Carolinas grid (~350–450). For anyone doing sustainability reporting on AI workloads, per-request, auditable emissions data is something no token-priced API gives you.

Where the math breaks down

The “up to 95% cheaper” headline is real for efficient MoE models — but NeuralWatt’s own docs are refreshingly honest that it’s not universal, and the Hacker News thread surfaced the rest:

  • Frontier models narrow the gap. Large/dense models (GLM-5, MiniMax M2.5) activate more parameters and run on more GPUs. At solo PAYG load, energy pricing can approach token-market parity — the savings come from concurrency (amortizing fixed overhead) and subscription rates, not magic.
  • Cache accounting is contested. One commenter argued the savings comparison assumes every input token is a cache miss; a heavy user countered with real data (1.1B tokens, 97% cached, ~$18/month). Both can be true — your mileage depends heavily on your cache-hit rate.
  • The kWh→token mental model is hard. “$5/kWh” is not intuitively comparable to “$0.69/Mtok.” You have to trust (or measure) the energy-per-request figures, which vary with load. NeuralWatt publishes 7-day rolling real-traffic averages per model, which helps — but it’s a new muscle to build.
  • Operational maturity. Early users reported rate-limit bugs and API-key hiccups — expected for a young provider.

This is why NeuralWatt also offers standard per-token pricing on every model (the rates in our provider page): you can migrate without changing billing integrations, and still get the energy/carbon numbers for free.

Does it matter beyond NeuralWatt?

Even if you stay on per-token APIs, the idea is worth internalizing. Per-token pricing is a margin product: the provider absorbs the variance between what your request cost them and what they charge. As MoE and quantization make models dramatically more efficient, that gap widens — and someone is capturing it. Energy pricing is the first attempt to hand that efficiency back to the buyer and make the real cost legible.

It won’t replace per-token billing soon — the abstraction is too convenient and too entrenched. But the trend it rides is the same one driving the whole market: inference is commoditizing toward its physical floor — GPU-seconds and electricity. A model that prices in joules is just being honest about where things are heading. For cost-sensitive, high-volume workloads on efficient open models, it’s worth a real benchmark on your own traffic.


Compare NeuralWatt’s per-token rates against every other provider on the NeuralWatt provider page, or browse the cheapest LLM APIs. Methodology and pricing per NeuralWatt’s energy documentation, as of June 2026.