Edge Inference Latency Budget Planner

Split an edge inference SLA into capture, transport, compute, and post-processing budgets so you know exactly how many milliseconds remain for your model—even after safety margins.

Hard latency target from sensor to response in milliseconds (ms).
Frame exposure, buffering, or audio windowing time before inference.
Round-trip or uplink time budget between edge and host (ms).
Rendering, business logic, or safety checks after inference (ms).
Optional. Defaults to 15%. Reserve extra room for jitter, cold starts, and GC pauses.
Optional. Defaults to 0 ms. Buffer for unpredictable spikes or handoffs.

Treat the output as guidance—validate against production traces and SLA monitoring before committing to launch.

Examples

  • SLA 120 ms, capture 30 ms, network 15 ms, post 10 ms, safety 20%, jitter 5 ms ⇒ Model inference budget: 54.0 ms (45.0% of SLA). Reserved overhead: 29.0 ms (safety + jitter). Slack after model: 6.0 ms.
  • SLA 80 ms, capture 25 ms, network 10 ms, post 8 ms, safety default, jitter 0 ⇒ Model inference budget: 25.0 ms (31.3% of SLA). Reserved overhead: 12.0 ms (safety + jitter). Slack after model: 12.0 ms.

FAQ

How do I split inference across CPU, GPU, and NPU?

Allocate the model budget by proportional throughput of each accelerator. The total must remain within the milliseconds returned by this planner.

Can I translate the budget into throughput?

Divide 1,000 by the inference budget (ms) to estimate maximum requests per second assuming sequential execution.

What if network latency fluctuates heavily?

Raise the safety margin and jitter allowance until the remaining slack stays positive using your P95 and P99 latency measurements.

Can I budget separately for preprocessing and postprocessing pipelines?

Yes—split them across the capture and post-processing inputs so you can highlight which side of the model is eating most of the envelope before optimisation work.

Additional Information

  • Safety margin converts to a reserved block of the SLA so unplanned spikes do not bust real-time guarantees.
  • Jitter allowance removes a fixed millisecond budget before solving for inference time.
  • Slack after the model shows contingency room that remains for retries or larger batches.
  • Result unit: milliseconds available for inference