Edge Inference Latency Budget Planner
Split an edge inference SLA into capture, transport, compute, and post-processing budgets so you know exactly how many milliseconds remain for your model—even after safety margins.
Treat the output as guidance—validate against production traces and SLA monitoring before committing to launch.
Examples
- SLA 120 ms, capture 30 ms, network 15 ms, post 10 ms, safety 20%, jitter 5 ms ⇒ Model inference budget: 54.0 ms (45.0% of SLA). Reserved overhead: 29.0 ms (safety + jitter). Slack after model: 6.0 ms.
- SLA 80 ms, capture 25 ms, network 10 ms, post 8 ms, safety default, jitter 0 ⇒ Model inference budget: 25.0 ms (31.3% of SLA). Reserved overhead: 12.0 ms (safety + jitter). Slack after model: 12.0 ms.
FAQ
How do I split inference across CPU, GPU, and NPU?
Allocate the model budget by proportional throughput of each accelerator. The total must remain within the milliseconds returned by this planner.
Can I translate the budget into throughput?
Divide 1,000 by the inference budget (ms) to estimate maximum requests per second assuming sequential execution.
What if network latency fluctuates heavily?
Raise the safety margin and jitter allowance until the remaining slack stays positive using your P95 and P99 latency measurements.
Can I budget separately for preprocessing and postprocessing pipelines?
Yes—split them across the capture and post-processing inputs so you can highlight which side of the model is eating most of the envelope before optimisation work.
Additional Information
- Safety margin converts to a reserved block of the SLA so unplanned spikes do not bust real-time guarantees.
- Jitter allowance removes a fixed millisecond budget before solving for inference time.
- Slack after the model shows contingency room that remains for retries or larger batches.
- Result unit: milliseconds available for inference