Why Batch Inference Still Beats Real-Time for 90% of Use Cases
My real-time inference endpoint has a cold start of 43 seconds. At that point, "real-time" is more of a philosophical concept than an engineering decision.
There's a particular kind of over-engineering that's endemic in ML teams: building real-time serving infrastructure for predictions that nobody needs in real-time. I've watched teams spend months wiring up Kubernetes clusters, GPU-backed endpoints, and autoscaling policies for models that serve a dashboard updated once a day. The architecture diagram looks impressive. The cloud bill looks worse.
The default for most ML workloads should be batch. The bar for real-time should be high. Most teams have it backwards.
The Cost Gap Is Not Subtle
AWS Bedrock offers batch inference at 50% of on-demand pricing. Google Vertex AI matches that. Spot GPUs — which batch workloads can trivially use because they're interruptible and resumable — run 70–91% cheaper than on-demand instances on AWS, and 60–80% cheaper on GCP.
These aren't marginal savings. Snap processes 500 million images daily using around 1,000 T4 GPUs, with 90% of that workload running as batch on spot instances. The result: $6.2 million saved annually, a 78% cost reduction. Pinterest trained recommendation models on 2 billion pins using 200 V100 GPUs with 80% on spot, saving $4.8 million a year.
Batching 32 requests together on a GPU yields roughly a 6x throughput improvement over single-request inference. The hardware is the same. The cost per prediction drops dramatically just by not processing one request at a time.
And then there's the infrastructure you don't need. A batch job runs, completes, and shuts down. A real-time endpoint sits there burning money whether anyone's calling it or not. Cast AI's 2025 benchmark across 2,100+ organisations found average Kubernetes CPU utilisation at just 10%. Up to 60% of GPU time in ML workflows is wasted on idle capacity. A single forgotten SageMaker endpoint drains over £1,000 a month. Scale that across a team with multiple models and you're haemorrhaging budget on infrastructure that's mostly doing nothing.
ML inference already accounts for up to 90% of total ML costs in deployed systems. Real-time serving makes that worse by requiring always-on monitoring stacks — Prometheus, Grafana, autoscaling policies — plus an MLOps specialist at £130,000–£200,000 a year to keep it all running. Batch needs a scheduler and a storage bucket.
The Latency Requirement That Doesn't Exist
Here's the question most teams skip: does anyone actually need this prediction in under a second?
Batch inference is like meal prep Sunday — boring, efficient, and your future self thanks you. Real-time inference is ordering Deliveroo for every meal and wondering where your money went.
Churn prediction models typically feed a weekly email campaign. Credit scoring runs when an application is submitted, not continuously. Customer segmentation updates daily at most. Demand forecasting, document classification, content recommendations — all batch-friendly. You precompute predictions, store them in a database or cache, and serve them via a simple lookup. No model loading at request time. No GPU required at the serving layer.
Tecton's State of Applied Machine Learning survey found that 34.1% of teams building real-time ML models use exclusively batch data to power them. They've got real-time infrastructure serving predictions computed from data that's already hours or days old. The serving layer is real-time. The value isn't.
Even SageMaker's serverless inference — supposedly the lightweight alternative to always-on endpoints — has cold starts exceeding 30–43 seconds for endpoints that haven't been hit in five minutes. If your "real-time" endpoint needs 43 seconds to wake up, you've built a very expensive batch job with extra steps.
How the Best Teams Actually Do It
Netflix, Spotify, Uber, DoorDash — these are companies with genuine scale and genuine latency requirements. None of them run full model inference in real-time for everything.
Netflix uses a three-layer architecture: offline batch systems crunch terabytes of viewing history, nearline systems update user embeddings seconds after an interaction, and online systems combine precomputed signals with lightweight real-time context like time of day and device type. The heavy computation is batch. Real-time is thin re-ranking on top of precomputed results.
Uber pre-computes wait time estimates and caches them by GPS location and time of day. Their feature store hosts over 20,000 features, with heavy computation handled in the offline data plane. The online plane serves predictions with millisecond latency because the expensive work is already done.
Spotify does the same — batch-computed collaborative filtering and content-based recommendations, with real-time re-ranking layered on top.
The pattern is consistent: precompute as much as possible in batch, serve the results from a fast lookup layer, and only run real-time inference on the thin slice that genuinely needs it. The expensive model never sees a live request.
When Real-Time Actually Matters
Real-time inference earns its cost in a narrow set of use cases. Fraud detection is the obvious one — transactions must be scored in milliseconds before authorisation, and a batch job that runs overnight is useless when someone's draining an account right now. Autonomous vehicles need single-digit millisecond inference for safety-critical decisions. Ad tech real-time bidding operates within a 100ms budget. High-frequency trading measures in microseconds.
Conversational AI needs real-time responses, though even there, server-side batching techniques like continuous batching are used to improve throughput.
The common thread: these are all cases where the prediction is worthless if it arrives late. A fraud score delivered an hour after the transaction cleared may as well not exist. A recommendation served from a cache that's six hours old is still perfectly useful.
If your prediction is still valuable after a few minutes — let alone a few hours — you don't need real-time inference. You need a cron job and a database.
The Resume-Driven Architecture Problem
I suspect a good chunk of real-time inference infrastructure exists because "I built a real-time ML serving platform on Kubernetes" looks better on a CV than "I wrote a SQL query that runs nightly." There's a gravitational pull towards complexity in ML engineering that doesn't exist to the same degree in other disciplines.
The business doesn't care whether the prediction was generated in 50 milliseconds or 5 hours ago, as long as it's there when they need it. But engineers care about the architecture, and the architecture that gets talks at conferences is the one with the most boxes in the diagram.
Midjourney moved their inference from NVIDIA A100/H100 GPUs to TPU v6e and cut monthly spend from $2.1 million to under $700,000 — $16.8 million in annualised savings. The lesson isn't "use TPUs." It's that inference infrastructure choices have enormous cost implications, and the default should be the cheapest option that meets the actual latency requirement. Not the theoretical one. Not the one in the design doc that says "must be real-time" without defining what that means.
Start With Batch, Prove You Need More
The decision framework is simple: start with batch. If someone complains that predictions are stale, measure how stale and whether it matters. If it genuinely matters — if there's a measurable business impact from prediction latency — then invest in real-time for that specific use case. Not the whole platform.
Even then, consider the hybrid approach first. Precompute the expensive parts in batch, cache the results, and only run lightweight real-time logic (re-ranking, contextual adjustment) at request time. That's what Netflix does. It's probably good enough for your recommendation engine too.
Batch inference isn't glamorous. It doesn't require a Kubernetes cluster or a team of MLOps engineers. It runs overnight, costs a fraction of always-on serving, and for the vast majority of ML use cases, it delivers predictions that are just as useful as real-time ones.
The 10% of use cases that genuinely need real-time inference know exactly why they need it. If you have to ask, you probably don't.