← Back to blog Engineering

Edge AI vs Cloud AI: When to Run Models On-Device

A practical decision framework for choosing between on-device and cloud-based AI inference, with cost analysis, comparison tables, and real-world use cases.

Glenn Sonna

· March 20, 2026 · 10 min read

edge-aicloud-aiinferencearchitecturecost-optimization

It Is a Spectrum, Not a Binary Choice

The “edge vs cloud” framing is misleading. In practice, most AI-powered applications exist on a continuum between three deployment modes:

Fully on-device: The model runs entirely on the user’s hardware. No network calls, no server costs, no data leaving the device.
Hybrid: Some models run locally while others route to the cloud. The system makes routing decisions based on model size, device capabilities, and network conditions.
Fully cloud: All inference happens on remote servers. The client sends data, waits for a response, and renders the result.

Each mode has legitimate use cases. The goal is not to pick a side but to understand where each model in your stack belongs. A single application might run speech recognition on-device, route complex reasoning to the cloud, and synthesize audio locally — all in the same request pipeline.

A Decision Framework for On-Device vs Cloud Inference

Choosing where to run a model comes down to five factors. Evaluate each one for every model in your pipeline, not once for the entire application.

Latency Requirements

This is often the deciding factor. On-device inference eliminates network round-trips entirely.

Scenario	Acceptable Latency	Recommendation
Real-time voice interaction	Under 50ms	On-device
Live camera/video processing	Under 100ms	On-device
Interactive chat responses	Under 500ms	Either (depends on model size)
Document processing	Under 5s	Cloud is fine
Batch analytics	Minutes	Cloud is fine

For voice-first applications, cloud latency is often a dealbreaker. A round-trip to a cloud API typically adds 100-300ms of network overhead before the model even begins processing. For text-to-speech, that delay breaks the conversational flow. For wake word detection, it makes the feature unusable.

Privacy Constraints

Some data should never leave the device. This is not just a preference — it is increasingly a legal requirement.

Regulated industries (healthcare, finance, legal): Patient records, financial data, and attorney-client communications often cannot be transmitted to third-party servers, even encrypted.
User expectations: Voice recordings, camera feeds, and biometric data create friction when users learn it is sent to the cloud.
Compliance overhead: Using cloud inference with sensitive data means managing data processing agreements, SOC 2 compliance, and regional data residency requirements. On-device inference sidesteps all of this.

If your data is sensitive and your model is small enough to run locally, on-device inference removes an entire category of compliance risk.

Model Complexity

Model size is the hard constraint. A device can only run what fits in its available memory and compute budget.

Model Size	Parameter Count	On-Device Feasibility
Under 50 MB	Under 10M params	Runs on any modern phone
50-200 MB	10-80M params	Runs on mid-range phones, all desktops
200 MB - 1 GB	80M-500M params	Runs on flagship phones, all desktops
1-4 GB	500M-3B params	Runs on desktops and high-end tablets
4+ GB	3B+ params	Desktop only, or cloud
20+ GB	13B+ params	Cloud (or high-end workstations with quantization)

The practical ceiling for mobile on-device inference today is roughly 1-3B parameters with 4-bit quantization. Models like Whisper Tiny (39M params), MobileNet (4M params), and Kokoro TTS (82M params) run comfortably on any modern device. A 7B parameter LLM can run on a desktop with quantization. Anything beyond 13B parameters generally requires cloud infrastructure.

Connectivity Reliability

If your users might be offline, on-device is not optional — it is required.

This applies more broadly than you might expect: aircraft cabins, rural areas, subway commutes, factory floors, field service operations, and disaster response scenarios all involve unreliable or nonexistent connectivity. If your application must work in these environments, the core inference pipeline needs to run locally.

Even in always-connected environments, on-device inference provides resilience against cloud outages. An application that falls back gracefully to local models when the API is down delivers a better experience than one that shows an error screen.

Cost at Scale

This is where the math gets interesting. Cloud inference has a per-call cost. On-device inference has a fixed cost (model download) and zero marginal cost per inference.

Cost Analysis: The Break-Even Point

Cloud AI pricing is straightforward: you pay per inference. On-device AI has a one-time cost (downloading the model to the device) and then runs for free.

Cloud Inference Costs

Typical cloud API pricing for common model types:

Model Type	Cost Per Inference	Cost Per 1M Inferences
Text classification	~$0.001	$1,000
Speech recognition (per minute)	~$0.006	$6,000
Text-to-speech (per 1K chars)	~$0.015	$15,000
Image classification	~$0.002	$2,000
LLM chat (per 1K tokens)	~$0.002-0.06	$2,000-60,000
Embedding generation	~$0.0001	$100

These costs compound quickly. An application that makes 10 inference calls per user session, with 100,000 daily active users, generates 1 million inferences per day. At $0.005 per inference, that is $5,000 per day or $150,000 per month.

On-Device Inference Costs

On-device costs are fundamentally different:

Cost Component	One-Time	Recurring
Model download bandwidth	~$0.01-0.10 per user (CDN cost for 50-500MB model)	$0
Model storage on device	0 (user’s storage)	$0
Compute per inference	0 (user’s CPU/GPU)	$0
API server infrastructure	$0	$0

The model download is typically 50-500MB, served from a CDN. At standard CDN rates ($0.08-0.12 per GB), downloading a 200MB model costs roughly $0.02 per user. After that, every inference is free.

Break-Even Analysis

The break-even point depends on how many inferences each user makes:

Cloud Cost Per Inference	Model Download Cost	Break-Even Point
$0.001	$0.02	20 inferences per user
$0.005	$0.02	4 inferences per user
$0.01	$0.05	5 inferences per user
$0.05	$0.10	2 inferences per user

For any model where users make more than a handful of inferences, on-device wins on cost — often by orders of magnitude. A user who runs text-to-speech 100 times costs $1.50 in cloud API fees or $0.02 in CDN bandwidth for the on-device model download. At scale, this difference becomes the dominant line item in your infrastructure budget.

Side-by-Side Comparison

Factor	On-Device	Cloud	Hybrid
Latency	5-50ms (no network)	100-500ms (network + inference)	Varies by route
Privacy	Data never leaves device	Data sent to server	Sensitive data stays local
Cost per inference	$0 (after model download)	$0.001-0.06	Reduced cloud spend
Model size limit	~1-3B params (mobile), ~13B (desktop)	Unlimited	Best of both
Accuracy	Limited by model size	Can use largest models	Largest where needed
Offline support	Full	None	Partial (on-device models work)
Scalability	Scales with user devices (free)	Scales with server spend	Balanced
Setup complexity	Model packaging, device testing	API key, HTTP calls	Routing logic required
Update cycle	Requires model re-download	Instant (server-side)	Mixed

Use Cases That Favor On-Device Inference

These workloads share common traits: they are latency-sensitive, use relatively small models, run frequently, or handle sensitive data.

Real-time speech recognition (ASR). Whisper Tiny (39M params, ~75MB) transcribes audio with sub-100ms latency on modern phones. Streaming transcription for voice interfaces, dictation, and accessibility features should almost always run on-device. The latency improvement alone justifies it, and the cost savings at scale are substantial.

Text-to-speech (TTS). Models like Kokoro (82M params, ~180MB) generate natural-sounding speech on-device. For voice assistants, audiobook readers, and accessibility tools, on-device TTS eliminates the noticeable delay that cloud TTS introduces in conversational flows.

Image classification and object detection. MobileNet, EfficientNet, and YOLO variants are designed for edge deployment. Camera-based features (barcode scanning, plant identification, accessibility descriptions) benefit from the instant feedback loop of on-device inference.

Text embeddings. Small embedding models (MiniLM, all-MiniLM-L6-v2 at ~80MB) generate vector representations locally. This enables on-device semantic search, document similarity, and retrieval-augmented generation without sending user documents to a server.

Wake word and keyword detection. Always-on listening for trigger phrases must run on-device. Streaming audio to the cloud continuously is both a privacy concern and a bandwidth problem.

Use Cases That Favor Cloud Inference

These workloads require model sizes or compute budgets that exceed what consumer devices can provide.

Large language models (70B+ parameters). Frontier LLMs like GPT-4, Claude, and Llama 3 70B require tens of gigabytes of memory and significant GPU compute. These models will remain cloud-only for consumer devices for the foreseeable future.

Training and fine-tuning. Model training is fundamentally a cloud workload. Even fine-tuning requires GPU clusters with large memory pools that consumer hardware cannot provide.

Multi-modal reasoning. Models that process images, video, and text together (vision-language models) are typically large and compute-intensive. The accuracy gains from using a 13B+ multi-modal model in the cloud outweigh the latency cost for most use cases.

Low-volume or sporadic usage. If a feature is used rarely (a few times per month per user), the cost of downloading and storing a model on-device is harder to justify. Cloud inference makes more sense when the per-user inference count is low.

Rapidly evolving models. When you need to swap models frequently (A/B testing, weekly model updates), cloud deployment is simpler. On-device models require the user to download updates.

The Hybrid Pattern: Intelligent Routing

The most practical architecture for production applications is hybrid: run what you can on-device, route what you must to the cloud, and make the decision automatically based on runtime conditions.

A well-designed orchestrator evaluates device capabilities, model availability, and network conditions at runtime, then routes each inference request to the appropriate backend. For example, a voice assistant pipeline might always run speech recognition and speech synthesis on-device (both are latency-sensitive and use small models), while routing the LLM stage to the cloud when the query exceeds what the local model can handle — or when the device does not have enough resources to run it.

The application code remains the same regardless of where execution happens. It submits an inference request and receives a result. Whether that result came from the local NPU or a cloud GPU is an infrastructure decision, not an application concern.

Routing Strategies

Several routing patterns work well in practice:

Prefer-device with cloud fallback: Try on-device first. If the model is not downloaded, the device is under load, or the task requires a larger model, route to the cloud. This is the most common pattern.
Latency-aware routing: Measure on-device inference time. If it exceeds a threshold (device is too slow for the model), route to the cloud for subsequent requests.
Complexity-based routing: Use a small on-device classifier to estimate query complexity. Simple queries go to the local model; complex ones go to a larger cloud model.
Offline-first: Always attempt on-device execution. Only route to the cloud when the local model cannot handle the task and connectivity is available.

Decision Checklist

Before deploying a model, run through these questions. If you answer “yes” to three or more, on-device inference is likely the right default.

Is latency critical? Does the user experience degrade noticeably with 200ms+ of added latency?
Is the data sensitive? Would sending this data to a third-party server create privacy, compliance, or trust concerns?
Is the model small enough? Is the model under 1GB (mobile) or under 4GB (desktop)?
Will users run it frequently? Will each user trigger this model more than 10 times over the app’s lifetime?
Do users need offline access? Must this feature work without an internet connection?
Is the model stable? Will you keep the same model for weeks or months (not swapping daily)?
Are you scaling to many users? Will cloud inference costs become a significant budget line at your projected user count?

For models where you answer “no” to most of these — large models, infrequent use, non-sensitive data, always-connected users — cloud inference is simpler to deploy and maintain.

Conclusion

The edge AI vs cloud AI decision is ultimately about matching each model to the deployment mode where it performs best across latency, privacy, cost, and feasibility. Small, fast, frequently-used models belong on the device. Large, complex, infrequently-used models belong in the cloud. And for everything in between, a hybrid approach with intelligent routing gives you the flexibility to optimize for the constraints that matter most to your application.

The key insight is to make this decision per-model, not per-application. A single product can — and usually should — use both on-device and cloud inference for different parts of its AI pipeline. The architecture that enables this flexibility is what separates production-grade AI applications from prototypes.

Apr 6, 2026 · 8 min read

Add Text-to-Speech to Your Flutter App in 15 Minutes

A step-by-step guide to adding high-quality, on-device TTS to a Flutter app using Xybrid and the Kokoro model. No cloud APIs, no API keys, no per-request costs.

flutterttstutorial

Mar 23, 2026 · 12 min read

On-Device AI: The Complete Guide to Running ML Models Locally

Everything you need to know about running machine learning models directly on mobile and desktop devices — privacy, latency, cost benefits, and how to get started.

on-device-aiedge-inferencemobile-ml

Mar 17, 2026 · 11 min read

Building a Voice Agent That Runs Entirely On-Device

A step-by-step tutorial for building an on-device voice agent using Whisper, a local LLM, and Kokoro TTS — no cloud APIs, no internet required.

tutorialvoice-agenttts