Back to blog
Engineering

Edge AI vs Cloud AI: When to Run Models On-Device

A practical decision framework for choosing between on-device and cloud-based AI inference, with cost analysis, comparison tables, and real-world use cases.

Glenn Sonna
10 min read
edge-aicloud-aiinferencearchitecturecost-optimization

TL;DR: The choice between edge AI and cloud AI is not binary — it is a spectrum. Small, latency-sensitive models (ASR, TTS, image classification) belong on-device. Large models (70B+ LLMs, multi-modal reasoning) belong in the cloud. The majority of production workloads fall somewhere in between, and the right answer is usually a hybrid approach that routes intelligently based on model size, latency requirements, and connectivity.

It Is a Spectrum, Not a Binary Choice

The “edge vs cloud” framing is misleading. In practice, most AI-powered applications exist on a continuum between three deployment modes:

  • Fully on-device: The model runs entirely on the user’s hardware. No network calls, no server costs, no data leaving the device.
  • Hybrid: Some models run locally while others route to the cloud. The system makes routing decisions based on model size, device capabilities, and network conditions.
  • Fully cloud: All inference happens on remote servers. The client sends data, waits for a response, and renders the result.

Each mode has legitimate use cases. The goal is not to pick a side but to understand where each model in your stack belongs. A single application might run speech recognition on-device, route complex reasoning to the cloud, and synthesize audio locally — all in the same request pipeline.

A Decision Framework for On-Device vs Cloud Inference

Choosing where to run a model comes down to five factors. Evaluate each one for every model in your pipeline, not once for the entire application.

Latency Requirements

This is often the deciding factor. On-device inference eliminates network round-trips entirely.

ScenarioAcceptable LatencyRecommendation
Real-time voice interactionUnder 50msOn-device
Live camera/video processingUnder 100msOn-device
Interactive chat responsesUnder 500msEither (depends on model size)
Document processingUnder 5sCloud is fine
Batch analyticsMinutesCloud is fine

For voice-first applications, cloud latency is often a dealbreaker. A round-trip to a cloud API typically adds 100-300ms of network overhead before the model even begins processing. For text-to-speech, that delay breaks the conversational flow. For wake word detection, it makes the feature unusable.

Privacy Constraints

Some data should never leave the device. This is not just a preference — it is increasingly a legal requirement.

  • Regulated industries (healthcare, finance, legal): Patient records, financial data, and attorney-client communications often cannot be transmitted to third-party servers, even encrypted.
  • User expectations: Voice recordings, camera feeds, and biometric data create friction when users learn it is sent to the cloud.
  • Compliance overhead: Using cloud inference with sensitive data means managing data processing agreements, SOC 2 compliance, and regional data residency requirements. On-device inference sidesteps all of this.

If your data is sensitive and your model is small enough to run locally, on-device inference removes an entire category of compliance risk.

Model Complexity

Model size is the hard constraint. A device can only run what fits in its available memory and compute budget.

Model SizeParameter CountOn-Device Feasibility
Under 50 MBUnder 10M paramsRuns on any modern phone
50-200 MB10-80M paramsRuns on mid-range phones, all desktops
200 MB - 1 GB80M-500M paramsRuns on flagship phones, all desktops
1-4 GB500M-3B paramsRuns on desktops and high-end tablets
4+ GB3B+ paramsDesktop only, or cloud
20+ GB13B+ paramsCloud (or high-end workstations with quantization)

The practical ceiling for mobile on-device inference today is roughly 1-3B parameters with 4-bit quantization. Models like Whisper Tiny (39M params), MobileNet (4M params), and Kokoro TTS (82M params) run comfortably on any modern device. A 7B parameter LLM can run on a desktop with quantization. Anything beyond 13B parameters generally requires cloud infrastructure.

Connectivity Reliability

If your users might be offline, on-device is not optional — it is required.

This applies more broadly than you might expect: aircraft cabins, rural areas, subway commutes, factory floors, field service operations, and disaster response scenarios all involve unreliable or nonexistent connectivity. If your application must work in these environments, the core inference pipeline needs to run locally.

Even in always-connected environments, on-device inference provides resilience against cloud outages. An application that falls back gracefully to local models when the API is down delivers a better experience than one that shows an error screen.

Cost at Scale

This is where the math gets interesting. Cloud inference has a per-call cost. On-device inference has a fixed cost (model download) and zero marginal cost per inference.

Cost Analysis: The Break-Even Point

Cloud AI pricing is straightforward: you pay per inference. On-device AI has a one-time cost (downloading the model to the device) and then runs for free.

Cloud Inference Costs

Typical cloud API pricing for common model types:

Model TypeCost Per InferenceCost Per 1M Inferences
Text classification~$0.001$1,000
Speech recognition (per minute)~$0.006$6,000
Text-to-speech (per 1K chars)~$0.015$15,000
Image classification~$0.002$2,000
LLM chat (per 1K tokens)~$0.002-0.06$2,000-60,000
Embedding generation~$0.0001$100

These costs compound quickly. An application that makes 10 inference calls per user session, with 100,000 daily active users, generates 1 million inferences per day. At $0.005 per inference, that is $5,000 per day or $150,000 per month.

On-Device Inference Costs

On-device costs are fundamentally different:

Cost ComponentOne-TimeRecurring
Model download bandwidth~$0.01-0.10 per user (CDN cost for 50-500MB model)$0
Model storage on device0 (user’s storage)$0
Compute per inference0 (user’s CPU/GPU)$0
API server infrastructure$0$0

The model download is typically 50-500MB, served from a CDN. At standard CDN rates ($0.08-0.12 per GB), downloading a 200MB model costs roughly $0.02 per user. After that, every inference is free.

Break-Even Analysis

The break-even point depends on how many inferences each user makes:

Cloud Cost Per InferenceModel Download CostBreak-Even Point
$0.001$0.0220 inferences per user
$0.005$0.024 inferences per user
$0.01$0.055 inferences per user
$0.05$0.102 inferences per user

For any model where users make more than a handful of inferences, on-device wins on cost — often by orders of magnitude. A user who runs text-to-speech 100 times costs $1.50 in cloud API fees or $0.02 in CDN bandwidth for the on-device model download. At scale, this difference becomes the dominant line item in your infrastructure budget.

Side-by-Side Comparison

FactorOn-DeviceCloudHybrid
Latency5-50ms (no network)100-500ms (network + inference)Varies by route
PrivacyData never leaves deviceData sent to serverSensitive data stays local
Cost per inference$0 (after model download)$0.001-0.06Reduced cloud spend
Model size limit~1-3B params (mobile), ~13B (desktop)UnlimitedBest of both
AccuracyLimited by model sizeCan use largest modelsLargest where needed
Offline supportFullNonePartial (on-device models work)
ScalabilityScales with user devices (free)Scales with server spendBalanced
Setup complexityModel packaging, device testingAPI key, HTTP callsRouting logic required
Update cycleRequires model re-downloadInstant (server-side)Mixed

Use Cases That Favor On-Device Inference

These workloads share common traits: they are latency-sensitive, use relatively small models, run frequently, or handle sensitive data.

Real-time speech recognition (ASR). Whisper Tiny (39M params, ~75MB) transcribes audio with sub-100ms latency on modern phones. Streaming transcription for voice interfaces, dictation, and accessibility features should almost always run on-device. The latency improvement alone justifies it, and the cost savings at scale are substantial.

Text-to-speech (TTS). Models like Kokoro (82M params, ~180MB) generate natural-sounding speech on-device. For voice assistants, audiobook readers, and accessibility tools, on-device TTS eliminates the noticeable delay that cloud TTS introduces in conversational flows.

Image classification and object detection. MobileNet, EfficientNet, and YOLO variants are designed for edge deployment. Camera-based features (barcode scanning, plant identification, accessibility descriptions) benefit from the instant feedback loop of on-device inference.

Text embeddings. Small embedding models (MiniLM, all-MiniLM-L6-v2 at ~80MB) generate vector representations locally. This enables on-device semantic search, document similarity, and retrieval-augmented generation without sending user documents to a server.

Wake word and keyword detection. Always-on listening for trigger phrases must run on-device. Streaming audio to the cloud continuously is both a privacy concern and a bandwidth problem.

Use Cases That Favor Cloud Inference

These workloads require model sizes or compute budgets that exceed what consumer devices can provide.

Large language models (70B+ parameters). Frontier LLMs like GPT-4, Claude, and Llama 3 70B require tens of gigabytes of memory and significant GPU compute. These models will remain cloud-only for consumer devices for the foreseeable future.

Training and fine-tuning. Model training is fundamentally a cloud workload. Even fine-tuning requires GPU clusters with large memory pools that consumer hardware cannot provide.

Multi-modal reasoning. Models that process images, video, and text together (vision-language models) are typically large and compute-intensive. The accuracy gains from using a 13B+ multi-modal model in the cloud outweigh the latency cost for most use cases.

Low-volume or sporadic usage. If a feature is used rarely (a few times per month per user), the cost of downloading and storing a model on-device is harder to justify. Cloud inference makes more sense when the per-user inference count is low.

Rapidly evolving models. When you need to swap models frequently (A/B testing, weekly model updates), cloud deployment is simpler. On-device models require the user to download updates.

The Hybrid Pattern: Intelligent Routing

The most practical architecture for production applications is hybrid: run what you can on-device, route what you must to the cloud, and make the decision automatically based on runtime conditions.

This is the core idea behind Xybrid’s routing system. A pipeline definition declares a preference for each stage, and the runtime decides where to execute based on device capabilities, model availability, and network conditions.

pipeline:
  name: smart-assistant
  routing: auto
  stages:
    - model: whisper-tiny
      task: transcribe
      prefer: device        # Always on-device (fast, small model)
    - model: llama-3.2-1b
      task: generate
      prefer: device        # On-device when possible
      fallback: cloud       # Cloud fallback for complex queries
    - model: kokoro-82m
      task: synthesize
      prefer: device        # Always on-device (latency-sensitive)

In this pipeline, speech recognition and speech synthesis always run on-device because they are latency-sensitive and the models are small enough. The LLM stage prefers on-device execution with a 1B parameter model, but falls back to the cloud when the query exceeds what the local model can handle — or when the device does not have enough resources to run it.

Routing Strategies

Several routing patterns work well in practice:

  • Prefer-device with cloud fallback: Try on-device first. If the model is not downloaded, the device is under load, or the task requires a larger model, route to the cloud. This is the most common pattern.
  • Latency-aware routing: Measure on-device inference time. If it exceeds a threshold (device is too slow for the model), route to the cloud for subsequent requests.
  • Complexity-based routing: Use a small on-device classifier to estimate query complexity. Simple queries go to the local model; complex ones go to a larger cloud model.
  • Offline-first: Always attempt on-device execution. Only route to the cloud when the local model cannot handle the task and connectivity is available.

Decision Checklist

Before deploying a model, run through these questions. If you answer “yes” to three or more, on-device inference is likely the right default.

  1. Is latency critical? Does the user experience degrade noticeably with 200ms+ of added latency?
  2. Is the data sensitive? Would sending this data to a third-party server create privacy, compliance, or trust concerns?
  3. Is the model small enough? Is the model under 1GB (mobile) or under 4GB (desktop)?
  4. Will users run it frequently? Will each user trigger this model more than 10 times over the app’s lifetime?
  5. Do users need offline access? Must this feature work without an internet connection?
  6. Is the model stable? Will you keep the same model for weeks or months (not swapping daily)?
  7. Are you scaling to many users? Will cloud inference costs become a significant budget line at your projected user count?

For models where you answer “no” to most of these — large models, infrequent use, non-sensitive data, always-connected users — cloud inference is simpler to deploy and maintain.

Conclusion

The edge AI vs cloud AI decision is ultimately about matching each model to the deployment mode where it performs best across latency, privacy, cost, and feasibility. Small, fast, frequently-used models belong on the device. Large, complex, infrequently-used models belong in the cloud. And for everything in between, a hybrid approach with intelligent routing gives you the flexibility to optimize for the constraints that matter most to your application.

The key insight is to make this decision per-model, not per-application. A single product can — and usually should — use both on-device and cloud inference for different parts of its AI pipeline. The architecture that enables this flexibility is what separates production-grade AI applications from prototypes.

Share