Back to blog
Tutorial

Building a Voice Agent That Runs Entirely On-Device

A step-by-step tutorial for building an on-device voice agent using Whisper, a local LLM, and Kokoro TTS — no cloud APIs, no internet required.

Glenn Sonna
11 min read
tutorialvoice-agentttsasron-device-aiflutter

TL;DR — This tutorial walks you through building a fully offline voice assistant in Flutter using Xybrid. You will chain three on-device models — Whisper for speech recognition, Llama 3.2 1B for reasoning, and Kokoro 82M for text-to-speech — into a single pipeline that turns spoken questions into spoken answers with zero cloud dependencies.

What We Are Building

Most voice assistants send your audio to a server, process it remotely, and stream an answer back. That round trip adds latency, costs money per request, and requires a network connection. It also means your voice data leaves your device.

We are going to build the opposite: a voice agent where everything runs locally. The user taps a button, speaks a question, and the app transcribes it, generates a response, and reads it back out loud — all without touching the internet.

The full loop looks like this:

Audio In → Whisper (ASR) → Text → Llama 3.2 (LLM) → Text → Kokoro (TTS) → Audio Out

By the end of this tutorial you will have a working Flutter app that performs this entire pipeline on an iPhone or Android device.

Architecture Overview

The voice agent is a three-stage pipeline. Each stage is a separate ML model, and Xybrid handles the data flow between them automatically.

Stage 1 — Speech Recognition (ASR). Whisper Tiny takes raw audio and produces a text transcript. It runs well on mobile hardware and supports multiple languages, though we will stick with English here.

Stage 2 — Reasoning (LLM). Llama 3.2 1B takes the transcript and generates a conversational response. The 1B parameter variant is small enough to run on-device while still producing coherent, useful answers.

Stage 3 — Voice Synthesis (TTS). Kokoro 82M takes the generated text and produces natural-sounding speech audio. At only 82 million parameters, it is fast enough for real-time synthesis on modern phones.

Xybrid connects these stages through its pipeline system. You define the chain in a YAML file, and the runtime handles envelope passing, model loading, and execution sequencing.

Prerequisites

Before you start, make sure you have:

  • Flutter SDK 3.22 or later installed and configured
  • A physical device for testing (iPhone 12+ or Pixel 6+). The simulator works for development but performance benchmarks require real hardware.
  • About 2 GB of free storage on the test device for the three models
  • Basic familiarity with Flutter and async Dart

Step 1: Define the Pipeline

The pipeline configuration tells Xybrid which models to run and in what order. Create a file called voice_agent_pipeline.yaml in your project’s assets/ directory:

pipeline:
  name: voice-agent
  stages:
    - model: whisper-tiny
      task: transcribe
      config:
        language: en
    - model: llama-3.2-1b
      task: generate
      config:
        max_tokens: 256
        system_prompt: "You are a helpful voice assistant. Keep responses concise and conversational."
    - model: kokoro-82m
      task: synthesize
      config:
        voice: af_heart
        speed: 1.0

A few things to note about this configuration:

  • whisper-tiny is the smallest Whisper variant. If you need better accuracy and your device can handle it, swap in whisper-base or whisper-small.
  • max_tokens: 256 keeps LLM responses short. Voice responses that run longer than 20-30 seconds feel unnatural, so capping token count here is intentional.
  • voice: af_heart selects one of Kokoro’s built-in voice profiles. You can browse available voices in the model’s documentation.
  • speed: 1.0 is normal playback speed. Values between 0.8 and 1.2 sound natural.

Step 2: Set Up the Flutter Project

Create a new Flutter project or open an existing one. Add the Xybrid Flutter package to your pubspec.yaml:

dependencies:
  flutter:
    sdk: flutter
  xybrid_flutter: ^0.1.0

Run flutter pub get, then initialize Xybrid in your app’s entry point:

import 'package:xybrid_flutter/xybrid_flutter.dart';

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  await Xybrid.init();
  runApp(const VoiceAgentApp());
}

The Xybrid.init() call sets up the runtime, configures the model cache directory, and prepares the execution backends. It needs to run before any model operations.

Make sure your pipeline YAML is registered in pubspec.yaml as an asset:

flutter:
  assets:
    - assets/voice_agent_pipeline.yaml

Step 3: Implement Speech Recognition

The first stage of our pipeline handles automatic speech recognition (ASR). Whisper takes raw audio bytes and returns a text transcript.

Before wiring it into the pipeline, let us verify ASR works in isolation:

// Load the pipeline
final pipelineYaml = await rootBundle.loadString('assets/voice_agent_pipeline.yaml');
final pipeline = await Xybrid.pipeline(yamlContent: pipelineYaml).load();

// Load all three models (shows progress for each)
await pipeline.loadModels(onProgress: (model, progress) {
  print('Loading $model: ${(progress * 100).toInt()}%');
});

For audio capture, you need to record from the device microphone. The recorded audio should be 16kHz mono WAV — the format Whisper expects. You can use any Flutter audio recording package that gives you raw bytes. Here is the pattern:

// After recording audio from the microphone...
final Uint8List audioBytes = await recorder.stopAndGetBytes();

// Run just the ASR stage to test transcription
final result = await pipeline.run(
  envelope: Envelope.audio(bytes: audioBytes),
);

// The transcript is available in the result metadata
final transcript = result.metadata['transcript'];
print('User said: $transcript');

The first time you run this, Xybrid will download the Whisper model to the device cache. Subsequent runs load from cache, which takes under a second.

Step 4: Add the Reasoning Layer

With transcription working, the next stage passes that text into a local LLM. Llama 3.2 1B runs the prompt through the model and produces a text response.

In a pipeline, this happens automatically. The output envelope from the ASR stage becomes the input envelope for the LLM stage. Xybrid transforms the data between stages based on each model’s task type.

The system_prompt in our pipeline YAML is important. It tells the LLM to behave like a voice assistant — keeping answers short and conversational. Without this, the model might produce long, written-style responses that sound awkward when read aloud.

You can customize the system prompt for your use case:

# For a cooking assistant
system_prompt: "You are a cooking assistant. Give brief recipe instructions. Use simple language."

# For a fitness coach
system_prompt: "You are a fitness coach. Give short, motivating exercise tips."

When testing the LLM stage, watch for response length. If answers run too long, reduce max_tokens. If they feel cut off, increase it slightly. For voice output, 100-200 tokens usually hits the right balance.

Step 5: Add Voice Synthesis

The final stage takes the LLM’s text response and converts it to speech using Kokoro TTS. Kokoro produces high-quality, natural-sounding audio at 24kHz sample rate.

The voice parameter selects from Kokoro’s built-in voice embeddings. Each voice has a distinct character and tone. The af_heart voice used in our pipeline is a warm, conversational American English voice — a good fit for an assistant.

After the full pipeline runs, the result envelope contains the synthesized audio:

final result = await pipeline.run(
  envelope: Envelope.audio(bytes: audioBytes),
);

// The result contains synthesized audio bytes
final audioPlayer = AudioPlayer();
await audioPlayer.playBytes(result.audioBytes!);

Kokoro generates PCM audio data. The AudioPlayer handles conversion to the device’s audio output format. If you need to save the audio or process it further, the raw bytes are available directly from the envelope.

Step 6: Wire It All Together

Now let us build the complete voice agent UI. This widget ties together recording, pipeline execution, and audio playback into a single interaction flow.

class VoiceAgentScreen extends StatefulWidget {
  const VoiceAgentScreen({super.key});

  @override
  State<VoiceAgentScreen> createState() => _VoiceAgentScreenState();
}

class _VoiceAgentScreenState extends State<VoiceAgentScreen> {
  late final Pipeline _pipeline;
  bool _isListening = false;
  String _transcript = '';
  String _response = '';

  @override
  void initState() {
    super.initState();
    _initPipeline();
  }

  Future<void> _initPipeline() async {
    _pipeline = await Xybrid.pipeline(yamlContent: pipelineYaml).load();
    await _pipeline.loadModels();
  }

  Future<void> _processAudio(Uint8List audioBytes) async {
    final result = await _pipeline.run(
      envelope: Envelope.audio(bytes: audioBytes),
    );
    setState(() {
      _transcript = result.metadata['transcript'] ?? '';
      _response = result.metadata['generated_text'] ?? '';
    });
  }

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      body: Center(
        child: Column(
          mainAxisAlignment: MainAxisAlignment.center,
          children: [
            Text(_transcript, style: Theme.of(context).textTheme.bodyLarge),
            const SizedBox(height: 16),
            Text(_response, style: Theme.of(context).textTheme.headlineSmall),
          ],
        ),
      ),
      floatingActionButton: FloatingActionButton(
        onPressed: () => setState(() => _isListening = !_isListening),
        child: Icon(_isListening ? Icons.stop : Icons.mic),
      ),
    );
  }
}

The interaction flow works like this:

  1. The user taps the microphone button to start recording.
  2. They tap again to stop. The recorded bytes are passed to _processAudio.
  3. The pipeline runs all three stages sequentially: transcribe, generate, synthesize.
  4. The UI updates with the transcript and response text.
  5. The synthesized audio plays back through the device speaker.

In a production app, you would add a loading indicator during pipeline execution and handle errors gracefully. You would also want to manage the audio recorder lifecycle more carefully — requesting microphone permissions, handling interruptions, and releasing resources.

Adding a Loading State

Pipeline execution takes a moment, so providing visual feedback matters:

bool _isProcessing = false;

Future<void> _processAudio(Uint8List audioBytes) async {
  setState(() => _isProcessing = true);

  try {
    final result = await _pipeline.run(
      envelope: Envelope.audio(bytes: audioBytes),
    );
    setState(() {
      _transcript = result.metadata['transcript'] ?? '';
      _response = result.metadata['generated_text'] ?? '';
    });

    // Play the audio response
    final audioPlayer = AudioPlayer();
    await audioPlayer.playBytes(result.audioBytes!);
  } finally {
    setState(() => _isProcessing = false);
  }
}

Performance Results

We benchmarked the complete pipeline on two devices to give you realistic expectations. These numbers reflect end-to-end latency for each stage, measured with models loaded in memory (not including first-load time).

iPhone 15 Pro (A17 Pro, 8 GB RAM)

StageModelLatency
ASRWhisper Tiny~200ms
LLMLlama 3.2 1B (Q4)~500ms to first token
TTSKokoro 82M~150ms
Total~850ms to first audio

Pixel 8 Pro (Tensor G3, 12 GB RAM)

StageModelLatency
ASRWhisper Tiny~280ms
LLMLlama 3.2 1B (Q4)~650ms to first token
TTSKokoro 82M~180ms
Total~1.1s to first audio

A few observations:

  • Under one second to first audio on iPhone makes the interaction feel responsive. Users perceive latencies below 1.2 seconds as “fast” for voice interfaces.
  • CoreML acceleration on iOS provides a noticeable boost for the ASR and TTS stages, which use ONNX models routed through Apple’s Neural Engine.
  • The LLM is the bottleneck. The 500-650ms to first token is the biggest chunk of the total latency. Streaming the LLM output to TTS (covered in “Next Steps”) can reduce perceived latency significantly.
  • Memory usage peaks around 1.8 GB with all three models loaded. On devices with 6 GB or more RAM, this leaves plenty of room for the rest of the app.

These benchmarks use quantized models (Q4 for the LLM, float16 for ASR and TTS). Full-precision models would be larger and slower without meaningful quality improvement for this use case.

Next Steps

You now have a working on-device voice agent. Here are three directions to take it further.

Conversation Memory

Right now, each interaction is stateless — the LLM does not remember previous exchanges. Xybrid supports conversation context that carries history across turns:

final context = ConversationContext();

// Each call appends to the conversation history
final result = await pipeline.run(
  envelope: Envelope.audio(bytes: audioBytes),
  context: context,
);

The ConversationContext manages a rolling window of past messages (default: 50 turns) so the LLM can reference earlier parts of the conversation. This turns your voice agent from a single-shot Q&A tool into a genuine conversational assistant.

Wake Word Detection

Tapping a button to start listening works, but a hands-free experience requires wake word detection. You can add a lightweight wake word model as a pre-stage that listens continuously and triggers the main pipeline only when it hears a specific phrase.

Streaming Responses

The current implementation waits for the LLM to finish generating before starting TTS. With streaming, you can begin synthesizing audio as soon as the first sentence is complete. This reduces perceived latency by starting playback while the LLM is still generating the rest of the response. Xybrid’s pipeline system supports stage-level streaming for exactly this pattern.


The complete source code for this tutorial is available in the Xybrid examples repository. If you run into issues or want to share what you have built, open a discussion on the GitHub repo.

Share