Speech-to-Text in Umbraco.AI

4 min read

Speech-to-text has landed in Umbraco.AI. Alongside chat and embeddings, transcription is now a first-class capability in the platform — with a new profile type, a core IAISpeechToTextService, a management API endpoint, and two polished UI integrations that drop straight into the backoffice.

The most visible changes are two new microphone buttons: one in Copilot, so you can talk to the AI instead of typing, and one in the Tiptap rich text toolbar, so editors can dictate straight into a content field. But as with file uploads, the feature runs deeper than the UI surface — everything sits on a shared, middleware-aware service that you can call from your own C# code just as easily.

Let’s start with the bits you’ll see first.

Dictating into Copilot

Inside the Copilot chat input there’s now a microphone button next to the send button. Click it once and it starts recording — click it again and the audio is transcribed and the resulting text is inserted into the input, ready for you to review, tweak, or send. The button pulses red while recording and shows a spinner while the transcription is running, so there’s always clear feedback on what’s happening.

Copilot voice input
Dictating a message into Copilot

It’s particularly nice when you’re thinking out loud about a piece of content — dictate a rough brief, let the AI respond, and iterate from there. No keyboard required.

Dictating into Rich Text

The Tiptap rich text editor now ships with a Dictate toolbar button. It works much the same way — click to start recording, click again to stop — but instead of populating a chat input, the transcribed text is inserted straight into the editor at the cursor position. The cursor position is captured when recording starts, so even if focus moves elsewhere while you’re speaking, the text lands exactly where you expected it to.

Tiptap dictate toolbar
Dictating straight into a rich text field

Like every other Tiptap extension in Umbraco, you enable it per property editor configuration — add the Dictate button to any rich text toolbar that should support voice input.

Configuring a Speech-to-Text Profile

Both experiences are driven by a new speech-to-text profile type. If you’ve worked with chat or embedding profiles before there are no surprises: you pick a connection, pick a model, and optionally set a BCP-47 language hint ("en", "de", "ja", and so on) to help the model when you already know what language the audio is in.

Under the hood there’s a new AICapability.SpeechToText capability, and providers opt in by implementing IAISpeechToTextCapability. The OpenAI provider ships with it out of the box, defaulting to gpt-4o-transcribe but also matching whisper-* models — so you can pick whichever option fits your latency, cost, and accuracy needs.

Calling It From Code

Like chat and embeddings, speech-to-text has a dedicated service — IAISpeechToTextService — with a fluent builder API for inline use. Here’s the minimum you need to transcribe an audio stream from your own C# code:

public class VoiceNotesService
{
    private readonly IAISpeechToTextService _speechToTextService;

    public VoiceNotesService(IAISpeechToTextService speechToTextService)
    {
        _speechToTextService = speechToTextService;
    }

    public async Task<string> TranscribeAsync(Stream audio, CancellationToken ct)
    {
        var response = await _speechToTextService.TranscribeAsync(
            stt => stt
                .WithAlias("voice-notes")
                .WithProfile("default-speech-to-text"),
            audio,
            ct);

        return response.Text;
    }
}

The builder gives you access to the full Umbraco.AI middleware pipeline — telemetry, auditing, notifications, guardrails — just like inline chat and inline agents. The WithAlias call is what anchors the execution in your observability story: the same alias always resolves to the same deterministic ID, so you can correlate transcription runs across time in your logs and telemetry.

If you need a longer-lived client — say you’re wiring speech-to-text into a middleware or a background queue — you can use CreateSpeechToTextClientAsync to get a configured ISpeechToTextClient back. There’s also a matching StreamTranscriptionAsync for providers that support streaming results.

The Management API

The same capability is available through the backoffice management API — and it’s how both the Copilot button and the Tiptap toolbar talk to the server:

POST /umbraco/ai/management/api/v1/speech-to-text/transcribe
    ?profileIdOrAlias={idOrAlias}
    &language={bcp47}

multipart/form-data:
    audioFile: <audio blob>

The response shape is intentionally simple:

{ "text": "..." }

The endpoint accepts the common browser-recordable audio MIME types — webm, wav, mp3, mp4, flac, m4a, ogg — and routes the request through the same middleware pipeline as the C# service, so every transcription benefits from whatever auditing, telemetry, and guardrails you already have configured.

Building Your Own Voice UI

If you’d rather build your own voice input experience, the frontend library (@umbraco-ai/core) exposes two small building blocks:

  • UaiAudioRecorder — a reactive wrapper around the browser’s MediaRecorder API that exposes a state$ observable you can bind directly into Lit templates
  • UaiSpeechToTextController — a controller that calls the management API and returns typed results
import { UaiAudioRecorder, UaiSpeechToTextController } from "@umbraco-ai/core";

const recorder = new UaiAudioRecorder(this);
const stt = new UaiSpeechToTextController(this);

await recorder.start();
// ...later...
const audioBlob = await recorder.stop();
const { data } = await stt.transcribe(audioBlob, { language: "en" });

console.log(data?.text);

Both the Copilot voice button (<uai-voice-button>) and the Tiptap dictate toolbar (<uai-dictate-tiptap-toolbar>) are built on these exact primitives — you can read them as worked examples.

What’s Next

Speech-to-text opens up a few directions we’re keen to explore: real-time streaming transcription into the chat (so you see words appear as you speak), voice-driven editing across more property editors, and eventually a matching text-to-speech capability for read-aloud experiences.

For now though, give it a try — we’d love to hear what it’s like to actually talk to Umbraco instead of typing at it.

Until next time 👋