refactor: split STT and Audio-LLM into separate interfaces (#5928)

4 weeks ago · 5ccba98adc
parent 238f27dea1
commit 5ccba98adc
25 changed files with 2940 additions and 408 deletions
--- a/docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
+++ b/docs/superpowers/plans/2026-05-02-stt-audiollm-split.md
--- a/docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
+++ b/docs/superpowers/specs/2026-05-02-stt-audiollm-split-design.md
@ -0,0 +1,511 @@
+# STT and Audio-LLM Split — Design Spec
+
+**Date:** 2026-05-02
+**Status:** Draft, pending user review
+
+## 1. Goal
+
+Refactor `internal/ai/` to split **speech-to-text (STT)** and **audio-multimodal-LLM (Audio-LLM)** into two separate Go interfaces, aligning with mainstream OSS conventions (Vercel AI SDK, LiteLLM, the Go AI ecosystem). Update the public API handler to dispatch to the right interface based on provider type. Make two small comment improvements to `proto/store/instance_setting.proto` and the generated bindings; **no proto field changes**.
+
+## 2. Non-Goals
+
+The following are intentionally **out of scope** for this design:
+
+- **`enabled` boolean field on `TranscriptionConfig`** (improvement #1 from the brainstorming) — keep using `provider_id == ""` as the disabled signal.
+- **Direction C (audio → structured note pipeline)** — auto-summarization / tag extraction. Independent future feature.
+- **Multi-provider STT with default-model selector** (Dify-style) — Memos has a single transcription config; that stays.
+- **Per-model credential overrides**, **load balancing**, **capability YAML schemas** — Dify-style enterprise complexity.
+- **Streaming transcription, retry policy, OpenAI Translations endpoint** — YAGNI.
+- **`gpt-4o-audio-preview` user-facing support** — the `audiollm/openai` package will be implementable after this refactor, but UI support is a follow-up.
+- **TTS** (text-to-speech) — different concern, not affected.
+
+## 3. Background
+
+The current `internal/ai/` has one `Transcriber` interface with two implementations: `openAITranscriber` (calls `/audio/transcriptions`, a real STT endpoint) and `geminiTranscriber` (calls `generateContent`, a multimodal-LLM endpoint dressed up to act like STT). This conflation has caused real symptoms:
+
+- `TranscribeResponse.Language` and `Duration` are silently empty for Gemini (Gemini multimodal doesn't return them).
+- The `Prompt` field has different semantics across providers — Whisper treats it as a soft hint that may be ignored; Gemini treats it as a literal instruction.
+- Gemini's multimodal failure modes (safety filter, token-truncation, refusals) are flattened to a single "did not include text" error.
+- Gemini-specific code (WebM transcoding, `maxGeminiInlineAudioSize`, `genai` SDK) lives in the same package as the OpenAI Whisper integration.
+
+The brainstorming session (this conversation, 2026-05-02) ran two rounds of OSS research to validate the corrective direction.
+
+## 4. Research Findings Summary
+
+Detailed findings in conversation history; abridged here for design accountability.
+
+### 4.1 SDK-Layer Research
+
+| Source | Key Decision |
+|---|---|
+| **Vercel AI SDK** (`vercel/ai`) | `TranscriptionModelV3` is implemented **only** by providers with a dedicated STT endpoint (OpenAI Whisper/gpt-4o-transcribe, Deepgram, ElevenLabs, AssemblyAI). **Google provider deliberately does not implement it** — Gemini audio rides through `generateText` with `FilePart`. Two completely separate code paths. No "source" discriminator. Provider id is `vendor.modality` (`openai.transcription`); model is a free string. |
+| **LiteLLM** (`BerriAI/litellm`) | `litellm.transcription()` only routes to providers with `/audio/transcriptions`-style endpoints. **Gemini is absent** from the transcription router (`litellm/llms/gemini/` has no `audio_transcription/` subdirectory). Multimodal audio rides through `litellm.completion()` with `{"type":"input_audio"}` content parts. Response is `text + usage`, no provider discriminator. |
+| **Go AI SDKs** (`cloudwego/eino`, `tmc/langchaingo`, `sashabaranov/go-openai`) | One package per provider; provider identity = import path; **no provider enum**. `Model` is opaque string. OpenAI-compatible endpoints handled via `BaseURL` config field, never via separate package. go-openai's `audio.go` is structurally separate from `chat.go`. |
+
+**Convergent finding:** All three ecosystems split STT and multimodal-audio into separate interfaces. None expose a "this came from a multimodal LLM" discriminator. None encode wire-format into the provider type enum.
+
+### 4.2 Application-Layer Research
+
+| Source | STT-Storage Design |
+|---|---|
+| **Open WebUI** | STT is a **flat singleton config block** (`audio.stt.*` namespace), completely separate from chat providers (`openai.*` namespace). `STT_ENGINE` enum dispatches; per-engine credentials side-by-side in one config. |
+| **LobeChat** | STT is a **separate global user setting** (`UserTTSConfig`). But credentials silently piggyback on the `openai` chat provider's `keyVaults` — author has marked the helper `@deprecated`. |
+| **Dify** | `ProviderEntity.supported_model_types` declares capabilities; STT is the `SPEECH2TEXT` enum value. STT info lives in a **separate "system model" config row** (`tenant_default_models(model_type='speech2text', provider_name, model_name)`) that **references** an existing provider. |
+
+**Convergent finding:** Zero apps add STT-specific fields onto the AI provider entity. All three keep providers capability-agnostic and put STT config in a separate place.
+
+### 4.3 Proto Schema Assessment
+
+The current `proto/store/instance_setting.proto` `InstanceAISetting` + `TranscriptionConfig` is **already aligned with the mainstream pattern**:
+
+- ✅ `AIProviderConfig` carries no STT-specific field (capability-agnostic)
+- ✅ `TranscriptionConfig` is a separate pointer (`provider_id` references a provider)
+- ✅ `AIProviderType` is vendor-level (`OPENAI`, `GEMINI`) — no wire-format suffix
+- ✅ `model` is a free string
+- ✅ Comments already document Whisper vs Gemini prompt semantics (though could be clearer)
+
+The proto schema requires **no field changes**. Only two comment improvements (§7 below).
+
+## 5. Current State (Files Touched)
+
+```
+internal/ai/
+  ai.go                # ProviderType, ProviderConfig, errors
+  client.go            # NewTranscriber factory, transcriberOptions, normalizeEndpoint, requireAPIKey
+  transcription.go     # Transcriber interface, TranscribeRequest, TranscribeResponse
+  openai.go            # openAITranscriber → /audio/transcriptions
+  openai_test.go
+  gemini.go            # geminiTranscriber → generateContent (multimodal)
+  gemini_test.go
+  models.go            # DefaultOpenAITranscriptionModel, DefaultGeminiTranscriptionModel
+  resolver.go          # FindProvider
+  errors.go            # ErrProviderNotFound, ErrCapabilityUnsupported
+  audio/
+    webm.go            # IsWebMContentType, WebMOpusToWAV (used by Gemini path)
+    webm_test.go
+
+server/router/api/v1/
+  ai_service.go        # Transcribe handler (lines 42–123)
+
+proto/store/
+  instance_setting.proto    # InstanceAISetting, AIProviderConfig, AIProviderType, TranscriptionConfig
+
+web/src/components/Settings/
+  AISection.tsx        # Provider list UI, TranscriptionForm
+```
+
+The handler at `server/router/api/v1/ai_service.go:42` is the **single integration point** between proto config and the `internal/ai/` SDK. It already discards Language/Duration (returns `{Text}` only), so the response narrowing is already in place.
+
+## 6. Target Design
+
+### 6.1 Package Structure
+
+```
+internal/ai/
+  ai.go                # ProviderType (unchanged: OPENAI, GEMINI), ProviderConfig (unchanged)
+  resolver.go          # FindProvider (unchanged)
+  errors.go            # add ErrSTTNotSupported, ErrAudioLLMNotSupported
+  audio/
+    webm.go            # unchanged — moves with audiollm/gemini consumer
+    webm_test.go
+
+  stt/
+    stt.go             # Transcriber interface, Request, Response, Segment
+    factory.go         # NewTranscriber(cfg ai.ProviderConfig, opts...) (Transcriber, error)
+    options.go         # TranscriberOption, WithHTTPClient
+    openai/
+      openai.go        # openAITranscriber → POST /audio/transcriptions
+      openai_test.go
+
+  audiollm/
+    audiollm.go        # Model interface, Request, Response, FinishReason
+    factory.go         # NewModel(cfg ai.ProviderConfig, opts...) (Model, error)
+    options.go         # ModelOption, WithHTTPClient
+    gemini/
+      gemini.go        # geminiModel → POST :generateContent (multimodal audio)
+      gemini_test.go
+    # openai/ — NOT created in this refactor; reserved for future gpt-4o-audio support
+```
+
+**Rationale (Go-ecosystem convention, per §4.1):** one package per provider; provider identity is import path; capability is implied by which umbrella package (`stt` vs `audiollm`) you import from. The runtime dispatch (factory) is the only place that translates `ProviderConfig.Type` enum → concrete implementation.
+
+### 6.2 Interfaces
+
+#### `internal/ai/stt/stt.go`
+
+```go
+package stt
+
+import (
+    "context"
+    "io"
+)
+
+// Transcriber transcribes audio into text using a provider's dedicated STT endpoint
+// (e.g. OpenAI /audio/transcriptions). Implementations are deterministic STT —
+// they are NOT for multimodal LLMs that happen to accept audio input. For
+// multimodal audio understanding, see internal/ai/audiollm.
+type Transcriber interface {
+    Transcribe(ctx context.Context, req Request) (*Response, error)
+}
+
+type Request struct {
+    Audio       io.Reader
+    Size        int64
+    Filename    string
+    ContentType string  // IANA media type, e.g. "audio/wav"
+    Model       string  // provider-specific model id (e.g. "whisper-1", "gpt-4o-transcribe")
+    Prompt      string  // soft spelling/vocabulary hint (Whisper "prompt" parameter)
+    Language    string  // ISO 639-1, optional
+}
+
+type Response struct {
+    Text     string
+    Language string    // empty if provider did not return it (best-effort)
+    Segments []Segment // empty unless provider returned timestamps
+}
+
+type Segment struct {
+    Text    string
+    Start   float64
+    End     float64
+    Speaker string // empty unless using a diarization-capable model (e.g. gpt-4o-transcribe-diarize)
+}
+```
+
+#### `internal/ai/audiollm/audiollm.go`
+
+```go
+package audiollm
+
+import (
+    "context"
+    "io"
+)
+
+// Model invokes a multimodal LLM with audio input. Implementations call
+// chat-completions or generate-content style APIs that happen to accept audio.
+// They are NOT deterministic STT — outputs may be refused, truncated, or
+// rephrased per the LLM's behavior. For pure transcription, prefer
+// internal/ai/stt where available.
+type Model interface {
+    GenerateFromAudio(ctx context.Context, req Request) (*Response, error)
+}
+
+type Request struct {
+    Audio        io.Reader
+    Size         int64
+    ContentType  string
+    Model        string
+    Instructions string   // literal instruction the model is expected to follow
+    Temperature  *float32 // optional; nil leaves provider default
+}
+
+type Response struct {
+    Text         string
+    FinishReason FinishReason
+}
+
+type FinishReason string
+
+const (
+    FinishStop     FinishReason = "stop"     // model finished normally
+    FinishLength   FinishReason = "length"   // truncated by max-tokens
+    FinishSafety   FinishReason = "safety"   // safety filter blocked output
+    FinishOther    FinishReason = "other"    // anything else (incl. unknown)
+)
+```
+
+#### Factory dispatch
+
+```go
+// internal/ai/stt/factory.go
+package stt
+
+import (
+    "github.com/pkg/errors"
+    "github.com/usememos/memos/internal/ai"
+    "github.com/usememos/memos/internal/ai/stt/openai"
+)
+
+func NewTranscriber(cfg ai.ProviderConfig, opts ...TranscriberOption) (Transcriber, error) {
+    switch cfg.Type {
+    case ai.ProviderOpenAI:
+        return openai.New(cfg, applyOptions(opts...))
+    case ai.ProviderGemini:
+        return nil, errors.Wrapf(ai.ErrSTTNotSupported,
+            "Gemini does not provide a dedicated STT endpoint; use audiollm.NewModel instead")
+    default:
+        return nil, errors.Wrapf(ai.ErrCapabilityUnsupported, "provider type %q", cfg.Type)
+    }
+}
+```
+
+```go
+// internal/ai/audiollm/factory.go
+package audiollm
+
+import (
+    "github.com/pkg/errors"
+    "github.com/usememos/memos/internal/ai"
+    "github.com/usememos/memos/internal/ai/audiollm/gemini"
+)
+
+func NewModel(cfg ai.ProviderConfig, opts ...ModelOption) (Model, error) {
+    switch cfg.Type {
+    case ai.ProviderGemini:
+        return gemini.New(cfg, applyOptions(opts...))
+    case ai.ProviderOpenAI:
+        // NOTE: gpt-4o-audio-preview support belongs here but is out of scope;
+        // see §2 (Non-Goals).
+        return nil, errors.Wrapf(ai.ErrAudioLLMNotSupported,
+            "OpenAI multimodal audio (gpt-4o-audio) is not yet implemented in this codebase")
+    default:
+        return nil, errors.Wrapf(ai.ErrCapabilityUnsupported, "provider type %q", cfg.Type)
+    }
+}
+```
+
+### 6.3 Backend Handler Dispatch
+
+The handler at `server/router/api/v1/ai_service.go:Transcribe` dispatches based on `provider.Type`:
+
+```go
+func (s *APIV1Service) Transcribe(ctx context.Context, request *v1pb.TranscribeRequest) (*v1pb.TranscribeResponse, error) {
+    // ... existing config loading, provider resolution, audio reading ...
+
+    switch provider.Type {
+    case ai.ProviderOpenAI:
+        text, err := s.transcribeViaSTT(ctx, provider, transcriptionCfg, audio, contentType)
+        if err != nil {
+            return nil, status.Errorf(codes.Internal, "failed to transcribe: %v", err)
+        }
+        return &v1pb.TranscribeResponse{Text: text}, nil
+
+    case ai.ProviderGemini:
+        text, err := s.transcribeViaAudioLLM(ctx, provider, transcriptionCfg, audio, contentType)
+        if err != nil {
+            return nil, status.Errorf(codes.Internal, "failed to transcribe: %v", err)
+        }
+        return &v1pb.TranscribeResponse{Text: text}, nil
+
+    default:
+        return nil, status.Errorf(codes.FailedPrecondition,
+            "provider type %q is not supported for transcription", provider.Type)
+    }
+}
+
+func (s *APIV1Service) transcribeViaSTT(ctx context.Context, provider ai.ProviderConfig,
+                                          cfg *storepb.TranscriptionConfig,
+                                          audio io.Reader, contentType string) (string, error) {
+    t, err := stt.NewTranscriber(provider)
+    if err != nil { return "", err }
+    resp, err := t.Transcribe(ctx, stt.Request{
+        Audio:       audio,
+        Filename:    "audio",
+        ContentType: contentType,
+        Model:       resolveModel(provider, cfg.Model),
+        Prompt:      cfg.Prompt,    // Whisper: soft hint, may be ignored
+        Language:    cfg.Language,
+    })
+    if err != nil { return "", err }
+    return resp.Text, nil
+}
+
+func (s *APIV1Service) transcribeViaAudioLLM(ctx context.Context, provider ai.ProviderConfig,
+                                                cfg *storepb.TranscriptionConfig,
+                                                audio io.Reader, contentType string) (string, error) {
+    m, err := audiollm.NewModel(provider)
+    if err != nil { return "", err }
+    resp, err := m.GenerateFromAudio(ctx, audiollm.Request{
+        Audio:        audio,
+        ContentType:  contentType,
+        Model:        resolveModel(provider, cfg.Model),
+        Instructions: buildTranscriptionInstructions(cfg.Prompt, cfg.Language),
+    })
+    if err != nil { return "", err }
+    if resp.FinishReason != audiollm.FinishStop {
+        return "", errors.Errorf("transcription incomplete (finish reason: %s)", resp.FinishReason)
+    }
+    return resp.Text, nil
+}
+```
+
+`buildTranscriptionInstructions` lives next to the handler and centralizes the literal instruction sent to multimodal LLMs:
+
+```go
+func buildTranscriptionInstructions(prompt, language string) string {
+    parts := []string{
+        "Transcribe the audio accurately. Return only the transcript text. " +
+        "Do not summarize, explain, or add content that is not spoken.",
+    }
+    if language != "" {
+        parts = append(parts, fmt.Sprintf("The input language is %s.", language))
+    }
+    if prompt != "" {
+        parts = append(parts, "Context and spelling hints:\n"+prompt)
+    }
+    return strings.Join(parts, "\n\n")
+}
+```
+
+`resolveModel` returns `cfg.Model` if non-empty, else the per-provider default from `ai/models.go` (unchanged from today).
+
+### 6.4 Implementation Notes Per Package
+
+#### `internal/ai/stt/openai/openai.go`
+
+- Identical wire behavior to current `internal/ai/openai.go::openAITranscriber.Transcribe`.
+- Uses `github.com/openai/openai-go/v3` SDK (already a dep).
+- Defaults `endpoint` to `https://api.openai.com/v1`. Trims trailing slash. Validates URL.
+- Honors `cfg.Endpoint` to support OpenAI-compatible providers (Groq Whisper, faster-whisper self-hosted, Azure Whisper deployments). The user simply adds another `AIProviderConfig` row with `Type=OPENAI` and a different `Endpoint`.
+- Supports any model the underlying endpoint accepts: `whisper-1`, `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-transcribe-diarize`, etc. The model string is opaque.
+- Returns `Response.Language` and `Response.Segments` populated when the API returns them; otherwise empty.
+
+#### `internal/ai/audiollm/gemini/gemini.go`
+
+- Identical wire behavior to current `internal/ai/gemini.go::geminiTranscriber.Transcribe`, EXCEPT:
+  - Reads `Instructions` from the caller (not hardcoded inside the package).
+  - Maps `genai.FinishReason` to `audiollm.FinishReason` (`STOP→FinishStop`, `MAX_TOKENS→FinishLength`, `SAFETY→FinishSafety`, anything else → `FinishOther`).
+  - Returns `Response{Text, FinishReason}` instead of swallowing the finish reason into a generic error.
+- Continues to use `internal/ai/audio.WebMOpusToWAV` for WebM transcoding (Gemini doesn't accept WebM).
+- Continues to enforce `maxGeminiInlineAudioSize` (14 MiB) — File API support is out of scope.
+- Uses `google.golang.org/genai` SDK (already a dep).
+
+#### `internal/ai/errors.go`
+
+Add:
+```go
+var ErrSTTNotSupported = errors.New("provider does not support speech-to-text capability")
+var ErrAudioLLMNotSupported = errors.New("provider does not support multimodal audio capability")
+```
+Keep existing `ErrProviderNotFound` and `ErrCapabilityUnsupported`.
+
+### 6.5 Proto Schema Changes
+
+**Two comment-only updates. No field additions, no field renames, no breaking changes.**
+
+#### Improvement #2 — `TranscriptionConfig.model` comment
+
+Replace lines 179–181 of `proto/store/instance_setting.proto`:
+
+```proto
+  // model is the provider-specific model identifier.
+  // Empty string falls back to the engine default
+  // (whisper-1 for OPENAI providers, gemini-2.5-flash for GEMINI providers).
+  string model = 2;
+```
+
+with:
+
+```proto
+  // model is the provider-specific model identifier.
+  // Empty string falls back to the engine default.
+  // OPENAI examples:
+  //   - whisper-1 (legacy, lower cost)
+  //   - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
+  //   - gpt-4o-transcribe-diarize (includes speaker labels)
+  // GEMINI examples:
+  //   - gemini-2.5-flash (default, multimodal call)
+  //   - gemini-2.5-pro
+  string model = 2;
+```
+
+**Rationale:** OpenAI's `/audio/transcriptions` endpoint now supports the `gpt-4o-transcribe` family in addition to `whisper-1`. The current comment is misleading — it implies Whisper is the only OpenAI option.
+
+#### Improvement #3 — `TranscriptionConfig.prompt` comment
+
+Replace lines 188–191:
+
+```proto
+  // prompt is a default spelling/vocabulary hint passed to the provider.
+  // Used as the OpenAI Whisper "prompt" parameter and folded into the Gemini
+  // generation prompt as a "Context and spelling hints" block.
+  string prompt = 4;
+```
+
+with:
+
+```proto
+  // prompt is a default spelling/vocabulary hint passed to the provider.
+  // Used as the OpenAI Whisper "prompt" parameter (a soft hint that the model
+  // may ignore) and folded into the Gemini generation prompt as a "Context and
+  // spelling hints" block (which the LLM will treat more literally).
+  string prompt = 4;
+```
+
+**Rationale:** Same field, two semantically different behaviors. Surfacing this in the schema documentation (which propagates to generated Go and TypeScript via JSDoc) makes the cross-provider variability explicit for any caller reading the bindings cold.
+
+After editing the proto, regenerate via `cd proto && buf format -w && buf generate`. The two regenerated files are:
+- `proto/gen/store/instance_setting.pb.go`
+- `web/src/types/proto/store/instance_setting_pb.ts`
+
+#### Why NOT add an `enabled` field (improvement #1)
+
+Out of scope per §2. Doing it would add a new field that the frontend, backend, and migration logic all need to handle, for the sole benefit of letting users "disable but keep the config." The current `provider_id == ""` semantics work; the cost of the change exceeds the benefit at this moment.
+
+### 6.6 Frontend Impact
+
+Minimal. `web/src/components/Settings/AISection.tsx` already:
+
+- Switches the model placeholder per provider (`placeholderForProvider` at line 371, using `setting.ai.transcription-model-placeholder-gemini` / `-openai`).
+- Disables the form when `providerId == ""`.
+- Validates that the referenced provider exists.
+
+Recommended adjustments (in scope):
+
+1. **Update i18n model placeholder strings** in `web/src/locales/en.json` to reflect the new model examples:
+   - `setting.ai.transcription-model-placeholder-openai`: include `gpt-4o-transcribe` family alongside `whisper-1`.
+   - `setting.ai.transcription-model-placeholder-gemini`: confirm `gemini-2.5-flash` is the listed example.
+2. **Update the prompt help text** (`setting.ai.transcription-prompt-help`) to note the cross-provider semantic difference, mirroring the new proto comment in user-facing language.
+
+No structural component changes. No new fields. No state-shape changes.
+
+### 6.7 What Stays Identical
+
+- Database storage (`InstanceSetting` rows, `AISetting` blob) — proto field tags unchanged.
+- API surface (`TranscribeRequest`, `TranscribeResponse` messages) — unchanged.
+- gRPC/Connect endpoint paths — unchanged.
+- Frontend state shape (`LocalTranscription`) — unchanged.
+- All existing tests semantically unchanged (will be ported to new package paths).
+
+## 7. Migration Path
+
+The refactor is internal to the Go server. End-to-end behavior is preserved. Migration is staged so each stage is independently buildable, testable, and revertable.
+
+| Stage | What | Compiles? | Tests pass? |
+|---|---|---|---|
+| A | Add `internal/ai/stt/` and `internal/ai/audiollm/` with new interfaces and (empty) factories. Add new errors. | ✅ | ✅ (no callers yet) |
+| B | Implement `internal/ai/stt/openai/` — port behavior from current `openai.go::openAITranscriber`. Port tests to `stt/openai/openai_test.go`. | ✅ | ✅ |
+| C | Implement `internal/ai/audiollm/gemini/` — port behavior from current `gemini.go::geminiTranscriber`, but: lift instructions out into the caller, return `FinishReason` instead of swallowing it. Port tests. | ✅ | ✅ |
+| D | Refactor `server/router/api/v1/ai_service.go::Transcribe` to dispatch via the new factories. Add `transcribeViaSTT` and `transcribeViaAudioLLM`. Add `buildTranscriptionInstructions`. | ✅ | ✅ |
+| E | Delete old files: `internal/ai/transcription.go`, `client.go`, `openai.go`, `openai_test.go`, `gemini.go`, `gemini_test.go`. | ✅ | ✅ |
+| F | Update proto comments (#2 and #3), run `buf format -w && buf generate`. | ✅ | ✅ |
+| G | Update `web/src/locales/en.json` strings for model placeholders and prompt help. | ✅ | ✅ |
+
+Each stage is one commit. Reverting any single stage leaves the system in a working state.
+
+## 8. Anti-Patterns Avoided (and Why)
+
+| Anti-pattern | Where it would have come from | Why we're avoiding it |
+|---|---|---|
+| `ProviderType` enum with wire-format suffix (`OPENAI_TRANSCRIPTIONS`, `OPENAI_CHAT_AUDIO`) | Earlier brainstorming draft | Vercel/LiteLLM/Go ecosystem all use vendor-level identity; capability is implied by which interface you call. |
+| `Response.Source` enum (`NativeSTT`, `MultimodalLLM`) | Earlier brainstorming draft | None of the three SDKs surveyed has this. It re-introduces the "pretend STT" smell at a different layer. |
+| Adapter wrapping `audiollm.Model` as `stt.Transcriber` | Earlier brainstorming draft | Adapter would re-create the conflation we're trying to remove. Application-layer dispatch is honest. |
+| Adding `transcription_*` fields to `AIProviderConfig` | Naive instinct | Three of three OSS apps surveyed (Open WebUI, LobeChat, Dify) do **not** do this. Pollutes the provider entity; repeats with every new capability. |
+| Silently reusing chat provider credentials for STT (LobeChat's deprecated pattern) | LobeChat-style shortcut | LobeChat's own author marked the helper `@deprecated`. Memos's existing `provider_id` reference is more flexible (user can configure a different OpenAI-compatible endpoint, e.g. Groq, just for STT). |
+| Per-model credential overrides, capability YAML, load balancing | Dify | Enterprise complexity that doesn't fit Memos's scope. |
+| Auto-fallback from STT failure to multimodal-LLM transcription | Plausible "smart" idea | LiteLLM doesn't do this; failure modes and cost differ enough that fallback would surprise users. Explicit dispatch by provider type is what LiteLLM ships. |
+
+## 9. Open Decisions
+
+All resolved during brainstorming. None remain open. For the record:
+
+1. **Direction A (split STT/Audio-LLM into separate interfaces) over Direction B (capability-flag system) over Direction C (audio-to-structured-note pipeline).** Resolved: A. Rationale: most honest abstraction, matches mainstream SDKs, leaves the door open to C as a future addition without rework.
+2. **Provider type naming: vendor-level (`openai`/`gemini`) over wire-format-encoded.** Resolved: vendor-level. Rationale: matches Vercel/LiteLLM/Go convention; new OpenAI transcription model snapshots require zero schema or code changes.
+3. **`TranscriptionConfig.Duration` field decision.** Not present in current proto; not added. Audio duration belongs to resource metadata (computed from the file at upload time), not to the transcription response.
+4. **Multimodal failure-mode surface.** Resolved: expose `FinishReason` from `audiollm.Model` to the application layer; the Transcribe handler converts non-`Stop` reasons into informative errors.
+
+## 10. Implementation Plan Pointer
+
+Once this spec is approved, the implementation plan will be created at `docs/superpowers/plans/2026-05-02-stt-audiollm-split.md` covering Stages A–G from §7 above as discrete, bite-sized tasks with TDD steps and per-stage commits.
--- a/go.mod
+++ b/go.mod
@ -4,6 +4,7 @@ go 1.26.2

 require (
 	connectrpc.com/connect v1.19.2
+	github.com/at-wat/ebml-go v0.18.0
 	github.com/aws/aws-sdk-go-v2 v1.41.6
 	github.com/aws/aws-sdk-go-v2/config v1.32.16
 	github.com/aws/aws-sdk-go-v2/credentials v1.19.15
@ -20,6 +21,7 @@ require (
 	github.com/mark3labs/mcp-go v0.49.0
 	github.com/moby/moby/api v1.54.2
 	github.com/openai/openai-go/v3 v3.32.0
+	github.com/pion/opus v0.0.0-20260430223319-81a9c5dc5013
 	github.com/pkg/errors v0.9.1
 	github.com/spf13/cobra v1.10.2
 	github.com/spf13/viper v1.21.0
--- a/go.sum
+++ b/go.sum
@ -20,6 +20,8 @@ github.com/Microsoft/go-winio v0.6.2 h1:F2VQgta7ecxGYO8k3ZZz3RS8fVIXVxONVUPlNERo
 github.com/Microsoft/go-winio v0.6.2/go.mod h1:yd8OoFMLzJbo9gZq8j5qaps8bJ9aShtEA8Ipt1oGCvU=
 github.com/antlr4-go/antlr/v4 v4.13.1 h1:SqQKkuVZ+zWkMMNkjy5FZe5mr5WURWnlpmOuzYWrPrQ=
 github.com/antlr4-go/antlr/v4 v4.13.1/go.mod h1:GKmUxMtwp6ZgGwZSva4eWPC5mS6vUAmOABFgjdkM7Nw=
+github.com/at-wat/ebml-go v0.18.0 h1:SNkpBFR4jCQV1rI4Bm1tSuIYnusxe2qQ4GHJia9eQg4=
+github.com/at-wat/ebml-go v0.18.0/go.mod h1:w1cJs7zmGsb5nnSvhWGKLCxvfu4FVx5ERvYDIalj1ww=
 github.com/aws/aws-sdk-go-v2 v1.41.6 h1:1AX0AthnBQzMx1vbmir3Y4WsnJgiydmnJjiLu+LvXOg=
 github.com/aws/aws-sdk-go-v2 v1.41.6/go.mod h1:dy0UzBIfwSeot4grGvY1AqFWN5zgziMmWGzysDnHFcQ=
 github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.9 h1:adBsCIIpLbLmYnkQU+nAChU5yhVTvu5PerROm+/Kq2A=
@ -195,6 +197,8 @@ github.com/opencontainers/image-spec v1.1.1 h1:y0fUlFfIZhPF1W537XOLg0/fcx6zcHCJw
 github.com/opencontainers/image-spec v1.1.1/go.mod h1:qpqAh3Dmcf36wStyyWU+kCeDgrGnAve2nCC8+7h8Q0M=
 github.com/pelletier/go-toml/v2 v2.3.0 h1:k59bC/lIZREW0/iVaQR8nDHxVq8OVlIzYCOJf421CaM=
 github.com/pelletier/go-toml/v2 v2.3.0/go.mod h1:2gIqNv+qfxSVS7cM2xJQKtLSTLUE9V8t9Stt+h56mCY=
+github.com/pion/opus v0.0.0-20260430223319-81a9c5dc5013 h1:HDxWSNNH8R5G+y1xGM8AVsSu95rAmoOnVSdPTzoAtoI=
+github.com/pion/opus v0.0.0-20260430223319-81a9c5dc5013/go.mod h1:t5Xog2n682JnawoykACE6nKVmupFvmJvkpM7x6bTv6g=
 github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
 github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
 github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U=
--- a/internal/ai/audio/webm.go
+++ b/internal/ai/audio/webm.go
@ -0,0 +1,161 @@
+// Package audio provides audio container/codec helpers for AI providers.
+//
+// The motivating use case is Gemini transcription: Gemini's audio inputs
+// require WAV/MP3/AIFF/AAC/OGG/FLAC, but browser MediaRecorder defaults to
+// WebM/Opus. This package converts WebM/Opus into 16-bit PCM WAV using
+// pure-Go decoders — no ffmpeg or other system dependency.
+package audio
+
+import (
+	"bytes"
+	"encoding/binary"
+	"io"
+	"strings"
+
+	"github.com/at-wat/ebml-go"
+	"github.com/at-wat/ebml-go/webm"
+	"github.com/pion/opus"
+	"github.com/pkg/errors"
+)
+
+const (
+	opusOutputSampleRate = 48000
+	// maxOpusPacketSamples is Opus's spec maximum: 120 ms at 48 kHz.
+	maxOpusPacketSamples = 5760
+	// opusCodecID is the WebM TrackEntry CodecID for an Opus audio track.
+	opusCodecID = "A_OPUS"
+	// opusHeadMinLength is the minimum size of the OpusHead identification
+	// header stored in TrackEntry.CodecPrivate.
+	opusHeadMinLength = 19
+)
+
+// WebMOpusToWAV decodes a WebM/Opus file into 16-bit PCM WAV bytes.
+//
+// The output is mono or stereo at 48 kHz (Opus's native decode rate),
+// regardless of the original encoder's hint. Pre-skip samples declared in
+// the OpusHead are discarded to avoid the encoder's startup padding.
+//
+// The function reads the entire WebM document into memory; callers should
+// enforce their own size limits before invoking it.
+func WebMOpusToWAV(input []byte) ([]byte, error) {
+	var doc struct {
+		Header  webm.EBMLHeader `ebml:"EBML"`
+		Segment webm.Segment    `ebml:"Segment"`
+	}
+	if err := ebml.Unmarshal(bytes.NewReader(input), &doc); err != nil && !errors.Is(err, io.EOF) {
+		return nil, errors.Wrap(err, "parse webm")
+	}
+
+	track := findOpusTrack(doc.Segment.Tracks.TrackEntry)
+	if track == nil {
+		return nil, errors.New("webm has no Opus audio track")
+	}
+	if len(track.CodecPrivate) < opusHeadMinLength {
+		return nil, errors.Errorf("invalid OpusHead: expected at least %d bytes, got %d", opusHeadMinLength, len(track.CodecPrivate))
+	}
+
+	channels := int(track.Audio.Channels)
+	if channels < 1 || channels > 2 {
+		return nil, errors.Errorf("unsupported Opus channel count: %d", channels)
+	}
+	preSkip := int(binary.LittleEndian.Uint16(track.CodecPrivate[10:12]))
+
+	decoder := opus.NewDecoder()
+	if err := decoder.Init(opusOutputSampleRate, channels); err != nil {
+		return nil, errors.Wrap(err, "init opus decoder")
+	}
+
+	pcm := make([]int16, 0, 1<<16)
+	frame := make([]int16, maxOpusPacketSamples*channels)
+
+	decodeBlock := func(block ebml.Block) error {
+		if block.TrackNumber != track.TrackNumber {
+			return nil
+		}
+		for _, packet := range block.Data {
+			if len(packet) == 0 {
+				continue
+			}
+			n, err := decoder.DecodeToInt16(packet, frame)
+			if err != nil {
+				return errors.Wrap(err, "decode opus packet")
+			}
+			pcm = append(pcm, frame[:n*channels]...)
+		}
+		return nil
+	}
+
+	for _, cluster := range doc.Segment.Cluster {
+		for _, sb := range cluster.SimpleBlock {
+			if err := decodeBlock(sb); err != nil {
+				return nil, err
+			}
+		}
+		for _, bg := range cluster.BlockGroup {
+			if err := decodeBlock(bg.Block); err != nil {
+				return nil, err
+			}
+		}
+	}
+
+	skip := preSkip * channels
+	if skip > len(pcm) {
+		skip = len(pcm)
+	}
+	pcm = pcm[skip:]
+
+	return encodeWAV(pcm, opusOutputSampleRate, channels), nil
+}
+
+// IsWebMContentType reports whether the MIME type is WebM audio.
+// Both "audio/webm" and "audio/webm; codecs=opus" return true.
+func IsWebMContentType(contentType string) bool {
+	contentType = strings.TrimSpace(contentType)
+	if contentType == "" {
+		return false
+	}
+	if i := strings.IndexByte(contentType, ';'); i >= 0 {
+		contentType = contentType[:i]
+	}
+	return strings.EqualFold(strings.TrimSpace(contentType), "audio/webm")
+}
+
+func findOpusTrack(entries []webm.TrackEntry) *webm.TrackEntry {
+	for i := range entries {
+		entry := &entries[i]
+		if entry.CodecID == opusCodecID && entry.Audio != nil {
+			return entry
+		}
+	}
+	return nil
+}
+
+// encodeWAV writes a standard RIFF/WAVE container around 16-bit PCM samples.
+// Reference layout: http://soundfile.sapp.org/doc/WaveFormat/
+func encodeWAV(samples []int16, sampleRate, channels int) []byte {
+	const bitsPerSample = 16
+	const bytesPerSample = bitsPerSample / 8
+	blockAlign := channels * bytesPerSample
+	byteRate := sampleRate * blockAlign
+	dataSize := len(samples) * bytesPerSample
+
+	buf := bytes.NewBuffer(make([]byte, 0, 44+dataSize))
+	buf.WriteString("RIFF")
+	_ = binary.Write(buf, binary.LittleEndian, uint32(36+dataSize))
+	buf.WriteString("WAVE")
+
+	buf.WriteString("fmt ")
+	_ = binary.Write(buf, binary.LittleEndian, uint32(16))
+	_ = binary.Write(buf, binary.LittleEndian, uint16(1)) // PCM
+	_ = binary.Write(buf, binary.LittleEndian, uint16(channels))
+	_ = binary.Write(buf, binary.LittleEndian, uint32(sampleRate))
+	_ = binary.Write(buf, binary.LittleEndian, uint32(byteRate))
+	_ = binary.Write(buf, binary.LittleEndian, uint16(blockAlign))
+	_ = binary.Write(buf, binary.LittleEndian, uint16(bitsPerSample))
+
+	buf.WriteString("data")
+	_ = binary.Write(buf, binary.LittleEndian, uint32(dataSize))
+	_ = binary.Write(buf, binary.LittleEndian, samples)
+
+	return buf.Bytes()
+}
--- a/internal/ai/audio/webm_test.go
+++ b/internal/ai/audio/webm_test.go
@ -0,0 +1,48 @@
+package audio
+
+import (
+	"testing"
+
+	"github.com/stretchr/testify/require"
+)
+
+func TestIsWebMContentType(t *testing.T) {
+	cases := []struct {
+		in   string
+		want bool
+	}{
+		{"audio/webm", true},
+		{"audio/webm;codecs=opus", true},
+		{"audio/webm; codecs=opus", true},
+		{"AUDIO/WEBM", true},
+		{"  audio/webm  ", true},
+		{"audio/wav", false},
+		{"audio/mp4", false},
+		{"video/webm", false},
+		{"", false},
+		{"webm", false},
+	}
+	for _, tc := range cases {
+		t.Run(tc.in, func(t *testing.T) {
+			require.Equal(t, tc.want, IsWebMContentType(tc.in))
+		})
+	}
+}
+
+func TestWebMOpusToWAV_RejectsInvalidInput(t *testing.T) {
+	t.Run("empty", func(t *testing.T) {
+		_, err := WebMOpusToWAV(nil)
+		require.Error(t, err)
+	})
+
+	t.Run("not webm", func(t *testing.T) {
+		_, err := WebMOpusToWAV([]byte("hello world this is not webm"))
+		require.Error(t, err)
+	})
+
+	t.Run("truncated webm header bytes", func(t *testing.T) {
+		// Valid EBML magic but no Segment.
+		_, err := WebMOpusToWAV([]byte{0x1A, 0x45, 0xDF, 0xA3})
+		require.Error(t, err)
+	})
+}
--- a/internal/ai/audiollm/audiollm.go
+++ b/internal/ai/audiollm/audiollm.go
@ -0,0 +1,41 @@
+// Package audiollm defines the multimodal-audio capability for AI providers.
+// Implementations call chat-completions or generate-content style APIs that
+// accept audio as input. For deterministic transcription, prefer internal/ai/stt
+// where a dedicated STT endpoint exists.
+package audiollm
+
+import (
+	"context"
+	"io"
+)
+
+// Model invokes a multimodal LLM with audio input.
+type Model interface {
+	GenerateFromAudio(ctx context.Context, req Request) (*Response, error)
+}
+
+// Request is the input to a multimodal-audio call.
+type Request struct {
+	Audio        io.Reader
+	Size         int64
+	ContentType  string
+	Model        string
+	Instructions string   // literal instruction the model is expected to follow
+	Temperature  *float32 // optional; nil leaves the provider default in place
+}
+
+// Response is the output of a multimodal-audio call.
+type Response struct {
+	Text         string
+	FinishReason FinishReason
+}
+
+// FinishReason describes why the model stopped generating.
+type FinishReason string
+
+const (
+	FinishStop   FinishReason = "stop"   // model finished normally
+	FinishLength FinishReason = "length" // truncated by max-tokens
+	FinishSafety FinishReason = "safety" // safety filter blocked output
+	FinishOther  FinishReason = "other"  // anything else, including unknown
+)
--- a/internal/ai/audiollm/gemini/gemini.go
+++ b/internal/ai/audiollm/gemini/gemini.go
@ -0,0 +1,202 @@
+// Package gemini implements audiollm.Model against the Gemini generateContent
+// endpoint. Used by Memos transcription when the user picks a Gemini provider:
+// the handler issues a transcription instruction via audiollm.Request.Instructions.
+package gemini
+
+import (
+	"context"
+	"io"
+	"mime"
+	"net/url"
+	"strings"
+
+	"github.com/pkg/errors"
+	"google.golang.org/genai"
+
+	"github.com/usememos/memos/internal/ai"
+	"github.com/usememos/memos/internal/ai/audio"
+	"github.com/usememos/memos/internal/ai/audiollm"
+)
+
+const (
+	defaultEndpoint   = "https://generativelanguage.googleapis.com/v1beta"
+	defaultAPIVersion = "v1beta"
+	maxInlineSize     = 14 * 1024 * 1024
+	providerName      = "Gemini"
+)
+
+var supportedContentTypes = map[string]string{
+	"audio/wav":    "audio/wav",
+	"audio/x-wav":  "audio/wav",
+	"audio/mp3":    "audio/mp3",
+	"audio/mpeg":   "audio/mp3",
+	"audio/aiff":   "audio/aiff",
+	"audio/aac":    "audio/aac",
+	"audio/ogg":    "audio/ogg",
+	"audio/flac":   "audio/flac",
+	"audio/x-flac": "audio/flac",
+}
+
+// Model implements audiollm.Model for Gemini generateContent.
+type Model struct {
+	client *genai.Client
+}
+
+// New constructs a Model from a provider config.
+func New(cfg ai.ProviderConfig, options audiollm.Options) (*Model, error) {
+	endpoint, err := normalizeEndpoint(cfg.Endpoint)
+	if err != nil {
+		return nil, err
+	}
+	if cfg.APIKey == "" {
+		return nil, errors.Errorf("%s API key is required", providerName)
+	}
+	baseURL, apiVersion, err := splitEndpoint(endpoint)
+	if err != nil {
+		return nil, err
+	}
+	httpOptions := genai.HTTPOptions{BaseURL: baseURL, APIVersion: apiVersion}
+	if options.HTTPClient != nil && options.HTTPClient.Timeout > 0 {
+		timeout := options.HTTPClient.Timeout
+		httpOptions.Timeout = &timeout
+	}
+	client, err := genai.NewClient(context.Background(), &genai.ClientConfig{
+		APIKey:      cfg.APIKey,
+		Backend:     genai.BackendGeminiAPI,
+		HTTPClient:  options.HTTPClient,
+		HTTPOptions: httpOptions,
+	})
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to create Gemini client")
+	}
+	return &Model{client: client}, nil
+}
+
+// GenerateFromAudio calls Gemini generateContent with the audio attached.
+func (m *Model) GenerateFromAudio(ctx context.Context, req audiollm.Request) (*audiollm.Response, error) {
+	if strings.TrimSpace(req.Model) == "" {
+		return nil, errors.New("model is required")
+	}
+	if req.Audio == nil {
+		return nil, errors.New("audio is required")
+	}
+	if strings.TrimSpace(req.Instructions) == "" {
+		return nil, errors.New("instructions are required")
+	}
+
+	audioBytes, err := io.ReadAll(req.Audio)
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to read audio")
+	}
+	if len(audioBytes) == 0 {
+		return nil, errors.New("audio is required")
+	}
+
+	contentType := req.ContentType
+	if audio.IsWebMContentType(contentType) {
+		wav, err := audio.WebMOpusToWAV(audioBytes)
+		if err != nil {
+			return nil, errors.Wrap(err, "failed to transcode webm audio for Gemini")
+		}
+		audioBytes = wav
+		contentType = "audio/wav"
+	}
+
+	if len(audioBytes) > maxInlineSize {
+		return nil, errors.Errorf("audio is too large for Gemini inline request; maximum size is %d bytes", maxInlineSize)
+	}
+
+	contentType, err = normalizeContentType(contentType)
+	if err != nil {
+		return nil, err
+	}
+
+	cfg := &genai.GenerateContentConfig{}
+	if req.Temperature != nil {
+		t := *req.Temperature
+		cfg.Temperature = &t
+	}
+
+	resp, err := m.client.Models.GenerateContent(ctx, normalizeModelName(req.Model), []*genai.Content{
+		genai.NewContentFromParts([]*genai.Part{
+			genai.NewPartFromBytes(audioBytes, contentType),
+			genai.NewPartFromText(req.Instructions),
+		}, genai.RoleUser),
+	}, cfg)
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to send Gemini request")
+	}
+
+	return &audiollm.Response{
+		Text:         strings.TrimSpace(resp.Text()),
+		FinishReason: mapFinishReason(resp),
+	}, nil
+}
+
+func mapFinishReason(resp *genai.GenerateContentResponse) audiollm.FinishReason {
+	if resp == nil || len(resp.Candidates) == 0 {
+		return audiollm.FinishOther
+	}
+	switch resp.Candidates[0].FinishReason {
+	case genai.FinishReasonStop:
+		return audiollm.FinishStop
+	case genai.FinishReasonMaxTokens:
+		return audiollm.FinishLength
+	case genai.FinishReasonSafety,
+		genai.FinishReasonRecitation,
+		genai.FinishReasonProhibitedContent,
+		genai.FinishReasonSPII,
+		genai.FinishReasonBlocklist,
+		genai.FinishReasonImageSafety,
+		genai.FinishReasonImageProhibitedContent,
+		genai.FinishReasonImageRecitation:
+		return audiollm.FinishSafety
+	default:
+		return audiollm.FinishOther
+	}
+}
+
+func normalizeEndpoint(endpoint string) (string, error) {
+	endpoint = strings.TrimSpace(endpoint)
+	if endpoint == "" {
+		endpoint = defaultEndpoint
+	}
+	if _, err := url.ParseRequestURI(endpoint); err != nil {
+		return "", errors.Wrapf(err, "invalid %s endpoint", providerName)
+	}
+	return strings.TrimRight(endpoint, "/"), nil
+}
+
+func splitEndpoint(endpoint string) (string, string, error) {
+	parsed, err := url.Parse(endpoint)
+	if err != nil {
+		return "", "", errors.Wrap(err, "invalid Gemini endpoint")
+	}
+	path := strings.TrimRight(parsed.Path, "/")
+	apiVersion := defaultAPIVersion
+	for _, supported := range []string{"v1alpha", "v1beta", "v1"} {
+		if path == "/"+supported || strings.HasSuffix(path, "/"+supported) {
+			apiVersion = supported
+			parsed.Path = strings.TrimSuffix(path, "/"+supported)
+			break
+		}
+	}
+	return strings.TrimRight(parsed.String(), "/"), apiVersion, nil
+}
+
+func normalizeContentType(contentType string) (string, error) {
+	mediaType, _, err := mime.ParseMediaType(strings.TrimSpace(contentType))
+	if err != nil {
+		return "", errors.Wrap(err, "invalid audio content type")
+	}
+	mediaType = strings.ToLower(mediaType)
+	normalized, ok := supportedContentTypes[mediaType]
+	if !ok {
+		return "", errors.Errorf("audio content type %q is not supported by Gemini", mediaType)
+	}
+	return normalized, nil
+}
+
+func normalizeModelName(model string) string {
+	return strings.TrimPrefix(strings.TrimSpace(model), "models/")
+}
--- a/internal/ai/audiollm/gemini/gemini_test.go
+++ b/internal/ai/audiollm/gemini/gemini_test.go
@ -1,4 +1,4 @@
-package ai
+package gemini_test

 import (
 	"context"
@ -11,9 +11,13 @@ import (
 	"time"

 	"github.com/stretchr/testify/require"
+
+	"github.com/usememos/memos/internal/ai"
+	"github.com/usememos/memos/internal/ai/audiollm"
+	audiollmgemini "github.com/usememos/memos/internal/ai/audiollm/gemini"
 )

-func TestGeminiTranscribe(t *testing.T) {
+func TestGenerateFromAudio(t *testing.T) {
 	t.Parallel()

 	server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@ -42,14 +46,14 @@ func TestGeminiTranscribe(t *testing.T) {
 		audio, err := base64.StdEncoding.DecodeString(request.Contents[0].Parts[0].InlineData.Data)
 		require.NoError(t, err)
 		require.Equal(t, "audio bytes", string(audio))
-		require.Contains(t, request.Contents[0].Parts[1].Text, "Return only the transcript text")
-		require.Contains(t, request.Contents[0].Parts[1].Text, "Context and spelling hints")
+		require.Equal(t, "transcribe please", request.Contents[0].Parts[1].Text)
 		require.Equal(t, json.Number("0"), request.GenerationConfig["temperature"])

 		w.Header().Set("Content-Type", "application/json")
 		require.NoError(t, json.NewEncoder(w).Encode(map[string]any{
 			"candidates": []map[string]any{
 				{
+					"finishReason": "STOP",
 					"content": map[string]any{
 						"parts": []map[string]string{{"text": "hello from gemini"}},
 					},
@ -59,40 +63,43 @@ func TestGeminiTranscribe(t *testing.T) {
 	}))
 	defer server.Close()

-	transcriber, err := NewTranscriber(ProviderConfig{
-		Type:     ProviderGemini,
+	model, err := audiollmgemini.New(ai.ProviderConfig{
+		Type:     ai.ProviderGemini,
 		Endpoint: server.URL + "/v1beta",
 		APIKey:   "test-key",
-	})
+	}, audiollm.ApplyOptions(nil))
 	require.NoError(t, err)

+	temp := float32(0)
 	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
 	defer cancel()
-	response, err := transcriber.Transcribe(ctx, TranscribeRequest{
-		Model:       "models/gemini-2.5-flash",
-		ContentType: "audio/mpeg",
-		Audio:       strings.NewReader("audio bytes"),
-		Prompt:      "Memos, Steven",
-		Language:    "en",
+	resp, err := model.GenerateFromAudio(ctx, audiollm.Request{
+		Model:        "models/gemini-2.5-flash",
+		ContentType:  "audio/mpeg",
+		Audio:        strings.NewReader("audio bytes"),
+		Instructions: "transcribe please",
+		Temperature:  &temp,
 	})
 	require.NoError(t, err)
-	require.Equal(t, "hello from gemini", response.Text)
+	require.Equal(t, "hello from gemini", resp.Text)
+	require.Equal(t, audiollm.FinishStop, resp.FinishReason)
 }

-func TestGeminiTranscribeRejectsUnsupportedContentType(t *testing.T) {
+func TestGenerateFromAudioRejectsUnsupportedContentType(t *testing.T) {
 	t.Parallel()

-	transcriber, err := NewTranscriber(ProviderConfig{
-		Type:     ProviderGemini,
+	model, err := audiollmgemini.New(ai.ProviderConfig{
+		Type:     ai.ProviderGemini,
 		Endpoint: "https://example.com/v1beta",
 		APIKey:   "test-key",
-	})
+	}, audiollm.ApplyOptions(nil))
 	require.NoError(t, err)

-	_, err = transcriber.Transcribe(context.Background(), TranscribeRequest{
-		Model:       "gemini-2.5-flash",
-		ContentType: "video/mp4",
-		Audio:       strings.NewReader("video bytes"),
+	_, err = model.GenerateFromAudio(context.Background(), audiollm.Request{
+		Model:        "gemini-2.5-flash",
+		ContentType:  "video/mp4",
+		Audio:        strings.NewReader("video bytes"),
+		Instructions: "transcribe please",
 	})
 	require.Error(t, err)
 	require.Contains(t, err.Error(), "not supported by Gemini")
--- a/internal/ai/audiollm/options.go
+++ b/internal/ai/audiollm/options.go
@ -0,0 +1,34 @@
+package audiollm
+
+import (
+	"net/http"
+	"time"
+)
+
+const defaultHTTPTimeout = 2 * time.Minute
+
+// Options is the resolved option set passed to provider implementations.
+type Options struct {
+	HTTPClient *http.Client
+}
+
+// ModelOption customizes a Model.
+type ModelOption func(*Options)
+
+// WithHTTPClient overrides the HTTP client used by the model.
+func WithHTTPClient(client *http.Client) ModelOption {
+	return func(o *Options) {
+		if client != nil {
+			o.HTTPClient = client
+		}
+	}
+}
+
+// ApplyOptions resolves a ModelOption slice into Options with defaults.
+func ApplyOptions(opts []ModelOption) Options {
+	resolved := Options{HTTPClient: &http.Client{Timeout: defaultHTTPTimeout}}
+	for _, apply := range opts {
+		apply(&resolved)
+	}
+	return resolved
+}
--- a/internal/ai/client.go
+++ b/internal/ai/client.go
@ -1,65 +0,0 @@
-package ai
-
-import (
-	"net/http"
-	"net/url"
-	"strings"
-	"time"
-
-	"github.com/pkg/errors"
-)
-
-const defaultHTTPTimeout = 2 * time.Minute
-
-type transcriberOptions struct {
-	httpClient *http.Client
-}
-
-// TranscriberOption configures a transcriber.
-type TranscriberOption func(*transcriberOptions)
-
-// WithHTTPClient sets the HTTP client used by a transcriber.
-func WithHTTPClient(client *http.Client) TranscriberOption {
-	return func(options *transcriberOptions) {
-		if client != nil {
-			options.httpClient = client
-		}
-	}
-}
-
-// NewTranscriber creates a transcriber for a provider.
-func NewTranscriber(config ProviderConfig, options ...TranscriberOption) (Transcriber, error) {
-	transcriberOptions := transcriberOptions{
-		httpClient: &http.Client{Timeout: defaultHTTPTimeout},
-	}
-	for _, applyOption := range options {
-		applyOption(&transcriberOptions)
-	}
-
-	switch config.Type {
-	case ProviderOpenAI:
-		return newOpenAITranscriber(config, transcriberOptions)
-	case ProviderGemini:
-		return newGeminiTranscriber(config, transcriberOptions)
-	default:
-		return nil, errors.Wrapf(ErrCapabilityUnsupported, "provider type %q", config.Type)
-	}
-}
-
-func normalizeEndpoint(endpoint string, defaultEndpoint string, providerName string) (string, error) {
-	endpoint = strings.TrimSpace(endpoint)
-	if endpoint == "" {
-		endpoint = defaultEndpoint
-	}
-	if _, err := url.ParseRequestURI(endpoint); err != nil {
-		return "", errors.Wrapf(err, "invalid %s endpoint", providerName)
-	}
-	return strings.TrimRight(endpoint, "/"), nil
-}
-
-func requireAPIKey(apiKey string, providerName string) error {
-	if apiKey == "" {
-		return errors.Errorf("%s API key is required", providerName)
-	}
-	return nil
-}
--- a/internal/ai/errors.go
+++ b/internal/ai/errors.go
@ -7,4 +7,11 @@ var (
 	ErrProviderNotFound = errors.New("AI provider not found")
 	// ErrCapabilityUnsupported indicates that the provider does not support the requested capability.
 	ErrCapabilityUnsupported = errors.New("AI provider capability unsupported")
+	// ErrSTTNotSupported indicates that the provider does not have a dedicated
+	// speech-to-text endpoint. Use the audiollm package for multimodal audio
+	// understanding when this is returned.
+	ErrSTTNotSupported = errors.New("provider does not support speech-to-text capability")
+	// ErrAudioLLMNotSupported indicates that the provider does not have a
+	// multimodal-audio LLM available in this codebase.
+	ErrAudioLLMNotSupported = errors.New("provider does not support multimodal audio capability")
 )
--- a/internal/ai/gemini.go
+++ b/internal/ai/gemini.go
@ -1,162 +0,0 @@
-package ai
-
-import (
-	"context"
-	"io"
-	"mime"
-	"net/url"
-	"strings"
-
-	"github.com/pkg/errors"
-	"google.golang.org/genai"
-)
-
-const (
-	defaultGeminiEndpoint     = "https://generativelanguage.googleapis.com/v1beta"
-	geminiTranscriptionPrompt = `Transcribe the audio accurately. Return only the transcript text. Do not summarize, explain, or add content that is not spoken.`
-	maxGeminiInlineAudioSize  = 14 * 1024 * 1024
-	defaultGeminiAPIVersion   = "v1beta"
-	geminiProviderDisplayName = "Gemini"
-	geminiDefaultTemperature  = float32(0)
-)
-
-var geminiSupportedContentTypes = map[string]string{
-	"audio/wav":    "audio/wav",
-	"audio/x-wav":  "audio/wav",
-	"audio/mp3":    "audio/mp3",
-	"audio/mpeg":   "audio/mp3",
-	"audio/aiff":   "audio/aiff",
-	"audio/aac":    "audio/aac",
-	"audio/ogg":    "audio/ogg",
-	"audio/flac":   "audio/flac",
-	"audio/x-flac": "audio/flac",
-}
-
-type geminiTranscriber struct {
-	client *genai.Client
-}
-
-func newGeminiTranscriber(config ProviderConfig, options transcriberOptions) (*geminiTranscriber, error) {
-	endpoint, err := normalizeEndpoint(config.Endpoint, defaultGeminiEndpoint, geminiProviderDisplayName)
-	if err != nil {
-		return nil, err
-	}
-	if err := requireAPIKey(config.APIKey, geminiProviderDisplayName); err != nil {
-		return nil, err
-	}
-	baseURL, apiVersion, err := normalizeGeminiEndpoint(endpoint)
-	if err != nil {
-		return nil, err
-	}
-	httpOptions := genai.HTTPOptions{
-		BaseURL:    baseURL,
-		APIVersion: apiVersion,
-	}
-	if options.httpClient.Timeout > 0 {
-		timeout := options.httpClient.Timeout
-		httpOptions.Timeout = &timeout
-	}
-
-	client, err := genai.NewClient(context.Background(), &genai.ClientConfig{
-		APIKey:      config.APIKey,
-		Backend:     genai.BackendGeminiAPI,
-		HTTPClient:  options.httpClient,
-		HTTPOptions: httpOptions,
-	})
-	if err != nil {
-		return nil, errors.Wrap(err, "failed to create Gemini client")
-	}
-	return &geminiTranscriber{client: client}, nil
-}
-
-// Transcribe transcribes audio with Gemini generateContent.
-func (t *geminiTranscriber) Transcribe(ctx context.Context, request TranscribeRequest) (*TranscribeResponse, error) {
-	if strings.TrimSpace(request.Model) == "" {
-		return nil, errors.New("model is required")
-	}
-	if request.Audio == nil {
-		return nil, errors.New("audio is required")
-	}
-	audio, err := io.ReadAll(request.Audio)
-	if err != nil {
-		return nil, errors.Wrap(err, "failed to read audio")
-	}
-	if len(audio) == 0 {
-		return nil, errors.New("audio is required")
-	}
-	if len(audio) > maxGeminiInlineAudioSize {
-		return nil, errors.Errorf("audio is too large for Gemini inline transcription; maximum size is %d bytes", maxGeminiInlineAudioSize)
-	}
-
-	contentType, err := normalizeGeminiContentType(request.ContentType)
-	if err != nil {
-		return nil, err
-	}
-	prompt := buildGeminiTranscriptionPrompt(request.Prompt, request.Language)
-	temperature := geminiDefaultTemperature
-	response, err := t.client.Models.GenerateContent(ctx, normalizeGeminiModelName(request.Model), []*genai.Content{
-		genai.NewContentFromParts([]*genai.Part{
-			genai.NewPartFromBytes(audio, contentType),
-			genai.NewPartFromText(prompt),
-		}, genai.RoleUser),
-	}, &genai.GenerateContentConfig{
-		Temperature: &temperature,
-	})
-	if err != nil {
-		return nil, errors.Wrap(err, "failed to send Gemini transcription request")
-	}
-	text := strings.TrimSpace(response.Text())
-	if text == "" {
-		return nil, errors.New("Gemini transcription response did not include text")
-	}
-	return &TranscribeResponse{
-		Text: text,
-	}, nil
-}
-
-func normalizeGeminiEndpoint(endpoint string) (string, string, error) {
-	parsed, err := url.Parse(endpoint)
-	if err != nil {
-		return "", "", errors.Wrap(err, "invalid Gemini endpoint")
-	}
-	path := strings.TrimRight(parsed.Path, "/")
-	apiVersion := defaultGeminiAPIVersion
-	for _, supportedVersion := range []string{"v1alpha", "v1beta", "v1"} {
-		if path == "/"+supportedVersion || strings.HasSuffix(path, "/"+supportedVersion) {
-			apiVersion = supportedVersion
-			parsed.Path = strings.TrimSuffix(path, "/"+supportedVersion)
-			break
-		}
-	}
-	return strings.TrimRight(parsed.String(), "/"), apiVersion, nil
-}
-
-func normalizeGeminiContentType(contentType string) (string, error) {
-	mediaType, _, err := mime.ParseMediaType(strings.TrimSpace(contentType))
-	if err != nil {
-		return "", errors.Wrap(err, "invalid audio content type")
-	}
-	mediaType = strings.ToLower(mediaType)
-	normalized, ok := geminiSupportedContentTypes[mediaType]
-	if !ok {
-		return "", errors.Errorf("audio content type %q is not supported by Gemini", mediaType)
-	}
-	return normalized, nil
-}
-
-func buildGeminiTranscriptionPrompt(prompt string, language string) string {
-	parts := []string{geminiTranscriptionPrompt}
-	language = strings.TrimSpace(language)
-	if language != "" {
-		parts = append(parts, "The input language is "+language+".")
-	}
-	prompt = strings.TrimSpace(prompt)
-	if prompt != "" {
-		parts = append(parts, "Context and spelling hints:\n"+prompt)
-	}
-	return strings.Join(parts, "\n\n")
-}
-
-func normalizeGeminiModelName(model string) string {
-	return strings.TrimPrefix(strings.TrimSpace(model), "models/")
-}
--- a/internal/ai/openai.go
+++ b/internal/ai/openai.go
@ -1,98 +0,0 @@
-package ai
-
-import (
-	"context"
-	"mime"
-	"strings"
-
-	openaisdk "github.com/openai/openai-go/v3"
-	openaioption "github.com/openai/openai-go/v3/option"
-	"github.com/pkg/errors"
-)
-
-const defaultOpenAIEndpoint = "https://api.openai.com/v1"
-
-type openAITranscriber struct {
-	client openaisdk.Client
-}
-
-func newOpenAITranscriber(config ProviderConfig, options transcriberOptions) (*openAITranscriber, error) {
-	endpoint, err := normalizeEndpoint(config.Endpoint, defaultOpenAIEndpoint, "OpenAI")
-	if err != nil {
-		return nil, err
-	}
-	if err := requireAPIKey(config.APIKey, "OpenAI"); err != nil {
-		return nil, err
-	}
-
-	return &openAITranscriber{
-		client: openaisdk.NewClient(
-			openaioption.WithAPIKey(config.APIKey),
-			openaioption.WithBaseURL(endpoint),
-			openaioption.WithHTTPClient(options.httpClient),
-		),
-	}, nil
-}
-
-// Transcribe transcribes audio with the OpenAI /audio/transcriptions endpoint.
-func (t *openAITranscriber) Transcribe(ctx context.Context, request TranscribeRequest) (*TranscribeResponse, error) {
-	if strings.TrimSpace(request.Model) == "" {
-		return nil, errors.New("model is required")
-	}
-	if request.Audio == nil {
-		return nil, errors.New("audio is required")
-	}
-
-	filename, contentType, err := normalizeOpenAIAudioFileMetadata(request)
-	if err != nil {
-		return nil, err
-	}
-
-	params := openaisdk.AudioTranscriptionNewParams{
-		File:           openaisdk.File(request.Audio, filename, contentType),
-		Model:          openaisdk.AudioModel(request.Model),
-		ResponseFormat: openaisdk.AudioResponseFormatJSON,
-	}
-	if request.Prompt != "" {
-		params.Prompt = openaisdk.String(request.Prompt)
-	}
-	if request.Language != "" {
-		params.Language = openaisdk.String(request.Language)
-	}
-
-	response, err := t.client.Audio.Transcriptions.New(ctx, params)
-	if err != nil {
-		return nil, errors.Wrap(err, "failed to send OpenAI transcription request")
-	}
-	return &TranscribeResponse{
-		Text:     response.Text,
-		Language: response.Language,
-		Duration: response.Duration,
-	}, nil
-}
-
-func normalizeOpenAIAudioFileMetadata(request TranscribeRequest) (string, string, error) {
-	filename := strings.TrimSpace(request.Filename)
-	if filename == "" {
-		filename = "audio"
-	}
-	contentType := strings.TrimSpace(request.ContentType)
-	if contentType == "" {
-		contentType = "application/octet-stream"
-	} else {
-		mediaType, _, err := mime.ParseMediaType(contentType)
-		if err != nil {
-			return "", "", errors.Wrap(err, "invalid audio content type")
-		}
-		contentType = mediaType
-	}
-	return sanitizeFilename(filename), contentType, nil
-}
-
-func sanitizeFilename(filename string) string {
-	filename = strings.NewReplacer("\r", "_", "\n", "_").Replace(filename)
-	if strings.TrimSpace(filename) == "" {
-		return "audio"
-	}
-	return filename
-}
--- a/internal/ai/stt/openai/openai.go
+++ b/internal/ai/stt/openai/openai.go
@ -0,0 +1,116 @@
+// Package openai implements stt.Transcriber against the OpenAI
+// /audio/transcriptions endpoint (and any compatible third-party endpoint
+// such as Groq Whisper, faster-whisper self-hosted, or Azure Whisper).
+package openai
+
+import (
+	"context"
+	"mime"
+	"net/url"
+	"strings"
+
+	openaisdk "github.com/openai/openai-go/v3"
+	openaioption "github.com/openai/openai-go/v3/option"
+	"github.com/pkg/errors"
+
+	"github.com/usememos/memos/internal/ai"
+	"github.com/usememos/memos/internal/ai/stt"
+)
+
+const defaultEndpoint = "https://api.openai.com/v1"
+
+// Transcriber implements stt.Transcriber for OpenAI-compatible STT endpoints.
+type Transcriber struct {
+	client openaisdk.Client
+}
+
+// New constructs a Transcriber from a provider config.
+func New(cfg ai.ProviderConfig, options stt.Options) (*Transcriber, error) {
+	endpoint, err := normalizeEndpoint(cfg.Endpoint)
+	if err != nil {
+		return nil, err
+	}
+	if cfg.APIKey == "" {
+		return nil, errors.New("OpenAI API key is required")
+	}
+	return &Transcriber{
+		client: openaisdk.NewClient(
+			openaioption.WithAPIKey(cfg.APIKey),
+			openaioption.WithBaseURL(endpoint),
+			openaioption.WithHTTPClient(options.HTTPClient),
+		),
+	}, nil
+}
+
+// Transcribe sends the audio to /audio/transcriptions.
+func (t *Transcriber) Transcribe(ctx context.Context, req stt.Request) (*stt.Response, error) {
+	if strings.TrimSpace(req.Model) == "" {
+		return nil, errors.New("model is required")
+	}
+	if req.Audio == nil {
+		return nil, errors.New("audio is required")
+	}
+
+	filename, contentType, err := normalizeAudioMetadata(req)
+	if err != nil {
+		return nil, err
+	}
+
+	params := openaisdk.AudioTranscriptionNewParams{
+		File:           openaisdk.File(req.Audio, filename, contentType),
+		Model:          openaisdk.AudioModel(req.Model),
+		ResponseFormat: openaisdk.AudioResponseFormatJSON,
+	}
+	if req.Prompt != "" {
+		params.Prompt = openaisdk.String(req.Prompt)
+	}
+	if req.Language != "" {
+		params.Language = openaisdk.String(req.Language)
+	}
+
+	resp, err := t.client.Audio.Transcriptions.New(ctx, params)
+	if err != nil {
+		return nil, errors.Wrap(err, "failed to send OpenAI transcription request")
+	}
+	return &stt.Response{
+		Text:     resp.Text,
+		Language: resp.Language,
+	}, nil
+}
+
+func normalizeEndpoint(endpoint string) (string, error) {
+	endpoint = strings.TrimSpace(endpoint)
+	if endpoint == "" {
+		endpoint = defaultEndpoint
+	}
+	if _, err := url.ParseRequestURI(endpoint); err != nil {
+		return "", errors.Wrap(err, "invalid OpenAI endpoint")
+	}
+	return strings.TrimRight(endpoint, "/"), nil
+}
+
+func normalizeAudioMetadata(req stt.Request) (string, string, error) {
+	filename := strings.TrimSpace(req.Filename)
+	if filename == "" {
+		filename = "audio"
+	}
+	contentType := strings.TrimSpace(req.ContentType)
+	if contentType == "" {
+		contentType = "application/octet-stream"
+	} else {
+		mediaType, _, err := mime.ParseMediaType(contentType)
+		if err != nil {
+			return "", "", errors.Wrap(err, "invalid audio content type")
+		}
+		contentType = mediaType
+	}
+	return sanitizeFilename(filename), contentType, nil
+}
+
+func sanitizeFilename(filename string) string {
+	filename = strings.NewReplacer("\r", "_", "\n", "_").Replace(filename)
+	if strings.TrimSpace(filename) == "" {
+		return "audio"
+	}
+	return filename
+}
--- a/internal/ai/stt/openai/openai_test.go
+++ b/internal/ai/stt/openai/openai_test.go
@ -1,4 +1,4 @@
-package ai
+package openai_test

 import (
 	"context"
@ -10,9 +10,13 @@ import (
 	"time"

 	"github.com/stretchr/testify/require"
+
+	"github.com/usememos/memos/internal/ai"
+	"github.com/usememos/memos/internal/ai/stt"
+	sttopenai "github.com/usememos/memos/internal/ai/stt/openai"
 )

-func TestOpenAITranscribe(t *testing.T) {
+func TestTranscribe(t *testing.T) {
 	t.Parallel()

 	server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@ -40,16 +44,16 @@ func TestOpenAITranscribe(t *testing.T) {
 	}))
 	defer server.Close()

-	transcriber, err := NewTranscriber(ProviderConfig{
-		Type:     ProviderOpenAI,
+	transcriber, err := sttopenai.New(ai.ProviderConfig{
+		Type:     ai.ProviderOpenAI,
 		Endpoint: server.URL,
 		APIKey:   "test-key",
-	})
+	}, stt.ApplyOptions(nil))
 	require.NoError(t, err)

 	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
 	defer cancel()
-	response, err := transcriber.Transcribe(ctx, TranscribeRequest{
+	response, err := transcriber.Transcribe(ctx, stt.Request{
 		Model:       "gpt-4o-transcribe",
 		Filename:    "voice.wav",
 		ContentType: "audio/wav",
@ -60,5 +64,5 @@ func TestOpenAITranscribe(t *testing.T) {
 	require.NoError(t, err)
 	require.Equal(t, "hello world", response.Text)
 	require.Equal(t, "en", response.Language)
-	require.Equal(t, 1.5, response.Duration)
+	// Note: Duration intentionally omitted from stt.Response — not exposed in the new contract.
 }
--- a/internal/ai/stt/options.go
+++ b/internal/ai/stt/options.go
@ -0,0 +1,34 @@
+package stt
+
+import (
+	"net/http"
+	"time"
+)
+
+const defaultHTTPTimeout = 2 * time.Minute
+
+// Options is the resolved option set passed to provider implementations.
+type Options struct {
+	HTTPClient *http.Client
+}
+
+// TranscriberOption customizes a Transcriber.
+type TranscriberOption func(*Options)
+
+// WithHTTPClient overrides the HTTP client used by the transcriber.
+func WithHTTPClient(client *http.Client) TranscriberOption {
+	return func(o *Options) {
+		if client != nil {
+			o.HTTPClient = client
+		}
+	}
+}
+
+// ApplyOptions resolves a TranscriberOption slice into Options with defaults.
+func ApplyOptions(opts []TranscriberOption) Options {
+	resolved := Options{HTTPClient: &http.Client{Timeout: defaultHTTPTimeout}}
+	for _, apply := range opts {
+		apply(&resolved)
+	}
+	return resolved
+}
--- a/internal/ai/stt/stt.go
+++ b/internal/ai/stt/stt.go
@ -0,0 +1,41 @@
+// Package stt defines the speech-to-text capability for AI providers.
+// Implementations call dedicated STT endpoints (e.g. OpenAI /audio/transcriptions)
+// and return deterministic transcription output. For multimodal LLMs that
+// happen to accept audio input, see internal/ai/audiollm.
+package stt
+
+import (
+	"context"
+	"io"
+)
+
+// Transcriber transcribes audio to text using a provider's dedicated STT endpoint.
+type Transcriber interface {
+	Transcribe(ctx context.Context, req Request) (*Response, error)
+}
+
+// Request is the input to a transcription call.
+type Request struct {
+	Audio       io.Reader
+	Size        int64
+	Filename    string
+	ContentType string // IANA media type, e.g. "audio/wav"
+	Model       string // provider-specific model id (e.g. "whisper-1", "gpt-4o-transcribe")
+	Prompt      string // soft spelling/vocabulary hint (Whisper "prompt" parameter)
+	Language    string // ISO 639-1, optional
+}
+
+// Response is the output of a transcription call.
+type Response struct {
+	Text     string
+	Language string    // empty if provider did not return it
+	Segments []Segment // empty unless provider returned timestamps
+}
+
+// Segment is a timestamped portion of the transcript.
+type Segment struct {
+	Text    string
+	Start   float64
+	End     float64
+	Speaker string // empty unless using a diarization-capable model
+}
--- a/internal/ai/transcription.go
+++ b/internal/ai/transcription.go
@ -1,29 +0,0 @@
-package ai
-
-import (
-	"context"
-	"io"
-)
-
-// Transcriber transcribes audio into text.
-type Transcriber interface {
-	Transcribe(ctx context.Context, request TranscribeRequest) (*TranscribeResponse, error)
-}
-
-// TranscribeRequest contains an audio transcription request.
-type TranscribeRequest struct {
-	Model       string
-	Filename    string
-	ContentType string
-	Audio       io.Reader
-	Size        int64
-	Prompt      string
-	Language    string
-}
-
-// TranscribeResponse contains an audio transcription response.
-type TranscribeResponse struct {
-	Text     string
-	Language string
-	Duration float64
-}
--- a/proto/gen/store/instance_setting.pb.go
+++ b/proto/gen/store/instance_setting.pb.go
@ -1098,15 +1098,23 @@ type TranscriptionConfig struct {
 	// Empty string means transcription is disabled.
 	ProviderId string `protobuf:"bytes,1,opt,name=provider_id,json=providerId,proto3" json:"provider_id,omitempty"`
 	// model is the provider-specific model identifier.
-	// Empty string falls back to the engine default
-	// (whisper-1 for OPENAI providers, gemini-2.5-flash for GEMINI providers).
+	// Empty string falls back to the engine default.
+	// OPENAI examples:
+	//   - whisper-1 (legacy, lower cost)
+	//   - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
+	//   - gpt-4o-transcribe-diarize (includes speaker labels)
+	//
+	// GEMINI examples:
+	//   - gemini-2.5-flash (default, multimodal call)
+	//   - gemini-2.5-pro
 	Model string `protobuf:"bytes,2,opt,name=model,proto3" json:"model,omitempty"`
 	// language is the default ISO 639-1 language hint sent to the provider.
 	// Empty string lets the provider auto-detect.
 	Language string `protobuf:"bytes,3,opt,name=language,proto3" json:"language,omitempty"`
 	// prompt is a default spelling/vocabulary hint passed to the provider.
-	// Used as the OpenAI Whisper "prompt" parameter and folded into the Gemini
-	// generation prompt as a "Context and spelling hints" block.
+	// Used as the OpenAI Whisper "prompt" parameter (a soft hint that the model
+	// may ignore) and folded into the Gemini generation prompt as a "Context and
+	// spelling hints" block (which the LLM will treat more literally).
 	Prompt        string `protobuf:"bytes,4,opt,name=prompt,proto3" json:"prompt,omitempty"`
 	unknownFields protoimpl.UnknownFields
 	sizeCache     protoimpl.SizeCache
--- a/proto/store/instance_setting.proto
+++ b/proto/store/instance_setting.proto
@ -177,8 +177,14 @@ message TranscriptionConfig {
  string provider_id = 1;

  // model is the provider-specific model identifier.
-  // Empty string falls back to the engine default
-  // (whisper-1 for OPENAI providers, gemini-2.5-flash for GEMINI providers).
+  // Empty string falls back to the engine default.
+  // OPENAI examples:
+  //   - whisper-1 (legacy, lower cost)
+  //   - gpt-4o-transcribe, gpt-4o-mini-transcribe (higher quality)
+  //   - gpt-4o-transcribe-diarize (includes speaker labels)
+  // GEMINI examples:
+  //   - gemini-2.5-flash (default, multimodal call)
+  //   - gemini-2.5-pro
  string model = 2;

  // language is the default ISO 639-1 language hint sent to the provider.
@ -186,7 +192,8 @@ message TranscriptionConfig {
  string language = 3;

  // prompt is a default spelling/vocabulary hint passed to the provider.
-  // Used as the OpenAI Whisper "prompt" parameter and folded into the Gemini
-  // generation prompt as a "Context and spelling hints" block.
+  // Used as the OpenAI Whisper "prompt" parameter (a soft hint that the model
+  // may ignore) and folded into the Gemini generation prompt as a "Context and
+  // spelling hints" block (which the LLM will treat more literally).
  string prompt = 4;
 }
--- a/server/router/api/v1/ai_service.go
+++ b/server/router/api/v1/ai_service.go
@ -7,10 +7,15 @@ import (
 	"net/http"
 	"strings"

+	"github.com/pkg/errors"
 	"google.golang.org/grpc/codes"
 	"google.golang.org/grpc/status"

 	"github.com/usememos/memos/internal/ai"
+	"github.com/usememos/memos/internal/ai/audiollm"
+	audiollmgemini "github.com/usememos/memos/internal/ai/audiollm/gemini"
+	"github.com/usememos/memos/internal/ai/stt"
+	sttopenai "github.com/usememos/memos/internal/ai/stt/openai"
 	v1pb "github.com/usememos/memos/proto/gen/api/v1"
 	storepb "github.com/usememos/memos/proto/gen/store"
 )
@ -99,26 +104,93 @@ func (s *APIV1Service) Transcribe(ctx context.Context, request *v1pb.TranscribeR
 		model = defaultModel
 	}

-	transcriber, err := ai.NewTranscriber(provider)
+	var text string
+	switch provider.Type {
+	case ai.ProviderOpenAI:
+		text, err = s.transcribeViaSTT(ctx, provider, persisted, model, content, filename, contentType)
+	case ai.ProviderGemini:
+		text, err = s.transcribeViaAudioLLM(ctx, provider, persisted, model, content, contentType)
+	default:
+		return nil, status.Errorf(codes.FailedPrecondition,
+			"provider type %q is not supported for transcription", provider.Type)
+	}
 	if err != nil {
-		return nil, status.Errorf(codes.InvalidArgument, "failed to create AI transcriber: %v", err)
+		return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
 	}
+	return &v1pb.TranscribeResponse{Text: text}, nil
+}

-	transcription, err := transcriber.Transcribe(ctx, ai.TranscribeRequest{
-		Model:       model,
-		Filename:    filename,
-		ContentType: contentType,
+func (*APIV1Service) transcribeViaSTT(
+	ctx context.Context,
+	provider ai.ProviderConfig,
+	persisted *storepb.TranscriptionConfig,
+	model string,
+	content []byte,
+	filename string,
+	contentType string,
+) (string, error) {
+	transcriber, err := sttopenai.New(provider, stt.ApplyOptions(nil))
+	if err != nil {
+		return "", errors.Wrap(err, "failed to create STT transcriber")
+	}
+	resp, err := transcriber.Transcribe(ctx, stt.Request{
 		Audio:       bytes.NewReader(content),
 		Size:        int64(len(content)),
+		Filename:    filename,
+		ContentType: contentType,
+		Model:       model,
 		Prompt:      persisted.GetPrompt(),
 		Language:    persisted.GetLanguage(),
 	})
 	if err != nil {
-		return nil, status.Errorf(codes.Internal, "failed to transcribe audio: %v", err)
+		return "", err
+	}
+	return resp.Text, nil
+}
+
+func (*APIV1Service) transcribeViaAudioLLM(
+	ctx context.Context,
+	provider ai.ProviderConfig,
+	persisted *storepb.TranscriptionConfig,
+	model string,
+	content []byte,
+	contentType string,
+) (string, error) {
+	m, err := audiollmgemini.New(provider, audiollm.ApplyOptions(nil))
+	if err != nil {
+		return "", errors.Wrap(err, "failed to create audio LLM")
+	}
+	resp, err := m.GenerateFromAudio(ctx, audiollm.Request{
+		Audio:        bytes.NewReader(content),
+		Size:         int64(len(content)),
+		ContentType:  contentType,
+		Model:        model,
+		Instructions: buildTranscriptionInstructions(persisted.GetPrompt(), persisted.GetLanguage()),
+	})
+	if err != nil {
+		return "", err
+	}
+	if resp.FinishReason != audiollm.FinishStop {
+		return "", errors.Errorf("transcription incomplete (finish reason: %s)", resp.FinishReason)
+	}
+	if strings.TrimSpace(resp.Text) == "" {
+		return "", errors.New("transcription response did not include text")
+	}
+	return resp.Text, nil
+}
+
+func buildTranscriptionInstructions(prompt, language string) string {
+	parts := []string{
+		"Transcribe the audio accurately. Return only the transcript text. " +
+			"Do not summarize, explain, or add content that is not spoken.",
+	}
+	if language = strings.TrimSpace(language); language != "" {
+		parts = append(parts, "The input language is "+language+".")
+	}
+	if prompt = strings.TrimSpace(prompt); prompt != "" {
+		parts = append(parts, "Context and spelling hints:\n"+prompt)
 	}
-	return &v1pb.TranscribeResponse{
-		Text: transcription.Text,
-	}, nil
+	return strings.Join(parts, "\n\n")
 }

 func (*APIV1Service) resolveAIProvider(setting *storepb.InstanceAISetting, providerID string) (ai.ProviderConfig, error) {
--- a/server/router/api/v1/test/ai_service_test.go
+++ b/server/router/api/v1/test/ai_service_test.go
@ -152,6 +152,7 @@ func TestTranscribe(t *testing.T) {
 			require.NoError(t, json.NewEncoder(w).Encode(map[string]any{
 				"candidates": []map[string]any{
 					{
+						"finishReason": "STOP",
 						"content": map[string]any{
 							"parts": []map[string]string{{"text": "gemini transcript"}},
 						},
--- a/web/src/components/Settings/AISection.tsx
+++ b/web/src/components/Settings/AISection.tsx
@ -400,9 +400,6 @@ const TranscriptionForm = ({ providers, transcription, referencedProvider, onCha
        {referencedProvider && !referencedProvider.apiKeySet && (
          <p className="text-xs text-destructive">{t("setting.ai.transcription-warning-no-key")}</p>
        )}
-        {referencedProvider?.type === InstanceSetting_AIProviderType.GEMINI && (
-          <p className="text-xs text-muted-foreground">{t("setting.ai.transcription-warning-gemini-webm")}</p>
-        )}
      </div>

      <div className="flex flex-col gap-1.5 sm:col-span-2">
--- a/web/src/locales/en.json
+++ b/web/src/locales/en.json
@ -442,16 +442,15 @@
      "transcription-language-placeholder": "auto-detect",
      "transcription-language": "Default language",
      "transcription-model-help": "Free text. Use the provider's model identifier — e.g. whisper-1, gpt-4o-transcribe, whisper-large-v3-turbo.",
-      "transcription-model-placeholder-gemini": "gemini-2.5-flash",
-      "transcription-model-placeholder-openai": "whisper-1",
+      "transcription-model-placeholder-gemini": "gemini-2.5-flash, gemini-2.5-pro",
+      "transcription-model-placeholder-openai": "whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize",
      "transcription-model": "Model",
      "transcription-no-provider": "None — transcription disabled",
-      "transcription-prompt-help": "Improves spelling of proper nouns and jargon. Whisper limit is roughly 224 tokens.",
+      "transcription-prompt-help": "Improves spelling of proper nouns and jargon. OpenAI Whisper treats this as a soft hint (Whisper limit is roughly 224 tokens). Gemini treats it as a literal instruction inside the generation prompt.",
      "transcription-prompt-placeholder": "Names: Alice, Bob. Glossary: kubernetes, OAuth.",
      "transcription-prompt": "Prompt hints",
      "transcription-provider": "Provider",
      "transcription-title": "Transcription",
-      "transcription-warning-gemini-webm": "Gemini does not accept browser-recorded audio/webm. For in-editor recording, use an OpenAI-compatible provider.",
      "transcription-warning-no-key": "The selected provider has no API key set. Edit the integration above to add one."
    },
    "instance": {