MOSS-TTSD-v1.0 8B (dialogue)

8Bpending

MOSS-TTS · moss_tts_delay · Apache 2.0🤗 OpenMOSS-Team/MOSS-TTSD-v1.0

Performance

Success rate

100%

41/41

Cold start

169s

activation → loaded

RTF (median)

2.036×

1.23× – 4.93×

Latency (median)

6.5s

4.8s – 47.0s

Same A100 pod as v1.5 + Local. Cold start 169s (matches v1.5's 166s — same size class). Liveness fix held throughout 7+ min of harness + extended testing — 0 pod restarts.

Capabilities

Base En

pending

Sound Effects

Out of scope

not supported

Failure Modes

pending

Multilingual

pending

Code Switching

pending

Cloning

pendingSame backend pattern as v1.5; desk-ref preset on PVC. Harness cloning cases will fall back to default voice unless we seed the preset.

Voice Design

Out of scope

not supported

Pronunciation

pending

Pauses

pending

Streaming

Out of scope

not supported

Dialogue

satisfactoryTTSD's specialty. [S1]/[S2]/[S3] tag-based dialogue works in our existing mode='generation' synthesis path — distinct voices produced without needing the more complex continuation-mode + per-speaker reference audio setup. 4 extended cases passed: 2-speaker short/long, 3-speaker, bilingual.

Long Form

pending

Showcase / extended cases

Voice cloning

1 cases

ext-clone-out

extclone

This is a cloning test on the eight billion parameter dialogue specialist using a reference recording.

Same desk-ref preset used for Local and v1.5 cloning tests — directly comparable to ext-clone-out in those models. RTF 4.93x reflects cloning's reference-processing overhead vs TTSD's median 2.04x.

RTF 4.93×

lat 28.00s

—

Dialogue

4 cases

9.1-dialogue-2speaker-short

ext

[S1] Did you finish the report? [S2] Yes, I sent it to you this morning. [S1] Great, thanks.

[S1]/[S2] tag-based dialogue. Confirmed audible voice change between speakers — TTSD's dialogue feature works without continuation-mode prompts.

RTF 1.72×

lat 7.00s

—

9.2-dialogue-2speaker-long

ext

[S1] So tell me about the new project. [S2] We are building a real-time TTS system for customer support. [S1] What is the latency target? [S2] Sub two hundred milliseconds, which means we need streaming inference end to end. [S1] That is ambitious. Have you tested it yet? [S2] First prototype this week. Looking good so far.

6-turn back-and-forth at 22.40s. Tests speaker-identity persistence across long dialogues.

RTF 2.10×

lat 47.00s

—

9.3-dialogue-3speaker

ext

[S1] I think we should ship this feature next week. [S2] Hold on, we have not tested the edge cases yet. [S3] I agree. Let us wait until QA signs off. [S1] Fine, but next week is the absolute deadline.

[S3] tag added. Tests whether 3+ speakers get distinct voices.

RTF 2.43×

lat 33.00s

—

9.4-dialogue-bilingual

ext

[S1] Welcome to our store. [S2] 你好,我想买一些礼物。 [S1] Of course, follow me.

Bilingual exchange (English [S1] + Mandarin [S2]). Tests if speaker identity AND language switch correctly within the same dialogue.

RTF 1.93×

lat 15.00s

—

Standard harness

Base English

5 cases

1.1-short

Hello, this is the first sentence.

RTF 3.68×

lat 6.47s

—

1.2-medium

The quick brown fox jumps over the lazy dog, and afterwards goes to sleep.

RTF 1.79×

lat 8.16s

—

1.3-question

Could you please confirm whether the deployment succeeded?

RTF 2.04×

lat 5.70s

—

1.4-exclamation

Watch out, that is dangerous!

RTF 3.20×

lat 5.88s

—

1.5-long-paragraph

The deployment process began at six in the morning. By half past seven, the first replicas were warm and serving traffic. Engineers checked the dashboards every few minutes, watching for the subtle latency increase that always preceded a regression. The new model had been tested for weeks in staging, but production traffic exposed edge cases that no synthetic load could simulate.

RTF 1.23×

lat 25.58s

—

Failure modes

5 cases

11.1-empty

(no text)

RTF 1.47×

lat 14.42s

—

11.2-punctuation-only

!?...?!

RTF 2.47×

lat 6.51s

—

11.3-mixed-script

Hello 世界 مرحبا こんにちは namaste

RTF 2.04×

lat 5.23s

—

11.4-symbols

$1,234.56 (75% off) @ 3PM EST

RTF 1.57×

lat 6.54s

—

11.5-markdown-residue

<b>Hello</b> **world** _italic_

RTF 2.07×

lat 5.80s

—

Multilingual

6 cases

2.1-zh

你好,今天天气很好,适合出去散步。

RTF 2.26×

lat 6.50s

—

2.2-ja

こんにちは、お元気ですか?今日もいい天気ですね。

RTF 3.27×

lat 9.68s

—

2.3-es

Hola, ¿cómo estás hoy? Espero que muy bien.

RTF 1.97×

lat 5.52s

—

2.4-fr

Bonjour, comment allez-vous aujourd'hui?

RTF 3.36×

lat 7.80s

—

2.5-ar

مرحبا، كيف حالك اليوم؟ أتمنى أن تكون بخير.

RTF 2.03×

lat 6.65s

—

2.6-hi

नमस्ते, आप कैसे हैं? आज मौसम बहुत अच्छा है।

RTF 1.63×

lat 7.04s

—

Code switching

3 cases

3.1-en-zh

I'll meet you at the 茶馆 at three in the afternoon.

RTF 1.45×

lat 8.94s

—

3.2-en-es

She said hola and then waved goodbye.

RTF 1.61×

lat 6.33s

—

3.3-en-ja

The Japanese word for thank you is ありがとう.

RTF 1.91×

lat 5.82s

—

Voice cloning

10 cases

4.1-clone-clean-5s

clone

The quick brown fox jumps over the lazy dog.

RTF 2.03×

lat 5.53s

—

4.2-clone-clean-15s

clone

The quick brown fox jumps over the lazy dog.

RTF 2.23×

lat 4.82s

—

4.3-clone-clean-30s

clone

The quick brown fox jumps over the lazy dog.

RTF 3.57×

lat 7.43s

—

4.4-clone-noisy

clone

The quick brown fox jumps over the lazy dog.

RTF 1.96×

lat 5.49s

—

4.5-clone-accented

clone

The quick brown fox jumps over the lazy dog.

RTF 2.33×

lat 6.17s

—

4.6-clone-whispered

clone

The quick brown fox jumps over the lazy dog.

RTF 2.35×

lat 7.16s

—

4.7-clone-raspy

clone

The quick brown fox jumps over the lazy dog.

RTF 1.80×

lat 6.49s

—

4.8-clone-reverb

clone

The quick brown fox jumps over the lazy dog.

RTF 1.85×

lat 5.49s

—

4.9-clone-child

clone

The quick brown fox jumps over the lazy dog.

RTF 1.95×

lat 5.00s

—

4.10-clone-cross-lang

clone

你好,今天天气很好。

RTF 2.13×

lat 4.78s

—

Pronunciation

4 cases

6.1-irish-name

Her name is Saoirse Ronan.

RTF 2.17×

lat 6.24s

—

6.2-brand-hyundai

I drive a Hyundai Ioniq.

RTF 1.96×

lat 6.12s

—

6.3-gif-vs-jif

Save the file as a GIF and not a JPEG.

RTF 1.81×

lat 6.67s

—

6.4-sql

We use SQL to query the database.

RTF 3.37×

lat 8.89s

—

Pauses

3 cases

7.1-pause-short

Wait, [pause 0.5s] for it.

RTF 2.27×

lat 9.62s

—

7.2-pause-medium

She paused, [pause 1.5s] then continued.

RTF 1.80×

lat 8.65s

—

7.3-pause-long

And then [pause 3.0s] silence.

RTF 2.92×

lat 11.68s

—

Issues

ttsd-no-explicit-dialogue-path

lowopen

MossTTSDBackend.synthesize() uses processor mode='generation' (mirrors v1.5). Upstream TTSD usage recommends mode='continuation' with prompt audio + per-speaker reference for proper multi-speaker output. Our [S1]/[S2] dialogue extended_cases probe whether generation mode alone produces distinguishable voices. If not, a dedicated synthesize_dialogue() method is the follow-up.

Deployment

Service

era-tts-moss

Namespace

era-core

GPU

nvidia-tesla-a100

GPU mode

gke-spot (preemptible)

PVC

era-tts-model-cache-moss

Worker timeout

1800s