MOSS-TTS-Local-Transformer 1.7B

1.7Bpending
MOSS-TTS · moss_tts_delay · Apache 2.0🤗 OpenMOSS-Team/MOSS-TTS-Local-Transformer

Performance

Success rate

76%

35/46
Cold start

45s

activation → loaded
RTF (median)

4.247×

3.53×8.74×
Latency (median)

13.4s

3.7s 104.0s

Capabilities

Base En

pending listening

Sound Effects

Out of scope
not supported

Failure Modes

no crash

Multilingual

pending listening

Code Switching

pending listeningRTF 8.5x outlier on en-zh — investigate

Cloning

satisfactorydesk-ref test passed — cloned voice recognizably matches reference. Standard 9-clip matrix still TBD.

Voice Design

Out of scope
not supported

Pronunciation

pending listening

Pauses

broken[pause X.Ys] markers pronounced literally; may be v1.5-only feature

Streaming

Out of scope
not supported

Long Form

truncatedMax ~26s audio per request even with max_new_tokens=8192

Showcase / extended cases

Voice cloning

1 cases

ext-clone-out

extclone

This is a cloning test using a reference recording.

First cloning attempt. Does output match reference voice?
lat 32.00s

Pauses

2 cases

ext-pause-zh

ext

我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思!

Upstream model card example. Does [pause 3.2s] produce silence here?
lat 52.90s

ext-pause-en

ext

Listen carefully.[pause 2.0s]The answer is forty two.

Pause re-test in English.
lat 20.60s

Long form

1 cases

ext-longform

ext

The deployment process began at six in the morning. By half past seven, the first replicas were warm and serving traffic. Engineers checked the dashboards every few minutes, watching for the subtle latency increase that always preceded a regression. The new model had been tested for weeks in staging, but production traffic exposed edge cases that no synthetic load could simulate. By noon, the team had isolated the issue and shipped a fix.

Output truncated — only 26s despite ~80 words and max_new_tokens=8192
RTF 3.95×
lat 104.00s

speed-determinism

3 cases

ext-warm-1

ext

Hello, this is a repeatability test.

Sequential synth attempt 1 of 3 with identical input. Different audio length each call → model non-deterministic without seed.
RTF 3.68×
lat 10.01s

ext-warm-2

ext

Hello, this is a repeatability test.

Sequential synth attempt 2 of 3 with identical input.
RTF 3.57×
lat 13.41s

ext-warm-3

ext

Hello, this is a repeatability test.

Sequential synth attempt 3 of 3 with identical input. RTF stable at ~3.6-3.7x across attempts; latency variance is from audio-length variance.
RTF 3.71×
lat 8.32s

speed-ttfb

3 cases

ext-ttfb-short

ext

Hello, this is a short sentence.

TTFB via /api/v1/tts/stream. TTFB == total latency — confirms /stream is chunked pre-rendered delivery, not incremental decode.
RTF 4.08×
lat 9.15s
no audio

ext-ttfb-zh

ext

你好,今天天气很好,适合出去散步。

TTFB on /stream for Chinese. TTFB - total < 1ms.
RTF 8.74×
lat 36.38s
no audio

ext-ttfb-long

ext

The deployment process began at six in the morning. By half past seven the first replicas were warm.

TTFB on /stream for long English. TTFB - total < 1ms.
RTF 3.53×
lat 25.68s
no audio

Standard harness

Base English

5 cases

1.1-short

Hello, this is the first sentence.

RTF 4.64×
lat 9.65s

1.2-medium

The quick brown fox jumps over the lazy dog, and afterwards goes to sleep.

RTF 3.70×
lat 28.71s

1.3-question

Could you please confirm whether the deployment succeeded?

RTF 4.54×
lat 10.53s

1.4-exclamation

Watch out, that is dangerous!

RTF 4.60×
lat 13.26s

1.5-long-paragraph

The deployment process began at six in the morning. By half past seven, the first replicas were warm and serving traffic. Engineers checked the dashboards every few minutes, watching for the subtle latency increase that always preceded a regression. The new model had been tested for weeks in staging, but production traffic exposed edge cases that no synthetic load could simulate.

RTF 3.63×
lat 87.25s

Failure modes

5 cases

11.1-empty

(no text)

RTF 5.50×
lat 4.84s

11.2-punctuation-only

!?...?!

RTF 5.75×
lat 3.68s

11.3-mixed-script

Hello 世界 مرحبا こんにちは namaste

RTF 4.14×
lat 13.25s

11.4-symbols

$1,234.56 (75% off) @ 3PM EST

RTF 3.93×
lat 16.34s

11.5-markdown-residue

error

<b>Hello</b> **world** _italic_

('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
no audio

Multilingual

6 cases

2.1-zh

你好,今天天气很好,适合出去散步。

RTF 4.06×
lat 11.70s

2.2-ja

こんにちは、お元気ですか?今日もいい天気ですね。

RTF 4.65×
lat 18.96s

2.3-es

Hola, ¿cómo estás hoy? Espero que muy bien.

RTF 5.64×
lat 15.79s

2.4-fr

Bonjour, comment allez-vous aujourd'hui?

RTF 5.20×
lat 9.97s

2.5-ar

مرحبا، كيف حالك اليوم؟ أتمنى أن تكون بخير.

RTF 4.16×
lat 16.29s

2.6-hi

नमस्ते, आप कैसे हैं? आज मौसम बहुत अच्छा है।

RTF 3.94×
lat 17.67s

Code switching

3 cases

3.1-en-zh

I'll meet you at the 茶馆 at three in the afternoon.

RTF 8.52×
lat 36.78s

3.2-en-es

She said hola and then waved goodbye.

RTF 4.64×
lat 11.14s

3.3-en-ja

The Japanese word for thank you is ありがとう.

RTF 4.35×
lat 12.19s

Voice cloning

10 cases

4.1-clone-clean-5s

errorclone

The quick brown fox jumps over the lazy dog.

('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
no audio

4.2-clone-clean-15s

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.3-clone-clean-30s

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.4-clone-noisy

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.5-clone-accented

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.6-clone-whispered

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.7-clone-raspy

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.8-clone-reverb

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.9-clone-child

errorclone

The quick brown fox jumps over the lazy dog.

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

4.10-clone-cross-lang

errorclone

你好,今天天气很好。

HTTPConnectionPool(host='localhost', port=8090): Max retries exceeded with url: /api/v1/tts/generate (Caused by NewConne
no audio

Pronunciation

4 cases

6.1-irish-name

Her name is Saoirse Ronan.

RTF 4.23×
lat 14.23s

6.2-brand-hyundai

I drive a Hyundai Ioniq.

RTF 4.55×
lat 7.28s

6.3-gif-vs-jif

Save the file as a GIF and not a JPEG.

RTF 4.26×
lat 10.23s

6.4-sql

We use SQL to query the database.

RTF 4.86×
lat 12.04s

Pauses

3 cases

7.1-pause-short

Wait, [pause 0.5s] for it.

RTF 4.39×
lat 11.60s

7.2-pause-medium

She paused, [pause 1.5s] then continued.

RTF 4.09×
lat 16.34s

7.3-pause-long

And then [pause 3.0s] silence.

RTF 4.11×
lat 19.06s

Issues

pipe-deadlock

criticalfixed

Worker subprocess pipe deadlock — root cause of all activation hangs across L4 and A100 attempts

liveness-restart

highopen

Liveness probe times out during long synthesis (event loop blocked by sync call); 4 pod restarts observed in 85 min of testing. Production blocker — drops in-flight requests.

non-deterministic-no-seed

mediumopen

Without an explicit `seed` parameter, identical (text, preset_id) input produces different audio durations and prosody across calls. Disables hash-based output caching. Untested: whether passing `seed` actually makes output deterministic.

streaming-endpoint-not-incremental

lowby-design

/api/v1/tts/stream serves a pre-rendered buffer chunked at 4 KB; TTFB equals total synth latency. moss_tts_delay is batch-only by architecture. True streaming would require the Realtime variant (upstream config bug).

service-selector-overlap

mediumopen

era-tts and era-tts-moss Services both select app=era-tts; requests round-robin to both pods. Workaround: direct pod port-forward.

long-form-cap

mediumopen

max_new_tokens=8192 only extends audio to ~26s (no proportional gain over 4096)

pause-tag-literal

mediumopen

[pause X.Ys] markers pronounced literally; canonical upstream syntax per model card. May be v1.5-only feature, or needs different preprocessing.

Deployment

Service

era-tts-moss

Namespace

era-core

GPU

nvidia-tesla-a100

GPU mode

gke-spot (preemptible)

PVC

era-tts-model-cache-moss

Worker timeout

3600s