Documentation Index
Fetch the complete documentation index at: https://docs.spatius.ai/llms.txt
Use this file to discover all available pages before exploring further.
Audio and Motion Data are the two streams that make the Avatar speak and move. Avatar speech audio is the input to Motion Server. Motion data is the output that drives the Avatar’s mouth, head, and gestures in AvatarKit.
Spatius does not consume the user’s microphone audio by default. In a voice-agent app, user speech usually goes through ASR, agent logic, and TTS before it becomes avatar speech audio.
avatar speech audio -> Motion Server -> motion data -> AvatarKit
Avatar speech audio
Motion Server accepts avatar speech audio as mono 16-bit PCM (s16le). The sample rate is configured when the connection or server session is initialized; after that, every chunk you send must match the configured rate.
| Property | Value |
|---|
| Sample rate | One of 8000, 16000, 22050, 24000, 32000, 44100, 48000 |
| Channels | 1 (mono) |
| Bit depth | 16-bit |
| Encoding | Signed PCM, little-endian (s16le) |
| Container | Raw PCM bytes — no WAV header, no compressed frames |
If your source is stereo, floating-point, compressed, or a different sample rate, convert it before sending. AvatarKit does not resample for you.
Use 16000 Hz as the default for most speech-driven integrations. Use 24000 Hz when it matches your TTS provider natively. Use 44100 / 48000 Hz when an RTC framework dictates that rate.
Sample-rate mismatch usually shows up as distorted, silent, or out-of-sync playback. It may not produce a separate error event.
→ Reference: Web Configuration · iOS AudioFormat · Android AudioFormat
Motion data
Motion data is generated by Motion Server from avatar speech audio. It is not video, and it is not animation data your app creates by hand.
AvatarKit consumes motion data together with the matching audio. If audio arrives without motion data, the Avatar may play sound but cannot perform the matching mouth, head, or gesture movement.
Data by mode
The input and output are the same in every mode. The owner changes:
| Mode | Avatar speech audio to Motion Server | Audio and motion data to AvatarKit |
|---|
| Basic Mode | AvatarKit on the client. | AvatarKit receives the output directly. |
| Custom Mode | Your backend through the Spatius Server SDK. | Your backend forwards encoded output messages through your transport. |
| LiveKit Plugin | The LiveKit Plugin running in your agent worker. | Motion Server publishes the output into the LiveKit room. |
In Custom Mode, keep the two payload boundaries separate:
- Backend → Motion Server: send raw mono PCM16 avatar speech audio at the session sample rate.
- Backend / transport → AvatarKit: deliver both encoded outputs produced by the server path.
| Encoded output | Client API |
|---|
| Audio messages | receiveAudioData() |
| Motion messages | receiveMotionData() |
Do not pass your original raw PCM directly into Custom Mode receive APIs such as receiveAudioData(). Those APIs consume encoded output messages from the Spatius server path. If you only deliver audio messages and omit motion messages, AvatarKit can play audio but cannot drive the Avatar’s movement.
In the LiveKit Plugin path, do not call receiveAudioData(...) or provide audio chunks from the client. The plugin attaches to your agent session and sends avatar speech audio to Spatius from the agent worker.
Response end and interruption
Each avatar response needs a clear end. Without it, AvatarKit may keep the response open and never return to idle.
In Basic Mode, avatar speech audio enters AvatarKit through AvatarController.receiveAudioData(...):
- Provide the first chunk with
receiveAudioData(audioData, false).
- Continue providing chunks as your TTS or speech source produces them.
- Mark the final chunk end-of-stream with
receiveAudioData(lastChunk, true).
receiveAudioData(...) returns a conversation ID. Keep it if you need to correlate later state changes, interruptions, or errors with a specific response.
In Custom Mode, your backend sends raw PCM audio through the Server SDK instead. For example, the Python Server SDK uses sample_rate=... when creating the session, then sends the avatar speech audio bytes with the same end-of-stream flag. The backend then forwards both outputs to the client: encoded audio messages through receiveAudioData(...), and encoded motion messages through receiveMotionData(...).
→ Reference: Web AvatarController · iOS AvatarController · Android AvatarController
Use interrupt() when the current avatar response should stop immediately, such as user barge-in.
interrupt() stops playback, clears pending audio and motion data, and resets the conversation context. After it returns, the next receiveAudioData(...) starts from a clean response.
pause() / resume() are different: they preserve state and buffers for later continuation. interrupt() throws them away.
What can go wrong
| Symptom | Likely cause |
|---|
| Audio is distorted or silent | Wrong sample rate, wrong channel count, compressed input, or non-s16le samples. |
| Avatar never returns to idle | Final chunk was not marked end-of-stream. |
| Audio plays but the Avatar does not move | Motion data is missing, late, or delivered through the wrong path. |
| Playback feels delayed | Chunks are too large, arrive late, or are buffered upstream by TTS / transport. |
| Avatar keeps speaking after barge-in | interrupt() is not called when your product cancels the current response. |
Pre-flight checklist
Go next
- State & Events to observe playback state, errors, or recovery.
- Sessions & Lifecycle if audio is not flowing because the connection path is not online.
- Avatars if audio plays but the Avatar does not load or render.