Audio and Motion Data

Audio and Motion Data are the two streams that make the Avatar speak and move. Avatar speech audio is the input to Motion Server. Motion data is the output that drives the Avatar’s mouth, head, and gestures in AvatarKit. Spatius does not consume the user’s microphone audio by default. In a voice-agent app, user speech usually goes through ASR, agent logic, and TTS before it becomes avatar speech audio.

avatar speech audio -> Motion Server -> motion data -> AvatarKit

Avatar speech audio

Motion Server accepts avatar speech audio as mono 16-bit PCM (s16le). The sample rate is configured when the connection or server session is initialized; after that, every chunk you send must match the configured rate.

Property	Value
Sample rate	One of `8000`, `16000`, `22050`, `24000`, `32000`, `44100`, `48000`
Channels	`1` (mono)
Bit depth	`16-bit`
Encoding	Signed PCM, little-endian (`s16le`)
Container	Raw PCM bytes — no WAV header, no compressed frames

If your source is stereo, floating-point, compressed, or a different sample rate, convert it before sending. AvatarKit does not resample for you. Use 16000 Hz as the default for most speech-driven integrations. Use 24000 Hz when it matches your TTS provider natively. Use 44100 / 48000 Hz when an RTC framework dictates that rate.

Sample-rate mismatch usually shows up as distorted, silent, or out-of-sync playback. It may not produce a separate error event.

→ Reference: Web Configuration · iOS AudioFormat · Android AudioFormat

Motion data

Motion data is generated by Motion Server from avatar speech audio. It is not video, and it is not animation data your app creates by hand. AvatarKit consumes motion data together with the matching audio. If audio arrives without motion data, the Avatar may play sound but cannot perform the matching mouth, head, or gesture movement.

Data by mode

The input and output are the same in every mode. The owner changes:

Mode	Avatar speech audio to Motion Server	Audio and motion data to AvatarKit
Basic Mode	AvatarKit on the client.	AvatarKit receives the output directly.
Custom Mode	Your backend through the Spatius Server SDK.	Your backend forwards encoded output messages through your transport.
LiveKit Plugin	The LiveKit Plugin running in your agent worker.	Motion Server publishes the output into the LiveKit room.

In Custom Mode, keep the two payload boundaries separate:

Backend → Motion Server: send raw mono PCM16 avatar speech audio at the session sample rate.
Backend / transport → AvatarKit: deliver both encoded outputs produced by the server path.

Encoded output	Client API
Audio messages	`receiveAudioData()`
Motion messages	`receiveMotionData()`

Do not pass your original raw PCM directly into Custom Mode receive APIs such as receiveAudioData(). Those APIs consume encoded output messages from the Spatius server path. If you only deliver audio messages and omit motion messages, AvatarKit can play audio but cannot drive the Avatar’s movement. In the LiveKit Plugin path, do not call receiveAudioData(...) or provide audio chunks from the client. The plugin attaches to your agent session and sends avatar speech audio to Spatius from the agent worker.

Response end and interruption

Each avatar response needs a clear end. Without it, AvatarKit may keep the response open and never return to idle. In Basic Mode, avatar speech audio enters AvatarKit through AvatarController.receiveAudioData(...):

Provide the first chunk with receiveAudioData(audioData, false).
Continue providing chunks as your TTS or speech source produces them.
Mark the final chunk end-of-stream with receiveAudioData(lastChunk, true).

receiveAudioData(...) returns a conversation ID. Keep it if you need to correlate later state changes, interruptions, or errors with a specific response. In Custom Mode, your backend sends raw PCM audio through the Server SDK instead. For example, the Python Server SDK uses sample_rate=... when creating the session, then sends the avatar speech audio bytes with the same end-of-stream flag. The backend then forwards both outputs to the client: encoded audio messages through receiveAudioData(...), and encoded motion messages through receiveMotionData(...). → Reference: Web AvatarController · iOS AvatarController · Android AvatarController Use interrupt() when the current avatar response should stop immediately, such as user barge-in. interrupt() stops playback, clears pending audio and motion data, and resets the conversation context. After it returns, the next receiveAudioData(...) starts from a clean response. pause() / resume() are different: they preserve state and buffers for later continuation. interrupt() throws them away.

What can go wrong

Symptom	Likely cause
Audio is distorted or silent	Wrong sample rate, wrong channel count, compressed input, or non-`s16le` samples.
Avatar never returns to idle	Final chunk was not marked end-of-stream.
Audio plays but the Avatar does not move	Motion data is missing, late, or delivered through the wrong path.
Playback feels delayed	Chunks are too large, arrive late, or are buffered upstream by TTS / transport.
Avatar keeps speaking after barge-in	`interrupt()` is not called when your product cancels the current response.

Pre-flight checklist

Sample rate matches the configured connection or server session.
Audio is mono.
Samples are 16-bit signed PCM, little-endian.
Bytes are raw PCM, not WAV / MP3 / Opus / AAC frames.
Final chunk of each response is marked end-of-stream.
In Custom Mode, encoded motion messages are delivered to receiveMotionData(...).

Go next

State & Events to observe playback state, errors, or recovery.
Sessions & Lifecycle if audio is not flowing because the connection path is not online.
Avatars if audio plays but the Avatar does not load or render.

Getting Started

Concepts

LiveKit Plugin

Basic Mode

Custom Mode

Resources

Audio and Motion Data

Avatar speech audio

Motion data

Data by mode

Response end and interruption

What can go wrong

Pre-flight checklist

Go next

Getting Started

Concepts

LiveKit Plugin

Basic Mode

Custom Mode

Resources

Documentation Index

​Avatar speech audio

​Motion data

​Data by mode

​Response end and interruption

​What can go wrong

​Pre-flight checklist

​Go next

Avatar speech audio

Motion data

Data by mode

Response end and interruption

What can go wrong

Pre-flight checklist

Go next