Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.spatius.ai/llms.txt

Use this file to discover all available pages before exploring further.

Audio and Motion Data are the two streams that make the Avatar speak and move. Avatar speech audio is the input to Motion Server. Motion data is the output that drives the Avatar’s mouth, head, and gestures in AvatarKit. Spatius does not consume the user’s microphone audio by default. In a voice-agent app, user speech usually goes through ASR, agent logic, and TTS before it becomes avatar speech audio.
avatar speech audio -> Motion Server -> motion data -> AvatarKit

Avatar speech audio

Motion Server accepts avatar speech audio as mono 16-bit PCM (s16le). The sample rate is configured when the connection or server session is initialized; after that, every chunk you send must match the configured rate.
PropertyValue
Sample rateOne of 8000, 16000, 22050, 24000, 32000, 44100, 48000
Channels1 (mono)
Bit depth16-bit
EncodingSigned PCM, little-endian (s16le)
ContainerRaw PCM bytes — no WAV header, no compressed frames
If your source is stereo, floating-point, compressed, or a different sample rate, convert it before sending. AvatarKit does not resample for you. Use 16000 Hz as the default for most speech-driven integrations. Use 24000 Hz when it matches your TTS provider natively. Use 44100 / 48000 Hz when an RTC framework dictates that rate.
Sample-rate mismatch usually shows up as distorted, silent, or out-of-sync playback. It may not produce a separate error event.
→ Reference: Web Configuration · iOS AudioFormat · Android AudioFormat

Motion data

Motion data is generated by Motion Server from avatar speech audio. It is not video, and it is not animation data your app creates by hand. AvatarKit consumes motion data together with the matching audio. If audio arrives without motion data, the Avatar may play sound but cannot perform the matching mouth, head, or gesture movement.

Data by mode

The input and output are the same in every mode. The owner changes:
ModeAvatar speech audio to Motion ServerAudio and motion data to AvatarKit
Basic ModeAvatarKit on the client.AvatarKit receives the output directly.
Custom ModeYour backend through the Spatius Server SDK.Your backend forwards encoded output messages through your transport.
LiveKit PluginThe LiveKit Plugin running in your agent worker.Motion Server publishes the output into the LiveKit room.
In Custom Mode, keep the two payload boundaries separate:
  • Backend → Motion Server: send raw mono PCM16 avatar speech audio at the session sample rate.
  • Backend / transport → AvatarKit: deliver both encoded outputs produced by the server path.
Encoded outputClient API
Audio messagesreceiveAudioData()
Motion messagesreceiveMotionData()
Do not pass your original raw PCM directly into Custom Mode receive APIs such as receiveAudioData(). Those APIs consume encoded output messages from the Spatius server path. If you only deliver audio messages and omit motion messages, AvatarKit can play audio but cannot drive the Avatar’s movement. In the LiveKit Plugin path, do not call receiveAudioData(...) or provide audio chunks from the client. The plugin attaches to your agent session and sends avatar speech audio to Spatius from the agent worker.

Response end and interruption

Each avatar response needs a clear end. Without it, AvatarKit may keep the response open and never return to idle. In Basic Mode, avatar speech audio enters AvatarKit through AvatarController.receiveAudioData(...):
  1. Provide the first chunk with receiveAudioData(audioData, false).
  2. Continue providing chunks as your TTS or speech source produces them.
  3. Mark the final chunk end-of-stream with receiveAudioData(lastChunk, true).
receiveAudioData(...) returns a conversation ID. Keep it if you need to correlate later state changes, interruptions, or errors with a specific response. In Custom Mode, your backend sends raw PCM audio through the Server SDK instead. For example, the Python Server SDK uses sample_rate=... when creating the session, then sends the avatar speech audio bytes with the same end-of-stream flag. The backend then forwards both outputs to the client: encoded audio messages through receiveAudioData(...), and encoded motion messages through receiveMotionData(...). → Reference: Web AvatarController · iOS AvatarController · Android AvatarController Use interrupt() when the current avatar response should stop immediately, such as user barge-in. interrupt() stops playback, clears pending audio and motion data, and resets the conversation context. After it returns, the next receiveAudioData(...) starts from a clean response. pause() / resume() are different: they preserve state and buffers for later continuation. interrupt() throws them away.

What can go wrong

SymptomLikely cause
Audio is distorted or silentWrong sample rate, wrong channel count, compressed input, or non-s16le samples.
Avatar never returns to idleFinal chunk was not marked end-of-stream.
Audio plays but the Avatar does not moveMotion data is missing, late, or delivered through the wrong path.
Playback feels delayedChunks are too large, arrive late, or are buffered upstream by TTS / transport.
Avatar keeps speaking after barge-ininterrupt() is not called when your product cancels the current response.

Pre-flight checklist

  • Sample rate matches the configured connection or server session.
  • Audio is mono.
  • Samples are 16-bit signed PCM, little-endian.
  • Bytes are raw PCM, not WAV / MP3 / Opus / AAC frames.
  • Final chunk of each response is marked end-of-stream.
  • In Custom Mode, encoded motion messages are delivered to receiveMotionData(...).

Go next

  • State & Events to observe playback state, errors, or recovery.
  • Sessions & Lifecycle if audio is not flowing because the connection path is not online.
  • Avatars if audio plays but the Avatar does not load or render.