Skip to main content
Cartesia is a voice AI platform that provides ultra-realistic, real-time Text-to-Speech (TTS) models with industry-leading latency. By combining Cerebras Inference’s lightning-fast LLM responses with Cartesia’s natural-sounding voice synthesis, you can build highly responsive voice agents and conversational AI applications. This guide will walk you through integrating Cerebras models with Cartesia to create a complete voice AI pipeline.

Prerequisites

Before you begin, ensure you have:
  • Cerebras API Key - Get a free API key here.
  • Cartesia API Key - Visit Cartesia and create an account. Navigate to your profile settings to generate an API key.
  • Python 3.10 or higher - Required for running the integration code.

Configure Cartesia Integration

1

Install required dependencies

Install the necessary Python packages for both Cerebras Inference and Cartesia:
pip install openai cartesia pyaudio
The openai package provides the client for Cerebras Inference (OpenAI-compatible), cartesia is the official Cartesia SDK for voice synthesis, and pyaudio enables real-time audio playback.
macOS users: If you encounter errors installing pyaudio, first install PortAudio with: brew install portaudio
2

Configure environment variables

Create a .env file in your project directory to securely store your API keys:
CEREBRAS_API_KEY=your-cerebras-api-key-here
CARTESIA_API_KEY=your-cartesia-api-key-here
Alternatively, you can set these as environment variables in your shell:
export CEREBRAS_API_KEY="your-cerebras-api-key-here"
export CARTESIA_API_KEY="your-cartesia-api-key-here"
3

Initialize the Cerebras client

Set up the Cerebras client using the OpenAI-compatible interface. The integration header helps us track and optimize this integration:
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1"
)
4

Create a basic text-to-speech pipeline

Now let’s create a complete pipeline that generates text with Cerebras and converts it to speech with Cartesia. This example demonstrates the power of combining Cerebras’s fast inference with Cartesia’s ultra-low latency voice synthesis:
import os
from openai import OpenAI
from cartesia import Cartesia
import pyaudio

# Initialize Cerebras client
cerebras_client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1"
)

# Initialize Cartesia client
cartesia_client = Cartesia(
    api_key=os.getenv("CARTESIA_API_KEY")
)

def generate_and_speak(prompt, voice_id="a0e99841-438c-4a64-b679-ae501e7d6091"):
    """
    Generate text response using Cerebras and stream speech directly to speakers.
    
    Args:
        prompt: User input text
        voice_id: Cartesia voice ID (default is Barbershop Man)
    """
    # Generate text response with Cerebras
    response = cerebras_client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Keep responses concise and natural for voice output."},
            {"role": "user", "content": prompt}
        ],
        max_completion_tokens=150,
        temperature=0.7,
        extra_headers={"X-Cerebras-3rd-Party-Integration": "cartesia"}
    )
    
    text_response = response.choices[0].message.content
    print(f"Generated text: {text_response}")
    
    # Stream audio directly to speakers using Cartesia's WebSocket approach
    print("🔊 Streaming audio...")
    
    # Set up audio stream
    p = pyaudio.PyAudio()
    rate = 44100
    stream = p.open(
        format=pyaudio.paFloat32,
        channels=1,
        rate=rate,
        output=True
    )
    
    try:
        # Generate and stream audio
        for output in cartesia_client.tts.websocket().send(
            model_id="sonic-3",
            transcript=text_response,
            voice={"mode": "id", "id": voice_id},
            stream=True,
            output_format={
                "container": "raw",
                "encoding": "pcm_f32le",
                "sample_rate": rate
            }
        ):
            # Write audio chunks directly to speakers
            stream.write(output.audio)
            
    finally:
        # Cleanup
        stream.stop_stream()
        stream.close()
        p.terminate()
    
    return text_response

# Example usage
if __name__ == "__main__":
    generate_and_speak("Tell me an interesting fact about space exploration.")
5

Build a conversational voice agent

For a more advanced use case, here’s how to build a multi-turn conversational agent that maintains context across multiple interactions:
import os
from openai import OpenAI
from cartesia import Cartesia
import pyaudio

# Initialize Cerebras client
cerebras_client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1"
)

# Initialize Cartesia client
cartesia_client = Cartesia(
    api_key=os.getenv("CARTESIA_API_KEY")
)

def stream_audio_to_speakers(text, voice_id="a0e99841-438c-4a64-b679-ae501e7d6091"):
    """Stream text as audio directly to speakers using Cartesia WebSocket."""
    # Set up audio stream
    p = pyaudio.PyAudio()
    rate = 44100
    stream = p.open(
        format=pyaudio.paFloat32,
        channels=1,
        rate=rate,
        output=True
    )
    
    try:
        # Generate and stream audio
        for output in cartesia_client.tts.websocket().send(
            model_id="sonic-3",
            transcript=text,
            voice={"mode": "id", "id": voice_id},
            stream=True,
            output_format={
                "container": "raw",
                "encoding": "pcm_f32le",
                "sample_rate": rate
            }
        ):
            # Write audio chunks directly to speakers
            stream.write(output.audio)
            
    finally:
        # Cleanup
        stream.stop_stream()
        stream.close()
        p.terminate()

class VoiceAgent:
    def __init__(self, system_prompt, voice_id="a0e99841-438c-4a64-b679-ae501e7d6091"):
        self.conversation_history = [
            {"role": "system", "content": system_prompt}
        ]
        self.voice_id = voice_id
    
    def chat(self, user_input):
        """Process user input and generate voice response."""
        # Add user message to history
        self.conversation_history.append({
            "role": "user",
            "content": user_input
        })
        
        # Generate response with Cerebras
        response = cerebras_client.chat.completions.create(
            model="llama-3.3-70b",
            messages=self.conversation_history,
            max_completion_tokens=200,
            temperature=0.8,
            extra_headers={"X-Cerebras-3rd-Party-Integration": "cartesia"}
        )
        
        assistant_message = response.choices[0].message.content
        
        # Add assistant response to history
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_message
        })
        
        print(f"Assistant: {assistant_message}")
        
        # Stream audio response directly to speakers
        stream_audio_to_speakers(assistant_message, self.voice_id)
        
        return assistant_message

# Example: Customer service agent
agent = VoiceAgent(
    system_prompt="You are a friendly customer service representative. Be helpful, concise, and professional.",
    voice_id="a0e99841-438c-4a64-b679-ae501e7d6091"  # Barbershop Man voice
)

# Simulate conversation
agent.chat("Hi, I need help with my order.")
agent.chat("My order number is 12345.")
agent.chat("When will it arrive?")
This voice agent maintains conversation context and provides natural, spoken responses using Cerebras’s fast inference and Cartesia’s voice synthesis.
6

Stream responses for lower latency

For even faster response times, you can stream the Cerebras output and generate speech in real-time. This provides the lowest possible latency for interactive voice applications:
import os
from openai import OpenAI
from cartesia import Cartesia
import pyaudio

# Initialize clients
cerebras_client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1"
)

cartesia_client = Cartesia(
    api_key=os.getenv("CARTESIA_API_KEY")
)

def stream_voice_response(prompt, voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", play_audio=True):
    """
    Stream text generation and voice synthesis for ultra-low latency.
    
    Args:
        prompt: User input text
        voice_id: Cartesia voice ID
        play_audio: Whether to play audio (set to False in headless environments)
    """
    # Stream text generation from Cerebras
    stream = cerebras_client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        stream=True,
        max_completion_tokens=200,
        extra_headers={"X-Cerebras-3rd-Party-Integration": "cartesia"}
    )
    
    # Collect text chunks from stream
    full_text = ""
    for chunk in stream:
        if chunk.choices[0].delta.content:
            text_chunk = chunk.choices[0].delta.content
            full_text += text_chunk
            print(text_chunk, end="", flush=True)
    
    print("\n\nGenerating audio...")
    
    # Set up Cartesia websocket for streaming TTS
    ws = cartesia_client.tts.websocket()
    
    # Set up audio playback if enabled
    p = None
    audio_stream = None
    rate = 22050
    
    if play_audio:
        try:
            p = pyaudio.PyAudio()
        except Exception as e:
            print(f"Audio playback not available: {e}")
            play_audio = False
    
    # Generate and optionally play audio
    audio_chunks = []
    for output in ws.send(
        model_id="sonic-3",
        transcript=full_text,
        voice={"mode": "id", "id": voice_id},
        stream=True,
        output_format={
            "container": "raw",
            "encoding": "pcm_f32le",
            "sample_rate": rate
        }
    ):
        buffer = output.audio
        audio_chunks.append(buffer)
        
        if play_audio and p:
            if not audio_stream:
                audio_stream = p.open(
                    format=pyaudio.paFloat32,
                    channels=1,
                    rate=rate,
                    output=True
                )
            audio_stream.write(buffer)
    
    # Cleanup
    if audio_stream:
        audio_stream.stop_stream()
        audio_stream.close()
    if p:
        p.terminate()
    ws.close()
    
    print(f"Generated {len(audio_chunks)} audio chunks")
    return full_text

# Example usage
if __name__ == "__main__":
    # Set play_audio=False in headless/testing environments
    result = stream_voice_response(
        "Explain quantum computing in simple terms.",
        play_audio=False  # Change to True to hear audio
    )
    print(f"\nFinal response: {result}")
This streaming approach minimizes latency by starting audio playback as soon as the first chunks are available.

Next Steps

Additional Resources