Skip to main content
Llama Stack is Meta’s comprehensive framework for building generative AI applications. It provides standardized APIs for inference, safety, memory, and agentic systems. By integrating Cerebras as a provider, you can leverage ultra-fast inference speeds while using Llama Stack’s unified interface.

Prerequisites

Before you begin, ensure you have:
  • Cerebras API Key - Get a free API key here
  • Python 3.12 to 3.13 - Llama Stack requires Python 3.12+ (Python 3.12 recommended for best compatibility)
  • Basic familiarity with async Python - Llama Stack uses async/await patterns
  • uv package manager (optional but recommended) - For faster dependency installation

Configure Llama Stack with Cerebras

1

Install Llama Stack

Install the Llama Stack distribution with Cerebras support. This installs both the Llama Stack server and client libraries needed to interact with Cerebras.
pip install llama-stack llama-stack-client
Alternatively, using uv (faster):
uv pip install llama-stack llama-stack-client
2

Configure environment variables

Export your Cerebras API key as an environment variable:
export CEREBRAS_API_KEY=your-cerebras-api-key-here
You can also add this to your shell profile (e.g., ~/.bashrc or ~/.zshrc) for persistence.
3

Create a Cerebras configuration file

Create a file named cerebras-run.yaml with the following configuration:
version: 2
image_name: cerebras
apis:
- inference
providers:
  inference:
  - provider_id: cerebras
    provider_type: remote::cerebras
    config:
      base_url: https://api.cerebras.ai
      api_key: ${env.CEREBRAS_API_KEY}
storage:
  backends:
    kv_default:
      type: kv_sqlite
      db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/cerebras}/kvstore.db
    sql_default:
      type: sql_sqlite
      db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/cerebras}/sql_store.db
  stores:
    metadata:
      namespace: registry
      backend: kv_default
    inference:
      table_name: inference_store
      backend: sql_default
    conversations:
      table_name: openai_conversations
      backend: sql_default
server:
  port: 8321
4

Start the Llama Stack server

Launch the Llama Stack server with your Cerebras configuration:
llama stack run cerebras-run.yaml
The server will start on http://localhost:8321 by default. You should see output indicating that Cerebras has been successfully configured as a provider.
5

Make your first inference request

Now you can use the Llama Stack client to make inference requests. This example demonstrates how to use Llama Stack’s standardized API to interact with Cerebras models.
from llama_stack_client import LlamaStackClient

# Initialize the Llama Stack client
client = LlamaStackClient(
    base_url="http://localhost:8321",
)

# Make a chat completion request
response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
)

print(response.choices[0].message.content)
6

Try streaming responses

Llama Stack supports streaming responses for real-time output. Streaming is particularly useful for interactive applications where you want to display responses as they’re generated.
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://localhost:8321",
)

# Stream the response
stream = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "user", "content": "Write a short poem about artificial intelligence."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()

Using Cerebras Directly with OpenAI SDK

If you prefer to use Cerebras directly without the Llama Stack server, you can use the OpenAI SDK with Cerebras endpoints. This approach gives you direct access to Cerebras while still tracking usage through the integration header.
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "llama-stack"
    }
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Advanced Features

Using Multiple Models

You can configure multiple Cerebras models in your Llama Stack configuration and switch between them based on your task requirements. Edit your run.yaml to include:
models:
  - model_id: llama-3.3-70b
    provider_id: cerebras
    metadata:
      description: "Best for complex reasoning tasks"
  
  - model_id: llama3.1-8b
    provider_id: cerebras
    metadata:
      description: "Fast and efficient for simple tasks"
  
  - model_id: qwen-3-32b
    provider_id: cerebras
    metadata:
      description: "Excellent for multilingual applications"
Then switch between models in your code:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# Use the larger model for complex tasks
complex_response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[{"role": "user", "content": "Explain quantum computing in one sentence."}],
)
print("Complex:", complex_response.choices[0].message.content)

# Use the smaller model for simple tasks
simple_response = client.chat.completions.create(
    model="cerebras/llama3.1-8b",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
print("Simple:", simple_response.choices[0].message.content)

System Prompts and Temperature Control

Customize model behavior with system prompts and sampling parameters to fine-tune responses:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# Example with system prompt and sampling params
response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
    ],
    temperature=0.7,
    top_p=0.9,
    max_tokens=500,
)
print(response.choices[0].message.content)

Building Agentic Applications

Llama Stack provides powerful abstractions for building AI agents. Here’s an example of using Cerebras with Llama Stack’s agent framework for multi-turn conversations:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# For agentic applications, use multi-turn conversations
conversation = [
    {"role": "system", "content": "You are a helpful research assistant."},
    {"role": "user", "content": "What are the latest trends in AI?"}
]

response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=conversation,
)
print(response.choices[0].message.content)

# Continue the conversation
conversation.append({"role": "assistant", "content": response.choices[0].message.content})
conversation.append({"role": "user", "content": "Can you elaborate on one of those trends?"})

response2 = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=conversation,
)
print(response2.choices[0].message.content)

FAQ

Llama Stack provides a standardized interface with additional features like safety guardrails, memory management, and agentic capabilities. If you only need basic inference, calling Cerebras directly with the OpenAI SDK is simpler. Use Llama Stack when you need its advanced features or want a provider-agnostic interface.
Yes, you can use Llama Stack’s client library to connect to a remote Llama Stack server. However, for Cerebras integration, you’ll need to ensure the server is configured with your Cerebras API key. Alternatively, use the OpenAI SDK directly as shown in the “Using Cerebras Directly” section.
All current Cerebras models work with Llama Stack. For complex reasoning and agentic tasks, use llama-3.3-70b or qwen-3-32b. For faster responses on simpler tasks, use llama3.1-8b.
Llama Stack requires Python 3.10 or higher.
Llama Stack automatically handles retries and basic error handling. For production applications, implement additional error handling around your API calls and monitor your Cerebras API usage through the Cerebras Cloud dashboard.
Yes, Llama Stack’s safety APIs work with any configured inference provider, including Cerebras. You can add content moderation, prompt injection detection, and other safety features by configuring safety providers in your run.yaml. See the Llama Stack safety documentation for details.

Troubleshooting

Server won’t start

If the Llama Stack server fails to start:
  1. Verify your Python version is 3.10 or higher: python --version
  2. Check that your CEREBRAS_API_KEY environment variable is set: echo $CEREBRAS_API_KEY
  3. Ensure the cerebras-run.yaml file is in your current directory
  4. Try running with verbose logging: llama stack run cerebras-run.yaml --verbose
  5. Check the Llama Stack releases page for any breaking changes

Connection errors

If you see connection errors when making requests:
  1. Verify the Llama Stack server is running on the expected port (default: 8321)
  2. Check that your Cerebras API key is valid by testing it directly with the OpenAI SDK
  3. Ensure there are no firewall rules blocking localhost connections
  4. Try restarting the Llama Stack server
  5. Verify your network connectivity to api.cerebras.ai

Model not found errors

If you get “model not found” errors:
  1. Use the cerebras/ prefix for model names (e.g., cerebras/llama-3.3-70b)
  2. List available models: curl http://localhost:8321/v1/models
  3. Restart the Llama Stack server after making configuration changes
  4. Consult the Cerebras models page for the current list of available models

Slow response times

If responses are slower than expected:
  1. Verify you’re using Cerebras models (not accidentally routing through another provider)
  2. Check your network connection and latency to Cerebras endpoints
  3. Consider using a smaller model like llama3.1-8b for simpler tasks
  4. Enable streaming to get partial responses faster
  5. Check your Cerebras account for any rate limiting or usage quotas

Import errors with llama_stack_client

If you get import errors:
  1. Ensure llama-stack-client is properly installed: pip install llama-stack-client
  2. Check that you’re using the correct import: from llama_stack_client import LlamaStackClient
  3. Verify you’re not mixing old and new package names
  4. Create a fresh virtual environment if issues persist

Next Steps

Llama Stack is actively developed by Meta and the community. For the latest features and updates, check the official documentation and GitHub repository. The Llama Stack server runs on port 8321 by default.