Get Started with Llama Stack

Llama Stack is Meta’s comprehensive framework for building generative AI applications. It provides standardized APIs for inference, safety, memory, and agentic systems. By integrating Cerebras as a provider, you can leverage ultra-fast inference speeds while using Llama Stack’s unified interface.

Prerequisites

Before you begin, ensure you have:

Cerebras API Key - Get a free API key here
Python 3.12 to 3.13 - Llama Stack requires Python 3.12+ (Python 3.12 recommended for best compatibility)
Basic familiarity with async Python - Llama Stack uses async/await patterns
uv package manager (optional but recommended) - For faster dependency installation

Configure Llama Stack with Cerebras

Install Llama Stack

Install the Llama Stack distribution with Cerebras support. This installs both the Llama Stack server and client libraries needed to interact with Cerebras.

pip install llama-stack llama-stack-client

Alternatively, using uv (faster):

uv pip install llama-stack llama-stack-client

Configure environment variables

Export your Cerebras API key as an environment variable:

export CEREBRAS_API_KEY=your-cerebras-api-key-here

You can also add this to your shell profile (e.g., ~/.bashrc or ~/.zshrc) for persistence.

Create a Cerebras configuration file

Create a file named cerebras-run.yaml with the following configuration:

version: 2
image_name: cerebras
apis:
- inference
providers:
  inference:
  - provider_id: cerebras
    provider_type: remote::cerebras
    config:
      base_url: https://api.cerebras.ai
      api_key: ${env.CEREBRAS_API_KEY}
storage:
  backends:
    kv_default:
      type: kv_sqlite
      db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/cerebras}/kvstore.db
    sql_default:
      type: sql_sqlite
      db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/cerebras}/sql_store.db
  stores:
    metadata:
      namespace: registry
      backend: kv_default
    inference:
      table_name: inference_store
      backend: sql_default
    conversations:
      table_name: openai_conversations
      backend: sql_default
server:
  port: 8321

Start the Llama Stack server

Launch the Llama Stack server with your Cerebras configuration:

llama stack run cerebras-run.yaml

The server will start on http://localhost:8321 by default. You should see output indicating that Cerebras has been successfully configured as a provider.

Make your first inference request

Now you can use the Llama Stack client to make inference requests. This example demonstrates how to use Llama Stack’s standardized API to interact with Cerebras models.

from llama_stack_client import LlamaStackClient

# Initialize the Llama Stack client
client = LlamaStackClient(
    base_url="http://localhost:8321",
)

# Make a chat completion request
response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
)

print(response.choices[0].message.content)

Try streaming responses

Llama Stack supports streaming responses for real-time output. Streaming is particularly useful for interactive applications where you want to display responses as they’re generated.

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://localhost:8321",
)

# Stream the response
stream = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "user", "content": "Write a short poem about artificial intelligence."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()

Using Cerebras Directly with OpenAI SDK

If you prefer to use Cerebras directly without the Llama Stack server, you can use the OpenAI SDK with Cerebras endpoints. This approach gives you direct access to Cerebras while still tracking usage through the integration header.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("CEREBRAS_API_KEY"),
    base_url="https://api.cerebras.ai/v1",
    default_headers={
        "X-Cerebras-3rd-Party-Integration": "llama-stack"
    }
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

Advanced Features

Using Multiple Models

You can configure multiple Cerebras models in your Llama Stack configuration and switch between them based on your task requirements. Edit your run.yaml to include:

models:
  - model_id: llama-3.3-70b
    provider_id: cerebras
    metadata:
      description: "Best for complex reasoning tasks"
  
  - model_id: llama3.1-8b
    provider_id: cerebras
    metadata:
      description: "Fast and efficient for simple tasks"
  
  - model_id: qwen-3-32b
    provider_id: cerebras
    metadata:
      description: "Excellent for multilingual applications"

Then switch between models in your code:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# Use the larger model for complex tasks
complex_response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[{"role": "user", "content": "Explain quantum computing in one sentence."}],
)
print("Complex:", complex_response.choices[0].message.content)

# Use the smaller model for simple tasks
simple_response = client.chat.completions.create(
    model="cerebras/llama3.1-8b",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
print("Simple:", simple_response.choices[0].message.content)

System Prompts and Temperature Control

Customize model behavior with system prompts and sampling parameters to fine-tune responses:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# Example with system prompt and sampling params
response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
    ],
    temperature=0.7,
    top_p=0.9,
    max_tokens=500,
)
print(response.choices[0].message.content)

Building Agentic Applications

Llama Stack provides powerful abstractions for building AI agents. Here’s an example of using Cerebras with Llama Stack’s agent framework for multi-turn conversations:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# For agentic applications, use multi-turn conversations
conversation = [
    {"role": "system", "content": "You are a helpful research assistant."},
    {"role": "user", "content": "What are the latest trends in AI?"}
]

response = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=conversation,
)
print(response.choices[0].message.content)

# Continue the conversation
conversation.append({"role": "assistant", "content": response.choices[0].message.content})
conversation.append({"role": "user", "content": "Can you elaborate on one of those trends?"})

response2 = client.chat.completions.create(
    model="cerebras/llama-3.3-70b",
    messages=conversation,
)
print(response2.choices[0].message.content)

FAQ

What's the difference between using Llama Stack and calling Cerebras directly?

Llama Stack provides a standardized interface with additional features like safety guardrails, memory management, and agentic capabilities. If you only need basic inference, calling Cerebras directly with the OpenAI SDK is simpler. Use Llama Stack when you need its advanced features or want a provider-agnostic interface.

Can I use Llama Stack without running a local server?

Yes, you can use Llama Stack’s client library to connect to a remote Llama Stack server. However, for Cerebras integration, you’ll need to ensure the server is configured with your Cerebras API key. Alternatively, use the OpenAI SDK directly as shown in the “Using Cerebras Directly” section.

Which Cerebras models work best with Llama Stack?

All current Cerebras models work with Llama Stack. For complex reasoning and agentic tasks, use llama-3.3-70b or qwen-3-32b. For faster responses on simpler tasks, use llama3.1-8b.

What Python versions are supported?

Llama Stack requires Python 3.10 or higher.

How do I handle rate limits and errors?

Llama Stack automatically handles retries and basic error handling. For production applications, implement additional error handling around your API calls and monitor your Cerebras API usage through the Cerebras Cloud dashboard.

Can I use Llama Stack's safety features with Cerebras models?

Yes, Llama Stack’s safety APIs work with any configured inference provider, including Cerebras. You can add content moderation, prompt injection detection, and other safety features by configuring safety providers in your run.yaml. See the Llama Stack safety documentation for details.

Troubleshooting

Server won’t start

If the Llama Stack server fails to start:

Verify your Python version is 3.10 or higher: python --version
Check that your CEREBRAS_API_KEY environment variable is set: echo $CEREBRAS_API_KEY
Ensure the cerebras-run.yaml file is in your current directory
Try running with verbose logging: llama stack run cerebras-run.yaml --verbose
Check the Llama Stack releases page for any breaking changes

Connection errors

If you see connection errors when making requests:

Verify the Llama Stack server is running on the expected port (default: 8321)
Check that your Cerebras API key is valid by testing it directly with the OpenAI SDK
Ensure there are no firewall rules blocking localhost connections
Try restarting the Llama Stack server
Verify your network connectivity to api.cerebras.ai

Model not found errors

If you get “model not found” errors:

Use the cerebras/ prefix for model names (e.g., cerebras/llama-3.3-70b)
List available models: curl http://localhost:8321/v1/models
Restart the Llama Stack server after making configuration changes
Consult the Cerebras models page for the current list of available models

Slow response times

If responses are slower than expected:

Verify you’re using Cerebras models (not accidentally routing through another provider)
Check your network connection and latency to Cerebras endpoints
Consider using a smaller model like llama3.1-8b for simpler tasks
Enable streaming to get partial responses faster
Check your Cerebras account for any rate limiting or usage quotas

Import errors with llama_stack_client

If you get import errors:

Ensure llama-stack-client is properly installed: pip install llama-stack-client
Check that you’re using the correct import: from llama_stack_client import LlamaStackClient
Verify you’re not mixing old and new package names
Create a fresh virtual environment if issues persist

Next Steps

Explore the Llama Stack documentation for advanced features like safety, memory, and agentic systems
Try different Cerebras models to find the best fit for your use case
Learn about Llama Stack’s safety APIs for content moderation
Build agentic applications with Llama Stack’s agent framework
Join the Llama Stack Discord community for support and discussions
Review the Llama Stack GitHub repository for latest updates and examples

Llama Stack is actively developed by Meta and the community. For the latest features and updates, check the official documentation and GitHub repository. The Llama Stack server runs on port 8321 by default.

Get Started

Capabilities

Compatibility

Resources

Support

Prerequisites

Configure Llama Stack with Cerebras

Using Cerebras Directly with OpenAI SDK

Advanced Features

Using Multiple Models

System Prompts and Temperature Control

Building Agentic Applications

FAQ

Troubleshooting

Server won’t start

Connection errors

Model not found errors

Slow response times

Import errors with llama_stack_client

Next Steps

Get Started

Capabilities

Compatibility

Resources

Support

​Prerequisites

​Configure Llama Stack with Cerebras

​Using Cerebras Directly with OpenAI SDK

​Advanced Features

​Using Multiple Models

​System Prompts and Temperature Control

​Building Agentic Applications

​FAQ

​Troubleshooting

​Server won’t start

​Connection errors

​Model not found errors

​Slow response times

​Import errors with llama_stack_client

​Next Steps

Prerequisites

Configure Llama Stack with Cerebras

Using Cerebras Directly with OpenAI SDK

Advanced Features

Using Multiple Models

System Prompts and Temperature Control

Building Agentic Applications

FAQ

Troubleshooting

Server won’t start

Connection errors

Model not found errors

Slow response times

Import errors with llama_stack_client

Next Steps