Prerequisites
Before you begin, ensure you have:- Cerebras API Key - Get a free API key here
- Python 3.12 to 3.13 - Llama Stack requires Python 3.12+ (Python 3.12 recommended for best compatibility)
- Basic familiarity with async Python - Llama Stack uses async/await patterns
- uv package manager (optional but recommended) - For faster dependency installation
Configure Llama Stack with Cerebras
1
Install Llama Stack
Install the Llama Stack distribution with Cerebras support. This installs both the Llama Stack server and client libraries needed to interact with Cerebras.Alternatively, using
uv (faster):2
Configure environment variables
Export your Cerebras API key as an environment variable:You can also add this to your shell profile (e.g.,
~/.bashrc or ~/.zshrc) for persistence.3
Create a Cerebras configuration file
Create a file named
cerebras-run.yaml with the following configuration:4
Start the Llama Stack server
Launch the Llama Stack server with your Cerebras configuration:The server will start on
http://localhost:8321 by default. You should see output indicating that Cerebras has been successfully configured as a provider.5
Make your first inference request
Now you can use the Llama Stack client to make inference requests. This example demonstrates how to use Llama Stack’s standardized API to interact with Cerebras models.
6
Try streaming responses
Llama Stack supports streaming responses for real-time output. Streaming is particularly useful for interactive applications where you want to display responses as they’re generated.
Using Cerebras Directly with OpenAI SDK
If you prefer to use Cerebras directly without the Llama Stack server, you can use the OpenAI SDK with Cerebras endpoints. This approach gives you direct access to Cerebras while still tracking usage through the integration header.Advanced Features
Using Multiple Models
You can configure multiple Cerebras models in your Llama Stack configuration and switch between them based on your task requirements. Edit yourrun.yaml to include:
System Prompts and Temperature Control
Customize model behavior with system prompts and sampling parameters to fine-tune responses:Building Agentic Applications
Llama Stack provides powerful abstractions for building AI agents. Here’s an example of using Cerebras with Llama Stack’s agent framework for multi-turn conversations:FAQ
What's the difference between using Llama Stack and calling Cerebras directly?
What's the difference between using Llama Stack and calling Cerebras directly?
Llama Stack provides a standardized interface with additional features like safety guardrails, memory management, and agentic capabilities. If you only need basic inference, calling Cerebras directly with the OpenAI SDK is simpler. Use Llama Stack when you need its advanced features or want a provider-agnostic interface.
Can I use Llama Stack without running a local server?
Can I use Llama Stack without running a local server?
Yes, you can use Llama Stack’s client library to connect to a remote Llama Stack server. However, for Cerebras integration, you’ll need to ensure the server is configured with your Cerebras API key. Alternatively, use the OpenAI SDK directly as shown in the “Using Cerebras Directly” section.
Which Cerebras models work best with Llama Stack?
Which Cerebras models work best with Llama Stack?
All current Cerebras models work with Llama Stack. For complex reasoning and agentic tasks, use
llama-3.3-70b or qwen-3-32b. For faster responses on simpler tasks, use llama3.1-8b.What Python versions are supported?
What Python versions are supported?
Llama Stack requires Python 3.10 or higher.
How do I handle rate limits and errors?
How do I handle rate limits and errors?
Llama Stack automatically handles retries and basic error handling. For production applications, implement additional error handling around your API calls and monitor your Cerebras API usage through the Cerebras Cloud dashboard.
Can I use Llama Stack's safety features with Cerebras models?
Can I use Llama Stack's safety features with Cerebras models?
Yes, Llama Stack’s safety APIs work with any configured inference provider, including Cerebras. You can add content moderation, prompt injection detection, and other safety features by configuring safety providers in your
run.yaml. See the Llama Stack safety documentation for details.Troubleshooting
Server won’t start
If the Llama Stack server fails to start:- Verify your Python version is 3.10 or higher:
python --version - Check that your
CEREBRAS_API_KEYenvironment variable is set:echo $CEREBRAS_API_KEY - Ensure the
cerebras-run.yamlfile is in your current directory - Try running with verbose logging:
llama stack run cerebras-run.yaml --verbose - Check the Llama Stack releases page for any breaking changes
Connection errors
If you see connection errors when making requests:- Verify the Llama Stack server is running on the expected port (default: 8321)
- Check that your Cerebras API key is valid by testing it directly with the OpenAI SDK
- Ensure there are no firewall rules blocking localhost connections
- Try restarting the Llama Stack server
- Verify your network connectivity to
api.cerebras.ai
Model not found errors
If you get “model not found” errors:- Use the
cerebras/prefix for model names (e.g.,cerebras/llama-3.3-70b) - List available models:
curl http://localhost:8321/v1/models - Restart the Llama Stack server after making configuration changes
- Consult the Cerebras models page for the current list of available models
Slow response times
If responses are slower than expected:- Verify you’re using Cerebras models (not accidentally routing through another provider)
- Check your network connection and latency to Cerebras endpoints
- Consider using a smaller model like
llama3.1-8bfor simpler tasks - Enable streaming to get partial responses faster
- Check your Cerebras account for any rate limiting or usage quotas
Import errors with llama_stack_client
If you get import errors:- Ensure
llama-stack-clientis properly installed:pip install llama-stack-client - Check that you’re using the correct import:
from llama_stack_client import LlamaStackClient - Verify you’re not mixing old and new package names
- Create a fresh virtual environment if issues persist
Next Steps
- Explore the Llama Stack documentation for advanced features like safety, memory, and agentic systems
- Try different Cerebras models to find the best fit for your use case
- Learn about Llama Stack’s safety APIs for content moderation
- Build agentic applications with Llama Stack’s agent framework
- Join the Llama Stack Discord community for support and discussions
- Review the Llama Stack GitHub repository for latest updates and examples
Llama Stack is actively developed by Meta and the community. For the latest features and updates, check the official documentation and GitHub repository. The Llama Stack server runs on port 8321 by default.

