Skip to main content
LlamaIndex is a powerful data orchestration framework for building LLM applications. By integrating with Cerebras Inference, you can leverage ultra-fast inference speeds to create responsive RAG (Retrieval-Augmented Generation) applications, chatbots, and data analysis tools.

Prerequisites

Before you begin, ensure you have:
  • Cerebras API Key - Get a free API key here
  • Python 3.10 or higher - LlamaIndex requires Python 3.10+
  • Basic familiarity with LlamaIndex - Review the LlamaIndex documentation if you’re new to the framework

Configure LlamaIndex with Cerebras

1

Install required dependencies

Install the LlamaIndex core package and the OpenAI integration, which we’ll use to connect to Cerebras:
pip install llama-index-core llama-index-llms-cerebras
LlamaIndex provides a dedicated Cerebras integration that handles all the API configuration automatically.
You may see a warning about PyTorch/TensorFlow/Flax not being found. This is harmless - those are only needed for running local models. Since you’re using Cerebras’s API, you can safely ignore it or suppress it with warnings.filterwarnings("ignore", message=".*PyTorch.*").
2

Configure environment variables

Create a .env file in your project directory to store your API key securely:
CEREBRAS_API_KEY=your-cerebras-api-key-here
This keeps your credentials safe and makes it easy to switch between development and production environments.
3

Initialize the Cerebras LLM

Set up LlamaIndex to use Cerebras as your LLM provider by configuring the OpenAI integration to point to Cerebras’s API endpoint:
import os
import warnings
warnings.filterwarnings("ignore", message=".*PyTorch.*TensorFlow.*Flax.*")

from llama_index.llms.cerebras import Cerebras
from llama_index.core import Settings
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY")
)

# Set as the default LLM for LlamaIndex
Settings.llm = llm

# Test with a simple query
messages = [ChatMessage(role="user", content="What is 25 * 25?")]
response = llm.chat(messages)
print(response.message.content)
This configuration tells LlamaIndex to route all LLM calls through Cerebras, giving you access to ultra-fast inference speeds.
4

Make your first query

Test the integration with a simple query to verify everything is working correctly:
import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY")
)

# Create a simple chat interaction
messages = [
    ChatMessage(role="system", content="You are a helpful AI assistant."),
    ChatMessage(role="user", content="Explain what LlamaIndex is in one sentence.")
]

# Get response from Cerebras
response = llm.chat(messages)
print(response.message.content)
You should see a fast, coherent response explaining LlamaIndex. The speed difference compared to other providers will be immediately noticeable!

Streaming Responses

Cerebras’s ultra-fast inference makes streaming particularly impressive. Here’s how to stream responses in LlamaIndex:
import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY")
)

messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="Write a haiku about coding.")
]

# Stream the response
for chunk in llm.stream_chat(messages):
    print(chunk.delta, end="", flush=True)
print()
With Cerebras, you’ll see tokens appearing almost instantaneously, creating a smooth user experience.

Advanced: Custom Query Pipeline

For more control over your RAG pipeline, you can create custom query pipelines with Cerebras:
import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core import Settings
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY")
)

Settings.llm = llm

# Test the configuration
messages = [ChatMessage(role="user", content="List 3 benefits of using LlamaIndex.")]
response = llm.chat(messages)
print(response.message.content)
For custom pipelines, load your documents, create a retriever with index.as_retriever(), and build a QueryPipeline with your desired modules like retrievers and summarizers.

Model Selection Guide

Choose the right Cerebras model for your use case:
  • llama-3.3-70b - Best for complex reasoning, long-form content, and tasks requiring deep understanding
  • qwen-3-32b - Balanced performance for general-purpose applications
  • llama3.1-8b - Fastest option for simple tasks and high-throughput scenarios
  • gpt-oss-120b - Largest model for the most demanding tasks
  • zai-glm-4.6 - Advanced 357B parameter model with strong reasoning capabilities

Next Steps

Now that you have LlamaIndex working with Cerebras, explore these advanced features:

Troubleshooting

Make sure you’re using your Cerebras API key, not an OpenAI key. Double-check that:
  1. Your .env file contains CEREBRAS_API_KEY=your-key-here
  2. You’re loading the environment variable correctly with os.getenv("CEREBRAS_API_KEY")
  3. The API key is active in your Cerebras dashboard
Yes! Cerebras fully supports async operations. Use await llm.achat() and await llm.astream_chat() for async calls:
import asyncio
import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core.llms import ChatMessage

llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY")
)

async def async_query():
    messages = [ChatMessage(role="user", content="Hello")]
    response = await llm.achat(messages)
    return response
Cerebras has generous rate limits, but if you’re building a high-traffic application, consider:
  1. Implementing exponential backoff for retries
  2. Using LlamaIndex’s built-in retry logic
  3. Caching responses for common queries
  4. Contacting Cerebras support for enterprise rate limits
Cerebras currently provides LLM inference, not embedding models. For embeddings in your RAG pipeline, use a separate embedding provider like HuggingFace embeddings by setting Settings.embed_model to a HuggingFaceEmbedding instance. This allows you to use Cerebras for generation while using specialized embedding models for retrieval.

Additional Resources