Get Started with LlamaIndex

LlamaIndex is a powerful data orchestration framework for building LLM applications. By integrating with Cerebras Inference, you can leverage ultra-fast inference speeds to create responsive RAG (Retrieval-Augmented Generation) applications, chatbots, and data analysis tools.

Prerequisites

Before you begin, ensure you have:

Cerebras API Key - Get a free API key here
Python 3.10 or higher - LlamaIndex requires Python 3.10+
Basic familiarity with LlamaIndex - Review the LlamaIndex documentation if you’re new to the framework

Configure LlamaIndex with Cerebras

Install required dependencies

Install the LlamaIndex core package and the OpenAI integration, which we’ll use to connect to Cerebras:

pip install llama-index-core llama-index-llms-cerebras

LlamaIndex provides a dedicated Cerebras integration that handles all the API configuration automatically.

You may see a warning about PyTorch/TensorFlow/Flax not being found. This is harmless - those are only needed for running local models. Since you’re using Cerebras’s API, you can safely ignore it or suppress it with warnings.filterwarnings("ignore", message=".*PyTorch.*").

Configure environment variables

Create a .env file in your project directory to store your API key securely:

CEREBRAS_API_KEY=your-cerebras-api-key-here

This keeps your credentials safe and makes it easy to switch between development and production environments.

Initialize the Cerebras LLM

Set up LlamaIndex to use Cerebras as your LLM provider by configuring the OpenAI integration to point to Cerebras’s API endpoint:

import os
import warnings
warnings.filterwarnings("ignore", message=".*PyTorch.*TensorFlow.*Flax.*")

from llama_index.llms.cerebras import Cerebras
from llama_index.core import Settings
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    additional_kwargs={"extra_headers": {"X-Cerebras-3rd-Party-Integration": "llamaindex"}}
)

# Set as the default LLM for LlamaIndex
Settings.llm = llm

# Test with a simple query
messages = [ChatMessage(role="user", content="What is 25 * 25?")]
response = llm.chat(messages)
print(response.message.content)

This configuration tells LlamaIndex to route all LLM calls through Cerebras, giving you access to ultra-fast inference speeds.

Make your first query

Test the integration with a simple query to verify everything is working correctly:

import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    additional_kwargs={"extra_headers": {"X-Cerebras-3rd-Party-Integration": "llamaindex"}}
)

# Create a simple chat interaction
messages = [
    ChatMessage(role="system", content="You are a helpful AI assistant."),
    ChatMessage(role="user", content="Explain what LlamaIndex is in one sentence.")
]

# Get response from Cerebras
response = llm.chat(messages)
print(response.message.content)

You should see a fast, coherent response explaining LlamaIndex. The speed difference compared to other providers will be immediately noticeable!

Streaming Responses

Cerebras’s ultra-fast inference makes streaming particularly impressive. Here’s how to stream responses in LlamaIndex:

import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    additional_kwargs={"extra_headers": {"X-Cerebras-3rd-Party-Integration": "llamaindex"}}
)

messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="Write a haiku about coding.")
]

# Stream the response
for chunk in llm.stream_chat(messages):
    print(chunk.delta, end="", flush=True)
print()

With Cerebras, you’ll see tokens appearing almost instantaneously, creating a smooth user experience.

Advanced: Custom Query Pipeline

For more control over your RAG pipeline, you can create custom query pipelines with Cerebras:

import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core import Settings
from llama_index.core.llms import ChatMessage

# Initialize Cerebras LLM
llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    additional_kwargs={"extra_headers": {"X-Cerebras-3rd-Party-Integration": "llamaindex"}}
)

Settings.llm = llm

# Test the configuration
messages = [ChatMessage(role="user", content="List 3 benefits of using LlamaIndex.")]
response = llm.chat(messages)
print(response.message.content)

For custom pipelines, load your documents, create a retriever with index.as_retriever(), and build a QueryPipeline with your desired modules like retrievers and summarizers.

Model Selection Guide

Choose the right Cerebras model for your use case:

llama-3.3-70b - Best for complex reasoning, long-form content, and tasks requiring deep understanding
qwen-3-32b - Balanced performance for general-purpose applications
llama3.1-8b - Fastest option for simple tasks and high-throughput scenarios
gpt-oss-120b - Largest model for the most demanding tasks
zai-glm-4.7 - Advanced 357B parameter model with strong reasoning capabilities

Next Steps

Now that you have LlamaIndex working with Cerebras, explore these advanced features:

Build a chatbot - Create conversational AI with memory using LlamaIndex’s chat engines
Add structured outputs - Use Pydantic models for type-safe responses
Implement agents - Build autonomous agents with LlamaIndex’s agent framework
Optimize embeddings - Explore different embedding models for better retrieval
Try different models - Experiment with Cerebras’s model lineup to find the best fit for your use case
Try the latest model - GLM4.7 migration guide

Troubleshooting

Why am I getting 'Invalid API key' errors?

Make sure you’re using your Cerebras API key, not an OpenAI key. Double-check that:

Your .env file contains CEREBRAS_API_KEY=your-key-here
You’re loading the environment variable correctly with os.getenv("CEREBRAS_API_KEY")
The API key is active in your Cerebras dashboard

Can I use LlamaIndex's async features with Cerebras?

Yes! Cerebras fully supports async operations. Use await llm.achat() and await llm.astream_chat() for async calls:

import asyncio
import os
from llama_index.llms.cerebras import Cerebras
from llama_index.core.llms import ChatMessage

llm = Cerebras(
    model="llama-3.3-70b",
    api_key=os.getenv("CEREBRAS_API_KEY"),
    additional_kwargs={"extra_headers": {"X-Cerebras-3rd-Party-Integration": "llamaindex"}}
)

async def async_query():
    messages = [ChatMessage(role="user", content="Hello")]
    response = await llm.achat(messages)
    return response

How do I handle rate limits?

Cerebras has generous rate limits, but if you’re building a high-traffic application, consider:

Implementing exponential backoff for retries
Using LlamaIndex’s built-in retry logic
Caching responses for common queries
Contacting Cerebras support for enterprise rate limits

Why are my embeddings not working?

Cerebras currently provides LLM inference, not embedding models. For embeddings in your RAG pipeline, use a separate embedding provider like HuggingFace embeddings by setting Settings.embed_model to a HuggingFaceEmbedding instance. This allows you to use Cerebras for generation while using specialized embedding models for retrieval.

Get Started

Capabilities

Compatibility

Resources

Support

Prerequisites

Configure LlamaIndex with Cerebras

Streaming Responses

Advanced: Custom Query Pipeline

Model Selection Guide

Next Steps

Troubleshooting

Additional Resources

Get Started

Capabilities

Compatibility

Resources

Support

​Prerequisites

​Configure LlamaIndex with Cerebras

​Streaming Responses

​Advanced: Custom Query Pipeline

​Model Selection Guide

​Next Steps

​Troubleshooting

​Additional Resources

Prerequisites

Configure LlamaIndex with Cerebras

Streaming Responses

Advanced: Custom Query Pipeline

Model Selection Guide

Next Steps

Troubleshooting

Additional Resources