Skip to main content
POST
/
v1
/
chat
/
completions
from cerebras.cloud.sdk import Cerebras
import os 

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"),)

chat_completion = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Hello!",}
    ],
)
print(chat_completion)
{
  "id": "chatcmpl-292e278f-514e-4186-9010-91ce6a14168b",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! How can I assist you today?",
        "reasoning": "The user is asking for a simple greeting to the world. This is a straightforward request that doesn't require complex analysis. I should provide a friendly, direct response.",
        "role": "assistant"
      }
    }
  ],
  "created": 1723733419,
  "model": "gpt-oss-120b",
  "system_fingerprint": "fp_70185065a4",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 10,
    "total_tokens": 22,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "time_info": {
    "queue_time": 0.000073161,
    "prompt_time": 0.0010744798888888889,
    "completion_time": 0.005658071111111111,
    "total_time": 0.022224903106689453,
    "created": 1723733419
  }
}

Request

logprobs
bool
Whether to return log probabilities of the output tokens or not.Default: False
max_completion_tokens
integer | null
The maximum number of tokens that can be generated in the completion, including reasoning tokens. The total length of input tokens and generated tokens is limited by the model’s context length.Default settings: qwen-3-32b = 40k | llama-3.3-70b = 64k.
messages
object[]
required
A list of messages comprising the conversation so far.Note: System prompts must be passed to the messages parameter as a string. Support for other object types will be added in future releases.
model
string
required
Available options:
  • llama3.1-8b
  • llama-3.3-70b
  • qwen-3-32b
  • qwen-3-235b-a22b-instruct-2507 (preview)
  • gpt-oss-120b
  • zai-glm-4.6 (preview)
parallel_tool_calls
boolean | null
Whether to enable parallel function calling during tool use. When enabled (default), the model can request multiple tool calls simultaneously in a single response. When disabled, the model will only request one tool call at a time.Default: true
prediction
object | null
Configuration for a Predicted Output, which can greatly speed up response times when large parts of the model response are known in advance. This is most common when you are regenerating a file with mostly minor changes to the content.Visit our page on Predicted Outputs for more information and examples.
reasoning_effort
string | null
Controls the amount of reasoning the model performs. Available values:
  • "low" - Minimal reasoning, faster responses
  • "medium" - Moderate reasoning (default)
  • "high" - Extensive reasoning, more thorough analysis
This flag is only available for gpt-oss-120b model.
response_format
object | null
An object that controls the format of the model response.Setting to { "type": "json_schema", "json_schema": { "name": "schema_name", "strict": true, "schema": {...} } } enforces schema compliance by ensuring that the model output conforms to your specified JSON schema. See Structured Outputs for more information.Setting { "type": "json_object" } enables the legacy JSON mode, ensuring that the model output is valid JSON. However, using json_schema is recommended for models that support it.
seed
integer | null
If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result. Determinism is not guaranteed.
stop
string | null
Up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
stream
boolean | null
If set, partial message deltas will be sent.
temperature
number | null
What sampling temperature to use, between 0 and 1.5. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
top_logprobs
integer | null
An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability. logprobs must be set to true if this parameter is used.
top_p
number | null
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
tool_choice
string | object
Controls which (if any) tool is called by the model. none means the model will not call any tool and instead generates a message. auto means the model can pick between generating a message or calling one or more tools. required means the model must call one or more tools. Specifying a particular tool via {"type": "function", "function": {"name": "my_function"}} forces the model to call that tool.none is the default when no tools are present. auto is the default if tools are present.
tools
object | null
A list of tools the model may call. Currently, only functions are supported as a tool. Use this to provide a list of functions the model may generate JSON inputs for.Specifying tools consumes prompt tokens in the context. If too many are given, the model may perform poorly or you may hit context length limitations
user
string | null
A unique identifier representing your end-user, which can help to monitor and detect abuse.

Response

id
string
A unique identifier for the chat completion.
choices
object[]
A list of chat completion choices. Can be more than one if n is greater than 1.
created
integer
The Unix timestamp (in seconds) of when the chat completion was created.
model
string
The model used for the chat completion.
object
string
The object type, which is always chat.completion.
usage
object
Usage statistics for the completion request.
from cerebras.cloud.sdk import Cerebras
import os 

client = Cerebras(api_key=os.environ.get("CEREBRAS_API_KEY"),)

chat_completion = client.chat.completions.create(
    model="gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Hello!",}
    ],
)
print(chat_completion)
{
  "id": "chatcmpl-292e278f-514e-4186-9010-91ce6a14168b",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Hello! How can I assist you today?",
        "reasoning": "The user is asking for a simple greeting to the world. This is a straightforward request that doesn't require complex analysis. I should provide a friendly, direct response.",
        "role": "assistant"
      }
    }
  ],
  "created": 1723733419,
  "model": "gpt-oss-120b",
  "system_fingerprint": "fp_70185065a4",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 10,
    "total_tokens": 22,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "time_info": {
    "queue_time": 0.000073161,
    "prompt_time": 0.0010744798888888889,
    "completion_time": 0.005658071111111111,
    "total_time": 0.022224903106689453,
    "created": 1723733419
  }
}