Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 82 additions & 52 deletions docs/user-guides/configuration-guide/llm-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,85 +59,127 @@ For more details about the command and its usage, see the [CLI documentation](..

### Using LLMs with Reasoning Traces

```{warning}
**Breaking Change in v0.18.0**: The `reasoning_config` field and its options (`remove_reasoning_traces`, `start_token`, `end_token`) have been removed. The `rails.output.apply_to_reasoning_traces` field has also been removed. Use output rails to guardrail reasoning traces instead.
```{deprecated} 0.18.0
The `reasoning_config` field and its options `remove_reasoning_traces`, `start_token`, and `end_token` are deprecated. The `rails.output.apply_to_reasoning_traces` field has also been deprecated. Instead, use output rails to guardrail reasoning traces, as introduced in this section.
```

Reasoning-capable LLMs such as [DeepSeek-R1](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) and [NVIDIA Llama 3.1 Nemotron Ultra 253B V1](https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1) include reasoning traces in their responses, typically wrapped in tokens like `<think>` and `</think>`. NeMo Guardrails automatically extracts these traces and makes them available throughout your guardrails configuration via the `$bot_thinking` variable in Colang flows and `bot_thinking` in Python contexts.
Reasoning-capable LLMs such as [DeepSeek-R1](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) and [NVIDIA Llama 3.1 Nemotron Ultra 253B V1](https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1) include reasoning traces in their responses, typically wrapped in tokens such as `<think>` and `</think>`.

The NeMo Guardrails toolkit automatically extracts these traces and makes them available to set up in your guardrails configuration through the following variables:

- In Colang flows, use the `$bot_thinking` variable.
- In Python contexts, use the `bot_thinking` variable.

#### Guardrailing Reasoning Traces with Output Rails

The primary approach is to use output rails to inspect and control reasoning traces. This allows you to:
Use output rails to inspect and control reasoning traces. This allows you to:

- Block responses based on problematic reasoning patterns
- Enhance moderation decisions with reasoning context
- Monitor and filter sensitive information in reasoning
- Block responses based on problematic reasoning patterns.
- Enhance moderation decisions with reasoning context.
- Monitor and filter sensitive information in reasoning.

Here's a minimal example:
##### Prepare Configuration Files

```yaml
models:
- type: main
engine: nim
model: nvidia/llama-3.1-nemotron-ultra-253b-v1
- type: self_check_output
model: <your_moderation_model>
engine: <your_engine>
The following configuration files show a minimal configuration for guardrailing reasoning traces with output rails.

rails:
output:
flows:
- self check output
```
1. Configure output rails in `config.yml`:

**prompts.yml**:
```yaml
models:
- type: main
engine: nim
model: nvidia/llama-3.1-nemotron-ultra-253b-v1
- type: self_check_output
model: <your_moderation_model>
engine: <your_engine>

```yaml
prompts:
- task: self_check_output
content: |
Your task is to check if the bot message complies with company policy.
rails:
output:
flows:
- self check output
```

Bot message: "{{ bot_response }}"
1. Configure the prompt to access the reasoning traces in `prompts.yml`:

{% if bot_thinking %}
Bot reasoning: "{{ bot_thinking }}"
{% endif %}
```yaml
prompts:
- task: self_check_output
content: |
Your task is to check if the bot message complies with company policy.

Should this be blocked (Yes or No)?
Answer:
```
Bot message: "{{ bot_response }}"

{% if bot_thinking %}
Bot reasoning: "{{ bot_thinking }}"
{% endif %}

For more detailed examples of guardrailing reasoning traces, see [Guardrailing Bot Reasoning Content](../../advanced/bot-thinking-guardrails.md).
Should this be blocked (Yes or No)?
Answer:
```

For more detailed examples of guardrailing reasoning traces, refer to [Guardrailing Bot Reasoning Content](../../advanced/bot-thinking-guardrails.md).

#### Accessing Reasoning Traces in API Responses

##### With GenerationOptions (Structured Access)
There are two ways to access reasoning traces in API responses: with generation options and without generation options.

Read the option **With GenerationOptions** when you:

- Need structured access to reasoning and response separately.
- Are building a new application.
- Need access to other structured fields such as state, output_data, or llm_metadata.

When you pass `GenerationOptions` to the API, the function returns a `GenerationResponse` object with structured fields, including `reasoning_content` for accessing reasoning traces separately from the main response:
Read the option **Without GenerationOptions** when you:

- Need backward compatibility with existing code.
- Want the raw response with inline reasoning tags.
- Are integrating with systems that expect tagged strings.

##### With GenerationOptions for Structured Access

When you pass `GenerationOptions` to the API, the function returns a `GenerationResponse` object with structured fields. This approach provides clean separation between the reasoning traces and the final response content, making it easier to process each component independently.

The `reasoning_content` field contains the extracted reasoning traces, while `response` contains the main LLM response. This structured access pattern is recommended for new applications as it provides type safety and clear access to all response metadata.

The following example demonstrates how to use `GenerationOptions` in an guardrails async generation call `rails.generate_async` to access reasoning traces.

```python
from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.rails.llm.options import GenerationOptions

# Load the guardrails configuration
config = RailsConfig.from_path("./config")
rails = LLMRails(config)

# Create a GenerationOptions object to enable structured responses
options = GenerationOptions()

# Make an async call with GenerationOptions
result = await rails.generate_async(
messages=[{"role": "user", "content": "What is 2+2?"}],
options=options
)

# Access reasoning traces separately from the response
if result.reasoning_content:
print("Reasoning:", result.reasoning_content)

# Access the main response content
print("Response:", result.response[0]["content"])
```

##### Without GenerationOptions (Tagged String)
The following example output shows the reasoning traces and the main response content from the guardrailed generation result.

```
Reasoning: Let me calculate: 2 plus 2 equals 4.
Response: The answer is 4.
```

##### Without GenerationOptions for Tagged String

When calling without `GenerationOptions`, such as by using a dict or string response, reasoning is wrapped in `<think>` tags.

When calling without `GenerationOptions` (e.g., via dict/string response), reasoning is wrapped in `<think>` tags:
The following example demonstrates how to access reasoning traces without using `GenerationOptions`.

```python
response = rails.generate(
Expand All @@ -147,25 +189,13 @@ response = rails.generate(
print(response["content"])
```

Output:
The response is wrapped in `<think>` tags as shown in the following example output.

```
<think>Let me calculate: 2 plus 2 equals 4.</think>
The answer is 4.
```

**Which pattern should you use?**

Use **Pattern 1 (With GenerationOptions)** when:
- You need structured access to reasoning and response separately
- You're building a new application
- You need access to other structured fields (state, output_data, llm_metadata, etc.)

Use **Pattern 2 (Without GenerationOptions)** when:
- You need backward compatibility with existing code
- You want the raw response with inline reasoning tags
- You're integrating with systems that expect tagged strings

### NIM for LLMs

[NVIDIA NIM](https://docs.nvidia.com/nim/index.html) is a set of easy-to-use microservices designed to accelerate the deployment of generative AI models across the cloud, data center, and workstations.
Expand Down