Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 114 additions & 97 deletions docs/user-guides/advanced/bot-thinking-guardrails.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,38 @@
# Guardrailing Bot Reasoning Content

Modern reasoning-capable LLMs expose their internal thought process as reasoning traces. These traces reveal how the model arrives at its conclusions, which can be valuable for transparency but may also contain sensitive information or problematic reasoning patterns.
Reasoning-capable large language models (LLMs) expose their internal thought process as reasoning traces. These traces reveal how the model arrives at its conclusions, providing transparency into the decision-making process. However, they may also contain sensitive information or problematic reasoning patterns that need to be monitored and controlled.

NeMo Guardrails allows you to inspect and control these reasoning traces by extracting them and making them available throughout your guardrails configuration. This enables you to write guardrails that can block responses based on the model's reasoning process, enhance moderation decisions with reasoning context, or monitor reasoning patterns.
The NeMo Guardrails toolkit helps you set up guardrails to inspect and control these reasoning traces by extracting them. With this feature, you can configure guardrails that can block responses based on the model's reasoning process, enhance moderation decisions with reasoning context, or monitor reasoning patterns.

```{note}
This guide uses Colang 1.0 syntax. Bot reasoning guardrails are currently supported in Colang 1.0 only.
This guide uses Colang 1.0 syntax. Colang 1.0 currently supports bot reasoning guardrails only.
```

```{important}
The examples in this guide range from minimal toy examples (for understanding concepts) to complete reference implementations. They are designed to teach you how to access and work with `bot_thinking` in different contexts, not as production-ready code to copy-paste. Adapt these patterns to your specific use case with appropriate validation, error handling, and business logic for your application.
The examples in this guide range from minimal toy examples (for understanding concepts) to complete reference implementations. These examples teach you how to access and work with `bot_thinking` in different contexts, not as production-ready code to copy-paste. Adapt these patterns to your specific use case with appropriate validation, error handling, and business logic for your application.
```

---

## Accessing Reasoning Content

When an LLM generates a response with reasoning traces, NeMo Guardrails automatically extracts the reasoning and makes it available in three ways:
When an LLM generates a response with reasoning traces, the NeMo Guardrails toolkit extracts the reasoning and makes it available through the `bot_thinking` variable. You can use this variable in the following ways.

### In Colang Flows: `$bot_thinking` Variable
### In Colang Flows

The reasoning content is available as a context variable in Colang output rails:
The reasoning content is available as a context variable in Colang output rails. For example, in `config/rails.co`, you can set up a flow to capture the reasoning content by setting the `$captured_reasoning` variable to `$bot_thinking`.

```colang
```{code-block}
define flow check_reasoning
if $bot_thinking
$captured_reasoning = $bot_thinking
```

### In Custom Actions: `context.get("bot_thinking")`
### In Custom Actions

When writing Python actions, you can access the reasoning via the context dictionary:
When you write Python action functions in `config/actions.py`, you can access the reasoning through the context dictionary. For example, the following is an example action function that checks if the reasoning retrieved through `context.get("bot_thinking")` contains the word `"sensitive"`. It returns `False` if the bot reasoning contains the word `"sensitive"`.

```python
```{code-block} python
@action(is_system_action=True)
async def check_reasoning(context: Optional[dict] = None):
bot_thinking = context.get("bot_thinking")
Expand All @@ -39,9 +41,9 @@ async def check_reasoning(context: Optional[dict] = None):
return True
```

### In Prompt Templates: `{{ bot_thinking }}`
### In Prompt Templates

When rendering prompts for LLM tasks (like `self check output`), the reasoning is available as a Jinja2 template variable:
When you render prompts for LLM tasks such as `self check output`, the reasoning is available as a Jinja2 template variable. For example, in `prompts.yml`, you can set up a prompt to check if the reasoning contains the word `"sensitive"` and block the response if it does.

```yaml
prompts:
Expand All @@ -56,15 +58,19 @@ prompts:
Should this be blocked (Yes or No)?
```

**Important**: Always check if reasoning exists before using it, as not all models provide reasoning traces.
```{important}
Always check if reasoning exists before using it, as not all models provide reasoning traces.
```

---

## Guardrailing with Output Rails

Output rails can use the `$bot_thinking` variable to inspect and control responses based on reasoning content.
You can use the `$bot_thinking` variable in output rails to inspect and control responses based on reasoning content.

### Basic Pattern Matching
```{code-block}
:caption: Basic Pattern Matching

```colang
define bot refuse to respond
"I'm sorry, I can't respond to that."

Expand All @@ -77,7 +83,9 @@ define flow block_sensitive_reasoning

Add this flow to your output rails in `config.yml`:

```yaml
```{code-block}
:caption: In `config.yml`

rails:
output:
flows:
Expand All @@ -88,114 +96,123 @@ rails:
This demonstrates basic pattern matching for learning purposes. Real implementations should use more comprehensive validation and consider edge cases.
```

---

## Guardrailing with Custom Actions

For complex validation logic or reusable checks across multiple flows, write custom Python actions:
For complex validation logic or reusable checks across multiple flows, you can write custom Python actions.
This approach provides better code organization and makes it easier to share validation logic across different guardrails.

**config/actions.py**:
1. Write the custom action function in `config/actions.py` as follows:

```python
from typing import Optional
from nemoguardrails.actions import action
```{code-block} python
from typing import Optional
from nemoguardrails.actions import action

@action(is_system_action=True)
async def check_reasoning_quality(context: Optional[dict] = None):
bot_thinking = context.get("bot_thinking")
@action(is_system_action=True)
async def check_reasoning_quality(context: Optional[dict] = None):
bot_thinking = context.get("bot_thinking")

if not bot_thinking:
return True
if not bot_thinking:
return True

forbidden_patterns = [
"proprietary information",
"trade secret",
"confidential data"
]
forbidden_patterns = [
"proprietary information",
"trade secret",
"confidential data"
]

for pattern in forbidden_patterns:
if pattern.lower() in bot_thinking.lower():
return False
for pattern in forbidden_patterns:
if pattern.lower() in bot_thinking.lower():
return False

return True
```
return True
```

**config/rails.co**:
2. Write the flow that uses the custom action function in `config/rails.co` as follows:

```colang
define bot refuse to respond
"I'm sorry, I can't respond to that."
```{code-block}
define bot refuse to respond
"I'm sorry, I can't respond to that."

define flow quality_check_reasoning
$is_safe = execute check_reasoning_quality
define flow quality_check_reasoning
$is_safe = execute check_reasoning_quality

if not $is_safe
bot refuse to respond
stop
```
if not $is_safe
bot refuse to respond
stop
```

Add to your output rails:
3. Add the flow to your output rails in `config.yml`.

```yaml
rails:
output:
flows:
- quality_check_reasoning
```
```{code-block}
rails:
output:
flows:
- quality_check_reasoning
```

---

## Using Reasoning in Self-Check Output

This is the **complete reference implementation** from [examples/configs/self_check_thinking/](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/self_check_thinking), showing how `bot_thinking` is used in practice. This pattern provides reasoning traces to your self-check output rail, allowing the moderation LLM to make more informed decisions.
The following example shows how to use `bot_thinking` in a self-check output rail. This pattern provides reasoning traces to your moderation LLM, allowing it to make more informed decisions by evaluating both the response and the reasoning process.

### Configuration
This is the *complete reference implementation* from [examples/configs/self_check_thinking/](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/self_check_thinking) in the NeMo Guardrails toolkit repository.

**config.yml**:
### Configuration Steps

```yaml
models:
- type: main
engine: <your_engine>
model: <your_reasoning_model>
- type: self_check_output
model: <your_moderation_model>
engine: <your_engine>
1. Write the `config.yml` file as follows:

rails:
output:
flows:
- self check output
```
```yaml
models:
- type: main
engine: <your_engine>
model: <your_reasoning_model>
- type: self_check_output
model: <your_moderation_model>
engine: <your_engine>

**prompts.yml**:
rails:
output:
flows:
- self check output
```

```yaml
prompts:
- task: self_check_output
content: |
Your task is to check if the bot message below complies with the company policy.
1. Write the `prompts.yml` file as follows:

Company policy for the bot:
- messages should not contain any explicit content
- messages should not contain abusive language or offensive content
- messages should not contain any harmful content
- messages should not contain racially insensitive content
- if a message is a refusal, should be polite
```yaml
prompts:
- task: self_check_output
content: |
Your task is to check if the bot message below complies with the company policy.

Bot message: "{{ bot_response }}"
Company policy for the bot:
- messages should not contain any explicit content
- messages should not contain abusive language or offensive content
- messages should not contain any harmful content
- messages should not contain racially insensitive content
- if a message is a refusal, should be polite

{% if bot_thinking %}
Bot thinking/reasoning: "{{ bot_thinking }}"
{% endif %}
Bot message: "{{ bot_response }}"

Question: Should the message be blocked (Yes or No)?
Answer:
```
{% if bot_thinking %}
Bot thinking/reasoning: "{{ bot_thinking }}"
{% endif %}

Question: Should the message be blocked (Yes or No)?
Answer:
```

The `{% if bot_thinking %}` conditional ensures that the prompt works with both reasoning and non-reasoning models. When reasoning is available, the self-check LLM can evaluate both the final response and the reasoning process.

The `{% if bot_thinking %}` conditional ensures the prompt works with both reasoning and non-reasoning models. When reasoning is available, the self-check LLM can evaluate both the final response and the reasoning process.
You can find the complete working configuration with all files in [examples/configs/self_check_thinking/](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/self_check_thinking). Use it as a reference for your own implementation.

**Explore the complete implementation**: You can find the full working configuration in [examples/configs/self_check_thinking/](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/self_check_thinking) with all files ready to use as a reference for your own implementation.
## Related Guides

## See Also
Use the following guides to learn more about the features used in this guide.

- [LLM Configuration - Using LLMs with Reasoning Traces](../configuration-guide/llm-configuration.md#using-llms-with-reasoning-traces) - API response handling and breaking changes
- [Output Rails](../../getting-started/5-output-rails/README.md) - General guide on output rails
- [Self-Check Output Example](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/self_check_thinking) - Complete working configuration
- [Custom Actions](../../colang-language-syntax-guide.md#actions) - Guide on writing custom actions
- [LLM Configuration - Using LLMs with Reasoning Traces](../configuration-guide/llm-configuration.md#using-llms-with-reasoning-traces): API response handling and breaking changes.
- [Output Rails](../../getting-started/5-output-rails/README.md): General guide on output rails.
- [Self-Check Output Example](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/self_check_thinking): Complete working configuration.
- [Custom Actions](../../colang-language-syntax-guide.md#actions): Guide on writing custom actions.