You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user-guides/configuration-guide/llm-configuration.md
+82-52Lines changed: 82 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,85 +59,127 @@ For more details about the command and its usage, see the [CLI documentation](..
59
59
60
60
### Using LLMs with Reasoning Traces
61
61
62
-
```{warning}
63
-
**Breaking Change in v0.18.0**: The `reasoning_config` field and its options (`remove_reasoning_traces`, `start_token`, `end_token`) have been removed. The `rails.output.apply_to_reasoning_traces` field has also been removed. Use output rails to guardrail reasoning traces instead.
62
+
```{deprecated} 0.18.0
63
+
The `reasoning_config` field and its options `remove_reasoning_traces`, `start_token`, and `end_token` are deprecated. The `rails.output.apply_to_reasoning_traces` field has also been deprecated. Instead, use output rails to guardrail reasoning traces, as introduced in this section.
64
64
```
65
65
66
-
Reasoning-capable LLMs such as [DeepSeek-R1](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) and [NVIDIA Llama 3.1 Nemotron Ultra 253B V1](https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1) include reasoning traces in their responses, typically wrapped in tokens like `<think>` and `</think>`. NeMo Guardrails automatically extracts these traces and makes them available throughout your guardrails configuration via the `$bot_thinking` variable in Colang flows and `bot_thinking` in Python contexts.
66
+
Reasoning-capable LLMs such as [DeepSeek-R1](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) and [NVIDIA Llama 3.1 Nemotron Ultra 253B V1](https://build.nvidia.com/nvidia/llama-3_1-nemotron-ultra-253b-v1) include reasoning traces in their responses, typically wrapped in tokens such as `<think>` and `</think>`.
67
+
68
+
The NeMo Guardrails toolkit automatically extracts these traces and makes them available to set up in your guardrails configuration through the following variables:
69
+
70
+
- In Colang flows, use the `$bot_thinking` variable.
71
+
- In Python contexts, use the `bot_thinking` variable.
67
72
68
73
#### Guardrailing Reasoning Traces with Output Rails
69
74
70
-
The primary approach is to use output rails to inspect and control reasoning traces. This allows you to:
75
+
Use output rails to inspect and control reasoning traces. This allows you to:
71
76
72
-
- Block responses based on problematic reasoning patterns
73
-
- Enhance moderation decisions with reasoning context
74
-
- Monitor and filter sensitive information in reasoning
77
+
- Block responses based on problematic reasoning patterns.
78
+
- Enhance moderation decisions with reasoning context.
79
+
- Monitor and filter sensitive information in reasoning.
75
80
76
-
Here's a minimal example:
81
+
##### Prepare Configuration Files
77
82
78
-
```yaml
79
-
models:
80
-
- type: main
81
-
engine: nim
82
-
model: nvidia/llama-3.1-nemotron-ultra-253b-v1
83
-
- type: self_check_output
84
-
model: <your_moderation_model>
85
-
engine: <your_engine>
83
+
The following configuration files show a minimal configuration for guardrailing reasoning traces with output rails.
86
84
87
-
rails:
88
-
output:
89
-
flows:
90
-
- self check output
91
-
```
85
+
1. Configure output rails in `config.yml`:
92
86
93
-
**prompts.yml**:
87
+
```yaml
88
+
models:
89
+
- type: main
90
+
engine: nim
91
+
model: nvidia/llama-3.1-nemotron-ultra-253b-v1
92
+
- type: self_check_output
93
+
model: <your_moderation_model>
94
+
engine: <your_engine>
94
95
95
-
```yaml
96
-
prompts:
97
-
- task: self_check_output
98
-
content: |
99
-
Your task is to check if the bot message complies with company policy.
96
+
rails:
97
+
output:
98
+
flows:
99
+
- self check output
100
+
```
100
101
101
-
Bot message: "{{ bot_response }}"
102
+
1. Configure the prompt to access the reasoning traces in `prompts.yml`:
102
103
103
-
{% if bot_thinking %}
104
-
Bot reasoning: "{{ bot_thinking }}"
105
-
{% endif %}
104
+
```yaml
105
+
prompts:
106
+
- task: self_check_output
107
+
content: |
108
+
Your task is to check if the bot message complies with company policy.
106
109
107
-
Should this be blocked (Yes or No)?
108
-
Answer:
109
-
```
110
+
Bot message: "{{ bot_response }}"
111
+
112
+
{% if bot_thinking %}
113
+
Bot reasoning: "{{ bot_thinking }}"
114
+
{% endif %}
110
115
111
-
For more detailed examples of guardrailing reasoning traces, see [Guardrailing Bot Reasoning Content](../../advanced/bot-thinking-guardrails.md).
116
+
Should this be blocked (Yes or No)?
117
+
Answer:
118
+
```
119
+
120
+
For more detailed examples of guardrailing reasoning traces, refer to [Guardrailing Bot Reasoning Content](../../advanced/bot-thinking-guardrails.md).
112
121
113
122
#### Accessing Reasoning Traces in API Responses
114
123
115
-
##### With GenerationOptions (Structured Access)
124
+
There are two ways to access reasoning traces in API responses: with generation options and without generation options.
125
+
126
+
Read the option **With GenerationOptions** when you:
127
+
128
+
- Need structured access to reasoning and response separately.
129
+
- Are building a new application.
130
+
- Need access to other structured fields such as state, output_data, or llm_metadata.
116
131
117
-
When you pass `GenerationOptions` to the API, the function returns a `GenerationResponse` object with structured fields, including `reasoning_content` for accessing reasoning traces separately from the main response:
132
+
Read the option **Without GenerationOptions** when you:
133
+
134
+
- Need backward compatibility with existing code.
135
+
- Want the raw response with inline reasoning tags.
136
+
- Are integrating with systems that expect tagged strings.
137
+
138
+
##### With GenerationOptions for Structured Access
139
+
140
+
When you pass `GenerationOptions` to the API, the function returns a `GenerationResponse` object with structured fields. This approach provides clean separation between the reasoning traces and the final response content, making it easier to process each component independently.
141
+
142
+
The `reasoning_content` field contains the extracted reasoning traces, while `response` contains the main LLM response. This structured access pattern is recommended for new applications as it provides type safety and clear access to all response metadata.
143
+
144
+
The following example demonstrates how to use `GenerationOptions` in an guardrails async generation call `rails.generate_async` to access reasoning traces.
118
145
119
146
```python
120
147
from nemoguardrails import RailsConfig, LLMRails
121
148
from nemoguardrails.rails.llm.options import GenerationOptions
122
149
150
+
# Load the guardrails configuration
123
151
config = RailsConfig.from_path("./config")
124
152
rails = LLMRails(config)
125
153
154
+
# Create a GenerationOptions object to enable structured responses
126
155
options = GenerationOptions()
156
+
157
+
# Make an async call with GenerationOptions
127
158
result = await rails.generate_async(
128
159
messages=[{"role": "user", "content": "What is 2+2?"}],
129
160
options=options
130
161
)
131
162
163
+
# Access reasoning traces separately from the response
132
164
if result.reasoning_content:
133
165
print("Reasoning:", result.reasoning_content)
134
166
167
+
# Access the main response content
135
168
print("Response:", result.response[0]["content"])
136
169
```
137
170
138
-
##### Without GenerationOptions (Tagged String)
171
+
The following example output shows the reasoning traces and the main response content from the guardrailed generation result.
172
+
173
+
```
174
+
Reasoning: Let me calculate: 2 plus 2 equals 4.
175
+
Response: The answer is 4.
176
+
```
177
+
178
+
##### Without GenerationOptions for Tagged String
179
+
180
+
When calling without `GenerationOptions`, such as by using a dict or string response, reasoning is wrapped in `<think>` tags.
139
181
140
-
When calling without `GenerationOptions` (e.g., via dict/string response), reasoning is wrapped in `<think>` tags:
182
+
The following example demonstrates how to access reasoning traces without using `GenerationOptions`.
141
183
142
184
```python
143
185
response = rails.generate(
@@ -147,25 +189,13 @@ response = rails.generate(
147
189
print(response["content"])
148
190
```
149
191
150
-
Output:
192
+
The response is wrapped in `<think>` tags as shown in the following example output.
151
193
152
194
```
153
195
<think>Let me calculate: 2 plus 2 equals 4.</think>
154
196
The answer is 4.
155
197
```
156
198
157
-
**Which pattern should you use?**
158
-
159
-
Use **Pattern 1 (With GenerationOptions)** when:
160
-
- You need structured access to reasoning and response separately
161
-
- You're building a new application
162
-
- You need access to other structured fields (state, output_data, llm_metadata, etc.)
163
-
164
-
Use **Pattern 2 (Without GenerationOptions)** when:
165
-
- You need backward compatibility with existing code
166
-
- You want the raw response with inline reasoning tags
167
-
- You're integrating with systems that expect tagged strings
168
-
169
199
### NIM for LLMs
170
200
171
201
[NVIDIA NIM](https://docs.nvidia.com/nim/index.html) is a set of easy-to-use microservices designed to accelerate the deployment of generative AI models across the cloud, data center, and workstations.
0 commit comments