Skip to content

Commit 86dd9d9

Browse files
gjulianmijkaylin
andauthored
gpu: Update setup documentation (#33090)
* Update setup * Improvements * PR suggestions * Fix process.core.usage * Fix tab * Apply suggestions from code review * Update content/en/gpu_monitoring/setup.md Co-authored-by: Kathy L. <[email protected]> * Update content/en/gpu_monitoring/setup.md Co-authored-by: Kathy L. <[email protected]> * Update content/en/gpu_monitoring/setup.md Co-authored-by: Kathy L. <[email protected]> * PR comments --------- Co-authored-by: Kathy L. <[email protected]>
1 parent 876b044 commit 86dd9d9

File tree

1 file changed

+114
-16
lines changed

1 file changed

+114
-16
lines changed

content/en/gpu_monitoring/setup.md

Lines changed: 114 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,36 @@
22
title: Set up GPU Monitoring
33
private: true
44
---
5-
This page provides instructions on setting up Datadog's GPU Monitoring on a uniform cluster (all nodes have GPU devices) or a mixed cluster (only some nodes have GPU devices).
5+
This page provides instructions on setting up Datadog's GPU Monitoring on your infrastructure. Follow the configuration instructions that match your operating environment below.
6+
7+
To get additional insights and advanced eBPF metrics, like GPU core utilization, you can optionally opt-in to enabling System Probe with privileged mode.
68

79
### Prerequisites
810

911
To begin using Datadog's GPU Monitoring, your environment must meet the following criteria:
10-
- You are a Datadog user with active Datadog infrastructure hosts
11-
- The NVIDIA device plugin for Kubernetes is installed ([directly][3], or through [NVIDIA GPU Operator][4])
12+
13+
- You are running Datadog Agent on your GPU-accelerated hosts that you want to monitor. If not, see the following guides:
14+
- [Install the Datadog Agent on Kubernetes][1]
15+
- [Install the Datadog Agent on Docker][7]
16+
- [Install the Datadog Agent on non-containerized Linux][8]
17+
- The NVIDIA drivers are installed on the hosts. If using Kubernetes, the NVIDIA device plugin for Kubernetes is installed ([directly][3], or through [NVIDIA GPU Operator][4])
1218

1319
#### Minimum version requirements
1420

15-
- **Datadog Agent**: version 7.72.2
16-
- [**Datadog Operator**][5]: version 1.18, _or_ [**Datadog Helm chart**][6]: version 3.137.3
21+
- **Datadog Agent**: v7.72.2
1722
- **Operating system**: Linux
1823
- (Optional) For advanced eBPF metrics, Linux kernel version 5.8
1924
- **NVIDIA driver**: version 450.51
25+
26+
If using Kubernetes, the following additional requirements must be met:
27+
28+
- [**Datadog Operator**][5]: version 1.18, _or_ [**Datadog Helm chart**][6]: version 3.137.3
2029
- **Kubernetes**: 1.22 with PodResources API active
2130

22-
## Set up GPU Monitoring on a uniform cluster or non-Kubernetes environment
31+
## Set up GPU Monitoring on a uniform Kubernetes cluster or non-Kubernetes environment
2332

2433
The following instructions are the basic steps to set up GPU Monitoring in the following environments:
25-
- In a Kubernetes cluster where **all** the nodes have GPU devices
34+
- In a Kubernetes cluster where **all** nodes have GPU devices
2635
- In a non-Kubernetes environment, such as Docker or non-containerized Linux.
2736

2837
{{< tabs >}}
@@ -34,7 +43,7 @@ The following instructions are the basic steps to set up GPU Monitoring in the f
3443
: Enables GPU Monitoring.
3544

3645
`gpu.privilegedMode: true`
37-
: _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.core.usage`).
46+
: _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.process.core.usage`).
3847

3948
`gpu.patchCgroupPermissions: true`
4049
: _Only for GKE_. Enables a code path in `system-probe` that ensures the Agent can access GPU devices.
@@ -74,7 +83,7 @@ The following instructions are the basic steps to set up GPU Monitoring in the f
7483
: Enables GPU Monitoring.
7584

7685
`gpuMonitoring.privilegedMode: true`
77-
: _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.core.usage`).
86+
: _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.process.core.usage`).
7887

7988
`gpuMonitoring.configureCgroupPerms: true`
8089
: _Only for GKE_. Enables a code path in `system-probe` that ensures the Agent can access GPU devices.
@@ -226,7 +235,7 @@ gpu:
226235
enabled: true
227236
```
228237

229-
To enable advanced eBPF metrics, follow these steps:
238+
Additionally, to enable advanced eBPF-based metrics such as GPU core utilization (`gpu.process.core.usage`), follow these steps:
230239

231240
1. If `/etc/datadog-agent/system-probe.yaml` does not exist, create it from `system-probe.yaml.example`:
232241

@@ -251,19 +260,106 @@ To enable advanced eBPF metrics, follow these steps:
251260

252261
{{< /tabs >}}
253262

254-
## Set up GPU Monitoring on a mixed cluster
263+
## Set up GPU Monitoring on a mixed Kubernetes cluster
264+
265+
In a mixed Kubernetes cluster, some nodes have GPU devices while other nodes do not. Two separate DaemonSets are required (one for the runtime class in GPU nodes, and another for non-GPU nodes) due to runtime class requirements for the NVIDIA device plugin for Kubernetes.
255266

256-
In a mixed cluster, some nodes have GPU devices while other nodes do not.
267+
The recommended method to set up the Agent in this case is using the Datadog Operator, version 1.20 or greater, which provides features to make this setup easier. However, for compatibility reasons instructions are also provided for Helm installations or for older versions of the Datadog Operator.
257268

258269
{{< tabs >}}
259-
{{% tab "Datadog Operator" %}}
270+
{{% tab "Datadog Operator (1.20 or greater)" %}}
271+
260272
To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the Operator's [Agent Profiles][2] feature to selectively enable GPU Monitoring only on nodes with GPUs.
261273

262-
1. Ensure that the [latest version of the Datadog Agent][4] is [installed and deployed][1] on every GPU host you wish to monitor.
274+
1. Configure the Datadog Operator to enable the Datadog Agent Profile feature in the DatadogAgentInternal mode.
275+
276+
If the Datadog Operator was deployed with Helm directly without a values file, the configuration can be toggled from the command line:
277+
278+
```shell
279+
helm upgrade --set datadogAgentProfile.enabled=true --set datadogAgentInternal.enabled=true --set datadogCRDs.crds.datadogAgentProfiles=true --set datadogCRDs.crds.datadogAgentInternal=true <release-name> datadog/datadog-operator
280+
```
281+
282+
If the Datadog Operator was deployed with a values file, the configuration can be toggled by adding the following settings to the values file:
283+
284+
```yaml
285+
datadogAgentProfile:
286+
enabled: true
287+
288+
datadogAgentInternal:
289+
enabled: true
290+
291+
datadogCRDs:
292+
crds:
293+
datadogAgentProfiles: true
294+
datadogAgentInternal: true
295+
```
296+
297+
Then re-deploy the Datadog Operator with: `helm upgrade --install <release-name> datadog/datadog-operator -f datadog-operator.yaml`.
263298

264299
2. Modify your `DatadogAgent` resource with the following changes:
265300

301+
1. Add the `agent.datadoghq.com/update-metadata` annotation to the `DatadogAgent` resource.
302+
2. If advanced eBPF metrics are wanted, ensure at least one system-probe feature is enabled. Examples of system-probe features are `npm`, `cws`, `usm`. If none is enabled, the `oomKill` feature can be enabled.
303+
304+
The additions to the `datadog-agent.yaml` file should look like this:
305+
306+
```yaml
307+
apiVersion: datadoghq.com/v2alpha1
308+
kind: DatadogAgent
309+
metadata:
310+
name: datadog
311+
annotations:
312+
agent.datadoghq.com/update-metadata: "true" # Required for the Datadog Agent Internal mode to work.
313+
spec:
314+
features:
315+
oomKill:
316+
# Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods
317+
# Examples of system-probe features are npm, cws, usm
318+
enabled: true
319+
```
320+
321+
3. Apply your changes to the `DatadogAgent` resource. These changes are safe to apply to all Datadog Agents, regardless of whether they run on GPU nodes.
322+
323+
4. Create a [Datadog Agent Profile][2] that targets GPU nodes and enables GPU Monitoring on these targeted nodes.
324+
325+
In the following example, the `profileNodeAffinity` selector is targeting nodes with the label [`nvidia.com/gpu.present=true`][3], because this label is commonly present on nodes with the NVIDIA GPU Operator. You may use another label if you wish.
326+
327+
```yaml
328+
apiVersion: datadoghq.com/v1alpha1
329+
kind: DatadogAgentProfile
330+
metadata:
331+
name: gpu-nodes
332+
spec:
333+
profileAffinity:
334+
profileNodeAffinity:
335+
- key: nvidia.com/gpu.present
336+
operator: In
337+
values:
338+
- "true"
339+
config:
340+
features:
341+
gpu:
342+
enabled: true
343+
privilegedMode: true # Only for advanced eBPF metrics
344+
patchCgroupPermissions: true # Only for GKE
266345
```
346+
347+
5. After you apply this new [Datadog Agent Profile][2], the Datadog Operator creates a new DaemonSet, `gpu-nodes-agent`.
348+
349+
[1]: /containers/kubernetes/installation/?tab=datadogoperator
350+
[2]: https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md
351+
[3]: http://nvidia.com/gpu.present
352+
[4]: https://github.com/DataDog/datadog-agent/releases
353+
354+
{{% /tab %}}
355+
{{% tab "Datadog Operator (1.18 or 1.19)" %}}
356+
To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the Operator's [Agent Profiles][2] feature to selectively enable GPU Monitoring only on nodes with GPUs.
357+
358+
1. Ensure that the [latest version of the Datadog Agent][4] is [installed and deployed][1] on every GPU host you wish to monitor.
359+
360+
2. Modify your `DatadogAgent` resource with the following changes:
361+
362+
```yaml
267363
spec:
268364
features:
269365
oomKill:
@@ -307,7 +403,7 @@ To set up GPU Monitoring on a mixed cluster with the Datadog Operator, use the O
307403

308404
In the following example, the `profileNodeAffinity` selector is targeting nodes with the label [`nvidia.com/gpu.present=true`][3], because this label is commonly present on nodes with the NVIDIA GPU Operator. You may use another label if you wish.
309405

310-
```
406+
```yaml
311407
apiVersion: datadoghq.com/v1alpha1
312408
kind: DatadogAgentProfile
313409
metadata:
@@ -383,7 +479,7 @@ To set up GPU Monitoring on a mixed cluster with Helm, create two different Helm
383479
: Enables GPU Monitoring.
384480

385481
`gpuMonitoring.privilegedMode: true`
386-
: _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.core.usage`).
482+
: _Optional_. Enables advanced eBPF metrics, such as GPU core utilization (`gpu.process.core.usage`).
387483

388484
`gpuMonitoring.configureCgroupPerms: true`
389485
: _Only for GKE_. Enables a code path in `system-probe` that ensures the Agent can access GPU devices.
@@ -452,3 +548,5 @@ To set up GPU Monitoring on a mixed cluster with Helm, create two different Helm
452548
[4]: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
453549
[5]: https://github.com/DataDog/datadog-operator
454550
[6]: https://github.com/DataDog/helm-charts/blob/main/charts/datadog/README.md
551+
[7]: /containers/docker/
552+
[8]: /agent/supported_platforms/linux/

0 commit comments

Comments
 (0)