summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorIlya Mashchenko <ilya@netdata.cloud>2024-03-14 22:17:07 +0200
committerGitHub <noreply@github.com>2024-03-14 22:17:07 +0200
commit382bc6c23b3c8c301292296525c3d1955cece71b (patch)
tree0b2fce60d4ce9a80f3d63c8c98886eb0bcaaa3e4
parentc56a123576b790528a9e4a030f1c0679b11c6665 (diff)
docs: add "With NVIDIA GPUs monitoring" to docker install (#17167)
-rw-r--r--packaging/docker/README.md37
-rw-r--r--src/collectors/python.d.plugin/nvidia_smi/README.md86
2 files changed, 42 insertions, 81 deletions
diff --git a/packaging/docker/README.md b/packaging/docker/README.md
index 2872d254a9..fbe5ba4332 100644
--- a/packaging/docker/README.md
+++ b/packaging/docker/README.md
@@ -172,6 +172,43 @@ Add `- /run/dbus:/run/dbus:ro` to the netdata service `volumes`.
</TabItem>
</Tabs>
+### With NVIDIA GPUs monitoring
+
+
+Monitoring NVIDIA GPUs requires:
+
+- Using official [NVIDIA driver](https://www.nvidia.com/Download/index.aspx).
+- Installing [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
+- Allowing the Netdata container to access GPU resources.
+
+
+<Tabs>
+<TabItem value="docker_run" label="docker run">
+
+<h3> Using the <code>docker run</code> command </h3>
+
+Add `--gpus 'all,capabilities=utility'` to your `docker run`.
+
+</TabItem>
+<TabItem value="docker compose" label="docker-compose">
+
+<h3> Using the <code>docker-compose</code> command</h3>
+
+Add the following to the netdata service.
+
+```yaml
+ deploy:
+ resources:
+ reservations:
+ devices:
+ - driver: nvidia
+ count: all
+ capabilities: [gpu]
+```
+
+</TabItem>
+</Tabs>
+
### With host-editable configuration
Use a [bind mount](https://docs.docker.com/storage/bind-mounts/) for `/etc/netdata` rather than a volume.
diff --git a/src/collectors/python.d.plugin/nvidia_smi/README.md b/src/collectors/python.d.plugin/nvidia_smi/README.md
index 534832809a..ac99b5dc0e 100644
--- a/src/collectors/python.d.plugin/nvidia_smi/README.md
+++ b/src/collectors/python.d.plugin/nvidia_smi/README.md
@@ -11,22 +11,18 @@ learn_rel_path: "Integrations/Monitor/Devices"
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool.
-## Requirements and Notes
+## Requirements
-- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
-- You must enable this plugin, as its disabled by default due to minor performance issues:
+- The `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
+- Enable this plugin, as it's disabled by default due to minor performance issues:
```bash
cd /etc/netdata # Replace this path with your Netdata config directory, if different
sudo ./edit-config python.d.conf
```
Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`.
-
- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
-- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357>
-- Contributions are welcome.
-- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is.
-- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`.
-- `poll_seconds` is how often in seconds the tool is polled for as an integer.
+
+If using Docker, see [Netdata Docker container with NVIDIA GPUs monitoring](https://github.com/netdata/netdata/tree/master/packaging/docker#with-nvidia-gpus-monitoring).
## Charts
@@ -83,75 +79,3 @@ Now you can manually run the `nvidia_smi` module in debug mode:
```bash
./python.d.plugin nvidia_smi debug trace
```
-
-## Docker
-
-GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable.
-
-Sample `docker-compose.yml`
-```yaml
-version: '3'
-services:
- netdata:
- image: netdata/netdata
- container_name: netdata
- hostname: example.com # set to fqdn of host
- ports:
- - 19999:19999
- restart: unless-stopped
- cap_add:
- - SYS_PTRACE
- security_opt:
- - apparmor:unconfined
- environment:
- - NETDATA_EXTRA_APK_PACKAGES=gcompat
- volumes:
- - netdataconfig:/etc/netdata
- - netdatalib:/var/lib/netdata
- - netdatacache:/var/cache/netdata
- - /etc/passwd:/host/etc/passwd:ro
- - /etc/group:/host/etc/group:ro
- - /proc:/host/proc:ro
- - /sys:/host/sys:ro
- - /etc/os-release:/host/etc/os-release:ro
- deploy:
- resources:
- reservations:
- devices:
- - driver: nvidia
- count: all
- capabilities: [gpu]
-
-volumes:
- netdataconfig:
- netdatalib:
- netdatacache:
-```
-
-Sample `docker run`
-```yaml
-docker run -d --name=netdata \
- -p 19999:19999 \
- -e NETDATA_EXTRA_APK_PACKAGES=gcompat \
- -v netdataconfig:/etc/netdata \
- -v netdatalib:/var/lib/netdata \
- -v netdatacache:/var/cache/netdata \
- -v /etc/passwd:/host/etc/passwd:ro \
- -v /etc/group:/host/etc/group:ro \
- -v /proc:/host/proc:ro \
- -v /sys:/host/sys:ro \
- -v /etc/os-release:/host/etc/os-release:ro \
- --restart unless-stopped \
- --cap-add SYS_PTRACE \
- --security-opt apparmor=unconfined \
- --gpus all \
- netdata/netdata
-```
-
-### Docker Troubleshooting
-To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`.
-```bash
-docker exec -it netdata bash
-cd /etc/netdata
-./edit-config python.d.conf
-```