summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
authorAlbin Suresh <albin.suresh@softwareag.com>2022-05-25 18:48:52 +0530
committerAlbin Suresh <albin.suresh@softwareag.com>2022-05-26 16:42:45 +0530
commite3775e430d3109081d3926ab4e7b13b05e1c2741 (patch)
tree1e25522055a7e502b4b6dafd1ee6d441dd413912 /docs
parent7f75af001342075a102460a3e3f5792411ed3ec3 (diff)
Fix tedge watchdog timeout misalignment with monitored services
Diffstat (limited to 'docs')
-rw-r--r--docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md54
1 files changed, 32 insertions, 22 deletions
diff --git a/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md b/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md
index 50c39967..c7a188fa 100644
--- a/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md
+++ b/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md
@@ -2,24 +2,31 @@
## Introduction
-The systemd watchdog feature enables systemd to detect when a service is unhealthy or unresponsive and attempt to fix it by restarting that service.
+The systemd watchdog feature enables systemd to detect when a service is unhealthy or unresponsive and
+attempt to fix it by restarting that service.
To detect if a service is healthy or not, systemd relies on periodic health notifications from that service at regular intervals.
-If the service fails to send that notification within a time threshold, then systemd will assume that service to be unhealthy and restart it.
+If the service fails to send that notification within a time threshold,
+then systemd will assume that service to be unhealthy and restart it.
This document describes how the systemd watchdog mechanism can be enabled for thin-edge services.
-## Enabling the `watchdog` feature in `systemd`
+## Enabling the systemd watchdog feature for a tedge service
-Enabling systemd `watchdog` for a `thin-edge.io` service (tedge_agent, tedge_mapper_c8y/az/collectd)
-using the `systemd` is a two-step process.
+Enabling systemd watchdog for a `thin-edge.io` service (tedge-agent, tedge-mapper-c8y/az/collectd) is a two-step process.
-### Step 1: Enable the `watchdog` feature in the `systemd` service file
-For example to enable the `watchdog` feature for `tedge-mapper-c8y` service, update systemd service file as shown below.
+### Step 1: Enable the watchdog feature in the systemd service file
-Add `tedge-watchdog.service` in `After` under `[Unit]` section.
-Add `WatchdogSec=5` under `[Service]` section.
+For example, to enable the watchdog feature for `tedge-mapper-c8y` service,
+update the systemd service file as shown below:
-The sample service file after updating looks as below.
+> Note: The systemd service file for tedge services are usually present in `/lib/systemd/system` directory,
+> like `/lib/systemd/system/tedge-mapper-c8y.service`.
+
+Add `tedge-watchdog.service` as an `After` service dependency under `[Unit]` section.
+Add the watchdog interval as `WatchdogSec=30` under `[Service]` section.
+Update the restart condition as `Restart=always` under `[Service]` section.
+
+Here is the updated service file for `tedge-mapper-c8y` service:
```shell
[Unit]
@@ -29,19 +36,16 @@ After=syslog.target network.target mosquitto.service tedge-watchdog.service
[Service]
User=tedge-mapper
ExecStart=/usr/bin/tedge_mapper c8y
-Restart=on-failure
+Restart=always
RestartPreventExitStatus=255
-WatchdogSec=5
+WatchdogSec=30
```
-> Note: The systemd service file for tedge services are usually present
-in `/lib/systemd/system` directory, like `/lib/systemd/system/tedge-mapper-c8y.service`.
-
### Step 2: Start the `tedge-watchdog` service
The `tedge-watchdog` service is responsible for periodically checking the health of
-all tedge services for which the watchdog feature is enabled, and send systemd
-watchdog notifications on their behalf to systemd.
+all tedge services for which the watchdog feature is enabled,
+and send systemd watchdog notifications on their behalf to systemd.
Start and enable the `tedge-watchdog` service as follows:
@@ -50,16 +54,22 @@ systemctl start tedge-watchdog.service
systemctl enable tedge-watchdog.service
```
-Now, the `tedge-watchdog` service will be keep sending health check messages to the monitored services periodically within their configured `WatchdogSec` interval.
+Once started, the `tedge-watchdog` service will keep checking the health of the monitored tedge services
+by periodically sending health check messages to them within their configured `WatchdogSec` interval.
-The health check request for service is published to `tedge/health-check/<service-name>` topic and the health status response from that service is expected on `tedge/health/<service-name>` topic.
+The health check request for service is published to `tedge/health-check/<service-name>` topic and
+the health status response from that service is expected on `tedge/health/<service-name>` topic.
-Once the health status response is received from a particular service, the `tedge-watchdog` service will send the watchdog notification on behalf of that service to systemd.
+Once the health status response is received from a particular service,
+the `tedge-watchdog` service will send the [systemd notification](https://www.freedesktop.org/software/systemd/man/sd_notify.html#) to systemd on behalf of that monitored service.
## Debugging
-One can observe the message exchange between the `service` and the `watchdog` by subscribing to `tedge/health/#` and `tedge/health-check/#` topics.
+
+One can observe the message exchange between the `service` and the `watchdog`
+by subscribing to `tedge/health/#` and `tedge/health-check/#` topics.
For more info check [here](./020_monitor_tedge_health.md)
-> Note: If the watchdog service did not send the notification to the systemd within `WatchdogSec`, then the systemd will kill the existing service process and restarts it.
+> Note: If the watchdog service does not send the notification to the systemd within `WatchdogSec` interval for a service,
+> then systemd restarts that service by killing the old process and spawning a new one to replace it.
> Note: [Here](https://www.medo64.com/2019/01/systemd-watchdog-for-any-service/) is an example about using `systemd watchdog` feature.