diff options
author | Albin Suresh <albin.suresh@softwareag.com> | 2022-05-25 18:48:52 +0530 |
---|---|---|
committer | Albin Suresh <albin.suresh@softwareag.com> | 2022-05-26 16:42:45 +0530 |
commit | e3775e430d3109081d3926ab4e7b13b05e1c2741 (patch) | |
tree | 1e25522055a7e502b4b6dafd1ee6d441dd413912 /docs | |
parent | 7f75af001342075a102460a3e3f5792411ed3ec3 (diff) |
Fix tedge watchdog timeout misalignment with monitored services
Diffstat (limited to 'docs')
-rw-r--r-- | docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md | 54 |
1 files changed, 32 insertions, 22 deletions
diff --git a/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md b/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md index 50c39967..c7a188fa 100644 --- a/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md +++ b/docs/src/howto-guides/021_enable_tedge_watchdog_using_systemd.md @@ -2,24 +2,31 @@ ## Introduction -The systemd watchdog feature enables systemd to detect when a service is unhealthy or unresponsive and attempt to fix it by restarting that service. +The systemd watchdog feature enables systemd to detect when a service is unhealthy or unresponsive and +attempt to fix it by restarting that service. To detect if a service is healthy or not, systemd relies on periodic health notifications from that service at regular intervals. -If the service fails to send that notification within a time threshold, then systemd will assume that service to be unhealthy and restart it. +If the service fails to send that notification within a time threshold, +then systemd will assume that service to be unhealthy and restart it. This document describes how the systemd watchdog mechanism can be enabled for thin-edge services. -## Enabling the `watchdog` feature in `systemd` +## Enabling the systemd watchdog feature for a tedge service -Enabling systemd `watchdog` for a `thin-edge.io` service (tedge_agent, tedge_mapper_c8y/az/collectd) -using the `systemd` is a two-step process. +Enabling systemd watchdog for a `thin-edge.io` service (tedge-agent, tedge-mapper-c8y/az/collectd) is a two-step process. -### Step 1: Enable the `watchdog` feature in the `systemd` service file -For example to enable the `watchdog` feature for `tedge-mapper-c8y` service, update systemd service file as shown below. +### Step 1: Enable the watchdog feature in the systemd service file -Add `tedge-watchdog.service` in `After` under `[Unit]` section. -Add `WatchdogSec=5` under `[Service]` section. +For example, to enable the watchdog feature for `tedge-mapper-c8y` service, +update the systemd service file as shown below: -The sample service file after updating looks as below. +> Note: The systemd service file for tedge services are usually present in `/lib/systemd/system` directory, +> like `/lib/systemd/system/tedge-mapper-c8y.service`. + +Add `tedge-watchdog.service` as an `After` service dependency under `[Unit]` section. +Add the watchdog interval as `WatchdogSec=30` under `[Service]` section. +Update the restart condition as `Restart=always` under `[Service]` section. + +Here is the updated service file for `tedge-mapper-c8y` service: ```shell [Unit] @@ -29,19 +36,16 @@ After=syslog.target network.target mosquitto.service tedge-watchdog.service [Service] User=tedge-mapper ExecStart=/usr/bin/tedge_mapper c8y -Restart=on-failure +Restart=always RestartPreventExitStatus=255 -WatchdogSec=5 +WatchdogSec=30 ``` -> Note: The systemd service file for tedge services are usually present -in `/lib/systemd/system` directory, like `/lib/systemd/system/tedge-mapper-c8y.service`. - ### Step 2: Start the `tedge-watchdog` service The `tedge-watchdog` service is responsible for periodically checking the health of -all tedge services for which the watchdog feature is enabled, and send systemd -watchdog notifications on their behalf to systemd. +all tedge services for which the watchdog feature is enabled, +and send systemd watchdog notifications on their behalf to systemd. Start and enable the `tedge-watchdog` service as follows: @@ -50,16 +54,22 @@ systemctl start tedge-watchdog.service systemctl enable tedge-watchdog.service ``` -Now, the `tedge-watchdog` service will be keep sending health check messages to the monitored services periodically within their configured `WatchdogSec` interval. +Once started, the `tedge-watchdog` service will keep checking the health of the monitored tedge services +by periodically sending health check messages to them within their configured `WatchdogSec` interval. -The health check request for service is published to `tedge/health-check/<service-name>` topic and the health status response from that service is expected on `tedge/health/<service-name>` topic. +The health check request for service is published to `tedge/health-check/<service-name>` topic and +the health status response from that service is expected on `tedge/health/<service-name>` topic. -Once the health status response is received from a particular service, the `tedge-watchdog` service will send the watchdog notification on behalf of that service to systemd. +Once the health status response is received from a particular service, +the `tedge-watchdog` service will send the [systemd notification](https://www.freedesktop.org/software/systemd/man/sd_notify.html#) to systemd on behalf of that monitored service. ## Debugging -One can observe the message exchange between the `service` and the `watchdog` by subscribing to `tedge/health/#` and `tedge/health-check/#` topics. + +One can observe the message exchange between the `service` and the `watchdog` +by subscribing to `tedge/health/#` and `tedge/health-check/#` topics. For more info check [here](./020_monitor_tedge_health.md) -> Note: If the watchdog service did not send the notification to the systemd within `WatchdogSec`, then the systemd will kill the existing service process and restarts it. +> Note: If the watchdog service does not send the notification to the systemd within `WatchdogSec` interval for a service, +> then systemd restarts that service by killing the old process and spawning a new one to replace it. > Note: [Here](https://www.medo64.com/2019/01/systemd-watchdog-for-any-service/) is an example about using `systemd watchdog` feature. |