summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorJelger Haanstra <jghaanstra@users.noreply.github.com>2019-07-17 12:29:46 +0200
committerChris Akritidis <43294513+cakrit@users.noreply.github.com>2019-07-17 12:29:46 +0200
commit0cf35dc906a97d8efc841abe331433066503a925 (patch)
treec8e5c2f8977f1b76bdfce4ca6477938888afda1b
parent3b60a082cea349797f41b653387ed93bc7e82e35 (diff)
Update docs health monitoring and health management api (#6435)
* Update docs health monitoring and health management api * Update docs health monitoring and health management api
-rw-r--r--health/README.md43
-rw-r--r--web/api/health/README.md42
2 files changed, 42 insertions, 43 deletions
diff --git a/health/README.md b/health/README.md
index 81cc043d0e..345f7fc70d 100644
--- a/health/README.md
+++ b/health/README.md
@@ -65,7 +65,7 @@ This line starts an alarm or alarm template.
alarm: NAME
```
-or
+or
```
template: NAME
@@ -161,7 +161,7 @@ The simple pattern syntax and operation is explained in [simple patterns](../lib
This line makes a database lookup to find a value. This result of this lookup is available as `$this`.
The format is:
-
+
```
lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS]
```
@@ -311,15 +311,15 @@ delay: [[[up U] [down D] multiplier M] max X]
notification for this event will be sent 10 seconds after the actual event. This is used in
hope the alarm will get back to its previous state within the duration given. The default `U`
is zero.
-
+
- `down D` defines the delay to be applied to a notification for an alarm that moves to lower
state (i.e. CRITICAL to WARNING, CRITICAL to CLEAR, WARNING to CLEAR). For example, `down 1m`
will delay the notification by 1 minute. This is used to prevent notifications for flapping
alarms. The default `D` is zero.
-
+
- `mutliplier M` multiplies `U` and `D` when an alarm changes state, while a notification is
delayed. The default multiplier is `1.0`.
-
+
- `max X` defines the maximum absolute notification delay an alarm may get. The default `X`
is `max(U * M, D * M)` (i.e. the max duration of `U` or `D` multiplied once with `M`).
@@ -361,13 +361,13 @@ repeat: [off] [warning DURATION] [critical DURATION]
#### Alarm line `option`
-The only possible value for the `option` line is
+The only possible value for the `option` line is
```
option: no-clear-notification
```
-For some alarms we need compare two time-frames, to detect anomalies. For example, `health.d/httpcheck.conf` has an alarm template called `web_service_slow` that compares the average http call response time over the last 3 minutes, compared to the average over the last hour. It triggers a warning alarm when the average of the last 3 minutes is twice the average of the last hour. In such cases, it is easy to trigger the alarm, but difficult to tell when the alarm is cleared. As time passes, the newest window moves into the older, so the average response time of the last hour will keep increasing. Eventually, the comparison will find the averages in the two time-frames close enough to clear the alarm. However, the issue was not resolved, it's just a matter of the newer data "polluting" the old. For such alarms, it's a good idea to tell Netdata to not clear the notification, by using the `no-clear-notification` option.
+For some alarms we need compare two time-frames, to detect anomalies. For example, `health.d/httpcheck.conf` has an alarm template called `web_service_slow` that compares the average http call response time over the last 3 minutes, compared to the average over the last hour. It triggers a warning alarm when the average of the last 3 minutes is twice the average of the last hour. In such cases, it is easy to trigger the alarm, but difficult to tell when the alarm is cleared. As time passes, the newest window moves into the older, so the average response time of the last hour will keep increasing. Eventually, the comparison will find the averages in the two time-frames close enough to clear the alarm. However, the issue was not resolved, it's just a matter of the newer data "polluting" the old. For such alarms, it's a good idea to tell Netdata to not clear the notification, by using the `no-clear-notification` option.
---
@@ -417,14 +417,14 @@ crit: $this > (($status == $CRITICAL) ? (85) : (95))
The above say:
* If the alarm is currently a warning, then the threshold for being considered a warning
is 75, otherwise it's 85.
-
+
* If the alarm is currently critical, then the threshold for being considered critical
is 85, otherwise it's 95.
Which in turn, results in the following behavior:
* While the value is rising, it will trigger a warning when it exceeds 85, and a critical
alert when it exceeds 95.
-
+
* While the value is falling, it will return to a warning state when it goes below 85,
and a normal state when it goes below 75.
@@ -442,13 +442,13 @@ Which in turn, results in the following behavior:
You can find all the variables that can be used for a given chart, using
`http://your.netdata.ip:19999/api/v1/alarm_variables?chart=CHART_NAME`
Example: [variables for the `system.cpu` chart of the registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu).
-
+
_Hint: If you don't know how to find the CHART_NAME, you can read about it [here](../docs/Charts.md#charts)._
-Netdata supports 3 internal indexes for variables that will be used in health monitoring.
+Netdata supports 3 internal indexes for variables that will be used in health monitoring.
<details markdown="1"><summary>The variables below can be used in both chart alarms and context templates.</summary>
-Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in templates for charts belonging to the same [context](../docs/Charts.md#contexts). The reason is that all charts of a given contexts are essentially identical, with the only difference being the [family](../docs/Charts.md#families) that identifies a particular hardware or software instance. Charts and templates do not apply to specific families anyway, unless if you explicitly limit an alarm with the [alarm line `families`](#alarm-line-families).
+Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in templates for charts belonging to the same [context](../docs/Charts.md#contexts). The reason is that all charts of a given contexts are essentially identical, with the only difference being the [family](../docs/Charts.md#families) that identifies a particular hardware or software instance. Charts and templates do not apply to specific families anyway, unless if you explicitly limit an alarm with the [alarm line `families`](#alarm-line-families).
</details>
- **chart local variables**. All the dimensions of the chart are exposed as local variables. The value of $this for the other configured alarms of the chart also appears, under the name of each configured alarm.
@@ -478,13 +478,13 @@ Although the `alarm_variables` link shows you variables for a particular chart,
- **special variables*** are:
- `$this`, which is resolved to the value of the current alarm.
-
+
- `$status`, which is resolved to the current status of the alarm (the current = the last
status, i.e. before the current database lookup and the evaluation of the `calc` line).
This values can be compared with `$REMOVED`, `$UNINITIALIZED`, `$UNDEFINED`, `$CLEAR`,
`$WARNING`, `$CRITICAL`. These values are incremental, ie. `$status > $CLEAR` works as
expected.
-
+
- `$now`, which is resolved to current unix timestamp.
## Alarm Statuses
@@ -493,16 +493,16 @@ Alarms can have the following statuses:
- `REMOVED` - the alarm has been deleted (this happens when a SIGUSR2 is sent to netdata
to reload health configuration)
-
+
- `UNINITIALIZED` - the alarm is not initialized yet
-
+
- `UNDEFINED` - the alarm failed to be calculated (i.e. the database lookup failed,
a division by zero occurred, etc)
-
+
- `CLEAR` - the alarm is not armed / raised (i.e. is OK)
-
+
- `WARNING` - the warning expression resulted in true or non-zero
-
+
- `CRITICAL` - the critical expression resulted in true or non-zero
The external script will be called for all status changes.
@@ -675,9 +675,6 @@ You can find how netdata interpreted the expressions by examining the alarm at `
## Disabling health checks or silencing notifications at runtime
-The health checks can be controlled at runtime via the [health management api](../web/api/health/#health-management-api).
+It's currently not possible to schedule notifications from within the alarm template. For those scenarios where you need to temporary disable notifications (for instance when running backups triggers a disk alert) you can disable or silence notifications are runtime. The health checks can be controlled at runtime via the [health management api](../web/api/health/#health-management-api).
[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]()
-
-
-
diff --git a/web/api/health/README.md b/web/api/health/README.md
index 58ef20cb76..0b4f79f387 100644
--- a/web/api/health/README.md
+++ b/web/api/health/README.md
@@ -50,7 +50,7 @@ From Netdata v1.16.0 and beyond, the configuration controlled via the API comman
Specifically, the API allows you to:
- Disable health checks completely. Alarm conditions will not be evaluated at all and no entries will be added to the alarm log.
- Silence alarm notifications. Alarm conditions will be evaluated, the alarms will appear in the log and the netdata UI will show the alarms as active, but no notifications will be sent.
- - Disable or Silence specific alarms that match selectors on alarm/template name, chart, context, host and family.
+ - Disable or Silence specific alarms that match selectors on alarm/template name, chart, context, host and family.
The API is available by default, but it is protected by an `api authorization token` that is stored in the file you will see in the following entry of `http://localhost:19999/netdata.conf`:
@@ -59,13 +59,15 @@ The API is available by default, but it is protected by an `api authorization to
# netdata management api key file = /var/lib/netdata/netdata.api.key
```
-You can access the API via GET requests, by adding the bearer token to an `Authorization` http header, like this:
+You can access the API via GET requests, by adding the bearer token to an `Authorization` http header, like this:
```
-curl "http://myserver/api/v1/manage/health?cmd=RESET" -H "X-Auth-Token: Mytoken"
+curl "http://myserver/api/v1/manage/health?cmd=RESET" -H "X-Auth-Token: Mytoken"
```
-The command `RESET` just returns netdata to the default operation, with all health checks and notifications enabled.
+By default access to the health management API is only allowed from `localhost`. Accessing the API from anything else will return a 403 error with the message `You are not allowed to access this resource.`. You can change permissions by editing the `allow management from` variable in netdata.conf within the [web] section. See [web server access lists](../../server/#access-lists) for more information.
+
+The command `RESET` just returns netdata to the default operation, with all health checks and notifications enabled.
If you've configured and entered your token correclty, you should see the plain text response `All health checks and notifications are enabled`.
### Disable or silence all alarms
@@ -73,14 +75,14 @@ If you've configured and entered your token correclty, you should see the plain
If all you need is temporarily disable all health checks, then you issue the following before your maintenance period starts:
```
-curl "http://myserver/api/v1/manage/health?cmd=DISABLE ALL" -H "X-Auth-Token: Mytoken"
+curl "http://myserver/api/v1/manage/health?cmd=DISABLE ALL" -H "X-Auth-Token: Mytoken"
```
The effect of disabling health checks is that the alarm criteria are not evaluated at all and nothing is written in the alarm log.
If you want the health checks to be running but to not receive any notifications during your maintenance period, you can instead use this:
```
-curl "http://myserver/api/v1/manage/health?cmd=SILENCE ALL" -H "X-Auth-Token: Mytoken"
+curl "http://myserver/api/v1/manage/health?cmd=SILENCE ALL" -H "X-Auth-Token: Mytoken"
```
Alarms may then still be raised and logged in netdata, so you'll be able to see them via the UI.
@@ -88,44 +90,44 @@ Alarms may then still be raised and logged in netdata, so you'll be able to see
Regardless of the option you choose, at the end of your maintenance period you revert to the normal state via the RESET command.
```
- curl "http://myserver/api/v1/manage/health?cmd=RESET" -H "X-Auth-Token: Mytoken"
+ curl "http://myserver/api/v1/manage/health?cmd=RESET" -H "X-Auth-Token: Mytoken"
```
### Disable or silence specific alarms
-If you do not wish to disable/silence all alarms, then the `DISABLE ALL` and `SILENCE ALL` commands can't be used.
+If you do not wish to disable/silence all alarms, then the `DISABLE ALL` and `SILENCE ALL` commands can't be used.
Instead, the following commands expect that one or more alarm selectors will be added, so that only alarms that match the selectors are disabled or silenced.
-- `DISABLE` : Set the mode to disable health checks.
-- `SILENCE` : Set the mode to silence notifications.
+- `DISABLE` : Set the mode to disable health checks.
+- `SILENCE` : Set the mode to silence notifications.
-You will normally put one of these commands in the same request with your first alarm selector, but it's possible to issue them separately as well.
-You will get a warning in the response, if a selector was added without a SILENCE/DISABLE command, or vice versa.
+You will normally put one of these commands in the same request with your first alarm selector, but it's possible to issue them separately as well.
+You will get a warning in the response, if a selector was added without a SILENCE/DISABLE command, or vice versa.
-Each request can specify a single alarm `selector`, with one or more `selection criteria`.
-A single alarm will match a `selector` if all selection criteria match the alarm.
+Each request can specify a single alarm `selector`, with one or more `selection criteria`.
+A single alarm will match a `selector` if all selection criteria match the alarm.
You can add as many selectors as you like.
In essence, the rule is: IF (alarm matches all the criteria in selector1 OR all the criteria in selector2 OR ...) THEN apply the DISABLE or SILENCE command.
To clear all selectors and reset the mode to default, use the `RESET` command.
-The following example silences notifications for all the alarms with context=load:
+The following example silences notifications for all the alarms with context=load:
```
-curl "http://myserver/api/v1/manage/health?cmd=SILENCE&context=load" -H "X-Auth-Token: Mytoken"
+curl "http://myserver/api/v1/manage/health?cmd=SILENCE&context=load" -H "X-Auth-Token: Mytoken"
```
-#### Selection criteria
+#### Selection criteria
-The `selection criteria` are key/value pairs, in the format `key : value`, where value is a netdata [simple pattern](../../../libnetdata/simple_pattern/). This means that you can create very powerful selectors (you will rarely need more than one or two).
+The `selection criteria` are key/value pairs, in the format `key : value`, where value is a netdata [simple pattern](../../../libnetdata/simple_pattern/). This means that you can create very powerful selectors (you will rarely need more than one or two).
The accepted keys for the `selection criteria` are the following:
-- `alarm` : The expression provided will match both `alarm` and `template` names.
+- `alarm` : The expression provided will match both `alarm` and `template` names.
- `chart` : Chart ids/names, as shown on the dashboard. These will match the `on` entry of a configured `alarm`.
- `context` : Chart context, as shown on the dashboard. These will match the `on` entry of a configured `template`.
- `hosts` : The hostnames that will need to match.
- `families` : The alarm families.
-You can add any of the selection criteria you need on the request, to ensure that only the alarms you are interested in are matched and disabled/silenced. e.g. there is no reason to add `hosts: *`, if you want the criteria to be applied to alarms for all hosts.
+You can add any of the selection criteria you need on the request, to ensure that only the alarms you are interested in are matched and disabled/silenced. e.g. there is no reason to add `hosts: *`, if you want the criteria to be applied to alarms for all hosts.
Example 1: Disable all health checks for context = `random`