summaryrefslogtreecommitdiffstats
path: root/health
diff options
context:
space:
mode:
authorAndrew Maguire <andrewm4894@gmail.com>2022-06-21 10:19:58 +0100
committerGitHub <noreply@github.com>2022-06-21 12:19:58 +0300
commit73f803fbc8900d0b004a87369983d997b67c094d (patch)
tree691c0075d84c8b9fda33351e269f69c78ae372f2 /health
parent03de79ed1a0b9d7126f05b113b5de7d026ccf232 (diff)
Add ml alerts examples (#13173)
* add ml alarm examples * Update Makefile.am * add hyperlinks and node level AR example
Diffstat (limited to 'health')
-rw-r--r--health/Makefile.am1
-rw-r--r--health/REFERENCE.md62
-rw-r--r--health/health.d/ml.conf36
3 files changed, 99 insertions, 0 deletions
diff --git a/health/Makefile.am b/health/Makefile.am
index d5eb884688..777b35858b 100644
--- a/health/Makefile.am
+++ b/health/Makefile.am
@@ -61,6 +61,7 @@ dist_healthconfig_DATA = \
health.d/megacli.conf \
health.d/memcached.conf \
health.d/memory.conf \
+ health.d/ml.conf \
health.d/mysql.conf \
health.d/net.conf \
health.d/netfilter.conf \
diff --git a/health/REFERENCE.md b/health/REFERENCE.md
index 3c1e53b2a3..d1af747676 100644
--- a/health/REFERENCE.md
+++ b/health/REFERENCE.md
@@ -895,6 +895,68 @@ lookup: mean -10s of user
Since [`z = (x - mean) / stddev`](https://en.wikipedia.org/wiki/Standard_score) we create two input alarms, one for `mean` and one for `stddev` and then use them both as inputs in our final `cpu_user_zscore` alarm.
+### Example 8 - [Anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) based CPU dimensions alarm
+
+Warning if 5 minute rolling [anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) for any CPU dimension is above 5%, critical if it goes above 20%:
+
+```yaml
+template: ml_5min_cpu_dims
+ on: system.cpu
+ os: linux
+ hosts: *
+ lookup: average -5m anomaly-bit foreach *
+ calc: $this
+ units: %
+ every: 30s
+ warn: $this > (($status >= $WARNING) ? (5) : (20))
+ crit: $this > (($status == $CRITICAL) ? (20) : (100))
+ info: rolling 5min anomaly rate for each system.cpu dimension
+```
+
+The `lookup` line will calculate the average anomaly rate of each `system.cpu` dimension over the last 5 minues. In this case
+Netdata will create alarms for all dimensions of the chart.
+
+### Example 9 - [Anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) based CPU chart alarm
+
+Warning if 5 minute rolling [anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) averaged across all CPU dimensions is above 5%, critical if it goes above 20%:
+
+```yaml
+template: ml_5min_cpu_chart
+ on: system.cpu
+ os: linux
+ hosts: *
+ lookup: average -5m anomaly-bit of *
+ calc: $this
+ units: %
+ every: 30s
+ warn: $this > (($status >= $WARNING) ? (5) : (20))
+ crit: $this > (($status == $CRITICAL) ? (20) : (100))
+ info: rolling 5min anomaly rate for system.cpu chart
+```
+
+The `lookup` line will calculate the average anomaly rate across all `system.cpu` dimensions over the last 5 minues. In this case
+Netdata will create one alarm for the chart.
+
+### Example 10 - [Anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) based node level alarm
+
+Warning if 5 minute rolling [anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) averaged across all ML enabled dimensions is above 5%, critical if it goes above 20%:
+
+```yaml
+template: ml_5min_node
+ on: anomaly_detection.anomaly_rate
+ os: linux
+ hosts: *
+ lookup: average -5m of anomaly_rate
+ calc: $this
+ units: %
+ every: 30s
+ warn: $this > (($status >= $WARNING) ? (5) : (20))
+ crit: $this > (($status == $CRITICAL) ? (20) : (100))
+ info: rolling 5min anomaly rate for all ML enabled dims
+```
+
+The `lookup` line will use the `anomaly_rate` dimension of the `anomaly_detection.anomaly_rate` ML chart to calculate the average [node level anomaly rate](https://learn.netdata.cloud/docs/agent/ml#node-anomaly-rate) over the last 5 minues.
+
## Troubleshooting
You can compile Netdata with [debugging](/daemon/README.md#debugging) and then set in `netdata.conf`:
diff --git a/health/health.d/ml.conf b/health/health.d/ml.conf
new file mode 100644
index 0000000000..9bcc81e76b
--- /dev/null
+++ b/health/health.d/ml.conf
@@ -0,0 +1,36 @@
+# below are some examples of using the `anomaly-bit` option to define alerts based on anomaly
+# rates as opposed to raw metric values. You can read more about the anomaly-bit and Netdata's
+# native anomaly detection here:
+# https://learn.netdata.cloud/docs/configure/machine-learning#anomaly-bit---100--anomalous-0--normal
+
+# examples below are commented, you would need to uncomment and adjust as desired to enable them.
+
+# alert per dimension example
+# if anomaly rate is between 5-20% then warning (pick your own threshold that works best via tial and error).
+# if anomaly rate is above 20% then critical (pick your own threshold that works best via tial and error).
+# template: ml_5min_cpu_dims
+# on: system.cpu
+# os: linux
+# hosts: *
+# lookup: average -5m anomaly-bit foreach *
+# calc: $this
+# units: %
+# every: 30s
+# warn: $this > (($status >= $WARNING) ? (5) : (20))
+# crit: $this > (($status == $CRITICAL) ? (20) : (100))
+# info: rolling 5min anomaly rate for each system.cpu dimension
+
+# alert per chart example
+# if anomaly rate is between 5-20% then warning (pick your own threshold that works best via tial and error).
+# if anomaly rate is above 20% then critical (pick your own threshold that works best via tial and error).
+# template: ml_5min_cpu_chart
+# on: system.cpu
+# os: linux
+# hosts: *
+# lookup: average -5m anomaly-bit of *
+# calc: $this
+# units: %
+# every: 30s
+# warn: $this > (($status >= $WARNING) ? (5) : (20))
+# crit: $this > (($status == $CRITICAL) ? (20) : (100))
+# info: rolling 5min anomaly rate for system.cpu chart \ No newline at end of file