summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorJoel Hans <joel@netdata.cloud>2021-01-07 12:15:27 -0700
committerGitHub <noreply@github.com>2021-01-07 12:15:27 -0700
commitc4941b49274d82c2d9ee535a7fdaf932da2ebbf7 (patch)
tree9ab50f416a599e0bb0474277f8c4338d5a4c69f1
parent46a8075c8ff6da974d1b39f5b3b9eb50385892c1 (diff)
Add guide: Detect anomalies in nodes and applications with Netdata (#10451)
* Init new anomalies guide * Continuing work * Finish draft of part 1 * Tweak to opening * Change headline * Copyediting and fixes * Fix link * Remove with Netdata from title * Clean up frontmatter * Update docs/guides/monitor/anomaly-detection.md Co-authored-by: Andrew Maguire <andrewm4894@gmail.com> * Update docs/guides/monitor/anomaly-detection.md Co-authored-by: Andrew Maguire <andrewm4894@gmail.com> * Fixes for Jen * Fixes and improvements for Andy Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
-rw-r--r--collectors/python.d.plugin/anomalies/README.md8
-rw-r--r--docs/guides/monitor/anomaly-detection.md188
2 files changed, 193 insertions, 3 deletions
diff --git a/collectors/python.d.plugin/anomalies/README.md b/collectors/python.d.plugin/anomalies/README.md
index 487a11a095..e074be6f52 100644
--- a/collectors/python.d.plugin/anomalies/README.md
+++ b/collectors/python.d.plugin/anomalies/README.md
@@ -1,15 +1,17 @@
<!--
-title: "Anomalies"
+title: "Anomaly detection with Netdata"
+description: "Use ML-driven anomaly detection to narrow your focus to only affected metrics and services/processes on your node to shorten root cause analysis."
custom_edit_url: https://github.com/netdata/netdata/edit/master/collectors/python.d.plugin/anomalies/README.md
+sidebar_url: anomalies
-->
-# Anomalies: ML-driven anomaly detection for the Netdata Agent
+# Anomaly detection with Netdata
This collector uses the Python [PyOD](https://pyod.readthedocs.io/en/latest/index.html) library to perform unsupervised [anomaly detection](https://en.wikipedia.org/wiki/Anomaly_detection) on your Netdata charts and/or dimensions.
Instead of this collector just _collecting_ data, it also does some computation on the data it collects to return an anomaly probability and anomaly flag for each chart or custom model you define. This computation consists of a **train** function that runs every `train_n_secs` to train the ML models to learn what 'normal' typically looks like on your node. At each iteration there is also a **predict** function that uses the latest trained models and most recent metrics to produce an anomaly probability and anomaly flag for each chart or custom model you define.
-**Note**: As this is a somewhat unique collector and involves often subjective concepts like anomalies and anomaly probabilities, we would love to hear any feedback on it from the community. Please let us know on the [community forum](https://community.netdata.cloud/c/agent-development/9) or drop us a note at [analytics-ml-team@netdata.cloud](mailto:analytics-ml-team@netdata.cloud) for any and all feedback, both positive and negative. This sort of feedback is priceless to help us make complex features more useful.
+> As this is a somewhat unique collector and involves often subjective concepts like anomalies and anomaly probabilities, we would love to hear any feedback on it from the community. Please let us know on the [community forum](https://community.netdata.cloud/c/agent-development/9) or drop us a note at [analytics-ml-team@netdata.cloud](mailto:analytics-ml-team@netdata.cloud) for any and all feedback, both positive and negative. This sort of feedback is priceless to help us make complex features more useful.
## Charts
diff --git a/docs/guides/monitor/anomaly-detection.md b/docs/guides/monitor/anomaly-detection.md
new file mode 100644
index 0000000000..3b5a0437bc
--- /dev/null
+++ b/docs/guides/monitor/anomaly-detection.md
@@ -0,0 +1,188 @@
+<!--
+title: "Detect anomalies in systems and applications"
+description: "Detect anomalies in any system, container, or application in your infrastructure with machine learning and the open-source Netdata Agent."
+image: /img/seo/guides/monitor/anomaly-detection.png
+custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/monitor/anomaly-detection.md
+-->
+
+# Detect anomalies in systems and applications
+
+Beginning with v1.27, the [open-source Netdata Agent](https://github.com/netdata/netdata) is capable of unsupervised
+[anomaly detection](https://en.wikipedia.org/wiki/Anomaly_detection) with machine learning (ML). As with all things
+Netdata, the anomalies collector comes with preconfigured alarms and instant visualizations that require no query
+languages or organizing metrics. You configure the collector to look at specific charts, and it handles the rest.
+
+Netdata's implementation uses a handful of functions in the [Python Outlier Detection (PyOD)
+library](https://github.com/yzhao062/pyod/tree/master), which periodically runs a `train` function that learns what
+"normal" looks like on your node and creates an ML model for each chart, then utilizes the
+[`predict_proba()`](https://pyod.readthedocs.io/en/latest/api_cc.html#pyod.models.base.BaseDetector.predict_proba) and
+[`predict()`](https://pyod.readthedocs.io/en/latest/api_cc.html#pyod.models.base.BaseDetector.predict) PyOD functions to
+quantify how anomalous certain charts are.
+
+All these metrics and alarms are available for centralized monitoring in [Netdata Cloud](https://app.netdata.cloud). If
+you choose to sign up for Netdata Cloud and [claim your nodes](/claim/README.md), you will have the ability to run
+tailored anomaly detection on every node in your infrastructure, regardless of its purpose or workload.
+
+In this guide, you'll learn how to set up the anomalies collector to instantly detect anomalies in an Nginx web server
+and/or the node that hosts it, which will give you the tools to configure parallel unsupervised monitors for any
+application in your infrastructure. Let's get started.
+
+![Example anomaly detection with an Nginx web
+server](https://user-images.githubusercontent.com/1153921/103586700-da5b0a00-4ea2-11eb-944e-46edd3f83e3a.png)
+
+## Prerequisites
+
+- A node running the Netdata Agent. If you don't yet have that, [get Netdata](/docs/get/README.md).
+- A Netdata Cloud account. [Sign up](https://app.netdata.cloud) if you don't have one already.
+- Familiarity with configuring the Netdata Agent with [`edit-config`](/docs/configure/nodes.md).
+- _Optional_: An Nginx web server running on the same node to follow the example configuration steps.
+
+## Install required Python packages
+
+The anomalies collector uses a few Python packages, available with `pip3`, to run ML training. It requires
+[`numba`](http://numba.pydata.org/), [`scikit-learn`](https://scikit-learn.org/stable/),
+[`pyod`](https://pyod.readthedocs.io/en/latest/), in addition to
+[`netdata-pandas`](https://github.com/netdata/netdata-pandas), which is a package built by the Netdata team to pull data
+from a Netdata Agent's API into a [Pandas](https://pandas.pydata.org/). Read more about `netdata-pandas` on its [package
+repo](https://github.com/netdata/netdata-pandas) or in Netdata's [community
+repo](https://github.com/netdata/community/tree/main/netdata-agent-api/netdata-pandas).
+
+```bash
+# Become the netdata user
+sudo su -s /bin/bash netdata
+
+# Install required packages for the netdata user
+pip3 install --user netdata-pandas==0.0.32 numba==0.50.1 scikit-learn==0.23.2 pyod==0.8.3
+```
+
+> If the `pip3` command fails, you need to install it. For example, on an Ubuntu system, use `sudo apt install
+> python3-pip`.
+
+Use `exit` to become your normal user again.
+
+## Enable the anomalies collector
+
+Navigate to your [Netdata config directory](/docs/configure/nodes.md#the-netdata-config-directory) and use `edit-config`
+to open the `python.d.conf` file.
+
+```bash
+sudo ./edit-config python.d.conf
+```
+
+In `python.d.conf` file, search for the `anomalies` line. If the line exists, set the value to `yes`. Add the line
+yourself if it doesn't already exist. Either way, the final result should look like:
+
+```conf
+anomalies: yes
+```
+
+[Restart the Agent](/docs/configure/start-stop-restart.md) with `sudo systemctl restart netdata` to start up the
+anomalies collector. By default, the model training process runs every 30 minutes, and uses the previous 4 hours of
+metrics to establish a baseline for health and performance across the default included charts.
+
+> 💡 The anomaly collector may need 30-60 seconds to finish its initial training and have enough data to start
+> generating anomaly scores. You may need to refresh your browser tab for the **Anomalies** section to appear in menus
+> on both the local Agent dashboard or Netdata Cloud.
+
+## Configure the anomalies collector
+
+Open `python.d/anomalies.conf` with `edit-conf`.
+
+```bash
+sudo ./edit-config python.d/anomalies.conf
+```
+
+The file contains many user-configurable settings with sane defaults. Here are some important settings that don't
+involve tweaking the behavior of the ML training itself.
+
+- `charts_regex`: Which charts to train models for and run anomaly detection on, with each chart getting a separate
+ model.
+- `charts_to_exclude`: Specific charts, selected by the regex in `charts_regex`, to exclude.
+- `train_every_n`: How often to train the ML models.
+- `train_n_secs`: The number of historical observations to train each model on. The default is 4 hours, but if your node
+ doesn't have historical metrics going back that far, consider [changing the metrics retention
+ policy](/docs/store/change-metrics-storage.md) or reducing this window.
+- `custom_models`: A way to define custom models that you want anomaly probabilities for, including multi-node or
+ streaming setups. More on custom models in part 3 of this guide series.
+
+> ⚠️ Setting `charts_regex` with many charts or `train_n_secs` to a very large number will have an impact on the
+> resources and time required to train a model for every chart. The actual performance implications depend on the
+> resources available on your node. If you plan on changing these settings beyond the default, or what's mentioned in
+> this guide, make incremental changes to observe the performance impact. Considering `train_max_n` to cap the number of
+> observations actually used to train on.
+
+### Run anomaly detection on Nginx and log file metrics
+
+As mentioned above, this guide uses an Nginx web server to demonstrate how the anomalies collector works. You must
+configure the collector to monitor charts from the
+[Nginx](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/nginx) and [web
+log](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog) collectors.
+
+`charts_regex` allows for some basic regex, such as wildcards (`*`) to match all contexts with a certain pattern. For
+example, `system\..*` matches with any chart wit ha context that begins with `system.`, and ends in any number of other
+characters (`.*`). Note the escape character (`\`) around the first period to capture a period character exactly, and
+not any character.
+
+Change `charts_regex` in `anomalies.conf` to the following:
+
+```conf
+ charts_regex: 'system\..*|nginx_local\..*|web_log_nginx\..*|apps.cpu|apps.mem'
+```
+
+This value tells the anomaly collector to train against every `system.` chart, every `nginx_local` chart, every
+`web_log_nginx` chart, and specifically the `apps.cpu` and `apps.mem` charts.
+
+![The anomalies collector chart with many
+dimensions](https://user-images.githubusercontent.com/1153921/102813877-db5e4880-4386-11eb-8040-d7a1d7a476bb.png)
+
+### Remove some metrics from anomaly detection
+
+As you can see in the above screenshot, this node is now looking for anomalies in many places. The result is a single
+`anomalies_local.probability` chart with more than twenty dimensions, some of which the dashboard hides at the bottom of
+a scroll-able area. In addition, training and analyzing the anomaly collector on many charts might require more CPU
+utilization that you're willing to give.
+
+First, explicitly declare which `system.` charts to monitor rather than of all of them using regex (`system\..*`).
+
+```conf
+ charts_regex: 'system\.cpu|system\.load|system\.io|system\.net|system\.ram|nginx_local\..*|web_log_nginx\..*|apps.cpu|apps.mem'
+```
+
+Next, remove some charts with the `charts_to_exclude` setting. For this example, using an Nginx web server, focus on the
+volume of requests/responses, not, for example, which type of 4xx response a user might receive.
+
+```conf
+ charts_to_exclude: 'web_log_nginx.excluded_requests,web_log_nginx.responses_by_status_code_class,web_log_nginx.status_code_class_2xx_responses,web_log_nginx.status_code_class_4xx_responses,web_log_nginx.current_poll_uniq_clients,web_log_nginx.requests_by_http_method,web_log_nginx.requests_by_http_version,web_log_nginx.requests_by_ip_proto'
+```
+
+![The anomalies collector with less
+dimensions](https://user-images.githubusercontent.com/1153921/102820642-d69f9180-4392-11eb-91c5-d3d166d40105.png)
+
+Apply the ideas behind the collector's regex and exclude settings to any other
+[system](/docs/collect/system-metrics.md), [container](/docs/collect/container-metrics.md), or
+[application](/docs/collect/application-metrics.md) metrics you want to detect anomalies for.
+
+## What's next?
+
+Now that you know how to set up unsupervised anomaly detection in the Netdata Agent, using an Nginx web server as an
+example, it's time to apply that knowledge to other mission-critical parts of your infrastructure. If you're not sure
+what to monitor next, check out our list of [collectors](/collectors/COLLECTORS.md) to see what kind of metrics Netdata
+can collect from your systems, containers, and applications.
+
+For a more user-friendly anomaly detection experience, try out the [Metric
+Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations) feature in Netdata Cloud. Metric
+Correlations runs only at your requests, removing unrelated charts from the dashboard to help you focus on root cause
+analysis.
+
+Stay tuned for the next two parts of this guide, which provide more real-world context for the anomalies collector.
+First, maximize the immediate value you get from anomaly detection by tracking preconfigured alarms, visualizing
+anomalies in charts, and building a new dashboard tailored to your applications. Then, learn about creating custom ML
+models, which help you holistically monitor an application or service by monitoring anomalies across a _cluster of
+charts_.
+
+### Related reference documentation
+
+- [Netdata Agent · Anomalies collector](/collectors/python.d.plugin/anomalies/README.md)
+- [Netdata Cloud · Metric Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations)
+
+[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2Fanomaly-detectionl&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>)