From 9f1403de7d3ea2633768d34095afcf880c7c4e2d Mon Sep 17 00:00:00 2001 From: Tasos Katsoulas <12612986+tkatsoulas@users.noreply.github.com> Date: Thu, 2 Feb 2023 15:23:54 +0200 Subject: Covert our documentation links to GH absolute links (#14344) Signed-off-by: Tasos Katsoulas --- docs/guides/monitor/anomaly-detection-python.md | 36 ++-- docs/guides/monitor/anomaly-detection.md | 16 +- docs/guides/monitor/dimension-templates.md | 37 ++-- docs/guides/monitor/kubernetes-k8s-netdata.md | 28 +-- docs/guides/monitor/lamp-stack.md | 42 ++-- docs/guides/monitor/pi-hole-raspberry-pi.md | 26 +-- docs/guides/monitor/process.md | 231 +++++++++++---------- .../monitor/raspberry-pi-anomaly-detection.md | 22 +- docs/guides/monitor/statsd.md | 14 +- docs/guides/monitor/stop-notifications-alarms.md | 12 +- docs/guides/monitor/visualize-monitor-anomalies.md | 28 +-- 11 files changed, 254 insertions(+), 238 deletions(-) (limited to 'docs/guides/monitor') diff --git a/docs/guides/monitor/anomaly-detection-python.md b/docs/guides/monitor/anomaly-detection-python.md index ad8398cc68..d6d27f4e57 100644 --- a/docs/guides/monitor/anomaly-detection-python.md +++ b/docs/guides/monitor/anomaly-detection-python.md @@ -23,7 +23,7 @@ library](https://github.com/yzhao062/pyod/tree/master), which periodically runs quantify how anomalous certain charts are. All these metrics and alarms are available for centralized monitoring in [Netdata Cloud](https://app.netdata.cloud). If -you choose to sign up for Netdata Cloud and [connect your nodes](/claim/README.md), you will have the ability to run +you choose to sign up for Netdata Cloud and [connect your nodes](https://github.com/netdata/netdata/blob/master/claim/README.md), you will have the ability to run tailored anomaly detection on every node in your infrastructure, regardless of its purpose or workload. In this guide, you'll learn how to set up the anomalies collector to instantly detect anomalies in an Nginx web server @@ -35,9 +35,9 @@ server](https://user-images.githubusercontent.com/1153921/103586700-da5b0a00-4ea ## Prerequisites -- A node running the Netdata Agent. If you don't yet have that, [get Netdata](/docs/get-started.mdx). +- A node running the Netdata Agent. If you don't yet have that, [get Netdata](https://github.com/netdata/netdata/blob/master/docs/get-started.mdx). - A Netdata Cloud account. [Sign up](https://app.netdata.cloud) if you don't have one already. -- Familiarity with configuring the Netdata Agent with [`edit-config`](/docs/configure/nodes.md). +- Familiarity with configuring the Netdata Agent with [`edit-config`](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md). - _Optional_: An Nginx web server running on the same node to follow the example configuration steps. ## Install required Python packages @@ -65,7 +65,7 @@ Use `exit` to become your normal user again. ## Enable the anomalies collector -Navigate to your [Netdata config directory](/docs/configure/nodes.md#the-netdata-config-directory) and use `edit-config` +Navigate to your [Netdata config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#the-netdata-config-directory) and use `edit-config` to open the `python.d.conf` file. ```bash @@ -79,8 +79,8 @@ yourself if it doesn't already exist. Either way, the final result should look l anomalies: yes ``` -[Restart the Agent](/docs/configure/start-stop-restart.md) with `sudo systemctl restart netdata`, or the [appropriate -method](/docs/configure/start-stop-restart.md) for your system, to start up the anomalies collector. By default, the +[Restart the Agent](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md) with `sudo systemctl restart netdata`, or the [appropriate +method](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md) for your system, to start up the anomalies collector. By default, the model training process runs every 30 minutes, and uses the previous 4 hours of metrics to establish a baseline for health and performance across the default included charts. @@ -105,7 +105,7 @@ involve tweaking the behavior of the ML training itself. - `train_every_n`: How often to train the ML models. - `train_n_secs`: The number of historical observations to train each model on. The default is 4 hours, but if your node doesn't have historical metrics going back that far, consider [changing the metrics retention - policy](/docs/store/change-metrics-storage.md) or reducing this window. + policy](https://github.com/netdata/netdata/blob/master/docs/store/change-metrics-storage.md) or reducing this window. - `custom_models`: A way to define custom models that you want anomaly probabilities for, including multi-node or streaming setups. @@ -119,8 +119,8 @@ involve tweaking the behavior of the ML training itself. As mentioned above, this guide uses an Nginx web server to demonstrate how the anomalies collector works. You must configure the collector to monitor charts from the -[Nginx](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/nginx) and [web -log](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog) collectors. +[Nginx](https://github.com/netdata/go.d.plugin/blob/master/modules/nginx/README.md) and [web +log](https://github.com/netdata/go.d.plugin/blob/master/modules/weblog/README.md) collectors. `charts_regex` allows for some basic regex, such as wildcards (`*`) to match all contexts with a certain pattern. For example, `system\..*` matches with any chart with a context that begins with `system.`, and ends in any number of other @@ -163,27 +163,27 @@ volume of requests/responses, not, for example, which type of 4xx response a use dimensions](https://user-images.githubusercontent.com/1153921/102820642-d69f9180-4392-11eb-91c5-d3d166d40105.png) Apply the ideas behind the collector's regex and exclude settings to any other -[system](/docs/collect/system-metrics.md), [container](/docs/collect/container-metrics.md), or -[application](/docs/collect/application-metrics.md) metrics you want to detect anomalies for. +[system](https://github.com/netdata/netdata/blob/master/docs/collect/system-metrics.md), [container](https://github.com/netdata/netdata/blob/master/docs/collect/container-metrics.md), or +[application](https://github.com/netdata/netdata/blob/master/docs/collect/application-metrics.md) metrics you want to detect anomalies for. ## What's next? Now that you know how to set up unsupervised anomaly detection in the Netdata Agent, using an Nginx web server as an example, it's time to apply that knowledge to other mission-critical parts of your infrastructure. If you're not sure -what to monitor next, check out our list of [collectors](/collectors/COLLECTORS.md) to see what kind of metrics Netdata +what to monitor next, check out our list of [collectors](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md) to see what kind of metrics Netdata can collect from your systems, containers, and applications. -Keep on moving to [part 2](/docs/guides/monitor/visualize-monitor-anomalies.md), which covers the charts and alarms +Keep on moving to [part 2](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/visualize-monitor-anomalies.md), which covers the charts and alarms Netdata creates for unsupervised anomaly detection. For a different troubleshooting experience, try out the [Metric -Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations) feature in Netdata Cloud. Metric +Correlations](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/metric-correlations.md) feature in Netdata Cloud. Metric Correlations helps you perform faster root cause analysis by narrowing a dashboard to only the charts most likely to be related to an anomaly. ### Related reference documentation -- [Netdata Agent · Anomalies collector](/collectors/python.d.plugin/anomalies/README.md) -- [Netdata Agent · Nginx collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/nginx) -- [Netdata Agent · web log collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog) -- [Netdata Cloud · Metric Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations) +- [Netdata Agent · Anomalies collector](https://github.com/netdata/netdata/blob/master/collectors/python.d.plugin/anomalies/README.md) +- [Netdata Agent · Nginx collector](https://github.com/netdata/go.d.plugin/blob/master/modules/nginx/README.md) +- [Netdata Agent · web log collector](https://github.com/netdata/go.d.plugin/blob/master/modules/weblog/README.md) +- [Netdata Cloud · Metric Correlations](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/metric-correlations.md) diff --git a/docs/guides/monitor/anomaly-detection.md b/docs/guides/monitor/anomaly-detection.md index 7f7727f53c..ce819d937a 100644 --- a/docs/guides/monitor/anomaly-detection.md +++ b/docs/guides/monitor/anomaly-detection.md @@ -14,7 +14,7 @@ custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/moni As of [`v1.32.0`](https://github.com/netdata/netdata/releases/tag/v1.32.0), Netdata comes with some ML powered [anomaly detection](https://en.wikipedia.org/wiki/Anomaly_detection) capabilities built into it and available to use out of the box, with zero configuration required (ML was enabled by default in `v1.35.0-29-nightly` in [this PR](https://github.com/netdata/netdata/pull/13158), previously it required a one line config change). -This means that in addition to collecting raw value metrics, the Netdata agent will also produce an [`anomaly-bit`](https://learn.netdata.cloud/docs/agent/ml#anomaly-bit---100--anomalous-0--normal) every second which will be `100` when recent raw metric values are considered anomalous by Netdata and `0` when they look normal. Once we aggregate beyond one second intervals this aggregated `anomaly-bit` becomes an ["anomaly rate"](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate---averageanomaly-bit). +This means that in addition to collecting raw value metrics, the Netdata agent will also produce an [`anomaly-bit`](https://github.com/netdata/netdata/blob/master/ml/README.md#anomaly-bit---100--anomalous-0--normal) every second which will be `100` when recent raw metric values are considered anomalous by Netdata and `0` when they look normal. Once we aggregate beyond one second intervals this aggregated `anomaly-bit` becomes an ["anomaly rate"](https://github.com/netdata/netdata/blob/master/ml/README.md#anomaly-rate---averageanomaly-bit). To be as concrete as possible, the below api call shows how to access the raw anomaly bit of the `system.cpu` chart from the [london.my-netdata.io](https://london.my-netdata.io) Netdata demo server. Passing `options=anomaly-bit` returns the anomaly bit instead of the raw metric value. @@ -22,19 +22,19 @@ To be as concrete as possible, the below api call shows how to access the raw an https://london.my-netdata.io/api/v1/data?chart=system.cpu&options=anomaly-bit ``` -If we aggregate the above to just 1 point by adding `points=1` we get an "[Anomaly Rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate---averageanomaly-bit)": +If we aggregate the above to just 1 point by adding `points=1` we get an "[Anomaly Rate](https://github.com/netdata/netdata/blob/master/ml/README.md#anomaly-rate---averageanomaly-bit)": ``` https://london.my-netdata.io/api/v1/data?chart=system.cpu&options=anomaly-bit&points=1 ``` -The fundamentals of Netdata's anomaly detection approach and implementation are covered in lots more detail in the [agent ML documentation](https://learn.netdata.cloud/docs/agent/ml). +The fundamentals of Netdata's anomaly detection approach and implementation are covered in lots more detail in the [agent ML documentation](https://github.com/netdata/netdata/blob/master/ml/README.md). This guide will explain how to get started using these ML based anomaly detection capabilities within Netdata. ## Anomaly Advisor -The [Anomaly Advisor](https://learn.netdata.cloud/docs/cloud/insights/anomaly-advisor) is the flagship anomaly detection feature within Netdata. In the "Anomalies" tab of Netdata you will see an overall "Anomaly Rate" chart that aggregates node level anomaly rate for all nodes in a space. The aim of this chart is to make it easy to quickly spot periods of time where the overall "[node anomaly rate](https://learn.netdata.cloud/docs/agent/ml#node-anomaly-rate)" is elevated in some unusual way and for what node or nodes this relates to. +The [Anomaly Advisor](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/anomaly-advisor.mdx) is the flagship anomaly detection feature within Netdata. In the "Anomalies" tab of Netdata you will see an overall "Anomaly Rate" chart that aggregates node level anomaly rate for all nodes in a space. The aim of this chart is to make it easy to quickly spot periods of time where the overall "[node anomaly rate](https://github.com/netdata/netdata/blob/master/ml/README.md#node-anomaly-rate)" is elevated in some unusual way and for what node or nodes this relates to. ![image](https://user-images.githubusercontent.com/2178292/175928290-490dd8b9-9c55-4724-927e-e145cb1cc837.png) @@ -44,7 +44,7 @@ Once an area on the Anomaly Rate chart is highlighted netdata will append a "hea ## Embedded Anomaly Rate Charts -Charts in both the [Overview](https://learn.netdata.cloud/docs/cloud/visualize/overview) and [single node dashboard](https://learn.netdata.cloud/docs/cloud/visualize/overview#jump-to-single-node-dashboards) tabs also expose the underlying anomaly rates for each dimension so users can easily see if the raw metrics are considered anomalous or not by Netdata. +Charts in both the [Overview](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/overview.md) and [single node dashboard](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/overview.md#jump-to-single-node-dashboards) tabs also expose the underlying anomaly rates for each dimension so users can easily see if the raw metrics are considered anomalous or not by Netdata. Pressing the anomalies icon (next to the information icon in the chart header) will expand the anomaly rate chart to make it easy to see how the anomaly rate for any individual dimension corresponds to the raw underlying data. In the example below we can see that the spike in `system.pgpgio|in` corresponded in the anomaly rate for that dimension jumping to 100% for a small period of time until the spike passed. @@ -65,9 +65,9 @@ You can see some example ML based alert configurations below: Check out the resources below to learn more about how Netdata is approaching ML: -- [Agent ML documentation](https://learn.netdata.cloud/docs/agent/ml). -- [Anomaly Advisor documentation](https://learn.netdata.cloud/docs/cloud/insights/anomaly-advisor). -- [Metric Correlations documentation](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations). +- [Agent ML documentation](https://github.com/netdata/netdata/blob/master/ml/README.md). +- [Anomaly Advisor documentation](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/anomaly-advisor.mdx). +- [Metric Correlations documentation](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/metric-correlations.md). - Anomaly Advisor [launch blog post](https://www.netdata.cloud/blog/introducing-anomaly-advisor-unsupervised-anomaly-detection-in-netdata/). - Netdata Approach to ML [blog post](https://www.netdata.cloud/blog/our-approach-to-machine-learning/). - `areal/ml` related [GitHub Discussions](https://github.com/netdata/netdata/discussions?discussions_q=label%3Aarea%2Fml). diff --git a/docs/guides/monitor/dimension-templates.md b/docs/guides/monitor/dimension-templates.md index 539127366c..d2795a9c60 100644 --- a/docs/guides/monitor/dimension-templates.md +++ b/docs/guides/monitor/dimension-templates.md @@ -8,24 +8,27 @@ custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/moni Your ability to monitor the health of your systems and applications relies on your ability to create and maintain the best set of alarms for your particular needs. -In v1.18 of Netdata, we introduced **dimension templates** for alarms, which simplifies the process of writing [alarm -entities](/health/REFERENCE.md#health-entity-reference) for charts with many dimensions. +In v1.18 of Netdata, we introduced **dimension templates** for alarms, which simplifies the process of +writing [alarm entities](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#health-entity-reference) for +charts with many dimensions. Dimension templates can condense many individual entities into one—no more copy-pasting one entity and changing the `alarm`/`template` and `lookup` lines for each dimension you'd like to monitor. They are, however, an advanced health monitoring feature. For more basic instructions on creating your first alarm, -check out our [health monitoring documentation](/health/README.md), which also includes -[examples](/health/REFERENCE.md#example-alarms). +check out our [health monitoring documentation](https://github.com/netdata/netdata/blob/master/health/README.md), which also includes +[examples](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#example-alarms). ## The fundamentals of `foreach` -Our dimension templates update creates a new `foreach` parameter to the existing [`lookup` -line](/health/REFERENCE.md#alarm-line-lookup). This is where the magic happens. +Our dimension templates update creates a new `foreach` parameter to the +existing [`lookup` line](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#alarm-line-lookup). This +is where the magic happens. You use the `foreach` parameter to specify which dimensions you want to monitor with this single alarm. You can separate -them with a comma (`,`) or a pipe (`|`). You can also use a [Netdata simple pattern](/libnetdata/simple_pattern/README.md) -to create many alarms with a regex-like syntax. +them with a comma (`,`) or a pipe (`|`). You can also use +a [Netdata simple pattern](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md) to create +many alarms with a regex-like syntax. The `foreach` parameter _has_ to be the last parameter in your `lookup` line, and if you have both `of` and `foreach` in the same `lookup` line, Netdata will ignore the `of` parameter and use `foreach` instead. @@ -95,7 +98,7 @@ Let's look at some other examples of how `foreach` works so you can best apply i In the last example, we used `foreach system,user,nice` to create three distinct alarms using dimension templates. But what if you want to quickly create alarms for _all_ the dimensions of a given chart? -Use a [simple pattern](/libnetdata/simple_pattern/README.md)! One example of a simple pattern is a single wildcard +Use a [simple pattern](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md)! One example of a simple pattern is a single wildcard (`*`). Instead of monitoring system CPU usage, let's monitor per-application CPU usage using the `apps.cpu` chart. Passing a @@ -113,14 +116,15 @@ lookup: average -10m percentage foreach * This entity will now create alarms for every dimension in the `apps.cpu` chart. Given that most `apps.cpu` charts have 10 or more dimensions, using the wildcard ensures you catch every CPU-hogging process. -To learn more about how to use simple patterns with dimension templates, see our [simple patterns -documentation](/libnetdata/simple_pattern/README.md). +To learn more about how to use simple patterns with dimension templates, see +our [simple patterns documentation](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md). ## Using `foreach` with alarm templates -Dimension templates also work with [alarm templates](/health/REFERENCE.md#alarm-line-alarm-or-template). Alarm -templates help you create alarms for all the charts with a given context—for example, all the cores of your system's -CPU. +Dimension templates also work +with [alarm templates](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#alarm-line-alarm-or-template). +Alarm templates help you create alarms for all the charts with a given context—for example, all the cores of your +system's CPU. By combining the two, you can create dozens of individual alarms with a single template entity. Here's how you would create alarms for the `system`, `user`, and `nice` dimensions for every chart in the `cpu.cpu` context—or, in other @@ -170,7 +174,8 @@ alarms that will help you better monitor the health of your systems. Or, at the very least, simplify your configuration files. -For information about other advanced features in Netdata's health monitoring toolkit, check out our [health -documentation](/health/README.md). And if you have some cool alarms you built using dimension templates, +For information about other advanced features in Netdata's health monitoring toolkit, check out +our [health documentation](https://github.com/netdata/netdata/blob/master/health/README.md). And if you have some cool +alarms you built using dimension templates, diff --git a/docs/guides/monitor/kubernetes-k8s-netdata.md b/docs/guides/monitor/kubernetes-k8s-netdata.md index 5cfefe8923..5732fc96c6 100644 --- a/docs/guides/monitor/kubernetes-k8s-netdata.md +++ b/docs/guides/monitor/kubernetes-k8s-netdata.md @@ -46,7 +46,7 @@ To follow this tutorial, you need: - A free Netdata Cloud account. [Sign up](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) if you don't have one already. - A working cluster running Kubernetes v1.9 or newer, with a Netdata deployment and connected parent/child nodes. See - our [Kubernetes deployment process](/packaging/installer/methods/kubernetes.md) for details on deployment and + our [Kubernetes deployment process](https://github.com/netdata/netdata/blob/master/packaging/installer/methods/kubernetes.md) for details on deployment and conneting to Cloud. - The [`kubectl`](https://kubernetes.io/docs/reference/kubectl/overview/) command line tool, within [one minor version difference](https://kubernetes.io/docs/tasks/tools/install-kubectl/#before-you-begin) of your cluster, on an @@ -104,7 +104,7 @@ To get started, [sign in](https://app.netdata.cloud/sign-in?cloudRoute=/spaces) to the War Room you connected your cluster to, if not **General**. Netdata Cloud is already visualizing your Kubernetes metrics, streamed in real-time from each node, in the -[Overview](https://learn.netdata.cloud/docs/cloud/visualize/overview): +[Overview](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/overview.md): ![Netdata's Kubernetes monitoring dashboard](https://user-images.githubusercontent.com/1153921/109037415-eafc5500-7687-11eb-8773-9b95941e3328.png) @@ -126,8 +126,8 @@ cluster](https://user-images.githubusercontent.com/1153921/109042169-19c8fa00-76 For example, the chart above shows a spike in the CPU utilization from `rabbitmq` every minute or so, along with a baseline CPU utilization of 10-15% across the cluster. -Read about the [Overview](https://learn.netdata.cloud/docs/cloud/visualize/overview) and some best practices on [viewing -an overview of your infrastructure](/docs/visualize/overview-infrastructure.md) for details on using composite charts to +Read about the [Overview](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/overview.md) and some best practices on [viewing +an overview of your infrastructure](https://github.com/netdata/netdata/blob/master/docs/visualize/overview-infrastructure.md) for details on using composite charts to drill down into per-node performance metrics. ## Pod and container metrics @@ -154,7 +154,7 @@ Let's explore the most colorful box by hovering over it. container](https://user-images.githubusercontent.com/1153921/109049544-a8417980-7695-11eb-80a7-109b4a645a27.png) The **Context** tab shows `rabbitmq-5bb66bb6c9-6xr5b` as the container's image name, which means this container is -running a [RabbitMQ](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/rabbitmq) workload. +running a [RabbitMQ](https://github.com/netdata/go.d.plugin/blob/master/modules/rabbitmq/README.md) workload. Click the **Metrics** tab to see real-time metrics from that container. Unsurprisingly, it shows a spike in CPU utilization at regular intervals. @@ -173,7 +173,7 @@ different namespaces. ![Time-series Kubernetes monitoring in Netdata Cloud](https://user-images.githubusercontent.com/1153921/109075210-126a1680-76b6-11eb-918d-5acdcdac152d.png) -Each composite chart has a [definition bar](https://learn.netdata.cloud/docs/cloud/visualize/overview#definition-bar) +Each composite chart has a [definition bar](https://github.com/netdata/netdata/blob/master/docs/cloud/visualize/overview.md#definition-bar) for complete customization. For example, grouping the top chart by `k8s_container_name` reveals new information. ![Changing time-series charts](https://user-images.githubusercontent.com/1153921/109075212-139b4380-76b6-11eb-836f-939482ae55fc.png) @@ -183,20 +183,20 @@ for complete customization. For example, grouping the top chart by `k8s_containe Netdata has a [service discovery plugin](https://github.com/netdata/agent-service-discovery), which discovers and creates configuration files for [compatible services](https://github.com/netdata/helmchart#service-discovery-and-supported-services) and any endpoints covered by -our [generic Prometheus collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/prometheus). +our [generic Prometheus collector](https://github.com/netdata/go.d.plugin/blob/master/modules/prometheus/README.md). Netdata uses these files to collect metrics from any compatible application as they run _inside_ of a pod. Service discovery happens without manual intervention as pods are created, destroyed, or moved between nodes. Service metrics show up on the Overview as well, beneath the **Kubernetes** section, and are labeled according to the service in question. For example, the **RabbitMQ** section has numerous charts from the [`rabbitmq` -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/rabbitmq): +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/rabbitmq/README.md): ![Finding service discovery metrics](https://user-images.githubusercontent.com/1153921/109054511-2eac8a00-769b-11eb-97f1-da93acb4b5fe.png) > The robot-shop cluster has more supported services, such as MySQL, which are not visible with zero configuration. This > is usually because of services running on non-default ports, using non-default names, or required passwords. Read up -> on [configuring service discovery](/packaging/installer/methods/kubernetes.md#configure-service-discovery) to collect +> on [configuring service discovery](https://github.com/netdata/netdata/blob/master/packaging/installer/methods/kubernetes.md#configure-service-discovery) to collect > more service metrics. Service metrics are essential to infrastructure monitoring, as they're the best indicator of the end-user experience, @@ -210,7 +210,7 @@ Netdata also automatically collects metrics from two essential Kubernetes proces The **k8s kubelet** section visualizes metrics from the Kubernetes agent responsible for managing every pod on a given node. This also happens without any configuration thanks to the [kubelet -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubelet). +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/k8s_kubelet/README.md). Monitoring each node's kubelet can be invaluable when diagnosing issues with your Kubernetes cluster. For example, you can see if the number of running containers/pods has dropped, which could signal a fault or crash in a particular @@ -226,7 +226,7 @@ configuration-related errors, and the actual vs. desired numbers of volumes, plu The **k8s kube-proxy** section displays metrics about the network proxy that runs on each node in your Kubernetes cluster. kube-proxy lets pods communicate with each other and accept sessions from outside your cluster. Its metrics are collected by the [kube-proxy -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubeproxy). +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/k8s_kubeproxy/README.md). With Netdata, you can monitor how often your k8s proxies are syncing proxy rules between nodes. Dramatic changes in these figures could indicate an anomaly in your cluster that's worthy of further investigation. @@ -246,9 +246,9 @@ clusters of all sizes. - [Netdata Helm chart](https://github.com/netdata/helmchart) - [Netdata service discovery](https://github.com/netdata/agent-service-discovery) - [Netdata Agent · `kubelet` - collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubelet) + collector](https://github.com/netdata/go.d.plugin/blob/master/modules/k8s_kubelet/README.md) - [Netdata Agent · `kube-proxy` - collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/k8s_kubeproxy) -- [Netdata Agent · `cgroups.plugin`](/collectors/cgroups.plugin/README.md) + collector](https://github.com/netdata/go.d.plugin/blob/master/modules/k8s_kubeproxy/README.md) +- [Netdata Agent · `cgroups.plugin`](https://github.com/netdata/netdata/blob/master/collectors/cgroups.plugin/README.md) diff --git a/docs/guides/monitor/lamp-stack.md b/docs/guides/monitor/lamp-stack.md index 29b35e142b..165888c4bf 100644 --- a/docs/guides/monitor/lamp-stack.md +++ b/docs/guides/monitor/lamp-stack.md @@ -58,7 +58,7 @@ To follow this tutorial, you need: ## Install the Netdata Agent If you don't have the free, open-source Netdata monitoring agent installed on your node yet, get started with a [single -kickstart command](/docs/get-started.mdx): +kickstart command](https://github.com/netdata/netdata/blob/master/docs/get-started.mdx): @@ -68,15 +68,15 @@ replacing `NODE` with the hostname or IP address of your system. ## Enable hardware and Linux system monitoring -There's nothing you need to do to enable [system monitoring](/docs/collect/system-metrics.md) and Linux monitoring with +There's nothing you need to do to enable [system monitoring](https://github.com/netdata/netdata/blob/master/docs/collect/system-metrics.md) and Linux monitoring with the Netdata Agent, which autodetects metrics from CPUs, memory, disks, networking devices, and Linux processes like systemd without any configuration. If you're using containers, Netdata automatically collects resource utilization -metrics from each using the [cgroups data collector](/collectors/cgroups.plugin/README.md). +metrics from each using the [cgroups data collector](https://github.com/netdata/netdata/blob/master/collectors/cgroups.plugin/README.md). ## Enable Apache monitoring Let's begin by configuring Apache to work with Netdata's [Apache data -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/apache). +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/apache/README.md). Actually, there's nothing for you to do to enable Apache monitoring with Netdata. @@ -87,7 +87,7 @@ metrics](https://httpd.apache.org/docs/2.4/mod/mod_status.html), which is just _ ## Enable web log monitoring The Netdata Agent also comes with a [web log -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog), which reads Apache's access +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/weblog/README.md), which reads Apache's access log file, processes each line, and converts them into per-second metrics. On Debian systems, it reads the file at `/var/log/apache2/access.log`. @@ -100,7 +100,7 @@ monitoring. Because your MySQL database is password-protected, you do need to tell MySQL to allow the `netdata` user to connect to without a password. Netdata's [MySQL data -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/mysql) collects metrics in _read-only_ +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/mysql/README.md) collects metrics in _read-only_ mode, without being able to alter or affect operations in any way. First, log into the MySQL shell. Then, run the following three commands, one at a time: @@ -112,15 +112,15 @@ FLUSH PRIVILEGES; ``` Run `sudo systemctl restart netdata`, or the [appropriate alternative for your -system](/docs/configure/start-stop-restart.md), to collect dozens of metrics every second for robust MySQL monitoring. +system](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md), to collect dozens of metrics every second for robust MySQL monitoring. ## Enable PHP monitoring Unlike Apache or MySQL, PHP isn't a service that you can monitor directly, unless you instrument a PHP-based application -with [StatsD](/collectors/statsd.plugin/README.md). +with [StatsD](https://github.com/netdata/netdata/blob/master/collectors/statsd.plugin/README.md). However, if you use [PHP-FPM](https://php-fpm.org/) in your LAMP stack, you can monitor that process with our [PHP-FPM -data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/phpfpm). +data collector](https://github.com/netdata/go.d.plugin/blob/master/modules/phpfpm/README.md). Open your PHP-FPM configuration for editing, replacing `7.4` with your version of PHP: @@ -166,12 +166,12 @@ If the Netdata Agent isn't already open in your browser, open a new tab and navi > If you [signed up](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) for Netdata Cloud earlier, you can also view > the exact same LAMP stack metrics there, plus additional features, like drag-and-drop custom dashboards. Be sure to -> [connecting your node](/claim/README.md) to start streaming metrics to your browser through Netdata Cloud. +> [connecting your node](https://github.com/netdata/netdata/blob/master/claim/README.md) to start streaming metrics to your browser through Netdata Cloud. Netdata automatically organizes all metrics and charts onto a single page for easy navigation. Peek at gauges to see overall system performance, then scroll down to see more. Click-and-drag with your mouse to pan _all_ charts back and forth through different time intervals, or hold `SHIFT` and use the scrollwheel (or two-finger scroll) to zoom in and -out. Check out our doc on [interacting with charts](/docs/visualize/interact-dashboards-charts.md) for all the details. +out. Check out our doc on [interacting with charts](https://github.com/netdata/netdata/blob/master/docs/visualize/interact-dashboards-charts.md) for all the details. ![The Netdata dashboard](https://user-images.githubusercontent.com/1153921/109520555-98e17800-7a69-11eb-86ec-16f689da4527.png) @@ -205,15 +205,15 @@ Here's a quick reference for what charts you might want to focus on after settin The Netdata Agent comes with hundreds of pre-configured alarms to help you keep tabs on your system, including 19 alarms designed for smarter LAMP stack monitoring. -Click the 🔔 icon in the top navigation to [see active alarms](/docs/monitor/view-active-alarms.md). The **Active** tabs +Click the 🔔 icon in the top navigation to [see active alarms](https://github.com/netdata/netdata/blob/master/docs/monitor/view-active-alarms.md). The **Active** tabs shows any alarms currently triggered, while the **All** tab displays a list of _every_ pre-configured alarm. The ![An example of LAMP stack alarms](https://user-images.githubusercontent.com/1153921/109524120-5883f900-7a6d-11eb-830e-0e7baaa28163.png) -[Tweak alarms](/docs/monitor/configure-alarms.md) based on your infrastructure monitoring needs, and to see these alarms +[Tweak alarms](https://github.com/netdata/netdata/blob/master/docs/monitor/configure-alarms.md) based on your infrastructure monitoring needs, and to see these alarms in other places, like your inbox or a Slack channel, [enable a notification -method](/docs/monitor/enable-notifications.md). +method](https://github.com/netdata/netdata/blob/master/docs/monitor/enable-notifications.md). ## What's next? @@ -223,7 +223,7 @@ services. The per-second metrics granularity means you have the most accurate in any LAMP-related issues. Another powerful way to monitor the availability of a LAMP stack is the [`httpcheck` -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/httpcheck), which pings a web server at +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/httpcheck/README.md), which pings a web server at a regular interval and tells you whether if and how quickly it's responding. The `response_match` option also lets you monitor when the web server's response isn't what you expect it to be, which might happen if PHP-FPM crashes, for example. @@ -233,14 +233,14 @@ we're not covering it here, but it _does_ work in a single-node setup. Just don' node crashed. If you're planning on managing more than one node, or want to take advantage of advanced features, like finding the -source of issues faster with [Metric Correlations](https://learn.netdata.cloud/docs/cloud/insights/metric-correlations), +source of issues faster with [Metric Correlations](https://github.com/netdata/netdata/blob/master/docs/cloud/insights/metric-correlations.md), [sign up](https://app.netdata.cloud/sign-up?cloudRoute=/spaces) for a free Netdata Cloud account. ### Related reference documentation -- [Netdata Agent · Get started](/docs/get-started.mdx) -- [Netdata Agent · Apache data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/apache) -- [Netdata Agent · Web log collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog) -- [Netdata Agent · MySQL data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/mysql) -- [Netdata Agent · PHP-FPM data collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/phpfpm) +- [Netdata Agent · Get started](https://github.com/netdata/netdata/blob/master/docs/get-started.mdx) +- [Netdata Agent · Apache data collector](https://github.com/netdata/go.d.plugin/blob/master/modules/apache/README.md) +- [Netdata Agent · Web log collector](https://github.com/netdata/go.d.plugin/blob/master/modules/weblog/README.md) +- [Netdata Agent · MySQL data collector](https://github.com/netdata/go.d.plugin/blob/master/modules/mysql/README.md) +- [Netdata Agent · PHP-FPM data collector](https://github.com/netdata/go.d.plugin/blob/master/modules/phpfpm/README.md) diff --git a/docs/guides/monitor/pi-hole-raspberry-pi.md b/docs/guides/monitor/pi-hole-raspberry-pi.md index 1246d8ba12..5099d12b95 100644 --- a/docs/guides/monitor/pi-hole-raspberry-pi.md +++ b/docs/guides/monitor/pi-hole-raspberry-pi.md @@ -79,7 +79,7 @@ service](https://discourse.pi-hole.net/t/how-do-i-configure-my-devices-to-use-pi finished setting up Pi-hole at this point. As far as configuring Netdata to monitor Pi-hole metrics, there's nothing you actually need to do. Netdata's [Pi-hole -collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/pihole) will autodetect the new service +collector](https://github.com/netdata/go.d.plugin/blob/master/modules/pihole/README.md) will autodetect the new service running on your Raspberry Pi and immediately start collecting metrics every second. Restart Netdata with `sudo systemctl restart netdata`, which will then recognize that Pi-hole is running and start a @@ -98,15 +98,15 @@ part of your system might affect another. ![The Netdata dashboard in action](https://user-images.githubusercontent.com/1153921/80827388-b9fee100-8b98-11ea-8f60-0d7824667cd3.gif) -If you're completely new to Netdata, look at our [step-by-step guide](/docs/guides/step-by-step/step-00.md) for a -walkthrough of all its features. For a more expedited tour, see the [get started guide](/docs/get-started.mdx). +If you're completely new to Netdata, look at our [step-by-step guide](https://github.com/netdata/netdata/blob/master/docs/guides/step-by-step/step-00.md) for a +walkthrough of all its features. For a more expedited tour, see the [get started guide](https://github.com/netdata/netdata/blob/master/docs/get-started.mdx). ### Enable temperature sensor monitoring You need to manually enable Netdata's built-in [temperature sensor -collector](https://learn.netdata.cloud/docs/agent/collectors/charts.d.plugin/sensors) to start collecting metrics. +collector](https://github.com/netdata/netdata/blob/master/collectors/charts.d.plugin/sensors/README.md) to start collecting metrics. -> Netdata uses a few plugins to manage its [collectors](/collectors/REFERENCE.md), each using a different language: Go, +> Netdata uses a few plugins to manage its [collectors](https://github.com/netdata/netdata/blob/master/collectors/REFERENCE.md), each using a different language: Go, > Python, Node.js, and Bash. While our Go collectors are undergoing the most active development, we still support the > other languages. In this case, you need to enable a temperature sensor collector that's written in Bash. @@ -124,7 +124,7 @@ Raspberry Pi temperature sensor monitoring. ### Storing historical metrics on your Raspberry Pi By default, Netdata allocates 256 MiB in disk space to store historical metrics inside the [database -engine](/database/engine/README.md). On the Raspberry Pi used for this guide, Netdata collects 1,500 metrics every +engine](https://github.com/netdata/netdata/blob/master/database/engine/README.md). On the Raspberry Pi used for this guide, Netdata collects 1,500 metrics every second, which equates to storing 3.5 days worth of historical metrics. You can increase this allocation by editing `netdata.conf` and increasing the `dbengine multihost disk space` setting to @@ -136,8 +136,8 @@ more than 256. ``` Use our [database sizing -calculator](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics) -and [guide on storing historical metrics](/docs/guides/longer-metrics-storage.md) to help you determine the right +calculator](https://github.com/netdata/netdata/blob/master/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics) +and [guide on storing historical metrics](https://github.com/netdata/netdata/blob/master/docs/guides/longer-metrics-storage.md) to help you determine the right setting for your Raspberry Pi. ## What's next? @@ -146,12 +146,12 @@ Now that you're monitoring Pi-hole and your Raspberry Pi with Netdata, you can e configure Netdata to more specific goals. Most importantly, you can always install additional services and instantly collect metrics from many of them with our -[300+ integrations](/collectors/COLLECTORS.md). +[300+ integrations](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md). -- [Optimize performance](/docs/guides/configure/performance.md) using tweaks developed for IoT devices. -- [Stream Raspberry Pi metrics](/streaming/README.md) to a parent host for easy access or longer-term storage. -- [Tweak alarms](/health/QUICKSTART.md) for either Pi-hole or the health of your Raspberry Pi. -- [Export metrics to external databases](/exporting/README.md) with the exporting engine. +- [Optimize performance](https://github.com/netdata/netdata/blob/master/docs/guides/configure/performance.md) using tweaks developed for IoT devices. +- [Stream Raspberry Pi metrics](https://github.com/netdata/netdata/blob/master/streaming/README.md) to a parent host for easy access or longer-term storage. +- [Tweak alarms](https://github.com/netdata/netdata/blob/master/health/QUICKSTART.md) for either Pi-hole or the health of your Raspberry Pi. +- [Export metrics to external databases](https://github.com/netdata/netdata/blob/master/exporting/README.md) with the exporting engine. Or, head over to [our guides](https://learn.netdata.cloud/guides/) for even more experiments and insights into troubleshooting the health of your systems and services. diff --git a/docs/guides/monitor/process.md b/docs/guides/monitor/process.md index 2f46d7abc4..7cc327a01c 100644 --- a/docs/guides/monitor/process.md +++ b/docs/guides/monitor/process.md @@ -23,38 +23,46 @@ SQL queries or know a bunch of arbitrary command-line flags. With Netdata's process monitoring, you can: -- Benchmark/optimize performance of standard applications, like web servers or databases -- Benchmark/optimize performance of custom applications -- Troubleshoot CPU/memory/disk utilization issues (why is my system's CPU spiking right now?) -- Perform granular capacity planning based on the specific needs of your infrastructure -- Search for leaking file descriptors -- Investigate zombie processes +- Benchmark/optimize performance of standard applications, like web servers or databases +- Benchmark/optimize performance of custom applications +- Troubleshoot CPU/memory/disk utilization issues (why is my system's CPU spiking right now?) +- Perform granular capacity planning based on the specific needs of your infrastructure +- Search for leaking file descriptors +- Investigate zombie processes ... and much more. Let's get started. ## Prerequisites -- One or more Linux nodes running [Netdata](/docs/get-started.mdx). If you need more time to understand Netdata before - following this guide, see the [infrastructure](/docs/quickstart/infrastructure.md) or - [single-node](/docs/quickstart/single-node.md) monitoring quickstarts. -- A general understanding of how to [configure the Netdata Agent](/docs/configure/nodes.md) using `edit-config`. -- A Netdata Cloud account. [Sign up](https://app.netdata.cloud) if you don't have one already. +- One or more Linux nodes running [Netdata](https://github.com/netdata/netdata/blob/master/docs/get-started.mdx). If you + need more time to understand Netdata before + following this guide, see + the [infrastructure](https://github.com/netdata/netdata/blob/master/docs/quickstart/infrastructure.md) or + [single-node](https://github.com/netdata/netdata/blob/master/docs/quickstart/single-node.md) monitoring quickstarts. +- A general understanding of how + to [configure the Netdata Agent](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md) + using `edit-config`. +- A Netdata Cloud account. [Sign up](https://app.netdata.cloud) if you don't have one already. ## How does Netdata do process monitoring? -The Netdata Agent already knows to look for hundreds of [standard applications that we support via -collectors](/collectors/COLLECTORS.md), and groups them based on their purpose. Let's say you want to monitor a MySQL +The Netdata Agent already knows to look for hundreds +of [standard applications that we support via collectors](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md), +and groups them based on their +purpose. Let's say you want to monitor a MySQL database using its process. The Netdata Agent already knows to look for processes with the string `mysqld` in their name, along with a few others, and puts them into the `sql` group. This `sql` group then becomes a dimension in all process-specific charts. The process and groups settings are used by two unique and powerful collectors. -[**`apps.plugin`**](/collectors/apps.plugin/README.md) looks at the Linux process tree every second, much like `top` or +[**`apps.plugin`**](https://github.com/netdata/netdata/blob/master/collectors/apps.plugin/README.md) looks at the Linux +process tree every second, much like `top` or `ps fax`, and collects resource utilization information on every running process. It then automatically adds a layer of meaningful visualization on top of these metrics, and creates per-process/application charts. -[**`ebpf.plugin`**](/collectors/ebpf.plugin/README.md): Netdata's extended Berkeley Packet Filter (eBPF) collector +[**`ebpf.plugin`**](https://github.com/netdata/netdata/blob/master/collectors/ebpf.plugin/README.md): Netdata's extended +Berkeley Packet Filter (eBPF) collector monitors Linux kernel-level metrics for file descriptors, virtual filesystem IO, and process management, and then hands process-specific metrics over to `apps.plugin` for visualization. The eBPF collector also collects and visualizes metrics on an _event frequency_, which means it captures every kernel interaction, and not just the volume of @@ -65,55 +73,55 @@ interaction at every second in time. That's even more precise than Netdata's sta With these collectors working in parallel, Netdata visualizes the following per-second metrics for _any_ process on your Linux systems: -- CPU utilization (`apps.cpu`) - - Total CPU usage - - User/system CPU usage (`apps.cpu_user`/`apps.cpu_system`) -- Disk I/O - - Physical reads/writes (`apps.preads`/`apps.pwrites`) - - Logical reads/writes (`apps.lreads`/`apps.lwrites`) - - Open unique files (if a file is found open multiple times, it is counted just once, `apps.files`) -- Memory - - Real Memory Used (non-shared, `apps.mem`) - - Virtual Memory Allocated (`apps.vmem`) - - Minor page faults (i.e. memory activity, `apps.minor_faults`) -- Processes - - Threads running (`apps.threads`) - - Processes running (`apps.processes`) - - Carried over uptime (since the last Netdata Agent restart, `apps.uptime`) - - Minimum uptime (`apps.uptime_min`) - - Average uptime (`apps.uptime_average`) - - Maximum uptime (`apps.uptime_max`) - - Pipes open (`apps.pipes`) -- Swap memory - - Swap memory used (`apps.swap`) - - Major page faults (i.e. swap activity, `apps.major_faults`) -- Network - - Sockets open (`apps.sockets`) -- eBPF file - - Number of calls to open files. (`apps.file_open`) - - Number of files closed. (`apps.file_closed`) - - Number of calls to open files that returned errors. - - Number of calls to close files that returned errors. -- eBPF syscall - - Number of calls to delete files. (`apps.file_deleted`) - - Number of calls to `vfs_write`. (`apps.vfs_write_call`) - - Number of calls to `vfs_read`. (`apps.vfs_read_call`) - - Number of bytes written with `vfs_write`. (`apps.vfs_write_bytes`) - - Number of bytes read with `vfs_read`. (`apps.vfs_read_bytes`) - - Number of calls to write a file that returned errors. - - Number of calls to read a file that returned errors. -- eBPF process - - Number of process created with `do_fork`. (`apps.process_create`) - - Number of threads created with `do_fork` or `__x86_64_sys_clone`, depending on your system's kernel version. (`apps.thread_create`) - - Number of times that a process called `do_exit`. (`apps.task_close`) -- eBPF net - - Number of bytes sent. (`apps.bandwidth_sent`) - - Number of bytes received. (`apps.bandwidth_recv`) +- CPU utilization (`apps.cpu`) + - Total CPU usage + - User/system CPU usage (`apps.cpu_user`/`apps.cpu_system`) +- Disk I/O + - Physical reads/writes (`apps.preads`/`apps.pwrites`) + - Logical reads/writes (`apps.lreads`/`apps.lwrites`) + - Open unique files (if a file is found open multiple times, it is counted just once, `apps.files`) +- Memory + - Real Memory Used (non-shared, `apps.mem`) + - Virtual Memory Allocated (`apps.vmem`) + - Minor page faults (i.e. memory activity, `apps.minor_faults`) +- Processes + - Threads running (`apps.threads`) + - Processes running (`apps.processes`) + - Carried over uptime (since the last Netdata Agent restart, `apps.uptime`) + - Minimum uptime (`apps.uptime_min`) + - Average uptime (`apps.uptime_average`) + - Maximum uptime (`apps.uptime_max`) + - Pipes open (`apps.pipes`) +- Swap memory + - Swap memory used (`apps.swap`) + - Major page faults (i.e. swap activity, `apps.major_faults`) +- Network + - Sockets open (`apps.sockets`) +- eBPF file + - Number of calls to open files. (`apps.file_open`) + - Number of files closed. (`apps.file_closed`) + - Number of calls to open files that returned errors. + - Number of calls to close files that returned errors. +- eBPF syscall + - Number of calls to delete files. (`apps.file_deleted`) + - Number of calls to `vfs_write`. (`apps.vfs_write_call`) + - Number of calls to `vfs_read`. (`apps.vfs_read_call`) + - Number of bytes written with `vfs_write`. (`apps.vfs_write_bytes`) + - Number of bytes read with `vfs_read`. (`apps.vfs_read_bytes`) + - Number of calls to write a file that returned errors. + - Number of calls to read a file that returned errors. +- eBPF process + - Number of process created with `do_fork`. (`apps.process_create`) + - Number of threads created with `do_fork` or `__x86_64_sys_clone`, depending on your system's kernel + version. (`apps.thread_create`) + - Number of times that a process called `do_exit`. (`apps.task_close`) +- eBPF net + - Number of bytes sent. (`apps.bandwidth_sent`) + - Number of bytes received. (`apps.bandwidth_recv`) As an example, here's the per-process CPU utilization chart, including a `sql` group/dimension. -![A per-process CPU utilization chart in Netdata -Cloud](https://user-images.githubusercontent.com/1153921/101217226-3a5d5700-363e-11eb-8610-aa1640aefb5d.png) +![A per-process CPU utilization chart in Netdata Cloud](https://user-images.githubusercontent.com/1153921/101217226-3a5d5700-363e-11eb-8610-aa1640aefb5d.png) ## Configure the Netdata Agent to recognize a specific process @@ -123,7 +131,8 @@ aware of hundreds of processes, and collects metrics from them automatically. But, if you want to change the grouping behavior, add an application that isn't yet supported in the Netdata Agent, or monitor a custom application, you need to edit the `apps_groups.conf` configuration file. -Navigate to your [Netdata config directory](/docs/configure/nodes.md) and use `edit-config` to edit the file. +Navigate to your [Netdata config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md) and +use `edit-config` to edit the file. ```bash cd /etc/netdata # Replace this with your Netdata config directory if not at /etc/netdata. @@ -138,7 +147,8 @@ others, and groups them into `sql`. That makes sense, since all these processes sql: mysqld* mariad* postgres* postmaster* oracle_* ora_* sqlservr ``` -These groups are then reflected as [dimensions](/web/README.md#dimensions) within Netdata's charts. +These groups are then reflected as [dimensions](https://github.com/netdata/netdata/blob/master/web/README.md#dimensions) +within Netdata's charts. ![An example per-process CPU utilization chart in Netdata Cloud](https://user-images.githubusercontent.com/1153921/101369156-352e2100-3865-11eb-9f0d-b8fac162e034.png) @@ -153,12 +163,13 @@ shouldn't need to configure it to discover them. However, if you're using multiple applications that the Netdata Agent groups together you may want to separate them for more precise monitoring. If you're not running any other types of SQL databases on that node, you don't need to change -the grouping, since you know that any MySQL is the only process contributing to the `sql` group. +the grouping, since you know that any MySQL is the only process contributing to the `sql` group. Let's say you're using both MySQL and PostgreSQL databases on a single node, and want to monitor their processes -independently. Open the `apps_groups.conf` file as explained in the [section -above](#configure-the-netdata-agent-to-recognize-a-specific-process) and scroll down until you find the `database -servers` section. Create new groups for MySQL and PostgreSQL, and move their process queries into the unique groups. +independently. Open the `apps_groups.conf` file as explained in +the [section above](#configure-the-netdata-agent-to-recognize-a-specific-process) and scroll down until you find +the `database servers` section. Create new groups for MySQL and PostgreSQL, and move their process queries into the +unique groups. ```conf # ----------------------------------------------------------------------------- @@ -169,17 +180,18 @@ postgres: postgres* sql: mariad* postmaster* oracle_* ora_* sqlservr ``` -Restart Netdata with `sudo systemctl restart netdata`, or the [appropriate -method](/docs/configure/start-stop-restart.md) for your system, to start collecting utilization metrics from your -application. Time to [visualize your process metrics](#visualize-process-metrics). +Restart Netdata with `sudo systemctl restart netdata`, or +the [appropriate method](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md) for your system, to start collecting utilization metrics +from your application. Time to [visualize your process metrics](#visualize-process-metrics). ### Custom applications Let's assume you have an application that runs on the process `custom-app`. To monitor eBPF metrics for that application separate from any others, you need to create a new group in `apps_groups.conf` and associate that process name with it. -Open the `apps_groups.conf` file as explained in the [section -above](#configure-the-netdata-agent-to-recognize-a-specific-process). Scroll down to `# NETDATA processes accounting`. +Open the `apps_groups.conf` file as explained in +the [section above](#configure-the-netdata-agent-to-recognize-a-specific-process). Scroll down +to `# NETDATA processes accounting`. Above that, paste in the following text, which creates a new `custom-app` group with the `custom-app` process. Replace `custom-app` with the name of your application's Linux process. `apps_groups.conf` should now look like this: @@ -195,26 +207,25 @@ custom-app: custom-app ... ``` -Restart Netdata with `sudo systemctl restart netdata`, or the [appropriate -method](/docs/configure/start-stop-restart.md) for your system, to start collecting utilization metrics from your -application. +Restart Netdata with `sudo systemctl restart netdata`, or +the [appropriate method](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md) for your system, to start collecting utilization metrics +from your application. ## Visualize process metrics Now that you're collecting metrics for your process, you'll want to visualize them using Netdata's real-time, -interactive charts. Find these visualizations in the same section regardless of whether you use [Netdata -Cloud](https://app.netdata.cloud) for infrastructure monitoring, or single-node monitoring with the local Agent's -dashboard at `http://localhost:19999`. +interactive charts. Find these visualizations in the same section regardless of whether you +use [Netdata Cloud](https://app.netdata.cloud) for infrastructure monitoring, or single-node monitoring with the local +Agent's dashboard at `http://localhost:19999`. -If you need a refresher on all the available per-process charts, see the [above -list](#per-process-metrics-and-charts-in-netdata). +If you need a refresher on all the available per-process charts, see +the [above list](#per-process-metrics-and-charts-in-netdata). ### Using Netdata's application collector (`apps.plugin`) `apps.plugin` puts all of its charts under the **Applications** section of any Netdata dashboard. -![Screenshot of the Applications section on a Netdata -dashboard](https://user-images.githubusercontent.com/1153921/101401172-2ceadb80-388f-11eb-9e9a-88443894c272.png) +![Screenshot of the Applications section on a Netdata dashboard](https://user-images.githubusercontent.com/1153921/101401172-2ceadb80-388f-11eb-9e9a-88443894c272.png) Let's continue with the MySQL example. We can create a [test database](https://www.digitalocean.com/community/tutorials/how-to-measure-mysql-query-performance-with-mysqlslap) in @@ -223,11 +234,9 @@ MySQL to generate load on the `mysql` process. `apps.plugin` immediately collects and visualizes this activity `apps.cpu` chart, which shows an increase in CPU utilization from the `sql` group. There is a parallel increase in `apps.pwrites`, which visualizes writes to disk. -![Per-application CPU utilization -metrics](https://user-images.githubusercontent.com/1153921/101409725-8527da80-389b-11eb-96e9-9f401535aafc.png) +![Per-application CPU utilization metrics](https://user-images.githubusercontent.com/1153921/101409725-8527da80-389b-11eb-96e9-9f401535aafc.png) -![Per-application disk writing -metrics](https://user-images.githubusercontent.com/1153921/101409728-85c07100-389b-11eb-83fd-d79dd1545b5a.png) +![Per-application disk writing metrics](https://user-images.githubusercontent.com/1153921/101409728-85c07100-389b-11eb-83fd-d79dd1545b5a.png) Next, the `mysqlslap` utility queries the database to provide some benchmarking load on the MySQL database. It won't look exactly like a production database executing lots of user queries, but it gives you an idea into the possibility of @@ -240,8 +249,7 @@ sudo mysqlslap --user=sysadmin --password --host=localhost --concurrency=50 --i The following per-process disk utilization charts show spikes under the `sql` group at the same time `mysqlslap` was run numerous times, with slightly different concurrency and query options. -![Per-application disk -metrics](https://user-images.githubusercontent.com/1153921/101411810-d08fb800-389e-11eb-85b3-f3fa41f1f887.png) +![Per-application disk metrics](https://user-images.githubusercontent.com/1153921/101411810-d08fb800-389e-11eb-85b3-f3fa41f1f887.png) > 💡 Click on any dimension below a chart in Netdata Cloud (or to the right of a chart on a local Agent dashboard), to > visualize only that dimension. This can be particularly useful in process monitoring to separate one process' @@ -256,8 +264,7 @@ For example, running the above workload shows the entire "story" how MySQL inter processes/threads to handle a large number of SQL queries, then subsequently close the tasks as each query returns the relevant data. -![Per-process eBPF -charts](https://user-images.githubusercontent.com/1153921/101412395-c8844800-389f-11eb-86d2-20c8a0f7b3c0.png) +![Per-process eBPF charts](https://user-images.githubusercontent.com/1153921/101412395-c8844800-389f-11eb-86d2-20c8a0f7b3c0.png) `ebpf.plugin` visualizes additional eBPF metrics, which are system-wide and not per-process, under the **eBPF** section. @@ -267,35 +274,39 @@ Now that you have `apps_groups.conf` configured correctly, and know where to fin Netdata's ecosystem, you can precisely monitor the health and performance of any process on your node using per-second metrics. -For even more in-depth troubleshooting, see our guide on [monitoring and debugging applications with -eBPF](/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md). +For even more in-depth troubleshooting, see our guide +on [monitoring and debugging applications with eBPF](https://github.com/netdata/netdata/blob/master/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md). -If the process you're monitoring also has a [supported collector](/collectors/COLLECTORS.md), now is a great time to set +If the process you're monitoring also has +a [supported collector](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md), now is a great time to +set that up if it wasn't autodetected. With both process utilization and application-specific metrics, you should have every -piece of data needed to discover the root cause of an incident. See our [collector -setup](/docs/collect/enable-configure.md) doc for details. +piece of data needed to discover the root cause of an incident. See +our [collector setup](https://github.com/netdata/netdata/blob/master/docs/collect/enable-configure.md) doc for details. -[Create new dashboards](/docs/visualize/create-dashboards.md) in Netdata Cloud using charts from `apps.plugin`, +[Create new dashboards](https://github.com/netdata/netdata/blob/master/docs/visualize/create-dashboards.md) in Netdata +Cloud using charts from `apps.plugin`, `ebpf.plugin`, and application-specific collectors to build targeted dashboards for monitoring key processes across your infrastructure. -Tr