From fecbb89d0c33e2bbe84aa14c0b3204cb60134218 Mon Sep 17 00:00:00 2001 From: Joel Hans Date: Thu, 4 Jun 2020 09:05:25 -0700 Subject: Move/refactor docs to accomodate new Guides section on Learn (#9266) * Move directories and change verbiage to guide * Move health guides * Quick fix to collectors quickstart * Fix broken links * Remove health/tutorials dir * Fix links in collectors quickstart * Fix links to go.d pages --- docs/guides/collect-apache-nginx-web-logs.md | 161 +++++++++ docs/guides/collect-unbound-metrics.md | 138 ++++++++ docs/guides/longer-metrics-storage.md | 158 +++++++++ docs/guides/monitor-cockroachdb.md | 136 ++++++++ docs/guides/monitor-hadoop-cluster.md | 204 ++++++++++++ docs/guides/monitor/dimension-templates.md | 178 ++++++++++ docs/guides/monitor/stop-notifications-alarms.md | 94 ++++++ docs/guides/step-by-step/step-00.md | 115 +++++++ docs/guides/step-by-step/step-01.md | 156 +++++++++ docs/guides/step-by-step/step-02.md | 208 ++++++++++++ docs/guides/step-by-step/step-03.md | 91 ++++++ docs/guides/step-by-step/step-04.md | 144 +++++++++ docs/guides/step-by-step/step-05.md | 347 ++++++++++++++++++++ docs/guides/step-by-step/step-06.md | 122 +++++++ docs/guides/step-by-step/step-07.md | 114 +++++++ docs/guides/step-by-step/step-08.md | 395 +++++++++++++++++++++++ docs/guides/step-by-step/step-09.md | 163 ++++++++++ docs/guides/step-by-step/step-10.md | 230 +++++++++++++ docs/guides/step-by-step/step-99.md | 51 +++ docs/guides/using-host-labels.md | 210 ++++++++++++ 20 files changed, 3415 insertions(+) create mode 100644 docs/guides/collect-apache-nginx-web-logs.md create mode 100644 docs/guides/collect-unbound-metrics.md create mode 100644 docs/guides/longer-metrics-storage.md create mode 100644 docs/guides/monitor-cockroachdb.md create mode 100644 docs/guides/monitor-hadoop-cluster.md create mode 100644 docs/guides/monitor/dimension-templates.md create mode 100644 docs/guides/monitor/stop-notifications-alarms.md create mode 100644 docs/guides/step-by-step/step-00.md create mode 100644 docs/guides/step-by-step/step-01.md create mode 100644 docs/guides/step-by-step/step-02.md create mode 100644 docs/guides/step-by-step/step-03.md create mode 100644 docs/guides/step-by-step/step-04.md create mode 100644 docs/guides/step-by-step/step-05.md create mode 100644 docs/guides/step-by-step/step-06.md create mode 100644 docs/guides/step-by-step/step-07.md create mode 100644 docs/guides/step-by-step/step-08.md create mode 100644 docs/guides/step-by-step/step-09.md create mode 100644 docs/guides/step-by-step/step-10.md create mode 100644 docs/guides/step-by-step/step-99.md create mode 100644 docs/guides/using-host-labels.md (limited to 'docs/guides') diff --git a/docs/guides/collect-apache-nginx-web-logs.md b/docs/guides/collect-apache-nginx-web-logs.md new file mode 100644 index 0000000000..6a32c8971d --- /dev/null +++ b/docs/guides/collect-apache-nginx-web-logs.md @@ -0,0 +1,161 @@ + + +# Monitor Nginx or Apache web server log files with Netdata + +Log files have been a critical resource for developers and system administrators who want to understand the health and +performance of their web servers, and Netdata is taking important steps to make them even more valuable. + +By parsing web server log files with Netdata, and seeing the volume of redirects, requests, or server errors over time, +you can better understand what's happening on your infrastructure. Too many bad requests? Maybe a recent deploy missed a +few small SVG icons. Too many requsests? Time to batten down the hatches—it's a DDoS. + +Netdata has been capable of monitoring web log files for quite some time, thanks for the [weblog python.d +module](/collectors/python.d.plugin/web_log/README.md), but we recently refactored this module in Go, and that effort +comes with a ton of improvements. + +You can now use the [LTSV log format](http://ltsv.org/), track TLS and cipher usage, and the whole parser is faster than +ever. In one test on a system with SSD storage, the collector consistently parsed the logs for 200,000 requests in +200ms, using ~30% of a single core. To learn more about these improvements, see our [v1.19 release post](https://blog.netdata.cloud/posts/release-1.19/). + +The [go.d plugin](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog/) is currently compatible +with [Nginx](https://nginx.org/en/) and [Apache](https://httpd.apache.org/). + +This guide will walk you through using the new Go-based web log collector to turn the logs these web servers +constantly write to into real-time insights into your infrastructure. + +## Set up your web servers + +As with all data sources, Netdata can auto-detect Nginx or Apache servers if you installed them using their standard +installation procedures. + +Almost all web server installations will need _no_ configuration to start collecting metrics. As long as your web server +has readable access log file, you can configure the web log plugin to access and parse it. + +## Configure the web log collector + +To use the Go version of this plugin, you need to explicitly enable it, and disable the deprecated Python version. +First, open `python.d.conf`: + +```bash +cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ +./edit-config python.d.conf +``` + +Find the `web_log` line, uncomment it, and set it to `web_log: no`. Next, open the `go.d.conf` file for editing. + +```bash +./edit-config go.d.conf +``` + +Find the `web_log` line again, uncomment it, and set it to `web_log: yes`. + +Finally, restart Netdata with `service netdata restart`, or the appropriate method for your system. You should see +metrics in your Netdata dashboard! + +![Example of real-time web server log metrics in Netdata's +dashboard](https://user-images.githubusercontent.com/1153921/69448130-2980c280-0d15-11ea-9fa5-6dcff25a92c3.png) + +If you don't see web log charts, or **web log nginx**/**web log apache** menus on the right-hand side of your dashboard, +continue reading for other configuration options. + +## Custom configuration of the web log collector + +The web log collector's default configuration comes with a few example jobs that should cover most Linux distributions +and their default locations for log files: + +```yaml +# [ JOBS ] +jobs: +# NGINX +# debian, arch + - name: nginx + path: /var/log/nginx/access.log + +# gentoo + - name: nginx + path: /var/log/nginx/localhost.access_log + +# APACHE +# debian + - name: apache + path: /var/log/apache2/access.log + +# gentoo + - name: apache + path: /var/log/apache2/access_log + +# arch + - name: apache + path: /var/log/httpd/access_log + +# debian + - name: apache_vhosts + path: /var/log/apache2/other_vhosts_access.log + +# GUNICORN + - name: gunicorn + path: /var/log/gunicorn/access.log + + - name: gunicorn + path: /var/log/gunicorn/gunicorn-access.log +``` + +However, if your log files were not auto-detected, it might be because they are in a different location. Try the default +`web_log.conf` file. + +```bash +./edit-config go.d/web_log.conf +``` + +To create a new custom configuration, you need to set the `path` parameter to point to your web server's access log +file. You can give it a `name` as well, and set the `log_type` to `auto`. + +```yaml +jobs: + - name: example + path: /path/to/file.log + log_type: auto +``` + +Restart Netdata with `service netdata restart` or the appropriate method for your system. Netdata should pick up your +web server's access log and begin showing real-time charts! + +### Custom log formats and fields + +The web log collector is capable of parsing custom Nginx and Apache log formats and presenting them as charts, but we'll +leave that topic for a separate guide. + +We do have [extensive +documentation](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/weblog/#custom-log-format) on how +to build custom parsing for Nginx and Apache logs. + +## Tweak web log collector alarms + +Over time, we've created some default alarms for web log monitoring. These alarms are designed to work only when your +web server is receiving more than 120 requests per minute. Otherwise, there's simply not enough data to make conclusions +about what is "too few" or "too many." + +- [web log alarms](https://raw.githubusercontent.com/netdata/netdata/master/health/health.d/web_log.conf). + +You can also edit this file directly with `edit-config`: + +```bash +./edit-config health.d/weblog.conf +``` + +For more information about editing the defaults or writing new alarm entities, see our [health monitoring +documentation](/health/README.md). + +## What's next? + +Now that you have web log collection up and running, we recommend you take a look at the documentation for our +[python.d](/collectors/python.d.plugin/web_log/README.md) for some ideas of how you can turn these rather "boring" logs +into powerful real-time tools for keeping your servers happy. + +Don't forget to give GitHub user [Wing924](https://github.com/Wing924) a big 👍 for his hard work in starting up the Go +refactoring effort. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fcollect-apache-nginx-web-logs&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/collect-unbound-metrics.md b/docs/guides/collect-unbound-metrics.md new file mode 100644 index 0000000000..2994647452 --- /dev/null +++ b/docs/guides/collect-unbound-metrics.md @@ -0,0 +1,138 @@ + + +# Monitor Unbound DNS servers with Netdata + +[Unbound](https://nlnetlabs.nl/projects/unbound/about/) is a "validating, recursive, caching DNS resolver" from NLNet +Labs. In v1.19 of Netdata, we release a completely refactored collector for collecting real-time metrics from Unbound +servers and displaying them in Netdata dashboards. + +Unbound runs on FreeBSD, OpenBSD, NetBSD, macOS, Linux, and Windows, and supports DNS-over-TLS, which ensures that DNS +queries and answers are all encrypted with TLS. In theory, that should reduce the risk of eavesdropping or +man-in-the-middle attacks when communicating to DNS servers. + +This guide will show you how to collect dozens of essential metrics from your Unbound servers with minimal +configuration. + +## Set up your Unbound installation + +As with all data sources, Netdata can auto-detect Unbound servers if you installed them using the standard installation +procedure. + +Regardless of whether you're connecting to a local or remote Unbound server, you need to be able to access the server's +`remote-control` interface via an IP address, FQDN, or Unix socket. + +To set up the `remote-control` interface, you can use `unbound-control`. First, run `unbound-control-setup` to generate +the TLS key files that will encrypt connections to the remote interface. Then add the following to the end of your +`unbound.conf` configuration file. See the [Unbound +documentation](https://nlnetlabs.nl/documentation/unbound/howto-setup/#setup-remote-control) for more details on using +`unbound-control`, such as how to handle situations when Unbound is run under a unique user. + +```conf +# enable remote-control +remote-control: + control-enable: yes +``` + +Next, make your `unbound.conf`, `unbound_control.key`, and `unbound_control.pem` files readable by Netdata using [access +control lists](https://wiki.archlinux.org/index.php/Access_Control_Lists) (ACL). + +```bash +sudo setfacl -m user:netdata:r unbound.conf +sudo setfacl -m user:netdata:r unbound_control.key +sudo setfacl -m user:netdata:r unbound_control.pem +``` + +Finally, take note whether you're using Unbound in _cumulative_ or _non-cumulative_ mode. This will become relevant when +configuring the collector. + +## Configure the Unbound collector + +You may not need to do any more configuration to have Netdata collect your Unbound metrics. + +If you followed the steps above to enable `remote-control` and make your Unbound files readable by Netdata, that should +be enough. Restart Netdata with `service netdata restart`, or the appropriate method for your system. You should see +Unbound metrics in your Netdata dashboard! + +![Some charts showing Unbound metrics in real-time](https://user-images.githubusercontent.com/1153921/69659974-93160f00-103c-11ea-88e6-27e9efcf8c0d.png) + +If that failed, you will need to manually configure `unbound.conf`. See the next section for details. + +### Manual setup for a local Unbound server + +To configure Netdata's Unbound collector module, navigate to your Netdata configuration directory (typically at +`/etc/netdata/`) and use `edit-config` to initialize and edit your Unbound configuration file. + +```bash +cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ +sudo ./edit-config go.d/unbound.conf +``` + +The file contains all the global and job-related parameters. The `name` setting is required, and two Unbound servers +can't have the same name. + +> It is important you know whether your Unbound server is running in cumulative or non-cumulative mode, as a conflict +> between modes will create incorrect charts. + +Here are two examples for local Unbound servers, which may work based on your unique setup: + +```yaml +jobs: + - name: local + address: 127.0.0.1:8953 + cumulative: no + use_tls: yes + tls_skip_verify: yes + tls_cert: /path/to/unbound_control.pem + tls_key: /path/to/unbound_control.key + + - name: local + address: 127.0.0.1:8953 + cumulative: yes + use_tls: no +``` + +Netdata will attempt to read `unbound.conf` to get the appropriate `address`, `cumulative`, `use_tls`, `tls_cert`, and +`tls_key` parameters. + +Restart Netdata with `service netdata restart`, or the appropriate method for your system. + +### Manual setup for a remote Unbound server + +Collecting metrics from remote Unbound servers requires manual configuration. There are too many possibilities to cover +all remote connections here, but the [default `unbound.conf` +file](https://github.com/netdata/go.d.plugin/blob/master/config/go.d/unbound.conf) contains a few useful examples: + +```yaml +jobs: + - name: remote + address: 203.0.113.10:8953 + use_tls: no + + - name: remote_cumulative + address: 203.0.113.11:8953 + use_tls: no + cumulative: yes + + - name: remote + address: 203.0.113.10:8953 + cumulative: yes + use_tls: yes + tls_cert: /etc/unbound/unbound_control.pem + tls_key: /etc/unbound/unbound_control.key +``` + +To see all the available options, see the default [unbound.conf +file](https://github.com/netdata/go.d.plugin/blob/master/config/go.d/unbound.conf). + +## What's next? + +Now that you're collecting metrics from your Unbound servers, let us know how it's working for you! There's always room +for improvement or refinement based on real-world use cases. Feel free to [file an +issue](https://github.com/netdata/netdata/issues/new?labels=bug%2C+needs+triage&template=bug_report.md) with your +thoughts. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Funbound-metrics&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/longer-metrics-storage.md b/docs/guides/longer-metrics-storage.md new file mode 100644 index 0000000000..5c542f427f --- /dev/null +++ b/docs/guides/longer-metrics-storage.md @@ -0,0 +1,158 @@ + + +# Change how long Netdata stores metrics + +Netdata helps you collect thousands of system and application metrics every second, but what about storing them for the +long term? + +Many people think Netdata can only store about an hour's worth of real-time metrics, but that's simply not true any +more. With the right settings, Netdata is quite capable of efficiently storing hours or days worth of historical, +per-second metrics without having to rely on an [exporting engine](/exporting/README.md). + +This guide gives two options for configuring Netdata to store more metrics. **We recommend the default [database +engine](#using-the-database-engine)**, but you can stick with or switch to the round-robin database if you prefer. + +Let's get started. + +## Using the database engine + +The database engine uses RAM to store recent metrics while also using a "spill to disk" feature that takes advantage of +available disk space for long-term metrics storage. This feature of the database engine allows you to store a much +larger dataset than your system's available RAM. + +The database engine is currently the default method of storing metrics, but if you're not sure which database you're +using, check out your `netdata.conf` file and look for the `memory mode` setting: + +```conf +[global] + memory mode = dbengine +``` + +If `memory mode` is set to anything but `dbengine`, change it and restart Netdata using the standard command for +restarting services on your system. You're now using the database engine! + +What makes the database engine efficient? While it's structured like a traditional database, the database engine splits +data between RAM and disk. The database engine caches and indexes data on RAM to keep memory usage low, and then +compresses older metrics onto disk for long-term storage. + +When the Netdata dashboard queries for historical metrics, the database engine will use its cache, stored in RAM, to +return relevant metrics for visualization in charts. + +Now, given that the database engine uses _both_ RAM and disk, there are two other settings to consider: `page cache +size` and `dbengine disk space`. + +```conf +[global] + page cache size = 32 + dbengine disk space = 256 +``` + +`page cache size` sets the maximum amount of RAM (in MiB) the database engine will use for caching and indexing. +`dbengine disk space` sets the maximum disk space (again, in MiB) the database engine will use for storing compressed +metrics. The default settings retain about two day's worth of metris on a system collecting 2,000 metrics every second. + +[**See our database engine calculator**](https://learn.netdata.cloud/docs/agent/database/calculator) to help you +correctly set `dbengine disk space` based on your needs. The calculator gives an accurate estimate based on how many +slave nodes you have, how many metrics your Agent collects, and more. + +With the database engine active, you can back up your `/var/cache/netdata/dbengine/` folder to another location for +redundancy. + +Now that you know how to switch to the database engine, let's cover the default round-robin database for those who +aren't ready to make the move. + +## Using the round-robin database + +In previous versions, Netdata used a round-robin database to store 1 hour of per-second metrics. + +To see if you're still using this database, or if you would like to switch to it, open your `netdata.conf` file and see +if `memory mode` option is set to `save`. + +```conf +[global] + memory mode = save +``` + +If `memory mode` is set to `save`, then you're using the round-robin database. If so, the `history` option is set to +`3600`, which is the equivalent to 3,600 seconds, or one hour. + +To increase your historical metrics, you can increase `history` to the number of seconds you'd like to store: + +```conf +[global] + # 2 hours = 2 * 60 * 60 = 7200 seconds + history = 7200 + # 4 hours = 4 * 60 * 60 = 14440 seconds + history = 14440 + # 24 hours = 24 * 60 * 60 = 86400 seconds + history = 86400 +``` + +And so on. + +Next, check to see how many metrics Netdata collects on your system, and how much RAM that uses. Visit the Netdata +dashboard and look at the bottom-right corner of the interface. You'll find a sentence similar to the following: + +> Every second, Netdata collects 1,938 metrics, presents them in 299 charts and monitors them with 81 alarms. Netdata is +> using 25 MB of memory on **netdata-linux** for 1 hour, 6 minutes and 36 seconds of real-time history. + +On this desktop system, using a Ryzen 5 1600 and 16GB of RAM, the round-robin databases uses 25 MB of RAM to store just +over an hour's worth of data for nearly 2,000 metrics. + +To increase the `history` option, you need to edit your `netdata.conf` file and increase the `history` setting. In most +installations, you'll find it at `/etc/netdata/netdata.conf`, but some operating systems place it at +`/opt/netdata/etc/netdata/netdata.conf`. + +Use `/etc/netdata/edit-config netdata.conf`, or your favorite text editor, to replace `3600` with the number of seconds +you'd like to store. + +You should base this number on two things: How much history you need for your use case, and how much RAM you're willing +to dedicate to Netdata. + +> Take care when you change the `history` option on production systems. Netdata is configured to stop its process if +> your system starts running out of RAM, but you can never be too careful. Out of memory situations are very bad. + +How much RAM will a longer history use? Let's use a little math. + +The round-robin database needs 4 bytes for every value Netdata collects. If Netdata collects metrics every second, +that's 4 bytes, per second, per metric. + +```text +4 bytes * X seconds * Y metrics = RAM usage in bytes +``` + +Let's assume your system collects 1,000 metrics per second. + +```text +4 bytes * 3600 seconds * 1,000 metrics = 14400000 bytes = 14.4 MB RAM +``` + +With that formula, you can calculate the RAM usage for much larger history settings. + +```conf +# 2 hours at 1,000 metrics per second +4 bytes * 7200 seconds * 1,000 metrics = 28800000 bytes = 28.8 MB RAM +# 2 hours at 2,000 metrics per second +4 bytes * 7200 seconds * 2,000 metrics = 57600000 bytes = 57.6 MB RAM +# 4 hours at 2,000 metrics per second +4 bytes * 14440 seconds * 2,000 metrics = 115520000 bytes = 115.52 MB RAM +# 24 hours at 1,000 metrics per second +4 bytes * 86400 seconds * 1,000 metrics = 345600000 bytes = 345.6 MB RAM +``` + +## What's next? + +Now that you have either configured database engine or round-robin database engine to store more metrics, you'll +probably want to see it in action! + +For more information about how to pan charts to view historical metrics, see our documentation on [using +charts](/web/README.md#using-charts). + +And if you'd now like to reduce Netdata's resource usage, view our [performance guide](/docs/Performance.md) for +our best practices on optimization. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Flonger-metrics-storage&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor-cockroachdb.md b/docs/guides/monitor-cockroachdb.md new file mode 100644 index 0000000000..fd0e7db643 --- /dev/null +++ b/docs/guides/monitor-cockroachdb.md @@ -0,0 +1,136 @@ + + +# Monitor CockroachDB metrics with Netdata + +[CockroachDB](https://github.com/cockroachdb/cockroach) is an open-source project that brings SQL databases into +scalable, disaster-resilient cloud deployments. Thanks to a [new CockroachDB +collector](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/cockroachdb/) released in +[v1.20](https://blog.netdata.cloud/posts/release-1.20/), you can now monitor any number of CockroachDB databases with +maximum granularity using Netdata. Collect more than 50 unique metrics and put them on interactive visualizations +designed for better visual anomaly detection. + +Netdata itself uses CockroachDB as part of its Netdata Cloud infrastructure, so we're happy to introduce this new +collector and help others get started with it straightaway. + +Let's dive in and walk through the process of monitoring CockroachDB metrics with Netdata. + +## What's in this guide + +- [Configure the CockroachDB collector](#configure-the-cockroachdb-collector) + - [Manual setup for a local CockroachDB database](#manual-setup-for-a-local-cockroachdb-database) +- [Tweak CockroachDB alarms](#tweak-cockroachdb-alarms) + +## Configure the CockroachDB collector + +Because _all_ of Netdata's collectors can auto-detect the services they monitor, you _shouldn't_ need to worry about +configuring CockroachDB. Netdata only needs to regularly query the database's `_status/vars` page to gather metrics and +display them on the dashboard. + +If your CockroachDB instance is accessible through `http://localhost:8080/` or `http://127.0.0.1:8080`, your setup is +complete. Restart Netdata with `service netdata restart`, or use the [appropriate +method](../getting-started.md#start-stop-and-restart-netdata) for your system, and refresh your browser. You should see +CockroachDB metrics in your Netdata dashboard! + +
+ CPU utilization charts from a CockroachDB database monitored by Netdata +
CPU utilization charts from a CockroachDB database monitored by Netdata
+
+ +> Note: Netdata collects metrics from CockroachDB every 10 seconds, instead of our usual 1 second, because CockroachDB +> only updates `_status/vars` every 10 seconds. You can't change this setting in CockroachDB. + +If you don't see CockroachDB charts, you may need to configure the collector manually. + +### Manual setup for a local CockroachDB database + +To configure Netdata's CockroachDB collector, navigate to your Netdata configuration directory (typically at +`/etc/netdata/`) and use `edit-config` to initialize and edit your CockroachDB configuration file. + +```bash +cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ +./edit-config go.d/cockroachdb.conf +``` + +Scroll down to the `[JOBS]` section at the bottom of the file. You will see the two default jobs there, which you can +edit, or create a new job with any of the parameters listed above in the file. Both the `name` and `url` values are +required, and everything else is optional. + +For a production cluster, you'll use either an IP address or the system's hostname. Be sure that your remote system +allows TCP communication on port 8080, or whichever port you have configured CockroachDB's [Admin +UI](https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting.html#prometheus-endpoint) to listen on. + +```yaml +# [ JOBS ] +jobs: + - name: remote + url: http://203.0.113.0:8080/_status/vars + + - name: remote_hostname + url: http://cockroachdb.example.com:8080/_status/vars +``` + +For a secure cluster, use `https` in the `url` field instead. + +```yaml +# [ JOBS ] +jobs: + - name: remote + url: https://203.0.113.0:8080/_status/vars + tls_skip_verify: yes # If your certificate is self-signed + + - name: remote_hostname + url: https://cockroachdb.example.com:8080/_status/vars + tls_skip_verify: yes # If your certificate is self-signed +``` + +You can add as many jobs as you'd like based on how many CockroachDB databases you have—Netdata will create separate +charts for each job. Once you've edited `cockroachdb.conf` according to the needs of your infrastructure, restart +Netdata to see your new charts. + +
+ Charts showing a node failure during a simulated test +
Charts showing a node failure during a simulated test
+
+ +## Tweak CockroachDB alarms + +This release also includes eight pre-configured alarms for live nodes, such as whether the node is live, storage +capacity, issues with replication, and the number of SQL connections/statements. See [health.d/cockroachdb.conf on +GitHub](https://raw.githubusercontent.com/netdata/netdata/master/health/health.d/cockroachdb.conf) for details. + +You can also edit these files directly with `edit-config`: + +```bash +cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ +./edit-config health.d/cockroachdb.conf # You may need to use `sudo` for write privileges +``` + +For more information about editing the defaults or writing new alarm entities, see our health monitoring [quickstart +guide](/health/QUICKSTART.md). + +## What's next? + +Now that you're collecting metrics from your CockroachDB databases, let us know how it's working for you! There's always +room for improvement or refinement based on real-world use cases. Feel free to [file an +issue](https://github.com/netdata/netdata/issues/new?labels=bug%2C+needs+triage&template=bug_report.md) with your +thoughts. + +Also, be sure to check out these useful resources: + +- [Netdata's CockroachDB + documentation](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/cockroachdb/) +- [Netdata's CockroachDB + configuration](https://github.com/netdata/go.d.plugin/blob/master/config/go.d/cockroachdb.conf) +- [Netdata's CockroachDB + alarms](https://github.com/netdata/netdata/blob/29d9b5e51603792ee27ef5a21f1de0ba8e130158/health/health.d/cockroachdb.conf) +- [CockroachDB homepage](https://www.cockroachlabs.com/product/) +- [CockroachDB documentation](https://www.cockroachlabs.com/docs/stable/) +- [`_status/vars` endpoint + docs](https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting.html#prometheus-endpoint) +- [Monitor CockroachDB with + Prometheus](https://www.cockroachlabs.com/docs/stable/monitor-cockroachdb-with-prometheus.html) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor-cockroachdb&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor-hadoop-cluster.md b/docs/guides/monitor-hadoop-cluster.md new file mode 100644 index 0000000000..17901f2815 --- /dev/null +++ b/docs/guides/monitor-hadoop-cluster.md @@ -0,0 +1,204 @@ + + +# Monitor a Hadoop cluster with Netdata + +Hadoop is an [Apache project](https://hadoop.apache.org/) is a framework for processing large sets of data across a +distributed cluster of systems. + +And while Hadoop is designed to be a highly-available and fault-tolerant service, those who operate a Hadoop cluster +will want to monitor the health and performance of their [Hadoop Distributed File System +(HDFS)](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) and [Zookeeper](https://zookeeper.apache.org/) +implementations. + +Netdata comes with built-in and pre-configured support for monitoring both HDFS and Zookeeper. + +This guide assumes you have a Hadoop cluster, with HDFS and Zookeeper, running already. If you don't, please follow +the [official Hadoop +instructions](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html) or an +alternative, like the guide available from +[DigitalOcean](https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-in-stand-alone-mode-on-ubuntu-18-04). + +For more specifics on the collection modules used in this guide, read the respective pages in our documentation: + +- [HDFS](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/hdfs) +- [Zookeeper](https://learn.netdata.cloud/docs/agent/collectors/go.d.plugin/modules/zookeeper) + +## Set up your HDFS and Zookeeper installations + +As with all data sources, Netdata can auto-detect HDFS and Zookeeper nodes if you installed them using the standard +installation procedure. + +For Netdata to collect HDFS metrics, it needs to be able to access the node's `/jmx` endpoint. You can test whether an +JMX endpoint is accessible by using `curl HDFS-IP:PORT/jmx`. For a NameNode, you should see output similar to the +following: + +```json +{ + "beans" : [ { + "name" : "Hadoop:service=NameNode,name=JvmMetrics", + "modelerType" : "JvmMetrics", + "MemNonHeapUsedM" : 65.67851, + "MemNonHeapCommittedM" : 67.3125, + "MemNonHeapMaxM" : -1.0, + "MemHeapUsedM" : 154.46341, + "MemHeapCommittedM" : 215.0, + "MemHeapMaxM" : 843.0, + "MemMaxM" : 843.0, + "GcCount" : 15, + "GcTimeMillis" : 305, + "GcNumWarnThresholdExceeded" : 0, + "GcNumInfoThresholdExceeded" : 0, + "GcTotalExtraSleepTime" : 92, + "ThreadsNew" : 0, + "ThreadsRunnable" : 6, + "ThreadsBlocked" : 0, + "ThreadsWaiting" : 7, + "ThreadsTimedWaiting" : 34, + "ThreadsTerminated" : 0, + "LogFatal" : 0, + "LogError" : 0, + "LogWarn" : 2, + "LogInfo" : 348 + }, + { ... } + ] +} +``` + +The JSON result for a DataNode's `/jmx` endpoint is slightly different: + +```json +{ + "beans" : [ { + "name" : "Hadoop:service=DataNode,name=DataNodeActivity-dev-slave-01.dev.loc +al-9866", + "modelerType" : "DataNodeActivity-dev-slave-01.dev.local-9866", + "tag.SessionId" : null, + "tag.Context" : "dfs", + "tag.Hostname" : "dev-slave-01.dev.local", + "BytesWritten" : 500960407, + "TotalWriteTime" : 463, + "BytesRead" : 80689178, + "TotalReadTime" : 41203, + "BlocksWritten" : 16, + "BlocksRead" : 16, + "BlocksReplicated" : 4, + ... + }, + { ... } + ] +} +``` + +If Netdata can't access the `/jmx` endpoint for either a NameNode or DataNode, it will not be able to auto-detect and +collect metrics from your HDFS implementation. + +Zookeeper auto-detection relies on an accessible client port and a whitelisted `mntr` command. For more details on +`mntr`, see Zookeeper's documentation on [cluster +options](https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_clusterOptions) and [Zookeeper +commands](https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_zkCommands). + +## Configure the HDFS and Zookeeper modules + +To configure Netdata's HDFS module, navigate to your Netdata directory (typically at `/etc/netdata/`) and use +`edit-config` to initialize and edit your HDFS configuration file. + +```bash +cd /etc/netdata/ +sudo ./edit-config go.d/hdfs.conf +``` + +At the bottom of the file, you will see two example jobs, both of which are commented out: + +```yaml +# [ JOBS ] +#jobs: +# - name: namenode +# url: http://127.0.0.1:9870/jmx +# +# - name: datanode +# url: http://127.0.0.1:9864/jmx +``` + +Uncomment these lines and edit the `url` value(s) according to your setup. Now's the time to add any other configuration +details, which you can find inside of the `hdfs.conf` file itself. Most production implementations will require TLS +certificates. + +The result for a simple HDFS setup, running entirely on `localhost` and without certificate authentication, might look +like this: + +```yaml +# [ JOBS ] +jobs: + - name: namenode + url: http://127.0.0.1:9870/jmx + + - name: datanode + url: http://127.0.0.1:9864/jmx +``` + +At this point, Netdata should be configured to collect metrics from your HDFS servers. Let's move on to Zookeeper. + +Next, use `edit-config` again to initialize/edit your `zookeeper.conf` file. + +```bash +cd /etc/netdata/ +sudo ./edit-config go.d/zookeeper.conf +``` + +As with the `hdfs.conf` file, head to the bottom, uncomment the example jobs, and tweak the `address` values according +to your setup. Again, you may need to add additional configuration options, like TLS certificates. + +```yaml +jobs: + - name : local + address : 127.0.0.1:2181 + + - name : remote + address : 203.0.113.10:2182 +``` + +Finally, restart Netdata. + +```sh +sudo service restart netdata +``` + +Upon restart, Netdata should recognize your HDFS/Zookeeper servers, enable the HDFS and Zookeeper modules, and begin +showing real-time metrics for both in your Netdata dashboard. 🎉 + +## Configuring HDFS and Zookeeper alarms + +The Netdata community helped us create sane defaults for alarms related to both HDFS and Zookeeper. You may want to +investigate these to ensure they work well with your Hadoop implementation. + +- [HDFS alarms](https://raw.githubusercontent.com/netdata/netdata/master/health/health.d/hdfs.conf) +- [Zookeeper alarms](https://raw.githubusercontent.com/netdata/netdata/master/health/health.d/zookeeper.conf) + +You can also access/edit these files directly with `edit-config`: + +```bash +sudo /etc/netdata/edit-config health.d/hdfs.conf +sudo /etc/netdata/edit-config health.d/zookeeper.conf +``` + +For more information about editing the defaults or writing new alarm entities, see our [health monitoring +documentation](/health/README.md). + +## What's next? + +If you're having issues with Netdata auto-detecting your HDFS/Zookeeper servers, or want to help improve how Netdata +collects or presents metrics from these services, feel free to [file an +issue](https://github.com/netdata/netdata/issues/new?labels=bug%2C+needs+triage&template=bug_report.md). + +- Read up on the [HDFS configuration + file](https://github.com/netdata/go.d.plugin/blob/master/config/go.d/hdfs.conf) to understand how to configure + global options or per-job options, such as username/password, TLS certificates, timeouts, and more. +- Read up on the [Zookeeper configuration + file](https://github.com/netdata/go.d.plugin/blob/master/config/go.d/zookeeper.conf) to understand how to configure + global options or per-job options, timeouts, TLS certificates, and more. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor-hadoop-cluster&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor/dimension-templates.md b/docs/guides/monitor/dimension-templates.md new file mode 100644 index 0000000000..5fe3ae9632 --- /dev/null +++ b/docs/guides/monitor/dimension-templates.md @@ -0,0 +1,178 @@ + + +# Use dimension templates to create dynamic alarms + +Your ability to monitor the health of your systems and applications relies on your ability to create and maintain +the best set of alarms for your particular needs. + +In v1.18 of Netdata, we introduced **dimension templates** for alarms, which simplifies the process of writing [alarm +entities](/health/REFERENCE.md#health-entity-reference) for charts with many dimensions. + +Dimension templates can condense many individual entities into one—no more copy-pasting one entity and changing the +`alarm`/`template` and `lookup` lines for each dimension you'd like to monitor. + +They are, however, an advanced health monitoring feature. For more basic instructions on creating your first alarm, +check out our [health monitoring documentation](/health/README.md), which also includes +[examples](/health/REFERENCE.md#example-alarms). + +## The fundamentals of `foreach` + +Our dimension templates update creates a new `foreach` parameter to the existing [`lookup` +line](/health/REFERENCE.md#alarm-line-lookup). This is where the magic happens. + +You use the `foreach` parameter to specify which dimensions you want to monitor with this single alarm. You can separate +them with a comma (`,`) or a pipe (`|`). You can also use a [Netdata simple pattern](/libnetdata/simple_pattern/README.md) +to create many alarms with a regex-like syntax. + +The `foreach` parameter _has_ to be the last parameter in your `lookup` line, and if you have both `of` and `foreach` in +the same `lookup` line, Netdata will ignore the `of` parameter and use `foreach` instead. + +Let's get into some examples so you can see how the new parameter works. + +> ⚠️ The following entities are examples to showcase the functionality and syntax of dimension templates. They are not +> meant to be run as-is on production systems. + +## Condensing entities with `foreach` + +Let's say you want to monitor the `system`, `user`, and `nice` dimensions in your system's overall CPU utilization. +Before dimension templates, you would need the following three entities: + +```yaml + alarm: cpu_system + on: system.cpu +lookup: average -10m percentage of system + every: 1m + warn: $this > 50 + crit: $this > 80 + + alarm: cpu_user + on: system.cpu +lookup: average -10m percentage of user + every: 1m + warn: $this > 50 + crit: $this > 80 + + alarm: cpu_nice + on: system.cpu +lookup: average -10m percentage of nice + every: 1m + warn: $this > 50 + crit: $this > 80 +``` + +With dimension templates, you can condense these into a single alarm. Take note of the `alarm` and `lookup` lines. + +```yaml + alarm: cpu_template + on: system.cpu +lookup: average -10m percentage foreach system,user,nice + every: 1m + warn: $this > 50 + crit: $this > 80 +``` + +The `alarm` line specifies the naming scheme Netdata will use. You can use whatever naming scheme you'd like, with `.` +and `_` being the only allowed symbols. + +The `lookup` line has changed from `of` to `foreach`, and we're now passing three dimensions. + +In this example, Netdata will create three alarms with the names `cpu_template_system`, `cpu_template_user`, and +`cpu_template_nice`. Every minute, each alarm will use the same database query to calculate the average CPU usage for +the `system`, `user`, and `nice` dimensions over the last 10 minutes and send out alarms if necessary. + +You can find these three alarms active by clicking on the **Alarms** button in the top navigation, and then clicking on +the **All** tab and scrolling to the **system - cpu** collapsible section. + +![Three new alarms created from the dimension template](https://user-images.githubusercontent.com/1153921/66218994-29523800-e67f-11e9-9bcb-9bca23e2c554.png) + +Let's look at some other examples of how `foreach` works so you can best apply it in your configurations. + +### Using a Netdata simple pattern in `foreach` + +In the last example, we used `foreach system,user,nice` to create three distinct alarms using dimension templates. But +what if you want to quickly create alarms for _all_ the dimensions of a given chart? + +Use a [simple pattern](/libnetdata/simple_pattern/README.md)! One example of a simple pattern is a single wildcard +(`*`). + +Instead of monitoring system CPU usage, let's monitor per-application CPU usage using the `apps.cpu` chart. Passing a +wildcard as the simple pattern tells Netdata to create a separate alarm for _every_ process on your system: + +```yaml + alarm: app_cpu + on: apps.cpu +lookup: average -10m percentage foreach * + every: 1m + warn: $this > 50 + crit: $this > 80 +``` + +This entity will now create alarms for every dimension in the `apps.cpu` chart. Given that most `apps.cpu` charts have +10 or more dimensions, using the wildcard ensures you catch every CPU-hogging process. + +To learn more about how to use simple patterns with dimension templates, see our [simple patterns +documentation](/libnetdata/simple_pattern/README.md). + +## Using `foreach` with alarm templates + +Dimension templates also work with [alarm templates](/health/REFERENCE.md#alarm-line-alarm-or-template). Alarm +templates help you create alarms for all the charts with a given context—for example, all the cores of your system's +CPU. + +By combining the two, you can create dozens of individual alarms with a single template entity. Here's how you would +create alarms for the `system`, `user`, and `nice` dimensions for every chart in the `cpu.cpu` context—or, in other +words, every CPU core. + +```yaml +template: cpu_template + on: cpu.cpu + lookup: average -10m percentage foreach system,user,nice + every: 1m + warn: $this > 50 + crit: $this > 80 +``` + +On a system with a 6-core, 12-thread Ryzen 5 1600 CPU, this one entity creates alarms on the following charts and +dimensions: + +- `cpu.cpu0` + - `cpu_template_user` + - `cpu_template_system` + - `cpu_template_nice` +- `cpu.cpu1` + - `cpu_template_user` + - `cpu_template_system` + - `cpu_template_nice` +- `cpu.cpu2` + - `cpu_template_user` + - `cpu_template_system` + - `cpu_template_nice` +- ... +- `cpu.cpu11` + - `cpu_template_user` + - `cpu_template_system` + - `cpu_template_nice` + +And how just a few of those dimension template-generated alarms look like in the Netdata dashboard. + +![A few of the created alarms in the Netdata dashboard](https://user-images.githubusercontent.com/1153921/66219669-708cf880-e680-11e9-8b3a-7bfe178fa28b.png) + +All in all, this single entity creates 36 individual alarms. Much easier than writing 36 separate entities in your +health configuration files! + +## What's next? + +We hope you're excited about the possibilities of using dimension templates! Maybe they'll inspire you to build new +alarms that will help you better monitor the health of your systems. + +Or, at the very least, simplify your configuration files. + +For information about other advanced features in Netdata's health monitoring toolkit, check out our [health +documentation](/health/README.md). And if you have some cool alarms you built using dimension templates, + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2dimension-templates&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/monitor/stop-notifications-alarms.md b/docs/guides/monitor/stop-notifications-alarms.md new file mode 100644 index 0000000000..197b6b0ebf --- /dev/null +++ b/docs/guides/monitor/stop-notifications-alarms.md @@ -0,0 +1,94 @@ + + +# Stop notifications for individual alarms + +In this short tutorial, you'll learn how to stop notifications for individual alarms in Netdata's health +monitoring system. We also refer to this process as _silencing_ the alarm. + +Why silence alarms? We designed Netdata's pre-configured alarms for production systems, so they might not be +relevant if you run Netdata on your laptop or a small virtual server. If they're not helpful, they can be a distraction +to real issues with health and performance. + +Silencing individual alarms is an excellent solution for situations where you're not interested in seeing a specific +alarm but don't want to disable a [notification system](/health/notifications/README.md) entirely. + +## Find the alarm configuration file + +To silence an alarm, you need to know where to find its configuration file. + +Let's use the `system.cpu` chart as an example. It's the first chart you'll see on most Netdata dashboards. + +To figure out which file you need to edit, open up Netdata's dashboard and, click the **Alarms** button at the top +of the dashboard, followed by clicking on the **All** tab. + +In this example, we're looking for the `system - cpu` entity, which, when opened, looks like this: + +![The system - cpu alarm +entity](https://user-images.githubusercontent.com/1153921/67034648-ebb4cc80-f0cc-11e9-9d49-1023629924f5.png) + +In the `source` row, you see that this chart is getting its configuration from +`4@/usr/lib/netdata/conf.d/health.d/cpu.conf`. The relevant part of begins at `health.d`: `health.d/cpu.conf`. That's +the file you need to edit if you want to silence this alarm. + +For more information about editing or referencing health configuration files on your system, see the [health +quickstart](/health/QUICKSTART.md#edit-health-configuration-files). + +## Edit the file to enable silencing + +To edit `health.d/cpu.conf`, use `edit-config` from inside of your Netdata configuration directory. + +```bash +cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ +./edit-config health.d/cpu.conf +``` + +> You may need to use `sudo` or another method of elevating your privileges. + +The beginning of the file looks like this: + +```yaml +template: 10min_cpu_usage + on: system.cpu + os: linux + hosts: * + lookup: average -10m unaligned of user,system,softirq,irq,guest + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal) + to: sysadmin +``` + +To silence this alarm, change `sysadmin` to `silent`. + +```yaml + to: silent +``` + +Use `killall -USR2 netdata` to reload your health configuration and ensure you get no more notifications about that +alarm. + +You can add `to: silence` to any alarm you'd rather not bother you with notifications. + +## What's next? + +You should now know the fundamentals behind silencing any individual alarm in Netdata. + +To learn about _all_ of Netdata's health configuration possibilities, visit the [health reference +guide](/health/REFERENCE.md), or check out other [tutorials on health monitoring](/health/README.md#tutorials). + +Or, take better control over how you get notified about alarms via the [notification +system](/health/notifications/README.md). + +You can also use Netdata's [Health Management API](/web/api/health/README.md#health-management-api) to control health +checks and notifications while Netdata runs. With this API, you can disable health checks during a maintenance window or +backup process, for example. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fmonitor%2Fstop-notifications-alarms%2F&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/step-by-step/step-00.md b/docs/guides/step-by-step/step-00.md new file mode 100644 index 0000000000..794366645a --- /dev/null +++ b/docs/guides/step-by-step/step-00.md @@ -0,0 +1,115 @@ + + +# The step-by-step Netdata guide + +Welcome to Netdata! We're glad you're interested in our health monitoring and performance troubleshooting system. + +Because Netdata is entirely open-source software, you can use it free of charge, whether you want to monitor one or ten +thousand systems! All our code is hosted on [GitHub](https://github.com/netdata/netdata). + +This guide is designed to help you understand what Netdata is, what it's capable of, and how it'll help you make +faster and more informed decisions about the health and performance of your systems and applications. If you're +completely new to Netdata, or have never tried health monitoring/performance troubleshooting systems before, this +guide is perfect for you. + +If you have monitoring experience, or would rather get straight into configuring Netdata to your needs, you can jump +straight into code and configurations with our [getting started guide](/docs/getting-started.md). + +> This guide contains instructions for Netdata installed on a Linux system. Many of the instructions will work on +> other supported operating systems, like FreeBSD and macOS, but we can't make any guarantees. + +## Where to go if you need help + +No matter where you are in this Netdata guide, if you need help, head over to our [GitHub +repository](https://github.com/netdata/netdata/). That's where we collect questions from users, help fix their bugs, and +point people toward documentation that explains what they're having trouble with. + +Click on the **issues** tab to see all the conversations we're having with Netdata users. Use the search bar to find +previously-written advice for your specific problem, and if you don't see any results, hit the **New issue** button to +send us a question. + +Or, if that's too complicated, feel free to send this guide's author [an email](mailto:joel@netdata.cloud). + +## Before we get started + +Let's make sure you have Netdata installed on your system! + +> If you already installed Netdata, feel free to skip to [Step 1: Netdata's building blocks](step-01.md). + +The easiest way to install Netdata on a Linux system is our `kickstart.sh` one-line installer. Run this on your system +and let it take care of the rest. + +This script will install Netdata from source, keep it up to date with nightly releases, connects to the Netdata +[registry](/registry/README.md), and sends [_anonymous statistics_](/docs/anonymous-statistics.md) about how you use +Netdata. We use this information to better understand how we can improve the Netdata experience for all our users. + +```bash +bash <(curl -Ss https://my-netdata.io/kickstart.sh) +``` + +Once finished, you'll have Netdata installed, and you'll be set up to get _nightly updates_ to get the latest features, +improvements, and bugfixes. + +If this method doesn't work for you, or you want to use a different process, visit our [installation +documentation](/packaging/installer/README.md) for details. + +## Netdata fundamentals + +[Step 1. Netdata's building blocks](step-01.md) + +In this introductory step, we'll talk about the fundamental ideas, philosophies, and UX decisions behind Netdata. + +[Step 2. Get to know Netdata's dashboard](step-02.md) + +Visit Netdata's dashboard to explore, manipulate charts, and check out alarms. Get your first taste of visual anomaly +detection. + +[Step 3. Monitor more than one system with Netdata](step-03.md) + +While the dashboard lets you quickly move from one agent to another, Netdata Cloud is our SaaS solution for monitoring +the health of many systems. We'll cover its features and the benefits of using Netdata Cloud on top of the dashboard. + +[Step 4. The basics of configuring Netdata](step-04.md) + +While Netdata can monitor thousands of metrics in real-time without any configuration, you may _want_ to tweak some +settings based on your system's resources. + +## Intermediate steps + +[Step 5. Health monitoring alarms and notifications](step-05.md) + +Learn how to tune, silence, and write custom alarms. Then enable notifications so you never miss a change in health +status or performance anomaly. + +[Step 6. Collect metrics from more services and apps](step-06.md) + +Learn how to enable/disable collection plugins and configure a collection plugin job to add more charts to your Netdata +dashboard and begin monitoring more apps and services, like MySQL, Nginx, MongoDB, and hundreds more. + +[Step 7. Netdata's dashboard in depth](step-07.md) + +Now that you configured your Netdata monitoring agent to your exact needs, you'll dive back into metrics snapshots, +updates, and the dashboard's settings. + +## Advanced steps + +[Step 8. Building your first custom dashboard](step-08.md) + +Using simple HTML, CSS, and JavaScript, we'll build a custom dashboard that displays essential information in any format +you choose. You can even monitor many systems from a single HTML file. + +[Step 9. Long-term metrics storage](step-09.md) + +By default, Netdata can store lots of real-time metrics, but you can also tweak our custom database engine to your +heart's content. Want to take your Netdata metrics elsewhere? We're happy to help you archive data to Prometheus, +MongoDB, TimescaleDB, and others. + +[Step 10. Set up a proxy](step-10.md) + +Run Netdata behind an Nginx proxy to improve performance, and enable TLS/HTTPS for better security. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fdocs%2Fguides%2Fstep-by-step%2Fstep-00&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/docs/guides/step-by-step/step-01.md b/docs/guides/step-by-step/step-01.md new file mode 100644 index 0000000000..cdcfcd7a2d --- /dev/null +++ b/docs/guides/step-by-step/step-01.md @@ -0,0 +1,156 @@ + + +# Step 1. Netdata's building blocks + +Netdata is a distributed and real-time _health monitoring and performance troubleshooting toolkit_ for monitoring your +systems and applications. + +Because the monitoring agent is highly-optimized, you can install it all your physical systems, containers, IoT devices, +and edge devices without disrupting their core function. + +By default, and without configuration, Netdata delivers real-time insights into everything happening on the system, from +CPU utilization to packet loss on every network device. Netdata can also auto-detect metrics from hundreds of your +favorite services and applications, like MySQL/MariaDB, Docker, Nginx, Apache, MongoDB, and more. + +All metrics are automatically-updated, providing interactive dashboards that allow you to dive in, discover anomalies, +and figure out the root cause analysis of any issue. + +Best of all, Netdata is entirely free, open-source software! Solo developers and enterprises with thousands of systems +can both use it free of charge. We're hosted on [GitHub](https://github.com/netdata/netdata). + +Want to learn about the history of Netdata, and what inspired our CEO to build it in the first place, and where we're +headed? Read Costa's comprehensive blog post: _[Redefining monitoring with Netdata (and how it came to +be)](https://blog.netdata.cloud/posts/redefining-monitoring-netdata/)_. + +## What you'll learn in this step + +In the first step of the Netdata guide, you'll learn about: + +- [Netdata's core features](#netdatas-core-features) +- [Why you should use Netdata](#why-you-should-use-netdata) +- [How Netdata has complementary systems, not competitors](#how-netdata-has-complementary-systems-not-competitors) + +Let's get started! + +## Netdata's core features + +Netdata has only been around for a few years, but it's a complex piece of software. Here are just some of the features +we'll cover throughout this guide. + +- A sophisticated **dashboard**, which we'll cover in [step 2](step-02.md). The real-time, highly-granular dashboard, + with hundreds of charts, is your main source of information about the health and performance of your systems/ + applications. We designed the dashboard with anomaly detection and quick analysis in mind. We'll return to + dashboard-related topics in both [step 7](step-07.md) and [step 8](step-08.md). +- **Long-term metrics storage** by default. With our new database engine, you can store days, weeks, or months of + per-second historical metrics. Or you can archive metrics to another database, like MongoDB or Prometheus. We'll + cover all these options in [step 9](step-09.md). +- **No configuration necessary**. Without any configuration, you'll get thousands of real-time metrics and hundreds of + alarms designed by our community of sysadmin experts. But you _can_ configure Netdata in a lot of ways, some of + which we'll cover in [step 4](step-04.md). +- **Distributed, per-system installation**. Instead of centralizing metrics in one location, you install Netdata on + _every_ system, and each system is responsible for its metrics. Having distributed agents reduces cost and lets + Netdata run on devices with little available resources, such as IoT and edge devices, without affecting their core + purpose. +- **Sophisticated health monitoring** to ensure you always know when an anomaly hits. In [step 5](step-05.md), we dive + into how you can tune alarms, write your own alarm, and enable two types of notifications. +- **High-speed, low-resource collectors** that allow you to collect thousands of metrics every second while using only + a fraction of your system's CPU resources and a few MiB of RAM. +- **Netdata Cloud** is our SaaS toolkit that helps Netdata users monitor the health and performance of entire + infrastructures, whether they are two or two thousand (or more!) systems. We'll cover Netdata Cloud in [step + 3](step-03.md). + +## Why you should use Netdata + +Because you care about the health and performance of your systems and applications, and all of the awesome features we +just mentioned. And it's free! + +All these may be valid reasons, but let's step back and talk about Netdata's _principles_ for health monitoring and +performance troubleshooting. We have a lot of [complementary +systems](#how-netdata-has-complementary-systems-not-competitors), and we think there's a good reason why Netdata should +always be your first choice when troubleshooting an anomaly. + +We built Netdata on four principles. + +### Per-second data collection + +Our first principle is per-second data collection for all metrics. + +That matters because you can't monitor a 2-second service-level agreement (SLA) with 10-second metrics. You can't detect +quick anomalies if your metrics don't show them. + +How do we solve this? By decentralizing monitoring. Each node is responsible for collecting metrics, triggering alarms, +and building dashboards locally, and we work hard to ensure it does each step (and others) with remarkable efficiency. +For example, Netdata can [collect 100,000 metrics](https://github.com/netdata/netdata/issues/1323) every second while +using only 9% of a single server-grade CPU core! + +By decentralizing monitoring and emphasizing speed at every turn, Netdata helps you scale your health monitoring and +performance troubleshooting to an infrastructure of every size. _And_ you get to keep per-second metrics in long-term +storage thanks to the database engine. + +### Unlimited metrics + +We believe all metrics are fundamentally important, and all metrics should be available to the user. + +If you don't collect _all_ the metrics a system creates, you're only seeing part of the story. It's like saying you've +read a book after skipping all but the last ten pages. You only know the ending, not everything that leads to it. + +Most monitoring solutions exist to poke you when there's a problem, and then tell you to use a dozen different console +tools to find the root cause. Netdata prefers to give you every piece of information you might need to understand why an +anomaly happened. + +### Meaningful presentation + +We want every piece of Netdata's dashboard not only to look good and update every second, but also provide context as to +what you're looking at and why it matters. + +The principle of meaningful presentation is fundamental to our dashboard's user experience (UX). We could have put +charts in a grid or hidden some behind tabs or buttons. We instead chose to stack them vertically, on a single page, so +you can visually see how, for example, a jump in disk usage can also increase system load. + +Here's an example of a system undergoing a disk stress test: + +![Screen Shot 2019-10-23 at 15 38 +32](https://user-images.githubusercontent.com/1153921/67439589-7f920700-f5ab-11e9-930d-fb0014900d90.png) + +> For the curious, here's the command: `stress-ng --fallocate 4 --fallocate-bytes 4g --timeout 1m --metrics --verify +> --times`! + +### Immediate results + +Finally, Netdata should be usable from the moment you install it. + +As we've talked about, and as you'll learn in the foll