Multi-Tier database backend for long term metrics storage (#13263)

* Tier part 1 * Tier part 2 * Tier part 3 * Tier part 4 * Tier part 5 * Fix some ML compilation errors * fix more conflicts * pass proper tier * move metric_uuid from state to RRDDIM * move aclk_live_status from state to RRDDIM * move ml_dimension from state to RRDDIM * abstracted the data collection interface * support flushing for mem db too * abstracted the query api * abstracted latest/oldest time per metric * cleanup * store_metric for tier1 * fix for store_metric * allow multiple tiers, more than 2 * state to tier * Change storage type in db. Query param to request min, max, sum or average * Store tier data correctly * Fix skipping tier page type * Add tier grouping in the tier * Fix to handle archived charts (part 1) * Temp fix for query granularity when requesting tier1 data * Fix parameters in the correct order and calculate the anomaly based on the anomaly count * Proper tiering grouping * Anomaly calculation based on anomaly count * force type checking on storage handles * update cmocka tests * fully dynamic number of storage tiers * fix static allocation * configure grouping for all tiers; disable tiers for unittest; disable statsd configuration for private charts mode * use default page dt using the tiering info * automatic selection of tier * fix for automatic selection of tier * working prototype of dynamic tier selection * automatic selection of tier done right (I hope) * ask for the proper tier value, based on the grouping function * fixes for unittests and load_metric_next() * fixes for lgtm findings * minor renames * add dbengine to page cache size setting * add dbengine to page cache with malloc * query engine optimized to loop as little are required based on the view_update_every * query engine grouping methods now do not assume a constant number of points per group and they allocate memory with OWA * report db points per tier in jsonwrap * query planer that switches database tiers on the fly to satisfy the query for the entire timeframe * dbegnine statistics and documentation (in progress) * calculate average point duration in db * handle single point pages the best we can * handle single point pages even better * Keep page type in the rrdeng_page_descr * updated doc * handle future backwards compatibility - improved statistics * support &tier=X in queries * enfore increasing iterations on tiers * tier 1 is always 1 iteration * backfilling higher tiers on first data collection * reversed anomaly bit * set up to 5 tiers * natural points should only be offered on tier 0, except a specific tier is selected * do not allow more than 65535 points of tier0 to be aggregated on any tier * Work only on actually activated tiers * fix query interpolation * fix query interpolation again * fix lgtm finding * Activate one tier for now * backfilling of higher tiers using raw metrics from lower tiers * fix for crash on start when storage tiers is increased from the default * more statistics on exit * fix bug that prevented higher tiers to get any values; added backfilling options * fixed the statistics log line * removed limit of 255 iterations per tier; moved the code of freezing rd->tiers[x]->db_metric_handle * fixed division by zero on zero points_wanted * removed dead code * Decide on the descr->type for the type of metric * dont store metrics on unknown page types * free db_metric_handle on sql based context queries * Disable STORAGE_POINT value check in the exporting engine unit tests * fix for db modes other than dbengine * fix for aclk archived chart queries destroying db_metric_handles of valid rrddims * fix left-over freez() instead of OWA freez on median queries Co-authored-by: Costa Tsaousis <costa@netdata.cloud> Co-authored-by: Vladimir Kobal <vlad@prokk.net>
author: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> 2022-07-06 14:01:53 +0300
committer: GitHub <noreply@github.com> 2022-07-06 14:01:53 +0300
commit: 49234f23de3a32682daff07ca229b6b62f24c090 (patch)
tree: a81ed628abcf4457737bcc3597b097e8e430497a /docs
parent: 8d5850fd49bf6308cd6cab690cdbba4a35505b39 (diff)
1 files changed, 109 insertions, 101 deletions
diff --git a/docs/guides/longer-metrics-storage.md b/docs/guides/longer-metrics-storage.md
index 2c6872d494..8ccd9585fa 100644
--- a/docs/guides/longer-metrics-storage.md
+++ b/docs/guides/longer-metrics-storage.md
@@ -1,150 +1,158 @@
 <!--
-title: "Change how long Netdata stores metrics"
-description: "With a single configuration change, the Netdata Agent can store days, weeks, or months of metrics at its famous per-second granularity."
+title: "Netdata Longer Metrics Retention"
+description: ""
 custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/longer-metrics-storage.md
 -->
 
-# Change how long Netdata stores metrics
+# Netdata Longer Metrics Retention
 
-Netdata helps you collect thousands of system and application metrics every second, but what about storing them for the
-long term?
+Metrics retention affects 3 parameters on the operation of a Netdata Agent:
 
-Many people think Netdata can only store about an hour's worth of real-time metrics, but that's simply not true any
-more. With the right settings, Netdata is quite capable of efficiently storing hours or days worth of historical,
-per-second metrics without having to rely on an [exporting engine](/docs/export/external-databases.md).
+1. The disk space required to store the metrics.
+2. The memory the Netdata Agent will require to have that retention available for queries.
+3. The CPU resources that will be required to query longer time-frames.
 
-This guide gives two options for configuring Netdata to store more metrics. **We recommend the default [database
-engine](#using-the-database-engine)**, but you can stick with or switch to the round-robin database if you prefer.
+As retention increases, the resources required to support that retention increase too.
 
-Let's get started.
+Since Netdata Agents usually run at the edge, inside production systems, Netdata Agent **parents** should be considered. When having a **parent - child** setup, the child (the Netdata Agent running on a production system) delegates all its functions, including longer metrics retention and querying, to the parent node that can dedicate more resources to this task. A single Netdata Agent parent can centralize multiple children Netdata Agents (dozens, hundreds, or even thousands depending on its available resources). 
 
-## Using the database engine
 
-The database engine uses RAM to store recent metrics while also using a "spill to disk" feature that takes advantage of
-available disk space for long-term metrics storage. This feature of the database engine allows you to store a much
-larger dataset than your system's available RAM.
+## Ephemerality of metrics
 
-The database engine is currently the default method of storing metrics, but if you're not sure which database you're
-using, check out your `netdata.conf` file and look for the `[db].mode` setting:
+The ephemerality of metrics plays an important role in retention. In environments where metrics stop being collected and new metrics are constantly being generated, we are interested about 2 parameters:
 
-```conf
-[db]
-    mode = dbengine
-```
-
-If `[db].mode` is set to anything but `dbengine`, change it and restart Netdata using the standard command for
-restarting services on your system. You're now using the database engine!
+1. The **expected concurrent number of metrics** as an average for the lifetime of the database.
+   This affects mainly the storage requirements.
 
-What makes the database engine efficient? While it's structured like a traditional database, the database engine splits
-data between RAM and disk. The database engine caches and indexes data on RAM to keep memory usage low, and then
-compresses older metrics onto disk for long-term storage.
+2. The **expected total number of unique metrics** for the lifetime of the database.
+   This affects mainly the memory requirements for having all these metrics indexed and available to be queried.
 
-When the Netdata dashboard queries for historical metrics, the database engine will use its cache, stored in RAM, to
-return relevant metrics for visualization in charts.
+## Granularity of metrics
 
-Now, given that the database engine uses _both_ RAM and disk, there are two other settings to consider: `page cache
-size MB` and `dbengine multihost disk space MB`.
+The granularity of metrics (the frequency they are collected and stored, i.e. their resolution) is significantly affecting retention.
 
-```conf
-[db]
-    page cache size MB = 32
-    dbengine multihost disk space MB = 256
-```
+Lowering the granularity from per second to every two seconds, will double their retention and half the CPU requirements of the Netdata Agent, without affecting disk space or memory requirements.
 
-`[db].page cache size MB` sets the maximum amount of RAM the database engine will use for caching and indexing.
-`[db].dbengine multihost disk space MB` sets the maximum disk space the database engine will use for storing
-compressed metrics. The default settings retain about four day's worth of metrics on a system collecting 2,000 metrics
-every second.
+## Which database mode to use
 
-[**See our database engine
-calculator**](/docs/store/change-metrics-storage.md#calculate-the-system-resources-ram-disk-space-needed-to-store-metrics)
-to help you correctly set `[db].dbengine multihost disk space MB` based on your needs. The calculator gives an accurate estimate
-based on how many child nodes you have, how many metrics your Agent collects, and more.
+Netdata Agents support multiple database modes.
 
-With the database engine active, you can back up your `/var/cache/netdata/dbengine/` folder to another location for
-redundancy.
+The default mode `[db].mode = dbengine` has been designed to scale for longer retentions.
 
-Now that you know how to switch to the database engine, let's cover the default round-robin database for those who
-aren't ready to make the move.
+The other available database modes are designed to minimize resource utilization and should usually be considered on **parent - child** setups at the children side.
 
-## Using the round-robin database
+So,
 
-In previous versions, Netdata used a round-robin database to store 1 hour of per-second metrics. 
+* On a single node setup, use `[db].mode = dbengine` to increase retention.
+* On a **parent - child** setup, use `[db].mode = dbengine` on the parent to increase retention and a more resource efficient mode (like `save`, `ram` or `none`) for the child to minimize resources utilization.
 
-To see if you're still using this database, or if you would like to switch to it, open your `netdata.conf` file and see
-if `[db].mode` option is set to `save`.
+To use `dbengine`, set this in `netdata.conf` (it is the default):
 
-```conf
+```
 [db]
-    mode = save
+    mode = dbengine
 ```
 
-If `[db].mode` is set to `save`, then you're using the round-robin database. If so, the `[db].retention` option is set to
-`3600`, which is the equivalent to 3,600 seconds, or one hour. 
+## Tiering
 
-To increase your historical metrics, you can increase `[db].retention` to the number of seconds you'd like to store:
+`dbengine` supports tiering. Tiering allows having up to 3 versions of the data:
 
-```conf
+1. Tier 0 is the high resolution data.
+2. Tier 1 is the first tier that samples data every 60 data collections of Tier 0.
+3. Tier 2 is the second tier that samples data every 3600 data collections of Tier 0 (60 of Tier 1).
+
+To enable tiering set `[db].storage tiers` in `netdata.conf` (the default is 1, to enable only Tier 0):
+
+```
 [db]
-    # 2 hours = 2 * 60 * 60 = 7200 seconds
-    retention = 7200
-    # 4 hours = 4 * 60 * 60 = 14440 seconds
-    retention = 14440
-    # 24 hours = 24 * 60 * 60 = 86400 seconds
-    retention = 86400
+    mode = dbengine
+    storage tiers = 3
 ```
 
-And so on.
+## Disk space requirements
 
-Next, check to see how many metrics Netdata collects on your system, and how much RAM that uses. Visit the Netdata
-dashboard and look at the bottom-right corner of the interface. You'll find a sentence similar to the following:
+Netdata Agents require about 1 bytes on disk per database point on Tier 0 and 4 times more on higher tiers (Tier 1 and 2). They require 4 times more storage per point compared to Tier 0, because for every point higher tiers store `min`, `max`, `sum`, `count` and `anomaly rate` (the values are 5, but they require 4 times the storage because `count` and `anomaly rate` are 16-bit integers). The `average` is calculated on the fly at query time using `sum / count`.
 
-> Every second, Netdata collects 1,938 metrics, presents them in 299 charts and monitors them with 81 alarms. Netdata is
-> using 25 MB of memory on **netdata-linux** for 1 hour, 6 minutes and 36 seconds of real-time history.
+### Tier 0 - per second for a week
 
-On this desktop system, using a Ryzen 5 1600 and 16GB of RAM, the round-robin databases uses 25 MB of RAM to store just
-over an hour's worth of data for nearly 2,000 metrics.
+For 2000 metrics, collected every second and retained for a week, Tier 0 needs: 1 byte x 2000 metrics x 3600 secs per hour x 24 hours per day x 7 days per week = 1100MB.
 
-You should base this number on two things: How much history you need for your use case, and how much RAM you're willing
-to dedicate to Netdata.
+The setting to control this is in `netdata.conf`:
 
-How much RAM will a longer retention use? Let's use a little math.
+```
+[db]
+    mode = dbengine
+    
+    # per second data collection
+    update every = 1
+    
+    # enable only Tier 0
+    storage tiers = 1
+    
+    # Tier 0, per second data for a week
+    dbengine multihost disk space MB = 1100
+```
 
-The round-robin database needs 4 bytes for every value Netdata collects. If Netdata collects metrics every second,
-that's 4 bytes, per second, per metric.
+By setting it to `1100` and restarting the Netdata Agent, this node will start maintaining about a week of data. But pay attention to the number of metrics. If you have more than 2000 metrics on a node, or you need more that a week of high resolution metrics, you may need to adjust this setting accordingly.
 
-```text
-4 bytes * X seconds * Y metrics = RAM usage in bytes
-```
+### Tier 1 - per minute for a month
 
-Let's assume your system collects 1,000 metrics per second.
+Tier 1 is by default sampling the data every 60 points of Tier 0. If Tier 0 is per second, then Tier 1 is per minute.
 
-```text
-4 bytes * 3600 seconds * 1,000 metrics = 14400000 bytes = 14.4 MB RAM
-```
+Tier 1 needs 4 times more storage per point compared to Tier 0. So, for 2000 metrics, with per minute resolution, retained for a month, Tier 1 needs: 4 bytes x 2000 metrics x 60 minutes per hour x 24 hours per day x 30 days per month = 330MB.
 
-With that formula, you can calculate the RAM usage for much larger history settings.
-
-```conf
-# 2 hours at 1,000 metrics per second
-4 bytes * 7200 seconds * 1,000 metrics = 28800000 bytes = 28.8 MB RAM
-# 2 hours at 2,000 metrics per second
-4 bytes * 7200 seconds * 2,000 metrics = 57600000 bytes = 57.6 MB RAM
-# 4 hours at 2,000 metrics per second
-4 bytes * 14440 seconds * 2,000 metrics = 115520000 bytes = 115.52 MB RAM
-# 24 hours at 1,000 metrics per second
-4 bytes * 86400 seconds * 1,000 metrics = 345600000 bytes = 345.6 MB RAM
+Do this in `netdata.conf`:
+
+```
+[db]
+    mode = dbengine
+    
+    # per second data collection
+    update every = 1
+    
+    # enable only Tier 0 and Tier 1
+    storage tiers = 2
+    
+    # Tier 0, per second data for a week
+    dbengine multihost disk space MB = 1100
+    
+    # Tier 1, per minute data for a month
+    dbengine tier 1 multihost disk space MB = 330
 ```
 
-## What's next?
+Once `netdata.conf` is edited, the Netdata Agent needs to be restarted for the changes to take effect.
+
+### Tier 2 - per hour for a year
+
+Tier 2 is by default sampling data every 3600 points of Tier 0 (60 of Tier 1). If Tier 0 is per second, then Tier 2 is per hour.
 
-Now that you have either configured database engine or round-robin database engine to store more metrics, you'll
-probably want to see it in action!
+The storage requirements are the same to Tier 1.
+
+For 2000 metrics, with per hour resolution, retained for a year, Tier 2 needs: 4 bytes x 2000 metrics x 24 hours per day x 365 days per year = 67MB.
+
+Do this in `netdata.conf`:
+
+```
+[db]
+    mode = dbengine
+    
+    # per second data collection
+    update every = 1
+    
+    # enable only Tier 0 and Tier 1
+    storage tiers = 3
+    
+    # Tier 0, per second data for a week
+    dbengine multihost disk space MB = 1100
+    
+    # Tier 1, per minute data for a month
+    dbengine tier 1 multihost disk space MB = 330
+
+    # Tier 2, per hour data for a year
+    dbengine tier 2 multihost disk space MB = 67
+```
 
-For more information about how to pan charts to view historical metrics, see our documentation on [using
-charts](/web/README.md#using-charts).
+Once `netdata.conf` is edited, the Netdata Agent needs to be restarted for the changes to take effect.
 
-And if you'd now like to reduce Netdata's resource usage, view our [performance
-guide](/docs/guides/configure/performance.md) for our best practices on optimization.
author	Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>	2022-07-06 14:01:53 +0300
committer	GitHub <noreply@github.com>	2022-07-06 14:01:53 +0300
commit	49234f23de3a32682daff07ca229b6b62f24c090 (patch)
tree	a81ed628abcf4457737bcc3597b097e8e430497a /docs
parent	8d5850fd49bf6308cd6cab690cdbba4a35505b39 (diff)