Update change-metrics-storage.md (#14726)

Provide better calculation of RAM usage
author: Chris Akritidis <43294513+cakrit@users.noreply.github.com> 2023-03-14 11:05:48 -0700
committer: GitHub <noreply@github.com> 2023-03-14 11:05:48 -0700
commit: 4263a234d679122cbbc5b7eb1049276c6b7fbfd4 (patch)
tree: c29b8e1bf227c8ef5c3d1e25e797b2cbaa409f61
parent: aba7e8c064b3b993ce837495bb66d1362b1f21df (diff)
1 files changed, 51 insertions, 18 deletions
diff --git a/docs/store/change-metrics-storage.md b/docs/store/change-metrics-storage.md
index cc9954532d..dfca1ff2b1 100644
--- a/docs/store/change-metrics-storage.md
+++ b/docs/store/change-metrics-storage.md
@@ -1,13 +1,3 @@
-<!--
-title: "Change how long Netdata stores metrics"
-description: "With a single configuration change, the Netdata Agent can store days, weeks, or months of metrics at its famous per-second granularity."
-custom_edit_url: "https://github.com/netdata/netdata/edit/master/docs/store/change-metrics-storage.md"
-sidebar_label: "Change how long Netdata stores metrics"
-learn_status: "Published"
-learn_topic_type: "Tasks"
-learn_rel_path: "Configuration"
--->
-
 # Change how long Netdata stores metrics
 
 The Netdata Agent uses a custom made time-series database (TSDB), named the 
@@ -86,23 +76,67 @@ numbers should not deviate significantly from the above.
 
 ### Memory for concurrently collected metrics
 
-DBENGINE memory is related to the number of metrics concurrently being collected, the retention of the metrics 
+The total memory Netdata uses is heavily influenced by the memory consumed by the DBENGINE.
+The DBENGINE memory is related to the number of metrics concurrently being collected, the retention of the metrics 
 on disk in relation with the queries running, and the number of metrics for which retention is maintained.
 
-The precise analysis of how much memory will be used is described in 
-[dbengine memory requirements](https://github.com/netdata/netdata/blob/master/database/engine/README.md#memory-requirements).
+The precise analysis of how much memory will be used by the DBENGINE itself is described in 
+[DBENGINE memory requirements](https://github.com/netdata/netdata/blob/master/database/engine/README.md#memory-requirements).
+
+In addition to the DBENGINE, Netdata uses memory for contexts, metric labels (e.g. in a Kubernetes setup), 
+other Netdata structures/processes (e.g. Health) and system overhead.
 
-The quick rule of thumb for a high level estimation is
+The quick rule of thumb, for a high level estimation is
 
 ```
-memory in KiB = METRICS x (TIERS - 1) x 4KiB x 2 + 32768 KiB
+DBENGINE memory in MiB = METRICS x (TIERS - 1) x 8 / 1024 MiB
+Total Netdata memory in MiB = Metric cardinality factor x DBENGINE memory in MiB + "dbengine page cache size MB" from netdata.conf
 ```
+The cardinality factor is usually between 3 or 4 and depends mainly on the ephemerality of the collected metrics. The more ephemeral 
+the infrastructure, the higher the factor. If the cardinality is extremely high with a lot of extremely short lived containers 
+(hundreds started every minute), the multiplication factor can get really high. In such cases, we recommend splitting the load across 
+multiple Netdata parents, until we can provide a way to lower the cardinality by aggregating similar metrics.
+
+#### Small agent RAM usage
 
-So, for 2000 metrics (dimensions) in 3 storage tiers:
+For 2000 metrics (dimensions) in 3 storage tiers and the default cache size:
 
 ```
-memory for 2k metrics = 2000 x (3 - 1) x 4 KiB x 2 + 32768 KiB = 64 MiB
+DBENGINE memory for 2k metrics = 2000 x (3 - 1) x 8 / 1024 MiB = 32 MiB
+dbengine page cache size MB = 32 MiB 
+Total Netdata memory in MiB = Between 2*32 + 32 = 96 MiB and 3*32 + 32 = 196 MiB, for low to average cardinality
 ```
+#### Large parent RAM usage
+
+The Netdata parent in our production infrastructure at the time of writing:
+ - Collects 206k metrics per second, most from children streaming data
+ - The metrics include moderately ephemeral Kubernetes containers (average ephemerality), leading to a cardinality factor of about 4
+ - 3 tiers are used for retention
+ - The `dbengine page cache size MB` in `netdata.conf` is configured to be 4GB
+
+The rule of thumb calculation for this set up gives us
+```
+DBENGINE memory = 206,000 x 16 / 1024 = 3 GiB
+Extra cache = 4 GiB
+Metric cardinality factor = 4
+Estimated total Netdata memory = 3 * 4 + 4 = 16 GiB
+```
+
+The actual measurement during a low usage time was the following:
+
+Purpose|RAM|Note
+:--- | ---: | :--- 
+DBENGINE usage | 5.9 GiB | Out of 7GB max 
+Cardinality related memory (k8s contexts, labels, strings) | 3.4 GiB
+Buffer for queries | 0 GiB | Out of 0.5 GiB max, when heavily queried
+Other | 0.5 GiB | 
+System overhead | 4.4 GiB | Calculated by subtracting all of the above from the total 
+**Total Netdata memory usage** | 14.2 GiB | 
+
+All the figures above except for the system memory management overhead were retrieved from Netdata itself. 
+The overhead can't be directly calculated, so we subtracted all the other figures from the total Netdata memory usage to get it. 
+This overhead is usually around 50% of the memory actually useable by Netdata, but could range from 20% in small 
+setups, all the way to 100% in some edge cases. 
 
 ## Configure metric retention
 
@@ -114,4 +148,3 @@ Save the file and restart the Agent with `sudo systemctl restart netdata`, or
 the [appropriate method](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md) 
 for your system, to change the database engine's size.
 
-
author	Chris Akritidis <43294513+cakrit@users.noreply.github.com>	2023-03-14 11:05:48 -0700
committer	GitHub <noreply@github.com>	2023-03-14 11:05:48 -0700
commit	4263a234d679122cbbc5b7eb1049276c6b7fbfd4 (patch)
tree	c29b8e1bf227c8ef5c3d1e25e797b2cbaa409f61
parent	aba7e8c064b3b993ce837495bb66d1362b1f21df (diff)