summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorChris Akritidis <43294513+cakrit@users.noreply.github.com>2023-04-17 06:40:13 -0700
committerGitHub <noreply@github.com>2023-04-17 06:40:13 -0700
commit227988441cb0b275b179617990f8bfe82cf322e3 (patch)
tree397530ce5038180920a224375d917100fe325c19
parenta4f8dc6cde66f69760506e57958df1e03f4ea7b0 (diff)
Add section for scaling parent nodes (#14915)
Update change-metrics-storage.md
-rw-r--r--docs/store/change-metrics-storage.md12
1 files changed, 12 insertions, 0 deletions
diff --git a/docs/store/change-metrics-storage.md b/docs/store/change-metrics-storage.md
index a7d7e77456..5e14fe2472 100644
--- a/docs/store/change-metrics-storage.md
+++ b/docs/store/change-metrics-storage.md
@@ -127,6 +127,8 @@ The Netdata parent in our production infrastructure at the time of writing:
- 3 tiers are used for retention
- The `dbengine page cache size MB` in `netdata.conf` is configured to be 4GB
+Netdata parents can end up collecting millions of metrics per second. See also [scaling dedicated parent nodes](#scaling-dedicated-parent-nodes).
+
The rule of thumb calculation for this set up gives us
```
DBENGINE memory = 206,000 x 16 / 1024 MiB = 3,217 MiB = about 3 GiB
@@ -193,3 +195,13 @@ All new child nodes are automatically transferred to the multihost dbengine inst
space. If you want to migrate a child node from its legacy dbengine instance to the multihost dbengine instance, you
must delete the instance's directory, which is located in `/var/cache/netdata/MACHINE_GUID/dbengine`, after stopping the
Agent.
+
+## Scaling dedicated parent nodes
+
+When you use streaming in medium to large infrastructures, you can have potentially millions of metrics per second reaching each parent node.
+In the lab we have reliably collected 1 million metrics/sec with 16cores and 32GB RAM.
+
+Our suggestion for scaling parents is to have them running on dedicated VMs, using a maximum of 50% of cpu, and ensuring you have enough RAM
+for the desired retention. When your infrastructure can lead a parent to exceed these characteristics, split the load to multiple parents that
+do not communicate with each other. With each child sending data to only one of the parents, you can still have replication, high availability,
+and infrastructure level observability via the Netdata Cloud UI.