summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMateusz Bularz <60339703+M4itee@users.noreply.github.com>2023-11-28 10:56:07 +0100
committerMateusz Bularz <60339703+M4itee@users.noreply.github.com>2023-11-28 10:56:07 +0100
commit43912c4b8bb6f9f3b30cfbeba43d31d014a8c9ee (patch)
tree6a7d2d9e334a271155d6091b13787c66576ed76c
parent0eae3020ee396f5542cc8b7d978a04412bff5722 (diff)
-rw-r--r--docs/netdata-cloud-onprem/getting-started-light-poc.md23
-rw-r--r--docs/netdata-cloud-onprem/getting-started.md85
-rw-r--r--docs/netdata-cloud-onprem/troubleshooting-onprem.md9
3 files changed, 58 insertions, 59 deletions
diff --git a/docs/netdata-cloud-onprem/getting-started-light-poc.md b/docs/netdata-cloud-onprem/getting-started-light-poc.md
index d79c9c2b70..80a73e8089 100644
--- a/docs/netdata-cloud-onprem/getting-started-light-poc.md
+++ b/docs/netdata-cloud-onprem/getting-started-light-poc.md
@@ -1,45 +1,46 @@
-# Getting started Getting started with Netdata Cloud On-Prem Light PoC
-Due to the high demand we designed very light and easy to install version of netdata for clients who do not have kubernetes cluster installed. Please keep in mind that this is (for now) only designed to be used as a PoC with no built in resiliency on failures of any kind.
+# Getting started with Netdata Cloud On-Prem Light PoC
+Due to the high demand, we designed a very light and easy-to-install version of netdata for clients who do not have Kubernetes cluster installed. Please keep in mind that this is (for now) only designed to be used as a PoC with no built-in resiliency on failures of any kind.
Requirements:
- Ubuntu 22.04 (clean installation will work best).
- 10 CPU Cores and 24 GiB of memory.
- Access to shell as a sudo.
- - TLS certificate for Netdata Cloud On-Prem PoC. Single endpoint is required. Certificate must be trusted by all entities connecting to the On-Prem installation by any means.
+ - TLS certificate for Netdata Cloud On-Prem PoC. A single endpoint is required. The certificate must be trusted by all entities connecting to the On-Prem installation by any means.
- AWS ID and Key - contact Netdata Product Team - info@netdata.cloud
- License Key - contact Netdata Product Team - info@netdata.cloud
-To install whole environment, login to designation host and run:
+To install the whole environment, log in to the designated host and run:
```shell
curl https://netdata-cloud-netdata-static-content.s3.amazonaws.com/provision.sh
chmod +x provision.sh
sudo ./provision.sh --install
```
-What script with does during the installation?
+What does the script do during installation?
1. Prompts user to provide:
- ID and KEY for accessing the AWS (to pull helm charts and container images)
- License Key
- URL under which Netdata Cloud Onprem PoC is going to function (without protocol like `https://`)
- Path for certificate file (PEM format)
- Path for private key file (PEM format)
-2. After getting all of the information installation is starting. Script will install:
+2. After getting all of the information installation is starting. The script will install:
1. Helm
2. Kubectl
3. AWS CLI
4. K3s cluster (single node)
-3. When all the required software is installed script starts to provision K3s cluster with gathered data.
+3. When all the required software is installed script starts to provision the K3s cluster with gathered data.
After cluster provisioning netdata is ready to be used.
-##### How to login?
-Because this is a PoC with 0 configuration required, only login by mail is able to work. What's more every mail that Netdata Cloud On-Prem is sending will appear on mailcatcher, which acts as the SMTP server with a simple GUI to read the mails. Steps:
+
+##### How to log in?
+Because this is a PoC with 0 configurations required, only log in by mail can work. What's more every mail that Netdata Cloud On-Prem sends will appear on the mailcatcher, which acts as the SMTP server with a simple GUI to read the mails. Steps:
1. Open Netdata Cloud On-Prem PoC in the web browser on URL you specified
2. Provide email and use the button to confirm
3. Mailcatcher will catch all the emails so go to `<URL from point 1.>/mailcatcher`. Find yours and click the link.
4. You are now logged into the netdata. Add your first nodes!
##### How to remove Netdata Cloud On-Prem PoC?
-To uninstall whole PoC, use the same script that installed it, with the `--uninstall` switch.
+To uninstall the whole PoC, use the same script that installed it, with the `--uninstall` switch.
```shell
cd <script dir>
@@ -47,4 +48,4 @@ sudo ./provision.sh --uninstall
```
#### WARNING
-This script will expose automatically expose not only netdata but also a mailcatcher under `<URL from point 1.>/mailcatcher`.
+This script will automatically expose not only netdata but also a mailcatcher under `<URL from point 1.>/mailcatcher`.
diff --git a/docs/netdata-cloud-onprem/getting-started.md b/docs/netdata-cloud-onprem/getting-started.md
index 0af0e3e38b..27c6bb538c 100644
--- a/docs/netdata-cloud-onprem/getting-started.md
+++ b/docs/netdata-cloud-onprem/getting-started.md
@@ -1,5 +1,5 @@
# Getting started with Netdata Cloud On-Prem
-Helm charts are designed for kubernetes to run as the local equivalent of the netdata.cloud public offering. This means that no data is sent outside of your cluster. By default On-Prem installation is trying to reach for outside resources only when pulling the container images.
+Helm charts are designed for Kubernetes to run as the local equivalent of the Netdata Cloud public offering. This means that no data is sent outside of your cluster. By default, On-Prem installation is trying to reach outside resources only when pulling the container images.
There are 2 helm charts in total:
- netdata-cloud-onprem - installs onprem itself.
- netdata-cloud-dependency - installs all necessary dependency applications. Not for production use, PoC only.
@@ -7,19 +7,19 @@ There are 2 helm charts in total:
## Requirements
#### Install host:
- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
-- [Helm](https://helm.sh/docs/intro/install/) version 3.12+ with OCI Configuration (explained in installation section)
+- [Helm](https://helm.sh/docs/intro/install/) version 3.12+ with OCI Configuration (explained in the installation section)
- [Kubectl](https://kubernetes.io/docs/tasks/tools/)
#### Kubernetes requirements:
- Kubernetes cluster version 1.23+
- Kubernetes metrics server (For autoscaling)
-- TLS certificate for Netdata Cloud On-Prem. Single endpoint is required but there is an option to split frontend, api and mqtt endpoints. Certificate must be trusted by all entities connecting to the On-Prem installation by any means.
+- TLS certificate for Netdata Cloud On-Prem. A single endpoint is required but there is an option to split the frontend, api, and mqtt endpoints. The certificate must be trusted by all entities connecting to the On-Prem installation by any means.
- Ingress controller to support HTTPS `*`
- PostgreSQL version 13.7 `*` (Main persistent data app)
- EMQX version 5.11 `*` (MQTT Broker that allows Agents to send messages to the On-Prem Cloud)
- Apache Pulsar version 2.10+ `*` (Central communication hub. Applications exchange messages through Pulsar)
- Traefik version 2.7.x `*` (Internal communication - API Gateway)
-- Elastic Search version 8.8.x `*` (Holds Feed)
+- Elasticsearch version 8.8.x `*` (Holds Feed)
- Redis version 6.2 `*` (Cache)
- Some form of generating imagePullSecret `*` (Our ECR repos are secured)
- Default storage class configured and working (Persistent volumes based on SSDs are preferred)
@@ -27,53 +27,53 @@ There are 2 helm charts in total:
#### Hardware requirements:
##### How we tested it:
-- A number of VMs on the AWS EC2, size of the instance was c6a.32xlarge (128CPUs / 256GiB memory).
+- Several VMs on the AWS EC2, the size of the instance was c6a.32xlarge (128CPUs / 256GiB memory).
- Host system - Ubuntu 22.04.
- Each VM hosts 200 Agent nodes as docker containers.
-- Agents are connected DIRECTLY to cloud (no Parent-Child relationships). This is the worst option for the cloud.
-- Cloud hosted on 1 kubernetes node c6a.8xlarge (32CPUs / 64GiB memory).
+- Agents are connected directly to the Netdata Cloud On-Prem (no Parent-Child relationships). This is the worst option for the cloud.
+- Cloud hosted on 1 Kubernetes node c6a.8xlarge (32CPUs / 64GiB memory).
- Dependencies were also installed on the same node.
-- Maximum connected nodes was ~2000.
+The maximum of nodes connected was ~2000.
##### Results
There was no point in trying to connect more nodes as we are covering the PoC purposes.
-- In a peak connection phase - All nodes startup were triggered in ~15 minues:
- - Up to 60% (20 cores) CPU usage of the kubernetes node. Top usage came from:
+- In a peak connection phase - All nodes startup were triggered in ~15 minutes:
+ - Up to 60% (20 cores) CPU usage of the Kubernetes node. Top usage came from:
- Ingress controller (we used haproxy ingress controller)
- Postgres
- Pulsar
- EMQX
Combined they were responsible for ~30-35% of CPU usage of the node.
-- When all nodes connected and synchronized their state CPU usage floated between 30% and 40% - depending on what we did on the cloud (browsing different). Here top offenders were:
+- When all nodes connected and synchronized their state CPU usage floated between 30% and 40% - depending on what we did on the Cloud. Here top offenders were:
- Pulsar
- Postgres
Combined they were responsible for ~15-20% of CPU usage of the node.
- Memory usage - 45GiB in a peak. Most of it (~20GiB) was consumed by:
- Postgres
- - Elastic
+ - Elasticsearch
- Pulsar
For a comparison - Netdata Cloud On-prem installation with just 100 nodes connected, without dependencies is going to consume ~2CPUs and ~2GiB of memory (REAL usage, not requests on a Kubernetes).
## Pulling the helm chart
-Helm chart for the Netdata Cloud On-Prem installation on Kubernetes is available at ECR registry.
-ECR registry is private, so you need to login first. Credentials are sent by our Product Team. If you do not have them, please contact our Product Team - info@netdata.cloud.
+The helm chart for the Netdata Cloud On-Prem installation on Kubernetes is available in the ECR registry.
+The ECR registry is private, so you need to log in first. Credentials are sent by our Product Team. If you do not have them, please contact our Product Team - info@netdata.cloud.
#### Configure AWS CLI
-Machine used for helm chart installation will also need [AWS CLI installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
-There are 2 options of configuring `aws` cli to work with provided credentials. First one is to set the environment variables:
+The machine used for helm chart installation will also need [AWS CLI installed](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
+There are 2 options for configuring `aws cli` to work with the provided credentials. The first one is to set the environment variables:
```bash
export AWS_ACCESS_KEY_ID=<your_secret_id>
export AWS_SECRET_ACCESS_KEY=<your_secret_key>
```
-Second one is to use an interactive shell:
+The Second one is to use an interactive shell:
```bash
aws configure
```
#### Configure helm to use secured ECR repository
-Using `aws` command we will generate token for helm to access secured ECR repository:
+Using `aws` command we will generate a token for helm to access the secured ECR repository:
```bash
aws ecr get-login-password --region us-east-1 | helm registry login --username AWS --password-stdin 362923047827.dkr.ecr.us-east-1.amazonaws.com/netdata-cloud-onprem
```
@@ -84,7 +84,7 @@ helm pull oci://362923047827.dkr.ecr.us-east-1.amazonaws.com/netdata-cloud-depen
helm pull oci://362923047827.dkr.ecr.us-east-1.amazonaws.com/netdata-cloud-onprem --untar
```
-Local folders with newest versions of helm charts should appear on your working dir.
+Local folders with the newest versions of helm charts should appear on your working dir.
## Installation
@@ -93,22 +93,21 @@ Netdata provides access to two helm charts:
2. netdata-cloud-onprem - the application itself + provisioning
### netdata-cloud-dependency
-
-Entire helm chart is designed around the idea that it allows to install all of the necessary applications:
-- redis
-- elasticsearch
-- emqx
-- pulsar
-- postgresql
-- traefik
-- mailcatcher
+The entire helm chart is designed around the idea that it allows the installation of the necessary applications:
+- Redis
+- Elasticsearch
+- EMQX
+- Apache Pulsar
+- PostgreSQL
+- Traefik
+- Mailcatcher
- k8s-ecr-login-renew
- kubernetes-ingress
-Every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-dependency helm chart. All configuration options are described in README.md that is a part of the helm chart. It is enough to mention here that each component can be enabled/disabled individually. It is done by true/false switches in `values.yaml`. In this way it is easier for user to migrate to production-grade components gradually.
+Every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-dependency helm chart. All configuration options are described in README.md which is a part of the helm chart. It is enough to mention here that each component can be enabled/disabled individually. It is done by true/false switches in `values.yaml`. In this way, it is easier for the user to migrate to production-grade components gradually.
-Unless you prefer different solution to the problem, `k8s-ecr-login-renew` is responsible for calling out the `AWS API` for token regeneration. This token is then injected into the secret that every node is using for authentication with secured ECR when pulling the images.
-Default setting in `values.yaml` of `netdata-cloud-onprem` - `.global.imagePullSecrets` is configured to work out of the box with the dependency helm chart.
+Unless you prefer a different solution to the problem, `k8s-ecr-login-renew` is responsible for calling out the `AWS API` for token regeneration. This token is then injected into the secret that every node is using for authentication with secured ECR when pulling the images.
+The default setting in `values.yaml` of `netdata-cloud-onprem` - `.global.imagePullSecrets` is configured to work out of the box with the dependency helm chart.
For helm chart installation - save your changes in `values.yaml` and execute:
```shell
@@ -118,7 +117,7 @@ helm upgrade --wait --install netdata-cloud-dependency -n netdata-cloud --create
### netdata-cloud-onprem
-Every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-onprem helm chart. All configuration options are described in README.md that is a part of the helm chart.
+Every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-onprem helm chart. All configuration options are described in README.md which is a part of the helm chart.
#### Installing Netdata Cloud On-Prem
```shell
@@ -128,13 +127,13 @@ helm upgrade --wait --install netdata-cloud-onprem -n netdata-cloud --create-nam
##### Important notes
1. Installation takes care of provisioning the resources with migration services.
-1. During the first installation, a secret called the `netdata-cloud-common` is created. It contains several randomly generated entries. Deleting helm chart is not going to delete this secret, nor reinstalling whole onprem, unless manually deleted by kubernetes administrator. Content of this secret is extremely relevant - strings that are contained there are essential part of encryption. Loosing or changing data that it contains will result in data loss.
+1. During the first installation, a secret called the `netdata-cloud-common` is created. It contains several randomly generated entries. Deleting helm chart is not going to delete this secret, nor reinstalling the whole On-Prem, unless manually deleted by kubernetes administrator. The content of this secret is extremely relevant - strings that are contained there are essential parts of encryption. Losing or changing the data that it contains will result in data loss.
## Short description of services
#### cloud-accounts-service
Responsible for user registration & authentication. Manages user account information.
#### cloud-agent-data-ctrl-service
-Forwards requests from the cloud to the relevant agents.
+Forwards request from the cloud to the relevant agents.
The requests include:
* Fetching chart metadata from the agent
* Fetching chart data from the agent
@@ -154,18 +153,18 @@ Persists latest alert statuses received from the agent in the cloud.
Aggregates alert statuses from relevant node instances.
Exposes API endpoints to fetch alert data for visualization on the cloud.
Determines if notifications need to be sent when alert statuses change and emits relevant messages to Pulsar.
-Exposes API endpoints to store and return notification silencing data.
+Exposes API endpoints to store and return notification-silencing data.
#### cloud-alarm-streaming-service
Responsible for starting the alert stream between the agent and the cloud.
-Ensures that messages are processed in the correct order, starts a reconciliation process between the cloud and the agent if out of order processing occurs.
+Ensures that messages are processed in the correct order, and starts a reconciliation process between the cloud and the agent if out-of-order processing occurs.
#### cloud-charts-mqtt-input-service
Forwards MQTT messages emitted by the agent related to the chart entities to the internal Pulsar broker. These include the chart metadata that is used to display relevant charts on the cloud.
#### cloud-charts-mqtt-output-service
Forwards Pulsar messages emitted in the cloud related to the charts entities to the MQTT broker. From there, the messages reach the relevant agent.
#### cloud-charts-service
-Exposes API endpoints to fetch the chart metdata.
+Exposes API endpoints to fetch the chart metadata.
Forwards data requests via the `cloud-agent-data-ctrl-service` to the relevant agents to fetch chart data points.
-Exposes API endpoints to call various other endpoints on the agent, for instance functions.
+Exposes API endpoints to call various other endpoints on the agent, for instance, functions.
#### cloud-custom-dashboard-service
Exposes API endpoints to fetch and store custom dashboard data.
#### cloud-environment-service
@@ -177,11 +176,11 @@ Exposes API endpoints to fetch feed events from Elasticsearch.
#### cloud-frontend
Contains the on-prem cloud website. Serves static content.
#### cloud-iam-user-service
-Acts as a middleware for authentication on most of API endpoints. Validates incoming token headers, injects relevant headers and forwards the requests.
+Acts as a middleware for authentication on most of the API endpoints. Validates incoming token headers, injects the relevant ones, and forwards the requests.
#### cloud-metrics-exporter
-Exports various metrics from an on prem cloud-install. Uses the Prometheus metric exposition format.
+Exports various metrics from an On-Prem Cloud installation. Uses the Prometheus metric exposition format.
#### cloud-netdata-assistant
-Exposes API endpoints to fetch a human friendly explanation of various netdata configuration options, namely the alerts.
+Exposes API endpoints to fetch a human-friendly explanation of various netdata configuration options, namely the alerts.
#### cloud-node-mqtt-input-service
Forwards MQTT messages emitted by the agent related to the node entities to the internal Pulsar broker. These include the node metadata as well as their connectivity state, either direct or via parents.
#### cloud-node-mqtt-output-service
@@ -190,7 +189,7 @@ Forwards Pulsar messages emitted in the cloud related to the charts entities to
Exposes API endpoints to handle integrations.
Handles incoming notification messages and uses the relevant channels(email, slack...) to notify relevant users.
#### cloud-spaceroom-service
-Exposes API endpoints to fetch and store relations between agents, nodes, spaces, users and rooms.
+Exposes API endpoints to fetch and store relations between agents, nodes, spaces, users, and rooms.
Acts as a provider of authorization for other cloud endpoints.
Exposes API endpoints to authenticate agents connecting to the cloud.
@@ -198,4 +197,4 @@ Exposes API endpoints to authenticate agents connecting to the cloud.
![infrastructure.jpeg](infrastructure.jpeg)
-### If you have any questions or suggestions please contact netdata team. \ No newline at end of file
+### If you have any questions or suggestions please contact the Netdata team. \ No newline at end of file
diff --git a/docs/netdata-cloud-onprem/troubleshooting-onprem.md b/docs/netdata-cloud-onprem/troubleshooting-onprem.md
index 25b560a533..4f449c9651 100644
--- a/docs/netdata-cloud-onprem/troubleshooting-onprem.md
+++ b/docs/netdata-cloud-onprem/troubleshooting-onprem.md
@@ -2,16 +2,15 @@
We cannot predict how your particular installation of Netdata Cloud On-prem is going to work. It is a mixture of underlying infrastructure, the number of agents, and their topology.
You can always contact the Netdata team for recommendations!
-#### Loading charts takes long time or ends with error
+#### Loading charts takes a long time or ends with an error
Charts service is trying to collect the data from all of the agents in question. If we are talking about the overview screen, all of the nodes in space are going to be queried (`All nodes` room). If it takes a long time, there are a few things that should be checked:
1. How many nodes are you querying directly?
There is a big difference between having 100 nodes connected directly to the cloud compared to them being connected through a few parents. Netdata always prioritizes querying nodes through parents. This way, we can reduce some of the load by pushing the responsibility to query the data to the parent. The parent is then responsible for passing accumulated data from nodes connected to it to the cloud.
1. If you are missing data from endpoints all the time.
Netdata Cloud always queries nodes themselves for the metrics. The cloud only holds information about metadata, such as information about what charts can be pulled from any node, but not the data points themselves for any metric. This means that if a node is throttled by the network connection or under high resource pressure, the information exchange between the agent and cloud through the MQTT broker might take a long time. In addition to checking resource usage and networking, we advise using a parent node for such endpoints. Parents can hold the data from nodes that are connected to the cloud through them, eliminating the need to query those endpoints.
1. Errors on the cloud when trying to load charts.
- If the entire data query is crashing and no data is displayed on the UI, it could indicate problems with the `cloud-charts-service`. It is possible that the query you are performing is simply exceeding the CPU and/or memory limits set on the deployment. We advise increasing those resources.
-
-#### It takes long time to load anything on the Cloud UI
+ If the entire data query is crashing and no data is displayed on the UI, it could indicate problems with the `cloud-charts-service`. The query you are performing might simply exceed the CPU and/or memory limits set on the deployment. We advise increasing those resources.
+It takes a long time to load anything on the Cloud UI
When experiencing sluggishness and slow responsiveness, the following factors should be checked regarding the Postgres database:
1. CPU: Monitor the CPU usage to ensure it is not reaching its maximum capacity. High and sustained CPU usage can lead to sluggish performance.
1. Memory: Check if the database server has sufficient memory allocated. Inadequate memory could cause excessive disk I/O and slow down the database.
@@ -19,4 +18,4 @@ When experiencing sluggishness and slow responsiveness, the following factors sh
By examining these factors and ensuring that CPU, memory, and disk IOPS are within acceptable ranges, you can mitigate potential performance issues with the Postgres database.
#### Nodes are not updated quickly on the Cloud UI
-If youre experiencing delays with information exchange between the Cloud UI and the Agent, and youve already checked the networking and resource usage on the agent side, the problem may be related to Apache Pulsar or the database. Slow alerts on node alerts or slow updates on node status (online/offline) could indicate issues with message processing or database performance. You may want to investigate the performance of Apache Pulsar, ensure it is properly configured, and consider scaling or optimizing the database to handle the volume of data being processed or written to it. \ No newline at end of file
+If you're experiencing delays with information exchange between the Cloud UI and the Agent, and you've already checked the networking and resource usage on the agent side, the problem may be related to Apache Pulsar or the database. Slow alerts on node alerts or slow updates on node status (online/offline) could indicate issues with message processing or database performance. You may want to investigate the performance of Apache Pulsar, ensure it is properly configured, and consider scaling or optimizing the database to handle the volume of data being processed or written to it.