updates to documentation

author: Mateusz Bularz <60339703+M4itee@users.noreply.github.com> 2023-11-21 11:39:39 +0100
committer: Mateusz Bularz <60339703+M4itee@users.noreply.github.com> 2023-11-21 11:39:39 +0100
commit: 3cf59046a5e18e385bfba943d85e576d107a0f04 (patch)
tree: 18b706b2ec51cfc07604692e8e4569b741bf233d
parent: ae70a1e60fe5128e587e819508b12eb02da407e0 (diff)
3 files changed, 34 insertions, 105 deletions
diff --git a/docs/netdata-cloud-onprem/getting-started-light-poc.md b/docs/netdata-cloud-onprem/getting-started-light-poc.md
index bcdbc24605..10eaeef49c 100644
--- a/docs/netdata-cloud-onprem/getting-started-light-poc.md
+++ b/docs/netdata-cloud-onprem/getting-started-light-poc.md
@@ -5,11 +5,11 @@ Requirements:
  - Ubuntu 22.04 (clean installation will work best)
  - 10 CPU Cores and 24 GiB of memory
  - Access to shell as a sudo
- - TLS certificate that is going to be trusted by all agents and web browsers that will use this PoC installation
+ - TLS certificate for Netdata Cloud On-Prem PoC. Single endpoint is required. Certificate must be trusted by all entities connecting to the On-Prem installation by any means.
 
 To install whole environment, login to designation host and run:
 ```shell
-curl <link>
+curl https://netdata-cloud-netdata-static-content.s3.amazonaws.com/provision.sh
 chmod +x provision.sh
 sudo ./provision.sh --install
 ```
@@ -18,7 +18,7 @@ What script does?
 1. Prompts user to provide:
    - ID and KEY for accessing the AWS (to pull helm charts and container images)
    - License Key
-   - URL under which Netdata Cloud Onprem PoC is going to function
+   - URL under which Netdata Cloud Onprem PoC is going to function (without protocol like `https://`)
    - Path for certificate file (unencrypted)
    - Path for private key file (unencrypted)
 2. After getting all of the information installation is starting. Script will install:
@@ -28,4 +28,7 @@ What script does?
    4. K3s cluster (single node)
 3. When all the required software is installed script starts to provision K3s cluster with gathered data.
 
-After cluster provisioning netdata is ready to be used.
-\ No newline at end of file
+After cluster provisioning netdata is ready to be used.
+
+#### WARNING
+This script will expose automatically expose not only netdata but also a mailcatcher under `<URL from point 1.>/mailcatcher`.
diff --git a/docs/netdata-cloud-onprem/getting-started.md b/docs/netdata-cloud-onprem/getting-started.md
index 63614446b7..482214e24d 100644
--- a/docs/netdata-cloud-onprem/getting-started.md
+++ b/docs/netdata-cloud-onprem/getting-started.md
@@ -1,5 +1,5 @@
 # Getting started with Netdata Cloud On-Prem
-Helm chart is designed for kubernetes to run as the local equivalent of the netdata.cloud public offering.
+Helm charts are designed for kubernetes to run as the local equivalent of the netdata.cloud public offering. This means that no data is sent outside of your cluster. By default On-Prem installation is trying to reach for outside resources only when pulling the container images.
 
 ## Requirements
 #### Install host:
@@ -10,7 +10,7 @@ Helm chart is designed for kubernetes to run as the local equivalent of the netd
 #### Kubernetes requirements:
 - Kubernetes cluster version 1.23+
 - Kubernetes metrics server (For autoscaling)
-- TLS certificate for Netdata Cloud On-Prem
+- TLS certificate for Netdata Cloud On-Prem. Single endpoint is required but there is an option to split frontend, api and mqtt endpoints. Certificate must be trusted by all entities connecting to the On-Prem installation by any means.
 - Ingress controller to support HTTPS `*`
 - PostgreSQL version 13.7 `*` (Main persistent data app)
 - EMQX version 5.11 `*` (MQTT Broker that allows Agents to send messages to the On-Prem Cloud)
@@ -73,7 +73,8 @@ Entire helm chart is designed around the idea that it allows to install all of t
 - k8s-ecr-login-renew
 - kubernetes-ingress
 
-Each component can be enabled/disabled individually. It is done by true/false switches in `values.yaml`.
+Every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-dependency helm chart. All configuration options are described in README.md that is a part of the helm chart. It is enough to mention here that each component can be enabled/disabled individually. It is done by true/false switches in `values.yaml`. In this way it is easier for user to migrate to production-grade components gradually.
+
 Unless you prefer different solution to the problem, `k8s-ecr-login-renew` is responsible for calling out the `AWS API` for token regeneration. This token is then injected into the secret that every node is using for authentication with secured ECR when pulling the images.
 Default setting in `values.yaml` of `netdata-cloud-onprem` - `.global.imagePullSecrets` is configured to work out of the box with the dependency helm chart.
 
@@ -83,84 +84,9 @@ cd [your helm chart location]
 helm upgrade --wait --install netdata-cloud-dependency -n netdata-cloud --create-namespace -f values.yaml .
 ```
 
-#### Manual dependency configuration options for production usage
-##### EMQX
-1. Make sure setup meeds your HA (High Avability) requirements.
-2. Environment variables to set:
-  ```
-  EMQX_BROKER__SHARED_SUBSCRIPTION_GROUP__cloudnodemqttinput__STRATEGY        = hash_clientid
-  EMQX_BROKER__SHARED_SUBSCRIPTION_GROUP__cloudagentmqttinput__STRATEGY       = hash_clientid
-  EMQX_BROKER__SHARED_SUBSCRIPTION_GROUP__cloudalarmlogmqttinput__STRATEGY    = hash_clientid
-  EMQX_BROKER__SHARED_SUBSCRIPTION_GROUP__cloudalarmconfigmqttinput__STRATEGY = hash_clientid
-  EMQX_FORCE_SHUTDOWN__MAX_HEAP_SIZE                                          = 128MB
-  EMQX_AUTHENTICATION__1__MECHANISM                                           = password_based
-  EMQX_AUTHENTICATION__1__BACKEND                                             = built_in_database
-  EMQX_AUTHENTICATION__1__USER_ID_TYPE                                        = username
-  EMQX_AUTHENTICATION__1__ENABLE                                              = true
-  EMQX_AUTHORIZATION__NO_MATCH                                                = deny
-  EMQX_AUTHORIZATION__SOURCES__1__TYPE                                        = file
-  EMQX_AUTHORIZATION__SOURCES__1__ENABLE                                      = false
-  EMQX_AUTHORIZATION__SOURCES__2__ENABLE                                      = true
-  EMQX_AUTHORIZATION__SOURCES__2__TYPE                                        = built_in_database
-  EMQX_MQTT__MAX_PACKET_SIZE                                                  = 5MB
-  ```
-3. Make sure `Values.global.emqx.provisioning` have all the data it needs. First password is the one you configured for your EMQX (needs to be an admin password). Second password username and password `Values.global.emqx.provisioning.users.netdata` is for the default user that services will use to contact EMQX's API.
-
-##### Apache Pulsar
-If you want to deploy Pulsar on the Kubernetes there is a ready to use helm chart available [here](https://pulsar.apache.org/docs/3.1.x/deploy-kubernetes/).
-1. Authentiaction - only 1 method can be used at the time. Currently we support:
-   - None (not recommended). Make sure everything is disabled in `Values.global.pulsar.authentication`
-   - Basic auth - turn the feature on in `Values.global.pulsar.authentication.basic`, provide password for pulsar in the same section. Each service can be configured individually.
-   - OAuth - configure section `Values.global.pulsar.authentication.oauth`. In this case applications need to also mount private key. Add it manually to the cluster and point `privateKeySecretName` to it. `privateKeySecretPath` is a mounting path for it.
-2. Namespace we are using must be named `onprem` - this step is done by provisioning script during Netdata Cloud On-Prem installation.
-3. You do not need to create Topics (by default they are creating themselves). Default creation method is to create non-partitioned topics. Partitioned topics can be used but there is no need in instalations for less than 30k Netdata Agent nodes. If you predict such big installation please contact us for further instructions.
-
-##### Elastic Search
-Elastic is going to be provisioned during the first installation. Make sure to setup Elastic in High Avability and configure network and credentials for the `cloud-feed-service` to be able to connect the Elastic instance.
-
-##### Postgres
-Postgres is provisioned automatically as well. Same it was with for example EMQX - `Values.global.postgres.provisioning` - first credentials for global admin, second one for creating the user called `dev`.
-All the databases are created and assigned permissions during the first installation. `migrations` jobs that run every upgrade are there to apply schema and keep it up to date further further application changes.
-
-##### Redis
-We are using Redis in very basic and simple way. The only thing Netdata Cloud On-Prem needs is a password to Redis server. No additional provisioning is required since Redis can automatically create it's own "databases".
-
-##### Traefik
-We need traefik to:
-1. Run in minimum 2 pods for HA.
-2. Be able to utilize Netdata Cloud On-Prem namespace. We are deploying there `ingressroutes` and `middlewares`.
-3. (Optional) Prometheus metrics can be enabled - Netdata Agent for Kubernetes (if installed) can scrape those metrics.
-
-##### Ingress controller
-This is the first point of contact for both the agents and the users. This is also configureable in 
-General requirements:
-1. Ingress for EMQX's passthrough:
-   - Host from: `Values.global.public.cloudUrl`, port: `8083`, path: `/mqtt` - pointing to `emqx`'s service.
-2. Ingress for the rest of communication.
-   - Host from: `Values.global.public.cloudUrl`, port: `80`, path: `/` - pointing to `Values.global.ingress.traefikServiceName`.
-   - Host from: `Values.global.public.apiUrl`, port: `80`, path: `/api` - pointing to `Values.global.ingress.traefikServiceName`.
-3. Make sure you have ingress controller installed and correctly pointed to in `Values.global.ingress`. We ourselves are using [HAProxy Ingress Controller](https://github.com/haproxytech/kubernetes-ingress).
-
-
 ### netdata-cloud-onprem
 
-Helm chart needs some basic configuration. Every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-onprem helm chart.
-
-|Setting|Description|
-|---|---|
-|.global.netdata_cloud_license|This is section for license key that you will obtain from Product Team. **It is mandatory to provide correct key**|
-|.global.pulsar|Section responsible for Apache Pulsar configuration. Default points to PoC installation from `netdata-cloud-dependency`|
-|.global.emqx|Section responsible for EMQX configuration. Default points to PoC installation from `netdata-cloud-dependency`|
-|.global.redis|Section responsible for Redis configuration. Default points to PoC installation from `netdata-cloud-dependency`|
-|.global.postgresql|Section responsible for PostgreSQL configuration. Default points to PoC installation from `netdata-cloud-dependency`|
-|.global.elastic|Section responsible for Elastic Search configuration. Default points to PoC installation from `netdata-cloud-dependency`|
-|.global.oauth.github|Settings for login through GitHub. If not configured this option will not work at all|
-|.global.oauth.google|Settings for login through Google account. Without configuration there is no option to login with Google Account|
-|.global.mail.sendgrid|Netdata Cloud is able to send mails through sendgrid, this section allows for it's configuration|
-|.global.mail.smtp|Section for SMTP server configuration. By default it points to the mailcatcher that is installed by dependency helm chart. To access emails without proper SMTP server, setup port forwarding to mailcatcher on port `1080` for webui. By default this is the only avaiable option to access the cloud|
-|.global.ingress|Section responsible for Ingress configuration. After enabling this feature helm chart will create needed ingresses|
-|.<APP_NAME>|Each netdata application have it's own section. You can tune services or passwords individually for each application. `<APP_NAME>.autoscaling` is useful when scaling for more performance. Short description of the applications avaiable below|
-
+Every configuration option is available through `values.yaml` in the folder that contains your netdata-cloud-onprem helm chart. All configuration options are described in README.md that is a part of the helm chart.
 
 #### Installing Netdata Cloud On-Prem
 ```shell
@@ -240,26 +166,4 @@ Exposes API endpoints to authenticate agents connecting to the cloud.
 
 ![infrastructure.jpeg](infrastructure.jpeg)
 
-## Basic troubleshooting
-We cannot predict how your particular installation of Netdata Cloud On-prem is going to work. It is a mixture of underlying infrastructure, the number of agents, and their topology. You can always contact the Netdata team for recommendations!
-
-#### Loading charts takes long time or ends with error
-Charts service is trying to collect the data from all of the agents in question. If we are talking about the overview screen, all of the nodes in space are going to be queried (`All nodes` room). If it takes a long time, there are a few things that should be checked:
-1. How many nodes are you querying directly?
-  There is a big difference between having 100 nodes connected directly to the cloud compared to them being connected through a few parents. Netdata always prioritizes querying nodes through parents. This way, we can reduce some of the load by pushing the responsibility to query the data to the parent. The parent is then responsible for passing accumulated data from nodes connected to it to the cloud.
-1. If you are missing data from endpoints all the time.
-  Netdata Cloud always queries nodes themselves for the metrics. The cloud only holds information about metadata, such as information about what charts can be pulled from any node, but not the data points themselves for any metric. This means that if a node is throttled by the network connection or under high resource pressure, the information exchange between the agent and cloud through the MQTT broker might take a long time. In addition to checking resource usage and networking, we advise using a parent node for such endpoints. Parents can hold the data from nodes that are connected to the cloud through them, eliminating the need to query those endpoints.
-1. Errors on the cloud when trying to load charts.
-  If the entire data query is crashing and no data is displayed on the UI, it could indicate problems with the `cloud-charts-service`. It is possible that the query you are performing is simply exceeding the CPU and/or memory limits set on the deployment. We advise increasing those resources.
-
-#### It takes long time to load anything on the Cloud UI
-When experiencing sluggishness and slow responsiveness, the following factors should be checked regarding the Postgres database:
-  1. CPU: Monitor the CPU usage to ensure it is not reaching its maximum capacity. High and sustained CPU usage can lead to sluggish performance.
-  1. Memory: Check if the database server has sufficient memory allocated. Inadequate memory could cause excessive disk I/O and slow down the database.
-  1. Disk Queue / IOPS: Analyze the disk queue length and disk I/O operations per second (IOPS). A high disk queue length or limited IOPS can indicate a bottleneck and negatively impact database performance.
-By examining these factors and ensuring that CPU, memory, and disk IOPS are within acceptable ranges, you can mitigate potential performance issues with the Postgres database.
-
-#### Nodes are not updated quickly on the Cloud UI
-If youre experiencing delays with information exchange between the Cloud UI and the Agent, and youve already checked the networking and resource usage on the agent side, the problem may be related to Apache Pulsar or the database. Slow alerts on node alerts or slow updates on node status (online/offline) could indicate issues with message processing or database performance. You may want to investigate the performance of Apache Pulsar, ensure it is properly configured, and consider scaling or optimizing the database to handle the volume of data being processed or written to it.
-
 ### If you have any questions or suggestions please contact netdata team.
 \ No newline at end of file
diff --git a/docs/netdata-cloud-onprem/troubleshooting-onprem.md b/docs/netdata-cloud-onprem/troubleshooting-onprem.md
new file mode 100644
index 0000000000..25b560a533
--- /dev/null
+++ b/docs/netdata-cloud-onprem/troubleshooting-onprem.md
@@ -0,0 +1,22 @@
+# Basic troubleshooting
+We cannot predict how your particular installation of Netdata Cloud On-prem is going to work. It is a mixture of underlying infrastructure, the number of agents, and their topology.
+You can always contact the Netdata team for recommendations!
+
+#### Loading charts takes long time or ends with error
+Charts service is trying to collect the data from all of the agents in question. If we are talking about the overview screen, all of the nodes in space are going to be queried (`All nodes` room). If it takes a long time, there are a few things that should be checked:
+1. How many nodes are you querying directly?
+  There is a big difference between having 100 nodes connected directly to the cloud compared to them being connected through a few parents. Netdata always prioritizes querying nodes through parents. This way, we can reduce some of the load by pushing the responsibility to query the data to the parent. The parent is then responsible for passing accumulated data from nodes connected to it to the cloud.
+1. If you are missing data from endpoints all the time.
+  Netdata Cloud always queries nodes themselves for the metrics. The cloud only holds information about metadata, such as information about what charts can be pulled from any node, but not the data points themselves for any metric. This means that if a node is throttled by the network connection or under high resource pressure, the information exchange between the agent and cloud through the MQTT broker might take a long time. In addition to checking resource usage and networking, we advise using a parent node for such endpoints. Parents can hold the data from nodes that are connected to the cloud through them, eliminating the need to query those endpoints.
+1. Errors on the cloud when trying to load charts.
+  If the entire data query is crashing and no data is displayed on the UI, it could indicate problems with the `cloud-charts-service`. It is possible that the query you are performing is simply exceeding the CPU and/or memory limits set on the deployment. We advise increasing those resources.
+
+#### It takes long time to load anything on the Cloud UI
+When experiencing sluggishness and slow responsiveness, the following factors should be checked regarding the Postgres database:
+  1. CPU: Monitor the CPU usage to ensure it is not reaching its maximum capacity. High and sustained CPU usage can lead to sluggish performance.
+  1. Memory: Check if the database server has sufficient memory allocated. Inadequate memory could cause excessive disk I/O and slow down the database.
+  1. Disk Queue / IOPS: Analyze the disk queue length and disk I/O operations per second (IOPS). A high disk queue length or limited IOPS can indicate a bottleneck and negatively impact database performance.
+By examining these factors and ensuring that CPU, memory, and disk IOPS are within acceptable ranges, you can mitigate potential performance issues with the Postgres database.
+
+#### Nodes are not updated quickly on the Cloud UI
+If youre experiencing delays with information exchange between the Cloud UI and the Agent, and youve already checked the networking and resource usage on the agent side, the problem may be related to Apache Pulsar or the database. Slow alerts on node alerts or slow updates on node status (online/offline) could indicate issues with message processing or database performance. You may want to investigate the performance of Apache Pulsar, ensure it is properly configured, and consider scaling or optimizing the database to handle the volume of data being processed or written to it.
+\ No newline at end of file
author	Mateusz Bularz <60339703+M4itee@users.noreply.github.com>	2023-11-21 11:39:39 +0100
committer	Mateusz Bularz <60339703+M4itee@users.noreply.github.com>	2023-11-21 11:39:39 +0100
commit	3cf59046a5e18e385bfba943d85e576d107a0f04 (patch)
tree	18b706b2ec51cfc07604692e8e4569b741bf233d
parent	ae70a1e60fe5128e587e819508b12eb02da407e0 (diff)