summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorIlya Mashchenko <ilya@netdata.cloud>2023-11-12 19:12:08 +0200
committerGitHub <noreply@github.com>2023-11-12 19:12:08 +0200
commitc4f71a268d54ac601a3888bda0b7e4c946bf8aa8 (patch)
tree119de88e6c23e4619b1cafe70b00fa8df27d6d5b
parent1fad323016076547e140a54d1436b0ea154c29f1 (diff)
health guides: remove guides for alerts that don't exist in the repo (#16375)
-rw-r--r--health/guides/cgroups/cgroup_10s_received_packets_storm.md37
-rw-r--r--health/guides/cgroups/cgroup_1m_received_packets_rate.md37
-rw-r--r--health/guides/cgroups/k8s_cgroup_10s_received_packets_storm.md59
-rw-r--r--health/guides/cgroups/k8s_cgroup_1m_received_packets_rate.md49
-rw-r--r--health/guides/fping/fping_host_latency.md22
-rw-r--r--health/guides/fping/fping_host_reachable.md55
-rw-r--r--health/guides/mdstat/mdstat_last_collected.md49
-rw-r--r--health/guides/ram/30min_ram_swapped_out.md26
-rw-r--r--health/guides/ram/used_swap.md24
-rw-r--r--health/guides/upsd/upsd_10min_ups_load.md (renamed from health/guides/nut/nut_10min_ups_load.md)2
-rw-r--r--health/guides/upsd/upsd_ups_battery_charge.md (renamed from health/guides/nut/nut_ups_charge.md)2
-rw-r--r--health/guides/upsd/upsd_ups_last_collected_secs.md (renamed from health/guides/nut/nut_last_collected_secs.md)0
12 files changed, 2 insertions, 360 deletions
diff --git a/health/guides/cgroups/cgroup_10s_received_packets_storm.md b/health/guides/cgroups/cgroup_10s_received_packets_storm.md
deleted file mode 100644
index 242acebd29..0000000000
--- a/health/guides/cgroups/cgroup_10s_received_packets_storm.md
+++ /dev/null
@@ -1,37 +0,0 @@
-### Understand the alert
-
-This alert checks the ratio of the average number of received packets for a network interface over the last 10 seconds, compared to the rate over the last minute. If the rate of received packets increases significantly over a short period, it may indicate a packet storm, which can impact network performance and connectivity.
-
-### What is a packet storm?
-
-A packet storm is a sudden increase in network traffic due to a large number of packets being sent simultaneously. This can cause network congestion, packet loss, and increased latency, leading to a degradation of network performance and potential loss of connectivity.
-
-### Troubleshoot the alert
-
-1. Identify the affected interface and examine its traffic patterns:
-
-Use `iftop` or a similar monitoring tool to view real-time network traffic on the affected interface.
-
-```
-sudo iftop -i <interface_name>
-```
-
-Replace `<interface_name>` with the name of the network interface experiencing the packet storm (e.g., eth0).
-
-2. Check for possible packet flood sources:
-
-Examine logs, firewall rules, and traffic patterns for evidence of a Denial of Service (DoS) attack or a misconfigured application. Use tools like `tcpdump` or `wireshark` to capture network packets and analyze traffic.
-
-3. Limit or block unwanted traffic:
-
-Apply traffic shaping or Quality of Service (QoS) policies, firewall rules, or Intrusion Prevention System (IPS) to limit or block the sources of unwanted traffic.
-
-4. Monitor network performance:
-
-Continuously monitor network performance to ensure the issue is resolved and prevent future packet storms. Use monitoring tools like Netdata to keep track of network performance metrics.
-
-### Useful resources
-
-1. [iftop: Linux Network Bandwidth Monitoring Tool](https://www.tecmint.com/iftop-linux-network-bandwidth-monitoring-tool/)
-2. [tcpdump: A powerful command-line packet analyzer](https://www.tcpdump.org/)
-3. [Wireshark: A network protocol analyzer for UNIX and Windows](https://www.wireshark.org/)
diff --git a/health/guides/cgroups/cgroup_1m_received_packets_rate.md b/health/guides/cgroups/cgroup_1m_received_packets_rate.md
deleted file mode 100644
index bbe727fb4c..0000000000
--- a/health/guides/cgroups/cgroup_1m_received_packets_rate.md
+++ /dev/null
@@ -1,37 +0,0 @@
-### Understand the alert
-
-This alert calculates the average number of packets received by the network interface `${label:device}` over the period of one minute. If you receive this alert, it means that the rate of received packets has significantly increased, which could indicate a potential network bottleneck or an increased network workload.
-
-### What does received packets rate mean?
-
-`Received packets rate` represents the speed at which packets are arriving at the network interface of your machine. A packet is a single unit of data transmitted over the network. A high rate of received packets indicates that your network is under significant workload as it processes incoming data.
-
-### Troubleshoot the alert
-
-1. Monitor your network traffic
-
- Use the `iftop` command to get a real-time report of bandwidth usage on your network interfaces:
- ```
- sudo iftop -i ${label:device}
- ```
- If you don't have `iftop` installed, install it using your package manager.
-
-2. Identify the top consumers of network bandwidth
-
- Inspect the output of `iftop` to identify if any IP addresses or hosts are using an unusual amount of bandwidth. This can help you pinpoint any sudden surges in network traffic caused by specific services or applications.
-
-3. Check for possible network congestion
-
- Determine if the high received packets rate is caused by network congestion. Network congestion occurs when the volume of data being transmitted exceeds the available capacity of the network. You can use `ping` or `traceroute` commands to check for latency and packet loss.
-
-4. Examine your application logs
-
- Investigate your application logs to identify any unusual activity or network spikes. This can provide valuable information about potential issues, such as a sudden increase in incoming client connections, improperly optimized application configurations, or the presence of malicious traffic.
-
-5. Optimize your network configuration
-
- Review your networking configurations to ensure they are optimized for the current workload. Check for any misconfigurations or resource limitations that might be causing the high received packets rate. You might consider increasing the maximum number of open file descriptors, changing your network driver settings, or adjusting your network buffer sizes.
-
-### Useful resources
-
-1. [Iftop Guide – Monitor Network Bandwidth](https://www.tecmint.com/iftop-linux-network-bandwidth-monitoring-tool/)
diff --git a/health/guides/cgroups/k8s_cgroup_10s_received_packets_storm.md b/health/guides/cgroups/k8s_cgroup_10s_received_packets_storm.md
deleted file mode 100644
index b5e3c424df..0000000000
--- a/health/guides/cgroups/k8s_cgroup_10s_received_packets_storm.md
+++ /dev/null
@@ -1,59 +0,0 @@
-### Understand the alert
-
-This alert indicates a potential `received packets storm` in your Kubernetes (k8s) cluster's network on a cgroup (control group) network interface. A received packets storm occurs when the average number of received packets in the last 10 seconds significantly exceeds the rate over the last minute.
-
-### What is a cgroup?
-
-A `cgroup` (control group) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) for a process or a group of processes. In Kubernetes, cgroups are used to manage resources for each container within a pod.
-
-### What is a received packets storm?
-
-A received packets storm occurs when the average number of received packets on a network interface becomes significantly higher than the recent background rate. This can cause network congestion, increased latency, or even denial of service, affecting the performance of services running on the Kubernetes cluster.
-
-### Troubleshoot the alert
-
-1. Inspect overall network activity on the affected node(s):
-
- Use the `iftop` command to monitor network activity on the host in real-time:
-
- ```
- sudo iftop
- ```
-
- If you don't have `iftop` installed, install it before running the command.
-
-2. Identify the container(s) responsible for the high packet rate:
-
- To list the running container(s) and their associated cgroups, run the following command:
-
- ```
- sudo kubectl get pods --all-namespaces -o jsonpath='{range.items[*]}{.metadata.namespace}:{.metadata.name}{"\t"}{.status.containerStatuses[].containerID}{"\t"}{"cgroup id: "}{"\n"}{end}'
- ```
-
- Now, check the network interface statistics for each container:
-
- ```
- cat /sys/fs/cgroup/net_cls,net_prio/net_cls.classid
- ```
-
-3. Investigate the cause:
-
- - Inspect the logs of the affected container(s) for any errors or unusual activity:
-
- ```
- sudo kubectl logs -f <pod-name> -c <container-name> -n <namespace>
- ```
-
- - Check if there are any misconfigurations or if network rate limits are not set correctly in the Kubernetes Deployment, StatefulSet, or DaemonSet manifest.
-
-4. Mitigate the issue:
-
- - If unnecessary traffic is causing the packet storm, consider implementing network throttling or limiting the rate at which packets are generated or received by the container(s).
-
- - If the issue is caused by a bug or misconfiguration, fix the problem and redeploy the affected component(s).
-
-### Useful resources
-
-1. [Kubernetes Cgroups documentation](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/#debugging-the-kernel-cgroups-and-kubernetes-primitives)
-2. [Monitoring and Visualizing Network Bandwidth on Linux](https://www.tecmint.com/linux-network-bandwidth-monitoring-tools/)
-3. [Networking in Kubernetes](https://kubernetes.io/docs/concepts/cluster-administration/networking/)
diff --git a/health/guides/cgroups/k8s_cgroup_1m_received_packets_rate.md b/health/guides/cgroups/k8s_cgroup_1m_received_packets_rate.md
deleted file mode 100644
index 7554c8358a..0000000000
--- a/health/guides/cgroups/k8s_cgroup_1m_received_packets_rate.md
+++ /dev/null
@@ -1,49 +0,0 @@
-### Understand the alert
-
-This alert calculates the average number of packets received by a specific network interface (denoted as `${label:device}` in the alert) on a Kubernetes cluster node over the last minute. If you receive this alert, it indicates that there is a significant amount of network traffic received by the node.
-
-### What does high received packets rate mean?
-
-A high received packets rate means that the network interface on the Kubernetes cluster node is processing a large number of incoming network packets. This can be due to increased legitimate traffic to the services running on the cluster or may indicate a potential network issue or Distributed Denial of Service (DDoS) attack.
-
-### Troubleshoot the alert
-
-1. Verify the current network traffic on the Kubernetes node:
-
- You can use the `nethogs` tool to analyze the network traffic on the Kubernetes node. If the tool is not installed, you can install it with:
-
- ```
- sudo apt install nethogs # Ubuntu/Debian
- sudo yum install nethogs # CentOS/RHEL
- ```
-
- Run `nethogs` to check the network traffic:
-
- ```
- sudo nethogs
- ```
-
-2. Check the services running on the Kubernetes cluster:
-
- Use the command `kubectl get pods --all-namespaces` to list all the pods running on the cluster. Inspect the output and identify any services that might be consuming a high amount of network traffic.
-
-3. Inspect logs for any anomalies:
-
- Check the application and Kubernetes logs for any unusual activity, errors, or repeated access attempts that may indicate a network issue or potential attack.
-
-4. Close unnecessary processes or services:
-
- Based on your analysis, if you find any unnecessary processes or services consuming a high amount of network traffic, consider terminating or scaling them down.
-
-5. Check for DDoS attacks:
-
- If you suspect a DDoS attack, consider implementing traffic filtering, rate limiting, or using a DDoS protection service to mitigate the attack.
-
-6. Monitor network traffic:
-
- Continue monitoring the network traffic on the Kubernetes node to ensure that the received packets rate returns to normal levels.
-
-### Useful resources
-
-1. [Kubernetes Networking](https://kubernetes.io/docs/concepts/cluster-administration/networking/)
-2. [How to Monitor and Identify Issues with Kubernetes Networking](https://www.stackrox.com/post/2017/03/how-to-monitor-and-identify-issues-with-kubernetes-networking/)
diff --git a/health/guides/fping/fping_host_latency.md b/health/guides/fping/fping_host_latency.md
deleted file mode 100644
index c13a6ef065..0000000000
--- a/health/guides/fping/fping_host_latency.md
+++ /dev/null
@@ -1,22 +0,0 @@
-### Understand the alert
-
-`fping` is a command line tool to send ICMP (Internet Control Message Protocol) echo requests to network hosts, similar to ping, but performing much better when pinging multiple hosts. The Netdata
-Agent utilizes `fping` to monitor latency, packet loss, uptime and reachability of any number of network endpoints.
-
-For the `fping_host_latency` alert, the Netdata Agent monitors the average latency to the network host over the last 10 seconds. Receiving this alert indicates high latency to the network host. It is likely you are experiencing networking issues or the host is overloaded.
-
-### Troubleshoot the alert
-
-- Customize the ICMP requests for each endpoint
-
-Different endpoints could be in different networks. For example, a server in your intra network would require less time to be accessed than your cloud infrastructures in terms of latency. You should always consider not to use a global approach for checking every endpoint of yours. You can find more information about how to configure every endpoint separately in the [fping.plugin alarm guide](https://learn.netdata.cloud/docs/agent/collectors/fping.plugin/#additional-tips).
-
-- Prioritize traffic on your endpoints
-
-Quality of service (QoS) is the use of mechanisms or technologies to control traffic and ensure the performance of critical applications. QoS works best when low-priority traffic exists that can be dropped when congestion occurs. The higher-priority traffic must fit within the bandwidth limitations of the link or path. The following are two open source solutions to apply QoS policies to your network interfaces.
-
-### Useful resources
-
-- [FireQOS]((https://firehol.org/tutorial/fireqos-new-user/) is a traffic shaping helper. It has a very simple shell scripting language to express traffic shaping.
-
-- [`tcconfig`](https://tcconfig.readthedocs.io/en/latest/index.html) is a command wrapper that makes it easy to set up traffic control of network bandwidth, latency, packet-loss, packet-corruption, etc. \ No newline at end of file
diff --git a/health/guides/fping/fping_host_reachable.md b/health/guides/fping/fping_host_reachable.md
deleted file mode 100644
index ff3c3eed24..0000000000
--- a/health/guides/fping/fping_host_reachable.md
+++ /dev/null
@@ -1,55 +0,0 @@
-### Understand the alert
-
-`fping` is a command line tool to send ICMP (Internet Control Message Protocol) echo requests to network hosts, similar to ping, but performing much better when pinging multiple hosts. The Netdata
-Agent utilizes `fping` to monitor latency, packet loss, uptime and reachability of any number of network endpoints.
-
-The `fping_host_reachable` alert in the Netdata Agent checks the reachability of a network host (0: unreachable, 1: reachable). Receiving a critical alert indicates that your endpoints are unreachable. It is likely that the host is down or your system is experiencing networking issues.
-
-### Troubleshoot the alert
-
-- Check network connectivity
-
-Verify that your system has access to the particular endpoint. Check for basic connectivity to known hosts from both your host and the endpoint.
-
-- DNS settings
-
-If you are using DNS resolution to check your endpoint, you should always consider check your DNS settings. To troubleshoot this issue, verify that your DNS can resolve your endpoints.
-
-1. Check your current DNS (for example in linux you can use the host command):
-
- ```
- host -v <your_endpoint>
- ```
-
-2. If the HTTP endpoint is supposed to be public facing endpoint, try an alternative DNS (for example Cloudflare's DNS):
-
- ```
- host -v <your_endpoint> 1.1.1.1
- ```
-- Verify access restrictions in the remote host</summary>
-
-If the remote host is a Linux-based machine and you have access to it, you can check the followings.
-
-**Check the ICMP settings**
-
-In most linux distributions you can restrict the ICMP echo operations.
-
- 1. Check your current setting. If this value is set to 1 your system ignore incoming ICMP echo requests.
- ```
- systemctl net.ipv4.icmp_echo_ignore_all
- ```
- 2. To change this, bump this `net.ipv4.icmp_echo_ignore_all=0` entry under `/etc/sysctl.conf`.
-
- 3. Reload the sysctl settings.
- ```
- sysctl -p
- ```
-
-**Check your firewall rules**
-
-Depending on what firewall you use, the commands might differ from what's shown below. For example, if you are using IP tables you can check for restriction rules upon `icmp`.
- ```
- iptables -L | grep ICMP
- ```
-
-For futher investigation or changes in your firewall settings we **strongly** advise you to consult your firewall's documentation and guidelines. \ No newline at end of file
diff --git a/health/guides/mdstat/mdstat_last_collected.md b/health/guides/mdstat/mdstat_last_collected.md
deleted file mode 100644
index 4205d9935e..0000000000
--- a/health/guides/mdstat/mdstat_last_collected.md
+++ /dev/null
@@ -1,49 +0,0 @@
-### Understand the alert
-
-The `mdstat_last_collected` alert is generated when there is a delay or absence of data collection from the Multiple Device (md) driver for an extended period of time. This can be a sign of an issue with the RAID array or the system itself.
-
-### Troubleshoot the alert
-
-1. Check the status of the RAID array
-
- The status of the RAID array can be checked using the following command:
-
- ```
- cat /proc/mdstat
- ```
-
- This will display the RAID array's current status, including any errors, degraded state, or rebuilding progress.
-
-2. Ensure the Netdata Agent is running
-
- Verify that the Netdata Agent is running and collecting data from the system using the following command:
-
- ```
- sudo systemctl status netdata
- ```
-
- If the Netdata Agent is not running, start it using:
-
- ```
- sudo systemctl start netdata
- ```
-
-3. Check if the `mdstat` plugin is enabled in `/etc/netdata/netdata.conf`
-
- Ensure that the plugin responsible for collecting data from the md driver is enabled. Look for the following lines in `/etc/netdata/netdata.conf`:
-
- ```
- [plugin:proc:/proc/mdstat]
- dedicated lines for md devices = no (auto)
- ```
-
- Make sure that the option is set as shown above.
-
-4. Check for any hardware issues or faulty disks
-
- If the RAID array status shows errors or a degraded state, investigate the disks and the RAID controller for any hardware issues or failures. If needed, replace the faulty disk and rebuild the array.
-
-5. Monitor the RAID array and system status
-
- Keep an eye on the RAID array's status and overall system health. If the issue persists or worsens, consider scheduling downtime for further diagnostics and maintenance.
-
diff --git a/health/guides/ram/30min_ram_swapped_out.md b/health/guides/ram/30min_ram_swapped_out.md
deleted file mode 100644
index c822aa6901..0000000000
--- a/health/guides/ram/30min_ram_swapped_out.md
+++ /dev/null
@@ -1,26 +0,0 @@
-### Understand the alert
-
-If the system needs more memory resources than your available RAM, inactive pages in memory can be moved into the swap space (or swap file). The swap space (or swap file) is located on hard drives,
-which have a slower access time than physical memory.
-
-The Netdata Agent calculates the percentage of the system RAM swapped in the last 30 minutes. This alert is triggered in warning state if the percentage of the system RAM swapped in is more than 20%.
-
-### Troubleshoot the alert
-
-You can find the most resource greedy processes in your system, but if you receive this alert many times you must consider upgrading your system's RAM.
-
-- Find the processes that consume the most RAM
-
-Linux:
-```
-top -b -o +%MEM | head -n 22
-```
-
-FreeBSD:
-```
-top -b -o res | head -n 22
-```
-
-Here, you can see which processes are the main RAM consumers. Consider killing any of the main consumer processes that you do not need to avoid thrashing.
-
-Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary.
diff --git a/health/guides/ram/used_swap.md b/health/guides/ram/used_swap.md
deleted file mode 100644
index 48b2c61022..0000000000
--- a/health/guides/ram/used_swap.md
+++ /dev/null
@@ -1,24 +0,0 @@
-### Understand the alert
-
-If the system needs more memory resources than your available RAM, inactive pages in memory can be moved into the swap space (or swap file). The Swap space (or swap file) is located on hard drives, which have a slower access time than physical memory.
-
-The Netdata Agent calculates the percentage of the used swap. This alert indicates high swap memory utilization. It may be a sign that the system has experienced memory pressure, which can affect the
-performance of your system. If there is no RAM and swap available, OOM Killer can start killing processes.
-
-This alert is triggered in warning state when the percentage of used swap is between 80-90% and in critical state when it is between 90-98%.
-
-### Troubleshoot the alert
-
-- Check per-process RAM usage to find the top consumers
-
-Linux:
-```
-top -b -o +%MEM | head -n 22
-```
-FreeBSD:
-```
-top -b -o res | head -n 22
-```
-
-It would be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary.
-
diff --git a/health/guides/nut/nut_10min_ups_load.md b/health/guides/upsd/upsd_10min_ups_load.md
index 98f5fb4c3c..fad4a2f6fa 100644
--- a/health/guides/nut/nut_10min_ups_load.md
+++ b/health/guides/upsd/upsd_10min_ups_load.md
@@ -1,6 +1,6 @@
### Understand the alert
-This alert is based on the `nut_10min_ups_load` metric, which measures the average UPS load over the last 10 minutes. If you receive this alert, it means that the load on your UPS is higher than expected, which may lead to an unstable power supply and ungraceful system shutdowns.
+This alert is based on the `upsd_10min_ups_load` metric, which measures the average UPS load over the last 10 minutes. If you receive this alert, it means that the load on your UPS is higher than expected, which may lead to an unstable power supply and ungraceful system shutdowns.
### Troubleshoot the alert
diff --git a/health/guides/nut/nut_ups_charge.md b/health/guides/upsd/upsd_ups_battery_charge.md
index eb3abac7cd..0d8f757f20 100644
--- a/health/guides/nut/nut_ups_charge.md
+++ b/health/guides/upsd/upsd_ups_battery_charge.md
@@ -1,6 +1,6 @@
### Understand the alert
-The `nut_ups_charge` alert indicates that the average UPS charge over the last minute has dropped below a predefined threshold. This might be due to a power outage, a UPS malfunction, or a sudden surge in power demands that the UPS can't handle.
+The `upsd_ups_battery_charge` alert indicates that the average UPS charge over the last minute has dropped below a predefined threshold. This might be due to a power outage, a UPS malfunction, or a sudden surge in power demands that the UPS can't handle.
### Troubleshoot the alert
diff --git a/health/guides/nut/nut_last_collected_secs.md b/health/guides/upsd/upsd_ups_last_collected_secs.md
index 8182478343..8182478343 100644
--- a/health/guides/nut/nut_last_collected_secs.md
+++ b/health/guides/upsd/upsd_ups_last_collected_secs.md