summaryrefslogtreecommitdiffstats
path: root/backends
diff options
context:
space:
mode:
Diffstat (limited to 'backends')
-rw-r--r--backends/Makefile.am19
-rw-r--r--backends/README.md137
-rw-r--r--backends/backends.c659
-rw-r--r--backends/backends.h50
-rw-r--r--backends/graphite/Makefile.am4
-rw-r--r--backends/graphite/graphite.c90
-rw-r--r--backends/graphite/graphite.h35
-rw-r--r--backends/json/Makefile.am4
-rw-r--r--backends/json/json.c152
-rw-r--r--backends/json/json.h34
-rwxr-xr-xbackends/nc-backend.sh158
-rw-r--r--backends/opentsdb/Makefile.am4
-rw-r--r--backends/opentsdb/opentsdb.c90
-rw-r--r--backends/opentsdb/opentsdb.h35
-rw-r--r--backends/prometheus/Makefile.am8
-rw-r--r--backends/prometheus/README.md376
-rw-r--r--backends/prometheus/backend_prometheus.c565
-rw-r--r--backends/prometheus/backend_prometheus.h20
18 files changed, 2440 insertions, 0 deletions
diff --git a/backends/Makefile.am b/backends/Makefile.am
new file mode 100644
index 0000000000..268259edd7
--- /dev/null
+++ b/backends/Makefile.am
@@ -0,0 +1,19 @@
+# SPDX-License-Identifier: GPL-3.0-or-later
+
+AUTOMAKE_OPTIONS = subdir-objects
+MAINTAINERCLEANFILES = $(srcdir)/Makefile.in
+
+SUBDIRS = \
+ graphite \
+ json \
+ opentsdb \
+ prometheus \
+ $(NULL)
+
+dist_noinst_DATA = \
+ README.md \
+ $(NULL)
+
+dist_noinst_SCRIPTS = \
+ nc-backend.sh \
+ $(NULL)
diff --git a/backends/README.md b/backends/README.md
new file mode 100644
index 0000000000..e514c2b8fe
--- /dev/null
+++ b/backends/README.md
@@ -0,0 +1,137 @@
+
+netdata supports backends for archiving the metrics, or providing long term dashboards, using grafana or other tools, like this:
+
+![image](https://cloud.githubusercontent.com/assets/2662304/20649711/29f182ba-b4ce-11e6-97c8-ab2c0ab59833.png)
+
+Since netdata collects thousands of metrics per server per second, which would easily congest any backend server when several netdata servers are sending data to it, netdata allows sending metrics at a lower frequency. So, although netdata collects metrics every second, it can send to the backend servers averages or sums every X seconds (though, it can send them per second if you need it to).
+
+## features
+
+1. Supported backends
+
+ 1. **graphite** (`plaintext interface`, used by **Graphite**, **InfluxDB**, **KairosDB**, **Blueflood**, **ElasticSearch** via logstash tcp input and the graphite codec, etc)
+
+ metrics are sent to the backend server as `prefix.hostname.chart.dimension`. `prefix` is configured below, `hostname` is the hostname of the machine (can also be configured).
+
+ 2. **opentsdb** (`telnet interface`, used by **OpenTSDB**, **InfluxDB**, **KairosDB**, etc)
+
+ metrics are sent to opentsdb as `prefix.chart.dimension` with tag `host=hostname`.
+
+ 3. **json** document DBs
+
+ metrics are sent to a document db, `JSON` formatted.
+
+ 4. **prometheus** is described at [prometheus page](prometheus/) since it pulls data from netdata.
+
+2. Only one backend may be active at a time.
+
+3. All metrics are transferred to the backend - netdata does not implement any metric filtering.
+
+4. Three modes of operation (for all backends):
+
+ 1. `as collected`: the latest collected value is sent to the backend. This means that if netdata is configured to send data to the backend every 10 seconds, only 1 out of 10 values will appear at the backend server. The values are sent exactly as collected, before any multipliers or dividers applied and before any interpolation. This mode emulates other data collectors, such as `collectd`.
+
+ 2. `average`: the average of the interpolated values shown on the netdata graphs is sent to the backend. So, if netdata is configured to send data to the backend server every 10 seconds, the average of the 10 values shown on the netdata charts will be used. **If you can't decide which mode to use, use `average`.**
+
+ 3. `sum` or `volume`: the sum of the interpolated values shown on the netdata graphs is sent to the backend. So, if netdata is configured to send data to the backend every 10 seconds, the sum of the 10 values shown on the netdata charts will be used.
+
+5. This code is smart enough, not to slow down netdata, independently of the speed of the backend server.
+
+## configuration
+
+In `/etc/netdata/netdata.conf` you should have something like this (if not download the latest version of `netdata.conf` from your netdata):
+
+```
+[backend]
+ enabled = yes | no
+ type = graphite | opentsdb | json
+ host tags = list of TAG=VALUE
+ destination = space separated list of [PROTOCOL:]HOST[:PORT] - the first working will be used
+ data source = average | sum | as collected
+ prefix = netdata
+ hostname = my-name
+ update every = 10
+ buffer on failures = 10
+ timeout ms = 20000
+ send charts matching = *
+ send hosts matching = localhost *
+ send names instead of ids = yes
+```
+
+- `enabled = yes | no`, enables or disables sending data to a backend
+
+- `type = graphite | opentsdb | json`, selects the backend type
+
+- `destination = host1 host2 host3 ...`, accepts **a space separated list** of hostnames, IPs (IPv4 and IPv6) and ports to connect to. Netdata will use the **first available** to send the metrics.
+
+ The format of each item in this list, is: `[PROTOCOL:]IP[:PORT]`.
+
+ `PROTOCOL` can be `udp` or `tcp`. `tcp` is the default and only supported by the current backends.
+
+ `IP` can be `XX.XX.XX.XX` (IPv4), or `[XX:XX...XX:XX]` (IPv6). For IPv6 you can to enclose the IP in `[]` to separate it from the port.
+
+ `PORT` can be a number of a service name. If omitted, the default port for the backend will be used (graphite = 2003, opentsdb = 4242).
+
+ Example IPv4:
+
+```
+ destination = 10.11.14.2:4242 10.11.14.3:4242 10.11.14.4:4242
+```
+
+ Example IPv6 and IPv4 together:
+
+```
+ destination = [ffff:...:0001]:2003 10.11.12.1:2003
+```
+
+ When multiple servers are defined, netdata will try the next one when the first one fails. This allows you to load-balance different servers: give your backend servers in different order on each netdata.
+
+ netdata also ships [`nc-backend.sh`](https://github.com/netdata/netdata/blob/master/contrib/nc-backend.sh), a script that can be used as a fallback backend to save the metrics to disk and push them to the time-series database when it becomes available again. It can also be used to monitor / trace / debug the metrics netdata generates.
+
+- `data source = as collected`, or `data source = average`, or `data source = sum`, selects the kind of data that will be sent to the backend.
+
+- `hostname = my-name`, is the hostname to be used for sending data to the backend server. By default this is `[global].hostname`.
+
+- `prefix = netdata`, is the prefix to add to all metrics.
+
+- `update every = 10`, is the number of seconds between sending data to the backend. netdata will add some randomness to this number, to prevent stressing the backend server when many netdata servers send data to the same backend. This randomness does not affect the quality of the data, only the time they are sent.
+
+- `buffer on failures = 10`, is the number of iterations (each iteration is `[backend].update every` seconds) to buffer data, when the backend is not available. If the backend fails to receive the data after that many failures, data loss on the backend is expected (netdata will also log it).
+
+- `timeout ms = 20000`, is the timeout in milliseconds to wait for the backend server to process the data. By default this is `2 * update_every * 1000`.
+
+- `send hosts matching = localhost *` includes one or more space separated patterns, using ` * ` as wildcard (any number of times within each pattern). The patterns are checked against the hostname (the localhost is always checked as `localhost`), allowing us to filter which hosts will be sent to the backend when this netdata is a central netdata aggregating multiple hosts. A pattern starting with ` ! ` gives a negative match. So to match all hosts named `*db*` except hosts containing `*slave*`, use `!*slave* *db*` (so, the order is important: the first pattern matching the hostname will be used - positive or negative).
+
+- `send charts matching = *` includes one or more space separated patterns, using ` * ` as wildcard (any number of times within each pattern). The patterns are checked against both chart id and chart name. A pattern starting with ` ! ` gives a negative match. So to match all charts named `apps.*` except charts ending in `*reads`, use `!*reads apps.*` (so, the order is important: the first pattern matching the chart id or the chart name will be used - positive or negative).
+
+- `send names instead of ids = yes | no` controls the metric names netdata should send to backend. netdata supports names and IDs for charts and dimensions. Usually IDs are unique identifiers as read by the system and names are human friendly labels (also unique). Most charts and metrics have the same ID and name, but in several cases they are different: disks with device-mapper, interrupts, QoS classes, statsd synthetic charts, etc.
+
+- `host tags = list of TAG=VALUE` defines tags that should be appended on all metrics for the given host. These are currently only sent to opentsdb and prometheus. Please use the appropriate format for each time-series db. For example opentsdb likes them like `TAG1=VALUE1 TAG2=VALUE2`, but prometheus like `tag1="value1",tag2="value2"`. Host tags are mirrored with database replication (streaming of metrics between netdata servers).
+
+## monitoring operation
+
+netdata provides 5 charts:
+
+1. **Buffered metrics**, the number of metrics netdata added to the buffer for dispatching them to the backend server.
+2. **Buffered data size**, the amount of data (in KB) netdata added the buffer.
+3. ~~**Backend latency**, the time the backend server needed to process the data netdata sent. If there was a re-connection involved, this includes the connection time.~~ (this chart has been removed, because it only measures the time netdata needs to give the data to the O/S - since the backend servers do not ack the reception, netdata does not have any means to measure this properly)
+4. **Backend operations**, the number of operations performed by netdata.
+5. **Backend thread CPU usage**, the CPU resources consumed by the netdata thread, that is responsible for sending the metrics to the backend server.
+
+![image](https://cloud.githubusercontent.com/assets/2662304/20463536/eb196084-af3d-11e6-8ee5-ddbd3b4d8449.png)
+
+## alarms
+
+The latest version of the alarms configuration for monitoring the backend is here: https://github.com/netdata/netdata/blob/master/conf.d/health.d/backend.conf
+
+netdata adds 4 alarms:
+
+1. `backend_last_buffering`, number of seconds since the last successful buffering of backend data
+2. `backend_metrics_sent`, percentage of metrics sent to the backend server
+3. `backend_metrics_lost`, number of metrics lost due to repeating failures to contact the backend server
+4. ~~`backend_slow`, the percentage of time between iterations needed by the backend time to process the data sent by netdata~~ (this was misleading and has been removed).
+
+![image](https://cloud.githubusercontent.com/assets/2662304/20463779/a46ed1c2-af43-11e6-91a5-07ca4533cac3.png)
+
+## InfluxDB setup as netdata backend (example)
+You can find blog post with example: how to use InfluxDB with netdata [here](https://blog.hda.me/2017/01/09/using-netdata-with-influxdb-backend.html)
diff --git a/backends/backends.c b/backends/backends.c
new file mode 100644
index 0000000000..6cb1e1c62a
--- /dev/null
+++ b/backends/backends.c
@@ -0,0 +1,659 @@
+// SPDX-License-Identifier: GPL-3.0-or-later
+
+#include "backends.h"
+
+// ----------------------------------------------------------------------------
+// How backends work in netdata:
+//
+// 1. There is an independent thread that runs at the required interval
+// (for example, once every 10 seconds)
+//
+// 2. Every time it wakes, it calls the backend formatting functions to build
+// a buffer of data. This is a very fast, memory only operation.
+//
+// 3. If the buffer already includes data, the new data are appended.
+// If the buffer becomes too big, because the data cannot be sent, a
+// log is written and the buffer is discarded.
+//
+// 4. Then it tries to send all the data. It blocks until all the data are sent
+// or the socket returns an error.
+// If the time required for this is above the interval, it starts skipping
+// intervals, but the calculated values include the entire database, without
+// gaps (it remembers the timestamps and continues from where it stopped).
+//
+// 5. repeats the above forever.
+//
+
+const char *global_backend_prefix = "netdata";
+int global_backend_update_every = 10;
+BACKEND_OPTIONS global_backend_options = BACKEND_SOURCE_DATA_AVERAGE | BACKEND_OPTION_SEND_NAMES;
+
+// ----------------------------------------------------------------------------
+// helper functions for backends
+
+size_t backend_name_copy(char *d, const char *s, size_t usable) {
+ size_t n;
+
+ for(n = 0; *s && n < usable ; d++, s++, n++) {
+ char c = *s;
+
+ if(c != '.' && !isalnum(c)) *d = '_';
+ else *d = c;
+ }
+ *d = '\0';
+
+ return n;
+}
+
+// calculate the SUM or AVERAGE of a dimension, for any timeframe
+// may return NAN if the database does not have any value in the give timeframe
+
+inline calculated_number backend_calculate_value_from_stored_data(
+ RRDSET *st // the chart
+ , RRDDIM *rd // the dimension
+ , time_t after // the start timestamp
+ , time_t before // the end timestamp
+ , BACKEND_OPTIONS backend_options // BACKEND_SOURCE_* bitmap
+ , time_t *first_timestamp // the first point of the database used in this response
+ , time_t *last_timestamp // the timestamp that should be reported to backend
+) {
+ RRDHOST *host = st->rrdhost;
+
+ // find the edges of the rrd database for this chart
+ time_t first_t = rrdset_first_entry_t(st);
+ time_t last_t = rrdset_last_entry_t(st);
+ time_t update_every = st->update_every;
+
+ // step back a little, to make sure we have complete data collection
+ // for all metrics
+ after -= update_every * 2;
+ before -= update_every * 2;
+
+ // align the time-frame
+ after = after - (after % update_every);
+ before = before - (before % update_every);
+
+ // for before, loose another iteration
+ // the latest point will be reported the next time
+ before -= update_every;
+
+ if(unlikely(after > before))
+ // this can happen when update_every > before - after
+ after = before;
+
+ if(unlikely(after < first_t))
+ after = first_t;
+
+ if(unlikely(before > last_t))
+ before = last_t;
+
+ if(unlikely(before < first_t || after > last_t)) {
+ // the chart has not been updated in the wanted timeframe
+ debug(D_BACKEND, "BACKEND: %s.%s.%s: aligned timeframe %lu to %lu is outside the chart's database range %lu to %lu",
+ host->hostname, st->id, rd->id,
+ (unsigned long)after, (unsigned long)before,
+ (unsigned long)first_t, (unsigned long)last_t
+ );
+ return NAN;
+ }
+
+ *first_timestamp = after;
+ *last_timestamp = before;
+
+ size_t counter = 0;
+ calculated_number sum = 0;
+
+ long start_at_slot = rrdset_time2slot(st, before),
+ stop_at_slot = rrdset_time2slot(st, after),
+ slot, stop_now = 0;
+
+ for(slot = start_at_slot; !stop_now ; slot--) {
+
+ if(unlikely(slot < 0)) slot = st->entries - 1;
+ if(unlikely(slot == stop_at_slot)) stop_now = 1;
+
+ storage_number n = rd->values[slot];
+
+ if(unlikely(!does_storage_number_exist(n))) {
+ // not collected
+ continue;
+ }
+
+ calculated_number value = unpack_storage_number(n);
+ sum += value;
+
+ counter++;
+ }
+
+ if(unlikely(!counter)) {
+ debug(D_BACKEND, "BACKEND: %s.%s.%s: no values stored in database for range %lu to %lu",
+ host->hostname, st->id, rd->id,
+ (unsigned long)after, (unsigned long)before
+ );
+ return NAN;
+ }
+
+ if(unlikely(BACKEND_OPTIONS_DATA_SOURCE(backend_options) == BACKEND_SOURCE_DATA_SUM))
+ return sum;
+
+ return sum / (calculated_number)counter;
+}
+
+
+// discard a response received by a backend
+// after logging a simple of it to error.log
+
+int discard_response(BUFFER *b, const char *backend) {
+ char sample[1024];
+ const char *s = buffer_tostring(b);
+ char *d = sample, *e = &sample[sizeof(sample) - 1];
+
+ for(; *s && d < e ;s++) {
+ char c = *s;
+ if(unlikely(!isprint(c))) c = ' ';
+ *d++ = c;
+ }
+ *d = '\0';
+
+ info("BACKEND: received %zu bytes from %s backend. Ignoring them. Sample: '%s'", buffer_strlen(b), backend, sample);
+ buffer_flush(b);
+ return 0;
+}
+
+
+// ----------------------------------------------------------------------------
+// the backend thread
+
+static SIMPLE_PATTERN *charts_pattern = NULL;
+static SIMPLE_PATTERN *hosts_pattern = NULL;
+
+inline int backends_can_send_rrdset(BACKEND_OPTIONS backend_options, RRDSET *st) {
+ RRDHOST *host = st->rrdhost;
+
+ if(unlikely(rrdset_flag_check(st, RRDSET_FLAG_BACKEND_IGNORE)))
+ return 0;
+
+ if(unlikely(!rrdset_flag_check(st, RRDSET_FLAG_BACKEND_SEND))) {
+ // we have not checked this chart
+ if(simple_pattern_matches(charts_pattern, st->id) || simple_pattern_matches(charts_pattern, st->name))
+ rrdset_flag_set(st, RRDSET_FLAG_BACKEND_SEND);
+ else {
+ rrdset_flag_set(st, RRDSET_FLAG_BACKEND_IGNORE);
+ debug(D_BACKEND, "BACKEND: not sending chart '%s' of host '%s', because it is disabled for backends.", st->id, host->hostname);
+ return 0;
+ }
+ }
+
+ if(unlikely(!rrdset_is_available_for_backends(st))) {
+ debug(D_BACKEND, "BACKEND: not sending chart '%s' of host '%s', because it is not available for backends.", st->id, host->hostname);
+ return 0;
+ }
+
+ if(unlikely(st->rrd_memory_mode == RRD_MEMORY_MODE_NONE && !(BACKEND_OPTIONS_DATA_SOURCE(backend_options) == BACKEND_SOURCE_DATA_AS_COLLECTED))) {
+ debug(D_BACKEND, "BACKEND: not sending chart '%s' of host '%s' because its memory mode is '%s' and the backend requires database access.", st->id, host->hostname, rrd_memory_mode_name(host->rrd_memory_mode));
+ return 0;
+ }
+
+ return 1;
+}
+
+inline BACKEND_OPTIONS backend_parse_data_source(const char *source, BACKEND_OPTIONS backend_options) {
+ if(!strcmp(source, "raw") || !strcmp(source, "as collected") || !strcmp(source, "as-collected") || !strcmp(source, "as_collected") || !strcmp(source, "ascollected")) {
+ backend_options |= BACKEND_SOURCE_DATA_AS_COLLECTED;
+ backend_options &= ~(BACKEND_OPTIONS_SOURCE_BITS ^ BACKEND_SOURCE_DATA_AS_COLLECTED);
+ }
+ else if(!strcmp(source, "average")) {
+ backend_options |= BACKEND_SOURCE_DATA_AVERAGE;
+ backend_options &= ~(BACKEND_OPTIONS_SOURCE_BITS ^ BACKEND_SOURCE_DATA_AVERAGE);
+ }
+ else if(!strcmp(source, "sum") || !strcmp(source, "volume")) {
+ backend_options |= BACKEND_SOURCE_DATA_SUM;
+ backend_options &= ~(BACKEND_OPTIONS_SOURCE_BITS ^ BACKEND_SOURCE_DATA_SUM);
+ }
+ else {
+ error("BACKEND: invalid data source method '%s'.", source);
+ }
+
+ return backend_options;
+}
+
+static void backends_main_cleanup(void *ptr) {
+ struct netdata_static_thread *static_thread = (struct netdata_static_thread *)ptr;
+ static_thread->enabled = NETDATA_MAIN_THREAD_EXITING;
+
+ info("cleaning up...");
+
+ static_thread->enabled = NETDATA_MAIN_THREAD_EXITED;
+}
+
+void *backends_main(void *ptr) {
+ netdata_thread_cleanup_push(backends_main_cleanup, ptr);
+
+ int default_port = 0;
+ int sock = -1;
+ BUFFER *b = buffer_create(1), *response = buffer_create(1);
+ int (*backend_request_formatter)(BUFFER *, const char *, RRDHOST *, const char *, RRDSET *, RRDDIM *, time_t, time_t, BACKEND_OPTIONS) = NULL;
+ int (*backend_response_checker)(BUFFER *) = NULL;
+
+ // ------------------------------------------------------------------------
+ // collect configuration options
+
+ struct timeval timeout = {
+ .tv_sec = 0,
+ .tv_usec = 0
+ };
+ int enabled = config_get_boolean(CONFIG_SECTION_BACKEND, "enabled", 0);
+ const char *source = config_get(CONFIG_SECTION_BACKEND, "data source", "average");
+ const char *type = config_get(CONFIG_SECTION_BACKEND, "type", "graphite");
+ const char *destination = config_get(CONFIG_SECTION_BACKEND, "destination", "localhost");
+ global_backend_prefix = config_get(CONFIG_SECTION_BACKEND, "prefix", "netdata");
+ const char *hostname = config_get(CONFIG_SECTION_BACKEND, "hostname", localhost->hostname);
+ global_backend_update_every = (int)config_get_number(CONFIG_SECTION_BACKEND, "update every", global_backend_update_every);
+ int buffer_on_failures = (int)config_get_number(CONFIG_SECTION_BACKEND, "buffer on failures", 10);
+ long timeoutms = config_get_number(CONFIG_SECTION_BACKEND, "timeout ms", global_backend_update_every * 2 * 1000);
+
+ if(config_get_boolean(CONFIG_SECTION_BACKEND, "send names instead of ids", (global_backend_options & BACKEND_OPTION_SEND_NAMES)))
+ global_backend_options |= BACKEND_OPTION_SEND_NAMES;
+ else
+ global_backend_options &= ~BACKEND_OPTION_SEND_NAMES;
+
+ charts_pattern = simple_pattern_create(config_get(CONFIG_SECTION_BACKEND, "send charts matching", "*"), NULL, SIMPLE_PATTERN_EXACT);
+ hosts_pattern = simple_pattern_create(config_get(CONFIG_SECTION_BACKEND, "send hosts matching", "localhost *"), NULL, SIMPLE_PATTERN_EXACT);
+
+
+ // ------------------------------------------------------------------------
+ // validate configuration options
+ // and prepare for sending data to our backend
+
+ global_backend_options = backend_parse_data_source(source, global_backend_options);
+
+ if(timeoutms < 1) {
+ error("BACKEND: invalid timeout %ld ms given. Assuming %d ms.", timeoutms, global_backend_update_every * 2 * 1000);
+ timeoutms = global_backend_update_every * 2 * 1000;
+ }
+ timeout.tv_sec = (timeoutms * 1000) / 1000000;
+ timeout.tv_usec = (timeoutms * 1000) % 1000000;
+
+ if(!enabled || global_backend_update_every < 1)
+ goto cleanup;
+
+ // ------------------------------------------------------------------------
+ // select the backend type
+
+ if(!strcmp(type, "graphite") || !strcmp(type, "graphite:plaintext")) {
+
+ default_port = 2003;
+ backend_response_checker = process_graphite_response;
+
+ if(BACKEND_OPTIONS_DATA_SOURCE(global_backend_options) == BACKEND_SOURCE_DATA_AS_COLLECTED)
+ backend_request_formatter = format_dimension_collected_graphite_plaintext;
+ else
+ backend_request_formatter = format_dimension_stored_graphite_plaintext;
+
+ }
+ else if(!strcmp(type, "opentsdb") || !strcmp(type, "opentsdb:telnet")) {
+
+ default_port = 4242;
+ backend_response_checker = process_opentsdb_response;
+
+ if(BACKEND_OPTIONS_DATA_SOURCE(global_backend_options) == BACKEND_SOURCE_DATA_AS_COLLECTED)
+ backend_request_formatter = format_dimension_collected_opentsdb_telnet;
+ else
+ backend_request_formatter = format_dimension_stored_opentsdb_telnet;
+
+ }
+ else if (!strcmp(type, "json") || !strcmp(type, "json:plaintext")) {
+
+ default_port = 5448;
+ backend_response_checker = process_json_response;
+
+ if (BACKEND_OPTIONS_DATA_SOURCE(global_backend_options) == BACKEND_SOURCE_DATA_AS_COLLECTED)
+ backend_request_formatter = format_dimension_collected_json_plaintext;
+ else
+ backend_request_formatter = format_dimension_stored_json_plaintext;
+
+ }
+ else {
+ error("BACKEND: Unknown backend type '%s'", type);
+ goto cleanup;
+ }
+
+ if(backend_request_formatter == NULL || backend_response_checker == NULL) {
+ error("BACKEND: backend is misconfigured - disabling it.");
+ goto cleanup;
+ }
+
+
+ // ------------------------------------------------------------------------
+ // prepare the charts for monitoring the backend operation
+
+ struct rusage thread;
+
+ collected_number
+ chart_buffered_metrics = 0,
+ chart_lost_metrics = 0,
+ chart_sent_metrics = 0,
+ chart_buffered_bytes = 0,
+ chart_received_bytes = 0,
+ chart_sent_bytes = 0,
+ chart_receptions = 0,
+ chart_transmission_successes = 0,
+ chart_transmission_failures = 0,
+ chart_data_lost_events = 0,
+ chart_lost_bytes = 0,
+ chart_backend_reconnects = 0;
+ // chart_backend_latency = 0;
+
+ RRDSET *chart_metrics = rrdset_create_localhost("netdata", "backend_metrics", NULL, "backend", NULL, "Netdata Buffered Metrics", "metrics", "backends", NULL, 130600, global_backend_update_every, RRDSET_TYPE_LINE);
+ rrddim_add(chart_metrics, "buffered", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_metrics, "lost", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_metrics, "sent", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+
+ RRDSET *chart_bytes = rrdset_create_localhost("netdata", "backend_bytes", NULL, "backend", NULL, "Netdata Backend Data Size", "KB", "backends", NULL, 130610, global_backend_update_every, RRDSET_TYPE_AREA);
+ rrddim_add(chart_bytes, "buffered", NULL, 1, 1024, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_bytes, "lost", NULL, 1, 1024, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_bytes, "sent", NULL, 1, 1024, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_bytes, "received", NULL, 1, 1024, RRD_ALGORITHM_ABSOLUTE);
+
+ RRDSET *chart_ops = rrdset_create_localhost("netdata", "backend_ops", NULL, "backend", NULL, "Netdata Backend Operations", "operations", "backends", NULL, 130630, global_backend_update_every, RRDSET_TYPE_LINE);
+ rrddim_add(chart_ops, "write", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_ops, "discard", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_ops, "reconnect", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_ops, "failure", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+ rrddim_add(chart_ops, "read", NULL, 1, 1, RRD_ALGORITHM_ABSOLUTE);
+
+ /*
+ * this is misleading - we can only measure the time we need to send data
+ * this time is not related to the time required for the data to travel to
+ * the backend database and the time that server needed to process them
+ *
+ * issue #1432 and https://www.softlab.ntua.gr/facilities/documentation/unix/unix-socket-faq/unix-socket-faq-2.html
+ *
+ RRDSET *chart_latency = rrdset_create_localhost("netdata", "backend_latency", NULL, "backend", NULL, "Netdata Backend Latency", "ms", "backends", NULL, 130620, global_backend_update_every, RRDSET_TYPE_AREA);
+ rrddim_add(chart_latency, "latency", NULL, 1, 1000, RRD_ALGORITHM_ABSOLUTE);
+ */
+
+ RRDSET *chart_rusage = rrdset_create_localhost("netdata", "backend_thread_cpu", NULL, "backend", NULL, "NetData Backend Thread CPU usage", "milliseconds/s", "backends", NULL, 130630, global_backend_update_every, RRDSET_TYPE_STACKED);
+ rrddim_add(chart_rusage, "user", NULL, 1, 1000, RRD_ALGORITHM_INCREMENTAL);
+ rrddim_add(chart_rusage, "system", NULL, 1, 1000, RRD_ALGORITHM_INCREMENTAL);
+
+
+ // ------------------------------------------------------------------------
+ // prepare the backend main loop
+
+ info("BACKEND: configured ('%s' on '%s' sending '%s' data, every %d seconds, as host '%s', with prefix '%s')", type, destination, source, global_backend_update_every, hostname, global_backend_prefix);
+
+ usec_t step_ut = global_backend_update_every * USEC_PER_SEC;
+ time_t after = now_realtime_sec();
+ int failures = 0;
+ heartbeat_t hb;
+ heartbeat_init(&hb);
+
+ while(!netdata_exit) {
+
+ // ------------------------------------------------------------------------
+ // Wait for the next iteration point.
+
+ heartbeat_next(&hb, step_ut);
+ time_t before = now_realtime_sec();
+ debug(D_BACKEND, "BACKEND: preparing buffer for timeframe %lu to %lu", (unsigned long)after, (unsigned long)before);
+
+ // ------------------------------------------------------------------------
+ // add to the buffer the data we need to send to the backend
+
+ netdata_thread_disable_cancelability();
+
+ size_t count_hosts = 0;
+ size_t count_charts_total = 0;
+ size_t count_dims_total = 0;
+
+ rrd_rdlock();
+ RRDHOST *host;
+ rrdhost_foreach_read(host) {
+ if(unlikely(!rrdhost_flag_check(host, RRDHOST_FLAG_BACKEND_SEND|RRDHOST_FLAG_BACKEND_DONT_SEND))) {
+ char *name = (host == localhost)?"localhost":host->hostname;
+ if (!hosts_pattern || simple_pattern_matches(hosts_pattern, name)) {
+ rrdhost_flag_set(host, RRDHOST_FLAG_BACKEND_SEND);
+ info("enabled backend for host '%s'", name);
+ }
+ else {
+ rrdhost_flag_set(host, RRDHOST_FLAG_BACKEND_DONT_SEND);
+ info("disabled backend for host '%s'", name);
+ }
+ }
+
+ if(unlikely(!rrdhost_flag_check(host, RRDHOST_FLAG_BACKEND_SEND)))
+ continue;
+
+ rrdhost_rdlock(host);
+
+ count_hosts++;
+ size_t count_charts = 0;
+ size_t count_dims = 0;
+ size_t count_dims_skipped = 0;
+
+ const char *__hostname = (host == localhost)?hostname:host->hostname;
+
+ RRDSET *st;
+ rrdset_foreach_read(st, host) {
+ if(likely(backends_can_send_rrdset(global_backend_options, st))) {
+ rrdset_rdlock(st);
+
+ count_charts++;
+
+ RRDDIM *rd;
+ rrddim_foreach_read(rd, st) {
+ if (likely(rd->last_collected_time.tv_sec >= after)) {
+ chart_buffered_metrics += backend_request_formatter(b, global_backend_prefix, host, __hostname, st, rd, after, before, global_backend_options);
+ count_dims++;
+ }
+ else {
+ debug(D_BACKEND, "BACKEND: not sending dimension '%s' of chart '%s' from host '%s', its last data collection (%lu) is not within our timeframe (%lu to %lu)", rd->id, st->id, __hostname, (unsigned long)rd->last_collected_time.tv_sec, (unsigned long)after, (unsigned long)before);
+ count_dims_skipped++;
+ }
+ }
+
+ rrdset_unlock(st);
+ }
+ }
+
+ debug(D_BACKEND, "BACKEND: sending host '%s', metrics of %zu dimensions, of %zu charts. Skipped %zu dimensions.", __hostname, count_dims, count_charts, count_dims_skipped);
+ count_charts_total += count_charts;
+ count_dims_total += count_dims;
+
+ rrdhost_unlock(host);
+ }
+ rrd_unlock();
+
+ netdata_thread_enable_cancelability();
+
+ debug(D_BACKEND, "BACKEND: buffer has %zu bytes, added metrics for %zu dimensions, of %zu charts, from %zu hosts", buffer_strlen(b), count_dims_total, count_charts_total, count_hosts);
+
+ // ------------------------------------------------------------------------
+
+ chart_buffered_bytes = (collected_number)buffer_strlen(b);
+
+ // reset the monitoring chart counters
+ chart_received_bytes =
+ chart_sent_bytes =
+ chart_sent_metrics =
+ chart_lost_metrics =
+ chart_transmission_successes =
+ chart_transmission_failures =
+ chart_data_lost_events =
+ chart_lost_bytes =
+ chart_backend_reconnects = 0;
+ // chart_backend_latency = 0;
+
+ if(unlikely(netdata_exit)) break;
+
+ //fprintf(stderr, "\nBACKEND BEGIN:\n%s\nBACKEND END\n", buffer_tostring(b));
+ //fprintf(stderr, "after = %lu, before = %lu\n", after, before);
+
+ // prepare for the next iteration
+ // to add incrementally data to buffer
+ after = before;
+
+ // ------------------------------------------------------------------------
+ // if we are connected, receive a response, without blocking
+
+ if(likely(sock != -1)) {
+ errno = 0;
+
+ // loop through to collect all data
+ while(sock != -1 && errno != EWOULDBLOCK) {
+ buffer_need_bytes(response, 4096);
+
+ ssize_t r = recv(sock, &response->buffer[response->len], response->size - response->len, MSG_DONTWAIT);
+ if(likely(r > 0)) {
+ // we received some data
+ response->len += r;
+ chart_received_bytes += r;
+ chart_receptions++;
+ }
+ else if(r == 0) {
+ error("BACKEND: '%s' closed the socket", destination);
+ close(sock);
+ sock = -1;
+ }
+ else {
+ // failed to receive data
+ if(errno != EAGAIN && errno != EWOULDBLOCK) {
+ error("BACKEND: cannot receive data from backend '%s'.", destination);
+ }
+ }
+ }
+
+ // if we received data, process them
+ if(buffer_strlen(response))
+ backend_response_checker(response);
+ }
+
+ // ------------------------------------------------------------------------
+ // if we are not connected, connect to a backend server
+
+ if(unlikely(sock == -1)) {
+ // usec_t start_ut = now_monotonic_usec();
+ size_t reconnects = 0;
+
+ sock = connect_to_one_of(destination, default_port, &timeout, &reconnects, NULL, 0);
+
+ chart_backend_reconnects += reconnects;
+ // chart_backend_latency += now_monotonic_usec() - start_ut;
+ }
+
+ if(unlikely(netdata_exit)) b