summaryrefslogtreecommitdiffstats
path: root/health
AgeCommit message (Collapse)Author
2023-06-08Fix CID 385073 -- Uninitialized scalar variable (#15163)Stelios Fragkakis
Fix CID 385073 Uninitialized scalar variable
2023-06-07freeipmi: add availability status chart and alarm (#15151)Ilya Mashchenko
2023-06-05Generate, store and transmit a unique alert event_hash_id (#15111)Emmanuel Vasilakis
* generate and store an event_hash_id * transmit to cloud * transmit to the cloud
2023-06-01health: remove "families" from alarms config (#15086)Ilya Mashchenko
2023-05-29Only queue an alert to the cloud when it's inserted (#15110)Emmanuel Vasilakis
only queue an alert to cloud when its inserted
2023-05-24fix cockroachdb alarms (#15095)Ilya Mashchenko
2023-05-23Better cleanup of health log table (#15045)Emmanuel Vasilakis
2023-05-22Use chart labels to filter alerts (#14982)Emmanuel Vasilakis
* use chart labels to filter alerts * add entry to readme * support chart_label=val val2 val3 * docs updates * more docs * use rc not rt
2023-05-15Comment out default `role_recipients_*` values (#15047)James Gregory-Monk
2023-05-02feat: add OpsGenie alert levels to payload (#14992)OliverNChalk
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2023-04-25Update README.md (#14962)Chris Akritidis
2023-04-21Add a checkpoint message to alerts stream (#14847)Emmanuel Vasilakis
* pull aclk schemas * resolve capas * handle checkpoints and removed from health * build with disable-cloud * codacy 1 * misc changes * one more char in hash * free buffer * change topic * misc fixes * skip removed alert variables * change hash functions * use create and destroy for compatibility with older openssl
2023-04-20WEBRTC for communication between agents and browsers (#14874)Costa Tsaousis
* initial webrtc setup * missing files * rewrite of webrtc integration * initialization and cleanup of webrtc connections * make it compile without libdatachannel * add missing webrtc_initialize() function when webrtc is not enabled * make c++17 optional * add build/m4/ax_compiler_vendor.m4 * add ax_cxx_compile_stdcxx.m4 * added new m4 files to makefile.am * id all webrtc connections * show warning when webrtc is disabled * fixed message * moved all webrtc error checking inside webrtc.cpp * working webrtc connection establishment and cleanup * remove obsolete code * rewrote webrtc code in C to remove dependency for c++17 * fixed left-over reference * detect binary and text messages * minor fix * naming of webrtc threads * added webrtc configuration * fix for thread_get_name_np() * smaller web_client memory footprint * universal web clients cache * free web clients every 100 uses * webrtc is now enabled by default only when compiled with internal checks * webrtc responses to /api/ requests, including LZ4 compression * fix for binary and text messages * web_client_cache is now global * unification of the internal web server API, for web requests, aclk request, webrtc requests * more cleanup and unification of web client timings * fixed compiler warnings * update sent and received bytes * eliminated of almost all big buffers in web client * registry now uses the new json generation * cookies are now an array; fixed redirects * fix redirects, again * write cookies directly to the header buffer, eliminating the need for cookie structures in web client * reset the has_cookies flag * gathered all web client cleanup to one function * fixes redirects * added summary.globals in /api/v2/data response * ars to arc in /api/v2/data * properly handle host impersonation * set the context of mem.numa_nodes
2023-04-18bump go.d.plugin to v0.52.1 (#14921)Ilya Mashchenko
2023-04-12Update REFERENCE.md (#14900)Chris Akritidis
2023-04-12Collect additional BTRFS metrics (#14636)Dimitris P
* Add commit_stats metrics to BTRFS section * Add error_stats metrics (per device) to BTRFS section * Simplify commit stats variables and chart ids/names * Add basic BTRFS error alarms. Configured to trip whenever any of the error dimensions is non-zero. * Add chart descriptions for new charts. * Remove duplicate code * Comment out some debugging code * Always create error stats dimensions, even if zero * Show rate of commits and commit duration instead of totals * Change current commit metrics to absolute from incremental * Change commits dimension to absolute and add separate commits time share chart * Rename 'device_' rrdlabels to 'filesystem_' * Replace all snprintf() calls with snprintfz() * Fix codacy warning * Provide separate error charts for each filesystem device * Accept code review suggestions for more descriptive context and labels Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> * Add 'device' prefix to id, name, title of errors chart * Add 'device_id' label to device_errors * Update health.d/btrfs.conf to match new errors charts * Remove commented out code * Do not disable all BTRFS metrics collection if only commit_stats is missing * Do not disable all BTRFS metrics collection if only error_stats is missing * Fix bug of BTRFS device add/remove not being detected properly * Fix double free() error when deleting a device * Update dashboard info with bold tags Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> --------- Co-authored-by: Austin S. Hemmelgarn <austin@netdata.cloud> Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2023-04-10Add support for alert notifications to ntfy.sh (#14875)Dim-P
2023-04-05Fix js tag in documentation (#14862)Fotis Voutsas
fix js tag
2023-04-04Update Agent notification methods documentation (#14827)Fotis Voutsas
* update health/notifications/README.md * Alerta notification method documentation update * Amazon SNS and some alerta changes * notification methods imporvements * alerta refinements * awssns refinements * custom alert refinements * discord refinements * email notifications documentation update * flock notifications documentation update * alerta edits * awssns edits * custom notification method edits * discord edits * email notification method edits * flock edits * IRC notifications update * Kavenegar notifications documentation update * matrix notifications documentation update * messagebird notifications documentation update * msteams notifications documentation update * wording change * twilio notifications documentation update * telegram notifications documentation update * syslog notifications update * smstools3 notifications documentation update * rocket.chat notifications documentation update * pushover notifications documentation update * pushbullet notifications documentation update * prowl notifications documentation update * pagerduty notifications documentation update * remove comments from example configuration * slight wording changes * more notification methods documentation updates * slack notification documentation update * add config options to the notifications Introduction page * crop image twilio * crop image slack * crop image pushover * crop images pushbullet * crop image messagebird * crop image kavenegar
2023-04-03fix typo alerms -> alarms (#14854)slavox
2023-03-21/api/v2/X part 5 (#14718)Costa Tsaousis
* query timestamps are now pre-determined and alignment on timestamps is guarranteed * turn internal_fatal() to internal_error() to investigate the issue * handle query when no data exist in the db * check for non NULL dict when running dictionary garbage collect * support API v2 requests via ACLK * add nodes detailed information to /api/v2/nodes * fixed keys and added dummy nodes for completeness * added nodes_hard_hash, alerts_hard_hash, alerts_soft_hash; started building a nodes status object to reflect the current status of a node * make sure replication does not double count charts that are already being replicated * expose min and max in sts structures * added view_minimum_value and view_maximum_value; percentage calculation is now an additional pass on the data, removed from formatters; absolute value calculation is now done at the query level, removed from formatters * respect trimming in percentage calculation; updated swagger * api/v2/weights preparative work to support multi-node queries - still single node though * multi-node /api/v2/weights endpoint, supporting all the filtering parameters of /api/v2/data * when passing the raw option, the query exposes the hidden dimensions * fix compilation issues on older systems * the query engine now calculates per dimension min, max, sum, count, anomaly count * use the macro to calculate storage point anomaly rate * weights endpoint exposing version hashes * weights method=value shows min, max, average, sum, count, anomaly count, anomaly rate * query: expose RESET flag; do not add the same point multiple times to the aggregated point * weights: more compact output * weights requests can be interrupted * all /api/v2 requests can be interrupted and timeout * allow relative timestamps in weights * fix macos compilation warnings * Revert "fix macos compilation warnings" This reverts commit 8a1d24e41e9b58de566ac59f0c4b1c465bcc0592. * /api/v2/data group-by now works on dimension names, not ids * /api/v2/weights does not query metrics without retention and new output format * /api/v2/weights value and anomaly queries do context queries when contexts are filtered; query timeout is now always in ms
2023-03-21Replace hardcoded links pointing to "learn.netdata.cloud" with github ↵Fotis Voutsas
absolute links (#14779) * Update REFERENCE.md * replace redirected links * format the files * fix redirected link * format the file * replace hardcoded links
2023-03-02Fix doc links (#14650)Chris Akritidis
* Update freebsd.md * Update REFERENCE.md * Update README.md * Update COLLECTORS.md
2023-03-02/api/v2/contexts (#14592)Costa Tsaousis
* preparation for /api/v2/contexts * working /api/v2/contexts * add anomaly rate information in all statistics; when sum-count is requested, return sums and counts instead of averages * minor fix * query targegt now accurately counts hosts, contexts, instances, dimensions, metrics * cleanup /api/v2/contexts * full text search with /api/v2/contexts * simple patterns now support the option to search ignoring case * full text search API with /api/v2/q * simple pattern execution optimization * do not show q when not given * full text search accounting * separated /api/v2/nodes from /api/v2/contexts * fix ssv queries for group_by * count query instances queried and failed per context and host * split rrdcontext.c to multiple files * add query totals * fix anomaly rate calculation; provide "ni" for indexing hosts * do not generate zero valued members * faster calculation of anomaly rate; by just summing integers for each db points and doing math once for every generated point * fix typo when printing dimensions totals * added option minify to remove spaces and newlines fron JSON output * send instance ids and names when they differ * do not add in query target dimensions, instances, contexts and hosts for which there is no retention in the current timeframe * fix for the previous + renames and code cleanup * when a dimension is filtered, include in the response all the other dimensions that are selectable * do not add nodes that do not have retention in the current window * move selection of dimensions to query_dimension_add(), instead of query_metric_add() * increase the pre-processing capacity of queries * generate instance fqdn ids and names only when they are needed * provide detailed statistics about tiers retention, queries, points, update_every * late allocation of query dimensions * cleanup * more cleanup * support for annotations per displayed point, RESET and PARTIAL * new type annotations * if a chart is not linked to contexts and it is collected, link it when it is collected * make ML run reentrant * make ML rrdr query synchronous * optimize replication memory allocation of replication_sort_entry * change units to percentage, when requesting a coefficinet of variation, or a percentage query * initialize replication before starting main threads * properly decrement no room requests counter * propagate the non-zero flag to group-by * the same by avoiding the extra loop * respect non-zero in all dimension arrays * remove dictionary garbage collection from dictionary_entries() and dictionary_version() * be more verbose when jv2 indexing is postponed * prevent infinite loop * use hidden dimensions even when dimensions pattern is unset * traverse hosts using dictionaries * fix dictionary unittests
2023-02-28Make the title metadata H1 in all markdown files (#14625)Fotis Voutsas
* make the title metadta the H1 * Update collectors/python.d.plugin/zscores/README.md * Update libnetdata/ebpf/README.md * Update ml/README.md * Update libnetdata/string/README.md --------- Co-authored-by: Chris Akritidis <43294513+cakrit@users.noreply.github.com>
2023-02-28Update REFERENCE.md (#14627)Chris Akritidis
2023-02-27Reorg learn 0227 (#14621)Chris Akritidis
* reorg batch 1 * remove duplicate cloud custom dashboard and agent dashboard * Simplify the root web/README * Merge streaming references * Make enable streaming the overall intro and the README the reference * Remove reference-streaming document * Update overview pages
2023-02-26Reorg learn 0226 (#14610)Chris Akritidis
* Reorg getting started * Streaming * Remove blanks * Fix up to cloud alerts
2023-02-22Clean host structure (#14584)Stelios Fragkakis
* Remove varlib_dir from host structure * Remove unused parameter
2023-02-20bump go.d to v0.51.0 (#14572)Ilya Mashchenko
2023-02-20Fix broken links in our documentation (#14565)Fotis Voutsas
* fix broken link in ml/README.md * fix broken link across all files * fix broken link across all files * fix broken links and remove what's next sections * fix broken links and remove what's next section * Remove related links sections with broken links that link to removed files * fix broken links
2023-02-17Update email notification docs with info about setup in Docker. (#14555)Austin S. Hemmelgarn
2023-02-17Reorg learn 021723 (#14556)Chris Akritidis
* Change titles of agent alert notifications * Reintroduce netdata for iot * Eliminate guides category, merge health config docs * Rename setup to configuration * Codacy fixes and move health config reference
2023-02-15JSON internal API, IEEE754 base64/hex streaming, weights endpoint ↵Costa Tsaousis
optimization (#14493) * first work on standardizing json formatting * renamed old grouping to time_grouping and added group_by * add dummy functions to enable compilation * buffer json api work * jsonwrap opening with buffer_json_X() functions * cleanup * storage for quotes * optimize buffer printing for both numbers and strings * removed ; from define * contexts json generation using the new json functions * fix buffer overflow at unit test * weights endpoint using new json api * fixes to weights endpoint * check buffer overflow on all buffer functions * do synchronous queries for weights * buffer_flush() now resets json state too * content type typedef * print double values that are above the max 64-bit value * str2ndd() can now parse values above UINT64_MAX * faster number parsing by avoiding double calculations as much as possible * faster number parsing * faster hex parsing * accurate printing and parsing of double values, even for very large numbers that cannot fit in 64bit integers * full printing and parsing without using library functions - and related unit tests * added IEEE754 streaming capability to enable streaming of double values in hex * streaming and replication to transfer all values in hex * use our own str2ndd for set2 * remove subnormal check from ieee * base64 encoding for numbers, instead of hex * when increasing double precision, also make sure the fractional number printed is aligned to the wanted precision * str2ndd_encoded() parses all encoding formats, including integers * prevent uninitialized use * /api/v1/info using the new json API * Fix error when compiling with --disable-ml * Remove redundant 'buffer_unittest' declaration * Fix formatting * Fix formatting * Fix formatting * fix buffer unit test * apps.plugin using the new JSON API * make sure the metrics registry does not accept negative timestamps * do not allow pages with negative timestamps to be loaded from db files; do not accept pages with negative timestamps in the cache * Fix more formatting --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-15Reorganize learn documents under Integrations part 2 (#14538)Chris Akritidis
* Reorge exporter integrations * Reorg alert notifications. change mdx to md for alert related files * Move all .mdx to .md, including links.
2023-02-14Fix broken links in markdown files (#14513)Fotis Voutsas
* fix broken links in claim/README.md * delete broken link in docs/guidelines.md * fix broken links * fix broken link * fix broken links * fix broken links * fix broken links * fix broken links * fix broken links * remove broken link * fix broken link * fix broken links * fix broken links * fix broken links * fix broken link * fix linking phrasing * fix broken links batch * fix broken links second batch * fix broken links * fix broken links * fix broken links * Update COLLECTORS.md * fix broken links * fix broken links
2023-02-09Fix crash when child connects (#14492)Stelios Fragkakis
* Just formatting * Remove single threaded * Only destroy if we are localhost (ie. shutdown)
2023-02-08Only load required charts for rrdvars (#14443)Emmanuel Vasilakis
* store only rrdvars health needs * make it simpler * only set * fix codacy
2023-02-08Add export for people running their own registry (#14457)Chris Akritidis
See https://github.com/netdata/netdata/issues/3495#issuecomment-1408452259
2023-02-02fix kubelet alarms (#14414)Ilya Mashchenko
2023-02-02Covert our documentation links to GH absolute links (#14344)Tasos Katsoulas
Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-01-31Update the notifications/integrations docs (#14335)Hugo Valente
* Add the docs for the newly added notification/integrations methods of the cloud. Notifications: Discord/PagerDuty/Slack/Generic WebHook * Update docs related to; Managing notification with the new methods. Co-authored-by: Shyam Sreevalsan <shyam@netdata.cloud>
2023-01-30Add main health readme to learn (#14356)Chris Akritidis
2023-01-30Delete QUICKSTART.md (#14355)Chris Akritidis
The info is already in the main README
2023-01-27minor fix on notification doc (Discord) (#14339)Tasos Katsoulas
Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-01-27DBENGINE v2 - improvements part 10 (#14332)Costa Tsaousis
* replication cancels pending queries on exit * log when waiting for inflight queries * when there are collected and not-collected metrics, use the context priority from the collected only * Write metadata with a faster pace * Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now * Wrap in a big transaction remaining metadata writes (test 1) * fix higher tiers when tiering iterations = 2 * dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation * Wrap in a big transaction metadata writes (test 2) * replication cancelling fix * do not first and last entry in replication when the db has no retention * fix internal check condition * Increase metadata write batch size * always apply error limit to dbengine logs * Remove code that processes the obsolete health.db files * cleanup in query.c * do not allow queries to go beyond db boundaries * prevent internal log for +1 delta in timestamp * detect gap pages in conflicts * double protection for gap injection in main cache * Add checkpoint to prevent large WAL while running Remove unused and duplicate functions * do not allocate chart cache dir if not needed * add more info to unittests * revert query expansion to satisfy unittests Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-26Add |nowarn and |noclear notification modifiers (#14330)Martin Vobruba
2023-01-25Introduce the new Structure of the documentation (#13915)Fotis Voutsas
* Moving the cloud docs under /docs/cloud (previous location: netdata/learn/*) * Added metadata on almost every document of the old learn site for the new ingest process of learn. * Map old learn document to their best fit as topic related docs. Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: DShreve2 <david@netdata.cloud> Co-authored-by: hugovalente-pm <hugo@netdata.cloud>
2023-01-23DBENGINE v2 - improvements part 7 (#14307)Costa Tsaousis
* run cleanup in workers * when there is a discrepancy between update every, fix it * fix the other occurences of metric update every mismatch * allow resetting the same timestamp * validate flushed pages before committing them to disk * initialize collection with the latest time in mrg * these should be static functions * acquire metrics for writing to detect multiple data collections of the same metric * print the uuid of the metric that is collected twice * log the discrepancies of completed pages * 1 second tolerance * unify validation of pages and related logging across dbengine * make do_flush_pages() thread safe * flush pages runs on libuv workers * added uv events to tp workers * dont cross datafile spinlock and rwlock * should be unlock * prevent the creation of multiple datafiles * break an infinite replication loop * do not log the epxansion of the replication window due to start streaming * log all invalid pages with internal checks * do not shutdown event loop threads * add information about collected page events, to find the root cause of invalid collected pages * rewrite of the gap filling to fix the invalid collected pages problem * handle multiple collections of the same metric gracefully * added log about main cache page conflicts; fix gap filling once again... * keep track of the first metric writer * it should be an internal fatal - it does not harm users * do not check of future timestamps on collected pages, since we inherit the clock of the children; do not check collected pages validity without internal checks * prevent negative replication completion percentage * internal error for the discrepancy of mrg * better logging of dbengine new metrics collection * without internal checks it is unused * prevent pluginsd crash on exit due to calling pthread_cancel() on an exited thread * renames and atomics everywhere * if a datafile cannot be acquired for deletion during shutdown, continue - this can happen when there are hot pages in open cache referencing it * Debug for context load * rrdcontext uuid debug * rrddim uuid debug * rrdeng uuid debug * Revert "rrdeng uuid debug" This reverts commit 393da190826a582e7e6cc90771bf91b175826d8b. * Revert "rrddim uuid debug" This reverts commit 72150b30408294f141b19afcfb35abd7c34777d8. * Revert "rrdcontext uuid debug" This reverts commit 2c3b940dc23f460226e9b2a6861c214e840044d0. * Revert "Debug for context load" This reverts commit 0d880fc1589f128524e0b47abd9ff0714283ce3b. * do not use legacy uuids on multihost dbs * thread safety for journafile size * handle other cases of inconsistent collected pages * make health thread check if it should be running in key loops * do not log uuids Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-20add consul license expiration time alarm (#14298)Ilya Mashchenko
* add consul license alarm * minor