summaryrefslogtreecommitdiffstats
path: root/health
AgeCommit message (Collapse)Author
2023-07-28Fix descriptions in config objects, make them single line (#15610)Fotis Voutsas
2023-07-28Sample Cloud Notifications metadata for Discord (#15597)Satyadeep Ashwathnarayana
* Added a sample metadata.yaml for Alerta * Fixing training spaces bringing yamlint errors * Sample Cloud Notification matadata for Discord * Add yamllint line-length check disable * Update metadata.yaml
2023-07-27Added a sample metadata.yaml for Alerta (#15591)Satyadeep Ashwathnarayana
* Added a sample metadata.yaml for Alerta * Fixing training spaces bringing yamlint errors
2023-07-27remove the noise by silencing alerts that dont need to wake up people (#15590)Costa Tsaousis
2023-07-26Add schema and examples for notification method metadata. (#15549)Austin S. Hemmelgarn
2023-07-26Refactor RRD code. (#15423)vkalintiris
* Storage engine. * Host indexes to rrdb * Move globals to rrdb * Move storage_tiers_backfill to rrdb * default_rrd_update_every to rrdb * default_rrd_history_entries to rrdb * gap_when_lost_iterations_above to rrdb * rrdset_free_obsolete_time_s to rrdb * libuv_worker_threads to rrdb * ieee754_doubles to rrdb * rrdhost_free_orphan_time_s to rrdb * rrd_rwlock to rrdb * localhost to rrdb * rm extern from func decls * mv rrd macro under rrd.h * default_rrdeng_page_cache_mb to rrdb * default_rrdeng_extent_cache_mb to rrdb * db_engine_journal_check to rrdb * default_rrdeng_disk_quota_mb to rrdb * default_multidb_disk_quota_mb to rrdb * multidb_ctx to rrdb * page_type_size to rrdb * tier_page_size to rrdb * No storage_engine_id in rrdim functions * storage_engine_id is provided by st * Update to fix merge conflict. * Update field name * Remove unnecessary macros from rrd.h * Rm unused type decls * Rm duplicate func decls * make internal function static * Make the rest of public dbengine funcs accept a storage_instance. * No more rrdengine_instance :) * rm rrdset_debug from rrd.h * Use rrdb to access globals in ML and ACLK Missed due to not having the submodules in the worktree. * rm total_number * rm RRDVAR_TYPE_TOTAL * rm unused inline * Rm names from typedef'd enums * rm unused header include * Move include * Rm unused header include * s/rrdhost_find_or_create/rrdhost_get_or_create/g * s/find_host_by_node_id/rrdhost_find_by_node_id/ Also, remove duplicate definition in rrdcontext.c * rm macro used only once * rm macro used only once * Reduce rrd.h api by moving funcs into a collector specific utils header * Remove unused func * Move parser specific function out of rrd.h * return storage_number instead of void pointer * move code related to rrd initialization out of rrdhost.c * Remove tier_grouping from rrdim_tier Saves 8 * storage_tiers bytes per dimension. * Fix rebase * s/rrd_update_every/update_every/ * Mark functions as static and constify args * Add license notes and file to build systems. * Remove remaining non-log/config mentions of memory mode * Move rrdlabels api to separate file. Also, move localhost functions that loads labels outside of database/ and into daemon/ * Remove function decl in rrd.h * merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir * Do not expose internal function from rrd.h * Rm NETDATA_RRD_INTERNALS Only one function decl is covered. We have more database internal functions that we currently expose for no good reason. These will be placed in a separate internal header in follow up PRs. * Add license note * Include libnetdata.h instead of aral.h * Use rrdb to access localhost * Fix builds without dbengine * Add header to build system files * Add rrdlabels.h to build systems * Move func def from rrd.h to rrdhost.c * Fix macos build * Rm non-existing function * Rebase master * Define buffer length macro in ad_charts. * Fix FreeBSD builds. * Mark functions static * Rm func decls without definitions * Rebase master * Rebase master * Properly initialize value of storage tiers. * Fix build after rebase.
2023-07-26proc integrations (#15494)Costa Tsaousis
Co-authored-by: ilyam8 <ilya@netdata.cloud>
2023-07-25added cloud status in registry?action=hello (#15530)Costa Tsaousis
Co-authored-by: ilyam8 <ilya@netdata.cloud>
2023-07-22Improve the update of the alert chart name in the database (#15490)Stelios Fragkakis
Disable check during health init Store chart_name when storing a new transition
2023-07-21Memory Controller (MC) and DIMM Error Detection And Correction (EDAC) (#15473)Costa Tsaousis
Co-authored-by: ilyam8 <ilya@netdata.cloud>
2023-07-21docs: note that health foreach works only with template (#15478)Ilya Mashchenko
* docs: note that health foreach works only with template
2023-07-20Store and transmit chart_name to cloud in alert events (#15441)Emmanuel Vasilakis
2023-07-18disable apps_group_file_descriptors_utilization alarm (#15435)Ilya Mashchenko
2023-07-18monitor applications file descriptor limits (#15417)Costa Tsaousis
Co-authored-by: ilyam8 <ilya@netdata.cloud>
2023-07-17Minor typo fix on consul.conf (#15419)Fotis Voutsas
2023-07-13Rename log_access and log_health (#15368)Emmanuel Vasilakis
2023-07-12agent alert notifications redirect (#15350)Costa Tsaousis
* agent alert notifications redirect * set the same cookies with SameSite: Strict * registry search now requires only "for" parameter * registry responses are not cacheable * fix typo and add more error checking * registry memory when mmap is used * fix free with aral
2023-07-12health: fix windows alarms for vnodes (#15376)Ilya Mashchenko
2023-07-12Keep health log history in seconds (#15314)Emmanuel Vasilakis
* rebase * changes queries to delete based on when * readme changes * no need to do migration * wip, protect un-updated events from cleanup * remove index on when_key * fix query for claimed cleanup * if set less than minimum, set minimum * fix query * correct config assign
2023-07-11Rename log Macros (debug) (#15322)thiagoftsm
2023-07-10Use spinlock in host and chart (#15328)Stelios Fragkakis
* Switch alarm log lock to spinlock * Switch the alerts lock in the chart structure to spinlock * Proper lock usage
2023-07-10multi-threaded version of freeipmi.plugin (#15327)Costa Tsaousis
* multi-threaded version of freeipmi.plugin * fix type check * debug info * debug info * updated should be smaller, not bigger * ignore sensors without name * variable data collection frequencies for sensors and sel; also respect the min data collection frequency * reorg and code cleanup * collect states even for unknown units and empty names * render all sensors * reset unknown state sensors * ignore sensors without name * added component fan * Update ipmi.conf * added label type * remove global state counters and chart * updated copyright notice * remove unused struct members * remove unused variable * added a log line everytime the plugin decides to exit to show what was wrong * reworked freeipmi for optimal performance * disabled debugging and fixed bug * added debug * added debug * added debug * removed debugging info * cleanup and final touches * let fan metrics be categorized by the component they are cooling * added plugin and module to charts * more component matches * code cleanup, sel should now be a lot faster * make sel min collection time 30 secs * more component matches; refreshed functions copied from freeipmi codebase * add keepalive to avoid parser read timeout during ipmi_detect_speed_secs * ipmi.fan_speed => ipmi.sensor_fan_speed * update metrics csv and readme * ok newline --------- Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2023-07-06Rename generic `error` function (#15296)thiagoftsm
2023-07-06fix(alerting): removing some of criticals (#15124)Mateusz Bularz
2023-07-06Code reorg and cleanup - enrichment of /api/v2 (#15294)Costa Tsaousis
* claim script now accepts the same params as the kickstart * rewrote buildinfo to unify all methods * added cloud unavailable in cloud status * added all exporters * renamed httpd to h2o * rename ENABLE_COMPRESSION to ENABLE_LZ4 * rename global variable * rename ENABLE_HTTPS to ENABLE_OPENSSL * fix coverity-scan for openssl * add lz4 to coverity-scan * added all plugins and most of the features * added all plugins and most of the features * generalize bitmap code so that we can have any size of bitmaps * cleanup * fix compilation without protobuf * fix compilation with others allocators * fix bitmap * comprehensive bitmaps unit test * bitmap as macros * added developer mode * added system info to build info * cloud available/unavailable * added /api/v2/info * added units and ni to transitions * when showing instances and transitions, show only the instances that have transitions * cleanup * add missing quotes * add anchor to transitions * added more to build info * calculate retention per tier and expose it to /api/v2/info * added currently collected metrics * do not show space and retention when no numbers are available * fix impossible overflow * Add function for transitions and execute callback * In case of error, reset and try next dictionary entry * Fix error message * simpler logic to maintain retention per tier * /api/v2/alert_transitions * Handle case of recipient null Convert after and before to usec * Add classification, type and component * working /api/v2/alert_transitions * Fix query to properly handle context and alert name * cleanup * Add search with transition * accept transition in /api/v2/alert_transitions * totaly dynamic facets * fixed debug info * restructured facets * cleanup; removal of options=transitions * updated alert entries flags * method to exec * Return also exec run timestamp Temp table cleanup only when we don't execute with a transition * cleanup obsolete anchor parameter * Add sql_get_alert_configuration function * added options=config to alert_transitions * added /api/v2/alert_config * preliminary work for /api/v2/claim * initialize variables; do not expose expected retention if no disk space info is available; do not report aclk as initializing when not claimed * fix claim session key filename * put a newline into the session key file * more progress on claiming * final /api/v2/claim endpoint * after claiming, refresh our state at the output * Fix query to fetch config * Remove debug log * add configuration objects * add configuration objects - fixed * respect the NETDATA_DISABLE_CLOUD env variable * NETDATA_DISABLE_CLOUD env variable sets the default, but the config sets the final value * use a new claimed_id on every claiming * regenerate random key on claiming and wait for online status * ignore write() return value when writing a newline * dont show cloud status disabled when claimed_id is missing * added ctx to alert instances * cleanup config and transitions from /api/v2/alerts * fix unused variable * in /api/v2/alert_config show 1 config without an array * show alert values conditionally, by appending options=values * When storing host info if the key value is empty, store unknown * added options=summary to control when the alerts summary is shown * increased http_api_v2 to version 5 * claming random key file is now not world readable * added local-listeners binary that detects all the listening ports, their IPs and their command lines --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-07-05health: respect overriding nc binary for IRC notifications (#15310)Ilya Mashchenko
2023-06-30Replace `info` macro with a less generic name (#15266)Carlo Cabrera
2023-06-29Misc alert fixes (#15274)Emmanuel Vasilakis
* rebase * proper pointer
2023-06-28rewrite /api/v2/alerts (#15257)Costa Tsaousis
* rewrite /api/v2/alerts * implement searching for transition * Find transition id and issue callback * Fix parameters * call and transition filter * Search with transition as well * renames and cleanup * render flags * what if scenario for moving transitions at the top level * If transition is given, limit the query appropriately * Add alert transitions * Optimize find transition to use prepared query Drop temp table properly * enabled alert instances again * Order by when key * Order by global_id * Return last X transitions * updated field names * add ati to configurations and show all keys in debug mode * Code cleanup and optimizations * Drop temp table in case of error * Finalize temp table population statement to prevent memory leak * final changes --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-22New alerts endpoint (#15232)Stelios Fragkakis
* alerts / alerts_log v2 * Add global_id to ae Populate entries with global id * Remove transition id from template Change history to instances * Link ae to rc in all cases Code cleanup
2023-06-21Use a single health log table (#15157)Emmanuel Vasilakis
* move old health log tables to one * change table in sqlite_health * remove check for off period of agent * changes in aclk_alert * fixes * add new field insert_mark_timestamp * cleanup * remove hostname, create the health log table during sqlite init * create the health_log during migration * move source from health_log to alert_hash. Remove class, component and type field from health_log * Register now_usec sqlite function * use global_id instead of insert_mark_timestamp. Use function now_usec to populate it * create functions earlier to have them during migration * small unit test fix * create additional health_log_detail table. Do the insert of an alert event on both * do the update on health_log_detail * change more queries * more indexes, fix inject removed * change last executed and select health log queries * random uuid for sqlite * do migration from old tables * queries to send alerts to cloud * cleanup queries * get an alarm id from db if not found in memory * small fix on query * add info when migration completes * dont pick health_log_detail during migration * check proper old health_log table * safer migration * proper log sent alerts. small fix in claimed cleanup * cleanups * extra check for cleanup * also get an alarm_event_id from sql * check for empty source * remove cleanup of main health log table --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-19/api/v2/nodes and streaming function (#15168)Costa Tsaousis
* dummy streaming function * expose global functions upstream * separate function for pushing global functions * add missing conditions * allow streaming function to run async * started internal API for functions * cache host retention and expose it to /api/v2/nodes * internal API for function table fields; more progress on streaming status * abstracted and unified rrdhost status * port old coverity warning fix - although it is not needed * add ML information to rrdhost status * add ML capability to streaming to signal the transmission of ML information; added ML information to host status * protect host->receiver * count metrics and instances per host * exposed all inbound and outbound streaming * fix for ML status and dependency of DATA_WITH_ML to INTERPOLATED, not IEEE754 * update ML dummy * added all fields * added streaming group by and cleaned up accepted values by cloud * removed type * Revert "removed type" This reverts commit faae4177e603d4f85b7433f33f92ef3ccd23976e. * added context to db summary * new /api/v2/nodes schema * added ML type * change default function charts * log to trace new capa * add more debug * removed debugging code * retry on receive interrupted read; respect sender reconnect delay in all cases * set disconnected host flag and manipulate localhost child count atomically, inside set/clear receiver * fix infinite loop * send_to_plugin() now has a spinlock to ensure that only 1 thread is writing to the plugin/child at the same time * global cloud_status() call * cloud should be a section, since it will contain error information * put cloud capabilities into cloud * aclk status in /api/v2 agents sections * keep aclk_connection_counter * updates on /api/v2/nodes * final /api/v2/nodes and addition of /api/v2/nodes_instances * parametrize all /api/v2/xxx output to control which info is outputed per endpoint * always accept nodes selector * st needs to be per instance, not per node * fix merging of contexts; fix cups plugin priorities * add after and before parameters to /api/v2/contexts/nodes/nodes_instances/q * give each libuv worker a unique id * aclk http_api_v2 version 4
2023-06-16Fix health crash (#15209)Stelios Fragkakis
2023-06-08Fix CID 385073 -- Uninitialized scalar variable (#15163)Stelios Fragkakis
Fix CID 385073 Uninitialized scalar variable
2023-06-07freeipmi: add availability status chart and alarm (#15151)Ilya Mashchenko
2023-06-05Generate, store and transmit a unique alert event_hash_id (#15111)Emmanuel Vasilakis
* generate and store an event_hash_id * transmit to cloud * transmit to the cloud
2023-06-01health: remove "families" from alarms config (#15086)Ilya Mashchenko
2023-05-29Only queue an alert to the cloud when it's inserted (#15110)Emmanuel Vasilakis
only queue an alert to cloud when its inserted
2023-05-24fix cockroachdb alarms (#15095)Ilya Mashchenko
2023-05-23Better cleanup of health log table (#15045)Emmanuel Vasilakis
2023-05-22Use chart labels to filter alerts (#14982)Emmanuel Vasilakis
* use chart labels to filter alerts * add entry to readme * support chart_label=val val2 val3 * docs updates * more docs * use rc not rt
2023-05-15Comment out default `role_recipients_*` values (#15047)James Gregory-Monk
2023-05-02feat: add OpsGenie alert levels to payload (#14992)OliverNChalk
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2023-04-25Update README.md (#14962)Chris Akritidis
2023-04-21Add a checkpoint message to alerts stream (#14847)Emmanuel Vasilakis
* pull aclk schemas * resolve capas * handle checkpoints and removed from health * build with disable-cloud * codacy 1 * misc changes * one more char in hash * free buffer * change topic * misc fixes * skip removed alert variables * change hash functions * use create and destroy for compatibility with older openssl
2023-04-20WEBRTC for communication between agents and browsers (#14874)Costa Tsaousis
* initial webrtc setup * missing files * rewrite of webrtc integration * initialization and cleanup of webrtc connections * make it compile without libdatachannel * add missing webrtc_initialize() function when webrtc is not enabled * make c++17 optional * add build/m4/ax_compiler_vendor.m4 * add ax_cxx_compile_stdcxx.m4 * added new m4 files to makefile.am * id all webrtc connections * show warning when webrtc is disabled * fixed message * moved all webrtc error checking inside webrtc.cpp * working webrtc connection establishment and cleanup * remove obsolete code * rewrote webrtc code in C to remove dependency for c++17 * fixed left-over reference * detect binary and text messages * minor fix * naming of webrtc threads * added webrtc configuration * fix for thread_get_name_np() * smaller web_client memory footprint * universal web clients cache * free web clients every 100 uses * webrtc is now enabled by default only when compiled with internal checks * webrtc responses to /api/ requests, including LZ4 compression * fix for binary and text messages * web_client_cache is now global * unification of the internal web server API, for web requests, aclk request, webrtc requests * more cleanup and unification of web client timings * fixed compiler warnings * update sent and received bytes * eliminated of almost all big buffers in web client * registry now uses the new json generation * cookies are now an array; fixed redirects * fix redirects, again * write cookies directly to the header buffer, eliminating the need for cookie structures in web client * reset the has_cookies flag * gathered all web client cleanup to one function * fixes redirects * added summary.globals in /api/v2/data response * ars to arc in /api/v2/data * properly handle host impersonation * set the context of mem.numa_nodes
2023-04-18bump go.d.plugin to v0.52.1 (#14921)Ilya Mashchenko
2023-04-12Update REFERENCE.md (#14900)Chris Akritidis
2023-04-12Collect additional BTRFS metrics (#14636)Dimitris P
* Add commit_stats metrics to BTRFS section * Add error_stats metrics (per device) to BTRFS section * Simplify commit stats variables and chart ids/names * Add basic BTRFS error alarms. Configured to trip whenever any of the error dimensions is non-zero. * Add chart descriptions for new charts. * Remove duplicate code * Comment out some debugging code * Always create error stats dimensions, even if zero * Show rate of commits and commit duration instead of totals * Change current commit metrics to absolute from incremental * Change commits dimension to absolute and add separate commits time share chart * Rename 'device_' rrdlabels to 'filesystem_' * Replace all snprintf() calls with snprintfz() * Fix codacy warning * Provide separate error charts for each filesystem device * Accept code review suggestions for more descriptive context and labels Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> * Add 'device' prefix to id, name, title of errors chart * Add 'device_id' label to device_errors * Update health.d/btrfs.conf to match new errors charts * Remove commented out code * Do not disable all BTRFS metrics collection if only commit_stats is missing * Do not disable all BTRFS metrics collection if only error_stats is missing * Fix bug of BTRFS device add/remove not being detected properly * Fix double free() error when deleting a device * Update dashboard info with bold tags Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> --------- Co-authored-by: Austin S. Hemmelgarn <austin@netdata.cloud> Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2023-04-10Add support for alert notifications to ntfy.sh (#14875)Dim-P