summaryrefslogtreecommitdiffstats
path: root/health
AgeCommit message (Collapse)Author
2020-03-02vernemq alarms, dashboard info and godplugin new version (#8236)Ilya Mashchenko
* web/gui: add vernemq to the dashboard_info.js * health: add vernemq alarms and update Makefile.am * health: vernemq alarms info fix * health: vernemq alarms info fix * health: fix vernemq_socket_errors template * packaging: bump godplugin version to v0.16.0 * packaging:update godplugin checksums * docs: add vernemq to the COLLECTORS.md
2020-02-24Merging the feature branch for the ACLK in the previous sprint. (#8179)Andrew Moss
* ACLK connection and protocol improvements (#8139) * Adding ACLK retry on connection failure (#8147) * Fixed reconnect issues on the ACLK. (#8163) * Cleaning up ACLK - part 1 (#8167) Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2020-02-20Tutorials to support v1.20 release (#7943)Joel Hans
* Add draft of CockroachDB tutorial * Fixed and new images * Support figures for images * Change border color * Change job * Initialize eBPF tutorial * Very very rough draft of host labels tutorial * Add a few mentions of tutorial * Fix for Thiago * Simplify health entities * Fixes for Thiago * Fixes and add tutorials to collectors README * Fixes to cockroachBD * Remove ebpf tutorial * remove link * Updates for Patti and Thiago * Add streaming security note * Straightaway
2020-02-11fix_exclusive_notification: (#7769)thiagoftsm
This commit removes the returns that were creating the bug
2020-02-08health_doc_name: Clarify the rules to create an alarm name (#7911)thiagoftsm
* health_doc_name: Clarify the rules to create an alarm name * health_doc_name: Fixes typo and gramma * health_doc_name: Fixes typo
2020-02-07alarms_values: New endpoint (#7836)thiagoftsm
* alarms_values: New endpoint This commit brings the new endpoint to Netdata * alarms_values: Documentation This commit brings the missing documentation for the PR * alarms_values: New function This commit brings a new code that removes dupplication * alarms_values: Fix typo * alarms_values: Fix missing word This commit fixes the missing word inside the documentation * alarms_values: Fix order This commit fixes the order of the alarm answer * alarms_values: Fixes typo and remmove unecessary variable * alarms_values: Fixes doc Describe all paramenters present in the endpoint * alarms_values: Same options This commit brings the same input pattern for alams and alams_values * alarms_values: Update swagger This commit brings the missing information to swagger json * alarms_values: Update swagger This commit brings the missing information to swagger yaml
2020-02-06Drop dirty dbengine pages if disk cannot keep up (#7777)Markos Fountoulakis
* Introduce dirty page pressure handling in the dbengine page cache that invalidates pages when the disk cannot keep up with the flushing speed.
2020-02-06ACLK agent 1 (#7894)Stelios Fragkakis
* - Add initial mqtt support * [WIP] Agent cloud link - Setup main mqtt thread to connect to a broker using V5 of the MQTT protocol (TBD) - Send alarms to "netdata/alarm" - Add error checks to handle connection failures - Add params for Broker, port Maximum concurrent sent / recev messages - Dummy function to check claiming status - Generic mqtt_send command to publish message to a base topic , sub topic It will end up in the form base_topic/sub_topic - Add host/port in the connection failure error message * Test libmosquitto libs * connect to broker locally (assume localhost:1883) * subscribe to channel netdata/command * Test try a reload command to trigger health reload * publish alerts to netdata/alarm * - Fix compile issues * - Use sleep_usec instead of usleep * - Delay reconnection on failure due to misconfiguration (high cpu usage) * - Remove the TLS connection config * - Fix NETDATA_MQTT_INITIALIZATION_SLEEP_WAIT to use seconds * - Gather ACLK related code under aclk folder - Add aclk_ functions for abstract layer - Moved low level libs intergration in mqtt.c * - Add README.md file with initial comment * - Clean MQTT v5 * - Code cleanup * - Remove alarm log for now - Remove the heart beat * - Remove message properties for V5 * - Remove message properties for V5 (header) * Fixed the netdata target to use a local static version of libmosquitto. The installer does not yet have steps to pull and build the local library. cd project_root git clone ssh://git@github.com/netdata/mosquitto mosquitto/ (cd mosquitto/lib && make) # Ignore the cpp error This will leave mosquitto/lib/libmosquitto.a for the build process to use. * - Fix compile issues with older < 1.6 libmosquitto lib * - Enable alarm events to check it works - Re arrange includes - Rework topic to be agent/guid/. Actual id will be returned by the is_agent_claimed * - Add initial metadata info - Added helper function in web_api - Added a debug command (info) * Update the claiming state to retrieve the claimed id. * - Use define for constants like command and metadata topics - Function to wait for initialization of the ACLK link - New aclk_subscribe command with QOS parameter for the mqtt subscription - Use the is_agent_claimed function to get the real claim id and use it to build the topics that will be used for the cloud communication - Change in netdata-claim.sh.in to write the claim id without a trailing \n * - Use define for constants like command and metadata topics - Function to wait for initialization of the ACLK link - New aclk_subscribe command with QOS parameter for the mqtt subscription - Use the is_agent_claimed function to get the real claim id and use it to build the topics that will be used for the cloud communication - Change in netdata-claim.sh.in to write the claim id without a trailing \n * - Remove the alarm log for now - Add code (but disabled) to send charts * - Use dummy anon, anon as username and password for testing purposes * - Use client id anon as well * Testing without TLS * Switching TLS back on to fix docker environment. * - Added query processing An incoming URL now calls web_client_api_request_v1_data to handle a request and push the results back to the "data" topic - Move the above processing from the message callback to the query handle loop - Added helper "pause" , "resume" commands to stop and resume query processing to stress test loading the queue with queries before executing them - Changed the endpoint topics to "meta", and "cmd" (previously metadata and command) * make info message follow protocol * move metadata msg generation into new func * move metadata msg generation into new func * - Add metadata to the responses - Add hook to queue chart changes on creation and dimensions - Changed the queue mechanism to include delay for X seconds - Add delayed submittion of charts to the cloud so that all DIMs are defined to avoid resubmission * - Add additional data info for aclk_queue command * - Use web_clinet_api_request_v1 to handle the incoming request This will handle all requests coming from the cloud * - Cleanup and aclk_query structure - Add msg_id parameter - Enable the incoming JSON request - Enable the outgoing JSON response * - Added new thread to handle query processing - Add lock and cond wait to wakeup thread when queries are submitted - Cleanup on the main init function * - Add wait time on agent init, to allow for chart, alarms and other definitions to be completed. - During the wait time, no queries will be queued * - Send metadata on query thread init - New generic create header function for the JSON response - Pack info and charts into one message - Modified chart to remove entries (test) - Modified charts mod to remove entries e.g alarms and volatile info - Change input to aclk_update_chart (RRDHOST / instead of hostname) * - When a request fails, add to the payload - We may need to handle in a different key - Error check in json parsing * - Add dummy aclk_update_alarm command * - Move incoming request JSON parsing code away from mqtt.c - Added #ifdef ACLK_ENABLE so that we can have code merged but disabled by default - Added version in incoming and outgoing JSON dict * - Disable code if ACLK_ENABLE is not defined - Remove references to the mqtt (mosquitto) lib - Add dummy stubs in mqtt.c for completeness if ACLK_ENABLE is not defined * - Disable challenge sample code for now * - Remove libmosquitto from makefile * - Fix spaces in Makefile.am - Remove ifdef to avoid warning from LGTM * - Remove for now the code that builds an along log test message to send to the cloud * - Add check for ACLK_ENABLE definition and avoid calling the chart update functions * - Remove commented code * - Move source files to the correct place (ACLK_PLUGIN_FILES) * - Remove include file thats not needed * - Remove include file thats not needed - Add improved checks for load_claiming_state() * - Fix error message. Used error() that also logs errno and message * - Fix some codacy issues * - Fix more codacy issues, code cleanup * - Revert code to address codacy warnings * - Revert spaces added in a previous commit by mistake * clean up if/else nest * print error if fopen fails * minor - error already logs errno * - Fix version formatting * - Cleanup all ACLK related compiler warnings - Re-arrange include files - Removed unused defines * - More compilation warnings fixed - Bug with thread creation fixed * - Add condition to skip compilation of the ACLK code entirely. Add env variable ACLK="yes" to enable * - Add condition to skip the libmosquitto * - Change feature flag from ACLK_ENABLE to ENABLE_ACLK in accordance with the rest of ENABLE_xx flags - Typo in info message fix Co-authored-by: Andrew Moss <1043609+amoss@users.noreply.github.com> Co-authored-by: Timo <6674623+underhood@users.noreply.github.com>
2020-02-01Parse host tags (#7702)Vladimir Kobal
* Fix memory leaks * Check for configuration options * Parse simple tags * Parse JSON tags * Remove an unnecessary check * Parse a JSON object * Parse a JSON array * Update the documentation * Fix host locks
2020-01-31installer: include go.d.plugin version v0.15.0 (#7882)Ilya Mashchenko
* /web/giu/dashboard_info.js: add cockroachdb info * /web/giu/dashboard_info.js: lgtm fix * /health/health.d/: add cockroachdb.conf
2020-01-30Clarify editing health config files in health quickstart (#7883)Joel Hans
* Add fixes to health quickstart * Add notice about EDITOR and fix link
2020-01-28Missing extern (#7877)thiagoftsm
* missing_extern: Fix missing Fix few externs that were missing in global variables * missing_extern: Variables This commit declares the variables inside .c files
2020-01-25Remove all refernces to .keep files (#7829)James Mills
2020-01-17Alarm Log labels (#7548)thiagoftsm
* alarm_log_with_labels: Alarm Log Rebase of alarm log to commit against master * alarm_log_with_labels: Remove lock This commit removes unecessary locks from health_log * alarm_log_with_labels: Restore and Rebase Remove previous changes and rebase the PR * alarm_log_with_labels: Unique line This commit brings an unique line to alarm log * alarm_log_with_labels: Correct separator This log file uses tabulation instead comma * alarm_log_with_labels: Fix memory leak There was a missing call for buffer_free
2020-01-15Update stop-notifications-alarms.md (#7737)Yashar Nesabian
fix a repetitive word
2020-01-06Clean up host labels in API responses (#7616)Vladimir Kobal
* Remove host labels from the Swagger specification * Remove host labels from the api responses
2020-01-02Adjust alarm labels (#7600)thiagoftsm
* adjust_alarm_labels: variable rename This commit renames the variables inside health * adjust_alarm_labels: Doc Changes documentation for the labels * adjust_alarm_labels: Fix typo this commit brings the fix for the documentation * adjust_alarm_labels: Table align Fix table align on documentation * adjust_alarm_labels: Table align Fix link * adjust_alarm_labels: Link * adjust_alarm_labels: Link * adjust_alarm_labels: Remove contradiction The previous documentation had a contradiction removed with this commit * adjust_alarm_labels: Missing conversion This commit brings the latest change to text
2019-12-18silencers_info: Change error to info (#7479)thiagoftsm
This commit changes the error message to info when the file is not present
2019-12-16Labels issues (#7515)Andrew Moss
Initial work on host labels from the dedicated branch. Includes work for issues #7096, #7400, #7411, #7369, #7410, #7458, #7459, #7412 and #7408 by @vlvkobal, @thiagoftsm, @cakrit and @amoss.
2019-12-09Fix missing parenthesis on softnet.conf (#7476)Steve8291
Missing parenthesis in alarm: 1min_netdev_backlog_exceeded
2019-12-04Docs: Fixes to new health documentation structure (#7419)Joel Hans
* Fixed link * Added GA links
2019-12-04installer: include go.d.plugin version v0.12.0 (#7418)Ilya Mashchenko
* add unbound basic alarms * add scaleio basic alarms * update health Makefile.am * add scaleio to dashboard_info.js * packaging: set go.d.plugin version to 0.12.0 * packaging: update go.d.plugin checksums
2019-12-03Health: Proposed restructuring of health documentation (#7329)Joel Hans
* Squashed commits for PR * Addressing comments from Chris and Thiago * Changed sidebar title * Fixes for Vlad
2019-11-25installer: include go.d.plugin version v0.11.0 (#7365)Ilya Mashchenko
* bump godplugin ver to 0.11.0 * update godplugin checksums * add python unbound module to obsolete modules list * add deprecation info to the python unbound readme * remove old unbound charts descriptions from the dashboard_info.js * add web_log go ver alarms * update web_log alarms info (401) * remove unbound from python.d.conf
2019-11-15Fine tune various alarm values. (#7322)Austin S. Hemmelgarn
* Fix formatting in alarm configurations. This makes sure everything is lined up properly so that the alarm definitions are easier to read. * Make TCP Accept Queue alarms much less aggressive. This switches the alarms to use averages instead of sums, and bumps up the trip points to be more aggressive, as both of these may be non-zero even in normal operation of a system. * Make softnet alarms less aggressive. This decreases the sampling window from 10 minutes to 1 minute, switches to using an average instead of a sum, and adjusts the trigger thresholds to be more aggressive. This one will need to be watched, as the resultant values may be too lenient for some systems. * Tweak UDP alarms to work like the TCP alarms. Just to ensure consistency.
2019-11-11Ownership and permissions of /etc/netdata (#7244)Konstantinos Natsakis
* make install takes care of ownership and permissions of /etc/netdata Instead of netdata-installer.sh * Fix identation in Makefile.am files * netdata-installer.sh: Clearer variable assignment * netdata-installer.sh: Set /etc/netdata/netdata.conf ownership to root:root and permissions to 0644 * netdata-installer.sh: Set /etc/netdata/.environment permissions to 0644 * install-or-update.sh: Set permissions for /opt/netdata/etc/netdata.conf to 0644 * install-or-update.sh: Use ${NETDATA_PREFIX} more * install-or-update.sh: Improve indentation * install-or-update.sh: Do not create /opt/netdata/etc/netdata directories * debian/rules: /etc/netdata files and directories are now installed by make install * debian/rules: Properly copy files across directories When destination directory exists * netdata.spec.in: /etc/netdata ownership and permissions * Revert "Fix identation in Makefile.am files" This reverts commit 63fdb299b69152fda6984f81b0fef02f364c5efe. * Remove uninstall-local recipes from Makefile.am files * Removed superfluous whitespace and hash
2019-11-11Makefile.am files indentation (#7252)Konstantinos Natsakis
* Use 4 spaces for indentation of non-recipe lines in Makefile.am files * Be more consistent in the use of space before = in Makefile.am files
2019-11-07fix_irc_notification: Remove line break from message (#7243)thiagoftsm
* fix_irc_notification: Remove line break from message The line break present on netdata alarms are creating 421 errors on the server, this is happening because according RFC1459 this is the end of the message * fix_irc_notification: Adjust tabulatin the script 'alarm-notify.sh' is not following our default format, this commit returns to old format the newest line brought
2019-11-05Update SYN cookie alarm to be less aggressive. (#7250)Austin S. Hemmelgarn
* Update SYN cookie alarm to be less aggressive. Based on discussion from #6998 * Update SYN queue overflow alarm the same way.
2019-10-30Update alarm-notify.sh to enable IRC notifications (#7148)Avinash H. Duduskar
* Update alarm-notify.sh Add irc as method and update author name. There's still some bugs (false error code 421) using irc notifications but adding it to the method list at least enables the notification method and sends alerts as expected.
2019-10-24detect if the disk cannot keep up with data collection (#7139)Markos Fountoulakis
* Adjust dbengine flushing speed more dynamically * Added error tracking statistics for failure to flush events * Added alarm for dbengine flushing errors * Improved dbengine accounting for commited to be written pages
2019-10-22add support for am2320 sensor (#7024)Tom Buck
* add support for am2320 sensor add support for am2320 temperature and humidity sensor * Rename readme.md to README.md * updated README.md to include proper sections updated README.md to include proper sections * readme updated and file name corrected readme updated with missing formatting and information. AM2320.chart.py filename corrected. * changed simple service import chnaged simple service import location * updated README.md to remove the reference of moving the script file. * requested changes - Moved header from README.md to am2320.chart.py - Added Alarm for am2320 to health.d - Changed exception to value error in am2320.chart.py * typo changed mae to make in comment * Add title and icon for AM2320 Sensor Add title and icon for AM2320 Sensor * typo corrected changed Save to save * added I2C group to installer Added netdata to the I2C group during install or update. Removed instruction to add netdate to I2C group from README.md * change tab to spaces change tab to spaces
2019-10-21telegram: fix broken links, add setup instructions (#7033)mal
* Fix broken telegram bot link, tidy comments The web documentation refers to `@myidbot`, which works. `@get_id` is currently a useless channel. * Add telegram notification setup instructions
2019-10-21mysql: add cluster_status alarm (#6989)Ilya Mashchenko
2019-10-17Implement hangouts chat notifications (#7013)Hendrik Hofstadt
* Implement hangouts chat notifications * Fix null check in bash script * Apply suggestions from code review Co-Authored-By: Joel Hans <joel.g.hans@gmail.com>
2019-10-14Fix typo in health_alarm_notify.conf (#7062)sz4bi
2019-10-07Remove hard cap from page cache size to eliminate deadlocks. (#7006)Markos Fountoulakis
* Remove page cache error detection and deadlock resolution * Change page cache logic to disallow deadlocks due to too many API users * Updated documentation * Changed default and minimum page cache size values to 32 and 8 MiB respectively
2019-10-06fix bug: issue #7002 (#7003)刘洋 Jax
2019-10-05Increase dbengine default cache size (#6997)Markos Fountoulakis
* Increase database engine default page cache size to support up to 32K metrics out of the box * Reduce mass flood effect of dbengine page cache alarm * changed repeating notification to every hour
2019-10-03Update README.md (#6961)Chris Akritidis
2019-10-03mysql: collect galera cluster metrics (#6962)Ilya Mashchenko
* mysql: collect galera cluster metrics * mysql: readme update * mysql: add galera cluster size and state alarms
2019-09-27Create a template for all dimensions (#6560)thiagoftsm
* health_connection: Comments inside Health Config To try to understand better what is necessary to change and where it is necessary to change anything inside the health, I commented the functions inside this file" " * health_connection: Comments about Health in other files This commit brings the rest of the comments that were missed for health" * health_connection: Comments on health_log I had to append more comments on health_log * health_connection: Create a new variable New variable is created to work with foreach * health_connection: Fix new option and doc The first implementation of the 'foreach' had a problem, this fixes the error. This commit also brings the updates for the documentation * health_connection: Understanding health This commit is to save the place that I am working, it has the map to understand all the alam process * health_connection: Update map I changed the position of the error message to identify the correct place to add new alarms * health_connection: End of simple alarm This commit finishes what is necessary to bring the same lookup for different dimensions in one unique line * health_connection: Documentation and template steps This commit brings the documentation missed for template and comments to help in the next step of apply a template to create an alarm. * health_connection: Restoring After some tests, it was detected that the alarms were not working as expected * health_connection: Fix bug and bring dimension to template This commit brings a fix for an old Netdata bug, before this the Netdata always tried to create a new entry in an index with the same id raising an error. It also brings the possibility to use 'foreach' in template * health_connection: Fix cmake compilation There was a problem with cmake compilation fixed by this commit * health_connection: shell script Finilize the shell script to test the PR * health_connection: Remove debug message During the development, I used some messages to understand the code this commit removes the last message * health_connection: Fix bugs This commits fix bugs reported by tests * health_connection: Alarm working This commit brings the necessary change for the alarms work, but it is missing the unlink from the newest list * health_connection: Template code written This commit finishes the creation of alarm from template, but it was not tested yet. * health_connection: Remove comments I am removing the comments from this PR to bring back late * health_connection: Remove lines Another commit to restore the files before they to be commented * health_connection: New alarm and remove messages I am bringing a new alarm to test template with SP and removing comments used during the development * health_connection: Functional test review After to review the functional test script, it was necessary to small adjust to test all the features available with the new version * health_connection: Free structure I am moving the free list for the correct place, the previous place was not safe * health_connection: ShellCheck This commit fixes the problems with shellcheck * health_connection: FIx hash This commit fix the hash calculation that was using wrong input * health_connection: Fix message error The system was showing a wronge message, because when we have foreach the alarm created with templated is added in a second stage to the index * health_connection: Fix documentation In this commit I am fixing the grammar of the previous doc and bringing two examples * health_connection: Fix examples This commit fix the last two examples that was brought in this PR * health_connection: Fix example doc When I brought the correct grammar in the last commit, I lost a mark * health_connection: Grammar fix Fixing grammar of the documentation * health_connection: Memory leak This commit fixes the memory leak that was present in the PR * health_connection: Reload This commit fix the problem that the alarms were not linked after to receive a SIGUSR2 * health_connection: False Positive from codacy Codacy was given a false positive, I changed the function to avoid it. * health_connection: dead code Remove dead code from the code. * health_connection: Memory Leak Remove memory leak when clean simple pattern * health_connection: Script format With this commit I am formatting the last message to return for the default color on terminal * health_connection: Script format 2 With this commit I am formatting the last message to return for the default color on terminal * health_connection: Script format 3 With this commit I am formatting the error message to return for the default color on terminal
2019-09-25zookeeper and hdfs: alarms and dashboard_info (#6927)Ilya Mashchenko
* add zookeeper alarms * add zookeeper to dashboard_info * zookeeper alarm fix * add hdfs alarms * add hfds to dashboard_info * minor * fix hdfs zk links: use latest version * hdfs dashboard_info: change semicolon to comma
2019-09-24Detect deadlock in dbengine page cache (#6911)Markos Fountoulakis
* Detect deadlock in dbengine page cache when there are too many metrics and print error message * Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms * Changed printing deadlock errors to only happen once per dbengine instance
2019-09-22Correct read length of silencers file (#6909)Chris Akritidis
2019-09-20Fix some errors reported by Coverity (#6797)thiagoftsm
* coverity_20190905: Fix reported bugs This commit has fixes for some bugs reported by Coverity in the present day * coverity_20190905: Fix missing report FIx a missing report of error * coverity_20190905: Pipe close The previous fix had an error that wolud allow a socket continue opened, this commit fixes this * coverity_20190905: Error pattern The call of perror would generate a different error report, instead I am using strerror() to keep pattern * coverity_20190905: Error function Rewrite the call to error function * coverity_20190905: Fix missing tests The previous fix did not have correct tests after to clean the variables * coverity_20190905: Fix readable I changed for an else instead a new if, it is more clean this way * coverity_20190905: remove unecessary test This commit is removing an unecessary test for a variable that will never be NULL. * coverity_20190905: Add neccessary NULLL After to clean the variable, I am setting NULL to variable to avoid clean again * coverity_20190905: Remove false error The condition added to fix Coverity was generating false positives, so we are changing to debug * coverity_20190905: Remove false error The condition added to fix Coverity was generating false positives, so we are changing to debug * coverity_20190905: Bring else to avoid error Bring an else to solve the problem to read a FD not opened * coverity_20190905: Return After to analyse the last changes, I decided to return, because they were not necessary * coverity_20190905: Remove NULL Remove unecessary set of variable to NULL
2019-09-17vcsa collector: charts description and alarms (#6772)Ilya Mashchenko
* add vcsa to dashboard_info.js * add vcsa alarms * update dashboard_info.js * update dashboard_info.js * update alarms * availability alarm fix
2019-09-17Instructions for simple SMTP transport (#6870)Chris Akritidis
Adds info on how to use msmtp as an alternative to sendmail, in order to have a simple MTA configuration for sending emails and auth to an existing SMTP server.
2019-09-17Gearman plugin for Netdata (#6567)Kyle Agronick
Added Gearman plugin and alarms
2019-09-17Center the chart on timeframe when an alarm is raised (#6391)thiagoftsm
##### Summary When an alarm happens, we were sending notification for our users, but the notifications were missing an important information that allows to centralize the Chart to demonstrate to our users. ##### Component Name UI ##### Additional Information Closes #5810