summaryrefslogtreecommitdiffstats
path: root/health
AgeCommit message (Collapse)Author
2020-07-26Add alarms for FreeBSD interface errors (#8340)Lasse Bang Mikkelsen
Based on net.drops alarms.
2020-07-16Change all instances of alarm to template (#9553)Toby Hammond
Fix megacli.conf alarm.
2020-07-11Remove health from archived metrics (#9520)Markos Fountoulakis
* Disassociate health variables and alarms from archived charts and dimensions. * Ignore archived charts during health reload.
2020-07-06Fix broken link in Kavenegar notification doc (#9492)Joel Hans
* Fix broken link * Retrigger CI
2020-06-29Fixed duplicate alarm ids in health-log.db (#9428)Stelios Fragkakis
Fixed duplicate alarm ids in health-log.db
2020-06-12Change streaming terminology to parent-child in the code (#9323)Andrew Moss
2020-06-12Add support for persistent metadata (#9324)Stelios Fragkakis
* Implemented collector metadata logging * Added persistent GUIDs for charts and dimensions * Added metadata log replay and automatic compaction * Added detection of charts with no active collector (archived) * Added new endpoint to report archived charts via `/api/v1/archivedcharts` * Added support for collector metadata update Co-authored-by: Markos Fountoulakis <44345837+mfundul@users.noreply.github.com>
2020-06-08Add revisions to Matrix doc (#9295)Joel Hans
2020-06-08Support for matrix notifications (#9196)David Heidelberg
2020-06-04Move/refactor docs to accomodate new Guides section on Learn (#9266)Joel Hans
* Move directories and change verbiage to guide * Move health guides * Quick fix to collectors quickstart * Fix broken links * Remove health/tutorials dir * Fix links in collectors quickstart * Fix links to go.d pages
2020-06-03Fixes documentation ambiguity leading into issue #8239 (#9255)Timotej S
* docu update * ilyam8 & joelhans comments on PR resolved
2020-05-26New alarms (exporting and Backend) (#9075)thiagoftsm
New alarms for exporting and backend.
2020-05-14Improve the impact of health code on netdata scalability (#8407)Markos Fountoulakis
* Add support for spawning processes without pipes. * Port health_alarm_execute() from mypopen() to netdata_spawn() * Make alarm notifications asynchronous within a single health thread iteration * Initial version of spawn server. * preliminary integration of spawn client with health
2020-05-14Account for zfs.arc_size.min, and correct calc (#8913)araemo
2020-05-12Remove check for old alarm status (#8978)Stelios Fragkakis
Fixed coverity issue (CID 358436)
2020-05-11Docs: Fix internal links and remove obsolete admonitions (#8946)Joel Hans
* Fixed a few more links * Remove old syntax * Abs-relative links to files in docs folder * Trying to fix nother doc learn link * Fix a few more links * Add testing doc * Tracking down mysteries * Cleanup * Update broken external links * Remove index.html that appeared from testing * Fix remainder of links
2020-05-11Enable support for Netdata Cloud.Andrew Moss
This PR merges the feature-branch to make the cloud live. It contains the following work: Co-authored-by: Andrew Moss <1043609+amoss@users.noreply.github.com(opens in new tab)> Co-authored-by: Jacek Kolasa <jacek.kolasa@gmail.com(opens in new tab)> Co-authored-by: Austin S. Hemmelgarn <austin@netdata.cloud(opens in new tab)> Co-authored-by: James Mills <prologic@shortcircuit.net.au(opens in new tab)> Co-authored-by: Markos Fountoulakis <44345837+mfundul@users.noreply.github.com(opens in new tab)> Co-authored-by: Timotej S <6674623+underhood@users.noreply.github.com(opens in new tab)> Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com(opens in new tab)> * dashboard with new navbars, v1.0-alpha.9: PR #8478 * dashboard v1.0.11: netdata/dashboard#76 Co-authored-by: Jacek Kolasa <jacek.kolasa@gmail.com(opens in new tab)> * Added installer code to bundle JSON-c if it's not present. PR #8836 Co-authored-by: James Mills <prologic@shortcircuit.net.au(opens in new tab)> * Fix claiming config PR #8843 * Adds JSON-c as hard dep. for ACLK PR #8838 * Fix SSL renegotiation errors in old versions of openssl. PR #8840. Also - we have a transient problem with opensuse CI so this PR disables them with a commit from @prologic. Co-authored-by: James Mills <prologic@shortcircuit.net.au(opens in new tab)> * Fix claiming error handling PR #8850 * Added CI to verify JSON-C bundling code in installer PR #8853 * Make cloud-enabled flag in web/api/v1/info be independent of ACLK build success PR #8866 * Reduce ACLK_STABLE_TIMEOUT from 10 to 3 seconds PR #8871 * remove old-cloud related UI from old dashboard (accessible now via /old suffix) PR #8858 * dashboard v1.0.13 PR #8870 * dashboard v1.0.14 PR #8904 * Provide feedback on proxy setting changes PR #8895 * Change the name of the connect message to update during an ongoing session PR #8927 * Fetch active alarms from alarm_log PR #8944
2020-04-22health: fix mdstat `failed devices` alarm (#8794)Ilya Mashchenko
2020-04-22health/portcheck: remove no-clear-notification (#8748)Ilya Mashchenko
2020-04-20added whoisquery health templates (#8700)Yashar Nesabian
Update Makefile.am to add whoisquery.conf
2020-04-15added certificate revocation alert (#8684)Yashar Nesabian
* added certificate revocation alert
2020-04-14Docs: Standardize links between documentation (#8638)Joel Hans
* Trying out some absolute-ish links * Try one out on installer * Testing logic * Trying out some more links * Fixing links * Fix links in python collectors * Changed a bunch more links * Fix build errors * Another push of links * Fix build error and add more links * Complete first pass * Fix final broken links * Fix links to files * Fix for Netlify * Two more fixes
2020-04-13Revert "Revert changes since v1.21 in pereparation for hotfix release."Austin S. Hemmelgarn
This reverts commit e2874320fc027f7ab51ab3e115d5b1889b8fd747.
2020-04-13Revert changes since v1.21 in pereparation for hotfix release.Austin S. Hemmelgarn
2020-04-08health/alarm_notify: add dynatrace enabled check (#8654)Ilya Mashchenko
2020-04-06Health Alarm to Dynatrace Event implementation (#8476)Illumine IT Consulting
* Health Alarm to Dynatrace Event implementation * Update health/notifications/dynatrace/README.md Co-Authored-By: Patti Short <35278231+shortpatti@users.noreply.github.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Patti Short <35278231+shortpatti@users.noreply.github.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Patti Short <35278231+shortpatti@users.noreply.github.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Patti Short <35278231+shortpatti@users.noreply.github.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Patti Short <35278231+shortpatti@users.noreply.github.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Patti Short <35278231+shortpatti@users.noreply.github.com> * Removing unwanted " * Update health/notifications/dynatrace/README.md Co-Authored-By: Joel Hans <joel.g.hans@gmail.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Joel Hans <joel.g.hans@gmail.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Joel Hans <joel.g.hans@gmail.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Joel Hans <joel.g.hans@gmail.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Joel Hans <joel.g.hans@gmail.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Joel Hans <joel.g.hans@gmail.com> * Update health/notifications/dynatrace/README.md Co-Authored-By: Joel Hans <joel.g.hans@gmail.com> * Implementation of https://github.com/netdata/netdata/pull/8476/checks?check_run_id=533657899 * Implemented https://github.com/netdata/netdata/pull/8476/checks?check_run_id=534740578 Co-authored-by: Patti Short <35278231+shortpatti@users.noreply.github.com> Co-authored-by: Joel Hans <joel.g.hans@gmail.com>
2020-04-06Docs: Change MacOS to macOS (#8562)Joel Hans
* Change MacOS to macOS * Change Mac as noun to macOS system
2020-04-03Change all https://app.netdata.cloud URLs to https://netdata.cloud to ↵Markos Fountoulakis
restore connectivity with netdata cloud.
2020-03-31Switching over to soft feature flag (#8545)Andrew Moss
Preparing for the cloud release. This changes how we handle the feature flag so that it no longer requires installer switches and can be set from the config file. This still requires internal access to use and is not ready for public access yet.
2020-03-31Improve the behavior of claiming (#8516)Andrew Moss
The default cloud url has been updated to app.netdata.cloud ready for the release. The claiming process now checks the current user executing claiming and refuses to perform the claim for the wrong user. If the current UID is 0 then claiming proceeds but the file ownership is adjusted to be the correct netdata user. The default expected user is `netdata` unless the script can identify the user from the current configuration. After the claiming script is executed the CLI is used to reload the claiming state.
2020-03-19health: add dns_query module alarm (#8434)Ilya Mashchenko
2020-03-11new version of godplugin and pulsar alarms, dashboard info (#8364)Ilya Mashchenko
bump godplugin to v0.17.0 and add pulsar alarms, dashboard_info
2020-03-10Bulk add frontmatter to all documentation (#8354)Joel Hans
* Bulk add frontmatter * A few extra edge cases
2020-03-02vernemq alarms, dashboard info and godplugin new version (#8236)Ilya Mashchenko
* web/gui: add vernemq to the dashboard_info.js * health: add vernemq alarms and update Makefile.am * health: vernemq alarms info fix * health: vernemq alarms info fix * health: fix vernemq_socket_errors template * packaging: bump godplugin version to v0.16.0 * packaging:update godplugin checksums * docs: add vernemq to the COLLECTORS.md
2020-02-24Merging the feature branch for the ACLK in the previous sprint. (#8179)Andrew Moss
* ACLK connection and protocol improvements (#8139) * Adding ACLK retry on connection failure (#8147) * Fixed reconnect issues on the ACLK. (#8163) * Cleaning up ACLK - part 1 (#8167) Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2020-02-20Tutorials to support v1.20 release (#7943)Joel Hans
* Add draft of CockroachDB tutorial * Fixed and new images * Support figures for images * Change border color * Change job * Initialize eBPF tutorial * Very very rough draft of host labels tutorial * Add a few mentions of tutorial * Fix for Thiago * Simplify health entities * Fixes for Thiago * Fixes and add tutorials to collectors README * Fixes to cockroachBD * Remove ebpf tutorial * remove link * Updates for Patti and Thiago * Add streaming security note * Straightaway
2020-02-11fix_exclusive_notification: (#7769)thiagoftsm
This commit removes the returns that were creating the bug
2020-02-08health_doc_name: Clarify the rules to create an alarm name (#7911)thiagoftsm
* health_doc_name: Clarify the rules to create an alarm name * health_doc_name: Fixes typo and gramma * health_doc_name: Fixes typo
2020-02-07alarms_values: New endpoint (#7836)thiagoftsm
* alarms_values: New endpoint This commit brings the new endpoint to Netdata * alarms_values: Documentation This commit brings the missing documentation for the PR * alarms_values: New function This commit brings a new code that removes dupplication * alarms_values: Fix typo * alarms_values: Fix missing word This commit fixes the missing word inside the documentation * alarms_values: Fix order This commit fixes the order of the alarm answer * alarms_values: Fixes typo and remmove unecessary variable * alarms_values: Fixes doc Describe all paramenters present in the endpoint * alarms_values: Same options This commit brings the same input pattern for alams and alams_values * alarms_values: Update swagger This commit brings the missing information to swagger json * alarms_values: Update swagger This commit brings the missing information to swagger yaml
2020-02-06Drop dirty dbengine pages if disk cannot keep up (#7777)Markos Fountoulakis
* Introduce dirty page pressure handling in the dbengine page cache that invalidates pages when the disk cannot keep up with the flushing speed.
2020-02-06ACLK agent 1 (#7894)Stelios Fragkakis
* - Add initial mqtt support * [WIP] Agent cloud link - Setup main mqtt thread to connect to a broker using V5 of the MQTT protocol (TBD) - Send alarms to "netdata/alarm" - Add error checks to handle connection failures - Add params for Broker, port Maximum concurrent sent / recev messages - Dummy function to check claiming status - Generic mqtt_send command to publish message to a base topic , sub topic It will end up in the form base_topic/sub_topic - Add host/port in the connection failure error message * Test libmosquitto libs * connect to broker locally (assume localhost:1883) * subscribe to channel netdata/command * Test try a reload command to trigger health reload * publish alerts to netdata/alarm * - Fix compile issues * - Use sleep_usec instead of usleep * - Delay reconnection on failure due to misconfiguration (high cpu usage) * - Remove the TLS connection config * - Fix NETDATA_MQTT_INITIALIZATION_SLEEP_WAIT to use seconds * - Gather ACLK related code under aclk folder - Add aclk_ functions for abstract layer - Moved low level libs intergration in mqtt.c * - Add README.md file with initial comment * - Clean MQTT v5 * - Code cleanup * - Remove alarm log for now - Remove the heart beat * - Remove message properties for V5 * - Remove message properties for V5 (header) * Fixed the netdata target to use a local static version of libmosquitto. The installer does not yet have steps to pull and build the local library. cd project_root git clone ssh://git@github.com/netdata/mosquitto mosquitto/ (cd mosquitto/lib && make) # Ignore the cpp error This will leave mosquitto/lib/libmosquitto.a for the build process to use. * - Fix compile issues with older < 1.6 libmosquitto lib * - Enable alarm events to check it works - Re arrange includes - Rework topic to be agent/guid/. Actual id will be returned by the is_agent_claimed * - Add initial metadata info - Added helper function in web_api - Added a debug command (info) * Update the claiming state to retrieve the claimed id. * - Use define for constants like command and metadata topics - Function to wait for initialization of the ACLK link - New aclk_subscribe command with QOS parameter for the mqtt subscription - Use the is_agent_claimed function to get the real claim id and use it to build the topics that will be used for the cloud communication - Change in netdata-claim.sh.in to write the claim id without a trailing \n * - Use define for constants like command and metadata topics - Function to wait for initialization of the ACLK link - New aclk_subscribe command with QOS parameter for the mqtt subscription - Use the is_agent_claimed function to get the real claim id and use it to build the topics that will be used for the cloud communication - Change in netdata-claim.sh.in to write the claim id without a trailing \n * - Remove the alarm log for now - Add code (but disabled) to send charts * - Use dummy anon, anon as username and password for testing purposes * - Use client id anon as well * Testing without TLS * Switching TLS back on to fix docker environment. * - Added query processing An incoming URL now calls web_client_api_request_v1_data to handle a request and push the results back to the "data" topic - Move the above processing from the message callback to the query handle loop - Added helper "pause" , "resume" commands to stop and resume query processing to stress test loading the queue with queries before executing them - Changed the endpoint topics to "meta", and "cmd" (previously metadata and command) * make info message follow protocol * move metadata msg generation into new func * move metadata msg generation into new func * - Add metadata to the responses - Add hook to queue chart changes on creation and dimensions - Changed the queue mechanism to include delay for X seconds - Add delayed submittion of charts to the cloud so that all DIMs are defined to avoid resubmission * - Add additional data info for aclk_queue command * - Use web_clinet_api_request_v1 to handle the incoming request This will handle all requests coming from the cloud * - Cleanup and aclk_query structure - Add msg_id parameter - Enable the incoming JSON request - Enable the outgoing JSON response * - Added new thread to handle query processing - Add lock and cond wait to wakeup thread when queries are submitted - Cleanup on the main init function * - Add wait time on agent init, to allow for chart, alarms and other definitions to be completed. - During the wait time, no queries will be queued * - Send metadata on query thread init - New generic create header function for the JSON response - Pack info and charts into one message - Modified chart to remove entries (test) - Modified charts mod to remove entries e.g alarms and volatile info - Change input to aclk_update_chart (RRDHOST / instead of hostname) * - When a request fails, add to the payload - We may need to handle in a different key - Error check in json parsing * - Add dummy aclk_update_alarm command * - Move incoming request JSON parsing code away from mqtt.c - Added #ifdef ACLK_ENABLE so that we can have code merged but disabled by default - Added version in incoming and outgoing JSON dict * - Disable code if ACLK_ENABLE is not defined - Remove references to the mqtt (mosquitto) lib - Add dummy stubs in mqtt.c for completeness if ACLK_ENABLE is not defined * - Disable challenge sample code for now * - Remove libmosquitto from makefile * - Fix spaces in Makefile.am - Remove ifdef to avoid warning from LGTM * - Remove for now the code that builds an along log test message to send to the cloud * - Add check for ACLK_ENABLE definition and avoid calling the chart update functions * - Remove commented code * - Move source files to the correct place (ACLK_PLUGIN_FILES) * - Remove include file thats not needed * - Remove include file thats not needed - Add improved checks for load_claiming_state() * - Fix error message. Used error() that also logs errno and message * - Fix some codacy issues * - Fix more codacy issues, code cleanup * - Revert code to address codacy warnings * - Revert spaces added in a previous commit by mistake * clean up if/else nest * print error if fopen fails * minor - error already logs errno * - Fix version formatting * - Cleanup all ACLK related compiler warnings - Re-arrange include files - Removed unused defines * - More compilation warnings fixed - Bug with thread creation fixed * - Add condition to skip compilation of the ACLK code entirely. Add env variable ACLK="yes" to enable * - Add condition to skip the libmosquitto * - Change feature flag from ACLK_ENABLE to ENABLE_ACLK in accordance with the rest of ENABLE_xx flags - Typo in info message fix Co-authored-by: Andrew Moss <1043609+amoss@users.noreply.github.com> Co-authored-by: Timo <6674623+underhood@users.noreply.github.com>
2020-02-01Parse host tags (#7702)Vladimir Kobal
* Fix memory leaks * Check for configuration options * Parse simple tags * Parse JSON tags * Remove an unnecessary check * Parse a JSON object * Parse a JSON array * Update the documentation * Fix host locks
2020-01-31installer: include go.d.plugin version v0.15.0 (#7882)Ilya Mashchenko
* /web/giu/dashboard_info.js: add cockroachdb info * /web/giu/dashboard_info.js: lgtm fix * /health/health.d/: add cockroachdb.conf
2020-01-30Clarify editing health config files in health quickstart (#7883)Joel Hans
* Add fixes to health quickstart * Add notice about EDITOR and fix link
2020-01-28Missing extern (#7877)thiagoftsm
* missing_extern: Fix missing Fix few externs that were missing in global variables * missing_extern: Variables This commit declares the variables inside .c files
2020-01-25Remove all refernces to .keep files (#7829)James Mills
2020-01-17Alarm Log labels (#7548)thiagoftsm
* alarm_log_with_labels: Alarm Log Rebase of alarm log to commit against master * alarm_log_with_labels: Remove lock This commit removes unecessary locks from health_log * alarm_log_with_labels: Restore and Rebase Remove previous changes and rebase the PR * alarm_log_with_labels: Unique line This commit brings an unique line to alarm log * alarm_log_with_labels: Correct separator This log file uses tabulation instead comma * alarm_log_with_labels: Fix memory leak There was a missing call for buffer_free
2020-01-15Update stop-notifications-alarms.md (#7737)Yashar Nesabian
fix a repetitive word
2020-01-06Clean up host labels in API responses (#7616)Vladimir Kobal
* Remove host labels from the Swagger specification * Remove host labels from the api responses
2020-01-02Adjust alarm labels (#7600)thiagoftsm
* adjust_alarm_labels: variable rename This commit renames the variables inside health * adjust_alarm_labels: Doc Changes documentation for the labels * adjust_alarm_labels: Fix typo this commit brings the fix for the documentation * adjust_alarm_labels: Table align Fix table align on documentation * adjust_alarm_labels: Table align Fix link * adjust_alarm_labels: Link * adjust_alarm_labels: Link * adjust_alarm_labels: Remove contradiction The previous documentation had a contradiction removed with this commit * adjust_alarm_labels: Missing conversion This commit brings the latest change to text