Age | Commit message (Collapse) | Author |
|
* include localhost hostname in edit_command
* since the edit_command now contains the localhost name, dont pass it again to the script
|
|
|
|
The order of the fields makes the bitfields irrelevant in this case.
|
|
|
|
|
|
|
|
|
|
Co-authored-by: Tina Luedtke <kickoke@users.noreply.github.com>
Co-authored-by: Josh Soref <jsoref@users.noreply.github.com>
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
|
|
* Send ML feature information with UpdateNodeInfo.
We achieve this by adding the `ml_{capable,enabled}` fields in
`system_info`. When streaming, these fields allow a parent to understand if
the child has ML and if it runs ML for itself.
The UpdateNodeInfo includes this information about a child, plus a
boolean that is set to true when the parent runs ML for the child.
* Fix unit test and building with --disable-ml.
* Refactoring to use the new MachineLearningInfo message
* Update aclk-schemas repository to include latest ML info message.
|
|
|
|
|
|
* Set a flag to do aclk sync thread shutdown
Attempt to dequeue a cmd in case the queue is full and someone is blocked
* Drop tables and recreate instead of deleting
* Add commands to check the database -W check-database, fix-database, compact-database
* Split the database setup to config and cleanup part
* Add checks during database setup and cleanup to detect corruption to the dimension and chart tables
* Add full database check and refactor code
* Change commands to better indicate that the operations refer to the sqlite metadata database (not the metrics dbengine database)
* Add check for table being null (request for entire database check)
* Rename command for better clarity
|
|
(#11827)
|
|
|
|
|
|
* Fix compilation warnings (variables used when debugging is enabled using NETDATA_INTERNAL_CHECKS)
* Fix compilation warning (casting)
|
|
* Add check for NULL wc->host
* Use sqlite3_exec, if it fails it will be retried on the next health log entries rotation
|
|
* always queue to aclk_alert
* proper function name
|
|
* add some logging for ng arch to access.log
* change arrows to IN, OG, AC
* log also the params for aclk requests
* check for wc->host before using wc->host->hostname
* turn two messages to info
* reduce alert event logs
* used thread local variables
|
|
* Use correct hop count if host in memory
* Add locking to be safe when using host lookup
* Update the live state correctly
|
|
|
|
Co-authored-by: ilyam8 <ilya@netdata.cloud>
|
|
|
|
|
|
* Do not send hidden dimensions to the cloud
* Cleanup dimension_delete table from stale entries
|
|
* Fix hop count
* Remove the warning message
|
|
* Enhance the dimension delete table and adjust the trigger to include chart_id and host_id
* Add the aclk_process_dimension_deletion function
* Change variable chart_name in aclk_upd_dimension_event (it is st->id from st.type dot st.id)
* Process dimension deletion when retention updates are sent
* Do not send charts if we don't have dimensions
* Add check for uuid_parse return code
|
|
|
|
* Move retention code to the charts
* Log information about node registration and updates
* Prevent deadlock if aclk_database_enq_cmd locks for a node
* Improve message (indicate that it comes from alerts). This will be improved in a followup PR
* Disable parts that can't be used if the new cloud env is not available
* Set dimension FLAG if message has been queued
* Queue messages using the correct protocol enabled
* Cleanup unused functions
Rename functions that queue charts and dimensions
Improve the generic chart payload add function
Add a counter for pending charts/dimension payloads to avoid polling the db
Delay the retention update message until we are done with the updates
Fix full resync command to handle sequence_id = 0 correctly
Disable functions not needed when the new cloud env functionality is not compiled
* Add chart_payload count and retry count
Output information or error message if we fail to queue chart/dimension PUSH commands
Only try to queue commands if we have chart_payload_count>0
Remove the event loop shutdown opcode handle
* Improve detection of shutdown (check netdata_exit)
* Adjusting info messages
|
|
|
|
* Add support for feature extraction and K-Means clustering.
This patch adds support for performing feature extraction and running the
K-Means clustering algorithm on the extracted features.
We use the open-source dlib library to compute the K-Means clustering
centers, which has been added as a new git submodule.
The build system has been updated to recognize two new options:
1) --enable-ml: build an agent with ml functionality, and
2) --enable-ml-tests: support running tests with the `-W mltest`
option in netdata.
The second flag is meant only for internal use. To build tests successfully,
you need to install the GoogleTest framework on your machine.
* Boilerplate code to track hosts/dims and init ML config options.
A new opaque pointer field is added to the database's host and dimension
data structures. The fields point to C++ wrapper classes that will be used
to store ML-related information in follow-up patches.
The ML functionality needs to iterate all tracked dimensions twice per
second. To avoid locking the entire DB multiple times, we use a
separate dictionary to add/remove dimensions as they are created/deleted
by the database.
A global configuration object is initialized during the startup of the
agent. It will allow our users to specify ML-related configuration
options, eg. hosts/charts to skip from training, etc.
* Add support for training and prediction of dimensions.
Every new host spawns a training thread which is used to train the model
of each dimension.
Training of dimensions is done in a non-batching mode in order to avoid
impacting the generated ML model by the CPU, RAM and disk utilization of
the training code itself.
For performance reasons, prediction is done at the time a new value
is pushed in the database. The alternative option, ie. maintaining a
separate thread for prediction, would be ~3-4x times slower and would
increase locking contention considerably.
For similar reasons, we use a custom function to unpack storage_numbers
into doubles, instead of long doubles.
* Add data structures required by the anomaly detector.
This patch adds two data structures that will be used by the anomaly
detector in follow-up patches.
The first data structure is a circular bit buffer which is being used to
count the number of set bits over time.
The second data structure represents an expandable, rolling window that
tracks set/unset bits. It is explicitly modeled as a finite-state
machine in order to make the anomaly detector's behaviour easier to test
and reason about.
* Add anomaly detection thread.
This patch creates a new anomaly detection thread per host. Each thread
maintains a BitRateWindow which is updated every second based on the
anomaly status of the correspondent host.
Based on the updated status of the anomaly window, we can identify the
existence/absence of an anomaly event, it's start/end time and the
dimensions that participate in it.
* Create/insert/query anomaly events from Sqlite DB.
* Create anomaly event endpoints.
This patch adds two endpoints to expose information about anomaly
events. The first endpoint returns the list of anomalous events within a
specified time range. The second endpoint provides detailed information
about a single anomaly event, ie. the list of anomalous dimensions in
that event along with their anomaly rate.
The `anomaly-bit` option has been added to the `/data` endpoint in order
to allow users to get the anomaly status of individual dimensions per
second.
* Fix build failures on Ubuntu 16.04 & CentOS 7.
These distros do not have toolchains with C++11 enabled by default.
Replacing nullptr with NULL should be fix the build problems on these
platforms when the ML feature is not enabled.
* Fix `make dist` to include ML makefiles and dlib sources.
Currently, we add ml/kmeans/dlib to EXTRA_DIST. We might want to
generate an explicit list of source files in the future, in order to
bring down the generated archive's file size.
* Small changes to make the LGTM & Codacy bots happy.
- Cast unused result of function calls to void.
- Pass a const-ref string to Database's constructor.
- Reduce the scope of a local variable in the anomaly detector.
* Add user configuration option to enable/disable anomaly detection.
* Do not log dimension-specific operations.
Training and prediction operations happen every second for each
dimension. In prep for making this PR easier to run anomaly detection
for many charts & dimensions, I've removed logs that would cause log
flooding.
* Reset dimensions' bit counter when not above anomaly rate threshold.
* Update the default config options with real values.
With this patch the default configuration options will match the ones
we want our users to use by default.
* Update conditions for creating new ML dimensions.
1. Skip dimensions with update_every != 1,
2. Skip dimensions that come from the ML charts.
With this filtering in place, any configuration value for the
relevant simple_pattern expressions will work correctly.
* Teach buildinfo{,json} about the ML feature.
* Set --enable-ml by default in the configuration options.
This patch is only meant for testing the building of the ML functionality
on Github. It will be reverted once tests pass successfully.
* Minor build system fixes.
- Add path to json header
- Enable C++ linker when ML functionality is enabled
- Rename ml/ml-dummy.cc to ml/ml-dummy.c
* Revert "Set --enable-ml by default in the configuration options."
This reverts commit 28206952a59a577675c86194f2590ec63b60506c.
We pass all Github checks when building the ML functionality, except for
those that run on CentOS 7 due to not having a C++11 toolchain.
* Check for missing dlib and nlohmann files.
We simply check the single-source files upon which our build system
depends. If they are missing, an error message notifies the user
about missing git submodules which are required for the ML
functionality.
* Allow users to specify the maximum number of KMeans iterations.
* Use dlib v19.10
v19.22 broke compatibility with CentOS 7's g++. Development of the
anomaly detection used v19.10, which is the version used by most Debian and
Ubuntu distribution versions that are not past EOL.
No observable performance improvements/regressions specific to the K-Means
algorithm occur between the two versions.
* Detect and use the -std=c++11 flag when building anomaly detection.
This patch automatically adds the -std=c++11 when building netdata
with the ML functionality, if it's supported by the user's toolchain.
With this change we are able to build the agent correctly on CentOS 7.
* Restructure configuration options.
- update default values,
- clamp values to min/max defaults,
- validate and identify conflicting values.
* Add update_every configuration option.
Considerring that the MVP does not support per host configuration
options, the update_every option will be used to filter hosts to train.
With this change anomaly detection will be supported on:
- Single nodes with update_every != 1, and
- Children nodes with a common update_every value that might differ from
the value of the parent node.
* Reorganize anomaly detection charts.
This follows Andrew's suggestion to have four charts to show the number
of anomalous/normal dimensions, the anomaly rate, the detector's window
length, and the events that occur in the prediction step.
Context and family values, along with the necessary information in the
dashboard_info.js file, will be updated in a follow-up commit.
* Do not dump anomaly event info in logs.
* Automatically handle low "train every secs" configuration values.
If a user specifies a very low value for the "train every secs", then
it is possible that the time it takes to train a dimension is higher
than the its allotted time.
In that case, we want the training thread to:
- Reduce it's CPU usage per second, and
- Allow the prediction thread to proceed.
We achieve this by limiting the training time of a single dimension to
be equal to half the time allotted to it. This means, that the training
thread will never consume more than 50% of a single core.
* Automatically detect if ML functionality should be enabled.
With these changes, we enable ML if:
- The user has not explicitly specified --disable-ml, and
- Git submodules have been checked out properly, and
- The toolchain supports C++11.
If the user has explicitly specified --enable-ml, the build fails if
git submodules are missing, or the toolchain does not support C++11.
* Disable anomaly detection by default.
* Do not update charts in locked region.
* Cleanup code reading configuration options.
* Enable C++ linker when building ML.
* Disable ML functionality for CMake builds.
* Skip LGTM for dlib and nlohmann libraries.
* Do not build ML if libuuid is missing.
* Fix dlib path in LGTM's yaml config file.
* Add chart to track duration of prediction step.
* Add chart to track duration of training step.
* Limit the number dimensions in an anomaly event.
This will ensure our JSON results won't grow without any limit. The
default ML configuration options, train approximately ~1700 dimensions
in a newly-installed Netdata agent. The hard-limit is set to 2000
dimensions which:
- Is well above the default number of dimensions we train,
- If it is ever reached it means that the user had accidentaly a
very low anomaly rate threshold, and
- Considering that we sort the result by anomaly score, the cutoff
dimensions will be the less anomalous, ie. the least important to
investigate.
* Add information about the ML charts.
* Update family value in ML charts.
This fix will allow us to show the individual charts in the RHS Anomaly
Detection submenu.
* Rename chart type
s/anomalydetection/anomaly_detection/g
* Expose ML feat in /info endpoint.
* Export ML config through /info endpoint.
* Fix CentOS 7 build.
* Reduce the critical region of a host's lock.
Before this change, each host had a single, dedicated lock to protect
its map of dimensions from adding/deleting new dimensions while training
and detecting anomalies. This was problematic because training of a
single dimension can take several seconds in nodes that are under heavy
load.
After this change, the host's lock protects only the insertion/deletion
of new dimensions, and the prediction step. For the training of dimensions
we use a dedicated lock per dimension, which is responsible for protecting
the dimension from deletion while training.
Prediction is fast enough, even on slow machines or under heavy load,
which allows us to use the host's main lock and avoid increasing the
complexity of our implementation in the anomaly detector.
* Improve the way we are tracking anomaly detector's performance.
This change allows us to:
- track the total training time per update_every period,
- track the maximum training time of a single dimension per
update_every period, and
- export the current number of total, anomalous, normal dimensions
to the /info endpoint.
Also, now that we use dedicated locks per dimensions, we can train under
heavy load continuously without having to sleep in order to yield the
training thread and allow the prediction thread to progress.
* Use samples instead of seconds in ML configuration.
This commit changes the way we are handling input ML configuration
options from the user. Instead of treating values as seconds, we
interpret all inputs as number of update_every periods. This allows
us to enable anomaly detection on hosts that have update_every != 1
second, and still produce a model for training/prediction & detection
that behaves in an expected way.
Tested by running anomaly detection on an agent with update_every = [1,
2, 4] seconds.
* Remove unecessary log message in detection thread
* Move ML configuration to global section.
* Update web/gui/dashboard_info.js
Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
* Fix typo
Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
* Rebase.
* Use negative logic for anomaly bit.
* Add info for prediction_stats and training_stats charts.
* Disable ML on PPC64EL.
The CI test fails with -std=c++11 and requires -std=gnu++11 instead.
However, it's not easy to quickly append the required flag to CXXFLAGS.
For the time being, simply disable ML on PPC64EL and if any users
require this functionality we can fix it in the future.
* Add comment on why we disable ML on PPC64EL.
Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
|
|
* rebased
* add error message
* make function void
* fix return
|
|
* Replace all usages of SN_EXISTS with SN_DEFAULT_FLAGS.
* Remove references to SN_NOT_EXISTS in comments.
* Replace raw zero constant with SN_EMPTY_SLOT.
* Use get_storage_number_flags only in storage_number.{c,h}
* Compare against SN_EMPTY_SLOT to check if a storage_number exists.
This is safe because:
1. rrdset_done_interpolate() is the only place where we call store_metric(),
2. All store_metric() calls, except for one, store an SN_EMPTY_SLOT value.
3. When we are not storing an SN_EMPTY_SLOT value, the flags that we pass to
pack_storage_number() can be either SN_EXISTS *or* SN_EXISTS_RESET.
* Compare only the SN_EXISTS_RESET bit to find reset values.
* Remove get_storage_number_flags from storage_number.h
* Do not set storage_number flags outside of rrdset_done_interpolate().
This is a NFC intended to limit the scope of storage_number flags
processing to just one function.
* Set reset bit without overwriting the rest of the flags.
* Rename SN_EXISTS to SN_ANOMALY_BIT.
* Use GOTOs in pack_storage_number to return from a single place.
* Teach pack_storage_number how to handle anomalous zero values.
Up until now, a storage_number had always either the SN_EXISTS or
SN_EXISTS_RESET bit set. This meant that it was not possible for any
packed storage_number to compare equal to the SN_EMPTY_SLOT.
However, the SN_ANOMALY_BIT can be set to zero. This is fine for every
value other than the anomalous 0 value, because it would compare equal to
SN_EMPTY_SLOT. We address this issue by mapping the anomalous zero value
to SN_EXISTS_100 (a number which was not possible to generate with the
previous versions of the agent, ie. it won't exist in older dbengine files).
This change was tested manually by intentionally flipping the anomaly
bit for odd/even iterations in rrdset_done_interpolate. Prior to this
change, charts whose dimensions had 0 values, where showing up in the
dashboard as gaps (SN_EMPTY_SLOT), whereas with this commit the values
are displayed correctly.
|
|
* mark host as UNUSED
* use snprintfz instead of snprintf. removes warning: %s directive output between 0 and 4096 bytes may exceed minimum required size of 4095
* increase length to 22 to include full int length. stops warning %d directive output may be truncated writing between 1 and 11 bytes into a region of size 5
* increase buffers to stop warning %0.1f directive output may be truncated writing between 3 and 312 bytes into a region of size 100
* use sprintfz
|
|
* fix 2 coverity errors
* remove call to sql_queue_removed_alerts_to_aclk from health
|
|
(#11606)
* Add flag to mark containers as created from official images in analytics.
* Fix CI.
* process NETDATA_CONTAINER_IS_OFFICIAL_IMAGE variable from system info and export to anonymous-statistics script
Co-authored-by: Emmanuel Vasilakis <mrzammler@mm.st>
|
|
|
|
* add idefs to protect regions of alerts code
* remove check
|
|
|
|
|
|
|
|
ACLK-NG supports both new and old cloud protocol. Protobuf and C++ compiler are required only for new cloud protocol.
There is no reason to skip building whole ACLK-NG when protobuf is missing.
|
|
* add alert messages
* also clear date_cloud_ack
* move buffer_create
* remove include file
* use wc->node_id
|
|
* Fix memory leak CID_373251
* Check return value CID_373248
* Check return code CID_373249
* Check return code CID_373250
* Initialize cmd CID_373249
|
|
|
|
* node info function
* Code cleanup
* Remove unnecessary strdupz / freez functions
* Fix complication error if ACLK_NG is not available
|
|
* Rebased
* use sql health log if it exists
* store alert config in sqlite
* move unlock before loop
* fix warnings
* remove hash message
* check return from counting health log
* remove check of hostname when reading log
* try to create the health log table to catch accidental removals of it
* fix warnings, cast values, report config_hash_id
* use snprintfz, add info logging
* remove unnecessary strdup and free
* check if stored config hash is null
* return if prepare statement fails
* replace with static variables
* remove replace info, free edit_command
* remove setting cfg entries to NULL
* change uuid_copy
* check return of uuid_parse, and exit if its not valid
* also free cfg
* use address
* removed health_alarm_entry_sql2json and sql_health_alarm_log_select_all
* remove check for is_valid_alarm_id
* replace lengths with GUID_LEN
* use uuid_unparse_lower_fix
* removed web api endopoint to get alert config
* check for non null values for name, chart and family
* include a date_updated field in alert_hash
* for config hash, digest NULL string if value to digest is null
* Use empty string instead of null
|
|
* Make sure an element was found for removal
* Remove fatal if async send fails
Add newline
|
|
- SQLITE_ENABLE_UPDATE_DELETE_LIMIT 1
- SQLITE_OMIT_LOAD_EXTENSION 1
- SQLITE_ENABLE_DBSTAT_VTAB 1
- Fix compilation warnings
|
|
|