summaryrefslogtreecommitdiffstats
path: root/libnetdata
AgeCommit message (Collapse)Author
2022-05-03Trace rwlocks of netdata (#12785)Costa Tsaousis
* with -DNETDATA_INTERNAL_CHECKS=1 enable rwlocks tracing * fix strings alignment on terminal * remove wrong addition * removed formating warning; now counting active locks per thread; tracing is enabled with -DNETDATA_TRACE_RWLOCKS=1 * added the missing netdata_mutex_destroy() * optimized clocks usage in locks * added also main * fixed formatting warning * add compiler warning when compiling with -DNETDATA_TRACE_RWLOCKS=1 * cleanup and documentation * fix for old variable * >= not just > to allow proper comparisons * dont print 0x twice and print the lock pointer on every line * trace locks deeper
2022-05-03One way allocator to double the speed of parallel context queries (#12787)Costa Tsaousis
* one way allocator to speed up context queries * fixed a bug while expanding memory pages * reworked for clarity and finally fixed the bug of allocating memory beyond the page size * further optimize allocation step to minimize the number of allocations made * implement strdup with memcpy instead of strcpy * added documentation * prevent an uninitialized use of owa * added callocz() interface * integrate onewayalloc everywhere - apart sql queries * one way allocator is now used in context queries using archived charts in sql * align on the size of pointers * forgotten freez() * removed not needed memcpys * give unique names to global variables to avoid conflicts with system definitions
2022-05-03Speed up BUFFER increases (minimize reallocs) (#12792)Costa Tsaousis
* speedup BUFFER increases by forward looking reallocs * implemented buffer_vsprintf() and optimized buffer_sprintf() to minimize calls to vsnprintfz() * optimize json generation for well known strings
2022-05-02procfile: more comfortable initial settings and faster/fewer reallocs (#12791)Costa Tsaousis
2022-05-02Don't use MADV_DONTDUMP on non-linux builds (#12795)vkalintiris
2022-04-28faster execution of external programs (#12759)Costa Tsaousis
* faster invocation of external plugins by eliminating the need for starting /bin/sh and then the command * added missing parameter * prefer the z function * cleanup and clarity - addressed LGTM issue * simplified the popen() interface a bit, to make it more predictable for future uses * removed commented old code * more comments cleanup * mypopen_raw() added for completeness - it is not currently used * simplified the mypopen_raw() interface even further * Update libnetdata/popen/popen.c Co-authored-by: Vladimir Kobal <vlad@prokk.net> * restored 0 flags for netdata_spawn() and cosmetic changes * added more clarity to the code and reverted old behavior of all other execution of commands Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-04-28feat(dbengine): make dbengine page cache undumpable and dedupuble (#12765)Ilya Mashchenko
* make netdata more awesome * reworked on-madvise and mmap to provide clarity
2022-04-26fix implicit declaration of function ↵Ilya Mashchenko
'appconfig_section_option_destroy_non_loaded' (#12756)
2022-04-25fix(cgroups.plugin): remove "enable cgroup X" config option on cgroup ↵Ilya Mashchenko
deletion (#12746)
2022-03-29Socket connections (eBPF) and bug fix (#12532)thiagoftsm
2022-03-24timex: this plugin enables timex plugin for non-linux systems (#12489)Suraj Neupane
* timex: this plugin enables timex plugin for non-linux system * refactoring and fixing PR comments * move OS specific macros to libnetdata * Update README.md Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> Co-authored-by: Tina Luedtke <kickoke@users.noreply.github.com>
2022-03-15Remove backends subsystem (#12146)Vladimir Kobal
2022-03-14Remove unecessary error report for proc and sys files (#12385)thiagoftsm
2022-03-08CO-RE and syscalls (#12318)thiagoftsm
2022-02-22Remove SIZEOF_VOIDP and ENVIRONMENT{32,64} macros. (#12046)vkalintiris
2022-02-21Update libs code (#12190)thiagoftsm
2022-02-21Fix compilation warnings on macOS (#12082)Vladimir Kobal
2022-02-17Docs: Removed Google Analytics tags (#12145)Tina Luedtke
2022-01-19eBPF plugin CO-RE and monitoring (#11992)thiagoftsm
2022-01-19Compute platform-specific list of static_threads at runtime. (#11955)vkalintiris
Compute array of static threads at runtime.
2022-01-18Do not use dbengine headers when dbengine is disabled. (#11967)vkalintiris
Prior to this commit both daemon/commands.c and spawn/spawn.c used to include database/engine/rrdenginelib.h, ie. a header file that is available only when enabling the dbengine feature.
2022-01-11Fix time_t format (#11897)Vladimir Kobal
2022-01-10Fix cachestat on kernel 5.15.x (eBPF) (#11833)thiagoftsm
2021-11-16Fix typos (#11782)Dimitris Apostolou
Co-authored-by: ilyam8 <ilya@netdata.cloud>
2021-11-08Add SSL_MODE_ENABLE_PARTIAL_WRITE to netdata_srv_ctx (#11754)Emmanuel Vasilakis
2021-10-27Anomaly Detection MVP (#11548)vkalintiris
* Add support for feature extraction and K-Means clustering. This patch adds support for performing feature extraction and running the K-Means clustering algorithm on the extracted features. We use the open-source dlib library to compute the K-Means clustering centers, which has been added as a new git submodule. The build system has been updated to recognize two new options: 1) --enable-ml: build an agent with ml functionality, and 2) --enable-ml-tests: support running tests with the `-W mltest` option in netdata. The second flag is meant only for internal use. To build tests successfully, you need to install the GoogleTest framework on your machine. * Boilerplate code to track hosts/dims and init ML config options. A new opaque pointer field is added to the database's host and dimension data structures. The fields point to C++ wrapper classes that will be used to store ML-related information in follow-up patches. The ML functionality needs to iterate all tracked dimensions twice per second. To avoid locking the entire DB multiple times, we use a separate dictionary to add/remove dimensions as they are created/deleted by the database. A global configuration object is initialized during the startup of the agent. It will allow our users to specify ML-related configuration options, eg. hosts/charts to skip from training, etc. * Add support for training and prediction of dimensions. Every new host spawns a training thread which is used to train the model of each dimension. Training of dimensions is done in a non-batching mode in order to avoid impacting the generated ML model by the CPU, RAM and disk utilization of the training code itself. For performance reasons, prediction is done at the time a new value is pushed in the database. The alternative option, ie. maintaining a separate thread for prediction, would be ~3-4x times slower and would increase locking contention considerably. For similar reasons, we use a custom function to unpack storage_numbers into doubles, instead of long doubles. * Add data structures required by the anomaly detector. This patch adds two data structures that will be used by the anomaly detector in follow-up patches. The first data structure is a circular bit buffer which is being used to count the number of set bits over time. The second data structure represents an expandable, rolling window that tracks set/unset bits. It is explicitly modeled as a finite-state machine in order to make the anomaly detector's behaviour easier to test and reason about. * Add anomaly detection thread. This patch creates a new anomaly detection thread per host. Each thread maintains a BitRateWindow which is updated every second based on the anomaly status of the correspondent host. Based on the updated status of the anomaly window, we can identify the existence/absence of an anomaly event, it's start/end time and the dimensions that participate in it. * Create/insert/query anomaly events from Sqlite DB. * Create anomaly event endpoints. This patch adds two endpoints to expose information about anomaly events. The first endpoint returns the list of anomalous events within a specified time range. The second endpoint provides detailed information about a single anomaly event, ie. the list of anomalous dimensions in that event along with their anomaly rate. The `anomaly-bit` option has been added to the `/data` endpoint in order to allow users to get the anomaly status of individual dimensions per second. * Fix build failures on Ubuntu 16.04 & CentOS 7. These distros do not have toolchains with C++11 enabled by default. Replacing nullptr with NULL should be fix the build problems on these platforms when the ML feature is not enabled. * Fix `make dist` to include ML makefiles and dlib sources. Currently, we add ml/kmeans/dlib to EXTRA_DIST. We might want to generate an explicit list of source files in the future, in order to bring down the generated archive's file size. * Small changes to make the LGTM & Codacy bots happy. - Cast unused result of function calls to void. - Pass a const-ref string to Database's constructor. - Reduce the scope of a local variable in the anomaly detector. * Add user configuration option to enable/disable anomaly detection. * Do not log dimension-specific operations. Training and prediction operations happen every second for each dimension. In prep for making this PR easier to run anomaly detection for many charts & dimensions, I've removed logs that would cause log flooding. * Reset dimensions' bit counter when not above anomaly rate threshold. * Update the default config options with real values. With this patch the default configuration options will match the ones we want our users to use by default. * Update conditions for creating new ML dimensions. 1. Skip dimensions with update_every != 1, 2. Skip dimensions that come from the ML charts. With this filtering in place, any configuration value for the relevant simple_pattern expressions will work correctly. * Teach buildinfo{,json} about the ML feature. * Set --enable-ml by default in the configuration options. This patch is only meant for testing the building of the ML functionality on Github. It will be reverted once tests pass successfully. * Minor build system fixes. - Add path to json header - Enable C++ linker when ML functionality is enabled - Rename ml/ml-dummy.cc to ml/ml-dummy.c * Revert "Set --enable-ml by default in the configuration options." This reverts commit 28206952a59a577675c86194f2590ec63b60506c. We pass all Github checks when building the ML functionality, except for those that run on CentOS 7 due to not having a C++11 toolchain. * Check for missing dlib and nlohmann files. We simply check the single-source files upon which our build system depends. If they are missing, an error message notifies the user about missing git submodules which are required for the ML functionality. * Allow users to specify the maximum number of KMeans iterations. * Use dlib v19.10 v19.22 broke compatibility with CentOS 7's g++. Development of the anomaly detection used v19.10, which is the version used by most Debian and Ubuntu distribution versions that are not past EOL. No observable performance improvements/regressions specific to the K-Means algorithm occur between the two versions. * Detect and use the -std=c++11 flag when building anomaly detection. This patch automatically adds the -std=c++11 when building netdata with the ML functionality, if it's supported by the user's toolchain. With this change we are able to build the agent correctly on CentOS 7. * Restructure configuration options. - update default values, - clamp values to min/max defaults, - validate and identify conflicting values. * Add update_every configuration option. Considerring that the MVP does not support per host configuration options, the update_every option will be used to filter hosts to train. With this change anomaly detection will be supported on: - Single nodes with update_every != 1, and - Children nodes with a common update_every value that might differ from the value of the parent node. * Reorganize anomaly detection charts. This follows Andrew's suggestion to have four charts to show the number of anomalous/normal dimensions, the anomaly rate, the detector's window length, and the events that occur in the prediction step. Context and family values, along with the necessary information in the dashboard_info.js file, will be updated in a follow-up commit. * Do not dump anomaly event info in logs. * Automatically handle low "train every secs" configuration values. If a user specifies a very low value for the "train every secs", then it is possible that the time it takes to train a dimension is higher than the its allotted time. In that case, we want the training thread to: - Reduce it's CPU usage per second, and - Allow the prediction thread to proceed. We achieve this by limiting the training time of a single dimension to be equal to half the time allotted to it. This means, that the training thread will never consume more than 50% of a single core. * Automatically detect if ML functionality should be enabled. With these changes, we enable ML if: - The user has not explicitly specified --disable-ml, and - Git submodules have been checked out properly, and - The toolchain supports C++11. If the user has explicitly specified --enable-ml, the build fails if git submodules are missing, or the toolchain does not support C++11. * Disable anomaly detection by default. * Do not update charts in locked region. * Cleanup code reading configuration options. * Enable C++ linker when building ML. * Disable ML functionality for CMake builds. * Skip LGTM for dlib and nlohmann libraries. * Do not build ML if libuuid is missing. * Fix dlib path in LGTM's yaml config file. * Add chart to track duration of prediction step. * Add chart to track duration of training step. * Limit the number dimensions in an anomaly event. This will ensure our JSON results won't grow without any limit. The default ML configuration options, train approximately ~1700 dimensions in a newly-installed Netdata agent. The hard-limit is set to 2000 dimensions which: - Is well above the default number of dimensions we train, - If it is ever reached it means that the user had accidentaly a very low anomaly rate threshold, and - Considering that we sort the result by anomaly score, the cutoff dimensions will be the less anomalous, ie. the least important to investigate. * Add information about the ML charts. * Update family value in ML charts. This fix will allow us to show the individual charts in the RHS Anomaly Detection submenu. * Rename chart type s/anomalydetection/anomaly_detection/g * Expose ML feat in /info endpoint. * Export ML config through /info endpoint. * Fix CentOS 7 build. * Reduce the critical region of a host's lock. Before this change, each host had a single, dedicated lock to protect its map of dimensions from adding/deleting new dimensions while training and detecting anomalies. This was problematic because training of a single dimension can take several seconds in nodes that are under heavy load. After this change, the host's lock protects only the insertion/deletion of new dimensions, and the prediction step. For the training of dimensions we use a dedicated lock per dimension, which is responsible for protecting the dimension from deletion while training. Prediction is fast enough, even on slow machines or under heavy load, which allows us to use the host's main lock and avoid increasing the complexity of our implementation in the anomaly detector. * Improve the way we are tracking anomaly detector's performance. This change allows us to: - track the total training time per update_every period, - track the maximum training time of a single dimension per update_every period, and - export the current number of total, anomalous, normal dimensions to the /info endpoint. Also, now that we use dedicated locks per dimensions, we can train under heavy load continuously without having to sleep in order to yield the training thread and allow the prediction thread to progress. * Use samples instead of seconds in ML configuration. This commit changes the way we are handling input ML configuration options from the user. Instead of treating values as seconds, we interpret all inputs as number of update_every periods. This allows us to enable anomaly detection on hosts that have update_every != 1 second, and still produce a model for training/prediction & detection that behaves in an expected way. Tested by running anomaly detection on an agent with update_every = [1, 2, 4] seconds. * Remove unecessary log message in detection thread * Move ML configuration to global section. * Update web/gui/dashboard_info.js Co-authored-by: Andrew Maguire <andrewm4894@gmail.com> * Fix typo Co-authored-by: Andrew Maguire <andrewm4894@gmail.com> * Rebase. * Use negative logic for anomaly bit. * Add info for prediction_stats and training_stats charts. * Disable ML on PPC64EL. The CI test fails with -std=c++11 and requires -std=gnu++11 instead. However, it's not easy to quickly append the required flag to CXXFLAGS. For the time being, simply disable ML on PPC64EL and if any users require this functionality we can fix it in the future. * Add comment on why we disable ML on PPC64EL. Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
2021-10-22Reuse the SN_EXISTS bit to track anomaly status. (#11154)vkalintiris
* Replace all usages of SN_EXISTS with SN_DEFAULT_FLAGS. * Remove references to SN_NOT_EXISTS in comments. * Replace raw zero constant with SN_EMPTY_SLOT. * Use get_storage_number_flags only in storage_number.{c,h} * Compare against SN_EMPTY_SLOT to check if a storage_number exists. This is safe because: 1. rrdset_done_interpolate() is the only place where we call store_metric(), 2. All store_metric() calls, except for one, store an SN_EMPTY_SLOT value. 3. When we are not storing an SN_EMPTY_SLOT value, the flags that we pass to pack_storage_number() can be either SN_EXISTS *or* SN_EXISTS_RESET. * Compare only the SN_EXISTS_RESET bit to find reset values. * Remove get_storage_number_flags from storage_number.h * Do not set storage_number flags outside of rrdset_done_interpolate(). This is a NFC intended to limit the scope of storage_number flags processing to just one function. * Set reset bit without overwriting the rest of the flags. * Rename SN_EXISTS to SN_ANOMALY_BIT. * Use GOTOs in pack_storage_number to return from a single place. * Teach pack_storage_number how to handle anomalous zero values. Up until now, a storage_number had always either the SN_EXISTS or SN_EXISTS_RESET bit set. This meant that it was not possible for any packed storage_number to compare equal to the SN_EMPTY_SLOT. However, the SN_ANOMALY_BIT can be set to zero. This is fine for every value other than the anomalous 0 value, because it would compare equal to SN_EMPTY_SLOT. We address this issue by mapping the anomalous zero value to SN_EXISTS_100 (a number which was not possible to generate with the previous versions of the agent, ie. it won't exist in older dbengine files). This change was tested manually by intentionally flipping the anomaly bit for odd/even iterations in rrdset_done_interpolate. Prior to this change, charts whose dimensions had 0 values, where showing up in the dashboard as gaps (SN_EMPTY_SLOT), whereas with this commit the values are displayed correctly.
2021-10-18Fix interval usage and reduce I/O (#11662)thiagoftsm
2021-10-12eBPF cgroup integration (#11642)thiagoftsm
2021-10-01Integrate eBPF and cgroup (consumer side) (#11573)thiagoftsm
2021-09-20Update libbpf (#11480)thiagoftsm
2021-08-19Update ebpf socket (#11441)thiagoftsm
2021-08-11Split eBPF programs (#11401)thiagoftsm
2021-08-11Add ACLK synchronization event loop (#11396)Stelios Fragkakis
2021-07-19Move cleanup of obsolete charts to a separate thread (#11222)Vladimir Kobal
2021-07-02Ebpf disk latency (#11276)thiagoftsm
Add disk monitoring independent of filesystem.
2021-06-18Ebpf apps memory usage (#11256)thiagoftsm
2021-06-16eBPF keep values from `ebpf.d.conf` (#11253)thiagoftsm
2021-06-08eBPF ext4 (new thread for collector) (#11224)thiagoftsm
* ebpf_ext4: Add new thread * ebpf_ext4: Add configuration files * ebpf_ext4: Add helpers to identify partitions and main threads * ebpf_ext4: Add helpers to create chart * ebpf_ext4: Add functions to read data from kernel ring * ebpf_ext4: Add functions to send data to Netdata * ebpf_ext4: Adjust dimensions * ebpf_ext4: Add information for dashboard * ebpf_ext4: Update documentation * ebpf_ext4: Update algorithm to read Array table instead hash table * ebpf_ext4: Add new eBPF version * ebpf_ext4: Add obsolete chart * ebpf_ext4: Fix coverity report * ebpf_ext4: Fix grammar in readme.md * ebpf_ext4: Update link inside dashboard_info.js * ebpf_ext4: Rename function and remove unused options inside filesystem.conf * ebpf_ext4: Rename variables and fix format * ebpf_ext4: Rename more variables * ebpf_ext4: Update algorithm to create dimensions * ebpf_ext4: Fix comment grammar * ebpf_ext4: Add messages to simplify comparison with hash table * ebpf_ext4: Update eBPF release * ebpf_ext4: Remove variables to improve the buckets * ebpf_ext4: Update algorithm to select filesystem * ebpf_ext4: Remove messages * ebpf_ext4: Add comment to filesystem
2021-05-25Move parser from children to main thread (#11152)thiagoftsm
Centralize eBPF plugin parser to avoid possible contradictions between user configuration and visualized charts.
2021-05-06Check the version of the default cgroup mountpoint (#11102)Vladimir Kobal
2021-05-03Ebpf directory cache (#10855)thiagoftsm
Add new thread to ebpf.plugin.
2021-04-28Load names (#11034)thiagoftsm
2021-04-27Provide more agent analytics to posthog (#11020)Emmanuel Vasilakis
* Move statistics related functions to analytics.c * error message change, space added after if * start an analytics thread * use heartbeat instead of sleep * add late enviroment (after rrdinit) pick of some attributes * change loop * re-enable info messages * remove possible new line * log and report hits on allmetrics pages. detect if exporting engines are enabled/in use, and report them * use lowercase for analytics variables * add collectors * add buildinfo * more attributes from late environment * add new attributes to v1/info * re-gather meta data before exit. update allmetrics counters to be available in v1/info * log hits to dashboard * add mirrored hosts * added notification methods * fix spaces, proper JSON naming * add alerts, charts and metrics count * more attributes * keep the thread up, and report a meta event every 2 hours * small formating changes. Disable analytics_log_prometheus when for unit testing. Add the new attributes to the anonymous-statistics.sh.in script * applied clang-format * dont gather data again on exit * safe buffer length in snprintfz * add rrdset lock * remove show_archived * remove setenv * calculate lengths during sets
2021-04-21Revert "Provide more agent analytics to posthog (#10887)" (#11011)Emmanuel Vasilakis
This reverts commit a1ce482f3e336dbabe1b12b92f6339af6a2bbbf8.
2021-04-21Provide more agent analytics to posthog (#10887)Emmanuel Vasilakis
* Move statistics related functions to analytics.c * error message change, space added after if * start an analytics thread * use heartbeat instead of sleep * add late enviroment (after rrdinit) pick of some attributes * change loop * re-enable info messages * remove possible new line * log and report hits on allmetrics pages. detect if exporting engines are enabled/in use, and report them * use lowercase for analytics variables * add collectors * add buildinfo * more attributes from late environment * add new attributes to v1/info * re-gather meta data before exit. update allmetrics counters to be available in v1/info * log hits to dashboard * add mirrored hosts * added notification methods * fix spaces, proper JSON naming * add alerts, charts and metrics count * more attributes * keep the thread up, and report a meta event every 2 hours * small formating changes. Disable analytics_log_prometheus when for unit testing. Add the new attributes to the anonymous-statistics.sh.in script * applied clang-format * dont gather data again on exit * safe buffer length in snprintfz * add rrdset lock * remove show_archived
2021-04-20Provide new attributes in health conf files (#10961)Emmanuel Vasilakis
* read and store new attributes (class, component, type) from health conf files. Replace family variable in info strings * provide the attributes to jsons * remove extra semicolon * populate conf files with new attributes * added newline * remove extra defines from health.h * remove empty line * remove realloc * use helper variables for find_and_replace. Adjust position for next strstr * remove comments * Add type to mysql.conf and vcsa.conf * fix formatting * add parenthesis * remove extra assignment * changes to mysql_galera_cluster_state from master * add type Errors to unbound_request_list_overwritten * fix identation for info strings spawning more than one line * check for null, replace with empty string if true * add class, component, type to systemdunits.conf
2021-04-15Bring flexible adjust for eBPF hash tables (#10962)thiagoftsm
Give possibility for users to set hash table size.
2021-04-15Remove error message on netdata restart (#8685)Steve8291
When issuing a SIGTERM with `systemctl restart netdata.service` an ERROR message is created in the log for every plugin: > netdata ERROR : PLUGINSD[apps] : child pid 23901 killed by signal 15. > netdata ERROR : PLUGINSD[python.d] : child pid 23908 killed by signal 15. > netdata ERROR : PLUGINSD[nfacct] : child pid 23909 killed by signal 15. > netdata ERROR : PLUGINSD[go.d] : child pid 23899 killed by signal 15. Seems like it would be worth silencing this to an INFO message if we did a proper restart or shutdown. Also, I wasn't sure what the proper return code should be so I put it in as `return(0);`
2021-04-14Spelling libnetdata (#10917)Josh Soref