summaryrefslogtreecommitdiffstats
path: root/ml
AgeCommit message (Collapse)Author
2023-03-21Revert "Use static thread-pool for training. (#14702)" (#14782)vkalintiris
This reverts commit 5046e034212c008557dd014196b6f6204eda24b2. Will re-apply once we investigate an issue that occurs during the shutdown of the agent.
2023-03-21Use static thread-pool for training. (#14702)vkalintiris
* Use static thread-pool for training. * Add missing function definition * disable training stats chart * Add config option to explicitly enable ML stats charts. --------- Co-authored-by: Costa Tsaousis <costa@netdata.cloud>
2023-03-10Refactor ML code. (#14659)vkalintiris
* Refactor ML code. This commit introduces only non-functional changes. Originally, the C++ code exposed C functions to be called from the rest of the agent. When we migrated from C++ to C, we did not eliminate these wrapper functions to make the PR easier to understand and keep the total LOC low. This commit removes the wrapper functions and "reclaims" the `ml_` prefix that we used for the public API of the old implementation. Also, the nlohmann Json library has been removed and its functionality was replaced with the equivalent Json functionality that we added in libnetdata's BUFFERs. * Remove missing headers from build systems. * Fix CMake build. * rrddim_free is outside of rrd "internals" now.
2023-03-07add note on readme on how to easily see all ml related blog posts (#14675)Andrew Maguire
2023-03-02/api/v2/contexts (#14592)Costa Tsaousis
* preparation for /api/v2/contexts * working /api/v2/contexts * add anomaly rate information in all statistics; when sum-count is requested, return sums and counts instead of averages * minor fix * query targegt now accurately counts hosts, contexts, instances, dimensions, metrics * cleanup /api/v2/contexts * full text search with /api/v2/contexts * simple patterns now support the option to search ignoring case * full text search API with /api/v2/q * simple pattern execution optimization * do not show q when not given * full text search accounting * separated /api/v2/nodes from /api/v2/contexts * fix ssv queries for group_by * count query instances queried and failed per context and host * split rrdcontext.c to multiple files * add query totals * fix anomaly rate calculation; provide "ni" for indexing hosts * do not generate zero valued members * faster calculation of anomaly rate; by just summing integers for each db points and doing math once for every generated point * fix typo when printing dimensions totals * added option minify to remove spaces and newlines fron JSON output * send instance ids and names when they differ * do not add in query target dimensions, instances, contexts and hosts for which there is no retention in the current timeframe * fix for the previous + renames and code cleanup * when a dimension is filtered, include in the response all the other dimensions that are selectable * do not add nodes that do not have retention in the current window * move selection of dimensions to query_dimension_add(), instead of query_metric_add() * increase the pre-processing capacity of queries * generate instance fqdn ids and names only when they are needed * provide detailed statistics about tiers retention, queries, points, update_every * late allocation of query dimensions * cleanup * more cleanup * support for annotations per displayed point, RESET and PARTIAL * new type annotations * if a chart is not linked to contexts and it is collected, link it when it is collected * make ML run reentrant * make ML rrdr query synchronous * optimize replication memory allocation of replication_sort_entry * change units to percentage, when requesting a coefficinet of variation, or a percentage query * initialize replication before starting main threads * properly decrement no room requests counter * propagate the non-zero flag to group-by * the same by avoiding the extra loop * respect non-zero in all dimension arrays * remove dictionary garbage collection from dictionary_entries() and dictionary_version() * be more verbose when jv2 indexing is postponed * prevent infinite loop * use hidden dimensions even when dimensions pattern is unset * traverse hosts using dictionaries * fix dictionary unittests
2023-02-28Port ML from C++ to C. (#14567)vkalintiris
* Port ML from C++ to C. Pretty much everything is a non-functional change, ie. the functionality is identical to the one provided by the existing implementation that is written in C++. Performance-wise, this implementation: - Eliminates/reduces the number of allocations and deallocations we have to do for training/detection, - Uses just a single thread to perform detection for *all* the hosts (ie. reduces the number of required threads by 50% on parents), and - Allows training, prediction and detection of dimensions that have an update_every that is different from that of the localhost. The only C++ functionality that we still use is vectors, because they make our life easier and they are pretty much a requirement imposed by dlib. * Remove profile.plugin It was useful only for testing during development. * Limit logs to 200 lines per period * Properly generate ml_info in /api/v1/info endpoint. * Remove resource usage charts since we use worker charts. * Use a temporary to make linters happy. * Rebase. * Fix builds that have ML functionality disabled.
2023-02-20Fix broken links in our documentation (#14565)Fotis Voutsas
* fix broken link in ml/README.md * fix broken link across all files * fix broken link across all files * fix broken links and remove what's next sections * fix broken links and remove what's next section * Remove related links sections with broken links that link to removed files * fix broken links
2023-02-17Reorg learn 021723 (#14556)Chris Akritidis
* Change titles of agent alert notifications * Reintroduce netdata for iot * Eliminate guides category, merge health config docs * Rename setup to configuration * Codacy fixes and move health config reference
2023-02-15JSON internal API, IEEE754 base64/hex streaming, weights endpoint ↵Costa Tsaousis
optimization (#14493) * first work on standardizing json formatting * renamed old grouping to time_grouping and added group_by * add dummy functions to enable compilation * buffer json api work * jsonwrap opening with buffer_json_X() functions * cleanup * storage for quotes * optimize buffer printing for both numbers and strings * removed ; from define * contexts json generation using the new json functions * fix buffer overflow at unit test * weights endpoint using new json api * fixes to weights endpoint * check buffer overflow on all buffer functions * do synchronous queries for weights * buffer_flush() now resets json state too * content type typedef * print double values that are above the max 64-bit value * str2ndd() can now parse values above UINT64_MAX * faster number parsing by avoiding double calculations as much as possible * faster number parsing * faster hex parsing * accurate printing and parsing of double values, even for very large numbers that cannot fit in 64bit integers * full printing and parsing without using library functions - and related unit tests * added IEEE754 streaming capability to enable streaming of double values in hex * streaming and replication to transfer all values in hex * use our own str2ndd for set2 * remove subnormal check from ieee * base64 encoding for numbers, instead of hex * when increasing double precision, also make sure the fractional number printed is aligned to the wanted precision * str2ndd_encoded() parses all encoding formats, including integers * prevent uninitialized use * /api/v1/info using the new json API * Fix error when compiling with --disable-ml * Remove redundant 'buffer_unittest' declaration * Fix formatting * Fix formatting * Fix formatting * fix buffer unit test * apps.plugin using the new JSON API * make sure the metrics registry does not accept negative timestamps * do not allow pages with negative timestamps to be loaded from db files; do not accept pages with negative timestamps in the cache * Fix more formatting --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-07Streaming interpolated values (#14431)Costa Tsaousis
* first commit - untested * fix wrong begin command * added set v2 too * debug to log stream buffer * debug to log stream buffer * faster streaming printing * mark charts and dimensions as collected * use stream points even if sender is not enabled * comment out stream debug log * parse null as nan * custom begin v2 * custom set v2; replication now copies the anomalous flag too * custom end v2 * enabled stream log test * renamed to BEGIN2, SET2, END2 * dont mix up replay and v2 members in user object * fix typo * cleanup * support to v2 to v1 proxying * mark updated dimensions as such * do not log unknown flags * comment out stream debug log * send also the chart id on BEGIN2, v2 to v2 * update the data collections counter * v2 values are transferred in hex * faster hex parsing * a little more generic hex and dec printing and parsing * fix hex parsing * minor optimization in dbengine api * turn debugging into info message * generalized the timings tracking, so that it can be used in more places * commented out debug info * renamed conflicting variable with macro * remove wrong edits * integrated ML and added cleanup in case parsing is interrupted * disable data collection locking during v2 * cleanup stale ML locks; send updated chart variables during v2; add info to find stale locks * inject an END2 between repeated BEGIN2 from rrdset_done() * test: remove lockless single-threaded logic from dictionary and aral and apply the right acquire/release memory order to reference counters * more fine grained dictionary atomics * remove unecessary return values * pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS * Revert "pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS" This reverts commit 846cdf2713e2a7ee2ff797f38db11714228800e9. * Revert "remove unecessary return values" This reverts commit 8c87d30f4d86f0f5d6b4562cf74fe7447138bbff. * Revert "more fine grained dictionary atomics" This reverts commit 984aec4234a340d197d45239ff9a10fd479fcf3c. * Revert "test: remove lockless single-threaded logic from dictionary and aral and apply the right acquire/release memory order to reference counters" This reverts commit c460b3d0ad497d2641bd0ea1d63cec7c052e74e4. * Apply again "pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS" while keeping the improved atomic operations. This reverts commit f158d009 * fix last commit * fix last commit again * optimizations in dbengine * do not send anomaly bit on non-supporting agents (send it when the INTERPOLATED capability is available) * break long empty-points-loops in rrdset_done() * decide page alignment on new page allocation, not on every point collected * create max size pages but no smaller than 1/3 * Fix compilation when --disable-ml is specified * Return false * fixes for NETDATA_LOG_REPLICATION_REQUESTS * added compile option NETDATA_WITHOUT_WORKERS_LATENCY * put timings in BEGIN2, SET2, END2 * isolate begin2 ml * revert repositioning data collection lock * fixed multi-threading of statistics * do not lookup dimensions all the time if they come in the same order * update used on iteration, not on every points; also do better error handling --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-03Stop training thread from processing training requests once cancelled. (#14423)vkalintiris
2023-02-02Covert our documentation links to GH absolute links (#14344)Tasos Katsoulas
Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-01-26DBENGINE v2 - improvements part 9 (#14326)Costa Tsaousis
* on shutdown stop data collection for all hosts instead of freeing their memory * print number of sql statements per metadata host scan * print timings with metadata checking * use dbengine API to figure out of a database is legacy * Recalculate retention after a datafile deletion * validate child timestamps during replication * main cache uses a lockless aral per partition, protected by the partition index lock * prevent ML crash * Revert "main cache uses a lockless aral per partition, protected by the partition index lock" This reverts commit 6afc01527dc5c66548b4bc8a1d63c026c3149358. * Log direct index and binary searches * distribute metrics more evenly across time * statistics about retention recalculation * fix crash * Reverse the binary search to calculate retention * more optimization on retention calculation * removed commented old code Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-25Introduce the new Structure of the documentation (#13915)Fotis Voutsas
* Moving the cloud docs under /docs/cloud (previous location: netdata/learn/*) * Added metadata on almost every document of the old learn site for the new ingest process of learn. * Map old learn document to their best fit as topic related docs. Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: DShreve2 <david@netdata.cloud> Co-authored-by: hugovalente-pm <hugo@netdata.cloud>
2023-01-20track memory footprint of Netdata (#14294)Costa Tsaousis
* track memory footprint of Netdata * track db modes alloc/ram/save/map * track system info; track sender and receiver * fixes * more fixes * track workers memory, onewayalloc memory; unify judyhs size estimation * track replication structures and buffers * Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag * flush the replication buffer every 1000 times the circular buffer is found empty * dont take timestamp too frequently in sender loop * sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it * free sender thread buffer on replication threads when replication is idle * use the last sender flag as a timestamp of the last buffer recreation * free cbuffer before reconnecting * recreate cbuffer on every flush * timings for journal v2 loading * inlining of metric and cache functions * aral likely/unlikely * free left-over thread buffers * fix NULL pointer dereference in replication * free sender thread buffer on sender thread too * mark ctx as used before flushing * better logging on ctx datafiles closing Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-18DBENGINE v2 - improvements part 5 (#14289)Costa Tsaousis
* cleanup journal v2 mounts periodically * fix for last commit * re-enable loading page from disk when the arrangement of pages requires it * Remove unused statistics * Estimate diskspace when the current datafile is full and queue a rotate command (Currently it will not attempt to estimate end size for journals) Queue a command to check quota on startup per tier * apps.plugin now exposes RSS chart * shorter thread names to make debugging easier, since thread names can only be 15 characters * more thread names fixes * allow an apps_groups.conf target to be pid 0 or 1 Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-14More 32bit fixes (#14264)Costa Tsaousis
* query planer weight calculation using long long * adjust replication query ahead pipeline for smaller systems * do not generate huge replication messages * add message to indicate replication message was interrupted * improved message * max replication size 25% of sender buffer * fix for last commit * use less cache and smaller page sizes and fewer threads on 32-bits * fix reserved libuv workers for 32bits * fix detection of 32/64 bit
2023-01-13DBENGINE v2 - improvements 2 (#14257)Costa Tsaousis
* allow extents to be merged for as long as possible * do not block the event loop while recalculating retention due to datafile rotation * buffers are incrementally cleaned up, every second, by just 1 entry * fix order of commands * remove newline * measure cancelled extent read requests * count all cancelled extent requests * do not double count failed pages * fixed cancelled name * Fix error and warnings when compiling with --disable-dbengine * when the timeframe is outside retention and whole query should fail * do not mark as failed pages that have been loaded but have been skipped * added chart to show cache memory calculation variables * LONG_MAX for 32-bit compatibility * fix cache size calculation on 32-bit * fix cache size calculation on 32-bit - use unsinged long long * fix compilation warnings on 32-bits * fix another compilation warning on 32-bits * fix compilation warnings on older 32-bit compilers * fix compilation warnings on older 32-bit compilers - more of them * disable ML threads joining Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-12Remove temporary allocations when preprocessing a samples buffer (#14208)vkalintiris
* Remove allocations when smoothing samples buffer The number of dims per sample is always 1, ie. we are training and predicting only individual dimensions. * Do not try to stop anomaly detection threads twice. A separate function call has been added to stop anomaly detection threads. This commit removes the left over function calls that were made internally when a host was being created/destroyed. * Move samples instead of copying.
2023-01-11cancel ml threads on shutdown and join them on host free (#14240)Costa Tsaousis
* cancel ml threads on shutdown and join them on host free * mark them not running before joining them
2023-01-10DBENGINE v2 (#14125)Costa Tsaousis
* count open cache pages refering to datafile * eliminate waste flush attempts * remove eliminated variable * journal v2 scanning split functions * avoid locking open cache for a long time while migrating to journal v2 * dont acquire datafile for the loop; disable thread cancelability while a query is running * work on datafile acquiring * work on datafile deletion * work on datafile deletion again * logs of dbengine should start with DBENGINE * thread specific key for queries to check if a query finishes without a finalize * page_uuid is not used anymore * Cleanup judy traversal when building new v2 Remove not needed calls to metric registry * metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent * disable checks for invalid time-ranges * Remove type from page details * report scanning time * remove infinite loop from datafile acquire for deletion * remove infinite loop from datafile acquire for deletion again * trace query handles * properly allocate array of dimensions in replication * metrics cleanup * metrics registry uses arrayalloc * arrayalloc free should be protected by lock * use array alloc in page cache * journal v2 scanning fix * datafile reference leaking hunding * do not load metrics of future timestamps * initialize reasons * fix datafile reference leak * do not load pages that are entirely overlapped by others * expand metric retention atomically * split replication logic in initialization and execution * replication prepare ahead queries * replication prepare ahead queries fixed * fix replication workers accounting * add router active queries chart * restore accounting of pages metadata sources; cleanup replication * dont count skipped pages as unroutable * notes on services shutdown * do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file * do not add pages we dont need to pdc * time in range re-work to provide info about past and future matches * finner control on the pages selected for processing; accounting of page related issues * fix invalid reference to handle->page * eliminate data collection handle of pg_lookup_next * accounting for queries with gaps * query preprocessing the same way the processing is done; cache now supports all operations on Judy * dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3) * get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query * finner gaps calculation; accounting of overlapping pages in queries * fix gaps accounting * move datafile deletion to worker thread * tune libuv workers and thread stack size * stop netdata threads gradually * run indexing together with cache flush/evict * more work on clean shutdown * limit the number of pages to evict per run * do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction * economies on flags for smaller page footprint; cleanup and renames * eviction moves referenced pages to the end of the queue * use murmur hash for indexing partition * murmur should be static * use more indexing partitions * revert number of partitions to number of cpus * cancel threads first, then stop services * revert default thread stack size * dont execute replication requests of disconnected senders * wait more time for services that are exiting gradually * fixed last commit * finer control on page selection algorithm * default stacksize of 1MB * fix formatting * fix worker utilization going crazy when the number is rotating * avoid buffer full due to replication preprocessing of requests * support query priorities * add count of spins in spinlock when compiled with netdata internal checks * remove prioritization from dbengine queries; cache now uses mutexes for the queues * hot pages are now in sections judy arrays, like dirty * align replication queries to optimal page size * during flushing add to clean and evict in batches * Revert "during flushing add to clean and evict in batches" This reverts commit 8fb2b69d068499eacea6de8291c336e5e9f197c7. * dont lock clean while evicting pages during flushing * Revert "dont lock clean while evicting pages during flushing" This reverts commit d6c82b5f40aeba86fc7aead062fab1b819ba58b3. * Revert "Revert "during flushing add to clean and evict in batches"" This reverts commit ca7a187537fb8f743992700427e13042561211ec. * dont cross locks during flushing, for the fastest flushes possible * low-priority queries load pages synchronously * Revert "low-priority queries load pages synchronously" This reverts commit 1ef2662ddcd20fe5842b856c716df134c42d1dc7. * cache uses spinlock again * during flushing, dont lock the clean queue at all; each item is added atomically * do smaller eviction runs * evict one page at a time to minimize lock contention on the clean queue * fix eviction statistics * fix last commit * plain should be main cache * event loop cleanup; evictions and flushes can now happen concurrently * run flush and evictions from tier0 only * remove not needed variables * flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time * added worker jobs in timer to find the slow part of it * support fast eviction of pages when all_of_them is set * revert default thread stack size * bypass event loop for dispatching read extent commands to workers - send them directly * Revert "bypass event loop for dispatching read extent commands to workers - send them directly" This reverts commit 2c08bc5bab12881ae33bc73ce5dea03dfc4e1fce. * cache work requests * minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors * publish flushed pages to open cache in the thread pool * prevent eventloop requests from getting stacked in the event loop * single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c * more rrdengine.c cleanup * enable db rotation * do not log when there is a filter * do not run multiple migration to journal v2 * load all extents async * fix wrong paste * report opcodes waiting, works dispatched, works executing * cleanup event loop memory every 10 minutes * dont dispatch more work requests than the number of threads available * use the dispatched counter instead of the executing counter to check if the worker thread pool is full * remove UV_RUN_NOWAIT * replication to fill the queues * caching of extent buffers; code cleanup * caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation * single transaction wal * synchronous flushing * first cancel the threads, then signal them to exit * caching of rrdeng query handles; added priority to query target; health is now low prio * add priority to the missing points; do not allow critical priority in queries * offload query preparation and routing to libuv thread pool * updated timing charts for the offloaded query preparation * caching of WALs * accounting for struct caches (buffers); do not load extents with invalid sizes * protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset * also check if the expanded before is not the chart later updated time * also check if the expanded before is not after the wall clock time of when the query started * Remove unused variable * replication to queue less queries; cleanup of internal fatals * Mark dimension to be updated async * caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol) * disable pgc stress test, under an ifdef * disable mrg stress test under an ifdef * Mark chart and host labels, host info for async check and store in the database * dictionary items use arrayalloc * cache section pages structure is allocated with arrayalloc * Add function to wakeup the aclk query threads and check for exit Register function to be called during shutdown after signaling the service to exit * parallel preparation of all dimensions of queries * be more sensitive to enable streaming after replication * atomically finish chart replication * fix last commit * fix last commit again * fix last commit again again * fix last commit again again again * unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication * do not cancel start streaming; use high priority queries when we have locked chart data collection * prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered * opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue * Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore * Fix bad memory allocation / cleanup * Cleanup ACLK sync initialization (part 1) * Don't update metric registry during shutdown (part 1) * Prevent crash when dashboard is refreshed and host goes away * Mark ctx that is shutting down. Test not adding flushed pages to open cache as hot if we are shutting down * make ML work * Fix compile without NETDATA_INTERNAL_CHECKS * shutdown each ctx independently * fix completion of quiesce * do not update shared ML charts * Create ML charts on child hosts. When a parent runs a ML for a child, the relevant-ML charts should be created on the child host. These charts should use the parent's hostname to differentiate multiple parents that might run ML for a child. The only exception to this rule is the training/prediction resource usage charts. These are created on the localhost of the parent host, because they provide information specific to said host. * check new ml code * first save the database, then free all memory * dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db * Cleanup metadata thread (part 2) * increase refcount before dispatching prep command * Do not try to stop anomaly detection threads twice. A separate function call has been added to stop anomaly detection threads. This commit removes the left over function calls that were made internally when a host was being created/destroyed. * Remove allocations when smoothing samples buffer The number of dims per sample is always 1, ie. we are training and predicting only individual dimensions. * set the orphan flag when loading archived hosts * track worker dispatch callbacks and threadpool worker init * make ML threads joinable; mark ctx having flushing in progress as early as possible * fix allocation counter * Cleanup metadata thread (part 3) * Cleanup metadata thread (part 4) * Skip metadata host scan when running unittest * unittest support during init * dont use all the libuv threads for queries * break an infinite loop when sleep_usec() is interrupted * ml prediction is a collector for several charts * sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit * worker_unregister() in netdata threads cleanup * moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h * fixed ML issues * removed engine2 directory * added dbengine2 files in CMakeLists.txt * move query plan data to query target, so that they can be exposed by in jsonwrap * uniform definition of query plan according to the other query target members * event_loop should be in daemon, not libnetdata * metric_retention_by_uuid() is now part of the storage engine abstraction * unify time_t variables to have the suffix _s (meaning: seconds) * old dbengine statistics become "dbengine io" * do not enable ML resource usage charts by default * unify ml chart families, plugins and modules * cleanup query plans from query target * cleanup all extent buffers * added debug info for rrddim slot to time * rrddim now does proper gap management * full rewrite of the mem modes * use library functions for madvise * use CHECKSUM_SZ for the checksum size * fix coverity warning about the impossible case of returning a page that is entirely in the past of the query * fix dbengine shutdown * keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently * fine tune cache evictions * dont initialize health if the health service is not running - prevent crash on shutdown while children get connected * rename AS threads to ACLK[hostname] * prevent re-use of uninitialized memory in queries * use JulyL instead of JudyL for PDC operations - to test it first * add also JulyL files * fix July memory accounting * disable July for PDC (use Judy) * use the function to remove datafiles from linked list * fix july and event_loop * add july to libnetdata subdirs * rename time_t variables that end in _t to end in _s * replicate when there is a gap at the beginning of the replication period * reset postponing of sender connections when a receiver is connected * Adjust update every properly * fix replication infinite loop due to last change * packed enums in rrd.h and cleanup of obsolete rrd structure members * prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests() * void unused variable * void unused variables * fix indentation * entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps * macros to caclulate page entries by time and size * prevent statsd cleanup crash on exit * cleanup health thread related variables Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2023-01-04Create ML charts on child hosts. (#14207)vkalintiris
When a parent runs a ML for a child, the relevant-ML charts should be created on the child host. These charts should use the parent's hostname to differentiate multiple parents that might run ML for a child. The only exception to this rule is the training/prediction resource usage charts. These are created on the localhost of the parent host, because they provide information specific to said host.
2023-01-04Fix typos (#14194)Dimitris Apostolou
2023-01-04Refactor ML code and add support for multiple KMeans models (#14198)vkalintiris
* Add profile.plugin Creates the specified number of charts/dimensions, and supports backfilling with pseudo-historical data. * Bump * Remove wrongly merged line. * Use the number of models specified from the config section. * Add option to consult all ML models. * Remove profiling option consuming all models. * Add underscore after chart name prefix. * prediction -> dimensions chart * reorder funcs * Split charts across types with correct priority * Ignore training request when chart is under replication. * Track global number of models consulted. * Cleanup config. * initial readme updates * fix readme * readme * Fix function definition when ML is disabled. * Add dummy ml_chart_update_{begin,end} * Remove profile_plugin * Define chart priorities under collectors/all.h * s/curr_t/current_time/ * Use libnetdata's lock/thread wrappers. * Fix autotools & cmake builds. * Delete ML dimensions & charts. * Let users of buffer preprocessing to handle memory. * Add separate API calls to start/stop ML threads. Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
2022-12-22Revert "Refactor ML code and add support for multiple KMeans models. … ↵vkalintiris
(#14172)
2022-12-21Refactor ML code and add support for multiple KMeans models. (#14065)vkalintiris
* Add profile.plugin Creates the specified number of charts/dimensions, and supports backfilling with pseudo-historical data. * Bump * Remove wrongly merged line. * Use the number of models specified from the config section. * Add option to consult all ML models. * Remove profiling option consuming all models. * Add underscore after chart name prefix. * prediction -> dimensions chart * reorder funcs * Split charts across types with correct priority * Ignore training request when chart is under replication. * Track global number of models consulted. * Cleanup config. * initial readme updates * fix readme * readme * Fix function definition when ML is disabled. * Add dummy ml_chart_update_{begin,end} * Remove profile_plugin * Define chart priorities under collectors/all.h * s/curr_t/current_time/ Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
2022-11-28replication fixes No 7 (#14053)Costa Tsaousis
* move global statistics workers to a separate thread; query statistics per query source; query statistics for ML, exporters, backfilling; reset replication point in time every 10 seconds, instead of every 1; fix compilation warnings; optimize the replication queries code; prevent long tail of replication requests (big sleeps); provide query statistics about replication ; optimize replication sender when most senders are full; optimize replication_request_get_first_available(); reset replication completion calculation; * remove workers utilization from global statistics thread
2022-11-22Do not force internal collectors to call rrdset_next. (#13926)vkalintiris
* Remove calls to rrdset_next(). * Rm checks plugin * Update documentantion * Call rrdset_next from within rrdset_done This wraps up the removal of rrdset_next from internal collectors, which removes a lot of unecessary code and the need for if/else clauses in every place. The pluginsd parser is the only component that calls rrdset_next*() functions because it's not strictly speaking a collector but more of a collector manager/proxy. With the current changes it's possible to simplify the API we expose from RRD significantly, but this will be follow-up work in the future. * Remove stale reference to checks.plugin * Fix RRD unit test rrdset_next is not meant to be called from these tests. * Fix db engine unit test. * Schedule rrdset_next when we have completed at least one collection. * Mark chart creation clauses as unlikely. * Add missing brace to fix FreeBSD plugin.
2022-11-10Remove health_thread_stop (#13948)Emmanuel Vasilakis
* remove health_thread_stop * soft sleeps
2022-11-09Remove anomaly rates chart. (#13763)vkalintiris
2022-10-23QUERY_TARGET: new query engine for Netdata Agent (#13697)Costa Tsaousis
* initial implementation of QUERY_TARGET * rrd2rrdr() interface * rrddim_find_best_tier_for_timeframe() ported * added dimension filtering * added db object in query target * rrd2rrdr() ported * working on formatters * working on jsonwrapper * finally, it compiles... * 1st run without crashes * query planer working * cleanup old code * review changes * fix also changing data collection frequency * fix signess * fix rrdlabels and dimension ordering * fixes * remove unused variable * ml should accept NULL response from rrd2rrdr() * number formatting fixes * more number formatting fixes * more number formatting fixes * support mc parallel queries * formatting and cleanup * added rrd2rrdr_legacy() as a simplified interface to run a query * make sure rrdset_find_natural_update_every_for_timeframe() returns a value * make signed comparisons * weights endpoint using rrdcontexts * fix for legacy db modes and cleanup * fix for chart_ids and remove AR chart from weights endpoint * Ignore command if not initialized yet * remove unused members * properly initialize window * code cleanup - rrddim linked list is gone; rrdset rwlock is gone too * reviewed RRDR.internal members * eliminate unnecessary members of QUERY_TARGET * more complete query ids; more detailed information on aborted queries * properly terminate option strings * query id contains group_options which is controlled by users, so escaping is necessary * tense in query id * tense in query id - again * added the remaining query options to the query id * Expose hidden option to the dimension * use the hidden flag when loading context dimensions * Specify table alias for option * dont update chart last access time, unless at least a dimension of the chart will be queried Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-10-13dbengine free from RRDSET and RRDDIM (#13772)Costa Tsaousis
* dbengine free from RRDSET and RRDDIM * fix for excess parameters to query ops * add comment about ML * update_every from int to uint32_t * rrddim_mem storage engine working * fixes for update_every_s * working dbengine * a lot of changes in dbengine regarding timestamps * better logging of not sequential points * rrdset_done() now gives aligned timestamps for higher tiers * dont change the end_time of descriptors, because they cant be loaded back * fixes for cmake * fixes for db mode ram * Global counters for dbengine loading errors. Ensure dbengine store metrics always has aligned metrics or breaks the page when storing new data. * update lgtm config * fixes for 32-bit systems * update unittests * Don't try to find and create a host on the fly if not already in memory * Remove unused functions * print backtrace in case of fatal * always set ctx to page_index * detect ctx and metric uuid discrepancies * use legacy uuid if multihost is not available * fix for last commit * prevent repeating log * Do not try to access archived charts when executing a data query * Remove unused function * log inconsistent collections once every 10 mins Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-10-13overload libc memory allocators with custom ones to trace all allocations ↵Costa Tsaousis
(#13810) * overload libc memory allocators with custom ones to trace all allocations * grab libc pointers for external c plugins * use -ldl when necessary; fallback to work without dlsym when it is not available * initialize global variable * add optional dl libs * dynamically link every library function when needed for the first time * prevent crashes on musl libc * another attempt * dont dereference function * attempt no 3 * attempt no 4 * cleanup - all attempts failed * dont enable tracing of allocations * missing parenthesis
2022-10-09Remove extern from function declared in headers. (#13790)vkalintiris
By default functions are declared as extern in C/C++ headers. The goal of this PR is to reduce the wall of text that many headers have and, more importantly, to make the declaration of extern'd variables - of which we have many dispersed in various places - easily and quickly identifiable. Automatically generated with: $ git grep -l '^extern.*(' '**.h' | \ grep -v libjudy | \ grep -v 'sqlite3.h' | \ xargs sed -i -e 's/extern \(.*(.*$\)/\1/' This is a NFC.
2022-10-05Remove anomaly detector (#13657)vkalintiris
* Move all dims under one class. * Dimension owns anomaly rate RD. * Remove Dimension::isAnomalous() * Remove Dimension::trainEvery() * Rm ml/kmeans * Remove anomaly detector The same logic can be implemented by using the host anomaly rate dim. * Profile plugin. * Revert "Profile plugin." This reverts commit e3db37cb49c514502c5216cfe7bca2a003fb90f1. * Add separate source files for anomaly detection charts. * Handle training/prediction sync at the dimension level. * Keep multiple KMeans models in mem. * Move feature extraction outside KMeans class. * Use multiple models. * Add /api/v1/ml_models endpoint. * Remove Dimension::getID() * Use just 1 model and fix tests. * Add detection logic based on rrdr. * Remove config options related to anomaly detection. * Make anomaly detection queries configurable. * Fix ad query duration option. * Finalize queries in all code paths. * Check if query was initialized before finalizing it * Do not leak OWA * Profile plugin. * Revert "Profile plugin." This reverts commit 5c77145d0df7e091d030476c480ab8d9cbceb89e. * Change context from anomaly_detection to detector_events.
2022-09-23Do not create train/predict dimensions meant for tracking anomaly rates. ↵vkalintiris
(#13707)
2022-09-19RRD structures managed by dictionaries (#13646)Costa Tsaousis
* rrdset - in progress * rrdset optimal constructor; rrdset conflict * rrdset final touches * re-organization of rrdset object members * prevent use-after-free * dictionary dfe supports also counting of iterations * rrddim managed by dictionary * rrd.h cleanup * DICTIONARY_ITEM now is referencing actual dictionary items in the code * removed rrdset linked list * Revert "removed rrdset linked list" This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5. * removed rrdset linked list * added comments * Switch chart uuid to static allocation in rrdset Remove unused functions * rrdset_archive() and friends... * always create rrdfamily * enable ml_free_dimension * rrddim_foreach done with dfe * most custom rrddim loops replaced with rrddim_foreach * removed accesses to rrddim->dimensions * removed locks that are no longer needed * rrdsetvar is now managed by the dictionary * set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853 * conflict callback of rrdsetvar now properly checks if it has to reset the variable * dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM * dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe * dictionary walkthrough callbacks get dictionary acquired items * dictionary reference counters that can be dupped from zero * added advanced functions for get and del * rrdvar managed by dictionaries * thread safety for rrdsetvar * faster rrdvar initialization * rrdvar string lengths should match in all add, del, get functions * rrdvar internals hidden from the rest of the world * rrdvar is now acquired throughout netdata * hide the internal structures of rrdsetvar * rrdsetvar is now acquired through out netdata * rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata * better error handling * dont create variables if not initialized for health * dont create variables if not initialized for health again * rrdfamily is now managed by dictionaries; references of it are acquired dictionary items * type checking on acquired objects * rrdcalc renaming of functions * type checking for rrdfamily_acquired * rrdcalc managed by dictionaries * rrdcalc double free fix * host rrdvars is always needed * attempt to fix deadlock 1 * attempt to fix deadlock 2 * Remove unused variable * attempt to fix deadlock 3 * snprintfz * rrdcalc index in rrdset fix * Stop storing active charts and computing chart hashes * Remove store active chart function * Remove compute chart hash function * Remove sql_store_chart_hash function * Remove store_active_dimension function * dictionary delayed destruction * formatting and cleanup * zero dictionary base on rrdsetvar * added internal error to log delayed destructions of dictionaries * typo in rrddimvar * added debugging info to dictionary * debug info * fix for rrdcalc keys being empty * remove forgotten unlock * remove deadlock * Switch to metadata version 5 and drop chart_hash chart_hash_map chart_active dimension_active v_chart_hash * SQL cosmetic changes * do not busy wait while destroying a referenced dictionary * remove deadlock * code cleanup; re-organization; * fast cleanup and flushing of dictionaries * number formatting fixes * do not delete configured alerts when archiving a chart * rrddim obsolete linked list management outside dictionaries * removed duplicate contexts call * fix crash when rrdfamily is not initialized * dont keep rrddimvar referenced * properly cleanup rrdvar * removed some locks * Do not attempt to cleanup chart_hash / chart_hash_map * rrdcalctemplate managed by dictionary * register callbacks on the right dictionary * removed some more locks * rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread * when looking up for an alarm look using both chart id and chart name * host initialization a bit more modular * init rrdlabels on host update * preparation for dictionary views * improved comment * unused variables without internal che