summaryrefslogtreecommitdiffstats
path: root/streaming
AgeCommit message (Collapse)Author
2023-02-09Virtual hosts for data collection (#14464)Costa Tsaousis
* support multiple hosts at pluginsd structures * cleanup obsolete code * use a lookup hashtable to quickly find the keyword to execute, without traversing the whole linked list of keywords * more cleanup * move new hash function to inlined.h * minimize comparisons, eliminate a pre-parsing of the first keyword for each line * cleanup parser from old code * move parser into libnetdata * unique entries in parser keywords hashtable * move all hashing functions to inlined.h, name their sources, simple_hash() now defaults to FNV1a, it was FNV1 * small_hash() for parser * plugins.d now can switch hosts, and also create/update them * update hash function and hashtable size * updated message * unittest all hashing functions * reset the chart when setting a new host * remove host tags * enable archived hosts when a collector pushes host info * do not need localhost to swtich to localhost * disable ARAL and OWA with -DFSANITIZE_ADDRESS=1
2023-02-08Add markdown files in Learn (#14466)Fotis Voutsas
* add metadata for learn * first batch of adding metadata to md files * second batch of adding metadata to md files * third batch of adding metadata to md files * test one sidebar_label * add missing sidebar_labels * add missing sidebar_labels to files left behind * test, ansible doc is stubborn * fix * fix * fix * don't use questionmarks in the sidebar label * don't use exclamation marks and symbols in the sidebar label * fix style guide * fixes * fixes
2023-02-07Streaming interpolated values (#14431)Costa Tsaousis
* first commit - untested * fix wrong begin command * added set v2 too * debug to log stream buffer * debug to log stream buffer * faster streaming printing * mark charts and dimensions as collected * use stream points even if sender is not enabled * comment out stream debug log * parse null as nan * custom begin v2 * custom set v2; replication now copies the anomalous flag too * custom end v2 * enabled stream log test * renamed to BEGIN2, SET2, END2 * dont mix up replay and v2 members in user object * fix typo * cleanup * support to v2 to v1 proxying * mark updated dimensions as such * do not log unknown flags * comment out stream debug log * send also the chart id on BEGIN2, v2 to v2 * update the data collections counter * v2 values are transferred in hex * faster hex parsing * a little more generic hex and dec printing and parsing * fix hex parsing * minor optimization in dbengine api * turn debugging into info message * generalized the timings tracking, so that it can be used in more places * commented out debug info * renamed conflicting variable with macro * remove wrong edits * integrated ML and added cleanup in case parsing is interrupted * disable data collection locking during v2 * cleanup stale ML locks; send updated chart variables during v2; add info to find stale locks * inject an END2 between repeated BEGIN2 from rrdset_done() * test: remove lockless single-threaded logic from dictionary and aral and apply the right acquire/release memory order to reference counters * more fine grained dictionary atomics * remove unecessary return values * pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS * Revert "pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS" This reverts commit 846cdf2713e2a7ee2ff797f38db11714228800e9. * Revert "remove unecessary return values" This reverts commit 8c87d30f4d86f0f5d6b4562cf74fe7447138bbff. * Revert "more fine grained dictionary atomics" This reverts commit 984aec4234a340d197d45239ff9a10fd479fcf3c. * Revert "test: remove lockless single-threaded logic from dictionary and aral and apply the right acquire/release memory order to reference counters" This reverts commit c460b3d0ad497d2641bd0ea1d63cec7c052e74e4. * Apply again "pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS" while keeping the improved atomic operations. This reverts commit f158d009 * fix last commit * fix last commit again * optimizations in dbengine * do not send anomaly bit on non-supporting agents (send it when the INTERPOLATED capability is available) * break long empty-points-loops in rrdset_done() * decide page alignment on new page allocation, not on every point collected * create max size pages but no smaller than 1/3 * Fix compilation when --disable-ml is specified * Return false * fixes for NETDATA_LOG_REPLICATION_REQUESTS * added compile option NETDATA_WITHOUT_WORKERS_LATENCY * put timings in BEGIN2, SET2, END2 * isolate begin2 ml * revert repositioning data collection lock * fixed multi-threading of statistics * do not lookup dimensions all the time if they come in the same order * update used on iteration, not on every points; also do better error handling --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-06Minor fix, convert metadata of the learn to hidden sections (#14427)Tasos Katsoulas
minor fix, convert metadata of the learn to hidden sections Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-02-06replication to streaming transition when there are gaps (#14434)Costa Tsaousis
fix https://github.com/netdata/netdata/issues/14432
2023-02-02Covert our documentation links to GH absolute links (#14344)Tasos Katsoulas
Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-02-02DBENGINE v2 - improvements part 12 (#14379)Costa Tsaousis
* parallel initialization of tiers * do not spawn multiple dbengine event loops * user configurable dbengine parallel initialization * size netdata based on the real cpu cores available on the system netdata runs, not on the system monitored * user configurable system cpus * move cpuset parsing to os.c/.h * fix replication of misaligned chart dimensions * give a different path to each tier thread * statically allocate the path into the initialization structure * use aral for reusing dbengine pages * dictionaries uses ARAL for fixed sized values * fix compilation without internal checks * journal v2 index uses aral * test to see judy allocations * judy allocations using aral * Add config option to select if dbengine will use direct I/O (default is yes) * V1 journafiles will use uv_fs_read instead of mmap (respect the direct I/O setting) * Remove sqlite3IsMemdb as it is unused * Fix compilation error when --disable-dbengine is used * use aral for dbengine work_cmds * changed aral API to support new features * pgc and mrg aral overheads * rrdeng opcodes using aral * better structuring and naming * dbegnine query handles using aral * page descriptors using aral * remove obsolete linking * extent io descriptors using aral * aral keeps one last page alive * add missing return value * added judy aral overhead * pdc now uses aral * page_details now use aral * epdl and deol using aral - make sure ARALs are initialized before spawning the event loop * remove unused linking * pgc now uses one aral per partition * aral measure maximum allocation queue * aral to allocate pages in parallel * aral parallel pages allocation when needed * aral cleanup * track page allocation and page population separately --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-30DBENGINE v2 - improvements part 11 (#14337)Costa Tsaousis
* acquiring / releasing interface for metrics * metrics registry statistics * cleanup metrics registry by deleting metrics when they dont have retention anymore; do not double copy the data of pages to be flushed * print the tier in retention summary * Open files with buffered instead of direct I/O (test) * added more metrics stats and fixed unittest * rename writer functions to avoid confusion with refcounting * do not release a metric that is not acquired * Revert to use direct I/O on write -- use direct I/O on read as well * keep track of ARAL overhead and add it to the memory chart * aral full check via api * Cleanup * give names to ARALs and PGCs * aral improvements * restore query expansion to the future * prefer higher resolution tier when switching plans * added extent read statistics * smoother joining of tiers at query engine * fine tune aral max allocation size * aral restructuring to hide its internals from the rest of netdata * aral restructuring; addtion of defrag option to aral to keep the linked list sorted - enabled by default to test it * fully async aral * some statistics and cleanup * fix infinite loop while calculating retention * aral docs and defragmenting disabled by default * fix bug and add optimization when defragmenter is not enabled * aral stress test * aral speed report and documentation * added internal checks that all pages are full * improve internal log about metrics deletion * metrics registry uses one aral per partition * metrics registry aral max size to 512 elements per page * remove data_structures/README.md dependency --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-27DBENGINE v2 - improvements part 10 (#14332)Costa Tsaousis
* replication cancels pending queries on exit * log when waiting for inflight queries * when there are collected and not-collected metrics, use the context priority from the collected only * Write metadata with a faster pace * Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now * Wrap in a big transaction remaining metadata writes (test 1) * fix higher tiers when tiering iterations = 2 * dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation * Wrap in a big transaction metadata writes (test 2) * replication cancelling fix * do not first and last entry in replication when the db has no retention * fix internal check condition * Increase metadata write batch size * always apply error limit to dbengine logs * Remove code that processes the obsolete health.db files * cleanup in query.c * do not allow queries to go beyond db boundaries * prevent internal log for +1 delta in timestamp * detect gap pages in conflicts * double protection for gap injection in main cache * Add checkpoint to prevent large WAL while running Remove unused and duplicate functions * do not allocate chart cache dir if not needed * add more info to unittests * revert query expansion to satisfy unittests Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-26DBENGINE v2 - improvements part 9 (#14326)Costa Tsaousis
* on shutdown stop data collection for all hosts instead of freeing their memory * print number of sql statements per metadata host scan * print timings with metadata checking * use dbengine API to figure out of a database is legacy * Recalculate retention after a datafile deletion * validate child timestamps during replication * main cache uses a lockless aral per partition, protected by the partition index lock * prevent ML crash * Revert "main cache uses a lockless aral per partition, protected by the partition index lock" This reverts commit 6afc01527dc5c66548b4bc8a1d63c026c3149358. * Log direct index and binary searches * distribute metrics more evenly across time * statistics about retention recalculation * fix crash * Reverse the binary search to calculate retention * more optimization on retention calculation * removed commented old code Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-25DBENGINE v2 - improvements part 8 (#14319)Costa Tsaousis
* cache 100 pages for each size our tiers need * smarter page caching * account the caching structures * dynamic max number of cached pages * make variables const to ensure they are not changed * make sure replication timestamps do not go to the future * replication now sends chart and dimension states atomically; replication receivers ignores chart and dimension states when rbegin is also ignored * make sure all pages are flushed on shutdown * take into account empty points too * when recalculating retention update first_time_s on metrics only when they are bigger * Report the datafile number we use to recalculate retention * Report the datafile number we use to recalculate retention * rotate db at startup * make query plans overlap * Calculate properly first time s * updated event labels * negative page caching fix * Atempt to create missing tables on query failure * Atempt to create missing tables on query failure (part 2) * negative page caching for all gaps, to eliminate jv2 scans * Fix unittest Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-23DBENGINE v2 - improvements part 7 (#14307)Costa Tsaousis
* run cleanup in workers * when there is a discrepancy between update every, fix it * fix the other occurences of metric update every mismatch * allow resetting the same timestamp * validate flushed pages before committing them to disk * initialize collection with the latest time in mrg * these should be static functions * acquire metrics for writing to detect multiple data collections of the same metric * print the uuid of the metric that is collected twice * log the discrepancies of completed pages * 1 second tolerance * unify validation of pages and related logging across dbengine * make do_flush_pages() thread safe * flush pages runs on libuv workers * added uv events to tp workers * dont cross datafile spinlock and rwlock * should be unlock * prevent the creation of multiple datafiles * break an infinite replication loop * do not log the epxansion of the replication window due to start streaming * log all invalid pages with internal checks * do not shutdown event loop threads * add information about collected page events, to find the root cause of invalid collected pages * rewrite of the gap filling to fix the invalid collected pages problem * handle multiple collections of the same metric gracefully * added log about main cache page conflicts; fix gap filling once again... * keep track of the first metric writer * it should be an internal fatal - it does not harm users * do not check of future timestamps on collected pages, since we inherit the clock of the children; do not check collected pages validity without internal checks * prevent negative replication completion percentage * internal error for the discrepancy of mrg * better logging of dbengine new metrics collection * without internal checks it is unused * prevent pluginsd crash on exit due to calling pthread_cancel() on an exited thread * renames and atomics everywhere * if a datafile cannot be acquired for deletion during shutdown, continue - this can happen when there are hot pages in open cache referencing it * Debug for context load * rrdcontext uuid debug * rrddim uuid debug * rrdeng uuid debug * Revert "rrdeng uuid debug" This reverts commit 393da190826a582e7e6cc90771bf91b175826d8b. * Revert "rrddim uuid debug" This reverts commit 72150b30408294f141b19afcfb35abd7c34777d8. * Revert "rrdcontext uuid debug" This reverts commit 2c3b940dc23f460226e9b2a6861c214e840044d0. * Revert "Debug for context load" This reverts commit 0d880fc1589f128524e0b47abd9ff0714283ce3b. * do not use legacy uuids on multihost dbs * thread safety for journafile size * handle other cases of inconsistent collected pages * make health thread check if it should be running in key loops * do not log uuids Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-20track memory footprint of Netdata (#14294)Costa Tsaousis
* track memory footprint of Netdata * track db modes alloc/ram/save/map * track system info; track sender and receiver * fixes * more fixes * track workers memory, onewayalloc memory; unify judyhs size estimation * track replication structures and buffers * Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag * flush the replication buffer every 1000 times the circular buffer is found empty * dont take timestamp too frequently in sender loop * sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it * free sender thread buffer on replication threads when replication is idle * use the last sender flag as a timestamp of the last buffer recreation * free cbuffer before reconnecting * recreate cbuffer on every flush * timings for journal v2 loading * inlining of metric and cache functions * aral likely/unlikely * free left-over thread buffers * fix NULL pointer dereference in replication * free sender thread buffer on sender thread too * mark ctx as used before flushing * better logging on ctx datafiles closing Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-18DBENGINE v2 - improvements part 5 (#14289)Costa Tsaousis
* cleanup journal v2 mounts periodically * fix for last commit * re-enable loading page from disk when the arrangement of pages requires it * Remove unused statistics * Estimate diskspace when the current datafile is full and queue a rotate command (Currently it will not attempt to estimate end size for journals) Queue a command to check quota on startup per tier * apps.plugin now exposes RSS chart * shorter thread names to make debugging easier, since thread names can only be 15 characters * more thread names fixes * allow an apps_groups.conf target to be pid 0 or 1 Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-18Revert health to run in a single thread (#14244)Emmanuel Vasilakis
* revert health to single thread * remove getting now * use a health struct * remove commented code * cleanup health log from metdata * dont check for METADATA_UPDATE
2023-01-17Make sure variables are streamed after SENDER_CONNECTED flag is set (#14283)Emmanuel Vasilakis
make sure vars are sent after SENDER_CONNECTED flag is set
2023-01-14More 32bit fixes (#14264)Costa Tsaousis
* query planer weight calculation using long long * adjust replication query ahead pipeline for smaller systems * do not generate huge replication messages * add message to indicate replication message was interrupted * improved message * max replication size 25% of sender buffer * fix for last commit * use less cache and smaller page sizes and fewer threads on 32-bits * fix reserved libuv workers for 32bits * fix detection of 32/64 bit
2023-01-13DBENGINE v2 - improvements 2 (#14257)Costa Tsaousis
* allow extents to be merged for as long as possible * do not block the event loop while recalculating retention due to datafile rotation * buffers are incrementally cleaned up, every second, by just 1 entry * fix order of commands * remove newline * measure cancelled extent read requests * count all cancelled extent requests * do not double count failed pages * fixed cancelled name * Fix error and warnings when compiling with --disable-dbengine * when the timeframe is outside retention and whole query should fail * do not mark as failed pages that have been loaded but have been skipped * added chart to show cache memory calculation variables * LONG_MAX for 32-bit compatibility * fix cache size calculation on 32-bit * fix cache size calculation on 32-bit - use unsinged long long * fix compilation warnings on 32-bits * fix another compilation warning on 32-bits * fix compilation warnings on older 32-bit compilers * fix compilation warnings on older 32-bit compilers - more of them * disable ML threads joining Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-13Enable retries for SSL_ERROR_WANT_READ (#14120)Emmanuel Vasilakis
* enable retries for SSL_ERROR_WANT_READ * only when bytes is <= 0 * treat ERROR_WANT_READ/WRITE as 0 bytes * dont close connection on zero bytes * reuse ssl connection * treat zero bytes * ifdef for old openssl * revert check
2023-01-10DBENGINE v2 (#14125)Costa Tsaousis
* count open cache pages refering to datafile * eliminate waste flush attempts * remove eliminated variable * journal v2 scanning split functions * avoid locking open cache for a long time while migrating to journal v2 * dont acquire datafile for the loop; disable thread cancelability while a query is running * work on datafile acquiring * work on datafile deletion * work on datafile deletion again * logs of dbengine should start with DBENGINE * thread specific key for queries to check if a query finishes without a finalize * page_uuid is not used anymore * Cleanup judy traversal when building new v2 Remove not needed calls to metric registry * metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent * disable checks for invalid time-ranges * Remove type from page details * report scanning time * remove infinite loop from datafile acquire for deletion * remove infinite loop from datafile acquire for deletion again * trace query handles * properly allocate array of dimensions in replication * metrics cleanup * metrics registry uses arrayalloc * arrayalloc free should be protected by lock * use array alloc in page cache * journal v2 scanning fix * datafile reference leaking hunding * do not load metrics of future timestamps * initialize reasons * fix datafile reference leak * do not load pages that are entirely overlapped by others * expand metric retention atomically * split replication logic in initialization and execution * replication prepare ahead queries * replication prepare ahead queries fixed * fix replication workers accounting * add router active queries chart * restore accounting of pages metadata sources; cleanup replication * dont count skipped pages as unroutable * notes on services shutdown * do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file * do not add pages we dont need to pdc * time in range re-work to provide info about past and future matches * finner control on the pages selected for processing; accounting of page related issues * fix invalid reference to handle->page * eliminate data collection handle of pg_lookup_next * accounting for queries with gaps * query preprocessing the same way the processing is done; cache now supports all operations on Judy * dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3) * get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query * finner gaps calculation; accounting of overlapping pages in queries * fix gaps accounting * move datafile deletion to worker thread * tune libuv workers and thread stack size * stop netdata threads gradually * run indexing together with cache flush/evict * more work on clean shutdown * limit the number of pages to evict per run * do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction * economies on flags for smaller page footprint; cleanup and renames * eviction moves referenced pages to the end of the queue * use murmur hash for indexing partition * murmur should be static * use more indexing partitions * revert number of partitions to number of cpus * cancel threads first, then stop services * revert default thread stack size * dont execute replication requests of disconnected senders * wait more time for services that are exiting gradually * fixed last commit * finer control on page selection algorithm * default stacksize of 1MB * fix formatting * fix worker utilization going crazy when the number is rotating * avoid buffer full due to replication preprocessing of requests * support query priorities * add count of spins in spinlock when compiled with netdata internal checks * remove prioritization from dbengine queries; cache now uses mutexes for the queues * hot pages are now in sections judy arrays, like dirty * align replication queries to optimal page size * during flushing add to clean and evict in batches * Revert "during flushing add to clean and evict in batches" This reverts commit 8fb2b69d068499eacea6de8291c336e5e9f197c7. * dont lock clean while evicting pages during flushing * Revert "dont lock clean while evicting pages during flushing" This reverts commit d6c82b5f40aeba86fc7aead062fab1b819ba58b3. * Revert "Revert "during flushing add to clean and evict in batches"" This reverts commit ca7a187537fb8f743992700427e13042561211ec. * dont cross locks during flushing, for the fastest flushes possible * low-priority queries load pages synchronously * Revert "low-priority queries load pages synchronously" This reverts commit 1ef2662ddcd20fe5842b856c716df134c42d1dc7. * cache uses spinlock again * during flushing, dont lock the clean queue at all; each item is added atomically * do smaller eviction runs * evict one page at a time to minimize lock contention on the clean queue * fix eviction statistics * fix last commit * plain should be main cache * event loop cleanup; evictions and flushes can now happen concurrently * run flush and evictions from tier0 only * remove not needed variables * flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time * added worker jobs in timer to find the slow part of it * support fast eviction of pages when all_of_them is set * revert default thread stack size * bypass event loop for dispatching read extent commands to workers - send them directly * Revert "bypass event loop for dispatching read extent commands to workers - send them directly" This reverts commit 2c08bc5bab12881ae33bc73ce5dea03dfc4e1fce. * cache work requests * minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors * publish flushed pages to open cache in the thread pool * prevent eventloop requests from getting stacked in the event loop * single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c * more rrdengine.c cleanup * enable db rotation * do not log when there is a filter * do not run multiple migration to journal v2 * load all extents async * fix wrong paste * report opcodes waiting, works dispatched, works executing * cleanup event loop memory every 10 minutes * dont dispatch more work requests than the number of threads available * use the dispatched counter instead of the executing counter to check if the worker thread pool is full * remove UV_RUN_NOWAIT * replication to fill the queues * caching of extent buffers; code cleanup * caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation * single transaction wal * synchronous flushing * first cancel the threads, then signal them to exit * caching of rrdeng query handles; added priority to query target; health is now low prio * add priority to the missing points; do not allow critical priority in queries * offload query preparation and routing to libuv thread pool * updated timing charts for the offloaded query preparation * caching of WALs * accounting for struct caches (buffers); do not load extents with invalid sizes * protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset * also check if the expanded before is not the chart later updated time * also check if the expanded before is not after the wall clock time of when the query started * Remove unused variable * replication to queue less queries; cleanup of internal fatals * Mark dimension to be updated async * caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol) * disable pgc stress test, under an ifdef * disable mrg stress test under an ifdef * Mark chart and host labels, host info for async check and store in the database * dictionary items use arrayalloc * cache section pages structure is allocated with arrayalloc * Add function to wakeup the aclk query threads and check for exit Register function to be called during shutdown after signaling the service to exit * parallel preparation of all dimensions of queries * be more sensitive to enable streaming after replication * atomically finish chart replication * fix last commit * fix last commit again * fix last commit again again * fix last commit again again again * unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication * do not cancel start streaming; use high priority queries when we have locked chart data collection * prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered * opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue * Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore * Fix bad memory allocation / cleanup * Cleanup ACLK sync initialization (part 1) * Don't update metric registry during shutdown (part 1) * Prevent crash when dashboard is refreshed and host goes away * Mark ctx that is shutting down. Test not adding flushed pages to open cache as hot if we are shutting down * make ML work * Fix compile without NETDATA_INTERNAL_CHECKS * shutdown each ctx independently * fix completion of quiesce * do not update shared ML charts * Create ML charts on child hosts. When a parent runs a ML for a child, the relevant-ML charts should be created on the child host. These charts should use the parent's hostname to differentiate multiple parents that might run ML for a child. The only exception to this rule is the training/prediction resource usage charts. These are created on the localhost of the parent host, because they provide information specific to said host. * check new ml code * first save the database, then free all memory * dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db * Cleanup metadata thread (part 2) * increase refcount before dispatching prep command * Do not try to stop anomaly detection threads twice. A separate function call has been added to stop anomaly detection threads. This commit removes the left over function calls that were made internally when a host was being created/destroyed. * Remove allocations when smoothing samples buffer The number of dims per sample is always 1, ie. we are training and predicting only individual dimensions. * set the orphan flag when loading archived hosts * track worker dispatch callbacks and threadpool worker init * make ML threads joinable; mark ctx having flushing in progress as early as possible * fix allocation counter * Cleanup metadata thread (part 3) * Cleanup metadata thread (part 4) * Skip metadata host scan when running unittest * unittest support during init * dont use all the libuv threads for queries * break an infinite loop when sleep_usec() is interrupted * ml prediction is a collector for several charts * sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit * worker_unregister() in netdata threads cleanup * moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h * fixed ML issues * removed engine2 directory * added dbengine2 files in CMakeLists.txt * move query plan data to query target, so that they can be exposed by in jsonwrap * uniform definition of query plan according to the other query target members * event_loop should be in daemon, not libnetdata * metric_retention_by_uuid() is now part of the storage engine abstraction * unify time_t variables to have the suffix _s (meaning: seconds) * old dbengine statistics become "dbengine io" * do not enable ML resource usage charts by default * unify ml chart families, plugins and modules * cleanup query plans from query target * cleanup all extent buffers * added debug info for rrddim slot to time * rrddim now does proper gap management * full rewrite of the mem modes * use library functions for madvise * use CHECKSUM_SZ for the checksum size * fix coverity warning about the impossible case of returning a page that is entirely in the past of the query * fix dbengine shutdown * keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently * fine tune cache evictions * dont initialize health if the health service is not running - prevent crash on shutdown while children get connected * rename AS threads to ACLK[hostname] * prevent re-use of uninitialized memory in queries * use JulyL instead of JudyL for PDC operations - to test it first * add also JulyL files * fix July memory accounting * disable July for PDC (use Judy) * use the function to remove datafiles from linked list * fix july and event_loop * add july to libnetdata subdirs * rename time_t variables that end in _t to end in _s * replicate when there is a gap at the beginning of the replication period * reset postponing of sender connections when a receiver is connected * Adjust update every properly * fix replication infinite loop due to last change * packed enums in rrd.h and cleanup of obsolete rrd structure members * prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests() * void unused variable * void unused variables * fix indentation * entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps * macros to caclulate page entries by time and size * prevent statsd cleanup crash on exit * cleanup health thread related variables Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2023-01-04Fix typos (#14194)Dimitris Apostolou
2022-12-09don't log too much about streaming connections (#14117)Costa Tsaousis
2022-12-02replication fixes 9 (#14079)Costa Tsaousis
* replication fixes 9 * no room metric is now absolute * decrement senders full and flip the actions on sender full and sender available * finer timings * execute all requests, unconditionally, once they are in the replication queue; worker charts are now sorted * remove left-over debug message * log verification of replication completion only when there are no pending requests any more * use up to 50% of the sender buffer for replication responses
2022-12-01fix SSL related crashes (#14076)Costa Tsaousis
prevent concurrent initialization of SSL from multiple threads; prevent re-initialization of SSL
2022-11-29Strict control of streaming API keys and MACHINE GUIDs in stream.conf (#14063)Costa Tsaousis
do not allow machine guids to be used as API keys
2022-11-29replication fixes No 8 (#14061)Costa Tsaousis
* replication requests with start_streaming=true are executed immediately upon reception, instead of being placed in the queue * disable thread cancelability while workers cleanup * remove obsolete worker from replication * multi-threaded replication with netdata.conf option to set number of replication threads * revert spinlock to mutex * separate worker and main thread worker jobs * restart the queue every 10 seconds only * use atomic for sender buffer percentage * reset the queue position after sleeping * use sender resets to sleep properly * fix condition * cleanup sender members related to replication
2022-11-28replication fixes No 7 (#14053)Costa Tsaousis
* move global statistics workers to a separate thread; query statistics per query source; query statistics for ML, exporters, backfilling; reset replication point in time every 10 seconds, instead of every 1; fix compilation warnings; optimize the replication queries code; prevent long tail of replication requests (big sleeps); provide query statistics about replication ; optimize replication sender when most senders are full; optimize replication_request_get_first_available(); reset replication completion calculation; * remove workers utilization from global statistics thread
2022-11-25replication fixes #6 (#14046)Costa Tsaousis
use the faster monotonic clock in workers and replication; avoid unecessary statistics function on every request on replication - gather them all together once every second; check the chart flags on all mirrored hosts, not only the ones that have a sender; cleanup and unify replication logs; added child world time to REND; fix first BEGIN been transmitted when replication starts;
2022-11-24replication fixes #5 (#14038)Costa Tsaousis
* pluginsd cleanup; replication logic cleanup; fix bug in replication begin * log replication start/stop and change the keyword of NETDATA_LOG_REPLICATION_REQUESTS logs to REPLAY * dont ask for data the child does not have; log fixes * more pluginsd cleanup * count sender dictionary entries * fix dictionary_flush()
2022-11-22do not merge duplicate replication requests (#14037)Costa Tsaousis
do not merge duplicate replication requests - just ignore them
2022-11-22Replication fixes #3 (#14035)Costa Tsaousis
* cleanup and additional information about replication * fix deadlock on sender mutex * do not ignore start streaming empty requests; when there duplicate requests, merge them * flipped the flag * final touch * added queued flag on the charts to prevent them from being obsoleted by the service thread
2022-11-22use 2 levels of judy arrays to speed up replication on very busy parents ↵Costa Tsaousis
(#14031) * use 2 levels of judy arrays to speed up replication on very busy parents * delete requests from judy when executed * do not process requests when the sender is not connected; not all requests removed are executed, so count the executed accurately * flush replication data on sender disconnect * cache used buffer ratio in sender structure * cleanup replication requests when they are not valid any more * properly update inner and outer judy arrays on deletions * detailed replication stats * fix bug in dictionary where deletion and flushes lead to crashes * replication should only report retention of tier 0 * replication now has 2 buffer limits 10 -> 30 % * detailed statistics about resets and skipped requests * register worker metrics * added counter for waits * make new requests discoverable * make it continue on iterations properly
2022-11-21Fix documentation TLS streaming (#14024)thiagoftsm
2022-11-20streaming compression, query planner and replication fixes (#14023)Costa Tsaousis
* streaming compression, query planner and replication fixes * remove journal v2 stats from global statistics * disable sql for checking past sql UUIDs * single threaded replication * final replication thread using dictionaries and JudyL for sorting the pending requests * do not timeout the sending socket when there are pending replication requests * streaming receiver using read() instead of fread() * remove FILE * from streaming - now using posix read() and write() * increase timeouts to 10 minutes * apply sender timeout only when there are metrics that are supposed to be streamed * error handling in replication * remove retries on socket read timeout; better error messages * take into account inbound traffic too to detect that a connection is stale * remove race conditions from replication thread * make sure deleted entries are marked as executed, so that even if deletion fails, they will not be executed * 2 minutes timeout to retry streaming to a parent that already has this node * remove unecessary condition check * fix compilation warnings * include judy in replication * wrappers to handle retries for SSL_read and SSL_write * compressed bytes read monitoring * recursive locks on replication to make it faster during flush or cleanup * replication completion chart at the receiver side * simplified recursive mutex * simplified recursive mutex again
2022-11-15Revert "New journal disk based indexing for agent memory reduction" (#14000)Costa Tsaousis
Revert "New journal disk based indexing for agent memory reduction (#13885)" This reverts commit 224b051a2b2bab39a4b536e531ab9ca590bf31bb.
2022-11-15New journal disk based indexing for agent memory reduction (#13885)Stelios Fragkakis
* Add read only option to netdata_mmap so files are accessed ousing PROT_READ * Initial functions to write the new journal file and switch to the new indexing * Cleanup code, add parameters to pg_cache_punch_hole to avoid updating page latets oldest times pg_cache insert to have parameter if page index locked needs to be done Page eviction functions will try to deallocate the descriptor as well (pg_cache_punch_hole without page_index time updates) Cleanup messages during startup * Cleanup messages during startup * Disbale extent caching for now, add placeholder for journal indexing and activation while the agent is running * Add main function to populate descriptors by checking the new journal indexing * prevent crash * fix for binary search crash * Avoid Time-of-check time-of-use filesystem race condition * always add a page * populate fixes - it is still incomplete * pg_cache_insert returns the descriptor that ends up in the page_index * Add populate next (Fix 1) * Fix compilation warnings, reactivate extent caching * Add populate next (Fix 2) * Add populate next (Fix 3) switch to the next entry or journal file when asking to populate descriptor with next * Fix resource leak and wrong sizeof * Rework page population (part 1) * Additional checksums added / journal validation * Cleanup (part 1) * Locking added and Cleanup (part 2) * Close journal file after new journal index activation * Skip warning when compiling without NETDATA_INTERNAL_CHECKS * Ignore empty index file (header and trailer and no metrics) * Try to remove all evicted descriptors (may prevent slight memory increase) * Evict pages also when we succesfully do try_reserve * Precache pages and cleanup * Add a separate cleanup thread to release unused descriptors * Check existence of key correctly * Fix total file size calculation * Statistics for journal descriptors * Track and release jourval v2 descriptors * Do not try to allocate pages for locality if under pressure * Do not track v2 descriptors when populating the page_index * Track page descriptors as they are inserted in the page index (per journal file) Scan journal files for pending items to cleanup Cleanup v2 descriptors only if they are not populated Check before adding to page cache to avoid memory allocation /free * Close journal file that has been processed and migrated to the new index Check for valid file before trying to truncate / close. This file has been closed during startup * Better calculation for the number of prefetched data pages based on the query end time Code cleanup and comments Add v2 populated descriptor expiration based on journal access time * Code cleanup * Faster indexing Better journal validation (more sanity checks) Detect new datafile/ journal creation and trigger index generation Switch to the new index / mark descriptors in memory as needed Update journal access time when a descriptor is returned Code cleanup (part 1) * Re activate descriptor clean Code cleanup * Allow locality precaching * Allow locality precaching for the same page alignment * Descriptor cleanup internal changed * Disable locality precaching * Precache only if not under pressure / internal cleanup at 60 seconds * Remove unused functions * Migrate on startup always Make sure the metric uuid is valid (we have a page_index) Prevent crash if no datafile is available when logging an error Remove unused functions * New warn limit for precaching Stress test v2 descriptor cleanup - Every 1s cleanup if it doesnt exist in cache - 60s cache eviction * Arrayalloc internal checks on free activated with NETDATA_ARRAYALLOC_INTERNAL_CHECKS Ability to add DESCRIPTOR_EXPIRATION_TIME and DESCRIPTOR_INTERVAL_CLEANUP during compile Defaults DESCRIPTOR_INTERVAL_CLEANUP = 60 and DESCRIPTOR_EXPIRATION_TIME = 600 * Lookup page index correctly * Calculate index time once * Detect a duplicate page when doing cache insert and during flushing of pages * Better logging * Descriptor validation (extent vs page index) when building an index file while the agent is running * Mark invalid entries in the journal v2 file * Schedule an index rebuild if a descriptor is found without an extent in the timerange we are processing Release descriptor lock to prevent random shutdown locks * Proper unlock * Skip descriptor cleanup when journal file v2 migration is running * Fix page cache statistics Remove multiple entries of the page_index from the page cache Cleanup * Adjust preload pages on pg_cache_next. Handle invalid descriptor properly Unlock properly * Better handling of invalid pages Journal indexing during runtime will scan all files to find potential ones to index * Reactivate migration on startup Evict descriptors to cause migration Don't count the entries in page index (calculate when processing the extent list) Check for valid extent since we may set the extent to NULL on startup if it is invalid Better structure init Address valgrind issues * Add don't fork/dump option * Add separate lock to protect accessing a datafile's extent list Comment out some unused code (for now) Abort descriptor cleanup if we are force flushing pages (page cache under pressure) * Check for index and schedule when data flush completes Configure max datafile size during compilation Keep a separate JudyL array for descriptors Skip quota test if we are deleting descriptors or explicitly flushing pages under pressure * Fix * set function when waiters are waken up * add the line number to trace the deadlock * add thread id * add wait list * init to zero * disable thread cancelability inside dbengine rrdeng_load_page_next() * make sure the owner is the thread * disable thread cancelability for replication as a whole * Check and queue indexing after first page flush * Queue indexing after a small delay to allow some time for page flushing * tracing of waiters o