netdata - Mirror of https://github.com/netdata/netdata

Age	Commit message (Collapse)	Author
2024-02-06	Move daemon/ under src/ (#16933)	vkalintiris

2024-01-23	DYNCFG: dynamically configured alerts (#16779)	Costa Tsaousis
	* cleanup alerts * fix references * fix references * fix references * load alerts once and apply them to each node * simplify health_create_alarm_entry() * Compile without warnings with compiler flags: -Wall -Wextra -Wformat=2 -Wshadow -Wno-format-nonliteral -Winit-self * code re-organization and cleanup * generate patterns when applying prototypes; give unique dyncfg names to all alerts * eval expressions keep the source and the parsed_as as STRING pointers * renamed host to node in dyncfg ids * renamed host to node in dyncfg ids * add all cloud roles to the list of parsed X-Netdata-Role header and also default to member access level * working functionality * code re-organization: moved health event-loop to a new file, moved health globals to health.c * rrdcalctemplate is removed; alert_cfg is removed; foreach dimension is removed; RRDCALCs are now instanciated only when they are linked to RRDSETs * dyncfg alert prototypes initialization for alerts * health dyncfg split to separate file * cleanup not-needed code * normalize matches between parsing and json * also detect !* for disabled alerts * dyncfg capability disabled * Store alert config part1 * Add rrdlabels_common_count * wip health variables lookup without indexes * Improve rrdlabels_common_count by reusing rrdlabels_find_label_with_key_unsafe with an additional parameter * working variables with runtime lookup * working variables with runtime lookup * delete rrddimvar and rrdfamily index * remove rrdsetvar; now all variables are in RRDVARs inside hosts and charts * added /api/v1/variable that resolves a variable the same way alerts do * remove rrdcalc from eval * remove debug code * remove duplicate assignment * Fix memory leak * all alert variables are now handled by alert_variable_lookup() and EVAL is now independent of alerts * hide all internal structures of EVAL * Enable -Wformat flag Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> * Adjust binding for calculation, warning, critical * Remove unused macro * Update config hash id * use the right info and summary in alerts log * use synchronous queries for alerts * Handle cases when config_hash_id is missing from health_log * remove deadlock from health worker * parsing to json payload for health alert prototypes * cleaner parsing and avoiding memory leaks in case of duplicate members in json * fix left-over rename of function * Keep original lookup field to send to the cloud Cleanup / rename function to store config Remove unused DEFINEs, functions * Use ac->lookup * link jobs to the host when the template is registered; do not accept running a function without a host * full dyncfg support for health alerts, except action TEST * working dyncfg additions, updates, removals * fixed missing source, wrong status updates * add alerts by type, component, classification, recipient and module at the /api/v2/alerts endpoint * fix dyncfg unittest * rename functions * generalize the json-c parser macros and move them to libnetdata * report progress when enabling and disabling dyncfg templates * moved rrdcalc and rrdvar to health * update alarms * added schema for alerts; separated alert_action_options from rrdr_options; restructured the json payload for alerts * enable parsed json alerts; allow sending back accepted but disabled * added format_version for alerts payload; enables/disables status now is also inheritted by the status of the rules; fixed variable names in json output * remove the RRDHOST pointer from DYNCFG * Fix command field submitted to the cloud * do not send updates to creation requests, for DYNCFG jobs --------- Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: ilyam8 <ilya@netdata.cloud>
2024-01-11	Delete memory mode "map" and "save". (#16604)	vkalintiris
	* Delete memory modes "map" and "save". * Remove unmaintained exporting tests * Remove references of map/save modes in docs. * Remove more references to map/save from docs.
2024-01-11	Name storage engine variables consistently. (#16753)	vkalintiris
	* Consistent naming of STORAGE_INSTANCE instances. Replace usages of `db_instance` and `instance` with `si`. * Rename array `storage_metrics_groups[tier]` to `smg[tier]` * Rename db_metric_handle to smh * Rename instances of `storage_engine_query_handle` to `seqh`. * Rename instances of STORAGE_ENGINE_BACKEND to `seb`. * Rename instances of STORAGE_COLLECT_HANDLE to `sch`.
2023-12-27	Shutdown dbengine event loop properly (#16658)	Stelios Fragkakis
	* Shutdown dbengine event loop properly * Adjust messages
2023-12-14	Remove assert (#16611)	Stelios Fragkakis

2023-12-01	Code cleanup (#16448)	Stelios Fragkakis
	* Code cleanup * More cleanup * More cleanup * Use FILENAME_MAX * query fix
2023-11-23	Handle ephemeral hosts (#16381)	Stelios Fragkakis
	* Handle ephemeral hosts * Node empheral removal timeout 86400 seconds (1 day) * Move config from health to global section * Set a node to queryable false when it is ephemeral and is removed * Log queryable. Send queryable=0 only when forcing host deletion (the node is ephemeral) * Switch to "is ephemeral node" Document stream.conf * Unregister node id
2023-11-22	New logging layer (#16357)	Costa Tsaousis
	* cleanup of logging - wip * first working iteration * add errno annotator * replace old logging functions with netdata_logger() * cleanup * update error_limit * fix remanining error_limit references * work on fatal() * started working on structured logs * full cleanup * default logging to files; fix all plugins initialization * fix formatting of numbers * cleanup and reorg * fix coverity issues * cleanup obsolete code * fix formatting of numbers * fix log rotation * fix for older systems * add detection of systemd journal via stderr * finished on access.log * remove left-over transport * do not add empty fields to the logs * journal get compact uuids; X-Transaction-ID header is added in web responses * allow compiling on systems without memfd sealing * added libnetdata/uuid directory * move datetime formatters to libnetdata * add missing files * link the makefiles in libnetdata * added uuid_parse_flexi() to parse UUIDs with and without hyphens; the web server now read X-Transaction-ID and uses it for functions and web responses * added stream receiver, sender, proc plugin and pluginsd log stack * iso8601 advanced usage; line_splitter module in libnetdata; code cleanup * add message ids to streaming inbound and outbound connections * cleanup line_splitter between lines to avoid logging garbage; when killing children, kill them with SIGABRT if internal checks is enabled * send SIGABRT to external plugins only if we are not shutting down * fix cross cleanup in pluginsd parser * fatal when there is a stack error in logs * compile netdata with -fexceptions * do not kill external plugins with SIGABRT * metasync info logs to debug level * added severity to logs * added json output; added options per log output; added documentation; fixed issues mentioned * allow memfd only on linux * moved journal low level functions to journal.c/h * move health logs to daemon.log with proper priorities * fixed a couple of bugs; health log in journal * updated docs * systemd-cat-native command to push structured logs to journal from the command line * fix makefiles * restored NETDATA_LOG_SEVERITY_LEVEL * fix makefiles * systemd-cat-native can also work as the logger of Netdata scripts * do not require a socket to systemd-journal to log-as-netdata * alarm notify logs in native format * properly compare log ids * fatals log alerts; alarm-notify.sh working * fix overflow warning * alarm-notify.sh now logs the request (command line) * anotate external plugins logs with the function cmd they run * added context, component and type to alarm-notify.sh; shell sanitization removes control character and characters that may be expanded by bash * reformatted alarm-notify logs * unify cgroup-network-helper.sh * added quotes around params * charts.d.plugin switched logging to journal native * quotes for logfmt * unify the status codes of streaming receivers and senders * alarm-notify: dont log anything, if there is nothing to do * all external plugins log to stderr when running outside netdata; alarm-notify now shows an error when notifications menthod are needed but are not available * migrate cgroup-name.sh to new logging * systemd-cat-native now supports messages with newlines * socket.c logs use priority * cleanup log field types * inherit the systemd set INVOCATION_ID if found * allow systemd-cat-native to send messages to a systemd-journal-remote URL * log2journal command that can convert structured logs to journal export format * various fixes and documentation of log2journal * updated log2journal docs * updated log2journal docs * updated documentation of fields * allow compiling without libcurl * do not use socket as format string * added version information to newly added tools * updated documentation and help messages * fix the namespace socket path * print errno with error * do not timeout * updated docs * updated docs * updated docs * log2journal updated docs and params * when talking to a remote journal, systemd-cat-native batches the messages * enable lz4 compression for systemd-cat-native when sending messages to a systemd-journal-remote * Revert "enable lz4 compression for systemd-cat-native when sending messages to a systemd-journal-remote" This reverts commit b079d53c11f6687cd64d804fdd7b24c0492bf245. * note about uncompressed traffic * log2journal: code reorg and cleanup to make modular * finished rewriting log2journal * more comments * rewriting rules support * increased limits * updated docs * updated docs * fix old log call * use journal only when stderr is connected to journal * update netdata.spec for libcurl, libpcre2 and log2journal * pcre2-devel * do not require pcre2 in centos < 8, amazonlinux < 2023, open suse * log2journal only on systems pcre2 is available * ignore log2journal in .gitignore * avoid log2journal on centos 7, amazonlinux 2 and opensuse * add pcre2-8 to static build * undo last commit * Bundle to static Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> * Add build deps for deb packages Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> * Add dependencies; build from source Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> * Test build for amazon linux and centos expect to fail for suse Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> * fix minor oversight Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> * Reorg code * Add the install from source (deps) as a TODO * Not enable the build on suse ecosystem Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> --------- Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-10-27	Faster parents (#16127)	Costa Tsaousis
	* cache ctx in collection handle * cache rd together with rda * do not repeatedy call rrdcontexts - cached collection status; optimize pluginsd_acquire_dimension() * fix unit tests * do the absolutely minimum while updating timestamps, ensure validity during reading them * when the stream is INTERPOLATED, buffer outstanding data for up to 50ms if the buffer contains DATA only. * remove the spinlock from mrg * remove the metric flags that are not used any more * mrg writers can be different threads * update first time when latest clean is also updated * cleanup * set hot page with a simple atomic operation * sender sets chart slot for every chart * work on senders without SLOT * enable SLOT capability * send slot at BEGIN when SLOT is enabled * fix slot generation and parsing * send slot while re-streaming * use the sender capabilities, not the receiver * cleanup * add slots support to all chart and dimension related plugin commands * fix condition * fix calculation * check sender capabilties * assign slots in constructors * we need the dimension slot at the DIMENSION keyword * more debug info in case of dimension mismatch * ensure the RRDDIM EXPOSED flag is multi-threaded and set it after the sender buffer has been committed, so that replication will not send dimensions prematurely * fix renumbering on child restart * reset rda caching when receiving a chart definition * optimize pluginsd_end_v2() * do not do zero sized allocations * trust the chart slot id of the child * cleanup charts on pluginsd thread exit * better cleanup * find the chart and put it in the slot, if it not already there * move slots array to host * initialize pluginsd slots properly * add slots to replay begin; do not cleanup slots that dont belong to a chart * cleanup on obsolete * cleanup slots on obsoletions * cleanup and renames about obsoletion * rewrite obsolation service code to remove race conditions * better service obsoletion log * added debugging * more debug * exposed flag now compares versions * removed debugging messages * respolve conflicts * fix replication check for unsent dimensions
2023-10-20	Drop an unused index from aclk_alert table (#16242)	Stelios Fragkakis
	* Drop unused aclk_alert index * Log messages only when compiled with NETDATA_INTERNAL_CHECKS
2023-08-03	Revert "Refactor RRD code. (#15423)" (#15723)	vkalintiris
	This reverts commit 440bd51e08fdfa2a4daa191fb68643456028a753. dbengine was still being used for non-zero tiers even on non-dbengine modes.
2023-07-26	Refactor RRD code. (#15423)	vkalintiris
	* Storage engine. * Host indexes to rrdb * Move globals to rrdb * Move storage_tiers_backfill to rrdb * default_rrd_update_every to rrdb * default_rrd_history_entries to rrdb * gap_when_lost_iterations_above to rrdb * rrdset_free_obsolete_time_s to rrdb * libuv_worker_threads to rrdb * ieee754_doubles to rrdb * rrdhost_free_orphan_time_s to rrdb * rrd_rwlock to rrdb * localhost to rrdb * rm extern from func decls * mv rrd macro under rrd.h * default_rrdeng_page_cache_mb to rrdb * default_rrdeng_extent_cache_mb to rrdb * db_engine_journal_check to rrdb * default_rrdeng_disk_quota_mb to rrdb * default_multidb_disk_quota_mb to rrdb * multidb_ctx to rrdb * page_type_size to rrdb * tier_page_size to rrdb * No storage_engine_id in rrdim functions * storage_engine_id is provided by st * Update to fix merge conflict. * Update field name * Remove unnecessary macros from rrd.h * Rm unused type decls * Rm duplicate func decls * make internal function static * Make the rest of public dbengine funcs accept a storage_instance. * No more rrdengine_instance :) * rm rrdset_debug from rrd.h * Use rrdb to access globals in ML and ACLK Missed due to not having the submodules in the worktree. * rm total_number * rm RRDVAR_TYPE_TOTAL * rm unused inline * Rm names from typedef'd enums * rm unused header include * Move include * Rm unused header include * s/rrdhost_find_or_create/rrdhost_get_or_create/g * s/find_host_by_node_id/rrdhost_find_by_node_id/ Also, remove duplicate definition in rrdcontext.c * rm macro used only once * rm macro used only once * Reduce rrd.h api by moving funcs into a collector specific utils header * Remove unused func * Move parser specific function out of rrd.h * return storage_number instead of void pointer * move code related to rrd initialization out of rrdhost.c * Remove tier_grouping from rrdim_tier Saves 8 * storage_tiers bytes per dimension. * Fix rebase * s/rrd_update_every/update_every/ * Mark functions as static and constify args * Add license notes and file to build systems. * Remove remaining non-log/config mentions of memory mode * Move rrdlabels api to separate file. Also, move localhost functions that loads labels outside of database/ and into daemon/ * Remove function decl in rrd.h * merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir * Do not expose internal function from rrd.h * Rm NETDATA_RRD_INTERNALS Only one function decl is covered. We have more database internal functions that we currently expose for no good reason. These will be placed in a separate internal header in follow up PRs. * Add license note * Include libnetdata.h instead of aral.h * Use rrdb to access localhost * Fix builds without dbengine * Add header to build system files * Add rrdlabels.h to build systems * Move func def from rrd.h to rrdhost.c * Fix macos build * Rm non-existing function * Rebase master * Define buffer length macro in ad_charts. * Fix FreeBSD builds. * Mark functions static * Rm func decls without definitions * Rebase master * Rebase master * Properly initialize value of storage tiers. * Fix build after rebase.
2023-07-06	Code reorg and cleanup - enrichment of /api/v2 (#15294)	Costa Tsaousis
	* claim script now accepts the same params as the kickstart * rewrote buildinfo to unify all methods * added cloud unavailable in cloud status * added all exporters * renamed httpd to h2o * rename ENABLE_COMPRESSION to ENABLE_LZ4 * rename global variable * rename ENABLE_HTTPS to ENABLE_OPENSSL * fix coverity-scan for openssl * add lz4 to coverity-scan * added all plugins and most of the features * added all plugins and most of the features * generalize bitmap code so that we can have any size of bitmaps * cleanup * fix compilation without protobuf * fix compilation with others allocators * fix bitmap * comprehensive bitmaps unit test * bitmap as macros * added developer mode * added system info to build info * cloud available/unavailable * added /api/v2/info * added units and ni to transitions * when showing instances and transitions, show only the instances that have transitions * cleanup * add missing quotes * add anchor to transitions * added more to build info * calculate retention per tier and expose it to /api/v2/info * added currently collected metrics * do not show space and retention when no numbers are available * fix impossible overflow * Add function for transitions and execute callback * In case of error, reset and try next dictionary entry * Fix error message * simpler logic to maintain retention per tier * /api/v2/alert_transitions * Handle case of recipient null Convert after and before to usec * Add classification, type and component * working /api/v2/alert_transitions * Fix query to properly handle context and alert name * cleanup * Add search with transition * accept transition in /api/v2/alert_transitions * totaly dynamic facets * fixed debug info * restructured facets * cleanup; removal of options=transitions * updated alert entries flags * method to exec * Return also exec run timestamp Temp table cleanup only when we don't execute with a transition * cleanup obsolete anchor parameter * Add sql_get_alert_configuration function * added options=config to alert_transitions * added /api/v2/alert_config * preliminary work for /api/v2/claim * initialize variables; do not expose expected retention if no disk space info is available; do not report aclk as initializing when not claimed * fix claim session key filename * put a newline into the session key file * more progress on claiming * final /api/v2/claim endpoint * after claiming, refresh our state at the output * Fix query to fetch config * Remove debug log * add configuration objects * add configuration objects - fixed * respect the NETDATA_DISABLE_CLOUD env variable * NETDATA_DISABLE_CLOUD env variable sets the default, but the config sets the final value * use a new claimed_id on every claiming * regenerate random key on claiming and wait for online status * ignore write() return value when writing a newline * dont show cloud status disabled when claimed_id is missing * added ctx to alert instances * cleanup config and transitions from /api/v2/alerts * fix unused variable * in /api/v2/alert_config show 1 config without an array * show alert values conditionally, by appending options=values * When storing host info if the key value is empty, store unknown * added options=summary to control when the alerts summary is shown * increased http_api_v2 to version 5 * claming random key file is now not world readable * added local-listeners binary that detects all the listening ports, their IPs and their command lines --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-07-01	Optimizations part 3 (#15293)	Costa Tsaousis
	* use madvise to speed up indexing * collect all rrddim members into a collector structure * use tier 0 virtual point for storing last stored value * reorganize key fields in rrddim * remove fgets from pluginsd and replace it with read() * properly uncork the web server sockets * Revert "reorganize key fields in rrddim" This reverts commit 2d45fa3959087e05462d387ff115a260f3a04b60. * Revert "use tier 0 virtual point for storing last stored value" This reverts commit a576cdd377ad4778a3b8608cabbb7ea7bb19a3a8. * fix cork names * fix compilation warnings
2023-06-21	Use a single health log table (#15157)	Emmanuel Vasilakis
	* move old health log tables to one * change table in sqlite_health * remove check for off period of agent * changes in aclk_alert * fixes * add new field insert_mark_timestamp * cleanup * remove hostname, create the health log table during sqlite init * create the health_log during migration * move source from health_log to alert_hash. Remove class, component and type field from health_log * Register now_usec sqlite function * use global_id instead of insert_mark_timestamp. Use function now_usec to populate it * create functions earlier to have them during migration * small unit test fix * create additional health_log_detail table. Do the insert of an alert event on both * do the update on health_log_detail * change more queries * more indexes, fix inject removed * change last executed and select health log queries * random uuid for sqlite * do migration from old tables * queries to send alerts to cloud * cleanup queries * get an alarm id from db if not found in memory * small fix on query * add info when migration completes * dont pick health_log_detail during migration * check proper old health_log table * safer migration * proper log sent alerts. small fix in claimed cleanup * cleanups * extra check for cleanup * also get an alarm_event_id from sql * check for empty source * remove cleanup of main health log table --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-19	Obvious memory reductions (#15204)	Costa Tsaousis
	* remove rd->update_every * reduce amount of memory for RRDDIM * reorgnize rrddim->db entries * optimize rrdset and statsd * optimize dictionaries * RW_SPINLOCK for dictionaries * fix codeql warning * rw_spinlock improvements * remove obsolete assertion * fix crash on health_alarm_log_process() * use RW_SPINLOCK for AVL trees * add RW_SPINLOCK read/write trylock * pgc and mrg now use rw_spinlocks; cache line optimizations for mrg * thread tag of dbegnine init * append created datafile, lockless * make DOUBLE_LINKED_LIST_APPEND_ITEM_UNSAFE friendly for lockless use * thread cancelability in spinlocks; optimize thread cancelability management * introduce a JudyL to index datafiles and use it during queries to quickly find the relevant files * use the last timestamp of each journal file for indexing * when the previous cannot be found, start from the beginning * add more stats to PDC to trace routing easier * rename spinlock functions * fix for spinlock renames * revert statsd socket statistics to size_t * turn fatal into internal_fatal() * show candidates always * show connected status and connection attempts
2023-04-07	Boost dbengine (#14832)	Costa Tsaousis
	* configure extent cache size * workers can now execute up to 10 jobs in a run, boosting query prep and extent reads * fix dispatched and executing counters * boost to the max * increase libuv worker threads * query prep always get more prio than extent reads; stop processing in batch when dbengine is queue is critical * fix accounting of query prep * inlining of time-grouping functions, to speed up queries with billions of points * make switching based on a local const variable * print one pending contexts loading message per iteration * inlined store engine query API * inlined storage engine data collection api * inlined all storage engine query ops * eliminate and inline data collection ops * simplified query group-by * more error handling * optimized partial trimming of group-by queries * preparative work to support multiple passes of group-by * more preparative work to support multiple passes of group-by (accepts multiple group-by params) * unified query timings * unified query timings - weights endpoint * query target is no longer a static thread variable - there is a list of cached query targets, each of which of freed every 1000 queries * fix query memory accounting * added summary.dimension[].pri and sorted summary.dimensions based on priority and then name * limit max ACLK WEB response size to 30MB * the response type should be text/plain * more preparative work for multiple group-by passes * create functions for generating group by keys, ids and names * multiple group-by passes are now supported * parse group-by options array also with an index * implemented percentage-of-instance group by function * family is now merged in multi-node contexts * prevent uninitialized use
2023-02-15	JSON internal API, IEEE754 base64/hex streaming, weights endpoint ↵	Costa Tsaousis
	optimization (#14493) * first work on standardizing json formatting * renamed old grouping to time_grouping and added group_by * add dummy functions to enable compilation * buffer json api work * jsonwrap opening with buffer_json_X() functions * cleanup * storage for quotes * optimize buffer printing for both numbers and strings * removed ; from define * contexts json generation using the new json functions * fix buffer overflow at unit test * weights endpoint using new json api * fixes to weights endpoint * check buffer overflow on all buffer functions * do synchronous queries for weights * buffer_flush() now resets json state too * content type typedef * print double values that are above the max 64-bit value * str2ndd() can now parse values above UINT64_MAX * faster number parsing by avoiding double calculations as much as possible * faster number parsing * faster hex parsing * accurate printing and parsing of double values, even for very large numbers that cannot fit in 64bit integers * full printing and parsing without using library functions - and related unit tests * added IEEE754 streaming capability to enable streaming of double values in hex * streaming and replication to transfer all values in hex * use our own str2ndd for set2 * remove subnormal check from ieee * base64 encoding for numbers, instead of hex * when increasing double precision, also make sure the fractional number printed is aligned to the wanted precision * str2ndd_encoded() parses all encoding formats, including integers * prevent uninitialized use * /api/v1/info using the new json API * Fix error when compiling with --disable-ml * Remove redundant 'buffer_unittest' declaration * Fix formatting * Fix formatting * Fix formatting * fix buffer unit test * apps.plugin using the new JSON API * make sure the metrics registry does not accept negative timestamps * do not allow pages with negative timestamps to be loaded from db files; do not accept pages with negative timestamps in the cache * Fix more formatting --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-14	Prevent crash when running '-W createdataset' (#14455)	Emmanuel Vasilakis
	prepare environment to run the dataset create
2023-01-27	DBENGINE v2 - improvements part 10 (#14332)	Costa Tsaousis
	* replication cancels pending queries on exit * log when waiting for inflight queries * when there are collected and not-collected metrics, use the context priority from the collected only * Write metadata with a faster pace * Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now * Wrap in a big transaction remaining metadata writes (test 1) * fix higher tiers when tiering iterations = 2 * dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation * Wrap in a big transaction metadata writes (test 2) * replication cancelling fix * do not first and last entry in replication when the db has no retention * fix internal check condition * Increase metadata write batch size * always apply error limit to dbengine logs * Remove code that processes the obsolete health.db files * cleanup in query.c * do not allow queries to go beyond db boundaries * prevent internal log for +1 delta in timestamp * detect gap pages in conflicts * double protection for gap injection in main cache * Add checkpoint to prevent large WAL while running Remove unused and duplicate functions * do not allocate chart cache dir if not needed * add more info to unittests * revert query expansion to satisfy unittests Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-20	track memory footprint of Netdata (#14294)	Costa Tsaousis
	* track memory footprint of Netdata * track db modes alloc/ram/save/map * track system info; track sender and receiver * fixes * more fixes * track workers memory, onewayalloc memory; unify judyhs size estimation * track replication structures and buffers * Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag * flush the replication buffer every 1000 times the circular buffer is found empty * dont take timestamp too frequently in sender loop * sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it * free sender thread buffer on replication threads when replication is idle * use the last sender flag as a timestamp of the last buffer recreation * free cbuffer before reconnecting * recreate cbuffer on every flush * timings for journal v2 loading * inlining of metric and cache functions * aral likely/unlikely * free left-over thread buffers * fix NULL pointer dereference in replication * free sender thread buffer on sender thread too * mark ctx as used before flushing * better logging on ctx datafiles closing Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-10	DBENGINE v2 (#14125)	Costa Tsaousis
	* count open cache pages refering to datafile * eliminate waste flush attempts * remove eliminated variable * journal v2 scanning split functions * avoid locking open cache for a long time while migrating to journal v2 * dont acquire datafile for the loop; disable thread cancelability while a query is running * work on datafile acquiring * work on datafile deletion * work on datafile deletion again * logs of dbengine should start with DBENGINE * thread specific key for queries to check if a query finishes without a finalize * page_uuid is not used anymore * Cleanup judy traversal when building new v2 Remove not needed calls to metric registry * metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent * disable checks for invalid time-ranges * Remove type from page details * report scanning time * remove infinite loop from datafile acquire for deletion * remove infinite loop from datafile acquire for deletion again * trace query handles * properly allocate array of dimensions in replication * metrics cleanup * metrics registry uses arrayalloc * arrayalloc free should be protected by lock * use array alloc in page cache * journal v2 scanning fix * datafile reference leaking hunding * do not load metrics of future timestamps * initialize reasons * fix datafile reference leak * do not load pages that are entirely overlapped by others * expand metric retention atomically * split replication logic in initialization and execution * replication prepare ahead queries * replication prepare ahead queries fixed * fix replication workers accounting * add router active queries chart * restore accounting of pages metadata sources; cleanup replication * dont count skipped pages as unroutable * notes on services shutdown * do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file * do not add pages we dont need to pdc * time in range re-work to provide info about past and future matches * finner control on the pages selected for processing; accounting of page related issues * fix invalid reference to handle->page * eliminate data collection handle of pg_lookup_next * accounting for queries with gaps * query preprocessing the same way the processing is done; cache now supports all operations on Judy * dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3) * get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query * finner gaps calculation; accounting of overlapping pages in queries * fix gaps accounting * move datafile deletion to worker thread * tune libuv workers and thread stack size * stop netdata threads gradually * run indexing together with cache flush/evict * more work on clean shutdown * limit the number of pages to evict per run * do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction * economies on flags for smaller page footprint; cleanup and renames * eviction moves referenced pages to the end of the queue * use murmur hash for indexing partition * murmur should be static * use more indexing partitions * revert number of partitions to number of cpus * cancel threads first, then stop services * revert default thread stack size * dont execute replication requests of disconnected senders * wait more time for services that are exiting gradually * fixed last commit * finer control on page selection algorithm * default stacksize of 1MB * fix formatting * fix worker utilization going crazy when the number is rotating * avoid buffer full due to replication preprocessing of requests * support query priorities * add count of spins in spinlock when compiled with netdata internal checks * remove prioritization from dbengine queries; cache now uses mutexes for the queues * hot pages are now in sections judy arrays, like dirty * align replication queries to optimal page size * during flushing add to clean and evict in batches * Revert "during flushing add to clean and evict in batches" This reverts commit 8fb2b69d068499eacea6de8291c336e5e9f197c7. * dont lock clean while evicting pages during flushing * Revert "dont lock clean while evicting pages during flushing" This reverts commit d6c82b5f40aeba86fc7aead062fab1b819ba58b3. * Revert "Revert "during flushing add to clean and evict in batches"" This reverts commit ca7a187537fb8f743992700427e13042561211ec. * dont cross locks during flushing, for the fastest flushes possible * low-priority queries load pages synchronously * Revert "low-priority queries load pages synchronously" This reverts commit 1ef2662ddcd20fe5842b856c716df134c42d1dc7. * cache uses spinlock again * during flushing, dont lock the clean queue at all; each item is added atomically * do smaller eviction runs * evict one page at a time to minimize lock contention on the clean queue * fix eviction statistics * fix last commit * plain should be main cache * event loop cleanup; evictions and flushes can now happen concurrently * run flush and evictions from tier0 only * remove not needed variables * flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time * added worker jobs in timer to find the slow part of it * support fast eviction of pages when all_of_them is set * revert default thread stack size * bypass event loop for dispatching read extent commands to workers - send them directly * Revert "bypass event loop for dispatching read extent commands to workers - send them directly" This reverts commit 2c08bc5bab12881ae33bc73ce5dea03dfc4e1fce. * cache work requests * minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors * publish flushed pages to open cache in the thread pool * prevent eventloop requests from getting stacked in the event loop * single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c * more rrdengine.c cleanup * enable db rotation * do not log when there is a filter * do not run multiple migration to journal v2 * load all extents async * fix wrong paste * report opcodes waiting, works dispatched, works executing * cleanup event loop memory every 10 minutes * dont dispatch more work requests than the number of threads available * use the dispatched counter instead of the executing counter to check if the worker thread pool is full * remove UV_RUN_NOWAIT * replication to fill the queues * caching of extent buffers; code cleanup * caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation * single transaction wal * synchronous flushing * first cancel the threads, then signal them to exit * caching of rrdeng query handles; added priority to query target; health is now low prio * add priority to the missing points; do not allow critical priority in queries * offload query preparation and routing to libuv thread pool * updated timing charts for the offloaded query preparation * caching of WALs * accounting for struct caches (buffers); do not load extents with invalid sizes * protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset * also check if the expanded before is not the chart later updated time * also check if the expanded before is not after the wall clock time of when the query started * Remove unused variable * replication to queue less queries; cleanup of internal fatals * Mark dimension to be updated async * caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol) * disable pgc stress test, under an ifdef * disable mrg stress test under an ifdef * Mark chart and host labels, host info for async check and store in the database * dictionary items use arrayalloc * cache section pages structure is allocated with arrayalloc * Add function to wakeup the aclk query threads and check for exit Register function to be called during shutdown after signaling the service to exit * parallel preparation of all dimensions of queries * be more sensitive to enable streaming after replication * atomically finish chart replication * fix last commit * fix last commit again * fix last commit again again * fix last commit again again again * unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication * do not cancel start streaming; use high priority queries when we have locked chart data collection * prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered * opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue * Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore * Fix bad memory allocation / cleanup * Cleanup ACLK sync initialization (part 1) * Don't update metric registry during shutdown (part 1) * Prevent crash when dashboard is refreshed and host goes away * Mark ctx that is shutting down. Test not adding flushed pages to open cache as hot if we are shutting down * make ML work * Fix compile without NETDATA_INTERNAL_CHECKS * shutdown each ctx independently * fix completion of quiesce * do not update shared ML charts * Create ML charts on child hosts. When a parent runs a ML for a child, the relevant-ML charts should be created on the child host. These charts should use the parent's hostname to differentiate multiple parents that might run ML for a child. The only exception to this rule is the training/prediction resource usage charts. These are created on the localhost of the parent host, because they provide information specific to said host. * check new ml code * first save the database, then free all memory * dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db * Cleanup metadata thread (part 2) * increase refcount before dispatching prep command * Do not try to stop anomaly detection threads twice. A separate function call has been added to stop anomaly detection threads. This commit removes the left over function calls that were made internally when a host was being created/destroyed. * Remove allocations when smoothing samples buffer The number of dims per sample is always 1, ie. we are training and predicting only individual dimensions. * set the orphan flag when loading archived hosts * track worker dispatch callbacks and threadpool worker init * make ML threads joinable; mark ctx having flushing in progress as early as possible * fix allocation counter * Cleanup metadata thread (part 3) * Cleanup metadata thread (part 4) * Skip metadata host scan when running unittest * unittest support during init * dont use all the libuv threads for queries * break an infinite loop when sleep_usec() is interrupted * ml prediction is a collector for several charts * sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit * worker_unregister() in netdata threads cleanup * moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h * fixed ML issues * removed engine2 directory * added dbengine2 files in CMakeLists.txt * move query plan data to query target, so that they can be exposed by in jsonwrap * uniform definition of query plan according to the other query target members * event_loop should be in daemon, not libnetdata * metric_retention_by_uuid() is now part of the storage engine abstraction * unify time_t variables to have the suffix _s (meaning: seconds) * old dbengine statistics become "dbengine io" * do not enable ML resource usage charts by default * unify ml chart families, plugins and modules * cleanup query plans from query target * cleanup all extent buffers * added debug info for rrddim slot to time * rrddim now does proper gap management * full rewrite of the mem modes * use library functions for madvise * use CHECKSUM_SZ for the checksum size * fix coverity warning about the impossible case of returning a page that is entirely in the past of the query * fix dbengine shutdown * keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently * fine tune cache evictions * dont initialize health if the health service is not running - prevent crash on shutdown while children get connected * rename AS threads to ACLK[hostname] * prevent re-use of uninitialized memory in queries * use JulyL instead of JudyL for PDC operations - to test it first * add also JulyL files * fix July memory accounting * disable July for PDC (use Judy) * use the function to remove datafiles from linked list * fix july and event_loop * add july to libnetdata subdirs * rename time_t variables that end in _t to end in _s * replicate when there is a gap at the beginning of the replication period * reset postponing of sender connections when a receiver is connected * Adjust update every properly * fix replication infinite loop due to last change * packed enums in rrd.h and cleanup of obsolete rrd structure members * prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests() * void unused variable * void unused variables * fix indentation * entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps * macros to caclulate page entries by time and size * prevent statsd cleanup crash on exit * cleanup health thread related variables Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2022-11-29	Sanitize command arguments. (#14064)	vkalintiris
	* Sanitize bash arguments. Remove leading dashes and escape single quotes in command arguments. * Quote expanded variable in test
2022-11-28	replication fixes No 7 (#14053)	Costa Tsaousis
	* move global statistics workers to a separate thread; query statistics per query source; query statistics for ML, exporters, backfilling; reset replication point in time every 10 seconds, instead of every 1; fix compilation warnings; optimize the replication queries code; prevent long tail of replication requests (big sleeps); provide query statistics about replication ; optimize replication sender when most senders are full; optimize replication_request_get_first_available(); reset replication completion calculation; * remove workers utilization from global statistics thread
2022-11-22	Do not force internal collectors to call rrdset_next. (#13926)	vkalintiris
	* Remove calls to rrdset_next(). * Rm checks plugin * Update documentantion * Call rrdset_next from within rrdset_done This wraps up the removal of rrdset_next from internal collectors, which removes a lot of unecessary code and the need for if/else clauses in every place. The pluginsd parser is the only component that calls rrdset_next() functions because it's not strictly speaking a collector but more of a collector manager/proxy. With the current changes it's possible to simplify the API we expose from RRD significantly, but this will be follow-up work in the future. Remove stale reference to checks.plugin * Fix RRD unit test rrdset_next is not meant to be called from these tests. * Fix db engine unit test. * Schedule rrdset_next when we have completed at least one collection. * Mark chart creation clauses as unlikely. * Add missing brace to fix FreeBSD plugin.
2022-10-31	Replication of metrics (gaps filling) during streaming (#13873)	vkalintiris
	* Revert "Use llvm's ar and ranlib when compiling with clang (#13854)" This reverts commit a9135f47bbb36e9cb437b18a7109607569580db7. * Profile plugin * Fix macos static thread * Add support for replication - Add a new capability for replication, when not supported the agent should behave as previously. - When replication is supported, the text protocol supports the following new commands: - CHART_DEFINITION_END: send the first/last entry of the child - REPLAY_RRDSET_BEGIN: sends the name of the chart we are replicating - REPLAY_RRDSET_HEADER: sends a line describing the columns of the following command (ie. start-time, end-time, dim1-name, ...) - REPLAY_RRDSET_DONE: sends values to push for a specific start/end time - REPLAY_RRDSET_END: send the (a) update every of the chart, (b) first/last entries in DB, (c) whether the child's been told to start streaming, (d) original after/before period to replicate. - REPLAY_CHART: Sent from a parent to a child, specifying (a) the chart name we want data for, (b) whether the child should start streaming once it has fullfilled the request with the aforementioned commands, (c) after/before of the data the parent wants - As a consequence of the new protocol, streaming is disabled for all charts on a new connection. It's enabled once replication is finished. - The configuration parameters are specified from within stream.conf: - "enable replication = yes\|no" - "seconds to replicate = 3600" - "replication step = 600" (ie. how many seconds to fill per roundtrip request. * Minor fixes - quote set and dim ids - start streaming after writing replicated data to the buffer - write replicated data only when buffer is less than 50% full. - use reentrant iteration for charts * Do not send chart definitions on connection. * Track replication status through rrdset flags. * Add debug flag for noisy log messages. * Add license notice. * Iterate charts with reentrant loop * Set replication finished flag when streaming is disabled. * Revert "Profile plugin" This reverts commit 468fc9386e5283e0865fae56e9989b8ec83de14d. Used only for testing purposes. * Revert "Revert "Use llvm's ar and ranlib when compiling with clang (#13854)"" This reverts commit 27c955c58d95aed6c44d42e8b675f0cf3ca45c6d. Reapply commit that I had to revert in order to be able to build the agent on MacOS. * Build replication source files with CMa