summaryrefslogtreecommitdiffstats
path: root/daemon
AgeCommit message (Collapse)Author
2023-06-07Re-write of SSL support in Netdata; restoration of SIGCHLD; detection of ↵Costa Tsaousis
stale plugins; streaming improvements (#15113) * add information about streaming connections to /api/v2/nodes; reset defer time when sender or receivers connect or disconnect * make each streaming destination respect its SSL settings * to not send SSL traffic over non-SSL connection * keep track of outgoing streaming connection attempts * retry SSL reads when SSL_read() returns SSL_ERROR_WANT_READ * Revert "retry SSL reads when SSL_read() returns SSL_ERROR_WANT_READ" This reverts commit 14c858677c6f2d3b08c94f298e2f45ecdb74c801. * cleanup SSL connections properly * initialize SSL in rpt before takeover * sender should free SSL when talking to a non-SSL destination * do not shutdown SSL when receiver exits * restore operation of SIGCHLD when the reaper is not enabled * create an fgets function that checks for data and times out * work on error handling of plugins exiting * remove newlines from logs * global call to waitid(), caching the result for netdata_pclose() to process * receiver tid * parser timeouts in 2 minutes instead of 10 * fix crash when UUID is NULL in SQLite * abstract sqlite3 parsing for uuid and text * write proper ssl errors on read and write * fix for SSL_ERROR_WANT_RETRY_VERIFY * SSL WANT per function * unified SSL error logging * fix compilation warning * additional logging about parser cleanup * streaming parser should call the pluginsd parser cleanup * SSL error handling work * SSL initialization unification * check for pending data when receiving SSL response with timeout * macro to check if an SSL connection has been established * remove SSL_pending() * check for SSL macros * use SSL_peek() to find if there is a response * SSL renames * more SSL renames & cleanup * rrdpush ssl connection function * abstract all SSL functions into security.c * keep track of SSL connections and always attempt to use SSL read/write when on SSL connection * signal openssl to skip certificate validation when configured to do so * better SSL error handling and logging * SSL code cleanup * SSL retry on SSL_connect and SSL_accept * SSL provide default return value for old compilers * SSL read/write functions emulate system read/write functions * fix receive/send timeout and switch from SSL_peek() to SSL_pending() * remove SSL_pending() * removed sender auto-retry and debug info for initial recevier response * ssl skip certificate verification config for web server * ssl errors log ip and port of the peer * keep ssl with web_client for its whole lifetime * thread safe socket peers to text * use error_limit() for common ssl errors * cleanup * more cleanup * coverity fixes * ssl error logs include both local and remote ip/port info * remove obsolete code
2023-05-29update agent telemetry url to be cloud function instead of posthog (#15085)Andrew Maguire
* update agent events telemetry url to be cloud function instead of posthog.
2023-05-15Debugfs collector (#15017)thiagoftsm
Co-authored-by: Fotis Voutsas <fotis@netdata.cloud> Co-authored-by: Austin S. Hemmelgarn <ahferroin7@gmail.com> Co-authored-by: ilyam8 <ilya@netdata.cloud>
2023-05-10make zlib compulsory dep (#14928)Timotej S
zlib compulsory
2023-05-10initial minimal h2o webserver integration (#14585)Timotej S
Introduces h2o based web server as an alternative
2023-04-21Remove netdatacli response size limitation (#14906)Stelios Fragkakis
* Dump config * Add charcat and rawcat * Build incoming response in an buffer * Allocate a buffer to hold the command response so that we dont have a 4K char limit * Add a dumpconfig command to output the current netdata.conf * Remove -W dumpconfig for now * Fix typo * Improve help message
2023-04-20WEBRTC for communication between agents and browsers (#14874)Costa Tsaousis
* initial webrtc setup * missing files * rewrite of webrtc integration * initialization and cleanup of webrtc connections * make it compile without libdatachannel * add missing webrtc_initialize() function when webrtc is not enabled * make c++17 optional * add build/m4/ax_compiler_vendor.m4 * add ax_cxx_compile_stdcxx.m4 * added new m4 files to makefile.am * id all webrtc connections * show warning when webrtc is disabled * fixed message * moved all webrtc error checking inside webrtc.cpp * working webrtc connection establishment and cleanup * remove obsolete code * rewrote webrtc code in C to remove dependency for c++17 * fixed left-over reference * detect binary and text messages * minor fix * naming of webrtc threads * added webrtc configuration * fix for thread_get_name_np() * smaller web_client memory footprint * universal web clients cache * free web clients every 100 uses * webrtc is now enabled by default only when compiled with internal checks * webrtc responses to /api/ requests, including LZ4 compression * fix for binary and text messages * web_client_cache is now global * unification of the internal web server API, for web requests, aclk request, webrtc requests * more cleanup and unification of web client timings * fixed compiler warnings * update sent and received bytes * eliminated of almost all big buffers in web client * registry now uses the new json generation * cookies are now an array; fixed redirects * fix redirects, again * write cookies directly to the header buffer, eliminating the need for cookie structures in web client * reset the has_cookies flag * gathered all web client cleanup to one function * fixes redirects * added summary.globals in /api/v2/data response * ars to arc in /api/v2/data * properly handle host impersonation * set the context of mem.numa_nodes
2023-04-20Initialize machine GUID earlier in the agent startup sequence (#14922)Stelios Fragkakis
* Read (cache) machine guid earlier, so that a fatal event during init will have the correct machine guid * Setup analytics global environment explicitly if we fail to initialize before localhost is created
2023-04-20Skip ML initialization when it's been disabled in netdata.conf (#14920)vkalintiris
* Apply ML changes again. The ML changes in 003df5f2 wheere reverted with 556bdad9 because we were partially initializing ML even when it was explicitly disabled in netdata.conf, causing the agent to crash on startup. * Do not start/stop ML threads when ML is disabled. * Restore default config settings.
2023-04-18change docusaurus admonitions to our style of admonitions (#14917)Fotis Voutsas
* change docusaurus admonitions to our style of admonitions * Apply suggestions from code review Co-authored-by: Shyam Sreevalsan <shyam@netdata.cloud> --------- Co-authored-by: Shyam Sreevalsan <shyam@netdata.cloud>
2023-04-14Revert ML changes. (#14908)vkalintiris
2023-04-13Save and load ML models (#14810)vkalintiris
* Revert "Revert "Use static thread-pool for training. (#14702)" (#14782)" This reverts commit 5321ca8d1ef8d974a6a2b2128ca8804de6acb693. * Model I/O. * Minor changes Meant to make debugging a crash issues easier on cloud VMs: - Less verbose logging - Higher logging history - Modify installer to use debug info by default * Fix ML initialization order. * read lock hosts when running detection. * Revert debugging changes. * Update ml/Config.cc Co-authored-by: Andrew Maguire <andrewm4894@gmail.com> --------- Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
2023-04-07Boost dbengine (#14832)Costa Tsaousis
* configure extent cache size * workers can now execute up to 10 jobs in a run, boosting query prep and extent reads * fix dispatched and executing counters * boost to the max * increase libuv worker threads * query prep always get more prio than extent reads; stop processing in batch when dbengine is queue is critical * fix accounting of query prep * inlining of time-grouping functions, to speed up queries with billions of points * make switching based on a local const variable * print one pending contexts loading message per iteration * inlined store engine query API * inlined storage engine data collection api * inlined all storage engine query ops * eliminate and inline data collection ops * simplified query group-by * more error handling * optimized partial trimming of group-by queries * preparative work to support multiple passes of group-by * more preparative work to support multiple passes of group-by (accepts multiple group-by params) * unified query timings * unified query timings - weights endpoint * query target is no longer a static thread variable - there is a list of cached query targets, each of which of freed every 1000 queries * fix query memory accounting * added summary.dimension[].pri and sorted summary.dimensions based on priority and then name * limit max ACLK WEB response size to 30MB * the response type should be text/plain * more preparative work for multiple group-by passes * create functions for generating group by keys, ids and names * multiple group-by passes are now supported * parse group-by options array also with an index * implemented percentage-of-instance group by function * family is now merged in multi-node contexts * prevent uninitialized use
2023-04-07Set a default registry unique id when there is none for statistics script ↵Emmanuel Vasilakis
(#14861) set a default registry unique id when theres none
2023-04-03update posthog domain (#14818)Andrew Maguire
* update posthog domain * update for curl too * update telemetry url in installer
2023-03-28configure extent cache size (#14821)Costa Tsaousis
2023-03-23add validation step before using GCP metadata (#14801)Ilya Mashchenko
* add GCP data validation * Update daemon/system-info.sh --------- Co-authored-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
2023-03-21Revert "Use static thread-pool for training. (#14702)" (#14782)vkalintiris
This reverts commit 5046e034212c008557dd014196b6f6204eda24b2. Will re-apply once we investigate an issue that occurs during the shutdown of the agent.
2023-03-21add validation step before using Azure metadata (AZURE_IMDS_DATA) (#14775)Ilya Mashchenko
2023-03-21Use static thread-pool for training. (#14702)vkalintiris
* Use static thread-pool for training. * Add missing function definition * disable training stats chart * Add config option to explicitly enable ML stats charts. --------- Co-authored-by: Costa Tsaousis <costa@netdata.cloud>
2023-03-16Use one thread for ACLK synchonization (#14281)Stelios Fragkakis
* Remove aclk sync threads * Disable functions if compiled with --disable-cloud * Allocate and reuse buffer when scanning hosts Tune transactions when writing metadata Error checking when executing db_execute (it is already within a loop with retries) * Schedule host context load in parallel Child connection will be delayed if context load is not complete Event loop cleanup * Delay retention check if context is not loaded Remove context load check from regular metadata host scan * Improve checks to check finished threads * Cleanup warnings when compiling with --disable-cloud * Clean chart labels that were created before our current maximum retention * Fix sql statement * Remove structures members that of no use Remove buffer allocations when not needed * Fix compilation error * Don't check for service running when not from a worker * Code cleanup if agent is compiled with --disable-cloud Setup ACLK tables in the database if needed Submit node status update messages to the cloud * Fix compilation warning when --disable-cloud is specified * Address codacy issues * Remove empty file -- has already been moved under contexts * Use enum instead of numbers * Use UUID_STR_LEN * Add newline at the end of file * Release node_id to prevent memory leak under certain cases * Add queries in defines * Ignore rc from transaction start -- if there is an active transaction, we will use it (same with commit) should further improve in a future PR * Remove commented out code * If host is null (it should not be) do not allocate config (coverity reports Resource leak) * Do garbage collection when contexts is initialized * Handle the case when config is not yet available for a host
2023-03-13fix system info disk size detection on raspberry pi (#14711)Ilya Mashchenko
2023-03-10/api/v2/X improvements part 3 (#14665)Costa Tsaousis
* max web request size to 64KB * fix the request too big message * increase max request reading tries to 100 * support for bigger web requests * add "avg" as a shortcut for "average" to both group by aggregation and time aggregation; discard the last partial points of a query in play mode, up to max update every; group by hidden dimensions too * better implementation for partial data trimming * added group_by=selected to return only one dimension for all selected metrics * fix acceptance of group_by=selected * passing option "raw" disables partial data trimming * remove obsolete option "plan"; use "debug" * fix view.min and view.max calculation - there were 2 bugs: a) min and max were reset for every row and b) min and max were corrupted by GBC and AR printing * per row annotations * added time column to point annotations * disable caching for /api/v2/contexts responses * added api format json2 that returns an array for each points, having all the point values and annotations in them * work on swagger about /api/v2 * prevent infinite loop * cleanup and swagger work * allow negative simple pattern expressions to work as expected * do not lookup in the dictionary empty names * garbage collect dictionaries * make query_target allocate less aggressively; queries fill the remaining points with nulls * reusable query ops to save memory on huge queries * move parts of query plans into query ops to save query target memory * remove storage engine from query metric tiers, to save memory, and recalculate it when it is needed
2023-03-09fix: detect the host os in k8s on non-docker cri (#14694)Witold Duranek
2023-03-08Fix Azure IMDS (#14686)Shyam Sreevalsan
* Fix Azure IMDS * Update system-info.sh * Update system-info.sh * Update daemon/system-info.sh Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> --------- Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2023-03-02/api/v2/contexts (#14592)Costa Tsaousis
* preparation for /api/v2/contexts * working /api/v2/contexts * add anomaly rate information in all statistics; when sum-count is requested, return sums and counts instead of averages * minor fix * query targegt now accurately counts hosts, contexts, instances, dimensions, metrics * cleanup /api/v2/contexts * full text search with /api/v2/contexts * simple patterns now support the option to search ignoring case * full text search API with /api/v2/q * simple pattern execution optimization * do not show q when not given * full text search accounting * separated /api/v2/nodes from /api/v2/contexts * fix ssv queries for group_by * count query instances queried and failed per context and host * split rrdcontext.c to multiple files * add query totals * fix anomaly rate calculation; provide "ni" for indexing hosts * do not generate zero valued members * faster calculation of anomaly rate; by just summing integers for each db points and doing math once for every generated point * fix typo when printing dimensions totals * added option minify to remove spaces and newlines fron JSON output * send instance ids and names when they differ * do not add in query target dimensions, instances, contexts and hosts for which there is no retention in the current timeframe * fix for the previous + renames and code cleanup * when a dimension is filtered, include in the response all the other dimensions that are selectable * do not add nodes that do not have retention in the current window * move selection of dimensions to query_dimension_add(), instead of query_metric_add() * increase the pre-processing capacity of queries * generate instance fqdn ids and names only when they are needed * provide detailed statistics about tiers retention, queries, points, update_every * late allocation of query dimensions * cleanup * more cleanup * support for annotations per displayed point, RESET and PARTIAL * new type annotations * if a chart is not linked to contexts and it is collected, link it when it is collected * make ML run reentrant * make ML rrdr query synchronous * optimize replication memory allocation of replication_sort_entry * change units to percentage, when requesting a coefficinet of variation, or a percentage query * initialize replication before starting main threads * properly decrement no room requests counter * propagate the non-zero flag to group-by * the same by avoiding the extra loop * respect non-zero in all dimension arrays * remove dictionary garbage collection from dictionary_entries() and dictionary_version() * be more verbose when jv2 indexing is postponed * prevent infinite loop * use hidden dimensions even when dimensions pattern is unset * traverse hosts using dictionaries * fix dictionary unittests
2023-02-26Reorg learn 0226 (#14610)Chris Akritidis
* Reorg getting started * Streaming * Remove blanks * Fix up to cloud alerts
2023-02-24Prevent core dump when the agent is performing a quick shutdown (#14587)Stelios Fragkakis
* Prevent core dump when the agent is performing a quick shutdown (e.g. when rrd_init fails) * Threads that have not started during shutdown are immediately marked as EXITED * Do not attempt to get statistics if database is not initialized * Do not attempt to get context db statistics if the context database is not initialized
2023-02-22/api/v2/data - multi-host/context/instance/dimension/label queries (#14564)Costa Tsaousis
* fundamentals for having /api/v2/ working * use an atomic to prevent writing to internal pipe too much * first attempt of multi-node, multi-context, multi-chart, multi-dimension queries * v2 jsonwrap * first attempt for group by * cleaned up RRDR and fixed group by * improvements to /api/v2/api * query instance may be realloced, so pointers to it get invalid; solved memory leaks * count of quried metrics in summary information * provide detailed information about selected, excluded, queried and failed metrics for each entity * select instances by fqdn too * add timing information to json output * link charts to rrdcontexts, if a query comes in and it is found unlinked * calculate min, max, sum, average, volume, count per metric * api v2 parameters naming * renders alerts and units * render machine_guid and node_id in all sections it is relevant * unified keys * group by now takes into account units and when there are multiple units involved, it creates a dimension per unit * request and detailed are hidden behind an option * summary includes only a flattened list of alerts * alert counts per host and instance * count of grouped metrics per dimension * added contexts to summary * added chart title * added dimension priorities and chart type * support for multiple group by at the same time * minor fixes * labels are now a tree * keys uniformity * filtering by alerts, both having a specific alert and having a specific alert in a specific status * added scope of hosts and contexts * count of instances on contexts and hosts * make the api return valid responses even when the response contains no data * calculate average and contribution % for every item in the summary * fix compilation warnings * fix compilation warnings - again
2023-02-20Remove References category from learn (#14571)Chris Akritidis
2023-02-15JSON internal API, IEEE754 base64/hex streaming, weights endpoint ↵Costa Tsaousis
optimization (#14493) * first work on standardizing json formatting * renamed old grouping to time_grouping and added group_by * add dummy functions to enable compilation * buffer json api work * jsonwrap opening with buffer_json_X() functions * cleanup * storage for quotes * optimize buffer printing for both numbers and strings * removed ; from define * contexts json generation using the new json functions * fix buffer overflow at unit test * weights endpoint using new json api * fixes to weights endpoint * check buffer overflow on all buffer functions * do synchronous queries for weights * buffer_flush() now resets json state too * content type typedef * print double values that are above the max 64-bit value * str2ndd() can now parse values above UINT64_MAX * faster number parsing by avoiding double calculations as much as possible * faster number parsing * faster hex parsing * accurate printing and parsing of double values, even for very large numbers that cannot fit in 64bit integers * full printing and parsing without using library functions - and related unit tests * added IEEE754 streaming capability to enable streaming of double values in hex * streaming and replication to transfer all values in hex * use our own str2ndd for set2 * remove subnormal check from ieee * base64 encoding for numbers, instead of hex * when increasing double precision, also make sure the fractional number printed is aligned to the wanted precision * str2ndd_encoded() parses all encoding formats, including integers * prevent uninitialized use * /api/v1/info using the new json API * Fix error when compiling with --disable-ml * Remove redundant 'buffer_unittest' declaration * Fix formatting * Fix formatting * Fix formatting * fix buffer unit test * apps.plugin using the new JSON API * make sure the metrics registry does not accept negative timestamps * do not allow pages with negative timestamps to be loaded from db files; do not accept pages with negative timestamps in the cache * Fix more formatting --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-14Prevent crash when running '-W createdataset' (#14455)Emmanuel Vasilakis
prepare environment to run the dataset create
2023-02-09Virtual hosts for data collection (#14464)Costa Tsaousis
* support multiple hosts at pluginsd structures * cleanup obsolete code * use a lookup hashtable to quickly find the keyword to execute, without traversing the whole linked list of keywords * more cleanup * move new hash function to inlined.h * minimize comparisons, eliminate a pre-parsing of the first keyword for each line * cleanup parser from old code * move parser into libnetdata * unique entries in parser keywords hashtable * move all hashing functions to inlined.h, name their sources, simple_hash() now defaults to FNV1a, it was FNV1 * small_hash() for parser * plugins.d now can switch hosts, and also create/update them * update hash function and hashtable size * updated message * unittest all hashing functions * reset the chart when setting a new host * remove host tags * enable archived hosts when a collector pushes host info * do not need localhost to swtich to localhost * disable ARAL and OWA with -DFSANITIZE_ADDRESS=1
2023-02-02Covert our documentation links to GH absolute links (#14344)Tasos Katsoulas
Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-02-02DBENGINE v2 - improvements part 12 (#14379)Costa Tsaousis
* parallel initialization of tiers * do not spawn multiple dbengine event loops * user configurable dbengine parallel initialization * size netdata based on the real cpu cores available on the system netdata runs, not on the system monitored * user configurable system cpus * move cpuset parsing to os.c/.h * fix replication of misaligned chart dimensions * give a different path to each tier thread * statically allocate the path into the initialization structure * use aral for reusing dbengine pages * dictionaries uses ARAL for fixed sized values * fix compilation without internal checks * journal v2 index uses aral * test to see judy allocations * judy allocations using aral * Add config option to select if dbengine will use direct I/O (default is yes) * V1 journafiles will use uv_fs_read instead of mmap (respect the direct I/O setting) * Remove sqlite3IsMemdb as it is unused * Fix compilation error when --disable-dbengine is used * use aral for dbengine work_cmds * changed aral API to support new features * pgc and mrg aral overheads * rrdeng opcodes using aral * better structuring and naming * dbegnine query handles using aral * page descriptors using aral * remove obsolete linking * extent io descriptors using aral * aral keeps one last page alive * add missing return value * added judy aral overhead * pdc now uses aral * page_details now use aral * epdl and deol using aral - make sure ARALs are initialized before spawning the event loop * remove unused linking * pgc now uses one aral per partition * aral measure maximum allocation queue * aral to allocate pages in parallel * aral parallel pages allocation when needed * aral cleanup * track page allocation and page population separately --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-30DBENGINE v2 - improvements part 11 (#14337)Costa Tsaousis
* acquiring / releasing interface for metrics * metrics registry statistics * cleanup metrics registry by deleting metrics when they dont have retention anymore; do not double copy the data of pages to be flushed * print the tier in retention summary * Open files with buffered instead of direct I/O (test) * added more metrics stats and fixed unittest * rename writer functions to avoid confusion with refcounting * do not release a metric that is not acquired * Revert to use direct I/O on write -- use direct I/O on read as well * keep track of ARAL overhead and add it to the memory chart * aral full check via api * Cleanup * give names to ARALs and PGCs * aral improvements * restore query expansion to the future * prefer higher resolution tier when switching plans * added extent read statistics * smoother joining of tiers at query engine * fine tune aral max allocation size * aral restructuring to hide its internals from the rest of netdata * aral restructuring; addtion of defrag option to aral to keep the linked list sorted - enabled by default to test it * fully async aral * some statistics and cleanup * fix infinite loop while calculating retention * aral docs and defragmenting disabled by default * fix bug and add optimization when defragmenter is not enabled * aral stress test * aral speed report and documentation * added internal checks that all pages are full * improve internal log about metrics deletion * metrics registry uses one aral per partition * metrics registry aral max size to 512 elements per page * remove data_structures/README.md dependency --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-27DBENGINE v2 - improvements part 10 (#14332)Costa Tsaousis
* replication cancels pending queries on exit * log when waiting for inflight queries * when there are collected and not-collected metrics, use the context priority from the collected only * Write metadata with a faster pace * Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now * Wrap in a big transaction remaining metadata writes (test 1) * fix higher tiers when tiering iterations = 2 * dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation * Wrap in a big transaction metadata writes (test 2) * replication cancelling fix * do not first and last entry in replication when the db has no retention * fix internal check condition * Increase metadata write batch size * always apply error limit to dbengine logs * Remove code that processes the obsolete health.db files * cleanup in query.c * do not allow queries to go beyond db boundaries * prevent internal log for +1 delta in timestamp * detect gap pages in conflicts * double protection for gap injection in main cache * Add checkpoint to prevent large WAL while running Remove unused and duplicate functions * do not allocate chart cache dir if not needed * add more info to unittests * revert query expansion to satisfy unittests Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-26DBENGINE v2 - improvements part 9 (#14326)Costa Tsaousis
* on shutdown stop data collection for all hosts instead of freeing their memory * print number of sql statements per metadata host scan * print timings with metadata checking * use dbengine API to figure out of a database is legacy * Recalculate retention after a datafile deletion * validate child timestamps during replication * main cache uses a lockless aral per partition, protected by the partition index lock * prevent ML crash * Revert "main cache uses a lockless aral per partition, protected by the partition index lock" This reverts commit 6afc01527dc5c66548b4bc8a1d63c026c3149358. * Log direct index and binary searches * distribute metrics more evenly across time * statistics about retention recalculation * fix crash * Reverse the binary search to calculate retention * more optimization on retention calculation * removed commented old code Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-25Add Collector log (#14309)thiagoftsm
2023-01-25Introduce the new Structure of the documentation (#13915)Fotis Voutsas
* Moving the cloud docs under /docs/cloud (previous location: netdata/learn/*) * Added metadata on almost every document of the old learn site for the new ingest process of learn. * Map old learn document to their best fit as topic related docs. Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: DShreve2 <david@netdata.cloud> Co-authored-by: hugovalente-pm <hugo@netdata.cloud>
2023-01-25DBENGINE v2 - improvements part 8 (#14319)Costa Tsaousis
* cache 100 pages for each size our tiers need * smarter page caching * account the caching structures * dynamic max number of cached pages * make variables const to ensure they are not changed * make sure replication timestamps do not go to the future * replication now sends chart and dimension states atomically; replication receivers ignores chart and dimension states when rbegin is also ignored * make sure all pages are flushed on shutdown * take into account empty points too * when recalculating retention update first_time_s on metrics only when they are bigger * Report the datafile number we use to recalculate retention * Report the datafile number we use to recalculate retention * rotate db at startup * make query plans overlap * Calculate properly first time s * updated event labels * negative page caching fix * Atempt to create missing tables on query failure * Atempt to create missing tables on query failure (part 2) * negative page caching for all gaps, to eliminate jv2 scans * Fix unittest Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-23DBENGINE v2 - improvements part 7 (#14307)Costa Tsaousis
* run cleanup in workers * when there is a discrepancy between update every, fix it * fix the other occurences of metric update every mismatch * allow resetting the same timestamp * validate flushed pages before committing them to disk * initialize collection with the latest time in mrg * these should be static functions * acquire metrics for writing to detect multiple data collections of the same metric * print the uuid of the metric that is collected twice * log the discrepancies of completed pages * 1 second tolerance * unify validation of pages and related logging across dbengine * make do_flush_pages() thread safe * flush pages runs on libuv workers * added uv events to tp workers * dont cross datafile spinlock and rwlock * should be unlock * prevent the creation of multiple datafiles * break an infinite replication loop * do not log the epxansion of the replication window due to start streaming * log all invalid pages with internal checks * do not shutdown event loop threads * add information about collected page events, to find the root cause of invalid collected pages * rewrite of the gap filling to fix the invalid collected pages problem * handle multiple collections of the same metric gracefully * added log about main cache page conflicts; fix gap filling once again... * keep track of the first metric writer * it should be an internal fatal - it does not harm users * do not check of future timestamps on collected pages, since we inherit the clock of the children; do not check collected pages validity without internal checks * prevent negative replication completion percentage * internal error for the discrepancy of mrg * better logging of dbengine new metrics collection * without internal checks it is unused * prevent pluginsd crash on exit due to calling pthread_cancel() on an exited thread * renames and atomics everywhere * if a datafile cannot be acquired for deletion during shutdown, continue - this can happen when there are hot pages in open cache referencing it * Debug for context load * rrdcontext uuid debug * rrddim uuid debug * rrdeng uuid debug * Revert "rrdeng uuid debug" This reverts commit 393da190826a582e7e6cc90771bf91b175826d8b. * Revert "rrddim uuid debug" This reverts commit 72150b30408294f141b19afcfb35abd7c34777d8. * Revert "rrdcontext uuid debug" This reverts commit 2c3b940dc23f460226e9b2a6861c214e840044d0. * Revert "Debug for context load" This reverts commit 0d880fc1589f128524e0b47abd9ff0714283ce3b. * do not use legacy uuids on multihost dbs * thread safety for journafile size * handle other cases of inconsistent collected pages * make health thread check if it should be running in key loops * do not log uuids Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-20DBENGINE v2 - improvements part 6 (#14299)Costa Tsaousis
* query preparation runs before extent reads * populate mrg in parallel * fix formatting warning * first search for a metric then add it if it does not exist * Revert "first search for a metric then add it if it does not exist" This reverts commit 4afa6461fcce859d03f1c9cf56dd3b5933ee5ebc. * Revert "fix formatting warning" This reverts commit 49473493f7f1c3399b5635a573d3c6ed2b6e46f3. * Revert "populate mrg in parallel" This reverts commit a40166708d4222f6329904f109114c47c44ca666. * merge journalfiles metrics before committing them to MRG * Revert "merge journalfiles metrics before committing them to MRG" This reverts commit 50c8934e23a0a09ea4da80e3f88290e46496ad92. * Revert "Revert "populate mrg in parallel"" This reverts commit f4c149d2ab7a8c9af24a10f95438a0d662a5cf8a. * Revert "Revert "fix formatting warning"" This reverts commit 78298ff9efc49806ded029f5f1e868cc42e8f6eb. * Revert "Revert "first search for a metric then add it if it does not exist"" This reverts commit 997b9c813b290882ba18a8c44bf73f9ee5480adf. * preload first and last journal files v2 * fix formatting warning * parallel loading of tiers; cleanup of ctx structures * use half the cores * add partitions to metrics registry * revert accidental change * parallel processing according to MRG partitions; dont recalculate retention on exit
2023-01-20Fixes required to make the agent work without crashes on MacOS (#14304)vkalintiris
* Bump the soft limit on open FDs to the max. On systems with a low soft-limit for open file descriptors, the agent would fail to initialize all dbengine tiers. * Iterate the right number of dbengine tiers. For whatever reason, this was causing a crash on MacOS but it was running "correctly" on Linux systems.
2023-01-20track memory footprint of Netdata (#14294)Costa Tsaousis
* track memory footprint of Netdata * track db modes alloc/ram/save/map * track system info; track sender and receiver * fixes * more fixes * track workers memory, onewayalloc memory; unify judyhs size estimation * track replication structures and buffers * Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag * flush the replication buffer every 1000 times the circular buffer is found empty * dont take timestamp too frequently in sender loop * sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it * free sender thread buffer on replication threads when replication is idle * use the last sender flag as a timestamp of the last buffer recreation * free cbuffer before reconnecting * recreate cbuffer on every flush * timings for journal v2 loading * inlining of metric and cache functions * aral likely/unlikely * free left-over thread buffers * fix NULL pointer dereference in replication * free sender thread buffer on sender thread too * mark ctx as used before flushing * better logging on ctx datafiles closing Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-19Improve file descriptor closing loops (#14213)Dim-P
* Add for_each_open_fd() and fix second instance of _SC_OPEN_MAX * Add argument to allow exclusion of file descriptors from closing * Fix clang error * Address review comments * Use close_range() if possible and replace macros with enums
2023-01-18DBENGINE v2 - improvements part 5 (#14289)Costa Tsaousis
* cleanup journal v2 mounts periodically * fix for last commit * re-enable loading page from disk when the arrangement of pages requires it * Remove unused statistics * Estimate diskspace when the current datafile is full and queue a rotate command (Currently it will not attempt to estimate end size for journals) Queue a command to check quota on startup per tier * apps.plugin now exposes RSS chart * shorter thread names to make debugging easier, since thread names can only be 15 characters * more thread names fixes * allow an apps_groups.conf target to be pid 0 or 1 Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-18Revert health to run in a single thread (#14244)Emmanuel Vasilakis
* revert health to single thread * remove getting now * use a health struct * remove commented code * cleanup health log from metdata * dont check for METADATA_UPDATE
2023-01-17DBENGINE v2 - improvements part 3 (#14269)Costa Tsaousis
* reduce journal v2 shared memory using madvise() - not integrated yet * working attempt to minimize dbengine shared memory * never call willneed - let the kernel decide which parts of each file are really needed * journal files get MADV_RANDOM * dont call MADV_DONTNEED too frequently * madvise() is always called with the journal unlocked but referenced * call madvise() even less frequently * added chart for monitoring database events * turn batch mode on under critical conditions * max size to evict is 1/4 of the max * fix max size to evict calculation * use dbengine_page/extent_alloc/free to pages and extents allocations, tracking also the size of these allocations at free time * fix calculation for batch evictions * allow main and open cache to have as many evictors as needed * control inline evictors for each cache; report different levels of cache pressure on every cache evaluation * more inline evictors for extent cache * bypass max inline evictors above critical level * current cache usage has to be taken * re-arrange items in journafile * updated docs - work in progress * more docs work * more docs work * Map / unmap journal file * draw.io diagram for dbengine operations * updated dbengine diagram * updated docs * journal files v2 now get mapped and unmapped as needed * unmap journal v2 immediately when getting retention * mmap and munmap do not block queries evaluating journal files v2 * have only one unmap function Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-16fix(pacakging): fix cpu/memory metrics when running inside LXC container as ↵Ilya Mashchenko
systemd service (#14255) Fixes https://github.com/netdata/netdata/issues/14238