summaryrefslogtreecommitdiffstats
path: root/daemon
AgeCommit message (Collapse)Author
2022-07-14Update docs on metric storage (#13327)Tasos Katsoulas
This PR - Explains the new tiering mechanism. - Housekeeping docs about Agent's database options. - Updates all the configuration options for the `dbengine`. - Provide a new way for the users to calculate the space they need for their metric storage needs (via a spreadsheet) Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: DShreve2 <david@netdata.cloud>
2022-07-13Fix bitmap unit tests (#13374)Stelios Fragkakis
* Fix bitmap unit tests * Fix bitmap unit tests (part 2)
2022-07-11Detect stored metric size by page type (#13334)Stelios Fragkakis
* Report unknown page only once Get metric storage size by the page type Verify validity of the page and skip problematic ones * Change PAGE_SIZE to PAGE_POINT_SIZE_BYTES * Add bitmap256 and unittests * Fix unit test tier_page_type array page_type_size arrays * Add another counter to not rely on uint8_t overflow to stop the test loop
2022-07-08fix crash on start on slow disks because ml is initialized before dbengine ↵Costa Tsaousis
starts (#13342)
2022-07-08Better ACLK debug communication log (#13281)Timotej S
2022-07-07Fix two helgrind reports (#13325)vkalintiris
* Use atomics ops with host->rrdpush_sender_connected. * Use different storage unit for rrdim's updated and exposed fields. The bitfields would end up in the same byte and thus requiring explicit protection with mutexes.
2022-07-06Multi-Tier database backend for long term metrics storage (#13263)Stelios Fragkakis
* Tier part 1 * Tier part 2 * Tier part 3 * Tier part 4 * Tier part 5 * Fix some ML compilation errors * fix more conflicts * pass proper tier * move metric_uuid from state to RRDDIM * move aclk_live_status from state to RRDDIM * move ml_dimension from state to RRDDIM * abstracted the data collection interface * support flushing for mem db too * abstracted the query api * abstracted latest/oldest time per metric * cleanup * store_metric for tier1 * fix for store_metric * allow multiple tiers, more than 2 * state to tier * Change storage type in db. Query param to request min, max, sum or average * Store tier data correctly * Fix skipping tier page type * Add tier grouping in the tier * Fix to handle archived charts (part 1) * Temp fix for query granularity when requesting tier1 data * Fix parameters in the correct order and calculate the anomaly based on the anomaly count * Proper tiering grouping * Anomaly calculation based on anomaly count * force type checking on storage handles * update cmocka tests * fully dynamic number of storage tiers * fix static allocation * configure grouping for all tiers; disable tiers for unittest; disable statsd configuration for private charts mode * use default page dt using the tiering info * automatic selection of tier * fix for automatic selection of tier * working prototype of dynamic tier selection * automatic selection of tier done right (I hope) * ask for the proper tier value, based on the grouping function * fixes for unittests and load_metric_next() * fixes for lgtm findings * minor renames * add dbengine to page cache size setting * add dbengine to page cache with malloc * query engine optimized to loop as little are required based on the view_update_every * query engine grouping methods now do not assume a constant number of points per group and they allocate memory with OWA * report db points per tier in jsonwrap * query planer that switches database tiers on the fly to satisfy the query for the entire timeframe * dbegnine statistics and documentation (in progress) * calculate average point duration in db * handle single point pages the best we can * handle single point pages even better * Keep page type in the rrdeng_page_descr * updated doc * handle future backwards compatibility - improved statistics * support &tier=X in queries * enfore increasing iterations on tiers * tier 1 is always 1 iteration * backfilling higher tiers on first data collection * reversed anomaly bit * set up to 5 tiers * natural points should only be offered on tier 0, except a specific tier is selected * do not allow more than 65535 points of tier0 to be aggregated on any tier * Work only on actually activated tiers * fix query interpolation * fix query interpolation again * fix lgtm finding * Activate one tier for now * backfilling of higher tiers using raw metrics from lower tiers * fix for crash on start when storage tiers is increased from the default * more statistics on exit * fix bug that prevented higher tiers to get any values; added backfilling options * fixed the statistics log line * removed limit of 255 iterations per tier; moved the code of freezing rd->tiers[x]->db_metric_handle * fixed division by zero on zero points_wanted * removed dead code * Decide on the descr->type for the type of metric * dont store metrics on unknown page types * free db_metric_handle on sql based context queries * Disable STORAGE_POINT value check in the exporting engine unit tests * fix for db modes other than dbengine * fix for aclk archived chart queries destroying db_metric_handles of valid rrddims * fix left-over freez() instead of OWA freez on median queries Co-authored-by: Costa Tsaousis <costa@netdata.cloud> Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-06-30fix RAM calculation on macOS in system-info (#13260)Ilya Mashchenko
2022-06-29Query engine with natural and virtual points (#13248)Costa Tsaousis
* new query engine * use Index * Revert change that changed in-memory page indexing to start time - update_every + 1 * use internal_error() to cleanup the code * interpolates values when generating points Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-06-28Dictionaries with reference counters and full deletion support during ↵Costa Tsaousis
traversal (#13195) * dont use atomic operations when not needed; detect misuse of the the unsafe functions * use relaxed atomic operations for statistics * use relaxed atomic operations for statistics * dictionaries now use reference counters, allowing deletetions of any item while traversing it * added acquire/release interface to dictionaries * added unittest for reference counters * added NETDATA_INTERNAL_CHECKS logs to detect non-exclusive access to crusial parts of the dictionaries * dictionaries cannot be deleted while there are referenced items in them - they will be deleted once the last item gets unreferenced * cleanup * properly cleanup released items * maintain counters for readers and writers; defer all deletes on sorted walkthrough; cleaner internal_error(); * somewhat faster reference counters on single threaded dictionaries * minor optimizations; allow compiling without internal checks
2022-06-28netdata doubles (#13217)Costa Tsaousis
* netdata doubles * fix cmocka test * fix cmocka test again * fix left-overs of long double to NETDATA_DOUBLE * RRDDIM detached from disk representation; db settings in [db] section of netdata.conf * update the memory before saving * rrdset is now detached from file structures too * on memory mode map, update the memory mapped structures on every iteration * allow RRD_ID_LENGTH_MAX to be changed * granularity secs, back to update every * fix formatting * more formatting
2022-06-28Update netdata commands (#13080)Tasos Katsoulas
* Update netdata commands Adding the `-W buildinfo` options. * Update README.md * Update README.md * Update daemon/README.md Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> * Update daemon/README.md Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> * also add the change in the daemon command line help message Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> * remove whitespace Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2022-06-28Add more sqlite unittests (#13227)Stelios Fragkakis
2022-06-27Removes Legacy JSON Cloud Protocol Support In Agent (#13111)Timotej S
* removes old protocol support (cloud removed support already)
2022-06-24Add user plugin dirs to environment (#13203)Vladimir Kobal
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2022-06-23Add configuration for dbengine page fetch timeout and retry count (#13194)Stelios Fragkakis
* Add configuration for page cache fetch timeout and retry count Change page cache wait default timeout to 3 seconds * Issue info message in the error.log if values not within expected lower range * Fix compilation errors with --disable-dbengine
2022-06-22Query Engine multi-granularity support (and MC improvements) (#13155)Costa Tsaousis
* set grouping functions * storage engine should check the validity of timestamps, not the query engine * calculate and store in RRDR anomaly rates for every query * anomaly rate used by volume metric correlations * mc volume should use absolute data, to avoid cancelling effect * return anomaly-rates in jasonwrap with jw-anomaly-rates option to data queries * dont return null on anomaly rates * allow passing group query options from the URL * added countif to the query engine and used it in metric correlations * fix configure * fix countif and anomaly rate percentages * added group_options to metric correlations; updated swagger * added newline at the end of yaml file * always check the time the highlighted window was above/below the highlighted window * properly track time in memory queries * error for internal checks only * moved pack_storage_number() into the storage engines * moved unpack_storage_number() inside the storage engines * remove old comment * pass unit tests * properly detect zero or subnormal values in pack_storage_number() * fill nulls before the value, not after * make sure math.h is included * workaround for isfinite() * fix for isfinite() * faster isfinite() alternative * fix for faster isfinite() alternative * next_metric() now returns end_time too * variable step implemented in a generic way * remove left-over variables * ensure we always complete the wanted number of points * fixes * ensure no infinite loop * mc-volume-improvements: Add information about invalid condition * points should have a duration in the past * removed unneeded info() line * Fix unit tests for exporting engine * new_point should only be checked when it is fetched from the db; better comment about the premature breaking of the main query loop Co-authored-by: Thiago Marques <thiagoftsm@gmail.com> Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-06-17Revert "Configurable storage engine for Netdata agents: step 3 (#12892)" ↵vkalintiris
(#13171) This reverts commit 100a12c6cc01222b1518e5e50d2147f592d8a111. A couple parent/child startup/shutdown scenarios can lead to crashes.
2022-06-16Configurable storage engine for Netdata agents: step 3 (#12892)Adrien Béraud
* storage engine: add host context API Add a new API to allow storage engines to manage host contexts. * Replace single global context with per-engine global context * Context is full managed by storage engines: a storage engine can use no context, a global engine context, per host contexts, or a mix of these. * Currently, only dbengine uses contexts. Following the current logic, legacy hosts use their own context, while non-legacy hosts share the global context. * storage engine: use empty function instead of null for context ops * rrdhost: don't check return value for void call * rrdhost: create context with host * storage engine: move rrddim ops to rrddim_mem.{c,h} * storage engine: don't use NULL for end-of-list marker * storage engine: fallback to default engine
2022-06-16Fix labels unit test (#13156)Stelios Fragkakis
2022-06-16Remove pinned page reference (#13108)Stelios Fragkakis
* Disable reference to prev_descr as we do not keep two pages pinned * Remove extra pinned page from page cache calculations * Removed invalid comment * Remove unused variable
2022-06-15Add an option to use malloc for page cache instead of mmap (#13142)Stelios Fragkakis
Add an option to switch to using malloc for page cache instead of mmap
2022-06-1373x times faster metrics correlations at the agent (#13107)Costa Tsaousis
* faster correlations * 4x times faster correlations * a little bit more help * 10x times faster metrics correlations * 6 digits precision; better comments * enabled metrics correlations by default * abstracted DIFFS_NUMBER to allow easily changing it * reworked the entire logic to have more accuracy and support a baseline that is power of two multiple of highlight * properly calculate shifts * even more improved version * added support for timeout; fixed another memory leak; skipped hidden dimensions * default timeout 1min * reduce memory even further * use dictionary for the list of charts and optimize locks * return 403 forbidden, when mc is not enabled * added query options * dont process zero dimensions * added volume method as an option to metric correlations ; now metric correlations can support multiple implementations * make sure we will never crash * spread results evenly for both kstwo and volume * fixed bug in query engine that was missing misaligned queries when a single point was requested from the db; improved comments; improved query flags * updated swagger and added sane defaults; query options are now supported, including anomaly-bit * added "raw" option to allow cross node correlations; added "group" option to allow different time aggregations; allowed calling metric correlations without any parameters; allowed calling metric correlations with relative timestamps; added timeout to volume method; properly handled timeout on ks2 method; json output now sends all parameters back - same for json_wrap; modified query engine to use present time for relative timestamps; modified "allow_past" to mean both past backwards and forwards * emulate the old behaviour about zero points * 100% accuracy against python ks_2samp(); now the default is volume and the default points are 500 * added config option to change default metric correlations method * removed work-arounds now that rrdlabels are merged
2022-06-13Labels with dictionary (#13070)Costa Tsaousis
* squashed and rebased to master * fix overflow and single character bug in sanitize; include rrd.h instead of node_info.h * added unittest for UTF-8 multibyte sanitization * Fix unit test compilation * Fix CMake build * remove double sanitizer for opentsdb; cleanup sanitize_json_string() * rename error_description to error_message to avoid conflict with json-c * revert last and undef error_description from json-c * more unittests; attempt to fix protobuf map issue * get rid of rrdlabels_get() and replace it with a safe version that writes the value to a buffer * added dictionary sorting unittest; rrdlabels_to_buffer() now is sorted * better sorted dictionary checking * proper unittesting for sorted dictionaries * call dictionary deletion callback when destroying the dictionary * remove obsolete variable * Fix exporting unit tests * Fix k8s label parsing test * workaround for cmocka and strdupz() * Bypass cmocka memory allocation check * Revert "Bypass cmocka memory allocation check" This reverts commit 4c49923839d9229bea23ca914dd8a0be1ebe2bf4. * Revert "workaround for cmocka and strdupz()" This reverts commit 7bebee04801db1865c748a7896d5fa54bb7104a5. * Bypass cmocka memory allocation checks * respect json formatting for chart labels * cloud sends colons * print the value only once * allow parenthesis in values and spaces; make stream sender send quotes for values Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-06-13fix virtualization detection on FreeBSD (#13087)Ilya Mashchenko
2022-06-01Dictionary with JudyHS and double linked list (#13032)Costa Tsaousis
* dictionary internals isolation * more dictionary cleanups * added unit test * we should use DICT internally * disable cups in cmake * implement DICTIONARY with Judy arrays * operational JUDY implementation * JUDY cleanup * JUDY summary added * JudyHS implementation with double linked list * test negative searches too * optimize destruction * optimize set to insert first without lookup * updated stats * code cleanup; better organization; updated info * more code cleanup and commenting * more cleanup, renames and comments * fix rename * more cleanups * use Judy.h from system paths * added foreach traversal; added flag to add item in front; isolated locks to their own functions; destruction returns the number of bytes freed * more comments; flags are now 16-bit * completed unittesting * addressed comments and added reference counters maintainance * added unittest in main; tested removal of items in front, back and middle * added read/write walkthrough and foreach; allowed walkthrough and foreach in write mode to delete the current element (used by cups.plugin); referenced counters removed from the API * DICTFE.name should be const too * added API calls for exposing all statistics * dictionary flags as enum and reference counters as atomic operations * more comments; improved error handling at unit tests * added functions to allow unsafe access while traversing the dictionary with locks in place * check for libcups in cmake * added delete callback; implemented statsd with this dictionary * added missing dfe_done() * added alternative implementation with AVL * added documentation * added comments and warning about AVL * dictionary walktrhough on new code * simplified foreach; updated docs * updated docs * AVL is much faster without hashes * AVL should follow DBENGINE
2022-05-24Run the /net/dev module of the proc plugin in a separate thread (#12996)Vladimir Kobal
2022-05-24Fix compilation warnings (#12993)Vladimir Kobal
2022-05-23Make heartbeat a static chart (#12986)Emmanuel Vasilakis
2022-05-20chore: check link local address before querying cloud instance metadata (#12973)Ilya Mashchenko
check link local address before querying cloud providers data
2022-05-20fix: keep virtualization unknown if all used commands are not available (#12964)Ilya Mashchenko
2022-05-18detailed dbengine stats (#12948)Costa Tsaousis
2022-05-17feat: move dirs, logs, and env vars config options to separate sections (#12935)Ilya Mashchenko
2022-05-17Reduce timeout to 1 second for getting cloud instance info (#12941)Emmanuel Vasilakis
2022-05-16fix virtualization detection when `systemd-detect-virt` is not available ↵Ilya Mashchenko
(#12911)
2022-05-16fix `[global statistics]` section in netdata.conf (#12916)Ilya Mashchenko
2022-05-10workers fixes and improvements (#12863)Costa Tsaousis
2022-05-10Initialize the metadata database when performing dbengine stress test (#12861)Stelios Fragkakis
* Remove error (no real value) * Add a parameter to create an in-memory database for stress testing * Add a new parameter to the stresstest command to set the number of deisred libuv worker threads
2022-05-09Workers utilization charts (#12807)Costa Tsaousis
* initial version of worker utilization * working example * without mutexes * monitoring DBENGINE, ACLKSYNC, WEB workers * added charts to monitor worker usage * fixed charts units * updated contexts * updated priorities * added documentation * converted threads to stacked chart * One query per query thread * Revert "One query per query thread" This reverts commit 6aeb391f5987c3c6ba2864b559fd7f0cd64b14d3. * fixed priority for web charts * read worker cpu utilization from proc * read workers cpu utilization via /proc/self/task/PID/stat, so that we have cpu utilization even when the jobs are too long to finish within our update_every frequency * disabled web server cpu utilization monitoring - it is now monitored by worker utilization * tight integration of worker utilization to web server * monitoring statsd worker threads * code cleanup and renaming of variables * contrained worker and statistics conflict to just one variable * support for rendering jobs per type * better priorities and removed the total jobs chart * added busy time in ms per job type * added proc.plugin monitoring, switch clock to MONOTONIC_RAW if available, global statistics now cleans up old worker threads * isolated worker thread families * added cgroups.plugin workers * remove unneeded dimensions when then expected worker is just one * plugins.d and streaming monitoring * rebased; support worker_is_busy() to be called one after another * added diskspace plugin monitoring * added tc.plugin monitoring * added ML threads monitoring * dont create dimensions and charts that are not needed * fix crash when job types are added on the fly * added timex and idlejitter plugins; collected heartbeat statistics; reworked heartbeat according to the POSIX * the right name is heartbeat for this chart * monitor streaming senders * added streaming senders to global stats * prevent division by zero * added clock_init() to external C plugins * added freebsd and macos plugins * added freebsd and macos to global statistics * dont use new as a variable; address compiler warnings on FreeBSD and MacOS * refactored contexts to be unique; added health threads monitoring Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-05-07fix memory leaks and mismatches of the use of the z functions for ↵Costa Tsaousis
allocations (#12841) * fix mismatches of the use of the z functions for allocations * when there was no memory; the original name of the dimensions was freed, and with mismatching deallocator.. * fixed memory leak at rrdeng_load_metric_*() functions * fixed memory leak on exit of plugins.d parser * fixed memory leak on plugins and streaming receiver threads exit * fixed compiler warnings
2022-05-04* Add a parameter for the libuv worker threads to pre-initialize (#12814)Stelios Fragkakis
* Set the thread name for libuv threads to LIBUV_WORKER * Make sure the dbengine thread has the correct name
2022-05-04Metric correlations (#12582)Emmanuel Vasilakis
* initial attempt at metric correlations * fix loop * simplify struct * change json * get points from query * comment * dont lock the host as much * add a configuration option to enable/disable metric correlations * remove KSfbar from header file * lock charts * add timeout * cast multiplication * add licencing info * better licencing * use onewayalloc * destroy owa
2022-05-03Remove node.d.plugin and relevant files (#12769)Suraj Neupane
* Remove node.d.plugin and relevant files * fix build packages * remove node.d related words/phrases from docs and tests
2022-05-03Trace rwlocks of netdata (#12785)Costa Tsaousis
* with -DNETDATA_INTERNAL_CHECKS=1 enable rwlocks tracing * fix strings alignment on terminal * remove wrong addition * removed formating warning; now counting active locks per thread; tracing is enabled with -DNETDATA_TRACE_RWLOCKS=1 * added the missing netdata_mutex_destroy() * optimized clocks usage in locks * added also main * fixed formatting warning * add compiler warning when compiling with -DNETDATA_TRACE_RWLOCKS=1 * cleanup and documentation * fix for old variable * >= not just > to allow proper comparisons * dont print 0x twice and print the lock pointer on every line * trace locks deeper
2022-05-03One way allocator to double the speed of parallel context queries (#12787)Costa Tsaousis
* one way allocator to speed up context queries * fixed a bug while expanding memory pages * reworked for clarity and finally fixed the bug of allocating memory beyond the page size * further optimize allocation step to minimize the number of allocations made * implement strdup with memcpy instead of strcpy * added documentation * prevent an uninitialized use of owa * added callocz() interface * integrate onewayalloc everywhere - apart sql queries * one way allocator is now used in context queries using archived charts in sql * align on the size of pointers * forgotten freez() * removed not needed memcpys * give unique names to global variables to avoid conflicts with system definitions
2022-05-02Make atomics a hard-dep. (#12730)vkalintiris
They are used extensively throughout our code base, and not having support for them does not generate a thread-safe agent.
2022-04-27fix: use 'diskutil info` to calculate the disk size on macOS (#12764)Ilya Mashchenko
2022-04-25fix(cgroups.plugin): remove "enable cgroup X" config option on cgroup ↵Ilya Mashchenko
deletion (#12746)
2022-04-11Add a timeout parameter to data queries (#12649)Stelios Fragkakis
* Add timeout parameter in queries and in calling functions * Add CANCEL flag in RRDR and code to cancel a query * Update swagger * Format swagger file properly
2022-04-11feat: add k8s_cluster_name host tag (GKE only) (#12638)Ilya Mashchenko