netdata - Mirror of https://github.com/netdata/netdata

Age	Commit message (Collapse)	Author
2023-02-08	Only load required charts for rrdvars (#14443)	Emmanuel Vasilakis
	* store only rrdvars health needs * make it simpler * only set * fix codacy
2023-02-08	Add export for people running their own registry (#14457)	Chris Akritidis
	See https://github.com/netdata/netdata/issues/3495#issuecomment-1408452259
2023-02-02	fix kubelet alarms (#14414)	Ilya Mashchenko

2023-02-02	Covert our documentation links to GH absolute links (#14344)	Tasos Katsoulas
	Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-01-31	Update the notifications/integrations docs (#14335)	Hugo Valente
	* Add the docs for the newly added notification/integrations methods of the cloud. Notifications: Discord/PagerDuty/Slack/Generic WebHook * Update docs related to; Managing notification with the new methods. Co-authored-by: Shyam Sreevalsan <shyam@netdata.cloud>
2023-01-30	Add main health readme to learn (#14356)	Chris Akritidis

2023-01-30	Delete QUICKSTART.md (#14355)	Chris Akritidis
	The info is already in the main README
2023-01-27	minor fix on notification doc (Discord) (#14339)	Tasos Katsoulas
	Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-01-27	DBENGINE v2 - improvements part 10 (#14332)	Costa Tsaousis
	* replication cancels pending queries on exit * log when waiting for inflight queries * when there are collected and not-collected metrics, use the context priority from the collected only * Write metadata with a faster pace * Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now * Wrap in a big transaction remaining metadata writes (test 1) * fix higher tiers when tiering iterations = 2 * dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation * Wrap in a big transaction metadata writes (test 2) * replication cancelling fix * do not first and last entry in replication when the db has no retention * fix internal check condition * Increase metadata write batch size * always apply error limit to dbengine logs * Remove code that processes the obsolete health.db files * cleanup in query.c * do not allow queries to go beyond db boundaries * prevent internal log for +1 delta in timestamp * detect gap pages in conflicts * double protection for gap injection in main cache * Add checkpoint to prevent large WAL while running Remove unused and duplicate functions * do not allocate chart cache dir if not needed * add more info to unittests * revert query expansion to satisfy unittests Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-26	Add \|nowarn and \|noclear notification modifiers (#14330)	Martin Vobruba

2023-01-25	Introduce the new Structure of the documentation (#13915)	Fotis Voutsas
	* Moving the cloud docs under /docs/cloud (previous location: netdata/learn/) Added metadata on almost every document of the old learn site for the new ingest process of learn. * Map old learn document to their best fit as topic related docs. Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud> Co-authored-by: DShreve2 <david@netdata.cloud> Co-authored-by: hugovalente-pm <hugo@netdata.cloud>
2023-01-23	DBENGINE v2 - improvements part 7 (#14307)	Costa Tsaousis
	* run cleanup in workers * when there is a discrepancy between update every, fix it * fix the other occurences of metric update every mismatch * allow resetting the same timestamp * validate flushed pages before committing them to disk * initialize collection with the latest time in mrg * these should be static functions * acquire metrics for writing to detect multiple data collections of the same metric * print the uuid of the metric that is collected twice * log the discrepancies of completed pages * 1 second tolerance * unify validation of pages and related logging across dbengine * make do_flush_pages() thread safe * flush pages runs on libuv workers * added uv events to tp workers * dont cross datafile spinlock and rwlock * should be unlock * prevent the creation of multiple datafiles * break an infinite replication loop * do not log the epxansion of the replication window due to start streaming * log all invalid pages with internal checks * do not shutdown event loop threads * add information about collected page events, to find the root cause of invalid collected pages * rewrite of the gap filling to fix the invalid collected pages problem * handle multiple collections of the same metric gracefully * added log about main cache page conflicts; fix gap filling once again... * keep track of the first metric writer * it should be an internal fatal - it does not harm users * do not check of future timestamps on collected pages, since we inherit the clock of the children; do not check collected pages validity without internal checks * prevent negative replication completion percentage * internal error for the discrepancy of mrg * better logging of dbengine new metrics collection * without internal checks it is unused * prevent pluginsd crash on exit due to calling pthread_cancel() on an exited thread * renames and atomics everywhere * if a datafile cannot be acquired for deletion during shutdown, continue - this can happen when there are hot pages in open cache referencing it * Debug for context load * rrdcontext uuid debug * rrddim uuid debug * rrdeng uuid debug * Revert "rrdeng uuid debug" This reverts commit 393da190826a582e7e6cc90771bf91b175826d8b. * Revert "rrddim uuid debug" This reverts commit 72150b30408294f141b19afcfb35abd7c34777d8. * Revert "rrdcontext uuid debug" This reverts commit 2c3b940dc23f460226e9b2a6861c214e840044d0. * Revert "Debug for context load" This reverts commit 0d880fc1589f128524e0b47abd9ff0714283ce3b. * do not use legacy uuids on multihost dbs * thread safety for journafile size * handle other cases of inconsistent collected pages * make health thread check if it should be running in key loops * do not log uuids Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-20	add consul license expiration time alarm (#14298)	Ilya Mashchenko
	* add consul license alarm * minor
2023-01-20	track memory footprint of Netdata (#14294)	Costa Tsaousis
	* track memory footprint of Netdata * track db modes alloc/ram/save/map * track system info; track sender and receiver * fixes * more fixes * track workers memory, onewayalloc memory; unify judyhs size estimation * track replication structures and buffers * Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag * flush the replication buffer every 1000 times the circular buffer is found empty * dont take timestamp too frequently in sender loop * sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it * free sender thread buffer on replication threads when replication is idle * use the last sender flag as a timestamp of the last buffer recreation * free cbuffer before reconnecting * recreate cbuffer on every flush * timings for journal v2 loading * inlining of metric and cache functions * aral likely/unlikely * free left-over thread buffers * fix NULL pointer dereference in replication * free sender thread buffer on sender thread too * mark ctx as used before flushing * better logging on ctx datafiles closing Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-18	fix(alarms): treat 0 processors as unknown in load_cpu_number (#14286)	Ilya Mashchenko

2023-01-18	Revert health to run in a single thread (#14244)	Emmanuel Vasilakis
	* revert health to single thread * remove getting now * use a health struct * remove commented code * cleanup health log from metdata * dont check for METADATA_UPDATE
2023-01-16	bump go.d.plugin to v0.49.1 (#14275)	Ilya Mashchenko

2023-01-13	fix(health): don't assume 2 cores if the number is unknown (#14265)	Ilya Mashchenko

2023-01-10	DBENGINE v2 (#14125)	Costa Tsaousis
	* count open cache pages refering to datafile * eliminate waste flush attempts * remove eliminated variable * journal v2 scanning split functions * avoid locking open cache for a long time while migrating to journal v2 * dont acquire datafile for the loop; disable thread cancelability while a query is running * work on datafile acquiring * work on datafile deletion * work on datafile deletion again * logs of dbengine should start with DBENGINE * thread specific key for queries to check if a query finishes without a finalize * page_uuid is not used anymore * Cleanup judy traversal when building new v2 Remove not needed calls to metric registry * metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent * disable checks for invalid time-ranges * Remove type from page details * report scanning time * remove infinite loop from datafile acquire for deletion * remove infinite loop from datafile acquire for deletion again * trace query handles * properly allocate array of dimensions in replication * metrics cleanup * metrics registry uses arrayalloc * arrayalloc free should be protected by lock * use array alloc in page cache * journal v2 scanning fix * datafile reference leaking hunding * do not load metrics of future timestamps * initialize reasons * fix datafile reference leak * do not load pages that are entirely overlapped by others * expand metric retention atomically * split replication logic in initialization and execution * replication prepare ahead queries * replication prepare ahead queries fixed * fix replication workers accounting * add router active queries chart * restore accounting of pages metadata sources; cleanup replication * dont count skipped pages as unroutable * notes on services shutdown * do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file * do not add pages we dont need to pdc * time in range re-work to provide info about past and future matches * finner control on the pages selected for processing; accounting of page related issues * fix invalid reference to handle->page * eliminate data collection handle of pg_lookup_next * accounting for queries with gaps * query preprocessing the same way the processing is done; cache now supports all operations on Judy * dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3) * get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query * finner gaps calculation; accounting of overlapping pages in queries * fix gaps accounting * move datafile deletion to worker thread * tune libuv workers and thread stack size * stop netdata threads gradually * run indexing together with cache flush/evict * more work on clean shutdown * limit the number of pages to evict per run * do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction * economies on flags for smaller page footprint; cleanup and renames * eviction moves referenced pages to the end of the queue * use murmur hash for indexing partition * murmur should be static * use more indexing partitions * revert number of partitions to number of cpus * cancel threads first, then stop services * revert default thread stack size * dont execute replication requests of disconnected senders * wait more time for services that are exiting gradually * fixed last commit * finer control on page selection algorithm * default stacksize of 1MB * fix formatting * fix worker utilization going crazy when the number is rotating * avoid buffer full due to replication preprocessing of requests * support query priorities * add count of spins in spinlock when compiled with netdata internal checks * remove prioritization from dbengine queries; cache now uses mutexes for the queues * hot pages are now in sections judy arrays, like dirty * align replication queries to optimal page size * during flushing add to clean and evict in batches * Revert "during flushing add to clean and evict in batches" This reverts commit 8fb2b69d068499eacea6de8291c336e5e9f197c7. * dont lock clean while evicting pages during flushing * Revert "dont lock clean while evicting pages during flushing" This reverts commit d6c82b5f40aeba86fc7aead062fab1b819ba58b3. * Revert "Revert "during flushing add to clean and evict in batches"" This reverts commit ca7a187537fb8f743992700427e13042561211ec. * dont cross locks during flushing, for the fastest flushes possible * low-priority queries load pages synchronously * Revert "low-priority queries load pages synchronously" This reverts commit 1ef2662ddcd20fe5842b856c716df134c42d1dc7. * cache uses spinlock again * during flushing, dont lock the clean queue at all; each item is added atomically * do smaller eviction runs * evict one page at a time to minimize lock contention on the clean queue * fix eviction statistics * fix last commit * plain should be main cache * event loop cleanup; evictions and flushes can now happen concurrently * run flush and evictions from tier0 only * remove not needed variables * flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time * added worker jobs in timer to find the slow part of it * support fast eviction of pages when all_of_them is set * revert default thread stack size * bypass event loop for dispatching read extent commands to workers - send them directly * Revert "bypass event loop for dispatching read extent commands to workers - send them directly" This reverts commit 2c08bc5bab12881ae33bc73ce5dea03dfc4e1fce. * cache work requests * minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors * publish flushed pages to open cache in the thread pool * prevent eventloop requests from getting stacked in the event loop * single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c * more rrdengine.c cleanup * enable db rotation * do not log when there is a filter * do not run multiple migration to journal v2 * load all extents async * fix wrong paste * report opcodes waiting, works dispatched, works executing * cleanup event loop memory every 10 minutes * dont dispatch more work requests than the number of threads available * use the dispatched counter instead of the executing counter to check if the worker thread pool is full * remove UV_RUN_NOWAIT * replication to fill the queues * caching of extent buffers; code cleanup * caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation * single transaction wal * synchronous flushing * first cancel the threads, then signal them to exit * caching of rrdeng query handles; added priority to query target; health is now low prio * add priority to the missing points; do not allow critical priority in queries * offload query preparation and routing to libuv thread pool * updated timing charts for the offloaded query preparation * caching of WALs * accounting for struct caches (buffers); do not load extents with invalid sizes * protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset * also check if the expanded before is not the chart later updated time * also check if the expanded before is not after the wall clock time of when the query started * Remove unused variable * replication to queue less queries; cleanup of internal fatals * Mark dimension to be updated async * caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol) * disable pgc stress test, under an ifdef * disable mrg stress test under an ifdef * Mark chart and host labels, host info for async check and store in the database * dictionary items use arrayalloc * cache section pages structure is allocated with arrayalloc * Add function to wakeup the aclk query threads and check for exit Register function to be called during shutdown after signaling the service to exit * parallel preparation of all dimensions of queries * be more sensitive to enable streaming after replication * atomically finish chart replication * fix last commit * fix last commit again * fix last commit again again * fix last commit again again again * unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication * do not cancel start streaming; use high priority queries when we have locked chart data collection * prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered * opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue * Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore * Fix bad memory allocation / cleanup * Cleanup ACLK sync initialization (part 1) * Don't update metric registry during shutdown (part 1) * Prevent crash when dashboard is refreshed and host goes away * Mark ctx that is shutting down. Test not adding flushed pages to open cache as hot if we are shutting down * make ML work * Fix compile without NETDATA_INTERNAL_CHECKS * shutdown each ctx independently * fix completion of quiesce * do not update shared ML charts * Create ML charts on child hosts. When a parent runs a ML for a child, the relevant-ML charts should be created on the child host. These charts should use the parent's hostname to differentiate multiple parents that might run ML for a child. The only exception to this rule is the training/prediction resource usage charts. These are created on the localhost of the parent host, because they provide information specific to said host. * check new ml code * first save the database, then free all memory * dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db * Cleanup metadata thread (part 2) * increase refcount before dispatching prep command * Do not try to stop anomaly detection threads twice. A separate function call has been added to stop anomaly detection threads. This commit removes the left over function calls that were made internally when a host was being created/destroyed. * Remove allocations when smoothing samples buffer The number of dims per sample is always 1, ie. we are training and predicting only individual dimensions. * set the orphan flag when loading archived hosts * track worker dispatch callbacks and threadpool worker init * make ML threads joinable; mark ctx having flushing in progress as early as possible * fix allocation counter * Cleanup metadata thread (part 3) * Cleanup metadata thread (part 4) * Skip metadata host scan when running unittest * unittest support during init * dont use all the libuv threads for queries * break an infinite loop when sleep_usec() is interrupted * ml prediction is a collector for several charts * sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit * worker_unregister() in netdata threads cleanup * moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h * fixed ML issues * removed engine2 directory * added dbengine2 files in CMakeLists.txt * move query plan data to query target, so that they can be exposed by in jsonwrap * uniform definition of query plan according to the other query target members * event_loop should be in daemon, not libnetdata * metric_retention_by_uuid() is now part of the storage engine abstraction * unify time_t variables to have the suffix _s (meaning: seconds) * old dbengine statistics become "dbengine io" * do not enable ML resource usage charts by default * unify ml chart families, plugins and modules * cleanup query plans from query target * cleanup all extent buffers * added debug info for rrddim slot to time * rrddim now does proper gap management * full rewrite of the mem modes * use library functions for madvise * use CHECKSUM_SZ for the checksum size * fix coverity warning about the impossible case of returning a page that is entirely in the past of the query * fix dbengine shutdown * keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently * fine tune cache evictions * dont initialize health if the health service is not running - prevent crash on shutdown while children get connected * rename AS threads to ACLK[hostname] * prevent re-use of uninitialized memory in queries * use JulyL instead of JudyL for PDC operations - to test it first * add also JulyL files * fix July memory accounting * disable July for PDC (use Judy) * use the function to remove datafiles from linked list * fix july and event_loop * add july to libnetdata subdirs * rename time_t variables that end in _t to end in _s * replicate when there is a gap at the beginning of the replication period * reset postponing of sender connections when a receiver is connected * Adjust update every properly * fix replication infinite loop due to last change * packed enums in rrd.h and cleanup of obsolete rrd structure members * prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests() * void unused variable * void unused variables * fix indentation * entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps * macros to caclulate page entries by time and size * prevent statsd cleanup crash on exit * cleanup health thread related variables Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2023-01-09	fix consul_raft_leadership_transitions alarm units (#14232)pre-dbengine2	Ilya Mashchenko

2023-01-06	Add few alarms for elasticsearch (#14197)	Ilya Mashchenko
	Fixes https://github.com/netdata/netdata/issues/14184
2023-01-05	Use brackets around info variables (#14206)	Emmanuel Vasilakis
	* use brackets around vars * update doc * change doc * try fix codacy * fix 2
2022-12-23	use specific charts labels instead of family in alarms (#14173)	Ilya Mashchenko

2022-12-23	add alarms and dashboard info for Consul (#14163)	Ilya Mashchenko

2022-12-19	Add a health configuration option of what alarms to load (#14150)	Emmanuel Vasilakis
	* use a simple pattern to choose what alerts to load * rename config to alarms * add config entry doc * fix link * change wording * change wording 2 * add to analytics * build a list of disabled alarms for analytics * add silenced alarms * remove analytics related * remove analytics related 2
2022-12-16	Populate field values in send_slack() for Mattermost (#14153)	je2555
	Closes https://github.com/netdata/netdata/issues/14152
2022-12-14	use chart labels in portcheck alarms (#14137)	Ilya Mashchenko

2022-12-13	fix httpcheck alarms (#14133)	Ilya Mashchenko

2022-12-09	remove deprecated fping.plugin in accordance with v1.37.0 deprecation notice ↵	Ilya Mashchenko
	(#14073)
2022-12-08	fix health emphemerality labels src (#14105)	Ilya Mashchenko
	Fixes https://github.com/netdata/netdata/issues/14084
2022-11-29	Sanitize command arguments. (#14064)	vkalintiris
	* Sanitize bash arguments. Remove leading dashes and escape single quotes in command arguments. * Quote expanded variable in test
2022-11-28	replication fixes No 7 (#14053)	Costa Tsaousis
	* move global statistics workers to a separate thread; query statistics per query source; query statistics for ML, exporters, backfilling; reset replication point in time every 10 seconds, instead of every 1; fix compilation warnings; optimize the replication queries code; prevent long tail of replication requests (big sleeps); provide query statistics about replication ; optimize replication sender when most senders are full; optimize replication_request_get_first_available(); reset replication completion calculation; * remove workers utilization from global statistics thread
2022-11-21	Change static image urls to app.netdata.cloud in alarm-notify.sh (#14007)	Emmanuel Vasilakis
	change image labels to app.netdata.cloud
2022-11-20	streaming compression, query planner and replication fixes (#14023)	Costa Tsaousis
	* streaming compression, query planner and replication fixes * remove journal v2 stats from global statistics * disable sql for checking past sql UUIDs * single threaded replication * final replication thread using dictionaries and JudyL for sorting the pending requests * do not timeout the sending socket when there are pending replication requests * streaming receiver using read() instead of fread() * remove FILE * from streaming - now using posix read() and write() * increase timeouts to 10 minutes * apply sender timeout only when there are metrics that are supposed to be streamed * error handling in replication * remove retries on socket read timeout; better error messages * take into account inbound traffic too to detect that a connection is stale * remove race conditions from replication thread * make sure deleted entries are marked as executed, so that even if deletion fails, they will not be executed * 2 minutes timeout to retry streaming to a parent that already has this node * remove unecessary condition check * fix compilation warnings * include judy in replication * wrappers to handle retries for SSL_read and SSL_write * compressed bytes read monitoring * recursive locks on replication to make it faster during flush or cleanup * replication completion chart at the receiver side * simplified recursive mutex * simplified recursive mutex again
2022-11-10	Remove health_thread_stop (#13948)	Emmanuel Vasilakis
	* remove health_thread_stop * soft sleeps
2022-11-09	Remove anomaly rates chart. (#13763)	vkalintiris

2022-11-07	health/ping: use 'host' label in alerts info (#13955)	Ilya Mashchenko

2022-11-07	bump go.d.plugin to v0.43.0 (#13954)	Ilya Mashchenko

2022-11-03	recalculate last_collected_total (#13945)	Costa Tsaousis

2022-10-31	add ping dashboard info and alarms (#13916)	Ilya Mashchenko
	Co-authored-by: Shyam Sreevalsan <shyam@netdata.cloud>
2022-10-27	Fix reading health "enable" from the configuration (#13894)	Stelios Fragkakis

2022-10-24	app to api netdata cloud (#13856)	Timotej S

2022-10-19	Health thread per host (#13712)	Emmanuel Vasilakis
	* Rebased * rebased * health_execute_pending_updates -> health_execute_delayed_initializations * fix labels for current host only * missing bracket * misc fixes, reload health for disconnected hosts * remove volatile, add comment
2022-10-17	chore(health): rm pihole_blocklist_gravity_file_existence_state (#13826)	Ilya Mashchenko

2022-10-12	bump go.d.plugin v0.40.4 (#13808)	Ilya Mashchenko

2022-10-11	minor - add host labels for ephemerality and nodes with unstable connections ↵	Timotej S
	(#13784) * add host labels for ephemerality and nodes with unstable connections
2022-10-10	Fix minor typo in systemdunits.conf alert (#13796)	Tasos Katsoulas

2022-10-10	Also enable rrdvars from health (#13795)	Emmanuel Vasilakis
	also enable rrdvars from health
2022-10-10	Initialize st->rrdvars from rrdset insert callback (#13786)	Emmanuel Vasilakis
	* init st->rrdvars from rrdset callback * also init rrdfamily and rrdimvar from rrdset * remove empty line
2022-10-09	Remove extern from function declared in headers. (#13790)	vkalintiris
	By default functions are declared as extern in C/C++ headers. The goal of this PR is to reduce the wall of text that many headers have and, more importantly, to make the declaration of extern'd variables - of which we have many dispersed in various places - easily and quickly identifiable. Automatically generated with: $ git grep -l '^extern.(' '.h' \| \ grep -v libjudy \| \ grep -v 'sqlite3.h' \| \ xargs sed -i -e 's/extern $.(.*$$/\1/' This is a NFC.