Age | Commit message (Collapse) | Author |
|
Fixes https://github.com/netdata/learn/issues/1298
|
|
* acquiring / releasing interface for metrics
* metrics registry statistics
* cleanup metrics registry by deleting metrics when they dont have retention anymore; do not double copy the data of pages to be flushed
* print the tier in retention summary
* Open files with buffered instead of direct I/O (test)
* added more metrics stats and fixed unittest
* rename writer functions to avoid confusion with refcounting
* do not release a metric that is not acquired
* Revert to use direct I/O on write -- use direct I/O on read as well
* keep track of ARAL overhead and add it to the memory chart
* aral full check via api
* Cleanup
* give names to ARALs and PGCs
* aral improvements
* restore query expansion to the future
* prefer higher resolution tier when switching plans
* added extent read statistics
* smoother joining of tiers at query engine
* fine tune aral max allocation size
* aral restructuring to hide its internals from the rest of netdata
* aral restructuring; addtion of defrag option to aral to keep the linked list sorted - enabled by default to test it
* fully async aral
* some statistics and cleanup
* fix infinite loop while calculating retention
* aral docs and defragmenting disabled by default
* fix bug and add optimization when defragmenter is not enabled
* aral stress test
* aral speed report and documentation
* added internal checks that all pages are full
* improve internal log about metrics deletion
* metrics registry uses one aral per partition
* metrics registry aral max size to 512 elements per page
* remove data_structures/README.md dependency
---------
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
|
|
This will be changed when the new integrations page is launched.
Also, it needs to be the main page when expanding the contents, not just another page in there (@tkatsoulas will do that)
|
|
|
|
* on shutdown stop data collection for all hosts instead of freeing their memory
* print number of sql statements per metadata host scan
* print timings with metadata checking
* use dbengine API to figure out of a database is legacy
* Recalculate retention after a datafile deletion
* validate child timestamps during replication
* main cache uses a lockless aral per partition, protected by the partition index lock
* prevent ML crash
* Revert "main cache uses a lockless aral per partition, protected by the partition index lock"
This reverts commit 6afc01527dc5c66548b4bc8a1d63c026c3149358.
* Log direct index and binary searches
* distribute metrics more evenly across time
* statistics about retention recalculation
* fix crash
* Reverse the binary search to calculate retention
* more optimization on retention calculation
* removed commented old code
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
|
|
* Moving the cloud docs under /docs/cloud (previous location: netdata/learn/*)
* Added metadata on almost every document of the old learn site for the new ingest process of learn.
* Map old learn document to their best fit as topic related docs.
Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
Co-authored-by: DShreve2 <david@netdata.cloud>
Co-authored-by: hugovalente-pm <hugo@netdata.cloud>
|
|
* fix(proc.plugin): add cpu label to per core util% charts
* fix codeql warning
|
|
* cache 100 pages for each size our tiers need
* smarter page caching
* account the caching structures
* dynamic max number of cached pages
* make variables const to ensure they are not changed
* make sure replication timestamps do not go to the future
* replication now sends chart and dimension states atomically; replication receivers ignores chart and dimension states when rbegin is also ignored
* make sure all pages are flushed on shutdown
* take into account empty points too
* when recalculating retention update first_time_s on metrics only when they are bigger
* Report the datafile number we use to recalculate retention
* Report the datafile number we use to recalculate retention
* rotate db at startup
* make query plans overlap
* Calculate properly first time s
* updated event labels
* negative page caching fix
* Atempt to create missing tables on query failure
* Atempt to create missing tables on query failure (part 2)
* negative page caching for all gaps, to eliminate jv2 scans
* Fix unittest
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* run cleanup in workers
* when there is a discrepancy between update every, fix it
* fix the other occurences of metric update every mismatch
* allow resetting the same timestamp
* validate flushed pages before committing them to disk
* initialize collection with the latest time in mrg
* these should be static functions
* acquire metrics for writing to detect multiple data collections of the same metric
* print the uuid of the metric that is collected twice
* log the discrepancies of completed pages
* 1 second tolerance
* unify validation of pages and related logging across dbengine
* make do_flush_pages() thread safe
* flush pages runs on libuv workers
* added uv events to tp workers
* dont cross datafile spinlock and rwlock
* should be unlock
* prevent the creation of multiple datafiles
* break an infinite replication loop
* do not log the epxansion of the replication window due to start streaming
* log all invalid pages with internal checks
* do not shutdown event loop threads
* add information about collected page events, to find the root cause of invalid collected pages
* rewrite of the gap filling to fix the invalid collected pages problem
* handle multiple collections of the same metric gracefully
* added log about main cache page conflicts; fix gap filling once again...
* keep track of the first metric writer
* it should be an internal fatal - it does not harm users
* do not check of future timestamps on collected pages, since we inherit the clock of the children; do not check collected pages validity without internal checks
* prevent negative replication completion percentage
* internal error for the discrepancy of mrg
* better logging of dbengine new metrics collection
* without internal checks it is unused
* prevent pluginsd crash on exit due to calling pthread_cancel() on an exited thread
* renames and atomics everywhere
* if a datafile cannot be acquired for deletion during shutdown, continue - this can happen when there are hot pages in open cache referencing it
* Debug for context load
* rrdcontext uuid debug
* rrddim uuid debug
* rrdeng uuid debug
* Revert "rrdeng uuid debug"
This reverts commit 393da190826a582e7e6cc90771bf91b175826d8b.
* Revert "rrddim uuid debug"
This reverts commit 72150b30408294f141b19afcfb35abd7c34777d8.
* Revert "rrdcontext uuid debug"
This reverts commit 2c3b940dc23f460226e9b2a6861c214e840044d0.
* Revert "Debug for context load"
This reverts commit 0d880fc1589f128524e0b47abd9ff0714283ce3b.
* do not use legacy uuids on multihost dbs
* thread safety for journafile size
* handle other cases of inconsistent collected pages
* make health thread check if it should be running in key loops
* do not log uuids
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* track memory footprint of Netdata
* track db modes alloc/ram/save/map
* track system info; track sender and receiver
* fixes
* more fixes
* track workers memory, onewayalloc memory; unify judyhs size estimation
* track replication structures and buffers
* Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag
* flush the replication buffer every 1000 times the circular buffer is found empty
* dont take timestamp too frequently in sender loop
* sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it
* free sender thread buffer on replication threads when replication is idle
* use the last sender flag as a timestamp of the last buffer recreation
* free cbuffer before reconnecting
* recreate cbuffer on every flush
* timings for journal v2 loading
* inlining of metric and cache functions
* aral likely/unlikely
* free left-over thread buffers
* fix NULL pointer dereference in replication
* free sender thread buffer on sender thread too
* mark ctx as used before flushing
* better logging on ctx datafiles closing
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
|
|
* cleanup journal v2 mounts periodically
* fix for last commit
* re-enable loading page from disk when the arrangement of pages requires it
* Remove unused statistics
* Estimate diskspace when the current datafile is full and queue a rotate command (Currently it will not attempt to estimate end size for journals)
Queue a command to check quota on startup per tier
* apps.plugin now exposes RSS chart
* shorter thread names to make debugging easier, since thread names can only be 15 characters
* more thread names fixes
* allow an apps_groups.conf target to be pid 0 or 1
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* reduce journal v2 shared memory using madvise() - not integrated yet
* working attempt to minimize dbengine shared memory
* never call willneed - let the kernel decide which parts of each file are really needed
* journal files get MADV_RANDOM
* dont call MADV_DONTNEED too frequently
* madvise() is always called with the journal unlocked but referenced
* call madvise() even less frequently
* added chart for monitoring database events
* turn batch mode on under critical conditions
* max size to evict is 1/4 of the max
* fix max size to evict calculation
* use dbengine_page/extent_alloc/free to pages and extents allocations, tracking also the size of these allocations at free time
* fix calculation for batch evictions
* allow main and open cache to have as many evictors as needed
* control inline evictors for each cache; report different levels of cache pressure on every cache evaluation
* more inline evictors for extent cache
* bypass max inline evictors above critical level
* current cache usage has to be taken
* re-arrange items in journafile
* updated docs - work in progress
* more docs work
* more docs work
* Map / unmap journal file
* draw.io diagram for dbengine operations
* updated dbengine diagram
* updated docs
* journal files v2 now get mapped and unmapped as needed
* unmap journal v2 immediately when getting retention
* mmap and munmap do not block queries evaluating journal files v2
* have only one unmap function
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
systemd service (#14255)
Fixes https://github.com/netdata/netdata/issues/14238
|
|
Fixes https://github.com/netdata/netdata/issues/14235
|
|
* rename "Pid" to "PID" in functions
I think it would be better for the case of "Pid" if we stuck to generally expected convention of "PID"
* `PPid` to `PPID`
* `Pid` to `PID` for pointer in PPID
* ignore "Gid" and "Uid" from this PR
|
|
rename NETDATA_ML_CHART_FAMILY from `ml - machine learning` to just `machine learning`
|
|
* count open cache pages refering to datafile
* eliminate waste flush attempts
* remove eliminated variable
* journal v2 scanning split functions
* avoid locking open cache for a long time while migrating to journal v2
* dont acquire datafile for the loop; disable thread cancelability while a query is running
* work on datafile acquiring
* work on datafile deletion
* work on datafile deletion again
* logs of dbengine should start with DBENGINE
* thread specific key for queries to check if a query finishes without a finalize
* page_uuid is not used anymore
* Cleanup judy traversal when building new v2
Remove not needed calls to metric registry
* metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent
* disable checks for invalid time-ranges
* Remove type from page details
* report scanning time
* remove infinite loop from datafile acquire for deletion
* remove infinite loop from datafile acquire for deletion again
* trace query handles
* properly allocate array of dimensions in replication
* metrics cleanup
* metrics registry uses arrayalloc
* arrayalloc free should be protected by lock
* use array alloc in page cache
* journal v2 scanning fix
* datafile reference leaking hunding
* do not load metrics of future timestamps
* initialize reasons
* fix datafile reference leak
* do not load pages that are entirely overlapped by others
* expand metric retention atomically
* split replication logic in initialization and execution
* replication prepare ahead queries
* replication prepare ahead queries fixed
* fix replication workers accounting
* add router active queries chart
* restore accounting of pages metadata sources; cleanup replication
* dont count skipped pages as unroutable
* notes on services shutdown
* do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file
* do not add pages we dont need to pdc
* time in range re-work to provide info about past and future matches
* finner control on the pages selected for processing; accounting of page related issues
* fix invalid reference to handle->page
* eliminate data collection handle of pg_lookup_next
* accounting for queries with gaps
* query preprocessing the same way the processing is done; cache now supports all operations on Judy
* dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3)
* get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query
* finner gaps calculation; accounting of overlapping pages in queries
* fix gaps accounting
* move datafile deletion to worker thread
* tune libuv workers and thread stack size
* stop netdata threads gradually
* run indexing together with cache flush/evict
* more work on clean shutdown
* limit the number of pages to evict per run
* do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction
* economies on flags for smaller page footprint; cleanup and renames
* eviction moves referenced pages to the end of the queue
* use murmur hash for indexing partition
* murmur should be static
* use more indexing partitions
* revert number of partitions to number of cpus
* cancel threads first, then stop services
* revert default thread stack size
* dont execute replication requests of disconnected senders
* wait more time for services that are exiting gradually
* fixed last commit
* finer control on page selection algorithm
* default stacksize of 1MB
* fix formatting
* fix worker utilization going crazy when the number is rotating
* avoid buffer full due to replication preprocessing of requests
* support query priorities
* add count of spins in spinlock when compiled with netdata internal checks
* remove prioritization from dbengine queries; cache now uses mutexes for the queues
* hot pages are now in sections judy arrays, like dirty
* align replication queries to optimal page size
* during flushing add to clean and evict in batches
* Revert "during flushing add to clean and evict in batches"
This reverts commit 8fb2b69d068499eacea6de8291c336e5e9f197c7.
* dont lock clean while evicting pages during flushing
* Revert "dont lock clean while evicting pages during flushing"
This reverts commit d6c82b5f40aeba86fc7aead062fab1b819ba58b3.
* Revert "Revert "during flushing add to clean and evict in batches""
This reverts commit ca7a187537fb8f743992700427e13042561211ec.
* dont cross locks during flushing, for the fastest flushes possible
* low-priority queries load pages synchronously
* Revert "low-priority queries load pages synchronously"
This reverts commit 1ef2662ddcd20fe5842b856c716df134c42d1dc7.
* cache uses spinlock again
* during flushing, dont lock the clean queue at all; each item is added atomically
* do smaller eviction runs
* evict one page at a time to minimize lock contention on the clean queue
* fix eviction statistics
* fix last commit
* plain should be main cache
* event loop cleanup; evictions and flushes can now happen concurrently
* run flush and evictions from tier0 only
* remove not needed variables
* flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time
* added worker jobs in timer to find the slow part of it
* support fast eviction of pages when all_of_them is set
* revert default thread stack size
* bypass event loop for dispatching read extent commands to workers - send them directly
* Revert "bypass event loop for dispatching read extent commands to workers - send them directly"
This reverts commit 2c08bc5bab12881ae33bc73ce5dea03dfc4e1fce.
* cache work requests
* minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors
* publish flushed pages to open cache in the thread pool
* prevent eventloop requests from getting stacked in the event loop
* single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c
* more rrdengine.c cleanup
* enable db rotation
* do not log when there is a filter
* do not run multiple migration to journal v2
* load all extents async
* fix wrong paste
* report opcodes waiting, works dispatched, works executing
* cleanup event loop memory every 10 minutes
* dont dispatch more work requests than the number of threads available
* use the dispatched counter instead of the executing counter to check if the worker thread pool is full
* remove UV_RUN_NOWAIT
* replication to fill the queues
* caching of extent buffers; code cleanup
* caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation
* single transaction wal
* synchronous flushing
* first cancel the threads, then signal them to exit
* caching of rrdeng query handles; added priority to query target; health is now low prio
* add priority to the missing points; do not allow critical priority in queries
* offload query preparation and routing to libuv thread pool
* updated timing charts for the offloaded query preparation
* caching of WALs
* accounting for struct caches (buffers); do not load extents with invalid sizes
* protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset
* also check if the expanded before is not the chart later updated time
* also check if the expanded before is not after the wall clock time of when the query started
* Remove unused variable
* replication to queue less queries; cleanup of internal fatals
* Mark dimension to be updated async
* caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol)
* disable pgc stress test, under an ifdef
* disable mrg stress test under an ifdef
* Mark chart and host labels, host info for async check and store in the database
* dictionary items use arrayalloc
* cache section pages structure is allocated with arrayalloc
* Add function to wakeup the aclk query threads and check for exit
Register function to be called during shutdown after signaling the service to exit
* parallel preparation of all dimensions of queries
* be more sensitive to enable streaming after replication
* atomically finish chart replication
* fix last commit
* fix last commit again
* fix last commit again again
* fix last commit again again again
* unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication
* do not cancel start streaming; use high priority queries when we have locked chart data collection
* prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered
* opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue
* Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore
* Fix bad memory allocation / cleanup
* Cleanup ACLK sync initialization (part 1)
* Don't update metric registry during shutdown (part 1)
* Prevent crash when dashboard is refreshed and host goes away
* Mark ctx that is shutting down.
Test not adding flushed pages to open cache as hot if we are shutting down
* make ML work
* Fix compile without NETDATA_INTERNAL_CHECKS
* shutdown each ctx independently
* fix completion of quiesce
* do not update shared ML charts
* Create ML charts on child hosts.
When a parent runs a ML for a child, the relevant-ML charts
should be created on the child host. These charts should use
the parent's hostname to differentiate multiple parents that might
run ML for a child.
The only exception to this rule is the training/prediction resource
usage charts. These are created on the localhost of the parent host,
because they provide information specific to said host.
* check new ml code
* first save the database, then free all memory
* dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db
* Cleanup metadata thread (part 2)
* increase refcount before dispatching prep command
* Do not try to stop anomaly detection threads twice.
A separate function call has been added to stop anomaly detection threads.
This commit removes the left over function calls that were made
internally when a host was being created/destroyed.
* Remove allocations when smoothing samples buffer
The number of dims per sample is always 1, ie. we are training and
predicting only individual dimensions.
* set the orphan flag when loading archived hosts
* track worker dispatch callbacks and threadpool worker init
* make ML threads joinable; mark ctx having flushing in progress as early as possible
* fix allocation counter
* Cleanup metadata thread (part 3)
* Cleanup metadata thread (part 4)
* Skip metadata host scan when running unittest
* unittest support during init
* dont use all the libuv threads for queries
* break an infinite loop when sleep_usec() is interrupted
* ml prediction is a collector for several charts
* sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit
* worker_unregister() in netdata threads cleanup
* moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h
* fixed ML issues
* removed engine2 directory
* added dbengine2 files in CMakeLists.txt
* move query plan data to query target, so that they can be exposed by in jsonwrap
* uniform definition of query plan according to the other query target members
* event_loop should be in daemon, not libnetdata
* metric_retention_by_uuid() is now part of the storage engine abstraction
* unify time_t variables to have the suffix _s (meaning: seconds)
* old dbengine statistics become "dbengine io"
* do not enable ML resource usage charts by default
* unify ml chart families, plugins and modules
* cleanup query plans from query target
* cleanup all extent buffers
* added debug info for rrddim slot to time
* rrddim now does proper gap management
* full rewrite of the mem modes
* use library functions for madvise
* use CHECKSUM_SZ for the checksum size
* fix coverity warning about the impossible case of returning a page that is entirely in the past of the query
* fix dbengine shutdown
* keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently
* fine tune cache evictions
* dont initialize health if the health service is not running - prevent crash on shutdown while children get connected
* rename AS threads to ACLK[hostname]
* prevent re-use of uninitialized memory in queries
* use JulyL instead of JudyL for PDC operations - to test it first
* add also JulyL files
* fix July memory accounting
* disable July for PDC (use Judy)
* use the function to remove datafiles from linked list
* fix july and event_loop
* add july to libnetdata subdirs
* rename time_t variables that end in _t to end in _s
* replicate when there is a gap at the beginning of the replication period
* reset postponing of sender connections when a receiver is connected
* Adjust update every properly
* fix replication infinite loop due to last change
* packed enums in rrd.h and cleanup of obsolete rrd structure members
* prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests()
* void unused variable
* void unused variables
* fix indentation
* entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps
* macros to caclulate page entries by time and size
* prevent statsd cleanup crash on exit
* cleanup health thread related variables
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: vkalintiris <vasilis@netdata.cloud>
|
|
|
|
* Add profile.plugin
Creates the specified number of charts/dimensions, and supports
backfilling with pseudo-historical data.
* Bump
* Remove wrongly merged line.
* Use the number of models specified from the config section.
* Add option to consult all ML models.
* Remove profiling option consuming all models.
* Add underscore after chart name prefix.
* prediction -> dimensions chart
* reorder funcs
* Split charts across types with correct priority
* Ignore training request when chart is under replication.
* Track global number of models consulted.
* Cleanup config.
* initial readme updates
* fix readme
* readme
* Fix function definition when ML is disabled.
* Add dummy ml_chart_update_{begin,end}
* Remove profile_plugin
* Define chart priorities under collectors/all.h
* s/curr_t/current_time/
* Use libnetdata's lock/thread wrappers.
* Fix autotools & cmake builds.
* Delete ML dimensions & charts.
* Let users of buffer preprocessing to handle memory.
* Add separate API calls to start/stop ML threads.
Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
|
|
|
|
|
|
|
|
Fixes https://github.com/netdata/netdata/issues/14170
|
|
|
|
(#14172)
|
|
* Add profile.plugin
Creates the specified number of charts/dimensions, and supports
backfilling with pseudo-historical data.
* Bump
* Remove wrongly merged line.
* Use the number of models specified from the config section.
* Add option to consult all ML models.
* Remove profiling option consuming all models.
* Add underscore after chart name prefix.
* prediction -> dimensions chart
* reorder funcs
* Split charts across types with correct priority
* Ignore training request when chart is under replication.
* Track global number of models consulted.
* Cleanup config.
* initial readme updates
* fix readme
* readme
* Fix function definition when ML is disabled.
* Add dummy ml_chart_update_{begin,end}
* Remove profile_plugin
* Define chart priorities under collectors/all.h
* s/curr_t/current_time/
Co-authored-by: Andrew Maguire <andrewm4894@gmail.com>
|
|
Fixes https://github.com/netdata/netdata/issues/14154
|
|
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
Fixes https://github.com/netdata/netdata/issues/14138
|
|
|
|
|
|
|
|
(#14073)
|
|
notice (#14072)
|
|
* Update COLLECTORS.md
minor fix
* Update COLLECTORS.md
remove whitespace
|
|
add clickhouse and third party install instruction
|
|
* proper use for atomic_compare_exchange()
* diskspace plugin is multi-threaded but it uses single threaded dictionaries
|
|
* find the issue with workers performance and optimize it
* have 2 independent locks in workers
* break down workers statistics work to measure the time of each module
* optimize reading cpu for workers; cleanup procfile from unecessary allocations and operations
* cleanup workers
* fix the workers statistics dimension name
|
|
|
|
use the faster monotonic clock in workers and replication; avoid unecessary statistics function on every request on replication - gather them all together once every second; check the chart flags on all mirrored hosts, not only the ones that have a sender; cleanup and unify replication logs; added child world time to REND; fix first BEGIN been transmitted when replication starts;
|
|
|
|
* dont sleep for all duration of step
* empty line
|
|
* pluginsd cleanup; replication logic cleanup; fix bug in replication begin
* log replication start/stop and change the keyword of NETDATA_LOG_REPLICATION_REQUESTS logs to REPLAY
* dont ask for data the child does not have; log fixes
* more pluginsd cleanup
* count sender dictionary entries
* fix dictionary_flush()
|
|
* cleanup and additional information about replication
* fix deadlock on sender mutex
* do not ignore start streaming empty requests; when there duplicate requests, merge them
* flipped the flag
* final touch
* added queued flag on the charts to prevent them from being obsoleted by the service thread
|
|
* Remove calls to rrdset_next().
* Rm checks plugin
* Update documentantion
* Call rrdset_next from within rrdset_done
This wraps up the removal of rrdset_next from internal collectors, which
removes a lot of unecessary code and the need for if/else clauses in
every place.
The pluginsd parser is the only component that calls rrdset_next*()
functions because it's not strictly speaking a collector but more of a
collector manager/proxy.
With the current changes it's possible to simplify the API we expose
from RRD significantly, but this will be follow-up work in the future.
* Remove stale reference to checks.plugin
* Fix RRD unit test
rrdset_next is not meant to be called from these tests.
* Fix db engine unit test.
* Schedule rrdset_next when we have completed at least one collection.
* Mark chart creation clauses as unlikely.
* Add missing brace to fix FreeBSD plugin.
|
|
* streaming compression, query planner and replication fixes
* remove journal v2 stats from global statistics
* disable sql for checking past sql UUIDs
* single threaded replication
* final replication thread using dictionaries and JudyL for sorting the pending requests
* do not timeout the sending socket when there are pending replication requests
* streaming receiver using read() instead of fread()
* remove FILE * from streaming - now using posix read() and write()
* increase timeouts to 10 minutes
* apply sender timeout only when there are metrics that are supposed to be streamed
* error handling in replication
* remove retries on socket read timeout; better error messages
* take into account inbound traffic too to detect that a connection is stale
* remove race conditions from replication thread
* make sure deleted entries are marked as executed, so that even if deletion fails, they will not be executed
* 2 minutes timeout to retry streaming to a parent that already has this node
* remove unecessary condition check
* fix compilation warnings
* include judy in replication
* wrappers to handle retries for SSL_read and SSL_write
* compressed bytes read monitoring
* recursive locks on replication to make it faster during flush or cleanup
* replication completion chart at the receiver side
* simplified recursive mutex
* simplified recursive mutex again
|
|
|
|
Revert "New journal disk based indexing for agent memory reduction (#13885)"
This reverts commit 224b051a2b2bab39a4b536e531ab9ca590bf31bb.
|