Age | Commit message (Collapse) | Author |
|
* dyncfg fncnames as constants
* add helper macros to know parser streaming/plugin
* plugins dictionary per RRDHOST
* api_request_v2_config add support for /host/
* streamify pluginsd_register_plugin
* streamify pluginsd_register_module
* streamify report_job_status
* streamify dyncfg get functions
* module_type2str
* add job type and flags
* add DYNCFG_REGISTER_JOB
* implement register job
* push all to parent at startup
* add helper function is_dyncfg_function
* forward virtual functions trough streaming
* separate job2json
* add api/v2/job_statuses
* do cleanup on streaming
* streamify set functions
* support FUNCTION_PAYLOAD trough streaming
* WIP tests
* dont attempt loading non-localhost configs
* move cfg persistence to proper place
* prevent race
* properly update job state at runtime
* cleanup 1
* job2json add missing reason
* add tests
* correct HTTP code
* add test
* streamify delete_job_cb
* add DELETE_JOB keyword
* job delete over streaming
* add tests for create and delete job over parent
* rrdpush common checks to macro
* add missing forwarders
* fix jobs according to test results
* more tests
* review comment 1
* codacy remove valid warning
* codacy ruby fixes
* fix wrong rc check
* minimal test plugin for child
* add test
* dict walk insted of master lock
* minor - english spelling fixes
* thiago comments 1
* minor - rename folder to dynconf
* enable only when built with -DNETDATA_TEST_DYNCFG
* minor - compiler warning
* create dir post daemonization
* stricter URL check
|
|
* remove the line length limit from pluginsd
* initialize the buffer on every iteration
* buffer_tostring inlined
* Release buffer
---------
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
|
|
This reverts commit 440bd51e08fdfa2a4daa191fb68643456028a753.
dbengine was still being used for non-zero tiers
even on non-dbengine modes.
|
|
* Storage engine.
* Host indexes to rrdb
* Move globals to rrdb
* Move storage_tiers_backfill to rrdb
* default_rrd_update_every to rrdb
* default_rrd_history_entries to rrdb
* gap_when_lost_iterations_above to rrdb
* rrdset_free_obsolete_time_s to rrdb
* libuv_worker_threads to rrdb
* ieee754_doubles to rrdb
* rrdhost_free_orphan_time_s to rrdb
* rrd_rwlock to rrdb
* localhost to rrdb
* rm extern from func decls
* mv rrd macro under rrd.h
* default_rrdeng_page_cache_mb to rrdb
* default_rrdeng_extent_cache_mb to rrdb
* db_engine_journal_check to rrdb
* default_rrdeng_disk_quota_mb to rrdb
* default_multidb_disk_quota_mb to rrdb
* multidb_ctx to rrdb
* page_type_size to rrdb
* tier_page_size to rrdb
* No storage_engine_id in rrdim functions
* storage_engine_id is provided by st
* Update to fix merge conflict.
* Update field name
* Remove unnecessary macros from rrd.h
* Rm unused type decls
* Rm duplicate func decls
* make internal function static
* Make the rest of public dbengine funcs accept a storage_instance.
* No more rrdengine_instance :)
* rm rrdset_debug from rrd.h
* Use rrdb to access globals in ML and ACLK
Missed due to not having the submodules in the
worktree.
* rm total_number
* rm RRDVAR_TYPE_TOTAL
* rm unused inline
* Rm names from typedef'd enums
* rm unused header include
* Move include
* Rm unused header include
* s/rrdhost_find_or_create/rrdhost_get_or_create/g
* s/find_host_by_node_id/rrdhost_find_by_node_id/
Also, remove duplicate definition in rrdcontext.c
* rm macro used only once
* rm macro used only once
* Reduce rrd.h api by moving funcs into a collector specific utils header
* Remove unused func
* Move parser specific function out of rrd.h
* return storage_number instead of void pointer
* move code related to rrd initialization out of rrdhost.c
* Remove tier_grouping from rrdim_tier
Saves 8 * storage_tiers bytes per dimension.
* Fix rebase
* s/rrd_update_every/update_every/
* Mark functions as static and constify args
* Add license notes and file to build systems.
* Remove remaining non-log/config mentions of memory mode
* Move rrdlabels api to separate file.
Also, move localhost functions that loads
labels outside of database/ and into daemon/
* Remove function decl in rrd.h
* merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir
* Do not expose internal function from rrd.h
* Rm NETDATA_RRD_INTERNALS
Only one function decl is covered. We have more
database internal functions that we currently
expose for no good reason. These will be placed
in a separate internal header in follow up PRs.
* Add license note
* Include libnetdata.h instead of aral.h
* Use rrdb to access localhost
* Fix builds without dbengine
* Add header to build system files
* Add rrdlabels.h to build systems
* Move func def from rrd.h to rrdhost.c
* Fix macos build
* Rm non-existing function
* Rebase master
* Define buffer length macro in ad_charts.
* Fix FreeBSD builds.
* Mark functions static
* Rm func decls without definitions
* Rebase master
* Rebase master
* Properly initialize value of storage tiers.
* Fix build after rebase.
|
|
* rebase
* changes queries to delete based on when
* readme changes
* no need to do migration
* wip, protect un-updated events from cleanup
* remove index on when_key
* fix query for claimed cleanup
* if set less than minimum, set minimum
* fix query
* correct config assign
|
|
* claim script now accepts the same params as the kickstart
* rewrote buildinfo to unify all methods
* added cloud unavailable in cloud status
* added all exporters
* renamed httpd to h2o
* rename ENABLE_COMPRESSION to ENABLE_LZ4
* rename global variable
* rename ENABLE_HTTPS to ENABLE_OPENSSL
* fix coverity-scan for openssl
* add lz4 to coverity-scan
* added all plugins and most of the features
* added all plugins and most of the features
* generalize bitmap code so that we can have any size of bitmaps
* cleanup
* fix compilation without protobuf
* fix compilation with others allocators
* fix bitmap
* comprehensive bitmaps unit test
* bitmap as macros
* added developer mode
* added system info to build info
* cloud available/unavailable
* added /api/v2/info
* added units and ni to transitions
* when showing instances and transitions, show only the instances that have transitions
* cleanup
* add missing quotes
* add anchor to transitions
* added more to build info
* calculate retention per tier and expose it to /api/v2/info
* added currently collected metrics
* do not show space and retention when no numbers are available
* fix impossible overflow
* Add function for transitions and execute callback
* In case of error, reset and try next dictionary entry
* Fix error message
* simpler logic to maintain retention per tier
* /api/v2/alert_transitions
* Handle case of recipient null
Convert after and before to usec
* Add classification, type and component
* working /api/v2/alert_transitions
* Fix query to properly handle context and alert name
* cleanup
* Add search with transition
* accept transition in /api/v2/alert_transitions
* totaly dynamic facets
* fixed debug info
* restructured facets
* cleanup; removal of options=transitions
* updated alert entries flags
* method to exec
* Return also exec run timestamp
Temp table cleanup only when we don't execute with a transition
* cleanup obsolete anchor parameter
* Add sql_get_alert_configuration function
* added options=config to alert_transitions
* added /api/v2/alert_config
* preliminary work for /api/v2/claim
* initialize variables; do not expose expected retention if no disk space info is available; do not report aclk as initializing when not claimed
* fix claim session key filename
* put a newline into the session key file
* more progress on claiming
* final /api/v2/claim endpoint
* after claiming, refresh our state at the output
* Fix query to fetch config
* Remove debug log
* add configuration objects
* add configuration objects - fixed
* respect the NETDATA_DISABLE_CLOUD env variable
* NETDATA_DISABLE_CLOUD env variable sets the default, but the config sets the final value
* use a new claimed_id on every claiming
* regenerate random key on claiming and wait for online status
* ignore write() return value when writing a newline
* dont show cloud status disabled when claimed_id is missing
* added ctx to alert instances
* cleanup config and transitions from /api/v2/alerts
* fix unused variable
* in /api/v2/alert_config show 1 config without an array
* show alert values conditionally, by appending options=values
* When storing host info if the key value is empty, store unknown
* added options=summary to control when the alerts summary is shown
* increased http_api_v2 to version 5
* claming random key file is now not world readable
* added local-listeners binary that detects all the listening ports, their IPs and their command lines
---------
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* use madvise to speed up indexing
* collect all rrddim members into a collector structure
* use tier 0 virtual point for storing last stored value
* reorganize key fields in rrddim
* remove fgets from pluginsd and replace it with read()
* properly uncork the web server sockets
* Revert "reorganize key fields in rrddim"
This reverts commit 2d45fa3959087e05462d387ff115a260f3a04b60.
* Revert "use tier 0 virtual point for storing last stored value"
This reverts commit a576cdd377ad4778a3b8608cabbb7ea7bb19a3a8.
* fix cork names
* fix compilation warnings
|
|
* make all pluginsd functions inline, instead of function pointers
* dynamic MRG partitions based on the number of CPUs
* report the right size of the MRG
* prevent invalid read on pluginsd exit
* faster service_running() check; fix compiler warnings; shutdown replication after streaming to prevent crash on shutdown
* sender is now using a spinlock
* rrdcontext uses spinlock
* replace select() with poll()
* signed calculation of threads
* disable read-ahead on jnfv2 files during scan
|
|
* use gperf for the pluginsd parser
* simplify pluginsd_parser by removing void pointers to user
* pluginsd_split_words() with inlined pluginsd_space()
* quoted_string_splitter() now uses a map instead of a function for determining spaces
* add stress test for pluginsd parser
* optimized BITMAP256
* optimized rrdpush receiver reception
* optimized rrdpush sender compression
* renames and cleanup
* remove wrong negation
* unify handshake and disconnection reasons
* use parser_find_keyword
* register job names only for the current repertoire
|
|
* remove rd->update_every
* reduce amount of memory for RRDDIM
* reorgnize rrddim->db entries
* optimize rrdset and statsd
* optimize dictionaries
* RW_SPINLOCK for dictionaries
* fix codeql warning
* rw_spinlock improvements
* remove obsolete assertion
* fix crash on health_alarm_log_process()
* use RW_SPINLOCK for AVL trees
* add RW_SPINLOCK read/write trylock
* pgc and mrg now use rw_spinlocks; cache line optimizations for mrg
* thread tag of dbegnine init
* append created datafile, lockless
* make DOUBLE_LINKED_LIST_APPEND_ITEM_UNSAFE friendly for lockless use
* thread cancelability in spinlocks; optimize thread cancelability management
* introduce a JudyL to index datafiles and use it during queries to quickly find the relevant files
* use the last timestamp of each journal file for indexing
* when the previous cannot be found, start from the beginning
* add more stats to PDC to trace routing easier
* rename spinlock functions
* fix for spinlock renames
* revert statsd socket statistics to size_t
* turn fatal into internal_fatal()
* show candidates always
* show connected status and connection attempts
|
|
* dummy streaming function
* expose global functions upstream
* separate function for pushing global functions
* add missing conditions
* allow streaming function to run async
* started internal API for functions
* cache host retention and expose it to /api/v2/nodes
* internal API for function table fields; more progress on streaming status
* abstracted and unified rrdhost status
* port old coverity warning fix - although it is not needed
* add ML information to rrdhost status
* add ML capability to streaming to signal the transmission of ML information; added ML information to host status
* protect host->receiver
* count metrics and instances per host
* exposed all inbound and outbound streaming
* fix for ML status and dependency of DATA_WITH_ML to INTERPOLATED, not IEEE754
* update ML dummy
* added all fields
* added streaming group by and cleaned up accepted values by cloud
* removed type
* Revert "removed type"
This reverts commit faae4177e603d4f85b7433f33f92ef3ccd23976e.
* added context to db summary
* new /api/v2/nodes schema
* added ML type
* change default function charts
* log to trace new capa
* add more debug
* removed debugging code
* retry on receive interrupted read; respect sender reconnect delay in all cases
* set disconnected host flag and manipulate localhost child count atomically, inside set/clear receiver
* fix infinite loop
* send_to_plugin() now has a spinlock to ensure that only 1 thread is writing to the plugin/child at the same time
* global cloud_status() call
* cloud should be a section, since it will contain error information
* put cloud capabilities into cloud
* aclk status in /api/v2 agents sections
* keep aclk_connection_counter
* updates on /api/v2/nodes
* final /api/v2/nodes and addition of /api/v2/nodes_instances
* parametrize all /api/v2/xxx output to control which info is outputed per endpoint
* always accept nodes selector
* st needs to be per instance, not per node
* fix merging of contexts; fix cups plugin priorities
* add after and before parameters to /api/v2/contexts/nodes/nodes_instances/q
* give each libuv worker a unique id
* aclk http_api_v2 version 4
|
|
* api v2 nodes for streaming statuses
* remove test
* move parts of the output
* in api/v2/data return 5 values per point when aggregation=percentage and raw option is given; return final values when aggregation=percentage is not the final grouping
|
|
stale plugins; streaming improvements (#15113)
* add information about streaming connections to /api/v2/nodes; reset defer time when sender or receivers connect or disconnect
* make each streaming destination respect its SSL settings
* to not send SSL traffic over non-SSL connection
* keep track of outgoing streaming connection attempts
* retry SSL reads when SSL_read() returns SSL_ERROR_WANT_READ
* Revert "retry SSL reads when SSL_read() returns SSL_ERROR_WANT_READ"
This reverts commit 14c858677c6f2d3b08c94f298e2f45ecdb74c801.
* cleanup SSL connections properly
* initialize SSL in rpt before takeover
* sender should free SSL when talking to a non-SSL destination
* do not shutdown SSL when receiver exits
* restore operation of SIGCHLD when the reaper is not enabled
* create an fgets function that checks for data and times out
* work on error handling of plugins exiting
* remove newlines from logs
* global call to waitid(), caching the result for netdata_pclose() to process
* receiver tid
* parser timeouts in 2 minutes instead of 10
* fix crash when UUID is NULL in SQLite
* abstract sqlite3 parsing for uuid and text
* write proper ssl errors on read and write
* fix for SSL_ERROR_WANT_RETRY_VERIFY
* SSL WANT per function
* unified SSL error logging
* fix compilation warning
* additional logging about parser cleanup
* streaming parser should call the pluginsd parser cleanup
* SSL error handling work
* SSL initialization unification
* check for pending data when receiving SSL response with timeout
* macro to check if an SSL connection has been established
* remove SSL_pending()
* check for SSL macros
* use SSL_peek() to find if there is a response
* SSL renames
* more SSL renames & cleanup
* rrdpush ssl connection function
* abstract all SSL functions into security.c
* keep track of SSL connections and always attempt to use SSL read/write when on SSL connection
* signal openssl to skip certificate validation when configured to do so
* better SSL error handling and logging
* SSL code cleanup
* SSL retry on SSL_connect and SSL_accept
* SSL provide default return value for old compilers
* SSL read/write functions emulate system read/write functions
* fix receive/send timeout and switch from SSL_peek() to SSL_pending()
* remove SSL_pending()
* removed sender auto-retry and debug info for initial recevier response
* ssl skip certificate verification config for web server
* ssl errors log ip and port of the peer
* keep ssl with web_client for its whole lifetime
* thread safe socket peers to text
* use error_limit() for common ssl errors
* cleanup
* more cleanup
* coverity fixes
* ssl error logs include both local and remote ip/port info
* remove obsolete code
|
|
* initial webrtc setup
* missing files
* rewrite of webrtc integration
* initialization and cleanup of webrtc connections
* make it compile without libdatachannel
* add missing webrtc_initialize() function when webrtc is not enabled
* make c++17 optional
* add build/m4/ax_compiler_vendor.m4
* add ax_cxx_compile_stdcxx.m4
* added new m4 files to makefile.am
* id all webrtc connections
* show warning when webrtc is disabled
* fixed message
* moved all webrtc error checking inside webrtc.cpp
* working webrtc connection establishment and cleanup
* remove obsolete code
* rewrote webrtc code in C to remove dependency for c++17
* fixed left-over reference
* detect binary and text messages
* minor fix
* naming of webrtc threads
* added webrtc configuration
* fix for thread_get_name_np()
* smaller web_client memory footprint
* universal web clients cache
* free web clients every 100 uses
* webrtc is now enabled by default only when compiled with internal checks
* webrtc responses to /api/ requests, including LZ4 compression
* fix for binary and text messages
* web_client_cache is now global
* unification of the internal web server API, for web requests, aclk request, webrtc requests
* more cleanup and unification of web client timings
* fixed compiler warnings
* update sent and received bytes
* eliminated of almost all big buffers in web client
* registry now uses the new json generation
* cookies are now an array; fixed redirects
* fix redirects, again
* write cookies directly to the header buffer, eliminating the need for cookie structures in web client
* reset the has_cookies flag
* gathered all web client cleanup to one function
* fixes redirects
* added summary.globals in /api/v2/data response
* ars to arc in /api/v2/data
* properly handle host impersonation
* set the context of mem.numa_nodes
|
|
* fundamentals for having /api/v2/ working
* use an atomic to prevent writing to internal pipe too much
* first attempt of multi-node, multi-context, multi-chart, multi-dimension queries
* v2 jsonwrap
* first attempt for group by
* cleaned up RRDR and fixed group by
* improvements to /api/v2/api
* query instance may be realloced, so pointers to it get invalid; solved memory leaks
* count of quried metrics in summary information
* provide detailed information about selected, excluded, queried and failed metrics for each entity
* select instances by fqdn too
* add timing information to json output
* link charts to rrdcontexts, if a query comes in and it is found unlinked
* calculate min, max, sum, average, volume, count per metric
* api v2 parameters naming
* renders alerts and units
* render machine_guid and node_id in all sections it is relevant
* unified keys
* group by now takes into account units and when there are multiple units involved, it creates a dimension per unit
* request and detailed are hidden behind an option
* summary includes only a flattened list of alerts
* alert counts per host and instance
* count of grouped metrics per dimension
* added contexts to summary
* added chart title
* added dimension priorities and chart type
* support for multiple group by at the same time
* minor fixes
* labels are now a tree
* keys uniformity
* filtering by alerts, both having a specific alert and having a specific alert in a specific status
* added scope of hosts and contexts
* count of instances on contexts and hosts
* make the api return valid responses even when the response contains no data
* calculate average and contribution % for every item in the summary
* fix compilation warnings
* fix compilation warnings - again
|
|
optimization (#14493)
* first work on standardizing json formatting
* renamed old grouping to time_grouping and added group_by
* add dummy functions to enable compilation
* buffer json api work
* jsonwrap opening with buffer_json_X() functions
* cleanup
* storage for quotes
* optimize buffer printing for both numbers and strings
* removed ; from define
* contexts json generation using the new json functions
* fix buffer overflow at unit test
* weights endpoint using new json api
* fixes to weights endpoint
* check buffer overflow on all buffer functions
* do synchronous queries for weights
* buffer_flush() now resets json state too
* content type typedef
* print double values that are above the max 64-bit value
* str2ndd() can now parse values above UINT64_MAX
* faster number parsing by avoiding double calculations as much as possible
* faster number parsing
* faster hex parsing
* accurate printing and parsing of double values, even for very large numbers that cannot fit in 64bit integers
* full printing and parsing without using library functions - and related unit tests
* added IEEE754 streaming capability to enable streaming of double values in hex
* streaming and replication to transfer all values in hex
* use our own str2ndd for set2
* remove subnormal check from ieee
* base64 encoding for numbers, instead of hex
* when increasing double precision, also make sure the fractional number printed is aligned to the wanted precision
* str2ndd_encoded() parses all encoding formats, including integers
* prevent uninitialized use
* /api/v1/info using the new json API
* Fix error when compiling with --disable-ml
* Remove redundant 'buffer_unittest' declaration
* Fix formatting
* Fix formatting
* Fix formatting
* fix buffer unit test
* apps.plugin using the new JSON API
* make sure the metrics registry does not accept negative timestamps
* do not allow pages with negative timestamps to be loaded from db files; do not accept pages with negative timestamps in the cache
* Fix more formatting
---------
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* first commit - untested
* fix wrong begin command
* added set v2 too
* debug to log stream buffer
* debug to log stream buffer
* faster streaming printing
* mark charts and dimensions as collected
* use stream points even if sender is not enabled
* comment out stream debug log
* parse null as nan
* custom begin v2
* custom set v2; replication now copies the anomalous flag too
* custom end v2
* enabled stream log test
* renamed to BEGIN2, SET2, END2
* dont mix up replay and v2 members in user object
* fix typo
* cleanup
* support to v2 to v1 proxying
* mark updated dimensions as such
* do not log unknown flags
* comment out stream debug log
* send also the chart id on BEGIN2, v2 to v2
* update the data collections counter
* v2 values are transferred in hex
* faster hex parsing
* a little more generic hex and dec printing and parsing
* fix hex parsing
* minor optimization in dbengine api
* turn debugging into info message
* generalized the timings tracking, so that it can be used in more places
* commented out debug info
* renamed conflicting variable with macro
* remove wrong edits
* integrated ML and added cleanup in case parsing is interrupted
* disable data collection locking during v2
* cleanup stale ML locks; send updated chart variables during v2; add info to find stale locks
* inject an END2 between repeated BEGIN2 from rrdset_done()
* test: remove lockless single-threaded logic from dictionary and aral and apply the right acquire/release memory order to reference counters
* more fine grained dictionary atomics
* remove unecessary return values
* pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS
* Revert "pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS"
This reverts commit 846cdf2713e2a7ee2ff797f38db11714228800e9.
* Revert "remove unecessary return values"
This reverts commit 8c87d30f4d86f0f5d6b4562cf74fe7447138bbff.
* Revert "more fine grained dictionary atomics"
This reverts commit 984aec4234a340d197d45239ff9a10fd479fcf3c.
* Revert "test: remove lockless single-threaded logic from dictionary and aral and apply the right acquire/release memory order to reference counters"
This reverts commit c460b3d0ad497d2641bd0ea1d63cec7c052e74e4.
* Apply again "pointer validation under NETDATA_DICTIONARY_VALIDATE_POINTERS" while keeping the improved atomic operations.
This reverts commit f158d009
* fix last commit
* fix last commit again
* optimizations in dbengine
* do not send anomaly bit on non-supporting agents (send it when the INTERPOLATED capability is available)
* break long empty-points-loops in rrdset_done()
* decide page alignment on new page allocation, not on every point collected
* create max size pages but no smaller than 1/3
* Fix compilation when --disable-ml is specified
* Return false
* fixes for NETDATA_LOG_REPLICATION_REQUESTS
* added compile option NETDATA_WITHOUT_WORKERS_LATENCY
* put timings in BEGIN2, SET2, END2
* isolate begin2 ml
* revert repositioning data collection lock
* fixed multi-threading of statistics
* do not lookup dimensions all the time if they come in the same order
* update used on iteration, not on every points; also do better error handling
---------
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* track memory footprint of Netdata
* track db modes alloc/ram/save/map
* track system info; track sender and receiver
* fixes
* more fixes
* track workers memory, onewayalloc memory; unify judyhs size estimation
* track replication structures and buffers
* Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag
* flush the replication buffer every 1000 times the circular buffer is found empty
* dont take timestamp too frequently in sender loop
* sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it
* free sender thread buffer on replication threads when replication is idle
* use the last sender flag as a timestamp of the last buffer recreation
* free cbuffer before reconnecting
* recreate cbuffer on every flush
* timings for journal v2 loading
* inlining of metric and cache functions
* aral likely/unlikely
* free left-over thread buffers
* fix NULL pointer dereference in replication
* free sender thread buffer on sender thread too
* mark ctx as used before flushing
* better logging on ctx datafiles closing
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* cleanup journal v2 mounts periodically
* fix for last commit
* re-enable loading page from disk when the arrangement of pages requires it
* Remove unused statistics
* Estimate diskspace when the current datafile is full and queue a rotate command (Currently it will not attempt to estimate end size for journals)
Queue a command to check quota on startup per tier
* apps.plugin now exposes RSS chart
* shorter thread names to make debugging easier, since thread names can only be 15 characters
* more thread names fixes
* allow an apps_groups.conf target to be pid 0 or 1
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
|
|
* query planer weight calculation using long long
* adjust replication query ahead pipeline for smaller systems
* do not generate huge replication messages
* add message to indicate replication message was interrupted
* improved message
* max replication size 25% of sender buffer
* fix for last commit
* use less cache and smaller page sizes and fewer threads on 32-bits
* fix reserved libuv workers for 32bits
* fix detection of 32/64 bit
|
|
* count open cache pages refering to datafile
* eliminate waste flush attempts
* remove eliminated variable
* journal v2 scanning split functions
* avoid locking open cache for a long time while migrating to journal v2
* dont acquire datafile for the loop; disable thread cancelability while a query is running
* work on datafile acquiring
* work on datafile deletion
* work on datafile deletion again
* logs of dbengine should start with DBENGINE
* thread specific key for queries to check if a query finishes without a finalize
* page_uuid is not used anymore
* Cleanup judy traversal when building new v2
Remove not needed calls to metric registry
* metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent
* disable checks for invalid time-ranges
* Remove type from page details
* report scanning time
* remove infinite loop from datafile acquire for deletion
* remove infinite loop from datafile acquire for deletion again
* trace query handles
* properly allocate array of dimensions in replication
* metrics cleanup
* metrics registry uses arrayalloc
* arrayalloc free should be protected by lock
* use array alloc in page cache
* journal v2 scanning fix
* datafile reference leaking hunding
* do not load metrics of future timestamps
* initialize reasons
* fix datafile reference leak
* do not load pages that are entirely overlapped by others
* expand metric retention atomically
* split replication logic in initialization and execution
* replication prepare ahead queries
* replication prepare ahead queries fixed
* fix replication workers accounting
* add router active queries chart
* restore accounting of pages metadata sources; cleanup replication
* dont count skipped pages as unroutable
* notes on services shutdown
* do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file
* do not add pages we dont need to pdc
* time in range re-work to provide info about past and future matches
* finner control on the pages selected for processing; accounting of page related issues
* fix invalid reference to handle->page
* eliminate data collection handle of pg_lookup_next
* accounting for queries with gaps
* query preprocessing the same way the processing is done; cache now supports all operations on Judy
* dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3)
* get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query
* finner gaps calculation; accounting of overlapping pages in queries
* fix gaps accounting
* move datafile deletion to worker thread
* tune libuv workers and thread stack size
* stop netdata threads gradually
* run indexing together with cache flush/evict
* more work on clean shutdown
* limit the number of pages to evict per run
* do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction
* economies on flags for smaller page footprint; cleanup and renames
* eviction moves referenced pages to the end of the queue
* use murmur hash for indexing partition
* murmur should be static
* use more indexing partitions
* revert number of partitions to number of cpus
* cancel threads first, then stop services
* revert default thread stack size
* dont execute replication requests of disconnected senders
* wait more time for services that are exiting gradually
* fixed last commit
* finer control on page selection algorithm
* default stacksize of 1MB
* fix formatting
* fix worker utilization going crazy when the number is rotating
* avoid buffer full due to replication preprocessing of requests
* support query priorities
* add count of spins in spinlock when compiled with netdata internal checks
* remove prioritization from dbengine queries; cache now uses mutexes for the queues
* hot pages are now in sections judy arrays, like dirty
* align replication queries to optimal page size
* during flushing add to clean and evict in batches
* Revert "during flushing add to clean and evict in batches"
This reverts commit 8fb2b69d068499eacea6de8291c336e5e9f197c7.
* dont lock clean while evicting pages during flushing
* Revert "dont lock clean while evicting pages during flushing"
This reverts commit d6c82b5f40aeba86fc7aead062fab1b819ba58b3.
* Revert "Revert "during flushing add to clean and evict in batches""
This reverts commit ca7a187537fb8f743992700427e13042561211ec.
* dont cross locks during flushing, for the fastest flushes possible
* low-priority queries load pages synchronously
* Revert "low-priority queries load pages synchronously"
This reverts commit 1ef2662ddcd20fe5842b856c716df134c42d1dc7.
* cache uses spinlock again
* during flushing, dont lock the clean queue at all; each item is added atomically
* do smaller eviction runs
* evict one page at a time to minimize lock contention on the clean queue
* fix eviction statistics
* fix last commit
* plain should be main cache
* event loop cleanup; evictions and flushes can now happen concurrently
* run flush and evictions from tier0 only
* remove not needed variables
* flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time
* added worker jobs in timer to find the slow part of it
* support fast eviction of pages when all_of_them is set
* revert default thread stack size
* bypass event loop for dispatching read extent commands to workers - send them directly
* Revert "bypass event loop for dispatching read extent commands to workers - send them directly"
This reverts commit 2c08bc5bab12881ae33bc73ce5dea03dfc4e1fce.
* cache work requests
* minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors
* publish flushed pages to open cache in the thread pool
* prevent eventloop requests from getting stacked in the event loop
* single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c
* more rrdengine.c cleanup
* enable db rotation
* do not log when there is a filter
* do not run multiple migration to journal v2
* load all extents async
* fix wrong paste
* report opcodes waiting, works dispatched, works executing
* cleanup event loop memory every 10 minutes
* dont dispatch more work requests than the number of threads available
* use the dispatched counter instead of the executing counter to check if the worker thread pool is full
* remove UV_RUN_NOWAIT
* replication to fill the queues
* caching of extent buffers; code cleanup
* caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation
* single transaction wal
* synchronous flushing
* first cancel the threads, then signal them to exit
* caching of rrdeng query handles; added priority to query target; health is now low prio
* add priority to the missing points; do not allow critical priority in queries
* offload query preparation and routing to libuv thread pool
* updated timing charts for the offloaded query preparation
* caching of WALs
* accounting for struct caches (buffers); do not load extents with invalid sizes
* protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset
* also check if the expanded before is not the chart later updated time
* also check if the expanded before is not after the wall clock time of when the query started
* Remove unused variable
* replication to queue less queries; cleanup of internal fatals
* Mark dimension to be updated async
* caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol)
* disable pgc stress test, under an ifdef
* disable mrg stress test under an ifdef
* Mark chart and host labels, host info for async check and store in the database
* dictionary items use arrayalloc
* cache section pages structure is allocated with arrayalloc
* Add function to wakeup the aclk query threads and check for exit
Register function to be called during shutdown after signaling the service to exit
* parallel preparation of all dimensions of queries
* be more sensitive to enable streaming after replication
* atomically finish chart replication
* fix last commit
* fix last commit again
* fix last commit again again
* fix last commit again again again
* unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication
* do not cancel start streaming; use high priority queries when we have locked chart data collection
* prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered
* opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue
* Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore
* Fix bad memory allocation / cleanup
* Cleanup ACLK sync initialization (part 1)
* Don't update metric registry during shutdown (part 1)
* Prevent crash when dashboard is refreshed and host goes away
* Mark ctx that is shutting down.
Test not adding flushed pages to open cache as hot if we are shutting down
* make ML work
* Fix compile without NETDATA_INTERNAL_CHECKS
* shutdown each ctx independently
* fix completion of quiesce
* do not update shared ML charts
* Create ML charts on child hosts.
When a parent runs a ML for a child, the relevant-ML charts
should be created on the child host. These charts should use
the parent's hostname to differentiate multiple parents that might
run ML for a child.
The only exception to this rule is the training/prediction resource
usage charts. These are created on the localhost of the parent host,
because they provide information specific to said host.
* check new ml code
* first save the database, then free all memory
* dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db
* Cleanup metadata thread (part 2)
* increase refcount before dispatching prep command
* Do not try to stop anomaly detection threads twice.
A separate function call has been added to stop anomaly detection threads.
This commit removes the left over function calls that were made
internally when a host was being created/destroyed.
* Remove allocations when smoothing samples buffer
The number of dims per sample is always 1, ie. we are training and
predicting only individual dimensions.
* set the orphan flag when loading archived hosts
* track worker dispatch callbacks and threadpool worker init
* make ML threads joinable; mark ctx having flushing in progress as early as possible
* fix allocation counter
* Cleanup metadata thread (part 3)
* Cleanup metadata thread (part 4)
* Skip metadata host scan when running unittest
* unittest support during init
* dont use all the libuv threads for queries
* break an infinite loop when sleep_usec() is interrupted
* ml prediction is a collector for several charts
* sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit
* worker_unregister() in netdata threads cleanup
* moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h
* fixed ML issues
* removed engine2 directory
* added dbengine2 files in CMakeLists.txt
* move query plan data to query target, so that they can be exposed by in jsonwrap
* uniform definition of query plan according to the other query target members
* event_loop should be in daemon, not libnetdata
* metric_retention_by_uuid() is now part of the storage engine abstraction
* unify time_t variables to have the suffix _s (meaning: seconds)
* old dbengine statistics become "dbengine io"
* do not enable ML resource usage charts by default
* unify ml chart families, plugins and modules
* cleanup query plans from query target
* cleanup all extent buffers
* added debug info for rrddim slot to time
* rrddim now does proper gap management
* full rewrite of the mem modes
* use library functions for madvise
* use CHECKSUM_SZ for the checksum size
* fix coverity warning about the impossible case of returning a page that is entirely in the past of the query
* fix dbengine shutdown
* keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently
* fine tune cache evictions
* dont initialize health if the health service is not running - prevent crash on shutdown while children get connected
* rename AS threads to ACLK[hostname]
* prevent re-use of uninitialized memory in queries
* use JulyL instead of JudyL for PDC operations - to test it first
* add also JulyL files
* fix July memory accounting
* disable July for PDC (use Judy)
* use the function to remove datafiles from linked list
* fix july and event_loop
* add july to libnetdata subdirs
* rename time_t variables that end in _t to end in _s
* replicate when there is a gap at the beginning of the replication period
* reset postponing of sender connections when a receiver is connected
* Adjust update every properly
* fix replication infinite loop due to last change
* packed enums in rrd.h and cleanup of obsolete rrd structure members
* prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests()
* void unused variable
* void unused variables
* fix indentation
* entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps
* macros to caclulate page entries by time and size
* prevent statsd cleanup crash on exit
* cleanup health thread related variables
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: vkalintiris <vasilis@netdata.cloud>
|
|
* replication fixes 9
* no room metric is now absolute
* decrement senders full and flip the actions on sender full and sender available
* finer timings
* execute all requests, unconditionally, once they are in the replication queue; worker charts are now sorted
* remove left-over debug message
* log verification of replication completion only when there are no pending requests any more
* use up to 50% of the sender buffer for replication responses
|
|
* replication requests with start_streaming=true are executed immediately upon reception, instead of being placed in the queue
* disable thread cancelability while workers cleanup
* remove obsolete worker from replication
* multi-threaded replication with netdata.conf option to set number of replication threads
* revert spinlock to mutex
* separate worker and main thread worker jobs
* restart the queue every 10 seconds only
* use atomic for sender buffer percentage
* reset the queue position after sleeping
* use sender resets to sleep properly
* fix condition
* cleanup sender members related to replication
|
|
* cleanup and additional information about replication
* fix deadlock on sender mutex
* do not ignore start streaming empty requests; when there duplicate requests, merge them
* flipped the flag
* final touch
* added queued flag on the charts to prevent them from being obsoleted by the service thread
|
|
(#14031)
* use 2 levels of judy arrays to speed up replication on very busy parents
* delete requests from judy when executed
* do not process requests when the sender is not connected; not all requests removed are executed, so count the executed accurately
* flush replication data on sender disconnect
* cache used buffer ratio in sender structure
* cleanup replication requests when they are not valid any more
* properly update inner and outer judy arrays on deletions
* detailed replication stats
* fix bug in dictionary where deletion and flushes lead to crashes
* replication should only report retention of tier 0
* replication now has 2 buffer limits 10 -> 30 %
* detailed statistics about resets and skipped requests
* register worker metrics
* added counter for waits
* make new requests discoverable
* make it continue on iterations properly
|
|
* streaming compression, query planner and replication fixes
* remove journal v2 stats from global statistics
* disable sql for checking past sql UUIDs
* single threaded replication
* final replication thread using dictionaries and JudyL for sorting the pending requests
* do not timeout the sending socket when there are pending replication requests
* streaming receiver using read() instead of fread()
* remove FILE * from streaming - now using posix read() and write()
* increase timeouts to 10 minutes
* apply sender timeout only when there are metrics that are supposed to be streamed
* error handling in replication
* remove retries on socket read timeout; better error messages
* take into account inbound traffic too to detect that a connection is stale
* remove race conditions from replication thread
* make sure deleted entries are marked as executed, so that even if deletion fails, they will not be executed
* 2 minutes timeout to retry streaming to a parent that already has this node
* remove unecessary condition check
* fix compilation warnings
* include judy in replication
* wrappers to handle retries for SSL_read and SSL_write
* compressed bytes read monitoring
* recursive locks on replication to make it faster during flush or cleanup
* replication completion chart at the receiver side
* simplified recursive mutex
* simplified recursive mutex again
|
|
Revert "New journal disk based indexing for agent memory reduction (#13885)"
This reverts commit 224b051a2b2bab39a4b536e531ab9ca590bf31bb.
|