summaryrefslogtreecommitdiffstats
path: root/claim/claim.c
AgeCommit message (Collapse)Author
2023-01-17Store host and claim info in sqlite as soon as possible (#14263)Emmanuel Vasilakis
* store host and claim info as soon as possible * no need to set the flag * check for metasync_worker.loop
2023-01-10DBENGINE v2 (#14125)Costa Tsaousis
* count open cache pages refering to datafile * eliminate waste flush attempts * remove eliminated variable * journal v2 scanning split functions * avoid locking open cache for a long time while migrating to journal v2 * dont acquire datafile for the loop; disable thread cancelability while a query is running * work on datafile acquiring * work on datafile deletion * work on datafile deletion again * logs of dbengine should start with DBENGINE * thread specific key for queries to check if a query finishes without a finalize * page_uuid is not used anymore * Cleanup judy traversal when building new v2 Remove not needed calls to metric registry * metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent * disable checks for invalid time-ranges * Remove type from page details * report scanning time * remove infinite loop from datafile acquire for deletion * remove infinite loop from datafile acquire for deletion again * trace query handles * properly allocate array of dimensions in replication * metrics cleanup * metrics registry uses arrayalloc * arrayalloc free should be protected by lock * use array alloc in page cache * journal v2 scanning fix * datafile reference leaking hunding * do not load metrics of future timestamps * initialize reasons * fix datafile reference leak * do not load pages that are entirely overlapped by others * expand metric retention atomically * split replication logic in initialization and execution * replication prepare ahead queries * replication prepare ahead queries fixed * fix replication workers accounting * add router active queries chart * restore accounting of pages metadata sources; cleanup replication * dont count skipped pages as unroutable * notes on services shutdown * do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file * do not add pages we dont need to pdc * time in range re-work to provide info about past and future matches * finner control on the pages selected for processing; accounting of page related issues * fix invalid reference to handle->page * eliminate data collection handle of pg_lookup_next * accounting for queries with gaps * query preprocessing the same way the processing is done; cache now supports all operations on Judy * dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3) * get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query * finner gaps calculation; accounting of overlapping pages in queries * fix gaps accounting * move datafile deletion to worker thread * tune libuv workers and thread stack size * stop netdata threads gradually * run indexing together with cache flush/evict * more work on clean shutdown * limit the number of pages to evict per run * do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction * economies on flags for smaller page footprint; cleanup and renames * eviction moves referenced pages to the end of the queue * use murmur hash for indexing partition * murmur should be static * use more indexing partitions * revert number of partitions to number of cpus * cancel threads first, then stop services * revert default thread stack size * dont execute replication requests of disconnected senders * wait more time for services that are exiting gradually * fixed last commit * finer control on page selection algorithm * default stacksize of 1MB * fix formatting * fix worker utilization going crazy when the number is rotating * avoid buffer full due to replication preprocessing of requests * support query priorities * add count of spins in spinlock when compiled with netdata internal checks * remove prioritization from dbengine queries; cache now uses mutexes for the queues * hot pages are now in sections judy arrays, like dirty * align replication queries to optimal page size * during flushing add to clean and evict in batches * Revert "during flushing add to clean and evict in batches" This reverts commit 8fb2b69d068499eacea6de8291c336e5e9f197c7. * dont lock clean while evicting pages during flushing * Revert "dont lock clean while evicting pages during flushing" This reverts commit d6c82b5f40aeba86fc7aead062fab1b819ba58b3. * Revert "Revert "during flushing add to clean and evict in batches"" This reverts commit ca7a187537fb8f743992700427e13042561211ec. * dont cross locks during flushing, for the fastest flushes possible * low-priority queries load pages synchronously * Revert "low-priority queries load pages synchronously" This reverts commit 1ef2662ddcd20fe5842b856c716df134c42d1dc7. * cache uses spinlock again * during flushing, dont lock the clean queue at all; each item is added atomically * do smaller eviction runs * evict one page at a time to minimize lock contention on the clean queue * fix eviction statistics * fix last commit * plain should be main cache * event loop cleanup; evictions and flushes can now happen concurrently * run flush and evictions from tier0 only * remove not needed variables * flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time * added worker jobs in timer to find the slow part of it * support fast eviction of pages when all_of_them is set * revert default thread stack size * bypass event loop for dispatching read extent commands to workers - send them directly * Revert "bypass event loop for dispatching read extent commands to workers - send them directly" This reverts commit 2c08bc5bab12881ae33bc73ce5dea03dfc4e1fce. * cache work requests * minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors * publish flushed pages to open cache in the thread pool * prevent eventloop requests from getting stacked in the event loop * single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c * more rrdengine.c cleanup * enable db rotation * do not log when there is a filter * do not run multiple migration to journal v2 * load all extents async * fix wrong paste * report opcodes waiting, works dispatched, works executing * cleanup event loop memory every 10 minutes * dont dispatch more work requests than the number of threads available * use the dispatched counter instead of the executing counter to check if the worker thread pool is full * remove UV_RUN_NOWAIT * replication to fill the queues * caching of extent buffers; code cleanup * caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation * single transaction wal * synchronous flushing * first cancel the threads, then signal them to exit * caching of rrdeng query handles; added priority to query target; health is now low prio * add priority to the missing points; do not allow critical priority in queries * offload query preparation and routing to libuv thread pool * updated timing charts for the offloaded query preparation * caching of WALs * accounting for struct caches (buffers); do not load extents with invalid sizes * protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset * also check if the expanded before is not the chart later updated time * also check if the expanded before is not after the wall clock time of when the query started * Remove unused variable * replication to queue less queries; cleanup of internal fatals * Mark dimension to be updated async * caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol) * disable pgc stress test, under an ifdef * disable mrg stress test under an ifdef * Mark chart and host labels, host info for async check and store in the database * dictionary items use arrayalloc * cache section pages structure is allocated with arrayalloc * Add function to wakeup the aclk query threads and check for exit Register function to be called during shutdown after signaling the service to exit * parallel preparation of all dimensions of queries * be more sensitive to enable streaming after replication * atomically finish chart replication * fix last commit * fix last commit again * fix last commit again again * fix last commit again again again * unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication * do not cancel start streaming; use high priority queries when we have locked chart data collection * prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered * opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue * Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore * Fix bad memory allocation / cleanup * Cleanup ACLK sync initialization (part 1) * Don't update metric registry during shutdown (part 1) * Prevent crash when dashboard is refreshed and host goes away * Mark ctx that is shutting down. Test not adding flushed pages to open cache as hot if we are shutting down * make ML work * Fix compile without NETDATA_INTERNAL_CHECKS * shutdown each ctx independently * fix completion of quiesce * do not update shared ML charts * Create ML charts on child hosts. When a parent runs a ML for a child, the relevant-ML charts should be created on the child host. These charts should use the parent's hostname to differentiate multiple parents that might run ML for a child. The only exception to this rule is the training/prediction resource usage charts. These are created on the localhost of the parent host, because they provide information specific to said host. * check new ml code * first save the database, then free all memory * dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db * Cleanup metadata thread (part 2) * increase refcount before dispatching prep command * Do not try to stop anomaly detection threads twice. A separate function call has been added to stop anomaly detection threads. This commit removes the left over function calls that were made internally when a host was being created/destroyed. * Remove allocations when smoothing samples buffer The number of dims per sample is always 1, ie. we are training and predicting only individual dimensions. * set the orphan flag when loading archived hosts * track worker dispatch callbacks and threadpool worker init * make ML threads joinable; mark ctx having flushing in progress as early as possible * fix allocation counter * Cleanup metadata thread (part 3) * Cleanup metadata thread (part 4) * Skip metadata host scan when running unittest * unittest support during init * dont use all the libuv threads for queries * break an infinite loop when sleep_usec() is interrupted * ml prediction is a collector for several charts * sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit * worker_unregister() in netdata threads cleanup * moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h * fixed ML issues * removed engine2 directory * added dbengine2 files in CMakeLists.txt * move query plan data to query target, so that they can be exposed by in jsonwrap * uniform definition of query plan according to the other query target members * event_loop should be in daemon, not libnetdata * metric_retention_by_uuid() is now part of the storage engine abstraction * unify time_t variables to have the suffix _s (meaning: seconds) * old dbengine statistics become "dbengine io" * do not enable ML resource usage charts by default * unify ml chart families, plugins and modules * cleanup query plans from query target * cleanup all extent buffers * added debug info for rrddim slot to time * rrddim now does proper gap management * full rewrite of the mem modes * use library functions for madvise * use CHECKSUM_SZ for the checksum size * fix coverity warning about the impossible case of returning a page that is entirely in the past of the query * fix dbengine shutdown * keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently * fine tune cache evictions * dont initialize health if the health service is not running - prevent crash on shutdown while children get connected * rename AS threads to ACLK[hostname] * prevent re-use of uninitialized memory in queries * use JulyL instead of JudyL for PDC operations - to test it first * add also JulyL files * fix July memory accounting * disable July for PDC (use Judy) * use the function to remove datafiles from linked list * fix july and event_loop * add july to libnetdata subdirs * rename time_t variables that end in _t to end in _s * replicate when there is a gap at the beginning of the replication period * reset postponing of sender connections when a receiver is connected * Adjust update every properly * fix replication infinite loop due to last change * packed enums in rrd.h and cleanup of obsolete rrd structure members * prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests() * void unused variable * void unused variables * fix indentation * entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps * macros to caclulate page entries by time and size * prevent statsd cleanup crash on exit * cleanup health thread related variables Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2022-10-16Add a thread to asynchronously process metadata updates (#13783)Stelios Fragkakis
* Remove old metalog text fle processing * Add metadata event loop * Move functions from sqlite_functions.c to sqlite_metadata.c Queue updates to the metadata event loop Migration to remove unused tables Cleanup unused functions * Queue chart labels to metadata * Store chart labels to metadata * During shutdown, run full speed * Add shutdown prepare Handle SHUTDOWN in the cmd queue function Add worker thread to handle host/chart/dimension metadata doing dictionary traversals * Remove unused RRDIM_FLAG_ACLK Add flags to trigger host/chart/dimension metadata processing * Incremental processing of chart metadata writes * Store host labels * Remove redundant return statements * Change unit tests / cleanup * Fix rescheduling * Schedule chart labels update by setting the RRDSET_FLAG_METADATA_UPDATE flag * Queue commands to update metadata for dimension and host labels * Make sure we do a final scan to store metadata during shutdown (if needed) * Remove unused structures Adjust queue size since we do batch processing of updates without queueing individual messages Remove pragma mmap for now Fix memory leak during sqlite unittest (minor) * Dont update if we are in archive mode * Cleanup * Build entire message payload and store * Initialize worker completion properly * Properly skip host check for pending metadata updates * Report bind param failures Add worker request inside the data payload Initialize variables to silence warnings Rebase on master * Report the chart id (not the dimension) and the dimension id when storing a dimension * Compilation warnings in 32bit * Add DEFINE for the queries * Remove commented out code * * Remove items parameter from unitest * Remove commented out code * sqlite_metadata.h contains only public items * Use sleep_usec instead of usleep * Rename metadata_database_init_cmd_queue to metadata_init_cmd_queue * Rename metadata_database_enq_cmd_noblock to metadata_enq_cmd_noblock
2022-10-05Allow netdata plugins to expose functions for querying more information ↵Costa Tsaousis
about specific charts (#13720) * function renames and code cleanup in popen.c; no actual code changes * netdata popen() now opens both child process stdin and stdout and returns FILE * for both * pass both input and output to parser structures * updated rrdset to call custom functions * RRDSET FUNCTION leading calls for both sync and async operation * put RRDSET functions to a separate file * added format and timeout at function definition * support for synchronous (internal plugins) and asynchronous (external plugins and children) functions * /api/v1/function endpoint * functions are now attached to the host and there is a dictionary view per chart * functions implemented at plugins.d * remove the defer until keyword hook from plugins.d when it is done * stream sender implementation of functions * sanitization of all functions so that certain characters are only allowed * strictier sanitization * common max size * 1st working plugins.d example * always init inflight dictionary * properly destroy dictionaries to avoid parallel insertion of items * add more debugging on disconnection reasons * add more debugging on disconnection reasons again * streaming receiver respects newlines * dont use the same fp for both streaming receive and send * dont free dbengine memory with internal checks * make sender proceed in the buffer * added timing info and garbage collection at plugins.d * added info about routing nodes * added info about routing nodes with delay * added more info about delays * added more info about delays again * signal sending thread to wake up * streaming version labeling and commented code to support capabilities * added functions to /api/v1/data, /api/v1/charts, /api/v1/chart, /api/v1/info * redirect top output to stdout * address coverity findings * fix resource leaks of popen * log attempts to connect to individual destinations * better messages * properly parse destinations * try to find a function from the most matching to the least matching * log added streaming destinations * rotate destinations bypassing a node in the middle that does not accept our connection * break the loops properly * use typedef to define callbacks * capabilities negotiation during streaming * functions exposed upstream based on capabilities; compression disabled per node persisting reconnects; always try to connect with all capabilities * restore functionality to lookup functions * better logging of capabilities * remove old versions from capabilities when a newer version is there * fix formatting * optimization for plugins.d rrdlabels to avoid creating and destructing dictionaries all the time * delayed health initialization for rrddim and rrdset * cleanup health initialization * fix for popen() not returning the right value * add health worker jobs for initializing rrdset and rrddim * added content type support for functions; apps.plugin permanent function to display all the processes * fixes for functions parameters parsing in apps.plugin * fix for process matching in apps.plugiin * first working function for apps.plugin * Dashboard ACL is disabled for functions; Function errors are all in JSON format * apps.plugin function processes returns json table * use json_escape_string() to escape message * fix formatting * apps.plugin exposes all its metrics to function processes * fix json formatting when filtering out some rows * reopen the internal pipe of rrdpush in case of errors * misplaced statement * do not use buffer->len * support for GLOBAL functions (functions that are not linked to a chart * added /api/v1/functions endpoint; removed format from the FUNCTIONS api; * swagger documentation about the new api end points * added plugins.d documentation about functions * never re-close a file * remove uncessesary ifdef * fixed issues identified by codacy * fix for null label value * make edit-config copy-and-paste friendly * Revert "make edit-config copy-and-paste friendly" This reverts commit 54500c0e0a97f65a0c66c4d34e966f6a9056698e. * reworked sender handshake to fix coverity findings * timeout is zero, for both send_timeout() and recv_timeout() * properly detect that parent closed the socket * support caching of function responses; limit function response to 10MB; added protection from malformed function responses * disabled excessive logging * added units to apps.plugin function processes and normalized all values to be human readable * shorter field names * fixed issues reported * fixed apps.plugin error response; tested that pluginsd can properly handle faulty responses * use double linked list macros for double linked list management * faster apps.plugin function printing by minimizing file operations * added memory percentage * fix compatibility issues with older compilers and FreeBSD * rrdpush sender code cleanup; rrhost structure cleanup from sender flags and variables; * fix letftover variable in ifdef * apps.plugin: do not call detach from the thread; exit immediately when input is broken * exclude AR charts from health * flush cleaner; prefer sender output * clarity * do not fill the cbuffer if not connected * fix * dont enabled host->sender if streaming is not enabled; send host label updates to parent; * functions are only available through ACLK * Prepared statement reports only in dev mode * fix AR chart detection * fix for streaming not being enabling itself * more cleanup of sender and receiver structures * moved read-only flags and configuration options to rrdhost->options * fixed merge with master * fix for incomplete rename * prevent service thread from working on charts that are being collected Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-08-24Remove aclk_api.[ch] (#13540)Timotej S
* get rid of aclk_starter middleman * get rid of aclk_api.[ch]
2022-07-24Rrdcontext (#13335)Costa Tsaousis
* type checking on dictionary return values * first STRING implementation, used by DICTIONARY and RRDLABEL * enable AVL compilation of STRING * Initial functions to store context info * Call simple test functions * Add host_id when getting charts * Allow host to be null and in this case it will process the localhost * Simplify init Do not use strdupz - link directly to sqlite result set * Init the database during startup * make it compile - no functionality yet * intermediate commit * intermidiate * first interface to sql * loading instances * check if we need to update cloud * comparison of rrdcontext on conflict * merge context titles * rrdcontext public interface; statistics on STRING; scratchpad on DICTIONARY * dictionaries maintain version numbers; rrdcontext api * cascading changes * first operational cleanup * string unittest * proper cleanup of referenced dictionaries * added rrdmetrics * rrdmetric starting retention * Add fields to context Adjuct context creation and delete * Memory cleanup * Fix get context list Fix memory double free in tests Store context with two hosts * calculated retention * rrdcontext retention with collection * Persist database and shutdown * loading all from sql * Get chart list and dimension list changes * fully working attempt 1 * fully working attempt 2 * missing archived flag from log * fixed archived / collected * operational * proper cleanup * cleanup - implemented all interface functions - dictionary react callback triggers after the dictionary is unlocked * track all reasons for changes * proper tracking of reasons of changes * fully working thread * better versioning of contexts * fix string indexing with AVL * running version per context vs hub version; ifdef dbengine * added option to disable rrdmetrics * release old context when a chart changes context * cleanup properly * renamed config * cleanup contexts; general cleanup; * deletion inline with dequeue; lots of cleanup; child connected/disconnected * ml should start after rrdcontext * added missing NULL to ri->rrdset; rrdcontext flags are now only changed under a mutex lock * fix buggy STRING under AVL * Rework database initialization Add migration logic to the context database * fix data race conditions during context deletion * added version hash algorithm * fix string over AVL * update aclk-schemas * compile new ctx related protos * add ctx stream message utils * add context messages * add dummy rx message handlers * add the new topics * add ctx capability * add helper functions to send the new messages * update cmake build to not fail * update topic names * handle rrdcontext_enabled * add more functions * fatal on OOM cases instead of return NULL * silence unknown query type error * fully working attempt 1 * fully working attempt 2 * allow compiling without ACLK * added family to the context * removed excess character in UUID * smarter merging of titles and families * Database migration code to add family Add family to SQL_CHART_DATA and VERSIONED_CONTEXT_DATA * add family to context message * enable ctx in communication * hardcoded enabled contexts * Add hard code for CTX * add update node collectors to json * add context message log * fix log about last_time_t * fix collected flags for queued items * prevent crash on charts cleanup * fix bug in AVL indexing of dictionaries; make sure react callback of dictionaries has a reference counter, which is acquired while the dictionary is locked * fixed dictionary unittest * strict policy to cleanup and garbage collector * fix db rotation and garbage collection timings * remove deadlock * proper garbage collection - a lot faster retention recalculation * Added not NULL in database columns Remove migration code for context -- we will ship with version 1 of the table schema Added define for query in tests to detect localhost * Use UUID_STR_LEN instead of GUID_LEN + 1 Use realistic timestamps when adding test data in the database * Add NULL checks for passed parameters * Log deleted context when compiled with NETDATA_INTERNAL_CHECKS * Error checking for null host id * add missing ContextsCheckpoint log convertor * Fix spelling in VACCUM * Hold additional information for host -- prepare to load archived hosts on startup * Make sure claim id is valid * is_get_claimed is actually get the current claim id * Simplify ctx get chart list query * remove env negotiation * fix string unittest when there are some strings already in the index * propagate live-retention flag upstream; cleanup all update reasons; updated instances logging; automated attaching started/stopped collecting flags; * first implementation of /api/v1/contexts * full contexts API; updated swagger * disabled debugging; rrdcontext enabled by default * final cleanup and renaming of global variables * return current time on currently collected contexts, charts and dimensions * added option "deepscan" to the API to have the server refresh the retention and recalculate the contexts on the fly * fixed identation of yaml * Add constrains to the host table * host->node_id may not be available * new capabilities * lock the context while rendering json * update aclk-schemas * added permanent labels to all charts about plugin, module and family; added labels to all proc plugin modules * always add the labels * allow merging of families down to [x] * dont show uuids by default, added option to enable them; response is now accepting after,before to show only data for a specific timeframe; deleted items are only shown when "deleted" is requested; hub version is now shown when "queue" is requested * Use the localhost claim id * Fix to handle host constrains better * cgroups: add "k8s." prefix to chart context in k8s * Improve sqlite metadata version migration check * empty values set to "[none]"; fix labels unit test to reflect that * Check if we reached the version we want first (address CODACY report re: Array index 'i' is used before limits check) * Rewrite condition to address CODACY report (Redundant condition: t->filter_callback. '!A || (A && B)' is equivalent to '!A || B') * Properly unlock context * fixed memory leak on rrdcontexts - it was not freeing all dictionaries in rrdhost; added wait of up to 100ms on dictionary_destroy() to give time to dictionaries to release their items before destroying them * fixed memory leak on rrdlabels not freed on rrdinstances * fixed leak when dimensions and charts are redefined * Mark entries for charts and dimensions as submitted to the cloud 3600 seconds after their creation Mark entries for charts and dimensions as updated (confirmed by the cloud) 1800 seconds after their submission * renamed struct string * update cgroups alarms * fixed codacy suggestions * update dashboard info * fix k8s_cgroup_10s_received_packets_storm alarm * added filtering options to /api/v1/contexts and /api/v1/context * fix eslint * fix eslint * Fix pointer binding for host / chart uuids * Fix cgroups unit tests * fixed non-retention updates not propagated upstream * removed non-fatal fatals * Remove context from 2 way string merge. * Move string_2way_merge to dictionary.c * Add 2-way string merge tests. * split long lines * fix indentation in netdata-swagger.yaml * update netdata-swagger.json * yamllint please * remove the deleted flag when a context is collected * fix yaml warning in swagger * removed non-fatal fatals * charts should now be able to switch contexts * allow deletion of unused metrics, instances and contexts * keep the queued flag * cleanup old rrdinstance labels * dont hide objects when there is no filter; mark objects as deleted when there are no sub-objects * delete old instances once they changed context * delete all instances and contexts that do not have sub-objects * more precise transitions * Load archived hosts on startup (part 1) * update the queued time every time * disable by default; dedup deleted dimensions after snapshot * Load archived hosts on startup (part 2) * delayed processing of events until charts are being collected * remove dont-trigger flag when object is collected * polish all triggers given the new dont_process flag * Remove always true condition Enums for readbility / create_host_callback only if ACLK is enabled (for now) * Skip retention message if context streaming is enabled Add messages in the access log if context streaming is enabled * Check for node id being a UUID that can be parsed Improve error check / reporting when loading archived hosts and creating ACLK sync threads * collected, archived, deleted are now mutually exclusive * Enable the "orphan" handling for now Remove dead code Fix memory leak on free host * Queue charts and dimensions will be no-op if host is set to stream contexts * removed unused parameter and made sure flags are set on rrdcontext insert * make the rrdcontext thread abort mid-work when exiting * Skip chart hash computation and storage if contexts streaming is enabled Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: Timo <timotej@netdata.cloud> Co-authored-by: ilyam8 <ilya@netdata.cloud> Co-authored-by: Vladimir Kobal <vlad@prokk.net> Co-authored-by: Vasilis Kalintiris <vasilis@netdata.cloud>
2022-05-27Fix coverity issue 378598 (#13022)Emmanuel Vasilakis
move appconfig_get before aclk lock
2022-03-17fix unclaimed agents (#12449)Timotej S
2022-03-16ensure claim_id is sent always lowercase (#12423)Timotej S
2022-01-19Handle re-claim while the agent is running in new architecture (#11924)Emmanuel Vasilakis
* re-connect when re-claiming * send the previous claim_id when disconnecting * use same block for aclk_kill_link * free prev_claimed_id
2021-06-14Allows ACLK NG and Legacy to coexist (#11225)Timotej S
2021-05-24Remove unecessary relative paths when including headers. (#11124)vkalintiris
Currently, we add the repository's top-level dir in the compiler's header search path. This means that code in every top-level directory within the repo can include headers sibling top-level directories. This patch makes header inclusion consistent when it comes to files that are included from sibling top-level directories within the repo.
2021-04-29Add functionality to store node_id for a host (#11059)Stelios Fragkakis
2021-04-26Store null claim id in the database for non claimed children (#11036)Stelios Fragkakis
2021-04-22Persist claim ids in local database for parent and children (#10993)Stelios Fragkakis
2021-03-16Adds ACLK-NG as fallback(#10315)Timotej S
* adds a new implementation of ACLK written almost from scratch * external dependencies only OpenSSL and JSON-C * fallback for systems where ACLK Legacy can't build (for technical or philosophical reasons) * can be forced to build by giving "--aclk-ng" to the installer
2021-01-19Move ACLK Legacy into a subfolder (#10265)Timotej S
* move all legacy ACLK into a subfolder to make space for ACLK-NG
2020-11-26ACLK Child Availability Messages (#9918)Timotej S
* new ACLK messages for Claiming MVP1
2020-09-08ACLK Version Negotiation (#9819)Timotej S
* implements version negotiation for ACLK
2020-08-26Adds claimed_id streaming (#9804)Timotej S
* streams claimed_id of child nodes to parents * adds this information into /api/v1/info
2020-08-04Fix missing comma. (#9656)Markos Fountoulakis
2020-07-23Sends netdata.public.unique.id (machine GUID) with claim (#9574)Timotej S
send mguid to the cloud
2020-05-20Regenerate topic base on connect (#9044)Andrew Moss
Allow agents to be reclaimed while they are running. Fix a race hazard between claiming and the ACLK. Changes the private key, base topic, username and contents of the LWT. Co-authored-by: <hilari@hilarimoragrega.com>
2020-05-11Enable support for Netdata Cloud.Andrew Moss
This PR merges the feature-branch to make the cloud live. It contains the following work: Co-authored-by: Andrew Moss <1043609+amoss@users.noreply.github.com(opens in new tab)> Co-authored-by: Jacek Kolasa <jacek.kolasa@gmail.com(opens in new tab)> Co-authored-by: Austin S. Hemmelgarn <austin@netdata.cloud(opens in new tab)> Co-authored-by: James Mills <prologic@shortcircuit.net.au(opens in new tab)> Co-authored-by: Markos Fountoulakis <44345837+mfundul@users.noreply.github.com(opens in new tab)> Co-authored-by: Timotej S <6674623+underhood@users.noreply.github.com(opens in new tab)> Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com(opens in new tab)> * dashboard with new navbars, v1.0-alpha.9: PR #8478 * dashboard v1.0.11: netdata/dashboard#76 Co-authored-by: Jacek Kolasa <jacek.kolasa@gmail.com(opens in new tab)> * Added installer code to bundle JSON-c if it's not present. PR #8836 Co-authored-by: James Mills <prologic@shortcircuit.net.au(opens in new tab)> * Fix claiming config PR #8843 * Adds JSON-c as hard dep. for ACLK PR #8838 * Fix SSL renegotiation errors in old versions of openssl. PR #8840. Also - we have a transient problem with opensuse CI so this PR disables them with a commit from @prologic. Co-authored-by: James Mills <prologic@shortcircuit.net.au(opens in new tab)> * Fix claiming error handling PR #8850 * Added CI to verify JSON-C bundling code in installer PR #8853 * Make cloud-enabled flag in web/api/v1/info be independent of ACLK build success PR #8866 * Reduce ACLK_STABLE_TIMEOUT from 10 to 3 seconds PR #8871 * remove old-cloud related UI from old dashboard (accessible now via /old suffix) PR #8858 * dashboard v1.0.13 PR #8870 * dashboard v1.0.14 PR #8904 * Provide feedback on proxy setting changes PR #8895 * Change the name of the connect message to update during an ongoing session PR #8927 * Fetch active alarms from alarm_log PR #8944
2020-04-03Fix Coverity defects (#8579)Andrew Moss
Fix Coverity CID355287 and CID355289: technically it is a false-positive but it is easier to put a pattern in the code that they can recognise as a sanitizer. The compiler will remove it during optimization. Fix CID353973: the security condition is unlikely to occur but we can avoid it completely. Fix resource leak from CID 355286 and CID 355288. Fixing new resource leak introduced by a previous commit (CID355449)
2020-04-03Fix compiler warnings in the claiming code (#8567)Vladimir Kobal
2020-03-31Switching over to soft feature flag (#8545)Andrew Moss
Preparing for the cloud release. This changes how we handle the feature flag so that it no longer requires installer switches and can be set from the config file. This still requires internal access to use and is not ready for public access yet.
2020-03-31Improve the behavior of claiming (#8516)Andrew Moss
The default cloud url has been updated to app.netdata.cloud ready for the release. The claiming process now checks the current user executing claiming and refuses to perform the claim for the wrong user. If the current UID is 0 then claiming proceeds but the file ownership is adjusted to be the correct netdata user. The default expected user is `netdata` unless the script can identify the user from the current configuration. After the claiming script is executed the CLI is used to reload the claiming state.
2020-03-26Improved ACLK (#8498)Stelios Fragkakis
Improved the stability of the ACLK
2020-03-25HTTP proxy support + some cleanup (#8418)Timo
* HTTP proxy support + some cleanup * fix unrelated compiler warnings with -Wextra * minor - log proxy setting * run changed code trough .clang-format * fix case when url ends by / * update README
2020-03-21Fix syntax error in claiming script. (#8452)Markos Fountoulakis
* Fix syntax error in claiming script. * Synchronized error messages between claiming script and C code * Fix exit code check
2020-03-17Fix outstanding problems in claiming and add SOCKS5 support. (#8406)Andrew Moss
This commit fixes the known problems in claiming: incorrect reports of success, better treatment of error code and improved visibility of what the script is doing. There has been extensive testing against both environments to check that it works. The socks5 proxy support has been integrated and works for both methods of calling the claiming script. Co-authored-by: Timotej Šiškovič <timotej@netdata.cloud>
2020-03-10Improve ACLK according to results of the smoke-test. (#8358)Andrew Moss
* Cleaning up the ACLK part 2 (#8187) * Initial proxy support implementation (#8146) * Implement the ACLK Challenge-Response Authentication (#8317) Co-authored-by: Timo <6674623+underhood@users.noreply.github.com>
2020-02-06ACLK agent 1 (#7894)Stelios Fragkakis
* - Add initial mqtt support * [WIP] Agent cloud link - Setup main mqtt thread to connect to a broker using V5 of the MQTT protocol (TBD) - Send alarms to "netdata/alarm" - Add error checks to handle connection failures - Add params for Broker, port Maximum concurrent sent / recev messages - Dummy function to check claiming status - Generic mqtt_send command to publish message to a base topic , sub topic It will end up in the form base_topic/sub_topic - Add host/port in the connection failure error message * Test libmosquitto libs * connect to broker locally (assume localhost:1883) * subscribe to channel netdata/command * Test try a reload command to trigger health reload * publish alerts to netdata/alarm * - Fix compile issues * - Use sleep_usec instead of usleep * - Delay reconnection on failure due to misconfiguration (high cpu usage) * - Remove the TLS connection config * - Fix NETDATA_MQTT_INITIALIZATION_SLEEP_WAIT to use seconds * - Gather ACLK related code under aclk folder - Add aclk_ functions for abstract layer - Moved low level libs intergration in mqtt.c * - Add README.md file with initial comment * - Clean MQTT v5 * - Code cleanup * - Remove alarm log for now - Remove the heart beat * - Remove message properties for V5 * - Remove message properties for V5 (header) * Fixed the netdata target to use a local static version of libmosquitto. The installer does not yet have steps to pull and build the local library. cd project_root git clone ssh://git@github.com/netdata/mosquitto mosquitto/ (cd mosquitto/lib && make) # Ignore the cpp error This will leave mosquitto/lib/libmosquitto.a for the build process to use. * - Fix compile issues with older < 1.6 libmosquitto lib * - Enable alarm events to check it works - Re arrange includes - Rework topic to be agent/guid/. Actual id will be returned by the is_agent_claimed * - Add initial metadata info - Added helper function in web_api - Added a debug command (info) * Update the claiming state to retrieve the claimed id. * - Use define for constants like command and metadata topics - Function to wait for initialization of the ACLK link - New aclk_subscribe command with QOS parameter for the mqtt subscription - Use the is_agent_claimed function to get the real claim id and use it to build the topics that will be used for the cloud communication - Change in netdata-claim.sh.in to write the claim id without a trailing \n * - Use define for constants like command and metadata topics - Function to wait for initialization of the ACLK link - New aclk_subscribe command with QOS parameter for the mqtt subscription - Use the is_agent_claimed function to get the real claim id and use it to build the topics that will be used for the cloud communication - Change in netdata-claim.sh.in to write the claim id without a trailing \n * - Remove the alarm log for now - Add code (but disabled) to send charts * - Use dummy anon, anon as username and password for testing purposes * - Use client id anon as well * Testing without TLS * Switching TLS back on to fix docker environment. * - Added query processing An incoming URL now calls web_client_api_request_v1_data to handle a request and push the results back to the "data" topic - Move the above processing from the message callback to the query handle loop - Added helper "pause" , "resume" commands to stop and resume query processing to stress test loading the queue with queries before executing them - Changed the endpoint topics to "meta", and "cmd" (previously metadata and command) * make info message follow protocol * move metadata msg generation into new func * move metadata msg generation into new func * - Add metadata to the responses - Add hook to queue chart changes on creation and dimensions - Changed the queue mechanism to include delay for X seconds - Add delayed submittion of charts to the cloud so that all DIMs are defined to avoid resubmission * - Add additional data info for aclk_queue command * - Use web_clinet_api_request_v1 to handle the incoming request This will handle all requests coming from the cloud * - Cleanup and aclk_query structure - Add msg_id parameter - Enable the incoming JSON request - Enable the outgoing JSON response * - Added new thread to handle query processing - Add lock and cond wait to wakeup thread when queries are submitted - Cleanup on the main init function * - Add wait time on agent init, to allow for chart, alarms and other definitions to be completed. - During the wait time, no queries will be queued * - Send metadata on query thread init - New generic create header function for the JSON response - Pack info and charts into one message - Modified chart to remove entries (test) - Modified charts mod to remove entries e.g alarms and volatile info - Change input to aclk_update_chart (RRDHOST / instead of hostname) * - When a request fails, add to the payload - We may need to handle in a different key - Error check in json parsing * - Add dummy aclk_update_alarm command * - Move incoming request JSON parsing code away from mqtt.c - Added #ifdef ACLK_ENABLE so that we can have code merged but disabled by default - Added version in incoming and outgoing JSON dict * - Disable code if ACLK_ENABLE is not defined - Remove references to the mqtt (mosquitto) lib - Add dummy stubs in mqtt.c for completeness if ACLK_ENABLE is not defined * - Disable challenge sample code for now * - Remove libmosquitto from makefile * - Fix spaces in Makefile.am - Remove ifdef to avoid warning from LGTM * - Remove for now the code that builds an along log test message to send to the cloud * - Add check for ACLK_ENABLE definition and avoid calling the chart update functions * - Remove commented code * - Move source files to the correct place (ACLK_PLUGIN_FILES) * - Remove include file thats not needed * - Remove include file thats not needed - Add improved checks for load_claiming_state() * - Fix error message. Used error() that also logs errno and message * - Fix some codacy issues * - Fix more codacy issues, code cleanup * - Revert code to address codacy warnings * - Revert spaces added in a previous commit by mistake * clean up if/else nest * print error if fopen fails * minor - error already logs errno * - Fix version formatting * - Cleanup all ACLK related compiler warnings - Re-arrange include files - Removed unused defines * - More compilation warnings fixed - Bug with thread creation fixed * - Add condition to skip compilation of the ACLK code entirely. Add env variable ACLK="yes" to enable * - Add condition to skip the libmosquitto * - Change feature flag from ACLK_ENABLE to ENABLE_ACLK in accordance with the rest of ENABLE_xx flags - Typo in info message fix Co-authored-by: Andrew Moss <1043609+amoss@users.noreply.github.com> Co-authored-by: Timo <6674623+underhood@users.noreply.github.com>
2019-12-19Agent claiming (#7525)Markos Fountoulakis
Initial infrastructure support for agent claiming. This feature is not currently enabled as we are still finalizing the details of the cloud infrastructure w.r.t. agent claiming. The feature will be enabled when we are ready to release it.