DBENGINE v2 - improvements part 11 (#14337)

* acquiring / releasing interface for metrics * metrics registry statistics * cleanup metrics registry by deleting metrics when they dont have retention anymore; do not double copy the data of pages to be flushed * print the tier in retention summary * Open files with buffered instead of direct I/O (test) * added more metrics stats and fixed unittest * rename writer functions to avoid confusion with refcounting * do not release a metric that is not acquired * Revert to use direct I/O on write -- use direct I/O on read as well * keep track of ARAL overhead and add it to the memory chart * aral full check via api * Cleanup * give names to ARALs and PGCs * aral improvements * restore query expansion to the future * prefer higher resolution tier when switching plans * added extent read statistics * smoother joining of tiers at query engine * fine tune aral max allocation size * aral restructuring to hide its internals from the rest of netdata * aral restructuring; addtion of defrag option to aral to keep the linked list sorted - enabled by default to test it * fully async aral * some statistics and cleanup * fix infinite loop while calculating retention * aral docs and defragmenting disabled by default * fix bug and add optimization when defragmenter is not enabled * aral stress test * aral speed report and documentation * added internal checks that all pages are full * improve internal log about metrics deletion * metrics registry uses one aral per partition * metrics registry aral max size to 512 elements per page * remove data_structures/README.md dependency --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
author: Costa Tsaousis <costa@netdata.cloud> 2023-01-30 20:36:16 +0200
committer: GitHub <noreply@github.com> 2023-01-30 20:36:16 +0200
commit: 7f8f11eb373dfc7bf6ac5a03e57a1b03487a279e (patch)
tree: a79f74e904690145b2c6807ca512eec1dd2160ed /libnetdata
parent: fd7f39a74426d16f01559bd5aac1a6f90baef57f (diff)
14 files changed, 1242 insertions, 606 deletions
diff --git a/libnetdata/Makefile.am b/libnetdata/Makefile.am
index b5bbb79e03..b81d620ba6 100644
--- a/libnetdata/Makefile.am
+++ b/libnetdata/Makefile.am
@@ -5,7 +5,7 @@ MAINTAINERCLEANFILES = $(srcdir)/Makefile.in
 
 SUBDIRS = \
     adaptive_resortable_list \
-    arrayalloc \
+    aral \
     avl \
     buffer \
     clocks \
diff --git a/libnetdata/arrayalloc/Makefile.am b/libnetdata/aral/Makefile.am
index 161784b8f6..161784b8f6 100644
--- a/libnetdata/arrayalloc/Makefile.am
+++ b/libnetdata/aral/Makefile.am
diff --git a/libnetdata/aral/README.md b/libnetdata/aral/README.md
new file mode 100644
index 0000000000..2b7376636e
--- /dev/null
+++ b/libnetdata/aral/README.md
@@ -0,0 +1,169 @@
+<!--
+title: "Array Allocator"
+custom_edit_url: https://github.com/netdata/netdata/edit/master/libnetdata/aral/README.md
+-->
+
+# Array Allocator
+
+Come on! Array allocators are embedded in libc! Why do we need such a thing in Netdata?
+
+Well, we have a couple of problems to solve:
+
+1. **Fragmentation** - It is important for Netdata to keeps its overall memory footprint as low as possible. libc does an amazing job when the same thread allocates and frees some memory. But it simply cannot do better without knowing the specifics of the application when memory is allocated and freed randomly between threads.
+2. **Speed** - Especially when allocations and de-allocations happen across threads, the speed penalty is tremendous.
+
+In Netdata we have a few moments that are very tough. Imagine collecting 1 million metrics per second. You have a buffer for each metric and put append new points there. This works beautifully, of course! But then, when the buffers get full, imagine the situation. You suddenly need 1 million buffers, at once!
+
+To solve this problem we first spread out the buffers. So, the first time each metric asks for a buffer, it gets a smaller one. We added logic there to spread them as evenly as possible across time. Solved? Not exactly!
+
+We have 3 tiers for each metric. For the metrics of tier 0 (per second resolution) we have a max buffer for 1024 points and every new metrics gets a random size between 3 points and 1024. So they are distributed across time. For 1 million metrics, we have about 1000 buffers beings created every second.
+
+But at some point, the end of the minute will come, and suddenly all the metrics will need a new buffer for tier 1 (per minute). Oops! We will spread tier 1 buffers across time too, but the first minute is a tough one. We really need 1 million buffers instantly.
+
+And if that minute happens to also be the beginning of an hour... tier 2 (per hour) kicks in. For that instant we are going to need 2 million buffers instantly.
+
+The problem becomes even bigger when we collect 2, or even 10 million metrics...
+
+So solve it, Netdata uses a special implementation of an array allocator that is tightly integrated with the structures we need.
+
+## Features
+
+1. Malloc, or MMAP modes. File based MMAP is also supported to put the data in file backed up shared memory.
+2. Fully asynchronous operations. There are just a couple of points where spin-locks protect a few counters and pointers.
+3. Optional defragmenter, that once enabled it will make free operation slower while trying to maintain a sorted list of fragments to offer first during allocations. The defragmenter can be enabled / disabled at run time. The defragmenter can hurt performance on application with intense turn-around of allocation, like Netdata dbengine caches. So, it is disabled by default.
+4. Without the defragmenter enabled, ARAL still tries to keep pages full, but the depth of the search is limited to 3 pages (so, a page with a free slot will either become 1st, 2nd, or 3rd). At the same time, during allocations, ARAL will evaluate the first 2 pages to find the one that is more full than the other, to use it for the new allocation.
+
+## How it works
+
+Allocations are organized in pages. Pages have a minimum size (a system page, usually 4KB) and a maximum defined by for each different kind of object.
+
+Initially every page is free. When an allocation request is made, the free space is split, and the first element is reserved. Free space is now considered there rest.
+
+This continuous until the page gets full, where a new page is allocated and the process is repeated.
+
+Each allocation returned has a pointer appended to it. The pointer points to the page the allocation belongs to.
+
+When a pointer is freed, the page it belongs is identified, its space is marked free, and it is prepended in a single linked list that resides in the page itself. So, each page has its own list of free slots to use.
+
+Pages are then on another linked list. This is a double linked list and at its beginning has the pages with free space and at the end the pages that are full. 
+
+When the defragmenter is enabled the pages double linked list is also sorted, like this: the fewer the free slots on a page, the earlier in the linked list the page will be, except if it does not have any free slot, in which case it will be at the end. So, the defragmenter tries to have pages full.
+
+When a page is entirerly free, it is given back to the system immediately. There is no caching of free pages.
+
+
+Parallelism is achieved like this:
+
+When some threads are waiting for a page to be allocated, free operations are allowed. If a free operation happens before a new page is allocated, any waiting thread will get the slot that is freed on another page.
+
+Free operations happen in parallel, even for the same page. There is a spin-lock on each page to protect the base pointer of the page's free slots single linked list. But, this is instant. All preparative work happens lockless, then to add the free slot to the page, the page spinlock is acquired, the free slot is prepended to the linked list on the page, the spinlock is released. Such free operations on different pages are totally parallel.
+
+Once the free operation on a page has finished, the pages double linked list spinlock is acquired to put the page first on that linked list. If the defragmenter is enabled, the spinlock is retained for a little longer, to find the exact position of the page in the linked list.
+
+During allocations, the reverse order is used. First get the pages double linked list spinlock, get the first page and decrement its free slots counter, then release the spinlock. If the first page does not have any free slots, a page allocation is spawn, without any locks acquired. All threads are spinning waiting for a page with free slots, either from the newly allocated one or from a free operation that may happen in parallel.
+
+Once a page is acquired, each thread locks its own page to get the first free slot and releases the lock immediately. This is guaranteed to succeed, because when the page was given to that thread its free slots counter was decremented. So, there is a free slot for every thread that got that page. All preparative work to return a pointer to the caller is done lock free. Allocations on different pages are done in parallel, without any intervention between them.
+
+
+## What to expect
+
+Systems not designed for parallelism achieve their top performance single threaded. The single threaded speed is the baseline. Adding more threads makes them slower.
+
+The baseline for ARAL is the following, the included stress test when running single threaded:
+
+```
+Running stress test of 1 threads, with 10000 elements each, for 5 seconds...
+2023-01-29 17:04:50: netdata INFO  : TH[0] : set name of thread 1314983 to TH[0]
+ARAL executes 12.27 M malloc and 12.26 M free operations/s
+ARAL executes 12.29 M malloc and 12.29 M free operations/s
+ARAL executes 12.30 M malloc and 12.30 M free operations/s
+ARAL executes 12.30 M malloc and 12.29 M free operations/s
+ARAL executes 12.29 M malloc and 12.29 M free operations/s
+Waiting the threads to finish...
+2023-01-29 17:04:55: netdata INFO  : MAIN : ARAL: did 61487356 malloc, 61487356 free, using 1 threads, in 5003808 usecs
+```
+
+The same test with 2 threads, both threads on the same ARAL of course. As you see performance improved:
+
+```
+Running stress test of 2 threads, with 10000 elements each, for 5 seconds...
+2023-01-29 17:05:25: netdata INFO  : TH[0] : set name of thread 1315537 to TH[0]
+2023-01-29 17:05:25: netdata INFO  : TH[1] : set name of thread 1315538 to TH[1]
+ARAL executes 17.75 M malloc and 17.73 M free operations/s
+ARAL executes 17.93 M malloc and 17.93 M free operations/s
+ARAL executes 18.17 M malloc and 18.18 M free operations/s
+ARAL executes 18.33 M malloc and 18.32 M free operations/s
+ARAL executes 18.36 M malloc and 18.36 M free operations/s
+Waiting the threads to finish...
+2023-01-29 17:05:30: netdata INFO  : MAIN : ARAL: did 90976190 malloc, 90976190 free, using 2 threads, in 5029462 usecs
+```
+
+The same test with 4 threads:
+
+```
+Running stress test of 4 threads, with 10000 elements each, for 5 seconds...
+2023-01-29 17:10:12: netdata INFO  : TH[0] : set name of thread 1319552 to TH[0]
+2023-01-29 17:10:12: netdata INFO  : TH[1] : set name of thread 1319553 to TH[1]
+2023-01-29 17:10:12: netdata INFO  : TH[2] : set name of thread 1319554 to TH[2]
+2023-01-29 17:10:12: netdata INFO  : TH[3] : set name of thread 1319555 to TH[3]
+ARAL executes 19.95 M malloc and 19.91 M free operations/s
+ARAL executes 20.08 M malloc and 20.08 M free operations/s
+ARAL executes 20.85 M malloc and 20.85 M free operations/s
+ARAL executes 20.84 M malloc and 20.84 M free operations/s
+ARAL executes 21.37 M malloc and 21.37 M free operations/s
+Waiting the threads to finish...
+2023-01-29 17:10:17: netdata INFO  : MAIN : ARAL: did 103549747 malloc, 103549747 free, using 4 threads, in 5023325 usecs
+```
+
+The same with 8 threads:
+
+```
+Running stress test of 8 threads, with 10000 elements each, for 5 seconds...
+2023-01-29 17:07:06: netdata INFO  : TH[0] : set name of thread 1317608 to TH[0]
+2023-01-29 17:07:06: netdata INFO  : TH[1] : set name of thread 1317609 to TH[1]
+2023-01-29 17:07:06: netdata INFO  : TH[2] : set name of thread 1317610 to TH[2]
+2023-01-29 17:07:06: netdata INFO  : TH[3] : set name of thread 1317611 to TH[3]
+2023-01-29 17:07:06: netdata INFO  : TH[4] : set name of thread 1317612 to TH[4]
+2023-01-29 17:07:06: netdata INFO  : TH[5] : set name of thread 1317613 to TH[5]
+2023-01-29 17:07:06: netdata INFO  : TH[6] : set name of thread 1317614 to TH[6]
+2023-01-29 17:07:06: netdata INFO  : TH[7] : set name of thread 1317615 to TH[7]
+ARAL executes 15.73 M malloc and 15.66 M free operations/s
+ARAL executes 13.95 M malloc and 13.94 M free operations/s
+ARAL executes 15.59 M malloc and 15.58 M free operations/s
+ARAL executes 15.49 M malloc and 15.49 M free operations/s
+ARAL executes 16.16 M malloc and 16.16 M free operations/s
+Waiting the threads to finish...
+2023-01-29 17:07:11: netdata INFO  : MAIN : ARAL: did 78427750 malloc, 78427750 free, using 8 threads, in 5088591 usecs
+```
+
+The same with 16 threads:
+
+```
+Running stress test of 16 threads, with 10000 elements each, for 5 seconds...
+2023-01-29 17:08:04: netdata INFO  : TH[0] : set name of thread 1318663 to TH[0]
+2023-01-29 17:08:04: netdata INFO  : TH[1] : set name of thread 1318664 to TH[1]
+2023-01-29 17:08:04: netdata INFO  : TH[2] : set name of thread 1318665 to TH[2]
+2023-01-29 17:08:04: netdata INFO  : TH[3] : set name of thread 1318666 to TH[3]
+2023-01-29 17:08:04: netdata INFO  : TH[4] : set name of thread 1318667 to TH[4]
+2023-01-29 17:08:04: netdata INFO  : TH[5] : set name of thread 1318668 to TH[5]
+2023-01-29 17:08:04: netdata INFO  : TH[6] : set name of thread 1318669 to TH[6]
+2023-01-29 17:08:04: netdata INFO  : TH[7] : set name of thread 1318670 to TH[7]
+2023-01-29 17:08:04: netdata INFO  : TH[8] : set name of thread 1318671 to TH[8]
+2023-01-29 17:08:04: netdata INFO  : TH[9] : set name of thread 1318672 to TH[9]
+2023-01-29 17:08:04: netdata INFO  : TH[10] : set name of thread 1318673 to TH[10]
+2023-01-29 17:08:04: netdata INFO  : TH[11] : set name of thread 1318674 to TH[11]
+2023-01-29 17:08:04: netdata INFO  : TH[12] : set name of thread 1318675 to TH[12]
+2023-01-29 17:08:04: netdata INFO  : TH[13] : set name of thread 1318676 to TH[13]
+2023-01-29 17:08:04: netdata INFO  : TH[14] : set name of thread 1318677 to TH[14]
+2023-01-29 17:08:04: netdata INFO  : TH[15] : set name of thread 1318678 to TH[15]
+ARAL executes 11.77 M malloc and 11.62 M free operations/s
+ARAL executes 12.80 M malloc and 12.81 M free operations/s
+ARAL executes 13.26 M malloc and 13.25 M free operations/s
+ARAL executes 13.30 M malloc and 13.29 M free operations/s
+ARAL executes 13.23 M malloc and 13.25 M free operations/s
+Waiting the threads to finish...
+2023-01-29 17:08:09: netdata INFO  : MAIN : ARAL: did 65302122 malloc, 65302122 free, using 16 threads, in 5066009 usecs
+```
+
+As you can see, the top performance is with 4 threads, almost double the single thread speed.
+16 threads performance is still better than single threaded, despite the intense concurrency.
diff --git a/libnetdata/aral/aral.c b/libnetdata/aral/aral.c
new file mode 100644
index 0000000000..8ea4f64624
--- /dev/null
+++ b/libnetdata/aral/aral.c
@@ -0,0 +1,918 @@
+#include "../libnetdata.h"
+#include "aral.h"
+
+#ifdef NETDATA_TRACE_ALLOCATIONS
+#define TRACE_ALLOCATIONS_FUNCTION_DEFINITION_PARAMS , const char *file, const char *function, size_t line
+#define TRACE_ALLOCATIONS_FUNCTION_CALL_PARAMS , file, function, line
+#else
+#define TRACE_ALLOCATIONS_FUNCTION_DEFINITION_PARAMS
+#define TRACE_ALLOCATIONS_FUNCTION_CALL_PARAMS
+#endif
+
+#define ARAL_FREE_PAGES_DELTA_TO_REARRANGE_LIST 5
+
+// max file size
+#define ARAL_MAX_PAGE_SIZE_MMAP (1*1024*1024*1024)
+
+// max malloc size
+// optimal at current versions of libc is up to 256k
+// ideal to have the same overhead as libc is 4k
+#define ARAL_MAX_PAGE_SIZE_MALLOC (65*1024)
+
+typedef struct aral_free {
+    size_t size;
+    struct aral_free *next;
+} ARAL_FREE;
+
+typedef struct aral_page {
+    size_t size;                    // the allocation size of the page
+    const char *filename;
+    uint8_t *data;
+
+    uint32_t free_elements_to_move_first;
+    uint32_t max_elements;          // the number of elements that can fit on this page
+
+    struct {
+        uint32_t used_elements;         // the number of used elements on this page
+        uint32_t free_elements;         // the number of free elements on this page
+    } aral_lock;
+
+    struct {
+        SPINLOCK spinlock;
+        ARAL_FREE *list;
+    } free;
+
+    struct aral_page *prev; // the prev page on the list
+    struct aral_page *next; // the next page on the list
+} ARAL_PAGE;
+
+struct aral {
+    struct {
+        char name[ARAL_MAX_NAME + 1];
+
+        bool lockless;
+        bool defragment;
+
+        size_t element_size;            // calculated to take into account ARAL overheads
+        size_t max_allocation_size;          // calculated in bytes
+        size_t page_ptr_offset;         // calculated
+        size_t natural_page_size;       // calculated
+
+        size_t requested_element_size;
+        size_t initial_page_elements;
+        size_t max_page_elements;
+
+        struct {
+            bool enabled;
+            const char *filename;
+            char **cache_dir;
+        } mmap;
+    } config;
+
+    struct {
+        SPINLOCK spinlock;
+        size_t file_number;             // for mmap
+        struct aral_page *pages;        // linked list of pages
+
+        size_t user_malloc_operations;
+        size_t user_free_operations;
+        size_t defragment_operations;
+        size_t defragment_linked_list_traversals;
+    } aral_lock;
+
+    struct {
+        SPINLOCK spinlock;
+        size_t allocation_size;         // current allocation size
+    } adders;
+
+    struct {
+    } atomic;
+};
+
+struct {
+    struct {
+        struct {
+            size_t allocations;
+            size_t allocated;
+        } structures;
+
+        struct {
+            size_t allocations;
+            size_t allocated;
+            size_t used;
+        } malloc;
+
+        struct {
+            size_t allocations;
+            size_t allocated;
+            size_t used;
+        } mmap;
+    } atomic;
+} aral_globals = {};
+
+void aral_get_size_statistics(size_t *structures, size_t *malloc_allocated, size_t *malloc_used, size_t *mmap_allocated, size_t *mmap_used) {
+    *structures = __atomic_load_n(&aral_globals.atomic.structures.allocated, __ATOMIC_RELAXED);
+    *malloc_allocated = __atomic_load_n(&aral_globals.atomic.malloc.allocated, __ATOMIC_RELAXED);
+    *malloc_used = __atomic_load_n(&aral_globals.atomic.malloc.used, __ATOMIC_RELAXED);
+    *mmap_allocated = __atomic_load_n(&aral_globals.atomic.mmap.allocated, __ATOMIC_RELAXED);
+    *mmap_used = __atomic_load_n(&aral_globals.atomic.mmap.used, __ATOMIC_RELAXED);
+}
+
+#define ARAL_NATURAL_ALIGNMENT  (sizeof(uintptr_t) * 2)
+static inline size_t natural_alignment(size_t size, size_t alignment) {
+    if(unlikely(size % alignment))
+        size = size + alignment - (size % alignment);
+
+    return size;
+}
+
+static size_t aral_align_alloc_size(ARAL *ar, uint64_t size) {
+    if(size % ar->config.natural_page_size)
+        size += ar->config.natural_page_size - (size % ar->config.natural_page_size) ;
+
+    if(size % ar->config.element_size)
+        size -= size % ar->config.element_size;
+
+    return size;
+}
+
+static inline void aral_lock(ARAL *ar) {
+    if(likely(!ar->config.lockless))
+        netdata_spinlock_lock(&ar->aral_lock.spinlock);
+}
+
+static inline void aral_unlock(ARAL *ar) {
+    if(likely(!ar->config.lockless))
+        netdata_spinlock_unlock(&ar->aral_lock.spinlock);
+}
+
+static void aral_delete_leftover_files(const char *name, const char *path, const char *required_prefix) {
+    DIR *dir = opendir(path);
+    if(!dir) return;
+
+    char full_path[FILENAME_MAX + 1];
+    size_t len = strlen(required_prefix);
+
+    struct dirent *de = NULL;
+    while((de = readdir(dir))) {
+        if(de->d_type == DT_DIR)
+            continue;
+
+        if(strncmp(de->d_name, required_prefix, len) != 0)
+            continue;
+
+        snprintfz(full_path, FILENAME_MAX, "%s/%s", path, de->d_name);
+        info("ARAL: '%s' removing left-over file '%s'", name, full_path);
+        if(unlikely(unlink(full_path) == -1))
+            error("ARAL: '%s' cannot delete file '%s'", name, full_path);
+    }
+
+    closedir(dir);
+}
+
+// ----------------------------------------------------------------------------
+// check a free slot
+
+#ifdef NETDATA_INTERNAL_CHECKS
+static inline void aral_free_validate_internal_check(ARAL *ar, ARAL_FREE *fr) {
+    if(unlikely(fr->size < ar->config.element_size))
+        fatal("ARAL: '%s' free item of size %zu, less than the expected element size %zu",
+              ar->config.name, fr->size, ar->config.element_size);
+
+    if(unlikely(fr->size % ar->config.element_size))
+        fatal("ARAL: '%s' free item of size %zu is not multiple to element size %zu",
+              ar->config.name, fr->size, ar->config.element_size);
+}
+#else
+#define aral_free_validate_internal_check(ar, fr) debug_dummy()
+#endif
+
+// ----------------------------------------------------------------------------
+// find the page a pointer belongs to
+
+#ifdef NETDATA_INTERNAL_CHECKS
+static inline ARAL_PAGE *find_page_with_allocation_internal_check(ARAL *ar, void *ptr) {
+    aral_lock(ar);
+
+    uintptr_t seeking = (uintptr_t)ptr;
+    ARAL_PAGE *page;
+
+    for(page = ar->aral_lock.pages; page ; page = page->next) {
+        if(unlikely(seeking >= (uintptr_t)page->data && seeking < (uintptr_t)page->data + page->size))
+            break;
+    }
+
+    aral_unlock(ar);
+
+    return page;
+}
+#endif
+
+// ----------------------------------------------------------------------------
+// find a page with a free slot (there shouldn't be any)
+
+#ifdef NETDATA_ARAL_INTERNAL_CHECKS
+static inline ARAL_PAGE *find_page_with_free_slots_internal_check___with_aral_lock(ARAL *ar) {
+    ARAL_PAGE *page;
+
+    for(page = ar->aral_lock.pages; page ; page = page->next) {
+        if(page->aral_lock.free_elements)
+            break;
+
+        internal_fatal(page->size - page->aral_lock.used_elements * ar->config.element_size >= ar->config.element_size,
+                       "ARAL: '%s' a page is marked full, but it is not!", ar->config.name);
+
+        internal_fatal(page->size < page->aral_lock.used_elements * ar->config.element_size,
+                       "ARAL: '%s' a page has been overflown!", ar->config.name);
+    }
+
+    return page;
+}
+#endif
+
+static ARAL_PAGE *aral_create_page___no_lock_needed(ARAL *ar TRACE_ALLOCATIONS_FUNCTION_DEFINITION_PARAMS) {
+    ARAL_PAGE *page = callocz(1, sizeof(ARAL_PAGE));
+    netdata_spinlock_init(&page->free.spinlock);
+    page->size = ar->adders.allocation_size;
+
+    if(page->size > ar->config.max_allocation_size)
+        page->size = ar->config.max_allocation_size;
+    else
+        ar->adders.allocation_size = aral_align_alloc_size(ar, (uint64_t)ar->adders.allocation_size * 4 / 3);
+
+    page->max_elements = page->aral_lock.free_elements = page->size / ar->config.element_size;
+    page->free_elements_to_move_first = page->max_elements / 4;
+    if(unlikely(page->free_elements_to_move_first < 1))
+        page->free_elements_to_move_first = 1;
+
+    __atomic_add_fetch(&aral_globals.atomic.structures.allocations, 1, __ATOMIC_RELAXED);
+    __atomic_add_fetch(&aral_globals.atomic.structures.allocated, sizeof(ARAL_PAGE), __ATOMIC_RELAXED);
+
+    if(unlikely(ar->config.mmap.enabled)) {
+        ar->aral_lock.file_number++;
+        char filename[FILENAME_MAX + 1];
+        snprintfz(filename, FILENAME_MAX, "%s/array_alloc.mmap/%s.%zu", *ar->config.mmap.cache_dir, ar->config.mmap.filename, ar->aral_lock.file_number);
+        page->filename = strdupz(filename);
+        page->data = netdata_mmap(page->filename, page->size, MAP_SHARED, 0, false, NULL);
+        if (unlikely(!page->data))
+            fatal("ARAL: '%s' cannot allocate aral buffer of size %zu on filename '%s'",
+                  ar->config.name, page->size, page->filename);
+        __atomic_add_fetch(&aral_globals.atomic.mmap.allocations, 1, __ATOMIC_RELAXED);
+        __atomic_add_fetch(&aral_globals.atomic.mmap.allocated, page->size, __ATOMIC_RELAXED);
+    }
+    else {
+#ifdef NETDATA_TRACE_ALLOCATIONS
+        page->data = mallocz_int(page->size TRACE_ALLOCATIONS_FUNCTION_CALL_PARAMS);
+#else
+        page->data = mallocz(page->size);
+#endif
+        __atomic_add_fetch(&aral_globals.atomic.malloc.allocations, 1, __ATOMIC_RELAXED);
+        __atomic_add_fetch(&aral_globals.atomic.malloc.allocated, page->size, __ATOMIC_RELAXED);
+    }
+
+    // link the free space to its page
+    ARAL_FREE *fr = (ARAL_FREE *)page->data;
+    fr->size = page->size;
+    fr->next = NULL;
+    page->free.list = fr;
+
+    aral_free_validate_internal_check(ar, fr);
+
+    return page;
+}
+
+void aral_del_page___no_lock_needed(ARAL *ar, ARAL_PAGE *page TRACE_ALLOCATIONS_FUNCTION_DEFINITION_PARAMS) {
+
+    // free it
+    if (ar->config.mmap.enabled) {
+        netdata_munmap(page->data, page->size);
+
+        if (unlikely(unlink(page->filename) == 1))
+            error("Cannot delete file '%s'", page->filename);
+
+        freez((void *)page->filename);
+
+        __atomic_sub_fetch(&aral_globals.atomic.mmap.allocations, 1, __ATOMIC_RELAXED);
+        __atomic_sub_fetch(&aral_globals.atomic.mmap.allocated, page->size, __ATOMIC_RELAXED);
+    }
+    else {
+#ifdef NETDATA_TRACE_ALLOCATIONS
+        freez_int(page->data TRACE_ALLOCATIONS_FUNCTION_CALL_PARAMS);
+#else
+        freez(page->data);
+#endif
+        __atomic_sub_fetch(&aral_globals.atomic.malloc.allocations, 1, __ATOMIC_RELAXED);
+        __atomic_sub_fetch(&aral_globals.atomic.malloc.allocated, page->size, __ATOMIC_RELAXED);
+    }
+
+    freez(page);
+
+    __atomic_sub_fetch(&aral_globals.atomic.structures.allocations, 1, __ATOMIC_RELAXED);
+    __atomic_sub_fetch(&aral_globals.atomic.structures.allocated, sizeof(ARAL_PAGE), __ATOMIC_RELAXED);
+}
+
+static inline void aral_insert_not_linked_page_with_free_items_to_proper_position___aral_lock_needed(ARAL *ar, ARAL_PAGE *page) {
+    ARAL_PAGE *first = ar->aral_lock.pages;
+
+    if (page->aral_lock.free_elements <= page->free_elements_to_move_first ||
+        !first ||
+        !first->aral_lock.free_elements ||
+        page->aral_lock.free_elements <= first->aral_lock.free_elements + ARAL_FREE_PAGES_DELTA_TO_REARRANGE_LIST) {
+        // first position
+        DOUBLE_LINKED_LIST_PREPEND_ITEM_UNSAFE(ar->aral_lock.pages, page, prev, next);
+    }
+    else {
+        ARAL_PAGE *second = first->next;
+
+        if (!second ||
+            !second->aral_lock.free_elements ||
+            page->aral_lock.free_elements <= second->aral_lock.free_elements)
+            // second position
+            DOUBLE_LINKED_LIST_INSERT_ITEM_AFTER_UNSAFE(ar->aral_lock.pages, first, page, prev, next);
+        else
+            // third position
+            DOUBLE_LINKED_LIST_INSERT_ITEM_AFTER_UNSAFE(ar->aral_lock.pages, second, page, prev, next);
+    }
+}
+
+static inline ARAL_PAGE *aral_acquire_a_free_slot(ARAL *ar TRACE_ALLOCATIONS_FUNCTION_DEFINITION_PARAMS) {
+    aral_lock(ar);
+
+    ARAL_PAGE *page = ar->aral_lock.pages;
+
+    while(!page || !page->aral_lock.free_elements) {
+#ifdef NETDATA_ARAL_INTERNAL_CHECKS
+        internal_fatal(find_page_with_free_slots_internal_check___with_aral_lock(ar), "ARAL: '%s' found page with free slot!", ar->config.name);
+#endif
+        aral_unlock(ar);
+
+        if(netdata_spinlock_trylock(&ar->adders.spinlock)) {
+            page = aral_create_page___no_lock_needed(ar TRACE_ALLOCATIONS_FUNCTION_CALL_PARAMS);
+
+            aral_lock(ar);
+            aral_insert_not_linked_page_with_free_items_to_proper_position___aral_lock_needed(ar, page);
+            netdata_spinlock_unlock(&ar->adders.spinlock);
+            break;
+        }
+        else {
+            aral_lock(ar);
+            page = ar->aral_lock.pages;
+        }
+    }
+
+    // we have a page
+    // and aral locked
+
+    {
+        ARAL_PAGE *first = ar->aral_lock.pages;
+        ARAL_PAGE *second = first->next;
+
+        if (!second ||
+            !second->aral_lock.free_elements ||
+            first->aral_lock.free_elements <= second->aral_lock.free_elements + ARAL_FREE_PAGES_DELTA_TO_REARRANGE_LIST)
+            page = first;
+        else {
+            DOUBLE_LINKED_LIST_REMOVE_ITEM_UNSAFE(ar->aral_lock.pages, second, prev, next);
+            DOUBLE_LINKED_LIST_PREPEND_ITEM_UNSAFE(ar->aral_lock.pages, second, prev, next);
+            page = second;
+        }
+    }
+
+    internal_fatal(!page || !page->aral_lock.free_elements,
+                   "ARAL: '%s' selected page does not have a free slot in it",
+                   ar->config.name);
+
+    internal_fatal(page->max_elements != page->aral_lock.used_elements + page->aral_lock.free_elements,
+                   "ARAL: '%s' page element counters do not match, "
+                   "page says it can handle %zu elements, "
+                   "but there are %zu used and %zu free items, "
+                   "total %zu items",
+                   ar->config.name,
+                   (size_t)page->max_elements,
+                   (size_t)page->aral_lock.used_elements, (size_t)page->aral_lock.free_elements,
+                   (size_t)page->aral_lock.used_elements + (size_t)page->aral_lock.free_elements
+    );
+
+    ar->aral_lock.user_malloc_operations++;
+
+    // acquire a slot for the caller
+    page->aral_lock.used_elements++;
+    if(--page->aral_lock.free_elements == 0) {
+        // we are done with this page
+        // move the full page last
+        // so that pages with free items remain first in the list
+        DOUBLE_LINKED_LIST_REMOVE_ITEM_UNSAFE(ar->aral_lock.pages, page, prev, next);
+        DOUBLE_LINKED_LIST_APPEND_ITEM_UNSAFE(ar->aral_lock.pages, page, prev, next);
+    }
+
+    aral_unlock(ar);
+
+    return page;
+}
+
+void *aral_mallocz_internal(ARAL *ar TRACE_ALLOCATIONS_FUNCTION_DEFINITION_PARAMS) {
+
+    ARAL_PAGE *page = aral_acquire_a_free_slot(ar TRACE_ALLOCATIONS_FUNCTION_CALL_PARAMS);
+
+    netdata_spinlock_lock(&page->free.spinlock);
+
+    internal_fatal(!page->free.list,
+                   "ARAL: '%s' free item to use, cannot be NULL.", ar->config.name);
+
+    internal_fatal(page->free.list->size < ar->config.element_size,
+                   "ARAL: '%s' free item size %zu, cannot be smaller than %zu",
+                   ar->config.name, page->free.list->size, ar->config.element_size);
+
+    ARAL_FREE *found_fr = page->free.list;
+
+    // check if the remaining size (after we use this slot) is not enough for another element
+    if(unlikely(found_fr->size - ar->config.element_size < ar->config.element_size)) {
+        // we can use the entire free space entry
+
+        page->free.list = found_fr->next;
+    }
+    else {
+        // we can split the free space entry
+
+        uint8_t *data = (uint8_t *)found_fr;
+        ARAL_FREE *fr = (ARAL_FREE *)&data[ar->config.element_size];
+        fr->size = found_fr->size - ar->config.element_size;
+
+        // link the free slot first in the page
+        fr->next = found_fr->next;
+        page->free.list = fr;
+
+        aral_free_validate_internal_check(ar, fr);
+    }
+
+    netdata_spinlock_unlock(&page->free.spinlock);
+
+    // put the page pointer after the element
+    uint8_t *data = (uint8_t *)found_fr;
+    ARAL_PAGE **page_ptr = (ARAL_PAGE **)&data[ar->config.page_ptr_offset];
+    *page_ptr = page;
+
+    if(unlikely(ar->config.mmap.enabled))
+        __atomic_add_fetch(&aral_globals.atomic.mmap.used, ar->config.element_size, __ATOMIC_RELAXED);
+    else
+        __atomic_add_fetch(&aral_globals.atomic.malloc.used, ar->config.element_size, __ATOMIC_RELAXED);
+
+    return (void *)found_fr;
+}
+
+static inline ARAL_PAGE *aral_ptr_to_page___must_NOT_have_aral_lock(ARAL *ar, void *ptr) {
+    // given a data pointer we returned before,
+    // find the ARAL_PAGE it belongs to
+
+    uint8_t *data = (uint8_t *)ptr;
+    ARAL_PAGE **page_ptr = (ARAL_PAGE **)&data[ar->config.page_ptr_offset];
+    ARAL_PAGE *page = *page_ptr;
+
+#ifdef NETDATA_INTERNAL_CHECKS
+    // make it NULL so that we will fail on double free
+    // do not enable this on production, because the MMAP file
+    // will need to be saved again!
+    *page_ptr = NULL;
+#endif
+
+#ifdef NETDATA_ARAL_INTERNAL_CHECKS
+    {
+        // find the page ptr belongs
+        ARAL_PAGE *page2 = find_page_with_allocation_internal_check(ar, ptr);
+
+        internal_fatal(page != page2,
+                       "ARAL: '%s' page pointers do not match!",
+                       ar->name);
+
+        internal_fatal(!page2,
+                       "ARAL: '%s' free of pointer %p is not in ARAL address space.",
+                       ar->name, ptr);
+    }
+#endif
+
+    internal_fatal(!page,
+                   "ARAL: '%s' possible corruption or double free of pointer %p",
+                   ar->config.name, ptr);
+
+    return page;
+}
+
+static void aral_defrag_sorted_page_position___aral_lock_needed(ARAL *ar, ARAL_PAGE *page) {
+    ARAL_PAGE *tmp;
+
+    int action = 0; (void)action;
+    size_t move_later = 0, move_earlier = 0;
+
+    for(tmp = page->next ;
+        tmp && tmp->aral_lock.free_elements && tmp->aral_lock.free_elements < page->aral_lock.free_elements ;
+        tmp = tmp->next)
+        move_later++;
+
+    if(!tmp && page->next) {
+        DOUBLE_LINKED_LIST_REMOVE_ITEM_UNSAFE(ar->aral_lock.pages, page, prev, next);
author	Costa Tsaousis <costa@netdata.cloud>	2023-01-30 20:36:16 +0200
committer	GitHub <noreply@github.com>	2023-01-30 20:36:16 +0200
commit	7f8f11eb373dfc7bf6ac5a03e57a1b03487a279e (patch)
tree	a79f74e904690145b2c6807ca512eec1dd2160ed /libnetdata
parent	fd7f39a74426d16f01559bd5aac1a6f90baef57f (diff)