diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-08-03 14:31:33 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-08-03 14:31:33 -0700 |
commit | 8f0cb6660acb0d4756df880a3e60e73daa9c244e (patch) | |
tree | 9f14396bed65f82b94f7bb2425ffb2f0cbd4519a /Documentation | |
parent | 5ece08178d6567db5ef0090b1ae7f795c3c36161 (diff) | |
parent | c1cc4784ce6e8cceff1013709abd74bcbf7fbf24 (diff) |
Merge tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
- kfree_rcu updates
- RCU tasks updates
- Read-side scalability tests
- SRCU updates
- Torture-test updates
- Documentation updates
- Miscellaneous fixes
* tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits)
torture: Remove obsolete "cd $KVM"
torture: Avoid duplicate specification of qemu command
torture: Dump ftrace at shutdown only if requested
torture: Add kvm-tranform.sh script for qemu-cmd files
torture: Add more tracing crib notes to kvm.sh
torture: Improve diagnostic for KCSAN-incapable compilers
torture: Correctly summarize build-only runs
torture: Pass --kmake-arg to all make invocations
rcutorture: Check for unwatched readers
torture: Abstract out console-log error detection
torture: Add a stop-run capability
torture: Create qemu-cmd in --buildonly runs
rcu/rcutorture: Replace 0 with false
torture: Add --allcpus argument to the kvm.sh script
torture: Remove whitespace from identify_qemu_vcpus output
rcutorture: NULL rcu_torture_current earlier in cleanup code
rcutorture: Handle non-statistic bang-string error messages
torture: Set configfile variable to current scenario
rcutorture: Add races with task-exit processing
locktorture: Use true and false to assign to bool variables
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/RCU/Design/Requirements/Requirements.rst | 7 | ||||
-rw-r--r-- | Documentation/RCU/checklist.rst (renamed from Documentation/RCU/checklist.txt) | 17 | ||||
-rw-r--r-- | Documentation/RCU/index.rst | 9 | ||||
-rw-r--r-- | Documentation/RCU/lockdep-splat.rst (renamed from Documentation/RCU/lockdep-splat.txt) | 109 | ||||
-rw-r--r-- | Documentation/RCU/lockdep.rst (renamed from Documentation/RCU/lockdep.txt) | 12 | ||||
-rw-r--r-- | Documentation/RCU/rculist_nulls.rst | 200 | ||||
-rw-r--r-- | Documentation/RCU/rculist_nulls.txt | 172 | ||||
-rw-r--r-- | Documentation/RCU/rcuref.rst (renamed from Documentation/RCU/rcuref.txt) | 199 | ||||
-rw-r--r-- | Documentation/RCU/stallwarn.rst (renamed from Documentation/RCU/stallwarn.txt) | 62 | ||||
-rw-r--r-- | Documentation/RCU/torture.rst (renamed from Documentation/RCU/torture.txt) | 117 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 68 | ||||
-rw-r--r-- | Documentation/locking/locktorture.rst | 2 |
12 files changed, 569 insertions, 405 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index 50d5c43c48b0..8f41ad0aa753 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -2583,7 +2583,12 @@ not work to have these markers in the trampoline itself, because there would need to be instructions following ``rcu_read_unlock()``. Although ``synchronize_rcu()`` would guarantee that execution reached the ``rcu_read_unlock()``, it would not be able to guarantee that execution -had completely left the trampoline. +had completely left the trampoline. Worse yet, in some situations +the trampoline's protection must extend a few instructions *prior* to +execution reaching the trampoline. For example, these few instructions +might calculate the address of the trampoline, so that entering the +trampoline would be pre-ordained a surprisingly long time before execution +actually reached the trampoline itself. The solution, in the form of `Tasks RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.rst index e98ff261a438..2efed9926c3f 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.rst @@ -1,4 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================ Review Checklist for RCU Patches +================================ This document contains a checklist for producing and reviewing patches @@ -411,18 +415,21 @@ over a rather long period of time, but improvements are always welcome! __rcu sparse checks to validate your RCU code. These can help find problems as follows: - CONFIG_PROVE_LOCKING: check that accesses to RCU-protected data + CONFIG_PROVE_LOCKING: + check that accesses to RCU-protected data structures are carried out under the proper RCU read-side critical section, while holding the right combination of locks, or whatever other conditions are appropriate. - CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the + CONFIG_DEBUG_OBJECTS_RCU_HEAD: + check that you don't pass the same object to call_rcu() (or friends) before an RCU grace period has elapsed since the last time that you passed that same object to call_rcu() (or friends). - __rcu sparse checks: tag the pointer to the RCU-protected data + __rcu sparse checks: + tag the pointer to the RCU-protected data structure with __rcu, and sparse will warn you if you access that pointer without the services of one of the variants of rcu_dereference(). @@ -442,8 +449,8 @@ over a rather long period of time, but improvements are always welcome! You instead need to use one of the barrier functions: - o call_rcu() -> rcu_barrier() - o call_srcu() -> srcu_barrier() + - call_rcu() -> rcu_barrier() + - call_srcu() -> srcu_barrier() However, these barrier functions are absolutely -not- guaranteed to wait for a grace period. In fact, if there are no call_rcu() diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 81a0a1e5f767..e703d3dbe60c 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + .. _rcu_concepts: ============ @@ -8,10 +10,17 @@ RCU concepts :maxdepth: 3 arrayRCU + checklist + lockdep + lockdep-splat rcubarrier rcu_dereference whatisRCU rcu + rculist_nulls + rcuref + torture + stallwarn listRCU NMI-RCU UP diff --git a/Documentation/RCU/lockdep-splat.txt b/Documentation/RCU/lockdep-splat.rst index b8096316fd11..2a5c79db57dc 100644 --- a/Documentation/RCU/lockdep-splat.txt +++ b/Documentation/RCU/lockdep-splat.rst @@ -1,3 +1,9 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Lockdep-RCU Splat +================= + Lockdep-RCU was added to the Linux kernel in early 2010 (http://lwn.net/Articles/371986/). This facility checks for some common misuses of the RCU API, most notably using one of the rcu_dereference() @@ -12,55 +18,54 @@ overwriting or worse. There can of course be false positives, this being the real world and all that. So let's look at an example RCU lockdep splat from 3.0-rc5, one that -has long since been fixed: - -============================= -WARNING: suspicious RCU usage ------------------------------ -block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage! - -other info that might help us debug this: - - -rcu_scheduler_active = 1, debug_locks = 0 -3 locks held by scsi_scan_6/1552: - #0: (&shost->scan_mutex){+.+.}, at: [<ffffffff8145efca>] -scsi_scan_host_selected+0x5a/0x150 - #1: (&eq->sysfs_lock){+.+.}, at: [<ffffffff812a5032>] -elevator_exit+0x22/0x60 - #2: (&(&q->__queue_lock)->rlock){-.-.}, at: [<ffffffff812b6233>] -cfq_exit_queue+0x43/0x190 - -stack backtrace: -Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17 -Call Trace: - [<ffffffff810abb9b>] lockdep_rcu_dereference+0xbb/0xc0 - [<ffffffff812b6139>] __cfq_exit_single_io_context+0xe9/0x120 - [<ffffffff812b626c>] cfq_exit_queue+0x7c/0x190 - [<ffffffff812a5046>] elevator_exit+0x36/0x60 - [<ffffffff812a802a>] blk_cleanup_queue+0x4a/0x60 - [<ffffffff8145cc09>] scsi_free_queue+0x9/0x10 - [<ffffffff81460944>] __scsi_remove_device+0x84/0xd0 - [<ffffffff8145dca3>] scsi_probe_and_add_lun+0x353/0xb10 - [<ffffffff817da069>] ? error_exit+0x29/0xb0 - [<ffffffff817d98ed>] ? _raw_spin_unlock_irqrestore+0x3d/0x80 - [<ffffffff8145e722>] __scsi_scan_target+0x112/0x680 - [<ffffffff812c690d>] ? trace_hardirqs_off_thunk+0x3a/0x3c - [<ffffffff817da069>] ? error_exit+0x29/0xb0 - [<ffffffff812bcc60>] ? kobject_del+0x40/0x40 - [<ffffffff8145ed16>] scsi_scan_channel+0x86/0xb0 - [<ffffffff8145f0b0>] scsi_scan_host_selected+0x140/0x150 - [<ffffffff8145f149>] do_scsi_scan_host+0x89/0x90 - [<ffffffff8145f170>] do_scan_async+0x20/0x160 - [<ffffffff8145f150>] ? do_scsi_scan_host+0x90/0x90 - [<ffffffff810975b6>] kthread+0xa6/0xb0 - [<ffffffff817db154>] kernel_thread_helper+0x4/0x10 - [<ffffffff81066430>] ? finish_task_switch+0x80/0x110 - [<ffffffff817d9c04>] ? retint_restore_args+0xe/0xe - [<ffffffff81097510>] ? __kthread_init_worker+0x70/0x70 - [<ffffffff817db150>] ? gs_change+0xb/0xb - -Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows: +has long since been fixed:: + + ============================= + WARNING: suspicious RCU usage + ----------------------------- + block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage! + +other info that might help us debug this:: + + rcu_scheduler_active = 1, debug_locks = 0 + 3 locks held by scsi_scan_6/1552: + #0: (&shost->scan_mutex){+.+.}, at: [<ffffffff8145efca>] + scsi_scan_host_selected+0x5a/0x150 + #1: (&eq->sysfs_lock){+.+.}, at: [<ffffffff812a5032>] + elevator_exit+0x22/0x60 + #2: (&(&q->__queue_lock)->rlock){-.-.}, at: [<ffffffff812b6233>] + cfq_exit_queue+0x43/0x190 + + stack backtrace: + Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17 + Call Trace: + [<ffffffff810abb9b>] lockdep_rcu_dereference+0xbb/0xc0 + [<ffffffff812b6139>] __cfq_exit_single_io_context+0xe9/0x120 + [<ffffffff812b626c>] cfq_exit_queue+0x7c/0x190 + [<ffffffff812a5046>] elevator_exit+0x36/0x60 + [<ffffffff812a802a>] blk_cleanup_queue+0x4a/0x60 + [<ffffffff8145cc09>] scsi_free_queue+0x9/0x10 + [<ffffffff81460944>] __scsi_remove_device+0x84/0xd0 + [<ffffffff8145dca3>] scsi_probe_and_add_lun+0x353/0xb10 + [<ffffffff817da069>] ? error_exit+0x29/0xb0 + [<ffffffff817d98ed>] ? _raw_spin_unlock_irqrestore+0x3d/0x80 + [<ffffffff8145e722>] __scsi_scan_target+0x112/0x680 + [<ffffffff812c690d>] ? trace_hardirqs_off_thunk+0x3a/0x3c + [<ffffffff817da069>] ? error_exit+0x29/0xb0 + [<ffffffff812bcc60>] ? kobject_del+0x40/0x40 + [<ffffffff8145ed16>] scsi_scan_channel+0x86/0xb0 + [<ffffffff8145f0b0>] scsi_scan_host_selected+0x140/0x150 + [<ffffffff8145f149>] do_scsi_scan_host+0x89/0x90 + [<ffffffff8145f170>] do_scan_async+0x20/0x160 + [<ffffffff8145f150>] ? do_scsi_scan_host+0x90/0x90 + [<ffffffff810975b6>] kthread+0xa6/0xb0 + [<ffffffff817db154>] kernel_thread_helper+0x4/0x10 + [<ffffffff81066430>] ? finish_task_switch+0x80/0x110 + [<ffffffff817d9c04>] ? retint_restore_args+0xe/0xe + [<ffffffff81097510>] ? __kthread_init_worker+0x70/0x70 + [<ffffffff817db150>] ? gs_change+0xb/0xb + +Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows:: if (rcu_dereference(ioc->ioc_data) == cic) { @@ -70,7 +75,7 @@ case. Instead, we hold three locks, one of which might be RCU related. And maybe that lock really does protect this reference. If so, the fix is to inform RCU, perhaps by changing __cfq_exit_single_io_context() to take the struct request_queue "q" from cfq_exit_queue() as an argument, -which would permit us to invoke rcu_dereference_protected as follows: +which would permit us to invoke rcu_dereference_protected as follows:: if (rcu_dereference_protected(ioc->ioc_data, lockdep_is_held(&q->queue_lock)) == cic) { @@ -85,7 +90,7 @@ On the other hand, perhaps we really do need an RCU read-side critical section. In this case, the critical section must span the use of the return value from rcu_dereference(), or at least until there is some reference count incremented or some such. One way to handle this is to -add rcu_read_lock() and rcu_read_unlock() as follows: +add rcu_read_lock() and rcu_read_unlock() as follows:: rcu_read_lock(); if (rcu_dereference(ioc->ioc_data) == cic) { @@ -102,7 +107,7 @@ above lockdep-RCU splat. But in this particular case, we don't actually dereference the pointer returned from rcu_dereference(). Instead, that pointer is just compared to the cic pointer, which means that the rcu_dereference() can be replaced -by rcu_access_pointer() as follows: +by rcu_access_pointer() as follows:: if (rcu_access_pointer(ioc->ioc_data) == cic) { diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.rst index 89db949eeca0..f1fc8ae3846a 100644 --- a/Documentation/RCU/lockdep.txt +++ b/Documentation/RCU/lockdep.rst @@ -1,4 +1,8 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== RCU and lockdep checking +======================== All flavors of RCU have lockdep checking available, so that lockdep is aware of when each task enters and leaves any flavor of RCU read-side @@ -8,7 +12,7 @@ tracking to include RCU state, which can sometimes help when debugging deadlocks and the like. In addition, RCU provides the following primitives that check lockdep's -state: +state:: rcu_read_lock_held() for normal RCU. rcu_read_lock_bh_held() for RCU-bh. @@ -63,7 +67,7 @@ checking of rcu_dereference() primitives: The rcu_dereference_check() check expression can be any boolean expression, but would normally include a lockdep expression. However, any boolean expression can be used. For a moderately ornate example, -consider the following: +consider the following:: file = rcu_dereference_check(fdt->fd[fd], lockdep_is_held(&files->file_lock) || @@ -82,7 +86,7 @@ RCU read-side critical sections, in case (2) the ->file_lock prevents any change from taking place, and finally, in case (3) the current task is the only task accessing the file_struct, again preventing any change from taking place. If the above statement was invoked only from updater -code, it could instead be written as follows: +code, it could instead be written as follows:: file = rcu_dereference_protected(fdt->fd[fd], lockdep_is_held(&files->file_lock) || @@ -105,7 +109,7 @@ false and they are called from outside any RCU read-side critical section. For example, the workqueue for_each_pwq() macro is intended to be used either within an RCU read-side critical section or with wq->mutex held. -It is thus implemented as follows: +It is thus implemented as follows:: #define for_each_pwq(pwq, wq) list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst new file mode 100644 index 000000000000..a9fc774bc400 --- /dev/null +++ b/Documentation/RCU/rculist_nulls.rst @@ -0,0 +1,200 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================= +Using RCU hlist_nulls to protect list and objects +================================================= + +This section describes how to use hlist_nulls to +protect read-mostly linked lists and +objects using SLAB_TYPESAFE_BY_RCU allocations. + +Please read the basics in Documentation/RCU/listRCU.rst + +Using 'nulls' +============= + +Using special makers (called 'nulls') is a convenient way +to solve following problem : + +A typical RCU linked list managing objects which are +allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can +use following algos : + +1) Lookup algo +-------------- + +:: + + rcu_read_lock() + begin: + obj = lockless_lookup(key); + if (obj) { + if (!try_get_ref(obj)) // might fail for free objects + goto begin; + /* + * Because a writer could delete object, and a writer could + * reuse these object before the RCU grace period, we + * must check key after getting the reference on object + */ + if (obj->key != key) { // not the object we expected + put_ref(obj); + goto begin; + } + } + rcu_read_unlock(); + +Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu() +but a version with an additional memory barrier (smp_rmb()) + +:: + + lockless_lookup(key) + { + struct hlist_node *node, *next; + for (pos = rcu_dereference((head)->first); + pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); + pos = rcu_dereference(next)) + if (obj->key == key) + return obj; + return NULL; + } + +And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb():: + + struct hlist_node *node; + for (pos = rcu_dereference((head)->first); + pos && ({ prefetch(pos->next); 1; }) && + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); + pos = rcu_dereference(pos->next)) + if (obj->key == key) + return obj; + return NULL; + +Quoting Corey Minyard:: + + "If the object is moved from one list to another list in-between the + time the hash is calculated and the next field is accessed, and the + object has moved to the end of a new list, the traversal will not + complete properly on the list it should have, since the object will + be on the end of the new list and there's not a way to tell it's on a + new list and restart the list traversal. I think that this can be + solved by pre-fetching the "next" field (with proper barriers) before + checking the key." + +2) Insert algo +-------------- + +We need to make sure a reader cannot read the new 'obj->obj_next' value +and previous value of 'obj->key'. Or else, an item could be deleted +from a chain, and inserted into another chain. If new chain was empty +before the move, 'next' pointer is NULL, and lockless reader can +not detect it missed following items in original chain. + +:: + + /* + * Please note that new inserts are done at the head of list, + * not in the middle or end. + */ + obj = kmem_cache_alloc(...); + lock_chain(); // typically a spin_lock() + obj->key = key; + /* + * we need to make sure obj->key is updated before obj->next + * or obj->refcnt + */ + smp_wmb(); + atomic_set(&obj->refcnt, 1); + hlist_add_head_rcu(&obj->obj_node, list); + unlock_chain(); // typically a spin_unlock() + + +3) Remove algo +-------------- +Nothing special here, we can use a standard RCU hlist deletion. +But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused +very very fast (before the end of RCU grace period) + +:: + + if (put_last_reference_on(obj) { + lock_chain(); // typically a spin_lock() + hlist_del_init_rcu(&obj->obj_node); + unlock_chain(); // typically a spin_unlock() + kmem_cache_free(cachep, obj); + } + + + +-------------------------------------------------------------------------- + +Avoiding extra smp_rmb() +======================== + +With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup() +and extra smp_wmb() in insert function. + +For example, if we choose to store the slot number as the 'nulls' +end-of-list marker for each slot of the hash table, we can detect +a race (some writer did a delete and/or a move of an object +to another chain) checking the final 'nulls' value if +the lookup met the end of chain. If final 'nulls' value +is not the slot number, then we must restart the lookup at +the beginning. If the object was moved to the same chain, +then the reader doesn't care : It might eventually +scan the list again without harm. + + +1) lookup algo +-------------- + +:: + + head = &table[slot]; + rcu_read_lock(); + begin: + hlist_nulls_for_each_entry_rcu(obj, node, head, member) { + if (obj->key == key) { + if (!try_get_ref(obj)) // might fail for free objects + goto begin; + if (obj->key != key) { // not the object we expected + put_ref(obj); + goto begin; + } + goto out; + } + /* + * if the nulls value we got at the end of this lookup is + * not the expected one, we must restart lookup. + * We probably met an item that was moved to another chain. + */ + if (get_nulls_value(node) != slot) + goto begin; + obj = NULL; + + out: + rcu_read_unlock(); + +2) Insert function +------------------ + +:: + + /* + * Please note that new inserts are done at the head of list, + * not in the middle or end. + */ + obj = kmem_cache_alloc(cachep); + lock_chain(); // typically a spin_lock() + obj->key = key; + /* + * changes to obj->key must be visible before refcnt one + */ + smp_wmb(); + atomic_set(&obj->refcnt, 1); + /* + * insert obj in RCU way (readers might be traversing chain) + */ + hlist_nulls_add_head_rcu(&obj->obj_node, list); + unlock_chain(); // typically a spin_unlock() diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt deleted file mode 100644 index 23f115dc87cf..000000000000 --- a/Documentation/RCU/rculist_nulls.txt +++ /dev/null @@ -1,172 +0,0 @@ -Using hlist_nulls to protect read-mostly linked lists and -objects using SLAB_TYPESAFE_BY_RCU allocations. - -Please read the basics in Documentation/RCU/listRCU.rst - -Using special makers (called 'nulls') is a convenient way -to solve following problem : - -A typical RCU linked list managing objects which are -allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can -use following algos : - -1) Lookup algo --------------- -rcu_read_lock() -begin: -obj = lockless_lookup(key); -if (obj) { - if (!try_get_ref(obj)) // might fail for free objects - goto begin; - /* - * Because a writer could delete object, and a writer could - * reuse these object before the RCU grace period, we - * must check key after getting the reference on object - */ - if (obj->key != key) { // not the object we expected - put_ref(obj); - goto begin; - } -} -rcu_read_unlock(); - -Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu() -but a version with an additional memory barrier (smp_rmb()) - -lockless_lookup(key) -{ - struct hlist_node *node, *next; - for (pos = rcu_dereference((head)->first); - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && - ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); - pos = rcu_dereference(next)) - if (obj->key == key) - return obj; - return NULL; - -And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() : - - struct hlist_node *node; - for (pos = rcu_dereference((head)->first); - pos && ({ prefetch(pos->next); 1; }) && - ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); - pos = rcu_dereference(pos->next)) - if (obj->key == key) - return obj; - return NULL; -} - -Quoting Corey Minyard : - -"If the object is moved from one list to another list in-between the - time the hash is calculated and the next field is accessed, and the - object has moved to the end of a new list, the traversal will not - complete properly on the list it should have, since the object will - be on the end of the new list and there's not a way to tell it's on a - new list and restart the list traversal. I think that this can be - solved by pre-fetching the "next" field (with proper barriers) before - checking the key." - -2) Insert algo : ----------------- - -We need to make sure a reader cannot read the new 'obj->obj_next' value -and previous value of 'obj->key'. Or else, an item could be deleted -from a chain, and inserted into another chain. If new chain was empty -before the move, 'next' pointer is NULL, and lockless reader can -not detect it missed following items in original chain. - -/* - * Please note that new inserts are done at the head of list, - * not in the middle or end. - */ -obj = kmem_cache_alloc(...); -lock_chain(); // typically a spin_lock() -obj->key = key; -/* - * we need to make sure obj->key is updated before obj->next - * or obj->refcnt - */ -smp_wmb(); -atomic_set(&obj->refcnt, 1); -hlist_add_head_rcu(&obj->obj_node, list); -unlock_chain(); // typically a spin_unlock() - - -3) Remove algo --------------- -Nothing special here, we can use a standard RCU hlist deletion. -But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused -very very fast (before the end of RCU grace period) - -if (put_last_reference_on(obj) { - lock_chain(); // typically a spin_lock() - hlist_del_init_rcu(&obj->obj_node); - unlock_chain(); // typically a spin_unlock() - kmem_cache_free(cachep, obj); -} - - - --------------------------------------------------------------------------- -With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup() -and extra smp_wmb() in insert function. - -For example, if we choose to store the slot number as the 'nulls' -end-of-list marker for each slot of the hash table, we can detect -a race (some writer did a delete and/or a move of an object -to another chain) checking the final 'nulls' value if -the lookup met the end of chain. If final 'nulls' value -is not the slot number, then we must restart the lookup at -the beginning. If the object was moved to the same chain, -then the reader doesn't care : It might eventually -scan the list again without harm. - - -1) lookup algo - - head = &table[slot]; - rcu_read_lock(); -begin: - hlist_nulls_for_each_entry_rcu(obj, node, head, member) { - if (obj->key == key) { - if (!try_get_ref(obj)) // might fail for free objects - goto begin; - if (obj->key != key) { // not the object we expected - put_ref(obj); - goto begin; - } - goto out; - } -/* - * if the nulls value we got at the end of this lookup is - * not the expected one, we must restart lookup. - * We probably met an item that was moved to another chain. - */ - if (get_nulls_value(node) != slot) - goto begin; - obj = NULL; - -out: - rcu_read_unlock(); - -2) Insert function : --------------------- - -/* - * Please note that new inserts are done at the head of list, - * not in the middle or end. - */ -obj = kmem_cache_alloc(cachep); -lock_chain(); // typically a spin_lock() -obj->key = key; -/* - * changes to obj->key must be visible before refcnt one - */ -smp_wmb(); -atomic_set(&obj->refcnt, 1); -/* - * insert obj in RCU way (readers might be traversing chain) - */ -hlist_nulls_add_head_rcu(&obj->obj_node, list); -unlock_chain(); // typically a spin_unlock() diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.rst index 5e6429d66c24..b33aeb14fde3 100644 --- a/Documentation/RCU/rcuref.txt +++ b/Documentation/RCU/rcuref.rst @@ -1,4 +1,8 @@ -Reference-count design for elements of lists/arrays protected by RCU. +.. SPDX-License-Identifier: GPL-2.0 + +==================================================================== +Reference-count design for elements of lists/arrays protected by RCU +==================================================================== Please note that the percpu-ref feature is likely your first @@ -12,32 +16,33 @@ please read on. Reference counting on elements of lists which are protected by traditional reader/writer spinlocks or semaphores are straightforward: -CODE LISTING A: -1. 2. -add() search_and_reference() -{ { - alloc_object read_lock(&list_lock); - ... search_for_element - atomic_set(&el->rc, 1); atomic_inc(&el->rc); - write_lock(&list_lock); ... - add_element read_unlock(&list_lock); - ... ... - write_unlock(&list_lock); } -} - -3. 4. -release_referenced() delete() -{ { - ... write_lock(&list_lock); - if(atomic_dec_and_test(&el->rc)) ... - kfree(el); - ... remove_element -} write_unlock(&list_lock); - ... - if (atomic_dec_and_test(&el->rc)) - kfree(el); - ... - } +CODE LISTING A:: + + 1. 2. + add() search_and_reference() + { { + alloc_object read_lock(&list_lock); + ... search_for_element + atomic_set(&el->rc, 1); atomic_inc(&el->rc); + write_lock(&list_lock); ... + add_element read_unlock(&list_lock); + ... ... + write_unlock(&list_lock); } + } + + 3. 4. + release_referenced() delete() + { { + ... write_lock(&list_lock); + if(atomic_dec_and_test(&el->rc)) ... + kfree(el); + ... remove_element + } write_unlock(&list_lock); + ... + if (atomic_dec_and_test(&el->rc)) + kfree(el); + ... + } If this list/array is made lock free using RCU as in changing the write_lock() in add() and delete() to spin_lock() and changing read_lock() @@ -46,34 +51,35 @@ search_and_reference() could potentially hold reference to an element which has already been deleted from the list/array. Use atomic_inc_not_zero() in this scenario as follows: -CODE LISTING B: -1. 2. -add() search_and_reference() -{ { - alloc_object rcu_read_lock(); - ... search_for_element - atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) { - spin_lock(&list_lock); rcu_read_unlock(); - return FAIL; - add_element } - ... ... - spin_unlock(&list_lock); rcu_read_unlock(); -} } -3. 4. -release_referenced() delete() -{ { - ... spin_lock(&list_lock); - if (atomic_dec_and_test(&el->rc)) ... - call_rcu(&el->head, el_free); remove_element - ... spin_unlock(&list_lock); -} ... - if (atomic_dec_and_test(&el->rc)) - call_rcu(&el->head, el_free); - ... - } +CODE LISTING B:: + + 1. 2. + add() search_and_reference() + { { + alloc_object rcu_read_lock(); + ... search_for_element + atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) { + spin_lock(&list_lock); rcu_read_unlock(); + return FAIL; + add_element } + ... ... + spin_unlock(&list_lock); rcu_read_unlock(); + } } + 3. 4. + release_referenced() delete() + { { + ... spin_lock(&list_lock); + if (atomic_dec_and_test(&el->rc)) ... + call_rcu(&el->head, el_free); remove_element + ... spin_unlock(&list_lock); + } ... + if (atomic_dec_and_test(&el->rc)) + call_rcu(&el->head, el_free); + ... + } Sometimes, a reference to the element needs to be obtained in the -update (write) stream. In such cases, atomic_inc_not_zero() might be +update (write) stream. In such cases, atomic_inc_not_zero() might be overkill, since we hold the update-side spinlock. One might instead use atomic_inc() in such cases. @@ -82,39 +88,40 @@ search_and_reference() code path. In such cases, the atomic_dec_and_test() may be moved from delete() to el_free() as follows: -CODE LISTING C: -1. 2. -add() search_and_reference() -{ { - alloc_object rcu_read_lock(); - ... search_for_element - atomic_set(&el->rc, 1); atomic_inc(&el->rc); - spin_lock(&list_lock); ... - - add_element rcu_read_unlock(); - ... } - spin_unlock(&list_lock); 4. -} delete() -3. { -release_referenced() spin_lock(&list_lock); -{ ... - ... remove_element - if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock); - kfree(el); ... - ... call_rcu(&el->head, el_free); -} ... -5. } -void el_free(struct rcu_head *rhp) -{ - release_referenced(); -} +CODE LISTING C:: + + 1. 2. + add() search_and_reference() + { { + alloc_object rcu_read_lock(); + ... search_for_element + atomic_set(&el->rc, 1); atomic_inc(&el->rc); + spin_lock(&list_lock); ... + + add_element rcu_read_unlock(); + ... } + spin_unlock(&list_lock); 4. + } delete() + 3. { + release_referenced() spin_lock(&list_lock); + { ... + ... remove_element + if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock); + kfree(el); ... + ... call_rcu(&el->head, el_free); + } ... + 5. } + void el_free(struct rcu_head *rhp) + { + release_referenced(); + } The key point is that the initial reference added by add() is not removed until after a grace period has elapsed following removal. This means that search_and_reference() cannot find this element, which means that the value of el->rc cannot increase. Thus, once it reaches zero, there are no -readers that can or ever will be able to reference the element. The -element can therefore safely be freed. This in turn guarantees that if +readers that can or ever will be able to reference the element. The +element can therefore safely be freed. This in turn guarantees that if any reader finds the element, that reader may safely acquire a reference without checking the value of the reference counter. @@ -130,21 +137,21 @@ the eventual invocation of kfree(), which is usually not a problem on modern computer systems, even the small ones. In cases where delete() can sleep, synchronize_rcu() can be called from -delete(), so that el_free() can be subsumed into delete as follows: - -4. -delete() -{ - spin_lock(&list_lock); - ... - remove_element - spin_unlock(&list_lock); - ... - |