summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2020-08-03 14:31:33 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2020-08-03 14:31:33 -0700
commit8f0cb6660acb0d4756df880a3e60e73daa9c244e (patch)
tree9f14396bed65f82b94f7bb2425ffb2f0cbd4519a /Documentation
parent5ece08178d6567db5ef0090b1ae7f795c3c36161 (diff)
parentc1cc4784ce6e8cceff1013709abd74bcbf7fbf24 (diff)
Merge tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar: - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes * tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits) torture: Remove obsolete "cd $KVM" torture: Avoid duplicate specification of qemu command torture: Dump ftrace at shutdown only if requested torture: Add kvm-tranform.sh script for qemu-cmd files torture: Add more tracing crib notes to kvm.sh torture: Improve diagnostic for KCSAN-incapable compilers torture: Correctly summarize build-only runs torture: Pass --kmake-arg to all make invocations rcutorture: Check for unwatched readers torture: Abstract out console-log error detection torture: Add a stop-run capability torture: Create qemu-cmd in --buildonly runs rcu/rcutorture: Replace 0 with false torture: Add --allcpus argument to the kvm.sh script torture: Remove whitespace from identify_qemu_vcpus output rcutorture: NULL rcu_torture_current earlier in cleanup code rcutorture: Handle non-statistic bang-string error messages torture: Set configfile variable to current scenario rcutorture: Add races with task-exit processing locktorture: Use true and false to assign to bool variables ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.rst7
-rw-r--r--Documentation/RCU/checklist.rst (renamed from Documentation/RCU/checklist.txt)17
-rw-r--r--Documentation/RCU/index.rst9
-rw-r--r--Documentation/RCU/lockdep-splat.rst (renamed from Documentation/RCU/lockdep-splat.txt)109
-rw-r--r--Documentation/RCU/lockdep.rst (renamed from Documentation/RCU/lockdep.txt)12
-rw-r--r--Documentation/RCU/rculist_nulls.rst200
-rw-r--r--Documentation/RCU/rculist_nulls.txt172
-rw-r--r--Documentation/RCU/rcuref.rst (renamed from Documentation/RCU/rcuref.txt)199
-rw-r--r--Documentation/RCU/stallwarn.rst (renamed from Documentation/RCU/stallwarn.txt)62
-rw-r--r--Documentation/RCU/torture.rst (renamed from Documentation/RCU/torture.txt)117
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt68
-rw-r--r--Documentation/locking/locktorture.rst2
12 files changed, 569 insertions, 405 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index 50d5c43c48b0..8f41ad0aa753 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -2583,7 +2583,12 @@ not work to have these markers in the trampoline itself, because there
would need to be instructions following ``rcu_read_unlock()``. Although
``synchronize_rcu()`` would guarantee that execution reached the
``rcu_read_unlock()``, it would not be able to guarantee that execution
-had completely left the trampoline.
+had completely left the trampoline. Worse yet, in some situations
+the trampoline's protection must extend a few instructions *prior* to
+execution reaching the trampoline. For example, these few instructions
+might calculate the address of the trampoline, so that entering the
+trampoline would be pre-ordained a surprisingly long time before execution
+actually reached the trampoline itself.
The solution, in the form of `Tasks
RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.rst
index e98ff261a438..2efed9926c3f 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
Review Checklist for RCU Patches
+================================
This document contains a checklist for producing and reviewing patches
@@ -411,18 +415,21 @@ over a rather long period of time, but improvements are always welcome!
__rcu sparse checks to validate your RCU code. These can help
find problems as follows:
- CONFIG_PROVE_LOCKING: check that accesses to RCU-protected data
+ CONFIG_PROVE_LOCKING:
+ check that accesses to RCU-protected data
structures are carried out under the proper RCU
read-side critical section, while holding the right
combination of locks, or whatever other conditions
are appropriate.
- CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the
+ CONFIG_DEBUG_OBJECTS_RCU_HEAD:
+ check that you don't pass the
same object to call_rcu() (or friends) before an RCU
grace period has elapsed since the last time that you
passed that same object to call_rcu() (or friends).
- __rcu sparse checks: tag the pointer to the RCU-protected data
+ __rcu sparse checks:
+ tag the pointer to the RCU-protected data
structure with __rcu, and sparse will warn you if you
access that pointer without the services of one of the
variants of rcu_dereference().
@@ -442,8 +449,8 @@ over a rather long period of time, but improvements are always welcome!
You instead need to use one of the barrier functions:
- o call_rcu() -> rcu_barrier()
- o call_srcu() -> srcu_barrier()
+ - call_rcu() -> rcu_barrier()
+ - call_srcu() -> srcu_barrier()
However, these barrier functions are absolutely -not- guaranteed
to wait for a grace period. In fact, if there are no call_rcu()
diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst
index 81a0a1e5f767..e703d3dbe60c 100644
--- a/Documentation/RCU/index.rst
+++ b/Documentation/RCU/index.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
.. _rcu_concepts:
============
@@ -8,10 +10,17 @@ RCU concepts
:maxdepth: 3
arrayRCU
+ checklist
+ lockdep
+ lockdep-splat
rcubarrier
rcu_dereference
whatisRCU
rcu
+ rculist_nulls
+ rcuref
+ torture
+ stallwarn
listRCU
NMI-RCU
UP
diff --git a/Documentation/RCU/lockdep-splat.txt b/Documentation/RCU/lockdep-splat.rst
index b8096316fd11..2a5c79db57dc 100644
--- a/Documentation/RCU/lockdep-splat.txt
+++ b/Documentation/RCU/lockdep-splat.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Lockdep-RCU Splat
+=================
+
Lockdep-RCU was added to the Linux kernel in early 2010
(http://lwn.net/Articles/371986/). This facility checks for some common
misuses of the RCU API, most notably using one of the rcu_dereference()
@@ -12,55 +18,54 @@ overwriting or worse. There can of course be false positives, this
being the real world and all that.
So let's look at an example RCU lockdep splat from 3.0-rc5, one that
-has long since been fixed:
-
-=============================
-WARNING: suspicious RCU usage
------------------------------
-block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage!
-
-other info that might help us debug this:
-
-
-rcu_scheduler_active = 1, debug_locks = 0
-3 locks held by scsi_scan_6/1552:
- #0: (&shost->scan_mutex){+.+.}, at: [<ffffffff8145efca>]
-scsi_scan_host_selected+0x5a/0x150
- #1: (&eq->sysfs_lock){+.+.}, at: [<ffffffff812a5032>]
-elevator_exit+0x22/0x60
- #2: (&(&q->__queue_lock)->rlock){-.-.}, at: [<ffffffff812b6233>]
-cfq_exit_queue+0x43/0x190
-
-stack backtrace:
-Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17
-Call Trace:
- [<ffffffff810abb9b>] lockdep_rcu_dereference+0xbb/0xc0
- [<ffffffff812b6139>] __cfq_exit_single_io_context+0xe9/0x120
- [<ffffffff812b626c>] cfq_exit_queue+0x7c/0x190
- [<ffffffff812a5046>] elevator_exit+0x36/0x60
- [<ffffffff812a802a>] blk_cleanup_queue+0x4a/0x60
- [<ffffffff8145cc09>] scsi_free_queue+0x9/0x10
- [<ffffffff81460944>] __scsi_remove_device+0x84/0xd0
- [<ffffffff8145dca3>] scsi_probe_and_add_lun+0x353/0xb10
- [<ffffffff817da069>] ? error_exit+0x29/0xb0
- [<ffffffff817d98ed>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
- [<ffffffff8145e722>] __scsi_scan_target+0x112/0x680
- [<ffffffff812c690d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
- [<ffffffff817da069>] ? error_exit+0x29/0xb0
- [<ffffffff812bcc60>] ? kobject_del+0x40/0x40
- [<ffffffff8145ed16>] scsi_scan_channel+0x86/0xb0
- [<ffffffff8145f0b0>] scsi_scan_host_selected+0x140/0x150
- [<ffffffff8145f149>] do_scsi_scan_host+0x89/0x90
- [<ffffffff8145f170>] do_scan_async+0x20/0x160
- [<ffffffff8145f150>] ? do_scsi_scan_host+0x90/0x90
- [<ffffffff810975b6>] kthread+0xa6/0xb0
- [<ffffffff817db154>] kernel_thread_helper+0x4/0x10
- [<ffffffff81066430>] ? finish_task_switch+0x80/0x110
- [<ffffffff817d9c04>] ? retint_restore_args+0xe/0xe
- [<ffffffff81097510>] ? __kthread_init_worker+0x70/0x70
- [<ffffffff817db150>] ? gs_change+0xb/0xb
-
-Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows:
+has long since been fixed::
+
+ =============================
+ WARNING: suspicious RCU usage
+ -----------------------------
+ block/cfq-iosched.c:2776 suspicious rcu_dereference_protected() usage!
+
+other info that might help us debug this::
+
+ rcu_scheduler_active = 1, debug_locks = 0
+ 3 locks held by scsi_scan_6/1552:
+ #0: (&shost->scan_mutex){+.+.}, at: [<ffffffff8145efca>]
+ scsi_scan_host_selected+0x5a/0x150
+ #1: (&eq->sysfs_lock){+.+.}, at: [<ffffffff812a5032>]
+ elevator_exit+0x22/0x60
+ #2: (&(&q->__queue_lock)->rlock){-.-.}, at: [<ffffffff812b6233>]
+ cfq_exit_queue+0x43/0x190
+
+ stack backtrace:
+ Pid: 1552, comm: scsi_scan_6 Not tainted 3.0.0-rc5 #17
+ Call Trace:
+ [<ffffffff810abb9b>] lockdep_rcu_dereference+0xbb/0xc0
+ [<ffffffff812b6139>] __cfq_exit_single_io_context+0xe9/0x120
+ [<ffffffff812b626c>] cfq_exit_queue+0x7c/0x190
+ [<ffffffff812a5046>] elevator_exit+0x36/0x60
+ [<ffffffff812a802a>] blk_cleanup_queue+0x4a/0x60
+ [<ffffffff8145cc09>] scsi_free_queue+0x9/0x10
+ [<ffffffff81460944>] __scsi_remove_device+0x84/0xd0
+ [<ffffffff8145dca3>] scsi_probe_and_add_lun+0x353/0xb10
+ [<ffffffff817da069>] ? error_exit+0x29/0xb0
+ [<ffffffff817d98ed>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
+ [<ffffffff8145e722>] __scsi_scan_target+0x112/0x680
+ [<ffffffff812c690d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
+ [<ffffffff817da069>] ? error_exit+0x29/0xb0
+ [<ffffffff812bcc60>] ? kobject_del+0x40/0x40
+ [<ffffffff8145ed16>] scsi_scan_channel+0x86/0xb0
+ [<ffffffff8145f0b0>] scsi_scan_host_selected+0x140/0x150
+ [<ffffffff8145f149>] do_scsi_scan_host+0x89/0x90
+ [<ffffffff8145f170>] do_scan_async+0x20/0x160
+ [<ffffffff8145f150>] ? do_scsi_scan_host+0x90/0x90
+ [<ffffffff810975b6>] kthread+0xa6/0xb0
+ [<ffffffff817db154>] kernel_thread_helper+0x4/0x10
+ [<ffffffff81066430>] ? finish_task_switch+0x80/0x110
+ [<ffffffff817d9c04>] ? retint_restore_args+0xe/0xe
+ [<ffffffff81097510>] ? __kthread_init_worker+0x70/0x70
+ [<ffffffff817db150>] ? gs_change+0xb/0xb
+
+Line 2776 of block/cfq-iosched.c in v3.0-rc5 is as follows::
if (rcu_dereference(ioc->ioc_data) == cic) {
@@ -70,7 +75,7 @@ case. Instead, we hold three locks, one of which might be RCU related.
And maybe that lock really does protect this reference. If so, the fix
is to inform RCU, perhaps by changing __cfq_exit_single_io_context() to
take the struct request_queue "q" from cfq_exit_queue() as an argument,
-which would permit us to invoke rcu_dereference_protected as follows:
+which would permit us to invoke rcu_dereference_protected as follows::
if (rcu_dereference_protected(ioc->ioc_data,
lockdep_is_held(&q->queue_lock)) == cic) {
@@ -85,7 +90,7 @@ On the other hand, perhaps we really do need an RCU read-side critical
section. In this case, the critical section must span the use of the
return value from rcu_dereference(), or at least until there is some
reference count incremented or some such. One way to handle this is to
-add rcu_read_lock() and rcu_read_unlock() as follows:
+add rcu_read_lock() and rcu_read_unlock() as follows::
rcu_read_lock();
if (rcu_dereference(ioc->ioc_data) == cic) {
@@ -102,7 +107,7 @@ above lockdep-RCU splat.
But in this particular case, we don't actually dereference the pointer
returned from rcu_dereference(). Instead, that pointer is just compared
to the cic pointer, which means that the rcu_dereference() can be replaced
-by rcu_access_pointer() as follows:
+by rcu_access_pointer() as follows::
if (rcu_access_pointer(ioc->ioc_data) == cic) {
diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.rst
index 89db949eeca0..f1fc8ae3846a 100644
--- a/Documentation/RCU/lockdep.txt
+++ b/Documentation/RCU/lockdep.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
RCU and lockdep checking
+========================
All flavors of RCU have lockdep checking available, so that lockdep is
aware of when each task enters and leaves any flavor of RCU read-side
@@ -8,7 +12,7 @@ tracking to include RCU state, which can sometimes help when debugging
deadlocks and the like.
In addition, RCU provides the following primitives that check lockdep's
-state:
+state::
rcu_read_lock_held() for normal RCU.
rcu_read_lock_bh_held() for RCU-bh.
@@ -63,7 +67,7 @@ checking of rcu_dereference() primitives:
The rcu_dereference_check() check expression can be any boolean
expression, but would normally include a lockdep expression. However,
any boolean expression can be used. For a moderately ornate example,
-consider the following:
+consider the following::
file = rcu_dereference_check(fdt->fd[fd],
lockdep_is_held(&files->file_lock) ||
@@ -82,7 +86,7 @@ RCU read-side critical sections, in case (2) the ->file_lock prevents
any change from taking place, and finally, in case (3) the current task
is the only task accessing the file_struct, again preventing any change
from taking place. If the above statement was invoked only from updater
-code, it could instead be written as follows:
+code, it could instead be written as follows::
file = rcu_dereference_protected(fdt->fd[fd],
lockdep_is_held(&files->file_lock) ||
@@ -105,7 +109,7 @@ false and they are called from outside any RCU read-side critical section.
For example, the workqueue for_each_pwq() macro is intended to be used
either within an RCU read-side critical section or with wq->mutex held.
-It is thus implemented as follows:
+It is thus implemented as follows::
#define for_each_pwq(pwq, wq)
list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node,
diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst
new file mode 100644
index 000000000000..a9fc774bc400
--- /dev/null
+++ b/Documentation/RCU/rculist_nulls.rst
@@ -0,0 +1,200 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================================
+Using RCU hlist_nulls to protect list and objects
+=================================================
+
+This section describes how to use hlist_nulls to
+protect read-mostly linked lists and
+objects using SLAB_TYPESAFE_BY_RCU allocations.
+
+Please read the basics in Documentation/RCU/listRCU.rst
+
+Using 'nulls'
+=============
+
+Using special makers (called 'nulls') is a convenient way
+to solve following problem :
+
+A typical RCU linked list managing objects which are
+allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
+use following algos :
+
+1) Lookup algo
+--------------
+
+::
+
+ rcu_read_lock()
+ begin:
+ obj = lockless_lookup(key);
+ if (obj) {
+ if (!try_get_ref(obj)) // might fail for free objects
+ goto begin;
+ /*
+ * Because a writer could delete object, and a writer could
+ * reuse these object before the RCU grace period, we
+ * must check key after getting the reference on object
+ */
+ if (obj->key != key) { // not the object we expected
+ put_ref(obj);
+ goto begin;
+ }
+ }
+ rcu_read_unlock();
+
+Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu()
+but a version with an additional memory barrier (smp_rmb())
+
+::
+
+ lockless_lookup(key)
+ {
+ struct hlist_node *node, *next;
+ for (pos = rcu_dereference((head)->first);
+ pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
+ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
+ pos = rcu_dereference(next))
+ if (obj->key == key)
+ return obj;
+ return NULL;
+ }
+
+And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb()::
+
+ struct hlist_node *node;
+ for (pos = rcu_dereference((head)->first);
+ pos && ({ prefetch(pos->next); 1; }) &&
+ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
+ pos = rcu_dereference(pos->next))
+ if (obj->key == key)
+ return obj;
+ return NULL;
+
+Quoting Corey Minyard::
+
+ "If the object is moved from one list to another list in-between the
+ time the hash is calculated and the next field is accessed, and the
+ object has moved to the end of a new list, the traversal will not
+ complete properly on the list it should have, since the object will
+ be on the end of the new list and there's not a way to tell it's on a
+ new list and restart the list traversal. I think that this can be
+ solved by pre-fetching the "next" field (with proper barriers) before
+ checking the key."
+
+2) Insert algo
+--------------
+
+We need to make sure a reader cannot read the new 'obj->obj_next' value
+and previous value of 'obj->key'. Or else, an item could be deleted
+from a chain, and inserted into another chain. If new chain was empty
+before the move, 'next' pointer is NULL, and lockless reader can
+not detect it missed following items in original chain.
+
+::
+
+ /*
+ * Please note that new inserts are done at the head of list,
+ * not in the middle or end.
+ */
+ obj = kmem_cache_alloc(...);
+ lock_chain(); // typically a spin_lock()
+ obj->key = key;
+ /*
+ * we need to make sure obj->key is updated before obj->next
+ * or obj->refcnt
+ */
+ smp_wmb();
+ atomic_set(&obj->refcnt, 1);
+ hlist_add_head_rcu(&obj->obj_node, list);
+ unlock_chain(); // typically a spin_unlock()
+
+
+3) Remove algo
+--------------
+Nothing special here, we can use a standard RCU hlist deletion.
+But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
+very very fast (before the end of RCU grace period)
+
+::
+
+ if (put_last_reference_on(obj) {
+ lock_chain(); // typically a spin_lock()
+ hlist_del_init_rcu(&obj->obj_node);
+ unlock_chain(); // typically a spin_unlock()
+ kmem_cache_free(cachep, obj);
+ }
+
+
+
+--------------------------------------------------------------------------
+
+Avoiding extra smp_rmb()
+========================
+
+With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
+and extra smp_wmb() in insert function.
+
+For example, if we choose to store the slot number as the 'nulls'
+end-of-list marker for each slot of the hash table, we can detect
+a race (some writer did a delete and/or a move of an object
+to another chain) checking the final 'nulls' value if
+the lookup met the end of chain. If final 'nulls' value
+is not the slot number, then we must restart the lookup at
+the beginning. If the object was moved to the same chain,
+then the reader doesn't care : It might eventually
+scan the list again without harm.
+
+
+1) lookup algo
+--------------
+
+::
+
+ head = &table[slot];
+ rcu_read_lock();
+ begin:
+ hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
+ if (obj->key == key) {
+ if (!try_get_ref(obj)) // might fail for free objects
+ goto begin;
+ if (obj->key != key) { // not the object we expected
+ put_ref(obj);
+ goto begin;
+ }
+ goto out;
+ }
+ /*
+ * if the nulls value we got at the end of this lookup is
+ * not the expected one, we must restart lookup.
+ * We probably met an item that was moved to another chain.
+ */
+ if (get_nulls_value(node) != slot)
+ goto begin;
+ obj = NULL;
+
+ out:
+ rcu_read_unlock();
+
+2) Insert function
+------------------
+
+::
+
+ /*
+ * Please note that new inserts are done at the head of list,
+ * not in the middle or end.
+ */
+ obj = kmem_cache_alloc(cachep);
+ lock_chain(); // typically a spin_lock()
+ obj->key = key;
+ /*
+ * changes to obj->key must be visible before refcnt one
+ */
+ smp_wmb();
+ atomic_set(&obj->refcnt, 1);
+ /*
+ * insert obj in RCU way (readers might be traversing chain)
+ */
+ hlist_nulls_add_head_rcu(&obj->obj_node, list);
+ unlock_chain(); // typically a spin_unlock()
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
deleted file mode 100644
index 23f115dc87cf..000000000000
--- a/Documentation/RCU/rculist_nulls.txt
+++ /dev/null
@@ -1,172 +0,0 @@
-Using hlist_nulls to protect read-mostly linked lists and
-objects using SLAB_TYPESAFE_BY_RCU allocations.
-
-Please read the basics in Documentation/RCU/listRCU.rst
-
-Using special makers (called 'nulls') is a convenient way
-to solve following problem :
-
-A typical RCU linked list managing objects which are
-allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can
-use following algos :
-
-1) Lookup algo
---------------
-rcu_read_lock()
-begin:
-obj = lockless_lookup(key);
-if (obj) {
- if (!try_get_ref(obj)) // might fail for free objects
- goto begin;
- /*
- * Because a writer could delete object, and a writer could
- * reuse these object before the RCU grace period, we
- * must check key after getting the reference on object
- */
- if (obj->key != key) { // not the object we expected
- put_ref(obj);
- goto begin;
- }
-}
-rcu_read_unlock();
-
-Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu()
-but a version with an additional memory barrier (smp_rmb())
-
-lockless_lookup(key)
-{
- struct hlist_node *node, *next;
- for (pos = rcu_dereference((head)->first);
- pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
- ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
- pos = rcu_dereference(next))
- if (obj->key == key)
- return obj;
- return NULL;
-
-And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() :
-
- struct hlist_node *node;
- for (pos = rcu_dereference((head)->first);
- pos && ({ prefetch(pos->next); 1; }) &&
- ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
- pos = rcu_dereference(pos->next))
- if (obj->key == key)
- return obj;
- return NULL;
-}
-
-Quoting Corey Minyard :
-
-"If the object is moved from one list to another list in-between the
- time the hash is calculated and the next field is accessed, and the
- object has moved to the end of a new list, the traversal will not
- complete properly on the list it should have, since the object will
- be on the end of the new list and there's not a way to tell it's on a
- new list and restart the list traversal. I think that this can be
- solved by pre-fetching the "next" field (with proper barriers) before
- checking the key."
-
-2) Insert algo :
-----------------
-
-We need to make sure a reader cannot read the new 'obj->obj_next' value
-and previous value of 'obj->key'. Or else, an item could be deleted
-from a chain, and inserted into another chain. If new chain was empty
-before the move, 'next' pointer is NULL, and lockless reader can
-not detect it missed following items in original chain.
-
-/*
- * Please note that new inserts are done at the head of list,
- * not in the middle or end.
- */
-obj = kmem_cache_alloc(...);
-lock_chain(); // typically a spin_lock()
-obj->key = key;
-/*
- * we need to make sure obj->key is updated before obj->next
- * or obj->refcnt
- */
-smp_wmb();
-atomic_set(&obj->refcnt, 1);
-hlist_add_head_rcu(&obj->obj_node, list);
-unlock_chain(); // typically a spin_unlock()
-
-
-3) Remove algo
---------------
-Nothing special here, we can use a standard RCU hlist deletion.
-But thanks to SLAB_TYPESAFE_BY_RCU, beware a deleted object can be reused
-very very fast (before the end of RCU grace period)
-
-if (put_last_reference_on(obj) {
- lock_chain(); // typically a spin_lock()
- hlist_del_init_rcu(&obj->obj_node);
- unlock_chain(); // typically a spin_unlock()
- kmem_cache_free(cachep, obj);
-}
-
-
-
---------------------------------------------------------------------------
-With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
-and extra smp_wmb() in insert function.
-
-For example, if we choose to store the slot number as the 'nulls'
-end-of-list marker for each slot of the hash table, we can detect
-a race (some writer did a delete and/or a move of an object
-to another chain) checking the final 'nulls' value if
-the lookup met the end of chain. If final 'nulls' value
-is not the slot number, then we must restart the lookup at
-the beginning. If the object was moved to the same chain,
-then the reader doesn't care : It might eventually
-scan the list again without harm.
-
-
-1) lookup algo
-
- head = &table[slot];
- rcu_read_lock();
-begin:
- hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
- if (obj->key == key) {
- if (!try_get_ref(obj)) // might fail for free objects
- goto begin;
- if (obj->key != key) { // not the object we expected
- put_ref(obj);
- goto begin;
- }
- goto out;
- }
-/*
- * if the nulls value we got at the end of this lookup is
- * not the expected one, we must restart lookup.
- * We probably met an item that was moved to another chain.
- */
- if (get_nulls_value(node) != slot)
- goto begin;
- obj = NULL;
-
-out:
- rcu_read_unlock();
-
-2) Insert function :
---------------------
-
-/*
- * Please note that new inserts are done at the head of list,
- * not in the middle or end.
- */
-obj = kmem_cache_alloc(cachep);
-lock_chain(); // typically a spin_lock()
-obj->key = key;
-/*
- * changes to obj->key must be visible before refcnt one
- */
-smp_wmb();
-atomic_set(&obj->refcnt, 1);
-/*
- * insert obj in RCU way (readers might be traversing chain)
- */
-hlist_nulls_add_head_rcu(&obj->obj_node, list);
-unlock_chain(); // typically a spin_unlock()
diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.rst
index 5e6429d66c24..b33aeb14fde3 100644
--- a/Documentation/RCU/rcuref.txt
+++ b/Documentation/RCU/rcuref.rst
@@ -1,4 +1,8 @@
-Reference-count design for elements of lists/arrays protected by RCU.
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================================
+Reference-count design for elements of lists/arrays protected by RCU
+====================================================================
Please note that the percpu-ref feature is likely your first
@@ -12,32 +16,33 @@ please read on.
Reference counting on elements of lists which are protected by traditional
reader/writer spinlocks or semaphores are straightforward:
-CODE LISTING A:
-1. 2.
-add() search_and_reference()
-{ {
- alloc_object read_lock(&list_lock);
- ... search_for_element
- atomic_set(&el->rc, 1); atomic_inc(&el->rc);
- write_lock(&list_lock); ...
- add_element read_unlock(&list_lock);
- ... ...
- write_unlock(&list_lock); }
-}
-
-3. 4.
-release_referenced() delete()
-{ {
- ... write_lock(&list_lock);
- if(atomic_dec_and_test(&el->rc)) ...
- kfree(el);
- ... remove_element
-} write_unlock(&list_lock);
- ...
- if (atomic_dec_and_test(&el->rc))
- kfree(el);
- ...
- }
+CODE LISTING A::
+
+ 1. 2.
+ add() search_and_reference()
+ { {
+ alloc_object read_lock(&list_lock);
+ ... search_for_element
+ atomic_set(&el->rc, 1); atomic_inc(&el->rc);
+ write_lock(&list_lock); ...
+ add_element read_unlock(&list_lock);
+ ... ...
+ write_unlock(&list_lock); }
+ }
+
+ 3. 4.
+ release_referenced() delete()
+ { {
+ ... write_lock(&list_lock);
+ if(atomic_dec_and_test(&el->rc)) ...
+ kfree(el);
+ ... remove_element
+ } write_unlock(&list_lock);
+ ...
+ if (atomic_dec_and_test(&el->rc))
+ kfree(el);
+ ...
+ }
If this list/array is made lock free using RCU as in changing the
write_lock() in add() and delete() to spin_lock() and changing read_lock()
@@ -46,34 +51,35 @@ search_and_reference() could potentially hold reference to an element which
has already been deleted from the list/array. Use atomic_inc_not_zero()
in this scenario as follows:
-CODE LISTING B:
-1. 2.
-add() search_and_reference()
-{ {
- alloc_object rcu_read_lock();
- ... search_for_element
- atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) {
- spin_lock(&list_lock); rcu_read_unlock();
- return FAIL;
- add_element }
- ... ...
- spin_unlock(&list_lock); rcu_read_unlock();
-} }
-3. 4.
-release_referenced() delete()
-{ {
- ... spin_lock(&list_lock);
- if (atomic_dec_and_test(&el->rc)) ...
- call_rcu(&el->head, el_free); remove_element
- ... spin_unlock(&list_lock);
-} ...
- if (atomic_dec_and_test(&el->rc))
- call_rcu(&el->head, el_free);
- ...
- }
+CODE LISTING B::
+
+ 1. 2.
+ add() search_and_reference()
+ { {
+ alloc_object rcu_read_lock();
+ ... search_for_element
+ atomic_set(&el->rc, 1); if (!atomic_inc_not_zero(&el->rc)) {
+ spin_lock(&list_lock); rcu_read_unlock();
+ return FAIL;
+ add_element }
+ ... ...
+ spin_unlock(&list_lock); rcu_read_unlock();
+ } }
+ 3. 4.
+ release_referenced() delete()
+ { {
+ ... spin_lock(&list_lock);
+ if (atomic_dec_and_test(&el->rc)) ...
+ call_rcu(&el->head, el_free); remove_element
+ ... spin_unlock(&list_lock);
+ } ...
+ if (atomic_dec_and_test(&el->rc))
+ call_rcu(&el->head, el_free);
+ ...
+ }
Sometimes, a reference to the element needs to be obtained in the
-update (write) stream. In such cases, atomic_inc_not_zero() might be
+update (write) stream. In such cases, atomic_inc_not_zero() might be
overkill, since we hold the update-side spinlock. One might instead
use atomic_inc() in such cases.
@@ -82,39 +88,40 @@ search_and_reference() code path. In such cases, the
atomic_dec_and_test() may be moved from delete() to el_free()
as follows:
-CODE LISTING C:
-1. 2.
-add() search_and_reference()
-{ {
- alloc_object rcu_read_lock();
- ... search_for_element
- atomic_set(&el->rc, 1); atomic_inc(&el->rc);
- spin_lock(&list_lock); ...
-
- add_element rcu_read_unlock();
- ... }
- spin_unlock(&list_lock); 4.
-} delete()
-3. {
-release_referenced() spin_lock(&list_lock);
-{ ...
- ... remove_element
- if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock);
- kfree(el); ...
- ... call_rcu(&el->head, el_free);
-} ...
-5. }
-void el_free(struct rcu_head *rhp)
-{
- release_referenced();
-}
+CODE LISTING C::
+
+ 1. 2.
+ add() search_and_reference()
+ { {
+ alloc_object rcu_read_lock();
+ ... search_for_element
+ atomic_set(&el->rc, 1); atomic_inc(&el->rc);
+ spin_lock(&list_lock); ...
+
+ add_element rcu_read_unlock();
+ ... }
+ spin_unlock(&list_lock); 4.
+ } delete()
+ 3. {
+ release_referenced() spin_lock(&list_lock);
+ { ...
+ ... remove_element
+ if (atomic_dec_and_test(&el->rc)) spin_unlock(&list_lock);
+ kfree(el); ...
+ ... call_rcu(&el->head, el_free);
+ } ...
+ 5. }
+ void el_free(struct rcu_head *rhp)
+ {
+ release_referenced();
+ }
The key point is that the initial reference added by add() is not removed
until after a grace period has elapsed following removal. This means that
search_and_reference() cannot find this element, which means that the value
of el->rc cannot increase. Thus, once it reaches zero, there are no
-readers that can or ever will be able to reference the element. The
-element can therefore safely be freed. This in turn guarantees that if
+readers that can or ever will be able to reference the element. The
+element can therefore safely be freed. This in turn guarantees that if
any reader finds the element, that reader may safely acquire a reference
without checking the value of the reference counter.
@@ -130,21 +137,21 @@ the eventual invocation of kfree(), which is usually not a problem on
modern computer systems, even the small ones.
In cases where delete() can sleep, synchronize_rcu() can be called from
-delete(), so that el_free() can be subsumed into delete as follows:
-
-4.
-delete()
-{
- spin_lock(&list_lock);
- ...
- remove_element
- spin_unlock(&list_lock);
- ...
-