summaryrefslogtreecommitdiffstats
path: root/Documentation/RCU/Design/Requirements/Requirements.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/RCU/Design/Requirements/Requirements.rst')
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.rst2704
1 files changed, 2704 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
new file mode 100644
index 000000000000..fd5e2cbc4935
--- /dev/null
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -0,0 +1,2704 @@
+=================================
+A Tour Through RCU's Requirements
+=================================
+
+Copyright IBM Corporation, 2015
+
+Author: Paul E. McKenney
+
+The initial version of this document appeared in the
+`LWN <https://lwn.net/>`_ on those articles:
+`part 1 <https://lwn.net/Articles/652156/>`_,
+`part 2 <https://lwn.net/Articles/652677/>`_, and
+`part 3 <https://lwn.net/Articles/653326/>`_.
+
+Introduction
+------------
+
+Read-copy update (RCU) is a synchronization mechanism that is often used
+as a replacement for reader-writer locking. RCU is unusual in that
+updaters do not block readers, which means that RCU's read-side
+primitives can be exceedingly fast and scalable. In addition, updaters
+can make useful forward progress concurrently with readers. However, all
+this concurrency between RCU readers and updaters does raise the
+question of exactly what RCU readers are doing, which in turn raises the
+question of exactly what RCU's requirements are.
+
+This document therefore summarizes RCU's requirements, and can be
+thought of as an informal, high-level specification for RCU. It is
+important to understand that RCU's specification is primarily empirical
+in nature; in fact, I learned about many of these requirements the hard
+way. This situation might cause some consternation, however, not only
+has this learning process been a lot of fun, but it has also been a
+great privilege to work with so many people willing to apply
+technologies in interesting new ways.
+
+All that aside, here are the categories of currently known RCU
+requirements:
+
+#. `Fundamental Requirements`_
+#. `Fundamental Non-Requirements`_
+#. `Parallelism Facts of Life`_
+#. `Quality-of-Implementation Requirements`_
+#. `Linux Kernel Complications`_
+#. `Software-Engineering Requirements`_
+#. `Other RCU Flavors`_
+#. `Possible Future Changes`_
+
+This is followed by a `summary <#Summary>`__, however, the answers to
+each quick quiz immediately follows the quiz. Select the big white space
+with your mouse to see the answer.
+
+Fundamental Requirements
+------------------------
+
+RCU's fundamental requirements are the closest thing RCU has to hard
+mathematical requirements. These are:
+
+#. `Grace-Period Guarantee`_
+#. `Publish/Subscribe Guarantee`_
+#. `Memory-Barrier Guarantees`_
+#. `RCU Primitives Guaranteed to Execute Unconditionally`_
+#. `Guaranteed Read-to-Write Upgrade`_
+
+Grace-Period Guarantee
+~~~~~~~~~~~~~~~~~~~~~~
+
+RCU's grace-period guarantee is unusual in being premeditated: Jack
+Slingwine and I had this guarantee firmly in mind when we started work
+on RCU (then called “rclock”) in the early 1990s. That said, the past
+two decades of experience with RCU have produced a much more detailed
+understanding of this guarantee.
+
+RCU's grace-period guarantee allows updaters to wait for the completion
+of all pre-existing RCU read-side critical sections. An RCU read-side
+critical section begins with the marker ``rcu_read_lock()`` and ends
+with the marker ``rcu_read_unlock()``. These markers may be nested, and
+RCU treats a nested set as one big RCU read-side critical section.
+Production-quality implementations of ``rcu_read_lock()`` and
+``rcu_read_unlock()`` are extremely lightweight, and in fact have
+exactly zero overhead in Linux kernels built for production use with
+``CONFIG_PREEMPT=n``.
+
+This guarantee allows ordering to be enforced with extremely low
+overhead to readers, for example:
+
+ ::
+
+ 1 int x, y;
+ 2
+ 3 void thread0(void)
+ 4 {
+ 5 rcu_read_lock();
+ 6 r1 = READ_ONCE(x);
+ 7 r2 = READ_ONCE(y);
+ 8 rcu_read_unlock();
+ 9 }
+ 10
+ 11 void thread1(void)
+ 12 {
+ 13 WRITE_ONCE(x, 1);
+ 14 synchronize_rcu();
+ 15 WRITE_ONCE(y, 1);
+ 16 }
+
+Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing
+readers, any instance of ``thread0()`` that loads a value of zero from
+``x`` must complete before ``thread1()`` stores to ``y``, so that
+instance must also load a value of zero from ``y``. Similarly, any
+instance of ``thread0()`` that loads a value of one from ``y`` must have
+started after the ``synchronize_rcu()`` started, and must therefore also
+load a value of one from ``x``. Therefore, the outcome:
+
+ ::
+
+ (r1 == 0 && r2 == 1)
+
+cannot happen.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Wait a minute! You said that updaters can make useful forward |
+| progress concurrently with readers, but pre-existing readers will |
+| block ``synchronize_rcu()``!!! |
+| Just who are you trying to fool??? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| First, if updaters do not wish to be blocked by readers, they can use |
+| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. |
+| Second, even when using ``synchronize_rcu()``, the other update-side |
+| code does run concurrently with readers, whether pre-existing or not. |
++-----------------------------------------------------------------------+
+
+This scenario resembles one of the first uses of RCU in
+`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a
+distributed lock manager's transition into a state suitable for handling
+recovery from node failure, more or less as follows:
+
+ ::
+
+ 1 #define STATE_NORMAL 0
+ 2 #define STATE_WANT_RECOVERY 1
+ 3 #define STATE_RECOVERING 2
+ 4 #define STATE_WANT_NORMAL 3
+ 5
+ 6 int state = STATE_NORMAL;
+ 7
+ 8 void do_something_dlm(void)
+ 9 {
+ 10 int state_snap;
+ 11
+ 12 rcu_read_lock();
+ 13 state_snap = READ_ONCE(state);
+ 14 if (state_snap == STATE_NORMAL)
+ 15 do_something();
+ 16 else
+ 17 do_something_carefully();
+ 18 rcu_read_unlock();
+ 19 }
+ 20
+ 21 void start_recovery(void)
+ 22 {
+ 23 WRITE_ONCE(state, STATE_WANT_RECOVERY);
+ 24 synchronize_rcu();
+ 25 WRITE_ONCE(state, STATE_RECOVERING);
+ 26 recovery();
+ 27 WRITE_ONCE(state, STATE_WANT_NORMAL);
+ 28 synchronize_rcu();
+ 29 WRITE_ONCE(state, STATE_NORMAL);
+ 30 }
+
+The RCU read-side critical section in ``do_something_dlm()`` works with
+the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that
+``do_something()`` never runs concurrently with ``recovery()``, but with
+little or no synchronization overhead in ``do_something_dlm()``.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Why is the ``synchronize_rcu()`` on line 28 needed? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Without that extra grace period, memory reordering could result in |
+| ``do_something_dlm()`` executing ``do_something()`` concurrently with |
+| the last bits of ``recovery()``. |
++-----------------------------------------------------------------------+
+
+In order to avoid fatal problems such as deadlocks, an RCU read-side
+critical section must not contain calls to ``synchronize_rcu()``.
+Similarly, an RCU read-side critical section must not contain anything
+that waits, directly or indirectly, on completion of an invocation of
+``synchronize_rcu()``.
+
+Although RCU's grace-period guarantee is useful in and of itself, with
+`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
+be good to be able to use RCU to coordinate read-side access to linked
+data structures. For this, the grace-period guarantee is not sufficient,
+as can be seen in function ``add_gp_buggy()`` below. We will look at the
+reader's code later, but in the meantime, just think of the reader as
+locklessly picking up the ``gp`` pointer, and, if the value loaded is
+non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
+
+ ::
+
+ 1 bool add_gp_buggy(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 p->a = a;
+ 12 p->b = a;
+ 13 gp = p; /* ORDERING BUG */
+ 14 spin_unlock(&gp_lock);
+ 15 return true;
+ 16 }
+
+The problem is that both the compiler and weakly ordered CPUs are within
+their rights to reorder this code as follows:
+
+ ::
+
+ 1 bool add_gp_buggy_optimized(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 gp = p; /* ORDERING BUG */
+ 12 p->a = a;
+ 13 p->b = a;
+ 14 spin_unlock(&gp_lock);
+ 15 return true;
+ 16 }
+
+If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized``
+executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
+And this is but one of many ways in which compiler and hardware
+optimizations could cause trouble. Therefore, we clearly need some way
+to prevent the compiler and the CPU from reordering in this manner,
+which brings us to the publish-subscribe guarantee discussed in the next
+section.
+
+Publish/Subscribe Guarantee
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RCU's publish-subscribe guarantee allows data to be inserted into a
+linked data structure without disrupting RCU readers. The updater uses
+``rcu_assign_pointer()`` to insert the new data, and readers use
+``rcu_dereference()`` to access data, whether new or old. The following
+shows an example of insertion:
+
+ ::
+
+ 1 bool add_gp(int a, int b)
+ 2 {
+ 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
+ 4 if (!p)
+ 5 return -ENOMEM;
+ 6 spin_lock(&gp_lock);
+ 7 if (rcu_access_pointer(gp)) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 p->a = a;
+ 12 p->b = a;
+ 13 rcu_assign_pointer(gp, p);
+ 14 spin_unlock(&gp_lock);
+ 15 return true;
+ 16 }
+
+The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a
+simple assignment statement, but also guarantees that its assignment
+will happen after the two assignments in lines 11 and 12, similar to the
+C11 ``memory_order_release`` store operation. It also prevents any
+number of “interesting” compiler optimizations, for example, the use of
+``gp`` as a scratch location immediately preceding the assignment.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But ``rcu_assign_pointer()`` does nothing to prevent the two |
+| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
+| also cause problems? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| No, it cannot. The readers cannot see either of these two fields |
+| until the assignment to ``gp``, by which time both fields are fully |
+| initialized. So reordering the assignments to ``p->a`` and ``p->b`` |
+| cannot possibly cause any problems. |
++-----------------------------------------------------------------------+
+
+It is tempting to assume that the reader need not do anything special to
+control its accesses to the RCU-protected data, as shown in
+``do_something_gp_buggy()`` below:
+
+ ::
+
+ 1 bool do_something_gp_buggy(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 p = gp; /* OPTIMIZATIONS GALORE!!! */
+ 5 if (p) {
+ 6 do_something(p->a, p->b);
+ 7 rcu_read_unlock();
+ 8 return true;
+ 9 }
+ 10 rcu_read_unlock();
+ 11 return false;
+ 12 }
+
+However, this temptation must be resisted because there are a
+surprisingly large number of ways that the compiler (to say nothing of
+`DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__)
+can trip this code up. For but one example, if the compiler were short
+of registers, it might choose to refetch from ``gp`` rather than keeping
+a separate copy in ``p`` as follows:
+
+ ::
+
+ 1 bool do_something_gp_buggy_optimized(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
+ 5 do_something(gp->a, gp->b);
+ 6 rcu_read_unlock();
+ 7 return true;
+ 8 }
+ 9 rcu_read_unlock();
+ 10 return false;
+ 11 }
+
+If this function ran concurrently with a series of updates that replaced
+the current structure with a new one, the fetches of ``gp->a`` and
+``gp->b`` might well come from two different structures, which could
+cause serious confusion. To prevent this (and much else besides),
+``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``:
+
+ ::
+
+ 1 bool do_something_gp(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 p = rcu_dereference(gp);
+ 5 if (p) {
+ 6 do_something(p->a, p->b);
+ 7 rcu_read_unlock();
+ 8 return true;
+ 9 }
+ 10 rcu_read_unlock();
+ 11 return false;
+ 12 }
+
+The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory
+barriers in the Linux kernel. Should a `high-quality implementation of
+C11 ``memory_order_consume``
+[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__
+ever appear, then ``rcu_dereference()`` could be implemented as a
+``memory_order_consume`` load. Regardless of the exact implementation, a
+pointer fetched by ``rcu_dereference()`` may not be used outside of the
+outermost RCU read-side critical section containing that
+``rcu_dereference()``, unless protection of the corresponding data
+element has been passed from RCU to some other synchronization
+mechanism, most commonly locking or `reference
+counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__.
+
+In short, updaters use ``rcu_assign_pointer()`` and readers use
+``rcu_dereference()``, and these two RCU API elements work together to
+ensure that readers have a consistent view of newly added data elements.
+
+Of course, it is also necessary to remove elements from RCU-protected
+data structures, for example, using the following process:
+
+#. Remove the data element from the enclosing structure.
+#. Wait for all pre-existing RCU read-side critical sections to complete
+ (because only pre-existing readers can possibly have a reference to
+ the newly removed data element).
+#. At this point, only the updater has a reference to the newly removed
+ data element, so it can safely reclaim the data element, for example,
+ by passing it to ``kfree()``.
+
+This process is implemented by ``remove_gp_synchronous()``:
+
+ ::
+
+ 1 bool remove_gp_synchronous(void)
+ 2 {
+ 3 struct foo *p;
+ 4
+ 5 spin_lock(&gp_lock);
+ 6 p = rcu_access_pointer(gp);
+ 7 if (!p) {
+ 8 spin_unlock(&gp_lock);
+ 9 return false;
+ 10 }
+ 11 rcu_assign_pointer(gp, NULL);
+ 12 spin_unlock(&gp_lock);
+ 13 synchronize_rcu();
+ 14 kfree(p);
+ 15 return true;
+ 16 }
+
+This function is straightforward, with line 13 waiting for a grace
+period before line 14 frees the old data element. This waiting ensures
+that readers will reach line 7 of ``do_something_gp()`` before the data
+element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on
+line 6 is similar to ``rcu_dereference()``, except that:
+
+#. The value returned by ``rcu_access_pointer()`` cannot be
+ dereferenced. If you want to access the value pointed to as well as
+ the pointer itself, use ``rcu_dereference()`` instead of
+ ``rcu_access_pointer()``.
+#. The call to ``rcu_access_pointer()`` need not be protected. In
+ contrast, ``rcu_dereference()`` must either be within an RCU
+ read-side critical section or in a code segment where the pointer
+ cannot change, for example, in code protected by the corresponding
+ update-side lock.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, |
+| what destructive optimizations might the compiler make use of? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Let's start with what happens to ``do_something_gp()`` if it fails to |
+| use ``rcu_dereference()``. It could reuse a value formerly fetched |
+| from this same pointer. It could also fetch the pointer from ``gp`` |
+| in a byte-at-a-time manner, resulting in *load tearing*, in turn |
+| resulting a bytewise mash-up of two distinct pointer values. It might |
+| even use value-speculation optimizations, where it makes a wrong |
+| guess, but by the time it gets around to checking the value, an |
+| update has changed the pointer to match the wrong guess. Too bad |
+| about any dereferences that returned pre-initialization garbage in |
+| the meantime! |
+| For ``remove_gp_synchronous()``, as long as all modifications to |
+| ``gp`` are carried out while holding ``gp_lock``, the above |
+| optimizations are harmless. However, ``sparse`` will complain if you |
+| define ``gp`` with ``__rcu`` and then access it without using either |
+| ``rcu_access_pointer()`` or ``rcu_dereference()``. |
++-----------------------------------------------------------------------+
+
+In short, RCU's publish-subscribe guarantee is provided by the
+combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This
+guarantee allows data elements to be safely added to RCU-protected
+linked data structures without disrupting RCU readers. This guarantee
+can be used in combination with the grace-period guarantee to also allow
+data elements to be removed from RCU-protected linked data structures,
+again without disrupting RCU readers.
+
+This guarantee was only partially premeditated. DYNIX/ptx used an
+explicit memory barrier for publication, but had nothing resembling
+``rcu_dereference()`` for subscription, nor did it have anything
+resembling the ``smp_read_barrier_depends()`` that was later subsumed
+into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The
+need for these operations made itself known quite suddenly at a
+late-1990s meeting with the DEC Alpha architects, back in the days when
+DEC was still a free-standing company. It took the Alpha architects a
+good hour to convince me that any sort of barrier would ever be needed,
+and it then took me a good *two* hours to convince them that their
+documentation did not make this point clear. More recent work with the C
+and C++ standards committees have provided much education on tricks and
+traps from the compiler. In short, compilers were much less tricky in
+the early 1990s, but in 2015, don't even think about omitting
+``rcu_dereference()``!
+
+Memory-Barrier Guarantees
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The previous section's simple linked-data-structure scenario clearly
+demonstrates the need for RCU's stringent memory-ordering guarantees on
+systems with more than one CPU:
+
+#. Each CPU that has an RCU read-side critical section that begins
+ before ``synchronize_rcu()`` starts is guaranteed to execute a full
+ memory barrier between the time that the RCU read-side critical
+ section ends and the time that ``synchronize_rcu()`` returns. Without
+ this guarantee, a pre-existing RCU read-side critical section might
+ hold a reference to the newly removed ``struct foo`` after the
+ ``kfree()`` on line 14 of ``remove_gp_synchronous()``.
+#. Each CPU that has an RCU read-side critical section that ends after
+ ``synchronize_rcu()`` returns is guaranteed to execute a full memory
+ barrier between the time that ``synchronize_rcu()`` begins and the
+ time that the RCU read-side critical section begins. Without this
+ guarantee, a later RCU read-side critical section running after the
+ ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run
+ ``do_something_gp()`` and find the newly deleted ``struct foo``.
+#. If the task invoking ``synchronize_rcu()`` remains on a given CPU,
+ then that CPU is guaranteed to execute a full memory barrier sometime
+ during the execution of ``synchronize_rcu()``. This guarantee ensures
+ that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really
+ does execute after the removal on line 11.
+#. If the task invoking ``synchronize_rcu()`` migrates among a group of
+ CPUs during that invocation, then each of the CPUs in that group is
+ guaranteed to execute a full memory barrier sometime during the
+ execution of ``synchronize_rcu()``. This guarantee also ensures that
+ the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does
+ execute after the removal on line 11, but also in the case where the
+ thread executing the ``synchronize_rcu()`` migrates in the meantime.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Given that multiple CPUs can start RCU read-side critical sections at |
+| any time without any ordering whatsoever, how can RCU possibly tell |
+| whether or not a given RCU read-side critical section starts before a |
+| given instance of ``synchronize_rcu()``? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| If RCU cannot tell whether or not a given RCU read-side critical |
+| section starts before a given instance of ``synchronize_rcu()``, then |
+| it must assume that the RCU read-side critical section started first. |
+| In other words, a given instance of ``synchronize_rcu()`` can avoid |
+| waiting on a given RCU read-side critical section only if it can |
+| prove that ``synchronize_rcu()`` started first. |
+| A related question is “When ``rcu_read_lock()`` doesn't generate any |
+| code, why does it matter how it relates to a grace period?” The |
+| answer is that it is not the relationship of ``rcu_read_lock()`` |
+| itself that is important, but rather the relationship of the code |
+| within the enclosed RCU read-side critical section to the code |
+| preceding and following the grace period. If we take this viewpoint, |
+| then a given RCU read-side critical section begins before a given |
+| grace period when some access preceding the grace period observes the |
+| effect of some access within the critical section, in which case none |
+| of the accesses within the critical section may observe the effects |
+| of any access following the grace period. |
+| |
+| As of late 2016, mathematical models of RCU take this viewpoint, for |
+| example, see slides 62 and 63 of the `2016 LinuxCon |
+| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 |
+| 6.10.04c.LCE.pdf>`__ |
+| presentation. |
++-----------------------------------------------------------------------+
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| The first and second guarantees require unbelievably strict ordering! |
+| Are all these memory barriers *really* required? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| Yes, they really are required. To see why the first guarantee is |
+| required, consider the following sequence of events: |
+| |
+| #. CPU 1: ``rcu_read_lock()`` |
+| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` |
+| #. CPU 0: ``list_del_rcu(p);`` |
+| #. CPU 0: ``synchronize_rcu()`` starts. |
+| #. CPU 1: ``do_something_with(q->a);`` |
+| ``/* No smp_mb(), so might happen after kfree(). */`` |
+| #. CPU 1: ``rcu_read_unlock()`` |
+| #. CPU 0: ``synchronize_rcu()`` returns. |
+| #. CPU 0: ``kfree(p);`` |
+| |
+| Therefore, there absolutely must be a full memory barrier between the |
+| end of the RCU read-side critical section and the end of the grace |
+| period. |
+| |
+| The sequence of events demonstrating the necessity of the second rule |
+| is roughly similar: |
+| |
+| #. CPU 0: ``list_del_rcu(p);`` |
+| #. CPU 0: ``synchronize_rcu()`` starts. |
+| #. CPU 1: ``rcu_read_lock()`` |
+| #. CPU 1: ``q = rcu_dereference(gp);`` |
+| ``/* Might return p if no memory barrier. */`` |
+| #. CPU 0: ``synchronize_rcu()`` returns. |
+| #. CPU 0: ``kfree(p);`` |
+| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` |
+| #. CPU 1: ``rcu_read_unlock()`` |
+| |
+| And similarly, without a memory barrier between the beginning of the |
+| grace period and the beginning of the RCU read-side critical section, |
+| CPU 1 might end up accessing the freelist. |
+| |
+| The “as if” rule of course applies, so that any implementation that |
+| acts as if the appropriate memory barriers were in place is a correct |
+| implementation. That said, it is much easier to fool yourself into |
+| believing that you have adhered to the as-if rule than it is to |
+| actually adhere to it! |
++-----------------------------------------------------------------------+
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
+| absolutely no code in some kernel builds. This means that the |
+| compiler might arbitrarily rearrange consecutive RCU read-side |
+| critical sections. Given such rearrangement, if a given RCU read-side |
+| critical section is done, how can you be sure that all prior RCU |
+| read-side critical sections are done? Won't the compiler |
+| rearrangements make that impossible to determine? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
+| absolutely no code, RCU infers quiescent states only at special |
+| locations, for example, within the scheduler. Because calls to |
+| ``schedule()`` had better prevent calling-code accesses to shared |
+| variables from being rearranged across the call to ``schedule()``, if |
+| RCU detects the end of a given RCU read-side critical section, it |
+| will necessarily detect the end of all prior RCU read-side critical |
+| sections, no matter how aggressively the compiler scrambles the code. |
+| Again, this all assumes that the compiler cannot scramble code across |
+| calls to the scheduler, out of interrupt handlers, into the idle |
+| loop, into user-mode code, and so on. But if your kernel build allows |
+| that sort of scrambling, you have broken far more than just RCU! |
++-----------------------------------------------------------------------+
+
+Note that these memory-barrier requirements do not replace the
+fundamental RCU requirement that a grace period wait for all
+pre-existing readers. On the contrary, the memory barriers called out in
+this section must operate in such a way as to *enforce* this fundamental
+requirement. Of course, different implementations enforce this
+requirement in different ways, but enforce it they must.
+
+RCU Primitives Guaranteed to Execute Unconditionally
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The common-case RCU primitives are unconditional. They are invoked, they
+do their job, and they return, with no possibility of error, and no need
+to retry. This is a key RCU design philosophy.
+
+However, this philosophy is pragmatic rather than pigheaded. If someone
+comes up with a good justification for a particular conditional RCU
+primitive, it might well be implemented and added. After all, this
+guarantee was reverse-engineered, not premeditated. The unconditional
+nature of the RCU primitives was initially an accident of
+implementation, and later experience with synchronization primitives
+with conditional primitives caused me to elevate this accident to a
+guarantee. Therefore, the justification for adding a conditional
+primitive to RCU would need to be based on detailed and compelling use
+cases.
+
+Guaranteed Read-to-Write Upgrade
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As far as RCU is concerned, it is always possible to carry out an update
+within an RCU read-side critical section. For example, that RCU
+read-side critical section might search for a given data element, and
+then might acquire the update-side spinlock in order to update that
+element, all while remaining in that RCU read-side critical section. Of
+course, it is necessary to exit the RCU read-side critical section
+before invoking ``synchronize_rcu()``, however, this inconvenience can
+be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API
+members described later in this document.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| But how does the upgrade-to-write operation exclude other readers? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| It doesn't, just like normal RCU updates, which also do not exclude |
+| RCU readers. |
++-----------------------------------------------------------------------+
+
+This guarantee allows lookup code to be shared between read-side and
+update-side code, and was premeditated, appearing in the earliest
+DYNIX/ptx RCU documentation.
+
+Fundamental Non-Requirements
+----------------------------
+
+RCU provides extremely lightweight readers, and its read-side
+guarantees, though quite useful, are correspondingly lightweight. It is
+therefore all too easy to assume that RCU is guaranteeing more than it
+really is. Of course, the list of things that RCU does not guarantee is
+infinitely long, however, the following sections list a few
+non-guarantees that have caused confusion. Except where otherwise noted,
+these non-guarantees were premeditated.
+
+#. `Readers Impose Minimal Ordering`_
+#. `Readers Do Not Exclude Updaters`_
+#. `Updaters Only Wait For Old Readers`_
+#. `Grace Periods Don't Partition Read-Side Critical Sections`_
+#. `Read-Side Critical Sections Don't Partition Grace Periods`_
+
+Readers Impose Minimal Ordering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Reader-side markers such as ``rcu_read_lock()`` and
+``rcu_read_unlock()`` provide absolutely no ordering guarantees except
+through their interaction with the grace-period APIs such as
+``synchronize_rcu()``. To see this, consider the following pair of
+threads:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(x, 1);
+ 5 rcu_read_unlock();
+ 6 rcu_read_lock();
+ 7 WRITE_ONCE(y, 1);
+ 8 rcu_read_unlock();
+ 9 }
+ 10
+ 11 void thread1(void)
+ 12 {
+ 13 rcu_read_lock();
+ 14 r1 = READ_ONCE(y);
+ 15 rcu_read_unlock();
+ 16 rcu_read_lock();
+ 17 r2 = READ_ONCE(x);
+ 18 rcu_read_unlock();
+ 19 }
+
+After ``thread0()`` and ``thread1()`` execute concurrently, it is quite
+possible to have
+
+ ::
+
+ (r1 == 1 && r2 == 0)
+
+(that is, ``y`` appears to have been assigned before ``x``), which would
+not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had
+much in the way of ordering properties. But they do not, so the CPU is
+within its rights to do significant reordering. This is by design: Any
+significant ordering constraints would slow down these fast-path APIs.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Can't the compiler also reorder this code? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` |
+| prevent the compiler from reordering in this particular case. |
++-----------------------------------------------------------------------+
+
+Readers Do Not Exclude Updaters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates.
+All they do is to prevent grace periods from ending. The following
+example illustrates this:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 r1 = READ_ONCE(y);
+ 5 if (r1) {
+ 6 do_something_with_nonzero_x();
+ 7 r2 = READ_ONCE(x);
+ 8 WARN_ON(!r2); /* BUG!!! */
+ 9 }
+ 10 rcu_read_unlock();
+ 11 }
+ 12
+ 13 void thread1(void)
+ 14 {
+ 15 spin_lock(&my_lock);
+ 16 WRITE_ONCE(x, 1);
+ 17 WRITE_ONCE(y, 1);
+ 18 spin_unlock(&my_lock);
+ 19 }
+
+If the ``thread0()`` function's ``rcu_read_lock()`` excluded the
+``thread1()`` function's update, the ``WARN_ON()`` could never fire. But
+the fact is that ``rcu_read_lock()`` does not exclude much of anything
+aside from subsequent grace periods, of which ``thread1()`` has none, so
+the ``WARN_ON()`` can and does fire.
+
+Updaters Only Wait For Old Readers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It might be tempting to assume that after ``synchronize_rcu()``
+completes, there are no readers executing. This temptation must be
+avoided because new readers can start immediately after
+``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no
+obligation to wait for these new readers.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**: |
++-----------------------------------------------------------------------+
+| Suppose that synchronize_rcu() did wait until *all* readers had |
+| completed instead of waiting only on pre-existing readers. For how |
+| long would the updater be able to rely on there being no readers? |
++-----------------------------------------------------------------------+
+| **Answer**: |
++-----------------------------------------------------------------------+
+| For no time at all. Even if ``synchronize_rcu()`` were to wait until |
+| all readers had completed, a new reader might start immediately after |
+| ``synchronize_rcu()`` completed. Therefore, the code following |
+| ``synchronize_rcu()`` can *never* rely on there being no readers. |
++-----------------------------------------------------------------------+
+
+Grace Periods Don't Partition Read-Side Critical Sections
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is tempting to assume that if any part of one RCU read-side critical
+section precedes a given grace period, and if any part of another RCU
+read-side critical section follows that same grace period, then all of
+the first RCU read-side critical section must precede all of the second.
+However, this just isn't the case: A single grace period does not
+partition the set of RCU read-side critical sections. An example of this
+situation can be illustrated as follows, where ``x``, ``y``, and ``z``
+are initially all zero:
+
+ ::
+
+ 1 void thread0(void)
+ 2 {
+ 3 rcu_read_lock();
+ 4 WRITE_ONCE(a, 1);
+ 5 WRITE_ONCE(b, 1);
+ 6 rcu_read_unlock();
+ 7 }
+ 8
+ 9 void thread1(void)
+ 10 {
+ 11 r1 = READ_ONCE(a);
+ 12 synchronize_rcu();
+ 13 WRITE_ONCE(c, 1);
+ 14 }
+ 15
+ 16 void thread2(void)
+ 17 {
+ 18 rcu_read_lock();
+ 19 r2 = READ_ONCE(b);
+ 20 r3 = READ_ONCE(c);
+ 21 rcu_read_unlock();
+ 22 }
+
+It turns out that the outcome:
+
+ ::
+
+ (r1 == 1 && r2 == 0