1 files changed, 2704 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
new file mode 100644
index 000000000000..fd5e2cbc4935
--- /dev/null
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -0,0 +1,2704 @@
+=================================
+A Tour Through RCU's Requirements
+=================================
+
+Copyright IBM Corporation, 2015
+
+Author: Paul E. McKenney
+
+The initial version of this document appeared in the
+`LWN <https://lwn.net/>`_ on those articles:
+`part 1 <https://lwn.net/Articles/652156/>`_,
+`part 2 <https://lwn.net/Articles/652677/>`_, and
+`part 3 <https://lwn.net/Articles/653326/>`_.
+
+Introduction
+------------
+
+Read-copy update (RCU) is a synchronization mechanism that is often used
+as a replacement for reader-writer locking. RCU is unusual in that
+updaters do not block readers, which means that RCU's read-side
+primitives can be exceedingly fast and scalable. In addition, updaters
+can make useful forward progress concurrently with readers. However, all
+this concurrency between RCU readers and updaters does raise the
+question of exactly what RCU readers are doing, which in turn raises the
+question of exactly what RCU's requirements are.
+
+This document therefore summarizes RCU's requirements, and can be
+thought of as an informal, high-level specification for RCU. It is
+important to understand that RCU's specification is primarily empirical
+in nature; in fact, I learned about many of these requirements the hard
+way. This situation might cause some consternation, however, not only
+has this learning process been a lot of fun, but it has also been a
+great privilege to work with so many people willing to apply
+technologies in interesting new ways.
+
+All that aside, here are the categories of currently known RCU
+requirements:
+
+#. `Fundamental Requirements`_
+#. `Fundamental Non-Requirements`_
+#. `Parallelism Facts of Life`_
+#. `Quality-of-Implementation Requirements`_
+#. `Linux Kernel Complications`_
+#. `Software-Engineering Requirements`_
+#. `Other RCU Flavors`_
+#. `Possible Future Changes`_
+
+This is followed by a `summary <#Summary>`__, however, the answers to
+each quick quiz immediately follows the quiz. Select the big white space
+with your mouse to see the answer.
+
+Fundamental Requirements
+------------------------
+
+RCU's fundamental requirements are the closest thing RCU has to hard
+mathematical requirements. These are:
+
+#. `Grace-Period Guarantee`_
+#. `Publish/Subscribe Guarantee`_
+#. `Memory-Barrier Guarantees`_
+#. `RCU Primitives Guaranteed to Execute Unconditionally`_
+#. `Guaranteed Read-to-Write Upgrade`_
+
+Grace-Period Guarantee
+~~~~~~~~~~~~~~~~~~~~~~
+
+RCU's grace-period guarantee is unusual in being premeditated: Jack
+Slingwine and I had this guarantee firmly in mind when we started work
+on RCU (then called “rclock”) in the early 1990s. That said, the past
+two decades of experience with RCU have produced a much more detailed
+understanding of this guarantee.
+
+RCU's grace-period guarantee allows updaters to wait for the completion
+of all pre-existing RCU read-side critical sections. An RCU read-side
+critical section begins with the marker ``rcu_read_lock()`` and ends
+with the marker ``rcu_read_unlock()``. These markers may be nested, and
+RCU treats a nested set as one big RCU read-side critical section.
+Production-quality implementations of ``rcu_read_lock()`` and
+``rcu_read_unlock()`` are extremely lightweight, and in fact have
+exactly zero overhead in Linux kernels built for production use with
+``CONFIG_PREEMPT=n``.
+
+This guarantee allows ordering to be enforced with extremely low
+overhead to readers, for example:
+
+   ::
+
+       1 int x, y;
+       2
+       3 void thread0(void)
+       4 {
+       5   rcu_read_lock();
+       6   r1 = READ_ONCE(x);
+       7   r2 = READ_ONCE(y);
+       8   rcu_read_unlock();
+       9 }
+      10
+      11 void thread1(void)
+      12 {
+      13   WRITE_ONCE(x, 1);
+      14   synchronize_rcu();
+      15   WRITE_ONCE(y, 1);
+      16 }
+
+Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing
+readers, any instance of ``thread0()`` that loads a value of zero from
+``x`` must complete before ``thread1()`` stores to ``y``, so that
+instance must also load a value of zero from ``y``. Similarly, any
+instance of ``thread0()`` that loads a value of one from ``y`` must have
+started after the ``synchronize_rcu()`` started, and must therefore also
+load a value of one from ``x``. Therefore, the outcome:
+
+   ::
+
+      (r1 == 0 && r2 == 1)
+
+cannot happen.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Wait a minute! You said that updaters can make useful forward         |
+| progress concurrently with readers, but pre-existing readers will     |
+| block ``synchronize_rcu()``!!!                                        |
+| Just who are you trying to fool???                                    |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| First, if updaters do not wish to be blocked by readers, they can use |
+| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later.     |
+| Second, even when using ``synchronize_rcu()``, the other update-side  |
+| code does run concurrently with readers, whether pre-existing or not. |
++-----------------------------------------------------------------------+
+
+This scenario resembles one of the first uses of RCU in
+`DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a
+distributed lock manager's transition into a state suitable for handling
+recovery from node failure, more or less as follows:
+
+   ::
+
+       1 #define STATE_NORMAL        0
+       2 #define STATE_WANT_RECOVERY 1
+       3 #define STATE_RECOVERING    2
+       4 #define STATE_WANT_NORMAL   3
+       5
+       6 int state = STATE_NORMAL;
+       7
+       8 void do_something_dlm(void)
+       9 {
+      10   int state_snap;
+      11
+      12   rcu_read_lock();
+      13   state_snap = READ_ONCE(state);
+      14   if (state_snap == STATE_NORMAL)
+      15     do_something();
+      16   else
+      17     do_something_carefully();
+      18   rcu_read_unlock();
+      19 }
+      20
+      21 void start_recovery(void)
+      22 {
+      23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
+      24   synchronize_rcu();
+      25   WRITE_ONCE(state, STATE_RECOVERING);
+      26   recovery();
+      27   WRITE_ONCE(state, STATE_WANT_NORMAL);
+      28   synchronize_rcu();
+      29   WRITE_ONCE(state, STATE_NORMAL);
+      30 }
+
+The RCU read-side critical section in ``do_something_dlm()`` works with
+the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that
+``do_something()`` never runs concurrently with ``recovery()``, but with
+little or no synchronization overhead in ``do_something_dlm()``.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Why is the ``synchronize_rcu()`` on line 28 needed?                   |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| Without that extra grace period, memory reordering could result in    |
+| ``do_something_dlm()`` executing ``do_something()`` concurrently with |
+| the last bits of ``recovery()``.                                      |
++-----------------------------------------------------------------------+
+
+In order to avoid fatal problems such as deadlocks, an RCU read-side
+critical section must not contain calls to ``synchronize_rcu()``.
+Similarly, an RCU read-side critical section must not contain anything
+that waits, directly or indirectly, on completion of an invocation of
+``synchronize_rcu()``.
+
+Although RCU's grace-period guarantee is useful in and of itself, with
+`quite a few use cases <https://lwn.net/Articles/573497/>`__, it would
+be good to be able to use RCU to coordinate read-side access to linked
+data structures. For this, the grace-period guarantee is not sufficient,
+as can be seen in function ``add_gp_buggy()`` below. We will look at the
+reader's code later, but in the meantime, just think of the reader as
+locklessly picking up the ``gp`` pointer, and, if the value loaded is
+non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
+
+   ::
+
+       1 bool add_gp_buggy(int a, int b)
+       2 {
+       3   p = kmalloc(sizeof(*p), GFP_KERNEL);
+       4   if (!p)
+       5     return -ENOMEM;
+       6   spin_lock(&gp_lock);
+       7   if (rcu_access_pointer(gp)) {
+       8     spin_unlock(&gp_lock);
+       9     return false;
+      10   }
+      11   p->a = a;
+      12   p->b = a;
+      13   gp = p; /* ORDERING BUG */
+      14   spin_unlock(&gp_lock);
+      15   return true;
+      16 }
+
+The problem is that both the compiler and weakly ordered CPUs are within
+their rights to reorder this code as follows:
+
+   ::
+
+       1 bool add_gp_buggy_optimized(int a, int b)
+       2 {
+       3   p = kmalloc(sizeof(*p), GFP_KERNEL);
+       4   if (!p)
+       5     return -ENOMEM;
+       6   spin_lock(&gp_lock);
+       7   if (rcu_access_pointer(gp)) {
+       8     spin_unlock(&gp_lock);
+       9     return false;
+      10   }
+      11   gp = p; /* ORDERING BUG */
+      12   p->a = a;
+      13   p->b = a;
+      14   spin_unlock(&gp_lock);
+      15   return true;
+      16 }
+
+If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized``
+executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
+And this is but one of many ways in which compiler and hardware
+optimizations could cause trouble. Therefore, we clearly need some way
+to prevent the compiler and the CPU from reordering in this manner,
+which brings us to the publish-subscribe guarantee discussed in the next
+section.
+
+Publish/Subscribe Guarantee
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RCU's publish-subscribe guarantee allows data to be inserted into a
+linked data structure without disrupting RCU readers. The updater uses
+``rcu_assign_pointer()`` to insert the new data, and readers use
+``rcu_dereference()`` to access data, whether new or old. The following
+shows an example of insertion:
+
+   ::
+
+       1 bool add_gp(int a, int b)
+       2 {
+       3   p = kmalloc(sizeof(*p), GFP_KERNEL);
+       4   if (!p)
+       5     return -ENOMEM;
+       6   spin_lock(&gp_lock);
+       7   if (rcu_access_pointer(gp)) {
+       8     spin_unlock(&gp_lock);
+       9     return false;
+      10   }
+      11   p->a = a;
+      12   p->b = a;
+      13   rcu_assign_pointer(gp, p);
+      14   spin_unlock(&gp_lock);
+      15   return true;
+      16 }
+
+The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a
+simple assignment statement, but also guarantees that its assignment
+will happen after the two assignments in lines 11 and 12, similar to the
+C11 ``memory_order_release`` store operation. It also prevents any
+number of “interesting” compiler optimizations, for example, the use of
+``gp`` as a scratch location immediately preceding the assignment.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| But ``rcu_assign_pointer()`` does nothing to prevent the two          |
+| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
+| also cause problems?                                                  |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| No, it cannot. The readers cannot see either of these two fields      |
+| until the assignment to ``gp``, by which time both fields are fully   |
+| initialized. So reordering the assignments to ``p->a`` and ``p->b``   |
+| cannot possibly cause any problems.                                   |
++-----------------------------------------------------------------------+
+
+It is tempting to assume that the reader need not do anything special to
+control its accesses to the RCU-protected data, as shown in
+``do_something_gp_buggy()`` below:
+
+   ::
+
+       1 bool do_something_gp_buggy(void)
+       2 {
+       3   rcu_read_lock();
+       4   p = gp;  /* OPTIMIZATIONS GALORE!!! */
+       5   if (p) {
+       6     do_something(p->a, p->b);
+       7     rcu_read_unlock();
+       8     return true;
+       9   }
+      10   rcu_read_unlock();
+      11   return false;
+      12 }
+
+However, this temptation must be resisted because there are a
+surprisingly large number of ways that the compiler (to say nothing of
+`DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__)
+can trip this code up. For but one example, if the compiler were short
+of registers, it might choose to refetch from ``gp`` rather than keeping
+a separate copy in ``p`` as follows:
+
+   ::
+
+       1 bool do_something_gp_buggy_optimized(void)
+       2 {
+       3   rcu_read_lock();
+       4   if (gp) { /* OPTIMIZATIONS GALORE!!! */
+       5     do_something(gp->a, gp->b);
+       6     rcu_read_unlock();
+       7     return true;
+       8   }
+       9   rcu_read_unlock();
+      10   return false;
+      11 }
+
+If this function ran concurrently with a series of updates that replaced
+the current structure with a new one, the fetches of ``gp->a`` and
+``gp->b`` might well come from two different structures, which could
+cause serious confusion. To prevent this (and much else besides),
+``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``:
+
+   ::
+
+       1 bool do_something_gp(void)
+       2 {
+       3   rcu_read_lock();
+       4   p = rcu_dereference(gp);
+       5   if (p) {
+       6     do_something(p->a, p->b);
+       7     rcu_read_unlock();
+       8     return true;
+       9   }
+      10   rcu_read_unlock();
+      11   return false;
+      12 }
+
+The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory
+barriers in the Linux kernel. Should a `high-quality implementation of
+C11 ``memory_order_consume``
+[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__
+ever appear, then ``rcu_dereference()`` could be implemented as a
+``memory_order_consume`` load. Regardless of the exact implementation, a
+pointer fetched by ``rcu_dereference()`` may not be used outside of the
+outermost RCU read-side critical section containing that
+``rcu_dereference()``, unless protection of the corresponding data
+element has been passed from RCU to some other synchronization
+mechanism, most commonly locking or `reference
+counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__.
+
+In short, updaters use ``rcu_assign_pointer()`` and readers use
+``rcu_dereference()``, and these two RCU API elements work together to
+ensure that readers have a consistent view of newly added data elements.
+
+Of course, it is also necessary to remove elements from RCU-protected
+data structures, for example, using the following process:
+
+#. Remove the data element from the enclosing structure.
+#. Wait for all pre-existing RCU read-side critical sections to complete
+   (because only pre-existing readers can possibly have a reference to
+   the newly removed data element).
+#. At this point, only the updater has a reference to the newly removed
+   data element, so it can safely reclaim the data element, for example,
+   by passing it to ``kfree()``.
+
+This process is implemented by ``remove_gp_synchronous()``:
+
+   ::
+
+       1 bool remove_gp_synchronous(void)
+       2 {
+       3   struct foo *p;
+       4
+       5   spin_lock(&gp_lock);
+       6   p = rcu_access_pointer(gp);
+       7   if (!p) {
+       8     spin_unlock(&gp_lock);
+       9     return false;
+      10   }
+      11   rcu_assign_pointer(gp, NULL);
+      12   spin_unlock(&gp_lock);
+      13   synchronize_rcu();
+      14   kfree(p);
+      15   return true;
+      16 }
+
+This function is straightforward, with line 13 waiting for a grace
+period before line 14 frees the old data element. This waiting ensures
+that readers will reach line 7 of ``do_something_gp()`` before the data
+element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on
+line 6 is similar to ``rcu_dereference()``, except that:
+
+#. The value returned by ``rcu_access_pointer()`` cannot be
+   dereferenced. If you want to access the value pointed to as well as
+   the pointer itself, use ``rcu_dereference()`` instead of
+   ``rcu_access_pointer()``.
+#. The call to ``rcu_access_pointer()`` need not be protected. In
+   contrast, ``rcu_dereference()`` must either be within an RCU
+   read-side critical section or in a code segment where the pointer
+   cannot change, for example, in code protected by the corresponding
+   update-side lock.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``,    |
+| what destructive optimizations might the compiler make use of?        |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| Let's start with what happens to ``do_something_gp()`` if it fails to |
+| use ``rcu_dereference()``. It could reuse a value formerly fetched    |
+| from this same pointer. It could also fetch the pointer from ``gp``   |
+| in a byte-at-a-time manner, resulting in *load tearing*, in turn      |
+| resulting a bytewise mash-up of two distinct pointer values. It might |
+| even use value-speculation optimizations, where it makes a wrong      |
+| guess, but by the time it gets around to checking the value, an       |
+| update has changed the pointer to match the wrong guess. Too bad      |
+| about any dereferences that returned pre-initialization garbage in    |
+| the meantime!                                                         |
+| For ``remove_gp_synchronous()``, as long as all modifications to      |
+| ``gp`` are carried out while holding ``gp_lock``, the above           |
+| optimizations are harmless. However, ``sparse`` will complain if you  |
+| define ``gp`` with ``__rcu`` and then access it without using either  |
+| ``rcu_access_pointer()`` or ``rcu_dereference()``.                    |
++-----------------------------------------------------------------------+
+
+In short, RCU's publish-subscribe guarantee is provided by the
+combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This
+guarantee allows data elements to be safely added to RCU-protected
+linked data structures without disrupting RCU readers. This guarantee
+can be used in combination with the grace-period guarantee to also allow
+data elements to be removed from RCU-protected linked data structures,
+again without disrupting RCU readers.
+
+This guarantee was only partially premeditated. DYNIX/ptx used an
+explicit memory barrier for publication, but had nothing resembling
+``rcu_dereference()`` for subscription, nor did it have anything
+resembling the ``smp_read_barrier_depends()`` that was later subsumed
+into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The
+need for these operations made itself known quite suddenly at a
+late-1990s meeting with the DEC Alpha architects, back in the days when
+DEC was still a free-standing company. It took the Alpha architects a
+good hour to convince me that any sort of barrier would ever be needed,
+and it then took me a good *two* hours to convince them that their
+documentation did not make this point clear. More recent work with the C
+and C++ standards committees have provided much education on tricks and
+traps from the compiler. In short, compilers were much less tricky in
+the early 1990s, but in 2015, don't even think about omitting
+``rcu_dereference()``!
+
+Memory-Barrier Guarantees
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The previous section's simple linked-data-structure scenario clearly
+demonstrates the need for RCU's stringent memory-ordering guarantees on
+systems with more than one CPU:
+
+#. Each CPU that has an RCU read-side critical section that begins
+   before ``synchronize_rcu()`` starts is guaranteed to execute a full
+   memory barrier between the time that the RCU read-side critical
+   section ends and the time that ``synchronize_rcu()`` returns. Without
+   this guarantee, a pre-existing RCU read-side critical section might
+   hold a reference to the newly removed ``struct foo`` after the
+   ``kfree()`` on line 14 of ``remove_gp_synchronous()``.
+#. Each CPU that has an RCU read-side critical section that ends after
+   ``synchronize_rcu()`` returns is guaranteed to execute a full memory
+   barrier between the time that ``synchronize_rcu()`` begins and the
+   time that the RCU read-side critical section begins. Without this
+   guarantee, a later RCU read-side critical section running after the
+   ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run
+   ``do_something_gp()`` and find the newly deleted ``struct foo``.
+#. If the task invoking ``synchronize_rcu()`` remains on a given CPU,
+   then that CPU is guaranteed to execute a full memory barrier sometime
+   during the execution of ``synchronize_rcu()``. This guarantee ensures
+   that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really
+   does execute after the removal on line 11.
+#. If the task invoking ``synchronize_rcu()`` migrates among a group of
+   CPUs during that invocation, then each of the CPUs in that group is
+   guaranteed to execute a full memory barrier sometime during the
+   execution of ``synchronize_rcu()``. This guarantee also ensures that
+   the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does
+   execute after the removal on line 11, but also in the case where the
+   thread executing the ``synchronize_rcu()`` migrates in the meantime.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Given that multiple CPUs can start RCU read-side critical sections at |
+| any time without any ordering whatsoever, how can RCU possibly tell   |
+| whether or not a given RCU read-side critical section starts before a |
+| given instance of ``synchronize_rcu()``?                              |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| If RCU cannot tell whether or not a given RCU read-side critical      |
+| section starts before a given instance of ``synchronize_rcu()``, then |
+| it must assume that the RCU read-side critical section started first. |
+| In other words, a given instance of ``synchronize_rcu()`` can avoid   |
+| waiting on a given RCU read-side critical section only if it can      |
+| prove that ``synchronize_rcu()`` started first.                       |
+| A related question is “When ``rcu_read_lock()`` doesn't generate any  |
+| code, why does it matter how it relates to a grace period?” The       |
+| answer is that it is not the relationship of ``rcu_read_lock()``      |
+| itself that is important, but rather the relationship of the code     |
+| within the enclosed RCU read-side critical section to the code        |
+| preceding and following the grace period. If we take this viewpoint,  |
+| then a given RCU read-side critical section begins before a given     |
+| grace period when some access preceding the grace period observes the |
+| effect of some access within the critical section, in which case none |
+| of the accesses within the critical section may observe the effects   |
+| of any access following the grace period.                             |
+|                                                                       |
+| As of late 2016, mathematical models of RCU take this viewpoint, for  |
+| example, see slides 62 and 63 of the `2016 LinuxCon                   |
+| EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 |
+| 6.10.04c.LCE.pdf>`__                                                  |
+| presentation.                                                         |
++-----------------------------------------------------------------------+
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| The first and second guarantees require unbelievably strict ordering! |
+| Are all these memory barriers *really* required?                      |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| Yes, they really are required. To see why the first guarantee is      |
+| required, consider the following sequence of events:                  |
+|                                                                       |
+| #. CPU 1: ``rcu_read_lock()``                                         |
+| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` |
+| #. CPU 0: ``list_del_rcu(p);``                                        |
+| #. CPU 0: ``synchronize_rcu()`` starts.                               |
+| #. CPU 1: ``do_something_with(q->a);``                                |
+|    ``/* No smp_mb(), so might happen after kfree(). */``              |
+| #. CPU 1: ``rcu_read_unlock()``                                       |
+| #. CPU 0: ``synchronize_rcu()`` returns.                              |
+| #. CPU 0: ``kfree(p);``                                               |
+|                                                                       |
+| Therefore, there absolutely must be a full memory barrier between the |
+| end of the RCU read-side critical section and the end of the grace    |
+| period.                                                               |
+|                                                                       |
+| The sequence of events demonstrating the necessity of the second rule |
+| is roughly similar:                                                   |
+|                                                                       |
+| #. CPU 0: ``list_del_rcu(p);``                                        |
+| #. CPU 0: ``synchronize_rcu()`` starts.                               |
+| #. CPU 1: ``rcu_read_lock()``                                         |
+| #. CPU 1: ``q = rcu_dereference(gp);``                                |
+|    ``/* Might return p if no memory barrier. */``                     |
+| #. CPU 0: ``synchronize_rcu()`` returns.                              |
+| #. CPU 0: ``kfree(p);``                                               |
+| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */``                  |
+| #. CPU 1: ``rcu_read_unlock()``                                       |
+|                                                                       |
+| And similarly, without a memory barrier between the beginning of the  |
+| grace period and the beginning of the RCU read-side critical section, |
+| CPU 1 might end up accessing the freelist.                            |
+|                                                                       |
+| The “as if” rule of course applies, so that any implementation that   |
+| acts as if the appropriate memory barriers were in place is a correct |
+| implementation. That said, it is much easier to fool yourself into    |
+| believing that you have adhered to the as-if rule than it is to       |
+| actually adhere to it!                                                |
++-----------------------------------------------------------------------+
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
+| absolutely no code in some kernel builds. This means that the         |
+| compiler might arbitrarily rearrange consecutive RCU read-side        |
+| critical sections. Given such rearrangement, if a given RCU read-side |
+| critical section is done, how can you be sure that all prior RCU      |
+| read-side critical sections are done? Won't the compiler              |
+| rearrangements make that impossible to determine?                     |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate |
+| absolutely no code, RCU infers quiescent states only at special       |
+| locations, for example, within the scheduler. Because calls to        |
+| ``schedule()`` had better prevent calling-code accesses to shared     |
+| variables from being rearranged across the call to ``schedule()``, if |
+| RCU detects the end of a given RCU read-side critical section, it     |
+| will necessarily detect the end of all prior RCU read-side critical   |
+| sections, no matter how aggressively the compiler scrambles the code. |
+| Again, this all assumes that the compiler cannot scramble code across |
+| calls to the scheduler, out of interrupt handlers, into the idle      |
+| loop, into user-mode code, and so on. But if your kernel build allows |
+| that sort of scrambling, you have broken far more than just RCU!      |
++-----------------------------------------------------------------------+
+
+Note that these memory-barrier requirements do not replace the
+fundamental RCU requirement that a grace period wait for all
+pre-existing readers. On the contrary, the memory barriers called out in
+this section must operate in such a way as to *enforce* this fundamental
+requirement. Of course, different implementations enforce this
+requirement in different ways, but enforce it they must.
+
+RCU Primitives Guaranteed to Execute Unconditionally
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The common-case RCU primitives are unconditional. They are invoked, they
+do their job, and they return, with no possibility of error, and no need
+to retry. This is a key RCU design philosophy.
+
+However, this philosophy is pragmatic rather than pigheaded. If someone
+comes up with a good justification for a particular conditional RCU
+primitive, it might well be implemented and added. After all, this
+guarantee was reverse-engineered, not premeditated. The unconditional
+nature of the RCU primitives was initially an accident of
+implementation, and later experience with synchronization primitives
+with conditional primitives caused me to elevate this accident to a
+guarantee. Therefore, the justification for adding a conditional
+primitive to RCU would need to be based on detailed and compelling use
+cases.
+
+Guaranteed Read-to-Write Upgrade
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As far as RCU is concerned, it is always possible to carry out an update
+within an RCU read-side critical section. For example, that RCU
+read-side critical section might search for a given data element, and
+then might acquire the update-side spinlock in order to update that
+element, all while remaining in that RCU read-side critical section. Of
+course, it is necessary to exit the RCU read-side critical section
+before invoking ``synchronize_rcu()``, however, this inconvenience can
+be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API
+members described later in this document.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| But how does the upgrade-to-write operation exclude other readers?    |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| It doesn't, just like normal RCU updates, which also do not exclude   |
+| RCU readers.                                                          |
++-----------------------------------------------------------------------+
+
+This guarantee allows lookup code to be shared between read-side and
+update-side code, and was premeditated, appearing in the earliest
+DYNIX/ptx RCU documentation.
+
+Fundamental Non-Requirements
+----------------------------
+
+RCU provides extremely lightweight readers, and its read-side
+guarantees, though quite useful, are correspondingly lightweight. It is
+therefore all too easy to assume that RCU is guaranteeing more than it
+really is. Of course, the list of things that RCU does not guarantee is
+infinitely long, however, the following sections list a few
+non-guarantees that have caused confusion. Except where otherwise noted,
+these non-guarantees were premeditated.
+
+#. `Readers Impose Minimal Ordering`_
+#. `Readers Do Not Exclude Updaters`_
+#. `Updaters Only Wait For Old Readers`_
+#. `Grace Periods Don't Partition Read-Side Critical Sections`_
+#. `Read-Side Critical Sections Don't Partition Grace Periods`_
+
+Readers Impose Minimal Ordering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Reader-side markers such as ``rcu_read_lock()`` and
+``rcu_read_unlock()`` provide absolutely no ordering guarantees except
+through their interaction with the grace-period APIs such as
+``synchronize_rcu()``. To see this, consider the following pair of
+threads:
+
+   ::
+
+       1 void thread0(void)
+       2 {
+       3   rcu_read_lock();
+       4   WRITE_ONCE(x, 1);
+       5   rcu_read_unlock();
+       6   rcu_read_lock();
+       7   WRITE_ONCE(y, 1);
+       8   rcu_read_unlock();
+       9 }
+      10
+      11 void thread1(void)
+      12 {
+      13   rcu_read_lock();
+      14   r1 = READ_ONCE(y);
+      15   rcu_read_unlock();
+      16   rcu_read_lock();
+      17   r2 = READ_ONCE(x);
+      18   rcu_read_unlock();
+      19 }
+
+After ``thread0()`` and ``thread1()`` execute concurrently, it is quite
+possible to have
+
+   ::
+
+      (r1 == 1 && r2 == 0)
+
+(that is, ``y`` appears to have been assigned before ``x``), which would
+not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had
+much in the way of ordering properties. But they do not, so the CPU is
+within its rights to do significant reordering. This is by design: Any
+significant ordering constraints would slow down these fast-path APIs.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Can't the compiler also reorder this code?                            |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()``        |
+| prevent the compiler from reordering in this particular case.         |
++-----------------------------------------------------------------------+
+
+Readers Do Not Exclude Updaters
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates.
+All they do is to prevent grace periods from ending. The following
+example illustrates this:
+
+   ::
+
+       1 void thread0(void)
+       2 {
+       3   rcu_read_lock();
+       4   r1 = READ_ONCE(y);
+       5   if (r1) {
+       6     do_something_with_nonzero_x();
+       7     r2 = READ_ONCE(x);
+       8     WARN_ON(!r2); /* BUG!!! */
+       9   }
+      10   rcu_read_unlock();
+      11 }
+      12
+      13 void thread1(void)
+      14 {
+      15   spin_lock(&my_lock);
+      16   WRITE_ONCE(x, 1);
+      17   WRITE_ONCE(y, 1);
+      18   spin_unlock(&my_lock);
+      19 }
+
+If the ``thread0()`` function's ``rcu_read_lock()`` excluded the
+``thread1()`` function's update, the ``WARN_ON()`` could never fire. But
+the fact is that ``rcu_read_lock()`` does not exclude much of anything
+aside from subsequent grace periods, of which ``thread1()`` has none, so
+the ``WARN_ON()`` can and does fire.
+
+Updaters Only Wait For Old Readers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It might be tempting to assume that after ``synchronize_rcu()``
+completes, there are no readers executing. This temptation must be
+avoided because new readers can start immediately after
+``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no
+obligation to wait for these new readers.
+
++-----------------------------------------------------------------------+
+| **Quick Quiz**:                                                       |
++-----------------------------------------------------------------------+
+| Suppose that synchronize_rcu() did wait until *all* readers had       |
+| completed instead of waiting only on pre-existing readers. For how    |
+| long would the updater be able to rely on there being no readers?     |
++-----------------------------------------------------------------------+
+| **Answer**:                                                           |
++-----------------------------------------------------------------------+
+| For no time at all. Even if ``synchronize_rcu()`` were to wait until  |
+| all readers had completed, a new reader might start immediately after |
+| ``synchronize_rcu()`` completed. Therefore, the code following        |
+| ``synchronize_rcu()`` can *never* rely on there being no readers.     |
++-----------------------------------------------------------------------+
+
+Grace Periods Don't Partition Read-Side Critical Sections
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is tempting to assume that if any part of one RCU read-side critical
+section precedes a given grace period, and if any part of another RCU
+read-side critical section follows that same grace period, then all of
+the first RCU read-side critical section must precede all of the second.
+However, this just isn't the case: A single grace period does not
+partition the set of RCU read-side critical sections. An example of this
+situation can be illustrated as follows, where ``x``, ``y``, and ``z``
+are initially all zero:
+
+   ::
+
+       1 void thread0(void)
+       2 {
+       3   rcu_read_lock();
+       4   WRITE_ONCE(a, 1);
+       5   WRITE_ONCE(b, 1);
+       6   rcu_read_unlock();
+       7 }
+       8
+       9 void thread1(void)
+      10 {
+      11   r1 = READ_ONCE(a);
+      12   synchronize_rcu();
+      13   WRITE_ONCE(c, 1);
+      14 }
+      15
+      16 void thread2(void)
+      17 {
+      18   rcu_read_lock();
+      19   r2 = READ_ONCE(b);
+      20   r3 = READ_ONCE(c);
+      21   rcu_read_unlock();
+      22 }
+
+It turns out that the outcome:
+
+   ::
+
+      (r1 == 1 && r2 == 0