summaryrefslogtreecommitdiffstats
path: root/Documentation/RCU/Design/Data-Structures/Data-Structures.html
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/RCU/Design/Data-Structures/Data-Structures.html')
-rw-r--r--Documentation/RCU/Design/Data-Structures/Data-Structures.html1391
1 files changed, 0 insertions, 1391 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html
deleted file mode 100644
index c30c1957c7e6..000000000000
--- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html
+++ /dev/null
@@ -1,1391 +0,0 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
- "http://www.w3.org/TR/html4/loose.dtd">
- <html>
- <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title>
- <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
-
- <p>December 18, 2016</p>
- <p>This article was contributed by Paul E.&nbsp;McKenney</p>
-
-<h3>Introduction</h3>
-
-This document describes RCU's major data structures and their relationship
-to each other.
-
-<ol>
-<li> <a href="#Data-Structure Relationships">
- Data-Structure Relationships</a>
-<li> <a href="#The rcu_state Structure">
- The <tt>rcu_state</tt> Structure</a>
-<li> <a href="#The rcu_node Structure">
- The <tt>rcu_node</tt> Structure</a>
-<li> <a href="#The rcu_segcblist Structure">
- The <tt>rcu_segcblist</tt> Structure</a>
-<li> <a href="#The rcu_data Structure">
- The <tt>rcu_data</tt> Structure</a>
-<li> <a href="#The rcu_head Structure">
- The <tt>rcu_head</tt> Structure</a>
-<li> <a href="#RCU-Specific Fields in the task_struct Structure">
- RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
-<li> <a href="#Accessor Functions">
- Accessor Functions</a>
-</ol>
-
-<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
-
-<p>RCU is for all intents and purposes a large state machine, and its
-data structures maintain the state in such a way as to allow RCU readers
-to execute extremely quickly, while also processing the RCU grace periods
-requested by updaters in an efficient and extremely scalable fashion.
-The efficiency and scalability of RCU updaters is provided primarily
-by a combining tree, as shown below:
-
-</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
-
-</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
-containing a tree of <tt>rcu_node</tt> structures.
-Each leaf node of the <tt>rcu_node</tt> tree has up to 16
-<tt>rcu_data</tt> structures associated with it, so that there
-are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
-one for each possible CPU.
-This structure is adjusted at boot time, if needed, to handle the
-common case where <tt>nr_cpu_ids</tt> is much less than
-<tt>NR_CPUs</tt>.
-For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
-which results in a three-level <tt>rcu_node</tt> tree.
-If the actual hardware has only 16 CPUs, RCU will adjust itself
-at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
-
-</p><p>The purpose of this combining tree is to allow per-CPU events
-such as quiescent states, dyntick-idle transitions,
-and CPU hotplug operations to be processed efficiently
-and scalably.
-Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
-and other events are recorded by the leaf-level <tt>rcu_node</tt>
-structures.
-All of these events are combined at each level of the tree until finally
-grace periods are completed at the tree's root <tt>rcu_node</tt>
-structure.
-A grace period can be completed at the root once every CPU
-(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
-has passed through a quiescent state.
-Once a grace period has completed, record of that fact is propagated
-back down the tree.
-
-</p><p>As can be seen from the diagram, on a 64-bit system
-a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
-of 64 at the root and a fanout of 16 at the leaves.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- Why isn't the fanout at the leaves also 64?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- Because there are more types of events that affect the leaf-level
- <tt>rcu_node</tt> structures than further up the tree.
- Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
- 64, the contention on these structures' <tt>-&gt;structures</tt>
- becomes excessive.
- Experimentation on a wide variety of systems has shown that a fanout
- of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
- </font>
-
- <p><font color="ffffff">Of course, further experience with
- systems having hundreds or thousands of CPUs may demonstrate
- that the fanout for the non-leaf <tt>rcu_node</tt> structures
- must also be reduced.
- Such reduction can be easily carried out when and if it proves
- necessary.
- In the meantime, if you are using such a system and running into
- contention problems on the non-leaf <tt>rcu_node</tt> structures,
- you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
- parameter to reduce the non-leaf fanout as needed.
- </font>
-
- <p><font color="ffffff">Kernels built for systems with
- strong NUMA characteristics might also need to adjust
- <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
- <tt>rcu_node</tt> structures align with hardware boundaries.
- However, there has thus far been no need for this.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
-a 32-bit system), then RCU will automatically add more levels to the
-tree.
-For example, if you are crazy enough to build a 64-bit system with 65,536
-CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
-
-</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
-
-</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
-accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
-32-bit systems.
-On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and
-<tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result
-in a 16-CPU test using a 4-level tree.
-This can be useful for testing large-system capabilities on small test
-machines.
-
-</p><p>This multi-level combining tree allows us to get most of the
-performance and scalability
-benefits of partitioning, even though RCU grace-period detection is
-inherently a global operation.
-The trick here is that only the last CPU to report a quiescent state
-into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
-structure at the next level up the tree.
-This means that at the leaf-level <tt>rcu_node</tt> structure, only
-one access out of sixteen will progress up the tree.
-For the internal <tt>rcu_node</tt> structures, the situation is even
-more extreme: Only one access out of sixty-four will progress up
-the tree.
-Because the vast majority of the CPUs do not progress up the tree,
-the lock contention remains roughly constant up the tree.
-No matter how many CPUs there are in the system, at most 64 quiescent-state
-reports per grace period will progress all the way to the root
-<tt>rcu_node</tt> structure, thus ensuring that the lock contention
-on that root <tt>rcu_node</tt> structure remains acceptably low.
-
-</p><p>In effect, the combining tree acts like a big shock absorber,
-keeping lock contention under control at all tree levels regardless
-of the level of loading on the system.
-
-</p><p>RCU updaters wait for normal grace periods by registering
-RCU callbacks, either directly via <tt>call_rcu()</tt>
-or indirectly via <tt>synchronize_rcu()</tt> and friends.
-RCU callbacks are represented by <tt>rcu_head</tt> structures,
-which are queued on <tt>rcu_data</tt> structures while they are
-waiting for a grace period to elapse, as shown in the following figure:
-
-</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
-
-</p><p>This figure shows how <tt>TREE_RCU</tt>'s and
-<tt>PREEMPT_RCU</tt>'s major data structures are related.
-Lesser data structures will be introduced with the algorithms that
-make use of them.
-
-</p><p>Note that each of the data structures in the above figure has
-its own synchronization:
-
-<p><ol>
-<li> Each <tt>rcu_state</tt> structures has a lock and a mutex,
- and some fields are protected by the corresponding root
- <tt>rcu_node</tt> structure's lock.
-<li> Each <tt>rcu_node</tt> structure has a spinlock.
-<li> The fields in <tt>rcu_data</tt> are private to the corresponding
- CPU, although a few can be read and written by other CPUs.
-</ol>
-
-<p>It is important to note that different data structures can have
-very different ideas about the state of RCU at any given time.
-For but one example, awareness of the start or end of a given RCU
-grace period propagates slowly through the data structures.
-This slow propagation is absolutely necessary for RCU to have good
-read-side performance.
-If this balkanized implementation seems foreign to you, one useful
-trick is to consider each instance of these data structures to be
-a different person, each having the usual slightly different
-view of reality.
-
-</p><p>The general role of each of these data structures is as
-follows:
-
-</p><ol>
-<li> <tt>rcu_state</tt>:
- This structure forms the interconnection between the
- <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
- tracks grace periods, serves as short-term repository
- for callbacks orphaned by CPU-hotplug events,
- maintains <tt>rcu_barrier()</tt> state,
- tracks expedited grace-period state,
- and maintains state used to force quiescent states when
- grace periods extend too long,
-<li> <tt>rcu_node</tt>: This structure forms the combining
- tree that propagates quiescent-state
- information from the leaves to the root, and also propagates
- grace-period information from the root to the leaves.
- It provides local copies of the grace-period state in order
- to allow this information to be accessed in a synchronized
- manner without suffering the scalability limitations that
- would otherwise be imposed by global locking.
- In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
- of tasks that have blocked while in their current
- RCU read-side critical section.
- In <tt>CONFIG_PREEMPT_RCU</tt> with
- <tt>CONFIG_RCU_BOOST</tt>, it manages the
- per-<tt>rcu_node</tt> priority-boosting
- kernel threads (kthreads) and state.
- Finally, it records CPU-hotplug state in order to determine
- which CPUs should be ignored during a given grace period.
-<li> <tt>rcu_data</tt>: This per-CPU structure is the
- focus of quiescent-state detection and RCU callback queuing.
- It also tracks its relationship to the corresponding leaf
- <tt>rcu_node</tt> structure to allow more-efficient
- propagation of quiescent states up the <tt>rcu_node</tt>
- combining tree.
- Like the <tt>rcu_node</tt> structure, it provides a local
- copy of the grace-period information to allow for-free
- synchronized
- access to this information from the corresponding CPU.
- Finally, this structure records past dyntick-idle state
- for the corresponding CPU and also tracks statistics.
-<li> <tt>rcu_head</tt>:
- This structure represents RCU callbacks, and is the
- only structure allocated and managed by RCU users.
- The <tt>rcu_head</tt> structure is normally embedded
- within the RCU-protected data structure.
-</ol>
-
-<p>If all you wanted from this article was a general notion of how
-RCU's data structures are related, you are done.
-Otherwise, each of the following sections give more details on
-the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data
-structures.
-
-<h3><a name="The rcu_state Structure">
-The <tt>rcu_state</tt> Structure</a></h3>
-
-<p>The <tt>rcu_state</tt> structure is the base structure that
-represents the state of RCU in the system.
-This structure forms the interconnection between the
-<tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
-tracks grace periods, contains the lock used to
-synchronize with CPU-hotplug events,
-and maintains state used to force quiescent states when
-grace periods extend too long,
-
-</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
-singly and in groups, in the following sections.
-The more specialized fields are covered in the discussion of their
-use.
-
-<h5>Relationship to rcu_node and rcu_data Structures</h5>
-
-This portion of the <tt>rcu_state</tt> structure is declared
-as follows:
-
-<pre>
- 1 struct rcu_node node[NUM_RCU_NODES];
- 2 struct rcu_node *level[NUM_RCU_LVLS + 1];
- 3 struct rcu_data __percpu *rda;
-</pre>
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- Wait a minute!
- You said that the <tt>rcu_node</tt> structures formed a tree,
- but they are declared as a flat array!
- What gives?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- The tree is laid out in the array.
- The first node In the array is the head, the next set of nodes in the
- array are children of the head node, and so on until the last set of
- nodes in the array are the leaves.
- </font>
-
- <p><font color="ffffff">See the following diagrams to see how
- this works.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<p>The <tt>rcu_node</tt> tree is embedded into the
-<tt>-&gt;node[]</tt> array as shown in the following figure:
-
-</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
-
-</p><p>One interesting consequence of this mapping is that a
-breadth-first traversal of the tree is implemented as a simple
-linear scan of the array, which is in fact what the
-<tt>rcu_for_each_node_breadth_first()</tt> macro does.
-This macro is used at the beginning and ends of grace periods.
-
-</p><p>Each entry of the <tt>-&gt;level</tt> array references
-the first <tt>rcu_node</tt> structure on the corresponding level
-of the tree, for example, as shown below:
-
-</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
-
-</p><p>The zero<sup>th</sup> element of the array references the root
-<tt>rcu_node</tt> structure, the first element references the
-first child of the root <tt>rcu_node</tt>, and finally the second
-element references the first leaf <tt>rcu_node</tt> structure.
-
-</p><p>For whatever it is worth, if you draw the tree to be tree-shaped
-rather than array-shaped, it is easy to draw a planar representation:
-
-</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
-
-</p><p>Finally, the <tt>-&gt;rda</tt> field references a per-CPU
-pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
-
-</p><p>All of these fields are constant once initialization is complete,
-and therefore need no protection.
-
-<h5>Grace-Period Tracking</h5>
-
-<p>This portion of the <tt>rcu_state</tt> structure is declared
-as follows:
-
-<pre>
- 1 unsigned long gp_seq;
-</pre>
-
-<p>RCU grace periods are numbered, and
-the <tt>-&gt;gp_seq</tt> field contains the current grace-period
-sequence number.
-The bottom two bits are the state of the current grace period,
-which can be zero for not yet started or one for in progress.
-In other words, if the bottom two bits of <tt>-&gt;gp_seq</tt> are
-zero, then RCU is idle.
-Any other value in the bottom two bits indicates that something is broken.
-This field is protected by the root <tt>rcu_node</tt> structure's
-<tt>-&gt;lock</tt> field.
-
-</p><p>There are <tt>-&gt;gp_seq</tt> fields
-in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
-as well.
-The fields in the <tt>rcu_state</tt> structure represent the
-most current value, and those of the other structures are compared
-in order to detect the beginnings and ends of grace periods in a distributed
-fashion.
-The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
-(down the tree from the root to the leaves) to <tt>rcu_data</tt>.
-
-<h5>Miscellaneous</h5>
-
-<p>This portion of the <tt>rcu_state</tt> structure is declared
-as follows:
-
-<pre>
- 1 unsigned long gp_max;
- 2 char abbr;
- 3 char *name;
-</pre>
-
-<p>The <tt>-&gt;gp_max</tt> field tracks the duration of the longest
-grace period in jiffies.
-It is protected by the root <tt>rcu_node</tt>'s <tt>-&gt;lock</tt>.
-
-<p>The <tt>-&gt;name</tt> and <tt>-&gt;abbr</tt> fields distinguish
-between preemptible RCU (&ldquo;rcu_preempt&rdquo; and &ldquo;p&rdquo;)
-and non-preemptible RCU (&ldquo;rcu_sched&rdquo; and &ldquo;s&rdquo;).
-These fields are used for diagnostic and tracing purposes.
-
-<h3><a name="The rcu_node Structure">
-The <tt>rcu_node</tt> Structure</a></h3>
-
-<p>The <tt>rcu_node</tt> structures form the combining
-tree that propagates quiescent-state
-information from the leaves to the root and also that propagates
-grace-period information from the root down to the leaves.
-They provides local copies of the grace-period state in order
-to allow this information to be accessed in a synchronized
-manner without suffering the scalability limitations that
-would otherwise be imposed by global locking.
-In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
-of tasks that have blocked while in their current
-RCU read-side critical section.
-In <tt>CONFIG_PREEMPT_RCU</tt> with
-<tt>CONFIG_RCU_BOOST</tt>, they manage the
-per-<tt>rcu_node</tt> priority-boosting
-kernel threads (kthreads) and state.
-Finally, they record CPU-hotplug state in order to determine
-which CPUs should be ignored during a given grace period.
-
-</p><p>The <tt>rcu_node</tt> structure's fields are discussed,
-singly and in groups, in the following sections.
-
-<h5>Connection to Combining Tree</h5>
-
-<p>This portion of the <tt>rcu_node</tt> structure is declared
-as follows:
-
-<pre>
- 1 struct rcu_node *parent;
- 2 u8 level;
- 3 u8 grpnum;
- 4 unsigned long grpmask;
- 5 int grplo;
- 6 int grphi;
-</pre>
-
-<p>The <tt>-&gt;parent</tt> pointer references the <tt>rcu_node</tt>
-one level up in the tree, and is <tt>NULL</tt> for the root
-<tt>rcu_node</tt>.
-The RCU implementation makes heavy use of this field to push quiescent
-states up the tree.
-The <tt>-&gt;level</tt> field gives the level in the tree, with
-the root being at level zero, its children at level one, and so on.
-The <tt>-&gt;grpnum</tt> field gives this node's position within
-the children of its parent, so this number can range between 0 and 31
-on 32-bit systems and between 0 and 63 on 64-bit systems.
-The <tt>-&gt;level</tt> and <tt>-&gt;grpnum</tt> fields are
-used only during initialization and for tracing.
-The <tt>-&gt;grpmask</tt> field is the bitmask counterpart of
-<tt>-&gt;grpnum</tt>, and therefore always has exactly one bit set.
-This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
-structure in its parent's bitmasks, which are described later.
-Finally, the <tt>-&gt;grplo</tt> and <tt>-&gt;grphi</tt> fields
-contain the lowest and highest numbered CPU served by this
-<tt>rcu_node</tt> structure, respectively.
-
-</p><p>All of these fields are constant, and thus do not require any
-synchronization.
-
-<h5>Synchronization</h5>
-
-<p>This field of the <tt>rcu_node</tt> structure is declared
-as follows:
-
-<pre>
- 1 raw_spinlock_t lock;
-</pre>
-
-<p>This field is used to protect the remaining fields in this structure,
-unless otherwise stated.
-That said, all of the fields in this structure can be accessed without
-locking for tracing purposes.
-Yes, this can result in confusing traces, but better some tracing confusion
-than to be heisenbugged out of existence.
-
-<h5>Grace-Period Tracking</h5>
-
-<p>This portion of the <tt>rcu_node</tt> structure is declared
-as follows:
-
-<pre>
- 1 unsigned long gp_seq;
- 2 unsigned long gp_seq_needed;
-</pre>
-
-<p>The <tt>rcu_node</tt> structures' <tt>-&gt;gp_seq</tt> fields are
-the counterparts of the field of the same name in the <tt>rcu_state</tt>
-structure.
-They each may lag up to one step behind their <tt>rcu_state</tt>
-counterpart.
-If the bottom two bits of a given <tt>rcu_node</tt> structure's
-<tt>-&gt;gp_seq</tt> field is zero, then this <tt>rcu_node</tt>
-structure believes that RCU is idle.
-</p><p>The <tt>&gt;gp_seq</tt> field of each <tt>rcu_node</tt>
-structure is updated at the beginning and the end
-of each grace period.
-
-<p>The <tt>-&gt;gp_seq_needed</tt> fields record the
-furthest-in-the-future grace period request seen by the corresponding
-<tt>rcu_node</tt> structure. The request is considered fulfilled when
-the value of the <tt>-&gt;gp_seq</tt> field equals or exceeds that of
-the <tt>-&gt;gp_seq_needed</tt> field.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- Suppose that this <tt>rcu_node</tt> structure doesn't see
- a request for a very long time.
- Won't wrapping of the <tt>-&gt;gp_seq</tt> field cause
- problems?
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- No, because if the <tt>-&gt;gp_seq_needed</tt> field lags behind the
- <tt>-&gt;gp_seq</tt> field, the <tt>-&gt;gp_seq_needed</tt> field
- will be updated at the end of the grace period.
- Modulo-arithmetic comparisons therefore will always get the
- correct answer, even with wrapping.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<h5>Quiescent-State Tracking</h5>
-
-<p>These fields manage the propagation of quiescent states up the
-combining tree.
-
-</p><p>This portion of the <tt>rcu_node</tt> structure has fields
-as follows:
-
-<pre>
- 1 unsigned long qsmask;
- 2 unsigned long expmask;
- 3 unsigned long qsmaskinit;
- 4 unsigned long expmaskinit;
-</pre>
-
-<p>The <tt>-&gt;qsmask</tt> field tracks which of this
-<tt>rcu_node</tt> structure's children still need to report
-quiescent states for the current normal grace period.
-Such children will have a value of 1 in their corresponding bit.
-Note that the leaf <tt>rcu_node</tt> structures should be
-thought of as having <tt>rcu_data</tt> structures as their
-children.
-Similarly, the <tt>-&gt;expmask</tt> field tracks which
-of this <tt>rcu_node</tt> structure's children still need to report
-quiescent states for the current expedited grace period.
-An expedited grace period has
-the same conceptual properties as a normal grace period, but the
-expedited implementation accepts extreme CPU overhead to obtain
-much lower grace-period latency, for example, consuming a few
-tens of microseconds worth of CPU time to reduce grace-period
-duration from milliseconds to tens of microseconds.
-The <tt>-&gt;qsmaskinit</tt> field tracks which of this
-<tt>rcu_node</tt> structure's children cover for at least
-one online CPU.
-This mask is used to initialize <tt>-&gt;qsmask</tt>,
-and <tt>-&gt;expmaskinit</tt> is used to initialize
-<tt>-&gt;expmask</tt> and the beginning of the
-normal and expedited grace periods, respectively.
-
-<table>
-<tr><th>&nbsp;</th></tr>
-<tr><th align="left">Quick Quiz:</th></tr>
-<tr><td>
- Why are these bitmasks protected by locking?
- Come on, haven't you heard of atomic instructions???
-</td></tr>
-<tr><th align="left">Answer:</th></tr>
-<tr><td bgcolor="#ffffff"><font color="ffffff">
- Lockless grace-period computation! Such a tantalizing possibility!
- </font>
-
- <p><font color="ffffff">But consider the following sequence of events:
- </font>
-
- <ol>
- <li> <font color="ffffff">CPU&nbsp;0 has been in dyntick-idle
- mode for quite some time.
- When it wakes up, it notices that the current RCU
- grace period needs it to report in, so it sets a
- flag where the scheduling clock interrupt will find it.
- </font><p>
- <li> <font color="ffffff">Meanwhile, CPU&nbsp;1 is running
- <tt>force_quiescent_state()</tt>,
- and notices that CPU&nbsp;0 has been in dyntick idle mode,
- which qualifies as an extended quiescent state.
- </font><p>
- <li> <font color="ffffff">CPU&nbsp;0's scheduling clock
- interrupt fires in the
- middle of an RCU read-side critical section, and notices
- that the RCU core needs something, so commences RCU softirq
- processing.
- </font>
- <p>
- <li> <font color="ffffff">CPU&nbsp;0's softirq handler
- executes and is just about ready
- to report its quiescent state up the <tt>rcu_node</tt>
- tree.
- </font><p>
- <li> <font color="ffffff">But CPU&nbsp;1 beats it to the punch,
- completing the current
- grace period and starting a new one.
- </font><p>
- <li> <font color="ffffff">CPU&nbsp;0 now reports its quiescent
- state for the wrong
- grace period.
- That grace period might now end before the RCU read-side
- critical section.
- If that happens, disaster will ensue.
- </font>
- </ol>
-
- <p><font color="ffffff">So the locking is absolutely required in
- order to coordinate clearing of the bits with updating of the
- grace-period sequence number in <tt>-&gt;gp_seq</tt>.
-</font></td></tr>
-<tr><td>&nbsp;</td></tr>
-</table>
-
-<h5>Blocked-Task Management</h5>
-
-<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
-midst of their RCU read-side critical sections, and these tasks
-must be tracked explicitly.
-The details of exactly why and how they are tracked will be covered
-in a separate article on RCU read-side processing.
-For now, it is enough to know that the <tt>rcu_node</tt>
-structure tracks them.
-
-<pre>
- 1 struct list_head blkd_tasks;
- 2 struct list_head *gp_tasks;
- 3 struct list_head *exp_tasks;
- 4 bool wait_blkd_tasks;
-</pre>
-
-<p>The <tt>-&gt;blkd_tasks</tt> field is a list header for
-the list of blocked and preempted tasks.
-As tasks undergo context switches within RCU read-side critical
-sections, their <tt>task_struct</tt> structures are enqueued
-(via the <tt>task_struct</tt>'s <tt>-&gt;rcu_node_entry</tt>
-field) onto the head of the <tt>-&gt;blkd_tasks</tt> list for the
-leaf <tt>rcu_node</tt> structure corresponding to the CPU
-on which the outgoing context switch executed.
-As these tasks later exit their RCU read-side critical sections,
-they remove themselves from the list.
-This list is therefore in reverse time order, so that if one of the tasks
-is blocking the current grace period, all subsequent tasks must
-also be blocking that same grace period.
-Therefore, a single pointer into this list suffices to track
-all tasks blocking a given grace period.
-That pointer is stored in <tt>-&gt;gp_tasks</tt> for normal
-grace periods and in <tt>-&gt;exp_tasks</tt> for expedited
-grace periods.
-These last two fields are <tt>NULL</tt> if either there is
-no grace period in flight or if there are no blocked tasks
-preventing that grace period from completing.
-If either of these two pointers is referencing a task that
-removes itself from the <tt>-&gt;blkd_tasks</tt> list,
-then that task must advance the pointer to the next task on
-the list, or set the pointer to <tt>NULL</tt> if there
-are no subsequent tasks on the list.
-
-</p><p>For example, suppose that tasks&nbsp;T1, T2, and&nbsp;T3 are
-all hard-affinitied to the largest-numbered CPU in the system.
-Then if task&nbsp;T1 blocked in an RCU read-side
-critical section, then an expedited grace period started,
-then task&nbsp;T2 blocked in an RCU read-side critical section,
-then a normal grace period started, and finally task&nbsp;3 blocked
-in an RCU read-side critical section, then the state of the
-last leaf <tt>rcu_node</tt> structure's blocked-task list
-would be as shown below:
-
-</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
-
-</p><p>Task&nbsp;T1 is blocking both grace periods, task&nbsp;T2 is
-blocking only the normal grace period, and task&nbsp;T3 is blocking
-neither grace period.
-Note that these tasks will not remove themselves from this list
-immediately upon resuming execution.
-They will instead remain on the list until they execute the outermost
-<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
-section.
-
-<p>
-The <tt>-&gt;wait_blkd_tasks</tt> field indicates whether or not
-the current grace period is waiting on a blocked task.
-
-<h5>Sizing the <tt>rcu_node</tt> Array</h5>
-
-<p>The <tt>rcu_node</tt> array is sized via a series of
-C-preprocessor expressions as follows:
-
-<pre>
- 1 #ifdef CONFIG_RCU_FANOUT
- 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
- 3 #else
- 4 # ifdef CONFIG_64BIT
- 5 # define RCU_FANOUT 64
- 6 # else
- 7 # define RCU_FANOUT 32
- 8 # endif
- 9 #endif
-10
-11 #ifdef CONFIG_RCU_FANOUT_LEAF
-12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
-13 #else
-14 # ifdef CONFIG_64BIT
-15 # define RCU_FANOUT_LEAF 64
-16 # else
-17 # define RCU_FANOUT_LEAF 32
-18 # endif
-19 #endif
-20
-21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
-22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
-23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
-24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
-25
-26 #if NR_CPUS &lt;= RCU_FANOUT_1
-27 # define RCU_NUM_LVLS 1
-28 # define NUM_RCU_LVL_0 1
-29 # define NUM_RCU_NODES NUM_RCU_LVL_0
-30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
-31 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
-32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
-33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
-34 #elif NR_CPUS &lt;= RCU_FANOUT_2
-35 # define RCU_NUM_LVLS 2
-36 # define NUM_RCU_LVL_0 1
-37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
-39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
-40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
-41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
-42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
-43 #elif NR_CPUS &lt;= RCU_FANOUT_3
-44 # define RCU_NUM_LVLS 3
-45 # define NUM_RCU_LVL_0 1
-46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
-47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
-49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
-50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
-51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
-52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
-53 #elif NR_CPUS &lt;= RCU_FANOUT_4
-54 # define RCU_NUM_LVLS 4
-55 # define NUM_RCU_LVL_0 1
-56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
-57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
-58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
-60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
-61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
-62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
-63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
-64 #else
-65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
-66 #endif
-</pre>
-
-<p>The maximum number of levels in the <tt>rcu_node</tt> structure
-is currently limited to four, as specified by lines&nbsp;21-24
-and the structure of the subsequent &ldquo;if&rdquo; statement.
-For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
-should be sufficient for the next few years at least.
-For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
-should see us through the next decade or so.
-This four-level tree also allows kernels built with
-<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
-which might be useful in very large systems having eight CPUs per
-socket (but please note that no one has yet shown any measurable
-performance degradation due to misaligned socket and <tt>rcu_node</tt>
-boundaries).
-In addition, building kernels with a full four levels of <tt>rcu_node</tt>
-tree permits better testing of RCU's combining-tree code.
-
-</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
-are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
-If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
-it is set based on the word size of the system, which is also
-the Kconfig default.
-
-</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
-handled by each leaf <tt>rcu_node</tt> structure.
-Experience has shown that allowing a given leaf <tt>rcu_node</tt>
-structure to handle 64 CPUs, as permitted by the number of bits in
-the <tt>-&gt;qsmask</tt> field on a 64-bit system, results in
-excessive contention for the leaf <tt>rcu_node</tt> structures'
-<tt>-&gt;lock</tt> fields.
-The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
-limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
-If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
-selected is based on the word size of the system, just as for
-<tt>CONFIG_RCU_FANOUT</tt>.
-Lines&nbsp;11-19 perform this computation.
-
-</p><p>Lines&nbsp;21-24 compute the maximum number of CPUs supported by
-a single-level (which contains a single <tt>rcu_node</tt> structure),
-two-level, three-level, and four-level <tt>rcu_node</tt> tree,
-respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
-and <tt>RCU_FANOUT_LEAF</tt>.
-These numbers of CPUs are retained in the
-<tt>RCU_FANOUT_1</tt>,
-<tt>RCU_FANOUT_2</tt>,
-<tt>RCU_FANOUT_3</tt>, and
-<tt>RCU_FANOUT_4</tt>
-C-preprocessor variables, respectively.
-
-</p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
-statement spanning lines&nbsp;26-66 that computes the number of
-<tt>rcu_node</tt> structures required for each level of the tree,
-as well as the number of levels required.
-The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
-C-preprocessor variable by lines&nbsp;27, 35, 44, and&nbsp;54.
-The number of <tt>rcu_node</tt> structures for the topmost level
-of the tree is always exactly one, and this value is unconditionally
-placed into <tt>NUM_RCU_LVL_0</tt> by lines&nbsp;28, 36, 45, and&nbsp;55.
-The rest of the levels (if any) of the <tt>rcu_node</tt> tree
-are computed by dividing the maximum number of CPUs by the
-fanout supported by the number of levels from the current level down,
-rounding up. This computation is performed by lines&nbsp;37,
-46-47, and&nbsp;56-58.
-Lines&nbsp;31-33, 40-42, 50-52, and&nbsp;62-63 create initializers
-for lockdep lock-class names.
-Finally, lines&nbsp;64-66 produce an error if the maximum number of
-CPUs is too large for the specified fanout.
-
-<h3><a name="The rcu_segcblist Structure">
-The <tt>rcu_segcblist</tt> Structure</a></h3>
-
-The <tt>rcu_segcblist</tt> structure maintains a segmented list of
-callbacks as follows:
-
-<pre>
- 1 #define RCU_DONE_TAIL 0
- 2 #define RCU_WAIT_TAIL 1
- 3 #define RCU_NEXT_READY_TAIL 2
- 4 #define RCU_NEXT_TAIL 3
- 5 #define RCU_CBLIST_NSEGS 4
- 6
- 7 struct rcu_segcblist {
- 8 struct rcu_head *head;
- 9 struct rcu_head **tails[RCU_CBLIST_NSEGS];
-10 unsigned long gp_seq[RCU_CBLIST_NSEGS];
-11 long len;
-12 long len_lazy;
-13 };
-</pre>
-
-<p>
-The segments are as follows:
-
-<ol>
-<li> <tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed.
- These callbacks are ready to be invoked.
-<li> <tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the
- current grace period.
- Note that diff