diff options
84 files changed, 1207 insertions, 1312 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 95b30fa25d56..62e847bcdcdd 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows: <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. <li> <a href="#Tracing and RCU">Tracing and RCU</a>. <li> <a href="#Energy Efficiency">Energy Efficiency</a>. +<li> <a href="#Scheduling-Clock Interrupts and RCU"> + Scheduling-Clock Interrupts and RCU</a>. <li> <a href="#Memory Efficiency">Memory Efficiency</a>. <li> <a href="#Performance, Scalability, Response Time, and Reliability"> Performance, Scalability, Response Time, and Reliability</a>. @@ -2532,6 +2534,134 @@ I learned of many of these requirements via angry phone calls: Flaming me on the Linux-kernel mailing list was apparently not sufficient to fully vent their ire at RCU's energy-efficiency bugs! +<h3><a name="Scheduling-Clock Interrupts and RCU"> +Scheduling-Clock Interrupts and RCU</a></h3> + +<p> +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. +Depending on kernel configuration, RCU handles these states differently: + +<table border=3> +<tr><th><tt>HZ</tt> Kconfig</th> + <th>In-Kernel</th> + <th>Usermode</th> + <th>Idle</th></tr> +<tr><th align="left"><tt>HZ_PERIODIC</tt></th> + <td>Can rely on scheduling-clock interrupt.</td> + <td>Can rely on scheduling-clock interrupt and its + detection of interrupt from usermode.</td> + <td>Can rely on RCU's dyntick-idle detection.</td></tr> +<tr><th align="left"><tt>NO_HZ_IDLE</tt></th> + <td>Can rely on scheduling-clock interrupt.</td> + <td>Can rely on scheduling-clock interrupt and its + detection of interrupt from usermode.</td> + <td>Can rely on RCU's dyntick-idle detection.</td></tr> +<tr><th align="left"><tt>NO_HZ_FULL</tt></th> + <td>Can only sometimes rely on scheduling-clock interrupt. + In other cases, it is necessary to bound kernel execution + times and/or use IPIs.</td> + <td>Can rely on RCU's dyntick-idle detection.</td> + <td>Can rely on RCU's dyntick-idle detection.</td></tr> +</table> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the + scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt> + and <tt>NO_HZ_IDLE</tt> do? +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + Because, as a performance optimization, <tt>NO_HZ_FULL</tt> + does not necessarily re-enable the scheduling-clock interrupt + on entry to each and every system call. +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +However, RCU must be reliably informed as to whether any given +CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>, +also whether that CPU is executing in usermode, as discussed +<a href="#Energy Efficiency">earlier</a>. +It also requires that the scheduling-clock interrupt be enabled when +RCU needs it to be: + +<ol> +<li> If a CPU is either idle or executing in usermode, and RCU believes + it is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, + very long (11-second) grace periods, with a pointless IPI waking + the CPU from time to time. +<li> If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. <b>DON'T DO THIS!!!</b> + + <br>This is one reason to test with lockdep, which will complain + about this sort of thing. +<li> If a CPU is in a portion of the kernel that is absolutely + positively no-joking guaranteed to never execute any RCU read-side + critical sections, and RCU believes this CPU to to be idle, + no problem. This sort of thing is used by some architectures + for light-weight exception handlers, which can then avoid the + overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> + at exception entry and exit, respectively. + Some go further and avoid the entireties of <tt>irq_enter()</tt> + and <tt>irq_exit()</tt>. + + <br>Just make very sure you are running some of your tests with + <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths + was in fact joking about not doing RCU read-side critical sections. +<li> If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, + and if the CPU goes idle (from an RCU perspective) every few + jiffies, no problem. It is usually OK for there to be the + occasional gap between idle periods of up to a second or so. + + <br>If the gap grows too long, you get RCU CPU stall warnings. +<li> If a CPU is either idle or executing in usermode, and RCU believes + it to be idle, of course no problem. +<li> If a CPU is executing in the kernel, the kernel code + path is passing through quiescent states at a reasonable + frequency (preferably about once per few jiffies, but the + occasional excursion to a second or so is usually OK) and the + scheduling-clock interrupt is enabled, of course no problem. + + <br>If the gap between a successive pair of quiescent states grows + too long, you get RCU CPU stall warnings. +</ol> + +<table> +<tr><th> </th></tr> +<tr><th align="left">Quick Quiz:</th></tr> +<tr><td> + But what if my driver has a hardware interrupt handler + that can run for many seconds? + I cannot invoke <tt>schedule()</tt> from an hardware + interrupt handler, after all! +</td></tr> +<tr><th align="left">Answer:</th></tr> +<tr><td bgcolor="#ffffff"><font color="ffffff"> + One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt> + every so often. + But given that long-running interrupt handlers can cause + other problems, not least for response time, shouldn't you + work to keep your interrupt handler's runtime within reasonable + bounds? +</font></td></tr> +<tr><td> </td></tr> +</table> + +<p> +But as long as RCU is properly informed of kernel state transitions between +in-kernel execution, usermode execution, and idle, and as long as the +scheduling-clock interrupt is enabled when RCU needs it to be, you +can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + <h3><a name="Memory Efficiency">Memory Efficiency</a></h3> <p> diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 6beda556faf3..49747717d905 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -23,6 +23,14 @@ over a rather long period of time, but improvements are always welcome! Yet another exception is where the low real-time latency of RCU's read-side primitives is critically important. + One final exception is where RCU readers are used to prevent + the ABA problem (https://en.wikipedia.org/wiki/ABA_problem) + for lockless updates. This does result in the mildly + counter-intuitive situation where rcu_read_lock() and + rcu_read_unlock() are used to protect updates, however, this + approach provides the same potential simplifications that garbage + collectors do. + 1. Does the update code have proper mutual exclusion? RCU does allow -readers- to run (almost) naked, but -writers- must @@ -40,7 +48,9 @@ over a rather long period of time, but improvements are always welcome! explain how this single task does not become a major bottleneck on big multiprocessor machines (for example, if the task is updating information relating to itself that other tasks can read, there - by definition can be no bottleneck). + by definition can be no bottleneck). Note that the definition + of "large" has changed significantly: Eight CPUs was "large" + in the year 2000, but a hundred CPUs was unremarkable in 2017. 2. Do the RCU read-side critical sections make proper use of rcu_read_lock() and friends? These primitives are needed @@ -55,6 +65,12 @@ over a rather long period of time, but improvements are always welcome! Disabling of preemption can serve as rcu_read_lock_sched(), but is less readable. + Letting RCU-protected pointers "leak" out of an RCU read-side + critical section is every bid as bad as letting them leak out + from under a lock. Unless, of course, you have arranged some + other means of protection, such as a lock or a reference count + -before- letting them out of the RCU read-side critical section. + 3. Does the update code tolerate concurrent accesses? The whole point of RCU is to permit readers to run without @@ -78,10 +94,10 @@ over a rather long period of time, but improvements are always welcome! This works quite well, also. - c. Make updates appear atomic to readers. For example, + c. Make updates appear atomic to readers. For example, pointer updates to properly aligned fields will appear atomic, as will individual atomic primitives. - Sequences of perations performed under a lock will -not- + Sequences of operations performed under a lock will -not- appear to be atomic to RCU readers, nor will sequences of multiple atomic primitives. @@ -168,8 +184,8 @@ over a rather long period of time, but improvements are always welcome! 5. If call_rcu(), or a related primitive such as call_rcu_bh(), call_rcu_sched(), or call_srcu() is used, the callback function - must be written to be called from softirq context. In particular, - it cannot block. + will be called from softirq context. In particular, it cannot + block. 6. Since synchronize_rcu() can block, it cannot be called from any sort of irq context. The same rule applies for @@ -178,11 +194,14 @@ over a rather long period of time, but improvements are always welcome! synchronize_sched_expedite(), and synchronize_srcu_expedited(). The expedited forms of these primitives have the same semantics - as the non-expedited forms, but expediting is both expensive - and unfriendly to real-time workloads. Use of the expedited - primitives should be restricted to rare configuration-change - operations that would not normally be undertaken while a real-time - workload is running. + as the non-expedited forms, but expediting is both expensive and + (with the exception of synchronize_srcu_expedited()) unfriendly + to real-time workloads. Use of the expedited primitives should + be restricted to rare configuration-change operations that would + not normally be undertaken while a real-time workload is running. + However, real-time workloads can use rcupdate.rcu_normal kernel + boot parameter to completely disable expedited grace periods, + though this might have performance implications. In particular, if you find yourself invoking one of the expedited primitives repeatedly in a loop, please do everyone a favor: @@ -193,11 +212,6 @@ over a rather long period of time, but improvements are always welcome! of the system, especially to real-time workloads running on the rest of the system. - In addition, it is illegal to call the expedited forms from - a CPU-hotplug notifier, or while holding a lock that is acquired - by a CPU-hotplug notifier. Failing to observe this restriction - will result in deadlock. - 7. If the updater uses call_rcu() or synchronize_rcu(), then the corresponding readers must use rcu_read_lock() and rcu_read_unlock(). If the updater uses call_rcu_bh() or @@ -321,7 +335,7 @@ over a rather long period of time, but improvements are always welcome! Similarly, disabling preemption is not an acceptable substitute for rcu_read_lock(). Code that attempts to use preemption disabling where it should be using rcu_read_lock() will break - in real-time kernel builds. + in CONFIG_PREEMPT=y kernel builds. If you want to wait for interrupt handlers, NMI handlers, and code under the influence of preempt_disable(), you instead @@ -356,23 +370,22 @@ over a rather long period of time, but improvements are always welcome! not the case, a self-spawning RCU callback would prevent the victim CPU from ever going offline.) -14. SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(), - synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu()) - may only be invoked from process context. Unlike other forms of - RCU, it -is- permissible to block in an SRCU read-side critical - section (demarked by srcu_read_lock() and srcu_read_unlock()), - hence the "SRCU": "sleepable RCU". Please note that if you - don't need to sleep in read-side critical sections, you should be - using RCU rather than SRCU, because RCU is almost always faster - and easier to use than is SRCU. - - Also unlike other forms of RCU, explicit initialization - and cleanup is required via init_srcu_struct() and - cleanup_srcu_struct(). These are passed a "struct srcu_struct" - that defines the scope of a given SRCU domain. Once initialized, - the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock() - synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu(). - A given synchronize_srcu() waits only for SRCU read-side critical +14. Unlike other forms of RCU, it -is- permissible to block in an + SRCU read-side critical section (demarked by srcu_read_lock() + and srcu_read_unlock()), hence the "SRCU": "sleepable RCU". + Please note that if you don't need to sleep in read-side critical + sections, you should be using RCU rather than SRCU, because RCU + is almost always faster and easier to use than is SRCU. + + Also unlike other forms of RCU, explicit initialization and + cleanup is required either at build time via DEFINE_SRCU() + or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct() + and cleanup_srcu_struct(). These last two are passed a + "struct srcu_struct" that defines the scope of a given + SRCU domain. Once initialized, the srcu_struct is passed + to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(), + synchronize_srcu_expedited(), and call_srcu(). A given + synchronize_srcu() waits only for SRCU read-side critical sections governed by srcu_read_lock() and srcu_read_unlock() calls that have been passed the same srcu_struct. This property is what makes sleeping read-side critical sections tolerable -- @@ -390,10 +403,16 @@ over a rather long period of time, but improvements are always welcome! Therefore, SRCU should be used in preference to rw_semaphore only in extremely read-intensive situations, or in situations requiring SRCU's read-side deadlock immunity or low read-side - realtime latency. + realtime latency. You should also consider percpu_rw_semaphore + when you need lightweight readers. - Note that, rcu_assign_pointer() relates to SRCU just as it does - to other forms of RCU. + SRCU's expedited primitive (synchronize_srcu_expedited()) + never sends IPIs to other CPUs, so it is easier on + real-time workloads than is synchronize_rcu_expedited(), + synchronize_rcu_bh_expedited() or synchronize_sched_expedited(). + + Note that rcu_dereference() and rcu_assign_pointer() relate to + SRCU just as they do to other forms of RCU. 15. The whole point of call_rcu(), synchronize_rcu(), and friends is to wait until all pre-existing readers have finished before @@ -435,3 +454,33 @@ over a rather long period of time, but improvements are always welcome! These debugging aids can help you find problems that are otherwise extremely difficult to spot. + +18. If you register a callback using call_rcu(), call_rcu_bh(), + call_rcu_sched(), or call_srcu(), and pass in a function defined + within a loadable module, then it in necessary to wait for + all pending callbacks to be invoked after the last invocation + and before unloading that module. Note that it is absolutely + -not- sufficient to wait for a grace period! The current (say) + synchronize_rcu() implementation waits only for all previous + callbacks registered on the CPU that synchronize_rcu() is running + on, but it is -not- guaranteed to wait for callbacks registered + on other CPUs. + + You instead need to use one of the barrier functions: + + o call_rcu() -> rcu_barrier() + o call_rcu_bh() -> rcu_barrier_bh() + o call_rcu_sched() -> rcu_barrier_sched() + o call_srcu() -> srcu_barrier() + + However, these barrier functions are absolutely -not- guaranteed + to wait for a grace period. In fact, if there are no call_rcu() + callbacks waiting anywhere in the system, rcu_barrier() is within + its rights to return immediately. + + So if you need to wait for both an RCU grace period and for + all pre-existing call_rcu() callbacks, you will need to execute + both rcu_barrier() and synchronize_rcu(), if necessary, using + something like workqueues to to execute them concurrently. + + See rcubarrier.txt for more information. diff --git |