summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.html130
-rw-r--r--Documentation/RCU/checklist.txt121
-rw-r--r--Documentation/RCU/rcu.txt9
-rw-r--r--Documentation/RCU/rcu_dereference.txt61
-rw-r--r--Documentation/RCU/rcubarrier.txt5
-rw-r--r--Documentation/RCU/torture.txt20
-rw-r--r--Documentation/RCU/whatisRCU.txt5
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt7
-rw-r--r--Documentation/core-api/kernel-api.rst49
-rw-r--r--Documentation/memory-barriers.txt41
-rw-r--r--MAINTAINERS2
-rw-r--r--arch/alpha/include/asm/spinlock.h5
-rw-r--r--arch/arc/include/asm/spinlock.h5
-rw-r--r--arch/arm/include/asm/spinlock.h16
-rw-r--r--arch/arm64/include/asm/spinlock.h58
-rw-r--r--arch/arm64/kernel/process.c2
-rw-r--r--arch/blackfin/include/asm/spinlock.h5
-rw-r--r--arch/blackfin/kernel/module.c39
-rw-r--r--arch/hexagon/include/asm/spinlock.h5
-rw-r--r--arch/ia64/include/asm/spinlock.h21
-rw-r--r--arch/m32r/include/asm/spinlock.h5
-rw-r--r--arch/metag/include/asm/spinlock.h5
-rw-r--r--arch/mn10300/include/asm/spinlock.h5
-rw-r--r--arch/parisc/include/asm/spinlock.h7
-rw-r--r--arch/powerpc/include/asm/spinlock.h33
-rw-r--r--arch/s390/include/asm/spinlock.h7
-rw-r--r--arch/sh/include/asm/spinlock-cas.h5
-rw-r--r--arch/sh/include/asm/spinlock-llsc.h5
-rw-r--r--arch/sparc/include/asm/spinlock_32.h5
-rw-r--r--arch/tile/include/asm/spinlock_32.h2
-rw-r--r--arch/tile/include/asm/spinlock_64.h2
-rw-r--r--arch/tile/lib/spinlock_32.c23
-rw-r--r--arch/tile/lib/spinlock_64.c22
-rw-r--r--arch/xtensa/include/asm/spinlock.h5
-rw-r--r--drivers/ata/libata-eh.c8
-rw-r--r--include/asm-generic/qspinlock.h14
-rw-r--r--include/linux/init_task.h8
-rw-r--r--include/linux/rcupdate.h14
-rw-r--r--include/linux/rcutiny.h8
-rw-r--r--include/linux/sched.h5
-rw-r--r--include/linux/spinlock.h31
-rw-r--r--include/linux/spinlock_up.h6
-rw-r--r--include/linux/srcutiny.h13
-rw-r--r--include/linux/srcutree.h3
-rw-r--r--include/linux/swait.h55
-rw-r--r--include/trace/events/rcu.h7
-rw-r--r--include/uapi/linux/membarrier.h23
-rw-r--r--ipc/sem.c3
-rw-r--r--kernel/Makefile1
-rw-r--r--kernel/exit.c10
-rw-r--r--kernel/locking/qspinlock.c117
-rw-r--r--kernel/membarrier.c70
-rw-r--r--kernel/rcu/Kconfig3
-rw-r--r--kernel/rcu/rcu.h128
-rw-r--r--kernel/rcu/rcu_segcblist.c108
-rw-r--r--kernel/rcu/rcu_segcblist.h28
-rw-r--r--kernel/rcu/rcuperf.c17
-rw-r--r--kernel/rcu/rcutorture.c83
-rw-r--r--kernel/rcu/srcutiny.c8
-rw-r--r--kernel/rcu/srcutree.c50
-rw-r--r--kernel/rcu/tiny.c2
-rw-r--r--kernel/rcu/tiny_plugin.h47
-rw-r--r--kernel/rcu/tree.c174
-rw-r--r--kernel/rcu/tree.h15
-rw-r--r--kernel/rcu/tree_exp.h2
-rw-r--r--kernel/rcu/tree_plugin.h238
-rw-r--r--kernel/rcu/update.c18
-rw-r--r--kernel/sched/Makefile1
-rw-r--r--kernel/sched/completion.c11
-rw-r--r--kernel/sched/core.c38
-rw-r--r--kernel/sched/membarrier.c152
-rw-r--r--kernel/task_work.c8
-rw-r--r--kernel/torture.c2
-rw-r--r--net/netfilter/nf_conntrack_core.c52
-rwxr-xr-xtools/testing/selftests/rcutorture/bin/config_override.sh61
-rw-r--r--tools/testing/selftests/rcutorture/bin/functions.sh27
-rwxr-xr-xtools/testing/selftests/rcutorture/bin/kvm-build.sh11
-rwxr-xr-xtools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh58
-rwxr-xr-xtools/testing/selftests/rcutorture/bin/kvm.sh34
-rw-r--r--tools/testing/selftests/rcutorture/configs/rcu/BUSTED.boot2
-rw-r--r--tools/testing/selftests/rcutorture/configs/rcu/SRCU-C.boot1
-rw-r--r--tools/testing/selftests/rcutorture/configs/rcu/SRCU-u3
-rw-r--r--tools/testing/selftests/rcutorture/configs/rcu/TREE01.boot2
-rw-r--r--tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt2
84 files changed, 1207 insertions, 1312 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 95b30fa25d56..62e847bcdcdd 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows:
<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
+<li> <a href="#Scheduling-Clock Interrupts and RCU">
+ Scheduling-Clock Interrupts and RCU</a>.
<li> <a href="#Memory Efficiency">Memory Efficiency</a>.
<li> <a href="#Performance, Scalability, Response Time, and Reliability">
Performance, Scalability, Response Time, and Reliability</a>.
@@ -2532,6 +2534,134 @@ I learned of many of these requirements via angry phone calls:
Flaming me on the Linux-kernel mailing list was apparently not
sufficient to fully vent their ire at RCU's energy-efficiency bugs!
+<h3><a name="Scheduling-Clock Interrupts and RCU">
+Scheduling-Clock Interrupts and RCU</a></h3>
+
+<p>
+The kernel transitions between in-kernel non-idle execution, userspace
+execution, and the idle loop.
+Depending on kernel configuration, RCU handles these states differently:
+
+<table border=3>
+<tr><th><tt>HZ</tt> Kconfig</th>
+ <th>In-Kernel</th>
+ <th>Usermode</th>
+ <th>Idle</th></tr>
+<tr><th align="left"><tt>HZ_PERIODIC</tt></th>
+ <td>Can rely on scheduling-clock interrupt.</td>
+ <td>Can rely on scheduling-clock interrupt and its
+ detection of interrupt from usermode.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_IDLE</tt></th>
+ <td>Can rely on scheduling-clock interrupt.</td>
+ <td>Can rely on scheduling-clock interrupt and its
+ detection of interrupt from usermode.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_FULL</tt></th>
+ <td>Can only sometimes rely on scheduling-clock interrupt.
+ In other cases, it is necessary to bound kernel execution
+ times and/or use IPIs.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+</table>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the
+ scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt>
+ and <tt>NO_HZ_IDLE</tt> do?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Because, as a performance optimization, <tt>NO_HZ_FULL</tt>
+ does not necessarily re-enable the scheduling-clock interrupt
+ on entry to each and every system call.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+However, RCU must be reliably informed as to whether any given
+CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>,
+also whether that CPU is executing in usermode, as discussed
+<a href="#Energy Efficiency">earlier</a>.
+It also requires that the scheduling-clock interrupt be enabled when
+RCU needs it to be:
+
+<ol>
+<li> If a CPU is either idle or executing in usermode, and RCU believes
+ it is non-idle, the scheduling-clock tick had better be running.
+ Otherwise, you will get RCU CPU stall warnings. Or at best,
+ very long (11-second) grace periods, with a pointless IPI waking
+ the CPU from time to time.
+<li> If a CPU is in a portion of the kernel that executes RCU read-side
+ critical sections, and RCU believes this CPU to be idle, you will get
+ random memory corruption. <b>DON'T DO THIS!!!</b>
+
+ <br>This is one reason to test with lockdep, which will complain
+ about this sort of thing.
+<li> If a CPU is in a portion of the kernel that is absolutely
+ positively no-joking guaranteed to never execute any RCU read-side
+ critical sections, and RCU believes this CPU to to be idle,
+ no problem. This sort of thing is used by some architectures
+ for light-weight exception handlers, which can then avoid the
+ overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt>
+ at exception entry and exit, respectively.
+ Some go further and avoid the entireties of <tt>irq_enter()</tt>
+ and <tt>irq_exit()</tt>.
+
+ <br>Just make very sure you are running some of your tests with
+ <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths
+ was in fact joking about not doing RCU read-side critical sections.
+<li> If a CPU is executing in the kernel with the scheduling-clock
+ interrupt disabled and RCU believes this CPU to be non-idle,
+ and if the CPU goes idle (from an RCU perspective) every few
+ jiffies, no problem. It is usually OK for there to be the
+ occasional gap between idle periods of up to a second or so.
+
+ <br>If the gap grows too long, you get RCU CPU stall warnings.
+<li> If a CPU is either idle or executing in usermode, and RCU believes
+ it to be idle, of course no problem.
+<li> If a CPU is executing in the kernel, the kernel code
+ path is passing through quiescent states at a reasonable
+ frequency (preferably about once per few jiffies, but the
+ occasional excursion to a second or so is usually OK) and the
+ scheduling-clock interrupt is enabled, of course no problem.
+
+ <br>If the gap between a successive pair of quiescent states grows
+ too long, you get RCU CPU stall warnings.
+</ol>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ But what if my driver has a hardware interrupt handler
+ that can run for many seconds?
+ I cannot invoke <tt>schedule()</tt> from an hardware
+ interrupt handler, after all!
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt>
+ every so often.
+ But given that long-running interrupt handlers can cause
+ other problems, not least for response time, shouldn't you
+ work to keep your interrupt handler's runtime within reasonable
+ bounds?
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+But as long as RCU is properly informed of kernel state transitions between
+in-kernel execution, usermode execution, and idle, and as long as the
+scheduling-clock interrupt is enabled when RCU needs it to be, you
+can rest assured that the bugs you encounter will be in some other
+part of RCU or some other part of the kernel!
+
<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
<p>
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 6beda556faf3..49747717d905 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -23,6 +23,14 @@ over a rather long period of time, but improvements are always welcome!
Yet another exception is where the low real-time latency of RCU's
read-side primitives is critically important.
+ One final exception is where RCU readers are used to prevent
+ the ABA problem (https://en.wikipedia.org/wiki/ABA_problem)
+ for lockless updates. This does result in the mildly
+ counter-intuitive situation where rcu_read_lock() and
+ rcu_read_unlock() are used to protect updates, however, this
+ approach provides the same potential simplifications that garbage
+ collectors do.
+
1. Does the update code have proper mutual exclusion?
RCU does allow -readers- to run (almost) naked, but -writers- must
@@ -40,7 +48,9 @@ over a rather long period of time, but improvements are always welcome!
explain how this single task does not become a major bottleneck on
big multiprocessor machines (for example, if the task is updating
information relating to itself that other tasks can read, there
- by definition can be no bottleneck).
+ by definition can be no bottleneck). Note that the definition
+ of "large" has changed significantly: Eight CPUs was "large"
+ in the year 2000, but a hundred CPUs was unremarkable in 2017.
2. Do the RCU read-side critical sections make proper use of
rcu_read_lock() and friends? These primitives are needed
@@ -55,6 +65,12 @@ over a rather long period of time, but improvements are always welcome!
Disabling of preemption can serve as rcu_read_lock_sched(), but
is less readable.
+ Letting RCU-protected pointers "leak" out of an RCU read-side
+ critical section is every bid as bad as letting them leak out
+ from under a lock. Unless, of course, you have arranged some
+ other means of protection, such as a lock or a reference count
+ -before- letting them out of the RCU read-side critical section.
+
3. Does the update code tolerate concurrent accesses?
The whole point of RCU is to permit readers to run without
@@ -78,10 +94,10 @@ over a rather long period of time, but improvements are always welcome!
This works quite well, also.
- c. Make updates appear atomic to readers. For example,
+ c. Make updates appear atomic to readers. For example,
pointer updates to properly aligned fields will
appear atomic, as will individual atomic primitives.
- Sequences of perations performed under a lock will -not-
+ Sequences of operations performed under a lock will -not-
appear to be atomic to RCU readers, nor will sequences
of multiple atomic primitives.
@@ -168,8 +184,8 @@ over a rather long period of time, but improvements are always welcome!
5. If call_rcu(), or a related primitive such as call_rcu_bh(),
call_rcu_sched(), or call_srcu() is used, the callback function
- must be written to be called from softirq context. In particular,
- it cannot block.
+ will be called from softirq context. In particular, it cannot
+ block.
6. Since synchronize_rcu() can block, it cannot be called from
any sort of irq context. The same rule applies for
@@ -178,11 +194,14 @@ over a rather long period of time, but improvements are always welcome!
synchronize_sched_expedite(), and synchronize_srcu_expedited().
The expedited forms of these primitives have the same semantics
- as the non-expedited forms, but expediting is both expensive
- and unfriendly to real-time workloads. Use of the expedited
- primitives should be restricted to rare configuration-change
- operations that would not normally be undertaken while a real-time
- workload is running.
+ as the non-expedited forms, but expediting is both expensive and
+ (with the exception of synchronize_srcu_expedited()) unfriendly
+ to real-time workloads. Use of the expedited primitives should
+ be restricted to rare configuration-change operations that would
+ not normally be undertaken while a real-time workload is running.
+ However, real-time workloads can use rcupdate.rcu_normal kernel
+ boot parameter to completely disable expedited grace periods,
+ though this might have performance implications.
In particular, if you find yourself invoking one of the expedited
primitives repeatedly in a loop, please do everyone a favor:
@@ -193,11 +212,6 @@ over a rather long period of time, but improvements are always welcome!
of the system, especially to real-time workloads running on
the rest of the system.
- In addition, it is illegal to call the expedited forms from
- a CPU-hotplug notifier, or while holding a lock that is acquired
- by a CPU-hotplug notifier. Failing to observe this restriction
- will result in deadlock.
-
7. If the updater uses call_rcu() or synchronize_rcu(), then the
corresponding readers must use rcu_read_lock() and
rcu_read_unlock(). If the updater uses call_rcu_bh() or
@@ -321,7 +335,7 @@ over a rather long period of time, but improvements are always welcome!
Similarly, disabling preemption is not an acceptable substitute
for rcu_read_lock(). Code that attempts to use preemption
disabling where it should be using rcu_read_lock() will break
- in real-time kernel builds.
+ in CONFIG_PREEMPT=y kernel builds.
If you want to wait for interrupt handlers, NMI handlers, and
code under the influence of preempt_disable(), you instead
@@ -356,23 +370,22 @@ over a rather long period of time, but improvements are always welcome!
not the case, a self-spawning RCU callback would prevent the
victim CPU from ever going offline.)
-14. SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(),
- synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu())
- may only be invoked from process context. Unlike other forms of
- RCU, it -is- permissible to block in an SRCU read-side critical
- section (demarked by srcu_read_lock() and srcu_read_unlock()),
- hence the "SRCU": "sleepable RCU". Please note that if you
- don't need to sleep in read-side critical sections, you should be
- using RCU rather than SRCU, because RCU is almost always faster
- and easier to use than is SRCU.
-
- Also unlike other forms of RCU, explicit initialization
- and cleanup is required via init_srcu_struct() and
- cleanup_srcu_struct(). These are passed a "struct srcu_struct"
- that defines the scope of a given SRCU domain. Once initialized,
- the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock()
- synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu().
- A given synchronize_srcu() waits only for SRCU read-side critical
+14. Unlike other forms of RCU, it -is- permissible to block in an
+ SRCU read-side critical section (demarked by srcu_read_lock()
+ and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
+ Please note that if you don't need to sleep in read-side critical
+ sections, you should be using RCU rather than SRCU, because RCU
+ is almost always faster and easier to use than is SRCU.
+
+ Also unlike other forms of RCU, explicit initialization and
+ cleanup is required either at build time via DEFINE_SRCU()
+ or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct()
+ and cleanup_srcu_struct(). These last two are passed a
+ "struct srcu_struct" that defines the scope of a given
+ SRCU domain. Once initialized, the srcu_struct is passed
+ to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(),
+ synchronize_srcu_expedited(), and call_srcu(). A given
+ synchronize_srcu() waits only for SRCU read-side critical
sections governed by srcu_read_lock() and srcu_read_unlock()
calls that have been passed the same srcu_struct. This property
is what makes sleeping read-side critical sections tolerable --
@@ -390,10 +403,16 @@ over a rather long period of time, but improvements are always welcome!
Therefore, SRCU should be used in preference to rw_semaphore
only in extremely read-intensive situations, or in situations
requiring SRCU's read-side deadlock immunity or low read-side
- realtime latency.
+ realtime latency. You should also consider percpu_rw_semaphore
+ when you need lightweight readers.
- Note that, rcu_assign_pointer() relates to SRCU just as it does
- to other forms of RCU.
+ SRCU's expedited primitive (synchronize_srcu_expedited())
+ never sends IPIs to other CPUs, so it is easier on
+ real-time workloads than is synchronize_rcu_expedited(),
+ synchronize_rcu_bh_expedited() or synchronize_sched_expedited().
+
+ Note that rcu_dereference() and rcu_assign_pointer() relate to
+ SRCU just as they do to other forms of RCU.
15. The whole point of call_rcu(), synchronize_rcu(), and friends
is to wait until all pre-existing readers have finished before
@@ -435,3 +454,33 @@ over a rather long period of time, but improvements are always welcome!
These debugging aids can help you find problems that are
otherwise extremely difficult to spot.
+
+18. If you register a callback using call_rcu(), call_rcu_bh(),
+ call_rcu_sched(), or call_srcu(), and pass in a function defined
+ within a loadable module, then it in necessary to wait for
+ all pending callbacks to be invoked after the last invocation
+ and before unloading that module. Note that it is absolutely
+ -not- sufficient to wait for a grace period! The current (say)
+ synchronize_rcu() implementation waits only for all previous
+ callbacks registered on the CPU that synchronize_rcu() is running
+ on, but it is -not- guaranteed to wait for callbacks registered
+ on other CPUs.
+
+ You instead need to use one of the barrier functions:
+
+ o call_rcu() -> rcu_barrier()
+ o call_rcu_bh() -> rcu_barrier_bh()
+ o call_rcu_sched() -> rcu_barrier_sched()
+ o call_srcu() -> srcu_barrier()
+
+ However, these barrier functions are absolutely -not- guaranteed
+ to wait for a grace period. In fact, if there are no call_rcu()
+ callbacks waiting anywhere in the system, rcu_barrier() is within
+ its rights to return immediately.
+
+ So if you need to wait for both an RCU grace period and for
+ all pre-existing call_rcu() callbacks, you will need to execute
+ both rcu_barrier() and synchronize_rcu(), if necessary, using
+ something like workqueues to to execute them concurrently.
+
+ See rcubarrier.txt for more information.
diff --git