summaryrefslogtreecommitdiffstats
path: root/Documentation/memory-barriers.txt
diff options
context:
space:
mode:
authorDavid Howells <dhowells@redhat.com>2006-03-31 16:00:29 +0100
committerLinus Torvalds <torvalds@g5.osdl.org>2006-03-31 12:27:01 -0800
commit108b42b4b966462444265806e3d7260a632ad630 (patch)
tree8565bb700a0a4bb6456f1955a9c2ec69e4df8927 /Documentation/memory-barriers.txt
parent428622986858aebddc32d022af65e88b9d2ea8bb (diff)
[PATCH] Document Linux's memory barriers [try #7]
The attached patch documents the Linux kernel's memory barriers. I've updated it from the comments I've been given. The per-arch notes sections are gone because it's clear that there are so many exceptions, that it's not worth having them. I've added a list of references to other documents. I've tried to get rid of the concept of memory accesses appearing on the bus; what matters is apparent behaviour with respect to other observers in the system. Interrupts barrier effects are now considered to be non-existent. They may be there, but you may not rely on them. I've added a couple of definition sections at the top of the document: one to specify the minimum execution model that may be assumed, the other to specify what this document refers to by the term "memory". I've made greater mention of the use of mmiowb(). I've adjusted the way in which caches are described, and described the fun that can be had with cache coherence maintenance being unordered and data dependency not being necessarily implicit. I've described (smp_)read_barrier_depends(). I've rearranged the order of the sections, so that memory barriers are discussed in abstract first, and then described the memory barrier facilities available on Linux, before going on to more real-world discussions and examples. I've added information about the lack of memory barriering effects with atomic ops and bitops. I've added information about control dependencies. I've added more diagrams to illustrate caching interactions between CPUs. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Diffstat (limited to 'Documentation/memory-barriers.txt')
-rw-r--r--Documentation/memory-barriers.txt1913
1 files changed, 1913 insertions, 0 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 000000000000..f8550310a6d5
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,1913 @@
+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+By: David Howells <dhowells@redhat.com>
+
+Contents:
+
+ (*) Abstract memory access model.
+
+ - Device operations.
+ - Guarantees.
+
+ (*) What are memory barriers?
+
+ - Varieties of memory barrier.
+ - What may not be assumed about memory barriers?
+ - Data dependency barriers.
+ - Control dependencies.
+ - SMP barrier pairing.
+ - Examples of memory barrier sequences.
+
+ (*) Explicit kernel barriers.
+
+ - Compiler barrier.
+ - The CPU memory barriers.
+ - MMIO write barrier.
+
+ (*) Implicit kernel memory barriers.
+
+ - Locking functions.
+ - Interrupt disabling functions.
+ - Miscellaneous functions.
+
+ (*) Inter-CPU locking barrier effects.
+
+ - Locks vs memory accesses.
+ - Locks vs I/O accesses.
+
+ (*) Where are memory barriers needed?
+
+ - Interprocessor interaction.
+ - Atomic operations.
+ - Accessing devices.
+ - Interrupts.
+
+ (*) Kernel I/O barrier effects.
+
+ (*) Assumed minimum execution ordering model.
+
+ (*) The effects of the cpu cache.
+
+ - Cache coherency.
+ - Cache coherency vs DMA.
+ - Cache coherency vs MMIO.
+
+ (*) The things CPUs get up to.
+
+ - And then there's the Alpha.
+
+ (*) References.
+
+
+============================
+ABSTRACT MEMORY ACCESS MODEL
+============================
+
+Consider the following abstract model of the system:
+
+ : :
+ : :
+ : :
+ +-------+ : +--------+ : +-------+
+ | | : | | : | |
+ | | : | | : | |
+ | CPU 1 |<----->| Memory |<----->| CPU 2 |
+ | | : | | : | |
+ | | : | | : | |
+ +-------+ : +--------+ : +-------+
+ ^ : ^ : ^
+ | : | : |
+ | : | : |
+ | : v : |
+ | : +--------+ : |
+ | : | | : |
+ | : | | : |
+ +---------->| Device |<----------+
+ : | | :
+ : | | :
+ : +--------+ :
+ : :
+
+Each CPU executes a program that generates memory access operations. In the
+abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
+perform the memory operations in any order it likes, provided program causality
+appears to be maintained. Similarly, the compiler may also arrange the
+instructions it emits in any order it likes, provided it doesn't affect the
+apparent operation of the program.
+
+So in the above diagram, the effects of the memory operations performed by a
+CPU are perceived by the rest of the system as the operations cross the
+interface between the CPU and rest of the system (the dotted lines).
+
+
+For example, consider the following sequence of events:
+
+ CPU 1 CPU 2
+ =============== ===============
+ { A == 1; B == 2 }
+ A = 3; x = A;
+ B = 4; y = B;
+
+The set of accesses as seen by the memory system in the middle can be arranged
+in 24 different combinations:
+
+ STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4
+ STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3
+ STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4
+ STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4
+ STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3
+ STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4
+ STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4
+ STORE B=4, ...
+ ...
+
+and can thus result in four different combinations of values:
+
+ x == 1, y == 2
+ x == 1, y == 4
+ x == 3, y == 2
+ x == 3, y == 4
+
+
+Furthermore, the stores committed by a CPU to the memory system may not be
+perceived by the loads made by another CPU in the same order as the stores were
+committed.
+
+
+As a further example, consider this sequence of events:
+
+ CPU 1 CPU 2
+ =============== ===============
+ { A == 1, B == 2, C = 3, P == &A, Q == &C }
+ B = 4; Q = P;
+ P = &B D = *Q;
+
+There is an obvious data dependency here, as the value loaded into D depends on
+the address retrieved from P by CPU 2. At the end of the sequence, any of the
+following results are possible:
+
+ (Q == &A) and (D == 1)
+ (Q == &B) and (D == 2)
+ (Q == &B) and (D == 4)
+
+Note that CPU 2 will never try and load C into D because the CPU will load P
+into Q before issuing the load of *Q.
+
+
+DEVICE OPERATIONS
+-----------------
+
+Some devices present their control interfaces as collections of memory
+locations, but the order in which the control registers are accessed is very
+important. For instance, imagine an ethernet card with a set of internal
+registers that are accessed through an address port register (A) and a data
+port register (D). To read internal register 5, the following code might then
+be used:
+
+ *A = 5;
+ x = *D;
+
+but this might show up as either of the following two sequences:
+
+ STORE *A = 5, x = LOAD *D
+ x = LOAD *D, STORE *A = 5
+
+the second of which will almost certainly result in a malfunction, since it set
+the address _after_ attempting to read the register.
+
+
+GUARANTEES
+----------
+
+There are some minimal guarantees that may be expected of a CPU:
+
+ (*) On any given CPU, dependent memory accesses will be issued in order, with
+ respect to itself. This means that for:
+
+ Q = P; D = *Q;
+
+ the CPU will issue the following memory operations:
+
+ Q = LOAD P, D = LOAD *Q
+
+ and always in that order.
+
+ (*) Overlapping loads and stores within a particular CPU will appear to be
+ ordered within that CPU. This means that for:
+
+ a = *X; *X = b;
+
+ the CPU will only issue the following sequence of memory operations:
+
+ a = LOAD *X, STORE *X = b
+
+ And for:
+
+ *X = c; d = *X;
+
+ the CPU will only issue:
+
+ STORE *X = c, d = LOAD *X
+
+ (Loads and stores overlap if they are targetted at overlapping pieces of
+ memory).
+
+And there are a number of things that _must_ or _must_not_ be assumed:
+
+ (*) It _must_not_ be assumed that independent loads and stores will be issued
+ in the order given. This means that for:
+
+ X = *A; Y = *B; *D = Z;
+
+ we may get any of the following sequences:
+
+ X = LOAD *A, Y = LOAD *B, STORE *D = Z
+ X = LOAD *A, STORE *D = Z, Y = LOAD *B
+ Y = LOAD *B, X = LOAD *A, STORE *D = Z
+ Y = LOAD *B, STORE *D = Z, X = LOAD *A
+ STORE *D = Z, X = LOAD *A, Y = LOAD *B
+ STORE *D = Z, Y = LOAD *B, X = LOAD *A
+
+ (*) It _must_ be assumed that overlapping memory accesses may be merged or
+ discarded. This means that for:
+
+ X = *A; Y = *(A + 4);
+
+ we may get any one of the following sequences:
+
+ X = LOAD *A; Y = LOAD *(A + 4);
+ Y = LOAD *(A + 4); X = LOAD *A;
+ {X, Y} = LOAD {*A, *(A + 4) };
+
+ And for:
+
+ *A = X; Y = *A;
+
+ we may get either of:
+
+ STORE *A = X; Y = LOAD *A;
+ STORE *A = Y;
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+As can be seen above, independent memory operations are effectively performed
+in random order, but this can be a problem for CPU-CPU interaction and for I/O.
+What is required is some way of intervening to instruct the compiler and the
+CPU to restrict the order.
+
+Memory barriers are such interventions. They impose a perceived partial
+ordering between the memory operations specified on either side of the barrier.
+They request that the sequence of memory events generated appears to other
+parts of the system as if the barrier is effective on that CPU.
+
+
+VARIETIES OF MEMORY BARRIER
+---------------------------
+
+Memory barriers come in four basic varieties:
+
+ (1) Write (or store) memory barriers.
+
+ A write memory barrier gives a guarantee that all the STORE operations
+ specified before the barrier will appear to happen before all the STORE
+ operations specified after the barrier with respect to the other
+ components of the system.
+
+ A write barrier is a partial ordering on stores only; it is not required
+ to have any effect on loads.
+
+ A CPU can be viewed as as commiting a sequence of store operations to the
+ memory system as time progresses. All stores before a write barrier will
+ occur in the sequence _before_ all the stores after the write barrier.
+
+ [!] Note that write barriers should normally be paired with read or data
+ dependency barriers; see the "SMP barrier pairing" subsection.
+
+
+ (2) Data dependency barriers.
+
+ A data dependency barrier is a weaker form of read barrier. In the case
+ where two loads are performed such that the second depends on the result
+ of the first (eg: the first load retrieves the address to which the second
+ load will be directed), a data dependency barrier would be required to
+ make sure that the target of the second load is updated before the address
+ obtained by the first load is accessed.
+
+ A data dependency barrier is a partial ordering on interdependent loads
+ only; it is not required to have any effect on stores, independent loads
+ or overlapping loads.
+
+ As mentioned in (1), the other CPUs in the system can be viewed as
+ committing sequences of stores to the memory system that the CPU being
+ considered can then perceive. A data dependency barrier issued by the CPU
+ under consideration guarantees that for any load preceding it, if that
+ load touches one of a sequence of stores from another CPU, then by the
+ time the barrier completes, the effects of all the stores prior to that
+ touched by the load will be perceptible to any loads issued after the data
+ dependency barrier.
+
+ See the "Examples of memory barrier sequences" subsection for diagrams
+ showing the ordering constraints.
+
+ [!] Note that the first load really has to have a _data_ dependency and
+ not a control dependency. If the address for the second load is dependent
+ on the first load, but the dependency is through a conditional rather than
+ actually loading the address itself, then it's a _control_ dependency and
+ a full read barrier or better is required. See the "Control dependencies"
+ subsection for more information.
+
+ [!] Note that data dependency barriers should normally be paired with
+ write barriers; see the "SMP barrier pairing" subsection.
+
+
+ (3) Read (or load) memory barriers.
+
+ A read barrier is a data dependency barrier plus a guarantee that all the
+ LOAD operations specified before the barrier will appear to happen before
+ all the LOAD operations specified after the barrier with respect to the
+ other components of the system.
+
+ A read barrier is a partial ordering on loads only; it is not required to
+ have any effect on stores.
+
+ Read memory barriers imply data dependency barriers, and so can substitute
+ for them.
+
+ [!] Note that read barriers should normally be paired with write barriers;
+ see the "SMP barrier pairing" subsection.
+
+
+ (4) General memory barriers.
+
+ A general memory barrier is a combination of both a read memory barrier
+ and a write memory barrier. It is a partial ordering over both loads and
+ stores.
+
+ General memory barriers imply both read and write memory barriers, and so
+ can substitute for either.
+
+
+And a couple of implicit varieties:
+
+ (5) LOCK operations.
+
+ This acts as a one-way permeable barrier. It guarantees that all memory
+ operations after the LOCK operation will appear to happen after the LOCK
+ operation with respect to the other components of the system.
+
+ Memory operations that occur before a LOCK operation may appear to happen
+ after it completes.
+
+ A LOCK operation should almost always be paired with an UNLOCK operation.
+
+
+ (6) UNLOCK operations.
+
+ This also acts as a one-way permeable barrier. It guarantees that all
+ memory operations before the UNLOCK operation will appear to happen before
+ the UNLOCK operation with respect to the other components of the system.
+
+ Memory operations that occur after an UNLOCK operation may appear to
+ happen before it completes.
+
+ LOCK and UNLOCK operations are guaranteed to appear with respect to each
+ other strictly in the order specified.
+
+ The use of LOCK and UNLOCK operations generally precludes the need for
+ other sorts of memory barrier (but note the exceptions mentioned in the
+ subsection "MMIO write barrier").
+
+
+Memory barriers are only required where there's a possibility of interaction
+between two CPUs or between a CPU and a device. If it can be guaranteed that
+there won't be any such interaction in any particular piece of code, then
+memory barriers are unnecessary in that piece of code.
+
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may _not_ be relied upon outside of arch
+specific code.
+
+
+WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
+----------------------------------------------
+
+There are certain things that the Linux kernel memory barriers do not guarantee:
+
+ (*) There is no guarantee that any of the memory accesses specified before a
+ memory barrier will be _complete_ by the completion of a memory barrier
+ instruction; the barrier can be considered to draw a line in that CPU's
+ access queue that accesses of the appropriate type may not cross.
+
+ (*) There is no guarantee that issuing a memory barrier on one CPU will have
+ any direct effect on another CPU or any other hardware in the system. The
+ indirect effect will be the order in which the second CPU sees the effects
+ of the first CPU's accesses occur, but see the next point:
+
+ (*) There is no guarantee that the a CPU will see the correct order of effects
+ from a second CPU's accesses, even _if_ the second CPU uses a memory
+ barrier, unless the first CPU _also_ uses a matching memory barrier (see
+ the subsection on "SMP Barrier Pairing").
+
+ (*) There is no guarantee that some intervening piece of off-the-CPU
+ hardware[*] will not reorder the memory accesses. CPU cache coherency
+ mechanisms should propagate the indirect effects of a memory barrier
+ between CPUs, but might not do so in order.
+
+ [*] For information on bus mastering DMA and coherency please read:
+
+ Documentation/pci.txt
+ Documentation/DMA-mapping.txt
+ Documentation/DMA-API.txt
+
+
+DATA DEPENDENCY BARRIERS
+------------------------
+
+The usage requirements of data dependency barriers are a little subtle, and
+it's not always obvious that they're needed. To illustrate, consider the
+following sequence of events:
+
+ CPU 1 CPU 2
+ =============== ===============
+ { A == 1, B == 2, C = 3, P == &A, Q == &C }
+ B = 4;
+ <write barrier>
+ P = &B
+ Q = P;
+ D = *Q;
+
+There's a clear data dependency here, and it would seem that by the end of the
+sequence, Q must be either &A or &B, and that:
+
+ (Q == &A) implies (D == 1)
+ (Q == &B) implies (D == 4)
+
+But! CPU 2's perception of P may be updated _before_ its perception of B, thus
+leading to the following situation:
+
+ (Q == &B) and (D == 2) ????
+
+Whilst this may seem like a failure of coherency or causality maintenance, it
+isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
+Alpha).
+
+To deal with this, a data dependency barrier must be inserted between the
+address load and the data load:
+
+ CPU 1 CPU 2
+ =============== ===============
+ { A == 1, B == 2, C = 3, P == &A, Q == &C }
+ B = 4;
+ <write barrier>
+ P = &B
+ Q = P;
+ <data dependency barrier>
+ D = *Q;
+
+This enforces the occurrence of one of the two implications, and prevents the
+third possibility from arising.
+
+[!] Note that this extremely counterintuitive situation arises most easily on
+machines with split caches, so that, for example, one cache bank processes
+even-numbered cache lines and the other bank processes odd-numbered cache
+lines. The pointer P might be stored in an odd-numbered cache line, and the
+variable B might be stored in an even-numbered cache line. Then, if the
+even-numbered bank of the reading CPU's cache is extremely busy while the
+odd-numbered bank is idle, one can see the new value of the pointer P (&B),
+but the old value of the variable B (1).
+
+
+Another example of where data dependency barriers might by required is where a
+number is read from memory and then used to calculate the index for an array
+access:
+
+ CPU 1 CPU 2
+ =============== ===============
+ { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
+ M[1] = 4;
+ <write barrier>
+ P = 1
+ Q = P;
+ <data dependency barrier>
+ D = M[Q];
+
+
+The data dependency barrier is very important to the RCU system, for example.
+See rcu_dereference() in include/linux/rcupdate.h. This permits the current
+target of an RCU'd pointer to be replaced with a new modified target, without
+the replacement target appearing to be incompletely initialised.
+
+See also the subsection on "Cache Coherency" for a more thorough example.
+
+
+CONTROL DEPENDENCIES
+--------------------
+
+A control dependency requires a full read memory barrier, not simply a data
+dependency barrier to make it work correctly. Consider the following bit of
+code:
+
+ q = &a;
+ if (p)
+ q = &b;
+ <data dependency barrier>
+ x = *q;
+
+This will not have the desired effect because there is no actual data
+dependency, but rather a control dependency that the CPU may short-circuit by
+attempting to predict the outcome in advance. In such a case what's actually
+required is:
+
+ q = &a;
+ if (p)
+ q = &b;
+ <read barrier>
+ x = *q;
+
+
+SMP BARRIER PAIRING
+-------------------
+
+When dealing with CPU-CPU interactions, certain types of memory barrier should
+always be paired. A lack of appropriate pairing is almost certainly an error.
+
+A write barrier should always be paired with a data dependency barrier or read
+barrier, though a general barrier would also be viable. Similarly a read
+barrier or a data dependency barrier should always be paired with at least an
+write barrier, though, again, a general barrier is viable:
+
+ CPU 1 CPU 2
+ =============== ===============
+ a = 1;
+ <write barrier>
+ b = 2; x = a;
+ <read barrier>
+ y = b;
+
+Or:
+
+ CPU 1 CPU 2
+ =============== ===============================
+ a = 1;
+ <write barrier>
+ b = &a; x = b;
+ <data dependency barrier>
+ y = *x;
+
+Basically, the read barrier always has to be there, even though it can be of
+the "weaker" type.
+
+
+EXAMPLES OF MEMORY BARRIER SEQUENCES
+------------------------------------
+
+Firstly, write barriers act as a partial orderings on store operations.
+Consider the following sequence of events:
+
+ CPU 1
+ =======================
+ STORE A = 1
+ STORE B = 2
+ STORE C = 3
+ <write barrier>
+ STORE D = 4
+ STORE E = 5
+
+This sequence of events is committed to the memory coherence system in an order
+that the rest of the system might perceive as the unordered set of { STORE A,
+STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E
+}:
+
+ +-------+ : :
+ | | +------+
+ | |------>| C=3 | } /\
+ | | : +------+ }----- \ -----> Events perceptible
+ | | : | A=1 | } \/ to rest of system
+ | | : +------+ }
+ | CPU 1 | : | B=2 | }
+ | | +------+ }
+ | | wwwwwwwwwwwwwwww } <--- At this point the write barrier
+ | | +------+ } requires all stores prior to the
+ | | : | E=5 | } barrier to be committed before
+ | | : +------+ } further stores may be take place.
+ | |------>| D=4 | }
+ | | +------+
+ +-------+ : :
+ |
+ | Sequence in which stores committed to memory system
+ | by CPU 1
+ V
+
+
+Secondly, data dependency barriers act as a partial orderings on data-dependent
+loads. Consider the following sequence of events:
+
+ CPU 1 CPU 2
+ ======================= =======================
+ STORE A = 1
+ STORE B = 2
+ <write barrier>
+ STORE C = &B LOAD X
+ STORE D = 4 LOAD C (gets &B)
+ LOAD *C (reads B)
+
+Without intervention, CPU 2 may perceive the events on CPU 1 in some
+effectively random order, despite the write barrier issued by CPU 1:
+
+ +-------+ : : : :
+ | | +------+ +-------+ | Sequence of update
+ | |------>| B=2 |----- --->| Y->8 | | of perception on
+ | | : +------+ \ +-------+ | CPU 2
+ | CPU 1 | : | A=1 | \ --->| C->&Y | V
+ | | +------+ | +-------+
+ | | wwwwwwwwwwwwwwww | : :
+ | | +------+ | : :
+ | | : | C=&B |--- | : : +-------+
+ | | : +------+ \ | +-------+ | |
+ | |------>| D=4 | ----------->| C->&B |------>| |
+ | | +------+ | +-------+ | |
+ +-------+ : : | : : | |
+ | : : | |
+ | : : | CPU 2 |
+ | +-------+ | |
+ Apparently incorrect ---> | | B->7 |------>| |
+ perception of B (!) | +-------+ | |
+ | : : | |
+ | +-------+ | |
+ The load of X holds ---> \ | X->9 |------>| |
+ up the maintenance \ +-------+ | |
+ of coherence of B ----->| B->2 | +-------+
+ +-------+
+ : :
+
+
+In the above example, CPU 2 perceives that B is 7, despite the load of *C
+(which would be B) coming after the the LOAD of C.
+
+If, however, a data dependency barrier were to be placed between the load of C
+and the load of *C (ie: B) on CPU 2, then the following will occur:
+
+ +-------+ : : : :
+ | | +------+ +-------+
+ | |------>| B=2 |----- --->| Y->8 |
+ | | : +------+ \ +-------+
+ | CPU 1 | : | A=1 | \ --->| C->&Y |
+ | | +------+ | +-------+
+ | | wwwwwwwwwwwwwwww | : :
+ | | +------+ | : :
+ | | : | C=&B |--- | : : +-------+
+ | | : +------+ \ | +-------+ | |
+ | |------>| D=4 | ----------->| C->&B |------>| |
+ | | +------+ | +-------+ | |
+ +-------+ : : | : : | |
+ | : : | |
+ | : : | CPU 2 |
+ | +-------+ | |
+ \ | X->9 |------>| |
+ \ +-------+ | |
+ ----->| B->2 | | |
+ +-------+ | |
+ Makes sure all effects ---> ddddddddddddddddd | |
+ prior to the store of C +-------+ | |
+ are perceptible to | B->2 |------>| |
+ successive loads +-------+ | |
+ : : +-------+
+
+
+And thirdly, a read barrier acts as a partial order on loads. Consider the
+following sequence of events:
+
+ CPU 1 CPU 2
+ ======================= =======================
+ STORE A=1
+ STORE B=2
+ STORE C=3
+ <write barrier>
+ STORE D=4
+ STORE E=5
+ LOAD A
+ LOAD B
+ LOAD C
+ LOAD D
+ LOAD E
+
+Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
+some effectively random order, despite the write barrier issued by CPU 1:
+
+ +-------+ : :
+ | | +------+
+ | |------>| C=3 | }
+ | | : +------+ }
+ | | : | A=1 | }
+ | | : +------+ }
+ | CPU 1 | : | B=2 | }---
+ | | +------+ } \
+ | | wwwwwwwwwwwww} \
+ | | +------+ } \ : : +-------+
+ | | : | E=5 | } \ +-------+ | |
+ | | : +------+ } \ { | C->3 |------>| |
+ | |------>| D=4 | } \ { +-------+ : | |
+ | | +------+ \ { | E->5 | : | |
+ +-------+ : : \ { +-------+ : | |
+ Transfer -->{ | A->1 | : | CPU 2 |
+ from CPU 1 { +-------+ : | |
+ to CPU 2 { | D->4 | : | |
+ { +-------+ : | |
+ { | B->2 |------>| |
+ +-------+ | |
+ : : +-------+
+
+
+If, however, a read barrier were to be placed between the load of C and the
+load of D on CPU 2, then the partial ordering imposed by CPU 1 will be
+perceived correctly by CPU 2.
+
+ +-------+ : :
+ | | +------+
+ | |------>| C=3 | }
+ | | : +------+ }
+ | | : | A=1 | }---
+ | | : +------+ } \
+ | CPU 1 | : | B=2 | } \
+ | | +------+ \
+ | | wwwwwwwwwwwwwwww \
+ | | +------+ \ : : +-------+
+ | | : | E=5 | } \ +-------+ | |
+ | | : +------+ }--- \ { | C->3 |------>| |
+ | |------>| D=4 | } \ \ { +-------+ : | |
+ | | +------+ \ -->{ | B->2 | : | |
+ +-------+ : : \ { +-------+ : | |
+ \ { | A->1 | : | CPU 2 |
+ \ +-------+ | |
+ At this point the read ----> \ rrrrrrrrrrrrrrrrr | |
+ barrier causes all effects \ +-------+ | |
+ prior to the storage of C \ { | E->5 | : | |
+ to be perceptible to CPU 2 -->{ +-------+ : | |
+ { | D->4 |------>| |
+ +-------+ | |
+ : : +-------+
+
+
+========================
+EXPLICIT KERNEL BARRIERS
+========================
+
+The Linux kernel has a variety of different barriers that act at different
+levels:
+
+ (*) Compiler barrier.
+
+ (*) CPU memory barriers.
+
+ (*) MMIO write barrier.
+
+
+COMPILER BARRIER
+----------------
+
+The Linux kernel has an explicit compiler barrier function that prevents the
+compiler from moving the memory accesses either side of it to the other side:
+
+ barrier();
+
+This a general barrier - lesser varieties of compiler barrier do not exist.
+
+The compiler barrier has no direct effect on the CPU, which may then reorder
+things however it wishes.
+
+
+CPU MEMORY BARRIERS
+-------------------
+
+The Linux kernel has eight basic CPU memory barriers:
+
+ TYPE MANDATORY SMP CONDITIONAL
+ =============== ======================= ===========================
+ GENERAL mb() smp_mb()
+ WRITE wmb() smp_wmb()
+ READ rmb() smp_rmb()
+ DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
+
+
+All CPU memory barriers unconditionally imply compiler barriers.
+
+SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
+systems because it is assumed that a CPU will be appear to be self-consistent,
+and will order overlapping accesses correctly with respect to itself.
+
+[!] Note that SMP memory barriers _must_ be used to control the ordering of
+references to shared memory on SMP systems, though the use of locking instead
+is sufficient.
+
+Mandatory barriers should not be used to control SMP effects, since mandatory
+barriers unnecessarily impose overhead on UP systems. They may, however, be
+used to control MMIO effects on accesses through relaxed memory I/O windows.
+These are required even on non-SMP systems as they affect the order in which
+memory operations appear to a device by prohibiting both the compiler and the
+CPU from reordering them.
+
+
+There are some more advanced barrier functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write
+ barrier after it, depending on the function. They aren't guaranteed to
+ insert anything more than a compiler barrier in a UP compilation.
+
+
+ (*) smp_mb__before_atomic_dec();
+ (*) smp_mb__after_atomic_dec();
+ (*) smp_mb__before_atomic_inc();
+ (*) smp_mb__after_atomic_inc();
+
+ These are for use with atomic add, subtract, increment and decrement
+ functions, especially when used for reference counting. These functions
+ do not imply memory barriers.
+
+ As an example, consider a piece of code that marks an object as being dead
+ and then decrements the object's reference count:
+
+ obj->dead = 1;
+ smp_mb__before_atomic_dec();
+ atomic_dec(&obj->ref_count);
+
+ This makes sure that the death mark on the object is perceived to be set
+ *before* the reference counter is decremented.
+
+ See Documentation/atomic_ops.txt for more information. See the "Atomic
+ operations" subsection for information on where to use these.
+
+
+ (*) smp_mb__before_clear_bit(void);
+ (*) smp_mb__after_clear_bit(void);
+
+ These are for use similar to the atomic inc/dec barriers. These are
+ typically used for bitwise unlocking operations, so care must be taken as
+ there are no implicit memory barriers here either.
+
+ Consider implementing an unlock operation of some nature by clearing a
+ locking bit. The clear_bit() would then need to be barriered like this:
+
+ smp_mb__before_clear_bit();
+ clear_bit( ... );
+
+ This prevents memory operations before the clear leaking to after it. See
+ the subsection on "Locking Functions" with reference to UNLOCK operation
+ implications.
+
+ See Documentation/atomic_ops.txt for more information. See the "Atomic
+ operations" subsection for information on where to use these.
+
+
+MMIO WRITE BARRIER
+------------------
+
+The Linux kernel also has a special barrier for use with memory-mapped I/O
+writes:
+
+ mmiowb();
+
+This is a variation on the mandatory write barrier that causes writes to weakly
+ordered I/O regions to be partially ordered. Its effects may go beyond the
+CPU->Hardware interface and actually affect the hardware at some level.
+
+See the subsection "Locks vs I/O accesses" for more information.
+
+
+===============================
+IMPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+Some of the other functions in the linux kernel imply memory barriers, amongst
+which are locking, scheduling and memory allocation functions.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+LOCKING FUNCTIONS
+-----------------
+
+The Linux kernel has a number of locking constructs:
+
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+ (*) RCU
+
+In all cases there are variants on "LOCK" operations and "UNLOCK" operations
+for each construct. These operations all imply certain barriers:
+
+ (1) LOCK operation implication:
+
+ Memory operations issued after the LOCK will be completed after the LOCK
+ operation has completed.
+
+ Memory operations issued before the LOCK may be completed after the LOCK
+ operation has completed.
+
+ (2) UNLOCK operation implication:
+
+ Memory operations issued before the UNLOCK will be completed before the
+ UNLOCK operation has completed.
+
+ Memory operations issued after the UNLOCK may be completed before the
+ UNLOCK operation has completed.
+
+ (3) LOCK vs LOCK implication:
+
+ All LOCK operations issued before another LOCK operation will be completed
+ before that LOCK operation.
+
+ (4) LOCK vs UNLOCK implication:
+
+ All LOCK operations issued before an UNLOCK operation will be completed
+ before the UNLOCK operation.
+
+ All UNLOCK operations issued before a LOCK operation will be completed
+ before the LOCK operation.
+
+ (5) Failed conditional LOCK implication:
+
+ Certain variants of the LOCK operation may fail, either due to being
+ unable to get the lock immediately, or due to receiving an unblocked
+ signal whilst asleep waiting for the lock to become available. Failed
+ locks do not imply any sort of barrier.
+
+Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
+equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
+
+[!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way
+ barriers is that the effects instructions outside of a critical section may
+ seep into the inside of the critical section.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so cannot be counted on in such a situation to actually achieve
+anything at all - especially with respect to I/O accesses - unless combined
+with interrupt disabling operations.
+
+See also the section on "Inter-CPU locking barrier effects".
+
+
+As an example, consider the following:
+
+ *A = a;
+ *B = b;
+ LOCK
+ *C = c;
+ *D = d;
+ UNLOCK
+ *E = e;
+ *F = f;
+
+The following sequence of events is acceptable:
+
+ LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+
+ [+] Note that {*F,*A} indicates a combined access.
+
+But none of th