summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDavid S. Miller <davem@davemloft.net>2020-06-01 15:53:08 -0700
committerDavid S. Miller <davem@davemloft.net>2020-06-01 15:53:08 -0700
commit9a25c1df24a6fea9dc79eec950453c4e00f707fd (patch)
tree1188078f9838a3b6a60a3923ed31df142ffc8ed6
parentefd7ed0f5f2d07ccbb1853c5d46656cdfa1371fb (diff)
parentcf51abcded837ef209faa03a62b2ea44e45995e8 (diff)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-06-01 The following pull-request contains BPF updates for your *net-next* tree. We've added 55 non-merge commits during the last 1 day(s) which contain a total of 91 files changed, 4986 insertions(+), 463 deletions(-). The main changes are: 1) Add rx_queue_mapping to bpf_sock from Amritha. 2) Add BPF ring buffer, from Andrii. 3) Attach and run programs through devmap, from David. 4) Allow SO_BINDTODEVICE opt in bpf_setsockopt, from Ferenc. 5) link based flow_dissector, from Jakub. 6) Use tracing helpers for lsm programs, from Jiri. 7) Several sk_msg fixes and extensions, from John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-rw-r--r--Documentation/bpf/ringbuf.rst209
-rw-r--r--MAINTAINERS2
-rw-r--r--drivers/net/ethernet/amazon/ena/ena_netdev.c2
-rw-r--r--drivers/net/ethernet/intel/i40e/i40e_txrx.c2
-rw-r--r--drivers/net/ethernet/intel/ice/ice_txrx_lib.c2
-rw-r--r--drivers/net/ethernet/intel/ixgbe/ixgbe_main.c2
-rw-r--r--drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c2
-rw-r--r--drivers/net/ethernet/marvell/mvneta.c2
-rw-r--r--drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c10
-rw-r--r--drivers/net/ethernet/sfc/rx.c2
-rw-r--r--drivers/net/ethernet/socionext/netsec.c2
-rw-r--r--drivers/net/ethernet/ti/cpsw_priv.c2
-rw-r--r--drivers/net/tun.c2
-rw-r--r--drivers/net/veth.c8
-rw-r--r--drivers/net/virtio_net.c4
-rw-r--r--include/linux/bpf-netns.h64
-rw-r--r--include/linux/bpf.h21
-rw-r--r--include/linux/bpf_types.h4
-rw-r--r--include/linux/bpf_verifier.h4
-rw-r--r--include/linux/skbuff.h26
-rw-r--r--include/linux/skmsg.h8
-rw-r--r--include/net/flow_dissector.h6
-rw-r--r--include/net/net_namespace.h4
-rw-r--r--include/net/netns/bpf.h18
-rw-r--r--include/net/sock.h2
-rw-r--r--include/net/tls.h9
-rw-r--r--include/net/xdp.h17
-rw-r--r--include/uapi/linux/bpf.h95
-rw-r--r--kernel/bpf/Makefile3
-rw-r--r--kernel/bpf/bpf_lsm.c2
-rw-r--r--kernel/bpf/cgroup.c2
-rw-r--r--kernel/bpf/core.c2
-rw-r--r--kernel/bpf/cpumap.c2
-rw-r--r--kernel/bpf/devmap.c132
-rw-r--r--kernel/bpf/helpers.c34
-rw-r--r--kernel/bpf/net_namespace.c373
-rw-r--r--kernel/bpf/ringbuf.c501
-rw-r--r--kernel/bpf/syscall.c29
-rw-r--r--kernel/bpf/verifier.c195
-rw-r--r--kernel/trace/bpf_trace.c28
-rw-r--r--net/core/dev.c18
-rw-r--r--net/core/filter.c94
-rw-r--r--net/core/flow_dissector.c124
-rw-r--r--net/core/skmsg.c98
-rw-r--r--net/core/sock.c10
-rw-r--r--net/ipv4/udp_tunnel.c2
-rw-r--r--net/ipv6/ip6_udp_tunnel.c2
-rw-r--r--net/tls/tls_sw.c20
-rw-r--r--tools/bpf/bpftool/btf.c10
-rw-r--r--tools/bpf/bpftool/cgroup.c14
-rw-r--r--tools/bpf/bpftool/feature.c91
-rw-r--r--tools/bpf/bpftool/gen.c6
-rw-r--r--tools/bpf/bpftool/iter.c8
-rw-r--r--tools/bpf/bpftool/link.c55
-rw-r--r--tools/bpf/bpftool/map.c41
-rw-r--r--tools/bpf/bpftool/net.c12
-rw-r--r--tools/bpf/bpftool/perf.c2
-rw-r--r--tools/bpf/bpftool/prog.c27
-rw-r--r--tools/bpf/bpftool/struct_ops.c15
-rw-r--r--tools/include/uapi/linux/bpf.h95
-rw-r--r--tools/lib/bpf/Build2
-rw-r--r--tools/lib/bpf/Makefile6
-rw-r--r--tools/lib/bpf/libbpf.c49
-rw-r--r--tools/lib/bpf/libbpf.h24
-rw-r--r--tools/lib/bpf/libbpf.map7
-rw-r--r--tools/lib/bpf/libbpf_probes.c5
-rw-r--r--tools/lib/bpf/ringbuf.c288
-rw-r--r--tools/testing/selftests/bpf/Makefile5
-rw-r--r--tools/testing/selftests/bpf/bench.c16
-rw-r--r--tools/testing/selftests/bpf/benchs/bench_ringbufs.c566
-rwxr-xr-xtools/testing/selftests/bpf/benchs/run_bench_ringbufs.sh75
-rw-r--r--tools/testing/selftests/bpf/prog_tests/flow_dissector.c166
-rw-r--r--tools/testing/selftests/bpf/prog_tests/flow_dissector_reattach.c588
-rw-r--r--tools/testing/selftests/bpf/prog_tests/ringbuf.c211
-rw-r--r--tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c102
-rw-r--r--tools/testing/selftests/bpf/prog_tests/skb_helpers.c30
-rw-r--r--tools/testing/selftests/bpf/prog_tests/sockmap_basic.c35
-rw-r--r--tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c97
-rw-r--r--tools/testing/selftests/bpf/progs/bpf_flow.c20
-rw-r--r--tools/testing/selftests/bpf/progs/connect4_prog.c33
-rw-r--r--tools/testing/selftests/bpf/progs/perfbuf_bench.c33
-rw-r--r--tools/testing/selftests/bpf/progs/ringbuf_bench.c60
-rw-r--r--tools/testing/selftests/bpf/progs/test_ringbuf.c78
-rw-r--r--tools/testing/selftests/bpf/progs/test_ringbuf_multi.c77
-rw-r--r--tools/testing/selftests/bpf/progs/test_skb_helpers.c28
-rw-r--r--tools/testing/selftests/bpf/progs/test_skmsg_load_helpers.c47
-rw-r--r--tools/testing/selftests/bpf/progs/test_sockmap_kern.h46
-rw-r--r--tools/testing/selftests/bpf/progs/test_xdp_devmap_helpers.c22
-rw-r--r--tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c44
-rw-r--r--tools/testing/selftests/bpf/test_maps.c52
-rw-r--r--tools/testing/selftests/bpf/test_sockmap.c163
-rw-r--r--tools/testing/selftests/bpf/verifier/and.c4
-rw-r--r--tools/testing/selftests/bpf/verifier/array_access.c4
-rw-r--r--tools/testing/selftests/bpf/verifier/bounds.c6
-rw-r--r--tools/testing/selftests/bpf/verifier/calls.c2
-rw-r--r--tools/testing/selftests/bpf/verifier/direct_value_access.c4
-rw-r--r--tools/testing/selftests/bpf/verifier/helper_access_var_len.c2
-rw-r--r--tools/testing/selftests/bpf/verifier/helper_value_access.c6
-rw-r--r--tools/testing/selftests/bpf/verifier/value_ptr_arith.c8
99 files changed, 5050 insertions, 539 deletions
diff --git a/Documentation/bpf/ringbuf.rst b/Documentation/bpf/ringbuf.rst
new file mode 100644
index 000000000000..75f943f0009d
--- /dev/null
+++ b/Documentation/bpf/ringbuf.rst
@@ -0,0 +1,209 @@
+===============
+BPF ring buffer
+===============
+
+This document describes BPF ring buffer design, API, and implementation details.
+
+.. contents::
+ :local:
+ :depth: 2
+
+Motivation
+----------
+
+There are two distinctive motivators for this work, which are not satisfied by
+existing perf buffer, which prompted creation of a new ring buffer
+implementation.
+
+- more efficient memory utilization by sharing ring buffer across CPUs;
+- preserving ordering of events that happen sequentially in time, even across
+ multiple CPUs (e.g., fork/exec/exit events for a task).
+
+These two problems are independent, but perf buffer fails to satisfy both.
+Both are a result of a choice to have per-CPU perf ring buffer. Both can be
+also solved by having an MPSC implementation of ring buffer. The ordering
+problem could technically be solved for perf buffer with some in-kernel
+counting, but given the first one requires an MPSC buffer, the same solution
+would solve the second problem automatically.
+
+Semantics and APIs
+------------------
+
+Single ring buffer is presented to BPF programs as an instance of BPF map of
+type ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but
+ultimately rejected.
+
+One way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make
+``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
+enforce "same CPU only" rule. This would be more familiar interface compatible
+with existing perf buffer use in BPF, but would fail if application needed more
+advanced logic to lookup ring buffer by arbitrary key.
+``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach.
+Additionally, given the performance of BPF ringbuf, many use cases would just
+opt into a simple single ring buffer shared among all CPUs, for which current
+approach would be an overkill.
+
+Another approach could introduce a new concept, alongside BPF map, to represent
+generic "container" object, which doesn't necessarily have key/value interface
+with lookup/update/delete operations. This approach would add a lot of extra
+infrastructure that has to be built for observability and verifier support. It
+would also add another concept that BPF developers would have to familiarize
+themselves with, new syntax in libbpf, etc. But then would really provide no
+additional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF``
+doesn't support lookup/update/delete operations, but so doesn't few other map
+types (e.g., queue and stack; array doesn't support delete, etc).
+
+The approach chosen has an advantage of re-using existing BPF map
+infrastructure (introspection APIs in kernel, libbpf support, etc), being
+familiar concept (no need to teach users a new type of object in BPF program),
+and utilizing existing tooling (bpftool). For common scenario of using a single
+ring buffer for all CPUs, it's as simple and straightforward, as would be with
+a dedicated "container" object. On the other hand, by being a map, it can be
+combined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
+a wide variety of topologies, from one ring buffer for each CPU (e.g., as
+a replacement for perf buffer use cases), to a complicated application
+hashing/sharding of ring buffers (e.g., having a small pool of ring buffers
+with hashed task's tgid being a look up key to preserve order, but reduce
+contention).
+
+Key and value sizes are enforced to be zero. ``max_entries`` is used to specify
+the size of ring buffer and has to be a power of 2 value.
+
+There are a bunch of similarities between perf buffer
+(``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics:
+
+- variable-length records;
+- if there is no more space left in ring buffer, reservation fails, no
+ blocking;
+- memory-mappable data area for user-space applications for ease of
+ consumption and high performance;
+- epoll notifications for new incoming data;
+- but still the ability to do busy polling for new data to achieve the
+ lowest latency, if necessary.
+
+BPF ringbuf provides two sets of APIs to BPF programs:
+
+- ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
+ buffer, similarly to ``bpf_perf_event_output()``;
+- ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
+ APIs split the whole process into two steps. First, a fixed amount of space
+ is reserved. If successful, a pointer to a data inside ring buffer data
+ area is returned, which BPF programs can use similarly to a data inside
+ array/hash maps. Once ready, this piece of memory is either committed or
+ discarded. Discard is similar to commit, but makes consumer ignore the
+ record.
+
+``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
+because record has to be prepared in some other place first. But it allows to
+submit records of the length that's not known to verifier beforehand. It also
+closely matches ``bpf_perf_event_output()``, so will simplify migration
+significantly.
+
+``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
+pointer directly to ring buffer memory. In a lot of cases records are larger
+than BPF stack space allows, so many programs have use extra per-CPU array as
+a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
+completely. But in exchange, it only allows a known constant size of memory to
+be reserved, such that verifier can verify that BPF program can't access memory
+outside its reserved record space. bpf_ringbuf_output(), while slightly slower
+due to extra memory copy, covers some use cases that are not suitable for
+``bpf_ringbuf_reserve()``.
+
+The difference between commit and discard is very small. Discard just marks
+a record as discarded, and such records are supposed to be ignored by consumer
+code. Discard is useful for some advanced use-cases, such as ensuring
+all-or-nothing multi-record submission, or emulating temporary
+``malloc()``/``free()`` within single BPF program invocation.
+
+Each reserved record is tracked by verifier through existing
+reference-tracking logic, similar to socket ref-tracking. It is thus
+impossible to reserve a record, but forget to submit (or discard) it.
+
+``bpf_ringbuf_query()`` helper allows to query various properties of ring
+buffer. Currently 4 are supported:
+
+- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
+- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
+- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical possition
+ of consumer/producer, respectively.
+
+Returned values are momentarily snapshots of ring buffer state and could be
+off by the time helper returns, so this should be used only for
+debugging/reporting reasons or for implementing various heuristics, that take
+into account highly-changeable nature of some of those characteristics.
+
+One such heuristic might involve more fine-grained control over poll/epoll
+notifications about new data availability in ring buffer. Together with
+``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard
+helpers, it allows BPF program a high degree of control and, e.g., more
+efficient batched notifications. Default self-balancing strategy, though,
+should be adequate for most applications and will work reliable and efficiently
+already.
+
+Design and Implementation
+-------------------------
+
+This reserve/commit schema allows a natural way for multiple producers, either
+on different CPUs or even on the same CPU/in the same BPF program, to reserve
+independent records and work with them without blocking other producers. This
+means that if BPF program was interruped by another BPF program sharing the
+same ring buffer, they will both get a record reserved (provided there is
+enough space left) and can work with it and submit it independently. This
+applies to NMI context as well, except that due to using a spinlock during
+reservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get
+a lock, in which case reservation will fail even if ring buffer is not full.
+
+The ring buffer itself internally is implemented as a power-of-2 sized
+circular buffer, with two logical and ever-increasing counters (which might
+wrap around on 32-bit architectures, that's not a problem):
+
+- consumer counter shows up to which logical position consumer consumed the
+ data;
+- producer counter denotes amount of data reserved by all producers.
+
+Each time a record is reserved, producer that "owns" the record will
+successfully advance producer counter. At that point, data is still not yet
+ready to be consumed, though. Each record has 8 byte header, which contains the
+length of reserved record, as well as two extra bits: busy bit to denote that
+record is still being worked on, and discard bit, which might be set at commit
+time if record is discarded. In the latter case, consumer is supposed to skip
+the record and move on to the next one. Record header also encodes record's
+relative offset from the beginning of ring buffer data area (in pages). This
+allows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the
+pointer to the record itself, without requiring also the pointer to ring buffer
+itself. Ring buffer memory location will be restored from record metadata
+header. This significantly simplifies verifier, as well as improving API
+usability.
+
+Producer counter increments are serialized under spinlock, so there is
+a strict ordering between reservations. Commits, on the other hand, are
+completely lockless and independent. All records become available to consumer
+in the order of reservations, but only after all previous records where
+already committed. It is thus possible for slow producers to temporarily hold
+off submitted records, that were reserved later.
+
+Reservation/commit/consumer protocol is verified by litmus tests in
+Documentation/litmus_tests/bpf-rb/_.
+
+One interesting implementation bit, that significantly simplifies (and thus
+speeds up as well) implementation of both producers and consumers is how data
+area is mapped twice contiguously back-to-back in the virtual memory. This
+allows to not take any special measures for samples that have to wrap around
+at the end of the circular buffer data area, because the next page after the
+last data page would be first data page again, and thus the sample will still
+appear completely contiguous in virtual memory. See comment and a simple ASCII
+diagram showing this visually in ``bpf_ringbuf_area_alloc()``.
+
+Another feature that distinguishes BPF ringbuf from perf ring buffer is
+a self-pacing notifications of new data being availability.
+``bpf_ringbuf_commit()`` implementation will send a notification of new record
+being available after commit only if consumer has already caught up right up to
+the record being committed. If not, consumer still has to catch up and thus
+will see new data anyways without needing an extra poll notification.
+Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c_) show that
+this allows to achieve a very high throughput without having to resort to
+tricks like "notify only every Nth sample", which are necessary with perf
+buffer. For extreme cases, when BPF program wants more manual control of
+notifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and
+``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of
+data availability, but require extra caution and diligence in using this API.
diff --git a/MAINTAINERS b/MAINTAINERS
index 5d81c002232a..66d1a3f10102 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18456,7 +18456,7 @@ L: netdev@vger.kernel.org
L: bpf@vger.kernel.org
S: Maint