diff options
author | Mauro Carvalho Chehab <mchehab+huawei@kernel.org> | 2020-04-28 00:01:36 +0200 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2020-04-28 14:39:46 -0700 |
commit | cb3f0d56e153398a035eb22769d2cb2837f29747 (patch) | |
tree | 8466ce6d769da4796645d2dfb3c100642fd63d0a /Documentation/networking/filter.rst | |
parent | aee113427c5d205730b2c1a023661799f41aca23 (diff) |
docs: networking: convert filter.txt to ReST
- add SPDX header;
- adjust title markup;
- mark code blocks and literals as such;
- use footnote markup;
- mark tables as such;
- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking/filter.rst')
-rw-r--r-- | Documentation/networking/filter.rst | 1651 |
1 files changed, 1651 insertions, 0 deletions
diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst new file mode 100644 index 000000000000..a1d3e192b9fa --- /dev/null +++ b/Documentation/networking/filter.rst @@ -0,0 +1,1651 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================================= +Linux Socket Filtering aka Berkeley Packet Filter (BPF) +======================================================= + +Introduction +------------ + +Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. +Though there are some distinct differences between the BSD and Linux +Kernel filtering, but when we speak of BPF or LSF in Linux context, we +mean the very same mechanism of filtering in the Linux kernel. + +BPF allows a user-space program to attach a filter onto any socket and +allow or disallow certain types of data to come through the socket. LSF +follows exactly the same filter code structure as BSD's BPF, so referring +to the BSD bpf.4 manpage is very helpful in creating filters. + +On Linux, BPF is much simpler than on BSD. One does not have to worry +about devices or anything like that. You simply create your filter code, +send it to the kernel via the SO_ATTACH_FILTER option and if your filter +code passes the kernel check on it, you then immediately begin filtering +data on that socket. + +You can also detach filters from your socket via the SO_DETACH_FILTER +option. This will probably not be used much since when you close a socket +that has a filter on it the filter is automagically removed. The other +less common case may be adding a different filter on the same socket where +you had another filter that is still running: the kernel takes care of +removing the old one and placing your new one in its place, assuming your +filter has passed the checks, otherwise if it fails the old filter will +remain on that socket. + +SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once +set, a filter cannot be removed or changed. This allows one process to +setup a socket, attach a filter, lock it then drop privileges and be +assured that the filter will be kept until the socket is closed. + +The biggest user of this construct might be libpcap. Issuing a high-level +filter command like `tcpdump -i em1 port 22` passes through the libpcap +internal compiler that generates a structure that can eventually be loaded +via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` +displays what is being placed into this structure. + +Although we were only speaking about sockets here, BPF in Linux is used +in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel +qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places +such as team driver, PTP code, etc where BPF is being used. + +.. [1] Documentation/userspace-api/seccomp_filter.rst + +Original BPF paper: + +Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new +architecture for user-level packet capture. In Proceedings of the +USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 +Conference Proceedings (USENIX'93). USENIX Association, Berkeley, +CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] + +Structure +--------- + +User space applications include <linux/filter.h> which contains the +following relevant structures:: + + struct sock_filter { /* Filter block */ + __u16 code; /* Actual filter code */ + __u8 jt; /* Jump true */ + __u8 jf; /* Jump false */ + __u32 k; /* Generic multiuse field */ + }; + +Such a structure is assembled as an array of 4-tuples, that contains +a code, jt, jf and k value. jt and jf are jump offsets and k a generic +value to be used for a provided code:: + + struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ + unsigned short len; /* Number of filter blocks */ + struct sock_filter __user *filter; + }; + +For socket filtering, a pointer to this structure (as shown in +follow-up example) is being passed to the kernel through setsockopt(2). + +Example +------- + +:: + + #include <sys/socket.h> + #include <sys/types.h> + #include <arpa/inet.h> + #include <linux/if_ether.h> + /* ... */ + + /* From the example above: tcpdump -i em1 port 22 -dd */ + struct sock_filter code[] = { + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 8, 0x000086dd }, + { 0x30, 0, 0, 0x00000014 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 17, 0x00000011 }, + { 0x28, 0, 0, 0x00000036 }, + { 0x15, 14, 0, 0x00000016 }, + { 0x28, 0, 0, 0x00000038 }, + { 0x15, 12, 13, 0x00000016 }, + { 0x15, 0, 12, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 8, 0x00000011 }, + { 0x28, 0, 0, 0x00000014 }, + { 0x45, 6, 0, 0x00001fff }, + { 0xb1, 0, 0, 0x0000000e }, + { 0x48, 0, 0, 0x0000000e }, + { 0x15, 2, 0, 0x00000016 }, + { 0x48, 0, 0, 0x00000010 }, + { 0x15, 0, 1, 0x00000016 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0x00000000 }, + }; + + struct sock_fprog bpf = { + .len = ARRAY_SIZE(code), + .filter = code, + }; + + sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (sock < 0) + /* ... bail out ... */ + + ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); + if (ret < 0) + /* ... bail out ... */ + + /* ... */ + close(sock); + +The above example code attaches a socket filter for a PF_PACKET socket +in order to let all IPv4/IPv6 packets with port 22 pass. The rest will +be dropped for this socket. + +The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments +and SO_LOCK_FILTER for preventing the filter to be detached, takes an +integer value with 0 or 1. + +Note that socket filters are not restricted to PF_PACKET sockets only, +but can also be used on other socket families. + +Summary of system calls: + + * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); + +Normally, most use cases for socket filtering on packet sockets will be +covered by libpcap in high-level syntax, so as an application developer +you should stick to that. libpcap wraps its own layer around all that. + +Unless i) using/linking to libpcap is not an option, ii) the required BPF +filters use Linux extensions that are not supported by libpcap's compiler, +iii) a filter might be more complex and not cleanly implementable with +libpcap's compiler, or iv) particular filter codes should be optimized +differently than libpcap's internal compiler does; then in such cases +writing such a filter "by hand" can be of an alternative. For example, +xt_bpf and cls_bpf users might have requirements that could result in +more complex filter code, or one that cannot be expressed with libpcap +(e.g. different return codes for various code paths). Moreover, BPF JIT +implementors may wish to manually write test cases and thus need low-level +access to BPF code as well. + +BPF engine and instruction set +------------------------------ + +Under tools/bpf/ there's a small helper tool called bpf_asm which can +be used to write low-level filters for example scenarios mentioned in the +previous section. Asm-like syntax mentioned here has been implemented in +bpf_asm and will be used for further explanations (instead of dealing with +less readable opcodes directly, principles are the same). The syntax is +closely modelled after Steven McCanne's and Van Jacobson's BPF paper. + +The BPF architecture consists of the following basic elements: + + ======= ==================================================== + Element Description + ======= ==================================================== + A 32 bit wide accumulator + X 32 bit wide X register + M[] 16 x 32 bit wide misc registers aka "scratch memory + store", addressable from 0 to 15 + ======= ==================================================== + +A program, that is translated by bpf_asm into "opcodes" is an array that +consists of the following elements (as already mentioned):: + + op:16, jt:8, jf:8, k:32 + +The element op is a 16 bit wide opcode that has a particular instruction +encoded. jt and jf are two 8 bit wide jump targets, one for condition +"jump if true", the other one "jump if false". Eventually, element k +contains a miscellaneous argument that can be interpreted in different +ways depending on the given instruction in op. + +The instruction set consists of load, store, branch, alu, miscellaneous +and return instructions that are also represented in bpf_asm syntax. This +table lists all bpf_asm instructions available resp. what their underlying +opcodes as defined in linux/filter.h stand for: + + =========== =================== ===================== + Instruction Addressing mode Description + =========== =================== ===================== + ld 1, 2, 3, 4, 12 Load word into A + ldi 4 Load word into A + ldh 1, 2 Load half-word into A + ldb 1, 2 Load byte into A + ldx 3, 4, 5, 12 Load word into X + ldxi 4 Load word into X + ldxb 5 Load byte into X + + st 3 Store A into M[] + stx 3 Store X into M[] + + jmp 6 Jump to label + ja 6 Jump to label + jeq 7, 8, 9, 10 Jump on A == <x> + jneq 9, 10 Jump on A != <x> + jne 9, 10 Jump on A != <x> + jlt 9, 10 Jump on A < <x> + jle 9, 10 Jump on A <= <x> + jgt 7, 8, 9, 10 Jump on A > <x> + jge 7, 8, 9, 10 Jump on A >= <x> + jset 7, 8, 9, 10 Jump on A & <x> + + add 0, 4 A + <x> + sub 0, 4 A - <x> + mul 0, 4 A * <x> + div 0, 4 A / <x> + mod 0, 4 A % <x> + neg !A + and 0, 4 A & <x> + or 0, 4 A | <x> + xor 0, 4 A ^ <x> + lsh 0, 4 A << <x> + rsh 0, 4 A >> <x> + + tax Copy A into X + txa Copy X into A + + ret 4, 11 Return + =========== =================== ===================== + +The next table shows addressing formats from the 2nd column: + + =============== =================== =============================================== + Addressing mode Syntax Description + =============== =================== =============================================== + 0 x/%x Register X + 1 [k] BHW at byte offset k in the packet + 2 [x + k] BHW at the offset X + k in the packet + 3 M[k] Word at offset k in M[] + 4 #k Literal value stored in k + 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet + 6 L Jump label L + 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 9 #k,Lt Jump to Lt if predicate is true + 10 x/%x,Lt Jump to Lt if predicate is true + 11 a/%a Accumulator A + 12 extension BPF extension + =============== =================== =============================================== + +The Linux kernel also has a couple of BPF extensions that are used along +with the class of load instructions by "overloading" the k argument with +a negative offset + a particular extension offset. The result of such BPF +extensions are loaded into A. + +Possible BPF extensions are shown in the following table: + + =================================== ================================================= + Extension Description + =================================== ================================================= + len skb->len + proto skb->protocol + type skb->pkt_type + poff Payload start offset + ifidx skb->dev->ifindex + nla Netlink attribute of type X with offset A + nlan Nested Netlink attribute of type X with offset A + mark skb->mark + queue skb->queue_mapping + hatype skb->dev->type + rxhash skb->hash + cpu raw_smp_processor_id() + vlan_tci skb_vlan_tag_get(skb) + vlan_avail skb_vlan_tag_present(skb) + vlan_tpid skb->vlan_proto + rand prandom_u32() + =================================== ================================================= + +These extensions can also be prefixed with '#'. +Examples for low-level BPF: + +**ARP packets**:: + + ldh [12] + jne #0x806, drop + ret #-1 + drop: ret #0 + +**IPv4 TCP packets**:: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #6, drop + ret #-1 + drop: ret #0 + +**(Accelerated) VLAN w/ id 10**:: + + ld vlan_tci + jneq #10, drop + ret #-1 + drop: ret #0 + +**icmp random packet sampling, 1 in 4**: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + +**SECCOMP filter example**:: + + ld [4] /* offsetof(struct seccomp_data, arch) */ + jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ + ld [0] /* offsetof(struct seccomp_data, nr) */ + jeq #15, good /* __NR_rt_sigreturn */ + jeq #231, good /* __NR_exit_group */ + jeq #60, good /* __NR_exit */ + jeq #0, good /* __NR_read */ + jeq #1, good /* __NR_write */ + jeq #5, good /* __NR_fstat */ + jeq #9, good /* __NR_mmap */ + jeq #14, good /* __NR_rt_sigprocmask */ + jeq #13, good /* __NR_rt_sigaction */ + jeq #35, good /* __NR_nanosleep */ + bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ + good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ + +The above example code can be placed into a file (here called "foo"), and +then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf +and cls_bpf understands and can directly be loaded with. Example with above +ARP code:: + + $ ./bpf_asm foo + 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, + +In copy and paste C-like output:: + + $ ./bpf_asm -c foo + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 1, 0x00000806 }, + { 0x06, 0, 0, 0xffffffff }, + { 0x06, 0, 0, 0000000000 }, + +In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF +filters that might not be obvious at first, it's good to test filters before +attaching to a live system. For that purpose, there's a small tool called +bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows +for testing BPF filters against given pcap files, single stepping through the +BPF code on the pcap's packets and to do BPF machine register dumps. + +Starting bpf_dbg is trivial and just requires issuing:: + + # ./bpf_dbg + +In case input and output do not equal stdin/stdout, bpf_dbg takes an +alternative stdin source as a first argument, and an alternative stdout +sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. + +Other than that, a particular libreadline configuration can be set via +file "~/.bpf_dbg_init" and the command history is stored in the file +"~/.bpf_dbg_history". + +Interaction in bpf_dbg happens through a shell that also has auto-completion +support (follow-up example commands starting with '>' denote bpf_dbg shell). +The usual workflow would be to ... + +* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 + Loads a BPF filter from standard output of bpf_asm, or transformed via + e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT + debugging (next section), this command creates a temporary socket and + loads the BPF code into the kernel. Thus, this will also be useful for + JIT developers. + +* load pcap foo.pcap + + Loads standard tcpdump pcap file. + +* run [<n>] + +bpf passes:1 fails:9 + Runs through all packets from a pcap to account how many passes and fails + the filter will generate. A limit of packets to traverse can be given. + +* disassemble:: + + l0: ldh [12] + l1: jeq #0x800, l2, l5 + l2: ldb [23] + l3: jeq #0x1, l4, l5 + l4: ret #0xffff + l5: ret #0 + + Prints out BPF code disassembly. + +* dump:: + + /* { op, jt, jf, k }, */ + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 3, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 0, 1, 0x00000001 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0000000000 }, + + Prints out C-style BPF code dump. + +* breakpoint 0:: + + breakpoint at: l0: ldh [12] + +* breakpoint 1:: + + breakpoint at: l1: jeq #0x800, l2, l5 + + ... + + Sets breakpoints at particular BPF instructions. Issuing a `run` command + will walk through the pcap file continuing from the current packet and + break when a breakpoint is being hit (another `run` will continue from + the currently active breakpoint executing next instructions): + + * run:: + + -- register dump -- + pc: [0] <-- program counter + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction + curr: l0: ldh [12] <-- disassembly of current instruction + A: [00000000][0] <-- content of A (hex, decimal) + X: [00000000][0] <-- content of X (hex, decimal) + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) + -- packet dump -- <-- Current packet from pcap (hex) + len: 42 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 + 32: 00 00 00 00 00 00 0a 3b 01 01 + (breakpoint) + > + + * breakpoint:: + + breakpoints: 0 1 + + Prints currently set breakpoints. + +* step [-<n>, +<n>] + + Performs single stepping through the BPF program from the current pc + offset. Thus, on each step invocation, above register dump is issued. + This can go forwards and backwards in time, a plain `step` will break + on the next BPF instruction, thus +1. (No `run` needs to be issued here.) + +* select <n> + + Selects a given packet from the pcap file to continue from. Thus, on + the next `run` or `step`, the BPF program is being evaluated against + the user pre-selected packet. Numbering starts just as in Wireshark + with index 1. + +* quit + + Exits bpf_dbg. + +JIT compiler +------------ + +The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, +PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through +CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each +attached filter from user space or for internal kernel users if it has +been previously enabled by root:: + + echo 1 > /proc/sys/net/core/bpf_jit_enable + +For JIT developers, doing audits etc, each compile run can output the generated +opcode image into the kernel log via:: + + echo 2 > /proc/sys/net/core/bpf_jit_enable + +Example output from dmesg:: + + [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f + [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 + [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 + [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 + [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 + [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 + +When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and +setting any other value than that will return in failure. This is even the case for +setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log +is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the +generally recommended approach instead. + +In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for +generating disassembly out of the kernel log's hexdump:: + + # ./bpf_jit_disasm + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + <x>: + 0: push %rbp + 1: mov %rsp,%rbp + 4: sub $0x60,%rsp + 8: mov %rbx,-0x8(%rbp) + c: mov 0x68(%rdi),%r9d + 10: sub 0x6c(%rdi),%r9d + 14: mov 0xd8(%rdi),%r8 + 1b: mov $0xc,%esi + 20: callq 0xffffffffe0ff9442 + 25: cmp $0x800,%eax + 2a: jne 0x0000000000000042 + 2c: mov $0x17,%esi + 31: callq 0xffffffffe0ff945e + 36: cmp $0x1,%eax + 39: jne 0x0000000000000042 + 3b: mov $0xffff,%eax + 40: jmp 0x0000000000000044 + 42: xor %eax,%eax + 44: leaveq + 45: retq + + Issuing option `-o` will "annotate" opcodes to resulting assembler + instructions, which can be very useful for JIT developers: + + # ./bpf_jit_disasm -o + 70 bytes emitted from JIT compiler (pass:3, flen:6) + ffffffffa0069c8f + <x>: + 0: push %rbp + 55 + 1: mov %rsp,%rbp + 48 89 e5 + 4: sub $0x60,%rsp + 48 83 ec 60 + 8: mov %rbx,-0x8(%rbp) + 48 89 5d f8 + c: mov 0x68(%rdi),%r9d + 44 8b 4f 68 + 10: sub 0x6c(%rdi),%r9d + 44 2b 4f 6c + 14: mov 0xd8(%rdi),%r8 + 4c 8b 87 d8 00 00 00 + 1b: mov $0xc,%esi + be 0c 00 00 00 + 20: callq 0xffffffffe0ff9442 + e8 1d 94 ff e0 + 25: cmp $0x800,%eax + 3d 00 08 00 00 + 2a: jne 0x0000000000000042 + 75 16 + 2c: mov $0x17,%esi + be 17 00 00 00 + 31: callq 0xffffffffe0ff945e + e8 28 94 ff e0 + 36: cmp $0x1,%eax + 83 f8 01 + 39: jne 0x0000000000000042 + 75 07 + 3b: mov $0xffff,%eax + b8 ff ff 00 00 + 40: jmp 0x0000000000000044 + eb 02 + 42: xor %eax,%eax + 31 c0 + 44: leaveq + c9 + 45: retq + c3 + +For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful +toolchain for developing and testing the kernel's JIT compiler. + +BPF kernel internals +-------------------- +Internally, for the kernel interpreter, a different instruction set +format with similar underlying principles from BPF described in previous +paragraphs is being used. However, the instruction set format is modelled +closer to the underlying architecture to mimic native instruction sets, so +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) + +It is designed to be JITed with one to one mapping, which can also open up +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. + +The new instruction set was originally designed with the possible goal in +mind to write programs in "restricted C" and compile into eBPF with a optional +GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with +minimal performance overhead over two steps, that is, C -> eBPF -> native code. + +Currently, the new format is being used for running user BPF programs, which +includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, +team driver's classifier for its load-balancing mode, netfilter's xt_bpf +extension, PTP dissector/classifier, and much more. They are all internally +converted by the kernel into the new instruction set representation and run +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using bpf_prog_create() for setting up the filter, resp. +bpf_prog_destroy() for destroying it. The macro +BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct bpf_prog that we +got from bpf_prog_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from bpf_check_classic() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most +32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, +sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF +instruction set. + +Some core changes of the new internal format: + +- Number of registers increase from 2 to 10: + + The old format had two registers A and X, and a hidden frame pointer. The + new layout extends this to be 10 internal registers and a read-only frame + pointer. Since 64-bit CPUs are passing arguments to functions via registers + the number of args from eBPF program to in-kernel function is restricted + to 5 and one register is used to accept return value from an in-kernel + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. + + Therefore, eBPF calling convention is defined as: + + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function + * R6 - R9 - callee saved registers that in-kernel function will preserve + * R10 - read-only frame pointer to access stack + + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on + 64-bit architectures. + + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic + and may let more complex programs to be interpreted. + + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. + +- Register width increases from 32-bit to 64-bit: + + Still, the semantics of the original 32-bit ALU operations are preserved + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower + subregisters that zero-extend into 64-bit if they are being written to. + That behavior maps directly to x86_64 and arm64 subregister definition, but + makes other JITs more difficult. + + 32-bit architectures run 64-bit internal BPF programs via interpreter. + Their JITs may convert BPF programs that only use 32-bit subregisters into + native instruction set and let the rest being interpreted. + + Operation is 64-bit, because on 64-bit architectures, pointers are also + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register + mapping and JIT would need to do combine/split/move operations for every + register in and out of the function, which is complex, bug prone and slow. + Another reason is the use of atomic 64-bit counters. + +- Conditional jt/jf targets replaced with jt/fall-through: + + While the original design has constructs such as ``if (cond) jump_true; + else jump_false;``, they are being replaced into alternative constructs like + ``if (cond) jump_true; /* else fall-through */``. + +- Introduces bpf_call insn and register passing convention for zero overhead + calls from/to other kernel functions: + + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions:: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64:: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like:: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + :: + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program:: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like:: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x6,%esi + mov $0x7,%edx + mov $0x8,%ecx + mov $0x9,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to:: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into ``%rax`` which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid:: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + An in-kernel eBPF verifier is used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any +program will terminate quickly and will only call a fixed number of kernel +functions. Original BPF and the new format are two operand instructions, +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. + +The input context pointer for invoking the interpreter function is generic, +its content is defined by a specific use case. For seccomp register R1 points +to seccomp_data, for converted BPF filters R1 points to a skb. + +A program, that is translated internally consists of the following elements:: + + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using ``exclusive add`` instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can be used as a generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. + +Just like the original BPF, the new format runs within a controlled environment, +is deterministic and the kernel can easily prove that. The safety of the program +can be determined in two steps: first step does depth-first-search to disallow +loops and other CFG validation; second step starts from the first insn and +descends all possible paths. It simulates execution of every insn and observes +the state change of registers and stack. + +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts:: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + =================== =============== + Classic BPF classes eBPF classes + =================== =============== + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 BPF_JMP32 0x06 + BPF_MISC 0x07 BPF_ALU64 0x07 + =================== =============== + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + :: + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means:: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means:: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 |