summaryrefslogtreecommitdiffstats
path: root/net/core/filter.c
AgeCommit message (Collapse)Author
2020-12-08bpf: Only provide bpf_sock_from_file with CONFIG_NETFlorent Revest
This moves the bpf_sock_from_file definition into net/core/filter.c which only gets compiled with CONFIG_NET and also moves the helper proto usage next to other tracing helpers that are conditional on CONFIG_NET. This avoids ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file': bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file' When compiling a kernel with BPF and without NET. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org
2020-12-03bpf: Adds support for setting window clampPrankur gupta
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP, which sets the maximum receiver window size. It will be useful for limiting receiver window based on RTT. Signed-off-by: Prankur gupta <prankgup@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
2020-12-02bpf: Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooksStanislav Fomichev
I have to now lock/unlock socket for the bind hook execution. That shouldn't cause any overhead because the socket is unbound and shouldn't receive any traffic. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com
2020-10-22net: Properly typecast int values to set sk_max_pacing_rateKe Li
In setsockopt(SO_MAX_PACING_RATE) on 64bit systems, sk_max_pacing_rate, after extended from 'u32' to 'unsigned long', takes unintentionally hiked value whenever assigned from an 'int' value with MSB=1, due to binary sign extension in promoting s32 to u64, e.g. 0x80000000 becomes 0xFFFFFFFF80000000. Thus inflated sk_max_pacing_rate causes subsequent getsockopt to return ~0U unexpectedly. It may also result in increased pacing rate. Fix by explicitly casting the 'int' value to 'unsigned int' before assigning it to sk_max_pacing_rate, for zero extension to happen. Fixes: 76a9ebe811fb ("net: extend sk_pacing_rate to unsigned long") Signed-off-by: Ji Li <jli@akamai.com> Signed-off-by: Ke Li <keli@akamai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201022064146.79873-1-keli@akamai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22bpf: Fix bpf_redirect_neigh helper api to support supplying nexthopToke Høiland-Jørgensen
Based on the discussion in [0], update the bpf_redirect_neigh() helper to accept an optional parameter specifying the nexthop information. This makes it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without incurring a duplicate FIB lookup - since the FIB lookup helper will return the nexthop information even if no neighbour is present, this can simply be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen. [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/ Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk
2020-10-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-10-12 The main changes are: 1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong. 2) libbpf relocation support for different size load/store, from Andrii. 3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel. 4) BPF support for per-cpu variables, form Hao. 5) sockmap improvements, from John. ==================== Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-11bpf: Always return target ifindex in bpf_fib_lookupToke Høiland-Jørgensen
The bpf_fib_lookup() helper performs a neighbour lookup for the destination IP and returns BPF_FIB_LKUP_NO_NEIGH if this fails, with the expectation that the BPF program will pass the packet up the stack in this case. However, with the addition of bpf_redirect_neigh() that can be used instead to perform the neighbour lookup, at the cost of a bit of duplicated work. For that we still need the target ifindex, and since bpf_fib_lookup() already has that at the time it performs the neighbour lookup, there is really no reason why it can't just return it in any case. So let's just always return the ifindex if the FIB lookup itself succeeds. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Ahern <dsahern@gmail.com> Link: https://lore.kernel.org/bpf/20201009184234.134214-1-toke@redhat.com
2020-10-11bpf: Add redirect_peer helperDaniel Borkmann
Add an efficient ingress to ingress netns switch that can be used out of tc BPF programs in order to redirect traffic from host ns ingress into a container veth device ingress without having to go via CPU backlog queue [0]. For local containers this can also be utilized and path via CPU backlog queue only needs to be taken once, not twice. On a high level this borrows from ipvlan which does similar switch in __netif_receive_skb_core() and then iterates via another_round. This helps to reduce latency for mentioned use cases. Pod to remote pod with redirect(), TCP_RR [1]: # percpu_netperf 10.217.1.33 RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 ) MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 ) STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 ) MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 ) P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 ) P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 ) P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 ) TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 ) Pod to remote pod with redirect_peer(), TCP_RR: # percpu_netperf 10.217.1.33 RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 ) MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 ) STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 ) MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 ) P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 ) P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 ) P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 ) TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 ) [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
2020-10-09bpf: Add tcp_notsent_lowat bpf setsockoptNikita V. Shirokov
Adding support for TCP_NOTSENT_LOWAT sockoption (https://lwn.net/Articles/560082/) in tcp bpf programs. Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201009070325.226855-1-tehnerd@tehnerd.com
2020-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Small conflict around locking in rxrpc_process_event() - channel_lock moved to bundle in next, while state lock needs _bh() from net. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-02bpf, sockmap: Add skb_adjust_room to pop bytes off ingress payloadJohn Fastabend
This implements a new helper skb_adjust_room() so users can push/pop extra bytes from a BPF_SK_SKB_STREAM_VERDICT program. Some protocols may include headers and other information that we may not want to include when doing a redirect from a BPF_SK_SKB_STREAM_VERDICT program. One use case is to redirect TLS packets into a receive socket that doesn't expect TLS data. In TLS case the first 13B or so contain the protocol header. With KTLS the payload is decrypted so we should be able to redirect this to a receiving socket, but the receiving socket may not be expecting to receive a TLS header and discard the data. Using the above helper we can pop the header off and put an appropriate header on the payload. This allows for creating a proxy between protocols without extra hops through the stack or userspace. So in order to fix this case add skb_adjust_room() so users can strip the header. After this the user can strip the header and an unmodified receiver thread will work correctly when data is redirected into the ingress path of a sock. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/160160099197.7052.8443193973242831692.stgit@john-Precision-5820-Tower
2020-10-01bpf: Fix "unresolved symbol" build error with resolve_btfidsYonghong Song
Michal reported a build failure likes below: BTFIDS vmlinux FAILED unresolved symbol tcp_timewait_sock make[1]: *** [/.../linux-5.9-rc7/Makefile:1176: vmlinux] Error 255 This error can be triggered when config has CONFIG_NET enabled but CONFIG_INET disabled. In this case, there is no user of istructs inet_timewait_sock and tcp_timewait_sock and hence vmlinux BTF types are not generated for these two structures. To fix the problem, let us force BTF generation for these two structures with BTF_TYPE_EMIT. Fixes: fce557bcef11 ("bpf: Make btf_sock_ids global") Reported-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201001051339.2549085-1-yhs@fb.com
2020-09-30bpf: Add redirect_neigh helper as redirect drop-inDaniel Borkmann
Add a redirect_neigh() helper as redirect() drop-in replacement for the xmit side. Main idea for the helper is to be very similar in semantics to the latter just that the skb gets injected into the neighboring subsystem in order to let the stack do the work it knows best anyway to populate the L2 addresses of the packet and then hand over to dev_queue_xmit() as redirect() does. This solves two bigger items: i) skbs don't need to go up to the stack on the host facing veth ingress side for traffic egressing the container to achieve the same for populating L2 which also has the huge advantage that ii) the skb->sk won't get orphaned in ip_rcv_core() when entering the IP routing layer on the host stack. Given that skb->sk neither gets orphaned when crossing the netns as per 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb.") the helper can then push the skbs directly to the phys device where FQ scheduler can do its work and TCP stack gets proper backpressure given we hold on to skb->sk as long as skb is still residing in queues. With the helper used in BPF data path to then push the skb to the phys device, I observed a stable/consistent TCP_STREAM improvement on veth devices for traffic going container -> host -> host -> container from ~10Gbps to ~15Gbps for a single stream in my test environment. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: David Ahern <dsahern@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/f207de81629e1724899b73b8112e0013be782d35.1601477936.git.daniel@iogearbox.net
2020-09-30bpf, net: Rework cookie generator as per-cpu oneDaniel Borkmann
With its use in BPF, the cookie generator can be called very frequently in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg) and attached to the root cgroup, for example, when used in v1/v2 mixed environments. In particular, when there's a high churn on sockets in the system there can be many parallel requests to the bpf_get_socket_cookie() and bpf_get_netns_cookie() helpers which then cause contention on the atomic counter. As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino allocator"), add a small helper library that both can use for the 64 bit counters. Given this can be called from different contexts, we also need to deal with potential nested calls even though in practice they are considered extremely rare. One idea as suggested by Eric Dumazet was to use a reverse counter for this situation since we don't expect 64 bit overflows anyways; that way, we can avoid bigger gaps in the 64 bit counter space compared to just batch-wise increase. Even on machines with small number of cores (e.g. 4) the cookie generation shrinks from min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run in parallel from multiple CPUs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net
2020-09-30bpf: Add classid helper only based on skb->skDaniel Borkmann
Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid from v2 hooks"), add a helper to retrieve cgroup v1 classid solely based on the skb->sk, so it can be used as key as part of BPF map lookups out of tc from host ns, in particular given the skb->sk is retained these days when crossing net ns thanks to 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing the skb."). This is similar to bpf_skb_cgroup_id() which implements the same for v2. Kubernetes ecosystem is still operating on v1 however, hence net_cls needs to be used there until this can be dropped in with the v2 helper of bpf_skb_cgroup_id(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/ed633cf27a1c620e901c5aa99ebdefb028dce600.1601477936.git.daniel@iogearbox.net
2020-09-25bpf: Change bpf_sk_assign to accept ARG_PTR_TO_BTF_ID_SOCK_COMMONMartin KaFai Lau
This patch changes the bpf_sk_assign() to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer returned by the bpf_skc_to_*() helpers also. The bpf_sk_lookup_assign() is taking ARG_PTR_TO_SOCKET_"OR_NULL". Meaning it specifically takes a literal NULL. ARG_PTR_TO_BTF_ID_SOCK_COMMON does not allow a literal NULL, so another ARG type is required for this purpose and another follow-up patch can be used if there is such need. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200925000415.3857374-1-kafai@fb.com
2020-09-25bpf: Change bpf_tcp_*_syncookie to accept ARG_PTR_TO_BTF_ID_SOCK_COMMONMartin KaFai Lau
This patch changes the bpf_tcp_*_syncookie() to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer returned by the bpf_skc_to_*() helpers also. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Link: https://lore.kernel.org/bpf/20200925000409.3856725-1-kafai@fb.com
2020-09-25bpf: Change bpf_sk_release and bpf_sk_*cgroup_id to accept ↵Martin KaFai Lau
ARG_PTR_TO_BTF_ID_SOCK_COMMON The previous patch allows the networking bpf prog to use the bpf_skc_to_*() helpers to get a PTR_TO_BTF_ID socket pointer, e.g. "struct tcp_sock *". It allows the bpf prog to read all the fields of the tcp_sock. This patch changes the bpf_sk_release() and bpf_sk_*cgroup_id() to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer returned by the bpf_skc_to_*() helpers also. For example, the following will work: sk = bpf_skc_lookup_tcp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0); if (!sk) return; tp = bpf_skc_to_tcp_sock(sk); if (!tp) { bpf_sk_release(sk); return; } lsndtime = tp->lsndtime; /* Pass tp to bpf_sk_release() will also work */ bpf_sk_release(tp); Since PTR_TO_BTF_ID could be NULL, the helper taking ARG_PTR_TO_BTF_ID_SOCK_COMMON has to check for NULL at runtime. A btf_id of "struct sock" may not always mean a fullsock. Regardless the helper's running context may get a non-fullsock or not, considering fullsock check/handling is pretty cheap, it is better to keep the same verifier expectation on helper that takes ARG_PTR_TO_BTF_ID* will be able to handle the minisock situation. In the bpf_sk_*cgroup_id() case, it will try to get a fullsock by using sk_to_full_sk() as its skb variant bpf_sk"b"_*cgroup_id() has already been doing. bpf_sk_release can already handle minisock, so nothing special has to be done. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200925000356.3856047-1-kafai@fb.com
2020-09-25bpf: Enable bpf_skc_to_* sock casting helper to networking prog typeMartin KaFai Lau
There is a constant need to add more fields into the bpf_tcp_sock for the bpf programs running at tc, sock_ops...etc. A current workaround could be to use bpf_probe_read_kernel(). However, other than making another helper call for reading each field and missing CO-RE, it is also not as intuitive to use as directly reading "tp->lsndtime" for example. While already having perfmon cap to do bpf_probe_read_kernel(), it will be much easier if the bpf prog can directly read from the tcp_sock. This patch tries to do that by using the existing casting-helpers bpf_skc_to_*() whose func_proto returns a btf_id. For example, the func_proto of bpf_skc_to_tcp_sock returns the btf_id of the kernel "struct tcp_sock". These helpers are also added to is_ptr_cast_function(). It ensures the returning reg (BPF_REF_0) will also carries the ref_obj_id. That will keep the ref-tracking works properly. The bpf_skc_to_* helpers are made available to most of the bpf prog types in filter.c. The bpf_skc_to_* helpers will be limited by perfmon cap. This patch adds a ARG_PTR_TO_BTF_ID_SOCK_COMMON. The helper accepting this arg can accept a btf-id-ptr (PTR_TO_BTF_ID + &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON]) or a legacy-ctx-convert-skc-ptr (PTR_TO_SOCK_COMMON). The bpf_skc_to_*() helpers are changed to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will accept pointer obtained from skb->sk. Instead of specifying both arg_type and arg_btf_id in the same func_proto which is how the current ARG_PTR_TO_BTF_ID does, the arg_btf_id of the new ARG_PTR_TO_BTF_ID_SOCK_COMMON is specified in the compatible_reg_types[] in verifier.c. The reason is the arg_btf_id is always the same. Discussion in this thread: https://lore.kernel.org/bpf/20200922070422.1917351-1-kafai@fb.com/ The ARG_PTR_TO_BTF_ID_ part gives a clear expectation that the helper is expecting a PTR_TO_BTF_ID which could be NULL. This is the same behavior as the existing helper taking ARG_PTR_TO_BTF_ID. The _SOCK_COMMON part means the helper is also expecting the legacy SOCK_COMMON pointer. By excluding the _OR_NULL part, the bpf prog cannot call helper with a literal NULL which doesn't make sense in most cases. e.g. bpf_skc_to_tcp_sock(NULL) will be rejected. All PTR_TO_*_OR_NULL reg has to do a NULL check first before passing into the helper or else the bpf prog will be rejected. This behavior is nothing new and consistent with the current expectation during bpf-prog-load. [ ARG_PTR_TO_BTF_ID_SOCK_COMMON will be used to replace ARG_PTR_TO_SOCK* of other existing helpers later such that those existing helpers can take the PTR_TO_BTF_ID returned by the bpf_skc_to_*() helpers. The only special case is bpf_sk_lookup_assign() which can accept a literal NULL ptr. It has to be handled specially in another follow up patch if there is a need (e.g. by renaming ARG_PTR_TO_SOCKET_OR_NULL to ARG_PTR_TO_BTF_ID_SOCK_COMMON_OR_NULL). ] [ When converting the older helpers that take ARG_PTR_TO_SOCK* in the later patch, if the kernel does not support BTF, ARG_PTR_TO_BTF_ID_SOCK_COMMON will behave like ARG_PTR_TO_SOCK_COMMON because no reg->type could have PTR_TO_BTF_ID in this case. It is not a concern for the newer-btf-only helper like the bpf_skc_to_*() here though because these helpers must require BTF vmlinux to begin with. ] Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200925000350.3855720-1-kafai@fb.com
2020-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-09-23 The following pull-request contains BPF updates for your *net-next* tree. We've added 95 non-merge commits during the last 22 day(s) which contain a total of 124 files changed, 4211 insertions(+), 2040 deletions(-). The main changes are: 1) Full multi function support in libbpf, from Andrii. 2) Refactoring of function argument checks, from Lorenz. 3) Make bpf_tail_call compatible with functions (subprograms), from Maciej. 4) Program metadata support, from YiFei. 5) bpf iterator optimizations, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-21bpf: Allow specifying a BTF ID per argument in function protosLorenz Bauer
Function prototypes using ARG_PTR_TO_BTF_ID currently use two ways to signal which BTF IDs are acceptable. First, bpf_func_proto.btf_id is an array of IDs, one for each argument. This array is only accessed up to the highest numbered argument that uses ARG_PTR_TO_BTF_ID and may therefore be less than five arguments long. It usually points at a BTF_ID_LIST. Second, check_btf_id is a function pointer that is called by the verifier if present. It gets the actual BTF ID of the register, and the argument number we're currently checking. It turns out that the only user check_arg_btf_id ignores the argument, and is simply used to check whether the BTF ID has a struct sock_common at it's start. Replace both of these mechanisms with an explicit BTF ID for each argument in a function proto. Thanks to btf_struct_ids_match this is very flexible: check_arg_btf_id can be replaced by requiring struct sock_common. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200921121227.255763-5-lmb@cloudflare.com
2020-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-09-15 The following pull-request contains BPF updates for your *net* tree. We've added 12 non-merge commits during the last 19 day(s) which contain a total of 10 files changed, 47 insertions(+), 38 deletions(-). The main changes are: 1) docs/bpf fixes, from Andrii. 2) ld_abs fix, from Daniel. 3) socket casting helpers fix, from Martin. 4) hash iterator fixes, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15bpf: Bpf_skc_to_* casting helpers require a NULL check on skMartin KaFai Lau
The bpf_skc_to_* type casting helpers are available to BPF_PROG_TYPE_TRACING. The traced PTR_TO_BTF_ID may be NULL. For example, the skb->sk may be NULL. Thus, these casting helpers need to check "!sk" also and this patch fixes them. Fixes: 0d4fad3e57df ("bpf: Add bpf_skc_to_udp6_sock() helper") Fixes: 478cfbdf5f13 ("bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers") Fixes: af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200915182959.241101-1-kafai@fb.com
2020-09-14ipv4: Initialize flowi4_multipath_hash in data pathDavid Ahern
flowi4_multipath_hash was added by the commit referenced below for tunnels. Unfortunately, the patch did not initialize the new field for several fast path lookups that do not initialize the entire flow struct to 0. Fix those locations. Currently, flowi4_multipath_hash is random garbage and affects the hash value computed by fib_multipath_hash for multipath selection. Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash") Signed-off-by: David Ahern <dsahern@gmail.com> Cc: wenxu <wenxu@ucloud.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10tcp: simplify _bpf_setsockopt(): Remove flags argumentNeal Cardwell
Now that the previous patches have removed the code that uses the flags argument to _bpf_setsockopt(), we can remove that argument. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Kevin Yang <yyd@google.com> Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10tcp: simplify tcp_set_congestion_control(): Always reinitializeNeal Cardwell
Now that the previous patches ensure that all call sites for tcp_set_congestion_control() want to initialize congestion control, we can simplify tcp_set_congestion_control() by removing the reinit argument and the code to support it. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Kevin Yang <yyd@google.com> Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10tcp: Simplify EBPF TCP_CONGESTION to always init CCNeal Cardwell
Now that the previous patch ensures we don't initialize the congestion control twice, when EBPF sets the congestion control algorithm at connection establishment we can simplify the code by simply initializing the congestion control module at that time. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Kevin Yang <yyd@google.com> Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-08bpf: Fix clobbering of r2 in bpf_gen_ld_absDaniel Borkmann
Bryce reported that he saw the following with: 0: r6 = r1 1: r1 = 12 2: r0 = *(u16 *)skb[r1] The xlated sequence was incorrectly clobbering r2 with pointer value of r6 ... 0: (bf) r6 = r1 1: (b7) r1 = 12 2: (bf) r1 = r6 3: (bf) r2 = r1 4: (85) call bpf_skb_load_helper_16_no_cache#7692160 ... and hence call to the load helper never succeeded given the offset was too high. Fix it by reordering the load of r6 to r1. Other than that the insn has similar calling convention than BPF helpers, that is, r0 - r5 are scratch regs, so nothing else affected after the insn. Fixes: e0cea7ce988c ("bpf: implement ld_abs/ld_ind in native bpf") Reported-by: Bryce Kahle <bryce.kahle@datadoghq.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/cace836e4d07bb63b1a53e49c5dfb238a040c298.1599512096.git.daniel@iogearbox.net
2020-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
We got slightly different patches removing a double word in a comment in net/ipv4/raw.c - picked the version from net. Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached values instead of VNIC login response buffer (following what commit 507ebe6444a4 ("ibmvnic: Fix use-after-free of VNIC login response buffer") did). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-08-24tcp: bpf: Optionally store mac header in TCP_SAVE_SYNMartin KaFai Lau
This patch is adapted from Eric's patch in an earlier discussion [1]. The TCP_SAVE_SYN currently only stores the network header and tcp header. This patch allows it to optionally store the mac header also if the setsockopt's optval is 2. It requires one more bit for the "save_syn" bit field in tcp_sock. This patch achieves this by moving the syn_smc bit next to the is_mptcp. The syn_smc is currently used with the TCP experimental option. Since syn_smc is only used when CONFIG_SMC is enabled, this patch also puts the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did with "IS_ENABLED(CONFIG_MPTCP)". The mac_hdrlen is also stored in the "struct saved_syn" to allow a quick offset from the bpf prog if it chooses to start getting from the network header or the tcp header. [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
2020-08-24bpf: tcp: Allow bpf prog to write and parse TCP header optionMartin KaFai Lau
[ Note: The TCP changes here is mainly to implement the bpf pieces into the bpf_skops_*() functions introduced in the earlier patches. ] The earlier effort in BPF-TCP-CC allows the TCP Congestion Control algorithm to be written in BPF. It opens up opportunities to allow a faster turnaround time in testing/releasing new congestion control ideas to production environment. The same flexibility can be extended to writing TCP header option. It is not uncommon that people want to test new TCP header option to improve the TCP performance. Another use case is for data-center that has a more controlled environment and has more flexibility in putting header options for internal only use. For example, we want to test the idea in putting maximum delay ACK in TCP header option which is similar to a draft RFC proposal [1]. This patch introduces the necessary BPF API and use them in the TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse and write TCP header options. It currently supports most of the TCP packet except RST. Supported TCP header option: ─────────────────────────── This patch allows the bpf-prog to write any option kind. Different bpf-progs can write its own option by calling the new helper bpf_store_hdr_opt(). The helper will ensure there is no duplicated option in the header. By allowing bpf-prog to write any option kind, this gives a lot of flexibility to the bpf-prog. Different bpf-prog can write its own option kind. It could also allow the bpf-prog to support a recently standardized option on an older kernel. Sockops Callback Flags: ────────────────────── The bpf program will only be called to parse/write tcp header option if the following newly added callback flags are enabled in tp->bpf_sock_ops_cb_flags: BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG A few words on the PARSE CB flags. When the above PARSE CB flags are turned on, the bpf-prog will be called on packets received at a sk that has at least reached the ESTABLISHED state. The parsing of the SYN-SYNACK-ACK will be discussed in the "3 Way HandShake" section. The default is off for all of the above new CB flags, i.e. the bpf prog will not be called to parse or write bpf hdr option. There are details comment on these new cb flags in the UAPI bpf.h. sock_ops->skb_data and bpf_load_hdr_opt() ───────────────────────────────────────── sock_ops->skb_data and sock_ops->skb_data_end covers the whole TCP header and its options. They are read only. The new bpf_load_hdr_opt() helps to read a particular option "kind" from the skb_data. Please refer to the comment in UAPI bpf.h. It has details on what skb_data contains under different sock_ops->op. 3 Way HandShake ─────────────── The bpf-prog can learn if it is sending SYN or SYNACK by reading the sock_ops->skb_tcp_flags. * Passive side When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB), the received SYN skb will be available to the bpf prog. The bpf prog can use the SYN skb (which may carry the header option sent from the remote bpf prog) to decide what bpf header option should be written to the outgoing SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*). More on this later. Also, the bpf prog can learn if it is in syncookie mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE). The bpf prog can store the received SYN pkt by using the existing bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it. [ Note that the fullsock here is a listen sk, bpf_sk_storage is not very useful here since the listen sk will be shared by many concurrent connection requests. Extending bpf_sk_storage support to request_sock will add weight to the minisock and it is not necessary better than storing the whole ~100 bytes SYN pkt. ] When the connection is established, the bpf prog will be called in the existing PASSIVE_ESTABLISHED_CB callback. At that time, the bpf prog can get the header option from the saved syn and then apply the needed operation to the newly established socket. The later patch will use the max delay ack specified in the SYN header and set the RTO of this newly established connection as an example. The received ACK (that concludes the 3WHS) will also be available to the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data. It could be useful in syncookie scenario. More on this later. There is an existing getsockopt "TCP_SAVED_SYN" to return the whole saved syn pkt which includes the IP[46] header and the TCP header. A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to start getting from, e.g. starting from TCP header, or from IP[46] header. The new getsockopt(TCP_BPF_SYN*) will also know where it can get the SYN's packet from: - (a) the just received syn (available when the bpf prog is writing SYNACK) and it is the only way to get SYN during syncookie mode. or - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other existing CB). The bpf prog does not need to know where the SYN pkt is coming from. The getsockopt(TCP_BPF_SYN*) will hide this details. Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to bpf_load_hdr_opt() to read a particular header option from the SYN packet. * Fastopen Fastopen should work the same as the regular non fastopen case. This is a test in a later patch. * Syncookie For syncookie, the later example patch asks the active side's bpf prog to resend the header options in ACK. The server can use bpf_load_hdr_opt() to look at the options in this received ACK during PASSIVE_ESTABLISHED_CB. * Active side The bpf prog will get a chance to write the bpf header option in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK pkt will also be available to the bpf prog during the existing ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data and bpf_load_hdr_opt(). * Turn off header CB flags after 3WHS If the bpf prog does not need to write/parse header options beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags to avoid being called for header options. Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on so that the kernel will only call it when there is option that the kernel cannot handle. [1]: draft-wang-tcpm-low-latency-opt-00 https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
2020-08-24bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8Martin KaFai Lau
A later patch needs to add a few pointers and a few u8 to sock_ops_kern. Hence, this patch saves some spaces by moving some of the existing members from u32 to u8 so that the later patch can still fit everything in a cacheline. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200820190058.2885640-1-kafai@fb.com
2020-08-24tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockoptMartin KaFai Lau
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog to set the min rto of a connection. It could be used together with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX). A later selftest patch will communicate the max delay ack in a bpf tcp header option and then the receiving side can use bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com
2020-08-24tcp: bpf: Add TCP_BPF_DELACK_MAX setsockoptMartin KaFai Lau
This change is mostly from an internal patch and adapts it from sysctl config to the bpf_setsockopt setup. The bpf_prog can set the max delay ack by using bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated to its peer through bpf header option. The receiving peer can then use this max delay ack and set a potentially lower rto by using bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced in the next patch. Another later selftest patch will also use it like the above to show how to write and parse bpf tcp header option. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
2020-08-24tcp: Use a struct to represent a saved_synMartin KaFai Lau
The TCP_SAVE_SYN has both the network header and tcp header. The total length of the saved syn packet is currently stored in the first 4 bytes (u32) of an array and the actual packet data is stored after that. A later patch will add a bpf helper that allows to get the tcp header alone from the saved syn without the network header. It will be more convenient to have a direct offset to a specific header instead of re-parsing it. This requires to separately store the network hdrlen. The total header length (i.e. network + tcp) is still needed for the current usage in getsockopt. Although this total length can be obtained by looking into the tcphdr and then get the (th->doff << 2), this patch chooses to directly store the tcp hdrlen in the second four bytes of this newly created "struct saved_syn". By using a new struct, it can give a readable name to each individual header length. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-13bpf: sock_ops sk access may stomp registers when dst_reg = src_regJohn Fastabend
Similar to patch ("bpf: sock_ops ctx access may stomp registers") if the src_reg = dst_reg when reading the sk field of a sock_ops struct we generate xlated code, 53: (61) r9 = *(u32 *)(r9 +28) 54: (15) if r9 == 0x0 goto pc+3 56: (79) r9 = *(u64 *)(r9 +0) This stomps on the r9 reg to do the sk_fullsock check and then when reading the skops->sk field instead of the sk pointer we get the sk_fullsock. To fix use similar pattern noted in the previous fix and use the temp field to save/restore a register used to do sk_fullsock check. After the fix the generated xlated code reads, 52: (7b) *(u64 *)(r9 +32) = r8 53: (61) r8 = *(u32 *)(r9 +28) 54: (15) if r9 == 0x0 goto pc+3 55: (79) r8 = *(u64 *)(r9 +32) 56: (79) r9 = *(u64 *)(r9 +0) 57: (05) goto pc+1 58: (79) r8 = *(u64 *)(r9 +32) Here r9 register was in-use so r8 is chosen as the temporary register. In line 52 r8 is saved in temp variable and at line 54 restored in case fullsock != 0. Finally we handle fullsock == 0 case by restoring at line 58. This adds a new macro SOCK_OPS_GET_SK it is almost possible to merge this with SOCK_OPS_GET_FIELD, but I found the extra branch logic a bit more confusing than just adding a new macro despite a bit of duplicating code. Fixes: 1314ef561102e ("bpf: export bpf_sock for BPF_PROG_TYPE_SOCK_OPS prog type") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159718349653.4728.6559437186853473612.stgit@john-Precision-5820-Tower
2020-08-13bpf: sock_ops ctx access may stomp registers in corner caseJohn Fastabend
I had a sockmap program that after doing some refactoring started spewing this splat at me: [18610.807284] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 [...] [18610.807359] Call Trace: [18610.807370] ? 0xffffffffc114d0d5 [18610.807382] __cgroup_bpf_run_filter_sock_ops+0x7d/0xb0 [18610.807391] tcp_connect+0x895/0xd50 [18610.807400] tcp_v4_connect+0x465/0x4e0 [18610.807407] __inet_stream_connect+0xd6/0x3a0 [18610.807412] ? __inet_stream_connect+0x5/0x3a0 [18610.807417] inet_stream_connect+0x3b/0x60 [18610.807425] __sys_connect+0xed/0x120 After some debugging I was able to build this simple reproducer, __section("sockops/reproducer_bad") int bpf_reproducer_bad(struct bpf_sock_ops *skops) { volatile __maybe_unused __u32 i = skops->snd_ssthresh; return 0; } And along the way noticed that below program ran without splat, __section("sockops/reproducer_good") int bpf_reproducer_good(struct bpf_sock_ops *skops) { volatile __maybe_unused __u32 i = skops->snd_ssthresh; volatile __maybe_unused __u32 family; compiler_barrier(); family = skops->family; return 0; } So I decided to check out the code we generate for the above two programs and noticed each generates the BPF code you would expect, 0000000000000000 <bpf_reproducer_bad>: ; volatile __maybe_unused __u32 i = skops->snd_ssthresh; 0: r1 = *(u32 *)(r1 + 96) 1: *(u32 *)(r10 - 4) = r1 ; return 0; 2: r0 = 0 3: exit 0000000000000000 <bpf_reproducer_good>: ; volatile __maybe_unused __u32 i = skops->snd_ssthresh; 0: r2 = *(u32 *)(r1 + 96) 1: *(u32 *)(r10 - 4) = r2 ; family = skops->family; 2: r1 = *(u32 *)(r1 + 20) 3: *(u32 *)(r10 - 8) = r1 ; return 0; 4: r0 = 0 5: exit So we get reasonable assembly, but still something was causing the null pointer dereference. So, we load the programs and dump the xlated version observing that line 0 above 'r* = *(u32 *)(r1 +96)' is going to be translated by the skops access helpers. int bpf_reproducer_bad(struct bpf_sock_ops * skops): ; volatile __maybe_unused __u32 i = skops->snd_ssthresh; 0: (61) r1 = *(u32 *)(r1 +28) 1: (15) if r1 == 0x0 goto pc+2 2: (79) r1 = *(u64 *)(r1 +0) 3: (61) r1 = *(u32 *)(r1 +2340) ; volatile __maybe_unused __u32 i = skops->snd_ssthresh; 4: (63) *(u32 *)(r10 -4) = r1 ; return 0; 5: (b7) r0 = 0 6: (95) exit int bpf_reproducer_good(struct bpf_sock_ops * skops): ; volatile __maybe_unused __u32 i = skops->snd_ssthresh; 0: (61) r2 = *(u32 *)(r1 +28) 1: (15) if r2 == 0x0 goto pc+2 2: (79) r2 = *(u64 *)(r1 +0) 3: (61) r2 = *(u32 *)(r2 +2340) ; volatile __maybe_unused __u32 i = skops->snd_ssthresh; 4: (63) *(u32 *)(r10 -4) = r2 ; family = skops->family; 5: (79) r1 = *(u64 *)(r1 +0) 6: (69) r1 = *(u16 *)(r1 +16) ; family = skops->family; 7: (63) *(u32 *)(r10 -8) = r1 ; return 0; 8: (b7) r0 = 0 9: (95) exit Then we look at lines 0 and 2 above. In the good case we do the zero check in r2 and then load 'r1 + 0' at line 2. Do a quick cross-check into the bpf_sock_ops check and we can confirm that is the 'struct sock *sk' pointer field. But, in the bad case, 0: (61) r1 = *(u32 *)(r1 +28) 1: (15) if r1 == 0x0 goto pc+2 2: (79) r1 = *(u64 *)(r1 +0) Oh no, we read 'r1 +28' into r1, this is skops->fullsock and then in line 2 we read the 'r1 +0' as a pointer. Now jumping back to our spat, [18610.807284] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 The 0x01 makes sense because that is exactly the fullsock value. And its not a valid dereference so we splat. To fix we need to guard the case when a program is doing a sock_ops field access with src_reg == dst_reg. This is already handled in the load case where the ctx_access handler uses a tmp register being careful to store the old value and restore it. To fix the get case test if src_reg == dst_reg and in this case do the is_fullsock test in the temporary register. Remembering to restore the temporary register before writing to either dst_reg or src_reg to avoid smashing the pointer into the struct holding the tmp variable. Adding this inline code to test_tcpbpf_kern will now be generated correctly from, 9: r2 = *(u32 *)(r2 + 96) to xlated code, 12: (7b) *(u64 *)(r2 +32) = r9 13: (61) r9 = *(u32 *)(r2 +28) 14: (15) if r9 == 0x0 goto pc+4 15: (79) r9 = *(u64 *)(r2 +32) 16: (79) r2 = *(u64 *)(r2 +0) 17: (61) r2 = *(u32 *)(r2 +2348) 18: (05) goto pc+1 19: (79) r9 = *(u64 *)(r2 +32) And in the normal case we keep the original code, because really this is an edge case. From this, 9: r2 = *(u32 *)(r6 + 96) to xlated code, 22: (61) r2 = *(u32 *)(r6 +28) 23: (15) if r2 == 0x0 goto pc+2 24: (79) r2 = *(u64 *)(r6 +0) 25: (61) r2 = *(u32 *)(r2 +2348) So three additional instructions if dst == src register, but I scanned my current code base and did not see this pattern anywhere so should not be a big deal. Further, it seems no one else has hit this or at least reported it so it must a fairly rare pattern. Fixes: 9b1f3d6e5af29 ("bpf: Refactor sock_ops_convert_ctx_access") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/159718347772.4728.2781381670567919577.stgit@john-Precision-5820-Tower
2020-07-31bpf: Expose socket storage to BPF_PROG_TYPE_CGROUP_SOCKStanislav Fomichev
This lets us use socket storage from the following hooks: * BPF_CGROUP_INET_SOCK_CREATE * BPF_CGROUP_INET_SOCK_RELEASE * BPF_CGROUP_INET4_POST_BIND * BPF_CGROUP_INET6_POST_BIND Using existing 'bpf_sk_storage_get_proto' doesn't work because second argument is ARG_PTR_TO_SOCKET. Even though BPF_PROG_TYPE_CGROUP_