summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2019-06-03net: ena: arrange ena_probe() function variables in reverse christmas treeSameeh Jubran
Reverse christmas tree arrangement is when strings are written from longer to shorter with each line. Most of our functions are abiding this arrangement but this function does not. In this commit we arrange the variables of ena_probe() in reverse christmas tree. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: Sameeh Jubran <sameehj@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03net: ena: replace free_tx/rx_ids union with single free_ids field in ena_ringSameeh Jubran
struct ena_ring holds a union of free_rx_ids and free_tx_ids. Both of the above fields mean the exact same thing and are used exactly the same way. Furthermore, these fields are always used with a prefix of the type of ring. So for tx it will be tx_ring->free_tx_ids, and for rx it will be rx_ring->free_rx_ids, which shows how redundant the "_tx" and "_rx" parts are. Furthermore still, this may lead to confusing code like where tx_ring->free_rx_ids which works correctly but looks like a mess. This commit removes the aforementioned redundancy by replacing the free_rx/tx_ids union with a single free_ids field. It also changes a single goto label name from err_free_tx_ids: to err_tx_free_ids: for consistency with the above new notation. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: Sameeh Jubran <sameehj@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03net: ena: ethtool: add extra properties retrieval via get_priv_flagsArthur Kiyanovski
This commit adds a mechanism for exposing different device properties via ethtool's priv_flags. The strings are provided by the device and copied to user space through the driver. In this commit we: Add commands, structs and defines necessary for handling extra properties Add functions for: Allocation/destruction of a buffer for extra properties strings. Retreival of extra properties strings and flags from the network device. Handle the allocation of a buffer for extra properties strings. * Initialize buffer with extra properties strings from the network device at driver startup. Use ethtool's get_priv_flags to expose extra properties of the ENA device Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: Sameeh Jubran <sameehj@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03net: ena: add handling of llq max tx burst sizeSameeh Jubran
There is a maximum TX burst size that the ENA device can handle. It is exposed by the device to the driver and the driver needs to comply with it to avoid bugs. In this commit we: 1. Add ena_com_is_doorbell_needed(), which calculates the number of llq entries that will be used to hold a packet, and will return true if they exceed the number of allowed entries in a burst. If the function returns true, a doorbell needs to be invoked to send this packet in the next burst. 2. Follow the available entries in the current burst: - Every doorbell a new burst begins - With each write of an llq entry, the available entries in the current burst are decreased by 1. Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: Sameeh Jubran <sameehj@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03net: dsa: mv88e6xxx: make mv88e6xxx_g1_stats_wait staticRasmus Villemoes
mv88e6xxx_g1_stats_wait has no users outside global1.c, so make it static. Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03net: dsa: mv88e6xxx: fix comments and macro names in mv88e6390_g1_mgmt_rsvd2cpuRasmus Villemoes
The macros have an extraneous '800' (after 0180C2 there should be just six nibbles, with X representing one), while the comments have interchanged c2 and 80 and an extra :00. Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02nexthop: Add entry to MAINTAINERSDavid Ahern
Add entry to MAINTAINERS file for new nexthop code. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02Merge branch 'r8169-replace-several-function-pointers-with-direct-calls'David S. Miller
Heiner Kallweit says: ==================== r8169: replace several function pointers with direct calls This series removes most function pointers from struct rtl8169_private and uses direct calls instead. This simplifies the code and avoids the penalty of indirect calls in times of retpoline. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02r8169: avoid tso csum function indirectionHeiner Kallweit
Replace indirect call to tso_csum with direct calls. To do this we have to move rtl_chip_supports_csum_v2(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02r8169: remove struct jumbo_opsHeiner Kallweit
The jumbo_ops are used in just one place, so we can simplify the code and avoid the penalty of indirect calls in times of retpoline. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02r8169: remove struct mdio_opsHeiner Kallweit
The mdio_ops are used in just one place, so we can simplify the code and avoid the penalty of indirect calls in times of retpoline. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02r8169: improve r8169_csum_workaroundHeiner Kallweit
Use helper skb_is_gso() and simplify access to tx_dropped. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: ethernet: improve eth_platform_get_mac_addressHeiner Kallweit
pci_device_to_OF_node(to_pci_dev(dev)) is the same as dev->of_node, so we can simplify the code. In addition add an empty line before the return statement. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02Merge branch 'ifa_list-RCU'David S. Miller
Florian Westphal says: ==================== net: add rcu annotations for ifa_list v3: fix typo in patch1 commit message All other patches are unchanged. v2: remove ifa_list iteration in afs instead of conversion Eric Dumazet reported following problem: It looks that unless RTNL is held, accessing ifa_list needs proper RCU protection. indev->ifa_list can be changed under us by another cpu (which owns RTNL) [..] A proper rcu_dereference() with an happy sparse support would require adding __rcu attribute. This patch series does that: add __rcu to the ifa_list pointers. That makes sparse complain, so the series also adds the required rcu_assign_pointer/dereference helpers where needed. All patches except the last one are preparation work. Two new macros are introduced for in_ifaddr walks. Last patch adds the __rcu annotations and the assign_pointer/dereference helper calls. This patch is a bit large, but I found no better way -- other approaches (annotate-first or add helpers-first) all result in mid-series sparse warnings. This series is submitted vs. net-next rather than net for several reasons: 1. Its (mostly) compile-tested only 2. 3rd patch changes behaviour wrt. secondary addresses (see changelog) 3. The problem exists for a very long time (2004), so it doesn't seem to be urgent to fix this -- rcu use to free ifa_list predates the git era. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: ipv4: provide __rcu annotation for ifa_listFlorian Westphal
ifa_list is protected by rcu, yet code doesn't reflect this. Add the __rcu annotations and fix up all places that are now reported by sparse. I've done this in the same commit to not add intermediate patches that result in new warnings. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02drivers: use in_dev_for_each_ifa_rtnl/rcuFlorian Westphal
Like previous patches, use the new iterator macros to avoid sparse warnings once proper __rcu annotations are added. Compile tested only. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: use new in_dev_ifa iteratorsFlorian Westphal
Use in_dev_for_each_ifa_rcu/rtnl instead. This prevents sparse warnings once proper __rcu annotations are added. Signed-off-by: Florian Westphal <fw@strlen.de> t di# Last commands done (6 commands done): Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02netfilter: use in_dev_for_each_ifa_rcuFlorian Westphal
Netfilter hooks are always running under rcu read lock, use the new iterator macro so sparse won't complain once we add proper __rcu annotations. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02devinet: use in_dev_for_each_ifa_rcu in more placesFlorian Westphal
This also replaces spots that used for_primary_ifa(). for_primary_ifa() aborts the loop on the first secondary address seen. Replace it with either the rcu or rtnl variant of in_dev_for_each_ifa(), but two places will now also consider secondary addresses too: inet_addr_onlink() and inet_ifa_byprefix(). I do not understand why they should ignore secondary addresses. Why would a secondary address not be considered 'on link'? When matching a prefix, why ignore a matching secondary address? Other places get converted as well, but gain "->flags & SECONDARY" check. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: inetdevice: provide replacement iterators for in_ifaddr walkFlorian Westphal
The ifa_list is protected either by rcu or rtnl lock, but the current iterators do not account for this. This adds two iterators as replacement, a later patch in the series will update them with the needed rcu/rtnl_dereference calls. Its not done in this patch yet to avoid sparse warnings -- the fields lack the proper __rcu annotation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02afs: do not send list of client addressesFlorian Westphal
David Howells says: I'm told that there's not really any point populating the list. Current OpenAFS ignores it, as does AuriStor - and IBM AFS 3.6 will do the right thing. The list is actually useless as it's the client's view of the world, not the servers, so if there's any NAT in the way its contents are invalid. Further, it doesn't support IPv6 addresses. On that basis, feel free to make it an empty list and remove all the interface enumeration. V1 of this patch reworked the function to use a new helper for the ifa_list iteration to avoid sparse warnings once the proper __rcu annotations get added in struct in_device later. But, in light of the above, just remove afs_get_ipv4_interfaces. Compile tested only. Cc: David Howells <dhowells@redhat.com> Cc: linux-afs@lists.infradead.org Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02qed: remove redundant assignment to rcColin Ian King
The variable rc is assigned with a value that is never read and it is re-assigned a new value later on. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02Merge tag 'isdn-removal' of ↵David S. Miller
https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Arnd Bergmann says: ==================== isdn: deprecate non-mISDN drivers When isdn4linux came up in the context of another patch series, I remembered that we had discussed removing it a while ago. It turns out that the suggestion from Karsten Keil wa to remove I4L in 2018 after the last public ISDN networks are shut down. This has happened now (with a very small number of exceptions), so I guess it's time to try again. We currently have three ISDN stacks in the kernel: the original isdn4linux (with the hisax driver), the newer CAPI (with four drivers), and finally the mISDN stack (supporting roughly the same hardware as hisax). As far as I can tell, anyone using ISDN with mainline kernel drivers in the past few years uses mISDN, and this is typically used for voice-only PBX installations that don't require a public network. The older stacks support additional features for data networks, but those typically make no sense any more if there is no network to connect to. My proposal for this time is to kill off isdn4linux entirely, as it seems to have been unusable for quite a while. This code has been abandoned for many years and it does cause problems for treewide maintenance as it tends to do everything that we try to stop doing. Birger Harzenetter mentioned that is is still using i4l in order to make use of the 'divert' feature that is not part of mISDN, but has otherwise moved on to mISDN for normal operation, like apparently everyone else. CAPI in turn is not quite as obsolete, but two of the drivers (avm and hysdn) don't seem to be used at all, while another one (gigaset) will stop being maintained as Paul Bolle is no longer able to test it after the network gets shut down in September. All three are now moved into drivers/staging to let others speak up in case there are remaining users. This leaves Bluetooth CMTP as the only remaining user of CAPI, but Marcel Holtmann wishes to keep maintaining it. For the discussion on version 1, see [2] Unfortunately, Karsten Keil as the maintainer has not participated in the discussion. Arnd [1] https://patchwork.kernel.org/patch/8484861/#17900371 [2] https://listserv.isdn4linux.de/pipermail/isdn4linux/2019-April/thread.html ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02Merge branch 'mscc-ocelot-tc-flower'David S. Miller
Horatiu Vultur says: ==================== Add hw offload of TC flower on MSCC Ocelot This patch series enables hardware offload for flower filter used in traffic controller on MSCC Ocelot board. v2->v3 changes: - remove the check for shared blocks v1->v2 changes: - when declaring variables use reverse christmas tree ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: mscc: ocelot: Hardware ofload for tc flower filterHoratiu Vultur
Hardware offload of port filtering are now supported via tc command using flower filter. ACL rules are used to enable the hardware offload. The following keys are supported: vlan_id vlan_prio dst_mac/src_mac for non IP frames dst_ip/src_ip dst_port/src_port The following actions are supported: trap drop These filters are supported only on the ingress schedulare. Add: tc qdisc add dev eth3 ingress tc filter ad dev eth3 parent ffff: ip_proto ip flower \ ip_proto tcp dst_port 80 action drop Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02net: mscc: ocelot: Add support for tcamHoratiu Vultur
Add ACL support using the TCAM. Using ACL it is possible to create rules in hardware to filter/redirect frames. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-02selftests: Add test cases for nexthop objectsDavid Ahern
Add functional test cases for nexthop objects. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset container Netfilter/IPVS update for net-next: 1) Add UDP tunnel support for ICMP errors in IPVS. Julian Anastasov says: This patchset is a followup to the commit that adds UDP/GUE tunnel: "ipvs: allow tunneling with gue encapsulation". What we do is to put tunnel real servers in hash table (patch 1), add function to lookup tunnels (patch 2) and use it to strip the embedded tunnel headers from ICMP errors (patch 3). 2) Extend xt_owner to match for supplementary groups, from Lukasz Pawelczyk. 3) Remove unused oif field in flow_offload_tuple object, from Taehee Yoo. 4) Release basechain counters from workqueue to skip synchronize_rcu() call. From Florian Westphal. 5) Replace skb_make_writable() by skb_ensure_writable(). Patchset from Florian Westphal. 6) Checksum support for gue encapsulation in IPVS, from Jacky Hu. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-05-31 The following pull-request contains BPF updates for your *net-next* tree. Lots of exciting new features in the first PR of this developement cycle! The main changes are: 1) misc verifier improvements, from Alexei. 2) bpftool can now convert btf to valid C, from Andrii. 3) verifier can insert explicit ZEXT insn when requested by 32-bit JITs. This feature greatly improves BPF speed on 32-bit architectures. From Jiong. 4) cgroups will now auto-detach bpf programs. This fixes issue of thousands bpf programs got stuck in dying cgroups. From Roman. 5) new bpf_send_signal() helper, from Yonghong. 6) cgroup inet skb programs can signal CN to the stack, from Lawrence. 7) miscellaneous cleanups, from many developers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31selftests/bpf: measure RTT from xdp using xdpingAlan Maguire
xdping allows us to get latency estimates from XDP. Output looks like this: ./xdping -I eth4 192.168.55.8 Setting up XDP for eth4, please wait... XDP setup disrupts network connectivity, hit Ctrl+C to quit Normal ping RTT data [Ignore final RTT; it is distorted by XDP using the reply] PING 192.168.55.8 (192.168.55.8) from 192.168.55.7 eth4: 56(84) bytes of data. 64 bytes from 192.168.55.8: icmp_seq=1 ttl=64 time=0.302 ms 64 bytes from 192.168.55.8: icmp_seq=2 ttl=64 time=0.208 ms 64 bytes from 192.168.55.8: icmp_seq=3 ttl=64 time=0.163 ms 64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.275 ms 4 packets transmitted, 4 received, 0% packet loss, time 3079ms rtt min/avg/max/mdev = 0.163/0.237/0.302/0.054 ms XDP RTT data: 64 bytes from 192.168.55.8: icmp_seq=5 ttl=64 time=0.02808 ms 64 bytes from 192.168.55.8: icmp_seq=6 ttl=64 time=0.02804 ms 64 bytes from 192.168.55.8: icmp_seq=7 ttl=64 time=0.02815 ms 64 bytes from 192.168.55.8: icmp_seq=8 ttl=64 time=0.02805 ms The xdping program loads the associated xdping_kern.o BPF program and attaches it to the specified interface. If run in client mode (the default), it will add a map entry keyed by the target IP address; this map will store RTT measurements, current sequence number etc. Finally in client mode the ping command is executed, and the xdping BPF program will use the last ICMP reply, reformulate it as an ICMP request with the next sequence number and XDP_TX it. After the reply to that request is received we can measure RTT and repeat until the desired number of measurements is made. This is why the sequence numbers in the normal ping are 1, 2, 3 and 8. We XDP_TX a modified version of ICMP reply 4 and keep doing this until we get the 4 replies we need; hence the networking stack only sees reply 8, where we have XDP_PASSed it upstream since we are done. In server mode (-s), xdping simply takes ICMP requests and replies to them in XDP rather than passing the request up to the networking stack. No map entry is required. xdping can be run in native XDP mode (the default, or specified via -N) or in skb mode (-S). A test program test_xdping.sh exercises some of these options. Note that native XDP does not seem to XDP_TX for veths, hence -N is not tested. Looking at the code, it looks like XDP_TX is supported so I'm not sure if that's expected. Running xdping in native mode for ixgbe as both client and server works fine. Changes since v4 - close fds on cleanup (Song Liu) Changes since v3 - fixed seq to be __be16 (Song Liu) - fixed fd checks in xdping.c (Song Liu) Changes since v2 - updated commit message to explain why seq number of last ICMP reply is 8 not 4 (Song Liu) - updated types of seq number, raddr and eliminated csum variable in xdpclient/xdpserver functions as it was not needed (Song Liu) - added XDPING_DEFAULT_COUNT definition and usage specification of default/max counts (Song Liu) Changes since v1 - moved from RFC to PATCH - removed unused variable in ipv4_csum() (Song Liu) - refactored ICMP checks into icmp_check() function called by client and server programs and reworked client and server programs due to lack of shared code (Song Liu) - added checks to ensure that SKB and native mode are not requested together (Song Liu) Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31Merge branch '40GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2019-05-31 This series contains updates to the iavf driver. Nathan Chancellor converts the use of gnu_printf to printf. Aleksandr modifies the driver to limit the number of RSS queues to the number of online CPUs in order to avoid creating misconfigured RSS queues. Gustavo A. R. Silva converts a couple of instances where sizeof() can be replaced with struct_size(). Alice makes the remaining changes to the iavf driver to cleanup all the old "i40evf" references in the driver to iavf, including the file names that still contained the old driver reference. There was no functional changes made, just cosmetic to reduce any confusion going forward now that the iavf driver is the virtual function driver for both i40e and ice drivers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31bpf: doc: update answer for 32-bit subregister questionJiong Wang
There has been quite a few progress around the two steps mentioned in the answer to the following question: Q: BPF 32-bit subregister requirements This patch updates the answer to reflect what has been done. v2: - Add missing full stop. (Song Liu) - Minor tweak on one sentence. (Song Liu) v1: - Integrated rephrase from Quentin and Jakub Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31Merge branch 'map-charge-cleanup'Alexei Starovoitov
Roman Gushchin says: ==================== During my work on memcg-based memory accounting for bpf maps I've done some cleanups and refactorings of the existing memlock rlimit-based code. It makes it more robust, unifies size to pages conversion, size checks and corresponding error codes. Also it adds coverage for cgroup local storage and socket local storage maps. It looks like some preliminary work on the mm side might be required to start working on the memcg-based accounting, so I'm sending these patches as a separate patchset. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: move memory size checks to bpf_map_charge_init()Roman Gushchin
Most bpf map types doing similar checks and bytes to pages conversion during memory allocation and charging. Let's unify these checks by moving them into bpf_map_charge_init(). Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: rework memlock-based memory accounting for mapsRoman Gushchin
In order to unify the existing memlock charging code with the memcg-based memory accounting, which will be added later, let's rework the current scheme. Currently the following design is used: 1) .alloc() callback optionally checks if the allocation will likely succeed using bpf_map_precharge_memlock() 2) .alloc() performs actual allocations 3) .alloc() callback calculates map cost and sets map.memory.pages 4) map_create() calls bpf_map_init_memlock() which sets map.memory.user and performs actual charging; in case of failure the map is destroyed <map is in use> 1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which performs uncharge and releases the user 2) .map_free() callback releases the memory The scheme can be simplified and made more robust: 1) .alloc() calculates map cost and calls bpf_map_charge_init() 2) bpf_map_charge_init() sets map.memory.user and performs actual charge 3) .alloc() performs actual allocations <map is in use> 1) .map_free() callback releases the memory 2) bpf_map_charge_finish() performs uncharge and releases the user The new scheme also allows to reuse bpf_map_charge_init()/finish() functions for memcg-based accounting. Because charges are performed before actual allocations and uncharges after freeing the memory, no bogus memory pressure can be created. In cases when the map structure is not available (e.g. it's not created yet, or is already destroyed), on-stack bpf_map_memory structure is used. The charge can be transferred with the bpf_map_charge_move() function. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: group memory related fields in struct bpf_map_memoryRoman Gushchin
Group "user" and "pages" fields of bpf_map into the bpf_map_memory structure. Later it can be extended with "memcg" and other related information. The main reason for a such change (beside cosmetics) is to pass bpf_map_memory structure to charging functions before the actual allocation of bpf_map. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: add memlock precharge for socket local storageRoman Gushchin
Socket local storage maps lack the memlock precharge check, which is performed before the memory allocation for most other bpf map types. Let's add it in order to unify all map types. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: add memlock precharge check for cgroup_local_storageRoman Gushchin
Cgroup local storage maps lack the memlock precharge check, which is performed before the memory allocation for most other bpf map types. Let's add it in order to unify all map types. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31Merge branch 'propagate-cn-to-tcp'Alexei Starovoitov
Lawrence Brakmo says: ==================== This patchset adds support for propagating congestion notifications (cn) to TCP from cgroup inet skb egress BPF programs. Current cgroup skb BPF programs cannot trigger TCP congestion window reductions, even when they drop a packet. This patch-set adds support for cgroup skb BPF programs to send congestion notifications in the return value when the packets are TCP packets. Rather than the current 1 for keeping the packet and 0 for dropping it, they can now return: NET_XMIT_SUCCESS (0) - continue with packet output NET_XMIT_DROP (1) - drop packet and do cn NET_XMIT_CN (2) - continue with packet output and do cn -EPERM - drop packet Finally, HBM programs are modified to collect and return more statistics. There has been some discussion regarding the best place to manage bandwidths. Some believe this should be done in the qdisc where it can also be managed with a BPF program. We believe there are advantages for doing it with a BPF program in the cgroup/skb callback. For example, it reduces overheads in the cases where there is on primary workload and one or more secondary workloads, where each workload is running on its own cgroupv2. In this scenario, we only need to throttle the secondary workloads and there is no overhead for the primary workload since there will be no BPF program attached to its cgroup. Regardless, we agree that this mechanism should not penalize those that are not using it. We tested this by doing 1 byte req/reply RPCs over loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs. Each test was repeated 50 times with a 1 minute delay between each set of 10. We then calculated the average RPCs/sec over the 50 tests. We compare upstream with upstream + patchset and no BPF program as well as upstream + patchset and a BPF program that just returns ALLOW_PKT. Here are the results: upstream 80937 RPCs/sec upstream + patches, no BPF program 80894 RPCs/sec upstream + patches, BPF program 80634 RPCs/sec These numbers indicate that there is no penalty for these patches The use of congestion notifications improves the performance of HBM when using Cubic. Without congestion notifications, Cubic will not decrease its cwnd and HBM will need to drop a large percentage of the packets. The following results are obtained for rate limits of 1Gbps, between two servers using netperf, and only one flow. We also show how reducing the max delayed ACK timer can improve the performance when using Cubic. Command used was: ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \ -s=<server running netserver> where: <rate> is 1000 --no_cn specifies no cwr notifications dctcp uses dctcp Cubic DCTCP Lim, DA Mbps cwnd cred drops Mbps cwnd cred drops -------- ---- ---- ---- ----- ---- ---- ---- ----- 1G, 40 35 462 -320 67% 995 1 -212 0.05% 1G, 40,cn 736 9 -78 0.07 995 1 -212 0.05 1G, 5,cn 941 2 -189 0.13 995 1 -212 0.05 Notes: --no_cn has no effect with DCTCP Lim = rate limit DA = maximum delay ack timer cred = credit in packets drops = % packets dropped v1->v2: Insures that only BPF_CGROUP_INET_EGRESS can return values 2 and 3 New egress values apply to all protocols, not just TCP Cleaned up patch 4, Update BPF_CGROUP_RUN_PROG_INET_EGRESS callers Removed changes to __tcp_transmit_skb (patch 5), no longer needed Removed sample use of EDT v2->v3: Removed the probe timer related changes v3->v4: Replaced preempt_enable_no_resched() by preempt_enable() in BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() macro ==================== Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: Add more stats to HBMbrakmo
Adds more stats to HBM, including average cwnd and rtt of all TCP flows, percents of packets that are ecn ce marked and distribution of return values. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: Add cn support to hbm_out_kern.cbrakmo
Update hbm_out_kern.c to support returning cn notifications. Also updates relevant files to allow disabling cn notifications. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS callsbrakmo
Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning congestion notifications from the BPF programs. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: Update __cgroup_bpf_run_filter_skb with cnbrakmo
For egress packets, __cgroup_bpf_fun_filter_skb() will now call BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY() in order to propagate congestion notifications (cn) requests to TCP callers. For egress packets, this function can return: NET_XMIT_SUCCESS (0) - continue with packet output NET_XMIT_DROP (1) - drop packet and notify TCP to call cwr NET_XMIT_CN (2) - continue with packet output and notify TCP to call cwr -EPERM - drop packet For ingress packets, this function will return -EPERM if any attached program was found and if it returned != 1 during execution. Otherwise 0 is returned. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: cgroup inet skb programs can return 0 to 3brakmo
Allows cgroup inet skb programs to return values in the range [0, 3]. The second bit is used to deterine if congestion occurred and higher level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr() The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS at load time if it uses the new return values (i.e. 2 or 3). The expected_attach_type is currently not enforced for BPF_PROG_TYPE_CGROUP_SKB. e.g Meaning the current bpf_prog with expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to BPF_CGROUP_INET_INGRESS. Blindly enforcing expected_attach_type will break backward compatibility. This patch adds a enforce_expected_attach_type bit to only enforce the expected_attach_type when it uses the new return value. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAYbrakmo
Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by __cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can request cwr for TCP packets. Current cgroup skb programs can only return 0 or 1 (0 to drop the packet. This macro changes the behavior so the low order bit indicates whether the packet should be dropped (0) or not (1) and the next bit is used for congestion notification (cn). Hence, new allowed return values of CGROUP EGRESS BPF programs are: 0: drop packet 1: keep packet 2: drop packet and call cwr 3: keep packet and call cwr This macro then converts it to one of NET_XMIT values or -EPERM that has the effect of dropping the packet with no cn. 0: NET_XMIT_SUCCESS skb should be transmitted (no cn) 1: NET_XMIT_DROP skb should be dropped and cwr called 2: NET_XMIT_CN skb should be transmitted and cwr called 3: -EPERM skb should be dropped (no cn) Note that when more than one BPF program is called, the packet is dropped if at least one of programs requests it be dropped, and there is cn if at least one program returns cn. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-31xen-netback: remove redundant assignment to errColin Ian King
The variable err is assigned with the value -ENOMEM that is never read and it is re-assigned a new value later on. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31nexthop: remove redundant assignment to errColin Ian King
The variable err is initialized with a value that is never read and err is reassigned a few statements later. This initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31Merge branch 'phylink-sfp-updates'David S. Miller
Russell King says: ==================== phylink/sfp updates This is a series of updates to phylink and sfp: - Remove an unused net device argument from the phylink MII ioctl emulation code. - add support for using interrupts when using a GPIO for link status tracking, rather than polling it at one second intervals. This reduces the need to wakeup the CPU every second. - add support to the MII ioctl API to read and write Clause 45 PHY registers. I don't know how desirable this is for mainline, but I have used this facility extensively to investigate the Marvell 88x3310 PHY. A recent illustration of use for this was debugging the PHY-without-firmware problem recently reported. - add mandatory attach/detach methods for the upstream side of sfp bus code, which will allow us to remove the "netdev" structure from the SFP layers. - remove the "netdev" structure from the SFP upstream registration calls, which simplifies PHY to SFP links. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31net: sfp: remove sfp-bus use of netdevsRussell King
The sfp-bus code now no longer has any use for the network device structure, so remove its use. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31net: sfp: add mandatory attach/detach methods for sfp busesRussell King
Add attach and detach methods for SFP buses, which will allow us to get rid of the netdev storage in sfp-bus. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>