summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2019-11-08sctp: add SCTP_PEER_ADDR_THLDS_V2 sockoptXin Long
Section 7.2 of rfc7829: "Peer Address Thresholds (SCTP_PEER_ADDR_THLDS) Socket Option" extends 'struct sctp_paddrthlds' with 'spt_pathcpthld' added to allow a user to change ps_retrans per sock/asoc/transport, as other 2 paddrthlds: pf_retrans, pathmaxrxt. Note: to not break the user's program, here to support pf_retrans dump and setting by adding a new sockopt SCTP_PEER_ADDR_THLDS_V2, and a new structure sctp_paddrthlds_v2 instead of extending sctp_paddrthlds. Also, when setting ps_retrans, the value is not allowed to be greater than pf_retrans. v1->v2: - use SCTP_PEER_ADDR_THLDS_V2 to set/get pf_retrans instead, as Marcelo and David Laight suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08sctp: add support for Primary Path SwitchoverXin Long
This is a new feature defined in section 5 of rfc7829: "Primary Path Switchover". By introducing a new tunable parameter: Primary.Switchover.Max.Retrans (PSMR) The primary path will be changed to another active path when the path error counter on the old primary path exceeds PSMR, so that "the SCTP sender is allowed to continue data transmission on a new working path even when the old primary destination address becomes active again". This patch is to add this tunable parameter, 'ps_retrans' per netns, sock, asoc and transport. It also allows a user to change ps_retrans per netns by sysctl, and ps_retrans per sock/asoc/transport will be initialized with it. The check will be done in sctp_do_8_2_transport_strike() when this feature is enabled. Note this feature is disabled by initializing 'ps_retrans' per netns as 0xffff by default, and its value can't be less than 'pf_retrans' when changing by sysctl. v3->v4: - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of sysctl 'ps_retrans'. - add a new entry for ps_retrans on ip-sysctl.txt. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08sctp: add SCTP_EXPOSE_POTENTIALLY_FAILED_STATE sockoptXin Long
This is a sockopt defined in section 7.3 of rfc7829: "Exposing the Potentially Failed Path State", by which users can change pf_expose per sock and asoc. The new sockopt SCTP_EXPOSE_POTENTIALLY_FAILED_STATE is also known as SCTP_EXPOSE_PF_STATE for short. v2->v3: - return -EINVAL if params.assoc_value > SCTP_PF_EXPOSE_MAX. - define SCTP_EXPOSE_PF_STATE SCTP_EXPOSE_POTENTIALLY_FAILED_STATE. v3->v4: - improve changelog. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08sctp: add SCTP_ADDR_POTENTIALLY_FAILED notificationXin Long
SCTP Quick failover draft section 5.1, point 5 has been removed from rfc7829. Instead, "the sender SHOULD (i) notify the Upper Layer Protocol (ULP) about this state transition", as said in section 3.2, point 8. So this patch is to add SCTP_ADDR_POTENTIALLY_FAILED, defined in section 7.1, "which is reported if the affected address becomes PF". Also remove transport cwnd's update when moving from PF back to ACTIVE , which is no longer in rfc7829 either. Note that ulp_notify will be set to false if asoc->expose is not 'enabled', according to last patch. v2->v3: - define SCTP_ADDR_PF SCTP_ADDR_POTENTIALLY_FAILED. v3->v4: - initialize spc_state with SCTP_ADDR_AVAILABLE, as Marcelo suggested. - check asoc->pf_expose in sctp_assoc_control_transport(), as Marcelo suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08sctp: add pf_expose per netns and sock and asocXin Long
As said in rfc7829, section 3, point 12: The SCTP stack SHOULD expose the PF state of its destination addresses to the ULP as well as provide the means to notify the ULP of state transitions of its destination addresses from active to PF, and vice versa. However, it is recommended that an SCTP stack implementing SCTP-PF also allows for the ULP to be kept ignorant of the PF state of its destinations and the associated state transitions, thus allowing for retention of the simpler state transition model of [RFC4960] in the ULP. Not only does it allow to expose the PF state to ULP, but also allow to ignore sctp-pf to ULP. So this patch is to add pf_expose per netns, sock and asoc. And in sctp_assoc_control_transport(), ulp_notify will be set to false if asoc->expose is not 'enabled' in next patch. It also allows a user to change pf_expose per netns by sysctl, and pf_expose per sock and asoc will be initialized with it. Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt, to not allow a user to query the state of a sctp-pf peer address when pf_expose is 'disabled', as said in section 7.3. v1->v2: - Fix a build warning noticed by Nathan Chancellor. v2->v3: - set pf_expose to UNUSED by default to keep compatible with old applications. v3->v4: - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested. - change this patch to 1/5, and move sctp_assoc_control_transport change into 2/5, as Marcelo suggested. - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08devlink: disallow reload operation during device cleanupJiri Pirko
There is a race between driver code that does setup/cleanup of device and devlink reload operation that in some drivers works with the same code. Use after free could we easily obtained by running: while true; do echo 10 > /sys/bus/netdevsim/new_device devlink dev reload netdevsim/netdevsim10 & echo 10 > /sys/bus/netdevsim/del_device done Fix this by enabling reload only after setup of device is complete and disabling it at the beginning of the cleanup process. Reported-by: Ido Schimmel <idosch@mellanox.com> Fixes: 2d8dc5bbf4e7 ("devlink: Add support for reload") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08selftest: net: add alternative names testJiri Pirko
Add a simple test for recently added netdevice alternative names. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08packet: fix data-race in fanout_flow_is_huge()Eric Dumazet
KCSAN reported the following data-race [1] Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it. Since the report hinted about multiple cpus using the history concurrently, I added a test avoiding writing on it if the victim slot already contains the desired value. [1] BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1: fanout_flow_is_huge net/packet/af_packet.c:1303 [inline] fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453 deliver_skb net/core/dev.c:1888 [inline] dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958 xmit_one net/core/dev.c:3195 [inline] dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0: fanout_flow_is_huge net/packet/af_packet.c:1306 [inline] fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453 deliver_skb net/core/dev.c:1888 [inline] dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958 xmit_one net/core/dev.c:3195 [inline] dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 3b3a5b0aab5b ("packet: rollover huge flows before small flows") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08Merge branch 'TIPC-Encryption'David S. Miller
Tuong Lien says: ==================== TIPC Encryption This series provides TIPC encryption feature, kernel part. There will be another one in the 'iproute2/tipc' for user space to set key. v2: add select crypto 'aes(gcm)' for TIPC_CRYPTO in Kconfig ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: add support for AEAD key setting via netlinkTuong Lien
This commit adds two netlink commands to TIPC in order for user to be able to set or remove AEAD keys: - TIPC_NL_KEY_SET - TIPC_NL_KEY_FLUSH When the 'KEY_SET' is given along with the key data, the key will be initiated and attached to TIPC crypto. On the other hand, the 'KEY_FLUSH' command will remove all existing keys if any. Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: introduce TIPC encryption & authenticationTuong Lien
This commit offers an option to encrypt and authenticate all messaging, including the neighbor discovery messages. The currently most advanced algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All encryption/decryption is done at the bearer layer, just before leaving or after entering TIPC. Supported features: - Encryption & authentication of all TIPC messages (header + data); - Two symmetric-key modes: Cluster and Per-node; - Automatic key switching; - Key-expired revoking (sequence number wrapped); - Lock-free encryption/decryption (RCU); - Asynchronous crypto, Intel AES-NI supported; - Multiple cipher transforms; - Logs & statistics; Two key modes: - Cluster key mode: One single key is used for both TX & RX in all nodes in the cluster. - Per-node key mode: Each nodes in the cluster has one specific TX key. For RX, a node requires its peers' TX key to be able to decrypt the messages from those peers. Key setting from user-space is performed via netlink by a user program (e.g. the iproute2 'tipc' tool). Internal key state machine: Attach Align(RX) +-+ +-+ | V | V +---------+ Attach +---------+ | IDLE |---------------->| PENDING |(user = 0) +---------+ +---------+ A A Switch| A | | | | | | Free(switch/revoked) | | (Free)| +----------------------+ | |Timeout | (TX) | | |(RX) | | | | | | v | +---------+ Switch +---------+ | PASSIVE |<----------------| ACTIVE | +---------+ (RX) +---------+ (user = 1) (user >= 1) The number of TFMs is 10 by default and can be changed via the procfs 'net/tipc/max_tfms'. At this moment, as for simplicity, this file is also used to print the crypto statistics at runtime: echo 0xfff1 > /proc/sys/net/tipc/max_tfms The patch defines a new TIPC version (v7) for the encryption message (- backward compatibility as well). The message is basically encapsulated as follows: +----------------------------------------------------------+ | TIPCv7 encryption | Original TIPCv2 | Authentication | | header | packet (encrypted) | Tag | +----------------------------------------------------------+ The throughput is about ~40% for small messages (compared with non- encryption) and ~9% for large messages. With the support from hardware crypto i.e. the Intel AES-NI CPU instructions, the throughput increases upto ~85% for small messages and ~55% for large messages. By default, the new feature is inactive (i.e. no encryption) until user sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO" in the kernel configuration to enable/disable the new code when needed. MAINTAINERS | add two new files 'crypto.h' & 'crypto.c' in tipc Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: add new AEAD key structure for user APITuong Lien
The new structure 'tipc_aead_key' is added to the 'tipc.h' for user to be able to transfer a key to TIPC in kernel. Netlink will be used for this purpose in the later commits. Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: enable creating a "preliminary" nodeTuong Lien
When user sets RX key for a peer not existing on the own node, a new node entry is needed to which the RX key will be attached. However, since the peer node address (& capabilities) is unknown at that moment, only the node-ID is provided, this commit allows the creation of a node with only the data that we call as “preliminary”. A preliminary node is not the object of the “tipc_node_find()” but the “tipc_node_find_by_id()”. Once the first message i.e. LINK_CONFIG comes from that peer, and is successfully decrypted by the own node, the actual peer node data will be properly updated and the node will function as usual. In addition, the node timer always starts when a node object is created so if a preliminary node is not used, it will be cleaned up. The later encryption functions will also use the node timer and be able to create a preliminary node automatically when needed. Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08tipc: add reference counter to bearerTuong Lien
As a need to support the crypto asynchronous operations in the later commits, apart from the current RCU mechanism for bearer pointer, we add a 'refcnt' to the bearer object as well. So, a bearer can be hold via 'tipc_bearer_hold()' without being freed even though the bearer or interface can be disabled in the meanwhile. If that happens, the bearer will be released then when the crypto operation is completed and 'tipc_bearer_put()' is called. Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08Merge branch '100GbE' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 100GbE Intel Wired LAN Driver Updates 2019-11-08 Another series that contains updates to the ice driver only. Anirudh cleans up the code of kernel config of ifdef wrappers by moving code that is needed by DCB to disable and enable the PF VSI for configuration. Implements ice_vsi_type_str() to convert an VSI type enum value to its string equivalent to help identify VSI types from module print statements. Usha and Tarun add support for setting the maximum per-queue bit rate for transmit queues. Dave implements dcb_nl set functions and supporting software DCB functions to support the callbacks defined in the dcbnl_rtnl_ops structure. Henry adds a check to ensure we are not resetting the device when trying to configure it, and to return -EBUSY during a reset. Usha fixes a call trace caused by the receive/transmit descriptor size change request via ethtool when DCB is configured by using the number of enabled queues and not the total number of allocated queues. Paul cleans up and refactors the software LLDP configuration to handle when firmware DCBX is disabled. Akeem adds checks to ensure the VF or PF is not disabled before honoring mailbox messages to configure the VF. Brett corrects the check to make sure the vector_id passed down from iavf is less than the max allowed interrupts per VF. Updates a flag bit to align with the current specification. Bruce updates a switch statement to use the correct status of the Download Package AQ command. Does some housekeeping by cleaning up a conditional check that is not needed. Mitch shortens up the delay for SQ responses to resolve issues with VF resets failing. Jake cleans up the code reducing namespace pollution and to simplify ice_debug_cq() since it always uses the same mask, not need to pass it in. Improve debugging by adding the command opcode in the debug messages that print an error code. v2: fixed reverse christmas tree issue in patch 3 of the series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08net: icmp: fix data-race in cmp_global_allow()Eric Dumazet
This code reads two global variables without protection of a lock. We need READ_ONCE()/WRITE_ONCE() pairs to avoid load/store-tearing and better document the intent. KCSAN reported : BUG: KCSAN: data-race in icmp_global_allow / icmp_global_allow read to 0xffffffff861a8014 of 4 bytes by task 11201 on cpu 0: icmp_global_allow+0x36/0x1b0 net/ipv4/icmp.c:254 icmpv6_global_allow net/ipv6/icmp.c:184 [inline] icmpv6_global_allow net/ipv6/icmp.c:179 [inline] icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640 dst_link_failure include/net/dst.h:419 [inline] vti_xmit net/ipv4/ip_vti.c:243 [inline] vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279 __netdev_start_xmit include/linux/netdevice.h:4420 [inline] netdev_start_xmit include/linux/netdevice.h:4434 [inline] xmit_one net/core/dev.c:3280 [inline] dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 write to 0xffffffff861a8014 of 4 bytes by task 11183 on cpu 1: icmp_global_allow+0x174/0x1b0 net/ipv4/icmp.c:272 icmpv6_global_allow net/ipv6/icmp.c:184 [inline] icmpv6_global_allow net/ipv6/icmp.c:179 [inline] icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514 icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43 ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640 dst_link_failure include/net/dst.h:419 [inline] vti_xmit net/ipv4/ip_vti.c:243 [inline] vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279 __netdev_start_xmit include/linux/netdevice.h:4420 [inline] netdev_start_xmit include/linux/netdevice.h:4434 [inline] xmit_one net/core/dev.c:3280 [inline] dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296 __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873 dev_queue_xmit+0x21/0x30 net/core/dev.c:3906 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 4cdf507d5452 ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08net/sched: annotate lockless accesses to qdisc->emptyEric Dumazet
KCSAN reported the following race [1] BUG: KCSAN: data-race in __dev_queue_xmit / net_tx_action read to 0xffff8880ba403508 of 1 bytes by task 21814 on cpu 1: __dev_xmit_skb net/core/dev.c:3389 [inline] __dev_queue_xmit+0x9db/0x1b40 net/core/dev.c:3761 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825 neigh_hh_output include/net/neighbour.h:500 [inline] neigh_output include/net/neighbour.h:509 [inline] ip6_finish_output2+0x873/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 write to 0xffff8880ba403508 of 1 bytes by interrupt on cpu 0: qdisc_run_begin include/net/sch_generic.h:160 [inline] qdisc_run include/net/pkt_sched.h:120 [inline] net_tx_action+0x2b1/0x6c0 net/core/dev.c:4551 __do_softirq+0x115/0x33f kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337 do_softirq kernel/softirq.c:329 [inline] __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189 local_bh_enable include/linux/bottom_half.h:32 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline] ip6_finish_output2+0x7bb/0xec0 net/ipv6/ip6_output.c:117 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 21817 Comm: syz-executor.2 Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: d518d2ed8640 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08ice: print opcode when printing controlq errorsJacob Keller
To help aid in debugging, display the command opcode in debug messages that print an error code. This makes it easier to see what command failed if only ICE_DBG_AQ_MSG is enabled. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: use more accurate ICE_DBG mask typesJacob Keller
ice_debug_cq is passed a mask which is always ICE_DBG_AQ_CMD. Modify this function, removing the mask parameter entirely, and directly use the more appropriate ICE_DBG_AQ_DESC and ICE_DBG_AQ_DESC_BUF. The function is only called from ice_controlq.c, and has no other callers outside of that file. Move it and mark it static to avoid namespace pollution. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Introduce and use ice_vsi_type_strAnirudh Venkataramanan
ice_vsi_type_str converts an ice_vsi_type enum value to its string equivalent. This is expected to help easily identify VSI types from module print statements. Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: remove unnecessary conditional checkBruce Allan
There is no reason to do this conditional check before the assignment so simply remove it. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Update enum ice_flg64_bits to current specificationBrett Creeley
Currently the VLAN ice_flg64_bits are off by 1. Fix this by setting the ICE_FLG_EVLAN_x8100 flag to 14, which also updates ICE_FLG_EVLAN_x9100 to 15 and ICE_FLG_VLAN_x8100 to 16. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: delay lessMitch Williams
Shorten the delay for SQ responses, but increase the number of loops. Max delay time is unchanged, but some operations complete much more quickly. In the process, add a new define to make the delay count and delay time more explicit. Add comments to make things more explicit. This fixes a problem with VF resets failing on with many VFs. Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: use pkg_dwnld_status instead of sq_last_statusBruce Allan
Since the return value from the Download Package AQ command is stored in hw->pkg_dwnld_status, use that instead of sq_last_status since that may have the return value from some other AQ command leading to unexpected results. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Change max MSI-x vector_id check in cfg_irq_mapBrett Creeley
Currently we check to make sure the vector_id passed down from iavf is less than or equal to pf->hw.func_caps.common_caps.num_msix_vectors. This is incorrect because the vector_id is always 0-based and never greater than or equal to the ICE_MAX_INTR_PER_VF. Fix this by checking to make sure the vector_id is less than the max allowed interrupts per VF (ICE_MAX_INTR_PER_VF). Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Check if VF is disabled for Opcode and other operationsAkeem G Abodunrin
This patch adds code to check if PF or VF is disabled before honoring mailbox message to configure VF - If it is disabled, and opcode is for resetting VF, the PF driver simply tell VF that all is set. In addition, if reset is ongoing, and Admin intend to configure VF on the host, we can poll the VF enabling bit to make sure it is ready before continue - If after ~250 milliseconds, VF is not in active state, we can bail out with invalid error. Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: configure software LLDP in ice_init_pf_dcbPaul Greenwalt
Move software LLDP configuration when FW DCBX is disabled to ice_init_pf_dcb, since that is where the FW DCBX state is determined. Remove this software LLDP configuration from ice_vsi_setup and ice_set_priv_flags. Software configuration includes redirecting Rx LLDP packets up the stack, when FW DCBX is not running. Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Fix to change Rx/Tx ring descriptor size via ethtool with DCBxUsha Ketineni
This patch fixes the call trace caused by the kernel when the Rx/Tx descriptor size change request is initiated via ethtool when DCB is configured. ice_set_ringparam() should use vsi->num_txq instead of vsi->alloc_txq as it represents the queues that are enabled in the driver when DCB is enabled/disabled. Otherwise, queue index being used can go out of range. For example, when vsi->alloc_txq has 104 queues and with 3 TCS enabled via DCB, each TC gets 34 queues, vsi->num_txq will be 102 and only 102 queues will be enabled. Signed-off-by: Usha Ketineni <usha.k.ketineni@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: avoid setting features during resetHenry Tieman
Certain subsystems behave very badly when called during reset (core dump). This patch returns -EBUSY when reconfiguring some subsystems during reset. With this patch some ethtool functions will not core dump during reset. Signed-off-by: Henry Tieman <henry.w.tieman@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Implement DCBNL supportDave Ertman
Implement interface layer for the DCBNL subsystem. These are the functions to support the callbacks defined in the dcbnl_rtnl_ops struct. These callbacks are going to be used to interface with the DCB settings of the device. Implementation of dcb_nl set functions and supporting SW DCB functions. Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Add NDO callback to set the maximum per-queue bitrateUsha Ketineni
Allow for rate limiting Tx queues. Bitrate is set in Mbps(megabits per second). Mbps max-rate is set for the queue via sysfs: /sys/class/net/<iface>/queues/tx-<queue>/tx_maxrate ex: echo 100 >/sys/class/net/ens7/queues/tx-0/tx_maxrate echo 200 >/sys/class/net/ens7/queues/tx-1/tx_maxrate Note: A value of zero for tx_maxrate means disabled, default is disabled. Signed-off-by: Usha Ketineni <usha.k.ketineni@intel.com> Co-developed-by: Tarun Singh <tarun.k.singh@intel.com> Signed-off-by: Tarun Singh <tarun.k.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08ice: Use ice_ena_vsi and ice_dis_vsi in DCB configuration flowAnirudh Venkataramanan
DCB configuration flow needs to disable and enable only the PF (main) VSI, so use ice_ena_vsi and ice_dis_vsi. To avoid the use of ifdef to control the staticness of these functions, move them to ice_lib.c. Also replace the allocate and copy of old_cfg to kmemdup() in ice_pf_dcb_cfg(). Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-11-08cxgb4: fix 64-bit division on i386Rahul Lakkireddy
Fix following compile error on i386 architecture. ERROR: "__udivdi3" [drivers/net/ethernet/chelsio/cxgb4/cxgb4.ko] undefined! Fixes: 0e395b3cb1fb ("cxgb4: add FLOWC based QoS offload") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08Merge tag 'mac80211-next-for-net-next-2019-11-08' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Some relatively small changes: * typo fixes in docs * APIs for station separation using VLAN tags rather than separate wifi netdevs * some preparations for upcoming features (802.3 offload and airtime queue limits (AQL) * stack reduction in ieee80211_assoc_success() * use DEFINE_DEBUGFS_ATTRIBUTE in hwsim ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08cxgb4: Use match_string() helper to simplify the codeYueHaibing
match_string() returns the array index of a matching string. Use it instead of the open-coded implementation. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08net: ethernet: stmmac: Add support for syscfg clockChristophe Roullier
Add optional support for syscfg clock in dwmac-stm32.c Now Syscfg clock is activated automatically when syscfg registers are used Signed-off-by: Christophe Roullier <christophe.roullier@st.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08cfg80211: VLAN offload support for set_key and set_sta_vlanGurumoorthi Gnanasambandhan
This provides an alternative mechanism for AP VLAN support where a single netdev is used with VLAN tagged frames instead of separate netdevs for each VLAN without tagged frames from the WLAN driver. By setting NL80211_EXT_FEATURE_VLAN_OFFLOAD flag the driver indicates support for a single netdev with VLAN tagged frames. Separate VLAN-specific netdevs can be added using RTM_NEWLINK/IFLA_VLAN_ID similarly to Ethernet. NL80211_CMD_NEW_KEY (for group keys), NL80211_CMD_NEW_STATION, and NL80211_CMD_SET_STATION will optionally specify vlan_id using NL80211_ATTR_VLAN_ID. Signed-off-by: Gurumoorthi Gnanasambandhan <gguru@codeaurora.org> Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Link: https://lore.kernel.org/r/20191031214640.5012-1-jouni@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08mac80211: Shrink the size of ack_frame_id to make room for tx_time_estToke Høiland-Jørgensen
To implement airtime queue limiting, we need to keep a running account of the estimated airtime of all skbs queued into the device. Do to this correctly, we need to store the airtime estimate into the skb so we can decrease the outstanding balance when the skb is freed. This means that the time estimate must be stored somewhere that will survive for the lifetime of the skb. To get this, decrease the size of the ack_frame_id field to 6 bits, and lower the size of the ID space accordingly. This leaves 10 bits for use for tx_time_est, which is enough to store a maximum of 4096 us, if we shift the values so they become units of 4us. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/157182474063.150713.16132669599100802716.stgit@toke.dk Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08mac80211: don't re-parse elems in ieee80211_assoc_success()Johannes Berg
We've already parsed the same data in the caller, so we can pass it. The only thing is that we might fill in more details in ieee80211_assoc_success(), but that doesn't bother the caller, so it's fine to do even when we share the parsed data. This reduces the stack space usage of the call stack here, Arnd reported it had grown above the 1024 byte warning limit. Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20191028125240.cb7661671bd2.I757c8752bf4f2f35e54f5e0a2c0a9cd9216c3d8b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08mac80211: move store skb ack code to its own functionJohn Crispin
This patch moves the code handling SKBTX_WIFI_STATUS inside the TX path into an extra function. This allows us to reuse it inside the 802.11 encap offloading datapath. Signed-off-by: John Crispin <john@phrozen.org> Link: https://lore.kernel.org/r/20191029091304.7330-2-john@phrozen.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08mac80211_hwsim: use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs fopszhong jiang
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file operation rather than DEFINE_SIMPLE_ATTRIBUTE. It is detected with the help of coccinelle. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Link: https://lore.kernel.org/r/1572404462-45462-1-git-send-email-zhongjiang@huawei.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-07tipc: eliminate checking netns if node establishedHoang Le
Currently, we scan over all network namespaces at each received discovery message in order to check if the sending peer might be present in a host local namespaces. This is unnecessary since we can assume that a peer will not change its location during an established session. We now improve the condition for this testing so that we don't perform any redundant scans. Fixes: f73b12812a3d ("tipc: improve throughput between nodes in netns") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07net: add a READ_ONCE() in skb_peek_tail()Eric Dumazet
skb_peek_tail() can be used without protection of a lock, as spotted by KCSAN [1] In order to avoid load-stearing, add a READ_ONCE() Note that the corresponding WRITE_ONCE() are already there. [1] BUG: KCSAN: data-race in sk_wait_data / skb_queue_tail read to 0xffff8880b36a4118 of 8 bytes by task 20426 on cpu 1: skb_peek_tail include/linux/skbuff.h:1784 [inline] sk_wait_data+0x15b/0x250 net/core/sock.c:2477 kcm_wait_data+0x112/0x1f0 net/kcm/kcmsock.c:1103 kcm_recvmsg+0xac/0x320 net/kcm/kcmsock.c:1130 sock_recvmsg_nosec net/socket.c:871 [inline] sock_recvmsg net/socket.c:889 [inline] sock_recvmsg+0x92/0xb0 net/socket.c:885 ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480 do_recvmmsg+0x19a/0x5c0 net/socket.c:2601 __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680 __do_sys_recvmmsg net/socket.c:2703 [inline] __se_sys_recvmmsg net/socket.c:2696 [inline] __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 write to 0xffff8880b36a4118 of 8 bytes by task 451 on cpu 0: __skb_insert include/linux/skbuff.h:1852 [inline] __skb_queue_before include/linux/skbuff.h:1958 [inline] __skb_queue_tail include/linux/skbuff.h:1991 [inline] skb_queue_tail+0x7e/0xc0 net/core/skbuff.c:3145 kcm_queue_rcv_skb+0x202/0x310 net/kcm/kcmsock.c:206 kcm_rcv_strparser+0x74/0x4b0 net/kcm/kcmsock.c:370 __strp_recv+0x348/0xf50 net/strparser/strparser.c:309 strp_recv+0x84/0xa0 net/strparser/strparser.c:343 tcp_read_sock+0x174/0x5c0 net/ipv4/tcp.c:1639 strp_read_sock+0xd4/0x140 net/strparser/strparser.c:366 do_strp_work net/strparser/strparser.c:414 [inline] strp_work+0x9a/0xe0 net/strparser/strparser.c:423 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 451 Comm: kworker/u4:3 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: kstrp strp_work Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07net: add annotations on hh->hh_len lockless accessesEric Dumazet
KCSAN reported a data-race [1] While we can use READ_ONCE() on the read sides, we need to make sure hh->hh_len is written last. [1] BUG: KCSAN: data-race in eth_header_cache / neigh_resolve_output write to 0xffff8880b9dedcb8 of 4 bytes by task 29760 on cpu 0: eth_header_cache+0xa9/0xd0 net/ethernet/eth.c:247 neigh_hh_init net/core/neighbour.c:1463 [inline] neigh_resolve_output net/core/neighbour.c:1480 [inline] neigh_resolve_output+0x415/0x470 net/core/neighbour.c:1470 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505 ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647 rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 read to 0xffff8880b9dedcb8 of 4 bytes by task 29572 on cpu 1: neigh_resolve_output net/core/neighbour.c:1479 [inline] neigh_resolve_output+0x113/0x470 net/core/neighbour.c:1470 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505 ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647 rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 29572 Comm: kworker/1:4 Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events rt6_probe_deferred Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07Merge branch 'u64_stats_t'David S. Miller
Eric Dumazet says: ==================== net: introduce u64_stats_t KCSAN found a data-race in per-cpu u64 stats accounting. (The stack traces are included in the 8th patch : tun: switch to u64_stats_t) This patch series first consolidate code in five patches. Then the last three patches address the data-race resolution. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07net: use u64_stats_t in struct pcpu_lstatsEric Dumazet
In order to fix the data-race found by KCSAN, we can use the new u64_stats_t type and its accessors instead of plain u64 fields. This will still generate optimal code for both 32 and 64 bit platforms. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07tun: switch to u64_stats_tEric Dumazet
In order to fix this data-race found by KCSAN [1], switch to u64_stats_t helpers. They provide all the needed annotations, without adding extra cost. [1] BUG: KCSAN: data-race in tun_get_user / tun_net_get_stats64 read to 0xffffe8ffffd8aca8 of 8 bytes by task 4882 on cpu 0: tun_net_get_stats64+0x9b/0x230 drivers/net/tun.c:1171 dev_get_stats+0x89/0x1e0 net/core/dev.c:9103 rtnl_fill_stats+0x56/0x370 net/core/rtnetlink.c:1177 rtnl_fill_ifinfo+0xd3b/0x2100 net/core/rtnetlink.c:1667 rtmsg_ifinfo_build_skb+0xb0/0x150 net/core/rtnetlink.c:3472 rtmsg_ifinfo_event.part.0+0x4e/0xb0 net/core/rtnetlink.c:3504 rtmsg_ifinfo_event net/core/rtnetlink.c:3515 [inline] rtmsg_ifinfo+0x85/0x90 net/core/rtnetlink.c:3513 __dev_notify_flags+0x18b/0x200 net/core/dev.c:7649 dev_change_flags+0xb8/0xe0 net/core/dev.c:7691 dev_ifsioc+0x201/0x6a0 net/core/dev_ioctl.c:237 dev_ioctl+0x149/0x660 net/core/dev_ioctl.c:489 sock_do_ioctl+0xdb/0x230 net/socket.c:1061 sock_ioctl+0x3a3/0x5e0 net/socket.c:1189 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x991/0xc60 fs/ioctl.c:696 write to 0xffffe8ffffd8aca8 of 8 bytes by task 4883 on cpu 1: tun_get_user+0x1d94/0x2ba0 drivers/net/tun.c:2002 tun_chr_write_iter+0x79/0xd0 drivers/net/tun.c:2022 call_write_iter include/linux/fs.h:1895 [inline] new_sync_write+0x388/0x4a0 fs/read_write.c:483 __vfs_write+0xb1/0xc0 fs/read_write.c:496 __kernel_write+0xb8/0x240 fs/read_write.c:515 write_pipe_buf+0xb6/0xf0 fs/splice.c:794 splice_from_pipe_feed fs/splice.c:500 [inline] __splice_from_pipe+0x248/0x480 fs/splice.c:624 splice_from_pipe+0xbb/0x100 fs/splice.c:659 default_file_splice_write+0x45/0x90 fs/splice.c:806 do_splice_from fs/splice.c:848 [inline] direct_splice_actor+0xa0/0xc0 fs/splice.c:1020 splice_direct_to_actor+0x215/0x510 fs/splice.c:975 do_splice_direct+0x161/0x1e0 fs/splice.c:1063 do_sendfile+0x384/0x7f0 fs/read_write.c:1464 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 4883 Comm: syz-executor.1 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07u64_stats: provide u64_stats_t typeEric Dumazet
On 64bit arches, struct u64_stats_sync is empty and provides no help against load/store tearing. Using READ_ONCE()/WRITE_ONCE() would be needed. But the update side would be slightly more expensive. local64_t was defined so that we could use regular adds in a manner which is atomic wrt IRQs. However the u64_stats infra means we do not have to use local64_t on 32bit arches since the syncp provides the needed protection. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07net: dummy: use standard dev_lstats_add() and dev_lstats_read()Eric Dumazet
This driver can simply use the common infrastructure instead of duplicating it. This cleanup will ease u64_stats_t adoption in a single location. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07vsockmon: use standard dev_lstats_add() and dev_lstats_read()Eric Dumazet
This cleanup will ease u64_stats_t adoption in a single location. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>