summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)Author
2017-01-24bpf: allow option for setting bpf_l4_csum_replace from scratchDaniel Borkmann
When programs need to calculate the csum from scratch for small UDP packets and use bpf_l4_csum_replace() to feed the result from helpers like bpf_csum_diff(), then we need a flag besides BPF_F_MARK_MANGLED_0 that would ignore the case of current csum being 0, and which would still allow for the helper to set the csum and transform when needed to CSUM_MANGLED_0. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: enable load bytes helper for filter/reuseport progsDaniel Borkmann
BPF_PROG_TYPE_SOCKET_FILTER are used in various facilities such as for SO_REUSEPORT and packet fanout demuxing, packet filtering, kcm, etc, and yet the only facility they can use is BPF_LD with {BPF_ABS, BPF_IND} for single byte/half/word access. Direct packet access is only restricted to tc programs right now, but we can still facilitate usage by allowing skb_load_bytes() helper added back then in 05c74e5e53f6 ("bpf: add bpf_skb_load_bytes helper") that calls skb_header_pointer() similarly to bpf_load_pointer(), but for stack buffers with larger access size. Name the previous sk_filter_func_proto() as bpf_base_func_proto() since this is used everywhere else as well, similarly for the ctx converter, that is, bpf_convert_ctx_access(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bpf: simplify __is_valid_access test on cbDaniel Borkmann
The __is_valid_access() test for cb[] from 62c7989b24db ("bpf: allow b/h/w/dw access for bpf's cb in ctx") was done unnecessarily complex, we can just simplify it the same way as recent fix from 2d071c643f1c ("bpf, trace: make ctx access checks more robust") did. Overflow can never happen as size is 1/2/4/8 depending on access. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net/sched: Introduce sample tc actionYotam Gigi
This action allows the user to sample traffic matched by tc classifier. The sampling consists of choosing packets randomly and sampling them using the psample module. The user can configure the psample group number, the sampling rate and the packet's truncation (to save kernel-user traffic). Example: To sample ingress traffic from interface eth1, one may use the commands: tc qdisc add dev eth1 handle ffff: ingress tc filter add dev eth1 parent ffff: \ matchall action sample rate 12 group 4 Where the first command adds an ingress qdisc and the second starts sampling randomly with an average of one sampled packet per 12 packets on dev eth1 to psample group 4. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: Introduce psample, a new genetlink channel for packet samplingYotam Gigi
Add a general way for kernel modules to sample packets, without being tied to any specific subsystem. This netlink channel can be used by tc, iptables, etc. and allow to standardize packet sampling in the kernel. For every sampled packet, the psample module adds the following metadata fields: PSAMPLE_ATTR_IIFINDEX - the packets input ifindex, if applicable PSAMPLE_ATTR_OIFINDEX - the packet output ifindex, if applicable PSAMPLE_ATTR_ORIGSIZE - the packet's original size, in case it has been truncated during sampling PSAMPLE_ATTR_SAMPLE_GROUP - the packet's sample group, which is set by the user who initiated the sampling. This field allows the user to differentiate between several samplers working simultaneously and filter packets relevant to him PSAMPLE_ATTR_GROUP_SEQ - sequence counter of last sent packet. The sequence is kept for each group PSAMPLE_ATTR_SAMPLE_RATE - the sampling rate used for sampling the packets PSAMPLE_ATTR_DATA - the actual packet bits The sampled packets are sent to the PSAMPLE_NL_MCGRP_SAMPLE multicast group. In addition, add the GET_GROUPS netlink command which allows the user to see the current sample groups, their refcount and sequence number. This command currently supports only netlink dump mode. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24net: dsa: Fix inverted test for multiple CPU interfaceAndrew Lunn
Remove the wrong !, otherwise we get false positives about having multiple CPU interfaces. Fixes: b22de490869d ("net: dsa: store CPU switch structure in the tree") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24bridge: multicast to unicastFelix Fietkau
Implements an optional, per bridge port flag and feature to deliver multicast packets to any host on the according port via unicast individually. This is done by copying the packet per host and changing the multicast destination MAC to a unicast one accordingly. multicast-to-unicast works on top of the multicast snooping feature of the bridge. Which means unicast copies are only delivered to hosts which are interested in it and signalized this via IGMP/MLD reports previously. This feature is intended for interface types which have a more reliable and/or efficient way to deliver unicast packets than broadcast ones (e.g. wifi). However, it should only be enabled on interfaces where no IGMPv2/MLDv1 report suppression takes place. This feature is disabled by default. The initial patch and idea is from Felix Fietkau. Signed-off-by: Felix Fietkau <nbd@nbd.name> [linus.luessing@c0d3.blue: various bug + style fixes, commit message] Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24Introduce a sysctl that modifies the value of PROT_SOCK.Krister Johansen
Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl that denotes the first unprivileged inet port in the namespace. To disable all privileged ports set this to zero. It also checks for overlap with the local port range. The privileged and local range may not overlap. The use case for this change is to allow containerized processes to bind to priviliged ports, but prevent them from ever being allowed to modify their container's network configuration. The latter is accomplished by ensuring that the network namespace is not a child of the user namespace. This modification was needed to allow the container manager to disable a namespace's priviliged port restrictions without exposing control of the network namespace to processes in the user namespace. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-22ipv6: add NUMA awareness to seg6_hmac_init_algo()Eric Dumazet
Since we allocate per cpu storage, let's also use NUMA hints. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-226lowpan: use rb_entry()Geliang Tang
To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20net: dsa: Remove hwmon supportAndrew Lunn
Only the Marvell mv88e6xxx DSA driver made use of the HWMON support in DSA. The temperature sensor registers are actually in the embedded PHYs, and the PHY driver now supports it. So remove all HWMON support from DSA and drivers. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20inet: don't use sk_v6_rcv_saddr directlyJosef Bacik
When comparing two sockets we need to use inet6_rcv_saddr so we get a NULL sk_v6_rcv_saddr if the socket isn't AF_INET6, otherwise our comparison function can be wrong. Fixes: 637bc8b ("inet: reset tb->fastreuseport when adding a reuseport sk") Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20net: remove duplicate code.Mahesh Bandewar
netdev_rx_handler_register() checks to see if the handler is already busy which was recently separated into netdev_is_rx_handler_busy(). So use the same function inside register() to avoid code duplication. Essentially this change should be a no-op Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20fq_codel: Avoid regenerating skb flow hash unless necessaryAndrew Collins
The fq_codel qdisc currently always regenerates the skb flow hash. This wastes some cycles and prevents flow seperation in cases where the traffic has been encrypted and can no longer be understood by the flow dissector. Change it to use the prexisting flow hash if one exists, and only regenerate if necessary. Signed-off-by: Andrew Collins <acollins@cradlepoint.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: make replicast a user selectable optionJon Paul Maloy
If the bearer carrying multicast messages supports broadcast, those messages will be sent to all cluster nodes, irrespective of whether these nodes host any actual destinations socket or not. This is clearly wasteful if the cluster is large and there are only a few real destinations for the message being sent. In this commit we extend the eligibility of the newly introduced "replicast" transmit option. We now make it possible for a user to select which method he wants to be used, either as a mandatory setting via setsockopt(), or as a relative setting where we let the broadcast layer decide which method to use based on the ratio between cluster size and the message's actual number of destination nodes. In the latter case, a sending socket must stick to a previously selected method until it enters an idle period of at least 5 seconds. This eliminates the risk of message reordering caused by method change, i.e., when changes to cluster size or number of destinations would otherwise mandate a new method to be used. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: introduce replicast as transport option for multicastJon Paul Maloy
TIPC multicast messages are currently carried over a reliable 'broadcast link', making use of the underlying media's ability to transport packets as L2 broadcast or IP multicast to all nodes in the cluster. When the used bearer is lacking that ability, we can instead emulate the broadcast service by replicating and sending the packets over as many unicast links as needed to reach all identified destinations. We now introduce a new TIPC link-level 'replicast' service that does this. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: add functionality to lookup multicast destination nodesJon Paul Maloy
As a further preparation for the upcoming 'replicast' functionality, we add some necessary structs and functions for looking up and returning a list of all nodes that host destinations for a given multicast message. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20tipc: add function for checking broadcast support in bearerJon Paul Maloy
As a preparation for the 'replicast' functionality we are going to introduce in the next commits, we need the broadcast base structure to store whether bearer broadcast is available at all from the currently used bearer or bearers. We do this by adding a new function tipc_bearer_bcast_support() to the bearer layer, and letting the bearer selection function in bcast.c use this to give a new boolean field, 'bcast_support' the appropriate value. Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20device: Implement a bus agnostic dev_num_vf routinePhil Sutter
Now that pci_bus_type has num_vf callback set, dev_num_vf can be implemented in a bus type independent way and the check for whether a PCI device is being handled in rtnetlink can be dropped. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20gre6: Clean up unused struct ipv6_tel_txoption definitionJakub Sitnicki
Commit b05229f44228 ("gre6: Cleanup GREv6 transmit path, call common GRE functions") removed the ip6gre specific transmit function, but left the struct ipv6_tel_txoption definition. Clean it up. Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20net: remove bh disabling around percpu_counter accessesEric Dumazet
Shaohua Li made percpu_counter irq safe in commit 098faf5805c8 ("percpu_counter: make APIs irq safe") We can safely remove BH disable/enable sections around various percpu_counter manipulations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net: ipv6: Keep nexthop of multipath route on admin downDavid Ahern
IPv6 deletes route entries associated with multipath routes on an admin down where IPv4 does not. For example: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 metric 64 nexthop via 10.100.1.254 dev eth1 weight 1 nexthop via 10.100.2.254 dev eth2 weight 1 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.4 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Set link down: $ ip li set eth1 down IPv4 retains the multihop route but flags eth1 route as dead: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 nexthop via 10.100.1.16 dev eth1 weight 1 dead linkdown nexthop via 10.100.2.16 dev eth2 weight 1 10.100.2.0/24 dev eth2 proto kernel scope link src 10.100.2.4 and IPv6 deletes the route as part of flushing all routes for the device: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Worse, on admin up of the device the multipath route has to be deleted to get this leg of the route re-added. This patch keeps routes that are part of a multipath route if ignore_routes_with_linkdown is set with the dead and linkdown flags enabling consistency between IPv4 and IPv6: $ ip -6 ro ls vrf red 2001:db8:2:: dev red proto none metric 0 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:11::/120 via 2001:db8:1::16 dev eth1 metric 1024 dead linkdown pref medium 2001:db8:11::/120 via 2001:db8:2::17 dev eth2 metric 1024 pref medium ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net: caif: Remove unused stats member from struct chnl_netTobias Klauser
The stats member of struct chnl_net is used nowhere in the code, so it might as well be removed. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19net/sched: cls_flower: reduce fl_change stack sizeArnd Bergmann
The new ARP support has pushed the stack size over the edge on ARM, as there are two large objects on the stack in this function (mask and tb) and both have now grown a bit more: net/sched/cls_flower.c: In function 'fl_change': net/sched/cls_flower.c:928:1: error: the frame size of 1072 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] We can solve this by dynamically allocating one or both of them. I first tried to do it just for the mask, but that only saved 152 bytes on ARM, while this version just does it for the 'tb' array, bringing the stack size back down to 664 bytes. Fixes: 99d31326cbe6 ("net/sched: cls_flower: Support matching on ARP") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: Remove usage of net_device last_rx memberTobias Klauser
The network stack no longer uses the last_rx member of struct net_device since the bonding driver switched to use its own private last_rx in commit 9f242738376d ("bonding: use last_arp_rx in slave_last_rx()"). However, some drivers still (ab)use the field for their own purposes and some driver just update it without actually using it. Previously, there was an accompanying comment for the last_rx member added in commit 4dc89133f49b ("net: add a comment on netdev->last_rx") which asked drivers not to update is, unless really needed. However, this commend was removed in commit f8ff080dacec ("bonding: remove useless updating of slave->dev->last_rx"), so some drivers added later on still did update last_rx. Remove all usage of last_rx and switch three drivers (sky2, atp and smc91c92_cs) which actually read and write it to use their own private copy in netdev_priv. Compile-tested with allyesconfig and allmodconfig on x86 and arm. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Mirko Lindner <mlindner@marvell.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: dsa: use cpu_switch instead of ds[0]Vivien Didelot
Now that the DSA Ethernet switches are true Linux devices, the CPU switch is not necessarily the first one. If its address is higher than the second switch on the same MDIO bus, its index will be 1, not 0. Avoid any confusion by using dst->cpu_switch instead of dst->ds[0]. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: dsa: store CPU switch structure in the treeVivien Didelot
Store a dsa_switch pointer to the CPU switch in the tree instead of only its index. This avoids the need to initialize it to -1. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove prefix arg to rt6_fill_nodeDavid Ahern
The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route where a user is requesting a prefix only dump. Simplify rt6_fill_node by removing the prefix arg and moving the prefix check to rt6_dump_route. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18net: ipv6: remove nowait arg to rt6_fill_nodeDavid Ahern
All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and simplify rt6_fill_node accordingly. rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the nowait arg from it as well. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: implement sender-side procedures for SSN Reset Request ParameterXin Long
This patch is to implement sender-side procedures for the Outgoing and Incoming SSN Reset Request Parameter described in rfc6525 section 5.1.2 and 5.1.3. It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2 for users. Note that the new asoc member strreset_outstanding is to make sure only one reconf request chunk on the fly as rfc6525 section 5.1.1 demands. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add sockopt SCTP_ENABLE_STREAM_RESETXin Long
This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set strreset_enable to indicate which reconf request type it supports, which is described in rfc6525 section 6.3.1. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add reconf_enable in asoc ep and netnsXin Long
This patch is to add reconf_enable field in all of asoc ep and netns to indicate if they support stream reset. When initializing, asoc reconf_enable get the default value from ep reconf_enable which is from netns netns reconf_enable by default. It is also to add reconf_capable in asoc peer part to know if peer supports reconf_enable, the value is set if ext params have reconf chunk support when processing init chunk, just as rfc6525 section 5.1.1 demands. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add stream reconf primitiveXin Long
This patch is to add a primitive based on sctp primitive frame for sending stream reconf request. It works as the other primitives, and create a SCTP_CMD_REPLY command to send the request chunk out. sctp_primitive_RECONF would be the api to send a reconf request chunk. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add stream reconf timerXin Long
This patch is to add a per transport timer based on sctp timer frame for stream reconf chunk retransmission. It would start after sending a reconf request chunk, and stop after receiving the response chunk. If the timer expires, besides retransmitting the reconf request chunk, it would also do the same thing with data RTO timer. like to increase the appropriate error counts, and perform threshold management, possibly destroying the asoc if sctp retransmission thresholds are exceeded, just as section 5.1.1 describes. This patch is also to add asoc strreset_chunk, it is used to save the reconf request chunk, so that it can be retransmitted, and to check if the response is really for this request by comparing the information inside with the response chunk as well. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18sctp: add support for generating stream reconf ssn reset request chunkXin Long
This patch is to add asoc strreset_outseq and strreset_inseq for saving the reconf request sequence, initialize them when create assoc and process init, and also to define Incoming and Outgoing SSN Reset Request Parameter described in rfc6525 section 4.1 and 4.2, As they can be in one same chunk as section rfc6525 3.1-3 describes, it makes them in one function. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: reset tb->fastreuseport when adding a reuseport skJosef Bacik
If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and never set it again. Which means that in the future if we end up adding a bunch of reuseport sk's to that tb we'll have to do the expensive scan every time. Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family so we know what comparison to make, and the ipv6 only setting so we can make sure to compare with new sockets appropriately. Once one sk has made it onto the list we know that there are no potential bind conflicts on the owners list that match that sk's rcv_addr. So copy the sk's information into our bind bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do an extra check for subsequent reuseport sockets and skip the expensive bind conflict check. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: split inet_csk_get_port into two functionsJosef Bacik
inet_csk_get_port does two different things, it either scans for an open port, or it tries to see if the specified port is available for use. Since these two operations have different rules and are basically independent lets split them into two different functions to make them both more readable. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: don't check for bind conflicts twice when searching for a portJosef Bacik
This is just wasted time, we've already found a tb that doesn't have a bind conflict, and we don't drop the head lock so scanning again isn't going to give us a different answer. Instead move the tb->reuse setting logic outside of the found_tb path and put it in the success: path. Then make it so that we don't goto again if we find a bind conflict in the found_tb path as we won't reach this anymore when we are scanning for an ephemeral port. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: kill smallest_size and smallest_portJosef Bacik
In inet_csk_get_port we seem to be using smallest_port to figure out where the best place to look for a SO_REUSEPORT sk that matches with an existing set of SO_REUSEPORT's. However if we get to the logic if (smallest_size != -1) { port = smallest_port; goto have_port; } we will do a useless search, because we would have already done the inet_csk_bind_conflict for that port and it would have returned 1, otherwise we would have gone to found_tb and succeeded. Since this logic makes us do yet another trip through inet_csk_bind_conflict for a port we know won't work just delete this code and save us the time. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: drop ->bind_conflictJosef Bacik
The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict is how they check the rcv_saddr, so delete this call back and simply change inet_csk_bind_conflict to call inet_rcv_saddr_equal. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18inet: collapse ipv4/v6 rcv_saddr_equal functions into oneJosef Bacik
We pass these per-protocol equal functions around in various places, but we can just have one function that checks the sk->sk_family and then do the right comparison function. I've also changed the ipv4 version to not cast to inet_sock since it is unneeded. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17tcp: accept RST for rcv_nxt - 1 after receiving a FINJason Baron
Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then sometimes (they are rate-limited) does not reply with any RST as would be expected on a closed socket. This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. A packetdrill test demonstrating the behavior: // testing mac osx rst behavior // Establish a connection 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5> 0.200 < . 1:1(0) ack 1 win 32768 0.200 accept(3, ..., ...) = 4 // Client closes the connection 0.300 < F. 1:1(0) ack 1 win 32768 // now send rst with same sequence 0.300 < R. 1:1(0) ack 1 win 32768 // make sure we are in TCP_CLOSE 0.400 %{ assert tcpi_state == 7 }% Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17net: ping: Use right format specifier to avoid type castingGao Feng
The inet_num is u16, so use %hu instead of casting it to int. And the sk_bound_dev_if is int actually, so it needn't cast to int. Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17bridge: sparse fixes in br_ip6_multicast_alloc_query()Lance Richardson
Changed type of csum field in struct igmpv3_query from __be16 to __sum16 to eliminate type warning, made same change in struct igmpv3_report for consistency. Fixed up an ntohs() where htons() should have been used instead. Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2017-01-17mpls: Packet statsRobert Shearman
Having MPLS packet stats is useful for observing network operation and for diagnosing network problems. In the absence of anything better, RFC2863 and RFC3813 are used for guidance for which stats to expose and the semantics of them. In particular rx_noroutes maps to in unknown protos in RFC2863. The stats are exposed to userspace via AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages. All the introduced fields are 64-bit, even error ones, to ensure no overflow with long uptimes. Per-CPU counters are used to avoid cache-line contention on the commonly used fields. The other fields have also been made per-CPU for code to avoid performance problems in error conditions on the assumption that on some platforms the cost of atomic operations could be more expensive than sending the packet (which is what would be done in the success case). If that's not the case, we could instead not use per-CPU counters for these fields. Only unicast and non-fragment are exposed at the moment, but other counters can be exposed in the future either by adding to the end of struct mpls_link_stats or by additional netlink attributes in the AF_MPLS IFLA_STATS_AF_SPEC nested attribute. Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17net: AF-specific RTM_GETSTATS attributesRobert Shearman
Add the functionality for including address-family-specific per-link stats in RTM_GETSTATS messages. This is done through adding a new IFLA_STATS_AF_SPEC attribute under which address family attributes are nested and then the AF-specific attributes can be further nested. This follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the advantage of presenting an easily extended hierarchy. The rtnl_af_ops structure is extended to provide AFs with the opportunity to fill and provide the size of their stats attributes. One alternative would have been to provide AFs with the ability to add attributes directly into the RTM_GETSTATS message without a nested hierarchy. I discounted this approach as it increases the rate at which the 32 attribute number space is used up and it makes implementation a little more tricky for stats dump resuming (at the moment the order in which attributes are added to the message has to match the numeric order of the attributes). Another alternative would have been to register per-AF RTM_GETSTATS handlers. I discounted this approach as I perceived a common use-case to be getting all the stats for an interface and this approach would necessitate multiple requests/dumps to retrieve them all. Signed-off-by: Robert Shearman <rshearma@brocade.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Handle multicast packets properly in fast-RX path of mac80211, from Johannes Berg. 2) Because of a logic bug, the user can't actually force SW checksumming on r8152 devices. This makes diagnosis of hw checksumming bugs really annoying. Fix from Hayes Wang. 3) VXLAN route lookup does not take the source and destination ports into account, which means IPSEC policies cannot be matched properly. Fix from Martynas Pumputis. 4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger. 5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky. 6) If lwtunnel_fill_encap() fails, we do not abort the netlink message construction properly in fib_dump_info(), from David Ahern. 7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan Schmidt. 8) Openvswitch conntack actions need to maintain a correct checksum, fix from Lance Richardson. 9) ax25_disconnect() is missing a check for ax25->sk being NULL, in fact it already checks this, but not in all of the necessary spots. Fix from Basil Gunn. 10) Action GET operations in the packet scheduler can erroneously bump the reference count of the entry, making it unreleasable. Fix from Jamal Hadi Salim. Jamal gives a great set of example command lines that trigger this in the commit message. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits) net sched actions: fix refcnt when GETing of action after bind net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions net/mlx4_core: Fix racy CQ (Completion Queue) free net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered net/mlx5e: Fix a -Wmaybe-uninitialized warning ax25: Fix segfault after sock connection timeout bpf: rework prog_digest into prog_tag tipc: allocate user memory with GFP_KERNEL flag net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types ip6_tunnel: Account for tunnel header in tunnel MTU mld: do not remove mld souce list info when set link down be2net: fix MAC addr setting on privileged BE3 VFs be2net: don't delete MAC on close on unprivileged BE3 VFs be2net: fix status check in be_cmd_pmac_add() cpmac: remove hopeless #warning ravb: do not use zero-length alignment DMA descriptor mlx4: do not call napi_schedule() without care openvswitch: maintain correct checksum state in conntrack actions tcp: fix tcp_fastopen unaligned access complaints on sparc ...
2017-01-16net sched actions: fix refcnt when GETing of action after bindJamal Hadi Salim
Demonstrating the issue: .. add a drop action $sudo $TC actions add action drop index 10 .. retrieve it $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 0 installed 29 sec used 29 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 ... bug 1 above: reference is two. Reference is actually 1 but we forget to subtract 1. ... do a GET again and we see the same issue try a few times and nothing changes ~$ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 0 installed 31 sec used 31 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 ... lets try to bind the action to a filter.. $ sudo $TC qdisc add dev lo ingress $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10 ... and now a few GETs: $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 3 bind 1 installed 204 sec used 204 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 4 bind 1 installed 206 sec used 206 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 5 bind 1 installed 235 sec used 235 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 .... as can be observed the reference count keeps going up. After the fix $ sudo $TC actions add action drop index 10 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 1 bind 0 installed 4 sec used 4 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 1 bind 0 installed 6 sec used 6 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC qdisc add dev lo ingress $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \ u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 1 installed 32 sec used 32 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 $ sudo $TC -s actions get action gact index 10 action order 1: gact action drop random type none pass val 0 index 10 ref 2 bind 1 installed 33 sec used 33 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Fixes: aecc5cefc389 ("net sched actions: fix GETing actions") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16net/sched: cls_flower: Disallow duplicate internal elementsPaul Blakey
Flower currently allows having the same filter twice with the same priority. Actions (and statistics update) will always execute on the first inserted rule leaving the second rule unused. This patch disallows that. Signed-off-by: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>