summaryrefslogtreecommitdiffstats
path: root/net/netfilter
AgeCommit message (Collapse)Author
2015-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Don't truncate ethernet protocol type to u8 in nft_compat, from Arturo Borrero. 2) Fix several problems in the addition/deletion of elements in nf_tables. 3) Fix module refcount leak in ip_vs_sync, from Julian Anastasov. 4) Fix a race condition in the abort path in the nf_tables transaction infrastructure. Basically aborted rules can show up as active rules until changes are unrolled, oneliner from Patrick McHardy. 5) Check for overflows in the data area of the rule, also from Patrick. 6) Fix off-by-one in the per-rule user data size field. This introduces a new nft_userdata structure that is placed at the beginning of the user data area that contains the length to save some bits from the rule and we only need one bit to indicate its presence, from Patrick. 7) Fix rule replacement error path, the replaced rule is deleted on error instead of leaving it in place. This has been fixed by relying on the abort path to undo the incomplete replacement. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-04netfilter: nf_tables: fix error handling of rule replacementPablo Neira Ayuso
In general, if a transaction object is added to the list successfully, we can rely on the abort path to undo what we've done. This allows us to simplify the error handling of the rule replacement path in nf_tables_newrule(). This implicitly fixes an unnecessary removal of the old rule, which needs to be left in place if we fail to replace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-04netfilter: nf_tables: fix userdata length overflowPatrick McHardy
The NFT_USERDATA_MAXLEN is defined to 256, however we only have a u8 to store its size. Introduce a struct nft_userdata which contains a length field and indicate its presence using a single bit in the rule. The length field of struct nft_userdata is also a u8, however we don't store zero sized data, so the actual length is udata->len + 1. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-04netfilter: nf_tables: check for overflow of rule dlen fieldPatrick McHardy
Check that the space required for the expressions doesn't exceed the size of the dlen field, which would lead to the iterators crashing. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-04netfilter: nf_tables: fix transaction race conditionPatrick McHardy
A race condition exists in the rule transaction code for rules that get added and removed within the same transaction. The new rule starts out as inactive in the current and active in the next generation and is inserted into the ruleset. When it is deleted, it is additionally set to inactive in the next generation as well. On commit the next generation is begun, then the actions are finalized. For the new rule this would mean clearing out the inactive bit for the previously current, now next generation. However nft_rule_clear() clears out the bits for *both* generations, activating the rule in the current generation, where it should be deactivated due to being deleted. The rule will thus be active until the deletion is finalized, removing the rule from the ruleset. Similarly, when aborting a transaction for the same case, the undo of insertion will remove it from the RCU protected rule list, the deletion will clear out all bits. However until the next RCU synchronization after all operations have been undone, the rule is active on CPUs which can still see the rule on the list. Generally, there may never be any modifications of the current generations' inactive bit since this defeats the entire purpose of atomicity. Change nft_rule_clear() to only touch the next generations bit to fix this. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-27rhashtable: remove indirection for grow/shrink decision functionsDaniel Borkmann
Currently, all real users of rhashtable default their grow and shrink decision functions to rht_grow_above_75() and rht_shrink_below_30(), so that there's currently no need to have this explicitly selectable. It can/should be generic and private inside rhashtable until a real use case pops up. Since we can make this private, we'll save us this additional indirection layer and can improve insertion/deletion time as well. Reference: http://patchwork.ozlabs.org/patch/443040/ Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-24Merge https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvsPablo Neira Ayuso
Simon Horman says: ==================== Second Round of IPVS Fixes for v3.20 This patch resolves some memory leaks in connection synchronisation code that date back to v2.6.39. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-22ipvs: add missing ip_vs_pe_put in sync codeJulian Anastasov
ip_vs_conn_fill_param_sync() gets in param.pe a module reference for persistence engine from __ip_vs_pe_getbyname() but forgets to put it. Problem occurs in backup for sync protocol v1 (2.6.39). Also, pe_data usually comes in sync messages for connection templates and ip_vs_conn_new() copies the pointer only in this case. Make sure pe_data is not leaked if it comes unexpectedly for normal connections. Leak can happen only if bogus messages are sent to backup server. Fixes: fe5e7a1efb66 ("IPVS: Backup, Adding Version 1 receive capability") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-02-22netfilter: nf_tables: fix addition/deletion of elements from commit/abortPablo Neira Ayuso
We have several problems in this path: 1) There is a use-after-free when removing individual elements from the commit path. 2) We have to uninit() the data part of the element from the abort path to avoid a chain refcount leak. 3) We have to check for set->flags to see if there's a mapping, instead of the element flags. 4) We have to check for !(flags & NFT_SET_ELEM_INTERVAL_END) to skip elements that are part of the interval that have no data part, so they don't need to be uninit(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-22netfilter: nft_compat: don't truncate ethernet protocol type to u8Arturo Borrero
Use u16 for protocol and then cast it to __be16 >> net/netfilter/nft_compat.c:140:37: sparse: incorrect type in assignment (different base types) net/netfilter/nft_compat.c:140:37: expected restricted __be16 [usertype] ethproto net/netfilter/nft_compat.c:140:37: got unsigned char [unsigned] [usertype] proto >> net/netfilter/nft_compat.c:351:37: sparse: incorrect type in assignment (different base types) net/netfilter/nft_compat.c:351:37: expected restricted __be16 [usertype] ethproto net/netfilter/nft_compat.c:351:37: got unsigned char [unsigned] [usertype] proto Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains updates for your net tree, they are: 1) Fix removal of destination in IPVS when the new mixed family support is used, from Alexey Andriyanov via Simon Horman. 2) Fix module refcount undeflow in nft_compat when reusing a match / target. 3) Fix iptables-restore when the recent match is used with a new hitcount that exceeds threshold, from Florian Westphal. 4) Fix stack corruption in xt_socket due to using stack storage to save the inner IPv6 header, from Eric Dumazet. I'll follow up soon with another batch with more fixes that are still cooking. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-16netfilter: xt_socket: fix a stack corruption bugEric Dumazet
As soon as extract_icmp6_fields() returns, its local storage (automatic variables) is deallocated and can be overwritten. Lets add an additional parameter to make sure storage is valid long enough. While we are at it, adds some const qualifiers. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: b64c9256a9b76 ("tproxy: added IPv6 support to the socket match") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-16netfilter: xt_recent: don't reject rule if new hitcount exceeds table maxFlorian Westphal
given: -A INPUT -m recent --update --seconds 30 --hitcount 4 and iptables-save > foo then iptables-restore < foo will fail with: kernel: xt_recent: hitcount (4) is larger than packets to be remembered (4) for table DEFAULT Even when the check is fixed, the restore won't work if the hitcount is increased to e.g. 6, since by the time checkentry runs it will find the 'old' incarnation of the table. We can avoid this by increasing the maximum threshold silently; we only have to rm all the current entries of the table (these entries would not have enough room to handle the increased hitcount). This even makes (not-very-useful) -A INPUT -m recent --update --seconds 30 --hitcount 4 -A INPUT -m recent --update --seconds 30 --hitcount 42 work. Fixes: abc86d0f99242b7f142b (netfilter: xt_recent: relax ip_pkt_list_tot restrictions) Tracked-down-by: Chris Vine <chris@cvine.freeserve.co.uk> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-02-16netfilter: nft_compat: fix module refcount underflowPablo Neira Ayuso
Feb 12 18:20:42 nfdev kernel: ------------[ cut here ]------------ Feb 12 18:20:42 nfdev kernel: WARNING: CPU: 4 PID: 4359 at kernel/module.c:963 module_put+0x9b/0xba() Feb 12 18:20:42 nfdev kernel: CPU: 4 PID: 4359 Comm: ebtables-compat Tainted: G W 3.19.0-rc6+ #43 [...] Feb 12 18:20:42 nfdev kernel: Call Trace: Feb 12 18:20:42 nfdev kernel: [<ffffffff815fd911>] dump_stack+0x4c/0x65 Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e6f7>] warn_slowpath_common+0x9c/0xb6 Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] ? module_put+0x9b/0xba Feb 12 18:20:42 nfdev kernel: [<ffffffff8103e726>] warn_slowpath_null+0x15/0x17 Feb 12 18:20:42 nfdev kernel: [<ffffffff8109919f>] module_put+0x9b/0xba Feb 12 18:20:42 nfdev kernel: [<ffffffff813ecf7c>] nft_match_destroy+0x45/0x4c Feb 12 18:20:42 nfdev kernel: [<ffffffff813e683f>] nf_tables_rule_destroy+0x28/0x70 Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
2015-02-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains two small Netfilter updates for your net-next tree, they are: 1) Add ebtables support to nft_compat, from Arturo Borrero. 2) Fix missing validation of the SET_ID attribute in the lookup expressions, from Patrick McHardy. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09ipvs: fix inability to remove a mixed-family RSAlexey Andriyanov
The current code prevents any operation with a mixed-family dest unless IP_VS_CONN_F_TUNNEL flag is set. The problem is that it's impossible for the client to follow this rule, because ip_vs_genl_parse_dest does not even read the destination conn_flags when cmd = IPVS_CMD_DEL_DEST (need_full_dest = 0). Also, not every client can pass this flag when removing a dest. ipvsadm, for example, does not support the "-i" command line option together with the "-d" option. This change disables any checks for mixed-family on IPVS_CMD_DEL_DEST command. Signed-off-by: Alexey Andriyanov <alan@al-an.info> Fixes: bc18d37f676f ("ipvs: Allow heterogeneous pools now that we support them") Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04netfilter: Use rhashtable walk iteratorHerbert Xu
This patch gets rid of the manual rhashtable walk in nft_hash which touches rhashtable internals that should not be exposed. It does so by using the rhashtable iterator primitives. Note that I'm leaving nft_hash_destroy alone since it's only invoked on shutdown and it shouldn't be affected by changes to rhashtable internals (or at least not what I'm planning to change). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30netfilter: nft_lookup: add missing attribute validation for NFTA_LOOKUP_SET_IDPatrick McHardy
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-30netfilter: nft_compat: add ebtables supportArturo Borrero
This patch extends nft_compat to support ebtables extensions. ebtables verdict codes are translated to the ones used by the nf_tables engine, so we can properly use ebtables target extensions from nft_compat. This patch extends previous work by Giuseppe Longo <giuseppelng@gmail.com>. Signed-off-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-30netfilter: nf_tables: fix leaks in error path of nf_tables_newchain()Pablo Neira Ayuso
Release statistics and module refcount on memory allocation problems. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-30ipvs: rerouting to local clients is not needed anymoreJulian Anastasov
commit f5a41847acc5 ("ipvs: move ip_route_me_harder for ICMP") from 2.6.37 introduced ip_route_me_harder() call for responses to local clients, so that we can provide valid rt_src after SNAT. It was used by TCP to provide valid daddr for ip_send_reply(). After commit 0a5ebb8000c5 ("ipv4: Pass explicit daddr arg to ip_send_reply()." from 3.0 this rerouting is not needed anymore and should be avoided, especially in LOCAL_IN. Fixes 3.12.33 crash in xfrm reported by Florian Wiessner: "3.12.33 - BUG xfrm_selector_match+0x25/0x2f6" Reported-by: Smart Weblications GmbH - Florian Wiessner <f.wiessner@smart-weblications.de> Tested-by: Smart Weblications GmbH - Florian Wiessner <f.wiessner@smart-weblications.de> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
2015-01-26netfilter: nf_tables: disable preemption when restoring chain countersPablo Neira Ayuso
With CONFIG_DEBUG_PREEMPT=y [22144.496057] BUG: using smp_processor_id() in preemptible [00000000] code: iptables-compat/10406 [22144.496061] caller is debug_smp_processor_id+0x17/0x1b [22144.496065] CPU: 2 PID: 10406 Comm: iptables-compat Not tainted 3.19.0-rc4+ # [...] [22144.496092] Call Trace: [22144.496098] [<ffffffff8145b9fa>] dump_stack+0x4f/0x7b [22144.496104] [<ffffffff81244f52>] check_preemption_disabled+0xd6/0xe8 [22144.496110] [<ffffffff81244f90>] debug_smp_processor_id+0x17/0x1b [22144.496120] [<ffffffffa07c557e>] nft_stats_alloc+0x94/0xc7 [nf_tables] [22144.496130] [<ffffffffa07c73d2>] nf_tables_newchain+0x471/0x6d8 [nf_tables] [22144.496140] [<ffffffffa07c5ef6>] ? nft_trans_alloc+0x18/0x34 [nf_tables] [22144.496154] [<ffffffffa063c8da>] nfnetlink_rcv_batch+0x2b4/0x457 [nfnetlink] Reported-by: Andreas Schultz <aschultz@tpip.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-19netfilter: nf_tables: validate hooks in NAT expressionsPablo Neira Ayuso
The user can crash the kernel if it uses any of the existing NAT expressions from the wrong hook, so add some code to validate this when loading the rule. This patch introduces nft_chain_validate_hooks() which is based on an existing function in the bridge version of the reject expression. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-18netlink: make nlmsg_end() and genlmsg_end() voidJohannes Berg
Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter updates for net-next The following patchset contains netfilter updates for net-next, just a bunch of cleanups and small enhancement to selectively flush conntracks in ctnetlink, more specifically the patches are: 1) Rise default number of buckets in conntrack from 16384 to 65536 in systems with >= 4GBytes, patch from Marcelo Leitner. 2) Small refactor to save one level on indentation in xt_osf, from Joe Perches. 3) Remove unnecessary sizeof(char) in nf_log, from Fabian Frederick. 4) Another small cleanup to remove redundant variable in nfnetlink, from Duan Jiong. 5) Fix compilation warning in nfnetlink_cthelper on parisc, from Chen Gang. 6) Fix wrong format in debugging for ctseqadj, from Gao feng. 7) Selective conntrack flushing through the mark for ctnetlink, patch from Kristian Evensen. 8) Remove nf_ct_conntrack_flush_report() exported symbol now that is not required anymore after the selective flushing patch, again from Kristian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/xen-netfront.c Minor overlapping changes in xen-netfront.c, mostly to do with some buffer management changes alongside the split of stats into TX and RX. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== netfilter/ipvs fixes for net The following patchset contains netfilter/ipvs fixes, they are: 1) Small fix for the FTP helper in IPVS, a diff variable may be left unset when CONFIG_IP_VS_IPV6 is set. Patch from Dan Carpenter. 2) Fix nf_tables port NAT in little endian archs, patch from leroy christophe. 3) Fix race condition between conntrack confirmation and flush from userspace. This is the second reincarnation to resolve this problem. 4) Make sure inner messages in the batch come with the nfnetlink header. 5) Relax strict check from nfnetlink_bind() that may break old userspace applications using all 1s group mask. 6) Schedule removal of chains once no sets and rules refer to them in the new nf_tables ruleset flush command. Reported by Asbjoern Sloth Toennesen. Note that this batch comes later than usual because of the short winter holidays. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08netfilter: conntrack: Remove nf_ct_conntrack_flush_reportKristian Evensen
The only user of nf_ct_conntrack_flush_report() was ctnetlink_del_conntrack(). After adding support for flushing connections with a given mark, this function is no longer called. Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-08netfilter: conntrack: Flush connections with a given markKristian Evensen
This patch adds support for selective flushing of conntrack mappings. By adding CTA_MARK and CTA_MARK_MASK to a delete-message, the mark (and mask) is checked before a connection is deleted while flushing. Configuring the flush is moved out of ctnetlink_del_conntrack(), and instead of calling nf_conntrack_flush_report(), we always call nf_ct_iterate_cleanup(). This enables us to only make one call from the new ctnetlink_flush_conntrack() and makes it easy to add more filter parameters. Filtering is done in the ctnetlink_filter_match()-function, which is also called from ctnetlink_dump_table(). ctnetlink_dump_filter has been renamed ctnetlink_filter, to indicated that it is no longer only used when dumping conntrack entries. Moreover, reject mark filters with -EOPNOTSUPP if no ct mark support is available. Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-06netfilter: nf_tables: fix flush ruleset chain dependenciesPablo Neira Ayuso
Jumping between chains doesn't mix well with flush ruleset. Rules from a different chain and set elements may still refer to us. [ 353.373791] ------------[ cut here ]------------ [ 353.373845] kernel BUG at net/netfilter/nf_tables_api.c:1159! [ 353.373896] invalid opcode: 0000 [#1] SMP [ 353.373942] Modules linked in: intel_powerclamp uas iwldvm iwlwifi [ 353.374017] CPU: 0 PID: 6445 Comm: 31c3.nft Not tainted 3.18.0 #98 [ 353.374069] Hardware name: LENOVO 5129CTO/5129CTO, BIOS 6QET47WW (1.17 ) 07/14/2010 [...] [ 353.375018] Call Trace: [ 353.375046] [<ffffffff81964c31>] ? nf_tables_commit+0x381/0x540 [ 353.375101] [<ffffffff81949118>] nfnetlink_rcv+0x3d8/0x4b0 [ 353.375150] [<ffffffff81943fc5>] netlink_unicast+0x105/0x1a0 [ 353.375200] [<ffffffff8194438e>] netlink_sendmsg+0x32e/0x790 [ 353.375253] [<ffffffff818f398e>] sock_sendmsg+0x8e/0xc0 [ 353.375300] [<ffffffff818f36b9>] ? move_addr_to_kernel.part.20+0x19/0x70 [ 353.375357] [<ffffffff818f44f9>] ? move_addr_to_kernel+0x19/0x30 [ 353.375410] [<ffffffff819016d2>] ? verify_iovec+0x42/0xd0 [ 353.375459] [<ffffffff818f3e10>] ___sys_sendmsg+0x3f0/0x400 [ 353.375510] [<ffffffff810615fa>] ? native_sched_clock+0x2a/0x90 [ 353.375563] [<ffffffff81176697>] ? acct_account_cputime+0x17/0x20 [ 353.375616] [<ffffffff8110dc78>] ? account_user_time+0x88/0xa0 [ 353.375667] [<ffffffff818f4bbd>] __sys_sendmsg+0x3d/0x80 [ 353.375719] [<ffffffff81b184f4>] ? int_check_syscall_exit_work+0x34/0x3d [ 353.375776] [<ffffffff818f4c0d>] SyS_sendmsg+0xd/0x20 [ 353.375823] [<ffffffff81b1826d>] system_call_fastpath+0x16/0x1b Release objects in this order: rules -> sets -> chains -> tables, to make sure no references to chains are held anymore. Reported-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.biz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-06netfilter: nfnetlink: relax strict multicast group check from netlink_bindPablo Neira Ayuso
Relax the checking that was introduced in 97840cb ("netfilter: nfnetlink: fix insufficient validation in nfnetlink_bind") when the subscription bitmask is used. Existing userspace code code may request to listen to all of the existing netlink groups by setting an all to one subscription group bitmask. Netlink already validates subscription via setsockopt() for us. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-06netfilter: nfnetlink: validate nfnetlink header from batchPablo Neira Ayuso
Make sure there is enough room for the nfnetlink header in the netlink messages that are part of the batch. There is a similar check in netlink_rcv_skb(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-06netfilter: conntrack: fix race between confirmation and flushPablo Neira Ayuso
Commit 5195c14c8b27c ("netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse") aimed to resolve the race condition between the confirmation (packet path) and the flush command (from control plane). However, it introduced a crash when several packets race to add a new conntrack, which seems easier to reproduce when nf_queue is in place. Fix this race, in __nf_conntrack_confirm(), by removing the CT from unconfirmed list before checking the DYING bit. In case race occured, re-add the CT to the dying list This patch also changes the verdict from NF_ACCEPT to NF_DROP when we lose race. Basically, the confirmation happens for the first packet that we see in a flow. If you just invoked conntrack -F once (which should be the common case), then this is likely to be the first packet of the flow (unless you already called flush anytime soon in the past). This should be hard to trigger, but better drop this packet, otherwise we leave things in inconsistent state since the destination will likely reply to this packet, but it will find no conntrack, unless the origin retransmits. The change of the verdict has been discussed in: https://www.marc.info/?l=linux-netdev&m=141588039530056&w=2 Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-01-05netfilter: nf_ct_seqadj: print ack seq in the right host byte orderGao feng
new_start_seq and new_end_seq are network byte order, print the host byte order in debug message and print seq number as the type of unsigned int. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@soleta.eu>
2015-01-05netfilter: nfnetlink_cthelper: Remove 'const' and '&' to avoid warningsChen Gang
The related code can be simplified, and also can avoid related warnings (with allmodconfig under parisc): CC [M] net/netfilter/nfnetlink_cthelper.o net/netfilter/nfnetlink_cthelper.c: In function ‘nfnl_cthelper_from_nlattr’: net/netfilter/nfnetlink_cthelper.c:97:9: warning: passing argument 1 o ‘memcpy’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-array-qualifiers] memcpy(&help->data, nla_data(attr), help->helper->data_len); ^ In file included from include/linux/string.h:17:0, from include/uapi/linux/uuid.h:25, from include/linux/uuid.h:23, from include/linux/mod_devicetable.h:12, from ./arch/parisc/include/asm/hardware.h:4, from ./arch/parisc/include/asm/processor.h:15, from ./arch/parisc/include/asm/spinlock.h:6, from ./arch/parisc/include/asm/atomic.h:21, from include/linux/atomic.h:4, from ./arch/parisc/include/asm/bitops.h:12, from include/linux/bitops.h:36, from include/linux/kernel.h:10, from include/linux/list.h:8, from include/linux/module.h:9, from net/netfilter/nfnetlink_cthelper.c:11: ./arch/parisc/include/asm/string.h:8:8: note: expected ‘void *’ but argument is of type ‘const char (*)[]’ void * memcpy(void * dest,const void *src,size_t count); ^ Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@soleta.eu>
2015-01-05netfilter: nfnetlink: remove redundant variable nskbDuan Jiong
Actually after netlink_skb_clone() is called, the nskb and skb will point to the same thing, but they are used just like they are different, sometimes this is confusing, so i think there is no necessary to keep nskb anymore. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@soleta.eu>
2015-01-03rhashtable: Per bucket locks & deferred expansion/shrinkingThomas Graf
Introduces an array of spinlocks to protect bucket mutations. The number of spinlocks per CPU is configurable and selected based on the hash of the bucket. This allows for parallel insertions and removals of entries which do not share a lock. The patch also defers expansion and shrinking to a worker queue which allows insertion and removal from atomic context. Insertions and deletions may occur in parallel to it and are only held up briefly while the particular bucket is linked or unzipped. Mutations of the bucket table pointer is protected by a new mutex, read access is RCU protected. In the event of an expansion or shrinking, the new bucket table allocated is exposed as a so called future table as soon as the resize process starts. Lookups, deletions, and insertions will briefly use both tables. The future table becomes the main table after an RCU grace period and initial linking of the old to the new table was performed. Optimization of the chains to make use of the new number of buckets follows only the new table is in use. The side effect of this is that during that RCU grace period, a bucket traversal using any rht_for_each() variant on the main table will not see any insertions performed during the RCU grace period which would at that point land in the future table. The lookup will see them as it searches both tables if needed. Having multiple insertions and removals occur in parallel requires nelems to become an atomic counter. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03nft_hash: Remove rhashtable_remove_pprev()Thomas Graf
The removal function of nft_hash currently stores a reference to the previous element during lookup which is used to optimize removal later on. This was possible because a lock is held throughout calling rhashtable_lookup() and rhashtable_remove(). With the introdution of deferred table resizing in parallel to lookups and insertions, the nftables lock will no longer synchronize all table mutations and the stored pprev may become invalid. Removing this optimization makes removal slightly more expensive on average but allows taking the resize cost out of the insert and remove path. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Convert bucket iterators to take table and indexThomas Graf
This patch is in preparation to introduce per bucket spinlocks. It extends all iterator macros to take the bucket table and bucket index. It also introduces a new rht_dereference_bucket() to handle protected accesses to buckets. It introduces a barrier() to the RCU iterators to the prevent the compiler from caching the first element. The lockdep verifier is introduced as stub which always succeeds and properly implement in the next patch when the locks are introduced. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Do hashing inside of rhashtable_lookup_compare()Thomas Graf
Hash the key inside of rhashtable_lookup_compare() like rhashtable_lookup() does. This allows to simplify the hashing functions and keep them private. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27netlink/genetlink: pass network namespace to bind/unbindJohannes Berg
Netlink families can exist in multiple namespaces, and for the most part multicast subscriptions are per network namespace. Thus it only makes sense to have bind/unbind notifications per network namespace. To achieve this, pass the network namespace of a given client socket to the bind/unbind functions. Also do this in generic netlink, and there also make sure that any bind for multicast groups that only exist in init_net is rejected. This isn't really a problem if it is accepted since a client in a different namespace will never receive any notifications from such a group, but it can confuse the family if not rejected (it's also possible to silently (without telling the family) accept it, but it would also have to be ignored on unbind so families that take any kind of action on bind/unbind won't do unnecessary work for invalid clients like that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23netfilter: nf_tables: fix port natting in little endian archsleroy christophe
Make sure this fetches 16-bits port data from the register. Remove casting to make sparse happy, not needed anymore. Signed-off-by: leroy christophe <christophe.leroy@c-s.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-12-23netfilter: log: remove unnecessary sizeof(char)Fabian Frederick
sizeof(char) is always 1. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-12-23netfilter: xt_osf: Use continue to reduce indentationJoe Perches
Invert logic in test to use continue. This routine already uses continue, use it a bit more to minimize > 80 column long lines and unnecessary indentation. No change in compiled object file. Other miscellanea: o Remove trailing whitespace o Realign arguments to multiline statement Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-12-23netfilter: conntrack: adjust nf_conntrack_buckets default valueMarcelo Leitner
Manually bumping either nf_conntrack_buckets or nf_conntrack_max has become a common task as our Linux servers tend to serve more and more clients/applications, so let's adjust nf_conntrack_buckets this to a more updated value. Now for systems with more than 4GB of memory, nf_conntrack_buckets becomes 65536 instead of 16384, resulting in nf_conntrack_max=256k entries. Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-12-18Merge tag 'ipvs2-for-v3.19' of ↵Pablo Neira Ayuso
https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next into ipvs-next Simon Horman says: ==================== Second round of IPVS Updates for v3.19 please consider these IPVS updates for v3.19 or alternatively v3.20. The single patch in this series fixes a long standing bug that has not caused any trouble and thus is not being prioritised as a fix. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) New offloading infrastructure and example 'rocker' driver for offloading of switching and routing to hardware. This work was done by a large group of dedicated individuals, not limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend, Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu 2) Start making the networking operate on IOV iterators instead of modifying iov objects in-situ during transfers. Thanks to Al Viro and Herbert Xu. 3) A set of new netlink interfaces for the TIPC stack, from Richard Alpe. 4) Remove unnecessary looping during ipv6 routing lookups, from Martin KaFai Lau. 5) Add PAUSE frame generation support to gianfar driver, from Matei Pavaluca. 6) Allow for larger reordering levels in TCP, which are easily achievable in the real world right now, from Eric Dumazet. 7) Add a variable of napi_schedule that doesn't need to disable cpu interrupts, from Eric Dumazet. 8) Use a doubly linked list to optimize neigh_parms_release(), from Nicolas Dichtel. 9) Various enhancements to the kernel BPF verifier, and allow eBPF programs to actually be attached to sockets. From Alexei Starovoitov. 10) Support TSO/LSO in sunvnet driver, from David L Stevens. 11) Allow controlling ECN usage via routing metrics, from Florian Westphal. 12) Remote checksum offload, from Tom Herbert. 13) Add split-header receive, BQL, and xmit_more support to amd-xgbe driver, from Thomas Lendacky. 14) Add MPLS support to openvswitch, from Simon Horman. 15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen Klassert. 16) Do gro flushes on a per-device basis using a timer, from Eric Dumazet. This tries to resolve the conflicting goals between the desired handling of bulk vs. RPC-like traffic. 17) Allow userspace to ask for the CPU upon what a packet was received/steered, via SO_INCOMING_CPU. From Eric Dumazet. 18) Limit GSO packets to half the current congestion window, from Eric Dumazet. 19) Add a generic helper so that all drivers set their RSS keys in a consistent way, from Eric Dumazet. 20) Add xmit_more support to enic driver, from Govindarajulu Varadarajan. 21) Add VLAN packet scheduler action, from Jiri Pirko. 22) Support configurable RSS hash functions via ethtool, from Eyal Perry. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits) Fix race condition between vxlan_sock_add and vxlan_sock_release net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header net/mlx4: Add support for A0 steering net/mlx4: Refactor QUERY_PORT net/mlx4_core: Add explicit error message when rule doesn't meet configuration net/mlx4: Add A0 hybrid steering net/mlx4: Add mlx4_bitmap zone allocator net/mlx4: Add a check if there are too many reserved QPs net/mlx4: Change QP allocation scheme net/mlx4_core: Use tasklet for user-space CQ completion events net/mlx4_core: Mask out host side virtualization features for guests net/mlx4_en: Set csum level for encapsulated packets be2net: Export tunnel offloads only when a VxLAN tunnel is created gianfar: Fix dma check map error when DMA_API_DEBUG is enabled cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call net: fec: only enable mdio interrupt before phy device link up net: fec: clear all interrupt events to support i.MX6SX net: fec: reset fep link status in suspend function net: sock: fix access via invalid file descriptor net: introduce helper macro for_each_cmsghdr ...
2014-12-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS changes from Al Viro: "First pile out of several (there _definitely_ will be more). Stuff in this one: - unification of d_splice_alias()/d_materialize_unique() - iov_iter rewrite - killing a bunch of ->f_path.dentry users (and f_dentry macro). Getting that completed will make life much simpler for unionmount/overlayfs, since then we'll be able to limit the places sensitive to file _dentry_ to reasonably few. Which allows to have file_inode(file) pointing to inode in a covered layer, with dentry pointing to (negative) dentry in union one. Still not complete, but much closer now. - crapectomy in lustre (dead code removal, mostly) - "let's make seq_printf return nothing" preparations - assorted cleanups and fixes There _definitely_ will be more piles" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) copy_from_iter_nocache() new helper: iov_iter_kvec() csum_and_copy_..._iter() iov_iter.c: handle ITER_KVEC directly iov_iter.c: convert copy_to_iter() to iterate_and_advance iov_iter.c: convert copy_from_iter() to iterate_and_advance iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() iov_iter.c: convert iov_iter_zero() to iterate_and_advance iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds iov_iter.c: convert iov_iter_npages() to iterate_all_kinds iov_iter.c: iterate_and_advance iov_iter.c: macros for iterating over iov_iter kill f_dentry macro dcache: fix kmemcheck warning in switch_names new helper: audit_file() nfsd_vfs_write(): use file_inode() ncpfs: use file_inode() kill f_dentry uses lockd: get rid of ->f_path.dentry->d_sb ...
2014-12-10ipvs: uninitialized data with IP_VS_IPV6Dan Carpenter
The app_tcp_pkt_out() function expects "*diff" to be set and ends up using uninitialized data if CONFIG_IP_VS_IPV6 is turned on. The same issue is there in app_tcp_pkt_in(). Thanks to Julian Anastasov for noticing that. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>