From cf28f3bbfca097d956f9021cb710dfad56adcc62 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Mon, 17 Aug 2020 10:42:14 -0700 Subject: bpf: Use get_file_rcu() instead of get_file() for task_file iterator With latest `bpftool prog` command, we observed the following kernel panic. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD dfe894067 P4D dfe894067 PUD deb663067 PMD 0 Oops: 0010 [#1] SMP CPU: 9 PID: 6023 ... RIP: 0010:0x0 Code: Bad RIP value. RSP: 0000:ffffc900002b8f18 EFLAGS: 00010286 RAX: ffff8883a405f400 RBX: ffff888e46a6bf00 RCX: 000000008020000c RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8883a405f400 RBP: ffff888e46a6bf50 R08: 0000000000000000 R09: ffffffff81129600 R10: ffff8883a405f300 R11: 0000160000000000 R12: 0000000000002710 R13: 000000e9494b690c R14: 0000000000000202 R15: 0000000000000009 FS: 00007fd9187fe700(0000) GS:ffff888e46a40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 0000000de5d33002 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: rcu_core+0x1a4/0x440 __do_softirq+0xd3/0x2c8 irq_exit+0x9d/0xa0 smp_apic_timer_interrupt+0x68/0x120 apic_timer_interrupt+0xf/0x20 RIP: 0033:0x47ce80 Code: Bad RIP value. RSP: 002b:00007fd9187fba40 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000002 RBX: 00007fd931789160 RCX: 000000000000010c RDX: 00007fd9308cdfb4 RSI: 00007fd9308cdfb4 RDI: 00007ffedd1ea0a8 RBP: 00007fd9187fbab0 R08: 000000000000000e R09: 000000000000002a R10: 0000000000480210 R11: 00007fd9187fc570 R12: 00007fd9316cc400 R13: 0000000000000118 R14: 00007fd9308cdfb4 R15: 00007fd9317a9380 After further analysis, the bug is triggered by Commit eaaacd23910f ("bpf: Add task and task/file iterator targets") which introduced task_file bpf iterator, which traverses all open file descriptors for all tasks in the current namespace. The latest `bpftool prog` calls a task_file bpf program to traverse all files in the system in order to associate processes with progs/maps, etc. When traversing files for a given task, rcu read_lock is taken to access all files in a file_struct. But it used get_file() to grab a file, which is not right. It is possible file->f_count is 0 and get_file() will unconditionally increase it. Later put_file() may cause all kind of issues with the above as one of sympotoms. The failure can be reproduced with the following steps in a few seconds: $ cat t.c #include #include #include #include #include #define N 10000 int fd[N]; int main() { int i; for (i = 0; i < N; i++) { fd[i] = open("./note.txt", 'r'); if (fd[i] < 0) { fprintf(stderr, "failed\n"); return -1; } } for (i = 0; i < N; i++) close(fd[i]); return 0; } $ gcc -O2 t.c $ cat run.sh #/bin/bash for i in {1..100} do while true; do ./a.out; done & done $ ./run.sh $ while true; do bpftool prog >& /dev/null; done This patch used get_file_rcu() which only grabs a file if the file->f_count is not zero. This is to ensure the file pointer is always valid. The above reproducer did not fail for more than 30 minutes. Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets") Suggested-by: Josef Bacik Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Reviewed-by: Josef Bacik Link: https://lore.kernel.org/bpf/20200817174214.252601-1-yhs@fb.com --- kernel/bpf/task_iter.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index 232df29793e9..f21b5e1e4540 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -178,10 +178,11 @@ again: f = fcheck_files(curr_files, curr_fd); if (!f) continue; + if (!get_file_rcu(f)) + continue; /* set info->fd */ info->fd = curr_fd; - get_file(f); rcu_read_unlock(); return f; } -- cgit v1.2.3 From 3fb1a96a91120877488071a167d26d76be4be977 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Tue, 18 Aug 2020 09:44:56 -0700 Subject: libbpf: Fix build on ppc64le architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On ppc64le we get the following warning: In file included from btf_dump.c:16:0: btf_dump.c: In function ‘btf_dump_emit_struct_def’: ../include/linux/kernel.h:20:17: error: comparison of distinct pointer types lacks a cast [-Werror] (void) (&_max1 == &_max2); \ ^ btf_dump.c:882:11: note: in expansion of macro ‘max’ m_sz = max(0LL, btf__resolve_size(d->btf, m->type)); ^~~ Fix by explicitly casting to __s64, which is a return type from btf__resolve_size(). Fixes: 702eddc77a90 ("libbpf: Handle GCC built-in types for Arm NEON") Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200818164456.1181661-1-andriin@fb.com --- tools/lib/bpf/btf_dump.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c index fe39bd774697..57c00fa63932 100644 --- a/tools/lib/bpf/btf_dump.c +++ b/tools/lib/bpf/btf_dump.c @@ -879,7 +879,7 @@ static void btf_dump_emit_struct_def(struct btf_dump *d, btf_dump_printf(d, ": %d", m_sz); off = m_off + m_sz; } else { - m_sz = max(0LL, btf__resolve_size(d->btf, m->type)); + m_sz = max((__s64)0, btf__resolve_size(d->btf, m->type)); off = m_off + m_sz * 8; } btf_dump_printf(d, ";"); -- cgit v1.2.3 From 8b61fba503904acae24aeb2bd5569b4d6544d48f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alvin=20=C5=A0ipraga?= Date: Tue, 18 Aug 2020 10:51:34 +0200 Subject: macvlan: validate setting of multiple remote source MAC addresses MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remote source MAC addresses can be set on a 'source mode' macvlan interface via the IFLA_MACVLAN_MACADDR_DATA attribute. This commit tightens the validation of these MAC addresses to match the validation already performed when setting or adding a single MAC address via the IFLA_MACVLAN_MACADDR attribute. iproute2 uses IFLA_MACVLAN_MACADDR_DATA for its 'macvlan macaddr set' command, and IFLA_MACVLAN_MACADDR for its 'macvlan macaddr add' command, which demonstrates the inconsistent behaviour that this commit addresses: # ip link add link eth0 name macvlan0 type macvlan mode source # ip link set link dev macvlan0 type macvlan macaddr add 01:00:00:00:00:00 RTNETLINK answers: Cannot assign requested address # ip link set link dev macvlan0 type macvlan macaddr set 01:00:00:00:00:00 # ip -d link show macvlan0 5: macvlan0@eth0: mtu 1500 ... link/ether 2e:ac:fd:2d:69:f8 brd ff:ff:ff:ff:ff:ff promiscuity 0 macvlan mode source remotes (1) 01:00:00:00:00:00 numtxqueues 1 ... With this change, the 'set' command will (rightly) fail in the same way as the 'add' command. Signed-off-by: Alvin Šipraga Signed-off-by: David S. Miller --- drivers/net/macvlan.c | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 4942f6112e51..5da04e997989 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -1269,6 +1269,9 @@ static void macvlan_port_destroy(struct net_device *dev) static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { + struct nlattr *nla, *head; + int rem, len; + if (tb[IFLA_ADDRESS]) { if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) return -EINVAL; @@ -1316,6 +1319,20 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[], return -EADDRNOTAVAIL; } + if (data[IFLA_MACVLAN_MACADDR_DATA]) { + head = nla_data(data[IFLA_MACVLAN_MACADDR_DATA]); + len = nla_len(data[IFLA_MACVLAN_MACADDR_DATA]); + + nla_for_each_attr(nla, head, len, rem) { + if (nla_type(nla) != IFLA_MACVLAN_MACADDR || + nla_len(nla) != ETH_ALEN) + return -EINVAL; + + if (!is_valid_ether_addr(nla_data(nla))) + return -EADDRNOTAVAIL; + } + } + if (data[IFLA_MACVLAN_MACADDR_COUNT]) return -EINVAL; @@ -1372,10 +1389,6 @@ static int macvlan_changelink_sources(struct macvlan_dev *vlan, u32 mode, len = nla_len(data[IFLA_MACVLAN_MACADDR_DATA]); nla_for_each_attr(nla, head, len, rem) { - if (nla_type(nla) != IFLA_MACVLAN_MACADDR || - nla_len(nla) != ETH_ALEN) - continue; - addr = nla_data(nla); ret = macvlan_hash_add_source(vlan, addr); if (ret) -- cgit v1.2.3 From db06ea341fcd1752fbdb58454507faa140e3842f Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:43:30 +0100 Subject: sfc: really check hash is valid before using it Actually hook up the .rx_buf_hash_valid method in EF100's nic_type. Fixes: 068885434ccb ("sfc: check hash is valid before using it") Reported-by: Martin Habets Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/ef100_nic.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c index 206d70f9d95b..b8a7e9ed7913 100644 --- a/drivers/net/ethernet/sfc/ef100_nic.c +++ b/drivers/net/ethernet/sfc/ef100_nic.c @@ -739,6 +739,7 @@ const struct efx_nic_type ef100_pf_nic_type = { .rx_remove = efx_mcdi_rx_remove, .rx_write = ef100_rx_write, .rx_packet = __ef100_rx_packet, + .rx_buf_hash_valid = ef100_rx_buf_hash_valid, .fini_dmaq = efx_fini_dmaq, .max_rx_ip_filters = EFX_MCDI_FILTER_TBL_ROWS, .filter_table_probe = ef100_filter_table_up, @@ -820,6 +821,7 @@ const struct efx_nic_type ef100_vf_nic_type = { .rx_remove = efx_mcdi_rx_remove, .rx_write = ef100_rx_write, .rx_packet = __ef100_rx_packet, + .rx_buf_hash_valid = ef100_rx_buf_hash_valid, .fini_dmaq = efx_fini_dmaq, .max_rx_ip_filters = EFX_MCDI_FILTER_TBL_ROWS, .filter_table_probe = ef100_filter_table_up, -- cgit v1.2.3 From 9cbbc451098ec1e9942886023203b2247dec94bd Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:43:57 +0100 Subject: sfc: take correct lock in ef100_reset() When downing and upping the ef100 filter table, we need to take a write lock on efx->filter_sem, not just a read lock, because we may kfree() the table pointers. Without this, resets cause a WARN_ON from efx_rwsem_assert_write_locked(). Fixes: a9dc3d5612ce ("sfc_ef100: RX filter table management and related gubbins") Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/ef100_nic.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c index b8a7e9ed7913..19fe86b3b316 100644 --- a/drivers/net/ethernet/sfc/ef100_nic.c +++ b/drivers/net/ethernet/sfc/ef100_nic.c @@ -431,18 +431,18 @@ static int ef100_reset(struct efx_nic *efx, enum reset_type reset_type) /* A RESET_TYPE_ALL will cause filters to be removed, so we remove filters * and reprobe after reset to avoid removing filters twice */ - down_read(&efx->filter_sem); + down_write(&efx->filter_sem); ef100_filter_table_down(efx); - up_read(&efx->filter_sem); + up_write(&efx->filter_sem); rc = efx_mcdi_reset(efx, reset_type); if (rc) return rc; netif_device_attach(efx->net_dev); - down_read(&efx->filter_sem); + down_write(&efx->filter_sem); rc = ef100_filter_table_up(efx); - up_read(&efx->filter_sem); + up_write(&efx->filter_sem); if (rc) return rc; -- cgit v1.2.3 From 788f920a0f137baa4dbc1efdd5039c4a0a01b8d7 Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:44:18 +0100 Subject: sfc: null out channel->rps_flow_id after freeing it If an ef100_net_open() fails, ef100_net_stop() may be called without channel->rps_flow_id having been written; thus it may hold the address freed by a previous ef100_net_stop()'s call to efx_remove_filters(). This then causes a double-free when efx_remove_filters() is called again, leading to a panic. To prevent this, after freeing it, overwrite it with NULL. Fixes: a9dc3d5612ce ("sfc_ef100: RX filter table management and related gubbins") Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/rx_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/sfc/rx_common.c b/drivers/net/ethernet/sfc/rx_common.c index ef9bca92b0b7..5e29284c89c9 100644 --- a/drivers/net/ethernet/sfc/rx_common.c +++ b/drivers/net/ethernet/sfc/rx_common.c @@ -849,6 +849,7 @@ void efx_remove_filters(struct efx_nic *efx) efx_for_each_channel(channel, efx) { cancel_delayed_work_sync(&channel->filter_work); kfree(channel->rps_flow_id); + channel->rps_flow_id = NULL; } #endif down_write(&efx->filter_sem); -- cgit v1.2.3 From e6a43910d55d09dae65772ad571d4c61e459b17a Mon Sep 17 00:00:00 2001 From: Edward Cree Date: Tue, 18 Aug 2020 13:44:50 +0100 Subject: sfc: don't free_irq()s if they were never requested If efx_nic_init_interrupt fails, or was never run (e.g. due to an earlier failure in ef100_net_open), freeing irqs in efx_nic_fini_interrupt is not needed and will cause error messages and stack traces. So instead, only do this if efx_nic_init_interrupt successfully completed, as indicated by the new efx->irqs_hooked flag. Fixes: 965b549f3c20 ("sfc_ef100: implement ndo_open/close and EVQ probing") Signed-off-by: Edward Cree Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/sfc/net_driver.h | 2 ++ drivers/net/ethernet/sfc/nic.c | 4 ++++ 2 files changed, 6 insertions(+) diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h index dcb741d8bd11..062462a13847 100644 --- a/drivers/net/ethernet/sfc/net_driver.h +++ b/drivers/net/ethernet/sfc/net_driver.h @@ -846,6 +846,7 @@ struct efx_async_filter_insertion { * @timer_quantum_ns: Interrupt timer quantum, in nanoseconds * @timer_max_ns: Interrupt timer maximum value, in nanoseconds * @irq_rx_adaptive: Adaptive IRQ moderation enabled for RX event queues + * @irqs_hooked: Channel interrupts are hooked * @irq_rx_mod_step_us: Step size for IRQ moderation for RX event queues * @irq_rx_moderation_us: IRQ moderation time for RX event queues * @msg_enable: Log message enable flags @@ -1004,6 +1005,7 @@ struct efx_nic { unsigned int timer_quantum_ns; unsigned int timer_max_ns; bool irq_rx_adaptive; + bool irqs_hooked; unsigned int irq_mod_step_us; unsigned int irq_rx_moderation_us; u32 msg_enable; diff --git a/drivers/net/ethernet/sfc/nic.c b/drivers/net/ethernet/sfc/nic.c index d994d136bb03..d1e908846f5d 100644 --- a/drivers/net/ethernet/sfc/nic.c +++ b/drivers/net/ethernet/sfc/nic.c @@ -129,6 +129,7 @@ int efx_nic_init_interrupt(struct efx_nic *efx) #endif } + efx->irqs_hooked = true; return 0; fail2: @@ -154,6 +155,8 @@ void efx_nic_fini_interrupt(struct efx_nic *efx) efx->net_dev->rx_cpu_rmap = NULL; #endif + if (!efx->irqs_hooked) + return; if (EFX_INT_MODE_USE_MSI(efx)) { /* Disable MSI/MSI-X interrupts */ efx_for_each_channel(channel, efx) @@ -163,6 +166,7 @@ void efx_nic_fini_interrupt(struct efx_nic *efx) /* Disable legacy interrupt */ free_irq(efx->legacy_irq, efx); } + efx->irqs_hooked = false; } /* Register dump */ -- cgit v1.2.3 From 335956421c86f64fd46186d76d3961f6adcff187 Mon Sep 17 00:00:00 2001 From: Ganji Aravind Date: Tue, 18 Aug 2020 21:10:57 +0530 Subject: cxgb4: Fix work request size calculation for loopback test Work request used for sending loopback packet needs to add the firmware work request only once. So, fix by using correct structure size. Fixes: 7235ffae3d2c ("cxgb4: add loopback ethtool self-test") Signed-off-by: Ganji Aravind Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/chelsio/cxgb4/sge.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index d2b587d1670a..7c9fe4bc235b 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -2553,8 +2553,8 @@ int cxgb4_selftest_lb_pkt(struct net_device *netdev) pkt_len = ETH_HLEN + sizeof(CXGB4_SELFTEST_LB_STR); - flits = DIV_ROUND_UP(pkt_len + sizeof(struct cpl_tx_pkt) + - sizeof(*wr), sizeof(__be64)); + flits = DIV_ROUND_UP(pkt_len + sizeof(*cpl) + sizeof(*wr), + sizeof(__be64)); ndesc = flits_to_desc(flits); lb = &pi->ethtool_lb; -- cgit v1.2.3 From c650e04898072e4b579cbf8d9dd5b86bcdbe9b00 Mon Sep 17 00:00:00 2001 From: Ganji Aravind Date: Tue, 18 Aug 2020 21:10:58 +0530 Subject: cxgb4: Fix race between loopback and normal Tx path Even after Tx queues are marked stopped, there exists a small window where the current packet in the normal Tx path is still being sent out and loopback selftest ends up corrupting the same Tx ring. So, ensure selftest takes the Tx lock to synchronize access the Tx ring. Fixes: 7235ffae3d2c ("cxgb4: add loopback ethtool self-test") Signed-off-by: Ganji Aravind Reviewed-by: Jesse Brandeburg Signed-off-by: David S. Miller --- drivers/net/ethernet/chelsio/cxgb4/sge.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index 7c9fe4bc235b..869431a1eedd 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -2561,11 +2561,14 @@ int cxgb4_selftest_lb_pkt(struct net_device *netdev) lb->loopback = 1; q = &adap->sge.ethtxq[pi->first_qset]; + __netif_tx_lock(q->txq, smp_processor_id()); reclaim_completed_tx(adap, &q->q, -1, true); credits = txq_avail(&q->q) - ndesc; - if (unlikely(credits < 0)) + if (unlikely(credits < 0)) { + __netif_tx_unlock(q->txq); return -ENOMEM; + } wr = (void *)&q->q.desc[q->q.pidx]; memset(wr, 0, sizeof(struct tx_desc)); @@ -2598,6 +2601,7 @@ int cxgb4_selftest_lb_pkt(struct net_device *netdev) init_completion(&lb->completion); txq_advance(&q->q, ndesc); cxgb4_ring_tx_db(adap, &q->q, ndesc); + __netif_tx_unlock(q->txq); /* wait for the pkt to return */ ret = wait_for_completion_timeout(&lb->completion, 10 * HZ); -- cgit v1.2.3 From 989e4da042ca4a56bbaca9223d1a93639ad11e17 Mon Sep 17 00:00:00 2001 From: Sumera Priyadarsini Date: Wed, 19 Aug 2020 00:22:41 +0530 Subject: net: gianfar: Add of_node_put() before goto statement Every iteration of for_each_available_child_of_node() decrements reference count of the previous node, however when control is transferred from the middle of the loop, as in the case of a return or break or goto, there is no decrement thus ultimately resulting in a memory leak. Fix a potential memory leak in gianfar.c by inserting of_node_put() before the goto statement. Issue found with Coccinelle. Signed-off-by: Sumera Priyadarsini Signed-off-by: David S. Miller --- drivers/net/ethernet/freescale/gianfar.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index b513b8c5c3b5..41dd3d0f3452 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -750,8 +750,10 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev) continue; err = gfar_parse_group(child, priv, model); - if (err) + if (err) { + of_node_put(child); goto err_grp_init; + } } } else { /* SQ_SG_MODE */ err = gfar_parse_group(np, priv, model); -- cgit v1.2.3 From eabe861881a733fc84f286f4d5a1ffaddd4f526f Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Sat, 15 Aug 2020 04:46:41 -0400 Subject: net: handle the return value of pskb_carve_frag_list() correctly pskb_carve_frag_list() may return -ENOMEM in pskb_carve_inside_nonlinear(). we should handle this correctly or we would get wrong sk_buff. Fixes: 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function") Signed-off-by: Miaohe Lin Signed-off-by: David S. Miller --- net/core/skbuff.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5c3b906aeef3..e18184ffa9c3 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5987,9 +5987,13 @@ static int pskb_carve_inside_nonlinear(struct sk_buff *skb, const u32 off, if (skb_has_frag_list(skb)) skb_clone_fraglist(skb); - if (k == 0) { - /* split line is in frag list */ - pskb_carve_frag_list(skb, shinfo, off - pos, gfp_mask); + /* split line is in frag list */ + if (k == 0 && pskb_carve_frag_list(skb, shinfo, off - pos, gfp_mask)) { + /* skb_frag_unref() is not needed here as shinfo->nr_frags = 0. */ + if (skb_has_frag_list(skb)) + kfree_skb_list(skb_shinfo(skb)->frag_list); + kfree(data); + return -ENOMEM; } skb_release_data(skb); -- cgit v1.2.3 From 0410d07190961ac526f05085765a8d04d926545b Mon Sep 17 00:00:00 2001 From: Jiri Wiesner Date: Sun, 16 Aug 2020 20:52:44 +0200 Subject: bonding: fix active-backup failover for current ARP slave When the ARP monitor is used for link detection, ARP replies are validated for all slaves (arp_validate=3) and fail_over_mac is set to active, two slaves of an active-backup bond may get stuck in a state where both of them are active and pass packets that they receive to the bond. This state makes IPv6 duplicate address detection fail. The state is reached thus: 1. The current active slave goes down because the ARP target is not reachable. 2. The current ARP slave is chosen and made active. 3. A new slave is enslaved. This new slave becomes the current active slave and can reach the ARP target. As a result, the current ARP slave stays active after the enslave action has finished and the log is littered with "PROBE BAD" messages: > bond0: PROBE: c_arp ens10 && cas ens11 BAD The workaround is to remove the slave with "going back" status from the bond and re-enslave it. This issue was encountered when DPDK PMD interfaces were being enslaved to an active-backup bond. I would be possible to fix the issue in bond_enslave() or bond_change_active_slave() but the ARP monitor was fixed instead to keep most of the actions changing the current ARP slave in the ARP monitor code. The current ARP slave is set as inactive and backup during the commit phase. A new state, BOND_LINK_FAIL, has been introduced for slaves in the context of the ARP monitor. This allows administrators to see how slaves are rotated for sending ARP requests and attempts are made to find a new active slave. Fixes: b2220cad583c9 ("bonding: refactor ARP active-backup monitor") Signed-off-by: Jiri Wiesner Signed-off-by: David S. Miller --- drivers/net/bonding/bond_main.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 415a37e44cae..c5d3032dd1a2 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2948,6 +2948,9 @@ static int bond_ab_arp_inspect(struct bonding *bond) if (bond_time_in_interval(bond, last_rx, 1)) { bond_propose_link_state(slave, BOND_LINK_UP); commit++; + } else if (slave->link == BOND_LINK_BACK) { + bond_propose_link_state(slave, BOND_LINK_FAIL); + commit++; } continue; } @@ -3056,6 +3059,19 @@ static void bond_ab_arp_commit(struct bonding *bond) continue; + case BOND_LINK_FAIL: + bond_set_slave_link_state(slave, BOND_LINK_FAIL, + BOND_SLAVE_NOTIFY_NOW); + bond_set_slave_inactive_flags(slave, + BOND_SLAVE_NOTIFY_NOW); + + /* A slave has just been enslaved and has become + * the current active slave. + */ + if (rtnl_dereference(bond->curr_active_slave)) + RCU_INIT_POINTER(bond->current_arp_slave, NULL); + continue; + default: slave_err(bond->dev, slave->dev, "impossible: link_new_state %d on slave\n", @@ -3106,8 +3122,6 @@ static bool bond_ab_arp_probe(struct bonding *bond) return should_notify_rtnl; } - bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER); - bond_for_each_slave_rcu(bond, slave, iter) { if (!found && !before && bond_slave_is_up(slave)) before = slave; -- cgit v1.2.3 From 4ef1a7cb08e94da1f2f2a34ee6cefe7ae142dc98 Mon Sep 17 00:00:00 2001 From: Xin Long Date: Mon, 17 Aug 2020 14:30:49 +0800 Subject: ipv6: some fixes for ipv6_dev_find() This patch is to do 3 things for ipv6_dev_find(): As David A. noticed, - rt6_lookup() is not really needed. Different from __ip_dev_find(), ipv6_dev_find() doesn't have a compatibility problem, so remove it. As Hideaki suggested, - "valid" (non-tentative) check for the address is also needed. ipv6_chk_addr() calls ipv6_chk_addr_and_flags(), which will traverse the address hash list, but it's heavy to be called inside ipv6_dev_find(). This patch is to reuse the code of ipv6_chk_addr_and_flags() for ipv6_dev_find(). - dev parameter is passed into ipv6_dev_find(), as link-local addresses from user space has sin6_scope_id set and the dev lookup needs it. Fixes: 81f6cb31222d ("ipv6: add ipv6_dev_find()") Suggested-by: YOSHIFUJI Hideaki Reported-by: David Ahern Signed-off-by: Xin Long Signed-off-by: David S. Miller --- include/net/addrconf.h | 3 ++- net/ipv6/addrconf.c | 60 +++++++++++++++++++------------------------------- net/tipc/udp_media.c | 8 +++---- 3 files changed, 28 insertions(+), 43 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index ba3f6c15ad2b..18f783dcd55f 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -97,7 +97,8 @@ bool ipv6_chk_custom_prefix(const struct in6_addr *addr, int ipv6_chk_prefix(const struct in6_addr *addr, struct net_device *dev); -struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr); +struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr, + struct net_device *dev); struct inet6_ifaddr *ipv6_get_ifaddr(struct net *net, const struct in6_addr *addr, diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 8e761b8c47c6..01146b66d666 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -1893,12 +1893,13 @@ EXPORT_SYMBOL(ipv6_chk_addr); * 2. does the address exist on the specific device * (skip_dev_check = false) */ -int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, - const struct net_device *dev, bool skip_dev_check, - int strict, u32 banned_flags) +static struct net_device * +__ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, + const struct net_device *dev, bool skip_dev_check, + int strict, u32 banned_flags) { unsigned int hash = inet6_addr_hash(net, addr); - const struct net_device *l3mdev; + struct net_device *l3mdev, *ndev; struct inet6_ifaddr *ifp; u32 ifp_flags; @@ -1909,10 +1910,11 @@ int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, dev = NULL; hlist_for_each_entry_rcu(ifp, &inet6_addr_lst[hash], addr_lst) { - if (!net_eq(dev_net(ifp->idev->dev), net)) + ndev = ifp->idev->dev; + if (!net_eq(dev_net(ndev), net)) continue; - if (l3mdev_master_dev_rcu(ifp->idev->dev) != l3mdev) + if (l3mdev_master_dev_rcu(ndev) != l3mdev) continue; /* Decouple optimistic from tentative for evaluation here. @@ -1923,15 +1925,23 @@ int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, : ifp->flags; if (ipv6_addr_equal(&ifp->addr, addr) && !(ifp_flags&banned_flags) && - (!dev || ifp->idev->dev == dev || + (!dev || ndev == dev || !(ifp->scope&(IFA_LINK|IFA_HOST) || strict))) { rcu_read_unlock(); - return 1; + return ndev; } } rcu_read_unlock(); - return 0; + return NULL; +} + +int ipv6_chk_addr_and_flags(struct net *net, const struct in6_addr *addr, + const struct net_device *dev, bool skip_dev_check, + int strict, u32 banned_flags) +{ + return __ipv6_chk_addr_and_flags(net, addr, dev, skip_dev_check, + strict, banned_flags) ? 1 : 0; } EXPORT_SYMBOL(ipv6_chk_addr_and_flags); @@ -1990,35 +2000,11 @@ EXPORT_SYMBOL(ipv6_chk_prefix); * * The caller should be protected by RCU, or RTNL. */ -struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr) +struct net_device *ipv6_dev_find(struct net *net, const struct in6_addr *addr, + struct net_device *dev) { - unsigned int hash = inet6_addr_hash(net, addr); - struct inet6_ifaddr *ifp, *result = NULL; - struct net_device *dev = NULL; - - rcu_read_lock(); - hlist_for_each_entry_rcu(ifp, &inet6_addr_lst[hash], addr_lst) { - if (net_eq(dev_net(ifp->idev->dev), net) && - ipv6_addr_equal(&ifp->addr, addr)) { - result = ifp; - break; - } - } - - if (!result) { - struct rt6_info *rt; - - rt = rt6_lookup(net, addr, NULL, 0, NULL, 0); - if (rt) { - dev = rt->dst.dev; - ip6_rt_put(rt); - } - } else { - dev = result->idev->dev; - } - rcu_read_unlock(); - - return dev; + return __ipv6_chk_addr_and_flags(net, addr, dev, !dev, 1, + IFA_F_TENTATIVE); } EXPORT_SYMBOL(ipv6_dev_find); diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c index 53f0de0676b7..911d13cd2e67 100644 --- a/net/tipc/udp_media.c +++ b/net/tipc/udp_media.c @@ -660,6 +660,7 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b, struct udp_tunnel_sock_cfg tuncfg = {NULL}; struct nlattr *opts[TIPC_NLA_UDP_MAX + 1]; u8 node_id[NODE_ID_LEN] = {0,}; + struct net_device *dev; int rmcast = 0; ub = kzalloc(sizeof(*ub), GFP_ATOMIC); @@ -714,8 +715,6 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b, rcu_assign_pointer(ub->bearer, b); tipc_udp_media_addr_set(&b->addr, &local); if (local.proto == htons(ETH_P_IP)) { - struct net_device *dev; - dev = __ip_dev_find(net, local.ipv4.s_addr, false); if (!dev) { err = -ENODEV; @@ -738,9 +737,8 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b, b->mtu = b->media->mtu; #if IS_ENABLED(CONFIG_IPV6) } else if (local.proto == htons(ETH_P_IPV6)) { - struct net_device *dev; - - dev = ipv6_dev_find(net, &local.ipv6); + dev = ub->ifindex ? __dev_get_by_index(net, ub->ifindex) : NULL; + dev = ipv6_dev_find(net, &local.ipv6, dev); if (!dev) { err = -ENODEV; goto err; -- cgit v1.2.3 From 840110a4eae190dcbb9907d68216d5d1d9f25839 Mon Sep 17 00:00:00 2001 From: Maxim Mikityanskiy Date: Mon, 17 Aug 2020 16:34:05 +0300 Subject: ethtool: Fix preserving of wanted feature bits in netlink interface Currently, ethtool-netlink calculates new wanted bits as: (req_wanted & req_mask) | (old_active & ~req_mask) It completely discards the old wanted bits, so they are forgotten with the next ethtool command. Sample steps to reproduce: 1. ethtool -k eth0 tx-tcp-segmentation: on # TSO is on from the beginning 2. ethtool -K eth0 tx off tx-tcp-segmentation: off [not requested] 3. ethtool -k eth0 tx-tcp-segmentation: off [requested on] 4. ethtool -K eth0 rx off # Some change unrelated to TSO 5. ethtool -k eth0 tx-tcp-segmentation: off # "Wanted on" is forgotten This commit fixes it by changing the formula to: (req_wanted & req_mask) | (old_wanted & ~req_mask), where old_active was replaced by old_wanted to account for the wanted bits. The shortcut condition for the case where nothing was changed now compares wanted bitmasks, instead of wanted to active. Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy Reviewed-by: Michal Kubecek Signed-off-by: David S. Miller --- net/ethtool/features.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/net/ethtool/features.c b/net/ethtool/features.c index 4e632dc987d8..ec196f0fddc9 100644 --- a/net/ethtool/features.c +++ b/net/ethtool/features.c @@ -224,7 +224,9 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) DECLARE_BITMAP(wanted_diff_mask, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(active_diff_mask, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(old_active, NETDEV_FEATURE_COUNT); + DECLARE_BITMAP(old_wanted, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(new_active, NETDEV_FEATURE_COUNT); + DECLARE_BITMAP(new_wanted, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(req_wanted, NETDEV_FEATURE_COUNT); DECLARE_BITMAP(req_mask, NETDEV_FEATURE_COUNT); struct nlattr *tb[ETHTOOL_A_FEATURES_MAX + 1]; @@ -250,6 +252,7 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) rtnl_lock(); ethnl_features_to_bitmap(old_active, dev->features); + ethnl_features_to_bitmap(old_wanted, dev->wanted_features); ret = ethnl_parse_bitset(req_wanted, req_mask, NETDEV_FEATURE_COUNT, tb[ETHTOOL_A_FEATURES_WANTED], netdev_features_strings, info->extack); @@ -261,11 +264,11 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) goto out_rtnl; } - /* set req_wanted bits not in req_mask from old_active */ + /* set req_wanted bits not in req_mask from old_wanted */ bitmap_and(req_wanted, req_wanted, req_mask, NETDEV_FEATURE_COUNT); - bitmap_andnot(new_active, old_active, req_mask, NETDEV_FEATURE_COUNT); - bitmap_or(req_wanted, new_active, req_wanted, NETDEV_FEATURE_COUNT); - if (bitmap_equal(req_wanted, old_active, NETDEV_FEATURE_COUNT)) { + bitmap_andnot(new_wanted, old_wanted, req_mask, NETDEV_FEATURE_COUNT); + bitmap_or(req_wanted, new_wanted, req_wanted, NETDEV_FEATURE_COUNT); + if (bitmap_equal(req_wanted, old_wanted, NETDEV_FEATURE_COUNT)) { ret = 0; goto out_rtnl; } -- cgit v1.2.3 From 2847bfed888fbb8bf4c8e8067fd6127538c2c700 Mon Sep 17 00:00:00 2001 From: Maxim Mikityanskiy Date: Mon, 17 Aug 2020 16:34:06 +0300 Subject: ethtool: Account for hw_features in netlink interface ethtool-netlink ignores dev->hw_features and may confuse the drivers by asking them to enable features not in the hw_features bitmask. For example: 1. ethtool -k eth0 tls-hw-tx-offload: off [fixed] 2. ethtool -K eth0 tls-hw-tx-offload on tls-hw-tx-offload: on 3. ethtool -k eth0 tls-hw-tx-offload: on [fixed] Fitler out dev->hw_features from req_wanted to fix it and to resemble the legacy ethtool behavior. Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy Reviewed-by: Michal Kubecek Signed-off-by: David S. Miller --- net/ethtool/features.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ethtool/features.c b/net/ethtool/features.c index ec196f0fddc9..6b288bfd7678 100644 --- a/net/ethtool/features.c +++ b/net/ethtool/features.c @@ -273,7 +273,8 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) goto out_rtnl; } - dev->wanted_features = ethnl_bitmap_to_features(req_wanted); + dev->wanted_features &= ~dev->hw_features; + dev->wanted_features |= ethnl_bitmap_to_features(req_wanted) & dev->hw_features; __netdev_update_features(dev); ethnl_features_to_bitmap(new_active, dev->features); mod = !bitmap_equal(old_active, new_active, NETDEV_FEATURE_COUNT); -- cgit v1.2.3 From f01204ec8be7ea5e8f0230a7d4200e338d563bde Mon Sep 17 00:00:00 2001 From: Maxim Mikityanskiy Date: Mon, 17 Aug 2020 16:34:07 +0300 Subject: ethtool: Don't omit the netlink reply if no features were changed The legacy ethtool userspace tool shows an error when no features could be changed. It's useful to have a netlink reply to be able to show this error when __netdev_update_features wasn't called, for example: 1. ethtool -k eth0 large-receive-offload: off 2. ethtool -K eth0 rx-fcs on 3. ethtool -K eth0 lro on Could not change any device features rx-lro: off [requested on] 4. ethtool -K eth0 lro on # The output should be the same, but without this patch the kernel # doesn't send the reply, and ethtool is unable to detect the error. This commit makes ethtool-netlink always return a reply when requested, and it still avoids unnecessary calls to __netdev_update_features if the wanted features haven't changed. Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Maxim Mikityanskiy Reviewed-by: Michal Kubecek Signed-off-by: David S. Miller --- net/ethtool/features.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/net/ethtool/features.c b/net/ethtool/features.c index 6b288bfd7678..495635f152ba 100644 --- a/net/ethtool/features.c +++ b/net/ethtool/features.c @@ -268,14 +268,11 @@ int ethnl_set_features(struct sk_buff *skb, struct genl_info *info) bitmap_and(req_wanted, req_wanted, req_mask, NETDEV_FEATURE_COUNT); bitmap_andnot(new_wanted, old_wanted, req_mask, NETDEV_FEATURE_COUNT); bitmap_or(req_wanted, new_wanted, req_wanted, NETDEV_FEATURE_COUNT); - if (bitmap_equal(req_wanted, old_wanted, NETDEV_FEATURE_COUNT)) { - ret = 0; - goto out_rtnl; + if (!bitmap_equal(req_wanted, old_wanted, NETDEV_FEATURE_COUNT)) { + dev->wanted_features &= ~dev->hw_features; + dev->wanted_features |= ethnl_bitmap_to_features(req_wanted) & dev->hw_features; + __netdev_update_features(dev); } - - dev->wanted_features &= ~dev->hw_features; - dev->wanted_features |= ethnl_bitmap_to_features(req_wanted) & dev->hw_features; - __netdev_update_features(dev); ethnl_features_to_bitmap(new_active, dev->features); mod = !bitmap_equal(old_active, new_active, NETDEV_FEATURE_COUNT); -- cgit v1.2.3 From 17340552ce449ab55e3ccbfc2af1bcc600b4fbb5 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Mon, 17 Aug 2020 23:40:42 +0100 Subject: net: mscc: ocelot: remove duplicate "the the" phrase in Kconfig text The Kconfig help text contains the phrase "the the" in the help text. Fix this. Signed-off-by: Colin Ian King Signed-off-by: David S. Miller --- drivers/net/dsa/ocelot/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/ocelot/Kconfig b/drivers/net/dsa/ocelot/Kconfig index f121619d81fe..2d23ccef7d0e 100644 --- a/drivers/net/dsa/ocelot/Kconfig +++ b/drivers/net/dsa/ocelot/Kconfig @@ -9,7 +9,7 @@ config NET_DSA_MSCC_FELIX select NET_DSA_TAG_OCELOT select FSL_ENETC_MDIO help - This driver supports network switches from the the Vitesse / + This driver supports network switches from the Vitesse / Microsemi / Microchip Ocelot family of switching cores that are connected to their host CPU via Ethernet. The following switches are supported: -- cgit v1.2.3 From ad6641189c5935192a15eeb4b369dd04ebedfabb Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Mon, 17 Aug 2020 23:44:25 +0100 Subject: net: ipv4: remove duplicate "the the" phrase in Kconfig text The Kconfig help text contains the phrase "the the" in the help text. Fix this and reformat the block of help text. Signed-off-by: Colin Ian King Signed-off-by: David S. Miller --- net/ipv4/Kconfig | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 60db5a6487cc..87983e70f03f 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -661,13 +661,13 @@ config TCP_CONG_BBR BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to maximize network utilization and minimize queues. It builds an explicit - model of the the bottleneck delivery rate and path round-trip - propagation delay. It tolerates packet loss and delay unrelated to - congestion. It can operate over LAN, WAN, cellular, wifi, or cable - modem links. It can coexist with flows that use loss-based congestion - control, and can operate with shallow buffers, deep buffers, - bufferbloat, policers, or AQM schemes that do not provide a delay - signal. It requires the fq ("Fair Queue") pacing packet scheduler. + model of the bottleneck delivery rate and path round-trip propagation + delay. It tolerates packet loss and delay unrelated to congestion. It + can operate over LAN, WAN, cellular, wifi, or cable modem links. It can + coexist with flows that use loss-based congestion control, and can + operate with shallow buffers, deep buffers, bufferbloat, policers, or + AQM schemes that do not provide a delay signal. It requires the fq + ("Fair Queue") pacing packet scheduler. choice prompt "Default TCP congestion control" -- cgit v1.2.3 From e679654a704e5bd676ea6446fa7b764cbabf168a Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 18 Aug 2020 15:23:09 -0700 Subject: bpf: Fix a rcu_sched stall issue with bpf task/task_file iterator In our production system, we observed rcu stalls when 'bpftool prog` is running. rcu: INFO: rcu_sched self-detected stall on CPU rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913 \x09(t=21031 jiffies g=2534773 q=179750) NMI backtrace for cpu 7 CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G W 5.8.0-00004-g68bfc7f8c1b4 #6 Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019 Call Trace: dump_stack+0x57/0x70 nmi_cpu_backtrace.cold+0x14/0x53 ? lapic_can_unplug_cpu.cold+0x39/0x39 nmi_trigger_cpumask_backtrace+0xb7/0xc7 rcu_dump_cpu_stacks+0xa2/0xd0 rcu_sched_clock_irq.cold+0x1ff/0x3d9 ? tick_nohz_handler+0x100/0x100 update_process_times+0x5b/0x90 tick_sched_timer+0x5e/0xf0 __hrtimer_run_queues+0x12a/0x2a0 hrtimer_interrupt+0x10e/0x280 __sysvec_apic_timer_interrupt+0x51/0xe0 asm_call_on_stack+0xf/0x20 sysvec_apic_timer_interrupt+0x6f/0x80 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:task_file_seq_get_next+0x71/0x220 Code: 00 00 8b 53 1c 49 8b 7d 00 89 d6 48 8b 47 20 44 8b 18 41 39 d3 76 75 48 8b 4f 20 8b 01 39 d0 76 61 41 89 d1 49 39 c1 48 19 c0 <48> 8b 49 08 21 d0 48 8d 04 c1 4c 8b 08 4d 85 c9 74 46 49 8b 41 38 RSP: 0018:ffffc90006223e10 EFLAGS: 00000297 RAX: ffffffffffffffff RBX: ffff888f0d172388 RCX: ffff888c8c07c1c0 RDX: 00000000000f017b RSI: 00000000000f017b RDI: ffff888c254702c0 RBP: ffffc90006223e68 R08: ffff888be2a1c140 R09: 00000000000f017b R10: 0000000000000002 R11: 0000000000100000 R12: ffff888f23c24118 R13: ffffc90006223e60 R14: ffffffff828509a0 R15: 00000000ffffffff task_file_seq_next+0x52/0xa0 bpf_seq_read+0xb9/0x320 vfs_read+0x9d/0x180 ksys_read+0x5f/0xe0 do_syscall_64+0x38/0x60 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f8815f4f76e Code: c0 e9 f6 fe ff ff 55 48 8d 3d 76 70 0a 00 48 89 e5 e8 36 06 02 00 66 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 52 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 RSP: 002b:00007fff8f9df578 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000000000170b9c0 RCX: 00007f8815f4f76e RDX: 0000000000001000 RSI: 00007fff8f9df5b0 RDI: 0000000000000007 RBP: 00007fff8f9e05f0 R08: 0000000000000049 R09: 0000000000000010 R10: 00007f881601fa40 R11: 0000000000000246 R12: 00007fff8f9e05a8 R13: 00007fff8f9e05a8 R14: 0000000001917f90 R15: 000000000000e22e Note that `bpftool prog` actually calls a task_file bpf iterator program to establish an association between prog/map/link/btf anon files and processes. In the case where the above rcu stall occured, we had a process having 1587 tasks and each task having roughly 81305 files. This implied 129 million bpf prog invocations. Unfortunwtely none of these files are prog/map/link/btf files so bpf iterator/prog needs to traverse all these files and not able to return to user space since there are no seq_file buffer overflow. This patch fixed the issue in bpf_seq_read() to limit the number of visited objects. If the maximum number of visited objects is reached, no more objects will be visited in the current syscall. If there is nothing written in the seq_file buffer, -EAGAIN will return to the user so user can try again. The maximum number of visited objects is set at 1 million. In our Intel Xeon D-2191 2.3GHZ 18-core server, bpf_seq_read() visiting 1 million files takes around 0.18 seconds. We did not use cond_resched() since for some iterators, e.g., netlink iterator, where rcu read_lock critical section spans between consecutive seq_ops->next(), which makes impossible to do cond_resched() in the key while loop of function bpf_seq_read(). Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Cc: Paul E. McKenney Link: https://lore.kernel.org/bpf/20200818222309.2181348-1-yhs@fb.com --- kernel/bpf/bpf_iter.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index b6715964b685..8faa2ce89396 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -67,6 +67,9 @@ static void bpf_iter_done_stop(struct seq_file *seq) iter_priv->done_stop = true; } +/* maximum visited objects before bailing out */ +#define MAX_ITER_OBJECTS 1000000 + /* bpf_seq_read, a customized and simpler version for bpf iterator. * no_llseek is assumed for this file. * The following are differences from seq_read(): @@ -79,7 +82,7 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size, { struct seq_file *seq = file->private_data; size_t n, offs, copied = 0; - int err = 0; + int err = 0, num_objs = 0; void *p; mutex_lock(&seq->lock); @@ -135,6 +138,7 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size, while (1) { loff_t pos = seq->index; + num_objs++; offs = seq->count; p = seq->op->next(seq, p, &seq->index); if (pos == seq->index) { @@ -153,6 +157,15 @@ static ssize_t bpf_seq_read(struct file *file, char __user *buf, size_t size, if (seq->count >= size) break; + if (num_objs >= MAX_ITER_OBJECTS) { + if (offs == 0) { + err = -EAGAIN; + seq->op->stop(seq, p); + goto done; + } + break; + } + err = seq->op->show(seq, p); if (err > 0) { bpf_iter_dec_seq_num(seq); -- cgit v1.2.3 From e60572b8d4c39572be6857d1ec91fdf979f8775f Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 18 Aug 2020 15:23:10 -0700 Subject: bpf: Avoid visit same object multiple times Currently when traversing all tasks, the next tid is always increased by one. This may result in visiting the same task multiple times in a pid namespace. This patch fixed the issue by seting the next tid as pid_nr_ns(pid, ns) + 1, similar to funciton next_tgid(). Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Cc: Rik van Riel Link: https://lore.kernel.org/bpf/20200818222310.2181500-1-yhs@fb.com --- kernel/bpf/task_iter.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index f21b5e1e4540..99af4cea1102 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -29,8 +29,9 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns, rcu_read_lock(); retry: - pid = idr_get_next(&ns->idr, tid); + pid = find_ge_pid(*tid, ns); if (pid) { + *tid = pid_nr_ns(pid, ns); task = get_pid_task(pid, PIDTYPE_PID); if (!task) { ++*tid; -- cgit v1.2.3 From 00fa1d83a8b50351c830521d00135e823c46e7d0 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Tue, 18 Aug 2020 15:23:12 -0700 Subject: bpftool: Handle EAGAIN error code properly in pids collection When the error code is EAGAIN, the kernel signals the user space should retry the read() operation for bpf iterators. Let us do it. Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20200818222312.2181675-1-yhs@fb.com --- tools/bpf/bpftool/pids.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c index e3b116325403..df7d8ec76036 100644 --- a/tools/bpf/bpftool/pids.c +++ b/tools/bpf/bpftool/pids.c @@ -134,6 +134,8 @@ int build_obj_refs_table(struct obj_refs_table *table, enum bpf_obj_type type) while (true) { ret = read(fd, buf, sizeof(buf)); if (ret < 0) { + if (errno == EAGAIN) + continue; err = -errno; p_err("failed to read PID iterator output: %d", err); goto out; -- cgit v1.2.3 From 63d4a4c145cca2e84dc6e62d2ef5cb990c9723c2 Mon Sep 17 00:00:00 2001 From: Shay Agroskin Date: Wed, 19 Aug 2020 20:28:36 +0300 Subject: net: ena: Prevent reset after device destruction The reset work is scheduled by the timer routine whenever it detects that a device reset is required (e.g. when a keep_alive signal is missing). When releasing device resources in ena_destroy_device() the driver cancels the scheduling of the timer routine without destroying the reset work explicitly. This creates the following bug: The driver is suspended and the ena_suspend() function is called -> This function calls ena_destroy_device() to free the net device resources -> The driver waits for the timer routine to finish its execution and then cancels it, thus preventing from it to be called again. If, in its final execution, the timer routine schedules a reset, the reset routine might be called afterwards,and a redundant call to ena_restore_device() would be made. By changing the reset routine we allow it to read the device's state accurately. This is achieved by checking whether ENA_FLAG_TRIGGER_RESET flag is set before resetting the device and making both the destruction function and the flag check are under rtnl lock. The ENA_FLAG_TRIGGER_RESET is cleared at the end of the destruction routine. Also surround the flag check with 'likely' because we expect that the reset routine would be called only when ENA_FLAG_TRIGGER_RESET flag is set. The destruction of the timer and reset services in __ena_shutoff() have to stay, even though the timer routine is destroyed in ena_destroy_device(). This is to avoid a case in which the reset routine is scheduled after free_netdev() in __ena_shutoff(), which would create an access to freed memory in adapter->flags. Fixes: 8c5c7abdeb2d ("net: ena: add power management ops to the ENA driver") Signed-off-by: Shay Agroskin Signed-off-by: David S. Miller --- drivers/net/ethernet/amazon/ena/ena_netdev.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 2a6c9725e092..44aeace196f0 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -3601,16 +3601,14 @@ static void ena_fw_reset_device(struct work_struct *work) { struct ena_adapter *adapter = container_of(work, struct ena_adapter, reset_task); - struct pci_dev *pdev = adapter->pdev; - if (unlikely(!test_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags))) { - dev_err(&pdev->dev, - "device reset schedule while reset bit is off\n"); - return; - } rtnl_lock(); - ena_destroy_device(adapter, false); - ena_restore_device(adapter); + + if (likely(test_bit(ENA_FLAG_TRIGGER_RESET, &adapter->flags))) { + ena_destroy_device(adapter, false); + ena_restore_device(adapter); + } + rtnl_unlock(); } @@ -4389,8 +4387,11 @@ static void __ena_shutoff(struct pci_dev *pdev, bool shutdown) netdev->rx_cpu_rmap = NULL; } #endif /* CONFIG_RFS_ACCEL */ - del_timer_sync(&adapter->timer_service); + /* Make sure timer and reset routine won't be called after + * freeing device resources. + */ + del_timer_sync(&adapter->timer_service); cancel_work_sync(&adapter->reset_task); rtnl_lock(); /* lock released inside the below if-else block */ -- cgit v1.2.3 From 8b147f6f3e7de4e51113e3e9ec44aa2debc02c58 Mon Sep 17 00:00:00 2001 From: Shay Agroskin Date: Wed, 19 Aug 2020 20:28:37 +0300 Subject: net: ena: Change WARN_ON expression in ena_del_napi_in_range() The ena_del_napi_in_range() function unregisters the napi handler for rings in a given range. This function had the following WARN_ON macro: WARN_ON(ENA_IS_XDP_INDEX(adapter, i) && adapter->ena_napi[i].xdp_ring); This macro prints the call stack if the expression inside of it is true [1], but the expression inside of it is the wanted situation. The expression checks whether the ring has an XDP queue and its index corresponds to a XDP one. This patch changes the expression to !ENA_IS_XDP_INDEX(adapter, i) && adapter->ena_napi[i].xdp_ring which indicates an unwanted situation. Also, change the structure of the function. The napi handler is unregistered for all rings, and so there's no need to check whether the index is an XDP index or not. By removing this check the code becomes much more readable. Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action") Signed-off-by: Shay Agroskin Signed-off-by: David S. Miller --- drivers/net/ethernet/amazon/ena/ena_netdev.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 44aeace196f0..233db15c970d 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2180,13 +2180,10 @@ static void ena_del_napi_in_range(struct ena_adapter *adapter, int i; for (i = first_index; i < first_index + count; i++) { - /* Check if napi was initialized before */ - if (!ENA_IS_XDP_INDEX(adapter, i) || - adapter->ena_napi[i].xdp_ring) - netif_napi_del(&adapter->ena_napi[i].napi); - else - WARN_ON(ENA_IS_XDP_INDEX(adapter, i) && - adapter->ena_napi[i].xdp_ring); + netif_napi_del(&adapter->ena_napi[i].napi); + + WARN_ON(!ENA_IS_XDP_INDEX(adapter, i) && + adapter->ena_napi[i].xdp_ring); } } -- cgit v1.2.3 From ccd143e5150f24b9ba15145c7221b61dd9e41021 Mon Sep 17 00:00:00 2001 From: Shay Agroskin Date: Wed, 19 Aug 2020 20:28:38 +0300 Subject: net: ena: Make missed_tx stat incremental Most statistics in ena driver are incremented, meaning that a stat's value is a sum of all increases done to it since driver/queue initialization. This patch makes all statistics this way, effectively making missed_tx statistic incremental. Also added a comment regarding rx_drops and tx_drops to make it clearer how these counters are calculated. Fixes: 11095fdb712b ("net: ena: add statistics for missed tx packets") Signed-off-by: Shay Agroskin Signed-off-by: David S. Miller --- drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 233db15c970d..a3a8edf9a734 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -3687,7 +3687,7 @@ static int check_missing_comp_in_tx_queue(struct ena_adapter *adapter, } u64_stats_update_begin(&tx_ring->syncp); - tx_ring->tx_stats.missed_tx = missed_tx; + tx_ring->tx_stats.missed_tx += missed_tx; u64_stats_update_end(&tx_ring->syncp); return rc; @@ -4556,6 +4556,9 @@ static void ena_keep_alive_wd(void *adapter_data, tx_drops = ((u64)desc->tx_drops_high << 32) | desc->tx_drops_low; u64_stats_update_begin(&adapter->syncp); + /* These stats are accumulated by the device, so the counters indicate + * all drops since last reset. + */ adapter->dev_stats.rx_drops = rx_drops; adapter->dev_stats.tx_drops = tx_drops; u64_stats_update_end(&adapter->syncp); -- cgit v1.2.3 From d1fb55592909ea249af70170c7a52e637009564d Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Wed, 19 Aug 2020 21:52:38 +0200 Subject: netlink: fix state reallocation in policy export Evidently, when I did this previously, we didn't have more than 10 policies and didn't run into the reallocation path, because it's missing a memset() for the unused policies. Fix that. Fixes: d07dcf9aadd6 ("netlink: add infrastructure to expose policies to userspace") Signed-off-by: Johannes Berg Signed-off-by: David S. Miller --- net/netlink/policy.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/netlink/policy.c b/net/netlink/policy.c index f6491853c797..2b3e26f7496f 100644 --- a/net/netlink/policy.c +++ b/net/netlink/policy.c @@ -51,6 +51,9 @@ static int add_policy(struct nl_policy_dump **statep, if (!state) return -ENOMEM; + memset(&state->policies[state->n_alloc], 0, + flex_array_size(state, policies, n_alloc - state->n_alloc)); + state->policies[state->n_alloc].policy = policy; state->policies[state->n_alloc].maxtype = maxtype; state->n_alloc = n_alloc; -- cgit v1.2.3 From 957ff4278e0db34f56c2bc121fdd6393e4523ef2 Mon Sep 17 00:00:00 2001 From: Min Li Date: Tue, 18 Aug 2020 10:41:22 -0400 Subject: ptp: ptp_clockmatrix: use i2c_master_send for i2c write The old code for i2c write would break on some controllers, which fails at handling Repeated Start Condition. So we will just use i2c_master_send to handle write in one transanction. Changes since v1: - Remove indentation change Signed-off-by: Min Li Signed-off-by: David S. Miller --- drivers/ptp/ptp_clockmatrix.c | 56 +++++++++++++++++++++++++++++++++---------- drivers/ptp/ptp_clockmatrix.h | 2 ++ 2 files changed, 45 insertions(+), 13 deletions(-) diff --git a/drivers/ptp/ptp_clockmatrix.c b/drivers/ptp/ptp_clockmatrix.c index 73aaae5574ed..e020faff7da5 100644 --- a/drivers/ptp/ptp_clockmatrix.c +++ b/drivers/ptp/ptp_clockmatrix.c @@ -142,16 +142,15 @@ static int idtcm_strverscmp(const char *ver1, const char *ver2) return result; } -static int idtcm_xfer(struct idtcm *idtcm, - u8 regaddr, - u8 *buf, - u16 count, - bool write) +static int idtcm_xfer_read(struct idtcm *idtcm, + u8 regaddr, + u8 *buf, + u16 count) { struct i2c_client *client = idtcm->client; struct i2c_msg msg[2]; int cnt; - char *fmt = "i2c_transfer failed at %d in %s for %s, at addr: %04X!\n"; + char *fmt = "i2c_transfer failed at %d in %s, at addr: %04X!\n"; msg[0].addr = client->addr; msg[0].flags = 0; @@ -159,7 +158,7 @@ static int idtcm_xfer(struct idtcm *idtcm, msg[0].buf = ®addr; msg[1].addr = client->addr; - msg[1].flags = write ? 0 : I2C_M_RD; + msg[1].flags = I2C_M_RD; msg[1].len = count; msg[1].buf = buf; @@ -170,7 +169,6 @@ static int idtcm_xfer(struct idtcm *idtcm, fmt, __LINE__, __func__, - write ? "write" : "read", regaddr); return cnt; } else if (cnt != 2) { @@ -182,6 +180,37 @@ static int idtcm_xfer(struct idtcm *idtcm, return 0; } +static int idtcm_xfer_write(struct idtcm *idtcm, + u8 regaddr, + u8 *buf, + u16 count) +{ + struct i2c_client *client = idtcm->client; + /* we add 1 byte for device register */ + u8 msg[IDTCM_MAX_WRITE_COUNT + 1]; + int cnt; + char *fmt = "i2c_master_send failed at %d in %s, at addr: %04X!\n"; + + if (count > IDTCM_MAX_WRITE_COUNT) + return -EINVAL; + + msg[0] = regaddr; + memcpy(&msg[1], buf, count); + + cnt = i2c_master_send(client, msg, count + 1); + + if (cnt < 0) { + dev_err(&client->dev, + fmt, + __LINE__, + __func__, + regaddr); + return cnt; + } + + return 0; +} + static int idtcm_page_offset(struct idtcm *idtcm, u8 val) { u8 buf[4]; @@ -195,7 +224,7 @@ static int idtcm_page_offset(struct idtcm *idtcm, u8 val) buf[2] = 0x10; buf[3] = 0x20; - err = idtcm_xfer(idtcm, PAGE_ADDR, buf, sizeof(buf), 1); + err = idtcm_xfer_write(idtcm, PAGE_ADDR, buf, sizeof(buf)); if (err) { idtcm->page_offset = 0xff; @@ -223,11 +252,12 @@ static int _idtcm_rdwr(struct idtcm *idtcm, err = idtcm_page_offset(idtcm, hi); if (err) - goto out; + return err; - err = idtcm_xfer(idtcm, lo, buf, count, write); -out: - return err; + if (write) + return idtcm_xfer_write(idtcm, lo, buf, count); + + return idtcm_xfer_read(idtcm, lo, buf, count); } static int idtcm_read(struct idtcm *idtcm, diff --git a/drivers/ptp/ptp_clockmatrix.h b/drivers/ptp/ptp_clockmatrix.h index ffae56c5d97f..82840d72364a 100644 --- a/drivers/ptp/ptp_clockmatrix.h +++ b/drivers/ptp/ptp_clockmatrix.h @@ -55,6 +55,8 @@ #define PEROUT_ENABLE_OUTPUT_MASK (0xdeadbeef) +#define IDTCM_MAX_WRITE_COUNT (512) + /* Values of DPLL_N.DPLL_MODE.PLL_MODE */ enum pll_mode { PLL_MODE_MIN = 0, -- cgit v1.2.3 From 9553b62c1dd27df67ab2f52ec8a3bc3501887619 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Tue, 18 Aug 2020 18:14:39 +0200 Subject: net: atlantic: Use readx_poll_timeout() for large timeout Commit 8dcf2ad39fdb2 ("net: atlantic: add hwmon getter for MAC temperature") implemented a read callback with an udelay(10000U). This fails to compile on ARM because the delay is >1ms. I doubt that it is needed to spin for 10ms even if possible on x86. >From looking at the code, the context appears to be preemptible so using usleep() should work and avoid busy spinning. Use readx_poll_timeout() in the poll loop. Fixes: 8dcf2ad39fdb2 ("net: atlantic: add hwmon getter for MAC temperature") Cc: Mark Starovoytov Cc: Igor Russkikh Signed-off-by: Sebastian Andrzej Siewior Acked-by: Guenter Roeck Signed-off-by: David S. Miller --- drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c index 16a944707ba9..8941ac4df9e3 100644 --- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c @@ -1631,8 +1631,8 @@ static int hw_atl_b0_get_mac_temp(struct aq_hw_s *self, u32 *temp) hw_atl_ts_reset_set(self, 0); } - err = readx_poll_timeout_atomic(hw_atl_b0_ts_ready_and_latch_high_get, - self, val, val == 1, 10000U, 500000U); + err = readx_poll_timeout(hw_atl_b0_ts_ready_and_latch_high_get, self, + val, val == 1, 10000U, 500000U); if (err) return err; -- cgit v1.2.3 From cf96d977381d4a23957bade2ddf1c420b74a26b6 Mon Sep 17 00:00:00 2001 From: Wang Hai Date: Wed, 19 Aug 2020 10:33:09 +0800 Subject: net: gemini: Fix missing free_netdev() in error path of gemini_ethernet_port_probe() Replace alloc_etherdev_mq with devm_alloc_etherdev_mqs. In this way, when probe fails, netdev can be freed automatically.