summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2017-11-10audit: use audit_set_enabled() in audit_enable()Paul Moore
Use audit_set_enabled() to enable auditing during early boot. This obviously won't emit an audit change record, but it will work anyway and should help prevent in future problems by consolidating the enable/disable code in one function. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: convert audit_ever_enabled to a booleanPaul Moore
We were treating it as a boolean, let's make it a boolean to help avoid future mistakes. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: don't use simple_strtol() anymorePaul Moore
The simple_strtol() function is deprecated, use kstrtol() instead. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: initialize the audit subsystem as early as possiblePaul Moore
We can't initialize the audit subsystem until after the network layer is initialized (core_initcall), but do it soon after. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-11-10audit: ensure that 'audit=1' actually enables audit for PID 1Paul Moore
Prior to this patch we enabled audit in audit_init(), which is too late for PID 1 as the standard initcalls are run after the PID 1 task is forked. This means that we never allocate an audit_context (see audit_alloc()) for PID 1 and therefore miss a lot of audit events generated by PID 1. This patch enables audit as early as possible to help ensure that when PID 1 is forked it can allocate an audit_context if required. Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-09-05audit: update the function commentsGeliang Tang
Update the function comments to match the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-09-05audit: Reduce overhead using a coarse clockMel Gorman
Commit 2115bb250f26 ("audit: Use timespec64 to represent audit timestamps") noted that audit timestamps were not y2038 safe and used a 64-bit timestamp. In itself, this makes sense but the conversion was from CURRENT_TIME to ktime_get_real_ts64() which is a heavier call to record an accurate timestamp which is required in some, but not all, cases. The impact is that when auditd is running without any rules that all syscalls have higher overhead. This is visible in the sysbench-thread benchmark as a 11.5% performance hit. That benchmark is dumb as rocks but it's also visible in redis as an 8-10% hit on all operations which is of greater concern. It is somewhat stupid of audit to track syscalls without any rules related to syscalls but that is how it behaves. The overhead can be directly measured with perf comparing 4.9 with 4.12 4.9 7.76% sysbench [kernel.vmlinux] [k] __schedule 7.62% sysbench [kernel.vmlinux] [k] _raw_spin_lock 7.37% sysbench libpthread-2.22.so [.] __lll_lock_elision 7.29% sysbench [kernel.vmlinux] [.] syscall_return_via_sysret 6.59% sysbench [kernel.vmlinux] [k] native_sched_clock 5.21% sysbench libc-2.22.so [.] __sched_yield 4.38% sysbench [kernel.vmlinux] [k] entry_SYSCALL_64 4.28% sysbench [kernel.vmlinux] [k] do_syscall_64 3.49% sysbench libpthread-2.22.so [.] __lll_unlock_elision 3.13% sysbench [kernel.vmlinux] [k] __audit_syscall_exit 2.87% sysbench [kernel.vmlinux] [k] update_curr 2.73% sysbench [kernel.vmlinux] [k] pick_next_task_fair 2.31% sysbench [kernel.vmlinux] [k] syscall_trace_enter 2.20% sysbench [kernel.vmlinux] [k] __audit_syscall_entry ..... 0.00% swapper [kernel.vmlinux] [k] read_tsc 4.12 7.84% sysbench [kernel.vmlinux] [k] __schedule 7.05% sysbench [kernel.vmlinux] [k] _raw_spin_lock 6.57% sysbench libpthread-2.22.so [.] __lll_lock_elision 6.50% sysbench [kernel.vmlinux] [.] syscall_return_via_sysret 5.95% sysbench [kernel.vmlinux] [k] read_tsc 5.71% sysbench [kernel.vmlinux] [k] native_sched_clock 4.78% sysbench libc-2.22.so [.] __sched_yield 4.30% sysbench [kernel.vmlinux] [k] entry_SYSCALL_64 3.94% sysbench [kernel.vmlinux] [k] do_syscall_64 3.37% sysbench libpthread-2.22.so [.] __lll_unlock_elision 3.32% sysbench [kernel.vmlinux] [k] __audit_syscall_exit 2.91% sysbench [kernel.vmlinux] [k] __getnstimeofday64 Note the additional overhead from read_tsc which goes from 0% to 5.95%. This is on a single-socket E3-1230 but similar overheads have been measured on an older machine which the patch also eliminates. The patch in question has no explanation as to why a fully-accurate timestamp is required and is likely an oversight. Using a coarser, but monotically increasing, timestamp the overhead can be eliminated. While it can be worked around by configuring or disabling audit, it's tricky enough to detect that a kernel fix is justified. With this patch, we see the following; sysbenchthread 4.9.0 4.12.0 4.12.0 vanilla vanilla coarse-v1r1 Amean 1 1.49 ( 0.00%) 1.66 ( -11.42%) 1.51 ( -1.34%) Amean 3 1.48 ( 0.00%) 1.65 ( -11.45%) 1.50 ( -0.96%) Amean 5 1.49 ( 0.00%) 1.67 ( -12.31%) 1.51 ( -1.83%) Amean 7 1.49 ( 0.00%) 1.66 ( -11.72%) 1.50 ( -0.67%) Amean 12 1.48 ( 0.00%) 1.65 ( -11.57%) 1.52 ( -2.89%) Amean 16 1.49 ( 0.00%) 1.65 ( -11.13%) 1.51 ( -1.73%) The benchmark is reporting the time required for different thread counts to lock/unlock a private mutex which, while dense, demonstrates the syscall overhead. This is showing that 4.12 took a 11-12% hit but the overhead is almost eliminated by the patch. While the variance is not reported here, it's well within the noise with the patch applied. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-09-03Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for a thinko in the raw timekeeper update which causes clock MONOTONIC_RAW to run with erratically increased frequency" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Fix ktime_get_raw() incorrect base accumulation
2017-09-03Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: - Prevent a potential inconistency in the perf user space access which might lead to evading sanity checks. - Prevent perf recording function trace entries twice * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/ftrace: Fix double traces of perf on ftrace:function perf/core: Fix potential double-fetch bug
2017-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix handling of pinned BPF map nodes in hash of maps, from Daniel Borkmann. 2) IPSEC ESP error paths leak memory, from Steffen Klassert. 3) We need an RCU grace period before freeing fib6_node objects, from Wei Wang. 4) Must check skb_put_padto() return value in HSR driver, from FLorian Fainelli. 5) Fix oops on PHY probe failure in ftgmac100 driver, from Andrew Jeffery. 6) Fix infinite loop in UDP queue when using SO_PEEK_OFF, from Eric Dumazet. 7) Use after free when tcf_chain_destroy() called multiple times, from Jiri Pirko. 8) Fix KSZ DSA tag layer multiple free of SKBS, from Florian Fainelli. 9) Fix leak of uninitialized memory in sctp_get_sctp_info(), inet_diag_msg_sctpladdrs_fill() and inet_diag_msg_sctpaddrs_fill(). From Stefano Brivio. 10) L2TP tunnel refcount fixes from Guillaume Nault. 11) Don't leak UDP secpath in udp_set_dev_scratch(), from Yossi Kauperman. 12) Revert a PHY layer change wrt. handling of PHY_HALTED state in phy_stop_machine(), it causes regressions for multiple people. From Florian Fainelli. 13) When packets are sent out of br0 we have to clear the offload_fwdq_mark value. 14) Several NULL pointer deref fixes in packet schedulers when their ->init() routine fails. From Nikolay Aleksandrov. 15) Aquantium devices cannot checksum offload correctly when the packet is <= 60 bytes. From Pavel Belous. 16) Fix vnet header access past end of buffer in AF_PACKET, from Benjamin Poirier. 17) Double free in probe error paths of nfp driver, from Dan Carpenter. 18) QOS capability not checked properly in DCB init paths of mlx5 driver, from Huy Nguyen. 19) Fix conflicts between firmware load failure and health_care timer in mlx5, also from Huy Nguyen. 20) Fix dangling page pointer when DMA mapping errors occur in mlx5, from Eran Ben ELisha. 21) ->ndo_setup_tc() in bnxt_en driver doesn't count rings properly, from Michael Chan. 22) Missing MSIX vector free in bnxt_en, also from Michael Chan. 23) Refcount leak in xfrm layer when using sk_policy, from Lorenzo Colitti. 24) Fix copy of uninitialized data in qlge driver, from Arnd Bergmann. 25) bpf_setsockopts() erroneously always returns -EINVAL even on success. Fix from Yuchung Cheng. 26) tipc_rcv() needs to linearize the SKB before parsing the inner headers, from Parthasarathy Bhuvaragan. 27) Fix deadlock between link status updates and link removal in netvsc driver, from Stephen Hemminger. 28) Missed locking of page fragment handling in ESP output, from Steffen Klassert. 29) Fix refcnt leak in ebpf congestion control code, from Sabrina Dubroca. 30) sxgbe_probe_config_dt() doesn't check devm_kzalloc()'s return value, from Christophe Jaillet. 31) Fix missing ipv6 rx_dst_cookie update when rx_dst is updated during early demux, from Paolo Abeni. 32) Several info leaks in xfrm_user layer, from Mathias Krause. 33) Fix out of bounds read in cxgb4 driver, from Stefano Brivio. 34) Properly propagate obsolete state of route upwards in ipv6 so that upper holders like xfrm can see it. From Xin Long. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (118 commits) udp: fix secpath leak bridge: switchdev: Clear forward mark when transmitting packet mlxsw: spectrum: Forbid linking to devices that have uppers wl1251: add a missing spin_lock_init() Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()" net: dsa: bcm_sf2: Fix number of CFP entries for BCM7278 kcm: do not attach PF_KCM sockets to avoid deadlock sch_tbf: fix two null pointer dereferences on init failure sch_sfq: fix null pointer dereference on init failure sch_netem: avoid null pointer deref on init failure sch_fq_codel: avoid double free on init failure sch_cbq: fix null pointer dereferences on init failure sch_hfsc: fix null pointer deref and double free on init failure sch_hhf: fix null pointer dereference on init failure sch_multiq: fix double free on init failure sch_htb: fix crash on init failure net/mlx5e: Fix CQ moderation mode not set properly net/mlx5e: Fix inline header size for small packets net/mlx5: E-Switch, Unload the representors in the correct order net/mlx5e: Properly resolve TC offloaded ipv6 vxlan tunnel source address ...
2017-08-31mm, uprobes: fix multiple free of ->uprobes_state.xol_areaEric Biggers
Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") made it possible to kill a forking task while it is waiting to acquire its ->mmap_sem for write, in dup_mmap(). However, it was overlooked that this introduced an new error path before the new mm_struct's ->uprobes_state.xol_area has been set to NULL after being copied from the old mm_struct by the memcpy in dup_mm(). For a task that has previously hit a uprobe tracepoint, this resulted in the 'struct xol_area' being freed multiple times if the task was killed at just the right time while forking. Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather than in uprobe_dup_mmap(). With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C program given by commit 2b7e8665b4ff ("fork: fix incorrect fput of ->exe_file causing use-after-free"), provided that a uprobe tracepoint has been set on the fork_thread() function. For example: $ gcc reproducer.c -o reproducer -lpthread $ nm reproducer | grep fork_thread 0000000000400719 t fork_thread $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable $ ./reproducer Here is the use-after-free reported by KASAN: BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200 Read of size 8 at addr ffff8800320a8b88 by task reproducer/198 CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Call Trace: dump_stack+0xdb/0x185 print_address_description+0x7e/0x290 kasan_report+0x23b/0x350 __asan_report_load8_noabort+0x19/0x20 uprobe_clear_state+0x1c4/0x200 mmput+0xd6/0x360 do_exit+0x740/0x1670 do_group_exit+0x13f/0x380 get_signal+0x597/0x17d0 do_signal+0x99/0x1df0 exit_to_usermode_loop+0x166/0x1e0 syscall_return_slowpath+0x258/0x2c0 entry_SYSCALL_64_fastpath+0xbc/0xbe ... Allocated by task 199: save_stack_trace+0x1b/0x20 kasan_kmalloc+0xfc/0x180 kmem_cache_alloc_trace+0xf3/0x330 __create_xol_area+0x10f/0x780 uprobe_notify_resume+0x1674/0x2210 exit_to_usermode_loop+0x150/0x1e0 prepare_exit_to_usermode+0x14b/0x180 retint_user+0x8/0x20 Freed by task 199: save_stack_trace+0x1b/0x20 kasan_slab_free+0xa8/0x1a0 kfree+0xba/0x210 uprobe_clear_state+0x151/0x200 mmput+0xd6/0x360 copy_process.part.8+0x605f/0x65d0 _do_fork+0x1a5/0xbd0 SyS_clone+0x19/0x20 do_syscall_64+0x22f/0x660 return_from_SYSCALL_64+0x0/0x7a Note: without KASAN, you may instead see a "Bad page state" message, or simply a general protection fault. Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") Signed-off-by: Eric Biggers <ebiggers@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31kernel/kthread.c: kthread_worker: don't hog the cpuShaohua Li
If the worker thread continues getting work, it will hog the cpu and rcu stall complains. Make it a good citizen. This is triggered in a loop block device test. Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-29Merge branch 'for-4.13-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "A late but obvious fix for cgroup. I broke the 'cpuset.memory_pressure' file a long time ago (v4.4) by accidentally deleting its file index, which made it a duplicate of the 'cpuset.memory_migrate' file. Spotted and fixed by Waiman" * 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: Fix incorrect memory_pressure control file mapping
2017-08-29perf/ftrace: Fix double traces of perf on ftrace:functionZhou Chengming
When running perf on the ftrace:function tracepoint, there is a bug which can be reproduced by: perf record -e ftrace:function -a sleep 20 & perf record -e ftrace:function ls perf script ls 10304 [005] 171.853235: ftrace:function: perf_output_begin ls 10304 [005] 171.853237: ftrace:function: perf_output_begin ls 10304 [005] 171.853239: ftrace:function: task_tgid_nr_ns ls 10304 [005] 171.853240: ftrace:function: task_tgid_nr_ns ls 10304 [005] 171.853242: ftrace:function: __task_pid_nr_ns ls 10304 [005] 171.853244: ftrace:function: __task_pid_nr_ns We can see that all the function traces are doubled. The problem is caused by the inconsistency of the register function perf_ftrace_event_register() with the probe function perf_ftrace_function_call(). The former registers one probe for every perf_event. And the latter handles all perf_events on the current cpu. So when two perf_events on the current cpu, the traces of them will be doubled. So this patch adds an extra parameter "event" for perf_tp_event, only send sample data to this event when it's not NULL. Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: huawei.libin@huawei.com Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-29perf/core: Fix potential double-fetch bugMeng Xu
While examining the kernel source code, I found a dangerous operation that could turn into a double-fetch situation (a race condition bug) where the same userspace memory region are fetched twice into kernel with sanity checks after the first fetch while missing checks after the second fetch. 1. The first fetch happens in line 9573 get_user(size, &uattr->size). 2. Subsequently the 'size' variable undergoes a few sanity checks and transformations (line 9577 to 9584). 3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size) 4. Given that 'uattr' can be fully controlled in userspace, an attacker can race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF) after the first fetch but before the second fetch. The changed value will be copied to 'attr->size'. 5. There is no further checks on 'attr->size' until the end of this function, and once the function returns, we lose the context to verify that 'attr->size' conforms to the sanity checks performed in step 2 (line 9577 to 9584). 6. My manual analysis shows that 'attr->size' is not used elsewhere later, so, there is no working exploit against it right now. However, this could easily turns to an exploitable one if careless developers start to use 'attr->size' later. To fix this, override 'attr->size' from the second fetch to the one from the first fetch, regardless of what is actually copied in. In this way, it is assured that 'attr->size' is consistent with the checks performed after the first fetch. Signed-off-by: Meng Xu <mengxu.gatech@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: alexander.shishkin@linux.intel.com Cc: meng.xu@gatech.edu Cc: sanidhya@gatech.edu Cc: taesoo@gatech.edu Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-27Minor page waitqueue cleanupsLinus Torvalds
Tim Chen and Kan Liang have been battling a customer load that shows extremely long page wakeup lists. The cause seems to be constant NUMA migration of a hot page that is shared across a lot of threads, but the actual root cause for the exact behavior has not been found. Tim has a patch that batches the wait list traversal at wakeup time, so that we at least don't get long uninterruptible cases where we traverse and wake up thousands of processes and get nasty latency spikes. That is likely 4.14 material, but we're still discussing the page waitqueue specific parts of it. In the meantime, I've tried to look at making the page wait queues less expensive, and failing miserably. If you have thousands of threads waiting for the same page, it will be painful. We'll need to try to figure out the NUMA balancing issue some day, in addition to avoiding the excessive spinlock hold times. That said, having tried to rewrite the page wait queues, I can at least fix up some of the braindamage in the current situation. In particular: (a) we don't want to continue walking the page wait list if the bit we're waiting for already got set again (which seems to be one of the patterns of the bad load). That makes no progress and just causes pointless cache pollution chasing the pointers. (b) we don't want to put the non-locking waiters always on the front of the queue, and the locking waiters always on the back. Not only is that unfair, it means that we wake up thousands of reading threads that will just end up being blocked by the writer later anyway. Also add a comment about the layout of 'struct wait_page_key' - there is an external user of it in the cachefiles code that means that it has to match the layout of 'struct wait_bit_key' in the two first members. It so happens to match, because 'struct page *' and 'unsigned long *' end up having the same values simply because the page flags are the first member in struct page. Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Christopher Lameter <cl@linux.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-26Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a timer granularity handling race+bug, which would manifest itself by spuriously increasing timeouts of some timers (from 1 jiffy to ~500 jiffies in the worst case measured) in certain nohz states" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix excessive granularity of new timers after a nohz idle
2017-08-26Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "A single fix to not allow nonsensical event groups that result in kernel warnings" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix group {cpu,task} validation
2017-08-26time: Fix ktime_get_raw() incorrect base accumulationJohn Stultz
In comqit fc6eead7c1e2 ("time: Clean up CLOCK_MONOTONIC_RAW time handling"), the following code got mistakenly added to the update of the raw timekeeper: /* Update the monotonic raw base */ seconds = tk->raw_sec; nsec = (u32)(tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift); tk->tkr_raw.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec); Which adds the raw_sec value and the shifted down raw xtime_nsec to the base value. But the read function adds the shifted down tk->tkr_raw.xtime_nsec value another time, The result of this is that ktime_get_raw() users (which are all internal users) see the raw time move faster then it should (the rate at which can vary with the current size of tkr_raw.xtime_nsec), which has resulted in at least problems with graphics rendering performance. The change tried to match the monotonic base update logic: seconds = (u64)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec); nsec = (u32) tk->wall_to_monotonic.tv_nsec; tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec); Which adds the wall_to_monotonic.tv_nsec value, but not the tk->tkr_mono.xtime_nsec value to the base. To fix this, simplify the tkr_raw.base accumulation to only accumulate the raw_sec portion, and do not include the tkr_raw.xtime_nsec portion, which will be added at read time. Fixes: fc6eead7c1e2 ("time: Clean up CLOCK_MONOTONIC_RAW time handling") Reported-and-tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Stephen Boyd <stephen.boyd@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Daniel Mentz <danielmentz@google.com> Link: http://lkml.kernel.org/r/1503701824-1645-1-git-send-email-john.stultz@linaro.org
2017-08-25fork: fix incorrect fput of ->exe_file causing use-after-freeEric Biggers
Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") made it possible to kill a forking task while it is waiting to acquire its ->mmap_sem for write, in dup_mmap(). However, it was overlooked that this introduced an new error path before a reference is taken on the mm_struct's ->exe_file. Since the ->exe_file of the new mm_struct was already set to the old ->exe_file by the memcpy() in dup_mm(), it was possible for the mmput() in the error path of dup_mm() to drop a reference to ->exe_file which was never taken. This caused the struct file to later be freed prematurely. Fix it by updating mm_init() to NULL out the ->exe_file, in the same place it clears other things like the list of mmaps. This bug was found by syzkaller. It can be reproduced using the following C program: #define _GNU_SOURCE #include <pthread.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> static void *mmap_thread(void *_arg) { for (;;) { mmap(NULL, 0x1000000, PROT_READ, MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); } } static void *fork_thread(void *_arg) { usleep(rand() % 10000); fork(); } int main(void) { fork(); fork(); fork(); for (;;) { if (fork() == 0) { pthread_t t; pthread_create(&t, NULL, mmap_thread, NULL); pthread_create(&t, NULL, fork_thread, NULL); usleep(rand() % 10000); syscall(__NR_exit_group, 0); } wait(NULL); } } No special kernel config options are needed. It usually causes a NULL pointer dereference in __remove_shared_vm_struct() during exit, or in dup_mmap() (which is usually inlined into copy_process()) during fork. Both are due to a vm_area_struct's ->vm_file being used after it's already been freed. Google Bug Id: 64772007 Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [v4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25perf/core: Fix group {cpu,task} validationMark Rutland
Regardless of which events form a group, it does not make sense for the events to target different tasks and/or CPUs, as this leaves the group inconsistent and impossible to schedule. The core perf code assumes that these are consistent across (successfully intialised) groups. Core perf code only verifies this when moving SW events into a HW context. Thus, we can violate this requirement for pure SW groups and pure HW groups, unless the relevant PMU driver happens to perform this verification itself. These mismatched groups subsequently wreak havoc elsewhere. For example, we handle watchpoints as SW events, and reserve watchpoint HW on a per-CPU basis at pmu::event_init() time to ensure that any event that is initialised is guaranteed to have a slot at pmu::add() time. However, the core code only checks the group leader's cpu filter (via event_filter_match()), and can thus install follower events onto CPUs violating thier (mismatched) CPU filters, potentially installing them into a CPU without sufficient reserved slots. This can be triggered with the below test case, resulting in warnings from arch backends. #define _GNU_SOURCE #include <linux/hw_breakpoint.h> #include <linux/perf_event.h> #include <sched.h> #include <stdio.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <unistd.h> static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags); } char watched_char; struct perf_event_attr wp_attr = { .type = PERF_TYPE_BREAKPOINT, .bp_type = HW_BREAKPOINT_RW, .bp_addr = (unsigned long)&watched_char, .bp_len = 1, .size = sizeof(wp_attr), }; int main(int argc, char *argv[]) { int leader, ret; cpu_set_t cpus; /* * Force use of CPU0 to ensure our CPU0-bound events get scheduled. */ CPU_ZERO(&cpus); CPU_SET(0, &cpus); ret = sched_setaffinity(0, sizeof(cpus), &cpus); if (ret) { printf("Unable to set cpu affinity\n"); return 1; } /* open leader event, bound to this task, CPU0 only */ leader = perf_event_open(&wp_attr, 0, 0, -1, 0); if (leader < 0) { printf("Couldn't open leader: %d\n", leader); return 1; } /* * Open a follower event that is bound to the same task, but a * different CPU. This means that the group should never be possible to * schedule. */ ret = perf_event_open(&wp_attr, 0, 1, leader, 0); if (ret < 0) { printf("Couldn't open mismatched follower: %d\n", ret); return 1; } else { printf("Opened leader/follower with mismastched CPUs\n"); } /* * Open as many independent events as we can, all bound to the same * task, CPU0 only. */ do { ret = perf_event_open(&wp_attr, 0, 0, -1, 0); } while (ret >= 0); /* * Force enable/disble all events to trigger the erronoeous * installation of the follower event. */ printf("Opened all events. Toggling..\n"); for (;;) { prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0); prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0); } return 0; } Fix this by validating this requirement regardless of whether we're moving events. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zhou Chengming <zhouchengming1@huawei.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-24Merge tag 'trace-v4.13-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Various bug fixes: - Two small memory leaks in error paths. - A missed return error code on an error path. - A fix to check the tracing ring buffer CPU when it doesn't exist (caused by setting maxcpus on the command line that is less than the actual number of CPUs, and then onlining them manually). - A fix to have the reset of boot tracers called by lateinit_sync() instead of just lateinit(). As some of the tracers register via lateinit(), and if the clear happens before the tracer is registered, it will never start even though it was told to via the kernel command line" * tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix freeing of filter in create_filter() when set_str is false tracing: Fix kmemleak in tracing_map_array_free() ftrace: Check for null ret_stack on profile function graph entry function ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU tracing: Missing error code in tracer_alloc_buffers() tracing: Call clear_boot_tracer() at lateinit_sync
2017-08-24cpuset: Fix incorrect memory_pressure control file mappingWaiman Long
The memory_pressure control file was incorrectly set up without a private value (0, by default). As a result, this control file was treated like memory_migrate on read. By adding back the FILE_MEMORY_PRESSURE private value, the correct memory pressure value will be returned. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 7dbdb199d3bf ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE") Cc: stable@vger.kernel.org # v4.4+
2017-08-24tracing: Fix freeing of filter in create_filter() when set_str is falseSteven Rostedt (VMware)
Performing the following task with kmemleak enabled: # cd /sys/kernel/tracing/events/irq/irq_handler_entry/ # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger # echo scan > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800b9290308 (size 32): comm "bash", pid 1114, jiffies 4294848451 (age 141.139s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0 [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290 [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940 [<ffffffff812639c9>] create_filter+0xa9/0x160 [<ffffffff81263bdc>] create_event_filter+0xc/0x10 [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210 [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490 [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260 [<ffffffff8138cf87>] __vfs_write+0xd7/0x380 [<ffffffff8138f421>] vfs_write+0x101/0x260 [<ffffffff8139187b>] SyS_write+0xab/0x130 [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe [<ffffffffffffffff>] 0xffffffffffffffff The function create_filter() is passed a 'filterp' pointer that gets allocated, and if "set_str" is true, it is up to the caller to free it, even on error. The problem is that the pointer is not freed by create_filter() when set_str is false. This is a bug, and it is not up to the caller to free the filter on error if it doesn't care about the string. Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org Fixes: 38b78eb85 ("tracing: Factorize filter creation") Reported-by: Chunyu Hu <chuhu@redhat.com> Tested-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-24tracing: Fix kmemleak in tracing_map_array_free()Chunyu Hu
kmemleak reported the below leak when I was doing clear of the hist trigger. With this patch, the kmeamleak is gone. unreferenced object 0xffff94322b63d760 (size 32): comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s) hex dump (first 32 bytes): 00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00 ................ 10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff ..........z.1... backtrace: [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0 [<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140 [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50 [<ffffffff9e38b935>] create_hist_data+0x535/0x750 [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420 [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0 [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170 [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0 [<ffffffff9e450b85>] SyS_write+0x55/0xc0 [<ffffffff9e203857>] do_syscall_64+0x67/0x150 [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a [<ffffffffffffffff>] 0xffffffffffffffff unreferenced object 0xffff9431f27aa880 (size 128): comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s) hex dump (first 32 bytes): 00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff ...*2......*2... 00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff ...*2......*2... backtrace: [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0 [<ffffffff9e425348>] __kmalloc+0xe8/0x220 [<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140 [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50 [<ffffffff9e38b935>] create_hist_data+0x535/0x750 [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420 [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0 [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170 [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0 [<ffffffff9e450b85>] SyS_write+0x55/0xc0 [<ffffffff9e203857>] do_syscall_64+0x67/0x150 [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a [<ffffffffffffffff>] 0xffffffffffffffff Link: http://lkml.kernel.org/r/1502705898-27571-1-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org Fixes: 08d43a5fa063 ("tracing: Add lock-free tracing_map") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-24ftrace: Check for null ret_stack on profile function graph entry functionSteven Rostedt (VMware)
There's a small race when function graph shutsdown and the calling of the registered function graph entry callback. The callback must not reference the task's ret_stack without first checking that it is not NULL. Note, when a ret_stack is allocated for a task, it stays allocated until the task exits. The problem here, is that function_graph is shutdown, and a new task was created, which doesn't have its ret_stack allocated. But since some of the functions are still being traced, the callbacks can still be called. The normal function_graph code handles this, but starting with commit 8861dd303c ("ftrace: Access ret_stack->subtime only in the function profiler") the profiler code references the ret_stack on function entry, but doesn't check if it is NULL first. Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611 Cc: stable@vger.kernel.org Fixes: 8861dd303c ("ftrace: Access ret_stack->subtime only in the function profiler") Reported-by: lilydjwg@gmail.com Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-24timers: Fix excessive granularity of new timers after a nohz idleNicholas Piggin
When a timer base is idle, it is forwarded when a new timer is added to ensure that granularity does not become excessive. When not idle, the timer tick is expected to increment the base. However there are several problems: - If an existing timer is modified, the base is forwarded only after the index is calculated. - The base is not forwarded by add_timer_on. - There is a window after a timer is restarted from a nohz idle, after it is marked not-idle and before the timer tick on this CPU, where a timer may be added but the ancient base does not get forwarded. These result in excessive granularity (a 1 jiffy timeout can blow out to 100s of jiffies), which cause the rcu lockup detector to trigger, among other things. Fix this by keeping track of whether the timer base has been idle since it was last run or forwarded, and if so then forward it before adding a new timer. There is still a case where mod_timer optimises the case of a pending timer mod with the same expiry time, where the timer can see excessive granularity relative to the new, shorter interval. A comment is added, but it's not changed because it is an important fastpath for networking. This has been tested and found to fix the RCU softlockup messages. Testing was also done with tracing to measure requested versus achieved wakeup latencies for all non-deferrable timers in an idle system (with no lockup watchdogs running). Wakeup latency relative to absolute latency is calculated (note this suffers from round-up skew at low absolute times) and analysed: max avg std upstream 506.0 1.20 4.68 patched 2.0 1.08 0.15 The bug was noticed due to the lockup detector Kconfig changes dropping it out of people's .configs and resulting in larger base clk skew When the lockup detectors are enabled, no CPU can go idle for longer than 4 seconds, which limits the granularity errors. Sub-optimal timer behaviour is observable on a smaller scale in that case: max avg std upstream 9.0 1.05 0.19 patched 2.0 1.04 0.11 Fixes: Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: David Miller <davem@davemloft.net> Cc: dzickus@redhat.com Cc: sfr@canb.auug.org.au Cc: mpe@ellerman.id.au Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: linuxarm@huawei.com Cc: abdhalee@linux.vnet.ibm.com Cc: John Stultz <john.stultz@linaro.org> Cc: akpm@linux-foundation.org Cc: paulmck@linux.vnet.ibm.com Cc: torvalds@linux-foundation.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com
2017-08-22bpf: fix map value attribute for hash of mapsDaniel Borkmann
Currently, iproute2's BPF ELF loader works fine with array of maps when retrieving the fd from a pinned node and doing a selfcheck against the provided map attributes from the object file, but we fail to do the same for hash of maps and thus refuse to get the map from pinned node. Reason is that when allocating hash of maps, fd_htab_map_alloc() will set the value size to sizeof(void *), and any user space map creation requests are forced to set 4 bytes as value size. Thus, selfcheck will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID returns the value size used to create the map. Fix it by handling it the same way as we do for array of maps, which means that we leave value size at 4 bytes and in the allocation phase round up value size to 8 bytes. alloc_htab_elem() needs an adjustment in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem() calling into htab_map_update_elem() with the pointer of the map pointer as value. Unlike array of maps where we just xchg(), we're using the generic htab_map_update_elem() callback also used from helper calls, which published the key/value already on return, so we need to ensure to memcpy() the right size. Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-21pids: make task_tgid_nr_ns() safeOleg Nesterov
This was reported many times, and this was even mentioned in commit 52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is not safe because task->group_leader points to nowhere after the exiting task passes exit_notify(), rcu_read_lock() can not help. We really need to change __unhash_process() to nullify group_leader, parent, and real_parent, but this needs some cleanups. Until then we can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and fix the problem. Reported-by: Troy Kensinger <tkensinger@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-20Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for the perf subsystem: - Fix an inconsistency of RDPMC mm struct tagging across exec() which causes RDPMC to fault. - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which causes incorrect timestamps and total time calculations" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix time on IOC_ENABLE perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A pile of smallish changes all over the place: - Add a missing ISB in the GIC V1 driver - Remove an ACPI version check in the GIC V3 ITS driver - Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid spurious wakeups - Remove the artifical limitation of ITS instances to the number of NUMA nodes which prevents utilizing the ITS hardware correctly - Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code - Honour the force affinity argument in the GIC-V3 driver which is required to make perf work correctly - Correctly report allocation failures in GIC-V2/V3 to avoid using half allocated and initialized interrupts. - Fixup checks against nr_cpu_ids in the generic IPI code" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/ipi: Fixup checks against nr_cpu_ids genirq: Restore trigger settings in irq_modify_status() MAINTAINERS: Remove Jason Cooper's irqchip git tree irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES irqchip: brcmstb-l2: Define an irq_pm_shutdown function irqchip/gic: Ensure we have an ISB between ack and ->handle_irq irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA irqchip/gic-v3: Honor forced affinity setting irqchip/gic-v3: Report failures in gic_irq_domain_alloc irqchip/gic-v2: Report failures in gic_irq_domain_alloc irqchip/atmel-aic: Remove root argument from ->fixup() prototype irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup() irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
2017-08-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes
2017-08-20genirq/ipi: Fixup checks against nr_cpu_idsAlexey Dobriyan
Valid CPU ids are [0, nr_cpu_ids-1] inclusive. Fixes: 3b8e29a82dd1 ("genirq: Implement ipi_send_mask/single()") Fixes: f9bce791ae2a ("genirq: Add a new function to get IPI reverse mapping") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170819095751.GB27864@avx2
2017-08-18signal: don't remove SIGNAL_UNKILLABLE for traced tasks.Jamie Iles
When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive faults, but this is undesirable when tracing. For example, debugging an init process (whether global or namespace), hitting a breakpoint and SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE. Everything continues fine, but then once debugging has finished, the init process is left killable which is unlikely what the user expects, resulting in either an accidentally killed init or an init that stops reaping zombies. Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.com Signed-off-by: Jamie Iles <jamie.iles@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18kmod: fix wait on recursive loopLuis R. Rodriguez
Recursive loops with module loading were previously handled in kmod by restricting the number of modprobe calls to 50 and if that limit was breached request_module() would return an error and a user would see the following on their kernel dmesg: request_module: runaway loop modprobe binfmt-464c Starting init:/sbin/init exists but couldn't execute it (error -8) This issue could happen for instance when a 64-bit kernel boots a 32-bit userspace on some architectures and has no 32-bit binary format hanlders. This is visible, for instance, when a CONFIG_MODULES enabled 64-bit MIPS kernel boots a into o32 root filesystem and the binfmt handler for o32 binaries is not built-in. After commit 6d7964a722af ("kmod: throttle kmod thread limit") we now don't have any visible signs of an error and the kernel just waits for the loop to end somehow. Although this *particular* recursive loop could also be addressed by doing a sanity check on search_binary_handler() and disallowing a modular binfmt to be required for modprobe, a generic solution for any recursive kernel kmod issues is still needed. This should catch these loops. We can investigate each loop and address each one separately as they come in, this however puts a stop gap for them as before. Link: http://lkml.kernel.org/r/20170809234635.13443-3-mcgrof@kernel.org Fixes: 6d7964a722af ("kmod: throttle kmod thread limit") Signed-o