summaryrefslogtreecommitdiffstats
path: root/mm/oom_kill.c
AgeCommit message (Collapse)Author
2019-02-01mm, oom: fix use-after-free in oom_kill_processShakeel Butt
Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01oom, oom_reaper: do not enqueue same task twiceTetsuo Handa
Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end calls v2Jérôme Glisse
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, oom: add oom victim's memcg to the oom context informationyuzhoujian
The current oom report doesn't display victim's memcg context during the global OOM situation. While this information is not strictly needed, it can be really helpful for containerized environments to locate which container has lost a process. Now that we have a single line for the oom context, we can trivially add both the oom memcg (this can be either global_oom or a specific memcg which hits its hard limits) and task_memcg which is the victim's memcg. Below is the single line output in the oom report after this patch. - global oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> - memcg oom context information: oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid> [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()] Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, oom: reorganize the oom report in dump_headeryuzhoujian
OOM report contains several sections. The first one is the allocation context that has triggered the OOM. Then we have cpuset context followed by the stack trace of the OOM path. The tird one is the OOM memory information. Followed by the current memory state of all system tasks. At last, we will show oom eligible tasks and the information about the chosen oom victim. One thing that makes parsing more awkward than necessary is that we do not have a single and easily parsable line about the oom context. This patch is reorganizing the oom report to 1) who invoked oom and what was the allocation request [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 2) OOM stack trace [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3 [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016 [ 515.906821] Call Trace: [ 515.908062] dump_stack+0x5a/0x73 [ 515.909311] dump_header+0x55/0x28c [ 515.914260] oom_kill_process+0x2d8/0x300 [ 515.916708] out_of_memory+0x145/0x4a0 [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16 [ 515.919157] __alloc_pages_nodemask+0x277/0x290 [ 515.920367] filemap_fault+0x3d0/0x6c0 [ 515.921529] ? filemap_map_pages+0x2b8/0x420 [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4] [ 515.923884] __do_fault+0x20/0x80 [ 515.925032] __handle_mm_fault+0xbc0/0xe80 [ 515.926195] handle_mm_fault+0xfa/0x210 [ 515.927357] __do_page_fault+0x233/0x4c0 [ 515.928506] do_page_fault+0x32/0x140 [ 515.929646] ? page_fault+0x8/0x30 [ 515.930770] page_fault+0x1e/0x30 3) OOM memory information [ 515.958093] Mem-Info: [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0 active_file:4402672 inactive_file:483963 isolated_file:1344 unevictable:0 dirty:4886753 writeback:0 unstable:0 slab_reclaimable:148442 slab_unreclaimable:18741 mapped:1347 shmem:1347 pagetables:58669 bounce:0 free:88663 free_pcp:0 free_cma:0 ... 4) current memory state of all system tasks [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic 5) oom context (contrains and the chosen victim). oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0 An admin can easily get the full oom context at a single line which makes parsing much easier. Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-24Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo updates from Eric Biederman: "I have been slowly sorting out siginfo and this is the culmination of that work. The primary result is in several ways the signal infrastructure has been made less error prone. The code has been updated so that manually specifying SEND_SIG_FORCED is never necessary. The conversion to the new siginfo sending functions is now complete, which makes it difficult to send a signal without filling in the proper siginfo fields. At the tail end of the patchset comes the optimization of decreasing the size of struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit. The fundamental observation that enables this is by definition none of the known ways to use struct siginfo uses the extra bytes. This comes at the cost of a small user space observable difference. For the rare case of siginfo being injected into the kernel only what can be copied into kernel_siginfo is delivered to the destination, the rest of the bytes are set to 0. For cases where the signal and the si_code are known this is safe, because we know those bytes are not used. For cases where the signal and si_code combination is unknown the bits that won't fit into struct kernel_siginfo are tested to verify they are zero, and the send fails if they are not. I made an extensive search through userspace code and I could not find anything that would break because of the above change. If it turns out I did break something it will take just the revert of a single change to restore kernel_siginfo to the same size as userspace siginfo. Testing did reveal dependencies on preferring the signo passed to sigqueueinfo over si->signo, so bit the bullet and added the complexity necessary to handle that case. Testing also revealed bad things can happen if a negative signal number is passed into the system calls. Something no sane application will do but something a malicious program or a fuzzer might do. So I have fixed the code that performs the bounds checks to ensure negative signal numbers are handled" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits) signal: Guard against negative signal numbers in copy_siginfo_from_user32 signal: Guard against negative signal numbers in copy_siginfo_from_user signal: In sigqueueinfo prefer sig not si_signo signal: Use a smaller struct siginfo in the kernel signal: Distinguish between kernel_siginfo and siginfo signal: Introduce copy_siginfo_from_user and use it's return value signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE signal: Fail sigqueueinfo if si_signo != sig signal/sparc: Move EMT_TAGOVF into the generic siginfo.h signal/unicore32: Use force_sig_fault where appropriate signal/unicore32: Generate siginfo in ucs32_notify_die signal/unicore32: Use send_sig_fault where appropriate signal/arc: Use force_sig_fault where appropriate signal/arc: Push siginfo generation into unhandled_exception signal/ia64: Use force_sig_fault where appropriate signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn signal/ia64: Use the generic force_sigsegv in setup_frame signal/arm/kvm: Use send_sig_mceerr signal/arm: Use send_sig_fault where appropriate signal/arm: Use force_sig_fault where appropriate ...
2018-09-11signal: Use SEND_SIG_PRIV not SEND_SIG_FORCED with SIGKILL and SIGSTOPEric W. Biederman
Now that siginfo is never allocated for SIGKILL and SIGSTOP there is no difference between SEND_SIG_PRIV and SEND_SIG_FORCED for SIGKILL and SIGSTOP. This makes SEND_SIG_FORCED unnecessary and redundant in the presence of SIGKILL and SIGSTOP. Therefore change users of SEND_SIG_FORCED that are sending SIGKILL or SIGSTOP to use SEND_SIG_PRIV instead. This removes the last users of SEND_SIG_FORCED. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-04mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm().Tetsuo Handa
Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers") has added an ability to skip over vmas with blockable mmu notifiers. This however didn't call tlb_finish_mmu as it should. As a result inc_tlb_flush_pending has been called without its pairing dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush even though this is not really needed. This alone is not harmful and it seems there shouldn't be any such callers for oom victims at all but there is no real reason to skip tlb_finish_mmu on early skip either so call it. [mhocko@suse.com: new changelog] Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04mm: memcontrol: print proper OOM header when no eligible victim leftJohannes Weiner
When the memcg OOM killer runs out of killable tasks, it currently prints a WARN with no further OOM context. This has caused some user confusion. Warnings indicate a kernel problem. In a reported case, however, the situation was triggered by a nonsensical memcg configuration (hard limit set to 0). But without any VM context this wasn't obvious from the report, and it took some back and forth on the mailing list to identify what is actually a trivial issue. Handle this OOM condition like we handle it in the global OOM killer: dump the full OOM context and tell the user we ran out of tasks. This way the user can identify misconfigurations easily by themselves and rectify the problem - without having to go through the hassle of running into an obscure but unsettling warning, finding the appropriate kernel mailing list and waiting for a kernel developer to remote-analyze that the memcg configuration caused this. If users cannot make sense of why the OOM killer was triggered or why it failed, they will still report it to the mailing list, we know that from experience. So in case there is an actual kernel bug causing this, kernel developers will very likely hear about it. Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - the rest of MM - procfs updates - various misc things - more y2038 fixes - get_maintainer updates - lib/ updates - checkpatch updates - various epoll updates - autofs updates - hfsplus - some reiserfs work - fatfs updates - signal.c cleanups - ipc/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits) ipc/util.c: update return value of ipc_getref from int to bool ipc/util.c: further variable name cleanups ipc: simplify ipc initialization ipc: get rid of ids->tables_initialized hack lib/rhashtable: guarantee initial hashtable allocation lib/rhashtable: simplify bucket_table_alloc() ipc: drop ipc_lock() ipc/util.c: correct comment in ipc_obtain_object_check ipc: rename ipcctl_pre_down_nolock() ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc: reorganize initialization of kern_ipc_perm.seq ipc: compute kern_ipc_perm.id under the ipc lock init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp adfs: use timespec64 for time conversion kernel/sysctl.c: fix typos in comments drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md fork: don't copy inconsistent signal handler state to child signal: make get_signal() return bool signal: make sigkill_pending() return bool ...
2018-08-22mm, oom: introduce memory.oom.groupRoman Gushchin
For some workloads an intervention from the OOM killer can be painful. Killing a random task can bring the workload into an inconsistent state. Historically, there are two common solutions for this problem: 1) enabling panic_on_oom, 2) using a userspace daemon to monitor OOMs and kill all outstanding processes. Both approaches have their downsides: rebooting on each OOM is an obvious waste of capacity, and handling all in userspace is tricky and requires a userspace agent, which will monitor all cgroups for OOMs. In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate the necessity of enabling panic_on_oom. Also, it can simplify the cgroup management for userspace applications. This commit introduces a new knob for cgroup v2 memory controller: memory.oom.group. The knob determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. To determine which cgroup has to be killed, we do traverse the cgroup hierarchy from the victim task's cgroup up to the OOMing cgroup (or root) and looking for the highest-level cgroup with memory.oom.group set. Tasks with the OOM protection (oom_score_adj set to -1000) are treated as an exception and are never killed. This patch doesn't change the OOM victim selection algorithm. Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, oom: refactor oom_kill_process()Roman Gushchin
Patch series "introduce memory.oom.group", v2. This is a tiny implementation of cgroup-aware OOM killer, which adds an ability to kill a cgroup as a single unit and so guarantee the integrity of the workload. Although it has only a limited functionality in comparison to what now resides in the mm tree (it doesn't change the victim task selection algorithm, doesn't look at memory stas on cgroup level, etc), it's also much simpler and more straightforward. So, hopefully, we can avoid having long debates here, as we had with the full implementation. As it doesn't prevent any futher development, and implements an useful and complete feature, it looks as a sane way forward. This patch (of 2): oom_kill_process() consists of two logical parts: the first one is responsible for considering task's children as a potential victim and printing the debug information. The second half is responsible for sending SIGKILL to all tasks sharing the mm struct with the given victim. This commit splits oom_kill_process() with an intention to re-use the the second half: __oom_kill_process(). The cgroup-aware OOM killer will kill multiple tasks belonging to the victim cgroup. We don't need to print the debug information for the each task, as well as play with task selection (considering task's children), so we can't use the existing oom_kill_process(). Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm/oom_kill.c: clean up oom_reap_task_mm()Michal Hocko
Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably - Undocumented return value. - comment "failed to reap part..." is misleading - sounds like it's referring to something which happened in the past, is in fact referring to something which might happen in the future. - fails to call trace_finish_task_reaping() in one case - code duplication. - Increases mmap_sem hold time a little by moving trace_finish_task_reaping() inside the locked region. So sue me ;) - Sharing the finish: path means that the trace event won't distinguish between the two sources of finishing. Add a short explanation for the return value and fix the rest by reorganizing the function a bit to have unified function exit paths. Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, oom: describe task memory unit, larger PID padRodrigo Freire
The default page memory unit of OOM task dump events might not be intuitive and potentially misleading for the non-initiated when debugging OOM events: These are pages and not kBs. Add a small printk prior to the task dump informing that the memory units are actually memory _pages_. Also extends PID field to align on up to 7 characters. Reference https://lkml.org/lkml/2018/7/3/1201 Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com Signed-off-by: Rodrigo Freire <rfreire@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, oom: remove oom_lock from oom_reaperMichal Hocko
oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper: close race with exiting task"). We do not really need the lock anymore though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") has removed serialization with the exit path based on the mm reference count and so we do not really rely on the oom_lock anymore. Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock to prevent from races when the page allocator didn't manage to get the freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later on and move on to another victim. Although this is possible in principle let's wait for it to actually happen in real life before we make the locking more complex again. Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem + MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path now. [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did] Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22mm, oom: distinguish blockable mode for mmu notifiersMichal Hocko
There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: David Rientjes <rientjes@google.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-21Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull core signal handling updates from Eric Biederman: "It was observed that a periodic timer in combination with a sufficiently expensive fork could prevent fork from every completing. This contains the changes to remove the need for that restart. This set of changes is split into several parts: - The first part makes PIDTYPE_TGID a proper pid type instead something only for very special cases. The part starts using PIDTYPE_TGID enough so that in __send_signal where signals are actually delivered we know if the signal is being sent to a a group of processes or just a single process. - With that prep work out of the way the logic in fork is modified so that fork logically makes signals received while it is running appear to be received after the fork completes" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits) signal: Don't send signals to tasks that don't exist signal: Don't restart fork when signals come in. fork: Have new threads join on-going signal group stops fork: Skip setting TIF_SIGPENDING in ptrace_init_task signal: Add calculate_sigpending() fork: Unconditionally exit if a fatal signal is pending fork: Move and describe why the code examines PIDNS_ADDING signal: Push pid type down into complete_signal. signal: Push pid type down into __send_signal signal: Push pid type down into send_signal signal: Pass pid type into do_send_sig_info signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task signal: Pass pid type into group_send_sig_info signal: Pass pid and pid type into send_sigqueue posix-timers: Noralize good_sigevent signal: Use PIDTYPE_TGID to clearly store where file signals will be sent pid: Implement PIDTYPE_TGID pids: Move the pgrp and session pid pointers from task_struct to signal_struct kvm: Don't open code task_pid in kvm_vcpu_ioctl pids: Compute task_tgid using signal->leader_pid ...
2018-08-17mm/oom_kill.c: document oom_lockMichal Hocko
Add comments describing oom_lock's scope. Requested-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, oom: remove sleep from under oom_lockMichal Hocko
Tetsuo has pointed out that since 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21signal: Pass pid type into do_send_sig_infoEric W. Biederman
This passes the information we already have at the call sight into do_send_sig_info. Ultimately allowing for better handling of signals sent to a group of processes during fork. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-06-15mm: fix oom_kill event handlingRoman Gushchin
Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate when waking pollers") converted most of memcg event counters to per-memcg atomics, which made them less confusing for a user. The "oom_kill" counter remained untouched, so now it behaves differently than other counters (including "oom"). This adds nothing but confusion. Let's fix this by adding the MEMCG_OOM_KILL event, and follow the MEMCG_OOM approach. This also removes a hack from count_memcg_event_mm(), introduced earlier specially for the OOM_KILL counter. [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch] Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07mm: rename page_counter's count/limit into usage/maxRoman Gushchin
This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/20180405185921.4942-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11mm, oom: fix concurrent munlock and oom reaper unmap, v3David Rientjes
Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm,oom_reaper: check for MMF_OOM_SKIP before complainingTetsuo Handa
I got "oom_reaper: unable to reap pid:" messages when the victim thread was blocked inside free_pgtables() (which occurred after returning from unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when exit_mmap() already set MMF_OOM_SKIP. Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB oom_reaper: unable to reap pid:7558 (a.out) a.out D13272 7558 6931 0x00100084 Call Trace: schedule+0x2d/0x80 rwsem_down_write_failed+0x2bb/0x440 call_rwsem_down_write_failed+0x13/0x20 down_write+0x49/0x60 unlink_file_vma+0x28/0x50 free_pgtables+0x36/0x100 exit_mmap+0xbb/0x180 mmput+0x50/0x110 copy_process.part.41+0xb61/0x1fe0 _do_fork+0xe6/0x560 do_syscall_64+0x74/0x230 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm, oom: remove 3% bonus for CAP_SYS_ADMIN processesDavid Rientjes
Since the 2.6 kernel, the oom killer has slightly biased away from CAP_SYS_ADMIN processes by discounting some of its memory usage in comparison to other processes. This has always been implicit and nothing exactly relies on the behavior. Gaurav notices that __task_cred() can dereference a potentially freed pointer if the task under consideration is exiting because a reference to the task_struct is not held. Remove the CAP_SYS_ADMIN bias so that all processes are treated equally. If any CAP_SYS_ADMIN process would like to be biased against, it is always allowed to adjust /proc/pid/oom_score_adj. Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05mm: kernel-doc: add missing parameter descriptionsMike Rapoport
Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31mm, oom: avoid reaping only for mm's with blockable invalidate callbacksDavid Rientjes
This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141330120.74052@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dimitri Sivanich <sivanich@hpe.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Joerg Roedel <joro@8bytes.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14mm, oom_reaper: fix memory corruptionMichal Hocko
David Rientjes has reported the following memory corruption while the oom reaper tries to unmap the victims address space BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000 addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d file: (null) fault: (null) mmap: (null) readpage: (null) CPU: 2 PID: 1001 Comm: oom_reaper Call Trace: unmap_page_range+0x1068/0x1130 __oom_reap_task_mm+0xd5/0x16b oom_reaper+0xff/0x14c kthread+0xc1/0xe0 Tetsuo Handa has noticed that the synchronization inside exit_mmap is insufficient. We only synchronize with the oom reaper if tsk_is_oom_victim which is not true if the final __mmput is called from a different context than the oom victim exit path. This can trivially happen from context of any task which has grabbed mm reference (e.g. to read /proc/<pid>/ file which requires mm etc.). The race would look like this oom_reaper oom_victim task mmget_not_zero do_exit mmput __oom_reap_task_mm mmput __mmput exit_mmap remove_vma unmap_page_range Fix this issue by providing a new mm_is_oom_victim() helper which operates on the mm struct rather than a task. Any context which operates on a remote mm struct should use this helper in place of tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared so it is stable in the exit_mmap path. Debugged by Tetsuo Handa. Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Rientjes <rientjes@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: <stable@vger.kernel.org> [4.14] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29mm, oom_reaper: gather each vma to prevent leaking TLB entryWang Nan
tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). Which causes leaking of tlb entries. Will clarifies his patch: "Basically, we tag each address space with an ASID (PCID on x86) which is resident in the TLB. This means we can elide TLB invalidation when pulling down a full mm because we won't ever assign that ASID to another mm without doing TLB invalidation elsewhere (which actually just nukes the whole TLB). I think that means that we could potentially not fault on a kernel uaccess, because we could hit in the TLB" There could be a window between complete_signal() sending IPI to other cores and all threads sharing this mm are really kicked off from cores. In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to flush TLB then frees pages. However, due to the above problem, the TLB entries are not really flushed on arm64. Other threads are possible to access these pages through TLB entries. Moreover, a copy_to_user() can also write to these pages without generating page fault, causes use-after-free bugs. This patch gathers each vma instead of gathering full vm space. In this case tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com Fixes: aac453635549 ("mm, oom: introduce oom reaper") Signed-off-by: Wang Nan <wangnan0@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Bob Liu <liubo95@huawei.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: simplify nodemask printingMichal Hocko
alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm,oom_reaper: remove pointless kthread_run() error checkTetsuo Handa
Since oom_init() is called before userspace processes start, memory allocation failure for creating the OOM reaper kernel thread will let the OOM killer call panic() rather than wake up the OOM reaper. Link: http://lkml.kernel.org/r/1510137800-4602-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo H