summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2019-07-12mm/slab: refactor common ksize KASAN logic into slab_common.cMarco Elver
This refactors common code of ksize() between the various allocators into slab_common.c: __ksize() is the allocator-specific implementation without instrumentation, whereas ksize() includes the required KASAN logic. Link: http://lkml.kernel.org/r/20190626142014.141844-5-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kasan: change kasan_check_{read,write} to return booleanMarco Elver
This changes {,__}kasan_check_{read,write} functions to return a boolean denoting if the access was valid or not. [sfr@canb.auug.org.au: include types.h for "bool"] Link: http://lkml.kernel.org/r/20190705184949.13cdd021@canb.auug.org.au Link: http://lkml.kernel.org/r/20190626142014.141844-3-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kasan: introduce __kasan_check_{read,write}Marco Elver
Patch series "mm/kasan: Add object validation in ksize()", v3. This patch (of 5): This introduces __kasan_check_{read,write}. __kasan_check functions may be used from anywhere, even compilation units that disable instrumentation selectively. This change eliminates the need for the __KASAN_INTERNAL definition. [elver@google.com: v5] Link: http://lkml.kernel.org/r/20190708170706.174189-2-elver@google.com Link: http://lkml.kernel.org/r/20190626142014.141844-2-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kasan: print frame description for stack bugsMarco Elver
This adds support for printing stack frame description on invalid stack accesses. The frame description is embedded by the compiler, which is parsed and then pretty-printed. Currently, we can only print the stack frame info for accesses to the task's own stack, but not accesses to other tasks' stacks. Example of what it looks like: page dumped because: kasan: bad access detected addr ffff8880673ef98a is located in stack of task insmod/2008 at offset 106 in frame: kasan_stack_oob+0x0/0xf5 [test_kasan] this frame has 2 objects: [32, 36) 'i' [96, 106) 'stack_array' Memory state around the buggy address: Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198435 Link: http://lkml.kernel.org/r/20190522100048.146841-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kmemleak.c: change error at _write when kmemleak is disabledAndré Almeida
According to POSIX, EBUSY means that the "device or resource is busy", and this can lead to people thinking that the file `/sys/kernel/debug/kmemleak/` is somehow locked or being used by other process. Change this error code to a more appropriate one. Link: http://lkml.kernel.org/r/20190612155231.19448-1-andrealmeid@collabora.com Signed-off-by: André Almeida <andrealmeid@collabora.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/kmemleak.c: fix check for softirq contextDmitry Vyukov
in_softirq() is a wrong predicate to check if we are in a softirq context. It also returns true if we have BH disabled, so objects are falsely stamped with "softirq" comm. The correct predicate is in_serving_softirq(). If user does cat from /sys/kernel/debug/kmemleak previously they would see this, which is clearly wrong, this is system call context (see the comm): unreferenced object 0xffff88805bd661c0 (size 64): comm "softirq", pid 0, jiffies 4294942959 (age 12.400s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 ................ 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ backtrace: [<0000000007dcb30c>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<0000000007dcb30c>] slab_post_alloc_hook mm/slab.h:439 [inline] [<0000000007dcb30c>] slab_alloc mm/slab.c:3326 [inline] [<0000000007dcb30c>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<00000000969722b7>] kmalloc include/linux/slab.h:547 [inline] [<00000000969722b7>] kzalloc include/linux/slab.h:742 [inline] [<00000000969722b7>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline] [<00000000969722b7>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085 [<00000000a4134b5f>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475 [<00000000d20248ad>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957 [<000000003d367be7>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246 [<000000003c7c76af>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616 [<000000000c1aeb23>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130 [<000000000157b92b>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078 [<00000000a9f3d058>] __do_sys_setsockopt net/socket.c:2089 [inline] [<00000000a9f3d058>] __se_sys_setsockopt net/socket.c:2086 [inline] [<00000000a9f3d058>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086 [<000000001b8da885>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301 [<00000000ba770c62>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 now they will see this: unreferenced object 0xffff88805413c800 (size 64): comm "syz-executor.4", pid 8960, jiffies 4294994003 (age 14.350s) hex dump (first 32 bytes): 00 7a 8a 57 80 88 ff ff e0 00 00 01 00 00 00 00 .z.W............ 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................ backtrace: [<00000000c5d3be64>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline] [<00000000c5d3be64>] slab_post_alloc_hook mm/slab.h:439 [inline] [<00000000c5d3be64>] slab_alloc mm/slab.c:3326 [inline] [<00000000c5d3be64>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553 [<0000000023865be2>] kmalloc include/linux/slab.h:547 [inline] [<0000000023865be2>] kzalloc include/linux/slab.h:742 [inline] [<0000000023865be2>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline] [<0000000023865be2>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085 [<000000003029a9d4>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475 [<00000000ccd0a87c>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957 [<00000000a85a3785>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246 [<00000000ec13c18d>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616 [<0000000052d748e3>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130 [<00000000512f1014>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078 [<00000000181758bc>] __do_sys_setsockopt net/socket.c:2089 [inline] [<00000000181758bc>] __se_sys_setsockopt net/socket.c:2086 [inline] [<00000000181758bc>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086 [<00000000d4b73623>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301 [<00000000c1098bec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: http://lkml.kernel.org/r/20190517171507.96046-1-dvyukov@gmail.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12slub: don't panic for memcg kmem cache creation failureShakeel Butt
Currently for CONFIG_SLUB, if a memcg kmem cache creation is failed and the corresponding root kmem cache has SLAB_PANIC flag, the kernel will be crashed. This is unnecessary as the kernel can handle the creation failures of memcg kmem caches. Additionally CONFIG_SLAB does not implement this behavior. So, to keep the behavior consistent between SLAB and SLUB, removing the panic for memcg kmem cache creation failures. The root kmem cache creation failure for SLAB_PANIC correctly panics for both SLAB and SLUB. Link: http://lkml.kernel.org/r/20190619232514.58994-1-shakeelb@google.com Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Roman Gushchin <guro@fb.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/slub.c: avoid double string traverse in kmem_cache_flags()Yury Norov
If ',' is not found, kmem_cache_flags() calls strlen() to find the end of line. We can do it in a single pass using strchrnul(). Link: http://lkml.kernel.org/r/20190501053111.7950-1-ynorov@marvell.com Signed-off-by: Yury Norov <ynorov@marvell.com> Acked-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/slab: sanity-check page type when looking up cacheKees Cook
This avoids any possible type confusion when looking up an object. For example, if a non-slab were to be passed to kfree(), the invalid slab_cache pointer (i.e. overlapped with some other value from the struct page union) would be used for subsequent slab manipulations that could lead to further memory corruption. Since the page is already in cache, adding the PageSlab() check will have nearly zero cost, so add a check and WARN() to virt_to_cache(). Additionally replaces an open-coded virt_to_cache(). To support the failure mode this also updates all callers of virt_to_cache() and cache_from_obj() to handle a NULL cache pointer return value (though note that several already handle this case gracefully). [dan.carpenter@oracle.com: restore IRQs in kfree()] Link: http://lkml.kernel.org/r/20190613065637.GE16334@mwanda Link: http://lkml.kernel.org/r/20190530045017.15252-3-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Alexander Popov <alex.popov@linux.com> Cc: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/slab: validate cache membership under freelist hardeningKees Cook
Patch series "mm/slab: Improved sanity checking". This adds defenses against slab cache confusion (as seen in real-world exploits[1]) and gracefully handles type confusions when trying to look up slab caches from an arbitrary page. (Also is patch 3: new LKDTM tests for these defenses as well as for the existing double-free detection. This patch (of 3): When building under CONFIG_SLAB_FREELIST_HARDENING, it makes sense to perform sanity-checking on the assumed slab cache during kmem_cache_free() to make sure the kernel doesn't mix freelists across slab caches and corrupt memory (as seen in the exploitation of flaws like CVE-2018-9568[1]). Note that the prior code might WARN() but still corrupt memory (i.e. return the assumed cache instead of the owned cache). There is no noticeable performance impact (changes are within noise). Measuring parallel kernel builds, I saw the following with CONFIG_SLAB_FREELIST_HARDENED, before and after this patch: before: Run times: 288.85 286.53 287.09 287.07 287.21 Min: 286.53 Max: 288.85 Mean: 287.35 Std Dev: 0.79 after: Run times: 289.58 287.40 286.97 287.20 287.01 Min: 286.97 Max: 289.58 Mean: 287.63 Std Dev: 0.99 Delta: 0.1% which is well below the standard deviation [1] https://github.com/ThomasKing2014/slides/raw/master/Building%20universal%20Android%20rooting%20with%20a%20type%20confusion%20vulnerability.pdf Link: http://lkml.kernel.org/r/20190530045017.15252-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Popov <alex.popov@linux.com> Cc: Alexander Potapenko <glider@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/z3fold.c: lock z3fold page before __SetPageMovable()Henry Burns
Following zsmalloc.c's example we call trylock_page() and unlock_page(). Also make z3fold_page_migrate() assert that newpage is passed in locked, as per the documentation. [akpm@linux-foundation.org: fix trylock_page return value test, per Shakeel] Link: http://lkml.kernel.org/r/20190702005122.41036-1-henryburns@google.com Link: http://lkml.kernel.org/r/20190702233538.52793-1-henryburns@google.com Signed-off-by: Henry Burns <henryburns@google.com> Suggested-by: Vitaly Wool <vitalywool@gmail.com> Acked-by: Vitaly Wool <vitalywool@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Vitaly Vul <vitaly.vul@sony.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Xidong Wang <wangxidong_97@163.com> Cc: Jonathan Adams <jwadams@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm/memcontrol: fix wrong statistics in memory.statYafang Shao
When we calculate total statistics for memcg1_stats and memcg1_events, we use the the index 'i' in the for loop as the events index. Actually we should use memcg1_stats[i] and memcg1_events[i] as the events index. Link: http://lkml.kernel.org/r/1562116978-19539-1-git-send-email-laoar.shao@gmail.com Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty"). Signed-off-by: Yafang Shao <laoar.shao@gmail.com Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yafang Shao <shaoyafang@didiglobal.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12mm: vmscan: scan anonymous pages on file refaultsKuo-Hsin Yang
When file refaults are detected and there are many inactive file pages, the system never reclaim anonymous pages, the file pages are dropped aggressively when there are still a lot of cold anonymous pages and system thrashes. This issue impacts the performance of applications with large executable, e.g. chrome. With this patch, when file refault is detected, inactive_list_is_low() always returns true for file pages in get_scan_count() to enable scanning anonymous pages. The problem can be reproduced by the following test program. ---8<--- void fallocate_file(const char *filename, off_t size) { struct stat st; int fd; if (!stat(filename, &st) && st.st_size >= size) return; fd = open(filename, O_WRONLY | O_CREAT, 0600); if (fd < 0) { perror("create file"); exit(1); } if (posix_fallocate(fd, 0, size)) { perror("fallocate"); exit(1); } close(fd); } long *alloc_anon(long size) { long *start = malloc(size); memset(start, 1, size); return start; } long access_file(const char *filename, long size, long rounds) { int fd, i; volatile char *start1, *end1, *start2; const int page_size = getpagesize(); long sum = 0; fd = open(filename, O_RDONLY); if (fd == -1) { perror("open"); exit(1); } /* * Some applications, e.g. chrome, use a lot of executable file * pages, map some of the pages with PROT_EXEC flag to simulate * the behavior. */ start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED, fd, 0); if (start1 == MAP_FAILED) { perror("mmap"); exit(1); } end1 = start1 + size / 2; start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2); if (start2 == MAP_FAILED) { perror("mmap"); exit(1); } for (i = 0; i < rounds; ++i) { struct timeval before, after; volatile char *ptr1 = start1, *ptr2 = start2; gettimeofday(&before, NULL); for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size) sum += *ptr1 + *ptr2; gettimeofday(&after, NULL); printf("File access time, round %d: %f (sec) ", i, (after.tv_sec - before.tv_sec) + (after.tv_usec - before.tv_usec) / 1000000.0); } return sum; } int main(int argc, char *argv[]) { const long MB = 1024 * 1024; long anon_mb, file_mb, file_rounds; const char filename[] = "large"; long *ret1; long ret2; if (argc != 4) { printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS "); exit(0); } anon_mb = atoi(argv[1]); file_mb = atoi(argv[2]); file_rounds = atoi(argv[3]); fallocate_file(filename, file_mb * MB); printf("Allocate %ld MB anonymous pages ", anon_mb); ret1 = alloc_anon(anon_mb * MB); printf("Access %ld MB file pages ", file_mb); ret2 = access_file(filename, file_mb * MB, file_rounds); printf("Print result to prevent optimization: %ld ", *ret1 + ret2); return 0; } ---8<--- Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program fills ram with 2048 MB memory, access a 200 MB file for 10 times. Without this patch, the file cache is dropped aggresively and every access to the file is from disk. $ ./thrash 2048 200 10 Allocate 2048 MB anonymous pages Access 200 MB file pages File access time, round 0: 2.489316 (sec) File access time, round 1: 2.581277 (sec) File access time, round 2: 2.487624 (sec) File access time, round 3: 2.449100 (sec) File access time, round 4: 2.420423 (sec) File access time, round 5: 2.343411 (sec) File access time, round 6: 2.454833 (sec) File access time, round 7: 2.483398 (sec) File access time, round 8: 2.572701 (sec) File access time, round 9: 2.493014 (sec) With this patch, these file pages can be cached. $ ./thrash 2048 200 10 Allocate 2048 MB anonymous pages Access 200 MB file pages File access time, round 0: 2.475189 (sec) File access time, round 1: 2.440777 (sec) File access time, round 2: 2.411671 (sec) File access time, round 3: 1.955267 (sec) File access time, round 4: 0.029924 (sec) File access time, round 5: 0.000808 (sec) File access time, round 6: 0.000771 (sec) File access time, round 7: 0.000746 (sec) File access time, round 8: 0.000738 (sec) File access time, round 9: 0.000747 (sec) Checked the swap out stats during the test [1], 19006 pages swapped out with this patch, 3418 pages swapped out without this patch. There are more swap out, but I think it's within reasonable range when file backed data set doesn't fit into the memory. $ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644, inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB) pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages Access 2100 MB regular file pages File access time, round 0: 12.165, (sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896, inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec) active_anon: 1430576, inactive_anon: 477144, active_file: 25440, inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec) active_anon: 1427436, inactive_anon: 476060, active_file: 21112, inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec) active_anon: 1420444, inactive_anon: 473632, active_file: 23216, inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec) active_anon: 1413964, inactive_anon: 471460, active_file: 31728, inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366 (+ 11309120) With 4 processes accessing non-overlapping parts of a large file, 30316 pages swapped out with this patch, 5152 pages swapped out without this patch. The swapout number is small comparing to pgpgin. [1]: https://github.com/vovo/testing/blob/master/mem_thrash.c Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com Fixes: e9868505987a ("mm,vmscan: only evict file pages when we have plenty") Fixes: 7c5bd705d8f9 ("mm: memcg: only evict file pages when we have plenty") Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-10Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many bug fixes and cleanups, and an optimization for case-insensitive lookups" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix coverity warning on error path of filename setup ext4: replace ktype default_attrs with default_groups ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree() ext4: refactor initialize_dirent_tail() ext4: rename "dirent_csum" functions to use "dirblock" ext4: allow directory holes jbd2: drop declaration of journal_sync_buffer() ext4: use jbd2_inode dirty range scoping jbd2: introduce jbd2_inode dirty range scoping mm: add filemap_fdatawait_range_keep_errors() ext4: remove redundant assignment to node ext4: optimize case-insensitive lookups ext4: make __ext4_get_inode_loc plug ext4: clean up kerneldoc warnigns when building with W=1 ext4: only set project inherit bit for directory ext4: enforce the immutable flag on open files ext4: don't allow any modifications to an immutable file jbd2: fix typo in comment of journal_submit_inode_data_buffers jbd2: fix some print format mistakes ext4: gracefully handle ext4_break_layouts() failure during truncate
2019-07-10Merge tag 'copy-file-range-fixes-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull copy_file_range updates from Darrick Wong: "This fixes numerous parameter checking problems and inconsistent behaviors in the new(ish) copy_file_range system call. Now the system call will actually check its range parameters correctly; refuse to copy into files for which the caller does not have sufficient privileges; update mtime and strip setuid like file writes are supposed to do; and allows copying up to the EOF of the source file instead of failing the call like we used to. Summary: - Create a generic copy_file_range handler and make individual filesystems responsible for calling it (i.e. no more assuming that do_splice_direct will work or is appropriate) - Refactor copy_file_range and remap_range parameter checking where they are the same - Install missing copy_file_range parameter checking(!) - Remove suid/sgid and update mtime like any other file write - Change the behavior so that a copy range crossing the source file's eof will result in a short copy to the source file's eof instead of EINVAL - Permit filesystems to decide if they want to handle cross-superblock copy_file_range in their local handlers" * tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: fuse: copy_file_range needs to strip setuid bits and update timestamps vfs: allow copy_file_range to copy across devices xfs: use file_modified() helper vfs: introduce file_modified() helper vfs: add missing checks to copy_file_range vfs: remove redundant checks from generic_remap_checks() vfs: introduce generic_file_rw_checks() vfs: no fallback for ->copy_file_range vfs: introduce generic_copy_file_range()
2019-07-09Merge tag 'docs-5.3' of git://git.lwn.net/linuxLinus Torvalds
Pull Documentation updates from Jonathan Corbet: "It's been a relatively busy cycle for docs: - A fair pile of RST conversions, many from Mauro. These create more than the usual number of simple but annoying merge conflicts with other trees, unfortunately. He has a lot more of these waiting on the wings that, I think, will go to you directly later on. - A new document on how to use merges and rebases in kernel repos, and one on Spectre vulnerabilities. - Various improvements to the build system, including automatic markup of function() references because some people, for reasons I will never understand, were of the opinion that :c:func:``function()`` is unattractive and not fun to type. - We now recommend using sphinx 1.7, but still support back to 1.4. - Lots of smaller improvements, warning fixes, typo fixes, etc" * tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits) docs: automarkup.py: ignore exceptions when seeking for xrefs docs: Move binderfs to admin-guide Disable Sphinx SmartyPants in HTML output doc: RCU callback locks need only _bh, not necessarily _irq docs: format kernel-parameters -- as code Doc : doc-guide : Fix a typo platform: x86: get rid of a non-existent document Add the RCU docs to the core-api manual Documentation: RCU: Add TOC tree hooks Documentation: RCU: Rename txt files to rst Documentation: RCU: Convert RCU UP systems to reST Documentation: RCU: Convert RCU linked list to reST Documentation: RCU: Convert RCU basic concepts to reST docs: filesystems: Remove uneeded .rst extension on toctables scripts/sphinx-pre-install: fix out-of-tree build docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/ Documentation: PGP: update for newer HW devices Documentation: Add section about CPU vulnerabilities for Spectre Documentation: platform: Delete x86-laptop-drivers.txt docs: Note that :c:func: should no longer be used ...
2019-07-08Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull force_sig() argument change from Eric Biederman: "A source of error over the years has been that force_sig has taken a task parameter when it is only safe to use force_sig with the current task. The force_sig function is built for delivering synchronous signals such as SIGSEGV where the userspace application caused a synchronous fault (such as a page fault) and the kernel responded with a signal. Because the name force_sig does not make this clear, and because the force_sig takes a task parameter the function force_sig has been abused for sending other kinds of signals over the years. Slowly those have been fixed when the oopses have been tracked down. This set of changes fixes the remaining abusers of force_sig and carefully rips out the task parameter from force_sig and friends making this kind of error almost impossible in the future" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits) signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus signal: Remove the signal number and task parameters from force_sig_info signal: Factor force_sig_info_to_task out of force_sig_info signal: Generate the siginfo in force_sig signal: Move the computation of force into send_signal and correct it. signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal signal: Remove the task parameter from force_sig_fault signal: Use force_sig_fault_to_task for the two calls that don't deliver to current signal: Explicitly call force_sig_fault on current signal/unicore32: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from __do_user_fault signal/arm: Remove tsk parameter from ptrace_break signal/nds32: Remove tsk parameter from send_sigtrap signal/riscv: Remove tsk parameter from do_trap signal/sh: Remove tsk parameter from force_sig_info_fault signal/um: Remove task parameter from send_sigtrap signal/x86: Remove task parameter from send_sigtrap signal: Remove task parameter from force_sig_mceerr signal: Remove task parameter from force_sig signal: Remove task parameter from force_sigsegv ...
2019-07-08Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP} - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to manage the permissions of executable vmalloc regions more strictly - Slight performance improvement by keeping softirqs enabled while touching the FPSIMD/SVE state (kernel_neon_begin/end) - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG and AXFLAG instructions for floating point comparison flags manipulation) and FRINT (rounding floating point numbers to integers) - Re-instate ARM64_PSEUDO_NMI support which was previously marked as BROKEN due to some bugs (now fixed) - Improve parking of stopped CPUs and implement an arm64-specific panic_smp_self_stop() to avoid warning on not being able to stop secondary CPUs during panic - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI platforms - perf: DDR performance monitor support for iMX8QXP - cache_line_size() can now be set from DT or ACPI/PPTT if provided to cope with a system cache info not exposed via the CPUID registers - Avoid warning on hardware cache line size greater than ARCH_DMA_MINALIGN if the system is fully coherent - arm64 do_page_fault() and hugetlb cleanups - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep) - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags' introduced in 5.1) - CONFIG_RANDOMIZE_BASE now enabled in defconfig - Allow the selection of ARM64_MODULE_PLTS, currently only done via RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill over into the vmalloc area - Make ZONE_DMA32 configurable * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits) perf: arm_spe: Enable ACPI/Platform automatic module loading arm_pmu: acpi: spe: Add initial MADT/SPE probing ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens ACPI/PPTT: Modify node flag detection to find last IDENTICAL x86/entry: Simplify _TIF_SYSCALL_EMU handling arm64: rename dump_instr as dump_kernel_instr arm64/mm: Drop [PTE|PMD]_TYPE_FAULT arm64: Implement panic_smp_self_stop() arm64: Improve parking of stopped CPUs arm64: Expose FRINT capabilities to userspace arm64: Expose ARMv8.5 CondM capability to userspace arm64: defconfig: enable CONFIG_RANDOMIZE_BASE arm64: ARM64_MODULES_PLTS must depend on MODULES arm64: bpf: do not allocate executable memory arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP arm64: module: create module allocations without exec permissions arm64: Allow user selection of ARM64_MODULE_PLTS acpi/arm64: ignore 5.1 FADTs that are reported as 5.0 arm64: Allow selecting Pseudo-NMI again ...
2019-07-05Revert "mm: page cache: store only head pages in i_pages"Linus Torvalds
This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f. Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in __delete_from_swap_cache() to trigger: page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363 anon flags: 0x17fffe00080034(uptodate|lru|active|swapbacked) raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689 raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000 page dumped because: VM_BUG_ON_PAGE(entry != page) page->mem_cgroup:ffff978433ace000 ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:170! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019 RIP: 0010:__delete_from_swap_cache+0x20d/0x240 Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f RSP: 0018:ffffa982036e7980 EFLAGS: 00010046 RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900 RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535 R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120 R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000 FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0 Call Trace: delete_from_swap_cache+0x46/0xa0 try_to_free_swap+0xbc/0x110 swap_writepage+0x13/0x70 pageout.isra.0+0x13c/0x350 shrink_page_list+0xc14/0xdf0 shrink_inactive_list+0x1e5/0x3c0 shrink_node_memcg+0x202/0x760 shrink_node+0xe0/0x470 balance_pgdat+0x2d1/0x510 kswapd+0x220/0x420 kthread+0xfb/0x130 ret_from_fork+0x22/0x40 and it's not immediately obvious why it happens. It's too late in the rc cycle to do anything but revert for now. Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/ Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05swap_readpage(): avoid blk_wake_io_task() if !synchronousOleg Nesterov
swap_readpage() sets waiter = bio->bi_private even if synchronous = F, this means that the caller can get the spurious wakeup after return. This can be fatal if blk_wake_io_task() does set_current_state(TASK_RUNNING) after the caller does set_special_state(), in the worst case the kernel can crash in do_task_dead(). Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com Fixes: 0619317ff8baa2d ("block: add polled wakeup task helper") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Qian Cai <cai@lca.pw> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05mm/vmscan.c: prevent useless kswapd loopsShakeel Butt
In production we have noticed hard lockups on large machines running large jobs due to kswaps hoarding lru lock within isolate_lru_pages when sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in isolate_lru_pages() was basically skipping GiBs of pages while holding the LRU spinlock with interrupt disabled. On further inspection, it seems like there are two issues: (1) If kswapd on the return from balance_pgdat() could not sleep (i.e. node is still unbalanced), the classzone_idx is unintentionally set to 0 and the whole reclaim cycle of kswapd will try to reclaim only the lowest and smallest zone while traversing the whole memory. (2) Fundamentally isolate_lru_pages() is really bad when the allocation has woken kswapd for a smaller zone on a very large machine running very large jobs. It can hoard the LRU spinlock while skipping over 100s of GiBs of pages. This patch only fixes (1). (2) needs a more fundamental solution. To fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is invalid use the classzone_idx of the previous kswapd loop otherwise use the one the waker has requested. Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05mm/page_alloc.c: fix regression with deferred struct page initJuergen Gross
Commit 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections") is causing a regression on some systems when the kernel is booted as Xen dom0. The system will just hang in early boot. Reason is an endless loop in get_page_from_freelist() in case the first zone looked at has no free memory. deferred_grow_zone() is always returning true due to the following code snipplet: /* If the zone is empty somebody else may have cleared out the zone */ if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, first_deferred_pfn)) { pgdat->first_deferred_pfn = ULONG_MAX; pgdat_resize_unlock(pgdat, &flags); return true; } This in turn results in the loop as get_page_from_freelist() is assuming forward progress can be made by doing some more struct page initialization. Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com Fixes: 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections") Signed-off-by: Juergen Gross <jgross@suse.com> Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29mm, swap: fix THP swap outHuang Ying
0-Day test system reported some OOM regressions for several THP (Transparent Huge Page) swap test cases. These regressions are bisected to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out THP. That causes the OOM. As in the patch description of 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP to swap space. So the issue is fixed via doing that in get_swap_bio(). BTW: I remember I have checked the THP swap code when 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the THP swap code needn't to be changed. But apparently, I was wrong. I should have done this at that time. Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warningArnd Bergmann
gcc gets confused in pcpu_get_vm_areas() because there are too many branches that affect whether 'lva' was initialized before it gets used: mm/vmalloc.c: In function 'pcpu_get_vm_areas': mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized] insert_vmap_area_augment(lva, &va->rb_node, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ &free_vmap_area_root, &free_vmap_area_list); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/vmalloc.c:916:20: note: 'lva' was declared here struct vmap_area *lva; ^~~ Add an intialization to NULL, and check whether this has changed before the first use. [akpm@linux-foundation.org: tweak comments] Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Joel Fernandes <joelaf@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29mm/page_idle.c: fix oops because end_pfn is larger than max_pfnColin Ian King
Currently the calcuation of end_pfn can round up the pfn number to more than the actual maximum number of pfns, causing an Oops. Fix this by ensuring end_pfn is never more than max_pfn. This can be easily triggered when on systems where the end_pfn gets rounded up to more than max_pfn using the idle-page stress-ng stress test: sudo stress-ng --idle-page 0 BUG: unable to handle kernel paging request at 00000000000020d8 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 11039 Comm: stress-ng-idle- Not tainted 5.0.0-5-generic #6-Ubuntu Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:page_idle_get_page+0xc8/0x1a0 Code: 0f b1 0a 75 7d 48 8b 03 48 89 c2 48 c1 e8 33 83 e0 07 48 c1 ea 36 48 8d 0c 40 4c 8d 24 88 49 c1 e4 07 4c 03 24 d5 00 89 c3 be <49> 8b 44 24 58 48 8d b8 80 a1 02 00 e8 07 d5 77 00 48 8b 53 08 48 RSP: 0018:ffffafd7c672fde8 EFLAGS: 00010202 RAX: 0000000000000005 RBX: ffffe36341fff700 RCX: 000000000000000f RDX: 0000000000000284 RSI: 0000000000000275 RDI: 0000000001fff700 RBP: ffffafd7c672fe00 R08: ffffa0bc34056410 R09: 0000000000000276 R10: ffffa0bc754e9b40 R11: ffffa0bc330f6400 R12: 0000000000002080 R13: ffffe36341fff700 R14: 0000000000080000 R15: ffffa0bc330f6400 FS: 00007f0ec1ea5740(0000) GS:ffffa0bc7db00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000020d8 CR3: 0000000077d68000 CR4: 00000000000006e0 Call Trace: page_idle_bitmap_write+0x8c/0x140 sysfs_kf_bin_write+0x5c/0x70 kernfs_fop_write+0x12e/0x1b0 __vfs_write+0x1b/0x40 vfs_write+0xab/0x1b0 ksys_write+0x55/0xc0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Link: http://lkml.kernel.org/r/20190618124352.28307-1-colin.king@canonical.com Fixes: 33c3fc71c8cf ("mm: introduce idle page tracking") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29mm/oom_kill.c: fix uninitialized oc->constraintYafang Shao
In dump_oom_summary() oc->constraint is used to show oom_constraint_text, but it hasn't been set before. So the value of it is always the default value 0. We should inititialize it before. Bellow is the output when memcg oom occurs, before this patch: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0 after this patch: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0 Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com Fixes: ef8444ea01d7 ("mm, oom: reorganize the oom report in dump_header") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wind Yu <yuzhoujian@didichuxing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHugeNaoya Horiguchi
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline for hugepages with overcommitting enabled. That was caused by the suboptimal code in current soft-offline code. See the following part: ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); if (ret) { ... } else { /* * We set PG_hwpoison only when the migration source hugepage * was successfully dissolved, because otherwise hwpoisoned * hugepage remains on free hugepage list, then userspace will * find it as SIGBUS by allocation failure. That's not expected * in soft-offlining. */ ret = dissolve_free_huge_page(page); if (!ret) { if (set_hwpoison_free_buddy_page(page)) num_poisoned_pages_inc(); } } return ret; Here dissolve_free_huge_page() returns -EBUSY if the migration source page was freed into buddy in migrate_pages(), but even in that case we actually has a chance that set_hwpoison_free_buddy_page() succeeds. So that means current code gives up offlining too early now. dissolve_free_huge_page() checks that a given hugepage is suitable for dissolving, where we should return success for !PageHuge() case because the given hugepage is considered as already dissolved. This change also affects other callers of dissolve_free_huge_page(), which are cleaned up together. [n-horiguchi@ah.jp.nec.com: v3] Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Chen, Jerry T <jerry.t.chen@intel.com> Tested-by: Chen, Jerry T <jerry.t.chen@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Cc: "Chen, Jerry T" <jerry.t.chen@intel.com> Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() failsNaoya Horiguchi
The pass/fail of soft offline should be judged by checking whether the raw error page was finally contained or not (i.e. the result of set_hwpoison_free_buddy_page()), but current code do not work like that. It might lead us to misjudge the test result when set_hwpoison_free_buddy_page() fails. Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may not offline the original page and will not return an error. Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining") Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com> Cc: "Chen, Jerry T" <jerry.t.chen@intel.com> Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com> Cc: <stable@vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemaskzhong jiang
mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE mempoclicies when the tasks's cpuset's mems_allowed changes. For policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES, it works by remapping the policy's allowed nodes (stored in v.nodes) using the previous value of mems_allowed (stored in w.cpuset_mems_allowed) as the domain of map and the new mems_allowed (passed as nodes) as the range of the map (see the comment of bitmap_remap() for details). The result of remapping is stored back as policy's nodemask in v.nodes, and the new value of mems_allowed should be stored in w.cpuset_mems_allowed to facilitate the next rebind, if it happens. However, 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") introduced a bug where the result of remapping is stored in w.cpuset_mems_allowed instead. Thus, a mempolicy's allowed nodes can evolve in an unexpected way after a series of rebinding due to cpuset mems_allowed changes, possibly binding to a wrong node or a smaller number of nodes which may e.g. overload them. This patch fixes the bug so rebinding again works as intended. [vbabka@suse.cz: new changlog] Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>