summaryrefslogtreecommitdiffstats
path: root/mm/madvise.c
AgeCommit message (Collapse)Author
2017-09-06mm,fork: introduce MADV_WIPEONFORKRik van Riel
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty in the child process after fork. This differs from MADV_DONTFORK in one important way. If a child process accesses memory that was MADV_WIPEONFORK, it will get zeroes. The address ranges are still valid, they are just empty. If a child process accesses memory that was MADV_DONTFORK, it will get a segmentation fault, since those address ranges are no longer valid in the child after fork. Since MADV_DONTFORK also seems to be used to allow very large programs to fork in systems with strict memory overcommit restrictions, changing the semantics of MADV_DONTFORK might break existing programs. MADV_WIPEONFORK only works on private, anonymous VMAs. The use case is libraries that store or cache information, and want to know that they need to regenerate it in the child process after fork. Examples of this would be: - systemd/pulseaudio API checks (fail after fork) (replacing a getpid check, which is too slow without a PID cache) - PKCS#11 API reinitialization check (mandated by specification) - glibc's upcoming PRNG (reseed after fork) - OpenSSL PRNG (reseed after fork) The security benefits of a forking server having a re-inialized PRNG in every child process are pretty obvious. However, due to libraries having all kinds of internal state, and programs getting compiled with many different versions of each library, it is unreasonable to expect calling programs to re-initialize everything manually after fork. A further complication is the proliferation of clone flags, programs bypassing glibc's functions to call clone directly, and programs calling unshare, causing the glibc pthread_atfork hook to not get called. It would be better to have the kernel take care of this automatically. The patch also adds MADV_KEEPONFORK, to undo the effects of a prior MADV_WIPEONFORK. This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO: https://man.openbsd.org/minherit.2 [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines] Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Florian Weimer <fweimer@redhat.com> Reported-by: Colm MacCártaigh <colm@allcosts.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31mm, madvise: ensure poisoned pages are removed from per-cpu listsMel Gorman
Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed and bisected it to the commit 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP"). The problem is that a page that was poisoned with madvise() is reused. The commit removed a check that would trigger if DEBUG_VM was enabled but re-enabling the check only fixes the problem as a side-effect by printing a bad_page warning and recovering. The root of the problem is that an madvise() can leave a poisoned page on the per-cpu list. This patch drains all per-cpu lists after pages are poisoned so that they will not be reused. Wendy reports that the test case in question passes with this patch applied. While this could be done in a targeted fashion, it is over-complicated for such a rare operation. Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Wang, Wendy <wendy.wang@intel.com> Tested-by: Wang, Wendy <wendy.wang@intel.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Hansen, Dave" <dave.hansen@intel.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25mm/madvise.c: fix freeing of locked page with MADV_FREEEric Biggers
If madvise(..., MADV_FREE) split a transparent hugepage, it called put_page() before unlock_page(). This was wrong because put_page() can free the page, e.g. if a concurrent madvise(..., MADV_DONTNEED) has removed it from the memory mapping. put_page() then rightfully complained about freeing a locked page. Fix this by moving the unlock_page() before put_page(). This bug was found by syzkaller, which encountered the following splat: BUG: Bad page state in process syzkaller412798 pfn:1bd800 page:ffffea0006f60000 count:0 mapcount:0 mapping: (null) index:0x20a00 flags: 0x200000000040019(locked|uptodate|dirty|swapbacked) raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x1(locked) Modules linked in: CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 bad_page+0x230/0x2b0 mm/page_alloc.c:565 free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943 free_pages_check mm/page_alloc.c:952 [inline] free_pages_prepare mm/page_alloc.c:1043 [inline] free_pcp_prepare mm/page_alloc.c:1068 [inline] free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584 __put_single_page mm/swap.c:79 [inline] __put_page+0xfb/0x160 mm/swap.c:113 put_page include/linux/mm.h:814 [inline] madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371 walk_pmd_range mm/pagewalk.c:50 [inline] walk_pud_range mm/pagewalk.c:108 [inline] walk_p4d_range mm/pagewalk.c:134 [inline] walk_pgd_range mm/pagewalk.c:160 [inline] __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249 walk_page_range+0x200/0x470 mm/pagewalk.c:326 madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444 madvise_free_single_vma+0x353/0x580 mm/madvise.c:471 madvise_dontneed_free mm/madvise.c:555 [inline] madvise_vma mm/madvise.c:664 [inline] SYSC_madvise mm/madvise.c:832 [inline] SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760 entry_SYSCALL_64_fastpath+0x1f/0xbe Here is a C reproducer: #define _GNU_SOURCE #include <pthread.h> #include <sys/mman.h> #include <unistd.h> #define MADV_FREE 8 #define PAGE_SIZE 4096 static void *mapping; static const size_t mapping_size = 0x1000000; static void *madvise_thrproc(void *arg) { madvise(mapping, mapping_size, (long)arg); } int main(void) { pthread_t t[2]; for (;;) { mapping = mmap(NULL, mapping_size, PROT_WRITE, MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); munmap(mapping + mapping_size / 2, PAGE_SIZE); pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED); pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE); pthread_join(t[0], NULL); pthread_join(t[1], NULL); munmap(mapping, mapping_size); } } Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and CONFIG_DEBUG_VM=y are needed. Google Bug Id: 64696096 Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [v4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02mm, mprotect: flush TLB if potentially racing with a parallel reclaim ↵Mel Gorman
leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Note that a variant of this patch was acked by Andy Lutomirski but this was for the x86 parts on top of his PCID work which didn't make the 4.13 merge window as expected. His ack is dropped from this version and there will be a follow-on patch on top of PCID that will include his ack. [akpm@linux-foundation.org: tweak comments] [akpm@linux-foundation.org: fix spello] Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: <stable@vger.kernel.org> [v4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10userfaultfd: non-cooperative: add madvise() event for MADV_FREE requestMike Rapoport
MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd monitor. The monitor has to stop handling #PF events in the range being freed. We are reusing userfaultfd_remove callback along with the logic required to re-get and re-validate the VMA which may change or disappear because userfaultfd_remove releases mmap_sem. Link: http://lkml.kernel.org/r/1497876311-18615-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10swap: add block io poll in swapin pathShaohua Li
For fast flash disk, async IO could introduce overhead because of context switch. block-mq now supports IO poll, which improves performance and latency a lot. swapin is a good place to use this technique, because the task is waiting for the swapin page to continue execution. In my virtual machine, directly read 4k data from a NVMe with iopoll is about 60% better than that without poll. With iopoll support in swapin patch, my microbenchmark (a task does random memory write) is about 10%~25% faster. CPU utilization increases a lot though, 2x and even 3x CPU utilization. This will depend on disk speed. While iopoll in swapin isn't intended for all usage cases, it's a win for latency sensistive workloads with high speed swap disk. block layer has knob to control poll in runtime. If poll isn't enabled in block layer, there should be no noticeable change in swapin. I got a chance to run the same test in a NVMe with DRAM as the media. In simple fio IO test, blkpoll boosts 50% performance in single thread test and ~20% in 8 threads test. So this is the base line. In above swap test, blkpoll boosts ~27% performance in single thread test. blkpoll uses 2x CPU time though. If we enable hybid polling, the performance gain has very slight drop but CPU time is only 50% worse than that without blkpoll. Also we can adjust parameter of hybid poll, with it, the CPU time penality is reduced further. In 8 threads test, blkpoll doesn't help though. The performance is similar to that without blkpoll, but cpu utilization is similar too. There is lock contention in swap path. The cpu time spending on blkpoll isn't high. So overall, blkpoll swapin isn't worse than that without it. The swapin readahead might read several pages in in the same time and form a big IO request. Since the IO will take longer time, it doesn't make sense to do poll, so the patch only does iopoll for single page swapin. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c Signed-off-by: Shaohua Li <shli@fb.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/madvise: move up the behavior parameter validationAnshuman Khandual
madvise_behavior_valid() should be called before acting upon the behavior parameter. Hence move up the function. This also includes MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter for the system call madvise(). Link: http://lkml.kernel.org/r/20170418052844.24891-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/madvise.c: clean up MADV_SOFT_OFFLINE and MADV_HWPOISONAnshuman Khandual
This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called through madvise() system call. * madvise_memory_failure() was misleading to accommodate handling of both memory_failure() as well as soft_offline_page() functions. Basically it handles memory error injection from user space which can go either way as memory failure or soft offline. Renamed as madvise_inject_error() instead. * Renamed struct page pointer 'p' to 'page'. * pr_info() was essentially printing PFN value but it said 'page' which was misleading. Made the process virtual address explicit. Before the patch: Soft offlining page 0x15e3e at 0x3fff8c230000 Soft offlining page 0x1f3 at 0x3fffa0da0000 Soft offlining page 0x744 at 0x3fff7d200000 Soft offlining page 0x1634d at 0x3fff95e20000 Soft offlining page 0x16349 at 0x3fff95e30000 Soft offlining page 0x1d6 at 0x3fff9e8b0000 Soft offlining page 0x5f3 at 0x3fff91bd0000 Injecting memory failure for page 0x15c8b at 0x3fff83280000 Injecting memory failure for page 0x16190 at 0x3fff83290000 Injecting memory failure for page 0x740 at 0x3fff9a2e0000 Injecting memory failure for page 0x741 at 0x3fff9a2f0000 After the patch: Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000 Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000 Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000 Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000 Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000 Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000 Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000 Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000 Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000 Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000 Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000 Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000 Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000 Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000 Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000 Link: http://lkml.kernel.org/r/20170410084701.11248-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: enable MADV_FREE for swapless systemShaohua Li
Now MADV_FREE pages can be easily reclaimed even for swapless system. We can safely enable MADV_FREE for all systems. Link: http://lkml.kernel.org/r/155648585589300bfae1d45078e7aebb3d988b87.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: reclaim MADV_FREE pagesShaohua Li
When memory pressure is high, we free MADV_FREE pages. If the pages are not dirty in pte, the pages could be freed immediately. Otherwise we can't reclaim them. We put the pages back to anonumous LRU list (by setting SwapBacked flag) and the pages will be reclaimed in normal swapout way. We use normal page reclaim policy. Since MADV_FREE pages are put into inactive file list, such pages and inactive file pages are reclaimed according to their age. This is expected, because we don't want to reclaim too many MADV_FREE pages before used once pages. Based on Minchan's original patch [minchan@kernel.org: clean up lazyfree page handling] Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: move MADV_FREE pages into LRU_INACTIVE_FILE listShaohua Li
madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still anonymous pages, but they can be freed without pageout. To distinguish these from normal anonymous pages, we clear their SwapBacked flag. MADV_FREE pages could be freed without pageout, so they pretty much like used once file pages. For such pages, we'd like to reclaim them once there is memory pressure. Also it might be unfair reclaiming MADV_FREE pages always before used once file pages and we definitively want to reclaim the pages before other anonymous and file pages. To speed up MADV_FREE pages reclaim, we put the pages into LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny nowadays and should be full of used once file pages. Reclaiming MADV_FREE pages will not have much interfere of anonymous and active file pages. And the inactive file pages and MADV_FREE pages will be reclaimed according to their age, so we don't reclaim too many MADV_FREE pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also means we can reclaim the pages without swap support. This idea is suggested by Johannes. This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to avoid bisect failure, next patch will do it. The patch is based on Minchan's original patch. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEEDAndrea Arcangeli
userfaultfd_remove() has to be execute before zapping the pagetables or UFFDIO_COPY could keep filling pages after zap_page_range returned, which would result in non zero data after a MADV_DONTNEED. However userfaultfd_remove() may have to release the mmap_sem. This was handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a potentially stale vma (the very vma passed to zap_page_range(vma, ...)). The fix consists in revalidating the vma in case userfaultfd_remove() had to release the mmap_sem. This also optimizes away an unnecessary down_read/up_read in the MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered. It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as userfaultfd_remove() will be defined as "true" at build time. Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: remove shmem_mapping() shmem_zero_setup() duplicatesHugh Dickins
Remove the prototypes for shmem_mapping() and shmem_zero_setup() from linux/mm.h, since they are already provided in linux/shmem_fs.h. But shmem_fs.h must then provide the inline stub for shmem_mapping() when CONFIG_SHMEM is not set, and a few more cfiles now need to #include it. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702081658250.1549@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, madvise: fail with ENOMEM when splitting vma will hit max_map_countDavid Rientjes
If madvise(2) advice will result in the underlying vma being split and the number of areas mapped by the process will exceed /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN. EAGAIN is returned by madvise(2) when a kernel resource, such as slab, is temporarily unavailable. It indicates that userspace should retry the advice in the near future. This is important for advice such as MADV_DONTNEED which is often used by malloc implementations to free memory back to the system: we really do want to free memory back when madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas, or mempolicies) cannot be allocated. Encountering /proc/sys/vm/max_map_count is not a temporary failure, however, so return ENOMEM to indicate this is a more serious issue. A followup patch to the man page will specify this behavior. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE requestMike Rapoport
When a page is removed from a shared mapping, the uffd reader should be notified, so that it won't attempt to handle #PF events for the removed pages. We can reuse the UFFD_EVENT_REMOVE because from the uffd monitor point of view, the semantices of madvise(MADV_DONTNEED) and madvise(MADV_REMOVE) is exactly the same. Link: http://lkml.kernel.org/r/1484814154-1557-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24userfaultfd: non-cooperative: rename *EVENT_MADVDONTNEED to *EVENT_REMOVEMike Rapoport
Patch series "userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE request". These patches add notification of madvise(MADV_REMOVE) event to non-cooperative userfaultfd monitor. The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with relevant functions and structures. Using _REMOVE instead of _MADVDONTNEED describes the event semantics more clearly and I hope it's not too late for such change in the ABI. This patch (of 3): The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about removal of certain range from address space tracked by userfaultfd. Hence, UFFD_EVENT_REMOVE seems to better reflect the operation semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to 'remove' and the madvise_userfault_dontneed callback is renamed to userfaultfd_remove. Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22oom-reaper: use madvise_dontneed() logic to decide if unmap the VMAKirill A. Shutemov
Logic on whether we can reap pages from the VMA should match what we have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP VMAs, but we don't now. Let's just extract condition on which we can shoot down pagesi from a VMA with MADV_DONTNEED into separate function and use it in both places. Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: drop unused argument of zap_page_range()Kirill A. Shutemov
There's no users of zap_page_range() who wants non-NULL 'details'. Let's drop it. Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: avoid MADV_DONTNEED race conditionAndrea Arcangeli
MADV_DONTNEED must be notified to userland before the pages are zapped. This allows userland to immediately stop adding pages to the userfaultfd ranges before the pages are actually zapped or there could be non-zeropage leftovers as result of concurrent UFFDIO_COPY run in between zap_page_range and madvise_userfault_dontneed (both MADV_DONTNEED and UFFDIO_COPY runs under the mmap_sem for reading, so they can run concurrently). Link: http://lkml.kernel.org/r/20161216144821.5183-15-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED requestPavel Emelyanov
If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12mm: add tlb_remove_check_page_size_change to track page size changeAneesh Kumar K.V
With commit e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force flush if page size change") we added the ability to force a tlb flush when the page size change in a mmu_gather loop. We did that by checking for a page size change every time we added a page to mmu_gather for lazy flush/remove. We can improve that by moving the page size change check early and not doing it every time we add a page. This also helps us to do tlb flush when invalidating a range covering dax mapping. Wrt dax mapping we don't have a backing struct page and hence we don't call tlb_remove_page, which earlier forced the tlb flush on page size change. Moving the page size change check earlier means we will do the same even for dax mapping. We also avoid doing this check on architecture other than powerpc. In a later patch we will remove page size check from tlb_remove_page(). Link: http://lkml.kernel.org/r/20161026084839.27299-5-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23mm: make mmap_sem for write waits killable for mm syscallsMichal Hocko
This is a follow up work for oom_reaper [1]. As the async OOM killing depends on oom_sem for read we would really appreciate if a holder for write didn't stood in the way. This patchset is changing many of down_write calls to be killable to help those cases when the writer is blocked and waiting for readers to release the lock and so help __oom_reap_task to process the oom victim. Most of the patches are really trivial because the lock is help from a shallow syscall paths where we can return EINTR trivially and allow the current task to die (note that EINTR will never get to the userspace as the task has fatal signal pending). Others seem to be easy as well as the callers are already handling fatal errors and bail and return to userspace which should be sufficient to handle the failure gracefully. I am not familiar with all those code paths so a deeper review is really appreciated. As this work is touching more areas which are not directly connected I have tried to keep the CC list as small as possible and people who I believed would be familiar are CCed only to the specific patches (all should have received the cover though). This patchset is based on linux-next and it depends on down_write_killable for rw_semaphores which got merged into tip locking/rwsem branch and it is merged into this next tree. I guess it would be easiest to route these patches via mmotm because of the dependency on the tip tree but if respective maintainers prefer other way I have no objections. I haven't covered all the mmap_write(mm->mmap_sem) instances here $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l 98 $ git grep "down_write(.*\<mmap_sem\>)" | wc -l 62 I have tried to cover those which should be relatively easy to review in this series because this alone should be a nice improvement. Other places can be changed on top. [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org This patch (of 18): This is the first step in making mmap_sem write waiters killable. It focuses on the trivial ones which are taking the lock early after entering the syscall and they are not changing state before. Therefore it is very easy to change them to use down_write_killable and immediately return with -EINTR. This will allow the waiter to pass away without blocking the mmap_sem which might be required to make a forward progress. E.g. the oom reaper will need the lock for reading to dismantle the OOM victim address space. The only tricky function in this patch is vm_mmap_pgoff which has many call sites via vm_mmap. To reduce the risk keep vm_mmap with the original non-killable semantic for now. vm_munmap callers do not bother checking the return value so open code it into the munmap syscall path for now for simplicity. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15mm/madvise: update comment on sys_madvise()Naoya Horiguchi
Some new MADV_* advices are not documented in sys_madvise() comment. So let's update it. [akpm@linux-foundation.org: modifications suggested by Michal] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Jason Baron <jbaron@redhat.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15mm/madvise: pass return code of memory_failure() to userspaceNaoya Horiguchi
Currently the return value of memory_failure() is not passed to userspace when madvise(MADV_HWPOISON) is used. This is inconvenient for test programs that want to know the result of error handling. So let's return it to the caller as we already do in the MADV_SOFT_OFFLINE case. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm/huge_memory.c: don't split THP page when MADV_FREE syscall is calledMinchan Kim
We don't need to split THP page when MADV_FREE syscall is called if [start, len] is aligned with THP size. The split could be done when VM decide to free it in reclaim path if memory pressure is heavy. With that, we could avoid unnecessary THP split. For the feature, this patch changes pte dirtness marking logic of THP. Now, it marks every ptes of pages dirty unconditionally in splitting, which makes MADV_FREE void. So, instead, this patch propagates pmd dirtiness to all pages via PG_dirty and restores pte dirtiness from PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split happens(e,g, shrink_page_list), all of pages are clean too so we could discard them. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@kernel.org> Cc: <yalin.wang2010@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rik van Riel <riel@redhat.com> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Shaohua Li <shli@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm: move lazily freed pages to inactive listMinchan Kim
MADV_FREE is a hint that it's okay to discard pages if there is memory pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them so there is no value keeping them in the active anonymous LRU so this patch moves them to inactive LRU list's head. This means that MADV_FREE-ed pages which were living on the inactive list are reclaimed first because they are more likely to be cold rather than recently active pages. An arguable issue for the approach would be whether we should put the page to the head or tail of the inactive list. I chose head because the kernel cannot make sure it's really cold or warm for every MADV_FREE usecase but at least we know it's not *hot*, so landing of inactive head would be a comprimise for various usecases. This fixes suboptimal behavior of MADV_FREE when pages living on the active list will sit there for a long time even under memory pressure while the inactive list is reclaimed heavily. This basically breaks the whole purpose of using MADV_FREE to help the system to free memory which is might not be used. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm/madvise.c: free swp_entry in madvise_freeMinchan Kim
When I test below piece of code with 12 processes(ie, 512M * 12 = 6G consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat slower (ie, 2x times) than madvise_dontneed. loop = 5; mmap(512M); while (loop--) { memset(512M); madvise(MADV_FREE or MADV_DONTNEED); } The reason is lots of swapin. 1) dontneed: 1,612 swapin 2) madvfree: 879,585 swapin If we find hinted pages were already swapped out when syscall is called, it's pointless to keep the swapped-out pages in pte. Instead, let's free the cold page because swapin is more expensive than (alloc page + zeroing). With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time 1) dontneed: 6.10user 233.50system 0:50.44elapsed 2) madvfree: 6.03user 401.17system 1:30.67elapsed 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Shaohua Li <shli@kernel.org> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rik van Riel <riel@redhat.com> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Shaohua Li <shli@kernel.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm: support madvise(MADV_FREE)Minchan Kim
Linux doesn't have an ability to free pages lazy while other OS already have been supported that named by madvise(MADV_FREE). The gain is clear that kernel can discard freed pages rather than swapping out or OOM if memory pressure happens. Without memory pressure, freed pages would be reused by userspace without another additional overhead(ex, page fault + allocation + zeroing). Jason Evans said: : Facebook has been using MAP_UNINITIALIZED : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for : several years, but there are operational costs to maintaining this : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it : increased throughput for much of our workload by ~5%, and although the : benefit has decreased using newer hardware and kernels, there is still : enough benefit that we cannot reasonably retire it without a replacement. : : Aside from Facebook operations, there are numerous broadly used : applications that would benefit from MADV_FREE. The ones that immediately : come to mind are redis, varnish, and MariaDB. I don't have much insight : into Android internals and development process, but I would hope to see : MADV_FREE support eventually end up there as well to benefit applications : linked with the integrated jemalloc. : : jemalloc will use MADV_FREE once it becomes available in the Linux kernel. : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux : (and AIX, but I'm not sure it even compiles on AIX). The lack of : MADV_FREE on Linux forced me down a long series of increasingly : sophisticated heuristics for madvise() volume reduction, and even so this : remains a common performance issue for people using jemalloc on Linux. : Please integrate MADV_FREE; many people will benefit substantially. How it works: When madvise syscall is called, VM clears dirty bit of ptes of the range. If memory pressure happens, VM checks dirty bit of page table and if it found still "clean", it means it's a "lazyfree pages" so VM could discard the page instead of swapping out. Once there was store operation for the page before VM peek a page to reclaim, dirty bit is set so VM can swap out the page instead of discarding. One thing we should notice is that basically, MADV_FREE relies on dirty bit in page table entry to decide whether VM allows to discard the page or not. IOW, if page table entry includes marked dirty bit, VM shouldn't discard the page. However, as a example, if swap-in by read fault happens, page table entry doesn't have dirty bit so MADV_FREE could discard the page wrongly. For avoiding the problem, MADV_FREE did more checks with PageDirty and PageSwapCache. It worked out because swapped-in page lives on swap cache and since it is evicted from the swap cache, the page has PG_dirty flag. So both page flags check effectively prevent wrong discarding by MADV_FREE. However, a problem in above logic is that swapped-in page has PG_dirty still after they are removed from swap cache so VM cannot consider the page as freeable any more even if madvise_free is called in future. Look at below example for detail. ptr = malloc(); memset(ptr); .. .. .. heavy memory pressure so all of pages are swapped out .. .. var = *ptr; -> a page swapped-in and could be removed from swapcache. Then, page table doesn't mark dirty bit and page descriptor includes PG_dirty .. .. madvise_free(ptr); -> It doesn't clear PG_dirty of the page. .. .. .. .. heavy memory pressure again. .. In this time, VM cannot discard the page because the page .. has *PG_dirty* To solve the problem, this patch clears PG_dirty if only the page is owned exclusively by current process when madvise is called because PG_dirty represents ptes's dirtiness in several processes so we could clear it only if we own it exclusively. Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already have supported the feature for other OS(ex, FreeBSD) barrios@blaptop:~/benchmark/ebizzy$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 12 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: