summaryrefslogtreecommitdiffstats
path: root/mm/swap.c
AgeCommit message (Collapse)Author
2017-05-03mm: move MADV_FREE pages into LRU_INACTIVE_FILE listShaohua Li
madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still anonymous pages, but they can be freed without pageout. To distinguish these from normal anonymous pages, we clear their SwapBacked flag. MADV_FREE pages could be freed without pageout, so they pretty much like used once file pages. For such pages, we'd like to reclaim them once there is memory pressure. Also it might be unfair reclaiming MADV_FREE pages always before used once file pages and we definitively want to reclaim the pages before other anonymous and file pages. To speed up MADV_FREE pages reclaim, we put the pages into LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny nowadays and should be full of used once file pages. Reclaiming MADV_FREE pages will not have much interfere of anonymous and active file pages. And the inactive file pages and MADV_FREE pages will be reclaimed according to their age, so we don't reclaim too many MADV_FREE pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also means we can reclaim the pages without swap support. This idea is suggested by Johannes. This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to avoid bisect failure, next patch will do it. The patch is based on Minchan's original patch. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-01Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main x86 MM changes in this cycle were: - continued native kernel PCID support preparation patches to the TLB flushing code (Andy Lutomirski) - various fixes related to 32-bit compat syscall returning address over 4Gb in applications, launched from 64-bit binaries - motivated by C/R frameworks such as Virtuozzo. (Dmitry Safonov) - continued Intel 5-level paging enablement: in particular the conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov) - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel) - ... plus misc updates, fixes and cleanups" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash x86/mm: Fix flush_tlb_page() on Xen x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm/64: Fix crash in remove_pagetable() Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" x86/boot/e820: Remove a redundant self assignment x86/mm: Fix dump pagetables for 4 levels of page tables x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()" x86/espfix: Add support for 5-level paging x86/kasan: Extend KASAN to support 5-level paging x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y x86/paravirt: Add 5-level support to the paravirt code x86/mm: Define virtual memory map for 5-level paging x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert x86/boot: Detect 5-level paging support x86/mm/numa: Remove numa_nodemask_from_meminfo() ...
2017-05-01mm, zone_device: Replace {get, put}_zone_device_page() with a single ↵Dan Williams
reference to fix pmem crash The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery: Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur *after* devm_memremap_page_release(), since it now maintains its own elevated reference. This speeds up things while also making the pmem refcounting more robust going forward. Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-08mm: move pcp and lru-pcp draining into single wqMichal Hocko
We currently have 2 specific WQ_RECLAIM workqueues in the mm code. vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain per cpu lru caches. This seems more than necessary because both can run on a single WQ. Both do not block on locks requiring a memory allocation nor perform any allocations themselves. We will save one rescuer thread this way. On the other hand drain_all_pages() queues work on the system wq which doesn't have rescuer and so this depend on memory allocation (when all workers are stuck allocating and new ones cannot be created). Initially we thought this would be more of a theoretical problem but Hugh Dickins has reported: : 4.11-rc has been giving me hangs after hours of swapping load. At : first they looked like memory leaks ("fork: Cannot allocate memory"); : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh" : before looking at /proc/meminfo one time, and the stat_refresh stuck : in D state, waiting for completion of flush_work like many kworkers. : kthreadd waiting for completion of flush_work in drain_all_pages(). This worker should be using WQ_RECLAIM as well in order to guarantee a forward progress. We can reuse the same one as for lru draining and vmstat. Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Yang Li <pku.leo@gmail.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm: vmscan: move dirty pages out of the way until they're flushedJohannes Weiner
We noticed a performance regression when moving hadoop workloads from 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout activity initiated by kswapd as well as frequent bursts of allocation stalls and direct reclaim scans. Even lowering the dirty ratios to the equivalent of less than 1% of memory would not eliminate the issue, suggesting that dirty pages concentrate where the scanner is looking. This can be traced back to recent efforts of thrash avoidance. Where 3.10 would not detect refaulting pages and continuously supply clean cache to the inactive list, a thrashing workload on 4.0+ will detect and activate refaulting pages right away, distilling used-once pages on the inactive list much more effectively. This is by design, and it makes sense for clean cache. But for the most part our workload's cache faults are refaults and its use-once cache is from streaming writes. We end up with most of the inactive list dirty, and we don't go after the active cache as long as we have use-once pages around. But waiting for writes to avoid reclaiming clean cache that *might* refault is a bad trade-off. Even if the refaults happen, reads are faster than writes. Before getting bogged down on writeback, reclaim should first look at *all* cache in the system, even active cache. To accomplish this, activate pages that are dirty or under writeback when they reach the end of the inactive LRU. The pages are marked for immediate reclaim, meaning they'll get moved back to the inactive LRU tail as soon as they're written back and become reclaimable. But in the meantime, by reducing the inactive list to only immediately reclaimable pages, we allow the scanner to deactivate and refill the inactive list with clean cache from the active list tail to guarantee forward progress. [hannes@cmpxchg.org: update comment] Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: split swap cache into 64MB trunksHuang, Ying
The patch is to improve the scalability of the swap out/in via using fine grained locks for the swap cache. In current kernel, one address space will be used for each swap device. And in the common configuration, the number of the swap device is very small (one is typical). This causes the heavy lock contention on the radix tree of the address space if multiple tasks swap out/in concurrently. But in fact, there is no dependency between pages in the swap cache. So that, we can split the one shared address space for each swap device into several address spaces to reduce the lock contention. In the patch, the shared address space is split into 64MB trunks. 64MB is chosen to balance the memory space usage and effect of lock contention reduction. The size of struct address_space on x86_64 architecture is 408B, so with the patch, 6528B more memory will be used for every 1GB swap space on x86_64 architecture. One address space is still shared for the swap entries in the same 64M trunks. To avoid lock contention for the first round of swap space allocation, the order of the swap clusters in the initial free clusters list is changed. The swap space distance between the consecutive swap clusters in the free cluster list is at least 64M. After the first round of allocation, the swap clusters are expected to be freed randomly, so the lock contention should be reduced effectively. Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-25mm: add PageWaiters indicating tasks are waiting for a page bitNicholas Piggin
Add a new page flag, PageWaiters, to indicate the page waitqueue has tasks waiting. This can be tested rather than testing waitqueue_active which requires another cacheline load. This bit is always set when the page has tasks on page_waitqueue(page), and is set and cleared under the waitqueue lock. It may be set when there are no tasks on the waitqueue, which will cause a harmless extra wakeup check that will clears the bit. The generic bit-waitqueue infrastructure is no longer used for pages. Instead, waitqueues are used directly with a custom key type. The generic code was not flexible enough to have PageWaiters manipulation under the waitqueue lock (which simplifies concurrency). This improves the performance of page lock intensive microbenchmarks by 2-3%. Putting two bits in the same word opens the opportunity to remove the memory barrier between clearing the lock bit and testing the waiters bit, after some work on the arch primitives (e.g., ensuring memory operand widths match and cover both bits). Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07thp: reduce usage of huge zero page's atomic counterAaron Lu
The global zero page is used to satisfy an anonymous read fault. If THP(Transparent HugePage) is enabled then the global huge zero page is used. The global huge zero page uses an atomic counter for reference counting and is allocated/freed dynamically according to its counter value. CPU time spent on that counter will greatly increase if there are a lot of processes doing anonymous read faults. This patch proposes a way to reduce the access to the global counter so that the CPU load can be reduced accordingly. To do this, a new flag of the mm_struct is introduced: MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch the global counter in two cases: 1 The first time it uses the global huge zero page; 2 The time when mm_user of its mm_struct reaches zero. Note that right now, the huge zero page is eligible to be freed as soon as its last use goes away. With this patch, the page will not be eligible to be freed until the exit of the last process from which it was ever used. And with the use of mm_user, the kthread is not eligible to use huge zero page either. Since no kthread is using huge zero page today, there is no difference after applying this patch. But if that is not desired, I can change it to when mm_count reaches zero. Case used for test on Haswell EP: usemem -n 72 --readonly -j 0x200000 100G Which spawns 72 processes and each will mmap 100G anonymous space and then do read only access to that space sequentially with a step of 2MB. CPU cycles from perf report for base commit: 54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page CPU cycles from perf report for this commit: 0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page Performance(throughput) of the workload for base commit: 1784430792 Performance(throughput) of the workload for this commit: 4726928591 164% increase. Runtime of the workload for base commit: 707592 us Runtime of the workload for this commit: 303970 us 50% drop. Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, pagevec: release/reacquire lru_lock on pgdat changeMel Gorman
With node-lru, the locking is based on the pgdat. Previously it was required that a pagevec drain released one zone lru_lock and acquired another zone lru_lock on every zone change. Now, it's only necessary if the node changes. The end-result is fewer lock release/acquires if the pages are all on the same node but in different zones. Link: http://lkml.kernel.org/r/1468588165-12461-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmscan: move LRU lists to nodeMel Gorman
This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28mm, vmscan: move lru_lock to the nodeMel Gorman
Node-based reclaim requires node-based LRUs and locking. This is a preparation patch that just moves the lru_lock to the node so later patches are easier to review. It is a mechanical change but note this patch makes contention worse because the LRU lock is hotter and direct reclaim and kswapd can contend on the same lock even when reclaiming from different zones. Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26shmem: add huge pages supportKirill A. Shutemov
Here's basic implementation of huge pages support for shmem/tmpfs. It's all pretty streight-forward: - shmem_getpage() allcoates huge page if it can and try to inserd into radix tree with shmem_add_to_page_cache(); - shmem_add_to_page_cache() puts the page onto radix-tree if there's space for it; - shmem_undo_range() removes huge pages, if it fully within range. Partial truncate of huge pages zero out this part of THP. This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour. As we don't really create hole in this case, lseek(SEEK_HOLE) may have inconsistent results depending what pages happened to be allocated. - no need to change shmem_fault: core-mm will map an compound page as huge if VMA is suitable; Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24mm/swap.c: flush lru pvecs on compound page arrivalLukasz Odzioba
Currently we can have compound pages held on per cpu pagevecs, which leads to a lot of memory unavailable for reclaim when needed. In the systems with hundreads of processors it can be GBs of memory. On of the way of reproducing the problem is to not call munmap explicitly on all mapped regions (i.e. after receiving SIGTERM). After that some pages (with THP enabled also huge pages) may end up on lru_add_pvec, example below. void main() { #pragma omp parallel { size_t size = 55 * 1000 * 1000; // smaller than MEM/CPUS void *p = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS , -1, 0); if (p != MAP_FAILED) memset(p, 0, size); //munmap(p, size); // uncomment to make the problem go away } } When we run it with THP enabled it will leave significant amount of memory on lru_add_pvec. This memory will be not reclaimed if we hit OOM, so when we run above program in a loop: for i in `seq 100`; do ./a.out; done many processes (95% in my case) will be killed by OOM. The primary point of the LRU add cache is to save the zone lru_lock contention with a hope that more pages will belong to the same zone and so their addition can be batched. The huge page is already a form of batched addition (it will add 512 worth of memory in one go) so skipping the batching seems like a safer option when compared to a potential excess in the caching which can be quite large and much harder to fix because lru_add_drain_all is way to expensive and it is not really clear what would be a good moment to call it. Similarly we can reproduce the problem on lru_deactivate_pvec by adding: madvise(p, size, MADV_FREE); after memset. This patch flushes lru pvecs on compound page arrival making the problem less severe - after applying it kill rate of above example drops to 0%, due to reducing maximum amount of memory held on pvec from 28MB (with THP) to 56kB per CPU. Suggested-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Ming Li <mingli199x@qq.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09mm: introduce dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_allWang Sheng-Hui
This patch is based on https://patchwork.ozlabs.org/patch/574623/. Tejun submitted commit 23d11a58a9a6 ("workqueue: skip flush dependency checks for legacy workqueues") for the legacy create*_workqueue() interface. But some workq created by alloc_workqueue still reports warning on memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set: workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu ------------[ cut here ]------------ WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c ... check_flush_dependency+0xb4/0x10c flush_work+0x54/0x140 lru_add_drain_all+0x138/0x188 migrate_prep+0xc/0x18 alloc_contig_range+0xf4/0x350 cma_alloc+0xec/0x1e4 dma_alloc_from_contiguous+0x38/0x40 __dma_alloc+0x74/0x25c nvme_alloc_queue+0xcc/0x36c nvme_reset_work+0x5c4/0xda8 process_one_work+0x128/0x2ec worker_thread+0x58/0x434 kthread+0xd4/0xe8 ret_from_fork+0x10/0x50 That's because lru_add_drain_all() will schedule the drain work on system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM. Introduce a dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all(), aiding in getting memory freed. Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.com Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20mm/swap.c: put activate_page_pvecs and other pagevecs togetherMing Li
Put the activate_page_pvecs definition next to those of the other pagevecs, for clarity. Signed-off-by: Ming Li <mingli199x@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-28thp: keep huge zero page pinned until tlb flushKirill A. Shutemov
Andrea has found[1] a race condition on MMU-gather based TLB flush vs split_huge_page() or shrinker which frees huge zero under us (patch 1/2 and 2/2 respectively). With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the page pinned until flush is complete and the pin prevents the page from being split under us. We still need patch 2/2. This is simplified version of Andrea's patch. We don't need fancy encoding. [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usageKirill A. Shutemov
Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm, x86: get_user_pages() for dax mappingsDan Williams
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver has established a devm_memremap_pages() mapping, i.e. when the pfn_t return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when encountering _PAGE_DEVMAP during a page table walk we lookup and pin a struct dev_pagemap instance to keep the result of pfn_to_page() valid until put_page(). Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Logan Gunthorpe <logang@deltatee.com> Cc: Dave Hansen <dave@sr71.net> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm: move lazily freed pages to inactive listMinchan Kim
MADV_FREE is a hint that it's okay to discard pages if there is memory pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them so there is no value keeping them in the active anonymous LRU so this patch moves them to inactive LRU list's head. This means that MADV_FREE-ed pages which were living on the inactive list are reclaimed first because they are more likely to be cold rather than recently active pages. An arguable issue for the approach would be whether we should put the page to the head or tail of the inactive list. I chose head because the kernel cannot make sure it's really cold or warm for every MADV_FREE usecase but at least we know it's not *hot*, so landing of inactive head would be a comprimise for various usecases. This fixes suboptimal behavior of MADV_FREE when pages living on the active list will sit there for a long time even under memory pressure while the inactive list is reclaimed heavily. This basically breaks the whole purpose of using MADV_FREE to help the system to free memory which is might not be used. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <yalin.wang2010@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jason Evans <je@fb.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Mika Penttil <mika.penttila@nextfour.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Roland Dreier <roland@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15thp: allow mlocked THP againKirill A. Shutemov
Before THP refcounting rework, THP was not allowed to cross VMA boundary. So, if we have THP and we split it, PG_mlocked can be safely transferred to small pages. With new THP refcounting and naive approach to mlocking we can end up with this scenario: 1. we have a mlocked THP, which belong to one VM_LOCKED VMA. 2. the process does munlock() on the *part* of the THP: - the VMA is split into two, one of them VM_LOCKED; - huge PMD split into PTE table; - THP is still mlocked; 3. split_huge_page(): - it transfers PG_mlocked to *all* small pages regrardless if it blong to any VM_LOCKED VMA. We probably could munlock() all small pages on split_huge_page(), but I think we have accounting issue already on step two. Instead of forbidding mlocked pages altogether, we just avoid mlocking PTE-mapped THPs and munlock THPs on split_huge_pmd(). This means PTE-mapped THPs will be on normal lru lists and will be split under memory pressure by vmscan. After the split vmscan will detect unevictable small pages and mlock them. With this approach we shouldn't hit situation like described above. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15mm: drop tail page refcountingKirill A. Shutemov
Tail page refcounting is utterly complicated and painful to support. It uses ->_mapcount on tail pages to store how many times this page is pinned. get_page() bumps ->_mapcount on tail page in addition to ->_count on head. This information is required by split_huge_page() to be able to distribute pins from head of compound page to tails during the split. We will need ->_mapcount to account PTE mappings of subpages of the compound page. We eliminate need in current meaning of ->_mapcount in tail pages by forbidding split entirely if the page is pinned. The only user of tail page refcounting is THP which is marked BROKEN for now. Let's drop all this mess. It makes get_page() and put_page() much simpler. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: make compound_head() robustKirill A. Shutemov
Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10mm: introduce idle page trackingVladimir Davydov
Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. by setting memory cgroup limits appropriately. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings: - it does not count unmapped file pages - it affects the reclaimer logic To overcome these drawbacks, this patch introduces two new page flags, Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap. A page's Idle flag can only be set from userspace by setting bit in /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page, and it is cleared whenever the page is accessed either through page tables (it is cleared in page_referenced() in this case) or using the read(2) system call (mark_page_accessed()). Thus by setting the Idle flag for pages of a particular workload, which can be found e.g. by reading /proc/PID/pagemap, waiting for some time to let the workload access its working set, and then reading the bitmap file, one can estimate the amount of pages that are not used by the workload. The Young page flag is used to avoid interference with the memory reclaimer. A page's Young flag is set whenever the Access bit of a page table entry pointing to the page is cleared by writing to the bitmap file. If page_referenced() is called on a Young page, it will add 1 to its return value, therefore concealing the fact that the Access bit was cleared. Note, since there is no room for extra page flags on 32 bit, this feature uses extended page flags when compiled on 32 bit. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: kpageidle requires an MMU] [akpm@linux-foundation.org: decouple from page-flags rework] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24mm: drop bogus VM_BUG_ON_PAGE assert in put_page() codepathKirill A. Shutemov
My commit 8d63d99a5dfb ("mm: avoid tail page refcounting on non-THP compound pages") which was merged during 4.1 merge window caused regression: page:ffffea0010a15040 count:0 mapcount:1 mapping: (null) index:0x0 flags: 0x8000000000008014(referenced|dirty|tail) page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) != 0) ------------[ cut here ]------------ kernel BUG at mm/swap.c:134! The problem can be reproduced by playing *two* audio files at the same time and then stopping one of players. I used two mplayers to trigger this. The VM_BUG_ON_PAGE() which triggers the bug is bogus: Sound subsystem uses compound pages for its buffers, but unlike most __GFP_COMP sound maps compound pages to userspace with PTEs. In our case with two players map the buffer twice and therefore elevates page_mapcount() on tail pages by two. When one of players exits it unmaps the VMA and drops page_mapcount() to one and try to release reference on the page with put_page(). My commit changes which path it takes under put_compound_page(). It hits put_unrefcounted_compound_page() where VM_BUG_ON_PAGE() is. It sees page_mapcount() == 1. The function wrongly assumes that subpages of compound page cannot be be mapped by itself with PTEs.. The solution is simply drop the VM_BUG_ON_PAGE(). Note: there's no need to move the check under put_page_testzero(). Allocator will check the mapcount by itself before putting on free list. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Borislav Petkov <bp@alien8.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm: don't call __page_cache_release for hugetlbNaoya Horiguchi
__put_compound_page() calls __page_cache_release() to do some freeing work, but it's obviously for thps, not for hugetlb. We don't care because PageLRU is always cleared and page->mem_cgroup is always NULL for hugetlb. But it's not correct and has potential risks, so let's make it conditional. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15mm: rename deactivate_page to deactivate_file_pageMinchan Kim
"deactivate_page" was created for file invalidation so it has too specific logic for file-backed pages. So, let's change the name of the function and date to a file-specific one and yield the generic name. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Wang, Yalin <Yalin.Wang@sonymobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info
2015-02-10rmap: drop support of non-linear mappingsKirill A. Shutemov
We don't create non-linear mappings anymore. Let's drop code which handles them in rmap. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-20fs: kill BDI_CAP_SWAP_BACKEDChristoph Hellwig
This bdi flag isn't too useful - we can determine that a vma is backed by either swap or shmem trivially in the caller. This also allows removing the backing_dev_info instaces for swap and shmem in favor of noop_backing_dev_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-09mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cacheMichal Hocko
free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks. This is not a big deal for the normal release path but it completely kills memcg uncharge batching which reduces res_counter spin_lock contention. Dave has noticed this with his page fault scalability test case on a large machine when the lock was basically dominating on all CPUs: 80.18% 80.18% [kernel] [k] _raw_spin_lock | --- _raw_spin_lock | |--66.59%-- res_counter_uncharge_until | res_counter_uncharge | uncharge_batch | uncharge_list | mem_cgroup_uncharge_list | release_pages | free_pages_and_swap_cache | tlb_flush_mmu_free | | | |--90.12%-- unmap_single_vma | | unmap_vmas | | unmap_region | | do_munmap | | vm_munmap | | sys_munmap | | system_call_fastpath | | __GI___munmap | | | --9.88%-- tlb_flush_mmu | tlb_finish_mmu |