summaryrefslogtreecommitdiffstats
path: root/mm/page_alloc.c
AgeCommit message (Collapse)Author
2020-10-16mm/page_owner: change split_page_owner to take a countMatthew Wilcox (Oracle)
The implementation of split_page_owner() prefers a count rather than the old order of the page. When we support a variable size THP, we won't have the order at this point, but we will have the number of pages. So change the interface to what the caller and callee would prefer. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13memblock: use separate iterators for memory and reserved regionsMike Rapoport
for_each_memblock() is used to iterate over memblock.memory in a few places that use data from memblock_region rather than the memory ranges. Introduce separate for_each_mem_region() and for_each_reserved_mem_region() to improve encapsulation of memblock internals from its users. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> [x86] Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> [MIPS] Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [.clang-format] Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Axtens <dja@axtens.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Emil Renner Berthing <kernel@esmil.dk> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200818151634.14343-18-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13memblock: reduce number of parameters in for_each_mem_range()Mike Rapoport
Currently for_each_mem_range() and for_each_mem_range_rev() iterators are the most generic way to traverse memblock regions. As such, they have 8 parameters and they are hardly convenient to users. Most users choose to utilize one of their wrappers and the only user that actually needs most of the parameters is memblock itself. To avoid yet another naming for memblock iterators, rename the existing for_each_mem_range[_rev]() to __for_each_mem_range[_rev]() and add a new for_each_mem_range[_rev]() wrappers with only index, start and end parameters. The new wrapper nicely fits into init_unavailable_mem() and will be used in upcoming changes to simplify memblock traversals. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> [MIPS] Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Axtens <dja@axtens.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Emil Renner Berthing <kernel@esmil.dk> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200818151634.14343-11-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/page_alloc.c: fix freeing non-compound pagesMatthew Wilcox (Oracle)
Here is a very rare race which leaks memory: Page P0 is allocated to the page cache. Page P1 is free. Thread A Thread B Thread C find_get_entry(): xas_load() returns P0 Removes P0 from page cache P0 finds its buddy P1 alloc_pages(GFP_KERNEL, 1) returns P0 P0 has refcount 1 page_cache_get_speculative(P0) P0 has refcount 2 __free_pages(P0) P0 has refcount 1 put_page(P0) P1 is not freed Fix this by freeing all the pages in __free_pages() that won't be freed by the call to put_page(). It's usually not a good idea to split a page, but this is a very unlikely scenario. Fixes: e286781d5f2e ("mm: speculative page references") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200926213919.26642-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mmzone: clean code by removing unused macro parameterMateusz Nosek
Previously 'for_next_zone_zonelist_nodemask' macro parameter 'zlist' was unused so this patch removes it. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200917211906.30059-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/page_alloc.c: __perform_reclaim should return 'unsigned long'Yanfei Xu
__perform_reclaim()'s single caller expects it to return 'unsigned long', hence change its return value and a local variable to 'unsigned long'. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200916022138.16740-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/page_alloc.c: clean code by merging two functionsMateusz Nosek
finalise_ac() is just 'epilogue' for 'prepare_alloc_pages'. Therefore there is no need to keep them both so 'finalise_ac' content can be merged into prepare_alloc_pages() code. It would make __alloc_pages_nodemask() cleaner when it comes to readability. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@kernel.org> Link: https://lkml.kernel.org/r/20200916110118.6537-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/page_alloc.c: fix early params garbage value accessesMateusz Nosek
Previously in '__init early_init_on_alloc' and '__init early_init_on_free' the return values from 'kstrtobool' were not handled properly. That caused potential garbage value read from variable 'bool_result'. Introduced patch fixes error handling. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200916214125.28271-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/page_alloc.c: micro-optimization remove unnecessary branchMateusz Nosek
Previously flags check was separated into two separated checks with two separated branches. In case of presence of any of two mentioned flags, the same effect on flow occurs. Therefore checks can be merged and one branch can be avoided. Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200911092310.31136-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/page_alloc.c: clean code by removing unnecessary initializationMateusz Nosek
Previously variable 'tmp' was initialized, but was not read later before reassigning. So the initialization can be removed. [akpm@linux-foundation.org: remove `tmp' altogether] Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200904132422.17387-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm, isolation: avoid checking unmovable pages across pageblock boundaryLi Xinhai
In has_unmovable_pages(), the page parameter would not always be the first page within a pageblock (see how the page pointer is passed in from start_isolate_page_range() after call __first_valid_page()), so that would cause checking unmovable pages span two pageblocks. After this patch, the checking is enforced within one pageblock no matter the page is first one or not, and obey the semantics of this function. This issue is found by code inspection. Michal said "this might lead to false negatives when an unrelated block would cause an isolation failure". Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Link: https://lkml.kernel.org/r/20200824065811.383266-1-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13mm/page_alloc: tweak comments in has_unmovable_pages()David Hildenbrand
Patch series "mm / virtio-mem: support ZONE_MOVABLE", v5. When introducing virtio-mem, the semantics of ZONE_MOVABLE were rather unclear, which is why we special-cased ZONE_MOVABLE such that partially plugged blocks would never end up in ZONE_MOVABLE. Now that the semantics are much clearer (and are documented in patch #6), let's support partially plugged memory blocks in ZONE_MOVABLE, allowing partially plugged memory blocks to be online to ZONE_MOVABLE and also unplugging from such memory blocks. This avoids surprises when onlining of memory blocks suddenly fails, just because they are not completely populated by virtio-mem (yet). This is especially helpful for testing, but also paves the way for virtio-mem optimizations, allowing more memory to get reliably unplugged. Cleanup has_unmovable_pages() and set_migratetype_isolate(), providing better documentation of how ZONE_MOVABLE interacts with different kind of unmovable pages (memory offlining vs. alloc_contig_range()). This patch (of 6): Let's move the split comment regarding bootmem allocations and memory holes, especially in the context of ZONE_MOVABLE, to the PageReserved() check. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200816125333.7434-1-david@redhat.com Link: http://lkml.kernel.org/r/20200816125333.7434-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-11mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected ↵Vijay Balakrishna
by khugepaged When memory is hotplug added or removed the min_free_kbytes should be recalculated based on what is expected by khugepaged. Currently after hotplug, min_free_kbytes will be set to a lower default and higher default set when THP enabled is lost. This change restores min_free_kbytes as expected for THP consumers. [vijayb@linux.microsoft.com: v5] Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com Fixes: f000565adb77 ("thp: set recommended min free kbytes") Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Allen Pais <apais@microsoft.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-03mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIsJoonsoo Kim
memalloc_nocma_{save/restore} APIs can be used to skip page allocation on CMA area, but, there is a missing case and the page on CMA area could be allocated even if APIs are used. This patch handles this case to fix the potential issue. For now, these APIs are used to prevent long-term pinning on the CMA page. When the long-term pinning is requested on the CMA page, it is migrated to the non-CMA page before pinning. This non-CMA page is allocated by using memalloc_nocma_{save/restore} APIs. If APIs doesn't work as intended, the CMA page is allocated and it is pinned for a long time. This long-term pin for the CMA page causes cma_alloc() failure and it could result in wrong behaviour on the device driver who uses the cma_alloc(). Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't specified. Fixes: 8510e69c8efe (mm/page_alloc: fix memalloc_nocma_{save/restore} APIs) Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/1601429472-12599-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-26mm: replace memmap_context by meminit_contextLaurent Dufour
Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-21mm, page_alloc: fix core hung in free_pcppages_bulk()Charan Teja Reddy
The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race still exists but it is very narrow as we update the pcp struct values for the first memory block online itself. This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet). [1]: https://patchwork.kernel.org/patch/11696389/ Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type") Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: <stable@vger.kernel.org> [2.6+] Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-21mm: include CMA pages in lowmem_reserve at bootDoug Berger
The lowmem_reserve arrays provide a means of applying pressure against allocations from lower zones that were targeted at higher zones. Its values are a function of the number of pages managed by higher zones and are assigned by a call to the setup_per_zone_lowmem_reserve() function. The function is initially called at boot time by the function init_per_zone_wmark_min() and may be called later by accesses of the /proc/sys/vm/lowmem_reserve_ratio sysctl file. The function init_per_zone_wmark_min() was moved up from a module_init to a core_initcall to resolve a sequencing issue with khugepaged. Unfortunately this created a sequencing issue with CMA page accounting. The CMA pages are added to the managed page count of a zone when cma_init_reserved_areas() is called at boot also as a core_initcall. This makes it uncertain whether the CMA pages will be added to the managed page counts of their zones before or after the call to init_per_zone_wmark_min() as it becomes dependent on link order. With the current link order the pages are added to the managed count after the lowmem_reserve arrays are initialized at boot. This means the lowmem_reserve values at boot may be lower than the values used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the ratio values are unchanged. In many cases the difference is not significant, but for example an ARM platform with 1GB of memory and the following memory layout cma: Reserved 256 MiB at 0x0000000030000000 Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000003fffffff] would result in 0 lowmem_reserve for the DMA zone. This would allow userspace to deplete the DMA zone easily. Funnily enough $ cat /proc/sys/vm/lowmem_reserve_ratio would fix up the situation because as a side effect it forces setup_per_zone_lowmem_reserve. This commit breaks the link order dependency by invoking init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages have the chance to be properly accounted in their zone(s) and allowing the lowmem_reserve arrays to receive consistent values. Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jason Baron <jbaron@akamai.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14mm: store compound_nr as well as compound_orderMatthew Wilcox (Oracle)
Patch series "THP prep patches". These are some generic cleanups and improvements, which I would like merged into mmotm soon. The first one should be a performance improvement for all users of compound pages, and the others are aimed at getting code to compile away when CONFIG_TRANSPARENT_HUGEPAGE is disabled (ie small systems). Also better documented / less confusing than the current prefix mixture of compound, hpage and thp. This patch (of 7): This removes a few instructions from functions which need to know how many pages are in a compound page. The storage used is either page->mapping on 64-bit or page->index on 32-bit. Both of these are fine to overlay on tail pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-1-willy@infradead.org Link: http://lkml.kernel.org/r/20200629151959.15779-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/page_alloc: remove a wrapper for alloc_migration_target()Joonsoo Kim
There is a well-defined standard migration target callback. Use it directly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-8-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/page_alloc.c: delete or fix duplicated wordsRandy Dunlap
Drop the repeated word "them" and "that". Change "the the" to "to the". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: fix memalloc_nocma_{save/restore} APIsJoonsoo Kim
Currently, memalloc_nocma_{save/restore} API that prevents CMA area in page allocation is implemented by using current_gfp_context(). However, there are two problems of this implementation. First, this doesn't work for allocation fastpath. In the fastpath, original gfp_mask is used since current_gfp_context() is introduced in order to control reclaim and it is on slowpath. So, CMA area can be allocated through the allocation fastpath even if memalloc_nocma_{save/restore} APIs are used. Currently, there is just one user for these APIs and it has a fallback method to prevent actual problem. Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect to exclude the memory on the ZONE_MOVABLE for allocation target. To fix these problems, this patch changes the implementation to exclude CMA area in page allocation. Main point of this change is using the alloc_flags. alloc_flags is mainly used to control allocation so it fits for excluding CMA area in allocation. Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc) Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: skip setting nodemask when we are in interruptMuchun Song
When we are in the interrupt context, it is irrelevant to the current task context. If we use current task's mems_allowed, we can be fair to alloc pages in the fast path and fall back to slow path memory allocation when the current node(which is the current task mems_allowed) does not have enough memory to allocate. In this case, it slows down the memory allocation speed of interrupt context. So we can skip setting the nodemask to allow any node to allocate memory, so that fast path allocation can success. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200706025921.53683-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: fallbacks at most has 3 elementsWei Yang
MIGRAGE_TYPES is used to be the mark of end and there are at most 3 elements for the one dimension array. Reduce to 3 to save little memory. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200625231022.18784-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: silence a KASAN false positiveQian Cai
kernel_init_free_pages() will use memset() on s390 to clear all pages from kmalloc_order() which will override KASAN redzones because a redzone was setup from the end of the allocation size to the end of the last page. Silence it by not reporting it there. An example of the report is, BUG: KASAN: slab-out-of-bounds in __free_pages_ok Write of size 4096 at addr 000000014beaa000 Call Trace: show_stack+0x152/0x210 dump_stack+0x1f8/0x248 print_address_description.isra.13+0x5e/0x4d0 kasan_report+0x130/0x178 check_memory_region+0x190/0x218 memset+0x34/0x60 __free_pages_ok+0x894/0x12f0 kfree+0x4f2/0x5e0 unpack_to_rootfs+0x60e/0x650 populate_rootfs+0x56/0x358 do_one_initcall+0x1f4/0xa20 kernel_init_freeable+0x758/0x7e8 kernel_init+0x1c/0x170 ret_from_fork+0x24/0x28 Memory state around the buggy address: 000000014bea9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000014bea9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >000000014beaa000: 03 fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe ^ 000000014beaa080: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe 000000014beaa100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Link: http://lkml.kernel.org/r/20200610052154.5180-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: remove unnecessary end_bitidx for ↵Wei Yang
[set|get]_pfnblock_flags_mask() After previous cleanup, the end_bitidx is not necessary any more. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-4-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: simplify pageblock bitmap accessWei Yang
Due to commit e58469bafd05 ("mm: page_alloc: use word-based accesses for get/set pageblock bitmaps"), pageblock bitmap is accessed with word-based access. This operation could be simplified a little. Intuitively, if we want to get a bit range [start_idx, end_idx] in a word, we can do like this: mask = (1 << (end_bitidx - start_bitidx + 1)) - 1; ret = (word >> start_idx) & mask; And also if we want to set a bit range [start_idx, end_idx] with flags, we can do the same by just shift start_bitidx. By doing so we reduce some instructions for these two helper functions: Before Patched set_pfnblock_flags_mask 209 198(-5%) get_pfnblock_flags_mask 101 87(-13%) Since the syntax is changed a little, we need to check the whole 4-bit migrate_type instead of part of it. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-3-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: extract the common part in pfn_to_bitidx()Wei Yang
The return value calculation is the same both for SPARSEMEM or not. Just take it out. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: remove nr_free_pagecache_pages()David Hildenbrand
nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and the name does not really help to understand what's going on. Let's open-code it instead and add a comment. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Huang Ying <ying.huang@intel.com> Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: remove vm_total_pagesDavid Hildenbrand
The global variable "vm_total_pages" is a relic from older days. There is only a single user that reads the variable - build_all_zonelists() - and the first thing it does is update it. Use a local variable in build_all_zonelists() instead and remove the global variable. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, page_alloc: skip ->waternark_boost for atomic order-0 allocationsCharan Teja Reddy
When boosting is enabled, it is observed that rate of atomic order-0 allocation failures are high due to the fact that free levels in the system are checked with ->watermark_boost offset. This is not a problem for sleepable allocations but for atomic allocations which looks like regression. This problem is seen frequently on system setup of Android kernel running on Snapdragon hardware with 4GB RAM size. When no extfrag event occurred in the system, ->watermark_boost factor is zero, thus the watermark configurations in the system are: _watermark = ( [WMARK_MIN] = 1272, --> ~5MB [WMARK_LOW] = 9067, --> ~36MB [WMARK_HIGH] = 9385), --> ~38MB watermark_boost = 0 After launching some memory hungry applications in Android which can cause extfrag events in the system to an extent that ->watermark_boost can be set to max i.e. default boost factor makes it to 150% of high watermark. _watermark = ( [WMARK_MIN] = 1272, --> ~5MB [WMARK_LOW] = 9067, --> ~36MB [WMARK_HIGH] = 9385), --> ~38MB watermark_boost = 14077, -->~57MB With default system configuration, for an atomic order-0 allocation to succeed, having free memory of ~2MB will suffice. But boosting makes the min_wmark to ~61MB thus for an atomic order-0 allocation to be successful system should have minimum of ~23MB of free memory(from calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are observed despite system is having ~20MB of free memory. In the testing, this is reproducible as early as first 300secs since boot and with furtherlowram configurations(<2GB) it is observed as early as first 150secs since boot. These failures can be avoided by excluding the ->watermark_boost in watermark caluculations for atomic order-0 allocations. [akpm@linux-foundation.org: fix comment grammar, reflow comment] [charante@codeaurora.org: fix suggested by Mel Gorman] Link: http://lkml.kernel.org/r/31556793-57b1-1c21-1a9d-22674d9bd938@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/1589882284-21010-1-git-send-email-charante@codeaurora.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07page_alloc: consider highatomic reserve in watermark fastJaewon Kim
zone_watermark_fast was introduced by commit 48ee5f3696f6 ("mm, page_alloc: shortcut watermark checks for order-0 pages"). The commit simply checks if free pages is bigger than watermark without additional calculation such like reducing watermark. It considered free cma pages but it did not consider highatomic reserved. This may incur exhaustion of free pages except high order atomic free pages. Assume that reserved_highatomic pageblock is bigger than watermark min, and there are only few free pages except high order atomic free. Because zone_watermark_fast passes the allocation without considering high order atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume all the free pages. Then finally order-0 atomic allocation may fail on allocation. This means watermark min is not protected against non-atomic allocation. The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed. Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also can be failed. To avoid the problem, zone_watermark_fast should consider highatomic reserve. If the actual size of high atomic free is counted accurately like cma free, we may use it. On this patch just use nr_reserved_highatomic. Additionally introduce __zone_watermark_unusable_free to factor out common parts between zone_watermark_fast and __zone_watermark_ok. This is an example of ALLOC_HARDER allocation failure using v4.19 based kernel. Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null) Call trace: [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0 [<ffffff8008223320>] warn_alloc+0xd8/0x12c [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250 [<ffffff800827f6e8>] new_slab+0x128/0x604 [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670 [<ffffff800827ba00>] __kmalloc+0x2f8/0x310 [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144 [<ffffff80084ad880>] security_sid_to_context+0x10/0x18 [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28 [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70 [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c Mem-Info: active_anon:102061 inactive_anon:81551 isolated_anon:0 active_file:59102 inactive_file:68924 isolated_file:64 unevictable:611 dirty:63 writeback:0 unstable:0 slab_reclaimable:13324 slab_unreclaimable:44354 mapped:83015 shmem:4858 pagetables:26316 bounce:0 free:2727 free_pcp:1035 free_cma:178 Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB lowmem_reserve[]: 0 0 Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB 138826 total pagecache pages 5460 pages in swap cache Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142 This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14 based kernel. kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null) kswapd0 cpuset=/ mems_allowed=0 CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1 Call trace: [<0000000000000000>] dump_backtrace+0x0/0x248 [<0000000000000000>] show_stack+0x18/0x20 [<0000000000000000>] __dump_stack+0x20/0x28 [<0000000000000000>] dump_stack+0x68/0x90 [<0000000000000000>] warn_alloc+0x104/0x198 [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0 [<0000000000000000>] zs_malloc+0x148/0x3d0 [<0000000000000000>] zram_bvec_rw+0x410/0x798 [<0000000000000000>] zram_rw_page+0x88/0xdc [<0000000000000000>] bdev_write_page+0x70/0xbc [<0000000000000000>] __swap_writepage+0x58/0x37c [<0000000000000000>] swap_writepage+0x40/0x4c [<0000000000000000>] shrink_page_list+0xc30/0xf48 [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c [<0000000000000000>] shrink_node_memcg+0x23c/0x618 [<0000000000000000>] shrink_node+0x1c8/0x304 [<0000000000000000>] kswapd+0x680/0x7c4 [<0000000000000000>] kthread+0x110/0x120 [<0000000000000000>] ret_from_fork+0x10/0x18 Mem-Info: active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0 Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0 Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB This is trace log which shows GFP_HIGHUSER consumes free pages right before ALLOC_NO_WATERMARKS. <...>-22275 [006] .... 889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO <...>-22275 [006] .... 889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO kswapd0-1207 [005] ...1 889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE [jaewon31.kim@samsung.com: remove redundant code for high-order] Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.com Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm, page_alloc: use unlikely() in task_capc()Vlastimil Babka
Hugh noted that task_capc() could use unlikely(), as most of the time there is no capture in progress and we are in page freeing hot path. Indeed adding unlikely() produces assembly that better matches the assumption and moves all the tests away from the hot path. I have also noticed that we don't need to test for cc->direct_compaction as the only place we set current->task_capture is compact_zone_order() which also always sets cc->direct_compaction true. Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Hugh Dickins <hughd@googlecom> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Li Wang <liwang@redhat.com> Link: http://lkml.kernel.org/r/4a24f7af-3aa5-6e80-4ae6-8f253b562039@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/sparse: cleanup the code surrounding memory_present()Mike Rapoport
After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent functions that call memory_present() for each region in memblock.memory: sparse_memory_present_with_active_regions() and membocks_present(). Moreover, all architectures have a call to either of these functions preceding the call to sparse_init() and in the most cases they are called one after the other. Mark the regions from memblock.memory as present during sparce_init() by making sparse_init() call memblocks_present(), make memblocks_present() and memory_present() functions static and remove redundant sparse_memory_present_with_active_regions() function. Also remove no longer required HAVE_MEMORY_PRESENT configuration option. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcontrol: account kernel stack per nodeShakeel Butt
Currently the kernel stack is being accounted per-zone. There is no need to do that. In addition due to being per-zone, memcg has to keep a separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of node_stat_item. In addition localize the kernel stack stats updates to account_kernel_stack(). Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: memcg: convert vmstat slab counters to bytesRoman Gushchin
In order to prepare for per-object slab memory accounting, convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes. To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB). Internally global and per-node counters are stored in pages, however memcg and lruvec counters are stored in bytes. This scheme may look weird, but only for now. As soon as slab p