summaryrefslogtreecommitdiffstats
path: root/mm/page_alloc.c
AgeCommit message (Collapse)Author
2018-01-31mm/page_alloc.c: fix typos in commentsShile Zhang
Link: http://lkml.kernel.org/r/1515485774-4768-1-git-send-email-zhangshile@gmail.com Signed-off-by: Shile Zhang <zhangshile@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31mm/page_alloc.c: fix comment in __get_free_pages()Jiankang Chen
__get_free_pages() will return a virtual address, but it is not just a 32-bit address, for example on a 64-bit system. And this comment really confuses new readers of mm. Link: http://lkml.kernel.org/r/1511780964-64864-1-git-send-email-chenjiankang1@huawei.com Signed-off-by: Jiankang Chen <chenjiankang1@huawei.com> Reported-by: Hanjun Guo <guohanjun@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31mm: split deferred_init_range into initializing and freeing partsPavel Tatashin
In deferred_init_range() we initialize struct pages, and also free them to buddy allocator. We do it in separate loops, because buddy page is computed ahead, so we do not want to access a struct page that has not been initialized yet. There is still, however, a corner case where it is potentially possible to access uninitialized struct page: this is when buddy page is from the next memblock range. This patch fixes this problem by splitting deferred_init_range() into two functions: one to initialize struct pages, and another to free them. In addition, this patch brings the following improvements: - Get rid of __def_free() helper function. And simplifies loop logic by adding a new pfn validity check function: deferred_pfn_valid(). - Reduces number of variables that we track. So, there is a higher chance that we will avoid using stack to store/load variables inside hot loops. - Enables future multi-threading of these functions: do initialization in multiple threads, wait for all threads to finish, do freeing part in multithreading. Tested on x86 with 1T of memory to make sure no regressions are introduced. [akpm@linux-foundation.org: fix spello in comment] Link: http://lkml.kernel.org/r/20171107150446.32055-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-04mm: check pfn_valid first in zero_resv_unavailDave Young
With latest kernel I get below bug while testing kdump: BUG: unable to handle kernel paging request at ffffea00034b1040 IP: zero_resv_unavail+0xbd/0x126 PGD 37b98067 P4D 37b98067 PUD 37b97067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.15.0-rc1+ #316 Hardware name: LENOVO 20ARS1BJ02/20ARS1BJ02, BIOS GJET92WW (2.42 ) 03/03/2017 task: ffffffff81a0e4c0 task.stack: ffffffff81a00000 RIP: 0010:zero_resv_unavail+0xbd/0x126 RSP: 0000:ffffffff81a03d88 EFLAGS: 00010006 RAX: 0000000000000000 RBX: ffffea00034b1040 RCX: 0000000000000010 RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffffea00034b1040 RBP: 00000000000d2c41 R08: 00000000000000c0 R09: 0000000000000a0d R10: 0000000000000002 R11: 0000000000007f01 R12: ffffffff81a03d90 R13: ffffea0000000000 R14: 0000000000000063 R15: 0000000000000062 FS: 0000000000000000(0000) GS:ffffffff81c73000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffea00034b1040 CR3: 0000000037609000 CR4: 00000000000606b0 Call Trace: ? free_area_init_nodes+0x640/0x664 ? zone_sizes_init+0x58/0x72 ? setup_arch+0xb50/0xc6c ? start_kernel+0x64/0x43d ? secondary_startup_64+0xa5/0xb0 Code: c1 e8 0c 48 39 d8 76 27 48 89 de 48 c1 e3 06 48 c7 c7 7a 87 79 81 e8 b0 c0 3e ff 4c 01 eb b9 10 00 00 00 31 c0 48 89 df 49 ff c6 <f3> ab eb bc 6a 00 49 c7 c0 f0 93 d1 81 31 d2 83 ce ff 41 54 49 RIP: zero_resv_unavail+0xbd/0x126 RSP: ffffffff81a03d88 CR2: ffffea00034b1040 ---[ end trace f5ba9e8f73c7ee26 ]--- This is introduced by commit a4a3ede2132a ("mm: zero reserved and unavailable struct pages"). The reason is some efi reserved boot ranges is not reported in E820 ram. In my case it is a bgrt buffer: efi: mem00: [Boot Data |RUN| | | | | | | |WB|WT|WC|UC] range=[0x00000000d2c41000-0x00000000d2c85fff] (0MB) Use "add_efi_memmap" can workaround the problem with another fix: http://lkml.kernel.org/r/20171130052327.GA3500@dhcp-128-65.nay.redhat.com In zero_resv_unavail it would be better to check pfn_valid first before zero the page struct. This fixes the problem and potential other similar problems. Also as Pavel Tatashin suggested checks pfn_valid at the beginning of the section. The range is backed by real memory. The memory range is efi "Boot Service Data", that means after ExitBootServices() these ranges can be used as system ram. But some of them need to be reserved, for example the bgrt image address in an acpi table, if the image memory is freed then kexec reboot will fail because kexec inherit same acpi table to initialize the driver. Link: http://lkml.kernel.org/r/20171201095048.GA3084@dhcp-128-65.nay.redhat.com Fixes: a4a3ede2132a ("mm: zero reserved and unavailable struct pages") Signed-off-by: Dave Young <dyoung@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14mm/page_alloc.c: avoid excessive IRQ disabled times in free_unref_page_list()Lucas Stach
Since commit 9cca35d42eb6 ("mm, page_alloc: enable/disable IRQs once when freeing a list of pages") we see excessive IRQ disabled times of up to 25ms on an embedded ARM system (tracing overhead included). This is due to graphics buffers being freed back to the system via release_pages(). Graphics buffers can be huge, so it's not hard to hit cases where the list of pages to free has 2048 entries. Disabling IRQs while freeing all those pages is clearly not a good idea. Introduce a batch limit, which allows IRQ servicing once every few pages. The batch count is the same as used in other parts of the MM subsystem when dealing with IRQ disabled regions. Link: http://lkml.kernel.org/r/20171207170314.4419-1-l.stach@pengutronix.de Fixes: 9cca35d42eb6 ("mm, page_alloc: enable/disable IRQs once when freeing a list of pages") Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29mm/cma: fix alloc_contig_range ret code/potential leakMike Kravetz
If the call __alloc_contig_migrate_range() in alloc_contig_range returns -EBUSY, processing continues so that test_pages_isolated() is called where there is a tracepoint to identify the busy pages. However, it is possible for busy pages to become available between the calls to these two routines. In this case, the range of pages may be allocated. Unfortunately, the original return code (ret == -EBUSY) is still set and returned to the caller. Therefore, the caller believes the pages were not allocated and they are leaked. Update the comment to indicate that allocation is still possible even if __alloc_contig_migrate_range returns -EBUSY. Also, clear return code in this case so that it is not accidentally used or returned to caller. Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation failure") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29mm, memory_hotplug: do not back off draining pcp free pages from kworker contextMichal Hocko
drain_all_pages backs off when called from a kworker context since commit 0ccce3b92421 ("mm, page_alloc: drain per-cpu pages from workqueue context") because the original IPI based pcp draining has been replaced by a WQ based one and the check wanted to prevent from recursion and inter workers dependencies. This has made some sense at the time because the system WQ has been used and one worker holding the lock could be blocked while waiting for new workers to emerge which can be a problem under OOM conditions. Since then commit ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq") has moved draining to a dedicated (mm_percpu_wq) WQ with a rescuer so we shouldn't depend on any other WQ activity to make a forward progress so calling drain_all_pages from a worker context is safe as long as this doesn't happen from mm_percpu_wq itself which is not the case because all workers are required to _not_ depend on any MM locks. Why is this a problem in the first place? ACPI driven memory hot-remove (acpi_device_hotplug) is executed from the worker context. We end up calling __offline_pages to free all the pages and that requires both lru_add_drain_all_cpuslocked and drain_all_pages to do their job otherwise we can have dangling pages on pcp lists and fail the offline operation (__test_page_isolated_in_pageblock would see a page with 0 ref count but without PageBuddy set). Fix the issue by removing the worker check in drain_all_pages. lru_add_drain_all_cpuslocked doesn't have this restriction so it works as expected. Link: http://lkml.kernel.org/r/20170828093341.26341-1-mhocko@kernel.org Fixes: 0ccce3b924212 ("mm, page_alloc: drain per-cpu pages from workqueue context") Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> [4.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17mm, compaction: split off flag for not updating skip hintsVlastimil Babka
Pageblock skip hints were added as a heuristic for compaction, which shares core code with CMA. Since CMA reliability would suffer from the heuristics, compact_control flag ignore_skip_hint was added for the CMA use case. Since 6815bf3f233e ("mm/compaction: respect ignore_skip_hint in update_pageblock_skip") the flag also means that CMA won't *update* the skip hints in addition to ignoring them. Today, direct compaction can also ignore the skip hints in the last resort attempt, but there's no reason not to set them when isolation fails in such case. Thus, this patch splits off a new no_set_skip_hint flag to avoid the updating, which only CMA sets. This should improve the heuristics a bit, and allow us to simplify the persistent skip bit handling as the next step. Link: http://lkml.kernel.org/r/20171102121706.21504-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: make alloc_node_mem_map a void call if we don't have ↵Oscar Salvador
CONFIG_FLAT_NODE_MEM_MAP free_area_init_node() calls alloc_node_mem_map(), but this function does nothing unless we have CONFIG_FLAT_NODE_MEM_MAP. As a cleanup, we can move the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" within alloc_node_mem_map() out of the function, and define a alloc_node_mem_map() { } when CONFIG_FLAT_NODE_MEM_MAP is not present. This also moves the printk that lays within the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" block from free_area_init_node() to alloc_node_mem_map(), getting rid of the "#ifdef CONFIG_FLAT_NODE_MEM_MAP" in free_area_init_node(). [akpm@linux-foundation.org: clean up the printk while we're there] Link: http://lkml.kernel.org/r/20171114111935.GA11758@techadventures.net Signed-off-by: Oscar Salvador <osalvador@techadventures.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: simplify nodemask printingMichal Hocko
alloc_warn() and dump_header() have to explicitly handle NULL nodemask which forces both paths to use pr_cont. We can do better. printk already handles NULL pointers properly so all we need is to teach nodemask_pr_args to handle NULL nodemask carefully. This allows simplification of both alloc_warn() and dump_header() and gets rid of pr_cont altogether. This patch has been motivated by patch from Joe Perches http://lkml.kernel.org/r/b31236dfe3fc924054fd7842bde678e71d193638.1509991345.git.joe@perches.com [akpm@linux-foundation.org: fix tile warning, per Arnd] Link: http://lkml.kernel.org/r/20171109100531.3cn2hcqnuj7mjaju@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Joe Perches <joe@perches.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/page_alloc.c: broken deferred calculationPavel Tatashin
In reset_deferred_meminit() we determine number of pages that must not be deferred. We initialize pages for at least 2G of memory, but also pages for reserved memory in this node. The reserved memory is determined in this function: memblock_reserved_memory_within(), which operates over physical addresses, and returns size in bytes. However, reset_deferred_meminit() assumes that that this function operates with pfns, and returns page count. The result is that in the best case machine boots slower than expected due to initializing more pages than needed in single thread, and in the worst case panics because fewer than needed pages are initialized early. Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: don't warn about allocations which stall for too longTetsuo Handa
Commit 63f53dea0c98 ("mm: warn about allocations which stall for too long") was a great step for reducing possibility of silent hang up problem caused by memory allocation stalls. But this commit reverts it, for it is possible to trigger OOM lockup and/or soft lockups when many threads concurrently called warn_alloc() (in order to warn about memory allocation stalls) due to current implementation of printk(), and it is difficult to obtain useful information due to limitation of synchronous warning approach. Current printk() implementation flushes all pending logs using the context of a thread which called console_unlock(). printk() should be able to flush all pending logs eventually unless somebody continues appending to printk() buffer. Since warn_alloc() started appending to printk() buffer while waiting for oom_kill_process() to make forward progress when oom_kill_process() is processing pending logs, it became possible for warn_alloc() to force oom_kill_process() loop inside printk(). As a result, warn_alloc() significantly increased possibility of preventing oom_kill_process() from making forward progress. ---------- Pseudo code start ---------- Before warn_alloc() was introduced: retry: if (mutex_trylock(&oom_lock)) { while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_lock) } goto retry; After warn_alloc() was introduced: retry: if (mutex_trylock(&oom_lock)) { while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_lock) } else if (waited_for_10seconds()) { atomic_inc(&printk_pending_logs); } goto retry; ---------- Pseudo code end ---------- Although waited_for_10seconds() becomes true once per 10 seconds, unbounded number of threads can call waited_for_10seconds() at the same time. Also, since threads doing waited_for_10seconds() keep doing almost busy loop, the thread doing print_one_log() can use little CPU resource. Therefore, this situation can be simplified like ---------- Pseudo code start ---------- retry: if (mutex_trylock(&oom_lock)) { while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_lock) } else { atomic_inc(&printk_pending_logs); } goto retry; ---------- Pseudo code end ---------- when printk() is called faster than print_one_log() can process a log. One of possible mitigation would be to introduce a new lock in order to make sure that no other series of printk() (either oom_kill_process() or warn_alloc()) can append to printk() buffer when one series of printk() (either oom_kill_process() or warn_alloc()) is already in progress. Such serialization will also help obtaining kernel messages in readable form. ---------- Pseudo code start ---------- retry: if (mutex_trylock(&oom_lock)) { mutex_lock(&oom_printk_lock); while (atomic_read(&printk_pending_logs) > 0) { atomic_dec(&printk_pending_logs); print_one_log(); } // Send SIGKILL here. mutex_unlock(&oom_printk_lock); mutex_unlock(&oom_lock) } else { if (mutex_trylock(&oom_printk_lock)) { atomic_inc(&printk_pending_logs); mutex_unlock(&oom_printk_lock); } } goto retry; ---------- Pseudo code end ---------- But this commit does not go that direction, for we don't want to introduce a new lock dependency, and we unlikely be able to obtain useful information even if we serialized oom_kill_process() and warn_alloc(). Synchronous approach is prone to unexpected results (e.g. too late [1], too frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped with providing information other than "something is going wrong". I want to consider asynchronous approach which can obtain information during stalls with possibly relevant threads (e.g. the owner of oom_lock and kswapd-like threads) and serve as a trigger for actions (e.g. turn on/off tracepoints, ask libvirt daemon to take a memory dump of stalling KVM guest for diagnostic purpose). This commit temporarily loses ability to report e.g. OOM lockup due to unable to invoke the OOM killer due to !__GFP_FS allocation request. But asynchronous approach will be able to detect such situation and emit warning. Thus, let's remove warn_alloc(). [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981 [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever")) Link: http://lkml.kernel.org/r/1509017339-4802-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: yuwang.yuwang <yuwang.yuwang@alibaba-inc.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, page_alloc: fix potential false positive in __zone_watermark_okVlastimil Babka
Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations"), __zone_watermark_ok() check for high-order allocations will shortcut per-migratetype free list checks for ALLOC_HARDER allocations, and return true as long as there's free page of any migratetype. The intention is that ALLOC_HARDER can allocate from MIGRATE_HIGHATOMIC free lists, while normal allocations can't. However, as a side effect, the watermark check will then also return true when there are pages only on the MIGRATE_ISOLATE list, or (prior to CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the allocation cannot actually obtain isolated pages, and might not be able to obtain CMA pages, this can result in a false positive. The condition should be rare and perhaps the outcome is not a fatal one. Still, it's better if the watermark check is correct. There also shouldn't be a performance tradeoff here. Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, sysctl: make NUMA stats configurableKemi Wang
This is the second step which introduces a tunable interface that allow numa stats configurable for optimizing zone_statistics(), as suggested by Dave Hansen and Ying Huang. ========================================================================= When page allocation performance becomes a bottleneck and you can tolerate some possible tool breakage and decreased numa counter precision, you can do: echo 0 > /proc/sys/vm/numa_stat In this case, numa counter update is ignored. We can see about *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu cycles per single page allocation and reclaim on Jesper's page_bench03 (88 threads) running on a 2-Socket Broadwell-based server (88 threads, 126G memory). Benchmark link provided by Jesper D Brouer (increase loop times to 10000000): https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench ========================================================================= When page allocation performance is not a bottleneck and you want all tooling to work, you can do: echo 1 > /proc/sys/vm/numa_stat This is system default setting. Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka for comments to help improve the original patch. [keescook@chromium.org: make sure mutex is a global static] Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com Signed-off-by: Kemi Wang <kemi.wang@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Ying Huang <ying.huang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christopher Lameter <cl@linux.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, page_alloc: simplify list handling in rmqueue_bulk()Vlastimil Babka
rmqueue_bulk() fills an empty pcplist with pages from the free list. It tries to preserve increasing order by pfn to the caller, because it leads to better performance with some I/O controllers, as explained in commit e084b2d95e48 ("page-allocator: preserve PFN ordering when __GFP_COLD is set"). To preserve the order, it's sufficient to add pages to the tail of the list as they are retrieved. The current code instead adds to the head of the list, but then updates the list head pointer to the last added page, in each step. This does result in the same order, but is needlessly confusing and potentially wasteful, with no apparent benefit. This patch simplifies the code and adjusts comment accordingly. Link: http://lkml.kernel.org/r/f6505442-98a9-12e4-b2cd-0fa83874c159@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: remove __GFP_COLDMel Gorman
As the page free path makes no distinction between cache hot and cold pages, there is no real useful ordering of pages in the free list that allocation requests can take advantage of. Juding from the users of __GFP_COLD, it is likely that a number of them are the result of copying other sites instead of actually measuring the impact. Remove the __GFP_COLD parameter which simplifies a number of paths in the page allocator. This is potentially controversial but bear in mind that the size of the per-cpu pagelists versus modern cache sizes means that the whole per-cpu list can often fit in the L3 cache. Hence, there is only a potential benefit for microbenchmarks that alloc/free pages in a tight loop. It's even worse when THP is taken into account which has little or no chance of getting a cache-hot page as the per-cpu list is bypassed and the zeroing of multiple pages will thrash the cache anyway. The truncate microbenchmarks are not shown as this patch affects the allocation path and not the free path. A page fault microbenchmark was tested but it showed no sigificant difference which is not surprising given that the __GFP_COLD branches are a miniscule percentage of the fault path. Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: remove cold parameter from free_hot_cold_page*Mel Gorman
Most callers users of free_hot_cold_page claim the pages being released are cache hot. The exception is the page reclaim paths where it is likely that enough pages will be freed in the near future that the per-cpu lists are going to be recycled and the cache hotness information is lost. As no one really cares about the hotness of pages being released to the allocator, just ditch the parameter. The APIs are renamed to indicate that it's no longer about hot/cold pages. It should also be less confusing as there are subtle differences between them. __free_pages drops a reference and frees a page when the refcount reaches zero. free_hot_cold_page handled pages whose refcount was already zero which is non-obvious from the name. free_unref_page should be more obvious. No performance impact is expected as the overhead is marginal. The parameter is removed simply because it is a bit stupid to have a useless parameter copied everywhere. [mgorman@techsingularity.net: add pages to head, not tail] Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm, page_alloc: enable/disable IRQs once when freeing a list of pagesMel Gorman
Patch series "Follow-up for speed up page cache truncation", v2. This series is a follow-on for Jan Kara's series "Speed up page cache truncation" series. We both ended up looking at the same problem but saw different problems based on the same data. This series builds upon his work. A variety of workloads were compared on four separate machines but each machine showed gains albeit at different levels. Minimally, some of the differences are due to NUMA where truncating data from a remote node is slower than a local node. The workloads checked were o sparse truncate microbenchmark, tiny o sparse truncate microbenchmark, large o reaim-io disk workfile o dbench4 (modified by mmtests to produce more stable results) o filebench varmail configuration for small memory size o bonnie, directory operations, working set size 2*RAM reaim-io, dbench and filebench all showed minor gains. Truncation does not dominate those workloads but were tested to ensure no other regressions. They will not be reported further. The sparse truncate microbench was written by Jan. It creates a number of files and then times how long it takes to truncate each one. The "tiny" configuraiton creates a number of files that easily fits in memory and times how long it takes to truncate files with page cache. The large configuration uses enough files to have data that is twice the size of memory and so timings there include truncating page cache and working set shadow entries in the radix tree. Patches 1-4 are the most relevant parts of this series. Patches 5-8 are optional as they are deleting code that is essentially useless but has a negligible performance impact. The changelogs have more information on performance but just for bonnie delete options, the main comparison is bonnie 4.14.0-rc5 4.14.0-rc5 4.14.0-rc5 jan-v2 vanilla mel-v2 Hmean SeqCreate ops 76.20 ( 0.00%) 75.80 ( -0.53%) 76.80 ( 0.79%) Hmean SeqCreate read 85.00 ( 0.00%) 85.00 ( 0.00%) 85.00 ( 0.00%) Hmean SeqCreate del 13752.31 ( 0.00%) 12090.23 ( -12.09%) 15304.84 ( 11.29%) Hmean RandCreate ops 76.00 ( 0.00%) 75.60 ( -0.53%) 77.00 ( 1.32%) Hmean RandCreate read 96.80 ( 0.00%) 96.80 ( 0.00%) 97.00 ( 0.21%) Hmean RandCreate del 13233.75 ( 0.00%) 11525.35 ( -12.91%) 14446.61 ( 9.16%) Jan's series is the baseline and the vanilla kernel is 12% slower where as this series on top gains another 11%. This is from a different machine than the data in the changelogs but the detailed data was not collected as there was no substantial change in v2. This patch (of 8): Freeing a list of pages current enables/disables IRQs for each page freed. This patch splits freeing a list of pages into two operations -- preparing the pages for freeing and the actual freeing. This is a tradeoff - we're taking two passes of the list to free in exchange for avoiding multiple enable/disable of IRQs. sparsetruncate (tiny) 4.14.0-rc4 4.14.0-rc4 janbatch-v1r1 oneirq-v1r1 Min Time 149.00 ( 0.00%) 141.00 ( 5.37%) 1st-qrtle Time 150.00 ( 0.00%) 142.00 ( 5.33%) 2nd-qrtle Time 151.00 ( 0.00%) 142.00 ( 5.96%) 3rd-qrtle Time 151.00 ( 0.00%) 143.00 ( 5.30%) Max-90% Time 153.00 ( 0.00%) 144.00 ( 5.88%) Max-95% Time 155.00 ( 0.00%) 147.00 ( 5.16%) Max-99% Time 201.00 ( 0.00%) 195.00 ( 2.99%) Max Time 236.00 ( 0.00%) 230.00 ( 2.54%) Amean Time 152.65 ( 0.00%) 144.37 ( 5.43%) Stddev Time 9.78 ( 0.00%) 10.44 ( -6.72%) Coeff Time 6.41 ( 0.00%) 7.23 ( -12.84%) Best99%Amean Time 152.07 ( 0.00%) 143.72 ( 5.50%) Best95%Amean Time 150.75 ( 0.00%) 142.37 ( 5.56%) Best90%Amean Time 150.59 ( 0.00%) 142.19 ( 5.58%) Best75%Amean Time 150.36 ( 0.00%) 141.92 ( 5.61%) Best50%Amean Time 150.04 ( 0.00%) 141.69 ( 5.56%) Best25%Amean Time 149.85 ( 0.00%) 141.38 ( 5.65%) With a tiny number of files, each file truncated has resident page cache and it shows that time to truncate is roughtly 5-6% with some minor jitter. 4.14.0-rc4 4.14.0-rc4 janbatch-v1r1 oneirq-v1r1 Hmean SeqCreate ops 65.27 ( 0.00%) 81.86 ( 25.43%) Hmean SeqCreate read 39.48 ( 0.00%) 47.44 ( 20.16%) Hmean SeqCreate del 24963.95 ( 0.00%) 26319.99 ( 5.43%) Hmean RandCreate ops 65.47 ( 0.00%) 82.01 ( 25.26%) Hmean RandCreate read 42.04 ( 0.00%) 51.75 ( 23.09%) Hmean RandCreate del 23377.66 ( 0.00%) 23764.79 ( 1.66%) As expected, there is a small gain for the delete operation. [mgorman@techsingularity.net: use page_private and set_page_private helpers] Link: http://lkml.kernel.org/r/20171018101547.mjycw7zreb66jzpa@techsingularity.net Link: http://lkml.kernel.org/r/20171018075952.10627-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Kara <jack@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm/page_alloc: make sure __rmqueue() etc are always inlineAaron Lu
__rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and __rmqueue_cma_fallback() are all in page allocator's hot path and better be finished as soon as possible. One way to make them faster is by making them inline. But as Andrew Morton and Andi Kleen pointed out: https://lkml.org/lkml/2017/10/10/1252 https://lkml.org/lkml/2017/10/10/1279 To make sure they are inlined, we should use __always_inline for them. With the will-it-scale/page_fault1/process benchmark, when using nr_cpu processes to stress buddy, the results for will-it-scale.processes with and without the patch are: On a 2-sockets Intel-Skylake machine: compiler base head gcc-4.4.7 6496131 6911823 +6.4% gcc-4.9.4 7225110 7731072 +7.0% gcc-5.4.1 7054224 7688146 +9.0% gcc-6.2.0 7059794 7651675 +8.4% On a 4-sockets Intel-Skylake machine: compiler base head gcc-4.4.7 13162890 13508193 +2.6% gcc-4.9.4 14997463 15484353 +3.2% gcc-5.4.1 14708711 15449805 +5.0% gcc-6.2.0 14574099 15349204 +5.3% The above 4 compilers are used because I've done the tests through Intel's Linux Kernel Performance(LKP) infrastructure and they are the available compilers there. The benefit being less on 4 sockets machine is due to the lock contention there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe than on the 2 sockets machine(85%). What the benchmark does is: it forks nr_cpu processes and then each process does the following: 1 mmap() 128M anonymous space; 2 writes to each page there to trigger actual page allocation; 3 munmap() it. in a loop. https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c Binary size wise, I have locally built them with different compilers: [aaron@aaronlu obj]$ size */*/mm/page_alloc.o text data bss dec hex filename 37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o 38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o 37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o 38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o 37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o 38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o 36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o 37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o Text size increased about 800 bytes for mm/page_alloc.o. [aaron@aaronlu obj]$ size */*/vmlinux text data bss dec hex filename 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux Text size for vmlinux has no change though, probably due to function alignment. Link: http://lkml.kernel.org/r/20171013063111.GA26032@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: stop zeroing memory during allocation in vmemmapPavel Tatashin
vmemmap_alloc_block() will no longer zero the block, so zero memory at its call sites for everything except struct pages. Struct page memory is zero'd by struct page initialization. Replace allocators in sparse-vmemmap to use the non-zeroing version. So, we will get the performance improvement by zeroing the memory in parallel when struct pages are zeroed. Add struct page zeroing as a part of initialization of other fields in __init_single_page(). This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes): BASE FIX sparse_init 11.244671836s 0.007199623s zone_sizes_init 4.879775891s 8.355182299s -------------------------- Total 16.124447727s 8.362381922s sparse_init is where memory for struct pages is zeroed, and the zeroing part is moved later in this patch into __init_single_page(), which is called from zone_sizes_init(). [akpm@linux-foundation.org: make vmemmap_alloc_block_zero() private to sparse-vmemmap.c] Link: http://lkml.kernel.org/r/20171013173214.27300-10-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: zero reserved and unavailable struct pagesPavel Tatashin
Some memory is reserved but unavailable: not present in memblock.memory (because not backed by physical pages), but present in memblock.reserved. Such memory has backing struct pages, but they are not initialized by going through __init_single_page(). In some cases these struct pages are accessed even if they do not contain any data. One example is page_to_pfn() might access page->flags if this is where section information is stored (CONFIG_SPARSEMEM, SECTION_IN_PAGE_FLAGS). One example of such memory: trim_low_memory_range() unconditionally reserves from pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn 1 (i.e. KVM). Since struct pages are zeroed in __init_single_page(), and not during allocation time, we must zero such struct pages explicitly. The patch involves adding a new memblock iterator: for_each_resv_unavail_range(i, p_start, p_end) Which iterates through reserved && !memory lists, and we zero struct pages explicitly by calling mm_zero_struct_page(). === Here is more detailed example of problem that this patch is addressing: Run tested on qemu with the following arguments: -enable-kvm -cpu kvm64 -m 512 -smp 2 This patch reports that there are 98 unavailable pages. They are: pfn 0 and pfns in range [159, 255]. Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does not reserve [159, 255] ones. e820__memblock_setup() reports linux that the following physical ranges are available: [1 , 158] [256, 130783] Notice, that exactly unavailable pfns are missing! Now, lets check what we have in zone 0: [1, 131039] pfn 0, is not part of the zone, but pfns [1, 158], are. However, the bigger problem we have if we do not initialize these struct pages is with memory hotplug. Because, that path operates at 2M boundaries (section_nr). And checks if 2M range of pages is hot removable. It starts with first pfn from zone, rounds it down to 2M boundary (sturct pages are allocated at 2M boundaries when vmemmap is created), and checks if that section is hot removable. In this case start with pfn 1 and convert it down to pfn 0. Later pfn is converted to struct page, and some fields are checked. Now, if we do not zero struct pages, we get unpredictable results. In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap memory to ones, the following panic is observed with kernel test without this patch applied: BUG: unable to handle kernel NULL pointer dereference at (null) IP: is_pageblock_removable_nolock+0x35/0x90 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT ... task: ffff88001f4e2900 task.stack: ffffc90000314000 RIP: 0010:is_pageblock_removable_nolock+0x35/0x90 Call Trace: ? is_mem_section_removable+0x5a/0xd0 show_mem_removable+0x6b/0xa0 dev_attr_show+0x1b/0x50 sysfs_kf_seq_show+0xa1/0x100 kernfs_seq_show+0x22/0x30 seq_read+0x1ac/0x3a0 kernfs_fop_read+0x36/0x190 ? security_file_permission+0x90/0xb0 __vfs_read+0x16/0x30 vfs_read+0x81/0x130 SyS_read+0x44/0xa0 entry_SYSCALL_64_fastpath+0x1f/0xbd Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: define memblock_virt_alloc_try_nid_rawPavel Tatashin
* A new variant of memblock_virt_alloc_* allocations: memblock_virt_alloc_try_nid_raw() - Does not zero the allocated memory - Does not panic if request cannot be satisfied * optimize early system hash allocations Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify that memory that was allocated for system hash needs to be zeroed, otherwise the memory does not need to be zeroed, and client will initialize it. If memory does not need to be zero'd, call the new memblock_virt_alloc_raw() interface, and thus improve the boot performance. * debug for raw alloctor When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no places excpect zeroed memory. Link: http://lkml.kernel.org/r/20171013173214.27300-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15mm: deferred_init_memmap improvementsPavel Tatashin
Patch series "complete deferred page initialization", v12. SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option, which defers initializing struct pages until all cpus have been started so it can be done in parallel. However, this feature is sub-optimal, because the deferred page initialization code expects that the struct pages have already been zeroed, and the zeroing is done early in boot with a single thread only. Also, we access that memory and set flags before struct pages are initialized. All of this is fixed in this patchset. In this work we do the following: - Never read access struct page until it was initialized - Never set any fields in struct pages before they are initialized - Zero struct page at the beginning of struct page initialization ========================================================================== Performance improvements on x86 machine with 8 nodes: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory: TIME SPEED UP base no deferred: 95.796233s fix no deferred: 79.978956s 19.77% base deferred: 77.254713s fix deferred: 55.050509s 40.34% ========================================================================== SPARC M6 3600 MHz with 15T of memory TIME SPEED UP base no deferred: 358.335727s fix no deferred: 302.320936s 18.52% base deferred: 237.534603s fix deferred: 182.103003s 30.44% ========================================================================== Raw dmesg output with timestamps: x86 base no deferred: https://hastebin.com/ofunepurit.scala x86 base deferred: https://hastebin.com/ifazegeyas.scala x86 fix no deferred: https://hastebin.com/pegocohevo.scala x86 fix deferred: https://hastebin.com/ofupevikuk.scala sparc base no deferred: https://hastebin.com/ibobeteken.go sparc base deferred: https://hastebin.com/fariqimiyu.go sparc fix no deferred: https://hastebin.com/muhegoheyi.go sparc fix deferred: https://hastebin.com/xadinobutu.go This patch (of 11): deferred_init_memmap() is called when struct pages are initialized later in boot by slave CPUs. This patch simplifies and optimizes this function, and also fixes a couple issues (described below). The main change is that now we are iterating through free memblock areas instead of all configured memory. Thus, we do not have to check if the struct page has already been initialized. ===== In deferred_init_memmap() where all deferred struct pages are initialized we have a check like this: if (page->flags) { VM_BUG_ON(page_zone(page) != zone); goto free_range; } This way we are checking if the current deferred page has already been initialized. It works, because memory for struct pages has been zeroed, and the only way flags are not zero if it went through __init_single_page() before. But, once we change the current behavior and won't zero the memory in memblock allocator, we cannot trust anything inside "struct page"es until they are initialized. This patch fixes this. The deferred_init_memmap() is re-written to loop through only free memory ranges provided by memblock. Note, this first issue is relevant only when the following change is merged: ===== This patch fixes another existing issue on systems that have holes in zones i.e CONFIG_HOLES_IN_ZONE is defined. In for_each_mem_pfn_range() we have code like this: if (!pfn_valid_within(pfn) goto free_range; Note: 'page' is not set to NULL and is not incremented but 'pfn' advances. Thus means if deferred struct pages are enabled on systems with these kind of holes, linux would get memory corruptions. I have fixed this issue by defining a new macro that performs all the necessary operations when we free the current set of pages. [pasha.tatashin@oracle.com: buddy page accessed before initialized] Link: http://lkml.kernel.org/r/20171102170221.7401-2-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20171013173214.27300-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Tested-by: Bob Picco <bob.picco@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15kmemcheck: remove annotationsLevin, Alexander (Sasha Levin)
Patch series "kmemcheck: kill kme