summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-09-07Merge tag 'for-linus-4.14b-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - the new pvcalls backend for routing socket calls from a guest to dom0 - some cleanups of Xen code - a fix for wrong usage of {get,put}_cpu() * tag 'for-linus-4.14b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (27 commits) xen/mmu: set MMU_NORMAL_PT_UPDATE in remap_area_mfn_pte_fn xen: Don't try to call xen_alloc_p2m_entry() on autotranslating guests xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init() xen/pvcalls: use WARN_ON(1) instead of __WARN() xen: remove not used trace functions xen: remove unused function xen_set_domain_pte() xen: remove tests for pvh mode in pure pv paths xen-platform: constify pci_device_id. xen: cleanup xen.h xen: introduce a Kconfig option to enable the pvcalls backend xen/pvcalls: implement write xen/pvcalls: implement read xen/pvcalls: implement the ioworker functions xen/pvcalls: disconnect and module_exit xen/pvcalls: implement release command xen/pvcalls: implement poll command xen/pvcalls: implement accept command xen/pvcalls: implement listen command xen/pvcalls: implement bind command xen/pvcalls: implement connect command ...
2017-09-07Merge tag 'powerpc-4.14-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Nothing really major this release, despite quite a lot of activity. Just lots of things all over the place. Some things of note include: - Access via perf to a new type of PMU (IMC) on Power9, which can count both core events as well as nest unit events (Memory controller etc). - Optimisations to the radix MMU TLB flushing, mostly to avoid unnecessary Page Walk Cache (PWC) flushes when the structure of the tree is not changing. - Reworks/cleanups of do_page_fault() to modernise it and bring it closer to other architectures where possible. - Rework of our page table walking so that THP updates only need to send IPIs to CPUs where the affected mm has run, rather than all CPUs. - The size of our vmalloc area is increased to 56T on 64-bit hash MMU systems. This avoids problems with the percpu allocator on systems with very sparse NUMA layouts. - STRICT_KERNEL_RWX support on PPC32. - A new sched domain topology for Power9, to capture the fact that pairs of cores may share an L2 cache. - Power9 support for VAS, which is a new mechanism for accessing coprocessors, and initial support for using it with the NX compression accelerator. - Major work on the instruction emulation support, adding support for many new instructions, and reworking it so it can be used to implement the emulation needed to fixup alignment faults. - Support for guests under PowerVM to use the Power9 XIVE interrupt controller. And probably that many things again that are almost as interesting, but I had to keep the list short. Plus the usual fixes and cleanups as always. Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh, Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly, Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall, LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring, Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff, Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu, Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui" * tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits) powerpc/xive: Fix section __init warning powerpc: Fix kernel crash in emulation of vector loads and stores powerpc/xive: improve debugging macros powerpc/xive: add XIVE Exploitation Mode to CAS powerpc/xive: introduce H_INT_ESB hcall powerpc/xive: add the HW IRQ number under xive_irq_data powerpc/xive: introduce xive_esb_write() powerpc/xive: rename xive_poke_esb() in xive_esb_read() powerpc/xive: guest exploitation of the XIVE interrupt controller powerpc/xive: introduce a common routine xive_queue_page_alloc() powerpc/sstep: Avoid used uninitialized error axonram: Return directly after a failed kzalloc() in axon_ram_probe() axonram: Improve a size determination in axon_ram_probe() axonram: Delete an error message for a failed memory allocation in axon_ram_probe() powerpc/powernv/npu: Move tlb flush before launching ATSD powerpc/macintosh: constify wf_sensor_ops structures powerpc/iommu: Use permission-specific DEVICE_ATTR variants powerpc/eeh: Delete an error out of memory message at init time powerpc/mm: Use seq_putc() in two functions macintosh: Convert to using %pOF instead of full_name ...
2017-09-07Merge branch 'efi-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI updates from Ingo Molnar: "The main changes in this cycle were: - Transparently fall back to other poweroff method(s) if EFI poweroff fails (and returns) - Use separate PE/COFF section headers for the RX and RW parts of the ARM stub loader so that the firmware can use strict mapping permissions - Add support for requesting the firmware to wipe RAM at warm reboot - Increase the size of the random seed obtained from UEFI so CRNG fast init can complete earlier - Update the EFI framebuffer address if it points to a BAR that gets moved by the PCI resource allocation code - Enable "reset attack mitigation" of TPM environments: this is enabled if the kernel is configured with CONFIG_RESET_ATTACK_MITIGATION=y. - Clang related fixes - Misc cleanups, constification, refactoring, etc" * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/bgrt: Use efi_mem_type() efi: Move efi_mem_type() to common code efi/reboot: Make function pointer orig_pm_power_off static efi/random: Increase size of firmware supplied randomness efi/libstub: Enable reset attack mitigation firmware/efi/esrt: Constify attribute_group structures firmware/efi: Constify attribute_group structures firmware/dcdbas: Constify attribute_group structures arm/efi: Split zImage code and data into separate PE/COFF sections arm/efi: Replace open coded constants with symbolic ones arm/efi: Remove pointless dummy .reloc section arm/efi: Remove forbidden values from the PE/COFF header drivers/fbdev/efifb: Allow BAR to be moved instead of claiming it efi/reboot: Fall back to original power-off method if EFI_RESET_SHUTDOWN returns efi/arm/arm64: Add missing assignment of efi.config_table efi/libstub/arm64: Set -fpie when building the EFI stub efi/libstub/arm64: Force 'hidden' visibility for section markers efi/libstub/arm64: Use hidden attribute for struct screen_info reference efi/arm: Don't mark ACPI reclaim memory as MEMBLOCK_NOMAP
2017-09-07Merge branch 'x86-platform-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform updates from Ingo Molnar: "The main changes include various Hyper-V optimizations such as faster hypercalls and faster/better TLB flushes - and there's also some Intel-MID cleanups" * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/hyper-v: Trace hyperv_mmu_flush_tlb_others() x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls x86/platform/intel-mid: Make several arrays static, to make code smaller MAINTAINERS: Add missed file for Hyper-V x86/hyper-v: Use hypercall for remote TLB flush hyper-v: Globalize vp_index x86/hyper-v: Implement rep hypercalls hyper-v: Use fast hypercall for HVCALL_SIGNAL_EVENT x86/hyper-v: Introduce fast hypercall implementation x86/hyper-v: Make hv_do_hypercall() inline x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set x86/platform/intel-mid: Make 'bt_sfi_data' const x86/platform/intel-mid: Make IRQ allocation a bit more flexible x86/platform/intel-mid: Group timers callbacks together
2017-09-06Merge branch 'for-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: "Except for the ahci fix that fixes a boot issue, nothing major in this pull request. Some new platform controller support and device specific changes" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: libata: zpodd: make arrays cdb static, reduces object code size ahci: don't use MSI for devices with the silly Intel NVMe remapping scheme dt-bindings: ata: add DT bindings for MediaTek SATA controller ata: mediatek: add support for MediaTek SATA controller pata_octeon_cf: use of_property_read_{bool|u32}() cs5536: add support for IDE controller variant ata: sata_gemini: Introduce explicit IDE pin control ata: sata_gemini: Retire custom pin control ata: ahci_platform: Add shutdown handler ata: sata_gemini: explicitly request exclusive reset control ata: Drop unnecessary static ata: Convert to using %pOF instead of full_name
2017-09-06Merge branch 'for-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Several notable changes this cycle: - Thread mode was merged. This will be used for cgroup2 support for CPU and possibly other controllers. Unfortunately, CPU controller cgroup2 support didn't make this pull request but most contentions have been resolved and the support is likely to be merged before the next merge window. - cgroup.stat now shows the number of descendant cgroups. - cpuset now can enable the easier-to-configure v2 behavior on v1 hierarchy" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Allow v2 behavior in v1 cgroup cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup cgroup: remove unneeded checks cgroup: misc changes cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy cgroup: re-use the parent pointer in cgroup_destroy_locked() cgroup: add cgroup.stat interface with basic hierarchy stats cgroup: implement hierarchy limits cgroup: keep track of number of descent cgroups cgroup: add comment to cgroup_enable_threaded() cgroup: remove unnecessary empty check when enabling threaded mode cgroup: update debug controller to print out thread mode information cgroup: implement cgroup v2 thread support cgroup: implement CSS_TASK_ITER_THREADED cgroup: introduce cgroup->dom_cgrp and threaded css_set handling cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS cgroup: reorganize cgroup.procs / task write path cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets cgroup: distinguish local and children populated states cgroup: remove now unused list_head @pending in cgroup_apply_cftypes() ...
2017-09-06Merge branch 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Nothing major. I introduced a flag collsion bug during v4.13 cycle which is fixed in this pull request. Fortunately, the flag is for debugging / verification and the bug isn't critical" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix flag collision workqueue: Use TASK_IDLE workqueue: fix path to documentation workqueue: doc change for ST behavior on NUMA systems
2017-09-06Merge branch 'for-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of changes for percpu this time around. percpu inherited the same area allocator from the original pre-virtual-address-mapped implementation. This was from the time when percpu allocator wasn't used all that much and the implementation was focused on simplicity, with the unfortunate computational complexity of O(number of areas allocated from the chunk) per alloc / free. With the increase in percpu usage, we're hitting cases where the lack of scalability is hurting. The most prominent one right now is bpf perpcu map creation / destruction which may allocate and free a lot of entries consecutively and it's likely that the problem will become more prominent in the future. To address the issue, Dennis replaced the area allocator with hinted bitmap allocator which is more consistent. While the new allocator does perform a bit worse in some cases, it outperforms the old allocator way more than an order of magnitude in other more common scenarios while staying mostly flat in CPU overhead and completely flat in memory consumption" * 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (27 commits) percpu: update header to contain bitmap allocator explanation. percpu: update pcpu_find_block_fit to use an iterator percpu: use metadata blocks to update the chunk contig hint percpu: update free path to take advantage of contig hints percpu: update alloc path to only scan if contig hints are broken percpu: keep track of the best offset for contig hints percpu: skip chunks if the alloc does not fit in the contig hint percpu: add first_bit to keep track of the first free in the bitmap percpu: introduce bitmap metadata blocks percpu: replace area map allocator with bitmap percpu: generalize bitmap (un)populated iterators percpu: increase minimum percpu allocation size and align first regions percpu: introduce nr_empty_pop_pages to help empty page accounting percpu: change the number of pages marked in the first_chunk pop bitmap percpu: combine percpu address checks percpu: modify base_addr to be region specific percpu: setup_first_chunk rename schunk/dchunk to chunk percpu: end chunk area maps page aligned for the populated bitmap percpu: unify allocation of schunk and dchunk percpu: setup_first_chunk remove dyn_size and consolidate logic ...
2017-09-06Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - various misc bits - DAX updates - OCFS2 - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits) mm,fork: introduce MADV_WIPEONFORK x86,mpx: make mpx depend on x86-64 to free up VMA flag mm: add /proc/pid/smaps_rollup mm: hugetlb: clear target sub-page last when clearing huge page mm: oom: let oom_reap_task and exit_mmap run concurrently swap: choose swap device according to numa node mm: replace TIF_MEMDIE checks by tsk_is_oom_victim mm, oom: do not rely on TIF_MEMDIE for memory reserves access z3fold: use per-cpu unbuddied lists mm, swap: don't use VMA based swap readahead if HDD is used as swap mm, swap: add sysfs interface for VMA based swap readahead mm, swap: VMA based swap readahead mm, swap: fix swap readahead marking mm, swap: add swap readahead hit statistics mm/vmalloc.c: don't reinvent the wheel but use existing llist API mm/vmstat.c: fix wrong comment selftests/memfd: add memfd_create hugetlbfs selftest mm/shmem: add hugetlbfs support to memfd_create() mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas() ...
2017-09-06x86/mm: Document how CR4.PCIDE restore worksAndy Lutomirski
While debugging a problem, I thought that using cr4_set_bits_and_update_boot() to restore CR4.PCIDE would be helpful. It turns out to be counterproductive. Add a comment documenting how this works. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06x86/mm: Reinitialize TLB state on hotplug and resumeAndy Lutomirski
When Linux brings a CPU down and back up, it switches to init_mm and then loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect of masking off the ASID bits in CR3. This can result in some confusion in the TLB handling code. If we bring a CPU down and back up with any ASID other than 0, we end up with the wrong ASID active on the CPU after resume. This could cause our internal state to become corrupt, although major corruption is unlikely because init_mm doesn't have any user pages. More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion in the next context switch. The result of *that* is a failure to resume from suspend with probability 1 - 1/6^(cpus-1). Fix it by reinitializing cpu_tlbstate on resume and CPU bringup. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Jiri Kosina <jikos@kernel.org> Fixes: 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm,fork: introduce MADV_WIPEONFORKRik van Riel
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty in the child process after fork. This differs from MADV_DONTFORK in one important way. If a child process accesses memory that was MADV_WIPEONFORK, it will get zeroes. The address ranges are still valid, they are just empty. If a child process accesses memory that was MADV_DONTFORK, it will get a segmentation fault, since those address ranges are no longer valid in the child after fork. Since MADV_DONTFORK also seems to be used to allow very large programs to fork in systems with strict memory overcommit restrictions, changing the semantics of MADV_DONTFORK might break existing programs. MADV_WIPEONFORK only works on private, anonymous VMAs. The use case is libraries that store or cache information, and want to know that they need to regenerate it in the child process after fork. Examples of this would be: - systemd/pulseaudio API checks (fail after fork) (replacing a getpid check, which is too slow without a PID cache) - PKCS#11 API reinitialization check (mandated by specification) - glibc's upcoming PRNG (reseed after fork) - OpenSSL PRNG (reseed after fork) The security benefits of a forking server having a re-inialized PRNG in every child process are pretty obvious. However, due to libraries having all kinds of internal state, and programs getting compiled with many different versions of each library, it is unreasonable to expect calling programs to re-initialize everything manually after fork. A further complication is the proliferation of clone flags, programs bypassing glibc's functions to call clone directly, and programs calling unshare, causing the glibc pthread_atfork hook to not get called. It would be better to have the kernel take care of this automatically. The patch also adds MADV_KEEPONFORK, to undo the effects of a prior MADV_WIPEONFORK. This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO: https://man.openbsd.org/minherit.2 [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines] Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Florian Weimer <fweimer@redhat.com> Reported-by: Colm MacCártaigh <colm@allcosts.net> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Drewry <wad@chromium.org> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06x86,mpx: make mpx depend on x86-64 to free up VMA flagRik van Riel
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4. If a child process accesses memory that was MADV_WIPEONFORK, it will get zeroes. The address ranges are still valid, they are just empty. If a child process accesses memory that was MADV_DONTFORK, it will get a segmentation fault, since those address ranges are no longer valid in the child after fork. Since MADV_DONTFORK also seems to be used to allow very large programs to fork in systems with strict memory overcommit restrictions, changing the semantics of MADV_DONTFORK might break existing programs. The use case is libraries that store or cache information, and want to know that they need to regenerate it in the child process after fork. Examples of this would be: - systemd/pulseaudio API checks (fail after fork) (replacing a getpid check, which is too slow without a PID cache) - PKCS#11 API reinitialization check (mandated by specification) - glibc's upcoming PRNG (reseed after fork) - OpenSSL PRNG (reseed after fork) The security benefits of a forking server having a re-inialized PRNG in every child process are pretty obvious. However, due to libraries having all kinds of internal state, and programs getting compiled with many different versions of each library, it is unreasonable to expect calling programs to re-initialize everything manually after fork. A further complication is the proliferation of clone flags, programs bypassing glibc's functions to call clone directly, and programs calling unshare, causing the glibc pthread_atfork hook to not get called. It would be better to have the kernel take care of this automatically. The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior MADV_WIPEONFORK. This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO: https://man.openbsd.org/minherit.2 This patch (of 2): MPX only seems to be available on 64 bit CPUs, starting with Skylake and Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in order to free up a VMA flag. Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Colm MacCártaigh <colm@allcosts.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: add /proc/pid/smaps_rollupDaniel Colascione
/proc/pid/smaps_rollup is a new proc file that improves the performance of user programs that determine aggregate memory statistics (e.g., total PSS) of a process. Android regularly "samples" the memory usage of various processes in order to balance its memory pool sizes. This sampling process involves opening /proc/pid/smaps and summing certain fields. For very large processes, sampling memory use this way can take several hundred milliseconds, due mostly to the overhead of the seq_printf calls in task_mmu.c. smaps_rollup improves the situation. It contains most of the fields of /proc/pid/smaps, but instead of a set of fields for each VMA, smaps_rollup instead contains one synthetic smaps-format entry representing the whole process. In the single smaps_rollup synthetic entry, each field is the summation of the corresponding field in all of the real-smaps VMAs. Using a common format for smaps_rollup and smaps allows userspace parsers to repurpose parsers meant for use with non-rollup smaps for smaps_rollup, and it allows userspace to switch between smaps_rollup and smaps at runtime (say, based on the availability of smaps_rollup in a given kernel) with minimal fuss. By using smaps_rollup instead of smaps, a caller can avoid the significant overhead of formatting, reading, and parsing each of a large process's potentially very numerous memory mappings. For sampling system_server's PSS in Android, we measured a 12x speedup, representing a savings of several hundred milliseconds. One alternative to a new per-process proc file would have been including PSS information in /proc/pid/status. We considered this option but thought that PSS would be too expensive (by a few orders of magnitude) to collect relative to what's already emitted as part of /proc/pid/status, and slowing every user of /proc/pid/status for the sake of readers that happen to want PSS feels wrong. The code itself works by reusing the existing VMA-walking framework we use for regular smaps generation and keeping the mem_size_stats structure around between VMA walks instead of using a fresh one for each VMA. In this way, summation happens automatically. We let seq_file walk over the VMAs just as it does for regular smaps and just emit nothing to the seq_file until we hit the last VMA. Benchmarks: using smaps: iterations:1000 pid:1163 pss:220023808 0m29.46s real 0m08.28s user 0m20.98s system using smaps_rollup: iterations:1000 pid:1163 pss:220702720 0m04.39s real 0m00.03s user 0m04.31s system We're using the PSS samples we collect asynchronously for system-management tasks like fine-tuning oom_adj_score, memory use tracking for debugging, application-level memory-use attribution, and deciding whether we want to kill large processes during system idle maintenance windows. Android has been using PSS for these purposes for a long time; as the average process VMA count has increased and and devices become more efficiency-conscious, PSS-collection inefficiency has started to matter more. IMHO, it'd be a lot safer to optimize the existing PSS-collection model, which has been fine-tuned over the years, instead of changing the memory tracking approach entirely to work around smaps-generation inefficiency. Tim said: : There are two main reasons why Android gathers PSS information: : : 1. Android devices can show the user the amount of memory used per : application via the settings app. This is a less important use case. : : 2. We log PSS to help identify leaks in applications. We have found : an enormous number of bugs (in the Android platform, in Google's own : apps, and in third-party applications) using this data. : : To do this, system_server (the main process in Android userspace) will : sample the PSS of a process three seconds after it changes state (for : example, app is launched and becomes the foreground application) and about : every ten minutes after that. The net result is that PSS collection is : regularly running on at least one process in the system (usually a few : times a minute while the screen is on, less when screen is off due to : suspend). PSS of a process is an incredibly useful stat to track, and we : aren't going to get rid of it. We've looked at some very hacky approaches : using RSS ("take the RSS of the target process, subtract the RSS of the : zygote process that is the parent of all Android apps") to reduce the : accounting time, but it regularly overestimated the memory used by 20+ : percent. Accordingly, I don't think that there's a good alternative to : using PSS. : : We started looking into PSS collection performance after we noticed random : frequency spikes while a phone's screen was off; occasionally, one of the : CPU clusters would ramp to a high frequency because there was 200-300ms of : constant CPU work from a single thread in the main Android userspace : process. The work causing the spike (which is reasonable governor : behavior given the amount of CPU time needed) was always PSS collection. : As a result, Android is burning more power than we should be on PSS : collection. : : The other issue (and why I'm less sure about improving smaps as a : long-term solution) is that the number of VMAs per process has increased : significantly from release to release. After trying to figure out why we : were seeing these 200-300ms PSS collection times on Android O but had not : noticed it in previous versions, we found that the number of VMAs in the : main system process increased by 50% from Android N to Android O (from : ~1800 to ~2700) and varying increases in every userspace process. Android : M to N also had an increase in the number of VMAs, although not as much. : I'm not sure why this is increasing so much over time, but thinking about : ASLR and ways to make ASLR better, I expect that this will continue to : increase going forward. I would not be surprised if we hit 5000 VMAs on : the main Android process (system_server) by 2020. : : If we assume that the number of VMAs is going to increase over time, then : doing anything we can do to reduce the overhead of each VMA during PSS : collection seems like the right way to go, and that means outputting an : aggregate statistic (to avoid whatever overhead there is per line in : writing smaps and in reading each line from userspace). Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com Signed-off-by: Daniel Colascione <dancol@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sonny Rao <sonnyrao@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: hugetlb: clear target sub-page last when clearing huge pageHuang Ying
Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when clearing huge page on x86_64 platform, the cache footprint is 2M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache pressure is heavy when clearing the huge page, and we clear the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing clearing the end of the huge page. And it is possible for the application to access the begin of the huge page after clearing the huge page. To help the above situation, in this patch, when we clear a huge page, the order to clear sub-pages is changed. In quite some situation, we can get the address that the application will access after we clear the huge page, for example, in a page fault handler. Instead of clearing the huge page from begin to end, we will clear the sub-pages farthest from the the sub-page to access firstly, and clear the sub-page to access last. This will make the sub-page to access most cache-hot and sub-pages around it more cache-hot too. If we cannot know the address the application will access, the begin of the huge page is assumed to be the the address the application will access. With this patch, the throughput increases ~28.3% in vm-scalability anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case creates 72 processes, each process mmap a big anonymous memory area and writes to it from the begin to the end. For each process, other processes could be seen as other workload which generates heavy cache pressure. At the same time, the cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per cycle) increased from 0.56 to 0.74, and the time spent in user space is reduced ~7.9% Christopher Lameter suggests to clear bytes inside a sub-page from end to begin too. But tests show no visible performance difference in the tests. May because the size of page is small compared with the cache size. Thanks Andi Kleen to propose to use address to access to determine the order of sub-pages to clear. The hugetlbfs access address could be improved, will do that in another patch. [ying.huang@intel.com: improve readability of clear_huge_page()] Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com Suggested-by: Andi Kleen <andi.kleen@intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: oom: let oom_reap_task and exit_mmap run concurrentlyAndrea Arcangeli
This is purely required because exit_aio() may block and exit_mmap() may never start, if the oom_reap_task cannot start running on a mm with mm_users == 0. At the same time if the OOM reaper doesn't wait at all for the memory of the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would generate a spurious OOM kill. If it wasn't because of the exit_aio or similar blocking functions in the last mmput, it would be enough to change the oom_reap_task() in the case it finds mm_users == 0, to wait for a timeout or to wait for __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the problem here so the concurrency of exit_mmap and oom_reap_task is apparently warranted. It's a non standard runtime, exit_mmap() runs without mmap_sem, and oom_reap_task runs with the mmap_sem for reading as usual (kind of MADV_DONTNEED). The race between the two is solved with a combination of tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP (serialized by a dummy down_write/up_write cycle on the same lines of the ksm_exit method). If the oom_reap_task() may be running concurrently during exit_mmap, exit_mmap will wait it to finish in down_write (before taking down mm structures that would make the oom_reap_task fail with use after free). If exit_mmap comes first, oom_reap_task() will skip the mm if MMF_OOM_SKIP is already set and in turn all memory is already freed and furthermore the mm data structures may already have been taken down by free_pgtables. [aarcange@redhat.com: incremental one liner] Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com [rientjes@google.com: remove unused mmput_async] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com [aarcange@redhat.com: microoptimization] Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com Fixes: 26db62f179d1 ("oom: keep mm of the killed task available") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06swap: choose swap device according to numa nodeAaron Lu
If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same priority, they are used round robin. This patch changes the previous single global swap_avail_list into a per-numa-node list, i.e. for each numa node, it sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. The current swap device's priority is set as: user can set a >=0 value, or the system will pick one starting from -1 then downwards. The priority value in the swap_avail_list is the negated value of the swap device's due to plist being sorted from low to high. The new policy doesn't change the semantics for priority >=0 cases, the previous starting from -1 then downwards now becomes starting from -2 then downwards and -1 is reserved as the promoted value. Take 4-node EX machine as an example, suppose 4 swap devices are available, each sit on a different node: swapA on node 0 swapB on node 1 swapC on node 2 swapD on node 3 After they are all swapped on in the sequence of ABCD. Current behaviour: their priorities will be: swapA: -1 swapB: -2 swapC: -3 swapD: -4 And their position in the global swap_avail_list will be: swapA -> swapB -> swapC -> swapD prio:1 prio:2 prio:3 prio:4 New behaviour: their priorities will be(note that -1 is skipped): swapA: -2 swapB: -3 swapC: -4 swapD: -5 And their positions in the 4 swap_avail_lists[nid] will be: swap_avail_lists[0]: /* node 0's available swap device list */ swapA -> swapB -> swapC -> swapD prio:1 prio:3 prio:4 prio:5 swap_avali_lists[1]: /* node 1's available swap device list */ swapB -> swapA -> swapC -> swapD prio:1 prio:2 prio:4 prio:5 swap_avail_lists[2]: /* node 2's available swap device list */ swapC -> swapA -> swapB -> swapD prio:1 prio:2 prio:3 prio:5 swap_avail_lists[3]: /* node 3's available swap device list */ swapD -> swapA -> swapB -> swapC prio:1 prio:2 prio:3 prio:4 To see the effect of the patch, a test that starts N process, each mmap a region of anonymous memory and then continually write to it at random position to trigger both swap in and out is used. On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives are used as swap devices with each attached to a different node, the result is: runtime=30m/processes=32/total test size=128G/each process mmap region=4G kernel throughput vanilla 13306 auto-binding 15169 +14% runtime=30m/processes=64/total test size=128G/each process mmap region=2G kernel throughput vanilla 11885 auto-binding 14879 +25% [aaron.lu@intel.com: v2] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com [akpm@linux-foundation.org: use kmalloc_array()] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: replace TIF_MEMDIE checks by tsk_is_oom_victimMichal Hocko
TIF_MEMDIE is set only to the tasks whick were either directly selected by the OOM killer or passed through mark_oom_victim from the allocator path. tsk_is_oom_victim is more generic and allows to identify all tasks (threads) which share the mm with the oom victim. Please note that the freezer still needs to check TIF_MEMDIE because we cannot thaw tasks which do not participage in oom_victims counting otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns. Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm, oom: do not rely on TIF_MEMDIE for memory reserves accessMichal Hocko
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the kthreads") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06z3fold: use per-cpu unbuddied listsVitaly Wool
It's been noted that z3fold doesn't scale well when it's run in a large number of threads on many cores, which can be easily reproduced with fio 'randrw' test with --numjobs=32. E.g. the result for 1 cluster (4 cores) is: Run status group 0 (all jobs): READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ... WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ... While for 8 cores (2 clusters) the result is: Run status group 0 (all jobs): READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ... WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ... The bottleneck here is the pool lock which many threads become waiting upon. To reduce that spin lock contention, z3fold can operate only on the lists local to the current CPU whenever possible. Due to the nature of z3fold unbuddied list handling (it only takes the first entry off the list on a hot path), if the z3fold pool is big enough and balanced well enough, limiting search to only local unbuddied list doesn't lead to a significant compression ratio degrade (2.57x vs 2.65x in our measurements). This patch also introduces two worker threads: one for async in-page object layout optimization and one for releasing freed pages. This is done to speed up z3fold_free() which is often on a hot path. The fio results for 8-core case are now the following: Run status group 0 (all jobs): READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ... WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ... So we're in for almost 6x performance increase. Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm, swap: don't use VMA based swap readahead if HDD is used as swapHuang Ying
VMA based swap readahead will readahead the virtual pages that is continuous in the virtual address space. While the original swap readahead will readahead the swap slots that is continuous in the swap device. Although VMA based swap readahead is more correct for the swap slots to be readahead, it will trigger more small random readings, which may cause the performance of HDD (hard disk) to degrade heavily, and may finally exceed the benefit. To avoid the issue, in this patch, if the HDD is used as swap, the VMA based swap readahead will be disabled, and the original swap readahead will be used instead. Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm, swap: add sysfs interface for VMA based swap readaheadHuang Ying
The sysfs interface to control the VMA based swap readahead is added as follow, /sys/kernel/mm/swap/vma_ra_enabled Enable the VMA based swap readahead algorithm, or use the original global swap readahead algorithm. /sys/kernel/mm/swap/vma_ra_max_order Set the max order of the readahead window size for the VMA based swap readahead algorithm. The corresponding ABI documentation is added too. Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm, swap: VMA based swap readaheadHuang Ying
The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patch, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm, swap: fix swap readahead markingHuang Ying
In the original implementation, it is possible that the existing pages in the swap cache (not newly readahead) could be marked as the readahead pages. This will cause the statistics of swap readahead be wrong and influence the swap readahead algorithm too. This is fixed via marking a page as the readahead page only if it is newly allocated and read from the disk. When testing with linpack, after the fixing the swap readahead hit rate increased from ~66% to ~86%. Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huan