From 91d25ba8a6b0d810dc844cebeedc53029118ce3e Mon Sep 17 00:00:00 2001 From: Ross Zwisler Date: Wed, 6 Sep 2017 16:18:43 -0700 Subject: dax: use common 4k zero page for dax mmap reads When servicing mmap() reads from file holes the current DAX code allocates a page cache page of all zeroes and places the struct page pointer in the mapping->page_tree radix tree. This has three major drawbacks: 1) It consumes memory unnecessarily. For every 4k page that is read via a DAX mmap() over a hole, we allocate a new page cache page. This means that if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory. This is easily visible by looking at the overall memory consumption of the system or by looking at /proc/[pid]/smaps: 7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 1048576 kB Pss: 1048576 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 1048576 kB Private_Dirty: 0 kB Referenced: 1048576 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB 2) It is slower than using a common zero page because each page fault has more work to do. Instead of just inserting a common zero page we have to allocate a page cache page, zero it, and then insert it. Here are the average latencies of dax_load_hole() as measured by ftrace on a random test box: Old method, using zeroed page cache pages: 3.4 us New method, using the common 4k zero page: 0.8 us This was the average latency over 1 GiB of sequential reads done by this simple fio script: [global] size=1G filename=/root/dax/data fallocate=none [io] rw=read ioengine=mmap 3) The fact that we had to check for both DAX exceptional entries and for page cache pages in the radix tree made the DAX code more complex. Solve these issues by following the lead of the DAX PMD code and using a common 4k zero page instead. As with the PMD code we will now insert a DAX exceptional entry into the radix tree instead of a struct page pointer which allows us to remove all the special casing in the DAX code. Note that we do still pretty aggressively check for regular pages in the DAX radix tree, especially where we take action based on the bits set in the page. If we ever find a regular page in our radix tree now that most likely means that someone besides DAX is inserting pages (which has happened lots of times in the past), and we want to find that out early and fail loudly. This solution also removes the extra memory consumption. Here is that same /proc/[pid]/smaps after 1GiB of reading from a hole with the new code: 7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data Size: 1048576 kB Rss: 0 kB Pss: 0 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 0 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB Overall system memory consumption is similarly improved. Another major change is that we remove dax_pfn_mkwrite() from our fault flow, and instead rely on the page fault itself to make the PTE dirty and writeable. The following description from the patch adding the vm_insert_mixed_mkwrite() call explains this a little more: "To be able to use the common 4k zero page in DAX we need to have our PTE fault path look more like our PMD fault path where a PTE entry can be marked as dirty and writeable as it is first inserted rather than waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault() call. Right now we can rely on having a dax_pfn_mkwrite() call because we can distinguish between these two cases in do_wp_page(): case 1: 4k zero page => writable DAX storage case 2: read-only DAX storage => writeable DAX storage This distinction is made by via vm_normal_page(). vm_normal_page() returns false for the common 4k zero page, though, just as it does for DAX ptes. Instead of special casing the DAX + 4k zero page case we will simplify our DAX PTE page fault sequence so that it matches our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper. We will instead use dax_iomap_fault() to handle write-protection faults. This means that insert_pfn() needs to follow the lead of insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If 'mkwrite' is set insert_pfn() will do the work that was previously done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path" Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler Reviewed-by: Jan Kara Cc: "Darrick J. Wong" Cc: "Theodore Ts'o" Cc: Alexander Viro Cc: Andreas Dilger Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Steven Rostedt Cc: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/dax.txt | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index a7e6e14aeb08..3be3b266be41 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -63,9 +63,8 @@ Filesystem support consists of - implementing an mmap file operation for DAX files which sets the VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These - handlers should probably call dax_iomap_fault() (for fault and page_mkwrite - handlers), dax_iomap_pmd_fault(), dax_pfn_mkwrite() passing the appropriate - iomap operations. + handlers should probably call dax_iomap_fault() passing the appropriate + fault size and iomap operations. - calling iomap_zero_range() passing appropriate iomap operations instead of block_truncate_page() for DAX files - ensuring that there is sufficient locking between reads, writes, -- cgit v1.2.3 From 5a47074f0279421778f97b1b1e75686696a5f42a Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 6 Sep 2017 16:20:10 -0700 Subject: zram: add config and doc file for writeback feature This patch adds document and kconfig for using of writeback feature. Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim Cc: Juneho Choi Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 8 ++++++++ Documentation/blockdev/zram.txt | 11 +++++++++++ 2 files changed, 19 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 451b6d882b2c..c1513c756af1 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -90,3 +90,11 @@ Description: device's debugging info useful for kernel developers. Its format is not documented intentionally and may change anytime without any notice. + +What: /sys/block/zram/backing_dev +Date: June 2017 +Contact: Minchan Kim +Description: + The backing_dev file is read-write and set up backing + device for zram to write incompressible pages. + For using, user should enable CONFIG_ZRAM_WRITEBACK. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 4fced8a21307..257e65714c6a 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -168,6 +168,7 @@ max_comp_streams RW the number of possible concurrent compress operations comp_algorithm RW show and change the compression algorithm compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes +backing_dev RW set up backend storage for zram to write out User space is advised to use the following files to read the device statistics. @@ -231,5 +232,15 @@ line of text and contains the following stats separated by whitespace: resets the disksize to zero. You must set the disksize again before reusing the device. +* Optional Feature + += writeback + +With incompressible pages, there is no memory saving with zram. +Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page +to backing storage rather than keeping it in memory. +User should set up backing device via /sys/block/zramX/backing_dev +before disksize setting. + Nitin Gupta ngupta@vflare.org -- cgit v1.2.3 From c9bff3eebc09be23fbc868f5e6731666d23cbea3 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 6 Sep 2017 16:20:13 -0700 Subject: mm, page_alloc: rip out ZONELIST_ORDER_ZONE Patch series "cleanup zonelists initialization", v1. This is aimed at cleaning up the zonelists initialization code we have but the primary motivation was bug report [2] which got resolved but the usage of stop_machine is just too ugly to live. Most patches are straightforward but 3 of them need a special consideration. Patch 1 removes zone ordered zonelists completely. I am CCing linux-api because this is a user visible change. As I argue in the patch description I do not think we have a strong usecase for it these days. I have kept sysctl in place and warn into the log if somebody tries to configure zone lists ordering. If somebody has a real usecase for it we can revert this patch but I do not expect anybody will actually notice runtime differences. This patch is not strictly needed for the rest but it made patch 6 easier to implement. Patch 7 removes stop_machine from build_all_zonelists without adding any special synchronization between iterators and updater which I _believe_ is acceptable as explained in the changelog. I hope I am not missing anything. Patch 8 then removes zonelists_mutex which is kind of ugly as well and not really needed AFAICS but a care should be taken when double checking my thinking. This patch (of 9): Supporting zone ordered zonelists costs us just a lot of code while the usefulness is arguable if existent at all. Mel has already made node ordering default on 64b systems. 32b systems are still using ZONELIST_ORDER_ZONE because it is considered better to fallback to a different NUMA node rather than consume precious lowmem zones. This argument is, however, weaken by the fact that the memory reclaim has been reworked to be node rather than zone oriented. This means that lowmem requests have to skip over all highmem pages on LRUs already and so zone ordering doesn't save the reclaim time much. So the only advantage of the zone ordering is under a light memory pressure when highmem requests do not ever hit into lowmem zones and the lowmem pressure doesn't need to reclaim. Considering that 32b NUMA systems are rather suboptimal already and it is generally advisable to use 64b kernel on such a HW I believe we should rather care about the code maintainability and just get rid of ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if somebody tries to set zone ordering either from kernel command line or the sysctl. [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate] Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Joonsoo Kim Cc: Shaohua Li Cc: Toshi Kani Cc: Abdul Haleem Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/sysctl/vm.txt | 4 +++- Documentation/vm/numa | 7 ++----- 3 files changed, 6 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6996b7727b85..86b0e8ec8ad7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2783,7 +2783,7 @@ Allowed values are enable and disable numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. - one of ['zone', 'node', 'default'] can be specified + 'node', 'default' can be specified This can be set from sysctl after boot. See Documentation/sysctl/vm.txt for details. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 48244c42ff52..9baf66a9ef4e 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information. numa_zonelist_order -This sysctl is only for NUMA. +This sysctl is only for NUMA and it is deprecated. Anything but +Node order will fail! + 'where the memory is allocated from' is controlled by zonelists. (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. you may be able to read ZONE_DMA as ZONE_DMA32...) diff --git a/Documentation/vm/numa b/Documentation/vm/numa index a08f71647714..a31b85b9bb88 100644 --- a/Documentation/vm/numa +++ b/Documentation/vm/numa @@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations fall back to the same zone type on a different node, or to a different zone type on the same node. This is an important consideration because some zones, such as DMA or DMA32, represent relatively scarce resources. Linux chooses -a default zonelist order based on the sizes of the various zone types relative -to the total memory of the node and the total memory of the system. The -default zonelist order may be overridden using the numa_zonelist_order kernel -boot parameter or sysctl. [see Documentation/admin-guide/kernel-parameters.rst and -Documentation/sysctl/vm.txt] +a default Node ordered zonelist. This means it tries to fallback to other zones +from the same node before using remote nodes which are ordered by NUMA distance. By default, Linux will attempt to satisfy memory allocation requests from the node to which the CPU that executes the request is assigned. Specifically, -- cgit v1.2.3 From 26b433d0da062d6e19d75350c0171d3cf8ff560d Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 6 Sep 2017 16:21:15 -0700 Subject: fscache: remove unused ->now_uncached callback Patch series "Ranged pagevec lookup", v2. In this series I make pagevec_lookup() update the index (to be consistent with pagevec_lookup_tag() and also as a preparation for ranged lookups), provide ranged variant of pagevec_lookup() and use it in places where it makes sense. This not only removes some common code but is also a measurable performance win for some use cases (see patch 4/10) where radix tree is sparse and searching & grabing of a page after the end of the range has measurable overhead. This patch (of 10): The callback doesn't ever get called. Remove it. Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz Signed-off-by: Jan Kara Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/caching/netfs-api.txt | 2 -- 1 file changed, 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt index aed6b94160b1..0eb31de3a2c1 100644 --- a/Documentation/filesystems/caching/netfs-api.txt +++ b/Documentation/filesystems/caching/netfs-api.txt @@ -151,8 +151,6 @@ To define an object, a structure of the following type should be filled out: void (*mark_pages_cached)(void *cookie_netfs_data, struct address_space *mapping, struct pagevec *cached_pvec); - - void (*now_uncached)(void *cookie_netfs_data); }; This has the following fields: -- cgit v1.2.3 From d9bfcfdc41e8e5d80f7591f95a09ccce7cb8ad05 Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Wed, 6 Sep 2017 16:24:40 -0700 Subject: mm, swap: add sysfs interface for VMA based swap readahead The sysfs interface to control the VMA based swap readahead is added as follow, /sys/kernel/mm/swap/vma_ra_enabled Enable the VMA based swap readahead algorithm, or use the original global swap readahead algorithm. /sys/kernel/mm/swap/vma_ra_max_order Set the max order of the readahead window size for the VMA based swap readahead algorithm. The corresponding ABI documentation is added too. Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" Cc: Johannes Weiner Cc: Minchan Kim Cc: Rik van Riel Cc: Shaohua Li Cc: Hugh Dickins Cc: Fengguang Wu Cc: Tim Chen Cc: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-kernel-mm-swap | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-swap (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-swap b/Documentation/ABI/testing/sysfs-kernel-mm-swap new file mode 100644 index 000000000000..587db52084c7 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-swap @@ -0,0 +1,26 @@ +What: /sys/kernel/mm/swap/ +Date: August 2017 +Contact: Linux memory management mailing list +Description: Interface for swapping + +What: /sys/kernel/mm/swap/vma_ra_enabled +Date: August 2017 +Contact: Linux memory management mailing list +Description: Enable/disable VMA based swap readahead. + + If set to true, the VMA based swap readahead algorithm + will be used for swappable anonymous pages mapped in a + VMA, and the global swap readahead algorithm will be + still used for tmpfs etc. other users. If set to + false, the global swap readahead algorithm will be + used for all swappable pages. + +What: /sys/kernel/mm/swap/vma_ra_max_order +Date: August 2017 +Contact: Linux memory management mailing list +Description: The max readahead size in order for VMA based swap readahead + + VMA based swap readahead algorithm will readahead at + most 1 << max_order pages for each readahead. The + real readahead size for each readahead will be scaled + according to the estimation algorithm. -- cgit v1.2.3 From a2468cc9bfdff6139f59ca896671e5819ff5f94a Mon Sep 17 00:00:00 2001 From: Aaron Lu Date: Wed, 6 Sep 2017 16:24:57 -0700 Subject: swap: choose swap device according to numa node If the system has more than one swap device and swap device has the node information, we can make use of this information to decide which swap device to use in get_swap_pages() to get better performance. The current code uses a priority based list, swap_avail_list, to decide which swap device to use and if multiple swap devices share the same priority, they are used round robin. This patch changes the previous single global swap_avail_list into a per-numa-node list, i.e. for each numa node, it sees its own priority based list of available swap devices. Swap device's priority can be promoted on its matching node's swap_avail_list. The current swap device's priority is set as: user can set a >=0 value, or the system will pick one starting from -1 then downwards. The priority value in the swap_avail_list is the negated value of the swap device's due to plist being sorted from low to high. The new policy doesn't change the semantics for priority >=0 cases, the previous starting from -1 then downwards now becomes starting from -2 then downwards and -1 is reserved as the promoted value. Take 4-node EX machine as an example, suppose 4 swap devices are available, each sit on a different node: swapA on node 0 swapB on node 1 swapC on node 2 swapD on node 3 After they are all swapped on in the sequence of ABCD. Current behaviour: their priorities will be: swapA: -1 swapB: -2 swapC: -3 swapD: -4 And their position in the global swap_avail_list will be: swapA -> swapB -> swapC -> swapD prio:1 prio:2 prio:3 prio:4 New behaviour: their priorities will be(note that -1 is skipped): swapA: -2 swapB: -3 swapC: -4 swapD: -5 And their positions in the 4 swap_avail_lists[nid] will be: swap_avail_lists[0]: /* node 0's available swap device list */ swapA -> swapB -> swapC -> swapD prio:1 prio:3 prio:4 prio:5 swap_avali_lists[1]: /* node 1's available swap device list */ swapB -> swapA -> swapC -> swapD prio:1 prio:2 prio:4 prio:5 swap_avail_lists[2]: /* node 2's available swap device list */ swapC -> swapA -> swapB -> swapD prio:1 prio:2 prio:3 prio:5 swap_avail_lists[3]: /* node 3's available swap device list */ swapD -> swapA -> swapB -> swapC prio:1 prio:2 prio:3 prio:4 To see the effect of the patch, a test that starts N process, each mmap a region of anonymous memory and then continually write to it at random position to trigger both swap in and out is used. On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives are used as swap devices with each attached to a different node, the result is: runtime=30m/processes=32/total test size=128G/each process mmap region=4G kernel throughput vanilla 13306 auto-binding 15169 +14% runtime=30m/processes=64/total test size=128G/each process mmap region=2G kernel throughput vanilla 11885 auto-binding 14879 +25% [aaron.lu@intel.com: v2] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com [akpm@linux-foundation.org: use kmalloc_array()] Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com Signed-off-by: Aaron Lu Cc: "Chen, Tim C" Cc: Huang Ying Cc: Andi Kleen Cc: Michal Hocko Cc: Minchan Kim Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/swap_numa.txt | 69 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 69 insertions(+) create mode 100644 Documentation/vm/swap_numa.txt (limited to 'Documentation') diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.txt new file mode 100644 index 000000000000..d5960c9124f5 --- /dev/null +++ b/Documentation/vm/swap_numa.txt @@ -0,0 +1,69 @@ +Automatically bind swap device to numa node +------------------------------------------- + +If the system has more than one swap device and swap device has the node +information, we can make use of this information to decide which swap +device to use in get_swap_pages() to get better performance. + + +How to use this feature +----------------------- + +Swap device has priority and that decides the order of it to be used. To make +use of automatically binding, there is no need to manipulate priority settings +for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and +swapB, with swapA attached to node 0 and swapB attached to node 1, are going +to be swapped on. Simply swapping them on by doing: +# swapon /dev/swapA +# swapon /dev/swapB + +Then node 0 will use the two swap devices in the order of swapA then swapB and +node 1 will use the two swap devices in the order of swapB then swapA. Note +that the order of them being swapped on doesn't matter. + +A more complex example on a 4 node machine. Assume 6 swap devices are going to +be swapped on: swapA and swapB are attached to node 0, swapC is attached to +node 1, swapD and swapE are attached to node 2 and swapF is attached to node3. +The way to swap them on is the same as above: +# swapon /dev/swapA +# swapon /dev/swapB +# swapon /dev/swapC +# swapon /dev/swapD +# swapon /dev/swapE +# swapon /dev/swapF + +Then node 0 will use them in the order of: +swapA/swapB -> swapC -> swapD -> swapE -> swapF +swapA and swapB will be used in a round robin mode before any other swap device. + +node 1 will use them in the order of: +swapC -> swapA -> swapB -> swapD -> swapE -> swapF + +node 2 will use them in the order of: +swapD/swapE -> swapA -> swapB -> swapC -> swapF +Similaly, swapD and swapE will be used in a round robin mode before any +other swap devices. + +node 3 will use them in the order of: +swapF -> swapA -> swapB -> swapC -> swapD -> swapE + + +Implementation details +---------------------- + +The current code uses a priority based list, swap_avail_list, to decide +which swap device to use and if multiple swap devices share the same +priority, they are used round robin. This change here replaces the single +global swap_avail_list with a per-numa-node list, i.e. for each numa node, +it sees its own priority based list of available swap devices. Swap +device's priority can be promoted on its matching node's swap_avail_list. + +The current swap device's priority is set as: user can set a >=0 value, +or the system will pick one starting from -1 then downwards. The priority +value in the swap_avail_list is the negated value of the swap device's +due to plist being sorted from low to high. The new policy doesn't change +the semantics for priority >=0 cases, the previous starting from -1 then +downwards now becomes starting from -2 then downwards and -1 is reserved +as the promoted value. So if multiple swap devices are attached to the same +node, they will all be promoted to priority -1 on that node's plist and will +be used round robin before any other swap devices. -- cgit v1.2.3 From 493b0e9d945fa9dfe96be93ae41b4ca4b6fdb317 Mon Sep 17 00:00:00 2001 From: Daniel Colascione Date: Wed, 6 Sep 2017 16:25:08 -0700 Subject: mm: add /proc/pid/smaps_rollup /proc/pid/smaps_rollup is a new proc file that improves the performance of user programs that determine aggregate memory statistics (e.g., total PSS) of a process. Android regularly "samples" the memory usage of various processes in order to balance its memory pool sizes. This sampling process involves opening /proc/pid/smaps and summing certain fields. For very large processes, sampling memory use this way can take several hundred milliseconds, due mostly to the overhead of the seq_printf calls in task_mmu.c. smaps_rollup improves the situation. It contains most of the fields of /proc/pid/smaps, but instead of a set of fields for each VMA, smaps_rollup instead contains one synthetic smaps-format entry representing the whole process. In the single smaps_rollup synthetic entry, each field is the summation of the corresponding field in all of the real-smaps VMAs. Using a common format for smaps_rollup and smaps allows userspace parsers to repurpose parsers meant for use with non-rollup smaps for smaps_rollup, and it allows userspace to switch between smaps_rollup and smaps at runtime (say, based on the availability of smaps_rollup in a given kernel) with minimal fuss. By using smaps_rollup instead of smaps, a caller can avoid the significant overhead of formatting, reading, and parsing each of a large process's potentially very numerous memory mappings. For sampling system_server's PSS in Android, we measured a 12x speedup, representing a savings of several hundred milliseconds. One alternative to a new per-process proc file would have been including PSS information in /proc/pid/status. We considered this option but thought that PSS would be too expensive (by a few orders of magnitude) to collect relative to what's already emitted as part of /proc/pid/status, and slowing every user of /proc/pid/status for the sake of readers that happen to want PSS feels wrong. The code itself works by reusing the existing VMA-walking framework we use for regular smaps generation and keeping the mem_size_stats structure around between VMA walks instead of using a fresh one for each VMA. In this way, summation happens automatically. We let seq_file walk over the VMAs just as it does for regular smaps and just emit nothing to the seq_file until we hit the last VMA. Benchmarks: using smaps: iterations:1000 pid:1163 pss:220023808 0m29.46s real 0m08.28s user 0m20.98s system using smaps_rollup: iterations:1000 pid:1163 pss:220702720 0m04.39s real 0m00.03s user 0m04.31s system We're using the PSS samples we collect asynchronously for system-management tasks like fine-tuning oom_adj_score, memory use tracking for debugging, application-level memory-use attribution, and deciding whether we want to kill large processes during system idle maintenance windows. Android has been using PSS for these purposes for a long time; as the average process VMA count has increased and and devices become more efficiency-conscious, PSS-collection inefficiency has started to matter more. IMHO, it'd be a lot safer to optimize the existing PSS-collection model, which has been fine-tuned over the years, instead of changing the memory tracking approach entirely to work around smaps-generation inefficiency. Tim said: : There are two main reasons why Android gathers PSS information: : : 1. Android devices can show the user the amount of memory used per : application via the settings app. This is a less important use case. : : 2. We log PSS to help identify leaks in applications. We have found : an enormous number of bugs (in the Android platform, in Google's own : apps, and in third-party applications) using this data. : : To do this, system_server (the main process in Android userspace) will : sample the PSS of a process three seconds after it changes state (for : example, app is launched and becomes the foreground application) and about : every ten minutes after that. The net result is that PSS collection is : regularly running on at least one process in the system (usually a few : times a minute while the screen is on, less when screen is off due to : suspend). PSS of a process is an incredibly useful stat to track, and we : aren't going to get rid of it. We've looked at some very hacky approaches : using RSS ("take the RSS of the target process, subtract the RSS of the : zygote process that is the parent of all Android apps") to reduce the : accounting time, but it regularly overestimated the memory used by 20+ : percent. Accordingly, I don't think that there's a good alternative to : using PSS. : : We started looking into PSS collection performance after we noticed random : frequency spikes while a phone's screen was off; occasionally, one of the : CPU clusters would ramp to a high frequency because there was 200-300ms of : constant CPU work from a single thread in the main Android userspace : process. The work causing the spike (which is reasonable governor : behavior given the amount of CPU time needed) was always PSS collection. : As a result, Android is burning more power than we should be on PSS : collection. : : The other issue (and why I'm less sure about improving smaps as a : long-term solution) is that the number of VMAs per process has increased : significantly from release to release. After trying to figure out why we : were seeing these 200-300ms PSS collection times on Android O but had not : noticed it in previous versions, we found that the number of VMAs in the : main system process increased by 50% from Android N to Android O (from : ~1800 to ~2700) and varying increases in every userspace process. Android : M to N also had an increase in the number of VMAs, although not as much. : I'm not sure why this is increasing so much over time, but thinking about : ASLR and ways to make ASLR better, I expect that this will continue to : increase going forward. I would not be surprised if we hit 5000 VMAs on : the main Android process (system_server) by 2020. : : If we assume that the number of VMAs is going to increase over time, then : doing anything we can do to reduce the overhead of each VMA during PSS : collection seems like the right way to go, and that means outputting an : aggregate statistic (to avoid whatever overhead there is per line in : writing smaps and in reading each line from userspace). Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com Signed-off-by: Daniel Colascione Cc: Tim Murray Cc: Joel Fernandes Cc: Al Viro Cc: Randy Dunlap Cc: Minchan Kim Cc: Michal Hocko Cc: Sonny Rao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/procfs-smaps_rollup | 31 +++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) create mode 100644 Documentation/ABI/testing/procfs-smaps_rollup (limited to 'Documentation') diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup new file mode 100644 index 000000000000..0a54ed0d63c9 --- /dev/null +++ b/Documentation/ABI/testing/procfs-smaps_rollup @@ -0,0 +1,31 @@ +What: /proc/pid/smaps_rollup +Date: August 2017 +Contact: Daniel Colascione +Description: + This file provides pre-summed memory information for a + process. The format is identical to /proc/pid/smaps, + except instead of an entry for each VMA in a process, + smaps_rollup has a single entry (tagged "[rollup]") + for which each field is the sum of the corresponding + fields from all the maps in /proc/pid/smaps. + For more details, see the procfs man page. + + Typical output looks like this: + + 00100000-ff709000 ---p 00000000 00:00 0 [rollup] + Rss: 884 kB + Pss: 385 kB + Shared_Clean: 696 kB + Shared_Dirty: 0 kB + Private_Clean: 120 kB + Private_Dirty: 68 kB + Referenced: 884 kB + Anonymous: 68 kB + LazyFree: 0 kB + AnonHugePages: 0 kB + ShmemPmdMapped: 0 kB + Shared_Hugetlb: 0 kB + Private_Hugetlb: 0 kB + Swap: 0 kB + SwapPss: 0 kB + Locked: 385 kB -- cgit v1.2.3