summaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)Author
2020-08-12mm/vmscan: restore active/inactive ratio for anonymous LRUJoonsoo Kim
Now that workingset detection is implemented for anonymous LRU, we don't need large inactive list to allow detecting frequently accessed pages before they are reclaimed, anymore. This effectively reverts the temporary measure put in by commit "mm/vmscan: make active/inactive ratio as 1:1 for anon lru". Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/swap: implement workingset detection for anonymous LRUJoonsoo Kim
This patch implements workingset detection for anonymous LRU. All the infrastructure is implemented by the previous patches so this patch just activates the workingset detection by installing/retrieving the shadow entry and adding refault calculation. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/swapcache: support to handle the shadow entriesJoonsoo Kim
Workingset detection for anonymous page will be implemented in the following patch and it requires to store the shadow entries into the swapcache. This patch implements an infrastructure to store the shadow entry in the swapcache. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/workingset: prepare the workingset detection infrastructure for anon LRUJoonsoo Kim
To prepare the workingset detection for anon LRU, this patch splits workingset event counters for refault, activate and restore into anon and file variants, as well as the refaults counter in struct lruvec. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan: protect the workingset on anonymous LRUJoonsoo Kim
In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active | inactive). 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/vmscan: make active/inactive ratio as 1:1 for anon lruJoonsoo Kim
Patch series "workingset protection/detection on the anonymous LRU list", v7. * PROBLEM In current implementation, newly created or swap-in anonymous page is started on the active list. Growing the active list results in rebalancing active/inactive list so old pages on the active list are demoted to the inactive list. Hence, hot page on the active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list and system can contain total 100 pages. Numbers denote the number of pages on active/inactive list (active | inactive). (h) stands for hot pages and (uo) stands for used-once pages. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) As we can see, hot pages are swapped-out and it would cause swap-in later. * SOLUTION Since this is what we want to avoid, this patchset implements workingset protection. Like as the file LRU list, newly created or swap-in anonymous page is started on the inactive list. Also, like as the file LRU list, if enough reference happens, the page will be promoted. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) hot pages remains in the active list. :) * EXPERIMENT I tested this scenario on my test bed and confirmed that this problem happens on current implementation. I also checked that it is fixed by this patchset. * SUBJECT workingset detection * PROBLEM Later part of the patchset implements the workingset detection for the anonymous LRU list. There is a corner case that workingset protection could cause thrashing. If we can avoid thrashing by workingset detection, we can get the better performance. Following is an example of thrashing due to the workingset protection. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (will be hot) pages 50(h) | 50(wh) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(wh) 4. workload: 50 (will be hot) pages 50(h) | 50(wh), swap-in 50(wh) 5. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(wh) 6. repeat 4, 5 Without workingset detection, this kind of workload cannot be promoted and thrashing happens forever. * SOLUTION Therefore, this patchset implements workingset detection. All the infrastructure for workingset detecion is already implemented, so there is not much work to do. First, extend workingset detection code to deal with the anonymous LRU list. Then, make swap cache handles the exceptional value for the shadow entry. Lastly, install/retrieve the shadow value into/from the swap cache and check the refault distance. * EXPERIMENT I made a test program to imitates above scenario and confirmed that problem exists. Then, I checked that this patchset fixes it. My test setup is a virtual machine with 8 cpus and 6100MB memory. But, the amount of the memory that the test program can use is about 280 MB. This is because the system uses large ram-backed swap and large ramdisk to capture the trace. Test scenario is like as below. 1. allocate cold memory (512MB) 2. allocate hot-1 memory (96MB) 3. activate hot-1 memory (96MB) 4. allocate another hot-2 memory (96MB) 5. access cold memory (128MB) 6. access hot-2 memory (96MB) 7. repeat 5, 6 Since hot-1 memory (96MB) is on the active list, the inactive list can contains roughly 190MB pages. hot-2 memory's re-access interval (96+128 MB) is more 190MB, so it cannot be promoted without workingset detection and swap-in/out happens repeatedly. With this patchset, workingset detection works and promotion happens. Therefore, swap-in/out occurs less. Here is the result. (average of 5 runs) type swap-in swap-out base 863240 989945 patch 681565 809273 As we can see, patched kernel do less swap-in/out. * OVERALL TEST (ebizzy using modified random function) ebizzy is the test program that main thread allocates lots of memory and child threads access them randomly during the given times. Swap-in will happen if allocated memory is larger than the system memory. The random function that represents the zipf distribution is used to make hot/cold memory. Hot/cold ratio is controlled by the parameter. If the parameter is high, hot memory is accessed much larger than cold one. If the parameter is low, the number of access on each memory would be similar. I uses various parameters in order to show the effect of patchset on various hot/cold ratio workload. My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB ram swap. Result format is as following. param: 1-1024-0.1 - 1 (number of thread) - 1024 (allocated memory size, MB) - 0.1 (zipf distribution alpha, 0.1 works like as roughly uniform random, 1.3 works like as small portion of memory is hot and the others are cold) pswpin: smaller is better std: standard deviation improvement: negative is better * single thread param pswpin std improvement base 1-1024.0-0.1 14101983.40 79441.19 prot 1-1024.0-0.1 14065875.80 136413.01 ( -0.26 ) detect 1-1024.0-0.1 13910435.60 100804.82 ( -1.36 ) base 1-1024.0-0.7 7998368.80 43469.32 prot 1-1024.0-0.7 7622245.80 88318.74 ( -4.70 ) detect 1-1024.0-0.7 7618515.20 59742.07 ( -4.75 ) base 1-1024.0-1.3 1017400.80 38756.30 prot 1-1024.0-1.3 940464.60 29310.69 ( -7.56 ) detect 1-1024.0-1.3 945511.40 24579.52 ( -7.07 ) base 1-1280.0-0.1 22895541.40 50016.08 prot 1-1280.0-0.1 22860305.40 51952.37 ( -0.15 ) detect 1-1280.0-0.1 22705565.20 93380.35 ( -0.83 ) base 1-1280.0-0.7 13717645.60 46250.65 prot 1-1280.0-0.7 12935355.80 64754.43 ( -5.70 ) detect 1-1280.0-0.7 13040232.00 63304.00 ( -4.94 ) base 1-1280.0-1.3 1654251.40 4159.68 prot 1-1280.0-1.3 1522680.60 33673.50 ( -7.95 ) detect 1-1280.0-1.3 1599207.00 70327.89 ( -3.33 ) base 1-1536.0-0.1 31621775.40 31156.28 prot 1-1536.0-0.1 31540355.20 62241.36 ( -0.26 ) detect 1-1536.0-0.1 31420056.00 123831.27 ( -0.64 ) base 1-1536.0-0.7 19620760.60 60937.60 prot 1-1536.0-0.7 18337839.60 56102.58 ( -6.54 ) detect 1-1536.0-0.7 18599128.00 75289.48 ( -5.21 ) base 1-1536.0-1.3 2378142.40 20994.43 prot 1-1536.0-1.3 2166260.60 48455.46 ( -8.91 ) detect 1-1536.0-1.3 2183762.20 16883.24 ( -8.17 ) base 1-1792.0-0.1 40259714.80 90750.70 prot 1-1792.0-0.1 40053917.20 64509.47 ( -0.51 ) detect 1-1792.0-0.1 39949736.40 104989.64 ( -0.77 ) base 1-1792.0-0.7 25704884.40 69429.68 prot 1-1792.0-0.7 23937389.00 79945.60 ( -6.88 ) detect 1-1792.0-0.7 24271902.00 35044.30 ( -5.57 ) base 1-1792.0-1.3 3129497.00 32731.86 prot 1-1792.0-1.3 2796994.40 19017.26 ( -10.62 ) detect 1-1792.0-1.3 2886840.40 33938.82 ( -7.75 ) base 1-2048.0-0.1 48746924.40 50863.88 prot 1-2048.0-0.1 48631954.40 24537.30 ( -0.24 ) detect 1-2048.0-0.1 48509419.80 27085.34 ( -0.49 ) base 1-2048.0-0.7 32046424.40 78624.22 prot 1-2048.0-0.7 29764182.20 86002.26 ( -7.12 ) detect 1-2048.0-0.7 30250315.80 101282.14 ( -5.60 ) base 1-2048.0-1.3 3916723.60 24048.55 prot 1-2048.0-1.3 3490781.60 33292.61 ( -10.87 ) detect 1-2048.0-1.3 3585002.20 44942.04 ( -8.47 ) * multi thread param pswpin std improvement base 8-1024.0-0.1 16219822.60 329474.01 prot 8-1024.0-0.1 15959494.00 654597.45 ( -1.61 ) detect 8-1024.0-0.1 15773790.80 502275.25 ( -2.75 ) base 8-1024.0-0.7 9174107.80 537619.33 prot 8-1024.0-0.7 8571915.00 385230.08 ( -6.56 ) detect 8-1024.0-0.7 8489484.20 364683.00 ( -7.46 ) base 8-1024.0-1.3 1108495.60 83555.98 prot 8-1024.0-1.3 1038906.20 63465.20 ( -6.28 ) detect 8-1024.0-1.3 941817.80 32648.80 ( -15.04 ) base 8-1280.0-0.1 25776114.20 450480.45 prot 8-1280.0-0.1 25430847.00 465627.07 ( -1.34 ) detect 8-1280.0-0.1 25282555.00 465666.55 ( -1.91 ) base 8-1280.0-0.7 15218968.00 702007.69 prot 8-1280.0-0.7 13957947.80 492643.86 ( -8.29 ) detect 8-1280.0-0.7 14158331.20 238656.02 ( -6.97 ) base 8-1280.0-1.3 1792482.80 30512.90 prot 8-1280.0-1.3 1577686.40 34002.62 ( -11.98 ) detect 8-1280.0-1.3 1556133.00 22944.79 ( -13.19 ) base 8-1536.0-0.1 33923761.40 575455.85 prot 8-1536.0-0.1 32715766.20 300633.51 ( -3.56 ) detect 8-1536.0-0.1 33158477.40 117764.51 ( -2.26 ) base 8-1536.0-0.7 20628907.80 303851.34 prot 8-1536.0-0.7 19329511.20 341719.31 ( -6.30 ) detect 8-1536.0-0.7 20013934.00 385358.66 ( -2.98 ) base 8-1536.0-1.3 2588106.40 130769.20 prot 8-1536.0-1.3 2275222.40 89637.06 ( -12.09 ) detect 8-1536.0-1.3 2365008.40 124412.55 ( -8.62 ) base 8-1792.0-0.1 43328279.20 946469.12 prot 8-1792.0-0.1 41481980.80 525690.89 ( -4.26 ) detect 8-1792.0-0.1 41713944.60 406798.93 ( -3.73 ) base 8-1792.0-0.7 27155647.40 536253.57 prot 8-1792.0-0.7 24989406.80 502734.52 ( -7.98 ) detect 8-1792.0-0.7 25524806.40 263237.87 ( -6.01 ) base 8-1792.0-1.3 3260372.80 137907.92 prot 8-1792.0-1.3 2879187.80 63597.26 ( -11.69 ) detect 8-1792.0-1.3 2892962.20 33229.13 ( -11.27 ) base 8-2048.0-0.1 50583989.80 710121.48 prot 8-2048.0-0.1 49599984.40 228782.42 ( -1.95 ) detect 8-2048.0-0.1 50578596.00 660971.66 ( -0.01 ) base 8-2048.0-0.7 33765479.60 812659.55 prot 8-2048.0-0.7 30767021.20 462907.24 ( -8.88 ) detect 8-2048.0-0.7 32213068.80 211884.24 ( -4.60 ) base 8-2048.0-1.3 3941675.80 28436.45 prot 8-2048.0-1.3 3538742.40 76856.08 ( -10.22 ) detect 8-2048.0-1.3 3579397.80 58630.95 ( -9.19 ) As we can see, all the cases show improvement. Especially, test case with zipf distribution 1.3 show more improvements. It means that if there is a hot/cold tendency in anon pages, this patchset works better. This patch (of 6): Current implementation of LRU management for anonymous page has some problems. Most important one is that it doesn't protect the workingset, that is, pages on the active LRU list. Although, this problem will be fixed in the following patchset, the preparation is required and this patch does it. What following patch does is to implement workingset protection. After the following patchset, newly created or swap-in pages will start their lifetime on the inactive list. If inactive list is too small, there is not enough chance to be referenced and the page cannot become the workingset. In order to provide the newly anonymous or swap-in pages enough chance to be referenced again, this patch makes active/inactive LRU ratio as 1:1. This is just a temporary measure. Later patch in the series introduces workingset detection for anonymous LRU that will be used to better decide if pages should start on the active and inactive list. Afterwards this patch is effectively reverted. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/hugetlb: add mempolicy check in the reservation routineMuchun Song
In the reservation routine, we only check whether the cpuset meets the memory allocation requirements. But we ignore the mempolicy of MPOL_BIND case. If someone mmap hugetlb succeeds, but the subsequent memory allocation may fail due to mempolicy restrictions and receives the SIGBUS signal. This can be reproduced by the follow steps. 1) Compile the test case. cd tools/testing/selftests/vm/ gcc map_hugetlb.c -o map_hugetlb 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the system. Each node will pre-allocate one huge page. echo 2 > /proc/sys/vm/nr_hugepages 3) Run test case(mmap 4MB). We receive the SIGBUS signal. numactl --membind=3D0 ./map_hugetlb 4 With this patch applied, the mmap will fail in the step 3) and throw "mmap: Cannot allocate memory". [akpm@linux-foundation.org: include sched.h for `current'] Reported-by: Jianchao Guo <guojianchao@bytedance.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Baoquan He <bhe@redhat.com> Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: memcg: charge memcg percpu memory to the parent cgroupRoman Gushchin
Memory cgroups are using large chunks of percpu memory to store vmstat data. Yet this memory is not accounted at all, so in the case when there are many (dying) cgroups, it's not exactly clear where all the memory is. Because the size of memory cgroup internal structures can dramatically exceed the size of object or page which is pinning it in the memory, it's not a good idea to simply ignore it. It actually breaks the isolation between cgroups. Let's account the consumed percpu memory to the parent cgroup. [guro@fb.com: add WARN_ON_ONCE()s, per Johannes] Link: http://lkml.kernel.org/r/20200811170611.GB1507044@carbon.DHCP.thefacebook.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-5-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: memcg/percpu: per-memcg percpu memory statisticsRoman Gushchin
Percpu memory can represent a noticeable chunk of the total memory consumption, especially on big machines with many CPUs. Let's track percpu memory usage for each memcg and display it in memory.stat. A percpu allocation is usually scattered over multiple pages (and nodes), and can be significantly smaller than a page. So let's add a byte-sized counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs can be perfectly reused for percpu case. [guro@fb.com: v3] Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: memcg/percpu: account percpu memory to memory cgroupsRoman Gushchin
Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning] [akpm@linux-foundation.org: move unreachable code, per Roman] [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12percpu: return number of released bytes from pcpu_free_area()Roman Gushchin
Patch series "mm: memcg accounting of percpu memory", v3. This patchset adds percpu memory accounting to memory cgroups. It's based on the rework of the slab controller and reuses concepts and features introduced for the per-object slab accounting. Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. Percpu allocations by their nature are scattered over multiple pages, so they can't be tracked on the per-page basis. So the per-object tracking introduced by the new slab controller is reused. The patchset implements charging of percpu allocations, adds memcg-level statistics, enables accounting for percpu allocations made by memory cgroup internals and provides some basic tests. To implement the accounting of percpu memory without a significant memory and performance overhead the following approach is used: all accounted allocations are placed into a separate percpu chunk (or chunks). These chunks are similar to default chunks, except that they do have an attached vector of pointers to obj_cgroup objects, which is big enough to save a pointer for each allocated object. On the allocation, if the allocation has to be accounted (__GFP_ACCOUNT is passed, the allocating process belongs to a non-root memory cgroup, etc), the memory cgroup is getting charged and if the maximum limit is not exceeded the allocation is performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the allocation is forced over the limit, depending on gfp (as any other kernel memory allocation). The memory cgroup information is saved in the obj_cgroup vector at the corresponding offset. On the release time the memcg information is restored from the vector and the cgroup is getting uncharged. Unaccounted allocations (at this point the absolute majority of all percpu allocations) are performed in the old way, so no additional overhead is expected. To avoid pinning dying memory cgroups by outstanding allocations, obj_cgroup API is used instead of directly saving memory cgroup pointers. obj_cgroup is basically a pointer to a memory cgroup with a standalone reference counter. The trick is that it can be atomically swapped to point at the parent cgroup, so that the original memory cgroup can be released prior to all objects, which has been charged to it. Because all charges and statistics are fully recursive, it's perfectly correct to uncharge the parent cgroup instead. This scheme is used in the slab memory accounting, and percpu memory can just follow the scheme. This patch (of 5): To implement accounting of percpu memory we need the information about the size of freed object. Return it from pcpu_free_area(). Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tobin C. Harding <tobin@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> cC: Michal Koutnýutny@suse.com> Cc: Bixuan Cui <cuibixuan@huawei.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-09Merge tag 'kbuild-v5.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - run the checker (e.g. sparse) after the compiler - remove unneeded cc-option tests for old compiler flags - fix tar-pkg to install dtbs - introduce ccflags-remove-y and asflags-remove-y syntax - allow to trace functions in sub-directories of lib/ - introduce hostprogs-always-y and userprogs-always-y syntax - various Makefile cleanups * tag 'kbuild-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: stop filtering out $(GCC_PLUGINS_CFLAGS) from cc-option base kbuild: include scripts/Makefile.* only when relevant CONFIG is enabled kbuild: introduce hostprogs-always-y and userprogs-always-y kbuild: sort hostprogs before passing it to ifneq kbuild: move host .so build rules to scripts/gcc-plugins/Makefile kbuild: Replace HTTP links with HTTPS ones kbuild: trace functions in subdirectories of lib/ kbuild: introduce ccflags-remove-y and asflags-remove-y kbuild: do not export LDFLAGS_vmlinux kbuild: always create directories of targets powerpc/boot: add DTB to 'targets' kbuild: buildtar: add dtbs support kbuild: remove cc-option test of -ffreestanding kbuild: remove cc-option test of -fno-stack-protector Revert "kbuild: Create directory for target DTB" kbuild: run the checker after the compiler
2020-08-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - a few MM hotfixes - kthread, tools, scripts, ntfs and ocfs2 - some of MM Subsystems affected by this patch series: kthread, tools, scripts, ntfs, ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan, debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore, sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits) mm: vmscan: consistent update to pgrefill mm/vmscan.c: fix typo khugepaged: khugepaged_test_exit() check mmget_still_valid() khugepaged: retract_page_tables() remember to test exit khugepaged: collapse_pte_mapped_thp() protect the pmd lock khugepaged: collapse_pte_mapped_thp() flush the right range mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible mm: thp: replace HTTP links with HTTPS ones mm/page_alloc: fix memalloc_nocma_{save/restore} APIs mm/page_alloc.c: skip setting nodemask when we are in interrupt mm/page_alloc: fallbacks at most has 3 elements mm/page_alloc: silence a KASAN false positive mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask() mm/page_alloc.c: simplify pageblock bitmap access mm/page_alloc.c: extract the common part in pfn_to_bitidx() mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits mm/shuffle: remove dynamic reconfiguration mm/memory_hotplug: document why shuffle_zone() is relevant mm/page_alloc: remove nr_free_pagecache_pages() mm: remove vm_total_pages ...
2020-08-07mm: vmscan: consistent update to pgrefillShakeel Butt
The vmstat pgrefill is useful together with pgscan and pgsteal stats to measure the reclaim efficiency. However vmstat's pgrefill is not updated consistently at system level. It gets updated for both global and memcg reclaim however pgscan and pgsteal are updated for only global reclaim. So, update pgrefill only for global reclaim. If someone is interested in the stats representing both system level as well as memcg level reclaim, then consult the root memcg's memory.stat instead of /proc/vmstat. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200711011459.1159929-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/vmscan.c: fix typodylan-meiners
Change "optizimation" to "optimization". Signed-off-by: dylan-meiners <spacct.spacct@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200609185144.10049-1-spacct.spacct@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07khugepaged: khugepaged_test_exit() check mmget_still_valid()Hugh Dickins
Move collapse_huge_page()'s mmget_still_valid() check into khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP only, and earned its mmget_still_valid() check because it inserts a huge pmd entry in place of the page table's pmd entry; whereas collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp() merely clears the page table's pmd entry. But core dumping without mmap lock must have been as open to mistaking a racily cleared pmd entry for a page table at physical page 0, as exit_mmap() was. And we certainly have no interest in mapping as a THP once dumping core. Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07khugepaged: retract_page_tables() remember to test exitHugh Dickins
Only once have I seen this scenario (and forgot even to notice what forced the eventual crash): a sequence of "BUG: Bad page map" alerts from vm_normal_page(), from zap_pte_range() servicing exit_mmap(); pmd:00000000, pte values corresponding to data in physical page 0. The pte mappings being zapped in this case were supposed to be from a huge page of ext4 text (but could as well have been shmem): my belief is that it was racing with collapse_file()'s retract_page_tables(), found *pmd pointing to a page table, locked it, but *pmd had become 0 by the time start_pte was decided. In most cases, that possibility is excluded by holding mmap lock; but exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged checks khugepaged_test_exit() after acquiring mmap lock: khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so, for example. But retract_page_tables() did not: fix that. The fix is for retract_page_tables() to check khugepaged_test_exit(), after acquiring mmap lock, before doing anything to the page table. Getting the mmap lock serializes with __mmput(), which briefly takes and drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on mm_users makes sure we don't touch the page table once exit_mmap() might reach it, since exit_mmap() will be proceeding without mmap lock, not expecting anyone to be racing with it. Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07khugepaged: collapse_pte_mapped_thp() protect the pmd lockHugh Dickins
When retract_page_tables() removes a page table to make way for a huge pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the case when the original mmap_write_trylock had failed), only mmap_write_trylock and pmd lock are held. That's not enough. One machine has twice crashed under load, with "BUG: spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving page_referenced() on a file THP, that had found a page table at *pmd) discovers that the page table page and its lock have already been freed by the time it comes to unlock. Follow the example of retract_page_tables(), but we only need one of huge page lock or i_mmap_lock_write to secure against this: because it's the narrower lock, and because it simplifies collapse_pte_mapped_thp() to know the hpage earlier, choose to rely on huge page lock here. Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07khugepaged: collapse_pte_mapped_thp() flush the right rangeHugh Dickins
pmdp_collapse_flush() should be given the start address at which the huge page is mapped, haddr: it was given addr, which at that point has been used as a local variable, incremented to the end address of the extent. Found by source inspection while chasing a hugepage locking bug, which I then could not explain by this. At first I thought this was very bad; then saw that all of the page translations that were not flushed would actually still point to the right pages afterwards, so harmless; then realized that I know nothing of how different architectures and models cache intermediate paging structures, so maybe it matters after all - particularly since the page table concerned is immediately freed. Much easier to fix than to think about. Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021204390.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possiblePeter Xu
This is found by code observation only. Firstly, the worst case scenario should assume the whole range was covered by pmd sharing. The old algorithm might not work as expected for ranges like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the expected range should be (0, 2g). Since at it, remove the loop since it should not be required. With that, the new code should be faster too when the invalidating range is huge. Mike said: : With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only : adjust to (0, 1g+2m) which is incorrect. : : We should cc stable. The original reason for adjusting the range was to : prevent data corruption (getting wrong page). Since the range is not : always adjusted correctly, the potential for corruption still exists. : : However, I am fairly confident that adjust_range_if_pmd_sharing_possible : is only gong to be called in two cases: : : 1) for a single page : 2) for range == entire vma : : In those cases, the current code should produce the correct results. : : To be safe, let's just cc stable. Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages") Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm: thp: replace HTTP links with HTTPS onesAlexander A. Klimov
Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `xmlns`: For each link, `http://[^# ]*(?:\w|/)`: If neither `gnu\.org/license`, nor `mozilla\.org/MPL`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. [akpm@linux-foundation.org: fix amd.com URL, per Vlastimil] Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200713164345.36088-1-grandmaster@al2klimov.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: fix memalloc_nocma_{save/restore} APIsJoonsoo Kim
Currently, memalloc_nocma_{save/restore} API that prevents CMA area in page allocation is implemented by using current_gfp_context(). However, there are two problems of this implementation. First, this doesn't work for allocation fastpath. In the fastpath, original gfp_mask is used since current_gfp_context() is introduced in order to control reclaim and it is on slowpath. So, CMA area can be allocated through the allocation fastpath even if memalloc_nocma_{save/restore} APIs are used. Currently, there is just one user for these APIs and it has a fallback method to prevent actual problem. Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect to exclude the memory on the ZONE_MOVABLE for allocation target. To fix these problems, this patch changes the implementation to exclude CMA area in page allocation. Main point of this change is using the alloc_flags. alloc_flags is mainly used to control allocation so it fits for excluding CMA area in allocation. Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc) Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Roman Gushchin <guro@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: skip setting nodemask when we are in interruptMuchun Song
When we are in the interrupt context, it is irrelevant to the current task context. If we use current task's mems_allowed, we can be fair to alloc pages in the fast path and fall back to slow path memory allocation when the current node(which is the current task mems_allowed) does not have enough memory to allocate. In this case, it slows down the memory allocation speed of interrupt context. So we can skip setting the nodemask to allow any node to allocate memory, so that fast path allocation can success. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200706025921.53683-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: fallbacks at most has 3 elementsWei Yang
MIGRAGE_TYPES is used to be the mark of end and there are at most 3 elements for the one dimension array. Reduce to 3 to save little memory. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200625231022.18784-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc: silence a KASAN false positiveQian Cai
kernel_init_free_pages() will use memset() on s390 to clear all pages from kmalloc_order() which will override KASAN redzones because a redzone was setup from the end of the allocation size to the end of the last page. Silence it by not reporting it there. An example of the report is, BUG: KASAN: slab-out-of-bounds in __free_pages_ok Write of size 4096 at addr 000000014beaa000 Call Trace: show_stack+0x152/0x210 dump_stack+0x1f8/0x248 print_address_description.isra.13+0x5e/0x4d0 kasan_report+0x130/0x178 check_memory_region+0x190/0x218 memset+0x34/0x60 __free_pages_ok+0x894/0x12f0 kfree+0x4f2/0x5e0 unpack_to_rootfs+0x60e/0x650 populate_rootfs+0x56/0x358 do_one_initcall+0x1f4/0xa20 kernel_init_freeable+0x758/0x7e8 kernel_init+0x1c/0x170 ret_from_fork+0x24/0x28 Memory state around the buggy address: 000000014bea9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000014bea9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >000000014beaa000: 03 fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe ^ 000000014beaa080: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe 000000014beaa100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Link: http://lkml.kernel.org/r/20200610052154.5180-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: remove unnecessary end_bitidx for ↵Wei Yang
[set|get]_pfnblock_flags_mask() After previous cleanup, the end_bitidx is not necessary any more. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-4-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: simplify pageblock bitmap accessWei Yang
Due to commit e58469bafd05 ("mm: page_alloc: use word-based accesses for get/set pageblock bitmaps"), pageblock bitmap is accessed with word-based access. This operation could be simplified a little. Intuitively, if we want to get a bit range [start_idx, end_idx] in a word, we can do like this: mask = (1 << (end_bitidx - start_bitidx + 1)) - 1; ret = (word >> start_idx) & mask; And also if we want to set a bit range [start_idx, end_idx] with flags, we can do the same by just shift start_bitidx. By doing so we reduce some instructions for these two helper functions: Before Patched set_pfnblock_flags_mask 209 198(-5%) get_pfnblock_flags_mask 101 87(-13%) Since the syntax is changed a little, we need to check the whole 4-bit migrate_type instead of part of it. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-3-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/page_alloc.c: extract the common part in pfn_to_bitidx()Wei Yang
The return value calculation is the same both for SPARSEMEM or not. Just take it out. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/shuffle: remove dynamic reconfigurationDavid Hildenbrand
Commit e900a918b098 ("mm: shuffle initial free memory to improve memory-side-cache utilization") promised "autodetection of a memory-side-cache (to be added in a follow-on patch)" over a year ago. The original series included patches [1], however, they were dropped during review [2] to be followed-up later. Due to lack of platforms that publish an HMAT, autodetection is currently not implemented. However, manual activation is actively used [3]. Let's simplify for now and re-add when really (ever?) needed. [1] https://lkml.kernel.org/r/154510700291.1941238.817190985966612531.stgit@dwillia2-desk3.amr.corp.intel.com [2] https://lkml.kernel.org/r/154690326478.676627.103843791978176914.stgit@dwillia2-desk3.amr.corp.intel.com [3] https://lkml.kernel.org/r/CAPcyv4irwGUU2x+c6b4L=KbB1dnasNKaaZd6oSpYjL9kfsnROQ@mail.gmail.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dan Williams <dan.j.williams@intel.com> Link: http://lkml.kernel.org/r/20200624094741.9918-4-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07mm/memory_hotplug: document why shuffle_zone() is relevantDavid Hildenbrand</