summaryrefslogtreecommitdiffstats
path: root/mm/memcontrol.c
AgeCommit message (Collapse)Author
2020-06-26mm/memcontrol.c: prevent missed memory.low load tearsChris Down
Looks like one of these got missed when massaging in f86b810c2610 ("mm, memcg: prevent memory.low load/store tearing") with other linux-mm changes. Link: http://lkml.kernel.org/r/20200612174437.GA391453@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Reported-by: Michal Koutny <mkoutny@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26mm/memcontrol.c: add missed css_put()Muchun Song
We should put the css reference when memory allocation failed. Link: http://lkml.kernel.org/r/20200614122653.98829-1-songmuchun@bytedance.com Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26mm: memcontrol: handle div0 crash race condition in memory.lowJohannes Weiner
Tejun reports seeing rare div0 crashes in memory.low stress testing: RIP: 0010:mem_cgroup_calculate_protection+0xed/0x150 Code: 0f 46 d1 4c 39 d8 72 57 f6 05 16 d6 42 01 40 74 1f 4c 39 d8 76 1a 4c 39 d1 76 15 4c 29 d1 4c 29 d8 4d 29 d9 31 d2 48 0f af c1 <49> f7 f1 49 01 c2 4c 89 96 38 01 00 00 5d c3 48 0f af c7 31 d2 49 RSP: 0018:ffffa14e01d6fcd0 EFLAGS: 00010246 RAX: 000000000243e384 RBX: 0000000000000000 RCX: 0000000000008f4b RDX: 0000000000000000 RSI: ffff8b89bee84000 RDI: 0000000000000000 RBP: ffffa14e01d6fcd0 R08: ffff8b89ca7d40f8 R09: 0000000000000000 R10: 0000000000000000 R11: 00000000006422f7 R12: 0000000000000000 R13: ffff8b89d9617000 R14: ffff8b89bee84000 R15: ffffa14e01d6fdb8 FS: 0000000000000000(0000) GS:ffff8b8a1f1c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f93b1fc175b CR3: 000000016100a000 CR4: 0000000000340ea0 Call Trace: shrink_node+0x1e5/0x6c0 balance_pgdat+0x32d/0x5f0 kswapd+0x1d7/0x3d0 kthread+0x11c/0x160 ret_from_fork+0x1f/0x30 This happens when parent_usage == siblings_protected. We check that usage is bigger than protected, which should imply parent_usage being bigger than siblings_protected. However, we don't read (or even update) these values atomically, and they can be out of sync as the memory state changes under us. A bit of fluctuation around the target protection isn't a big deal, but we need to handle the div0 case. Check the parent state explicitly to make sure we have a reasonable positive value for the divisor. Link: http://lkml.kernel.org/r/20200615140658.601684-1-hannes@cmpxchg.org Fixes: 8a931f801340 ("mm: memcontrol: recursive memory.low protection") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09mmap locking API: convert mmap_sem commentsMichel Lespinasse
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm, memcg: fix some typos in memcontrol.cEthon Paul
There are some typos in comment, fix them. s/responsiblity/responsibility s/oflline/offline Signed-off-by: Ethon Paul <ethp@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200411064246.15781-1-ethp@qq.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: base LRU balancing on an explicit cost modelJohannes Weiner
Currently, scan pressure between the anon and file LRU lists is balanced based on a mixture of reclaim efficiency and a somewhat vague notion of "value" of having certain pages in memory over others. That concept of value is problematic, because it has caused us to count any event that remotely makes one LRU list more or less preferrable for reclaim, even when these events are not directly comparable and impose very different costs on the system. One example is referenced file pages that we still deactivate and referenced anonymous pages that we actually rotate back to the head of the list. There is also conceptual overlap with the LRU algorithm itself. By rotating recently used pages instead of reclaiming them, the algorithm already biases the applied scan pressure based on page value. Thus, when rebalancing scan pressure due to rotations, we should think of reclaim cost, and leave assessing the page value to the LRU algorithm. Lastly, considering both value-increasing as well as value-decreasing events can sometimes cause the same type of event to be counted twice, i.e. how rotating a page increases the LRU value, while reclaiming it succesfully decreases the value. In itself this will balance out fine, but it quietly skews the impact of events that are only recorded once. The abstract metric of "value", the murky relationship with the LRU algorithm, and accounting both negative and positive events make the current pressure balancing model hard to reason about and modify. This patch switches to a balancing model of accounting the concrete, actually observed cost of reclaiming one LRU over another. For now, that cost includes pages that are scanned but rotated back to the list head. Subsequent patches will add consideration for IO caused by refaulting of recently evicted pages. Replace struct zone_reclaim_stat with two cost counters in the lruvec, and make everything that affects cost go through a new lru_note_cost() function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: update page->mem_cgroup stability rulesJohannes Weiner
The previous patches have simplified the access rules around page->mem_cgroup somewhat: 1. We never change page->mem_cgroup while the page is isolated by somebody else. This was by far the biggest exception to our rules and it didn't stop at lock_page() or lock_page_memcg(). 2. We charge pages before they get put into page tables now, so the somewhat fishy rule about "can be in page table as long as it's still locked" is now gone and boiled down to having an exclusive reference to the page. Document the new rules. Any of the following will stabilize the page->mem_cgroup association: - the page lock - LRU isolation - lock_page_memcg() - exclusive access to the page Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-20-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: delete unused lrucare handlingJohannes Weiner
Swapin faults were the last event to charge pages after they had already been put on the LRU list. Now that we charge directly on swapin, the lrucare portion of the charge code is unused. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: make swap tracking an integral part of memory controlJohannes Weiner
Without swap page tracking, users that are otherwise memory controlled can easily escape their containment and allocate significant amounts of memory that they're not being charged for. That's because swap does readahead, but without the cgroup records of who owned the page at swapout, readahead pages don't get charged until somebody actually faults them into their page table and we can identify an owner task. This can be maliciously exploited with MADV_WILLNEED, which triggers arbitrary readahead allocations without charging the pages. Make swap swap page tracking an integral part of memcg and remove the Kconfig options. In the first place, it was only made configurable to allow users to save some memory. But the overhead of tracking cgroup ownership per swap page is minimal - 2 byte per page, or 512k per 1G of swap, or 0.04%. Saving that at the expense of broken containment semantics is not something we should present as a coequal option. The swapaccount=0 boot option will continue to exist, and it will eliminate the page_counter overhead and hide the swap control files, but it won't disable swap slot ownership tracking. This patch makes sure we always have the cgroup records at swapin time; the next patch will fix the actual bug by charging readahead swap pages at swapin time rather than at fault time. v2: fix double swap charge bug in cgroup1/cgroup2 code gating [hannes@cmpxchg.org: fix crash with cgroup_disable=memory] Link: http://lkml.kernel.org/r/20200521215855.GB815153@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Link: http://lkml.kernel.org/r/20200508183105.225460-16-hannes@cmpxchg.org Debugged-by: Hugh Dickins <hughd@google.com> Debugged-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: prepare swap controller setup for integrationJohannes Weiner
A few cleanups to streamline the swap controller setup: - Replace the do_swap_account flag with cgroup_memory_noswap. This brings it in line with other functionality that is usually available unless explicitly opted out of - nosocket, nokmem. - Remove the really_do_swap_account flag that stores the boot option and is later used to switch the do_swap_account. It's not clear why this indirection is/was necessary. Use do_swap_account directly. - Minor coding style polishing Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-15-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: drop unused try/commit/cancel charge APIJohannes Weiner
There are no more users. RIP in peace. [arnd@arndb.de: fix an unused-function warning] Link: http://lkml.kernel.org/r/20200528095640.151454-1-arnd@arndb.de Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-14-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: switch to native NR_ANON_THPS counterJohannes Weiner
With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the native NR_ANON_THPS accounting sites. [hannes@cmpxchg.org: fixes] Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> [build-tested] Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: switch to native NR_ANON_MAPPED counterJohannes Weiner
Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM countersJohannes Weiner
Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM. The page is already locked in these places, so page->mem_cgroup is stable; we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure it's set up in time. Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private NR_SHMEM accounting sites. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: prepare cgroup vmstat infrastructure for native anon countersJohannes Weiner
Anonymous compound pages can be mapped by ptes, which means that if we want to track NR_MAPPED_ANON, NR_ANON_THPS on a per-cgroup basis, we have to be prepared to see tail pages in our accounting functions. Make mod_lruvec_page_state() and lock_page_memcg() deal with tail pages correctly, namely by redirecting to the head page which has the page->mem_cgroup set up. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-9-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: prepare move_account for removal of private page type countersJohannes Weiner
When memcg uses the generic vmstat counters, it doesn't need to do anything at charging and uncharging time. It does, however, need to migrate counts when pages move to a different cgroup in move_account. Prepare the move_account function for the arrival of NR_FILE_PAGES, NR_ANON_MAPPED, NR_ANON_THPS etc. by having a branch for files and a branch for anon, which can then divided into sub-branches. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-8-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: prepare uncharging for removal of private page type countersJohannes Weiner
The uncharge batching code adds up the anon, file, kmem counts to determine the total number of pages to uncharge and references to drop. But the next patches will remove the anon and file counters. Maintain an aggregate nr_pages in the uncharge_gather struct. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-7-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: convert page cache to a new mem_cgroup_charge() APIJohannes Weiner
The try/commit/cancel protocol that memcg uses dates back to when pages used to be uncharged upon removal from the page cache, and thus couldn't be committed before the insertion had succeeded. Nowadays, pages are uncharged when they are physically freed; it doesn't matter whether the insertion was successful or not. For the page cache, the transaction dance has become unnecessary. Introduce a mem_cgroup_charge() function that simply charges a newly allocated page to a cgroup and sets up page->mem_cgroup in one single step. If the insertion fails, the caller doesn't have to do anything but free/put the page. Then switch the page cache over to this new API. Subsequent patches will also convert anon pages, but it needs a bit more prep work. Right now, memcg depends on page->mapping being already set up at the time of charging, so that it can maintain its own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set under the same pte lock under which the page is publishd, so a single charge point that can block doesn't work there just yet. The following prep patches will replace the private memcg counters with the generic vmstat counters, thus removing the page->mapping dependency, then complete the transition to the new single-point charge API and delete the old transactional scheme. v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo) v3: rebase on preceeding shmem simplification patch Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: move out cgroup swaprate throttlingJohannes Weiner
The cgroup swaprate throttling is about matching new anon allocations to the rate of available IO when that is being throttled. It's the io controller hooking into the VM, rather than a memory controller thing. Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and drop the @memcg argument which is only used to check whether the preceding page charge has succeeded and the fault is proceeding. We could decouple the call from mem_cgroup_try_charge() here as well, but that would cause unnecessary churn: the following patches convert all callsites to a new charge API and we'll decouple as we go along. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: drop @compound parameter from memcg charging APIJohannes Weiner
The memcg charging API carries a boolean @compound parameter that tells whether the page we're dealing with is a hugepage. mem_cgroup_commit_charge() has another boolean @lrucare that indicates whether the page needs LRU locking or not while charging. The majority of callsites know those parameters at compile time, which results in a lot of naked "false, false" argument lists. This makes for cryptic code and is a breeding ground for subtle mistakes. Thankfully, the huge page state can be inferred from the page itself and doesn't need to be passed along. This is safe because charging completes before the page is published and somebody may split it. Simplify the callsites by removing @compound, and let memcg infer the state by using hpage_nr_pages() unconditionally. That function does PageTransHuge() to identify huge pages, which also helpfully asserts that nobody passes in tail pages by accident. The following patches will introduce a new charging API, best not to carry over unnecessary weight. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm: memcontrol: fix stat-corrupting race in charge movingJohannes Weiner
The move_lock is a per-memcg lock, but the VM accounting code that needs to acquire it comes from the page and follows page->mem_cgroup under RCU protection. That means that the page becomes unlocked not when we drop the move_lock, but when we update page->mem_cgroup. And that assignment doesn't imply any memory ordering. If that pointer write gets reordered against the reads of the page state - page_mapped, PageDirty etc. the state may change while we rely on it being stable and we can end up corrupting the counters. Place an SMP memory barrier to make sure we're done with all page state by the time the new page->mem_cgroup becomes visible. Also replace the open-coded move_lock with a lock_page_memcg() to make it more obvious what we're serializing against. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-3-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03mm/memcg: optimize memory.numa_stat like memory.statShakeel Butt
Currently reading memory.numa_stat traverses the underlying memcg tree multiple times to accumulate the stats to present the hierarchical view of the memcg tree. However the kernel already maintains the hierarchical view of the stats and use it in memory.stat. Just use the same mechanism in memory.numa_stat as well. I ran a simple benchmark which reads root_mem_cgroup's memory.numa_stat file in the presense of 10000 memcgs. The results are: Without the patch: $ time cat /dev/cgroup/memory/memory.numa_stat > /dev/null real 0m0.700s user 0m0.001s sys 0m0.697s With the patch: $ time cat /dev/cgroup/memory/memory.numa_stat > /dev/null real 0m0.001s user 0m0.001s sys 0m0.000s [akpm@linux-foundation.org: avoid forcing out-of-line code generation] Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200304022058.248270-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02memcg: fix memcg_kmem_bypass() for remote memcg chargingZefan Li
While trying to use remote memcg charging in an out-of-tree kernel module I found it's not working, because the current thread is a workqueue thread. As we will probably encounter this issue in the future as the users of memalloc_use_memcg() grow, and it's nothing wrong for this usage, it's better we fix it now. Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/1d202a12-26fe-0012-ea14-f025ddcd044a@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: automatically penalize tasks with high swap useJakub Kicinski
Add a memory.swap.high knob, which can be used to protect the system from SWAP exhaustion. The mechanism used for penalizing is similar to memory.high penalty (sleep on return to user space). That is not to say that the knob itself is equivalent to memory.high. The objective is more to protect the system from potentially buggy tasks consuming a lot of swap and impacting other tasks, or even bringing the whole system to stand still with complete SWAP exhaustion. Hopefully without the need to find per-task hard limits. Slowing misbehaving tasks down gradually allows user space oom killers or other protection mechanisms to react. oomd and earlyoom already do killing based on swap exhaustion, and memory.swap.high protection will help implement such userspace oom policies more reliably. We can use one counter for number of pages allocated under pressure to save struct task space and avoid two separate hierarchy walks on the hot path. The exact overage is calculated on return to user space, anyway. Take the new high limit into account when determining if swap is "full". Borrowing the explanation from Johannes: The idea behind "swap full" is that as long as the workload has plenty of swap space available and it's not changing its memory contents, it makes sense to generously hold on to copies of data in the swap device, even after the swapin. A later reclaim cycle can drop the page without any IO. Trading disk space for IO. But the only two ways to reclaim a swap slot is when they're faulted in and the references go away, or by scanning the virtual address space like swapoff does - which is very expensive (one could argue it's too expensive even for swapoff, it's often more practical to just reboot). So at some point in the fill level, we have to start freeing up swap slots on fault/swapin. Otherwise we could eventually run out of swap slots while they're filled with copies of data that is also in RAM. We don't want to OOM a workload because its available swap space is filled with redundant cache. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Chris Down <chris@chrisdown.name> Cc: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200527195846.102707-5-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: move cgroup high memory limit setting into struct page_counterJakub Kicinski
High memory limit is currently recorded directly in struct mem_cgroup. We are about to add a high limit for swap, move the field to struct page_counter and add some helpers. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Down <chris@chrisdown.name> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200527195846.102707-4-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: move penalty delay clamping out of calculate_high_delay()Jakub Kicinski
We will want to call calculate_high_delay() twice - once for memory and once for swap, and we should apply the clamp value to sum of the penalties. Clamping has to be applied outside of calculate_high_delay(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Down <chris@chrisdown.name> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200527195846.102707-3-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/memcg: prepare for swap over-high accounting and penalty calculationJakub Kicinski
Patch series "memcg: Slow down swap allocation as the available space gets depleted", v6. Tejun describes the problem as follows: When swap runs out, there's an abrupt change in system behavior - the anonymous memory suddenly becomes unmanageable which readily breaks any sort of memory isolation and can bring down the whole system. To avoid that, oomd [1] monitors free swap space and triggers kills when it drops below the specific threshold (e.g. 15%). While this works, it's far from ideal: - Depending on IO performance and total swap size, a given headroom might not be enough or too much. - oomd has to monitor swap depletion in addition to the usual pressure metrics and it currently doesn't consider memory.swap.max. Solve this by adapting parts of the approach that memory.high uses - slow down allocation as the resource gets depleted turning the depletion behavior from abrupt cliff one to gradual degradation observable through memory pressure metric. [1] https://github.com/facebookincubator/oomd This patch (of 4): Slice the memory overage calculation logic a little bit so we can reuse it to apply a similar penalty to the swap. The logic which accesses the memory-specific fields (use and high values) has to be taken out of calculate_high_delay(). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200527195846.102707-1-kuba@kernel.org Link: http://lkml.kernel.org/r/20200527195846.102707-2-kuba@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02memcg: expose root cgroup's memory.statShakeel Butt
One way to measure the efficiency of memory reclaim is to look at the ratio (pgscan+pfrefill)/pgsteal. However at the moment these stats are not updated consistently at the system level and the ratio of these are not very meaningful. The pgsteal and pgscan are updated for only global reclaim while pgrefill gets updated for global as well as cgroup reclaim. Please note that this difference is only for system level vmstats. The cgroup stats returned by memory.stat are actually consistent. The cgroup's pgsteal contains number of reclaimed pages for global as well as cgroup reclaim. So, one way to get the system level stats is to get these stats from root's memory.stat, so, expose memory.stat for the root cgroup. From Johannes Weiner: There are subtle differences between /proc/vmstat and memory.stat, and cgroup-aware code that wants to watch the full hierarchy currently has to know about these intricacies and translate semantics back and forth. Generally having the fully recursive memory.stat at the root level could help a broader range of usecases. Why not fix the stats by including both the global and cgroup reclaim activity instead of exposing root cgroup's memory.stat? The reason is the benefit of having metrics exposing the activity that happens purely due to machine capacity rather than localized activity that happens due to the limits throughout the cgroup tree. Additionally there are userspace tools like sysstat(sar) which reads these stats to inform about the system level reclaim activity. So, we should not break such use-cases. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200508170630.94406-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm: memcontrol: simplify value comparison between count and limitKaixu Xia
When the variables count and limit have the same value(count == limit), the result of min(margin, limit - count) statement should be 0 and the variable margin is set to 0. So in this case, the min() statement is not necessary and we can directly set the variable margin to 0. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/1587479661-27237-1-git-send-email-kaixuxia@tencent.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm, memcg: add workingset_restore in memory.statYafang Shao
There's a new workingset counter introduced in commit 1899ad18c607 ("mm: workingset: tell cache transitions from workingset thrashing"). With the help of this counter we can know the workingset is transitioning or thrashing. To leverage the benifit of this counter to memcg, we should introduce it into memory.stat. Then we could know the workingset of the workload inside a memcg better. Bellow is the verification of this new counter in memory.stat. Read a file into the memory and then read it again to make these pages be active. The size of this file is 1G. (memory.max is greater than file size) The counters in memory.stat will be inactive_file 0 active_file 1073639424 workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0 Trigger the memcg reclaim by setting a lower value to memory.high, and then some pages will be demoted into inactive list, and then some pages in the inactive list will be evicted into the storage. inactive_file 498094080 active_file 310063104 workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0 Then recover the memory.high and read the file into memory again. As a result of it, the transitioning will occur. Bellow is the result of this transitioning, inactive_file 498094080 active_file 575397888 workingset_refault 64746 workingset_activate 64746 workingset_restore 64746 workingset_nodereclaim 0 Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Chris Down <chris@chrisdown.name> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Shakeel Butt <shakeelb@google.com> Link: http://lkml.kernel.org/r/20200504153522.11553-1-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK insteadNeilBrown
After an NFS page has been written it is considered "unstable" until a COMMIT request succeeds. If the COMMIT fails, the page will be re-written. These "unstable" pages are currently accounted as "reclaimable", either in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a 'reclaimable' count. This might have made sense when sending the COMMIT required a separate action by the VFS/MM (e.g. releasepage() used to send a COMMIT). However now that all writes generated by ->writepages() will automatically be followed by a COMMIT (since commit 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete")) it makes more sense to treat them as writeback pages. So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in NR_WRITEBACK and WB_WRITEBACK. A particular effect of this change is that when wb_check_background_flush() calls wb_over_bg_threshold(), the latter will report 'true' a lot less often as the 'unstable' pages are no longer considered 'dirty' (as there is nothing that writeback can do about them anyway). Currently wb_check_background_flush() will trigger writeback to NFS even when there are relatively few dirty pages (if there are lots of unstable pages), this can result in small writes going to the server (10s of Kilobytes rather than a Megabyte) which hurts throughput. With this patch, there are fewer writes which are each larger on average. Where the NR_UNSTABLE_NFS count was included in statistics virtual-files, the entry is retained, but the value is hard-coded as zero. static trace points and warning printks which mentioned this counter no longer report it. [akpm@linux-foundation.org: re-layout comment] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Michal Hocko <mhocko@suse.com> [mm] Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-07mm, memcg: fix error return value of mem_cgroup_css_alloc()Yafang Shao
When I run my memcg testcase which creates lots of memcgs, I found there're unexpected out of memory logs while there're still enough available free memory. The error log is mkdir: cannot create directory 'foo.65533': Cannot allocate memory The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs, an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right errno should be -ENOSPC "No space left on device", which is an appropriate errno for userspace's failed mkdir. As the errno really misled me, we should make it right. After this patch, the error log will be mkdir: cannot create directory 'foo.65533': No space left on device [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal] Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10mm, memcg: do not high throttle allocators based on wraparoundJakub Kicinski
If a cgroup violates its memory.high constraints, we may end up unduly penalising it. For example, for the following hierarchy: A: max high, 20 usage A/B: 9 high, 10 usage A/C: max high, 10 usage We would end up doing the following calculation below when calculating high delay for A/B: A/B: 10 - 9 = 1... A: 20 - PAGE_COUNTER_MAX = 21, so set max_overage to 21. This gets worse with higher disparities in usage in the parent. I have no idea how this disappeared from the final version of the patch, but it is certainly Not Good(tm). This wasn't obvious in testing because, for a simple cgroup hierarchy with only one child, the result is usually roughly the same. It's only in more complex hierarchies that things go really awry (although still, the effects are limited to a maximum of 2 seconds in schedule_timeout_killable at a maximum). [chris@chrisdown.name: changelog] Fixes: e26733e0d0ec ("mm, memcg: throttle allocators based on ancestral memory.high") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [5.4.x] Link: http://lkml.kernel.org/r/20200331152424.GA1019937@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: use fallthrough;Joe Perches
Convert the various /* fallthrough */ comments to the pseudo-keyword fallthrough; Done via script: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/ Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm, memcg: bypass high reclaim iteration for cgroup hierarchy rootChris Down
The root of the hierarchy cannot have high set, so we will never reclaim based on it. This makes that clearer and avoids another entry. Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200312164137.GA1753625@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02mm: memcg: make memory.oom.group tolerable to task migrationRoman Gushchin
If a task is getting moved out of the OOMing cgroup, it might result in unexpected OOM killings if memory.oom.group is used anywhere in the cgroup tree. Imagine the following example: A (oom.group = 1) / \ (OOM) B C Let's say B's memory.max is exceeded and it's OOMing. The OOM killer selects a task in B as a victim, but someone asynchronously moves the task into C. mem_cgroup_get_oom_group() will iterate over all ancestors of C up to the root cgroup. In theory it had to stop at the oom_domain level - the memory cgroup which is OOMing. But because B is not an ancestor of C, it's not happening. Instead it chooses A (because it's oom.group is set), and kills all tasks in A. This behavior is wrong because the OOM happened in B, so there is no reason to kill anything outside. Fix this by checking it the memory cgroup to which the task belongs is a descendant of the oom_domain. If not, mem