summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2020-12-15mm: memcg: deprecate the non-hierarchical modeRoman Gushchin
Patch series "mm: memcg: deprecate cgroup v1 non-hierarchical mode", v1. The non-hierarchical cgroup v1 mode is a legacy of early days of the memory controller and doesn't bring any value today. However, it complicates the code and creates many edge cases all over the memory controller code. It's a good time to deprecate it completely. This patchset removes the internal logic, adjusts the user interface and updates the documentation. The alt patch removes some bits of the cgroup core code, which become obsolete. Michal Hocko said: "All that we know today is that we have a warning in place to complain loudly when somebody relies on use_hierarchy=0 with a deeper hierarchy. For all those years we have seen _zero_ reports that would describe a sensible usecase. Moreover we (SUSE) have backported this warning into old distribution kernels (since 3.0 based kernels) to extend the coverage and didn't hear even for users who adopt new kernels only very slowly. The only report we have seen so far was a LTP test suite which doesn't really reflect any real life usecase" This patch (of 3): The non-hierarchical cgroup v1 mode is a legacy of early days of the memory controller and doesn't bring any value today. However, it complicates the code and creates many edge cases all over the memory controller code. It's a good time to deprecate it completely. Functionally this patch enabled is by default for all cgroups and forbids switching it off. Nothing changes if cgroup v2 is used: hierarchical mode was enforced from scratch. To protect the ABI memory.use_hierarchy interface is preserved with a limited functionality: reading always returns "1", writing of "1" passes silently, writing of any other value fails with -EINVAL and a warning to dmesg (on the first occasion). Link: https://lkml.kernel.org/r/20201110220800.929549-1-guro@fb.com Link: https://lkml.kernel.org/r/20201110220800.929549-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcg: fix obsolete code commentsRoman Gushchin
This patch fixes/removes some obsolete comments in the code related to the kernel memory accounting: - kmem_cache->memcg_params.memcg_caches has been removed by commit 9855609bde03 ("mm: memcg/slab: use a single set of kmem_caches for all accounted allocations") - memcg->kmemcg_id is not used as a gate for kmem accounting since commit 0b8f73e10428 ("mm: memcontrol: clean up alloc, online, offline, free functions") Link: https://lkml.kernel.org/r/20201110184615.311974-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/memcg: update page struct member in commentsAlex Shi
The page->mem_cgroup member is replaced by memcg_data, and add a helper page_memcg() for it. Need to update comments to avoid confusing. Link: https://lkml.kernel.org/r/1491c150-1cc0-6062-08ea-9c891548a3bc@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/rmap: always do TTU_IGNORE_ACCESSShakeel Butt
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the secondary MMU's page table before the check. More specifically for those secondary MMUs which unmap the memory in mmu_notifier_invalidate_range_start() like kvm. However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the absence of TTU_IGNORE_ACCESS and it explicitly performs the page table access check before trying to unmap the page. So, at worst the reclaim will miss accesses in a very short window if we remove page table access check in unmapping code. There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg reclaim. From memcg reclaim the page_referenced() only account the accesses from the processes which are in the same memcg of the target page but the unmapping code is considering accesses from all the processes, so, decreasing the effectiveness of memcg reclaim. The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping code. Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcg/slab: fix use after free in obj_cgroup_chargeMuchun Song
The rcu_read_lock/unlock only can guarantee that the memcg will not be freed, but it cannot guarantee the success of css_get to memcg. If the whole process of a cgroup offlining is completed between reading a objcg->memcg pointer and bumping the css reference on another CPU, and there are exactly 0 external references to this memory cgroup (how we get to the obj_cgroup_charge() then?), css_get() can change the ref counter from 0 back to 1. Link: https://lkml.kernel.org/r/20201028035013.99711-2-songmuchun@bytedance.com Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcg/slab: fix return of child memcg objcg for root memcgMuchun Song
Consider the following memcg hierarchy. root / \ A B If we failed to get the reference on objcg of memcg A, the get_obj_cgroup_from_current can return the wrong objcg for the root memcg. Link: https://lkml.kernel.org/r/20201029164429.58703-1-songmuchun@bytedance.com Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Eugene Syromiatnikov <esyr@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Adrian Reber <areber@redhat.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcontrol: eliminate redundant check in __mem_cgroup_insert_exceeded()Miaohe Lin
The mz->usage_in_excess >= mz_node->usage_in_excess check is exactly the else case of mz->usage_in_excess < mz_node->usage_in_excess. So we could replace else if (mz->usage_in_excess >= mz_node->usage_in_excess) with else equally. Also drop the comment which doesn't really explain much. Link: https://lkml.kernel.org/r/20201012131607.10656-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcontrol: remove unused mod_memcg_obj_state()Muchun Song
Since commit 991e7673859e ("mm: memcontrol: account kernel stack per node") there is no user of the mod_memcg_obj_state(). So just remove it. Also rework type of the idx parameter of the mod_objcg_state() from int to enum node_stat_item. Link: https://lkml.kernel.org/r/20201013153504.92602-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christopher Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: memcontrol: add file_thp, shmem_thp to memory.statJohannes Weiner
As huge page usage in the page cache and for shmem files proliferates in our production environment, the performance monitoring team has asked for per-cgroup stats on those pages. We already track and export anon_thp per cgroup. We already track file THP and shmem THP per node, so making them per-cgroup is only a matter of switching from node to lruvec counters. All callsites are in places where the pages are charged and locked, so page->memcg is stable. [hannes@cmpxchg.org: add documentation] Link: https://lkml.kernel.org/r/20201026174029.GC548555@cmpxchg.org Link: https://lkml.kernel.org/r/20201022151844.489337-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15tmpfs: fix Documentation nitsRandy Dunlap
Fix a typo, punctuation, use uppercase for CPUs, and limit tmpfs to keeping only its files in virtual memory (phrasing). Link: https://lkml.kernel.org/r/20201202010934.18566-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/shmem.c: make shmem_mapping() inlineHui Su
shmem_mapping() isn't worth an out-of-line call from any callsite. So make it inline by - make shmem_aops global - export shmem_aops - inline the shmem_mapping() and replace the direct call 'shmem_aops' with shmem_mapping() in shmem.c. Link: https://lkml.kernel.org/r/20201115165207.GA265355@rlk Signed-off-by: Hui Su <sh_def@163.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: remove pagevec_lookup_range_nr_tag()Jeff Layton
With the merge of commit 2e1692966034 ("ceph: have ceph_writepages_start call pagevec_lookup_range_tag"), nothing calls this anymore. Link: https://lkml.kernel.org/r/20201021193926.101474-1-jlayton@kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/swapfile.c: use memset to fill the swap_map with SWAP_HAS_CACHEMiaohe Lin
We could use helper memset to fill the swap_map with SWAP_HAS_CACHE instead of a direct loop here to simplify the code. Also we can remove the local variable i and map this way. Link: https://lkml.kernel.org/r/20200921122224.7139-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/swapfile.c: remove unnecessary out label in __swap_duplicate()Miaohe Lin
When the code went to the out label, it must have p == NULL. So what out label really does is redundant if check and return err. We should Remove this unnecessary out label because it does not handle resource free and so on. Link: https://lkml.kernel.org/r/20201009130337.29698-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/swap_state: skip meaningless swap cache readahead when ra_info.win == 0Miaohe Lin
swap_ra_info() may leave ra_info untouched in non_swap_entry() case as page table lock is not held. In this case, we have ra_info.nr_pte == 0 and it is meaningless to continue with swap cache readahead. Skip such ops by init ra_info.win = 1. [akpm@linux-foundation.org: clean up struct init] Link: https://lkml.kernel.org/r/20201009133059.58407-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/swapfile.c: use helper function swap_count() in add_swap_count_continuation()Miaohe Lin
Commit 570a335b8e22 ("swap_info: swap count continuations") introduced the func add_swap_count_continuation() but forgot to use the helper function swap_count() introduced by commit 355cfa73ddff ("mm: modify swap_map and add SWAP_HAS_CACHE flag"). Link: https://lkml.kernel.org/r/20201009134306.18033-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: handle zone device pages in release_pages()Ralph Campbell
release_pages() is an optimized, inlined version of __put_pages() except that zone device struct pages that are not page_is_devmap_managed() (i.e., memory_type MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA), fall through to the code that could return the zone device page to the page allocator instead of adjusting the pgmap reference count. Clearly these type of pages are not having the reference count decremented to zero via release_pages() or page allocation problems would be seen. Just to be safe, handle the 1 to zero case in release_pages() like __put_page() does. Link: https://lkml.kernel.org/r/20201021194733.11530-1-rcampbell@nvidia.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/gup: combine put_compound_head() and unpin_user_page()Jason Gunthorpe
These functions accomplish the same thing but have different implementations. unpin_user_page() has a bug where it calls mod_node_page_state() after calling put_page() which creates a risk that the page could have been hot-uplugged from the system. Fix this by using put_compound_head() as the only implementation. __unpin_devmap_managed_user_page() and related can be deleted as well in favour of the simpler, but slower, version in put_compound_head() that has an extra atomic page_ref_sub, but always calls put_page() which internally contains the special devmap code. Move put_compound_head() to be directly after try_grab_compound_head() so people can find it in future. Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> CC: Joao Martins <joao.m.martins@oracle.com> CC: Jonathan Corbet <corbet@lwn.net> CC: Dan Williams <dan.j.williams@intel.com> CC: Dave Chinner <david@fromorbit.com> CC: Christoph Hellwig <hch@infradead.org> CC: Jane Chu <jane.chu@oracle.com> CC: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> CC: Michal Hocko <mhocko@suse.com> CC: Mike Kravetz <mike.kravetz@oracle.com> CC: Shuah Khan <shuah@kernel.org> CC: Muchun Song <songmuchun@bytedance.com> CC: Vlastimil Babka <vbabka@suse.cz> CC: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/gup: remove the vma allocation from gup_longterm_locked()Jason Gunthorpe
Long ago there wasn't a FOLL_LONGTERM flag so this DAX check was done by post-processing the VMA list. These days it is trivial to just check each VMA to see if it is DAX before processing it inside __get_user_pages() and return failure if a DAX VMA is encountered with FOLL_LONGTERM. Removing the allocation of the VMA list is a significant speed up for many call sites. Add an IS_ENABLED to vma_is_fsdax so that code generation is unchanged when DAX is compiled out. Remove the dummy version of __gup_longterm_locked() as !CONFIG_CMA already makes memalloc_nocma_save(), check_and_migrate_cma_pages(), and memalloc_nocma_restore() into a NOP. Link: https://lkml.kernel.org/r/0-v1-5551df3ed12e+b8-gup_dax_speedup_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/gup: prevent gup_fast from racing with COW during forkJason Gunthorpe
Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") pages under a FOLL_PIN will not be write protected during COW for fork. This means that pages returned from pin_user_pages(FOLL_WRITE) should not become write protected while the pin is active. However, there is a small race where get_user_pages_fast(FOLL_PIN) can establish a FOLL_PIN at the same time copy_present_page() is write protecting it: CPU 0 CPU 1 get_user_pages_fast() internal_get_user_pages_fast() copy_page_range() pte_alloc_map_lock() copy_present_page() atomic_read(has_pinned) == 0 page_maybe_dma_pinned() == false atomic_set(has_pinned, 1); gup_pgd_range() gup_pte_range() pte_t pte = gup_get_pte(ptep) pte_access_permitted(pte) try_grab_compound_head() pte = pte_wrprotect(pte) set_pte_at(); pte_unmap_unlock() // GUP now returns with a write protected page The first attempt to resolve this by using the write protect caused problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") Instead wrap copy_p4d_range() with the write side of a seqcount and check the read side around gup_pgd_range(). If there is a collision then get_user_pages_fast() fails and falls back to slow GUP. Slow GUP is safe against this race because copy_page_range() is only called while holding the exclusive side of the mmap_lock on the src mm_struct. [akpm@linux-foundation.org: coding style fixes] Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de> [seqcount_t parts] Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/gup: reorganize internal_get_user_pages_fast()Jason Gunthorpe
Patch series "Add a seqcount between gup_fast and copy_page_range()", v4. As discussed and suggested by Linus use a seqcount to close the small race between gup_fast and copy_page_range(). Ahmed confirms that raw_write_seqcount_begin() is the correct API to use in this case and it doesn't trigger any lockdeps. I was able to test it using two threads, one forking and the other using ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep made the window large enough to reliably hit to test the logic. This patch (of 2): The next patch in this series makes the lockless flow a little more complex, so move the entire block into a new function and remove a level of indention. Tidy a bit of cruft: - addr is always the same as start, so use start - Use the modern check_add_overflow() for computing end = start + len - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to avoid shift overflow, make the variables unsigned long to avoid coding casts in both places. nr_pinned was missing its cast - The handling of ret and nr_pinned can be streamlined a bit No functional change. Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/gup_test: GUP_TEST depends on DEBUG_FSBarry Song
Without DEBUG_FS, all the code in gup_benchmark becomes meaningless. For sure kernel provides debugfs stub while DEBUG_FS is disabled, but the point here is that GUP_TEST can do nothing without DEBUG_FS. [song.bao.hua@hisilicon.com: add comment as a prompt to users as commented by John and Randy] Link: https://lkml.kernel.org/r/20201108083732.15336-1-song.bao.hua@hisilicon.com Link: https://lkml.kernel.org/r/20201104100552.20156-1-song.bao.hua@hisilicon.com Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Suggested-by: John Garry <john.garry@huawei.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/gup_test.c: mark gup_test_init as __init functionBarry Song
gup_test_init() is only called during initialization, mark it as __init to save some memory. Link: https://lkml.kernel.org/r/20201103081016.16532-1-song.bao.hua@hisilicon.com Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: 2x speedup for run_vmtests.shJohn Hubbard
Each invocation of userfaultfd for "anon" and "shmem" was taking about 6.5 sec to run, contributing to an overall run time of about 22 sec for run_vmtests.sh. Reduce the size and bounce input values to the userfaultfd invocation within run_vmtests.sh, enough to get each invocation down to about 1.0 sec. This should still provide a reasonable smoke test, while staying within a nominal time budget of around 1 second or so per test. And this brings the overall running time of run_vmtests.sh down to 11 second. Link: https://lkml.kernel.org/r/20201026064021.3545418-10-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: hmm-tests: remove the libhugetlbfs dependencyJohn Hubbard
HMM selftests are incredibly useful, but they are only effective if people actually build and run them. All the other tests in selftests/vm can be built with very standard, always-available libraries: libpthread, librt. The hmm-tests.c program, on the other hand, requires something that is (much) less readily available: libhugetlbfs. And so the build will typically fail for many developers. A simple attempt to install libhugetlbfs will also run into complications on some common distros these days: Fedora and Arch Linux (yes, Arch AUR has it, but that's fragile, as always with AUR). The library is not maintained actively enough at the moment, for distros to deal with it. I had to build it from source, for Fedora, and that didn't go too smoothly either. It turns out that, out of 21 tests in hmm-tests.c, only 2 actually require functionality from libhugetlbfs. Therefore, if libhugetlbfs is missing, simply ifdef those two tests out and allow the developer to at least have the other 19 tests, if they don't want to pause to work through the above issues. Also issue a warning, so that it's clear that there is an imperfection in the build. In order to do that, a tiny shell script (check_config.sh) runs a quick compile (not link, that's too prone to false failures with library paths), and basically, if the compiler doesn't find hugetlbfs.h in its standard locations, then the script concludes that libhugetlbfs is not available. The output is in two files, one for inclusion in hmm-test.c (local_config.h), and one for inclusion in the Makefile (local_config.mk). Link: https://lkml.kernel.org/r/20201026064021.3545418-9-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: run_vmtests.sh: update and clean up gup_test invocationJohn Hubbard
Run benchmarks on the _fast variants of gup and pup, as originally intended. Run the new gup_test sub-test: dump pages. In addition to exercising the dump_page() call, it also demonstrates the various options you can use to specify which pages to dump, and how. Link: https://lkml.kernel.org/r/20201026064021.3545418-8-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: gup_test: introduce the dump_pages() sub-testJohn Hubbard
For quite a while, I was doing a quick hack to gup_test.c (previously, gup_benchmark.c) whenever I wanted to try out my changes to dump_page(). This makes that hack unnecessary, and instead allows anyone to easily get the same coverage from a user space program. That saves a lot of time because you don't have to change the kernel, in order to test different pages and options. The new sub-test takes advantage of the existing gup_test infrastructure, which already provides a simple user space program, some allocated user space pages, an ioctl call, pinning of those pages (via either get_user_pages or pin_user_pages) and a corresponding kernel-side test invocation. There's not much more required, mainly just a couple of inputs from the user. In fact, the new test re-uses the existing command line options in order to get various helpful combinations (THP or normal, _fast or slow gup, gup vs. pup, and more). New command line options are: which pages to dump, and what type of "get/pin" to use. In order to figure out which pages to dump, the logic is: * If the user doesn't specify anything, the page 0 (the first page in the address range that the program sets up for testing) is dumped. * Or, the user can type up to 8 page indices anywhere on the command line. If you type more than 8, then it uses the first 8 and ignores the remaining items. For example: ./gup_test -ct -F 1 0 19 0x1000 Meaning: -c: dump pages sub-test -t: use THP pages -F 1: use pin_user_pages() instead of get_user_pages() 0 19 0x1000: dump pages 0, 19, and 4096 Link: https://lkml.kernel.org/r/20201026064021.3545418-7-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: only some gup_test items are really benchmarksJohn Hubbard
Therefore, some minor cleanup and improvements are in order: 1. Rename the other items appropriately. 2. Stop reporting timing information on the non-benchmark items. It's still being recorded and is available, but there's no point in cluttering up the report with data that no one reasonably needs to check. 3. Don't do iterations, for non-benchmark items. 4. Print out a shorter, more appropriate report for the non-benchmark tests. 5. Add the command that was run, to the report. This really helps, as there are quite a lot of options now. 6. Use a larger integer type for cmd, now that it's being compared Otherwise it doesn't work, because in this case cmd is about 3 billion, which is the perfect size for problems with signed vs unsigned int. Link: https://lkml.kernel.org/r/20201026064021.3545418-6-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: minor cleanup: Makefile and gup_test.cJohn Hubbard
A few cleanups that don't deserve separate patches, but that also should not clutter up other functional changes: 1. Remove an unnecessary #include <prctl.h> 2. Restore the sorted order of TEST_GEN_FILES. 3. Add -lpthread to the common LDLIBS, as it is harmless and several tests use it. This gets rid of one special rule already. Link: https://lkml.kernel.org/r/20201026064021.3545418-5-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: rename run_vmtests --> run_vmtests.shJohn Hubbard
Rename to *.sh, in order to match the conventions of all of the other items in selftest/vm. The only reason not to use a .sh suffix a shell script like this, might be to make it look more like a normal program, but that's not an issue here. Link: https://lkml.kernel.org/r/20201026064021.3545418-4-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15selftests/vm: use a common gup_test.hJohn Hubbard
Avoid the need to copy-paste the gup_test ioctl commands and the struct gup_test definition, between the kernel and the user space application, by providing a new header file for these. This allows easier and safer adding of new ioctl calls, as well as reducing the overall line count. Details: The header file has to be able to compile independently, because of the arguably unfortunate way that the Makefile is written: the Makefile tries to build all of its prerequisites, when really it should be only building the .c files, and leaving the other prerequisites (LOCAL_HDRS) as pure dependencies. That Makefile limitation is probably not worth fixing, but it explains why one of the includes had to be moved into the new header file. Also: simplify the ioctl struct (struct gup_test), by deleting the unused __expansion[10] field. This sort of thing is what you might see in a stable ABI, but this low-level, kernel-developer-oriented selftests/vm system is very much not subject to ABI stability. So "expansion" and "reserved" fields are unnecessary here. Link: https://lkml.kernel.org/r/20201026064021.3545418-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/gup_benchmark: rename to mm/gup_testJohn Hubbard
Patch series "selftests/vm: gup_test, hmm-tests, assorted improvements", v3. Summary: This series provides two main things, and a number of smaller supporting goodies. The two main points are: 1) Add a new sub-test to gup_test, which in turn is a renamed version of gup_benchmark. This sub-test allows nicer testing of dump_pages(), at least on user-space pages. For quite a while, I was doing a quick hack to gup_test.c whenever I wanted to try out changes to dump_page(). Then Matthew Wilcox asked me what I meant when I said "I used my dump_page() unit test", and I realized that it might be nice to check in a polished up version of that. Details about how it works and how to use it are in the commit description for patch #6 ("selftests/vm: gup_test: introduce the dump_pages() sub-test"). 2) Fixes a limitation of hmm-tests: these tests are incredibly useful, but only if people actually build and run them. And it turns out that libhugetlbfs is a little too effective at throwing a wrench in the works, there. So I've added a little configuration check that removes just two of the 21 hmm-tests, if libhugetlbfs is not available. Further details in the commit description of patch #8 ("selftests/vm: hmm-tests: remove the libhugetlbfs dependency"). Other smaller things that this series does: a) Remove code duplication by creating gup_test.h. b) Clear up the sub-test organization, and their invocation within run_vmtests.sh. c) Other minor assorted improvements. [1] v2 is here: https://lore.kernel.org/linux-doc/20200929212747.251804-1-jhubbard@nvidia.com/ [2] https://lore.kernel.org/r/CAHk-=wgh-TMPHLY3jueHX7Y2fWh3D+nMBqVS__AZm6-oorquWA@mail.gmail.com This patch (of 9): Rename nearly every "gup_benchmark" reference and file name to "gup_test". The one exception is for the actual gup benchmark test itself. The current code already does a *little* bit more than benchmarking, and definitely covers more than get_user_pages_fast(). More importantly, however, subsequent patches are about to add some functionality that is non-benchmark related. Closely related changes: * Kconfig: in addition to renaming the options from GUP_BENCHMARK to GUP_TEST, update the help text to reflect that it's no longer a benchmark-only test. Link: https://lkml.kernel.org/r/20201026064021.3545418-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20201026064021.3545418-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/filemap.c: remove else after a returnHailong Liu
The `else' is not useful after a `return' in __lock_page_or_retry(). [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20201202154720.115162-1-carver4lio@163.com Signed-off-by: Hailong Liu<liu.hailong6@zte.com.cn> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/truncate: add parameter explanation for invalidate_mapping_pagevecAlex Shi
To fix a kernel-doc markups issue: mm/truncate.c:646: warning: Function parameter or member 'mapping' not described in 'invalidate_mapping_pagevec' mm/truncate.c:646: warning: Function parameter or member 'start' not described in 'invalidate_mapping_pagevec' mm/truncate.c:646: warning: Function parameter or member 'end' not described in 'invalidate_mapping_pagevec' mm/truncate.c:646: warning: Function parameter or member 'nr_pagevec' not described in 'invalidate_mapping_pagevec' Link: https://lkml.kernel.org/r/1605605088-30668-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/filemap.c: generic_file_buffered_read() now uses find_get_pages_contigKent Overstreet
Convert generic_file_buffered_read() to get pages to read from in batches, and then copy data to userspace from many pages at once - in particular, we now don't touch any cachelines that might be contended while we're in the loop to copy data to userspace. This is is a performance improvement on workloads that do buffered reads with large blocksizes, and a very large performance improvement if that file is also being accessed concurrently by different threads. On smaller reads (512 bytes), there's a very small performance improvement (1%, within the margin of error). akpm: kernel test robot found a 32% speedup on one test: https://lkml.kernel.org/r/20201030081456.GY31092@shao2-debian Link: https://lkml.kernel.org/r/20201025212949.602194-3-kent.overstreet@gmail.com Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/filemap/c: break generic_file_buffered_read up into multiple functionsKent Overstreet
Patch series "generic_file_buffered_read() improvements", v2. generic_file_buffered_read() has turned into a real monstrosity to work with. And it's a major performance improvement, for both small random and large sequential reads. On my test box, 4k buffered random reads go from ~150k to ~250k iops, and the improvements to big sequential reads are even bigger. This incorporates the fix for IOCB_WAITQ handling that Jens just posted as well, also factors out lock_page_for_iocb() to improve handling of the various iocb flags. This patch (of 2): This is prep work for changing generic_file_buffered_read() to use find_get_pages_contig() to batch up all the pagecache lookups. This patch should be functionally identical to the existing code and changes as little as of the flow control as possible. More refactoring could be done, this patch is intended to be relatively minimal. Link: https://lkml.kernel.org/r/20201025212949.602194-1-kent.overstreet@gmail.com Link: https://lkml.kernel.org/r/20201025212949.602194-2-kent.overstreet@gmail.com Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/page_owner: record timestamp and pidLiam Mark
Collect the time for each allocation recorded in page owner so that allocation "surges" can be measured. Record the pid for each allocation recorded in page owner so that the source of allocation "surges" can be better identified. The above is very useful when doing memory analysis. On a crash for example, we can get this information from kdump (or ramdump) and parse it to figure out memory allocation problems. Please note that on x86_64 this increases the size of struct page_owner from 16 bytes to 32. Vlastimil: it's not a functionality intended for production, so unless somebody says they need to enable page_owner for debugging and this increase prevents them from fitting into available memory, let's not complicate things with making this optional. [lmark@codeaurora.org: v3] Link: https://lkml.kernel.org/r/20201210160357.27779-1-georgi.djakov@linaro.org Link: https://lkml.kernel.org/r/20201209125153.10533-1-georgi.djakov@linaro.org Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm: fix page_owner initializing issue for arm32Zhenhua Huang
Page owner of pages used by page owner itself used is missing on arm32 targets. The reason is dummy_handle and failure_handle is not initialized correctly. Buddy allocator is used to initialize these two handles. However, buddy allocator is not ready when page owner calls it. This change fixed that by initializing page owner after buddy initialization. The working flow before and after this change are: original logic: 1. allocated memory for page_ext(using memblock). 2. invoke the init callback of page_ext_ops like page_owner(using buddy allocator). 3. initialize buddy. after this change: 1. allocated memory for page_ext(using memblock). 2. initialize buddy. 3. invoke the init callback of page_ext_ops like page_owner(using buddy allocator). with the change, failure/dummy_handle can get its correct value and page owner output for example has the one for page owner itself: Page allocated via order 2, mask 0x6202c0(GFP_USER|__GFP_NOWARN), pid 1006, ts 67278156558 ns PFN 543776 type Unmovable Block 531 type Unmovable Flags 0x0() init_page_owner+0x28/0x2f8 invoke_init_callbacks_flatmem+0x24/0x34 start_kernel+0x33c/0x5d8 Link: https://lkml.kernel.org/r/1603104925-5888-1-git-send-email-zhenhuah@codeaurora.org Signed-off-by: Zhenhua Huang <zhenhuah@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15device-dax/kmem: use struct_size()Dan Williams
Linus notes the kernel has had a nice helper for the 'size of struct with variable array member at the end' operation for a couple years now, use it. Link: http://lore.kernel.org/r/CAHk-=wgNTLbvAD8mNTvh+GQyapNWeX20PXhU_+frqEvVq4298w@mail.gmail.com Link: https://lkml.kernel.org/r/160288261564.3242821.6055291930923876456.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/slub: let number of online CPUs determine the slub page orderBharata B Rao
The page order of the slab that gets chosen for a given slab cache depends on the number of objects that can be fit in the slab while meeting other requirements. We start with a value of minimum objects based on nr_cpu_ids that is driven by possible number of CPUs and hence could be higher than the actual number of CPUs present in the system. This leads to calculate_order() chosing a page order that is on the higher side leading to increased slab memory consumption on systems that have bigger page sizes. Hence rely on the number of online CPUs when determining the mininum objects, thereby increasing the chances of chosing a lower conservative page order for the slab. Vlastimil said: "Ideally, we would react to hotplug events and update existing caches accordingly. But for that, recalculation of order for existing caches would have to be made safe, while not affecting hot paths. We have removed the sysfs interface with 32a6f409b693 ("mm, slub: remove runtime allocation order changes") as it didn't seem easy and worth the trouble. In case somebody wants to start with a large order right from the boot because they know they will hotplug lots of cpus later, they can use slub_min_objects= boot param to override this heuristic. So in case this change regresses somebody's performance, there's a way around it and thus the risk is low IMHO" Link: https://lkml.kernel.org/r/20201118082759.1413056-1-bharata@linux.ibm.com Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm, slub: use kmem_cache_debug_flags() in deactivate_slab()Vlastimil Babka
Commit 9cf7a1118365 ("mm/slub: make add_full() condition more explicit") replaced an unnecessarily generic kmem_cache_debug(s) check with an explicit check of SLAB_STORE_USER and #ifdef CONFIG_SLUB_DEBUG. We can achieve the same specific check with the recently added kmem_cache_debug_flags() which removes the #ifdef and restores the no-branch-overhead benefit of static key check when slub debugging is not enabled. Link: https://lkml.kernel.org/r/3ef24214-38c7-1238-8296-88caf7f48ab6@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Abel Wu <wuyun.wu@huawei.com> Cc: Christopher Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Liu Xiang <liu.xiang6@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15mm/slab: rerform init_on_free earlierAlexander Popov
Currently in CONFIG_SLAB init_on_free happens too late, and heap objects go to the heap quarantine not being erased. Lets move init