From 68f23b89067fdf187763e75a56087550624fdbee Mon Sep 17 00:00:00 2001 From: Theodore Ts'o Date: Thu, 30 Jan 2020 22:11:04 -0800 Subject: memcg: fix a crash in wb_workfn when a device disappears Without memcg, there is a one-to-one mapping between the bdi and bdi_writeback structures. In this world, things are fairly straightforward; the first thing bdi_unregister() does is to shutdown the bdi_writeback structure (or wb), and part of that writeback ensures that no other work queued against the wb, and that the wb is fully drained. With memcg, however, there is a one-to-many relationship between the bdi and bdi_writeback structures; that is, there are multiple wb objects which can all point to a single bdi. There is a refcount which prevents the bdi object from being released (and hence, unregistered). So in theory, the bdi_unregister() *should* only get called once its refcount goes to zero (bdi_put will drop the refcount, and when it is zero, release_bdi gets called, which calls bdi_unregister). Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about the Brave New memcg World, and calls bdi_unregister directly. It does this without informing the file system, or the memcg code, or anything else. This causes the root wb associated with the bdi to be unregistered, but none of the memcg-specific wb's are shutdown. So when one of these wb's are woken up to do delayed work, they try to dereference their wb->bdi->dev to fetch the device name, but unfortunately bdi->dev is now NULL, thanks to the bdi_unregister() called by del_gendisk(). As a result, *boom*. Fortunately, it looks like the rest of the writeback path is perfectly happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to create a bdi_dev_name() function which can handle bdi->dev being NULL. This also allows us to bulletproof the writeback tracepoints to prevent them from dereferencing a NULL pointer and crashing the kernel if one is tracing with memcg's enabled, and an iSCSI device dies or a USB storage stick is pulled. The most common way of triggering this will be hotremoval of a device while writeback with memcg enabled is going on. It was triggering several times a day in a heavily loaded production environment. Google Bug Id: 145475544 Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu Signed-off-by: Theodore Ts'o Cc: Chris Mason Cc: Tejun Heo Cc: Jens Axboe Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/backing-dev.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index c360f6a6c844..62f05f605fb5 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -21,6 +21,7 @@ struct backing_dev_info noop_backing_dev_info = { EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; +const char *bdi_unknown_name = "(unknown)"; /* * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU -- cgit v1.2.3 From c7a91bc7c2e17e0a9c8b9745a2cb118891218fd1 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Thu, 30 Jan 2020 22:11:07 -0800 Subject: mm/mempolicy.c: fix out of bounds write in mpol_parse_str() What we are trying to do is change the '=' character to a NUL terminator and then at the end of the function we restore it back to an '='. The problem is there are two error paths where we jump to the end of the function before we have replaced the '=' with NUL. We end up putting the '=' in the wrong place (possibly one element before the start of the buffer). Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display") Signed-off-by: Dan Carpenter Acked-by: Vlastimil Babka Dmitry Vyukov Cc: Michal Hocko Cc: Dan Carpenter Cc: Lee Schermerhorn Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b2920ae87a61..977c641f78cf 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2821,6 +2821,9 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) char *flags = strchr(str, '='); int err = 1, mode; + if (flags) + *flags++ = '\0'; /* terminate mode string */ + if (nodelist) { /* NUL-terminate mode or flags string */ *nodelist++ = '\0'; @@ -2831,9 +2834,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } else nodes_clear(nodes); - if (flags) - *flags++ = '\0'; /* terminate mode string */ - mode = match_string(policy_modes, MPOL_MAX, str); if (mode < 0) goto out; -- cgit v1.2.3 From 1f503443e7df8dc8366608b4d810ce2d6669827c Mon Sep 17 00:00:00 2001 From: Pingfan Liu Date: Thu, 30 Jan 2020 22:11:10 -0800 Subject: mm/sparse.c: reset section's mem_map when fully deactivated After commit ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug"), when a mem section is fully deactivated, section_mem_map still records the section's start pfn, which is not used any more and will be reassigned during re-addition. In analogy with alloc/free pattern, it is better to clear all fields of section_mem_map. Beside this, it breaks the user space tool "makedumpfile" [1], which makes assumption that a hot-removed section has mem_map as NULL, instead of checking directly against SECTION_MARKED_PRESENT bit. (makedumpfile will be better to change the assumption, and need a patch) The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" , trigger a crash, and save vmcore by makedumpfile [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version") Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu Acked-by: Michal Hocko Acked-by: David Hildenbrand Cc: Dan Williams Cc: Oscar Salvador Cc: Baoquan He Cc: Qian Cai Cc: Kazuhito Hagio Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/sparse.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/sparse.c b/mm/sparse.c index 3822ecbd8a1f..3918fc3eaef1 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -789,7 +789,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, ms->usage = NULL; } memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr); + ms->section_mem_map = (unsigned long)NULL; } if (section_is_early && memmap) -- cgit v1.2.3 From dfe9aa23cab7880a794db9eb2d176c00ed064eb6 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Thu, 30 Jan 2020 22:11:14 -0800 Subject: mm/migrate.c: also overwrite error when it is bigger than zero If we get here after successfully adding page to list, err would be 1 to indicate the page is queued in the list. Current code has two problems: * on success, 0 is not returned * on error, if add_page_for_migratioin() return 1, and the following err1 from do_move_pages_to_node() is set, the err1 is not returned since err is 1 And these behaviors break the user interface. Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com Fixes: e0153fc2c760 ("mm: move_pages: return valid node id in status if the page is already on the target node"). Signed-off-by: Wei Yang Acked-by: Yang Shi Cc: John Hubbard Cc: Vlastimil Babka Cc: Christoph Lameter Cc: Michal Hocko Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index 86873b6f38a7..430fdccc733e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1676,7 +1676,7 @@ out_flush: err1 = do_move_pages_to_node(mm, &pagelist, current_node); if (!err1) err1 = store_status(status, start, current_node, i - start); - if (!err) + if (err >= 0) err = err1; out: return err; -- cgit v1.2.3 From f1037ec0cc8ac1a450974ad9754e991f72884f48 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Thu, 30 Jan 2020 22:11:17 -0800 Subject: mm/memory_hotplug: fix remove_memory() lockdep splat The daxctl unit test for the dax_kmem driver currently triggers the (false positive) lockdep splat below. It results from the fact that remove_memory_block_devices() is invoked under the mem_hotplug_lock() causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs active state tracking). It is a false positive because the sysfs attribute path triggering the memory remove is not the same attribute path associated with memory-block device. sysfs_break_active_protection() is not applicable since there is no real deadlock conflict, instead move memory-block device removal outside the lock. The mem_hotplug_lock() is not needed to synchronize the memory-block device removal vs the page online state, that is already handled by lock_device_hotplug(). Specifically, lock_device_hotplug() is sufficient to allow try_remove_memory() to check the offline state of the memblocks and be assured that any in progress online attempts are flushed / blocked by kernfs_drain() / attribute removal. The add_memory() path safely creates memblock devices under the mem_hotplug_lock(). There is no kernfs active state synchronization in the memblock device_register() path, so nothing to fix there. This change is only possible thanks to the recent change that refactored memory block device removal out of arch_remove_memory() (commit 4c4b7f9ba948 "mm/memory_hotplug: remove memory block devices before arch_remove_memory()"), and David's due diligence tracking down the guarantees afforded by kernfs_drain(). Not flagged for -stable since this only impacts ongoing development and lockdep validation, not a runtime issue. ====================================================== WARNING: possible circular locking dependency detected 5.5.0-rc3+ #230 Tainted: G OE ------------------------------------------------------ lt-daxctl/6459 is trying to acquire lock: ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80 but task is already holding lock: ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (mem_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 get_online_mems+0x3e/0xb0 kmem_cache_create_usercopy+0x2e/0x260 kmem_cache_create+0x12/0x20 ptlock_cache_init+0x20/0x28 start_kernel+0x243/0x547 secondary_startup_64+0xb6/0xc0 -> #1 (cpu_hotplug_lock.rw_sem){++++}: __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 cpus_read_lock+0x3e/0xb0 online_pages+0x37/0x300 memory_subsys_online+0x17d/0x1c0 device_online+0x60/0x80 state_store+0x65/0xd0 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe -> #0 (kn->count#241){++++}: check_prev_add+0x98/0xa40 validate_chain+0x576/0x860 __lock_acquire+0x39c/0x790 lock_acquire+0xa2/0x1b0 __kernfs_remove+0x25f/0x2e0 kernfs_remove_by_name_ns+0x41/0x80 remove_files.isra.0+0x30/0x70 sysfs_remove_group+0x3d/0x80 sysfs_remove_groups+0x29/0x40 device_remove_attrs+0x39/0x70 device_del+0x16a/0x3f0 device_unregister+0x16/0x60 remove_memory_block_devices+0x82/0xb0 try_remove_memory+0xb5/0x130 remove_memory+0x26/0x40 dev_dax_kmem_remove+0x44/0x6a [kmem] device_release_driver_internal+0xe4/0x1c0 unbind_store+0xef/0x120 kernfs_fop_write+0xcf/0x1c0 vfs_write+0xdb/0x1d0 ksys_write+0x65/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe other info that might help us debug this: Chain exists of: kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(mem_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); lock(mem_hotplug_lock.rw_sem); lock(kn->count#241); *** DEADLOCK *** No fixes tag as this has been a long standing issue that predated the addition of kernfs lockdep annotations. Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Cc: Vishal Verma Cc: Pavel Tatashin Cc: Dave Hansen Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a91a072f2b2c..0ddff29079c3 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1764,8 +1764,6 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) BUG_ON(check_hotplug_memory_range(start, size)); - mem_hotplug_begin(); - /* * All memory blocks must be offlined before removing memory. Check * whether all memory blocks in question are offline and return error @@ -1778,9 +1776,14 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); - /* remove memory block devices before removing memory */ + /* + * Memory block device removal under the device_hotplug_lock is + * a barrier against racing online attempts. + */ remove_memory_block_devices(start, size); + mem_hotplug_begin(); + arch_remove_memory(nid, start, size, NULL); memblock_free(start, size); memblock_remove(start, size); -- cgit v1.2.3 From fac0516b5534897bf4c4a88daa06a8cfa5611b23 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Thu, 30 Jan 2020 22:11:20 -0800 Subject: mm: thp: don't need care deferred split queue in memcg charge move path If compound is true, this means it is a PMD mapped THP. Which implies the page is not linked to any defer list. So the first code chunk will not be executed. Also with this reason, it would not be proper to add this page to a defer list. So the second code chunk is not correct. Based on this, we should remove the defer list related code. [yang.shi@linux.alibaba.com: better patch title] Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Signed-off-by: Wei Yang Suggested-by: Kirill A. Shutemov Acked-by: Yang Shi Cc: David Rientjes Cc: Michal Hocko Cc: Kirill A. Shutemov Cc: Johannes Weiner Cc: Vladimir Davydov Cc: [5.4+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 18 ------------------ 1 file changed, 18 deletions(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6c83cf4ed970..27c231bf4565 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5340,14 +5340,6 @@ static int mem_cgroup_move_account(struct page *page, __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && !list_empty(page_deferred_list(page))) { - spin_lock(&from->deferred_split_queue.split_queue_lock); - list_del_init(page_deferred_list(page)); - from->deferred_split_queue.split_queue_len--; - spin_unlock(&from->deferred_split_queue.split_queue_lock); - } -#endif /* * It is safe to change page->mem_cgroup here because the page * is referenced, charged, and isolated - we can't race with @@ -5357,16 +5349,6 @@ static int mem_cgroup_move_account(struct page *page, /* caller should have done css_get */ page->mem_cgroup = to; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (compound && list_empty(page_deferred_list(page))) { - spin_lock(&to->deferred_split_queue.split_queue_lock); - list_add_tail(page_deferred_list(page), - &to->deferred_split_queue.split_queue); - to->deferred_split_queue.split_queue_len++; - spin_unlock(&to->deferred_split_queue.split_queue_lock); - } -#endif - spin_unlock_irqrestore(&from->move_lock, flags); ret = 0; -- cgit v1.2.3 From 5984fabb6e82d9ab4e6305cb99694c85d46de8ae Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Thu, 30 Jan 2020 22:11:24 -0800 Subject: mm: move_pages: report the number of non-attempted pages Since commit a49bd4d71637 ("mm, numa: rework do_pages_move"), the semantic of move_pages() has changed to return the number of non-migrated pages if they were result of a non-fatal reasons (usually a busy page). This was an unintentional change that hasn't been noticed except for LTP tests which checked for the documented behavior. There are two ways to go around this change. We can even get back to the original behavior and return -EAGAIN whenever migrate_pages is not able to migrate pages due to non-fatal reasons. Another option would be to simply continue with the changed semantic and extend move_pages documentation to clarify that -errno is returned on an invalid input or when migration simply cannot succeed (e.g. -ENOMEM, -EBUSY) or the number of pages that couldn't have been migrated due to ephemeral reasons (e.g. page is pinned or locked for other reasons). This patch implements the second option because this behavior is in place for some time without anybody complaining and possibly new users depending on it. Also it allows to have a slightly easier error handling as the caller knows that it is worth to retry when err > 0. But since the new semantic would be aborted immediately if migration is failed due to ephemeral reasons, need include the number of non-attempted pages in the return value too. Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") Signed-off-by: Yang Shi Suggested-by: Michal Hocko Acked-by: Michal Hocko Reviewed-by: Wei Yang Cc: [4.17+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index 430fdccc733e..b3b5d3bf0aab 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1627,8 +1627,19 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, start = i; } else if (node != current_node) { err = do_move_pages_to_node(mm, &pagelist, current_node); - if (err) + if (err) { + /* + * Positive err means the number of failed + * pages to migrate. Since we are going to + * abort and return the number of non-migrated + * pages, so need to incude the rest of the + * nr_pages that have not been attempted as + * well. + */ + if (err > 0) + err += nr_pages - i - 1; goto out; + } err = store_status(status, start, current_node, i - start); if (err) goto out; @@ -1659,8 +1670,11 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, goto out_flush; err = do_move_pages_to_node(mm, &pagelist, current_node); - if (err) + if (err) { + if (err > 0) + err += nr_pages - i - 1; goto out; + } if (i > start) { err = store_status(status, start, current_node, i - start); if (err) @@ -1674,6 +1688,13 @@ out_flush: /* Make sure we do not overwrite the existing error */ err1 = do_move_pages_to_node(mm, &pagelist, current_node); + /* + * Don't have to report non-attempted pages here since: + * - If the above loop is done gracefully all pages have been + * attempted. + * - If the above loop is aborted it means a fatal error + * happened, should return ret. + */ if (!err1) err1 = store_status(status, start, current_node, i - start); if (err >= 0) -- cgit v1.2.3 From 90e9f6a66c78fdf3c2e5884ffe97bfc2736863c2 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Thu, 30 Jan 2020 22:11:57 -0800 Subject: mm/slub.c: avoid slub allocation while holding list_lock If we are already under list_lock, don't call kmalloc(). Otherwise we will run into a deadlock because kmalloc() also tries to grab the same lock. Fix the problem by using a static bitmap instead. WARNING: possible recursive locking detected -------------------------------------------- mount-encrypted/4921 is trying to acquire lock: (&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437 but task is already holding lock: (&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&n->list_lock)->rlock); lock(&(&n->list_lock)->rlock); *** DEADLOCK *** Link: http://lkml.kernel.org/r/20191108193958.205102-2-yuzhao@google.com Signed-off-by: Yu Zhao Acked-by: Kirill A. Shutemov Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tetsuo Handa Cc: Yu Zhao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 88 ++++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 47 insertions(+), 41 deletions(-) (limited to 'mm') diff --git a/mm/slub.c b/mm/slub.c index 0ab92ec8c2a6..17dc00e33115 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -439,19 +439,38 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, } #ifdef CONFIG_SLUB_DEBUG +static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)]; +static DEFINE_SPINLOCK(object_map_lock); + /* * Determine a map of object in use on a page. * * Node listlock must be held to guarantee that the page does * not vanish from under us. */ -static void get_map(struct kmem_cache *s, struct page *page, unsigned long *map) +static unsigned long *get_map(struct kmem_cache *s, struct page *page) { void *p; void *addr = page_address(page); + VM_BUG_ON(!irqs_disabled()); + + spin_lock(&object_map_lock); + + bitmap_zero(object_map, page->objects); + for (p = page->freelist; p; p = get_freepointer(s, p)) - set_bit(slab_index(p, s, addr), map); + set_bit(slab_index(p, s, addr), object_map); + + return object_map; +} + +static void put_map(unsigned long *map) +{ + VM_BUG_ON(map != object_map); + lockdep_assert_held(&object_map_lock); + + spin_unlock(&object_map_lock); } static inline unsigned int size_from_object(struct kmem_cache *s) @@ -3675,13 +3694,12 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page, #ifdef CONFIG_SLUB_DEBUG void *addr = page_address(page); void *p; - unsigned long *map = bitmap_zalloc(page->objects, GFP_ATOMIC); - if (!map) - return; + unsigned long *map; + slab_err(s, page, text, s->name); slab_lock(page); - get_map(s, page, map); + map = get_map(s, page); for_each_object(p, s, addr, page->objects) { if (!test_bit(slab_index(p, s, addr), map)) { @@ -3689,8 +3707,9 @@ static void list_slab_objects(struct kmem_cache *s, struct page *page, print_tracking(s, p); } } + put_map(map); + slab_unlock(page); - bitmap_free(map); #endif } @@ -4384,19 +4403,19 @@ static int count_total(struct page *page) #endif #ifdef CONFIG_SLUB_DEBUG -static void validate_slab(struct kmem_cache *s, struct page *page, - unsigned long *map) +static void validate_slab(struct kmem_cache *s, struct page *page) { void *p; void *addr = page_address(page); + unsigned long *map; + + slab_lock(page); if (!check_slab(s, page) || !on_freelist(s, page, NULL)) - return; + goto unlock; /* Now we know that a valid freelist exists */ - bitmap_zero(map, page->objects); - - get_map(s, page, map); + map = get_map(s, page); for_each_object(p, s, addr, page->objects) { u8 val = test_bit(slab_index(p, s, addr), map) ? SLUB_RED_INACTIVE : SLUB_RED_ACTIVE; @@ -4404,18 +4423,13 @@ static void validate_slab(struct kmem_cache *s, struct page *page, if (!check_object(s, page, p, val)) break; } -} - -static void validate_slab_slab(struct kmem_cache *s, struct page *page, - unsigned long *map) -{ - slab_lock(page); - validate_slab(s, page, map); + put_map(map); +unlock: slab_unlock(page); } static int validate_slab_node(struct kmem_cache *s, - struct kmem_cache_node *n, unsigned long *map) + struct kmem_cache_node *n) { unsigned long count = 0; struct page *page; @@ -4424,7 +4438,7 @@ static int validate_slab_node(struct kmem_cache *s, spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) { - validate_slab_slab(s, page, map); + validate_slab(s, page); count++; } if (count != n->nr_partial) @@ -4435,7 +4449,7 @@ static int validate_slab_node(struct kmem_cache *s, goto out; list_for_each_entry(page, &n->full, slab_list) { - validate_slab_slab(s, page, map); + validate_slab(s, page); count++; } if (count != atomic_long_read(&n->nr_slabs)) @@ -4452,15 +4466,11 @@ static long validate_slab_cache(struct kmem_cache *s) int node; unsigned long count = 0; struct kmem_cache_node *n; - unsigned long *map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL); - - if (!map) - return -ENOMEM; flush_all(s); for_each_kmem_cache_node(s, node, n) - count += validate_slab_node(s, n, map); - bitmap_free(map); + count += validate_slab_node(s, n); + return count; } /* @@ -4590,18 +4600,17 @@ static int add_location(struct loc_track *t, struct kmem_cache *s, } static void process_slab(struct loc_track *t, struct kmem_cache *s, - struct page *page, enum track_item alloc, - unsigned long *map) + struct page *page, enum track_item alloc) { void *addr = page_address(page); void *p; + unsigned long *map; - bitmap_zero(map, page->objects); - get_map(s, page, map); - + map = get_map(s, page); for_each_object(p, s, addr, page->objects) if (!test_bit(slab_index(p, s, addr), map)) add_location(t, s, get_track(s, p, alloc)); + put_map(map); } static int list_locations(struct kmem_cache *s, char *buf, @@ -4612,11 +4621,9 @@ static int list_locations(struct kmem_cache *s, char *buf, struct loc_track t = { 0, 0, NULL }; int node; struct kmem_cache_node *n; - unsigned long *map = bitmap_alloc(oo_objects(s->max), GFP_KERNEL); - if (!map || !alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location), - GFP_KERNEL)) { - bitmap_free(map); + if (!alloc_loc_track(&t, PAGE_SIZE / sizeof(struct location), + GFP_KERNEL)) { return sprintf(buf, "Out of memory\n"); } /* Push back cpu slabs */ @@ -4631,9 +4638,9 @@ static int list_locations(struct kmem_cache *s, char *buf, spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) - process_slab(&t, s, page, alloc, map); + process_slab(&t, s, page, alloc); list_for_each_entry(page, &n->full, slab_list) - process_slab(&t, s, page, alloc, map); + process_slab(&t, s, page, alloc); spin_unlock_irqrestore(&n->list_lock, flags); } @@ -4682,7 +4689,6 @@ static int list_locations(struct kmem_cache *s, char *buf, } free_loc_track(&t); - bitmap_free(map); if (!t.count) len += sprintf(buf, "No data\n"); return len; -- cgit v1.2.3 From 8c96f1bc6fc49c724c4cdd22d3e99260263b7384 Mon Sep 17 00:00:00 2001 From: He Zhe Date: Thu, 30 Jan 2020 22:12:00 -0800 Subject: mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t kmemleak_lock as a rwlock on RT can possibly be acquired in atomic context which does work. Since the kmemleak operation is performed in atomic context make it a raw_spinlock_t so it can also be acquired on RT. This is used for debugging and is not enabled by default in a production like environment (where performance/latency matters) so it makes sense to make it a raw_spinlock_t instead trying to get rid of the atomic context. Turn also the kmemleak_object->lock into raw_spinlock_t which is acquired (nested) while the kmemleak_lock is held. The time spent in "echo scan > kmemleak" slightly improved on 64core box with this patch applied after boot. [bigeasy@linutronix.de: redo the description, update comments. Merge the individual bits: He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded Liu's patch.] Link: http://lkml.kernel.org/r/20191219170834.4tah3prf2gdothz4@linutronix.de Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com Signed-off-by: He Zhe Signed-off-by: Liu Haitao Signed-off-by: Yongxin Liu Signed-off-by: Sebastian Andrzej Siewior Acked-by: Catalin Marinas Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kmemleak.c | 112 +++++++++++++++++++++++++++++----------------------------- 1 file changed, 56 insertions(+), 56 deletions(-) (limited to 'mm') diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 244607663363..3a4259eeb5a0 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -13,7 +13,7 @@ * * The following locks and mutexes are used by kmemleak: * - * - kmemleak_lock (rwlock): protects the object_list modifications and + * - kmemleak_lock (raw_spinlock_t): protects the object_list modifications and * accesses to the object_tree_root. The object_list is the main list * holding the metadata (struct kmemleak_object) for the allocated memory * blocks. The object_tree_root is a red black tree used to look-up @@ -22,13 +22,13 @@ * object_tree_root in the create_object() function called from the * kmemleak_alloc() callback and removed in delete_object() called from the * kmemleak_free() callback - * - kmemleak_object.lock (spinlock): protects a kmemleak_object. Accesses to - * the metadata (e.g. count) are protected by this lock. Note that some - * members of this structure may be protected by other means (atomic or - * kmemleak_lock). This lock is also held when scanning the corresponding - * memory block to avoid the kernel freeing it via the kmemleak_free() - * callback. This is less heavyweight than holding a global lock like - * kmemleak_lock during scanning + * - kmemleak_object.lock (raw_spinlock_t): protects a kmemleak_object. + * Accesses to the metadata (e.g. count) are protected by this lock. Note + * that some members of this structure may be protected by other means + * (atomic or kmemleak_lock). This lock is also held when scanning the + * corresponding memory block to avoid the kernel freeing it via the + * kmemleak_free() callback. This is less heavyweight than holding a global + * lock like kmemleak_lock during scanning. * - scan_mutex (mutex): ensures that only one thread may scan the memory for * unreferenced objects at a time. The gray_list contains the objects which * are already referenced or marked as false positives and need to be @@ -135,7 +135,7 @@ struct kmemleak_scan_area { * (use_count) and freed using the RCU mechanism. */ struct kmemleak_object { - spinlock_t lock; + raw_spinlock_t lock; unsigned int flags; /* object status flags */ struct list_head object_list; struct list_head gray_list; @@ -191,8 +191,8 @@ static int mem_pool_free_count = ARRAY_SIZE(mem_pool); static LIST_HEAD(mem_pool_free_list); /* search tree for object boundaries */ static struct rb_root object_tree_root = RB_ROOT; -/* rw_lock protecting the access to object_list and object_tree_root */ -static DEFINE_RWLOCK(kmemleak_lock); +/* protecting the access to object_list and object_tree_root */ +static DEFINE_RAW_SPINLOCK(kmemleak_lock); /* allocation caches for kmemleak internal data */ static struct kmem_cache *object_cache; @@ -426,7 +426,7 @@ static struct kmemleak_object *mem_pool_alloc(gfp_t gfp) } /* slab allocation failed, try the memory pool */ - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); object = list_first_entry_or_null(&mem_pool_free_list, typeof(*object), object_list); if (object) @@ -435,7 +435,7 @@ static struct kmemleak_object *mem_pool_alloc(gfp_t gfp) object = &mem_pool[--mem_pool_free_count]; else pr_warn_once("Memory pool empty, consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE\n"); - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); return object; } @@ -453,9 +453,9 @@ static void mem_pool_free(struct kmemleak_object *object) } /* add the object to the memory pool free list */ - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); list_add(&object->object_list, &mem_pool_free_list); - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); } /* @@ -514,9 +514,9 @@ static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias) struct kmemleak_object *object; rcu_read_lock(); - read_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); object = lookup_object(ptr, alias); - read_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); /* check whether the object is still available */ if (object && !get_object(object)) @@ -546,11 +546,11 @@ static struct kmemleak_object *find_and_remove_object(unsigned long ptr, int ali unsigned long flags; struct kmemleak_object *object; - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); object = lookup_object(ptr, alias); if (object) __remove_object(object); - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); return object; } @@ -585,7 +585,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, INIT_LIST_HEAD(&object->object_list); INIT_LIST_HEAD(&object->gray_list); INIT_HLIST_HEAD(&object->area_list); - spin_lock_init(&object->lock); + raw_spin_lock_init(&object->lock); atomic_set(&object->use_count, 1); object->flags = OBJECT_ALLOCATED; object->pointer = ptr; @@ -617,7 +617,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, /* kernel backtrace */ object->trace_len = __save_stack_trace(object->trace); - write_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); untagged_ptr = (unsigned long)kasan_reset_tag((void *)ptr); min_addr = min(min_addr, untagged_ptr); @@ -649,7 +649,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, list_add_tail_rcu(&object->object_list, &object_list); out: - write_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); return object; } @@ -667,9 +667,9 @@ static void __delete_object(struct kmemleak_object *object) * Locking here also ensures that the corresponding memory block * cannot be freed when it is being scanned. */ - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->flags &= ~OBJECT_ALLOCATED; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -739,9 +739,9 @@ static void paint_it(struct kmemleak_object *object, int color) { unsigned long flags; - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); __paint_it(object, color); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } static void paint_ptr(unsigned long ptr, int color) @@ -798,7 +798,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp) if (scan_area_cache) area = kmem_cache_alloc(scan_area_cache, gfp_kmemleak_mask(gfp)); - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (!area) { pr_warn_once("Cannot allocate a scan area, scanning the full object\n"); /* mark the object for full scan to avoid false positives */ @@ -820,7 +820,7 @@ static void add_scan_area(unsigned long ptr, size_t size, gfp_t gfp) hlist_add_head(&area->node, &object->area_list); out_unlock: - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -842,9 +842,9 @@ static void object_set_excess_ref(unsigned long ptr, unsigned long excess_ref) return; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->excess_ref = excess_ref; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -864,9 +864,9 @@ static void object_no_scan(unsigned long ptr) return; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->flags |= OBJECT_NO_SCAN; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -1026,9 +1026,9 @@ void __ref kmemleak_update_trace(const void *ptr) return; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); object->trace_len = __save_stack_trace(object->trace); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); } @@ -1233,7 +1233,7 @@ static void scan_block(void *_start, void *_end, unsigned long flags; unsigned long untagged_ptr; - read_lock_irqsave(&kmemleak_lock, flags); + raw_spin_lock_irqsave(&kmemleak_lock, flags); for (ptr = start; ptr < end; ptr++) { struct kmemleak_object *object; unsigned long pointer; @@ -1268,7 +1268,7 @@ static void scan_block(void *_start, void *_end, * previously acquired in scan_object(). These locks are * enclosed by scan_mutex. */ - spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); + raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); /* only pass surplus references (object already gray) */ if (color_gray(object)) { excess_ref = object->excess_ref; @@ -1277,7 +1277,7 @@ static void scan_block(void *_start, void *_end, excess_ref = 0; update_refs(object); } - spin_unlock(&object->lock); + raw_spin_unlock(&object->lock); if (excess_ref) { object = lookup_object(excess_ref, 0); @@ -1286,12 +1286,12 @@ static void scan_block(void *_start, void *_end, if (object == scanned) /* circular reference, ignore */ continue; - spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); + raw_spin_lock_nested(&object->lock, SINGLE_DEPTH_NESTING); update_refs(object); - spin_unlock(&object->lock); + raw_spin_unlock(&object->lock); } } - read_unlock_irqrestore(&kmemleak_lock, flags); + raw_spin_unlock_irqrestore(&kmemleak_lock, flags); } /* @@ -1324,7 +1324,7 @@ static void scan_object(struct kmemleak_object *object) * Once the object->lock is acquired, the corresponding memory block * cannot be freed (the same lock is acquired in delete_object). */ - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (object->flags & OBJECT_NO_SCAN) goto out; if (!(object->flags & OBJECT_ALLOCATED)) @@ -1344,9 +1344,9 @@ static void scan_object(struct kmemleak_object *object) if (start >= end) break; - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); cond_resched(); - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); } while (object->flags & OBJECT_ALLOCATED); } else hlist_for_each_entry(area, &object->area_list, node) @@ -1354,7 +1354,7 @@ static void scan_object(struct kmemleak_object *object) (void *)(area->start + area->size), object); out: - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } /* @@ -1407,7 +1407,7 @@ static void kmemleak_scan(void) /* prepare the kmemleak_object's */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); #ifdef DEBUG /* * With a few exceptions there should be a maximum of @@ -1424,7 +1424,7 @@ static void kmemleak_scan(void) if (color_gray(object) && get_object(object)) list_add_tail(&object->gray_list, &gray_list); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); @@ -1492,14 +1492,14 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (color_white(object) && (object->flags & OBJECT_ALLOCATED) && update_checksum(object) && get_object(object)) { /* color it gray temporarily */ object->count = object->min_count; list_add_tail(&object->gray_list, &gray_list); } - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); @@ -1519,7 +1519,7 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if (unreferenced_object(object) && !(object->flags & OBJECT_REPORTED)) { object->flags |= OBJECT_REPORTED; @@ -1529,7 +1529,7 @@ static void kmemleak_scan(void) new_leaks++; } - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); @@ -1681,10 +1681,10 @@ static int kmemleak_seq_show(struct seq_file *seq, void *v) struct kmemleak_object *object = v; unsigned long flags; - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object)) print_unreferenced(seq, object); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); return 0; } @@ -1714,9 +1714,9 @@ static int dump_str_object_info(const char *str) return -EINVAL; } - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); dump_object_info(object); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); put_object(object); return 0; @@ -1735,11 +1735,11 @@ static void kmemleak_clear(void) rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irqsave(&object->lock, flags); if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object)) __paint_it(object, KMEMLEAK_GREY); - spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irqrestore(&object->lock, flags); } rcu_read_unlock(); -- cgit v1.2.3 From 5b57b8f22709f07c0ab5921c94fd66e8c59c3e11 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 30 Jan 2020 22:12:03 -0800 Subject: mm/debug.c: always print flags in dump_page() Commit 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line") inadvertently removed printing of page flags for pages that are neither anon nor ksm nor have a mapping. Fix that. Using pr_cont() again would be a solution, but the commit explicitly removed its use. Avoiding the danger of mixing up split lines from multiple CPUs might be beneficial for near-panic dumps like this, so fix this without reintroducing pr_cont(). Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz Fixes: 76a1850e4572 ("mm/debug.c: __dump_page() prints an extra line") Signed-off-by: Vlastimil Babka Reported-by: Anshuman Khandual Reported-by: Michal Hocko Acked-by: Michal Hocko Cc: David Hildenbrand Cc: Qian Cai Cc: Oscar Salvador Cc: Mel Gorman Cc: Mike Rapoport Cc: Dan Williams Cc: Pavel Tatashin Cc: Ralph Campbell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/debug.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/debug.c b/mm/debug.c index 74ee73cf7079..4a9ded0d3504 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -47,6 +47,7 @@ void __dump_page(struct page *page, const char *reason) struct address_space *mapping; bool page_poisoned = PagePoisoned(page); int mapcount; + char *type = ""; /* * If struct page is poisoned don't access Page*() functions as that @@ -78,9 +79,9 @@ void __dump_page(struct page *page, const char *reason) page, page_ref_count(page), mapcount, page->mapping, page_to_pgoff(page)); if (PageKsm(page)) - pr_warn("ksm flags: %#lx(%pGp)\n", page->flags, &page->flags); + type = "ksm "; else if (PageAnon(page)) - pr_warn("anon flags: %#lx(%pGp)\n", page->flags, &page->flags); + type = "anon "; else if (mapping) { if (mapping->host && mapping->host->i_dentry.first) { struct dentry *dentry; @@ -88,10 +89,11 @@ void __dump_page(struct page *page, const char *reason) pr_warn("%ps name:\"%pd\"\n", mapping->a_ops, dentry); } else pr_warn("%ps\n", mapping->a_ops); - pr_warn("flags: %#lx(%pGp)\n", page->flags, &page->flags); } BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS + 1); + pr_warn("%sflags: %#lx(%pGp)\n", type, page->flags, &page->flags); + hex_only: print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32, sizeof(unsigned long), page, -- cgit v1.2.3 From ddf8f376d137ba41ca67347a0b80ba0c357a1018 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 30 Jan 2020 22:12:07 -0800 Subject: mm/filemap.c: clean up filemap_write_and_wait() At some point filemap_write_and_wait() and filemap_write_and_wait_range() got the exact same implementation with the exception of the range being specified in *_range() Similar to other functions in fs.h which call *_range(..., 0, LLONG_MAX), change filemap_write_and_wait() to be a static inline which calls filemap_write_and_wait_range() Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com Signed-off-by: Ira Weiny Reviewed-by: Nikolay Borisov Reviewed-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 34 ++++++---------------------------- 1 file changed, 6 insertions(+), 28 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index bf6aa30be58d..1784478270e1 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -632,33 +632,6 @@ static bool mapping_needs_writeback(struct address_space *mapping) return mapping->nrpages; } -int filemap_write_and_wait(struct address_space *mapping) -{ - int err = 0; - - if (mapping_needs_writeback(mapping)) { - err = filemap_fdatawrite(mapping); - /* - * Even if the above returned error, the pages may be - * written partially (e.g. -ENOSPC), so we wait for it. - * But the -EIO is special case, it may indicate the worst - * thing (e.g. bug) happened, so we avoid waiting for it. - */ - if (err != -EIO) { - int err2 = filemap_fdatawait(mapping); - if (!err) - err = err2; - } else { - /* Clear any previously stored errors */ - filemap_check_errors(mapping); - } - } else { - err = filemap_check_errors(mapping); - } - return err; -} -EXPORT_SYMBOL(filemap_write_and_wait); - /** * filemap_write_and_wait_range - write out & wait on a file range * @mapping: the address_space for the pages @@ -680,7 +653,12 @@ int filemap_write_and_wait_range(struct address_space *mapping, if (mapping_needs_writeback(mapping)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); - /* See comment of filemap_write_and_wait() */ + /* + * Even if the above returned error, the pages may be + * written partially (e.g. -ENOSPC), so we wait for it. + * But the -EIO is special case, it may indicate the worst + * thing (e.g. bug) happened, so we avoid waiting for it. + */ if (err != -EIO) { int err2 = filemap_fdatawait_range(mapping, lstart, lend); -- cgit v1.2.3 From 15494520b776aa2eadc3e2fefae524764cab9cea Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Thu, 30 Jan 2020 22:12:10 -0800 Subject: mm: fix gup_pud_range sorry for not processing for a long time. I met it again. patch v1 https://lkml.org/lkml/2019/9/20/656 do_machine_check() do_memory_failure() memory_failure() hw_poison_user_mappings() try_to_unmap() pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); ...and now we have a swap entry that indicates that the page entry refers to a bad (and poisoned) page of memory, but gup_fast() at this level of the page table was ignoring swap entries, and incorrectly assuming that "!pxd_none() == valid and present". And this was not just a poisoned page problem, but a generaly swap entry problem. So, any swap entry type (device memory migration, numa migration, or just regular swapping) could lead to the same problem. Fix this by checking for pxd_present(), instead of pxd_none(). Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.com Signed-off-by: Qiujun Huang Cc: John Hubbard Cc: Aneesh Kumar K.V Cc: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/gup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/gup.c b/mm/gup.c index 7646bf993b25..9c41670d3f43 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2237,7 +2237,7 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, pud_t pud = READ_ONCE(*pudp); next = pud_addr_end(addr, end); - if (pud_none(pud)) + if (unlikely(!pud_present(pud))) return 0; if (unlikely(pud_huge(pud))) { if (!gup_huge_pud(pud, pudp, addr, next, flags, -- cgit v1.2.3 From be9d30458913f70bed1801abb5ee6f7b7d6f4b19 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Thu, 30 Jan 2020 22:12:14 -0800 Subject: mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge No functional change, just leverage the helper function to improve readability as others. Link: http://lkml.kernel.org/r/20200113070322.26627-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang Acked-by: Vlastimil Babka Acked-by: David Rientjes Reviewed-by: Ralph Campbell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/gup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/gup.c b/mm/gup.c index 9c41670d3f43..6cb71800fb5c 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -323,7 +323,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, pmdval = READ_ONCE(*pmd); if (pmd_none(pmdval)) return no_page_table(vma, flags); - if (pmd_huge(pmdval) && vma->vm_flags & VM_HUGETLB) { + if (pmd_huge(pmdval) && is_vm_hugetlb_page(vma)) { page = follow_huge_pmd(mm, address, pmd, flags); if (page) return page; @@ -433,7 +433,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, pud = pud_offset(p4dp, address); if (pud_none(*pud)) return no_page_table(vma, flags); - if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) { + if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) { page = follow_huge_pud(mm, address, pud, flags); if (page) return page; -- cgit v1.2.3 From a43e982082c24c2f5c0b139daac9657ac352eed3 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Thu, 30 Jan 2020 22:12:17 -0800 Subject: mm/gup: factor out duplicate code from four routines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12. Overview: This is a prerequisite to solving the problem of proper interactions between file-backed pages, and [R]DMA activities, as discussed in [1], [2], [3], and in a remarkable number of email threads since about 2017. :) A new internal gup flag, FOLL_PIN is introduced, and thoroughly documented in the last patch's Documentation/vm/pin_user_pages.rst. I believe that this will provide a good starting point for doing the layout lease work that Ira Weiny has been working on. That's because these new wrapper functions provide a clean, constrained, systematically named set of functionality that, again, is required in order to even know if a page is "dma-pinned". In contrast to earlier approaches, the page tracking can be incrementally applied to the kernel call sites that, until now, have been simply calling get_user_pages() ("gup"). In other words, opt-in by changing from this: get_user_pages() (sets FOLL_GET) put_page() to this: pin_user_pages() (sets FOLL_PIN) unpin_user_page() Testing: * I've done some overall kernel testing (LTP, and a few other goodies), and some directed testing to exercise some of the changes. And as you can see, gup_benchmark is enhanced to exercise this. Basically, I've been able to runtime test the core get_user_pages() and pin_user_pages() and related routines, but not so much on several of the call sites--but those are generally just a couple of lines changed, each. Not much of the kernel is actually using this, which on one hand reduces risk quite a lot. But on the other hand, testing coverage is low. So I'd love it if, in particular, the Infiniband and PowerPC folks could do a smoke test of this series for me. Runtime testing for the call sites so far is pretty light: * io_uring: Some directed tests from liburing exercise this, and they pass. * process_vm_access.c: A small directed test passes. * gup_benchmark: the enhanced version hits the new gup.c code, and passes. * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py * VFIO: compiles (I'm vowing to set up a run time test soon, but it's not ready just yet) * powerpc: it compiles... * drm/via: compiles... * goldfish: compiles... * net/xdp: compiles... * media/v4l2: compiles... [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ This patch (of 22): There are four locations in gup.c that have a fair amount of code duplication. This means that changing one requires making the same changes in four places, not to mention reading the same code four times, and wondering if there are subtle differences. Factor out the common code into static functions, thus reducing the overall line count and the code's complexity. Also, take the opportunity to slightly improve the efficiency of the error cases, by doing a mass subtraction of the refcount, surrounded by get_page()/put_page(). Also, further simplify (slightly), by waiting until the the successful end of each routine, to increment *nr. Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.com Signed-off-by: John Hubbard Reviewed-by: Christoph Hellwig Reviewed-by: Jérôme Glisse Reviewed-by: Jan Kara Cc: Kirill A. Shutemov Cc: Ira Weiny Cc: Christoph Hellwig Cc: Aneesh Kumar K.V Cc: Alex Williamson Cc: Björn Töpel Cc: Daniel Vetter Cc: Dan Williams Cc: Hans Verkuil Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Jonathan Corbet Cc: Leon Romanovsky Cc: Mauro Carvalho Chehab Cc: Mike Rapoport Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/gup.c | 95 +++++++++++++++++++++++++++------------------------------------- 1 file changed, 40 insertions(+), 55 deletions(-) (limited to 'mm') diff --git a/mm/gup.c b/mm/gup.c index 6cb71800fb5c..706f85b84f47 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1978,6 +1978,29 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, } #endif +static int record_subpages(struct page *page, unsigned long addr, + unsigned long end, struct page **pages) +{ + int nr; + + for (nr = 0; addr != end; addr += PAGE_SIZE) + pages[nr++] = page++; + + return nr; +} + +static void put_compound_head(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); + /* + * Calling put_page() for each ref is unnecessarily slow. Only the last + * ref needs a put_page(). + */ + if (refs > 1) + page_ref_sub(page, refs - 1); + put_page(page); +} + #ifdef CONFIG_ARCH_HAS_HUGEPD static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, unsigned long sz) @@ -2007,32 +2030,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, /* hugepages are never "special" */ VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - refs = 0; head = pte_page(pte); - page = head + ((addr & (sz-1)) >> PAGE_SHIFT); - do { - VM_BUG_ON(compound_head(page) != head); - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(head, refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pte_val(pte) != pte_val(*ptep))) { - /* Could be optimized better */ - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } @@ -2079,28 +2090,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr); } - refs = 0; page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(pmd_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } @@ -2120,28 +2122,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr); } - refs = 0; page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(pud_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pud_val(orig) != pud_val(*pudp))) { - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } @@ -2157,28 +2150,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr, return 0; BUILD_BUG_ON(pgd_devmap(orig)); - refs = 0; + page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); - do { - pages[*nr] = page; - (*nr)++; - page++; - refs++; - } while (addr += PAGE_SIZE, addr != end); + refs = record_subpages(page, addr, end, pages + *nr); head = try_get_compound_head(pgd_page(orig), refs); - if (!head) { - *nr -= refs; + if (!head) return 0; - } if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) { - *nr -= refs; - while (refs--) - put_page(head); + put_compound_head(head, refs); return 0; } + *nr += refs; SetPageReferenced(head); return 1; } -- cgit v1.2.3 From a707cdd55f0f8bbf409214ab42028d89c3eddec6 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Thu, 30 Jan 2020 22:12:21 -0800 Subject: mm/gup: move try_get_compound_head() to top, fix minor issues MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit An upcoming patch uses try_get_compound_head() more widely, so move it to the top of gup.c. Also fix a tiny spelling error and a checkpatch.pl warning. Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.com Signed-off-by: John Hubbard Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Reviewed-by: Ira Weiny Cc: Alex Williamson Cc: Aneesh Kumar K.V Cc: Björn Töpel Cc: Daniel Vetter Cc: Dan Williams Cc: Hans Verkuil Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: Jens Axboe Cc: Jerome Glisse Cc: Jonathan Corbet Cc: Kirill A. Shutemov Cc: Leon Romanovsky Cc: Mauro