summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2020-03-23btrfs: hold a ref on the root in create_reloc_inodeJosef Bacik
We're creating a reloc inode in the data reloc tree, we need to hold a ref on the root while we're doing that. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in find_data_referencesJosef Bacik
We're looking up the data references for the bytenr in a root, we need to hold a ref on that root while we're doing that. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in record_reloc_root_in_transJosef Bacik
We are recording this root in the transaction, so we need to hold a ref on it until we do that. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in merge_reloc_rootsJosef Bacik
We look up the corresponding root for the reloc root, we need to hold a ref while we're messing with it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in prepare_to_mergeJosef Bacik
We look up the reloc roots corresponding root, we need to hold a ref on that root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in build_backref_treeJosef Bacik
This is trickier than the previous conversions. We have backref_node's that need to hold onto their root for their lifetime. Do the read of the root and grab the ref. If at any point we don't use the root we discard it, however if we use it in our backref node we don't free it until we free the backref node. Any time we switch the root's for the backref node we need to drop our ref on the old root and grab the ref on the new root, and if we dupe a node we need to get a ref on the root there as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold ref on root in btrfs_ioctl_default_subvolJosef Bacik
We look up an arbitrary fs root here, we need to hold a ref on the root for the duration. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in btrfs_ioctl_get_subvol_infoJosef Bacik
We look up whatever root userspace has given us, we need to hold a ref throughout this operation. Use 'root' only for the on fs root and not as a temporary variable elsewhere. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in btrfs_search_path_in_tree_userJosef Bacik
We can wander into a different root, so grab a ref on the root we look up. Later on we make root = fs_info->tree_root so we need this separate out label to make sure we do the right cleanup only in the case we're looking up a different root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in btrfs_search_path_in_treeJosef Bacik
We look up an arbitrary fs root, we need to hold a ref on it while we're doing our search. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in search_ioctlJosef Bacik
We lookup a arbitrary fs root, we need to hold a ref on that root. If we're using our own inodes root then grab a ref on that as well to make the cleanup easier. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in create_subvolJosef Bacik
We're creating the new root here, but we should hold the ref until after we've initialized the inode for it. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in fixup_tree_root_locationJosef Bacik
Looking up the inode from an arbitrary tree means we need to hold a ref on that root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in __btrfs_run_defrag_inodeJosef Bacik
We are looking up an arbitrary inode, we need to hold a ref on the root while we're doing this. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a root ref in btrfs_get_dentryJosef Bacik
Looking up the inode we need to search the root, make sure we hold a reference on that root while we're doing the lookup. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on the root in resolve_indirect_refJosef Bacik
We're looking up a random root, we need to hold a ref on it while we're using it. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: hold a ref on fs roots while they're in the radix treeJosef Bacik
If the root is sitting in the radix tree, we should probably have a ref for the radix tree. Grab a ref on the root when we insert it, and drop it when it gets deleted. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: describe the space reservation system in generalJosef Bacik
Add another comment to cover how the space reservation system works generally. This covers the actual reservation flow, as well as how flushing is handled. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: add a comment describing delalloc space reservationJosef Bacik
delalloc space reservation is tricky because it encompasses both data and metadata. Make it clear what each side does, the general flow of how space is moved throughout the lifetime of a write, and what goes into the calculations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: add a comment describing block reservesJosef Bacik
This is a giant comment at the top of block-rsv.c describing generally how block reserves work. It is purely about the block reserves themselves, and nothing to do with how the actual reservation system works. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_rootJosef Bacik
We want to use this for dropping all roots, and in some error cases we may not have a root, so handle this to make the cleanup code easier. Make btrfs_grab_fs_root the same so we can use it in cases where the root may not exist (like the quota root). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: make the fs root init functions staticJosef Bacik
Now that the orphan cleanup stuff doesn't use this directly we can just make them static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: open code btrfs_read_fs_root_no_nameJosef Bacik
All this does is call btrfs_get_fs_root() with check_ref == true. Just use btrfs_get_fs_root() so we don't have a bunch of different helpers that do the same thing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: remove btrfs_read_fs_root, not used anymoreJosef Bacik
All helpers should either be using btrfs_get_fs_root() or btrfs_read_tree_root(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: make relocation use btrfs_read_tree_root()Josef Bacik
Relocation has it's special roots, we don't want to save these in the root cache either, so swap it to use btrfs_read_tree_root(). However the reloc root does need REF_COWS set, so make sure we set it everywhere we use this helper, as it no longer does the REF_COWS setting. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: export and use btrfs_read_tree_root for tree-logJosef Bacik
Tree-log uses btrfs_read_fs_root to load its log, but this just calls btrfs_read_tree_root. We don't save the log roots in our root cache, so just export this helper and use it in the logging code. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: make btrfs_find_orphan_roots use btrfs_get_fs_rootJosef Bacik
btrfs_find_orphan_roots has this weird thing where it looks up the root in cache to see if it is there before just reading the root. But the read it uses just reads the root, it doesn't do any of the init work, we do that by hand here. But this is unnecessary, all we really want is to see if the root still exists and add it to the dead roots list to be cleaned up, otherwise we delete the orphan item. Fix this by just using btrfs_get_fs_root directly with check_ref set to false so we get the orphan root items. Then we just handle in cache and out of cache roots the same, add them to the dead roots list and carry on. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: move fs root init stuff into btrfs_init_fs_rootJosef Bacik
We have a helper for reading fs roots that just reads the fs root off the disk and then sets REF_COWS and init's the inheritable flags. Move this into btrfs_init_fs_root so we can later get rid of this helper and consolidate all of the fs root reading into one helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: push __setup_root into btrfs_alloc_rootJosef Bacik
There's no reason to not init the root at alloc time, and with later patches it actually causes problems if we error out mounting the fs before the tree_root is init'ed because we expect it to have a valid ref count. Fix this by pushing __setup_root into btrfs_alloc_root. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: delete the ordered isize update codeJosef Bacik
Now that we have a safe way to update the isize, remove all of this code as it's no longer needed. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: replace all uses of btrfs_ordered_update_i_sizeJosef Bacik
Now that we have a safe way to update the i_size, replace all uses of btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: use the file extent tree infrastructureJosef Bacik
We want to use this everywhere we modify the file extent items permanently. These include: 1) Inserting new file extents for writes and prealloc extents. 2) Truncating inode items. 3) btrfs_cont_expand(). 4) Insert inline extents. 5) Insert new extents from log replay. 6) Insert a new extent for clone, as it could be past i_size. 7) Hole punching For hole punching in particular it might seem it's not necessary because anybody extending would use btrfs_cont_expand, however there is a corner that still can give us trouble. Start with an empty file and fallocate KEEP_SIZE 1M-2M We now have a 0 length file, and a hole file extent from 0-1M, and a prealloc extent from 1M-2M. Now punch 1M-1.5M Because this is past i_size we have [HOLE EXTENT][ NOTHING ][PREALLOC] [0 1M][1M 1.5M][1.5M 2M] with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size to 1.5M, but our disk_i_size is still 0 until the ordered extent completes. However if we now immediately truncate 2M on the file we'll just call btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we commit the transaction here and crash we'll expose the gap. To fix this we need to clear the file extent mapping for the range that we punched but didn't insert a corresponding file extent for. This will mean the truncate will only get an disk_i_size set to 1M if we crash before the finish ordered io happens. I've written an xfstest to reproduce the problem and validate this fix. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: introduce per-inode file extent treeJosef Bacik
In order to keep track of where we have file extents on disk, and thus where it is safe to adjust the i_size to, we need to have a tree in place to keep track of the contiguous areas we have file extents for. Add helpers to use this tree, as it's not required for NO_HOLES file systems. We will use this by setting DIRTY for areas we know we have file extent item's set, and clearing it when we remove file extent items for truncation. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: use btrfs_ordered_update_i_size in clone_finish_inode_updateJosef Bacik
We were using btrfs_i_size_write(), which unconditionally jacks up inode->disk_i_size. However since clone can operate on ranges we could have pending ordered extents for a range prior to the start of our clone operation and thus increase disk_i_size too far and have a hole with no file extent. Fix this by using the btrfs_ordered_update_i_size helper which will do the right thing in the face of pending ordered extents outside of our clone range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: update the comment of btrfs_control_ioctl()Su Yue
Btrfsctl was removed in 2012, now the function btrfs_control_ioctl() is only used for devices ioctls. So update the comment. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Su Yue <Damenly_Su@gmx.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: relocation: Add introduction of how relocation worksQu Wenruo
Relocation is one of the most complex part of btrfs, while it's also the foundation stone for online resizing, profile converting. For such a complex facility, we should at least have some introduction to it. This patch will add an basic introduction at pretty a high level, explaining: - What relocation does - How relocation is done Only mentioning how data reloc tree and reloc tree are involved in the operation. No details like the backref cache, or the data reloc tree contents. - Which function to refer. More detailed comments will be added for reloc tree creation, data reloc tree creation and backref cache. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23Btrfs: don't iterate mod seq list when putting a tree mod seqFilipe Manana
Each new element added to the mod seq list is always appended to the list, and each one gets a sequence number coming from a counter which gets incremented everytime a new element is added to the list (or a new node is added to the tree mod log rbtree). Therefore the element with the lowest sequence number is always the first element in the list. So just remove the list iteration at btrfs_put_tree_mod_seq() that computes the minimum sequence number in the list and replace it with a check for the first element's sequence number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-23btrfs: Add overview of device replaceQu Wenruo
The overview of btrfs dev-replace. It mentions some corner cases caused by the write duplication and scrub based data copy. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ adjust wording ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-22Linux 5.6-rc7Linus Torvalds
2020-03-22Merge tag 'for-5.6-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two fixes. The first is a regression: when dropping some incompat bits the conditions were reversed. The other is a fix for rename whiteout potentially leaving stack memory linked to a list" * tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix removal of raid[56|1c34} incompat flags after removing block group btrfs: fix log context list corruption after rename whiteout error
2020-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: x86/mm: split vmalloc_sync_all() mm, slub: prevent kmalloc_node crashes and memory leaks mm/mmu_notifier: silence PROVE_RCU_LIST warnings epoll: fix possible lost wakeup on epoll_ctl() path mm: do not allow MADV_PAGEOUT for CoW pages mm, memcg: throttle allocators based on ancestral memory.high mm, memcg: fix corruption on 64-bit divisor in memory.high throttling page-flags: fix a crash at SetPageError(THP_SWAP) mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event
2020-03-21x86/mm: split vmalloc_sync_all()Joerg Roedel
Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, slub: prevent kmalloc_node crashes and memory leaksVlastimil Babka
Sachin reports [1] a crash in SLUB __slab_alloc(): BUG: Kernel NULL pointer dereference on read at 0x000073b0 Faulting instruction address: 0xc0000000003d55f4 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1 NIP: c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000 REGS: c0000008b37836d0 TRAP: 0300 Not tainted (5.6.0-rc2-next-20200218-autotest) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004844 XER: 00000000 CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500 GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620 GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000 GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000 GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002 GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122 GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8 GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180 NIP ___slab_alloc+0x1f4/0x760 LR __slab_alloc+0x34/0x60 Call Trace: ___slab_alloc+0x334/0x760 (unreliable) __slab_alloc+0x34/0x60 __kmalloc_node+0x110/0x490 kvmalloc_node+0x58/0x110 mem_cgroup_css_online+0x108/0x270 online_css+0x48/0xd0 cgroup_apply_control_enable+0x2ec/0x4d0 cgroup_mkdir+0x228/0x5f0 kernfs_iop_mkdir+0x90/0xf0 vfs_mkdir+0x110/0x230 do_mkdirat+0xb0/0x1a0 system_call+0x5c/0x68 This is a PowerPC platform with following NUMA topology: available: 2 nodes (0-1) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 1 size: 35247 MB node 1 free: 30907 MB node distances: node 0 1 0: 10 40 1: 40 10 possible numa nodes: 0-31 This only happens with a mmotm patch "mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node" [2] which effectively calls kmalloc_node for each possible node. SLUB however only allocates kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on node_to_mem_node to return such valid node for other nodes since commit a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node"). This is however not true in this configuration where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31, thus it contains zeroes and get_partial() ends up accessing non-allocated kmem_cache_node. A related issue was reported by Bharata (originally by Ramachandran) [3] where a similar PowerPC configuration, but with mainline kernel without patch [2] ends up allocating large amounts of pages by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with node_to_mem_node() not behaving as expected, and might probably also lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4]. This patch should fix both issues by not relying on node_to_mem_node() anymore and instead simply falling back to NUMA_NO_NODE, when kmalloc_node(node) is attempted for a node that's not online, or has no usable memory. The "usable memory" condition is also changed from node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly the condition that SLUB uses to allocate kmem_cache_node structures. The check in get_partial() is removed completely, as the checks in ___slab_alloc() are now sufficient to prevent get_partial() being reached with an invalid node. [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/ [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/ [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/ Fixes: a561ce00b09e ("slub: fall back to node_to_mem_node() node if allocating on memoryless node") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reported-by: PUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christopher Lameter <cl@linux.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz Debugged-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm/mmu_notifier: silence PROVE_RCU_LIST warningsQian Cai
It is safe to traverse mm->notifier_subscriptions->list either under SRCU read lock or mm->notifier_subscriptions->lock using hlist_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false positives, for example, WARNING: suspicious RCU usage ----------------------------- mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by libvirtd/802: #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0 #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800 #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460 stack backtrace: CPU: 7 PID: 802 Comm: libvirtd Tainted: G I 5.6.0-rc6-next-20200317+ #2 Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014 Call Trace: dump_stack+0xa4/0xfe lockdep_rcu_suspicious+0xeb/0xf5 __mmu_notifier_invalidate_range_start+0x3ff/0x460 change_p4d_range+0x746/0x800 change_protection+0x1df/0x300 mprotect_fixup+0x245/0x3e0 do_mprotect_pkey+0x23b/0x3e0 __x64_sys_mprotect+0x51/0x70 do_syscall_64+0x91/0xae8 entry_SYSCALL_64_after_hwframe+0x49/0xb3 Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21epoll: fix possible lost wakeup on epoll_ctl() pathRoman Penyaev
This fixes possible lost wakeup introduced by commit a218cc491420. Originally modifications to ep->wq were serialized by ep->wq.lock, but in commit a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") a new rw lock was introduced in order to relax fd event path, i.e. callers of ep_poll_callback() function. After the change ep_modify and ep_insert (both are called on epoll_ctl() path) were switched to ep->lock, but ep_poll (epoll_wait) was using ep->wq.lock on wqueue list modification. The bug doesn't lead to any wqueue list corruptions, because wake up path and list modifications were serialized by ep->wq.lock internally, but actual waitqueue_active() check prior wake_up() call can be reordered with modifications of ep ready list, thus wake up can be lost. And yes, can be healed by explicit smp_mb(): list_add_tail(&epi->rdlink, &ep->rdllist); smp_mb(); if (waitqueue_active(&ep->wq)) wake_up(&ep->wp); But let's make it simple, thus current patch replaces ep->wq.lock with the ep->lock for wqueue modifications, thus wake up path always observes activeness of the wqueue correcty. Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention") Reported-by: Max Neunhoeffer <max@arangodb.com> Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Max Neunhoeffer <max@arangodb.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Jes Sorensen <jes.sorensen@gmail.com> Cc: <stable@vger.kernel.org> [5.1+] Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de References: https://bugzilla.kernel.org/show_bug.cgi?id=205933 Bisected-by: Max Neunhoeffer <max@arangodb.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm: do not allow MADV_PAGEOUT for CoW pagesMichal Hocko
Jann has brought up a very interesting point [1]. While shared pages are excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed that way. This can lead to all sorts of hard to debug problems. E.g. performance problems outlined by Daniel [2]. There are runtime environments where there is a substantial memory shared among security domains via CoW memory and a easy to reclaim way of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation in for the parent process which might be more privileged or even open side channel attacks. The feasibility of the latter is not really clear to me TBH but there is no real reason for exposure at this stage. It seems there is no real use case to depend on reclaiming CoW memory via madvise at this stage so it is much easier to simply disallow it and this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the exclusively owned memory which is a straightforward semantic. [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, memcg: throttle allocators based on ancestral memory.highChris Down
Prior to this commit, we only directly check the affected cgroup's memory.high against its usage. However, it's possible that we are being reclaimed as a result of hitting an ancestor memory.high and should be penalised based on that, instead. This patch changes memory.high overage throttling to use the largest overage in its ancestors when considering how many penalty jiffies to charge. This makes sure that we penalise poorly behaving cgroups in the same way regardless of at what level of the hierarchy memory.high was breached. Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high") Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> [5.4.x+] Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm, memcg: fix corruption on 64-bit divisor in memory.high throttlingChris Down
Commit 0e4b01df8659 had a bunch of fixups to use the right division method. However, it seems that after all that it still wasn't right -- div_u64 takes a 32-bit divisor. The headroom is still large (2^32 pages), so on mundane systems you won't hit this, but this should definitely be fixed. Fixes: 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high") Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chris Down <chris@chrisdown.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: <stable@vger.kernel.org> [5.4.x+] Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.name Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21page-flags: fix a crash at SetPageError(THP_SWAP)Qian Cai
Commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped out") supported writing THP to a swap device but forgot to upgrade an older commit df8c94d13c7e ("page-flags: define behavior of FS/IO-related flags on compound pages") which could trigger a crash during THP swapping out with DEBUG_VM_PGFLAGS=y, kernel BUG at include/linux/page-flags.h:317! page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) end_swap_bio_write() SetPageError(page) VM_BUG_ON_PAGE(1 && PageCompound(page)) <IRQ> bio_endio+0x297/0x560 dec_pending+0x218/0x430 [dm_mod] clone_endio+0xe4/0x2c0 [dm_mod] bio_endio+0x297/0x560 blk_update_request+0x201/0x920 scsi_end_request+0x6b/0x4b0 scsi_io_completion+0x509/0x7e0 scsi_finish_command+0x1ed/0x2a0 scsi_softirq_done+0x1c9/0x1d0 __blk_mqnterrupt+0xf/0x20 </IRQ> Fix by checking PF_NO_TAIL in those places instead. Fixes: bd4c82c22c36 ("mm, THP, swap: delay splitting THP after swapped out") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200310235846.1319-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP caseBaoquan He
In section_deactivate(), pfn_to_page() doesn't work any more after ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case. It causes a hot remove failure: kernel BUG at mm/page_alloc.c:4806! invalid opcode: 0000 [#1] SMP PTI CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G W 5.5.0-next-20200205+ #340 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: kacpi_hotplug acpi_hotplug_work_fn RIP: 0010:free_pages+0x85/0xa0 Call Trace: __remove_pages+0x99/0xc0 arch_remove_memory+0x23/0x4d try_remove_memory+0xc8/0x130 __remove_memory+0xa/0x11 acpi_memory_device_remove+0x72/0x100 acpi_bus_trim+0x55/0x90 acpi_device_hotplug+0x2eb/0x3d0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x1a7/0x370 worker_thread+0x30/0x380 kthread+0x112/0x130 ret_from_fork+0x35/0x40 Let's move the ->section_mem_map resetting after depopulate_section_memmap() to fix it. [akpm@linux-foundation.org: remove unneeded initialization, per David] Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>