summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-09-01Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk() fixes, Documentation and MAINTAINERS updates)" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) MAINTAINERS: update my e-mail address mod_devicetable: add space before */ scsi: a100u2w: trivial typo in printk i2c: Fix typo in i2c-bfin-twi.c treewide: fix typos in comment blocks Doc: fix trivial typo in SubmittingPatches proportions: Spelling s/consitent/consistent/ dm: Spelling s/consitent/consistent/ aic7xxx: Fix typo in error message pcmcia: Fix typo in locking documentation scsi/arcmsr: Fix typos in error log drm/nouveau/gr: Fix typo in nv10.c [SCSI] Fix printk typos in drivers/scsi staging: comedi: Grammar s/Enable support a/Enable support for a/ Btrfs: Spelling s/consitent/consistent/ README: GTK+ is a acronym ASoC: omap: Fix typo in config option description mm: tlb.c: Fix error message ntfs: super.c: Fix error log fix typo in Documentation/SubmittingPatches ...
2015-08-13block: remove bio_get_nr_vecs()Kent Overstreet
We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-13btrfs: remove bio splitting and merge_bvec_fn() callsKent Overstreet
Btrfs has been doing bio splitting from btrfs_map_bio(), by checking device limits as well as calling ->merge_bvec_fn() etc. That is not necessary any more, because generic_make_request() is now able to handle arbitrarily sized bios. So clean up unnecessary code paths. Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-09Merge 4.2-rc6 into char-misc-nextGreg Kroah-Hartman
We want the fixes in Linus's tree in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-09Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "We have a btrfs quota regression fix. I merged this one on Thursday and have run it through tests against current master. Normally I wouldn't have sent this while you were finalizing rc6, but I'm feeding mosquitoes in the adirondacks next week, so I wanted to get this one out before leaving. I'll leave longer tests running and check on things during the week, but I don't expect any problems" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: qgroup: Fix a regression in qgroup reserved space.
2015-08-07Btrfs: Spelling s/consitent/consistent/Geert Uytterhoeven
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2015-08-06btrfs: qgroup: Fix a regression in qgroup reserved space.Qu Wenruo
During the change to new btrfs extent-oriented qgroup implement, due to it doesn't use the old __qgroup_excl_accounting() for exclusive extent, it didn't free the reserved bytes. The bug will cause limit function go crazy as the reserved space is never freed, increasing limit will have no effect and still cause EQOUT. The fix is easy, just free reserved bytes for newly created exclusive extent as what it does before. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Yang Dongsheng <yangds.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-05char: make misc_deregister a void functionGreg Kroah-Hartman
With well over 200+ users of this api, there are a mere 12 users that actually checked the return value of this function. And all of them really didn't do anything with that information as the system or module was shutting down no matter what. So stop pretending like it matters, and just return void from misc_deregister(). If something goes wrong in the call, you will get a WARNING splat in the syslog so you know how to fix up your driver. Other than that, there's nothing that can go wrong. Cc: Alasdair Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Wim Van Sebroeck <wim@iguana.be> Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-31Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe fixed up a hard to trigger ENOSPC regression from our merge window pull, and we have a few other smaller fixes" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix quick exhaustion of the system array in the superblock btrfs: its btrfs_err() instead of btrfs_error() btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-22Btrfs: fix quick exhaustion of the system array in the superblockFilipe Manana
Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when finishing block group creation"), introduced in 4.2-rc1, the following test was failing due to exhaustion of the system array in the superblock: #!/bin/bash truncate -s 100T big.img mkfs.btrfs big.img mount -o loop big.img /mnt/loop num=5 sz=10T for ((i = 0; i < $num; i++)); do echo fallocate $i $sz fallocate -l $sz /mnt/loop/testfile$i done btrfs filesystem sync /mnt/loop for ((i = 0; i < $num; i++)); do echo rm $i rm /mnt/loop/testfile$i btrfs filesystem sync /mnt/loop done umount /mnt/loop This made btrfs_add_system_chunk() fail with -EFBIG due to excessive allocation of system block groups. This happened because the test creates a large number of data block groups per transaction and when committing the transaction we start the writeout of the block group caches for all the new new (dirty) block groups, which results in pre-allocating space for each block group's free space cache using the same transaction handle. That in turn often leads to creation of more block groups, and all get attached to the new_bgs list of the same transaction handle to the point of getting a list with over 1500 elements, and creation of new block groups leads to the need of reserving space in the chunk block reserve and often creating a new system block group too. So that made us quickly exhaust the chunk block reserve/system space info, because as of the commit mentioned before, we do reserve space for each new block group in the chunk block reserve, unlike before where we would not and would at most allocate one new system block group and therefore would only ensure that there was enough space in the system space info to allocate 1 new block group even if we ended up allocating thousands of new block groups using the same transaction handle. That worked most of the time because the computed required space at check_system_chunk() is very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and that all nodes/leafs in a path will be COWed and split) and since the updates to the chunk tree all happen at btrfs_create_pending_block_groups it is unlikely that a path needs to be COWed more than once (unless writepages() for the btree inode is called by mm in between) and that compensated for the need of creating any new nodes/leads in the chunk tree. So fix this by ensuring we don't accumulate a too large list of new block groups in a transaction's handles new_bgs list, inserting/updating the chunk tree for all accumulated new block groups and releasing the unused space from the chunk block reserve whenever the list becomes sufficiently large. This is a generic solution even though the problem currently can only happen when starting the writeout of the free space caches for all dirty block groups (btrfs_start_dirty_block_groups()). Reported-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22btrfs: its btrfs_err() instead of btrfs_error()Anand Jain
sorry I indented to use btrfs_err() and I have no idea how btrfs_error() got there. infact I was thinking about these kind of oversights since these two func are too closely named. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22btrfs: Avoid NULL pointer dereference of free_extent_buffer when ↵Zhao Lei
read_tree_block() fail When read_tree_block() failed, we can see following dmesg: [ 134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063 [ 134.372236] IP: [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90 [ 134.372236] PGD 0 [ 134.372236] Oops: 0000 [#1] SMP [ 134.372236] Modules linked in: [ 134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f046843d2455aa231747b5a07a999a9f3d_+ #115 [ 134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014 [ 134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000 [ 134.372236] RIP: 0010:[<ffffffff813a4a51>] [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90 ... [ 134.372236] Call Trace: [ 134.372236] [<ffffffff81379aa1>] free_root_extent_buffers+0x91/0xb0 [ 134.372236] [<ffffffff81379c3d>] free_root_pointers+0x17d/0x190 [ 134.372236] [<ffffffff813801b0>] open_ctree+0x1ca0/0x25b0 [ 134.372236] [<ffffffff8144d017>] ? disk_name+0x97/0xb0 [ 134.372236] [<ffffffff813558aa>] btrfs_mount+0x8fa/0xab0 ... Reason: read_tree_block() changed to return error number on fail, and this value(not NULL) is set to tree_root->node, then subsequent code will run to: free_root_pointers() ->free_root_extent_buffers() ->free_extent_buffer() ->atomic_read((extent_buffer *)(-E_XXX)->refs); and trigger above error. Fix: Set tree_root->node to NULL on fail to make error_handle code happy. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-22btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()Zhao Lei
Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of delayed_iput_sem in xfstests generic/241: [ 2061.345955] ============================================= [ 2061.346027] [ INFO: possible recursive locking detected ] [ 2061.346027] 4.1.0+ #268 Tainted: G W [ 2061.346027] --------------------------------------------- [ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock: [ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100 [ 2061.346027] but task is already holding lock: [ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100 [ 2061.346027] other info that might help us debug this: [ 2061.346027] Possible unsafe locking scenario: [ 2061.346027] CPU0 [ 2061.346027] ---- [ 2061.346027] lock(&fs_info->delayed_iput_sem); [ 2061.346027] lock(&fs_info->delayed_iput_sem); [ 2061.346027] *** DEADLOCK *** It is rarely happened, about 1/400 in my test env. The reason is recursion of btrfs_run_delayed_iputs(): cleaner_kthread -> btrfs_run_delayed_iputs() *1 -> get delayed_iput_sem lock *2 -> iput() -> ... -> btrfs_commit_transaction() -> btrfs_run_delayed_iputs() *1 -> get delayed_iput_sem lock (dead lock) *2 *1: recursion of btrfs_run_delayed_iputs() *2: warning of lockdep about delayed_iput_sem When fs is in high stress, new iputs may added into fs_info->delayed_iputs list when btrfs_run_delayed_iputs() is running, which cause second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem) again, and cause above lockdep warning. Actually, it will not cause real problem because both locks are read lock, but to avoid lockdep warning, we can do a fix. Fix: Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for cleaner_kthread thread to break above recursion path. cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code, and don't need to call btrfs_run_delayed_iputs() again in btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow. Test: No above lockdep warning after patch in 1200 generic/241 tests. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-17Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are all from Filipe, and cover a few problems we've had reported on the list recently (along with ones he found on his own)" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix file corruption after cloning inline extents Btrfs: fix order by which delayed references are run Btrfs: fix list transaction->pending_ordered corruption Btrfs: fix memory leak in the extent_same ioctl Btrfs: fix shrinking truncate when the no_holes feature is enabled
2015-07-14Btrfs: fix file corruption after cloning inline extentsFilipe Manana
Using the clone ioctl (or extent_same ioctl, which calls the same extent cloning function as well) we end up allowing copy an inline extent from the source file into a non-zero offset of the destination file. This is something not expected and that the btrfs code is not prepared to deal with - all inline extents must be at a file offset equals to 0. For example, the following excerpt of a test case for fstests triggers a crash/BUG_ON() on a write operation after an inline extent is cloned into a non-zero offset: _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test files. File foo has the same 2K of data at offset 4K # as file bar has at its offset 0. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \ -c "pwrite -S 0xbb 4k 2K" \ -c "pwrite -S 0xcc 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # File bar consists of a single inline extent (2K size). $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \ $SCRATCH_MNT/bar | _filter_xfs_io # Now call the clone ioctl to clone the extent of file bar into file # foo at its offset 4K. This made file foo have an inline extent at # offset 4K, something which the btrfs code can not deal with in future # IO operations because all inline extents are supposed to start at an # offset of 0, resulting in all sorts of chaos. # So here we validate that clone ioctl returns an EOPNOTSUPP, which is # what it returns for other cases dealing with inlined extents. $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Because of the inline extent at offset 4K, the following write made # the kernel crash with a BUG_ON(). $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io status=0 exit The stack trace of the BUG_ON() triggered by the last write is: [152154.035903] ------------[ cut here ]------------ [152154.036424] kernel BUG at mm/page-writeback.c:2286! [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$ [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G W 4.1.0-rc6-btrfs-next-11+ #2 [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000 [152154.036424] RIP: 0010:[<ffffffff8111a9d5>] [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP: 0018:ffff880429effc68 EFLAGS: 00010246 [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001 [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0 [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001 [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68 [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80 [152154.036424] FS: 00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000 [152154.036424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0 [152154.036424] Stack: [152154.036424] ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1 [152154.036424] ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8 [152154.036424] 0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000 [152154.036424] Call Trace: [152154.036424] [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs] [152154.036424] [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs] [152154.036424] [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs] [152154.036424] [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5 [152154.036424] [<ffffffff81165f89>] vfs_write+0xa0/0xe4 [152154.036424] [<ffffffff81166855>] SyS_pwrite64+0x64/0x82 [152154.036424] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$ [152154.036424] RIP [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP <ffff880429effc68> [152154.242621] ---[ end trace e3d3376b23a57041 ]--- Fix this by returning the error EOPNOTSUPP if an attempt to copy an inline extent into a non-zero offset happens, just like what is done for other scenarios that would require copying/splitting inline extents, which were introduced by the following commits: 00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents") Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11Btrfs: fix order by which delayed references are runFilipe Manana
When we have an extent that got N references removed and N new references added in the same transaction, we must run the insertion of the references first because otherwise the last removed reference will remove the extent item from the extent tree, resulting in a failure for the insertions. This is a regression introduced in the 4.2-rc1 release and this fix just brings back the behaviour of selecting reference additions before any reference removals. The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_flakey _require_cloner _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create prealloc extent covering range [160K, 620K[ $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo # Now write to the last 80K of the prealloc extent plus 40K to the unallocated # space that immediately follows it. This creates a new extent of 40K that spans # the range [620K, 660K[. $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | _filter_xfs_io # At this point, there are now 2 back references to the prealloc extent in our # extent tree. Both are for our file offset 160K and one relates to a file # extent item with a data offset of 0 and a length of 380K, while the other # relates to a file extent item with a data offset of 380K and a length of 80K. # Make sure everything done so far is durably persisted (all back references are # in the extent tree, etc). sync # Now clone all extents of our file that cover the offset 160K up to its eof # (660K at this point) into itself at offset 2M. This leaves a hole in the file # covering the range [660K, 2M[. The prealloc extent will now be referenced by # the file twice, once for offset 160K and once for offset 2M. The 40K extent # that follows the prealloc extent will also be referenced twice by our file, # once for offset 620K and once for offset 2M + 460K. $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 $SCRATCH_MNT/foo \ $SCRATCH_MNT/foo # Now create one new extent in our file with a size of 100Kb. It will span the # range [3M, 3M + 100K[. It also will cause creation of a hole spanning the # range [2M + 460K, 3M[. Our new file size is 3M + 100K. $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | _filter_xfs_io # At this point, there are now (in memory) 4 back references to the prealloc # extent. # # Two of them are for file offset 160K, related to file extent items # matching the file offsets 160K and 540K respectively, with data offsets of # 0 and 380K respectively, and with lengths of 380K and 80K respectively. # # The other two references are for file offset 2M, related to file extent items # matching the file offsets 2M and 2M + 380K respectively, with data offsets of # 0 and 380K respectively, and with lengths of 389K and 80K respectively. # # The 40K extent has 2 back references, one for file offset 620K and the other # for file offset 2M + 460K. # # The 100K extent has a single back reference and it relates to file offset 3M. # Now clone our 100K extent into offset 600K. That offset covers the last 20K # of the prealloc extent, the whole 40K extent and 40K of the hole starting at # offset 660K. $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo # At this point there's only one reference to the 40K extent, at file offset # 2M + 460K, we have 4 references for the prealloc extent (2 for file offset # 160K and 2 for file offset 2M) and 2 references for the 100K extent (1 for # file offset 3M and a new one for file offset 600K). # Now fsync our file to make all its new data and metadata updates are durably # persisted and present if a power failure/crash happens after a successful # fsync and before the next transaction commit. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo echo "File digest before power failure:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Silently drop all writes and ummount to simulate a crash/power failure. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again, mount to trigger log replay and validate file contents. # During log replay, the btrfs delayed references implementation used to run the # deletion of back references before the addition of new back references, which # made the addition fail as it didn't find the key in the extent tree that it # was looking for. The failure triggered by this test was related to the 40K # extent, which got 1 reference dropped and 1 reference added during the fsync # log replay - when running the delayed references at transaction commit time, # btrfs was applying the deletion before the insertion, resulting in a failure # of the insertion that ended up turning the fs into read-only mode. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey echo "File digest after log replay:" md5sum $SCRATCH_MNT/foo | _filter_scratch _unmount_flakey status=0 exit This issue turned the filesystem into read-only mode (current transaction aborted) and produced the following traces: [ 8247.578385] ------------[ cut here ]------------ [ 8247.579947] WARNING: CPU: 0 PID: 11341 at fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d [btrfs]() (...) [ 8247.601697] Call Trace: [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [ 8247.605488] [<ffffffffa0506c8d>] ? lookup_inline_extent_backref+0x17d/0x45d [btrfs] [ 8247.608226] [<ffffffffa0506c8d>] lookup_inline_extent_backref+0x17d/0x45d [btrfs] [ 8247.617061] [<ffffffffa0507957>] insert_inline_extent_backref+0x41/0xb2 [btrfs] [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a [btrfs] [ 8247.624366] [<ffffffffa050ee60>] __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 [btrfs] [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 [ 8247.628531] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] (...) [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- [ 8247.727263] WARNING: CPU: 3 PID: 11341 at fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() [ 8247.728954] BTRFS: Transaction aborted (error -5) (...) [ 8247.760866] Call Trace: [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [ 8247.767582] [<ffffffffa0510e04>] ? btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 [ 8247.773664] [<ffffffffa0520482>] btrfs_commit_transaction+0x4c/0xa20 [btrfs] [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189 [ 8247.777427] [<ffffffffa055a920>] btrfs_recover_log_trees+0x2da/0x33d [btrfs] [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc [btrfs] [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs] [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81 [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs] (...) [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- [ 8247.794276] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2771: errno=-5 IO failure [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: errno=-5 IO failure (Failed to recover log tree) Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Signed-off-by: Filipe Manana <fdmanana@suse.com> Acked-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-07-11Btrfs: fix list transaction->pending_ordered corruptionFilipe Manana
When we call btrfs_commit_transaction(), we splice the list "ordered" of our transaction handle into the transaction's "pending_ordered" list, but we don't re-initialize the "ordered" list of our transaction handle, this means it still points to the same elements it used to before the splice. Then we check if the current transaction's state is >= TRANS_STATE_COMMIT_START and if it is we end up calling btrfs_end_transaction() which simply splices again the "ordered" list of our handle into the transaction's "pending_ordered" list, leaving multiple pointers to the same ordered extents which results in list corruption when we are iterating, removing and freeing ordered extents at btrfs_wait_pending_ordered(), resulting in access to dangling pointers / use-after-free issues. Similarly, btrfs_end_transaction() can end up in some cases calling btrfs_commit_transaction(), and both did a list splice of the transaction handle's "ordered" list into the transaction's "pending_ordered" without re-initializing the handle's "ordered" list, resulting in exactly the same problem. This produces the following warning on a kernel with linked list debugging enabled: [109749.265416] ------------[ cut here ]------------ [109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() [109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d (...) [109749.287505] Call Trace: [109749.288135] [<ffffffff8145f077>] dump_stack+0x4f/0x7b [109749.298080] [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2 [109749.331605] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb [109749.334849] [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98 [109749.337093] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 [109749.337847] [<ffffffff81260642>] __list_del_entry+0x5a/0x98 [109749.338678] [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs] [109749.340145] [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs] [109749.348313] [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs] [109749.349745] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf [109749.350819] [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs] [109749.351976] [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e [109749.360341] [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e [109749.368828] [<ffffffff8118ee1d>] do_fsync+0x34/0x4e [109749.369790] [<ffffffff8118f045>] SyS_fsync+0x10/0x14 [109749.370925] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [109749.382274] ---[ end trace 48e0d07f7c03d95a ]--- On a non-debug kernel this leads to invalid memory accesses, causing a crash. Fix this by using list_splice_init() instead of list_splice() in btrfs_commit_transaction() and btrfs_end_transaction(). Cc: stable@vger.kernel.org Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
2015-07-11Btrfs: fix memory leak in the extent_same ioctlFilipe Manana
We were allocating memory with memdup_user() but we were never releasing that memory. This affected pretty much every call to the ioctl, whether it deduplicated extents or not. This issue was reported on IRC by Julian Taylor and on the mailing list by Marcel Ritter, credit goes to them for finding the issue. Reported-by: Julian Taylor <jtaylor.debian@googlemail.com> Reported-by: Marcel Ritter <ritter.marcel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-07-11Btrfs: fix shrinking truncate when the no_holes feature is enabledFilipe Manana
If the no_holes feature is enabled, we attempt to shrink a file to a size that ends up in the middle of a hole and we don't have any file extent items in the fs/subvol tree that go beyond the new file size (or any ordered extents that will insert such file extent items), we end up not updating the inode's disk_i_size, we only update the inode's i_size. This means that after unmounting and mounting the filesystem, or after the inode is evicted and reloaded, its i_size ends up being incorrect (an inode's i_size is set to the disk_i_size field when an inode is loaded). This happens when btrfs_truncate_inode_items() doesn't find any file extent items to drop - in this case it never makes a call to btrfs_ordered_update_i_size() in order to update the inode's disk_i_size. Example reproducer: $ mkfs.btrfs -O no-holes -f /dev/sdd $ mount /dev/sdd /mnt # Create our test file with some data and durably persist it. $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo $ sync # Append some data to the file, increasing its size, and leave a hole # between the old size and the start offset if the following write. So # our file gets a hole in the range [128Kb, 256Kb[. $ xfs_io -c "truncate 160K" /mnt/foo # We expect to see our file with a size of 160Kb, with the first 128Kb # of data all having the value 0xaa and the remaining 32Kb of data all # having the value 0x00. $ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0500000 # Now cleanly unmount and mount again the filesystem. $ umount /mnt $ mount /dev/sdd /mnt # We expect to get the same result as before, a file with a size of # 160Kb, with the first 128Kb of data all having the value 0xaa and the # remaining 32Kb of data all having the value 0x00. $ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0400000 In the example above the file size/data do not match what they were before the remount. Fix this by always calling btrfs_ordered_update_i_size() with a size matching the size the file was truncated to if btrfs_truncate_inode_items() is not called for a log tree and no file extent items were dropped. This ensures the same behaviour as when the no_holes feature is not enabled. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assortment of fixes. Most of the commits are from Filipe (fsync, the inode allocation cache and a few others). Mark kicked in a series fixing corners in the extent sharing ioctls, and everyone else fixed up on assorted other problems" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix wrong check for btrfs_force_chunk_alloc() Btrfs: fix warning of bytes_may_use Btrfs: fix hang when failing to submit bio of directIO Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() Btrfs: fix memory corruption on failure to submit bio for direct IO btrfs: don't update mtime/ctime on deduped inodes btrfs: allow dedupe of same inode btrfs: fix deadlock with extent-same and readpage btrfs: pass unaligned length to btrfs_cmp_data() Btrfs: fix fsync after truncate when no_holes feature is enabled Btrfs: fix fsync xattr loss in the fast fsync path Btrfs: fix fsync data loss after append write Btrfs: fix crash on close_ctree() if cleaner starts new transaction Btrfs: fix race between caching kthread and returning inode to inode cache Btrfs: use kmem_cache_free when freeing entry in inode cache Btrfs: fix race between balance and unused block group deletion btrfs: add error handling for scrub_workers_get() btrfs: cleanup noused initialization of dev in btrfs_end_bio() btrfs: qgroup: allow user to clear the limitation on qgroup
2015-07-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
2015-07-01Btrfs: fix wrong check for btrfs_force_chunk_alloc()Shilong Wang
btrfs_force_chunk_alloc() return 1 for allocation chunk successfully. This problem exists since commit c87f08ca4. With this patch, we might fix some enospc problems for balances. Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01Btrfs: fix warning of bytes_may_useLiu Bo
While running generic/019, dmesg got several warnings from btrfs_free_reserved_data_space(). Test generic/019 produces some disk failures so sumbit dio will get errors, in which case, btrfs_direct_IO() goes to the error handling and free bytes_may_use, but the problem is that bytes_may_use has been free'd during get_block(). This adds a runtime flag to show if we've gone through get_block(), if so, don't do the cleanup work. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01Btrfs: fix hang when failing to submit bio of directIOLiu Bo
The hang is uncoverd by generic/019. btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits an error, thus those added ordered extents will never get processed, which block processes that waiting for them via btrfs_start_ordered_extent(). This fixes the above, and meanwhile finish_ordered_fn will do the space accounting work. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01Btrfs: fix a comment in inode.c:evict_inode_truncate_pages()Filipe Manana
The comment was not correct about the part where it says the endio callback of the bio might have not yet been called - update it to mention that by that time the endio callback execution might still be in progress only. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01Btrfs: fix memory corruption on failure to submit bio for direct IOFilipe Manana
If we fail to submit a bio for a direct IO request, we were grabbing the corresponding ordered extent and decrementing its reference count twice, once for our lookup reference and once for the ordered tree reference. This was a problem because it caused the ordered extent to be freed without removing it from the ordered tree and any lists it might be attached to, leaving dangling pointers to the ordered extent around. Example trace with CONFIG_DEBUG_PAGEALLOC=y: [161779.858707] BUG: unable to handle kernel paging request at 0000000087654330 [161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b [161779.860636] PGD 34d818067 PUD 0 [161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC (...) [161779.860636] Call Trace: [161779.860636] [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs] [161779.860636] [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs] [161779.860636] [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs] [161779.860636] [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs] [161779.860636] [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs] [161779.860636] [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43 [161779.860636] [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs] [161779.860636] [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs] [161779.860636] [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34 [161779.860636] [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs] [161779.860636] [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs] [161779.860636] [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs] [161779.860636] [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128 [161779.860636] [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] [161779.860636] [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs] (...) We were also not freeing the btrfs_dio_private we allocated previously, which kmemleak reported with the following trace in its sysfs file: unreferenced object 0xffff8803f553bf80 (size 96): comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s) hex dump (first 32 bytes): 88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00 .l.............. 00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00 ................ backtrace: [<ffffffff81161ffe>] create_object+0x172/0x29a [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41 [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18 [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148 [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs] [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43 [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34 [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs] [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128 [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs] [<ffffffff8116586a>] __vfs_write+0x7c/0xa5 [<ffffffff81165da9>] vfs_write+0xa0/0xe4 [<ffffffff81166675>] SyS_pwrite64+0x64/0x82 [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f [<ffffffffffffffff>] 0xffffffffffffffff For read requests we weren't doing any cleanup either (none of the work done by btrfs_endio_direct_read()), so a failure submitting a bio for a read request would leave a range in the inode's io_tree locked forever, blocking any future operations (both reads and writes) against that range. So fix this by making sure we do the same cleanup that we do for the case where the bio submission succeeds. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01btrfs: don't update mtime/ctime on deduped inodesMark Fasheh
One issue users have reported is that dedupe changes mtime on files, resulting in tools like rsync thinking that their contents have changed when in fact the data is exactly the same. We also skip the ctime update as no user-visible metadata changes here and we want dedupe to be transparent to the user. Clone still wants time changes, so we special case this in the code. This was tested with the btrfs-extent-same tool. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01btrfs: allow dedupe of same inodeMark Fasheh
clone() supports cloning within an inode so extent-same can do the same now. This patch fixes up the locking in extent-same to know about the single-inode case. In addition to that, we add a check for overlapping ranges, which clone does not allow. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01btrfs: fix deadlock with extent-same and readpageMark Fasheh
->readpage() does page_lock() before extent_lock(), we do the opposite in extent-same. We want to reverse the order in btrfs_extent_same() but it's not quite straightforward since the page locks are taken inside btrfs_cmp_data(). So I split btrfs_cmp_data() into 3 parts with a small context structure that is passed between them. The first, btrfs_cmp_data_prepare() gathers up the pages needed (taking page lock as required) and puts them on our context structure. At this point, we are safe to lock the extent range. Afterwards, we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free() to clean up our context. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01btrfs: pass unaligned length to btrfs_cmp_data()Mark Fasheh
In the case that we dedupe the tail of a file, we might expand the dedupe len out to the end of our last block. We don't want to compare data past i_size however, so pass the original length to btrfs_cmp_data(). Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01Btrfs: fix fsync after truncate when no_holes feature is enabledFilipe Manana
When we have the no_holes feature enabled, if a we truncate a file to a smaller size, truncate it again but to a size greater than or equals to its original size and fsync it, the log tree will not have any information about the hole covering the range [truncate_1_offset, new_file_size[. Which means if the fsync log is replayed, the file will remain with the state it had before both truncate operations. Without the no_holes feature this does not happen, since when the inode is logged (full sync flag is set) it will find in the fs/subvol tree a leaf with a generation matching the current transaction id that has an explicit extent item representing the hole. Fix this by adding an explicit extent item representing a hole between the last extent and the inode's i_size if we are doing a full sync. The issue is easy to reproduce with the following test case for fstests: . ./common/rc . ./common/filter . ./common/dmflakey _need_to_be_root _supported_fs generic _supported_os Linux _require_scratch _require_dm_flakey # This test was motivated by an issue found in btrfs when the btrfs # no-holes feature is enabled (introduced in kernel 3.14). So enable # the feature if the fs being tested is btrfs. if [ $FSTYP == "btrfs" ]; then _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes" fi rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create our test files and make sur