summaryrefslogtreecommitdiffstats
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2018-12-17btrfs: document extent mapping assumptions in checksumJohannes Thumshirn
Document why map_private_extent_buffer() cannot return '1' (i.e. the map spans two pages) for the csum_tree_block() case. The current algorithm for detecting a page boundary crossing in map_private_extent_buffer() will return a '1' *IFF* the extent buffer's offset in the page + the offset passed in by csum_tree_block() and the minimal length passed in by csum_tree_block() - 1 are bigger than PAGE_SIZE. We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32 and the current extent buffer allocator always guarantees page aligned extends, so the above condition can't be true. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: don't initialize 'offset' in map_private_extent_buffer()Johannes Thumshirn
In map_private_extent_buffer() the 'offset' variable is initialized to a page aligned version of the 'start' parameter. But later on it is overwritten with either the offset from the extent buffer's start or 0. So get rid of the initial initialization. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: fix deadlock with memory reclaim during scrubFilipe Manana
When a transaction commit starts, it attempts to pause scrub and it blocks until the scrub is paused. So while the transaction is blocked waiting for scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub, otherwise we risk getting into a deadlock with reclaim. Checking for scrub pause requests is done early at the beginning of the while loop of scrub_stripe() and later in the loop, scrub_extent() and scrub_raid56_parity() are called, which in turn call scrub_pages() and scrub_pages_for_parity() respectively. These last two functions do memory allocations using GFP_KERNEL. Same problem could happen while scrubbing the super blocks, since it calls scrub_pages(). We also can not have any of the worker tasks, created by the scrub task, doing GFP_KERNEL allocations, because before pausing, the scrub task waits for all the worker tasks to complete (also done at scrub_stripe()). So make sure GFP_NOFS is used for the memory allocations because at any time a scrub pause request can happen from another task that started to commit a transaction. Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path") CC: stable@vger.kernel.org # 4.6+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::readpage_io_failed_hookNikolay Borisov
For data inodes this hook does nothing but to return -EAGAIN which is used to signal to the endio routines that this bio belongs to a data inode. If this is the case the actual retrying is handled by bio_readpage_error. Alternatively, if this bio belongs to the btree inode then btree_io_failed_hook just does some cleanup and doesn't retry anything. This patch simplifies the code flow by eliminating readpage_io_failed_hook and instead open-coding btree_io_failed_hook in end_bio_extent_readpage. Also eliminate some needless checks since IO is always performed on either data inode or btree inode, both of which are guaranteed to have their extent_io_tree::ops set. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: remove btrfs_bio_end_io_tJohannes Thumshirn
The btrfs_bio_end_io_t typedef was introduced with commit a1d3c4786a4b ("btrfs: btrfs_multi_bio replaced with btrfs_bio") but never used anywhere. This commit also introduced a forward declaration of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t. Remove both as they're not needed anywhere. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: replace btrfs_io_bio::end_io with a simple helperDavid Sterba
The end_io callback implemented as btrfs_io_bio_endio_readpage only calls kfree. Also the callback is set only in case the csum buffer is allocated and not pointing to the inline buffer. We can use that information to drop the indirection and call a helper that will free the csums only in the right case. This shrinks struct btrfs_io_bio by 8 bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: remove redundant csum buffer in btrfs_io_bioDavid Sterba
The io_bio tracks checksums and has an inline buffer or an allocated one. And there's a third member that points to the right one, but we don't need to use an extra pointer for that. Let btrfs_io_bio::csum point to the right buffer and check that the inline buffer is not accidentally freed. This shrinks struct btrfs_io_bio by 8 bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: replace async_cow::root with fs_infoDavid Sterba
The async_cow::root is used to propagate fs_info to async_cow_submit. We can't use inode to reach it because it could become NULL after write without compression in async_cow_start. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: merge btrfs_submit_bio_done to its callerDavid Sterba
There's one caller and its code is simple, we can open code it in run_one_async_done. The errors are passed through bio. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: balance: print to system log when balance ends or is pausedAnand Jain
Print a kernel log message when the balance ends, either for cancel or completed or if it is paused. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: balance: print args during start and resumeAnand Jain
The information about balance arguments is important for system audit, this patch prints the textual representation when balance starts or is resumed. Example command: $ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs Example kernel log output: BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1 Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog, simplify code ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: add helper to describe block group flagsAnand Jain
Factor out helper that describes block group flags from describe_relocation. The result will not be longer than the given size. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creationFilipe Manana
If the quota enable and snapshot creation ioctls are called concurrently we can get into a deadlock where the task enabling quotas will deadlock on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it twice, or the task creating a snapshot tries to commit the transaction while the task enabling quota waits for the former task to commit the transaction while holding the mutex. The following time diagrams show how both cases happen. First scenario: CPU 0 CPU 1 btrfs_ioctl() btrfs_ioctl_quota_ctl() btrfs_quota_enable() mutex_lock(fs_info->qgroup_ioctl_lock) btrfs_start_transaction() btrfs_ioctl() btrfs_ioctl_snap_create_v2 create_snapshot() --> adds snapshot to the list pending_snapshots of the current transaction btrfs_commit_transaction() create_pending_snapshots() create_pending_snapshot() qgroup_account_snapshot() btrfs_qgroup_inherit() mutex_lock(fs_info->qgroup_ioctl_lock) --> deadlock, mutex already locked by this task at btrfs_quota_enable() Second scenario: CPU 0 CPU 1 btrfs_ioctl() btrfs_ioctl_quota_ctl() btrfs_quota_enable() mutex_lock(fs_info->qgroup_ioctl_lock) btrfs_start_transaction() btrfs_ioctl() btrfs_ioctl_snap_create_v2 create_snapshot() --> adds snapshot to the list pending_snapshots of the current transaction btrfs_commit_transaction() --> waits for task at CPU 0 to release its transaction handle btrfs_commit_transaction() --> sees another task started the transaction commit first --> releases its transaction handle --> waits for the transaction commit to be completed by the task at CPU 1 create_pending_snapshot() qgroup_account_snapshot() btrfs_qgroup_inherit() mutex_lock(fs_info->qgroup_ioctl_lock) --> deadlock, task at CPU 0 has the mutex locked but it is waiting for us to finish the transaction commit So fix this by setting the quota enabled flag in fs_info after committing the transaction at btrfs_quota_enable(). This ends up serializing quota enable and snapshot creation as if the snapshot creation happened just before the quota enable request. The quota rescan task, scheduled after committing the transaction in btrfs_quote_enable(), will do the accounting. Fixes: 6426c7ad697d ("btrfs: qgroup: Fix qgroup accounting when creating snapshot") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: fix access to available allocation bits when starting balanceFilipe Manana
The available allocation bits members from struct btrfs_fs_info are protected by a sequence lock, and when starting balance we access them incorrectly in two different ways: 1) In the read sequence lock loop at btrfs_balance() we use the values we read from fs_info->avail_*_alloc_bits and we can immediately do actions that have side effects and can not be undone (printing a message and jumping to a label). This is wrong because a retry might be needed, so our actions must not have side effects and must be repeatable as long as read_seqretry() returns a non-zero value. In other words, we were essentially ignoring the sequence lock; 2) Right below the read sequence lock loop, we were reading the values from avail_metadata_alloc_bits and avail_data_alloc_bits without any protection from concurrent writers, that is, reading them outside of the read sequence lock critical section. So fix this by making sure we only read the available allocation bits while in a read sequence lock critical section and that what we do in the critical section is repeatable (has nothing that can not be undone) so that any eventual retry that is needed is handled properly. Fixes: de98ced9e743 ("Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits") Fixes: 14506127979a ("btrfs: fix a bogus warning when converting only data or metadata") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: allow clear_extent_dirty() to receive a cached extent state recordFilipe Manana
We can have a lot freed extents during the life span of transaction, so the red black tree that keeps track of the ranges of each freed extent (fs_info->freed_extents[]) can get quite big. When finishing a transaction commit we find each range, process it (discard the extents, unpin them) and then remove it from the red black tree. We can use an extent state record as a cache when searching for a range, so that when we clean the range we can use the cached extent state we passed to the search function instead of iterating the red black tree again. Doing things as fast as possible when finishing a transaction (in state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we block another task that wants to commit the next transaction. So change clear_extent_dirty() to allow an optional extent state record to be passed as an argument, which will be passed down to __clear_extent_bit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Handle final split-brain possibility during fsid changeNikolay Borisov
This patch lands the last case which needs to be handled by the fsid change code. Namely, this is the case where a multidisk filesystem has already undergone at least one successful fsid change i.e all disks have the METADATA_UUID incompat bit and power failure occurs as another fsid change is in progress. When such an event occurs, disks could be split in 2 groups. One of the groups will have both METADATA_UUID and CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs. The other group of disks will have only METADATA_UUID bit set and their fsid will be different than the one in disks in the first group. Here we look at the following cases: a) A disk from the first group is scanned first, so fs_devices is created with stale fsid/metdata_uuid. Then when a disk from the second group is scanned it needs to first check whether there exists such an fs_devices that has fsid_change set to true (because it was created with a disk having the CHANGING_FSID_V2 flag), the metadata_uuid and fsid of the fs_devices will be different (since it was created by a disk which already has had at least 1 successful fsid change) and finally the metadata_uuid of the fs_devices will equal that of the currently scanned disk (because metadata_uuid never really changes). When the correct fs_devices is found the information from the scanned disk will replace the current one in fs_devices since the scanned disk will have higher generation number. b) A disk from the second group is scanned so fs_devices is created as usual with differing fsid/metdata_uid. Then when a disk from the first group is scanned the code detects that it has both CHANGING_FSID_V2 and METADATA_UUID flags set and will search for fs_devices that has differing metadata_uuid/fsid and whose metadata_uuid is the same as that of the scanned device. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Handle one more split-brain scenario during fsid changeNikolay Borisov
This commit continues hardening the scanning code to handle cases where power loss could have caused disks in a multi-disk filesystem to be in inconsistent state. Namely handle the situation that can occur when some of the disks in multi-disk fs have completed their fsid change i.e they have METADATA_UUID incompat flag set, have cleared the CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At the same time the other half of the disks will have their fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag. This is handled by introducing code in the scan path which: a) Handles the case when a device with CHANGING_FSID_V2 flag is scanned and as a result btrfs_fs_devices is created with matching fsid/metdata_uuid. Subsequently, when a device with completed fsid change is scanned it will detect this via the new code in find_fsid i.e that such an fs_devices exist that fsid_change flag is set to true, it's metadata_uuid/fsid match and the metadata_uuid of the scanned device matches that of the fs_devices. In this case, it's important to note that the devices which has its fsid change completed will have a higher generation number than the device with FSID_CHANGING_V2 flag set, so its superblock block will be used during mount. To prevent an assertion triggering because the sb used for mounting will have differing fsid/metadata_uuid than the ones in the fs_devices struct also add code in device_list_add which overwrites the values in fs_devices. b) Alternatively we can end up with a device that completed its fsid change be scanned first which will create the respective btrfs_fs_devices struct with differing fsid/metadata_uuid. In this case when a device with FSID_CHANGING_V2 flag set is scanned it will call the newly added find_fsid_inprogress function which will return the correct fs_devices. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: add members to fs_devices to track fsid changesNikolay Borisov
In order to gracefully handle split-brain scenario during fsid change (which are very unlikely, yet possible), two more pieces of information will be necessary: 1. The highest generation number among all devices registered to a particular btrfs_fs_devices 2. A boolean flag whether a given btrfs_fs_devices was created by a device which had the FSID_CHANGING_V2 flag set. This is a preparatory patch and just introduces the variables as well as code which sets them, their actual use is going to happen in a later patch. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Add handling for disk split-brain scenario during fsid changeNikolay Borisov
Even though fsid change without rewrite is a very quick operation it's still possible to experience a split-brain scenario if power loss occurs at the most inconvenient time. This patch handles the case where power failure occurs while the first transaction (the one setting CHANGING_FSID_V2) flag is being persisted on disk. This can cause the btrfs_fs_devices of this filesystem to be created by a device which: a) has the CHANGING_FSID_V2 flag set but its fsid value is intact b) or a device which doesn't have CHANGING_FSID_V2 flag set and its fsid value is intact This situation is trivially handled by the current find_fsid code since in both cases the devices are going to be treated like ordinary devices. Since btrfs is always mounted using the superblock of the latest device (the one with highest generation number), meaning it will have the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On the first transaction commit following mount all disks will have it cleared. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove fsid/metadata_fsid fields from btrfs_infoNikolay Borisov
Currently btrfs_fs_info structure contains a copy of the fsid/metadata_uuid fields. Same values are also contained in the btrfs_fs_devices structure which fs_info has a reference to. Let's reduce duplication by removing the fields from fs_info and always refer to the ones in fs_devices. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Add sysfs support for metadata_uuid featureNikolay Borisov
Since the metadata_uuid is a new incompat feature it requires the respective sysfs hooks. This patch adds the 'metdata_uuid' feature to be shown if it supported by the kernel. Additionally it adds /sys/fs/btrfs/UUID/metadata_uuid attribute which allows one to read the current metadata_uuid. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Introduce support for FSID change without metadata rewriteNikolay Borisov
This field is going to be used when the user wants to change the UUID of the filesystem without having to rewrite all metadata blocks. This field adds another level of indirection such that when the FSID is changed what really happens is the current UUID (the one with which the fs was created) is copied to the 'metadata_uuid' field in the superblock as well as a new incompat flag is set METADATA_UUID. When the kernel detects this flag is set it knows that the superblock in fact has 2 UUIDs: 1. Is the UUID which is user-visible, currently known as FSID. 2. Metadata UUID - this is the UUID which is stamped into all on-disk datastructures belonging to this file system. When the new incompat flag is present device scanning checks whether both fsid/metadata_uuid of the scanned device match any of the registered filesystems. When the flag is not set then both UUIDs are equal and only the FSID is retained on disk, metadata_uuid is set only in-memory during mount. Additionally a new metadata_uuid field is also added to the fs_info struct. It's initialised either with the FSID in case METADATA_UUID incompat flag is not set or with the metdata_uuid of the superblock otherwise. This commit introduces the new fields as well as the new incompat flag and switches all users of the fsid to the new logic. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor updates in comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: use EXPORT_FOR_TESTS for conditionally exported functionsJohannes Thumshirn
Several functions in BTRFS are only used inside the source file they are declared if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not defined. However if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is defined these functions are shared with the unit tests code. Before the introduction of the EXPORT_FOR_TESTS macro, these functions could not be declared as static and the compiler had a harder task when optimizing and inlining them. As we have EXPORT_FOR_TESTS now, use it where appropriate to support the compiler. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: introduce EXPORT_FOR_TESTS macroJohannes Thumshirn
Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS functions are either local to the file they are implemented in and thus should be declared static or are called from within the test implementation defined in a different file. Introduce an EXPORT_FOR_TESTS macro which depending on CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a function or not. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: remove unused drop_on_err in btrfs_mkdirJohannes Thumshirn
Up to commit 32955c5422a8 ("btrfs: switch to discard_new_inode()") the drop_on_err variable in btrfs_mkdir() was used to check whether the inode had to be dropped via iput(). After commit 32955c5422a8 ("btrfs: switch to discard_new_inode()") discard_new_inode() is called when err is set and inode is non NULL. Therefore drop_on_err is not used anymore and thus causes a warning when building with -Wunused-but-set-variable. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Replace BUG_ON with ASSERT in find_lock_delalloc_rangeNikolay Borisov
lock_delalloc_pages should only return 2 values - 0 in case of success and -EAGAIN if the range of pages to be locked should be shrunk due to some of gone. Manual inspections confirms that this is indeed the case since __process_pages_contig is where lock_delalloc_pages gets its return value. The latter always returns 0 or -EAGAIN so the invariant holds. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Sink find_lock_delalloc_range's 'max_bytes' argumentNikolay Borisov
All callers of this function pass BTRFS_MAX_EXTENT_SIZE (128M) so let's reduce the argument count and make that a local variable. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove superfluous check form btrfs_remove_chunkNikolay Borisov
It's unnecessary to check map->stripes[i].dev for NULL given its value is already set and dereferenced above the the check. No functional changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: don't report user-requested cancel as an errorAnand Jain
As of now only user requested replace cancel can cancel the replace-scrub so no need to log the error. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: silence warning if replace is canceledAnand Jain
When we successfully cancel the device replace, its scrub worker returns -ECANCELED, which is then passed to btrfs_dev_replace_finishing. It cleans up based on the returned status and propagates the same -ECANCELED back the parent function. As of now only user can cancel the replace-scrub, so its ok to silence the warning here. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: dev-replace: add explicit check for replace result "no error"Anand Jain
We recast the replace return status BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS to 0, to indicate no error. And since BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR should also return 0, which is also declared as 0, so we just return. Instead add it to the if statement so that there is enough clarity while reading the code. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: dev-replace: replace's scrub must not be running in suspended stateAnand Jain
When the replace state is in the suspended state, btrfs_scrub_cancel() should fail with -ENOTCONN as there is no scrub running. As a safety catch check if btrfs_scrub_cancel() returns -ENOTCONN and assert if it doesn't. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: dev-replace: set result code of cancel by status of scrubAnand Jain
The device-replace needs to check the result code of the scrub workers in btrfs_dev_replace_cancel and distinguish if successful cancel operation and when the there was no operation running. If btrfs_scrub_cancel() fails, return BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED so that user can try to cancel the replace again. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: fix use-after-free due to race between replace start and cancelAnand Jain
The device replace cancel thread can race with the replace start thread and if fs_info::scrubs_running is not yet set, btrfs_scrub_cancel() will fail to stop the scrub thread. The scrub thread continues with the scrub for replace which then will try to write to the target device and which is already freed by the cancel thread. scrub_setup_ctx() warns as tgtdev is NULL. struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace) { ... if (is_dev_replace) { WARN_ON(!fs_info->dev_replace.tgtdev); <=== sctx->pages_per_wr_bio = SCRUB_PAGES_PER_WR_BIO; sctx->wr_tgtdev = fs_info->dev_replace.tgtdev; sctx->flush_all_writes = false; } [ 6724.497655] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc started [ 6753.945017] BTRFS info (device sdb): dev_replace from /dev/sdb (devid 1) to /dev/sdc canceled [ 6852.426700] WARNING: CPU: 0 PID: 4494 at fs/btrfs/scrub.c:622 scrub_setup_ctx.isra.19+0x220/0x230 [btrfs] ... [ 6852.428928] RIP: 0010:scrub_setup_ctx.isra.19+0x220/0x230 [btrfs] ... [ 6852.432970] Call Trace: [ 6852.433202] btrfs_scrub_dev+0x19b/0x5c0 [btrfs] [ 6852.433471] btrfs_dev_replace_start+0x48c/0x6a0 [btrfs] [ 6852.433800] btrfs_dev_replace_by_ioctl+0x3a/0x60 [btrfs] [ 6852.434097] btrfs_ioctl+0x2476/0x2d20 [btrfs] [ 6852.434365] ? do_sigaction+0x7d/0x1e0 [ 6852.434623] do_vfs_ioctl+0xa9/0x6c0 [ 6852.434865] ? syscall_trace_enter+0x1c8/0x310 [ 6852.435124] ? syscall_trace_enter+0x1c8/0x310 [ 6852.435387] ksys_ioctl+0x60/0x90 [ 6852.435663] __x64_sys_ioctl+0x16/0x20 [ 6852.435907] do_syscall_64+0x50/0x180 [ 6852.436150] entry_SYSCALL_64_after_hwframe+0x49/0xbe Further, as the replace thread enters scrub_write_page_to_dev_replace() without the target device it panics: static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, struct scrub_page *spage) { ... bio_set_dev(bio, sbio->dev->bdev); <====== [ 6929.715145] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a0 .. [ 6929.717106] Workqueue: btrfs-scrub btrfs_scrub_helper [btrfs] [ 6929.717420] RIP: 0010:scrub_write_page_to_dev_replace+0xb4/0x260 [btrfs] .. [ 6929.721430] Call Trace: [ 6929.721663] scrub_write_block_to_dev_replace+0x3f/0x60 [btrfs] [ 6929.721975] scrub_bio_end_io_worker+0x1af/0x490 [btrfs] [ 6929.722277] normal_work_helper+0xf0/0x4c0 [btrfs] [ 6929.722552] process_one_work+0x1f4/0x520 [ 6929.722805] ? process_one_work+0x16e/0x520 [ 6929.723063] worker_thread+0x46/0x3d0 [ 6929.723313] kthread+0xf8/0x130 [ 6929.723544] ? process_one_work+0x520/0x520 [ 6929.723800] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 6929.724081] ret_from_fork+0x3a/0x50 Fix this by letting the btrfs_dev_replace_finishing() to do the job of cleaning after the cancel, including freeing of the target device. btrfs_dev_replace_finishing() is called when btrfs_scub_dev() returns along with the scrub return status. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: dev-replace: go back to suspend state if another EXCL_OP is runningAnand Jain
In a secnario where balance and replace co-exists as below, - start balance - pause balance - start replace - reboot and when system restarts, balance resumes first. Then the replace is attempted to restart but will fail as the EXCL_OP lock is already held by the balance. If so place the replace state back to BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state. Fixes: 010a47bde9420 ("btrfs: add proper safety check before resuming dev-replace") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: dev-replace: go back to suspended state if target device is missingAnand Jain
At the time of forced unmount we place the running replace to BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes back and expect the target device is missing. Then let the replace state continue to be in BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub running as part of replace. Fixes: e93c89c1aaaa ("Btrfs: add new sources for device replace code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: mark btrfs_dev_replace_start as staticAnand Jain
There isn't any other consumer other than in its own file dev-replace.c. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: harden agaist duplicate fsid on scanned devicesAnand Jain
It's not that impossible to imagine that a device OR a btrfs image is copied just by using the dd or the cp command. Which in case both the copies of the btrfs will have the same fsid. If on the system with automount enabled, the copied FS gets scanned. We have a known bug in btrfs, that we let the device path be changed after the device has been mounted. So using this loop hole the new copied device would appears as if its mounted immediately after it's been copied. For example: Initially.. /dev/mmcblk0p4 is mounted as / $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT mmcblk0 179:0 0 29.2G 0 disk |-mmcblk0p4 179:4 0 4G 0 part / |-mmcblk0p2 179:2 0 500M 0 part /boot |-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4 Copy mmcblk0 to sda $ dd if=/dev/mmcblk0 of=/dev/sda And immediately after the copy completes the change in the device superblock is notified which the automount scans using btrfs device scan and the new device sda becomes the mounted root device. $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 14.9G 0 disk |-sda4 8:4 1 4G 0 part / |-sda2 8:2 1 500M 0 part |-sda3 8:3 1 256M 0 part `-sda1 8:1 1 256M 0 part mmcblk0 179:0 0 29.2G 0 disk |-mmcblk0p4 179:4 0 4G 0 part |-mmcblk0p2 179:2 0 500M 0 part /boot |-mmcblk0p3 179:3 0 256M 0 part [SWAP] `-mmcblk0p1 179:1 0 256M 0 part /boot/efi $ btrfs fi show / Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid 1 size 4.00GiB used 3.00GiB path /dev/sda4 The bug is quite nasty that you can't either unmount /dev/sda4 or /dev/mmcblk0p4. And the problem does not get solved until you take sda out of the system on to another system to change its fsid using the 'btrfstune -u' command. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: introduce nparity raid_attrHans van Kranenburg
Instead of hardcoding exceptions for RAID5 and RAID6 in the code, use an nparity field in raid_attr. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: fix ncopies raid_attr for RAID56Hans van Kranenburg
RAID5 and RAID6 profile store one copy of the data, not 2 or 3. These values are not yet used anywhere so there's no change. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: alloc_chunk: fix more DUP stripe size handlingHans van Kranenburg
Commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling" fixed calculating the stripe_size for a new DUP chunk. However, the same calculation reappears a bit later, and that one was not changed yet. The resulting bug that is exposed is that the newly allocated device extents ('stripes') can have a few MiB overlap with the next thing stored after them, which is another device extent or the end of the disk. The scenario in which this can happen is: * The block device for the filesystem is less than 10GiB in size. * The amount of contiguous free unallocated disk space chosen to use for chunk allocation is 20% of the total device size, or a few MiB more or less. An example: - The filesystem device is 7880MiB (max_chunk_size gets set to 788MiB) - There's 1578MiB unallocated raw disk space left in one contiguous piece. In this case stripe_size is first calculated as 789MiB, (half of 1578MiB). Since 789MiB (stripe_size * data_stripes) > 788MiB (max_chunk_size), we enter the if block. Now stripe_size value is immediately overwritten while calculating an adjusted value based on max_chunk_size, which ends up as 788MiB. Next, the value is rounded up to a 16MiB boundary, 800MiB, which is actually more than the value we had before. However, the last comparison fails to detect this, because it's comparing the value with the total amount of free space, which is about twice the size of stripe_size. In the example above, this means that the resulting raw disk space being allocated is 1600MiB, while only a gap of 1578MiB has been found. The second device extent object for this DUP chunk will overlap for 22MiB with whatever comes next. The underlying problem here is that the stripe_size is reused all the time for different things. So, when entering the code in the if block, stripe_size is immediately overwritten with something else. If later we decide we want to have the previous value back, then the logic to compute it was copy pasted in again. With this change, the value in stripe_size is not unnecessarily destroyed, so the duplicated calculation is not needed any more. Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: alloc_chunk: improve chunk size variable nameHans van Kranenburg
The variable num_bytes is really a way too generic name for a variable in this function. There are a dozen other variables that hold a number of bytes as value. Give it a name that actually describes what it does, which is holding the size of the chunk that we're allocating. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: alloc_chunk: do not refurbish num_bytesHans van Kranenburg
The variable num_bytes is used to store the chunk length of the chunk that we're allocating. Do not reuse it for something really different in the same function. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: use tagged writepage to mitigate livelock of snapshotEthan Lien
Snapshot is expected to be fast. But if there are writers steadily creating dirty pages in our subvolume, the snapshot may take a very long time to complete. To fix the problem, we use tagged writepage for snapshot flusher as we do in the generic write_cache_pages(), so we can omit pages dirtied after the snapshot command. This does not change the semantics regarding which data get to the snapshot, if there are pages being dirtied during the snapshotting operation. There's a sync called before snapshot is taken in old/new case, any IO in flight just after that may be in the snapshot but this depends on other system effects that might still sync the IO. We do a simple snapshot speed test on a Intel D-1531 box: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 1m58sec patched: 6.54sec This is the best case for this patch since for a sequential write case, we omit nearly all pages dirtied after the snapshot command. For a multi writers, random write test: fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120 --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5; time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio original: 15.83sec patched: 10.35sec The improvement is smaller compared to the sequential write case, since we omit only half of the pages dirtied after snapshot command. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove unused extent_state argument from ↵Nikolay Borisov
btrfs_writepage_endio_finish_ordered This parameter was never used, yet was part of the interface of the function ever since its introduction as extent_io_ops::writepage_end_io_hook in e6dcd2dc9c48 ("Btrfs: New data=ordered implementation"). Now that NULL is passed everywhere as a value for this parameter let's remove it for good. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_page_data argument from writepage_delallocNikolay Borisov
The only remaining use of the 'epd' argument in writepage_delalloc is to reference the extent_io_tree which was set in extent_writepages. Since it is guaranteed that page->mapping of any page passed to writepage_delalloc (and __extent_writepage as the sole caller) to be equal to that passed in extent_writepages we can directly get the io_tree via the already passed inode (which is also taken from page->mapping->host). No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Move epd::extent_locked check to writepage_delalloc's callerNikolay Borisov
If epd::extent_locked is set then writepage_delalloc terminates. Make this a bit more apparent in the caller by simply bubbling the check up. This enables to remove epd as an argument to writepage_delalloc in a future patch. No functional change. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Check for missing device before bio submission in btrfs_map_bioNikolay Borisov
Before btrfs_map_bio submits all stripe bios it does a number of checks to ensure the device for every stripe is present. However, it doesn't do a DEV_STATE_MISSING check, instead this is relegated to the lower level btrfs_schedule_bio (in the async submission case, sync submission doesn't check DEV_STATE_MISSING at all). Additionally btrfs_schedule_bios does the duplicate device->bdev check which has already been performed in btrfs_map_bio. This patch moves the DEV_STATE_MISSING check in btrfs_map_bio and removes the duplicate device->bdev check. Doing so ensures that no bio cloning/submission happens for both async/sync requests in the face of missing device. This makes the async io submission path slightly shorter in terms of instruction count. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: remove redundant replace_state initAnand Jain
dev_replace::replace_state has been set to BTRFS_DEV_REPLACE_ITEM_STATE_NEVER_STARTED (0) in the same function, So delete the line which sets replace_state = 0; Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: remove no longer used io_err from btrfs_log_ctxFilipe Manana
The io_err field of struct btrfs_log_ctx is no longer used after the recent simplification of the fast fsync path, where we now wait for ordered extents to complete before logging the inode. We did this in commit b5e6c3e170b7 ("btrfs: always wait on ordered extents at fsync time") and commit a2120a473a80 ("btrfs: clean up the left over logged_list usage") removed its last use. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>