summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/compression.c
AgeCommit message (Collapse)Author
2020-12-09btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecsQu Wenruo
Refactor btrfs_lookup_bio_sums() by: - Remove the @file_offset parameter There are two factors making the @file_offset parameter useless: * For csum lookup in csum tree, file offset makes no sense We only need disk_bytenr, which is unrelated to file_offset * page_offset (file offset) of each bvec is not contiguous. Pages can be added to the same bio as long as their on-disk bytenr is contiguous, meaning we could have pages at different file offsets in the same bio. Thus passing file_offset makes no sense any more. The only user of file_offset is for data reloc inode, we will use a new function, search_file_offset_in_bio(), to handle it. - Extract the csum tree lookup into search_csum_tree() The new function will handle the csum search in csum tree. The return value is the same as btrfs_find_ordered_sum(), returning the number of found sectors which have checksum. - Change how we do the main loop The only needed info from bio is: * the on-disk bytenr * the length After extracting the above info, we can do the search without bio at all, which makes the main loop much simpler: for (cur_disk_bytenr = orig_disk_bytenr; cur_disk_bytenr < orig_disk_bytenr + orig_len; cur_disk_bytenr += count * sectorsize) { /* Lookup csum tree */ count = search_csum_tree(fs_info, path, cur_disk_bytenr, search_len, csum_dst); if (!count) { /* Csum hole handling */ } } - Use single variable as the source to calculate all other offsets Instead of all different type of variables, we use only one main variable, cur_disk_bytenr, which represents the current disk bytenr. All involved values can be calculated from that variable, and all those variable will only be visible in the inner loop. The above refactoring makes btrfs_lookup_bio_sums() way more robust than it used to be, especially related to the file offset lookup. Now file_offset lookup is only related to data reloc inode, otherwise we don't need to bother file_offset at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09btrfs: drop casts of bio bi_sectorDavid Sterba
Since commit 72deb455b5ec ("block: remove CONFIG_LBDAF") (5.2) the sector_t type is u64 on all arches and configs so we don't need to typecast it. It used to be unsigned long and the result of sector size shifts were not guaranteed to fit in the type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: remove unnecessary local variables for checksum sizeDavid Sterba
Remove local variable that is then used just once and replace it with fs_info::csum_size. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: switch cached fs_info::csum_size from u16 to u32David Sterba
The fs_info value is 32bit, switch also the local u16 variables. This leads to a better assembly code generated due to movzwl. This simple change will shave some bytes on x86_64 and release config: text data bss dec hex filename 1090000 17980 14912 1122892 11224c pre/btrfs.ko 1089794 17980 14912 1122686 11217e post/btrfs.ko DELTA: -206 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: use cached value of fs_info::csum_size everywhereDavid Sterba
btrfs_get_16 shows up in the system performance profiles (helper to read 16bit values from on-disk structures). This is partially because of the checksum size that's frequently read along with data reads/writes, other u16 uses are from item size or directory entries. Replace all calls to btrfs_super_csum_size by the cached value from fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: introduce mount option rescue=ignorebadrootsJosef Bacik
In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08btrfs: push the NODATASUM check into btrfs_lookup_bio_sumsJosef Bacik
When we move to being able to handle NULL csum_roots it'll be cleaner to just check in btrfs_lookup_bio_sums instead of at all of the caller locations, so push the NODATASUM check into it as well so it's unified. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07btrfs: compression: move declarations to headerDavid Sterba
The declarations of compression algorithm callbacks are defined in the .c file as they're used from there. Compiler warns that there are no declarations for public functions when compiling lzo.c/zlib.c/zstd.c. Fix that by moving the declarations to the header as it's the common place for all of them. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: remove fail label in check_compressed_csumNikolay Borisov
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: increment corrupt device counter during compressed readNikolay Borisov
If a compressed read fails due to checksum error only a line is printed to dmesg, device corrupt counter is not modified. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: remove needless ASSERT check of orig_bio in end_compressed_bio_readNikolay Borisov
compressed_bio::orig_bio is always set in btrfs_submit_compressed_read before any bio submission is performed. Since that function is always called with a valid bio it renders the ASSERT unnecessary. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_submit_compressed_write take btrfs_inodeNikolay Borisov
Majority of its uses are for btrfs_inode so take it as an argument directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27btrfs: make btrfs_csum_one_bio takae btrfs_inodeNikolay Borisov
Will enable converting btrfs_submit_compressed_write to btrfs_inode more easily. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: unexport btrfs_compress_set_level()Anand Jain
btrfs_compress_set_level() can be static function in the file compression.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25btrfs: use crypto_shash_digest() instead of open codingEric Biggers
Use crypto_shash_digest() instead of crypto_shash_init() + crypto_shash_update() + crypto_shash_final(). This is more efficient. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-31btrfs: use larger zlib buffer for s390 hardware compressionMikhail Zaslonko
In order to benefit from s390 zlib hardware compression support, increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390 zlib hardware support is enabled on the machine). This brings up to 60% better performance in hardware on s390 compared to the PAGE_SIZE buffer and much more compared to the software zlib processing in btrfs. In case of memory pressure, fall back to a single page buffer during workspace allocation. The data compressed with larger input buffers will still conform to zlib standard and thus can be decompressed also on a systems that uses only PAGE_SIZE buffer for btrfs zlib. Link: http://lkml.kernel.org/r/20200108105103.29028-1-zaslonko@linux.ibm.com Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Eduard Shishkin <edward6@linux.ibm.com> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-20btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums()Omar Sandoval
We can encode this in the offset parameter: -1 means use the page offsets, anything else is a valid offset. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappersOmar Sandoval
Currently, we have two wrappers for __btrfs_lookup_bio_sums(): btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and btrfs_lookup_bio_sums(), which is used everywhere else. The only difference is that the _dio variant looks up csums starting at the given offset instead of using the page index, which isn't actually direct I/O-specific. Let's clean up the signature and return value of __btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get rid of the trivial helpers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30btrfs: fix compressed write bio blkcg attributionDennis Zhou
Bio attribution is handled at bio_set_dev() as once we have a device, we have a corresponding request_queue and then can derive the current css. In special cases, we want to attribute to bio to someone else. This can be done by calling bio_associate_blkg_from_css() or kthread_associate_blkcg() depending on the scenario. Btrfs does this for compressed writeback as they are handled by kworkers, so the latter can be done here. Commit 1a41802701ec ("btrfs: drop bio_set_dev where not needed") removes early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the above assumption that we'll have a request_queue when we are doing association. To fix this, switch to using kthread_associate_blkcg(). Without this, we crash in btrfs/024: [ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510 [ 3052.107013] #PF: supervisor read access in kernel mode [ 3052.107014] #PF: error_code(0x0000) - not-present page [ 3052.107015] PGD 0 P4D 0 [ 3052.107021] Oops: 0000 [#1] SMP [ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712 [ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 [ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper [ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0 [ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282 [ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000 [ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0 [ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400 [ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0 [ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8 [ 3052.293367] FS: 0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000 [ 3052.293368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0 [ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3052.293372] PKRU: 55555554 [ 3052.293376] Call Trace: [ 3052.402552] btrfs_submit_compressed_write+0x137/0x390 [ 3052.402558] submit_compressed_extents+0x40f/0x4c0 [ 3052.422401] btrfs_work_helper+0x246/0x5a0 [ 3052.422408] process_one_work+0x200/0x570 [ 3052.438601] ? process_one_work+0x180/0x570 [ 3052.438605] worker_thread+0x4c/0x3e0 [ 3052.438614] kthread+0x103/0x140 [ 3052.460735] ? process_one_work+0x570/0x570 [ 3052.460737] ? kthread_mod_delayed_work+0xc0/0xc0 [ 3052.460744] ret_from_fork+0x24/0x30 Fixes: 1a41802701ec ("btrfs: drop bio_set_dev where not needed") Reported-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30btrfs: punt all bios created in btrfs_submit_compressed_write()Dennis Zhou
Compressed writes happen in the background via kworkers. However, this causes bios to be attributed to root bypassing any cgroup limits from the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will punt the bio to an appropriate cgroup specific workqueue and attribute the IO properly. However, if btrfs_submit_compressed_write() creates a new bio, we don't tag it the same way. Add the appropriate tagging for subsequent bios. Fixes: ec39f7696ccfa ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios") Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: drop bio_set_dev where not neededDavid Sterba
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug also changing some state bits if there was a different bdev set before. This is one thing that's not needed. Another thing is that setting a bdev at bio allocation time is too early and actually does not work with plain redundancy profiles, where each time we submit a bio to a device, the bdev is set correctly. In many places the bio bdev is set to latest_bdev that seems to serve as a stub pointer "just to put something to bio". But we don't have to do that. Where do we know which bdev to set: * for regular IO: submit_stripe_bio that's called by btrfs_map_bio * repair IO: repair_io_failure, read or write from specific device * super block write (using buffer_heads but uses raw bdev) and barriers * scrub: this does not use all regular IO paths as it needs to reach all copies, verify and fixup eventually, and for that all bdev management is independent * raid56: rbio_add_io_page, for the RMW write * integrity-checker: does it's own low-level block tracking Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: remove ops pointer from workspace_managerDavid Sterba
We can infer the ops from the type that is now passed to all functions that would need it, this makes workspace_manager::ops redundant and can be removed. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline free_workspaceDavid Sterba
Replace indirect calls to free_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: pass type to btrfs_put_workspaceDavid Sterba
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for free_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline alloc_workspaceDavid Sterba
Replace indirect calls to alloc_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: pass type to btrfs_get_workspaceDavid Sterba
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for alloc_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline put_workspaceDavid Sterba
Similar to get_workspace, majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. Trivial callback implementations use the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline get_workspaceDavid Sterba
Majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. ZLIB needs to adjust level in the callback and ZSTD workspace management is complex, the rest is call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: export alloc/free/get/put callbacks of all algosDavid Sterba
The indirect calls will be replaced by a switch in compression.c. (Switch is faster than indirect calls with when Spectre mitigations are enabled). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline cleanup_workspace_managerDavid Sterba
Replace loop calling to all algos with a list of direct calls to the cleanup manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: let workspace manager cleanup take only the typeDavid Sterba
With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager cleanup helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: inline init_workspace_managerDavid Sterba
Replace loop calling to all algos with a list of direct calls to the init manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: let workspace manager init take only the typeDavid Sterba
With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager init helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: compression: attach workspace manager to the opsDavid Sterba
There's a lot of indirection when the generic code calls into algo-specific callbacks to reach the private workspace manager structure and back to the generic code. To simplify that, export the workspace manager for heuristic, LZO and ZLIB, while ZSTD is going to use it's own manager. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: switch compression callbacks to direct callsDavid Sterba
The indirect calls bring some overhead due to spectre vulnerability mitigations. The number of cases is small and below the threshold (10-20) where indirect call would be better. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: export compression and decompression callbacksDavid Sterba
Export compress_pages, decompress_bio and decompress callbacks for all compression algos. The indirect calls will be replaced by a switch. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: use better definition of number of compression typeChengguang Xu
The compression type upper limit constant is the same as the last value and this is confusing. In order to keep coding style consistent, use BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of 'NR' being one more than the last value. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: use REQ_CGROUP_PUNT for worker thread submitted biosChris Mason
Async CRCs and compression submit IO through helper threads, which means they have IO priority inversions when cgroup IO controllers are in use. This flags all of the writes submitted by btrfs helper threads as REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg work items to avoid the priority inversion. For the compression code, we take a reference on the wbc's blkg css and pass it down to the async workers. For the async CRCs, the bio already has the correct css, we just need to tell the block layer to use REQ_CGROUP_PUNT. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Chris Mason <clm@fb.com> Modified-and-reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: stop using btrfs_schedule_bio()Chris Mason
btrfs_schedule_bio() hands IO off to a helper thread to do the actual submit_bio() call. This has been used to make sure async crc and compression helpers don't get stuck on IO submission. To maintain good performance, over time the IO submission threads duplicated some IO scheduler characteristics such as high and low priority IOs and they also made some ugly assumptions about request allocation batch sizes. All of this cost at least one extra context switch during IO submission, and doesn't fit well with the modern blkmq IO stack. So, this commit stops using btrfs_schedule_bio(). We may need to adjust the number of async helper threads for crcs and compression, but long term it's a better path. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move cond_wake_up functions out of ctreeDavid Sterba
The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: compression: replace set_level callbacks by a common helperDavid Sterba
The set_level callbacks do not do anything special and can be replaced by a helper that uses the levels defined in the tables. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: lift bio_set_dev from bio allocation helpersDavid Sterba
The block device is passed around for the only purpose to set it in new bios. Move the assignment one level up. This is a preparatory patch for further bdev cleanups. Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: correctly validate compression typeJohannes Thumshirn
Nikolay reported the following KASAN splat when running btrfs/048: [ 1843.470920] ================================================================== [ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0 [ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979 [ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536 [ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 1843.476322] Call Trace: [ 1843.476674] dump_stack+0x7c/0xbb [ 1843.477132] ? strncmp+0x66/0xb0 [ 1843.477587] print_address_description+0x114/0x320 [ 1843.478256] ? strncmp+0x66/0xb0 [ 1843.478740] ? strncmp+0x66/0xb0 [ 1843.479185] __kasan_report+0x14e/0x192 [ 1843.479759] ? strncmp+0x66/0xb0 [ 1843.480209] kasan_report+0xe/0x20 [ 1843.480679] strncmp+0x66/0xb0 [ 1843.481105] prop_compression_validate+0x24/0x70 [ 1843.481798] btrfs_xattr_handler_set_prop+0x65/0x160 [ 1843.482509] __vfs_setxattr+0x71/0x90 [ 1843.483012] __vfs_setxattr_noperm+0x84/0x130 [ 1843.483606] vfs_setxattr+0xac/0xb0 [ 1843.484085] setxattr+0x18c/0x230 [ 1843.484546] ? vfs_setxattr+0xb0/0xb0 [ 1843.485048] ? __mod_node_page_state+0x1f/0xa0 [ 1843.485672] ? _raw_spin_unlock+0x24/0x40 [ 1843.486233] ? __handle_mm_fault+0x988/0x1290 [ 1843.486823] ? lock_acquire+0xb4/0x1e0 [ 1843.487330] ? lock_acquire+0xb4/0x1e0 [ 1843.487842] ? mnt_want_write_file+0x3c/0x80 [ 1843.488442] ? debug_lockdep_rcu_enabled+0x22/0x40 [ 1843.489089] ? rcu_sync_lockdep_assert+0xe/0x70 [ 1843.489707] ? __sb_start_write+0x158/0x200 [ 1843.490278] ? mnt_want_write_file+0x3c/0x80 [ 1843.490855] ? __mnt_want_write+0x98/0xe0 [ 1843.491397] __x64_sys_fsetxattr+0xba/0xe0 [ 1843.492201] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 1843.493201] do_syscall_64+0x6c/0x230 [ 1843.493988] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1843.495041] RIP: 0033:0x7fa7a8a7707a [ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48 [ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be [ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a [ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003 [ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000 [ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91 [ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff [ 1843.505268] Allocated by task 3979: [ 1843.505771] save_stack+0x19/0x80 [ 1843.506211] __kasan_kmalloc.constprop.5+0xa0/0xd0 [ 1843.506836] setxattr+0xeb/0x230 [ 1843.507264] __x64_sys_fsetxattr+0xba/0xe0 [ 1843.507886] do_syscall_64+0x6c/0x230 [ 1843.508429] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1843.509558] Freed by task 0: [ 1843.510188] (stack is not available) [ 1843.511309] The buggy address belongs to the object at ffff888111e369e0 which belongs to the cache kmalloc-8 of size 8 [ 1843.514095] The buggy address is located 2 bytes inside of 8-byte region [ffff888111e369e0, ffff888111e369e8) [ 1843.516524] The buggy address belongs to the page: [ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0 [ 1843.519993] flags: 0x4404000010200(slab|head) [ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300 [ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000 [ 1843.524281] page dumped because: kasan: bad access detected [ 1843.525936] Memory state around the buggy address: [ 1843.526975] ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.528479] ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc [ 1843.531877] ^ [ 1843.533287] ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.534874] ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1843.536468] ================================================================== This is caused by supplying a too short compression value ('lz') in the test-case and comparing it to 'lzo' with strncmp() and a length of 3. strncmp() read past the 'lz' when looking for the 'o' and thus caused an out-of-bounds read. Introduce a new check 'btrfs_compress_is_valid_type()' which not only checks the user-supplied value against known compression types, but also employs checks for too short values. Reported-by: Nikolay Borisov <nborisov@suse.com> Fixes: 272e5326c783 ("btrfs: prop: fix vanished compression property after failed set") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: remove assumption about csum type form btrfs_print_data_csum_error()Johannes Thumshirn
btrfs_print_data_csum_error() still assumed checksums to be 32 bit in size. Make it size agnostic. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: directly call into crypto framework for checksummingJohannes Thumshirn
Currently btrfs_csum_data() relied on the crc32c() wrapper around the crypto framework for calculating the CRCs. As we have our own crypto_shash structure in the fs_info now, we can directly call into the crypto framework without going trough the wrapper. This way we can even remove the btrfs_csum_data() and btrfs_csum_final() wrappers. The module dependency on crc32c is preserved via MODULE_SOFTDEP("pre: crc32c"), which was previously provided by LIBCRC32C config option doing the same. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: don't assume compressed_bio sums to be 4 bytesJohannes Thumshirn
BTRFS has the implicit assumption that a checksum in compressed_bio is 4 bytes. While this is true for CRC32C, it is not for any other checksum. Change the data type to be a byte array and adjust loop index calculation accordingly. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01btrfs: don't assume ordered sums to be 4 bytesJohannes Thumshirn
BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums is 4 bytes. While this is true for CRC32C, it is not for any other checksum. Change the data type to be a byte array and adjust loop index calculation accordingly. This includes moving the adjustment of 'index' by 'ins_size' in btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum size, because before this patch the 'sums' member of 'struct btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one byte. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-20Merge tag 'for-5.2-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Notable highlights: - fixes for some long-standing bugs in fsync that were quite hard to catch but now finaly fixed - some fixups to error handling paths that did not properly clean up (locking, memory) - fix to space reservation for inheriting properties" * tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: tree-checker: detect file extent items with overlapping ranges Btrfs: fix race between ranged fsync and writeback of adjacent ranges Btrfs: avoid fallback to transaction commit during fsync of files with holes btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes btrfs: sysfs: don't leak memory when failing add fsid btrfs: sysfs: Fix error path kobject memory leak Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path btrfs: use the existing reserved items for our first prop for inheritance btrfs: don't double unlock on error in btrfs_punch_hole btrfs: Check the compression level before getting a workspace
2019-05-07Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "Nothing major in this series, just fixes and improvements all over the map. This contains: - Series of fixes for sed-opal (David, Jonas) - Fixes and performance tweaks for BFQ (via Paolo) - Set of fixes for bcache (via Coly) - Set of fixes for md (via Song) - Enabling multi-page for passthrough requests (Ming) - Queue release fix series (Ming) - Device notification improvements (Martin) - Propagate underlying device rotational status in loop (Holger) - Removal of mtip32xx trim support, which has been disabled for years (Christoph) - Improvement and cleanup of nvme command handling (Christoph) - Add block SPDX tags (Christoph) - Cleanup/hardening of bio/bvec iteration (Christoph) - A few NVMe pull requests (Christoph) - Removal of CONFIG_LBDAF (Christoph) - Various little fixes here and there" * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits) block: fix mismerge in bvec_advance block: don't drain in-progress dispatch in blk_cleanup_queue() blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release blk-mq: always free hctx after request queue is freed blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: free hw queue's resource in hctx's release handler blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: grab .q_usage_counter when queuing request from plug code path block: fix function name in comment nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting ...
2019-05-03btrfs: Check the compression level before getting a workspaceJohnny Chang
When a file's compression property is set as zlib or zstd but leave the compression mount option not be set, that means btrfs will try to compress the file with default compression level. But in btrfs_compress_pages(), it calls get_workspace() with level = 0. This will return a workspace with a wrong compression level. For zlib, the compression level in the workspace will be 0 (that means "store only"). And for zstd, the compression in the workspace will be 1, not the default level 3. How to reproduce: mkfs -t btrfs /dev/sdb mount /dev/sdb /mnt/ mkdir /mnt/zlib btrfs property set /mnt/zlib/ compression zlib dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10 sync btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M btrfs-debugfs output: * before: ... (258 9961472): ram 524288 disk 1106247680 disk_size 524288 file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00 * after: ... (258 10354688): ram 131072 disk 14217216 disk_size 4096 file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00 The steps for zstd are similar, but need to put a debugging message to show the level of the return workspace in zstd_get_workspace(). This commit adds a check of the compression level before getting a workspace by set_level(). CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Johnny Chang <johnnyc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>