summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2020-05-27nvme-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to sendSagi Grimberg
We can signal the stack that this is not the last page coming and the stack can build a larger tso segment, so go ahead and use it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27nvmet: mark nvmet_ana_state staticChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2020-05-27nvmet: replace kstrndup() with kmemdup_nul()Chen Zhou
It is more efficient to use kmemdup_nul() if the size is known exactly. The doc in kernel: "Note: Use kmemdup_nul() instead if the size is known exactly." Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-26block/floppy: fix contended case in floppy_queue_rq()Jiri Kosina
Since the switch of floppy driver to blk-mq, the contended (fdc_busy) case in floppy_queue_rq() is not handled correctly. In case we reach floppy_queue_rq() with fdc_busy set (i.e. with the floppy locked due to another request still being in-flight), we put the request on the list of requests and return BLK_STS_OK to the block core, without actually scheduling delayed work / doing further processing of the request. This means that processing of this request is postponed until another request comes and passess uncontended. Which in some cases might actually never happen and we keep waiting indefinitely. The simple testcase is for i in `seq 1 2000`; do echo -en $i '\r'; blkid --info /dev/fd0 2> /dev/null; done run in quemu. That reliably causes blkid eventually indefinitely hanging in __floppy_read_block_0() waiting for completion, as the BIO callback never happens, and no further IO is ever submitted on the (non-existent) floppy device. This was observed reliably on qemu-emulated device. Fix that by not queuing the request in the contended case, and return BLK_STS_RESOURCE instead, so that blk core handles the request rescheduling and let it pass properly non-contended later. Fixes: a9f38e1dec107a ("floppy: convert to blk-mq") Cc: stable@vger.kernel.org Tested-by: Libor Pechacek <lpechacek@suse.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-24loop: remove redundant assignment to variable errorColin Ian King
The variable error is being assigned a value that is never read so the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21block: remove ioctl_by_bdevChristoph Hellwig
No callers left. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21s390/dasd: remove ioctl_by_bdev callsStefan Haberland
The IBM partition parser requires device type specific information only available to the DASD driver to correctly register partitions. The current approach of using ioctl_by_bdev with a fake user space pointer is discouraged. Fix this by replacing IOCTL calls with direct in-kernel function calls. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21dasd: refactor dasd_ioctl_informationChristoph Hellwig
Prepare for in-kernel callers of this functionality. Signed-off-by: Christoph Hellwig <hch@lst.de> [sth@de.ibm.com: remove leftover kfree] Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Add LOOP_CONFIGURE ioctlMartijn Coenen
This allows userspace to completely setup a loop device with a single ioctl, removing the in-between state where the device can be partially configured - eg the loop device has a backing file associated with it, but is reading from the wrong offset. Besides removing the intermediate state, another big benefit of this ioctl is that LOOP_SET_STATUS can be slow; the main reason for this slowness is that LOOP_SET_STATUS(64) calls blk_mq_freeze_queue() to freeze the associated queue; this requires waiting for RCU synchronization, which I've measured can take about 15-20ms on this device on average. In addition to doing what LOOP_SET_STATUS can do, LOOP_CONFIGURE can also be used to: - Set the correct block size immediately by setting loop_config.block_size (avoids LOOP_SET_BLOCK_SIZE) - Explicitly request direct I/O mode by setting LO_FLAGS_DIRECT_IO in loop_config.info.lo_flags (avoids LOOP_SET_DIRECT_IO) - Explicitly request read-only mode by setting LO_FLAGS_READ_ONLY in loop_config.info.lo_flags Here's setting up ~70 regular loop devices with an offset on an x86 Android device, using LOOP_SET_FD and LOOP_SET_STATUS: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m03.40s real 0m00.02s user 0m00.03s system Here's configuring ~70 devices in the same way, but using a modified losetup that uses the new LOOP_CONFIGURE ioctl: vsoc_x86:/system/apex # time for i in `seq 30 100`; do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done 0m01.94s real 0m00.01s user 0m00.01s system Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Clean up LOOP_SET_STATUS lo_flags handlingMartijn Coenen
LOOP_SET_STATUS(64) will actually allow some lo_flags to be modified; in particular, LO_FLAGS_AUTOCLEAR can be set and cleared, whereas LO_FLAGS_PARTSCAN can be set to request a partition scan. Make this explicit by updating the UAPI to include the flags that can be set/cleared using this ioctl. The implementation can then blindly take over the passed in flags, and use the previous flags for those flags that can't be set / cleared using LOOP_SET_STATUS. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Rework lo_ioctl() __user argument castingMartijn Coenen
In preparation for a new ioctl that needs to copy_from_user(); makes the code easier to read as well. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Move loop_set_status_from_info() and friends upMartijn Coenen
So we can use it without forward declaration. This is a separate commit to make it easier to verify that this is just a move, without functional modifications. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Factor out configuring loop from statusMartijn Coenen
Factor out this code into a separate function, so it can be reused by other code more easily. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Remove figure_loop_size()Martijn Coenen
This function was now only used by loop_set_capacity(). Just open code the remaining code in the caller instead. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Refactor loop_set_status() size calculationMartijn Coenen
figure_loop_size() calculates the loop size based on the passed in parameters, but at the same time it updates the offset and sizelimit parameters in the loop device configuration. That is a somewhat unexpected side effect of a function with this name, and it is only only needed by one of the two callers of this function - loop_set_status(). Move the lo_offset and lo_sizelimit assignment back into loop_set_status(), and use the newly factored out functions to validate and apply the newly calculated size. This allows us to get rid of figure_loop_size() in a follow-up commit. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Switch to set_capacity_revalidate_and_notify()Martijn Coenen
This was recently added to block/genhd.c, and takes care of both updating the capacity and notifying userspace of the new size. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Factor out setting loop device sizeMartijn Coenen
This code is used repeatedly. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Remove sector_t truncation checksMartijn Coenen
sector_t is now always u64, so we don't need to check for truncation. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21loop: Call loop_config_discard() only after new config is appliedMartijn Coenen
loop_set_status() calls loop_config_discard() to configure discard for the loop device; however, the discard configuration depends on whether the loop device uses encryption, and when we call it the encryption configuration has not been updated yet. Move the call down so we apply the correct discard configuration based on the new configuration. Signed-off-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-16block/swim3: use set_current_state macroXu Wang
Use set_current_state macro instead of current->state = TASK_RUNNING. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13Merge branch 'md-next' of ↵Jens Axboe
git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.8/drivers Pull MD changes from Song. * 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid1: Replace zero-length array with flexible-array md: add a newline when printing parameter 'start_ro' by sysfs md: stop using ->queuedata md/raid1: release pending accounting for an I/O only after write-behind is also finished md: remove redundant memalloc scope API usage raid5: update code comment of scribble_alloc() raid5: remove gfp flags from scribble_alloc() md: use memalloc scope APIs in mddev_suspend()/mddev_resume() md: remove the extra line for ->hot_add_disk md: flush md_rdev_misc_wq for HOT_ADD_DISK case md: don't flush workqueue unconditionally in md_open md: add new workqueue for delete rdev md: add checkings before flush md_misc_wq
2020-05-13md/raid1: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: add a newline when printing parameter 'start_ro' by sysfsXiongfeng Wang
Add a missing newline when printing module parameter 'start_ro' by sysfs. Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: stop using ->queuedataChristoph Hellwig
Pointer to mddev is already available in private_data. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md/raid1: release pending accounting for an I/O only after write-behind is ↵David Jeffery
also finished When using RAID1 and write-behind, md can deadlock when errors occur. With write-behind, r1bio structs can be accounted by raid1 as queued but not counted as pending. The pending count is dropped when the original bio is returned complete but write-behind for the r1bio may still be active. This breaks the accounting used in some conditions to know when the raid1 md device has reached an idle state. It can result in calls to freeze_array deadlocking. freeze_array will never complete from a negative "unqueued" value being calculated due to a queued count larger than the pending count. To properly account for write-behind, move the call to allow_barrier from call_bio_endio to raid_end_bio_io. When using write-behind, md can call call_bio_endio before all write-behind I/O is complete. Using raid_end_bio_io for the point to call allow_barrier will release the pending count at a point where all I/O for an r1bio, even write-behind, is done. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: remove redundant memalloc scope API usageColy Li
In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save() and memalloc_noio_restore() are used when allocating memory by calling mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in raid array suspend context, it is unncessary to explicitly call them around mempool_create_kmalloc_pool() any longer. This patch removes the redundant memalloc scope APIs in mddev_create_serial_pool(). Signed-off-by: Coly Li <colyli@suse.de> Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13raid5: update code comment of scribble_alloc()Coly Li
Code comments of scribble_alloc() is outdated for a while. This patch update the comments in function header for the new parameter list. Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13raid5: remove gfp flags from scribble_alloc()Coly Li
Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does not have the expected behavior. kvmalloc_array() inside scribble_alloc() which receives the GFP_NOIO flag will eventually call kmalloc_node() to allocate physically continuous pages. Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to prevent memory reclaim I/Os during raid array suspend context, calling to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive I/O as expected. This patch removes the useless gfp flags from parameters list of scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The incorrect GFP_NOIO flag does not exist anymore. Fixes: b330e6a49dc3 ("md: convert to kvmalloc") Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: use memalloc scope APIs in mddev_suspend()/mddev_resume()Coly Li
In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO flag, then it is sent into kvmalloc_array() inside scribble_alloc(). The problem is kvmalloc_array() eventually calls kvmalloc_node() which does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then kmalloc_node() is called indeed to allocate physically continuous pages. When system memory is under heavy pressure, and the requesting size is large, there is high probability that allocating continueous pages will fail. But simply using GFP_KERNEL flag to call kvmalloc_array() is also progblematic. In the code path where scribble_alloc() is called, the raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os and such I/Os go back to the suspend raid array, deadlock will happen. What is desired here is to allocate non-physically (a.k.a virtually) continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests to use the mmealloc sceope APIs to restrict memory reclaim I/O in allocating context, specifically to call memalloc_noio_save() when suspend the raid array and to call memalloc_noio_restore() when resume the raid array. This patch adds the memalloc scope APIs in mddev_suspend() and mddev_resume(), to restrict memory reclaim I/Os during the raid array is suspended. The benifit of adding the memalloc scope API in the unified entry point mddev_suspend()/mddev_resume() is, no matter which md raid array type (personality), we are sure the deadlock by recursive memory reclaim I/O won't happen on the suspending context. Please notice that the memalloc scope APIs only take effect on the raid array suspending context, if the memory allocation is from another new created kthread after raid array suspended, the recursive memory reclaim I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are used for the critical section where the raid metadata is modifying, creating a kthread to allocate memory inside the critical section is queer and very probably being buggy. Fixes: b330e6a49dc3 ("md: convert to kvmalloc") Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: remove the extra line for ->hot_add_diskGuoqing Jiang
It is not not necessary to add a newline for them since they don't exceed 80 characters, and it is not intutive to distinguish ->hot_add_disk() from hot_add_disk() too. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: flush md_rdev_misc_wq for HOT_ADD_DISK caseGuoqing Jiang
Since rdev->kobj is removed asynchronously, it is possible that the rdev->kobj still exists when try to add the rdev again after rdev is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk -> bind_rdev_to_array missed it. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: don't flush workqueue unconditionally in md_openGuoqing Jiang
We need to check mddev->del_work before flush workqueu since the purpose of flush is to ensure the previous md is disappeared. Otherwise the similar deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the bdev->bd_mutex before flush workqueue. kernel: [ 154.522645] ====================================================== kernel: [ 154.522647] WARNING: possible circular locking dependency detected kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O kernel: [ 154.522651] ------------------------------------------------------ kernel: [ 154.522653] mdadm/2482 is trying to acquire lock: kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0 kernel: [ 154.522673] kernel: [ 154.522673] but task is already holding lock: kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522691] kernel: [ 154.522691] which lock already depends on the new lock. kernel: [ 154.522691] kernel: [ 154.522694] kernel: [ 154.522694] the existing dependency chain (in reverse order) is: kernel: [ 154.522696] kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}: kernel: [ 154.522704] __mutex_lock+0x87/0x950 kernel: [ 154.522706] __blkdev_get+0x79/0x590 kernel: [ 154.522708] blkdev_get+0x65/0x140 kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40 kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod] kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod] kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod] kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522735] vfs_write+0xad/0x1a0 kernel: [ 154.522737] ksys_write+0xa4/0xe0 kernel: [ 154.522745] do_syscall_64+0x64/0x2b0 kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522749] kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}: kernel: [ 154.522752] __mutex_lock+0x87/0x950 kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod] kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod] kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0 kernel: [ 154.522763] vfs_write+0xad/0x1a0 kernel: [ 154.522765] ksys_write+0xa4/0xe0 kernel: [ 154.522767] do_syscall_64+0x64/0x2b0 kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522770] kernel: [ 154.522770] -> #2 (kn->count#253){++++}: kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0 kernel: [ 154.522778] kernfs_remove+0x1f/0x30 kernel: [ 154.522780] kobject_del+0x28/0x60 kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod] kernel: [ 154.522786] process_one_work+0x2a7/0x5f0 kernel: [ 154.522788] worker_thread+0x2d/0x3d0 kernel: [ 154.522793] kthread+0x117/0x130 kernel: [ 154.522795] ret_from_fork+0x3a/0x50 kernel: [ 154.522796] kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}: kernel: [ 154.522800] process_one_work+0x27e/0x5f0 kernel: [ 154.522802] worker_thread+0x2d/0x3d0 kernel: [ 154.522804] kthread+0x117/0x130 kernel: [ 154.522806] ret_from_fork+0x3a/0x50 kernel: [ 154.522807] kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}: kernel: [ 154.522813] __lock_acquire+0x1392/0x1690 kernel: [ 154.522816] lock_acquire+0xb4/0x1a0 kernel: [ 154.522818] flush_workqueue+0xab/0x4b0 kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522823] __blkdev_get+0xea/0x590 kernel: [ 154.522825] blkdev_get+0x65/0x140 kernel: [ 154.522828] do_dentry_open+0x1d1/0x380 kernel: [ 154.522831] path_openat+0x567/0xcc0 kernel: [ 154.522834] do_filp_open+0x9b/0x110 kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522838] do_sys_open+0x57/0x80 kernel: [ 154.522840] do_syscall_64+0x64/0x2b0 kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522844] kernel: [ 154.522844] other info that might help us debug this: kernel: [ 154.522844] kernel: [ 154.522846] Chain exists of: kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex kernel: [ 154.522846] kernel: [ 154.522850] Possible unsafe locking scenario: kernel: [ 154.522850] kernel: [ 154.522852] CPU0 CPU1 kernel: [ 154.522853] ---- ---- kernel: [ 154.522854] lock(&bdev->bd_mutex); kernel: [ 154.522856] lock(&mddev->reconfig_mutex); kernel: [ 154.522858] lock(&bdev->bd_mutex); kernel: [ 154.522860] lock((wq_completion)md_misc); kernel: [ 154.522861] kernel: [ 154.522861] *** DEADLOCK *** kernel: [ 154.522861] kernel: [ 154.522864] 1 lock held by mdadm/2482: kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590 kernel: [ 154.522868] kernel: [ 154.522868] stack backtrace: kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25 kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 kernel: [ 154.522878] Call Trace: kernel: [ 154.522881] dump_stack+0x8f/0xcb kernel: [ 154.522884] check_noncircular+0x194/0x1b0 kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690 kernel: [ 154.522890] __lock_acquire+0x1392/0x1690 kernel: [ 154.522893] lock_acquire+0xb4/0x1a0 kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522898] flush_workqueue+0xab/0x4b0 kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0 kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod] kernel: [ 154.522910] __blkdev_get+0xea/0x590 kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522914] blkdev_get+0x65/0x140 kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0 kernel: [ 154.522918] do_dentry_open+0x1d1/0x380 kernel: [ 154.522921] path_openat+0x567/0xcc0 kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690 kernel: [ 154.522926] do_filp_open+0x9b/0x110 kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0 kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630 kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0 kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0 kernel: [ 154.522944] do_sys_open+0x57/0x80 kernel: [ 154.522946] do_syscall_64+0x64/0x2b0 kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae And md_alloc also flushed the same workqueue, but the thing is different here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and the flush is necessary to avoid race condition, so leave it as it is. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: add new workqueue for delete rdevGuoqing Jiang
Since the purpose of call flush_workqueue in new_dev_store is to ensure md_delayed_delete() has completed, so we should check rdev->del_work is pending or not. To suppress lockdep warning, we have to check mddev->del_work while md_delayed_delete is attached to rdev->del_work, so it is not aligned to the purpose of flush workquee. So a new workqueue is needed to avoid the awkward situation, and introduce a new func flush_rdev_wq to flush the new workqueue after check if there was pending work. Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush workqueue while it holds bdev->bd_mutex, so make the same change applies to the ioctl to avoid similar lock issue. And md_delayed_delete actually wants to delete rdev, so rename the function to rdev_delayed_delete. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13md: add checkings before flush md_misc_wqGuoqing Jiang
Coly reported possible circular locking dependencyi with LOCKDEP enabled, quote the below info from the detailed report [1]. [ 1607.673903] Chain exists of: [ 1607.673903] kn->count#256 --> (wq_completion)md_misc --> (work_completion)(&rdev->del_work) [ 1607.673903] [ 1607.827946] Possible unsafe locking scenario: [ 1607.827946] [ 1607.898780] CPU0 CPU1 [ 1607.952980] ---- ---- [ 1608.007173] lock((work_completion)(&rdev->del_work)); [ 1608.069690] lock((wq_completion)md_misc); [ 1608.149887] lock((work_completion)(&rdev->del_work)); [ 1608.242563] lock(kn->count#256); [ 1608.283238] [ 1608.283238] *** DEADLOCK *** [ 1608.283238] [ 1608.354078] 2 locks held by kworker/5:0/843: [ 1608.405152] #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at: process_one_work+0x42b/0xb30 [ 1608.512399] #1: ffff888a1d3b7e10 ((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30 [ 1608.632130] Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq, then lockdep_map lock is held if either of them are running, then both of them try to hold kernfs lock by call kobject_del. Then if new_dev_store or array_state_store are triggered by write to the related sysfs node, so the write operation gets kernfs lock, but need the lockdep_map because all of them would trigger flush_workqueue(md_misc_wq) finally, then the same lockdep_map lock is needed. To suppress the lockdep warnning, we should flush the workqueue in case the related work is pending. And several works are attached to md_misc_wq, so we need to check which work should be checked: 1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync thread is started if it was starting, so check mddev->del_work is pending or not since md_start_sync is attached to mddev->del_work. 2. __md_stop flushes md_misc_wq to ensure event_work is done, check the event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't need the kernfs lock. 3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has completed, this case will be handled in next patch. 4. md_open flushes workqueue to ensure the previous md is disappeared, but it holds bdev->bd_mutex then try to flush workqueue, so it is better to check mddev->del_work as well to avoid potential lock issue, this will be done in another patch. [1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2 Cc: Coly Li <colyli@suse.de> Reported-by: Coly Li <colyli@suse.de> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-12Merge tag 'floppy-for-5.8' of https://github.com/evdenis/linux-floppy into ↵Jens Axboe
for-5.8/drivers Floppy patches for 5.8 Cleanups: - symbolic register names for x86,sparc64,sparc32,powerpc,parisc,m68k - split of local/global variables for drive,fdc - UBSAN warning suppress in setup_rw_floppy() Changes were compile tested on arm, sparc64, powerpc, m68k. Many patches introduce no binary changes by using defines instead of magic numbers. The patches were also tested with syzkaller and simple write/read/format tests on real hardware. Signed-off-by: Denis Efremov <efremov@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> * tag 'floppy-for-5.8' of https://github.com/evdenis/linux-floppy: (31 commits) floppy: suppress UBSAN warning in setup_rw_floppy() floppy: add defines for sizes of cmd & reply buffers of floppy_raw_cmd floppy: add FD_AUTODETECT_SIZE define for struct floppy_drive_params floppy: use print_hex_dump() in setup_DMA() floppy: cleanup: make set_fdc() always set current_drive and current_fd floppy: cleanup: get rid of current_reqD in favor of current_drive floppy: make sure to reset all FDCs upon resume() floppy: cleanup: do not iterate on current_fdc in do_floppy_init() floppy: cleanup: add a few comments about expectations in certain functions floppy: cleanup: do not iterate on current_fdc in DMA grab/release functions floppy: cleanup: make get_fdc_version() not rely on current_fdc anymore floppy: cleanup: make next_valid_format() not rely on current_drive anymore floppy: cleanup: make check_wp() not rely on current_{fdc,drive} anymore floppy: cleanup: make fdc_specify() not rely on current_{fdc,drive} anymore floppy: cleanup: make fdc_configure() not rely on current_fdc anymore floppy: cleanup: make perpendicular_mode() not rely on current_fdc anymore floppy: cleanup: make need_more_output() not rely on current_fdc anymore floppy: cleanup: make result() not rely on current_fdc anymore floppy: cleanup: make output_byte() not rely on current_fdc anymore floppy: cleanup: make wait_til_ready() not rely on current_fdc anymore ...
2020-05-12floppy: suppress UBSAN warning in setup_rw_floppy()Denis Efremov
UBSAN: array-index-out-of-bounds in drivers/block/floppy.c:1521:45 index 16 is out of range for type 'unsigned char [16]' Call Trace: ... setup_rw_floppy+0x5c3/0x7f0 floppy_ready+0x2be/0x13b0 process_one_work+0x2c1/0x5d0 worker_thread+0x56/0x5e0 kthread+0x122/0x170 ret_from_fork+0x35/0x40 From include/uapi/linux/fd.h: struct floppy_raw_cmd { ... unsigned char cmd_count; unsigned char cmd[16]; unsigned char reply_count; unsigned char reply[16]; ... } This out-of-bounds access is intentional. The command in struct floppy_raw_cmd may take up the space initially intended for the reply and the reply count. It is needed for long 82078 commands such as RESTORE, which takes 17 command bytes. Initial cmd size is not enough and since struct setup_rw_floppy is a part of uapi we check that cmd_count is in [0:16+1+16] in raw_cmd_copyin(). The patch adds union with original cmd,reply_count,reply fields and fullcmd field of equivalent size. The cmd accesses are turned to fullcmd where appropriate to suppress UBSAN warning. Link: https://lore.kernel.org/r/20200501134416.72248-5-efremov@linux.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: add defines for sizes of cmd & reply buffers of floppy_raw_cmdDenis Efremov
Use FD_RAW_CMD_SIZE, FD_RAW_REPLY_SIZE defines instead of magic numbers for cmd & reply buffers of struct floppy_raw_cmd. Remove local to floppy.c MAX_REPLIES define, as it is now FD_RAW_REPLY_SIZE. FD_RAW_CMD_FULLSIZE added as we allow command to also fill reply_count and reply fields. Link: https://lore.kernel.org/r/20200501134416.72248-4-efremov@linux.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: add FD_AUTODETECT_SIZE define for struct floppy_drive_paramsDenis Efremov
Use FD_AUTODETECT_SIZE for autodetect buffer size in struct floppy_drive_params instead of a magic number. Link: https://lore.kernel.org/r/20200501134416.72248-3-efremov@linux.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: use print_hex_dump() in setup_DMA()Denis Efremov
Remove pr_cont() and use print_hex_dump() in setup_DMA() to print the contents of the cmd buffer. Link: https://lore.kernel.org/r/20200501134416.72248-2-efremov@linux.com Suggested-by: Joe Perches <joe@perches.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: make set_fdc() always set current_drive and current_fdWilly Tarreau
When called with a negative drive value, set_fdc() would stick to the current fdc (which was assumed to reflect the current_drive's FDC). We do not need this anymore as the last call place with a negative value was just addressed. Let's make this function always set both current_fdc and current_drive so that there's no more ambiguity. A few comments stating this were added to a few non-obvious places. Link: https://lore.kernel.org/r/20200410101904.14652-3-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: get rid of current_reqD in favor of current_driveWilly Tarreau
This macro equals -1 and is used as an alternative for current_drive when calling reschedule_timeout(), which in turn needs to remap it. This only adds obfuscation, let's simply use current_drive. Link: https://lore.kernel.org/r/20200410101904.14652-2-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: make sure to reset all FDCs upon resume()Willy Tarreau
In floppy_resume() we don't properly reinitialize all FDCs, instead we reinitialize the current FDC once per available FDC because value -1 is passed to user_reset_fdc(). Let's simply save the current drive and properly reinitialize each FDC. Link: https://lore.kernel.org/r/20200410101904.14652-1-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: do not iterate on current_fdc in do_floppy_init()Willy Tarreau
There's no need to iterate on current_fdc in do_floppy_init() anymore, in the first case it's only used as an array index to access fdc_state[], so let's get rid of this confusing assignment. The second case is a bit trickier because user_reset_fdc() needs to already know current_fdc when called with drive==-1 due to this call chain: user_reset_fdc() lock_fdc() set_fdc() drive<0 ==> new_fdc = current_fdc Note that current_drive is not used in this code part and may even not match a unit belonging to current_fdc. Instead of passing -1 we can simply pass the first drive of the FDC being initialized, which is even cleaner as it will allow the function chain above to consistently assign both variables. Link: https://lore.kernel.org/r/20200410093023.14499-1-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: add a few comments about expectations in certain functionsWilly Tarreau
The locking in the driver is far from being obvious, with unlocking automatically happening at end of operations scheduled by interrupt, especially for the error paths where one does not necessarily expect that such an interrupt may be triggered. Let's add a few comments about what to expect at certain places to avoid misdetecting bugs which are not. Link: https://lore.kernel.org/r/20200331094054.24441-24-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: do not iterate on current_fdc in DMA grab/release functionsWilly Tarreau
Both floppy_grab_irq_and_dma() and floppy_release_irq_and_dma() used to iterate on the global variable while setting up or freeing resources. Now that they exclusively rely on functions which take the fdc as an argument, so let's not touch the global one anymore. Link: https://lore.kernel.org/r/20200331094054.24441-23-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: make get_fdc_version() not rely on current_fdc anymoreWilly Tarreau
Now the fdc is passed in argument so that the function does not use current_fdc anymore. Link: https://lore.kernel.org/r/20200331094054.24441-22-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: make next_valid_format() not rely on current_drive anymoreWilly Tarreau
Now the drive is passed in argument so that the function does not use current_drive anymore. Link: https://lore.kernel.org/r/20200331094054.24441-21-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: make check_wp() not rely on current_{fdc,drive} anymoreWilly Tarreau
Now the fdc and drive are passed in argument so that the function does not use current_fdc nor current_drive anymore. Link: https://lore.kernel.org/r/20200331094054.24441-20-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: make fdc_specify() not rely on current_{fdc,drive} anymoreWilly Tarreau
Now the fdc and drive are passed in argument so that the function does not use current_fdc nor current_drive anymore. Link: https://lore.kernel.org/r/20200331094054.24441-19-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>
2020-05-12floppy: cleanup: make fdc_configure() not rely on current_fdc anymoreWilly Tarreau
Now the fdc is passed in argument so that the function does not use current_fdc anymore. Link: https://lore.kernel.org/r/20200331094054.24441-18-w@1wt.eu Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Denis Efremov <efremov@linux.com>