summaryrefslogtreecommitdiffstats
path: root/block
AgeCommit message (Collapse)Author
2019-11-14block: merge invalidate_partitions into rescan_partitionsChristoph Hellwig
A lot of the logic in invalidate_partitions and rescan_partitions is shared. Merge the two functions to simplify things. There is a small behavior change in that we now send the kevent change notice also if we were not invalidating but no partitions were found, which seems like the right thing to do. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14block: refactor rescan_partitionsChristoph Hellwig
Split out a helper that adds one single partition, and another one calling that dealing with the parsed_partitions state. This makes it much more obvious how we clean up all state and start again when using the rescan label. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14block, bfq: deschedule empty bfq_queues not referred by any processPaolo Valente
Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging"), to prevent the service guarantees of a bfq_queue from being violated, the bfq_queue may be left busy, i.e., scheduled for service, even if empty (see comments in __bfq_bfqq_expire() for details). But, if no process will send requests to the bfq_queue any longer, then there is no point in keeping the bfq_queue scheduled for service. In addition, keeping the bfq_queue scheduled for service, but with no process reference any longer, may cause the bfq_queue to be freed when descheduled from service. But this is assumed to never happen, and causes a UAF if it happens. This, in turn, caused crashes [1, 2]. This commit fixes this issue by descheduling an empty bfq_queue when it remains with not process reference. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539 [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447 Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Reported-by: Chris Evich <cevich@redhat.com> Reported-by: Patrick Dung <patdung100@gmail.com> Reported-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()John Garry
These functions are not referenced, so delete them. Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12block: rework zone reportingChristoph Hellwig
Avoid the need to allocate a potentially large array of struct blk_zone in the block layer by switching the ->report_zones method interface to a callback model. Now the caller simply supplies a callback that is executed on each reported zone, and private data for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12block: Remove partition support for zoned block devicesDamien Le Moal
No known partitioning tool supports zoned block devices, especially the host managed flavor with strong sequential write constraints. Furthermore, there are also no known user nor use cases for partitioned zoned block devices. This patch removes partition device creation for zoned block devices, which allows simplifying the processing of zone commands for zoned block devices. A warning is added if a partition table is found on the device. For report zones operations no zone sector information remapping is necessary anymore, simplifying the code. Of note is that remapping of zone reports for DM targets is still necessary as done by dm_remap_zone_report(). Similarly, remaping of a zone reset bio is not necessary anymore. Testing for the applicability of the zone reset all request also becomes simpler and only needs to check that the number of sectors of the requested zone range is equal to the disk capacity. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12block: Simplify report zones executionDamien Le Moal
All kernel users of blkdev_report_zones() as well as applications use through ioctl(BLKZONEREPORT) expect to potentially get less zone descriptors than requested. As such, the use of the internal report zones command execution loop implemented by blk_report_zones() is not necessary and can even be harmful to performance by causing the execution of inefficient small zones report command to service the reminder of a requested zone array. This patch removes blk_report_zones(), simplifying the code. Also remove a now incorrect comment in dm_blk_report_zones(). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier Gonzalez <javier@javigon.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12block: cleanup the !zoned case in blk_revalidate_disk_zonesChristoph Hellwig
blk_revalidate_disk_zones is never called for non-zoned devices. Just return early and warn instead of trying to handle this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12block: Enhance blk_revalidate_disk_zones()Damien Le Moal
For ZBC and ZAC zoned devices, the scsi driver revalidation processing implemented by sd_revalidate_disk() includes a call to sd_zbc_read_zones() which executes a full disk zone report used to check that all zones of the disk are the same size. This processing is followed by a call to blk_revalidate_disk_zones(), used to initialize the device request queue zone bitmaps (zone type and zone write lock bitmaps). To do so, blk_revalidate_disk_zones() also executes a full device zone report to obtain zone types. As a result, the entire zoned block device revalidation process includes two full device zone report. By moving the zone size checks into blk_revalidate_disk_zones(), this process can be optimized to a single full device zone report, leading to shorter device scan and revalidation times. This patch implements this optimization, reducing the original full device zone report implemented in sd_zbc_check_zones() to a single, small, report zones command execution to obtain the size of the first zone of the device. Checks whether all zones of the device are the same size as the first zone size are moved to the generic blk_check_zone() function called from blk_revalidate_disk_zones(). This optimization also has the following benefits: 1) fewer memory allocations in the scsi layer during disk revalidation as the potentailly large buffer for zone report execution is not needed. 2) Implement zone checks in a generic manner, reducing the burden on device driver which only need to obtain the zone size and check that this size is a power of 2 number of LBAs. Any new type of zoned block device will benefit from this. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-12block: check bi_size overflow before mergeJunichi Nomura
__bio_try_merge_page() may merge a page to bio without bio_full() check and cause bi_size overflow. The overflow typically ends up with sd_init_command() warning on zero segment request with call trace like this: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1986 at drivers/scsi/scsi_lib.c:1025 scsi_init_io+0x156/0x180 CPU: 2 PID: 1986 Comm: kworker/2:1H Kdump: loaded Not tainted 5.4.0-rc7 #1 Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:scsi_init_io+0x156/0x180 RSP: 0018:ffffa11487663bf0 EFLAGS: 00010246 RAX: 00000000002be0a0 RBX: ffff8e6e9ff30118 RCX: 0000000000000000 RDX: 00000000ffffffe1 RSI: 0000000000000000 RDI: ffff8e6e9ff30118 RBP: ffffa11487663c18 R08: ffffa11487663d28 R09: ffff8e6e9ff30150 R10: 0000000000000001 R11: 0000000000000000 R12: ffff8e6e9ff30000 R13: 0000000000000001 R14: ffff8e74a1cf1800 R15: ffff8e6e9ff30000 FS: 0000000000000000(0000) GS:ffff8e6ea7680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff18cf0fe8 CR3: 0000000659f0a001 CR4: 00000000001606e0 Call Trace: sd_init_command+0x326/0xb40 [sd_mod] scsi_queue_rq+0x502/0xaa0 ? blk_mq_get_driver_tag+0xe7/0x120 blk_mq_dispatch_rq_list+0x256/0x5a0 ? elv_rb_del+0x24/0x30 ? deadline_remove_request+0x7b/0xc0 blk_mq_do_dispatch_sched+0xa3/0x140 blk_mq_sched_dispatch_requests+0xfb/0x170 __blk_mq_run_hw_queue+0x81/0x130 blk_mq_run_work_fn+0x1b/0x20 process_one_work+0x179/0x390 worker_thread+0x4f/0x3e0 kthread+0x105/0x140 ? max_active_store+0x80/0x80 ? kthread_bind+0x20/0x20 ret_from_fork+0x35/0x40 ---[ end trace f9036abf5af4a4d3 ]--- blk_update_request: I/O error, dev sdd, sector 2875552 op 0x1:(WRITE) flags 0x0 phys_seg 0 prio class 0 XFS (sdd1): writeback error on sector 2875552 __bio_try_merge_page() should check the overflow before actually doing merge. Fixes: 07173c3ec276c ("block: enable multipage bvecs") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08block: split bio if the only bvec's length is > SZ_4KMing Lei
64K PAGE_SIZE is popular on ARM64 or other ARCHs, and 64K has been big enough to break some devices probably, so change the logic to split bio if the only bvec's length is > SZ_4K instead of PAGE_SIZE. Fixes: fa5322872187 (block: avoid blk_bio_segment_split for small I/O operations) Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08block: still try to split bio if the bvec crosses pagesMing Lei
Some device may set segment boundary as PAGE_SIZE - 1. If the bvec crosses pages, and meantime its length is <= PAGE_SIZE, we still need to split the bvec into 2 segments. Fixes this issue by still splitting bio if the single bvec crosses pages. Reported-by: kernel test robot <lkp@intel.com> Fixes: fa5322872187 (block: avoid blk_bio_segment_split for small I/O operations) Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTATTejun Heo
blkg_rwstat is now only used by bfq-iosched and blk-throtl when on cgroup1. Let's move it into its own files and gate it behind a config option. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07blk-cgroup: reimplement basic IO stats using cgroup rstatTejun Heo
blk-cgroup has been using blkg_rwstat to track basic IO stats. Unfortunately, reading recursive stats scales badly as itinvolves walking all descendants. On systems with a huge number of cgroups (dead or alive), this can lead to substantial CPU cost when reading IO stats. This patch reimplements basic IO stats using cgroup rstat which uses more memory but makes recursive stat reading O(# descendants which have been active since last reading) instead of O(# descendants). * blk-cgroup core no longer uses sync/async stats. Introduce new stat enums - BLKG_IOSTAT_{READ|WRITE|DISCARD}. * Add blkg_iostat[_set] which encapsulates byte and io stats, last values for propagation delta calculation and u64_stats_sync for correctness on 32bit archs. * Update the new percpu stat counters directly and implement blkcg_rstat_flush() to implement propagation. * blkg_print_stat() can now bring the stats up to date by calling cgroup_rstat_flush() and print them instead of directly summing up all descendants. * It now allocates 96 bytes per cpu. It used to be 40 bytes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Dan Schatzberg <dschatzberg@fb.com> Cc: Daniel Xu <dlxu@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()Tejun Heo
These don't have users anymore. Remove them. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07blk-throtl: stop using blkg->stat_bytes and ->stat_iosTejun Heo
When used on cgroup1, blk-throtl uses the blkg->stat_bytes and ->stat_ios from blk-cgroup core to populate four stat knobs. blk-cgroup core is moving away from blkg_rwstat to improve scalability and won't be able to support this usage. It isn't like the sharing gains all that much. Let's break them out to dedicated rwstat counters which are updated when on cgroup1. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07bfq-iosched: stop using blkg->stat_bytes and ->stat_iosTejun Heo
When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios from blk-cgroup core to populate six stat knobs. blk-cgroup core is moving away from blkg_rwstat to improve scalability and won't be able to support this usage. It isn't like the sharing gains all that much. Let's break it out to dedicated rwstat counters which are updated when on cgroup1. This makes use of bfqg_*rwstat*() helpers outside of CONFIG_BFQ_CGROUP_DEBUG. Move them out. v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07bfq-iosched: relocate bfqg_*rwstat*() helpersTejun Heo
Collect them right under #ifdef CONFIG_BFQ_CGROUP_DEBUG. The next patch will use them from !DEBUG path and this makes it easy to move them out of the ifdef block. This is pure code reorganization. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07Merge branch 'for-linus' into for-5.5/blockJens Axboe
Pull on for-linus to resolve what otherwise would have been a conflict with the cgroups rstat patchset from Tejun. * for-linus: (942 commits) blkcg: make blkcg_print_stat() print stats only for online blkgs nvme: change nvme_passthru_cmd64 to explicitly mark rsvd nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths nvme-rdma: fix a segmentation fault during module unload iocost: don't nest spin_lock_irq in ioc_weight_write() io_uring: ensure we clear io_kiocb->result before each issue um-ubd: Entrust re-queue to the upper layers nvme-multipath: remove unused groups_only mode in ana log nvme-multipath: fix possible io hang after ctrl reconnect io_uring: don't touch ctx in setup after ring fd install io_uring: Fix leaked shadow_req Linux 5.4-rc5 riscv: cleanup do_trap_break nbd: verify socket is supported during setup ata: libahci_platform: Fix regulator_get_optional() misuse nbd: handle racing with error'ed out commands nbd: protect cmd->status with cmd->lock io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD io_uring: used cached copies of sq->dropped and cq->overflow ARM: dts: stm32: relax qspi pins slew-rate for stm32mp157 ...
2019-11-07block: add zone open, close and finish ioctl supportAjay Joshi
Introduce three new ioctl commands BLKOPENZONE, BLKCLOSEZONE and BLKFINISHZONE to allow applications to control the condition of zones on a zoned block device through the execution of the REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH operations. Contains contributions from Matias Bjorling, Hans Holmberg, Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig. Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07block: add zone open, close and finish operationsAjay Joshi
Zoned block devices (ZBC and ZAC devices) allow an explicit control over the condition (state) of zones. The operations allowed are: * Open a zone: Transition to open condition to indicate that a zone will actively be written * Close a zone: Transition to closed condition to release the drive resources used for writing to a zone * Finish a zone: Transition an open or closed zone to the full condition to prevent write operations To enable this control for in-kernel zoned block device users, define the new request operations REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH as well as the generic function blkdev_zone_mgmt() for submitting these operations on a range of zones. This results in blkdev_reset_zones() removal and replacement with this new zone magement function. Users of blkdev_reset_zones() (f2fs and dm-zoned) are updated accordingly. Contains contributions from Matias Bjorling, Hans Holmberg, Dmitry Fomichev, Keith Busch, Damien Le Moal and Christoph Hellwig. Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07block: Simplify REQ_OP_ZONE_RESET_ALL handlingDamien Le Moal
There is no need for the function __blkdev_reset_all_zones() as REQ_OP_ZONE_RESET_ALL can be handled directly in blkdev_reset_zones() bio loop with an early break from the loop. This patch removes this function and modifies blkdev_reset_zones(), simplifying the code. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07block: Remove REQ_OP_ZONE_RESET pluggingDamien Le Moal
REQ_OP_ZONE_RESET operations cannot be merged as these bios and requests do not have a size and are never sequential due to the zone start sector position required for their execution. As a result, there is no point in using a plug around blkdev_reset_zones() bio issuing loop. This patch removes this unnecessary plugging. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Javier González <javier@javigon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06blkcg: make blkcg_print_stat() print stats only for online blkgsTejun Heo
blkcg_print_stat() iterates blkgs under RCU and doesn't test whether the blkg is online. This can call into pd_stat_fn() on a pd which is still being initialized leading to an oops. The heaviest operation - recursively summing up rwstat counters - is already done while holding the queue_lock. Expand queue_lock to cover the other operations and skip the blkg if it isn't online yet. The online state is protected by both blkcg and queue locks, so this guarantees that only online blkgs are processed. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Roman Gushchin <guro@fb.com> Cc: Josef Bacik <jbacik@fb.com> Fixes: 903d23f0a354 ("blk-cgroup: allow controllers to output their own stats") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-06block: Warn if elevator= parameter is usedJan Kara
With transition to blk-mq, the elevator= kernel argument was removed as it makes less and less sense with the current variety of devices. Since this may surprise some users and there are advices on the Internet that still suggest to use it, let's at least warn if the parameter is used. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04block: avoid blk_bio_segment_split for small I/O operationsChristoph Hellwig
__blk_queue_split() adds significant overhead for small I/O operations. Add a shortcut to avoid it for cases where we know we never need to split. Based on a patch from Ming Lei. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04blk-mq: make sure that line break can be printedMing Lei
8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") avoids sysfs buffer overflow, and reserves one character for line break. However, the last snprintf() doesn't get correct 'size' parameter passed in, so fixed it. Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04block: sed-opal: Introduce Opal Datastore UIDRevanth Rajashekar
This patch introduces Opal Datastore UID. The generic read/write table ioctl can use this UID to access the Opal Datastore. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04block: sed-opal: Add support to read/write opal tables genericallyRevanth Rajashekar
This feature gives the user RW access to any opal table with admin1 authority. The flags described in the new structure determines if the user wants to read/write the data. Flags are checked for valid values in order to allow future features to be added to the ioctl. The user can provide the desired table's UID. Also, the ioctl provides a size and offset field and internally will loop data accesses to return the full data block. Read overrun is prevented by the initiator's sec_send_recv() backend. The ioctl provides a private field with the intention to accommodate any future expansions to the ioctl. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04block: sed-opal: Generalizing write data to any opal tableRevanth Rajashekar
This patch refactors the existing "write_shadowmbr" func and creates a new generalized function "generic_table_write_data", to write data to any opal table. Also, a few cleanups are included in this patch. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-02blk-mq: avoid sysfs buffer overflow with too many CPU coresMing Lei
It is reported that sysfs buffer overflow can be triggered if the system has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of hctx via /sys/block/$DEV/mq/$N/cpu_list. Use snprintf to avoid the potential buffer overflow. This version doesn't change the attribute format, and simply stops showing CPU numbers if the buffer is going to overflow. Cc: stable@vger.kernel.org Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01blk-mq: Make blk_mq_run_hw_queue() return voidJohn Garry
Since commit 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"), the return value of blk_mq_run_hw_queue() is never checked, so make it return void, which very marginally simplifies the code. Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-31iocost: don't nest spin_lock_irq in ioc_weight_write()Dan Carpenter
This code causes a static analysis warning: block/blk-iocost.c:2113 ioc_weight_write() error: double lock 'irq' We disable IRQs in blkg_conf_prep() and re-enable them in blkg_conf_finish(). IRQ disable/enable should not be nested because that means the IRQs will be enabled at the first unlock instead of the second one. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25blk-mq: remove needless goto from blk_mq_get_driver_tagAndré Almeida
The only usage of the label "done" is when (rq->tag != -1) at the beginning of the function. Rather than jumping to label, we can just remove this label and execute the code at the "if". Besides that, the code that would be executed after the label "done" is the return of the logical expression (rq->tag != -1) but since we are already inside the if, we now that this is true. Remove the label and replace the goto with the proper result of the label. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25block: Reduce the amount of memory used for tag setsBart Van Assche
Instead of allocating an array of size nr_cpu_ids for set->tags, allocate an array of size set->nr_hw_queues. This patch improves behavior that was introduced by commit 868f2f0b7206 ("blk-mq: dynamic h/w context count"). Reallocating tag sets from inside __blk_mq_update_nr_hw_queues() is safe because: - All request queues that share the tag sets are frozen before the tag sets are reallocated. - blk_mq_queue_tag_busy_iter() holds q->q_usage_counter while active and hence is serialized against __blk_mq_update_nr_hw_queues(). Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25block: Reduce the amount of memory required per request queueBart Van Assche
Instead of always allocating at least nr_cpu_ids hardware queues per request queue, reallocate q->queue_hw_ctx if it has to grow. This patch improves behavior that was introduced by commit 868f2f0b7206 ("blk-mq: dynamic h/w context count"). Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25block: Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()Bart Van Assche
Since the blk_mq_{,un}freeze_queue() calls in __blk_mq_update_nr_hw_queues() already serialize __blk_mq_update_nr_hw_queues() against blk_mq_queue_tag_busy_iter(), the synchronize_rcu() call in __blk_mq_update_nr_hw_queues() is not necessary. Hence remove it. Note: the synchronize_rcu() call in __blk_mq_update_nr_hw_queues() was introduced by commit f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter"). Commit 530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter callbacks") removed the rcu_read_{,un}lock() calls that correspond to the synchronize_rcu() call in __blk_mq_update_nr_hw_queues(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-23compat_ioctl: reimplement SG_IO handlingArnd Bergmann
There are two code locations that implement the SG_IO ioctl: the old sg.c driver, and the generic scsi_ioctl helper that is in turn used by multiple drivers. To eradicate the old compat_ioctl conversion handler for the SG_IO command, I implement a readable pair of put_sg_io_hdr() /get_sg_io_hdr() helper functions that can be used for both compat and native mode, and then I call this from both drivers. For the iovec handling, there is already a compat_import_iovec() function that can simply be called in place of import_iovec(). To avoid having to pass the compat/native state through multiple indirections, I mark the SG_IO command itself as compatible in fs/compat_ioctl.c and use in_compat_syscall() to figure out where we are called from. As a side-effect of this, the sg.c driver now also accepts the 32-bit sg_io_hdr format in compat mode using the read/write interface, not just ioctl. This should improve compatiblity with old 32-bit binaries, but it would break if any application intentionally passes the 64-bit data structure in compat mode here. Steffen Maier helped debug an issue in an earlier version of this patch. Cc: Steffen Maier <maier@linux.ibm.com> Cc: linux-scsi@vger.kernel.org Cc: Doug Gilbert <dgilbert@interlog.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-15blk-rq-qos: fix first node deletion of rq_qos_del()Tejun Heo
rq_qos_del() incorrectly assigns the node being deleted to the head if it was the first on the list in the !prev path. Fix it by iterating with ** instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-15blkcg: Fix multiple bugs in blkcg_activate_policy()Tejun Heo
blkcg_activate_policy() has the following bugs. * cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however, blkcg_activate_policy() ends up using pd's allocated for the root blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs can be passed in pd's which are allocated for the root blkcg. For blk-iocost, this means that ->pd_init_fn() can write beyond the end of the allocated object as it determines the length of the flex array at the end based on the blkcg's nesting level. * Each pd is initialized as they get allocated. If alloc fails, the policy will get freed with pd's initialized on it. * After the above partial failure, the partial pds are not freed. This patch fixes all the above issues by * Restructuring blkcg_activate_policy() so that alloc and init passes are separate. Init takes place only after all allocs succeeded and on failure all allocated pds are freed. * Unifying and fixing the cleanup of the remaining pd_prealloc. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-14block: Fix elv_support_iosched()Damien Le Moal
A BIO based request queue does not have a tag_set, which prevent testing for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not require an elevator. This leads to an incorrect initialization of a default elevator in some cases such as BIO based null_blk (queue_mode == BIO) with zoned mode enabled as the default elevator in this case is mq-deadline instead of "none". Fix this by testing for a NULL queue mq_ops field which indicates that the queue is BIO based and should not have an elevator. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-10block: account statistics for passthrough requestsLogan Gunthorpe
Presently, passthrough requests are not accounted for because blk_do_io_stat() expressly rejects them. Based on some digging in the history, this doesn't seem like a concious decision but one that evolved from the change from blk_fs_request() to blk_rq_is_passthrough(). To support this, call blk_account_io_start() in blk_execute_rq_nowait() and remove the passthrough check in blk_do_io_stat(). Link: https://lore.kernel.org/linux-block/20191010100526.GA27209@lst.de/ Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07blk-stat: Optimise blk_stat_add()Pavel Begunkov
blk_stat_add() calls {get,put}_cpu_ptr() in a loop, which entails overhead of disabling/enabling preemption. The loop is under RCU (i.e.short) anyway, so do get_cpu() in advance. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07blk-mq: Embed counters into struct mq_inflightPavel Begunkov
Store inflight counters immediately in struct mq_inflight. That's type-safer and removes extra indirection. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07blk-mq: Reuse callback in blk_mq_in_flight*()Pavel Begunkov
Reuse a more generic callback in both blk_mq_in_flight() and blk_mq_in_flight_rw(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07blk-mq: Inline status checkersPavel Begunkov
blk_mq_request_completed() and blk_mq_request_started() are short, inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07block: Reduce sysfs_lock locking inside blk_cleanup_queue()Bart Van Assche
Since blk_cleanup_queue() is called after blk_unregister_queue() and since that last function removes all sysfs attributes, serializing any code in blk_cleanup_queue() against sysfs callback methods nor against I/O scheduler changes is necessary. Hence remove the syfs_lock locking calls from the start of blk_cleanup_queue(). Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07block: Remove "dying" checks from sysfs callbacksBart Van Assche
Block drivers must call del_gendisk() before blk_cleanup_queue(). del_gendisk() calls kobject_del() and kobject_del() waits until any ongoing sysfs callback functions have finished. In other words, the sysfs callback functions won't be called for a queue in the dying state. Hence remove the "dying" checks from the sysfs callback functions. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07block: Remove request_queue.nr_queuesBart Van Assche
Commit 897bb0c7f1ea ("blk-mq: Use proper cpumask iterator"; v4.6) removed the last use of request_queue.nr_queues from outside blk_mq_init_allocate_queue(). Remove this member variable to make struct request_queue smaller. This patch does not change any functionality. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07block: Fix three kernel-doc warningsBart Van Assche
Fix the following kernel-doc warnings: block/t10-pi.c:242: warning: Function parameter or member 'rq' not described in 't10_pi_type3_prepare' block/t10-pi.c:249: warning: Function parameter or member 'rq' not described in 't10_pi_type3_complete' block/t10-pi.c:249: warning: Function parameter or member 'nr_bytes' not described in 't10_pi_type3_complete' Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Fixes: 54d4e6ab91eb ("block: centralize PI remapping logic to the block layer") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>