summaryrefslogtreecommitdiffstats
path: root/drivers/nvme
AgeCommit message (Collapse)Author
2020-09-27nvme-pci: allocate separate interrupt for the reserved non-polled I/O queueJeffle Xu
One queue will be reserved for non-polled IO when nvme.poll_queues is greater or equal than the number of IO queues that the nvme controller can provide. Currently the reserved queue for non-polled IO will reuse the interrupt used by admin queue in this case, e.g, vector 0. This can work and the performance may not be an issue since the admin queue is used unfrequently. However this behaviour may be inconsistent with that when nvme.poll_queues is smaller than the number of IO queues available. Thus allocate separate interrupt for this reserved queue, and thus make the behaviour consistent. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> [hch: minor cleanups, mostly to the pre-existing surrounding code] Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-27nvme: fix error handling in nvme_ns_report_zonesChristoph Hellwig
nvme_submit_sync_cmd can return positive NVMe error codes in addition to the negative Linux error code, which are currently ignored. Fix this by removing __nvme_ns_report_zones and handling the errors from nvme_submit_sync_cmd in the caller instead of multiplexing the return value and the number of zones reported into a single return value. Fixes: 240e6ee272c0 ("nvme: support for zoned namespaces") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-09-27nvmet-fc: fix missing check for no hostport structJames Smart
A hostport port pointer is allowed to be NULL as it is not allocated if the lldd does not support the new interfaces for NVME LS request support. The hostport free routine validates the handle but forgot to validate the hostport pointer. Validate the hostport pointer before using it to validate the handle. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-27nvmet: add passthru ZNS supportChaitanya Kulkarni
In the default passthru implementation NVMeOF target passthru ctrl is not capable of handling Zoned Namespaces (ZNS). Update the nvmet_parse_pasthru_admin_cmd() to allow NVME_ID_CNS_CS_CTRL/NVME_CIS_ZNS and NVME_ID_CNS_CS_NS/NVME_CIS_ZNS. With this addition NVMeOF Passthru allows Zoned Namespaces. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-27nvmet: handle keep-alive timer when kato is modified by a set features cmdAmit Engel
A user may modify the kato by a set features cmd. To properly deal with races or a kato value of 0 (no keep alive enabled) change nvmet_set_feat_kato to first disable the timer, then set the value and then re-enable the timer. Signed-off-by: Amit Engel <amit.engel@dell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-27nvmet-tcp: have queue io_work context run on sock incoming cpuMark Wunderlich
No real good need to spread queues artificially. Usually the target will serve multiple hosts, and it's better to run on the socket incoming cpu for better affinitization rather than spread queues on all online cpus. We rely on RSS to spread the work around sufficiently. Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-27nvme-pci: Move enumeration by class to be last in the tableAndy Shevchenko
It's unusual that we have enumeration by class in the middle of the table. It might potentially be problematic in the future if we add another entry after it. So, move class matching entry to be the last in the ID table. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-27nvme: use an xarray to lookup the Commands Supported and Effects logChaitanya Kulkarni
When using linked list we have to open code the locking, search, and destroy operations with the loops even if data structure doesn't fall into the fast path. One of the main advantage of having XArray to store, search, and remove items is that it handles all the locking by itself, avoids the loops when using linked lists, provides clear API to replace the linked list's search and destroy loops. This patch replaces the ctrl->cel list with XArray and removes :- a. Extra code needed for the linked list for ctrl->cel item management such as nvme_find_cel(). b. Destroy loop in the nvme_free_ctrl(). c. Explicit insertion locking in the nvme_get_effects_log(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-27nvme: lift the file open code from nvme_ctrl_get_by_pathChaitanya Kulkarni
Lift opening the file open/close code from nvme_ctrl_get_by_path into the caller, just keeping a simple nvme_ctrl_from_file() helper. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [hch: refactored a bit, split the bug fixes into a separate prep patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
2020-09-26Merge tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "NVMe pull request from Christoph, and removal of a dead define. - fix error during controller probe that cause double free irqs (Keith Busch) - FC connection establishment fix (James Smart) - properly handle completions for invalid tags (Xianting Tian) - pass the correct nsid to the command effects and supported log (Chaitanya Kulkarni)" * tag 'block-5.9-2020-09-25' of git://git.kernel.dk/linux-block: block: remove unused BLK_QC_T_EAGAIN flag nvme-core: don't use NVME_NSID_ALL for command effects and supported log nvme-fc: fail new connections to a deleted host or remote port nvme-pci: fix NULL req in completion handler nvme: return errors for hwmon init
2020-09-24Merge branch 'for-5.10/block' into for-5.10/driversJens Axboe
* for-5.10/block: (140 commits) bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag bdi: invert BDI_CAP_NO_ACCT_WB bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag mm: use SWP_SYNCHRONOUS_IO more intelligently bdi: remove BDI_CAP_SYNCHRONOUS_IO bdi: remove BDI_CAP_CGROUP_WRITEBACK block: lift setting the readahead size into the block layer md: update the optimal I/O size on reshape bdi: initialize ->ra_pages and ->io_pages in bdi_init aoe: set an optimal I/O size bcache: inherit the optimal I/O size drbd: remove dead code in device_to_statistics fs: remove the unused SB_I_MULTIROOT flag block: mark blkdev_get static PM: mm: cleanup swsusp_swap_check mm: split swap_type_of PM: rewrite is_hibernate_resume_dev to not require an inode mm: cleanup claim_swapfile ocfs2: cleanup o2hb_region_dev_store dasd: cleanup dasd_scan_partitions ...
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24block: lift setting the readahead size into the block layerChristoph Hellwig
Drivers shouldn't really mess with the readahead size, as that is a VM concept. Instead set it based on the optimal I/O size by lifting the algorithm from the md driver when registering the disk. Also set bdi->io_pages there as well by applying the same scheme based on max_sectors. To ensure the limits work well for stacking drivers a new helper is added to update the readahead limits from the block limits, which is also called from disk_stack_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23nvme-core: don't use NVME_NSID_ALL for command effects and supported logChaitanya Kulkarni
In the function nvme_get_effects_log() it uses NVME_NSID_ALL which has namespace scope. The command effect log page is controller specific. Replace NVME_NSID_ALL with 0x00 which specifies the controller scope instead of namespace scope. Fixes: 84fef62d135b ("nvme: check admin passthru command effects") Link: https://bugzilla.kernel.org/show_bug.cgi?id=209287 Reported-by: Huai-Cheng Kuo <hh81478072@gmail.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-22Merge tag 'block-5.9-2020-09-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A few NVMe fixes, and a dasd write zero fix" * tag 'block-5.9-2020-09-22' of git://git.kernel.dk/linux-block: nvmet: get transport reference for passthru ctrl nvme-core: get/put ctrl and transport module in nvme_dev_open/release() nvme-tcp: fix kconfig dependency warning when !CRYPTO nvme-pci: disable the write zeros command for Intel 600P/P3100 s390/dasd: Fix zero write for FBA devices
2020-09-22nvme-fc: fail new connections to a deleted host or remote portJames Smart
The lldd may have made calls to delete a remote port or local port and the delete is in progress when the cli then attempts to create a new controller. Currently, this proceeds without error although it can't be very successful. Fix this by validating that both the host port and remote port are present when a new controller is to be created. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-22nvme-pci: fix NULL req in completion handlerXianting Tian
Currently, we use nvmeq->q_depth as the upper limit for a valid tag in nvme_handle_cqe(), it is not correct. Because the available tag number is recorded in tagset, which is not equal to nvmeq->q_depth. The nvme driver registers interrupts for queues before initializing the tagset, because it uses the number of successful request_irq() calls to configure the tagset parameters. This allows a race condition with the current tag validity check if the controller happens to produce an interrupt with a corrupted CQE before the tagset is initialized. Replace the driver's indirect tag check with the one already provided by the block layer. Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-22nvme: return errors for hwmon initKeith Busch
Initializing the nvme hwmon retrieves a log from the controller. If the controller is broken, we need to return the appropriate error so that subsequent initialization doesn't attempt to continue. Reported-by: Tong Zhang <ztong0001@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-17nvmet: get transport reference for passthru ctrlChristoph Hellwig
Grab a reference to the transport driver to ensure it can't be unloaded while a passthrough controller is active. Fixes: c1fef73f793b ("nvmet: add passthru code to process commands") Reported-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
2020-09-17nvme-core: get/put ctrl and transport module in nvme_dev_open/release()Chaitanya Kulkarni
Get and put the reference to the ctrl in the nvme_dev_open() and nvme_dev_release() before and after module get/put for ctrl in char device file operations. Introduce char_dev relase function, get/put the controller and module which allows us to fix the potential Oops which can be easily reproduced with a passthru ctrl (although the problem also exists with pure user access): Entering kdb (current=0xffff8887f8290000, pid 3128) on processor 30 Oops: (null) due to oops @ 0xffffffffa01019ad CPU: 30 PID: 3128 Comm: bash Tainted: G W OE 5.8.0-rc4nvme-5.9+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.4 RIP: 0010:nvme_free_ctrl+0x234/0x285 [nvme_core] Code: 57 10 a0 e8 73 bf 02 e1 ba 3d 11 00 00 48 c7 c6 98 33 10 a0 48 c7 c7 1d 57 10 a0 e8 5b bf 02 e1 8 RSP: 0018:ffffc90001d63de0 EFLAGS: 00010246 RAX: ffffffffa05c0440 RBX: ffff8888119e45a0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8888177e9550 RDI: ffff8888119e43b0 RBP: ffff8887d4768000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: ffffc90001d63c90 R12: ffff8888119e43b0 R13: ffff8888119e5108 R14: dead000000000100 R15: ffff8888119e5108 FS: 00007f1ef27b0740(0000) GS:ffff888817600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa05c0470 CR3: 00000007f6bee000 CR4: 00000000003406e0 Call Trace: device_release+0x27/0x80 kobject_put+0x98/0x170 nvmet_passthru_ctrl_disable+0x4a/0x70 [nvmet] nvmet_passthru_enable_store+0x4c/0x90 [nvmet] configfs_write_file+0xe6/0x150 vfs_write+0xba/0x1e0 ksys_write+0x5f/0xe0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f1ef1eb2840 Code: Bad RIP value. RSP: 002b:00007fffdbff0eb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1ef1eb2840 RDX: 0000000000000002 RSI: 00007f1ef27d2000 RDI: 0000000000000001 RBP: 00007f1ef27d2000 R08: 000000000000000a R09: 00007f1ef27b0740 R10: 0000000000000001 R11: 0000000000000246 R12: 00007f1ef2186400 R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000 With this patch fix we take the module ref count in nvme_dev_open() and release that ref count in newly introduced nvme_dev_release(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-15nvme-tcp: fix kconfig dependency warning when !CRYPTONecip Fazil Yildiran
When NVME_TCP is enabled and CRYPTO is disabled, it results in the following Kbuild warning: WARNING: unmet direct dependencies detected for CRYPTO_CRC32C Depends on [n]: CRYPTO [=n] Selected by [y]: - NVME_TCP [=y] && INET [=y] && BLK_DEV_NVME [=y] The reason is that NVME_TCP selects CRYPTO_CRC32C without depending on or selecting CRYPTO while CRYPTO_CRC32C is subordinate to CRYPTO. Honor the kconfig menu hierarchy to remove kconfig dependency warnings. Fixes: 79fd751d61aa ("nvme: tcp: selects CRYPTO_CRC32C for nvme-tcp") Signed-off-by: Necip Fazil Yildiran <fazilyildiran@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-15nvme-pci: disable the write zeros command for Intel 600P/P3100David Milburn
The write zeros command does not work with 4k range. bash-4.4# ./blkdiscard /dev/nvme0n1p2 bash-4.4# strace -efallocate xfs_io -c "fzero 536895488 2048" /dev/nvme0n1p2 fallocate(3, FALLOC_FL_ZERO_RANGE, 536895488, 2048) = 0 +++ exited with 0 +++ bash-4.4# dd bs=1 if=/dev/nvme0n1p2 skip=536895488 count=512 | hexdump -C 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000200 bash-4.4# ./blkdiscard /dev/nvme0n1p2 bash-4.4# strace -efallocate xfs_io -c "fzero 536895488 4096" /dev/nvme0n1p2 fallocate(3, FALLOC_FL_ZERO_RANGE, 536895488, 4096) = 0 +++ exited with 0 +++ bash-4.4# dd bs=1 if=/dev/nvme0n1p2 skip=536895488 count=512 | hexdump -C 00000000 5c 61 5c b0 96 21 1b 5e 85 0c 07 32 9c 8c eb 3c |\a\..!.^...2...<| 00000010 4a a2 06 ca 67 15 2d 8e 29 8d a8 a0 7e 46 8c 62 |J...g.-.)...~F.b| 00000020 bb 4c 6c c1 6b f5 ae a5 e4 a9 bc 93 4f 60 ff 7a |.Ll.k.......O`.z| Reported-by: Eric Sandeen <esandeen@redhat.com> Signed-off-by: David Milburn <dmilburn@redhat.com> Tested-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-11Merge tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Fix a regression in bdev partition locking (Christoph) - NVMe pull request from Christoph: - cancel async events before freeing them (David Milburn) - revert a broken race fix (James Smart) - fix command processing during resets (Sagi Grimberg) - Fix a kyber crash with requeued flushes (Omar) - Fix __bio_try_merge_page() same_page error for no merging (Ritesh) * tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block: block: Set same_page to false in __bio_try_merge_page if ret is false nvme-fabrics: allow to queue requests for live queues block: only call sched requeue_request() for scheduled requests nvme-tcp: cancel async events before freeing event struct nvme-rdma: cancel async events before freeing event struct nvme-fc: cancel async events before freeing event struct nvme: Revert: Fix controller creation races with teardown flow block: restore a specific error code in bdev_del_partition
2020-09-09nvme-fabrics: allow to queue requests for live queuesSagi Grimberg
Right now we are failing requests based on the controller state (which is checked inline in nvmf_check_ready) however we should definitely accept requests if the queue is live. When entering controller reset, we transition the controller into NVME_CTRL_RESETTING, and then return BLK_STS_RESOURCE for non-mpath requests (have blk_noretry_request set). This is also the case for NVME_REQ_USER for the wrong reason. There shouldn't be any reason for us to reject this I/O in a controller reset. We do want to prevent passthru commands on the admin queue because we need the controller to fully initialize first before we let user passthru admin commands to be issued. In a non-mpath setup, this means that the requests will simply be requeued over and over forever not allowing the q_usage_counter to drop its final reference, causing controller reset to hang if running concurrently with heavy I/O. Fixes: 35897b920c8a ("nvme-fabrics: fix and refine state checks in __nvmf_check_ready") Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-08nvme-tcp: cancel async events before freeing event structDavid Milburn
Cancel async event work in case async event has been queued up, and nvme_tcp_submit_async_event() runs after event has been freed. Signed-off-by: David Milburn <dmilburn@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-08nvme-rdma: cancel async events before freeing event structDavid Milburn
Cancel async event work in case async event has been queued up, and nvme_rdma_submit_async_event() runs after event has been freed. Signed-off-by: David Milburn <dmilburn@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-08nvme-fc: cancel async events before freeing event structDavid Milburn
Cancel async event work in case async event has been queued up, and nvme_fc_submit_async_event() runs after event has been freed. Signed-off-by: David Milburn <dmilburn@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-08nvme: Revert: Fix controller creation races with teardown flowJames Smart
The indicated patch introduced a barrier in the sysfs_delete attribute for the controller that rejects the request if the controller isn't created. "Created" is defined as at least 1 call to nvme_start_ctrl(). This is problematic in error-injection testing. If an error occurs on the initial attempt to create an association and the controller enters reconnect(s) attempts, the admin cannot delete the controller until either there is a successful association created or ctrl_loss_tmo times out. Where this issue is particularly hurtful is when the "admin" is the nvme-cli, it is performing a connection to a discovery controller, and it is initiated via auto-connect scripts. With the FC transport, if the first connection attempt fails, the controller enters a normal reconnect state but returns control to the cli thread that created the controller. In this scenario, the cli attempts to read the discovery log via ioctl, which fails, causing the cli to see it as an empty log and then proceeds to delete the discovery controller. The delete is rejected and the controller is left live. If the discovery controller reconnect then succeeds, there is no action to delete it, and it sits live doing nothing. Cc: <stable@vger.kernel.org> # v5.7+ Fixes: ce1518139e69 ("nvme: Fix controller creation races with teardown flow") Signed-off-by: James Smart <james.smart@broadcom.com> CC: Israel Rukshin <israelr@mellanox.com> CC: Max Gurtovoy <maxg@mellanox.com> CC: Christoph Hellwig <hch@lst.de> CC: Keith Busch <kbusch@kernel.org> CC: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-04Merge tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "A bit larger than usual this week, mostly due to the NVMe fixes arriving late for -rc3 and hence didn't make last weeks pull request. - NVMe: - instance leak and io boundary fixes from Keith - fc locking fix from Christophe - various tcp/rdma reset during traffic fixes from Sagi - pci use-after-free fix from Tong - tcp target null deref fix from Ziye - Locking fix for partition removal (Christoph) - Ensure bdi->io_pages is always set (me) - Fixup for hd struct reference (Ming) - Fix for zero length bvecs (Ming) - Two small blk-iocost fixes (Tejun)" * tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block: block: allow for_each_bvec to support zero len bvec blk-stat: make q->stats->lock irqsafe blk-iocost: ioc_pd_free() shouldn't assume irq disabled block: fix locking in bdev_del_partition block: release disk reference in hd_struct_free_work block: ensure bdi->io_pages is always initialized nvme-pci: cancel nvme device request before disabling nvme: only use power of two io boundaries nvme: fix controller instance leak nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()' nvme: Fix NULL dereference for pci nvme controllers nvme-rdma: fix reset hang if controller died in the middle of a reset nvme-rdma: fix timeout handler nvme-rdma: serialize controller teardown sequences nvme-tcp: fix reset hang if controller died in the middle of a reset nvme-tcp: fix timeout handler nvme-tcp: serialize controller teardown sequences nvme: have nvme_wait_freeze_timeout return if it timed out nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
2020-09-02nvme: opencode revalidate_disk in nvme_validate_nsChristoph Hellwig
Keep control in the NVMe driver instead of going through an indirect call back into ->revalidate_disk. Also reorder the function a bit to be easier to follow with the additional code. And now that we have removed all callers of revalidate_disk() in the nvme code, ->revalidate_disk is only called from the open code when first opening the device. Which is of course totally pointless as we have a valid size since the initial scan, and will get an updated view through the asynchronous notifiation everytime the size changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01nvme: don't call revalidate_disk from nvme_set_queue_dyingChristoph Hellwig
In nvme_set_queue_dying we really just want to ensure the disk and bdev sizes are set to zero. Going through revalidate_disk leads to a somewhat arcance and complex callchain relying on special behavior in a few places. Instead just lift the set_capacity directly to nvme_set_queue_dying, and rename and move the nvme_mpath_update_disk_size helper so that we can use it in nvme_set_queue_dying to propagate the size to the bdev without detours. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01block: replace bd_set_size with bd_set_nr_sectorsChristoph Hellwig
Replace bd_set_size with a version that takes the number of sectors instead, as that fits most of the current and future callers much better. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01Merge branch 'block-5.9' into for-5.10/blockJens Axboe
* block-5.9: blk-stat: make q->stats->lock irqsafe blk-iocost: ioc_pd_free() shouldn't assume irq disabled block: fix locking in bdev_del_partition block: release disk reference in hd_struct_free_work block: ensure bdi->io_pages is always initialized nvme-pci: cancel nvme device request before disabling nvme: only use power of two io boundaries nvme: fix controller instance leak nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()' nvme: Fix NULL dereference for pci nvme controllers nvme-rdma: fix reset hang if controller died in the middle of a reset nvme-rdma: fix timeout handler nvme-rdma: serialize controller teardown sequences nvme-tcp: fix reset hang if controller died in the middle of a reset nvme-tcp: fix timeout handler nvme-tcp: serialize controller teardown sequences nvme: have nvme_wait_freeze_timeout return if it timed out nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
2020-08-28nvme-pci: cancel nvme device request before disablingTong Zhang
This patch addresses an irq free warning and null pointer dereference error problem when nvme devices got timeout error during initialization. This problem happens when nvme_timeout() function is called while nvme_reset_work() is still in execution. This patch fixed the problem by setting flag of the problematic request to NVME_REQ_CANCELLED before calling nvme_dev_disable() to make sure __nvme_submit_sync_cmd() returns an error code and let nvme_submit_sync_cmd() fail gracefully. The following is console output. [ 62.472097] nvme nvme0: I/O 13 QID 0 timeout, disable controller [ 62.488796] nvme nvme0: could not set timestamp (881) [ 62.494888] ------------[ cut here ]------------ [ 62.495142] Trying to free already-free IRQ 11 [ 62.495366] WARNING: CPU: 0 PID: 7 at kernel/irq/manage.c:1751 free_irq+0x1f7/0x370 [ 62.495742] Modules linked in: [ 62.495902] CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.8.0+ #8 [ 62.496206] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4 [ 62.496772] Workqueue: nvme-reset-wq nvme_reset_work [ 62.497019] RIP: 0010:free_irq+0x1f7/0x370 [ 62.497223] Code: e8 ce 49 11 00 48 83 c4 08 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 44 89 f6 48 c70 [ 62.498133] RSP: 0000:ffffa96800043d40 EFLAGS: 00010086 [ 62.498391] RAX: 0000000000000000 RBX: ffff9b87fc458400 RCX: 0000000000000000 [ 62.498741] RDX: 0000000000000001 RSI: 0000000000000096 RDI: ffffffff9693d72c [ 62.499091] RBP: ffff9b87fd4c8f60 R08: ffffa96800043bfd R09: 0000000000000163 [ 62.499440] R10: ffffa96800043bf8 R11: ffffa96800043bfd R12: ffff9b87fd4c8e00 [ 62.499790] R13: ffff9b87fd4c8ea4 R14: 000000000000000b R15: ffff9b87fd76b000 [ 62.500140] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000 [ 62.500534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 62.500816] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0 [ 62.501165] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 62.501515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 62.501864] Call Trace: [ 62.501993] pci_free_irq+0x13/0x20 [ 62.502167] nvme_reset_work+0x5d0/0x12a0 [ 62.502369] ? update_load_avg+0x59/0x580 [ 62.502569] ? ttwu_queue_wakelist+0xa8/0xc0 [ 62.502780] ? try_to_wake_up+0x1a2/0x450 [ 62.502979] process_one_work+0x1d2/0x390 [ 62.503179] worker_thread+0x45/0x3b0 [ 62.503361] ? process_one_work+0x390/0x390 [ 62.503568] kthread+0xf9/0x130 [ 62.503726] ? kthread_park+0x80/0x80 [ 62.503911] ret_from_fork+0x22/0x30 [ 62.504090] ---[ end trace de9ed4a70f8d71e2 ]--- [ 123.912275] nvme nvme0: I/O 12 QID 0 timeout, disable controller [ 123.914670] nvme nvme0: 1/0/0 default/read/poll queues [ 123.916310] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 123.917469] #PF: supervisor write access in kernel mode [ 123.917725] #PF: error_code(0x0002) - not-present page [ 123.917976] PGD 0 P4D 0 [ 123.918109] Oops: 0002 [#1] SMP PTI [ 123.918283] CPU: 0 PID: 7 Comm: kworker/u4:0 Tainted: G W 5.8.0+ #8 [ 123.918650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812dda519-p4 [ 123.919219] Workqueue: nvme-reset-wq nvme_reset_work [ 123.919469] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80 [ 123.919757] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4 [ 123.920657] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286 [ 123.920912] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000 [ 123.921258] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000 [ 123.921602] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000 [ 123.921949] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000 [ 123.922295] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000 [ 123.922641] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000 [ 123.923032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 123.923312] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0 [ 123.923660] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 123.924007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 123.924353] Call Trace: [ 123.924479] blk_mq_alloc_tag_set+0x137/0x2a0 [ 123.924694] nvme_reset_work+0xed6/0x12a0 [ 123.924898] process_one_work+0x1d2/0x390 [ 123.925099] worker_thread+0x45/0x3b0 [ 123.925280] ? process_one_work+0x390/0x390 [ 123.925486] kthread+0xf9/0x130 [ 123.925642] ? kthread_park+0x80/0x80 [ 123.925825] ret_from_fork+0x22/0x30 [ 123.926004] Modules linked in: [ 123.926158] CR2: 0000000000000000 [ 123.926322] ---[ end trace de9ed4a70f8d71e3 ]--- [ 123.926549] RIP: 0010:__blk_mq_alloc_map_and_request+0x21/0x80 [ 123.926832] Code: 66 0f 1f 84 00 00 00 00 00 41 55 41 54 55 48 63 ee 53 48 8b 47 68 89 ee 48 89 fb 8b4 [ 123.927734] RSP: 0000:ffffa96800043d40 EFLAGS: 00010286 [ 123.927989] RAX: ffff9b87fc4fee40 RBX: ffff9b87fc8cb008 RCX: 0000000000000000 [ 123.928336] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9b87fc618000 [ 123.928679] RBP: 0000000000000000 R08: ffff9b87fdc2c4a0 R09: ffff9b87fc616000 [ 123.929025] R10: 0000000000000000 R11: ffff9b87fffd1500 R12: 0000000000000000 [ 123.929370] R13: 0000000000000000 R14: ffff9b87fc8cb200 R15: ffff9b87fc8cb000 [ 123.929715] FS: 0000000000000000(0000) GS:ffff9b87fdc00000(0000) knlGS:0000000000000000 [ 123.930106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 123.930384] CR2: 0000000000000000 CR3: 000000003aa0a000 CR4: 00000000000006f0 [ 123.930731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 123.931077] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Co-developed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Tong Zhang <ztong0001@gmail.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme: only use power of two io boundariesKeith Busch
The kernel requires a power of two for boundaries because that's the only way it can efficiently split commands that cross them. A controller, however, may report a non-power of two boundary. The driver had been rounding the controller's value to one the kernel can use, but splitting on the wrong boundary provides no benefit on the device side, and incurs additional submission overhead from non-optimal splits. Don't provide any boundary hint if the controller's value can't be used and log a warning when first scanning a disk's unreported IO boundary. Since the chunk sector logic has grown, move it to a separate function. Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme: fix controller instance leakKeith Busch
If the driver has to unbind from the controller for an early failure before the subsystem has been set up, there won't be a subsystem holding the controller's instance, so the controller needs to free its own instance in this case. Fixes: 733e4b69d508d ("nvme: Assign subsys instance from first ctrl") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'Christophe JAILLET
The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent in this function. Use 'spin_lock_irqsave()' also here, as there is no guarantee that interruptions are disabled at that point, according to surrounding code. Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme: Fix NULL dereference for pci nvme controllersSagi Grimberg
PCIe controllers do not have fabric opts, verify they exist before showing ctrl_loss_tmo or reconnect_delay attributes. Fixes: 764075fdcb2f ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs") Reported-by: Tobias Markus <tobias@markus-regensburg.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-rdma: fix reset hang if controller died in the middle of a resetSagi Grimberg
If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-rdma: fix timeout handlerSagi Grimberg
When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-rdma: serialize controller teardown sequencesSagi Grimberg
In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-tcp: fix reset hang if controller died in the middle of a resetSagi Grimberg
If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-tcp: fix timeout handlerSagi Grimberg
When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-tcp: serialize controller teardown sequencesSagi Grimberg
In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme: have nvme_wait_freeze_timeout return if it timed outSagi Grimberg
Users can detect if the wait has completed or not and take appropriate actions based on this information (e.g. weather to continue initialization or rather fail and schedule another initialization attempt). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptanceSagi Grimberg
NVME_CTRL_NEW should never see any I/O, because in order to start initialization it has to transition to NVME_CTRL_CONNECTING and from there it will never return to this state. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pduZiye Yang
When handling commands without in-capsule data, we assign the ttag assuming we already have the queue commands array allocated (based on the queue size information in the connect data payload). However if the connect itself did not send the connect data in-capsule we have yet to allocate the queue commands,and we will assign a bogus ttag and suffer a NULL dereference when we receive the corresponding h2cdata pdu. Fix this by checking if we already allocated commands before dereferencing it when handling h2cdata, if we didn't, its for sure a connect and we should use the preallocated connect command. Signed-off-by: Ziye Yang <ziye.yang@intel.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-24Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request from Sagi: - nvme completion rework from Christoph and Chao that mostly came from a bit of divergence of how we classify errors related to pathing/retry etc. - nvmet passthru fixes from Chaitanya - minor nvmet fixes from Amit and I - mpath round-robin path selection fix from Martin - ignore noiob for zoned devices from Keith - minor nvme-fc fix from Tianjia" - BFQ cgroup leak fix (Dmitry) - block layer MAINTAINERS addition (Geert) - fix null_blk FUA checking (Hou) - get_max_io_size() size fix (Keith) - fix block page_is_mergeable() for compound pages (Matthew) - discard granularity fixes (Ming) - IO scheduler ordering fix (Ming) - misc fixes * tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits) null_blk: fix passing of REQ_FUA flag in null_handle_rq nvmet: Disable keep-alive timer when kato is cleared to 0h nvme: redirect commands on dying queue nvme: just check the status code type in nvme_is_path_error nvme: refactor command completion nvme: rename and document nvme_end_request nvme: skip noiob for zoned devices nvme-pci: fix PRP pool size nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth nvme: Use spin_lock_irq() when taking the ctrl->lock nvmet: call blk_mq_free_request() directly nvmet: fix oops in pt cmd execution nvmet: add ns tear down label for pt-cmd handling nvme: multipath: round-robin: eliminate "fallback" variable nvme: multipath: round-robin: fix single non-optimized path case nvme-fc: Fix wrong return value in __nvme_fc_init_request() nvmet-passthru: Reject commands with non-sgl flags set nvmet: fix a memory leak blkcg: fix memleak for iolatency MAINTAINERS: Add missing header files to BLOCK LAYER section ...
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-21nvmet: Disable keep-alive timer when kato is cleared to 0hAmit Engel
Based on nvme spec, when keep alive timeout is set to zero the keep-alive timer should be disabled. Signed-off-by: Amit Engel <amit.engel@dell.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>