summaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)Author
2019-07-08rbd: support for object-map and fast-diffIlya Dryomov
Speed up reads, discards and zeroouts through RBD_OBJ_FLAG_MAY_EXIST and RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT based on object map. Invalid object maps are not trusted, but still updated. Note that we never iterate, resize or invalidate object maps. If object-map feature is enabled but object map fails to load, we just fail the requester (either "rbd map" or I/O, by way of post-acquire action). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()Ilya Dryomov
Snapshot object map will be loaded in rbd_dev_image_probe(), so we need to know snapshot's size (as opposed to HEAD's size) sooner. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08libceph: export osd_req_op_data() macroIlya Dryomov
We already have one exported wrapper around it for extent.osd_data and rbd_object_map_update_finish() needs another one for cls.request_data. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2019-07-08libceph: change ceph_osdc_call() to take page vector for responseIlya Dryomov
This will be used for loading object map. rbd_obj_read_sync() isn't suitable because object map must be accessed through class methods. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2019-07-08libceph: bump CEPH_MSG_MAX_DATA_LEN (again)Ilya Dryomov
This time for rbd object map. Object maps are limited in size to 256000000 objects, two bits per object. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: new exclusive lock wait/wake codeIlya Dryomov
rbd_wait_state_locked() is built around rbd_dev->lock_waitq and blocks rbd worker threads while waiting for the lock, potentially impacting other rbd devices. There is no good way to pass an error code into image request state machines when acquisition fails, hence the use of RBD_DEV_FLAG_BLACKLISTED for everything and various other issues. Introduce rbd_dev->acquiring_list and move acquisition into image request state machine. Use rbd_img_schedule() for kicking and passing error codes. No blocking occurs while waiting for the lock, but rbd_dev->lock_rwsem is still held across lock, unlock and set_cookie calls. Always acquire the lock on "rbd map" to avoid associating the latency of acquiring the lock with the first I/O request. A slight regression is that lock_timeout is now respected only if lock acquisition is triggered by "rbd map" and not by I/O. This is somewhat compensated by the fact that we no longer block if the peer refuses to release lock -- I/O is failed with EROFS right away. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: quiescing lock should wait for image requestsIlya Dryomov
Syncing OSD requests doesn't really work. A single image request may be comprised of multiple object requests, each of which can go through a series of OSD requests (original, copyups, etc). On top of that, the OSD cliest may be shared with other rbd devices. What we want is to ensure that all in-flight image requests complete. Introduce rbd_dev->running_list and block in RBD_LOCK_STATE_RELEASING until that happens. New OSD requests may be started during this time. Note that __rbd_img_handle_request() acquires rbd_dev->lock_rwsem only if need_exclusive_lock() returns true. This avoids a deadlock similar to the one outlined in the previous commit between unlock and I/O that doesn't require lock, such as a read with object-map feature disabled. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: lock should be quiesced on reacquireIlya Dryomov
Quiesce exclusive lock at the top of rbd_reacquire_lock() instead of only when ceph_cls_set_cookie() fails. This avoids a deadlock on rbd_dev->lock_rwsem. If rbd_dev->lock_rwsem is needed for I/O completion, set_cookie can hang ceph-msgr worker thread if set_cookie reply ends up behind an I/O reply, because, like lock and unlock requests, set_cookie is sent and waited upon with rbd_dev->lock_rwsem held for write. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: introduce copyup state machineIlya Dryomov
Both write and copyup paths will get more complex with object map. Factor copyup code out into a separate state machine. While at it, take advantage of obj_req->osd_reqs list and issue empty and current snapc OSD requests together, one after another. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()Ilya Dryomov
These functions don't allocate and set up OSD requests anymore. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: move OSD request allocation into object request state machinesIlya Dryomov
Following submission, move initial OSD request allocation into object request state machines. Everything that has to do with OSD requests is now handled inside the state machine, all __rbd_img_fill_request() has left is initialization. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: factor out __rbd_osd_setup_discard_ops()Ilya Dryomov
With obj_req->xferred removed, obj_req->ex.oe_off and obj_req->ex.oe_len can be updated if required for alignment. Previously the new offset and length weren't stored anywhere beyond rbd_obj_setup_discard(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: factor out rbd_osd_setup_copyup()Ilya Dryomov
Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: introduce obj_req->osd_reqs listIlya Dryomov
Since the dawn of time it had been assumed that a single object request spawns a single OSD request. This is already impacting copyup: instead of sending empty and current snapc copyups together, we wait for empty snapc OSD request to complete in order to reassign obj_req->osd_req with current snapc OSD request. Looking further, updating potentially hundreds of snapshot object maps serially is a non-starter. Replace obj_req->osd_req pointer with obj_req->osd_reqs list. Use osd_req->r_private_item for linkage. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08libceph: rename r_unsafe_item to r_private_itemIlya Dryomov
This list item remained from when we had safe and unsafe replies (commit vs ack). It has since become a private list item for use by clients. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08rbd: introduce image request state machineIlya Dryomov
Make it possible to schedule image requests on a workqueue. This fixes parent chain recursion added in the previous commit and lays the ground for exclusive lock wait/wake improvements. The "wait for pending subrequests and report first nonzero result" code is generalized to be used by object request state machine. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: move OSD request submission into object request state machinesIlya Dryomov
Start eliminating asymmetry where the initial OSD request is allocated and submitted from outside the state machine, making error handling and restarts harder than they could be. This commit deals with submission, a commit that deals with allocation will follow. Note that this commit adds parent chain recursion on the submission side: rbd_img_request_submit rbd_obj_handle_request __rbd_obj_handle_request rbd_obj_handle_read rbd_obj_handle_write_guard rbd_obj_read_from_parent rbd_img_request_submit This will be fixed in the next commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}Ilya Dryomov
In preparation for moving OSD request allocation and submission into object request state machines, get rid of RBD_OBJ_WRITE_{FLAT,GUARD}. We would need to start in a new state, whether the request is guarded or not. Unify them into RBD_OBJ_WRITE_OBJECT and pass guard info through obj_req->flags. While at it, make our ENOENT handling a little more precise: only hide ENOENT when it is actually expected, that is on delete. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: replace obj_req->tried_parent with obj_req->read_stateIlya Dryomov
Make rbd_obj_handle_read() look like a state machine and get rid of the necessity to patch result in rbd_obj_handle_request(), completing the removal of obj_req->xferred and img_req->xferred. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferredIlya Dryomov
obj_req->xferred and img_req->xferred don't bring any value. The former is used for short reads and has to be set to obj_req->ex.oe_len after that and elsewhere. The latter is just an aggregate. Use result for short reads (>=0 - number of bytes read, <0 - error) and pass it around explicitly. No need to store it in obj_req. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
2019-07-08ceph: don't NULL terminate virtual xattrsJeff Layton
The convention with xattrs is to not store the termination with string data, given that it returns the length. This is how setfattr/getfattr operate. Most of ceph's virtual xattr routines use snprintf to plop the string directly into the destination buffer, but snprintf always NULL terminates the string. This means that if we send the kernel a buffer that is the exact length needed to hold the string, it'll end up truncated. Add a ceph_fmt_xattr helper function to format the string into an on-stack buffer that should always be large enough to hold the whole thing and then memcpy the result into the destination buffer. If it does turn out that the formatted string won't fit in the on-stack buffer, then return -E2BIG and do a WARN_ONCE(). Change over most of the virtual xattr routines to use the new helper. A couple of the xattrs are sourced from strings however, and it's difficult to know how long they'll be. Just have those memcpy the result in place after verifying the length. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: return -ERANGE if virtual xattr value didn't fit in bufferJeff Layton
The getxattr manpage states that we should return ERANGE if the destination buffer size is too small to hold the value. ceph_vxattrcb_layout does this internally, but we should be doing this for all vxattrs. Fix the only caller of getxattr_cb to check the returned size against the buffer length and return -ERANGE if it doesn't fit. Drop the same check in ceph_vxattrcb_layout and just rely on the caller to handle it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: make getxattr_cb return ssize_tJeff Layton
The getxattr_cb functions return size_t, which is unsigned and then cast that value to int and then ssize_t before returning it. While all of this works, it relies on implicit casting rules for signed/unsigned conversions. Change getxattr_cb to return ssize_t to better conform with what the caller actually wants. Also, remove some suspicious casts. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Acked-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAPYan, Zheng
Client uses this flag to tell mds if there is more cap snap need to flush. It's mainly for the case that client needs to re-send cap/snap flushes after mds failover, but CEPH_CAP_ANY_FILE_WR on corresponding inodes are all released before mds failover. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: kick flushing and flush snaps before sending normal cap messageYan, Zheng
Otherwise client may send cap flush messages in wrong order. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: clear CEPH_I_KICK_FLUSH flag inside __kick_flushing_caps()Yan, Zheng
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: increment change_attribute on local changesJeff Layton
We don't set SB_I_VERSION on ceph since we need to manage it ourselves, so we must increment it whenever we update the file times. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: handle change_attr in cap messagesJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add change_attr field to ceph_inode_infoJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08iversion: add a routine to update a raw value with a larger oneJeff Layton
Under ceph, clients can be independently updating iversion themselves, while working under comprehensive sets of caps on an inode. In that situation we always want to prefer the largest value of a change attribute. Add a new function that will update a raw value with a larger one, but otherwise leave it alone. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: allow querying of STATX_BTIME in ceph_getattrJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: turn on CEPH_FEATURE_MSG_ADDR2Jeff Layton
Now that the client can handle either address formatting, advertise to the peer that we can support it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: handle btime in cap messagesJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add btime field to ceph_inode_infoJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: rename ceph_encode_addr to ceph_encode_banner_addrJeff Layton
...ditto for the decode function. We only use these functions to fix up banner addresses now, so let's name them more appropriately. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONEJeff Layton
Going forward, we'll have different address types so let's use the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE. Also, make ceph_pr_addr print the address type value as well. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: fix decode_locker to use ceph_decode_entity_addrJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: have MDS map decoding use entity_addr_t decoderJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: correctly decode ADDR2 addresses in incremental OSD mapsJeff Layton
Given the new format, we have to decode the addresses twice. Once to skip past the new_up_client field, and a second time to collect the addresses. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: fix watch_item_t decoding to use ceph_decode_entity_addrJeff Layton
While we're in there, let's also fix up the decoder to do proper bounds checking. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: switch osdmap decoding to use ceph_decode_entity_addrJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: ADDR2 support for monmapJeff Layton
Switch the MonMap decoder to use the new decoding routine for entity_addr_t's. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: add ceph_decode_entity_addrJeff Layton
Add a function for decoding an entity_addr_t. Once CEPH_FEATURE_MSG_ADDR2 is enabled, the server daemons will start encoding entity_addr_t differently. Add a new helper function that can handle either format. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08libceph: fix sa_family just after reading addressJeff Layton
It doesn't make sense to leave it undecoded until later. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: remove request from waiting list before unregisterYan, Zheng
Link: https://tracker.ceph.com/issues/40339 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: don't blindly unregister session that is in opening stateYan, Zheng
handle_cap_export() may add placeholder caps to session that is in opening state. These caps' session pointer become wild after session get unregistered. The fix is not to unregister session in opening state during mds failovers, just let client to reconnect later when mds is recovered. Link: https://tracker.ceph.com/issues/40190 Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: fix infinite loop in get_quota_realm()Yan, Zheng
get_quota_realm() enters infinite loop if quota inode has no caps. This can happen after client gets evicted. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: add selinux supportYan, Zheng
When creating new file/directory, use security_dentry_init_security() to prepare selinux context for the new inode, then send openc/mkdir request to MDS, together with selinux xattr. security_dentry_init_security() only supports single security module and only selinux has dentry_init_security hook. So only selinux is supported for now. We can add support for other security modules once kernel has a generic version of dentry_init_security() Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: rename struct ceph_acls_info to ceph_acl_sec_ctxYan, Zheng
Also rename ceph_release_acls_info() to ceph_release_acl_sec_ctx(). And move their definitions to different files. This is preparation for security label support. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-07-08ceph: fix debug print format in __set_xattr()Yan, Zheng
name is not '\0' terminated. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>