summaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)Author
2013-05-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Alex Elder: "This is a big pull. Most of it is culmination of Alex's work to implement RBD image layering, which is now complete (yay!). There is also some work from Yan to fix i_mutex behavior surrounding writes in cephfs, a sync write fix, a fix for RBD images that get resized while they are mapped, and a few patches from me that resolve annoying auth warnings and fix several bugs in the ceph auth code." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (254 commits) rbd: fix image request leak on parent read libceph: use slab cache for osd client requests libceph: allocate ceph message data with a slab allocator libceph: allocate ceph messages with a slab allocator rbd: allocate image object names with a slab allocator rbd: allocate object requests with a slab allocator rbd: allocate name separate from obj_request rbd: allocate image requests with a slab allocator rbd: use binary search for snapshot lookup rbd: clear EXISTS flag if mapped snapshot disappears rbd: kill off the snapshot list rbd: define rbd_snap_size() and rbd_snap_features() rbd: use snap_id not index to look up snap info rbd: look up snapshot name in names buffer rbd: drop obj_request->version rbd: drop rbd_obj_method_sync() version parameter rbd: more version parameter removal rbd: get rid of some version parameters rbd: stop tracking header object version rbd: snap names are pointer to constant data ...
2013-05-06Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "A set of cifs cleanup fixes. The only big one of this set optimizes the cifs error logging, renaming cFYI and cERROR macros to cifs_dbg, and in the process makes it clearer and reduces module size." * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: small variable name cleanup CIFS: fix error return code in cifs_atomic_open() cifs: store the real expected sequence number in the mid cifs: on send failure, readjust server sequence number downward cifs: remove ENOSPC handling in smb_sendv [CIFS] cifs: Rename cERROR and cFYI to cifs_dbg fs: cifs: use kmemdup instead of kmalloc + memcpy cifs: replaced kmalloc + memset with kzalloc cifs: ignore the unc= and prefixpath= mount options
2013-05-06autofs - remove autofs dentry mount checkDavid Jeffery
When checking if an autofs mount point is busy it isn't sufficient to only check if it's a mount point. For example, if the mount of an offset mountpoint in a tree is denied for this host by its export and the dentry becomes a process working directory the check incorrectly returns the mount as not in use at expire. This can happen since the default when mounting within a tree is nostrict, which means ingnore mount fails on mounts within the tree and continue. The nostrict option is meant to allow mounting in this case. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Ian Kent <raven@themaw.net> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-06autofs - fix sparse warning for autofs4_d_manage()Claudiu Ghioc
Fixed the sparse warning: fs/autofs4/root.c:411:5: warning: symbol 'autofs4_d_manage' was not declared. Should it be static?" [ Clearly it should be static as the function is declared static at the top of root.c. - imk ] Signed-off-by: Claudiu Ghioc <claudiu.ghioc@gmail.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-04cifs: small variable name cleanupDan Carpenter
server and ses->server are the same, but it's a little bit ugly that we lock &ses->server->srv_mutex and unlock &server->srv_mutex. It causes a false positive in Smatch about inconsistent locking. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04CIFS: fix error return code in cifs_atomic_open()Wei Yongjun
Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04cifs: store the real expected sequence number in the midJeff Layton
Currently, the signing routines take a pointer to a place to store the expected sequence number for the mid response. It then stores a value that's one below what that sequence number should be, and then adds one to it when verifying the signature on the response. Increment the sequence number before storing the value in the mid, and eliminate the "+1" when checking the signature. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04cifs: on send failure, readjust server sequence number downwardJeff Layton
If sending a call to the server fails for some reason (for instance, the sending thread caught a signal), then we must readjust the sequence number downward again or the next send will have it too high. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04cifs: remove ENOSPC handling in smb_sendvJeff Layton
To my knowledge, no one ever reported seeing this pop. Acked-by: Suresh Jayaraman <sjayaraman@novell.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04[CIFS] cifs: Rename cERROR and cFYI to cifs_dbgJoe Perches
It's not obvious from reading the macro names that these macros are for debugging. Convert the names to a single more typical kernel style cifs_dbg macro. cERROR(1, ...) -> cifs_dbg(VFS, ...) cFYI(1, ...) -> cifs_dbg(FYI, ...) cFYI(DBG2, ...) -> cifs_dbg(NOISY, ...) Move the terminating format newline from the macro to the call site. Add CONFIG_CIFS_DEBUG function cifs_vfs_err to emit the "CIFS VFS: " prefix for VFS messages. Size is reduced ~ 1% when CONFIG_CIFS_DEBUG is set (default y) $ size fs/cifs/cifs.ko* text data bss dec hex filename 265245 2525 132 267902 4167e fs/cifs/cifs.ko.new 268359 2525 132 271016 422a8 fs/cifs/cifs.ko.old Other miscellaneous changes around these conversions: o Miscellaneous typo fixes o Add terminating \n's to almost all formats and remove them from the macros to be more kernel style like. A few formats previously had defective \n's o Remove unnecessary OOM messages as kmalloc() calls dump_stack o Coalesce formats to make grep easier, added missing spaces when coalescing formats o Use %s, __func__ instead of embedded function name o Removed unnecessary "cifs: " prefixes o Convert kzalloc with multiply to kcalloc o Remove unused cifswarn macro Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04fs: cifs: use kmemdup instead of kmalloc + memcpySilviu-Mihai Popescu
This replaces calls to kmalloc followed by memcpy with a single call to kmemdup. This was found via make coccicheck. Signed-off-by: Silviu-Mihai Popescu <silviupopescu1990@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04cifs: replaced kmalloc + memset with kzallocDia Vasile
Signed-off-by: Diana Vasile <kill.elohim@hotmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04cifs: ignore the unc= and prefixpath= mount optionsJeff Layton
...as advertised for 3.10. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Steve French <smfrench@gmail.com>
2013-05-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second round of VFS updates from Al Viro: "Assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: xtensa simdisk: fix braino in "xtensa simdisk: switch to proc_create_data()" hostfs: use kmalloc instead of kzalloc hostfs: move HOSTFS_SUPER_MAGIC to <linux/magic.h> hostfs: remove "will unlock" comment vfs: use list_move instead of list_del/list_add proc_devtree: Replace include linux/module.h with linux/export.h create_mnt_ns: unidiomatic use of list_add() fs: remove dentry_lru_prune() Removed unused typedef to avoid "unused local typedef" warnings. kill fs/read_write.h fs: Fix hang with BSD accounting on frozen filesystem sun3_scsi: add ->show_info() nubus: Kill nubus_proc_detach_device() more mode_t whack-a-mole... do_coredump(): don't wait for thaw if coredump has already been interrupted do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission checks")
2013-05-04hostfs: use kmalloc instead of kzallocJames Hogan
The inode info structure is zeroed at allocation with kzalloc, and then all but one of the fields (including the largest, vfs_inode) are initialised explicitly. Switch to using kmalloc and initialise the remaining field too. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04hostfs: move HOSTFS_SUPER_MAGIC to <linux/magic.h>James Hogan
Move HOSTFS_SUPER_MAGIC to <linux/magic.h> to be with it's magical friends from other file systems. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04hostfs: remove "will unlock" commentJames Hogan
A "will unlock" comment was added to hostfs in the following commit, along with a spinlock: Commit e9193059b1b3733695d5b80e667778311695aa73 ("hostfs: fix races in dentry_name() and inode_name()"). But the spinlock was subsequently removed in the following commit: Commit ec2447c278ee973d35f38e53ca16ba7f965ae33d ("hostfs: simplify locking"). Since the comment is no longer applicable, remove it. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04vfs: use list_move instead of list_del/list_addWei Yongjun
Using list_move() instead of list_del() + list_add(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04proc_devtree: Replace include linux/module.h with linux/export.hSyam Sidhardhan
Since it uses only THIS_MODULE macro, include <linux/export.h> is the right to go here. Signed-off-by: Syam Sidhardhan <s.syam@samsung.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04create_mnt_ns: unidiomatic use of list_add()Al Viro
while list_add(A, B) and list_add(B, A) are equivalent when both A and B are guaranteed to be empty, the usual idiom is list_add(what, where), not the other way round... Not a bug per se, but only by accident and it makes RTFS harder for no good reason. Spotted-by: Rajat Sharma <fs.rajat@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04fs: remove dentry_lru_prune()Yan, Zheng
When pruning a dentry, its ancestor dentry can also be pruned. But the ancestor dentry does not go through dput(), so it does not get put on the dentry LRU. Hence associating d_prune with removing the dentry from the LRU is the wrong. The fix is remove dentry_lru_prune(). Call file system's d_prune() callback directly when pruning dentries. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04Removed unused typedef to avoid "unused local typedef" warnings.Han Shen
Fix warnings about unused local typedefs (reported by gcc 4.8). Signed-off-by: Han Shen (shenhan@google.com) Change-Id: I4bccc234f1390daa808d2b309ed112e20c0ac096 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04kill fs/read_write.hAl Viro
fs/compat.c doesn't need it anymore, so let's just move the remaining contents (two typedefs) into fs/read_write.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04do_coredump(): don't wait for thaw if coredump has already been interruptedAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-04do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission ↵Al Viro
checks") Cc: stable@vger.kernel.org Bisected-by: Michael Leun <lkml20130126@newton.leun.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-03Merge branch 'for-3.10' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd changes from J Bruce Fields: "Highlights include: - Some more DRC cleanup and performance work from Jeff Layton - A gss-proxy upcall from Simo Sorce: currently krb5 mounts to the server using credentials from Active Directory often fail due to limitations of the svcgssd upcall interface. This replacement lifts those limitations. The existing upcall is still supported for backwards compatibility. - More NFSv4.1 support: at this point, if a user with a current client who upgrades from 4.0 to 4.1 should see no regressions. In theory we do everything a 4.1 server is required to do. Patches for a couple minor exceptions are ready for 3.11, and with those and some more testing I'd like to turn 4.1 on by default in 3.11." Fix up semantic conflict as per Stephen Rothwell and linux-next: Commit 030d794bf498 ("SUNRPC: Use gssproxy upcall for server RPCGSS authentication") adds two new users of "PDE(inode)->data", but we're supposed to use "PDE_DATA(inode)" instead since commit d9dda78bad87 ("procfs: new helper - PDE_DATA(inode)"). The old PDE() macro is no longer available since commit c30480b92cf4 ("proc: Make the PROC_I() and PDE() macros internal to procfs") * 'for-3.10' of git://linux-nfs.org/~bfields/linux: (60 commits) NFSD: SECINFO doesn't handle unsupported pseudoflavors correctly NFSD: Simplify GSS flavor encoding in nfsd4_do_encode_secinfo() nfsd: make symbol nfsd_reply_cache_shrinker static svcauth_gss: fix error return code in rsc_parse() nfsd4: don't remap EISDIR errors in rename svcrpc: fix gss-proxy to respect user namespaces SUNRPC: gssp_procedures[] can be static SUNRPC: define {create,destroy}_use_gss_proxy_proc_entry in !PROC case nfsd4: better error return to indicate SSV non-support nfsd: fix EXDEV checking in rename SUNRPC: Use gssproxy upcall for server RPCGSS authentication. SUNRPC: Add RPC based upcall mechanism for RPCGSS auth SUNRPC: conditionally return endtime from import_sec_context SUNRPC: allow disabling idle timeout SUNRPC: attempt AF_LOCAL connect on setup nfsd: Decode and send 64bit time values nfsd4: put_client_renew_locked can be static nfsd4: remove unused macro nfsd4: remove some useless code nfsd4: implement SEQ4_STATUS_RECALLABLE_STATE_REVOKED ...
2013-05-03Merge tag 'jfs-3.10' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs fixes from David Kleikamp: "A couple fixes for jfs" (What's with the unhelpful pull request "explanations" from fs people today?) * tag 'jfs-3.10' of git://github.com/kleikamp/linux-shaggy: jfs: fix a couple races jfs: avoid undefined behavior from left-shifting by 32 bits
2013-05-03Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext3/jbd fixes from Jan Kara: "A couple of ext3/jbd fixes" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: jbd: use kmem_cache_zalloc for allocating journal head jbd: use kmem_cache_zalloc instead of kmem_cache_alloc/memset jbd: don't wait (forever) for stale tid caused by wraparound ext3: fix data=journal fast mount/umount hang
2013-05-03Merge tag 'for-v3.10' of git://git.infradead.org/users/cbou/linux-pstoreLinus Torvalds
Pull pstore update from Anton Vorontsov: - A new platform data parameter to specify ECC configuration; - Rounding fixup to not waste memory in ecc_blocks; - Restore ECC information printouts; - A small code cleanup: use kmemdup where appropriate. * tag 'for-v3.10' of git://git.infradead.org/users/cbou/linux-pstore: pstore/ram: Restore ecc information block pstore/ram: Allow specifying ecc parameters in platform data pstore/ram: Include ecc_size when calculating ecc_block pstore: Replace calls to kmalloc and memcpy with kmemdup
2013-05-03Merge branch 'for_next' into for_linusJan Kara
2013-05-02Merge tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs update from Ben Myers: "For 3.10-rc1 we have a number of bug fixes and cleanups and a currently experimental feature from David Chinner, CRCs protection for metadata. CRCs are enabled by using mkfs.xfs to create a filesystem with the feature bits set. - numerous fixes for speculative preallocation - don't verify buffers on IO errors - rename of random32 to prandom32 - refactoring/rearrangement in xfs_bmap.c - removal of unused m_inode_shrink in struct xfs_mount - fix error handling of xfs_bufs and readahead - quota driven preallocation throttling - fix WARN_ON in xfs_vm_releasepage - add ratelimited printk for different alert levels - fix spurious forced shutdowns due to freed Extent Free Intents - remove some obsolete XLOG_CIL_HARD_SPACE_LIMIT() macros - remove some obsoleted comments - (experimental) CRC support for metadata" * tag 'for-linus-v3.10-rc1' of git://oss.sgi.com/xfs/xfs: (46 commits) xfs: fix da node magic number mismatches xfs: Remote attr validation fixes and optimisations xfs: Teach dquot recovery about CONFIG_XFS_QUOTA xfs: add metadata CRC documentation xfs: implement extended feature masks xfs: add CRC checks to the superblock xfs: buffer type overruns blf_flags field xfs: add buffer types to directory and attribute buffers xfs: add CRC protection to remote attributes xfs: split remote attribute code out xfs: add CRCs to attr leaf blocks xfs: add CRCs to dir2/da node blocks xfs: shortform directory offsets change for dir3 format xfs: add CRC checking to dir2 leaf blocks xfs: add CRC checking to dir2 data blocks xfs: add CRC checking to dir2 free blocks xfs: add CRC checks to block format directory blocks xfs: add CRC checks to remote symlinks xfs: split out symlink code into it's own file. xfs: add version 3 inode format with CRCs ...
2013-05-02Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc update from Benjamin Herrenschmidt: "The main highlights this time around are: - A pile of addition POWER8 bits and nits, such as updated performance counter support (Michael Ellerman), new branch history buffer support (Anshuman Khandual), base support for the new PCI host bridge when not using the hypervisor (Gavin Shan) and other random related bits and fixes from various contributors. - Some rework of our page table format by Aneesh Kumar which fixes a thing or two and paves the way for THP support. THP itself will not make it this time around however. - More Freescale updates, including Altivec support on the new e6500 cores, new PCI controller support, and a pile of new boards support and updates. - The usual batch of trivial cleanups & fixes" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits) powerpc: Fix build error for book3e powerpc: Context switch the new EBB SPRs powerpc: Turn on the EBB H/FSCR bits powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S powerpc: Setup BHRB instructions facility in HFSCR for POWER8 powerpc: Fix interrupt range check on debug exception powerpc: Update tlbie/tlbiel as per ISA doc powerpc: Print page size info during boot powerpc: print both base and actual page size on hash failure powerpc: Fix hpte_decode to use the correct decoding for page sizes powerpc: Decode the pte-lp-encoding bits correctly. powerpc: Use encode avpn where we need only avpn values powerpc: Reduce PTE table memory wastage powerpc: Move the pte free routines from common header powerpc: Reduce the PTE_INDEX_SIZE powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format powerpc: New hugepage directory format powerpc: Don't truncate pgd_index wrongly powerpc: Don't hard code the size of pte page powerpc: Save DAR and DSISR in pt_regs on MCE ...
2013-05-01ceph: use ceph_create_snap_context()Alex Elder
Now that we have a library routine to create snap contexts, use it. This is part of: http://tracker.ceph.com/issues/4857 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: kill off osd data write_request parametersAlex Elder
In the incremental move toward supporting distinct data items in an osd request some of the functions had "write_request" parameters to indicate, basically, whether the data belonged to in_data or the out_data. Now that we maintain the data fields in the op structure there is no need to indicate the direction, so get rid of the "write_request" parameters. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: fix printk format warnings in file.cRandy Dunlap
Fix printk format warnings by using %zd for 'ssize_t' variables: fs/ceph/file.c:751:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat] fs/ceph/file.c:762:2: warning: format '%ld' expects argument of type 'long int', but argument 11 has type 'ssize_t' [-Wformat] Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@inktank.com>
2013-05-01ceph: fix race between writepages and truncateYan, Zheng
ceph_writepages_start() reads inode->i_size in two places. It can get different values between successive read, because truncate can change inode->i_size at any time. The race can lead to mismatch between data length of osd request and pages marked as writeback. When osd request finishes, it clear writeback page according to its data length. So some pages can be left in writeback state forever. The fix is only read inode->i_size once, save its value to a local variable and use the local variable when i_size is needed. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: apply write checks in ceph_aio_writeYan, Zheng
copy write checks in __generic_file_aio_write to ceph_aio_write. To make these checks cover sync write path. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01ceph: take i_mutex before getting Fw capYan, Zheng
There is deadlock as illustrated bellow. The fix is taking i_mutex before getting Fw cap reference. write truncate MDS --------------------- -------------------- -------------- get Fw cap lock i_mutex lock i_mutex (blocked) request setattr.size -> <- revoke Fw cap Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01libceph: change how "safe" callback is usedAlex Elder
An osd request currently has two callbacks. They inform the initiator of the request when we've received confirmation for the target osd that a request was received, and when the osd indicates all changes described by the request are durable. The only time the second callback is used is in the ceph file system for a synchronous write. There's a race that makes some handling of this case unsafe. This patch addresses this problem. The error handling for this callback is also kind of gross, and this patch changes that as well. In ceph_sync_write(), if a safe callback is requested we want to add the request on the ceph inode's unsafe items list. Because items on this list must have their tid set (by ceph_osd_start_request()), the request added *after* the call to that function returns. The problem with this is that there's a race between starting the request and adding it to the unsafe items list; the request may already be complete before ceph_sync_write() even begins to put it on the list. To address this, we change the way the "safe" callback is used. Rather than just calling it when the request is "safe", we use it to notify the initiator the bounds (start and end) of the period during which the request is *unsafe*. So the initiator gets notified just before the request gets sent to the osd (when it is "unsafe"), and again when it's known the results are durable (it's no longer unsafe). The first call will get made in __send_request(), just before the request message gets sent to the messenger for the first time. That function is only called by __send_queued(), which is always called with the osd client's request mutex held. We then have this callback function insert the request on the ceph inode's unsafe list when we're told the request is unsafe. This will avoid the race because this call will be made under protection of the osd client's request mutex. It also nicely groups the setup and cleanup of the state associated with managing unsafe requests. The name of the "safe" callback field is changed to "unsafe" to better reflect its new purpose. It has a Boolean "unsafe" parameter to indicate whether the request is becoming unsafe or is now safe. Because the "msg" parameter wasn't used, we drop that. This resolves the original problem reportedin: http://tracker.ceph.com/issues/4706 Reported-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-05-01ceph: let osd client clean up for interrupted requestAlex Elder
In ceph_sync_write(), if a safe callback is supplied with a request, and an error is returned by ceph_osdc_wait_request(), a block of code is executed to remove the request from the unsafe writes list and drop references to capabilities acquired just prior to a call to ceph_osdc_wait_request(). The only function used for this callback is sync_write_commit(), and it does *exactly* what that block of error handling code does. Now in ceph_osdc_wait_request(), if an error occurs (due to an interupt during a wait_for_completion_interruptible() call), complete_request() gets called, and that calls the request's safe_callback method if it's defined. So this means that this cleanup activity gets called twice in this case, which is erroneous (and in fact leads to a crash). Fix this by just letting the osd client handle the cleanup in the event of an interrupt. This resolves one problem mentioned in: http://tracker.ceph.com/issues/4706 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-05-01ceph: fix symlink inode operationsYan, Zheng
add getattr/setattr and xattrs related methods. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com>
2013-05-01ceph: Use pseudo-random numbers to choose mdsSam Lang
We don't need to use up entropy to choose an mds, so use prandom_u32() to get a pseudo-random number. Also, we don't need to choose a random mds if only one mds is available, so add special casing for the common case. Fixes http://tracker.ceph.com/issues/3579 Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
2013-05-01libceph: add, don't set data for a messageAlex Elder
Change the names of the functions that put data on a pagelist to reflect that we're adding to whatever's already there rather than just setting it to the one thing. Currently only one data item is ever added to a message, but that's about to change. This resolves: http://tracker.ceph.com/issues/2770 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: combine initializing and setting osd dataAlex Elder
This ends up being a rather large patch but what it's doing is somewhat straightforward. Basically, this is replacing two calls with one. The first of the two calls is initializing a struct ceph_osd_data with data (either a page array, a page list, or a bio list); the second is setting an osd request op so it associates that data with one of the op's parameters. In place of those two will be a single function that initializes the op directly. That means we sort of fan out a set of the needed functions: - extent ops with pages data - extent ops with pagelist data - extent ops with bio list data and - class ops with page data for receiving a response We also have define another one, but it's only used internally: - class ops with pagelist data for request parameters Note that we *still* haven't gotten rid of the osd request's r_data_in and r_data_out fields. All the osd ops refer to them for their data. For now, these data fields are pointers assigned to the appropriate r_data_* field when these new functions are called. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: specify osd op by index in requestAlex Elder
An osd request now holds all of its source op structures, and every place that initializes one of these is in fact initializing one of the entries in the the osd request's array. So rather than supplying the address of the op to initialize, have caller specify the osd request and an indication of which op it would like to initialize. This better hides the details the op structure (and faciltates moving the data pointers they use). Since osd_req_op_init() is a common routine, and it's not used outside the osd client code, give it static scope. Also make it return the address of the specified op (so all the other init routines don't have to repeat that code). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: add data pointers in osd op structuresAlex Elder
An extent type osd operation currently implies that there will be corresponding data supplied in the data portion of the request (for write) or response (for read) message. Similarly, an osd class method operation implies a data item will be supplied to receive the response data from the operation. Add a ceph_osd_data pointer to each of those structures, and assign it to point to eithre the incoming or the outgoing data structure in the osd message. The data is not always available when an op is initially set up, so add two new functions to allow setting them after the op has been initialized. Begin to make use of the data item pointer available in the osd operation rather than the request data in or out structure in places where it's convenient. Add some assertions to verify pointers are always set the way they're expected to be. This is a sort of stepping stone toward really moving the data into the osd request ops, to allow for some validation before making that jump. This is the first in a series of patches that resolve: http://tracker.ceph.com/issues/4657 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: keep source rather than message osd op arrayAlex Elder
An osd request keeps a pointer to the osd operations (ops) array that it builds in its request message. In order to allow each op in the array to have its own distinct data, we will need to keep track of each op's data, and that information does not go over the wire. As long as we're tracking the data we might as well just track the entire (source) op definition for each of the ops. And if we're doing that, we'll have no more need to keep a pointer to the wire-encoded version. This patch makes the array of source ops be kept with the osd request structure, and uses that instead of the version encoded in the message in places where that was previously used. The array will be embedded in the request structure, and the maximum number of ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP to 2 to reduce the size of the structure. The result of doing this sort of ripples back up, and as a result various function parameters and local variables become unnecessary. Make r_num_ops be unsigned, and move the definition of struct ceph_osd_req_op earlier to ensure it's defined where needed. It does not yet add per-op data, that's coming soon. This resolves: http://tracker.ceph.com/issues/4656 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: a few more osd data cleanupsAlex Elder
These are very small changes that make use osd_data local pointers as shorthands for structures being operated on. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01libceph: define osd data initialization helpersAlex Elder
Define and use functions that encapsulate the initializion of a ceph_osd_data structure. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-05-01ceph: build osd request message later for writepagesAlex Elder
Hold off building the osd request message in ceph_writepages_start() until just before it will be submitted to the osd client for execution. We'll still create the request and allocate the page pointer array after we learn we have at least one page to write. A local variable will be used to keep track of the allocated array of pages. Wait until just before submitting the request for assigning that page array pointer to the request message. Create ands use a new function osd_req_op_extent_update() whose purpose is to serve this one spot where the length value supplied when an osd request's op was initially formatted might need to get changed (reduced, never increased) before submitting the request. Previously, ceph_writepages_start() assigned the message header's data length because of this update. That's no longer necessary, because ceph_osdc_build_request() will recalculate the right value to use based on the content of the ops in the request. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>