summaryrefslogtreecommitdiffstats
path: root/fs/fscache
AgeCommit message (Collapse)Author
2017-03-02KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload()David Howells
rcu_dereference_key() and user_key_payload() are currently being used in two different, incompatible ways: (1) As a wrapper to rcu_dereference() - when only the RCU read lock used to protect the key. (2) As a wrapper to rcu_dereference_protected() - when the key semaphor is used to protect the key and the may be being modified. Fix this by splitting both of the key wrappers to produce: (1) RCU accessors for keys when caller has the key semaphore locked: dereference_key_locked() user_key_payload_locked() (2) RCU accessors for keys when caller holds the RCU read lock: dereference_key_rcu() user_key_payload_rcu() This should fix following warning in the NFS idmapper =============================== [ INFO: suspicious RCU usage. ] 4.10.0 #1 Tainted: G W ------------------------------- ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 0 1 lock held by mount.nfs/5987: #0: (rcu_read_lock){......}, at: [<d000000002527abc>] nfs_idmap_get_key+0x15c/0x420 [nfsv4] stack backtrace: CPU: 1 PID: 5987 Comm: mount.nfs Tainted: G W 4.10.0 #1 Call Trace: dump_stack+0xe8/0x154 (unreliable) lockdep_rcu_suspicious+0x140/0x190 nfs_idmap_get_key+0x380/0x420 [nfsv4] nfs_map_name_to_uid+0x2a0/0x3b0 [nfsv4] decode_getfattr_attrs+0xfac/0x16b0 [nfsv4] decode_getfattr_generic.constprop.106+0xbc/0x150 [nfsv4] nfs4_xdr_dec_lookup_root+0xac/0xb0 [nfsv4] rpcauth_unwrap_resp+0xe8/0x140 [sunrpc] call_decode+0x29c/0x910 [sunrpc] __rpc_execute+0x140/0x8f0 [sunrpc] rpc_run_task+0x170/0x200 [sunrpc] nfs4_call_sync_sequence+0x68/0xa0 [nfsv4] _nfs4_lookup_root.isra.44+0xd0/0xf0 [nfsv4] nfs4_lookup_root+0xe0/0x350 [nfsv4] nfs4_lookup_root_sec+0x70/0xa0 [nfsv4] nfs4_find_root_sec+0xc4/0x100 [nfsv4] nfs4_proc_get_rootfh+0x5c/0xf0 [nfsv4] nfs4_get_rootfh+0x6c/0x190 [nfsv4] nfs4_server_common_setup+0xc4/0x260 [nfsv4] nfs4_create_server+0x278/0x3c0 [nfsv4] nfs4_remote_mount+0x50/0xb0 [nfsv4] mount_fs+0x74/0x210 vfs_kern_mount+0x78/0x220 nfs_do_root_mount+0xb0/0x140 [nfsv4] nfs4_try_mount+0x60/0x100 [nfsv4] nfs_fs_mount+0x5ec/0xda0 [nfs] mount_fs+0x74/0x210 vfs_kern_mount+0x78/0x220 do_mount+0x254/0xf70 SyS_mount+0x94/0x100 system_call+0x38/0xe0 Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-01-31fscache: Fix dead object requeueDavid Howells
Under some circumstances, an fscache object can become queued such that it fscache_object_work_func() can be called once the object is in the OBJECT_DEAD state. This results in the kernel oopsing when it tries to invoke the handler for the state (which is hard coded to 0x2). The way this comes about is something like the following: (1) The object dispatcher is processing a work state for an object. This is done in workqueue context. (2) An out-of-band event comes in that isn't masked, causing the object to be queued, say EV_KILL. (3) The object dispatcher finishes processing the current work state on that object and then sees there's another event to process, so, without returning to the workqueue core, it processes that event too. It then follows the chain of events that initiates until we reach OBJECT_DEAD without going through a wait state (such as WAIT_FOR_CLEARANCE). At this point, object->events may be 0, object->event_mask will be 0 and oob_event_mask will be 0. (4) The object dispatcher returns to the workqueue processor, and in due course, this sees that the object's work item is still queued and invokes it again. (5) The current state is a work state (OBJECT_DEAD), so the dispatcher jumps to it - resulting in an OOPS. When I'm seeing this, the work state in (1) appears to have been either LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is fscache_osm_lookup_oob). The window for (2) is very small: (A) object->event_mask is cleared whilst the event dispatch process is underway - though there's no memory barrier to force this to the top of the function. The window, therefore is from the time the object was selected by the workqueue processor and made requeueable to the time the mask was cleared. (B) fscache_raise_event() will only queue the object if it manages to set the event bit and the corresponding event_mask bit was set. The enqueuement is then deferred slightly whilst we get a ref on the object and get the per-CPU variable for workqueue congestion. This slight deferral slightly increases the probability by allowing extra time for the workqueue to make the item requeueable. Handle this by giving the dead state a processor function and checking the for the dead state address rather than seeing if the processor function is address 0x2. The dead state processor function can then set a flag to indicate that it's occurred and give a warning if it occurs more than once per object. If this race occurs, an oops similar to the following is seen (note the RIP value): BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 IP: [<0000000000000002>] 0x1 PGD 0 Oops: 0010 [#1] SMP Modules linked in: ... CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015 Workqueue: fscache_object fscache_object_work_func [fscache] task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000 RIP: 0010:[<0000000000000002>] [<0000000000000002>] 0x1 RSP: 0018:ffff880717547df8 EFLAGS: 00010202 RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200 RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480 RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510 R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000 R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600 FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00 ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000 ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900 Call Trace: [<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e4ac>] worker_thread+0x21c/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff816460d8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeremy McNicoll <jeremymc@redhat.com> Tested-by: Frank Sorenson <sorenson@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31fscache: Clear outstanding writes when disabling a cookieDavid Howells
fscache_disable_cookie() needs to clear the outstanding writes on the cookie it's disabling because they cannot be completed after. Without this, fscache_nfs_open_file() gets stuck because it disables the cookie when the file is opened for writing but can't uncache the pages till afterwards - otherwise there's a race between the open routine and anyone who already has it open R/O and is still reading from it. Looking in /proc/pid/stack of the offending process shows: [<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache] [<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache] [<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs] [<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4] [<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7 [<ffffffff811743ac>] vfs_open+0x5c/0x65 [<ffffffff81184185>] path_openat+0x785/0x8fb [<ffffffff81184343>] do_filp_open+0x48/0x9e [<ffffffff81174710>] do_sys_open+0x13b/0x1cb [<ffffffff811747b9>] SyS_open+0x19/0x1b [<ffffffff81001c44>] do_syscall_64+0x80/0x17a [<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a [<ffffffffffffffff>] 0xffffffffffffffff Reported-by: Jianhong Yin <jiyin@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-31FS-Cache: Initialise stores_lock in netfs cookieDavid Howells
Initialise the stores_lock in fscache netfs cookies. Technically, it shouldn't be necessary, since the netfs cookie is an index and stores no data, but initialising it anyway adds insignificant overhead. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-30Merge branch 'd_real' of ↵Al Viro
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into work.misc
2016-06-01FS-Cache: wake write waiter after invalidating writesYan, Zheng
Signed-off-by: Yan, Zheng <zyan@redhat.com> Acked-by: David Howells <dhowells@redhat.com>
2016-05-29drop redundant ->owner initializationsAl Viro
it's not needed for file_operations of inodes located on fs defined in the hosting module and for file_operations that go into procfs. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-11FS-Cache: Handle a write to the page immediately beyond the EOF markerDavid Howells
Handle a write being requested to the page immediately beyond the EOF marker on a cache object. Currently this gets an assertion failure in CacheFiles because the EOF marker is used there to encode information about a partial page at the EOF - which could lead to an unknown blank spot in the file if we extend the file over it. The problem is actually in fscache where we check the index of the page being written against store_limit. store_limit is set to the number of pages that we're allowed to store by fscache_set_store_limit() - which means it's one more than the index of the last page we're allowed to store. The problem is that we permit writing to a page with an index _equal_ to the store limit - when we should reject that case. Whilst we're at it, change the triggered assertion in CacheFiles to just return -ENOBUFS instead. The assertion failure looks something like this: CacheFiles: Assertion failed 1000 < 7b1 is false ------------[ cut here ]------------ kernel BUG at fs/cachefiles/rdwr.c:962! ... RIP: 0010:[<ffffffffa02c9e83>] [<ffffffffa02c9e83>] cachefiles_write_page+0x273/0x2d0 [cachefiles] Cc: stable@vger.kernel.org # v2.6.31+; earlier - that + backport of a17754f (at least) Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-11FS-Cache: Don't override netfs's primary_index if registering failedKinglong Mee
Only override netfs->primary_index when registering success. Cc: stable@vger.kernel.org # v2.6.30+ Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-11FS-Cache: Increase reference of parent after registering, netfs successKinglong Mee
If netfs exist, fscache should not increase the reference of parent's usage and n_children, otherwise, never be decreased. v2: thanks David's suggest, move increasing reference of parent if success use kmem_cache_free() freeing primary_index directly v3: don't move "netfs->primary_index->parent = &fscache_fsdef_index;" Cc: stable@vger.kernel.org # v2.6.30+ Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-21KEYS: Merge the type-specific data with the payload dataDavid Howells
Merge the type-specific data with the payload data into one four-word chunk as it seems pointless to keep them separate. Use user_key_payload() for accessing the payloads of overloaded user-defined keys. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-cifs@vger.kernel.org cc: ecryptfs@vger.kernel.org cc: linux-ext4@vger.kernel.org cc: linux-f2fs-devel@lists.sourceforge.net cc: linux-nfs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: linux-ima-devel@lists.sourceforge.net
2015-04-02FS-Cache: Retain the netfs context in the retrieval op earlierDavid Howells
Now that the retrieval operation may be disposed of by fscache_put_operation() before we actually set the context, the retrieval-specific cleanup operation can produce a NULL-pointer dereference when it tries to unconditionally clean up the netfs context. Given that it is expected that we'll get at least as far as the place where we currently set the context pointer and it is unlikely we'll go through the error handling paths prior to that point, retain the context right from the point that the retrieval op is allocated. Concomitant to this, we need to retain the cookie pointer in the retrieval op also so that we can call the netfs to release its context in the release method. In addition, we might now get into fscache_release_retrieval_op() with the op only initialised. To this end, set the operation to DEAD only after the release method has been called and skip the n_pages test upon cleanup if the op is still in the INITIALISED state. Without these changes, the following oops might be seen: BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8 ... RIP: 0010:[<ffffffffa0089c98>] fscache_release_retrieval_op+0xae/0x100 ... Call Trace: [<ffffffffa0088560>] fscache_put_operation+0x117/0x2e0 [<ffffffffa008b8f5>] __fscache_read_or_alloc_pages+0x351/0x3ac [<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs] [<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs] [<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e [<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a [<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c [<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af [<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a [<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs] [<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs] [<ffffffff811363be>] new_sync_read+0x78/0x9c [<ffffffff81137164>] __vfs_read+0x13/0x38 [<ffffffff8113721e>] vfs_read+0x95/0x121 [<ffffffff811372f6>] SyS_read+0x4c/0x8a [<ffffffff81557a52>] system_call_fastpath+0x12/0x17 Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: The operation cancellation method needs calling in more placesDavid Howells
Any time an incomplete operation is cancelled, the operation cancellation function needs to be called to clean up. This is currently being passed directly to some of the functions that might want to call it, but not all. Instead, pass the cancellation method pointer to the fscache_operation_init() and have that cache it in the operation struct. Further, plug in a dummy cancellation handler if the caller declines to set one as this allows us to call the function unconditionally (the extra overhead isn't worth bothering about as we don't expect to be calling this typically). The cancellation method must thence be called everywhere the CANCELLED state is set. Note that we call it *before* setting the CANCELLED state such that the method can use the old state value to guide its operation. fscache_do_cancel_retrieval() needs moving higher up in the sources so that the init function can use it now. Without this, the following oops may be seen: FS-Cache: Assertion failed FS-Cache: 3 == 0 is false ------------[ cut here ]------------ kernel BUG at ../fs/fscache/page.c:261! ... RIP: 0010:[<ffffffffa0089c1b>] fscache_release_retrieval_op+0x77/0x100 [<ffffffffa008853d>] fscache_put_operation+0x114/0x2da [<ffffffffa008b8c2>] __fscache_read_or_alloc_pages+0x358/0x3b3 [<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs] [<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs] [<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e [<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a [<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c [<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af [<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a [<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs] [<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs] [<ffffffff811363be>] new_sync_read+0x78/0x9c [<ffffffff81137164>] __vfs_read+0x13/0x38 [<ffffffff8113721e>] vfs_read+0x95/0x121 [<ffffffff811372f6>] SyS_read+0x4c/0x8a [<ffffffff81557a52>] system_call_fastpath+0x12/0x17 The assertion is showing that the remaining number of pages (n_pages) is not 0 when the operation is being released. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Put an aborted initialised op so that it is accounted correctlyDavid Howells
Call fscache_put_operation() or a wrapper on any op that has gone through fscache_operation_init() so that the accounting shown in /proc is done correctly, specifically fscache_n_op_release. fscache_put_operation() therefore now allows an op in the INITIALISED state as well as in the CANCELLED and COMPLETE states. Note that this means that an operation can get put that doesn't have its ->object pointer filled in, so anything that depends on the object needs to be conditional in fscache_put_operation(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Fix cancellation of in-progress operationDavid Howells
Cancellation of an in-progress operation needs to update the relevant counters and start any operations that are pending waiting on this one. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Count the number of initialised operationsDavid Howells
Count and display through /proc/fs/fscache/stats the number of initialised operations. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Out of line fscache_operation_init()David Howells
Out of line fscache_operation_init() so that it can access internal FS-Cache features, such as stats, in a later commit. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Permit fscache_cancel_op() to cancel in-progress operations tooDavid Howells
Currently, fscache_cancel_op() only cancels pending operations - attempts to cancel in-progress operations are ignored. This leads to a problem in fscache_wait_for_operation_activation() whereby the wait is terminated, but the object has been killed. The check at the end of the function now triggers because it's no longer contingent on the cache having produced an I/O error since the commit that fixed the logic error in fscache_object_is_dead(). The result of the check is that it tries to cancel the operation - but since the object may not be pending by this point, the cancellation request may be ignored - with the result that the the object is just put by the caller and fscache_put_operation has an assertion failure because the operation isn't in either the COMPLETE or the CANCELLED states. To fix this, we permit in-progress ops to be cancelled under some circumstances. The bug results in an oops that looks something like this: FS-Cache: fscache_wait_for_operation_activation() = -ENOBUFS [obj dead 3] FS-Cache: FS-Cache: Assertion failed FS-Cache: 3 == 5 is false ------------[ cut here ]------------ kernel BUG at ../fs/fscache/operation.c:432! ... RIP: 0010:[<ffffffffa0088574>] fscache_put_operation+0xf2/0x2cd Call Trace: [<ffffffffa008b92a>] __fscache_read_or_alloc_pages+0x2ec/0x3b3 [<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs] [<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs] [<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e [<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a [<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c [<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af [<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a [<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs] [<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs] [<ffffffff811363be>] new_sync_read+0x78/0x9c [<ffffffff81137164>] __vfs_read+0x13/0x38 [<ffffffff8113721e>] vfs_read+0x95/0x121 [<ffffffff811372f6>] SyS_read+0x4c/0x8a [<ffffffff81557a52>] system_call_fastpath+0x12/0x17 Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: fscache_object_is_dead() has wrong logic, kill itDavid Howells
fscache_object_is_dead() returns true only if the object is marked dead and the cache got an I/O error. This should be a logical OR instead. Since two of the callers got split up into handling for separate subcases, expand the other callers and kill the function. This is probably the right thing to do anyway since one of the subcases isn't about the object at all, but rather about the cache. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Synchronise object death state change vs operation submissionDavid Howells
When an object is being marked as no longer live, do this under the object spinlock to prevent a race with operation submission targeted on that object. The problem occurs due to the following pair of intertwined sequences when the cache tries to create an object that would take it over the hard available space limit: NETFS INTERFACE =============== (A) The netfs calls fscache_acquire_cookie(). object creation is deferred to the object state machine and the netfs is allowed to continue. OBJECT STATE MACHINE KTHREAD ============================ (1) The object is looked up on disk by fscache_look_up_object() calling cachefiles_walk_to_object(). The latter finds that the object is not yet represented on disk and calls fscache_object_lookup_negative(). (2) fscache_object_lookup_negative() sets FSCACHE_COOKIE_NO_DATA_YET and clears FSCACHE_COOKIE_LOOKING_UP, thus allowing the netfs to start queuing read operations. (B) The netfs calls fscache_read_or_alloc_pages(). This calls fscache_wait_for_deferred_lookup() which sees FSCACHE_COOKIE_LOOKING_UP become clear, allowing the read to begin. (C) A read operation is set up and passed to fscache_submit_op() to deal with. (3) cachefiles_walk_to_object() calls cachefiles_has_space(), which fails (or one of the file operations to create stuff fails). cachefiles returns an error to fscache. (4) fscache_look_up_object() transits to the LOOKUP_FAILURE state, (5) fscache_lookup_failure() sets FSCACHE_OBJECT_LOOKED_UP and FSCACHE_COOKIE_UNAVAILABLE and clears FSCACHE_COOKIE_LOOKING_UP then transits to the KILL_OBJECT state. (6) fscache_kill_object() clears FSCACHE_OBJECT_IS_LIVE in an attempt to reject any further requests from the netfs. (7) object->n_ops is examined and found to be 0. fscache_kill_object() transits to the DROP_OBJECT state. (D) fscache_submit_op() locks the object spinlock, sees if it can dispatch the op immediately by calling fscache_object_is_active() - which fails since FSCACHE_OBJECT_IS_AVAILABLE has not yet been set. (E) fscache_submit_op() then tests FSCACHE_OBJECT_LOOKED_UP - which is set. It then queues the object and increments object->n_ops. (8) fscache_drop_object() releases the object and eventually fscache_put_object() calls cachefiles_put_object() which suffers an assertion failure here: ASSERTCMP(object->fscache.n_ops, ==, 0); Locking the object spinlock in step (6) around the clearance of FSCACHE_OBJECT_IS_LIVE ensures that the the decision trees in fscache_submit_op() and fscache_submit_exclusive_op() don't see the IS_LIVE flag being cleared mid-decision: either the op is queued before step (7) - in which case fscache_kill_object() will see n_ops>0 and will deal with the op - or the op will be rejected. This, combined with rejecting op submission if the target object is dying, fix the problem. The problem shows up as the following oops: CacheFiles: Assertion failed CacheFiles: 1 == 0 is false ------------[ cut here ]------------ kernel BUG at ../fs/cachefiles/interface.c:339! ... RIP: 0010:[<ffffffffa014fd9c>] [<ffffffffa014fd9c>] cachefiles_put_object+0x2a4/0x301 [cachefiles] ... Call Trace: [<ffffffffa008674b>] fscache_put_object+0x18/0x21 [fscache] [<ffffffffa00883e6>] fscache_object_work_func+0x3ba/0x3c9 [fscache] [<ffffffff81054dad>] process_one_work+0x226/0x441 [<ffffffff81055d91>] worker_thread+0x273/0x36b [<ffffffff81055b1e>] ? rescuer_thread+0x2e1/0x2e1 [<ffffffff81059b9d>] kthread+0x10e/0x116 [<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb [<ffffffff815579ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Handle a new operation submitted against a killed objectDavid Howells
Reject new operations that are being submitted against an object if that object has failed its lookup or creation states or has been killed by the cache backend for some other reason, such as having been culled. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: When submitting an op, cancel it if the target object is dyingDavid Howells
When submitting an operation, prefer to cancel the operation immediately rather than queuing it for later processing if the object is marked as dying (ie. the object state machine has reached the KILL_OBJECT state). Whilst we're at it, change the series of related test_bit() calls into a READ_ONCE() and bitwise-AND operators to reduce the number of load instructions (test_bit() has a volatile address). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-04-02FS-Cache: Move fscache_report_unexpected_submission() to make it more availableDavid Howells
Move fscache_report_unexpected_submission() up within operation.c so that it can be called from fscache_submit_exclusive_op() too. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2015-02-24FS-Cache: Count culled objects and objects rejected due to lack of spaceDavid Howells
Count the number of objects that get culled by the cache backend and the number of objects that the cache backend declines to instantiate due to lack of space in the cache. These numbers are made available through /proc/fs/fscache/stats Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com>
2014-10-13fs/fscache/object-list.c: use __seq_open_private()Rob Jones
Reduce boilerplate code by using __seq_open_private() instead of seq_open() in fscache_objlist_open(). Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com>
2014-09-17FS-Cache: refcount becomes corrupt under vma pressure.Milosz Tanski
In rare cases under heavy VMA pressure the ref count for a fscache cookie becomes corrupt. In this case we decrement ref count even if we fail before incrementing the refcount. FS-Cache: Assertion failed bnode-eca5f9c6/syslog 0 > 0 is false ------------[ cut here ]------------ kernel BUG at fs/fscache/cookie.c:519! invalid opcode: 0000 [#1] SMP Call Trace: [<ffffffffa01ba060>] __fscache_relinquish_cookie+0x50/0x220 [fscache] [<ffffffffa02d64ce>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph] [<ffffffffa02ae1d3>] ceph_destroy_inode+0x33/0x200 [ceph] [<ffffffff811cf67e>] ? __fsnotify_inode_delete+0xe/0x10 [<ffffffff811a9e0c>] destroy_inode+0x3c/0x70 [<ffffffff811a9f51>] evict+0x111/0x180 [<ffffffff811aa763>] iput+0x103/0x190 [<ffffffff811a5de8>] __dentry_kill+0x1c8/0x220 [<ffffffff811a5f31>] shrink_dentry_list+0xf1/0x250 [<ffffffff811a762c>] prune_dcache_sb+0x4c/0x60 [<ffffffff811930af>] super_cache_scan+0xff/0x170 [<ffffffff8113d7a0>] shrink_slab_node+0x140/0x2c0 [<ffffffff8113f2da>] shrink_slab+0x8a/0x130 [<ffffffff81142572>] balance_pgdat+0x3e2/0x5d0 [<ffffffff811428ca>] kswapd+0x16a/0x4a0 [<ffffffff810a43f0>] ? __wake_up_sync+0x20/0x20 [<ffffffff81142760>] ? balance_pgdat+0x5d0/0x5d0 [<ffffffff81083e09>] kthread+0xc9/0xe0 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff8159f63c>] ret_from_fork+0x7c/0xb0 [<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0 RIP [<ffffffffa01b984b>] __fscache_disable_cookie+0x1db/0x210 [fscache] RSP <ffff8803bc85f9b8> ---[ end trace 254d0d7c74a01f25 ]--- Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-27FS-Cache: Reduce cookie ref count if submit fails.Milosz Tanski
I've been seeing issues with disposing cookies under vma pressure. The symptom is that the refcount gets out of sync. In this case we fail to decrement the refcount if submit fails. I found this while auditing the error in and around cookie operations. Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-27FS-Cache: Timeout for releasepage()Milosz Tanski
This is meant to avoid a recusive hang caused by underlying filesystem trying to grab a free page and causing a write-out. INFO: task kworker/u30:7:28375 blocked for more than 120 seconds. Not tainted 3.15.0-virtual #74 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u30:7 D 0000000000000000 0 28375 2 0x00000000 Workqueue: fscache_operation fscache_op_work_func [fscache] ffff88000b147148 0000000000000046 0000000000000000 ffff88000b1471c8 ffff8807aa031820 0000000000014040 ffff88000b147fd8 0000000000014040 ffff880f0c50c860 ffff8807aa031820 ffff88000b147158 ffff88007be59cd0 Call Trace: [<ffffffff815930e9>] schedule+0x29/0x70 [<ffffffffa018bed5>] __fscache_wait_on_page_write+0x55/0x90 [fscache] [<ffffffff810a4350>] ? __wake_up_sync+0x20/0x20 [<ffffffffa018c135>] __fscache_maybe_release_page+0x65/0x1e0 [fscache] [<ffffffffa02ad813>] ceph_releasepage+0x83/0x100 [ceph] [<ffffffff811635b0>] ? anon_vma_fork+0x130/0x130 [<ffffffff8112cdd2>] try_to_release_page+0x32/0x50 [<ffffffff81140096>] shrink_page_list+0x7e6/0x9d0 [<ffffffff8113f278>] ? isolate_lru_pages.isra.73+0x78/0x1e0 [<ffffffff81140932>] shrink_inactive_list+0x252/0x4c0 [<ffffffff811412b1>] shrink_lruvec+0x3e1/0x670 [<ffffffff8114157f>] shrink_zone+0x3f/0x110 [<ffffffff81141b06>] do_try_to_free_pages+0x1d6/0x450 [<ffffffff8114a939>] ? zone_statistics+0x99/0xc0 [<ffffffff81141e44>] try_to_free_pages+0xc4/0x180 [<ffffffff81136982>] __alloc_pages_nodemask+0x6b2/0xa60 [<ffffffff811c1d4e>] ? __find_get_block+0xbe/0x250 [<ffffffff810a405e>] ? wake_up_bit+0x2e/0x40 [<ffffffff811740c3>] alloc_pages_current+0xb3/0x180 [<ffffffff8112cf07>] __page_cache_alloc+0xb7/0xd0 [<ffffffff8112da6c>] grab_cache_page_write_begin+0x7c/0xe0 [<ffffffff81214072>] ? ext4_mark_inode_dirty+0x82/0x220 [<ffffffff81214a89>] ext4_da_write_begin+0x89/0x2d0 [<ffffffff8112c6ee>] generic_perform_write+0xbe/0x1d0 [<ffffffff811a96b1>] ? update_time+0x81/0xc0 [<ffffffff811ad4c2>] ? mnt_clone_write+0x12/0x30 [<ffffffff8112e80e>] __generic_file_aio_write+0x1ce/0x3f0 [<ffffffff8112ea8e>] generic_file_aio_write+0x5e/0xe0 [<ffffffff8120b94f>] ext4_file_write+0x9f/0x410 [<ffffffff8120af56>] ? ext4_file_open+0x66/0x180 [<ffffffff8118f0da>] do_sync_write+0x5a/0x90 [<ffffffffa025c6c9>] cachefiles_write_page+0x149/0x430 [cachefiles] [<ffffffff812cf439>] ? radix_tree_gang_lookup_tag+0x89/0xd0 [<ffffffffa018c512>] fscache_write_op+0x222/0x3b0 [fscache] [<ffffffffa018b35a>] fscache_op_work_func+0x3a/0x100 [fscache] [<ffffffff8107bfe9>] process_one_work+0x179/0x4a0 [<ffffffff8107d47b>] worker_thread+0x11b/0x370 [<ffffffff8107d360>] ? manage_workers.isra.21+0x2e0/0x2e0 [<ffffffff81083d69>] kthread+0xc9/0xe0 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0 [<ffffffff8159eefc>] ret_from_fork+0x7c/0xb0 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0 Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-06fs/fscache: make ctl_table staticFabian Frederick
fscache_sysctls and fscache_sysctls_root are only used in main.c Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: David Howells <dhowells@redhat.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-16sched: Remove proliferation of wait_on_bit() action functionsNeilBrown
The current "wait_on_bit" interface requires an 'action' function to be provided which does the actual waiting. There are over 20 such functions, many of them identical. Most cases can be satisfied by one of just two functions, one which uses io_schedule() and one which just uses schedule(). So: Rename wait_on_bit and wait_on_bit_lock to wait_on_bit_action and wait_on_bit_lock_action to make it explicit that they need an action function. Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io which are *not* given an action function but implicitly use a standard one. The decision to error-out if a signal is pending is now made based on the 'mode' argument rather than being encoded in the action function. All instances of the old wait_on_bit and wait_on_bit_lock which can use the new version have been changed accordingly and their action functions have been discarded. wait_on_bit{_lock} does not return any specific error code in the event of a signal so the caller must check for non-zero and interpolate their own error code as appropriate. The wait_on_bit() call in __fscache_wait_on_invalidate() was ambiguous as it specified TASK_UNINTERRUPTIBLE but used fscache_wait_bit_interruptible as an action function. David Howells confirms this should be uniformly "uninterruptible" The main remaining user of wait_on_bit{,_lock}_action is NFS which needs to use a freezer-aware schedule() call. A comment in fs/gfs2/glock.c notes that having multiple 'action' functions is useful as they display differently in the 'wchan' field of 'ps'. (and /proc/$PID/wchan). As the new bit_wait{,_io} functions are tagged "__sched", they will not show up at all, but something higher in the stack. So the distinction will still be visible, only with different function names (gds2_glock_wait versus gfs2_glock_dq_wait in the gfs2/glock.c case). Since first version of this patch (against 3.15) two new action functions appeared, on in NFS and one in CIFS. CIFS also now uses an action function that makes the same freezer aware schedule call as NFS. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: David Howells <dhowells@redhat.com> (fscache, keys) Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2) Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-06-06fscache: convert use of typedef ctl_table to struct ctl_tableJoe Perches
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04fs/fscache: replace seq_printf by seq_putsFabian Frederick
Replace seq_printf where possible + coalesce formats from 2 existing seq_puts Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04fs/fscache: convert printk to pr_foo()Fabian Frederick
All printk converted to pr_foo() except internal.h: printk(KERN_DEBUG Coalesce formats. Add pr_fmt Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-17FS-Cache: Handle removal of unadded object to the fscache_object_list rb treeDavid Howells
When FS-Cache allocates an object, the following sequence of events can occur: -->fscache_alloc_object() -->cachefiles_alloc_object() [via cache->ops->alloc_object] <--[returns new object] -->fscache_attach_object() <--[failed] -->cachefiles_put_object() [via cache->ops->put_object] -->fscache_object_destroy() -->fscache_objlist_remove() -->rb_erase() to remove the object from fscache_object_list. resulting in a crash in the rbtree code. The problem is that the object is only added to fscache_object_list on the success path of fscache_attach_object() where it calls fscache_objlist_add(). So if fscache_attach_object() fails, the object won't have been added to the objlist rbtree. We do, however, unconditionally try to remove the object from the tree. Thanks to NeilBrown for finding this and suggesting this solution. Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: (a customer of) NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-14Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ...
2013-11-08block: Replace __get_cpu_var usesChristoph Lameter
__get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu va