summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2018-12-29 11:21:49 -0800
committerLinus Torvalds <torvalds@linux-foundation.org>2018-12-29 11:21:49 -0800
commit3868772b99e3146d02cf47e739d79022eba1d77c (patch)
treed32c0283496e6955937b618981766b5f0878724f /Documentation/filesystems
parent6f9d71c9c759b1e7d31189a4de228983192c7dc7 (diff)
parent942104a21ce4951420ddf6c6b3179a0627301f7e (diff)
Merge tag 'docs-5.0' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet: "A fairly normal cycle for documentation stuff. We have a new document on perf security, more Italian translations, more improvements to the memory-management docs, improvements to the pathname lookup documentation, and the usual array of smaller fixes. As is often the case, there are a few reaches outside of Documentation/ to adjust kerneldoc comments" * tag 'docs-5.0' of git://git.lwn.net/linux: (38 commits) docs: improve pathname-lookup document structure configfs: fix wrong name of struct in documentation docs/mm-api: link slab_common.c to "The Slab Cache" section slab: make kmem_cache_create{_usercopy} description proper kernel-doc doc:process: add links where missing docs/core-api: make mm-api.rst more structured x86, boot: documentation whitespace fixup Documentation: devres: note checking needs when converting doc:it: add some process/* translations doc:it: fixes in process/1.Intro Documentation: convert path-lookup from markdown to resturctured text Documentation/admin-guide: update admin-guide index.rst Documentation/admin-guide: introduce perf-security.rst file scripts/kernel-doc: Fix struct and struct field attribute processing Documentation: dev-tools: Fix typos in index.rst Correct gen_init_cpio tool's documentation Document /proc/pid PID reuse behavior Documentation: update path-lookup.md for parallel lookups Documentation: Use "while" instead of "whilst" dmaengine: Add mailing list address to the documentation ...
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/caching/backend-api.txt2
-rw-r--r--Documentation/filesystems/caching/cachefiles.txt4
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt2
-rw-r--r--Documentation/filesystems/caching/operations.txt2
-rw-r--r--Documentation/filesystems/configfs/configfs.txt2
-rw-r--r--Documentation/filesystems/index.rst21
-rw-r--r--Documentation/filesystems/path-lookup.rst (renamed from Documentation/filesystems/path-lookup.md)913
-rw-r--r--Documentation/filesystems/proc.txt13
-rw-r--r--Documentation/filesystems/qnx6.txt4
-rw-r--r--Documentation/filesystems/spufs.txt2
-rw-r--r--Documentation/filesystems/vfs.txt2
-rw-r--r--Documentation/filesystems/xfs-self-describing-metadata.txt2
-rw-r--r--Documentation/filesystems/xfs.txt2
13 files changed, 526 insertions, 445 deletions
diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
index c0bd5677271b..c418280c915f 100644
--- a/Documentation/filesystems/caching/backend-api.txt
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -704,7 +704,7 @@ FS-Cache provides some utilities that a cache backend may make use of:
void fscache_get_retrieval(struct fscache_retrieval *op);
void fscache_put_retrieval(struct fscache_retrieval *op);
- These two functions are used to retain a retrieval record whilst doing
+ These two functions are used to retain a retrieval record while doing
asynchronous data retrieval and block allocation.
diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt
index 748a1ae49e12..28aefcbb1442 100644
--- a/Documentation/filesystems/caching/cachefiles.txt
+++ b/Documentation/filesystems/caching/cachefiles.txt
@@ -45,7 +45,7 @@ filesystems are very specific in nature.
CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
to communication with the daemon. Only one thing may have this open at once,
-and whilst it is open, a cache is at least partially in existence. The daemon
+and while it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.
CacheFiles is currently limited to a single cache.
@@ -163,7 +163,7 @@ Do not mount other things within the cache as this will cause problems. The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.
-Do not create, rename or unlink files and directories in the cache whilst the
+Do not create, rename or unlink files and directories in the cache while the
cache is active, as this may cause the state to become uncertain.
Renaming files in the cache might make objects appear to be other objects (the
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index 2a6f7399c1f3..ba968e8f5704 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -382,7 +382,7 @@ MISCELLANEOUS OBJECT REGISTRATION
An optional step is to request an object of miscellaneous type be created in
the cache. This is almost identical to index cookie acquisition. The only
difference is that the type in the object definition should be something other
-than index type. Whilst the parent object could be an index, it's more likely
+than index type. While the parent object could be an index, it's more likely
it would be some other type of object such as a data file.
xattr->cache =
diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt
index a1c052cbba35..d8976c434718 100644
--- a/Documentation/filesystems/caching/operations.txt
+++ b/Documentation/filesystems/caching/operations.txt
@@ -171,7 +171,7 @@ Operations are used through the following procedure:
(3) If the submitting thread wants to do the work itself, and has marked the
operation with FSCACHE_OP_MYTHREAD, then it should monitor
FSCACHE_OP_WAITING as described above and check the state of the object if
- necessary (the object might have died whilst the thread was waiting).
+ necessary (the object might have died while the thread was waiting).
When it has finished doing its processing, it should call
fscache_op_complete() and fscache_put_operation() on it.
diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt
index 3828e85345ae..16e606c11f40 100644
--- a/Documentation/filesystems/configfs/configfs.txt
+++ b/Documentation/filesystems/configfs/configfs.txt
@@ -216,7 +216,7 @@ be called whenever userspace asks for a write(2) on the attribute.
[struct configfs_bin_attribute]
- struct configfs_attribute {
+ struct configfs_bin_attribute {
struct configfs_attribute cb_attr;
void *cb_private;
size_t cb_max_size;
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 46d1b1be3a51..605befab300b 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -359,3 +359,24 @@ encryption of files and directories.
:maxdepth: 2
fscrypt
+
+Pathname lookup
+===============
+
+
+This write-up is based on three articles published at lwn.net:
+
+- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
+- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
+- <https://lwn.net/Articles/650786/> A walk among the symlinks
+
+Written by Neil Brown with help from Al Viro and Jon Corbet.
+It has subsequently been updated to reflect changes in the kernel
+including:
+
+- per-directory parallel name lookup.
+
+.. toctree::
+ :maxdepth: 2
+
+ path-lookup.rst
diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.rst
index e2edd45c4bc0..9d6b68853f5b 100644
--- a/Documentation/filesystems/path-lookup.md
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1,20 +1,6 @@
-<head>
-<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
-</head>
-Pathname lookup in Linux.
-=========================
-
-This write-up is based on three articles published at lwn.net:
-
-- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
-- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
-- <https://lwn.net/Articles/650786/> A walk among the symlinks
-
-Written by Neil Brown with help from Al Viro and Jon Corbet.
-
-Introduction
-------------
+Introduction to pathname lookup
+===============================
The most obvious aspect of pathname lookup, which very little
exploration is needed to discover, is that it is complex. There are
@@ -32,58 +18,58 @@ distinctions we need to clarify first.
There are two sorts of ...
--------------------------
-[`openat()`]: http://man7.org/linux/man-pages/man2/openat.2.html
+.. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
Pathnames (sometimes "file names"), used to identify objects in the
filesystem, will be familiar to most readers. They contain two sorts
-of elements: "slashes" that are sequences of one or more "`/`"
+of elements: "slashes" that are sequences of one or more "``/``"
characters, and "components" that are sequences of one or more
-non-"`/`" characters. These form two kinds of paths. Those that
+non-"``/``" characters. These form two kinds of paths. Those that
start with slashes are "absolute" and start from the filesystem root.
The others are "relative" and start from the current directory, or
from some other location specified by a file descriptor given to a
-"xxx`at`" system call such as "[`openat()`]".
+"``XXXat``" system call such as `openat() <openat_>`_.
-[`execveat()`]: http://man7.org/linux/man-pages/man2/execveat.2.html
+.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
It is tempting to describe the second kind as starting with a
component, but that isn't always accurate: a pathname can lack both
slashes and components, it can be empty, in other words. This is
-generally forbidden in POSIX, but some of those "xxx`at`" system calls
-in Linux permit it when the `AT_EMPTY_PATH` flag is given. For
+generally forbidden in POSIX, but some of those "xxx``at``" system calls
+in Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For
example, if you have an open file descriptor on an executable file you
-can execute it by calling [`execveat()`] passing the file descriptor,
-an empty path, and the `AT_EMPTY_PATH` flag.
+can execute it by calling `execveat() <execveat_>`_ passing
+the file descriptor, an empty path, and the ``AT_EMPTY_PATH`` flag.
These paths can be divided into two sections: the final component and
everything else. The "everything else" is the easy bit. In all cases
it must identify a directory that already exists, otherwise an error
-such as `ENOENT` or `ENOTDIR` will be reported.
+such as ``ENOENT`` or ``ENOTDIR`` will be reported.
The final component is not so simple. Not only do different system
calls interpret it quite differently (e.g. some create it, some do
not), but it might not even exist: neither the empty pathname nor the
pathname that is just slashes have a final component. If it does
-exist, it could be "`.`" or "`..`" which are handled quite differently
+exist, it could be "``.``" or "``..``" which are handled quite differently
from other components.
-[POSIX]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+.. _POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
-If a pathname ends with a slash, such as "`/tmp/foo/`" it might be
+If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
tempting to consider that to have an empty final component. In many
ways that would lead to correct results, but not always. In
-particular, `mkdir()` and `rmdir()` each create or remove a directory named
+particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
by the final component, and they are required to work with pathnames
-ending in "`/`". According to [POSIX]
+ending in "``/``". According to POSIX_
-> A pathname that contains at least one non- &lt;slash> character and
-> that ends with one or more trailing &lt;slash> characters shall not
-> be resolved successfully unless the last pathname component before
-> the trailing <slash> characters names an existing directory or a
-> directory entry that is to be created for a directory immediately
-> after the pathname is resolved.
+ A pathname that contains at least one non- &lt;slash> character and
+ that ends with one or more trailing &lt;slash> characters shall not
+ be resolved successfully unless the last pathname component before
+ the trailing <slash> characters names an existing directory or a
+ directory entry that is to be created for a directory immediately
+ after the pathname is resolved.
-The Linux pathname walking code (mostly in `fs/namei.c`) deals with
+The Linux pathname walking code (mostly in ``fs/namei.c``) deals with
all of these issues: breaking the path into components, handling the
"everything else" quite separately from the final component, and
checking that the trailing slash is not used where it isn't
@@ -100,15 +86,15 @@ of the possible races are seen most clearly in the context of the
"dcache" and an understanding of that is central to understanding
pathname lookup.
-More than just a cache.
------------------------
+More than just a cache
+----------------------
The "dcache" caches information about names in each filesystem to
make them quickly available for lookup. Each entry (known as a
"dentry") contains three significant fields: a component name, a
pointer to a parent dentry, and a pointer to the "inode" which
contains further information about the object in that parent with
-the given name. The inode pointer can be `NULL` indicating that the
+the given name. The inode pointer can be ``NULL`` indicating that the
name doesn't exist in the parent. While there can be linkage in the
dentry of a directory to the dentries of the children, that linkage is
not used for pathname lookup, and so will not be considered here.
@@ -135,7 +121,7 @@ whether remote filesystems like NFS and 9P, or cluster filesystems
like ocfs2 or cephfs. These filesystems allow the VFS to revalidate
cached information, and must provide their own protection against
awkward races. The VFS can detect these filesystems by the
-`DCACHE_OP_REVALIDATE` flag being set in the dentry.
+``DCACHE_OP_REVALIDATE`` flag being set in the dentry.
REF-walk: simple concurrency management with refcounts and spinlocks
--------------------------------------------------------------------
@@ -144,22 +130,23 @@ With all of those divisions carefully classified, we can now start
looking at the actual process of walking along a path. In particular
we will start with the handling of the "everything else" part of a
pathname, and focus on the "REF-walk" approach to concurrency
-management. This code is found in the `link_path_walk()` function, if
-you ignore all the places that only run when "`LOOKUP_RCU`"
+management. This code is found in the ``link_path_walk()`` function, if
+you ignore all the places that only run when "``LOOKUP_RCU``"
(indicating the use of RCU-walk) is set.
-[Meet the Lockers]: https://lwn.net/Articles/453685/
+.. _Meet the Lockers: https://lwn.net/Articles/453685/
REF-walk is fairly heavy-handed with locks and reference counts. Not
as heavy-handed as in the old "big kernel lock" days, but certainly not
afraid of taking a lock when one is needed. It uses a variety of
different concurrency controls. A background understanding of the
various primitives is assumed, or can be gleaned from elsewhere such
-as in [Meet the Lockers].
+as in `Meet the Lockers`_.
The locking mechanisms used by REF-walk include:
-### dentry->d_lockref ###
+dentry->d_lockref
+~~~~~~~~~~~~~~~~~
This uses the lockref primitive to provide both a spinlock and a
reference count. The special-sauce of this primitive is that the
@@ -168,49 +155,51 @@ with a single atomic memory operation.
Holding a reference on a dentry ensures that the dentry won't suddenly
be freed and used for something else, so the values in various fields
-will behave as expected. It also protects the `->d_inode` reference
+will behave as expected. It also protects the ``->d_inode`` reference
to the inode to some extent.
The association between a dentry and its inode is fairly permanent.
For example, when a file is renamed, the dentry and inode move
together to the new location. When a file is created the dentry will
-initially be negative (i.e. `d_inode` is `NULL`), and will be assigned
+initially be negative (i.e. ``d_inode`` is ``NULL``), and will be assigned
to the new inode as part of the act of creation.
When a file is deleted, this can be reflected in the cache either by
-setting `d_inode` to `NULL`, or by removing it from the hash table
+setting ``d_inode`` to ``NULL``, or by removing it from the hash table
(described shortly) used to look up the name in the parent directory.
If the dentry is still in use the second option is used as it is
perfectly legal to keep using an open file after it has been deleted
and having the dentry around helps. If the dentry is not otherwise in
-use (i.e. if the refcount in `d_lockref` is one), only then will
-`d_inode` be set to `NULL`. Doing it this way is more efficient for a
+use (i.e. if the refcount in ``d_lockref`` is one), only then will
+``d_inode`` be set to ``NULL``. Doing it this way is more efficient for a
very common case.
-So as long as a counted reference is held to a dentry, a non-`NULL` `->d_inode`
+So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
value will never be changed.
-### dentry->d_lock ###
+dentry->d_lock
+~~~~~~~~~~~~~~
-`d_lock` is a synonym for the spinlock that is part of `d_lockref` above.
+``d_lock`` is a synonym for the spinlock that is part of ``d_lockref`` above.
For our purposes, holding this lock protects against the dentry being
-renamed or unlinked. In particular, its parent (`d_parent`), and its
-name (`d_name`) cannot be changed, and it cannot be removed from the
+renamed or unlinked. In particular, its parent (``d_parent``), and its
+name (``d_name``) cannot be changed, and it cannot be removed from the
dentry hash table.
-When looking for a name in a directory, REF-walk takes `d_lock` on
+When looking for a name in a directory, REF-walk takes ``d_lock`` on
each candidate dentry that it finds in the hash table and then checks
that the parent and name are correct. So it doesn't lock the parent
while searching in the cache; it only locks children.
-When looking for the parent for a given name (to handle "`..`"),
-REF-walk can take `d_lock` to get a stable reference to `d_parent`,
+When looking for the parent for a given name (to handle "``..``"),
+REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
but it first tries a more lightweight approach. As seen in
-`dget_parent()`, if a reference can be claimed on the parent, and if
-subsequently `d_parent` can be seen to have not changed, then there is
+``dget_parent()``, if a reference can be claimed on the parent, and if
+subsequently ``d_parent`` can be seen to have not changed, then there is
no need to actually take the lock on the child.
-### rename_lock ###
+rename_lock
+~~~~~~~~~~~
Looking up a given name in a given directory involves computing a hash
from the two values (the name and the dentry of the directory),
@@ -224,71 +213,117 @@ happened to be looking at a dentry that was moved in this way,
it might end up continuing the search down the wrong chain,
and so miss out on part of the correct chain.
-The name-lookup process (`d_lookup()`) does _not_ try to prevent this
+The name-lookup process (``d_lookup()``) does _not_ try to prevent this
from happening, but only to detect when it happens.
-`rename_lock` is a seqlock that is updated whenever any dentry is
-renamed. If `d_lookup` finds that a rename happened while it
+``rename_lock`` is a seqlock that is updated whenever any dentry is
+renamed. If ``d_lookup`` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries
again.
-### inode->i_mutex ###
+inode->i_rwsem
+~~~~~~~~~~~~~~
-`i_mutex` is a mutex that serializes all changes to a particular
-directory. This ensures that, for example, an `unlink()` and a `rename()`
+``i_rwsem`` is a read/write semaphore that serializes all changes to a particular
+directory. This ensures that, for example, an ``unlink()`` and a ``rename()``
cannot both happen at the same time. It also keeps the directory
stable while the filesystem is asked to look up a name that is not
-currently in the dcache.
+currently in the dcache or, optionally, when the list of entries in a
+directory is being retrieved with ``readdir()``.
-This has a complementary role to that of `d_lock`: `i_mutex` on a
-directory protects all of the names in that directory, while `d_lock`
+This has a complementary role to that of ``d_lock``: ``i_rwsem`` on a
+directory protects all of the names in that directory, while ``d_lock``
on a name protects just one name in a directory. Most changes to the
-dcache hold `i_mutex` on the relevant directory inode and briefly take
-`d_lock` on one or more the dentries while the change happens. One
+dcache hold ``i_rwsem`` on the relevant directory inode and briefly take
+``d_lock`` on one or more the dentries while the change happens. One
exception is when idle dentries are removed from the dcache due to
-memory pressure. This uses `d_lock`, but `i_mutex` plays no role.
+memory pressure. This uses ``d_lock``, but ``i_rwsem`` plays no role.
-The mutex affects pathname lookup in two distinct ways. Firstly it
-serializes lookup of a name in a directory. `walk_component()` uses
-`lookup_fast()` first which, in turn, checks to see if the name is in the cache,
-using only `d_lock` locking. If the name isn't found, then `walk_component()`
-falls back to `lookup_slow()` which takes `i_mutex`, checks again that
+The semaphore affects pathname lookup in two distinct ways. Firstly it
+prevents changes during lookup of a name in a directory. ``walk_component()`` uses
+``lookup_fast()`` first which, in turn, checks to see if the name is in the cache,
+using only ``d_lock`` locking. If the name isn't found, then ``walk_component()``
+falls back to ``lookup_slow()`` which takes a shared lock on ``i_rwsem``, checks again that
the name isn't in the cache, and then calls in to the filesystem to get a
definitive answer. A new dentry will be added to the cache regardless of
the result.
Secondly, when pathname lookup reaches the final component, it will
-sometimes need to take `i_mutex` before performing the last lookup so
+sometimes need to take an exclusive lock on ``i_rwsem`` before performing the last lookup so
that the required exclusion can be achieved. How path lookup chooses
-to take, or not take, `i_mutex` is one of the
+to take, or not take, ``i_rwsem`` is one of the
issues addressed in a subsequent section.
-### mnt->mnt_count ###
-
-`mnt_count` is a per-CPU reference counter on "`mount`" structures.
+If two threads attempt to look up the same name at the same time - a
+name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
+not prevent them both adding new dentries with the same name. As this
+would result in confusion an extra level of interlocking is used,
+based around a secondary hash table (``in_lookup_hashtable``) and a
+per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
+
+To add a new dentry to the cache while only holding a shared lock on
+``i_rwsem``, a thread must call ``d_alloc_parallel()``. This allocates a
+dentry, stores the required name and parent in it, checks if there
+is already a matching dentry in the primary or secondary hash
+tables, and if not, stores the newly allocated dentry in the secondary
+hash table, with ``DCACHE_PAR_LOOKUP`` set.
+
+If a matching dentry was found in the primary hash table then that is
+returned and the caller can know that it lost a race with some other
+thread adding the entry. If no matching dentry is found in either
+cache, the newly allocated dentry is returned and the caller can
+detect this from the presence of ``DCACHE_PAR_LOOKUP``. In this case it
+knows that it has won any race and now is responsible for asking the
+filesystem to perform the lookup and find the matching inode. When
+the lookup is complete, it must call ``d_lookup_done()`` which clears
+the flag and does some other house keeping, including removing the
+dentry from the secondary hash table - it will normally have been
+added to the primary hash table already. Note that a ``struct
+waitqueue_head`` is passed to ``d_alloc_parallel()``, and
+``d_lookup_done()`` must be called while this ``waitqueue_head`` is still
+in scope.
+
+If a matching dentry is found in the secondary hash table,
+``d_alloc_parallel()`` has a little more work to do. It first waits for
+``DCACHE_PAR_LOOKUP`` to be cleared, using a wait_queue that was passed
+to the instance of ``d_alloc_parallel()`` that won the race and that
+will be woken by the call to ``d_lookup_done()``. It then checks to see
+if the dentry has now been added to the primary hash table. If it
+has, the dentry is returned and the caller just sees that it lost any
+race. If it hasn't been added to the primary hash table, the most
+likely explanation is that some other dentry was added instead using
+``d_splice_alias()``. In any case, ``d_alloc_parallel()`` repeats all the
+look ups from the start and will normally return something from the
+primary hash table.
+
+mnt->mnt_count
+~~~~~~~~~~~~~~
+
+``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
Per-CPU here means that incrementing the count is cheap as it only
uses CPU-local memory, but checking if the count is zero is expensive as
-it needs to check with every CPU. Taking a `mnt_count` reference
+it needs to check with every CPU. Taking a ``mnt_count`` reference
prevents the mount structure from disappearing as the result of regular
unmount operations, but does not prevent a "lazy" unmount. So holding
-`mnt_count` doesn't ensure that the mount remains in the namespace and,
+``mnt_count`` doesn't ensure that the mount remains in the namespace and,
in particular, doesn't stabilize the link to the mounted-on dentry. It
-does, however, ensure that the `mount` data structure remains coherent,
+does, however, ensure that the ``mount`` data structure remains coherent,
and it provides a reference to the root dentry of the mounted
-filesystem. So a reference through `->mnt_count` provides a stable
+filesystem. So a reference through ``->mnt_count`` provides a stable
reference to the mounted dentry, but not the mounted-on dentry.
-### mount_lock ###
+mount_lock
+~~~~~~~~~~
-`mount_lock` is a global seqlock, a bit like `rename_lock`. It can be used to
+``mount_lock`` is a global seqlock, a bit like ``rename_lock``. It can be used to
check if any change has been made to any mount points.
While walking down the tree (away from the root) this lock is used when
crossing a mount point to check that the crossing was safe. That is,
the value in the seqlock is read, then the code finds the mount that
is mounted on the current directory, if there is one, and increments
-the `mnt_count`. Finally the value in `mount_lock` is checked against
+the ``mnt_count``. Finally the value in ``mount_lock`` is checked against
the old value. If there is no change, then the crossing was safe. If there
-was a change, the `mnt_count` is decremented and the whole process is
+was a change, the ``mnt_count`` is decremented and the whole process is
retried.
When walking up the tree (towards the root) by following a ".." link,
@@ -298,7 +333,8 @@ any changes to any mount points while stepping up. This locking is
needed to stabilize the link to the mounted-on dentry, which the
refcount on the mount itself doesn't ensure.
-### RCU ###
+RCU
+~~~
Finally the global (but extremely lightweight) RCU read lock is held
from time to time to ensure certain data structures don't get freed
@@ -307,137 +343,141 @@ unexpectedly.
In particular it is held while scanning chains in the dcache hash
table, and the mount point hash table.
-Bringing it together with `struct nameidata`
+Bringing it together with ``struct nameidata``
--------------------------------------------
-[First edition Unix]: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
+.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
Throughout the process of walking a path, the current status is stored
-in a `struct nameidata`, "namei" being the traditional name - dating
-all the way back to [First Edition Unix] - of the function that
-converts a "name" to an "inode". `struct nameidata` contains (among
+in a ``struct nameidata``, "namei" being the traditional name - dating
+all the way back to `First Edition Unix`_ - of the function that
+converts a "name" to an "inode". ``struct nameidata`` contains (among
other fields):
-### `struct path path` ###
+``struct path path``
+~~~~~~~~~~~~~~~~~~
-A `path` contains a `struct vfsmount` (which is
-embedded in a `struct mount`) and a `struct dentry`. Together these
+A ``path`` contains a ``struct vfsmount`` (which is
+embedded in a ``struct mount``) and a ``struct dentry``. Together these
record the current status of the walk. They start out referring to the
starting point (the current working directory, the root directory, or some other
directory identified by a file descriptor), and are updated on each
-step. A reference through `d_lockref` and `mnt_count` is always
+step. A reference through ``d_lockref`` and ``mnt_count`` is always
held.
-### `struct qstr last` ###
+``struct qstr last``
+~~~~~~~~~~~~~~~~~~
-This is a string together with a length (i.e. _not_ `nul` terminated)
+This is a string together with a length (i.e. _not_ ``nul`` terminated)
that is the "next" component in the pathname.
-### `int last_type` ###
+``int last_type``
+~~~~~~~~~~~~~~~
-This is one of `LAST_NORM`, `LAST_ROOT`, `LAST_DOT`, `LAST_DOTDOT`, or
-`LAST_BIND`. The `last` field is only valid if the type is
-`LAST_NORM`. `LAST_BIND` is used when following a symlink and no
+This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or
+``LAST_BIND``. The ``last`` field is only valid if the type is
+``LAST_NORM``. ``LAST_BIND`` is used when following a symlink and no
components of the symlink have been processed yet. Others should be
fairly self-explanatory.
-### `struct path root` ###
+``struct path root``
+~~~~~~~~~~~~~~~~~~
This is used to hold a reference to the effective root of the
filesystem. Often that reference won't be needed, so this field is
only assigned the first time it is used, or when a non-standard root
-is requested. Keeping a reference in the `nameidata` ensures that
+is requested. Keeping a reference in the ``nameidata`` ensures that
only one root is in effect for the entire path walk, even if it races
-with a `chroot()` system call.
+with a ``chroot()`` system call.
The root is needed when either of two conditions holds: (1) either the
-pathname or a symbolic link starts with a "'/'", or (2) a "`..`"
-component is being handled, since "`..`" from the root must always stay
+pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
+component is being handled, since "``..``" from the root must always stay
at the root. The value used is usually the current root directory of
the calling process. An alternate root can be provided as when
-`sysctl()` calls `file_open_root()`, and when NFSv4 or Btrfs call
-`mount_subtree()`. In each case a pathname is being looked up in a very
+``sysctl()`` calls ``file_open_root()``, and when NFSv4 or Btrfs call
+``mount_subtree()``. In each case a pathname is being looked up in a very
specific part of the filesystem, and the lookup must not be allowed to
-escape that subtree. It works a bit like a local `chroot()`.
+escape that subtree. It works a bit like a local ``chroot()``.
Ignoring the handling of symbolic links, we can now describe the
-"`link_path_walk()`" function, which handles the lookup of everything
+"``link_path_walk()``" function, which handles the lookup of everything
except the final component as:
-> Given a path (`name`) and a nameidata structure (`nd`), check that the
-> current directory has execute permission and then advance `name`
-> over one component while updating `last_type` and `last`. If that
-> was the final component, then return, otherwise call
-> `walk_component()` and repeat from the top.
+ Given a path (``name``) and a nameidata structure (``nd``), check that the
+ current directory has execute permission and then advance ``name``
+ over one component while updating ``last_type`` and ``last``. If that
+ was the final component, then return, otherwise call
+ ``walk_component()`` and repeat from the top.
-`walk_component()` is even easier. If the component is `LAST_DOTS`,
-it calls `handle_dots()` which does the necessary locking as already
-described. If it finds a `LAST_NORM` component it first calls
-"`lookup_fast()`" which only looks in the dcache, but will ask the
+``walk_component()`` is even easier. If the component is ``LAST_DOTS``,
+it calls ``handle_dots()`` which does the necessary locking as already
+described. If it finds a ``LAST_NORM`` component it first calls
+"``lookup_fast()``" which only looks in the dcache, but will ask the
filesystem to revalidate the result if it is that sort of filesystem.
-If that doesn't get a good result, it calls "`lookup_slow()`" which
-takes the `i_mutex`, rechecks the cache, and then asks the filesystem
+If that doesn't get a good result, it calls "``lookup_slow()``" which
+takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
to find a definitive answer. Each of these will call
-`follow_managed()` (as described below) to handle any mount points.
+``follow_managed()`` (as described below) to handle any mount points.
-In the absence of symbolic links, `walk_component()` creates a new
-`struct path` containing a counted reference to the new dentry and a
-reference to the new `vfsmount` which is only counted if it is
-different from the previous `vfsmount`. It then calls
-`path_to_nameidata()` to install the new `struct path` in the
-`struct nameidata` and drop the unneeded references.
+In the absence of symbolic links, ``walk_component()`` creates a new
+``struct path`` containing a counted reference to the new dentry and a
+reference to the new ``vfsmount`` which is only counted if it is
+different from the previous ``vfsmount``. It then calls
+``path_to_nameidata()`` to install the new ``struct path`` in the
+``struct nameidata`` and drop the unneeded references.
This "hand-over-hand" sequencing of getting a reference to the new
dentry before dropping the reference to the previous dentry may
seem obvious, but is worth pointing out so that we will recognize its
analogue in the "RCU-walk" version.
-Handling the final component.
------------------------------
+Handling the final component
+----------------------------
-`link_path_walk()` only walks as far as setting `nd->last` and
-`nd->last_type` to refer to the final component of the path. It does
-not call `walk_component()` that last time. Handling that final
+``link_path_walk()`` only walks as far as setting ``nd->last`` and
+``nd->last_type`` to refer to the final component of the path. It does
+not call ``walk_component()`` that last time. Handling that final
component remains for the caller to sort out. Those callers are
-`path_lookupat()`, `path_parentat()`, `path_mountpoint()` and
-`path_openat()` each of which handles the differing requirements of
+``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
+``path_openat()`` each of which handles the differing requirements of
different system calls.
-`path_parentat()` is clearly the simplest - it just wraps a little bit
-of housekeeping around `link_path_walk()` and returns the parent
+``path_parentat()`` is clearly the simplest - it just wraps a little bit
+of housekeeping around ``link_path_walk()`` and returns the parent
directory and final component to the caller. The caller will be either
-aiming to create a name (via `filename_create()`) or remove or rename
-a name (in which case `user_path_parent()` is used). They will use
-`i_mutex` to exclude other changes while they validate and then
+aiming to create a name (via ``filename_create()``) or remove or rename
+a name (in which case ``user_path_parent()`` is used). They will use
+``i_rwsem`` to exclude other changes while they validate and then
perform their operation.
-`path_lookupat()` is nearly as simple - it is used when an existing
-object is wanted such as by `stat()` or `chmod()`. It essentially just
-calls `walk_component()` on the final component through a call to
-`lookup_last()`. `path_lookupat()` returns just the final dentry.
+``path_lookupat()`` is nearly as simple - it is used when an existing
+object is wanted such as by ``stat()`` or ``chmod()``. It essentially just
+calls ``walk_component()`` on the final component through a call to
+``lookup_last()``. ``path_lookupat()`` returns just the final dentry.
-`path_mountpoint()` handles the special case of unmounting which must
+``path_mountpoint()`` handles the special case of unmounting which must
not try to revalidate the mounted filesystem. It effectively
-contains, through a call to `mountpoint_last()`, an alternate
-implementation of `lookup_slow()` which skips that step. This is
+cont