summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
authorNeilBrown <neilb@suse.com>2018-12-05 10:02:51 +1100
committerJonathan Corbet <corbet@lwn.net>2018-12-06 10:06:51 -0700
commit7bbfd9ad8eb24e6683f7a0467edfcff6c189d492 (patch)
treebc0a026340feeda4ba91f7e2d193c539a5dbfca7 /Documentation/filesystems
parent036c20c06e43679a006e1cf932ce8284f4b39b42 (diff)
Documentation: convert path-lookup from markdown to resturctured text
This allows the document to be integrated with the main documentation tree. Changes include: - rename from .md to .rst - use `` for code, not single ` - use correct sub-section marking - fix indented blocks, both code and non-code - fix external-link markup Signed-off-by: NeilBrown <neilb@suse.com> [jc: changed the toctree organization a bit] Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/index.rst11
-rw-r--r--Documentation/filesystems/path-lookup.rst (renamed from Documentation/filesystems/path-lookup.md)889
2 files changed, 464 insertions, 436 deletions
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 46d1b1be3a51..ba921bdd5b06 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -359,3 +359,14 @@ encryption of files and directories.
:maxdepth: 2
fscrypt
+
+Pathname lookup
+===============
+
+Pathname lookup in Linux is a complex beast; the document linked below
+provides a comprehensive summary for those looking for the details.
+
+.. toctree::
+ :maxdepth: 2
+
+ path-lookup.rst
diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.rst
index 06151b178f80..30a155736afe 100644
--- a/Documentation/filesystems/path-lookup.md
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1,9 +1,6 @@
-<head>
-<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
-</head>
-
-Pathname lookup in Linux.
-=========================
+========================
+Pathname lookup in Linux
+========================
This write-up is based on three articles published at lwn.net:
@@ -17,8 +14,8 @@ including:
- per-directory parallel name lookup.
-Introduction
-------------
+Introduction to pathname lookup
+===============================
The most obvious aspect of pathname lookup, which very little
exploration is needed to discover, is that it is complex. There are
@@ -36,58 +33,58 @@ distinctions we need to clarify first.
There are two sorts of ...
--------------------------
-[`openat()`]: http://man7.org/linux/man-pages/man2/openat.2.html
+.. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
Pathnames (sometimes "file names"), used to identify objects in the
filesystem, will be familiar to most readers. They contain two sorts
-of elements: "slashes" that are sequences of one or more "`/`"
+of elements: "slashes" that are sequences of one or more "``/``"
characters, and "components" that are sequences of one or more
-non-"`/`" characters. These form two kinds of paths. Those that
+non-"``/``" characters. These form two kinds of paths. Those that
start with slashes are "absolute" and start from the filesystem root.
The others are "relative" and start from the current directory, or
from some other location specified by a file descriptor given to a
-"xxx`at`" system call such as "[`openat()`]".
+"``XXXat``" system call such as `openat() <openat_>`_.
-[`execveat()`]: http://man7.org/linux/man-pages/man2/execveat.2.html
+.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
It is tempting to describe the second kind as starting with a
component, but that isn't always accurate: a pathname can lack both
slashes and components, it can be empty, in other words. This is
-generally forbidden in POSIX, but some of those "xxx`at`" system calls
-in Linux permit it when the `AT_EMPTY_PATH` flag is given. For
+generally forbidden in POSIX, but some of those "xxx``at``" system calls
+in Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For
example, if you have an open file descriptor on an executable file you
-can execute it by calling [`execveat()`] passing the file descriptor,
-an empty path, and the `AT_EMPTY_PATH` flag.
+can execute it by calling `execveat() <execveat_>`_ passing
+the file descriptor, an empty path, and the ``AT_EMPTY_PATH`` flag.
These paths can be divided into two sections: the final component and
everything else. The "everything else" is the easy bit. In all cases
it must identify a directory that already exists, otherwise an error
-such as `ENOENT` or `ENOTDIR` will be reported.
+such as ``ENOENT`` or ``ENOTDIR`` will be reported.
The final component is not so simple. Not only do different system
calls interpret it quite differently (e.g. some create it, some do
not), but it might not even exist: neither the empty pathname nor the
pathname that is just slashes have a final component. If it does
-exist, it could be "`.`" or "`..`" which are handled quite differently
+exist, it could be "``.``" or "``..``" which are handled quite differently
from other components.
-[POSIX]: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+.. _POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
-If a pathname ends with a slash, such as "`/tmp/foo/`" it might be
+If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
tempting to consider that to have an empty final component. In many
ways that would lead to correct results, but not always. In
-particular, `mkdir()` and `rmdir()` each create or remove a directory named
+particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
by the final component, and they are required to work with pathnames
-ending in "`/`". According to [POSIX]
+ending in "``/``". According to POSIX_
-> A pathname that contains at least one non- &lt;slash> character and
-> that ends with one or more trailing &lt;slash> characters shall not
-> be resolved successfully unless the last pathname component before
-> the trailing <slash> characters names an existing directory or a
-> directory entry that is to be created for a directory immediately
-> after the pathname is resolved.
+ A pathname that contains at least one non- &lt;slash> character and
+ that ends with one or more trailing &lt;slash> characters shall not
+ be resolved successfully unless the last pathname component before
+ the trailing <slash> characters names an existing directory or a
+ directory entry that is to be created for a directory immediately
+ after the pathname is resolved.
-The Linux pathname walking code (mostly in `fs/namei.c`) deals with
+The Linux pathname walking code (mostly in ``fs/namei.c``) deals with
all of these issues: breaking the path into components, handling the
"everything else" quite separately from the final component, and
checking that the trailing slash is not used where it isn't
@@ -104,15 +101,15 @@ of the possible races are seen most clearly in the context of the
"dcache" and an understanding of that is central to understanding
pathname lookup.
-More than just a cache.
------------------------
+More than just a cache
+----------------------
The "dcache" caches information about names in each filesystem to
make them quickly available for lookup. Each entry (known as a
"dentry") contains three significant fields: a component name, a
pointer to a parent dentry, and a pointer to the "inode" which
contains further information about the object in that parent with
-the given name. The inode pointer can be `NULL` indicating that the
+the given name. The inode pointer can be ``NULL`` indicating that the
name doesn't exist in the parent. While there can be linkage in the
dentry of a directory to the dentries of the children, that linkage is
not used for pathname lookup, and so will not be considered here.
@@ -139,7 +136,7 @@ whether remote filesystems like NFS and 9P, or cluster filesystems
like ocfs2 or cephfs. These filesystems allow the VFS to revalidate
cached information, and must provide their own protection against
awkward races. The VFS can detect these filesystems by the
-`DCACHE_OP_REVALIDATE` flag being set in the dentry.
+``DCACHE_OP_REVALIDATE`` flag being set in the dentry.
REF-walk: simple concurrency management with refcounts and spinlocks
--------------------------------------------------------------------
@@ -148,22 +145,23 @@ With all of those divisions carefully classified, we can now start
looking at the actual process of walking along a path. In particular
we will start with the handling of the "everything else" part of a
pathname, and focus on the "REF-walk" approach to concurrency
-management. This code is found in the `link_path_walk()` function, if
-you ignore all the places that only run when "`LOOKUP_RCU`"
+management. This code is found in the ``link_path_walk()`` function, if
+you ignore all the places that only run when "``LOOKUP_RCU``"
(indicating the use of RCU-walk) is set.
-[Meet the Lockers]: https://lwn.net/Articles/453685/
+.. _Meet the Lockers: https://lwn.net/Articles/453685/
REF-walk is fairly heavy-handed with locks and reference counts. Not
as heavy-handed as in the old "big kernel lock" days, but certainly not
afraid of taking a lock when one is needed. It uses a variety of
different concurrency controls. A background understanding of the
various primitives is assumed, or can be gleaned from elsewhere such
-as in [Meet the Lockers].
+as in `Meet the Lockers`_.
The locking mechanisms used by REF-walk include:
-### dentry->d_lockref ###
+dentry->d_lockref
+~~~~~~~~~~~~~~~~~
This uses the lockref primitive to provide both a spinlock and a
reference count. The special-sauce of this primitive is that the
@@ -172,49 +170,51 @@ with a single atomic memory operation.
Holding a reference on a dentry ensures that the dentry won't suddenly
be freed and used for something else, so the values in various fields
-will behave as expected. It also protects the `->d_inode` reference
+will behave as expected. It also protects the ``->d_inode`` reference
to the inode to some extent.
The association between a dentry and its inode is fairly permanent.
For example, when a file is renamed, the dentry and inode move
together to the new location. When a file is created the dentry will
-initially be negative (i.e. `d_inode` is `NULL`), and will be assigned
+initially be negative (i.e. ``d_inode`` is ``NULL``), and will be assigned
to the new inode as part of the act of creation.
When a file is deleted, this can be reflected in the cache either by
-setting `d_inode` to `NULL`, or by removing it from the hash table
+setting ``d_inode`` to ``NULL``, or by removing it from the hash table
(described shortly) used to look up the name in the parent directory.
If the dentry is still in use the second option is used as it is
perfectly legal to keep using an open file after it has been deleted
and having the dentry around helps. If the dentry is not otherwise in
-use (i.e. if the refcount in `d_lockref` is one), only then will
-`d_inode` be set to `NULL`. Doing it this way is more efficient for a
+use (i.e. if the refcount in ``d_lockref`` is one), only then will
+``d_inode`` be set to ``NULL``. Doing it this way is more efficient for a
very common case.
-So as long as a counted reference is held to a dentry, a non-`NULL` `->d_inode`
+So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
value will never be changed.
-### dentry->d_lock ###
+dentry->d_lock
+~~~~~~~~~~~~~~
-`d_lock` is a synonym for the spinlock that is part of `d_lockref` above.
+``d_lock`` is a synonym for the spinlock that is part of ``d_lockref`` above.
For our purposes, holding this lock protects against the dentry being
-renamed or unlinked. In particular, its parent (`d_parent`), and its
-name (`d_name`) cannot be changed, and it cannot be removed from the
+renamed or unlinked. In particular, its parent (``d_parent``), and its
+name (``d_name``) cannot be changed, and it cannot be removed from the
dentry hash table.
-When looking for a name in a directory, REF-walk takes `d_lock` on
+When looking for a name in a directory, REF-walk takes ``d_lock`` on
each candidate dentry that it finds in the hash table and then checks
that the parent and name are correct. So it doesn't lock the parent
while searching in the cache; it only locks children.
-When looking for the parent for a given name (to handle "`..`"),
-REF-walk can take `d_lock` to get a stable reference to `d_parent`,
+When looking for the parent for a given name (to handle "``..``"),
+REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
but it first tries a more lightweight approach. As seen in
-`dget_parent()`, if a reference can be claimed on the parent, and if
-subsequently `d_parent` can be seen to have not changed, then there is
+``dget_parent()``, if a reference can be claimed on the parent, and if
+subsequently ``d_parent`` can be seen to have not changed, then there is
no need to actually take the lock on the child.
-### rename_lock ###
+rename_lock
+~~~~~~~~~~~
Looking up a given name in a given directory involves computing a hash
from the two values (the name and the dentry of the directory),
@@ -228,114 +228,117 @@ happened to be looking at a dentry that was moved in this way,
it might end up continuing the search down the wrong chain,
and so miss out on part of the correct chain.
-The name-lookup process (`d_lookup()`) does _not_ try to prevent this
+The name-lookup process (``d_lookup()``) does _not_ try to prevent this
from happening, but only to detect when it happens.
-`rename_lock` is a seqlock that is updated whenever any dentry is
-renamed. If `d_lookup` finds that a rename happened while it
+``rename_lock`` is a seqlock that is updated whenever any dentry is
+renamed. If ``d_lookup`` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries
again.
-### inode->i_rwsem ###
+inode->i_rwsem
+~~~~~~~~~~~~~~
-`i_rwsem` is a read/write semaphore that serializes all changes to a particular
-directory. This ensures that, for example, an `unlink()` and a `rename()`
+``i_rwsem`` is a read/write semaphore that serializes all changes to a particular
+directory. This ensures that, for example, an ``unlink()`` and a ``rename()``
cannot both happen at the same time. It also keeps the directory
stable while the filesystem is asked to look up a name that is not
currently in the dcache or, optionally, when the list of entries in a
-directory is being retrieved with `readdir()`.
+directory is being retrieved with ``readdir()``.
-This has a complementary role to that of `d_lock`: `i_rwsem` on a
-directory protects all of the names in that directory, while `d_lock`
+This has a complementary role to that of ``d_lock``: ``i_rwsem`` on a
+directory protects all of the names in that directory, while ``d_lock``
on a name protects just one name in a directory. Most changes to the
-dcache hold `i_rwsem` on the relevant directory inode and briefly take
-`d_lock` on one or more the dentries while the change happens. One
+dcache hold ``i_rwsem`` on the relevant directory inode and briefly take
+``d_lock`` on one or more the dentries while the change happens. One
exception is when idle dentries are removed from the dcache due to
-memory pressure. This uses `d_lock`, but `i_rwsem` plays no role.
+memory pressure. This uses ``d_lock``, but ``i_rwsem`` plays no role.
The semaphore affects pathname lookup in two distinct ways. Firstly it
-prevents changes during lookup of a name in a directory. `walk_component()` uses
-`lookup_fast()` first which, in turn, checks to see if the name is in the cache,
-using only `d_lock` locking. If the name isn't found, then `walk_component()`
-falls back to `lookup_slow()` which takes a shared lock on `i_rwsem`, checks again that
+prevents changes during lookup of a name in a directory. ``walk_component()`` uses
+``lookup_fast()`` first which, in turn, checks to see if the name is in the cache,
+using only ``d_lock`` locking. If the name isn't found, then ``walk_component()``
+falls back to ``lookup_slow()`` which takes a shared lock on ``i_rwsem``, checks again that
the name isn't in the cache, and then calls in to the filesystem to get a
definitive answer. A new dentry will be added to the cache regardless of
the result.
Secondly, when pathname lookup reaches the final component, it will
-sometimes need to take an exclusive lock on `i_rwsem` before performing the last lookup so
+sometimes need to take an exclusive lock on ``i_rwsem`` before performing the last lookup so
that the required exclusion can be achieved. How path lookup chooses
-to take, or not take, `i_rwsem` is one of the
+to take, or not take, ``i_rwsem`` is one of the
issues addressed in a subsequent section.
If two threads attempt to look up the same name at the same time - a
-name that is not yet in the dcache - the shared lock on `i_rwsem` will
+name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
not prevent them both adding new dentries with the same name. As this
would result in confusion an extra level of interlocking is used,
-based around a secondary hash table (`in_lookup_hashtable`) and a
-per-dentry flag bit (`DCACHE_PAR_LOOKUP`).
+based around a secondary hash table (``in_lookup_hashtable``) and a
+per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
To add a new dentry to the cache while only holding a shared lock on
-`i_rwsem`, a thread must call `d_alloc_parallel()`. This allocates a
+``i_rwsem``, a thread must call ``d_alloc_parallel()``. This allocates a
dentry, stores the required name and parent in it, checks if there
is already a matching dentry in the primary or secondary hash
tables, and if not, stores the newly allocated dentry in the secondary
-hash table, with `DCACHE_PAR_LOOKUP` set.
+hash table, with ``DCACHE_PAR_LOOKUP`` set.
If a matching dentry was found in the primary hash table then that is
returned and the caller can know that it lost a race with some other
thread adding the entry. If no matching dentry is found in either
cache, the newly allocated dentry is returned and the caller can
-detect this from the presence of `DCACHE_PAR_LOOKUP`. In this case it
+detect this from the presence of ``DCACHE_PAR_LOOKUP``. In this case it
knows that it has won any race and now is responsible for asking the
filesystem to perform the lookup and find the matching inode. When
-the lookup is complete, it must call `d_lookup_done()` which clears
+the lookup is complete, it must call ``d_lookup_done()`` which clears
the flag and does some other house keeping, including removing the
dentry from the secondary hash table - it will normally have been
-added to the primary hash table already. Note that a `struct
-waitqueue_head` is passed to `d_alloc_parallel()`, and
-`d_lookup_done()` must be called while this `waitqueue_head` is still
+added to the primary hash table already. Note that a ``struct
+waitqueue_head`` is passed to ``d_alloc_parallel()``, and
+``d_lookup_done()`` must be called while this ``waitqueue_head`` is still
in scope.
If a matching dentry is found in the secondary hash table,
-`d_alloc_parallel()` has a little more work to do. It first waits for
-`DCACHE_PAR_LOOKUP` to be cleared, using a wait_queue that was passed
-to the instance of `d_alloc_parallel()` that won the race and that
-will be woken by the call to `d_lookup_done()`. It then checks to see
+``d_alloc_parallel()`` has a little more work to do. It first waits for
+``DCACHE_PAR_LOOKUP`` to be cleared, using a wait_queue that was passed
+to the instance of ``d_alloc_parallel()`` that won the race and that
+will be woken by the call to ``d_lookup_done()``. It then checks to see
if the dentry has now been added to the primary hash table. If it
has, the dentry is returned and the caller just sees that it lost any
race. If it hasn't been added to the primary hash table, the most
likely explanation is that some other dentry was added instead using
-`d_splice_alias()`. In any case, `d_alloc_parallel()` repeats all the
+``d_splice_alias()``. In any case, ``d_alloc_parallel()`` repeats all the
look ups from the start and will normally return something from the
primary hash table.
-### mnt->mnt_count ###
+mnt->mnt_count
+~~~~~~~~~~~~~~
-`mnt_count` is a per-CPU reference counter on "`mount`" structures.
+``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
Per-CPU here means that incrementing the count is cheap as it only
uses CPU-local memory, but checking if the count is zero is expensive as
-it needs to check with every CPU. Taking a `mnt_count` reference
+it needs to check with every CPU. Taking a ``mnt_count`` reference
prevents the mount structure from disappearing as the result of regular
unmount operations, but does not prevent a "lazy" unmount. So holding
-`mnt_count` doesn't ensure that the mount remains in the namespace and,
+``mnt_count`` doesn't ensure that the mount remains in the namespace and,
in particular, doesn't stabilize the link to the mounted-on dentry. It
-does, however, ensure that the `mount` data structure remains coherent,
+does, however, ensure that the ``mount`` data structure remains coherent,
and it provides a reference to the root dentry of the mounted
-filesystem. So a reference through `->mnt_count` provides a stable
+filesystem. So a reference through ``->mnt_count`` provides a stable
reference to the mounted dentry, but not the mounted-on dentry.
-### mount_lock ###
+mount_lock
+~~~~~~~~~~
-`mount_lock` is a global seqlock, a bit like `rename_lock`. It can be used to
+``mount_lock`` is a global seqlock, a bit like ``rename_lock``. It can be used to
check if any change has been made to any mount points.
While walking down the tree (away from the root) this lock is used when
crossing a mount point to check that the crossing was safe. That is,
the value in the seqlock is read, then the code finds the mount that
is mounted on the current directory, if there is one, and increments
-the `mnt_count`. Finally the value in `mount_lock` is checked against
+the ``mnt_count``. Finally the value in ``mount_lock`` is checked against
the old value. If there is no change, then the crossing was safe. If there
-was a change, the `mnt_count` is decremented and the whole process is
+was a change, the ``mnt_count`` is decremented and the whole process is
retried.
When walking up the tree (towards the root) by following a ".." link,
@@ -345,7 +348,8 @@ any changes to any mount points while stepping up. This locking is
needed to stabilize the link to the mounted-on dentry, which the
refcount on the mount itself doesn't ensure.
-### RCU ###
+RCU
+~~~
Finally the global (but extremely lightweight) RCU read lock is held
from time to time to ensure certain data structures don't get freed
@@ -354,137 +358,141 @@ unexpectedly.
In particular it is held while scanning chains in the dcache hash
table, and the mount point hash table.
-Bringing it together with `struct nameidata`
+Bringing it together with ``struct nameidata``
--------------------------------------------
-[First edition Unix]: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
+.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
Throughout the process of walking a path, the current status is stored
-in a `struct nameidata`, "namei" being the traditional name - dating
-all the way back to [First Edition Unix] - of the function that
-converts a "name" to an "inode". `struct nameidata` contains (among
+in a ``struct nameidata``, "namei" being the traditional name - dating
+all the way back to `First Edition Unix`_ - of the function that
+converts a "name" to an "inode". ``struct nameidata`` contains (among
other fields):
-### `struct path path` ###
+``struct path path``
+~~~~~~~~~~~~~~~~~~
-A `path` contains a `struct vfsmount` (which is
-embedded in a `struct mount`) and a `struct dentry`. Together these
+A ``path`` contains a ``struct vfsmount`` (which is
+embedded in a ``struct mount``) and a ``struct dentry``. Together these
record the current status of the walk. They start out referring to the
starting point (the current working directory, the root directory, or some other
directory identified by a file descriptor), and are updated on each
-step. A reference through `d_lockref` and `mnt_count` is always
+step. A reference through ``d_lockref`` and ``mnt_count`` is always
held.
-### `struct qstr last` ###
+``struct qstr last``
+~~~~~~~~~~~~~~~~~~
-This is a string together with a length (i.e. _not_ `nul` terminated)
+This is a string together with a length (i.e. _not_ ``nul`` terminated)
that is the "next" component in the pathname.
-### `int last_type` ###
+``int last_type``
+~~~~~~~~~~~~~~~
-This is one of `LAST_NORM`, `LAST_ROOT`, `LAST_DOT`, `LAST_DOTDOT`, or
-`LAST_BIND`. The `last` field is only valid if the type is
-`LAST_NORM`. `LAST_BIND` is used when following a symlink and no
+This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or
+``LAST_BIND``. The ``last`` field is only valid if the type is
+``LAST_NORM``. ``LAST_BIND`` is used when following a symlink and no
components of the symlink have been processed yet. Others should be
fairly self-explanatory.
-### `struct path root` ###
+``struct path root``
+~~~~~~~~~~~~~~~~~~
This is used to hold a reference to the effective root of the
filesystem. Often that reference won't be needed, so this field is
only assigned the first time it is used, or when a non-standard root
-is requested. Keeping a reference in the `nameidata` ensures that
+is requested. Keeping a reference in the ``nameidata`` ensures that
only one root is in effect for the entire path walk, even if it races
-with a `chroot()` system call.
+with a ``chroot()`` system call.
The root is needed when either of two conditions holds: (1) either the
-pathname or a symbolic link starts with a "'/'", or (2) a "`..`"
-component is being handled, since "`..`" from the root must always stay
+pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
+component is being handled, since "``..``" from the root must always stay
at the root. The value used is usually the current root directory of
the calling process. An alternate root can be provided as when
-`sysctl()` calls `file_open_root()`, and when NFSv4 or Btrfs call
-`mount_subtree()`. In each case a pathname is being looked up in a very
+``sysctl()`` calls ``file_open_root()``, and when NFSv4 or Btrfs call
+``mount_subtree()``. In each case a pathname is being looked up in a very
specific part of the filesystem, and the lookup must not be allowed to
-escape that subtree. It works a bit like a local `chroot()`.
+escape that subtree. It works a bit like a local ``chroot()``.
Ignoring the handling of symbolic links, we can now describe the
-"`link_path_walk()`" function, which handles the lookup of everything
+"``link_path_walk()``" function, which handles the lookup of everything
except the final component as:
-> Given a path (`name`) and a nameidata structure (`nd`), check that the
-> current directory has execute permission and then advance `name`
-> over one component while updating `last_type` and `last`. If that
-> was the final component, then return, otherwise call
-> `walk_component()` and repeat from the top.
+ Given a path (``name``) and a nameidata structure (``nd``), check that the
+ current directory has execute permission and then advance ``name``
+ over one component while updating ``last_type`` and ``last``. If that
+ was the final component, then return, otherwise call
+ ``walk_component()`` and repeat from the top.
-`walk_component()` is even easier. If the component is `LAST_DOTS`,
-it calls `handle_dots()` which does the necessary locking as already
-described. If it finds a `LAST_NORM` component it first calls
-"`lookup_fast()`" which only looks in the dcache, but will ask the
+``walk_component()`` is even easier. If the component is ``LAST_DOTS``,
+it calls ``handle_dots()`` which does the necessary locking as already
+described. If it finds a ``LAST_NORM`` component it first calls
+"``lookup_fast()``" which only looks in the dcache, but will ask the
filesystem to revalidate the result if it is that sort of filesystem.
-If that doesn't get a good result, it calls "`lookup_slow()`" which
-takes `i_rwsem`, rechecks the cache, and then asks the filesystem
+If that doesn't get a good result, it calls "``lookup_slow()``" which
+takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
to find a definitive answer. Each of these will call
-`follow_managed()` (as described below) to handle any mount points.
+``follow_managed()`` (as described below) to handle any mount points.
-In the absence of symbolic links, `walk_component()` creates a new
-`struct path` containing a counted reference to the new dentry and a
-reference to the new `vfsmount` which is only counted if it is
-different from the previous `vfsmount`. It then calls
-`path_to_nameidata()` to install the new `struct path` in the
-`struct nameidata` and drop the unneeded references.
+In the absence of symbolic links, ``walk_component()`` creates a new
+``struct path`` containing a counted reference to the new dentry and a
+reference to the new ``vfsmount`` which is only counted if it is
+different from the previous ``vfsmount``. It then calls
+``path_to_nameidata()`` to install the new ``struct path`` in the
+``struct nameidata`` and drop the unneeded references.
This "hand-over-hand" sequencing of getting a reference to the new
dentry before dropping the reference to the previous dentry may
seem obvious, but is worth pointing out so that we will recognize its
analogue in the "RCU-walk" version.
-Handling the final component.
------------------------------
+Handling the final component
+----------------------------
-`link_path_walk()` only walks as far as setting `nd->last` and
-`nd->last_type` to refer to the final component of the path. It does
-not call `walk_component()` that last time. Handling that final
+``link_path_walk()`` only walks as far as setting ``nd->last`` and
+``nd->last_type`` to refer to the final component of the path. It does
+not call ``walk_component()`` that last time. Handling that final
component remains for the caller to sort out. Those callers are
-`path_lookupat()`, `path_parentat()`, `path_mountpoint()` and
-`path_openat()` each of which handles the differing requirements of
+``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
+``path_openat()`` each of which handles the differing requirements of
different system calls.
-`path_parentat()` is clearly the simplest - it just wraps a little bit
-of housekeeping around `link_path_walk()` and returns the parent
+``path_parentat()`` is clearly the simplest - it just wraps a little bit
+of housekeeping around ``link_path_walk()`` and returns the parent
directory and final component to the caller. The caller will be either
-aiming to create a name (via `filename_create()`) or remove or rename
-a name (in which case `user_path_parent()` is used). They will use
-`i_rwsem` to exclude other changes while they validate and then
+aiming to create a name (via ``filename_create()``) or remove or rename
+a name (in which case ``user_path_parent()`` is used). They will use
+``i_rwsem`` to exclude other changes while they validate and then
perform their operation.
-`path_lookupat()` is nearly as simple - it is used when an existing
-object is wanted such as by `stat()` or `chmod()`. It essentially just
-calls `walk_component()` on the final component through a call to
-`lookup_last()`. `path_lookupat()` returns just the final dentry.
+``path_lookupat()`` is nearly as simple - it is used when an existing
+object is wanted such as by ``stat()`` or ``chmod()``. It essentially just
+calls ``walk_component()`` on the final component through a call to
+``lookup_last()``. ``path_lookupat()`` returns just the final dentry.
-`path_mountpoint()` handles the special case of unmounting which must
+``path_mountpoint()`` handles the special case of unmounting which must
not try to revalidate the mounted filesystem. It effectively
-contains, through a call to `mountpoint_last()`, an alternate
-implementation of `lookup_slow()` which skips that step. This is
+contains, through a call to ``mountpoint_last()``, an alternate
+implementation of ``lookup_slow()`` which skips that step. This is
important when unmounting a filesystem that is inaccessible, such as
one provided by a dead NFS server.
-Finally `path_openat()` is used for the `open()` system call; it
-contains, in support functions starting with "`do_last()`", all the
+Finally ``path_openat()`` is used for the ``open()`` system call; it
+contains, in support functions starting with "``do_last()``", all the
complexity needed to handle the different subtleties of O_CREAT (with
-or without O_EXCL), final "`/`" characters, and trailing symbolic
+or without O_EXCL), final "``/``" characters, and trailing symbolic
links. We will revisit this in the final part of this series, which
-focuses on those symbolic links. "`do_last()`" will sometimes, but
-not always, take `i_rwsem`, depending on what it finds.
+focuses on those symbolic links. "``do_last()``" will sometimes, but
+not always, take ``i_rwsem``, depending on what it finds.
Each of these, or the functions which call them, need to be alert to
-the possibility that the final component is not `LAST_NORM`. If the
+the possibility that the final component is not ``LAST_NORM``. If the
goal of the lookup is to create something, then any value for
-`last_type` other than `LAST_NORM` will result in an error. For
-example if `path_parentat()` reports `LAST_DOTDOT`, then the caller
+``last_type`` other than ``LAST_NORM`` will result in an error. For
+example if ``path_parentat()`` reports ``LAST_DOTDOT``, then the caller
won't try to create that name. They also check for trailing slashes
-by testing `last.name[last.len]`. If there is any character beyond
+by testing ``last.name[last.len]``. If there is any character beyond
the final component, it must be a trailing slash.
Revalidation and automounts
@@ -495,12 +503,12 @@ process not yet covered. One is the handling of stale cache entries
and the other is automounts.
On filesystems that require it, the lookup routines will call the
-`->d_revalidate()` dentry method to ensure that the cached information
+``->d_revalidate()`` dentry method to ensure that the cached information
is current. This will often confirm validity or update a few details
from a server. In some cases it may find that there has been change
further up the path and that something that was thought to be valid
previously isn't really. When this happens the lookup of the whole
-path is aborted and retried with the "`LOOKUP_REVAL`" flag set. This
+path is aborted and retried with the "``LOOKUP_REVAL``" flag set. This
forces revalidation to be more thorough. We will see more details of
this retry process in the next article.
@@ -512,52 +520,55 @@ tree, but a few notes specifically related to path lookup are in order
here.
The Linux VFS has a concept of "managed" dentries which is reflected
-in function names such as "`follow_managed()`". There are three
+in function names such as "``follow_managed()``". There are three
potentially interesting things about these dentries corresponding
-to three different flags that might be set in `dentry->d_flags`:
+to three different flags that might be set in ``dentry->d_flags``:
-### `DCACHE_MANAGE_TRANSIT` ###
+``DCACHE_MANAGE_TRANSIT``
+~~~~~~~~~~~~~~~~~~~~~~~
If this flag has been set, then the filesystem has requested that the
-`d_manage()` dentry operation be called before handling any possible
+``d_manage()`` dentry operation be called before handling any possible
mount point. This can perform two particular services:
It can block to avoid races. If an automount point is being
-unmounted, the `d_manage()` function will usually wait for that
+unmounted, the ``d_manage()`` function will usually wait for that
process to complete before letting the new lookup proceed and possibly
trigger a new automount.
It can selectively allow only some processes to transit through a
mount point. When a server process is managing automounts, it may
need to access a directory without triggering normal automount
-processing. That server process can identify itself to the `autofs`
+processing. That server process can identify itself to the ``autofs``
filesystem, which will then give it a special pass through
-`d_manage()` by returning `-EISDIR`.
+``d_manage()`` by returning ``-EISDIR``.
-### `DCACHE_MOUNTED` ###
+``DCACHE_MOUNTED``
+~~~~~~~~~~~~~~~~
This flag is set on every dentry that is mounted on. As Linux
supports multiple filesystem namespaces, it is possible that the
dentry may not be mounted on in *this* namespace, just in some
other. So this flag is seen as a hint, not a promise.
-If this flag is set, and `d_manage()` didn't return `-EISDIR`,
-`lookup_mnt()` is called to examine the mount hash table (honoring the
-`mount_lock` described earlier) and possibly return a new `vfsmount`
-and a new `dentry` (both with counted references).
+If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
+``lookup_mnt()`` is called to examine the mount hash table (honoring the
+``mount_lock`` described earlier) and possibly return a new ``vfsmount``
+and a new ``dentry`` (both with counted references).
-### `DCACHE_NEED_AUTOMOUNT` ###
+``DCACHE_NEED_AUTOMOUNT``
+~~~~~~~~~~~~~~~~~~~~~~~
-If `d_manage()` allowed us to get this far, and `lookup_mnt()` didn't
-find a mount point, then this flag causes the `d_automount()` dentry
+If ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't
+find a mount point, then this flag causes the ``d_automount()`` dentry
operation to be called.
-The `d_automount()` operation can be arbitrarily complex and may
+The ``d_automount()`` operation can be arbitrarily complex and may
communicate with server processes etc. but it should ultimately either
report that there was an error, that there was nothing to mount, or
-should provide an updated `struct path` with new `dentry` and `vfsmount`.
+should provide an updated ``struct path`` with new ``dentry`` and ``vfsmount``.
-In the latter case, `finish_automount()` will be called to safely
+In the latter case, ``finish_automount()`` will be called to safely
install the new mount point into the mount table.
There is no new locking of import here and it is important that no
@@ -614,7 +625,7 @@ isn't in the cache, then it tries to stop gracefully and switch to
REF-walk.
This stopping requires getting a counted reference on the current
-`vfsmount` and `dentry`, and ensuring that these are still valid -
+``vfsmount`` and ``dentry``, and ensuring that these are still valid -
that a path walk with REF-walk would have found the same entries.
This is an invariant that RCU-walk must guarantee. It can only make
decisions, such as selecting the next step, that are decisions which
@@ -625,21 +636,21 @@ RCU-walk finds it cannot stop gracefully, it simply gives up and
restarts from the top with REF-walk.
This pattern of "try RCU-walk, if that fails try REF-walk" can be
-clearly seen in functions like `filename_lookup()`,
-`filename_parentat()`, `filename_mountpoint()`,
-`do_filp_open()`, and `do_file_open_root()`. These five
-correspond roughly to the four `path_`* functions we met earlier,
-each of which calls `link_path_walk()`. The `path_*` functions are
+clearly seen in functions like ``filename_lookup()``,
+``filename_parentat()``, ``filename_mountpoint()``,
+``do_filp_open()``, and ``do_file_open_root()``. These five
+correspond roughly to the four ``path_``* functions we met earlier,
+each of which calls ``link_path_walk()``. The ``path_*`` functions are
called using different mode flags until a mode is found which works.
-They are first called with `LOOKUP_RCU` set to request "RCU-walk". If
-that fails with the error `ECHILD` they are called again with no
+They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
+that fails with the error ``ECHILD`` they are called again with no
special flag to request "REF-walk". If either of those report the
-error `ESTALE` a final attempt is made with `LOOKUP_REVAL` set (and no
-`LOOKUP_RCU`) to ensure that entries found in the cache are forcibly
+error ``ESTALE`` a final attempt is made with ``LOOKUP_REVAL`` set (and no
+``LOOKUP_RCU``) to ensure that entries found in the cache are forcibly
revalidated - normally entries are only revalidated if the filesystem
determines that they are too old to trust.
-The `LOOKUP_RCU` attempt may drop that flag internally and switch to
+The ``LOOKUP_RCU`` attempt may drop that flag internally and switch to
REF-walk, but will never then try to switch back to RCU-walk. Places
that trip up RCU-walk are much more likely to be near the leaves and
so it is very unlikely that there will be much, if any, benefit from
@@ -649,7 +660,7 @@ RCU and seqlocks: fast and light<