summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2019-07-09 12:34:26 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2019-07-09 12:34:26 -0700
commite9a83bd2322035ed9d7dcf35753d3f984d76c6a5 (patch)
tree66dc466ff9aec0f9bb7f39cba50a47eab6585559 /Documentation/filesystems
parent7011b7e1b702cc76f9e969b41d9a95969f2aecaa (diff)
parent454f96f2b738374da4b0a703b1e2e7aed82c4486 (diff)
Merge tag 'docs-5.3' of git://git.lwn.net/linux
Pull Documentation updates from Jonathan Corbet: "It's been a relatively busy cycle for docs: - A fair pile of RST conversions, many from Mauro. These create more than the usual number of simple but annoying merge conflicts with other trees, unfortunately. He has a lot more of these waiting on the wings that, I think, will go to you directly later on. - A new document on how to use merges and rebases in kernel repos, and one on Spectre vulnerabilities. - Various improvements to the build system, including automatic markup of function() references because some people, for reasons I will never understand, were of the opinion that :c:func:``function()`` is unattractive and not fun to type. - We now recommend using sphinx 1.7, but still support back to 1.4. - Lots of smaller improvements, warning fixes, typo fixes, etc" * tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits) docs: automarkup.py: ignore exceptions when seeking for xrefs docs: Move binderfs to admin-guide Disable Sphinx SmartyPants in HTML output doc: RCU callback locks need only _bh, not necessarily _irq docs: format kernel-parameters -- as code Doc : doc-guide : Fix a typo platform: x86: get rid of a non-existent document Add the RCU docs to the core-api manual Documentation: RCU: Add TOC tree hooks Documentation: RCU: Rename txt files to rst Documentation: RCU: Convert RCU UP systems to reST Documentation: RCU: Convert RCU linked list to reST Documentation: RCU: Convert RCU basic concepts to reST docs: filesystems: Remove uneeded .rst extension on toctables scripts/sphinx-pre-install: fix out-of-tree build docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/ Documentation: PGP: update for newer HW devices Documentation: Add section about CPU vulnerabilities for Spectre Documentation: platform: Delete x86-laptop-drivers.txt docs: Note that :c:func: should no longer be used ...
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/api-summary.rst3
-rw-r--r--Documentation/filesystems/binderfs.rst68
-rw-r--r--Documentation/filesystems/ext4/index.rst8
-rw-r--r--Documentation/filesystems/index.rst13
-rw-r--r--Documentation/filesystems/porting10
-rw-r--r--Documentation/filesystems/ubifs-authentication.md4
-rw-r--r--Documentation/filesystems/vfs.rst1428
-rw-r--r--Documentation/filesystems/vfs.txt1268
-rw-r--r--Documentation/filesystems/xfs-delayed-logging-design.txt2
9 files changed, 1442 insertions, 1362 deletions
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
index aa51ffcfa029..bbb0c1c0e5cf 100644
--- a/Documentation/filesystems/api-summary.rst
+++ b/Documentation/filesystems/api-summary.rst
@@ -89,9 +89,6 @@ Other Functions
.. kernel-doc:: fs/direct-io.c
:export:
-.. kernel-doc:: fs/file_table.c
- :export:
-
.. kernel-doc:: fs/libfs.c
:export:
diff --git a/Documentation/filesystems/binderfs.rst b/Documentation/filesystems/binderfs.rst
deleted file mode 100644
index c009671f8434..000000000000
--- a/Documentation/filesystems/binderfs.rst
+++ /dev/null
@@ -1,68 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-The Android binderfs Filesystem
-===============================
-
-Android binderfs is a filesystem for the Android binder IPC mechanism. It
-allows to dynamically add and remove binder devices at runtime. Binder devices
-located in a new binderfs instance are independent of binder devices located in
-other binderfs instances. Mounting a new binderfs instance makes it possible
-to get a set of private binder devices.
-
-Mounting binderfs
------------------
-
-Android binderfs can be mounted with::
-
- mkdir /dev/binderfs
- mount -t binder binder /dev/binderfs
-
-at which point a new instance of binderfs will show up at ``/dev/binderfs``.
-In a fresh instance of binderfs no binder devices will be present. There will
-only be a ``binder-control`` device which serves as the request handler for
-binderfs. Mounting another binderfs instance at a different location will
-create a new and separate instance from all other binderfs mounts. This is
-identical to the behavior of e.g. ``devpts`` and ``tmpfs``. The Android
-binderfs filesystem can be mounted in user namespaces.
-
-Options
--------
-max
- binderfs instances can be mounted with a limit on the number of binder
- devices that can be allocated. The ``max=<count>`` mount option serves as
- a per-instance limit. If ``max=<count>`` is set then only ``<count>`` number
- of binder devices can be allocated in this binderfs instance.
-
-Allocating binder Devices
--------------------------
-
-.. _ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html
-
-To allocate a new binder device in a binderfs instance a request needs to be
-sent through the ``binder-control`` device node. A request is sent in the form
-of an `ioctl() <ioctl_>`_.
-
-What a program needs to do is to open the ``binder-control`` device node and
-send a ``BINDER_CTL_ADD`` request to the kernel. Users of binderfs need to
-tell the kernel which name the new binder device should get. By default a name
-can only contain up to ``BINDERFS_MAX_NAME`` chars including the terminating
-zero byte.
-
-Once the request is made via an `ioctl() <ioctl_>`_ passing a ``struct
-binder_device`` with the name to the kernel it will allocate a new binder
-device and return the major and minor number of the new device in the struct
-(This is necessary because binderfs allocates a major device number
-dynamically.). After the `ioctl() <ioctl_>`_ returns there will be a new
-binder device located under /dev/binderfs with the chosen name.
-
-Deleting binder Devices
------------------------
-
-.. _unlink: http://man7.org/linux/man-pages/man2/unlink.2.html
-.. _rm: http://man7.org/linux/man-pages/man1/rm.1.html
-
-Binderfs binder devices can be deleted via `unlink() <unlink_>`_. This means
-that the `rm() <rm_>`_ tool can be used to delete them. Note that the
-``binder-control`` device cannot be deleted since this would make the binderfs
-instance unuseable. The ``binder-control`` device will be deleted when the
-binderfs instance is unmounted and all references to it have been dropped.
diff --git a/Documentation/filesystems/ext4/index.rst b/Documentation/filesystems/ext4/index.rst
index 3be3e54d480d..705d813d558f 100644
--- a/Documentation/filesystems/ext4/index.rst
+++ b/Documentation/filesystems/ext4/index.rst
@@ -8,7 +8,7 @@ ext4 Data Structures and Algorithms
:maxdepth: 6
:numbered:
- about.rst
- overview.rst
- globals.rst
- dynamic.rst
+ about
+ overview
+ globals
+ dynamic
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d77f6..2de2fe2ab078 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -16,7 +16,8 @@ algorithms work.
.. toctree::
:maxdepth: 2
- path-lookup.rst
+ vfs
+ path-lookup
api-summary
splice
@@ -31,13 +32,3 @@ filesystem implementations.
journalling
fscrypt
-
-Filesystem-specific documentation
-=================================
-
-Documentation for individual filesystem types can be found here.
-
-.. toctree::
- :maxdepth: 2
-
- binderfs.rst
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 3bd1148d8bb6..2813a19389fe 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -330,14 +330,14 @@ unreferenced dentries, and is now only called when the dentry refcount goes to
[mandatory]
.d_compare() calling convention and locking rules are significantly
-changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+changed. Read updated documentation in Documentation/filesystems/vfs.rst (and
look at examples of other filesystems) for guidance.
---
[mandatory]
.d_hash() calling convention and locking rules are significantly
-changed. Read updated documentation in Documentation/filesystems/vfs.txt (and
+changed. Read updated documentation in Documentation/filesystems/vfs.rst (and
look at examples of other filesystems) for guidance.
---
@@ -377,12 +377,12 @@ where possible.
the filesystem provides it), which requires dropping out of rcu-walk mode. This
may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be
returned if the filesystem cannot handle rcu-walk. See
-Documentation/filesystems/vfs.txt for more details.
+Documentation/filesystems/vfs.rst for more details.
permission is an inode permission check that is called on many or all
directory inodes on the way down a path walk (to check for exec permission). It
must now be rcu-walk aware (mask & MAY_NOT_BLOCK). See
-Documentation/filesystems/vfs.txt for more details.
+Documentation/filesystems/vfs.rst for more details.
--
[mandatory]
@@ -625,7 +625,7 @@ in your dentry operations instead.
--
[mandatory]
->clone_file_range() and ->dedupe_file_range have been replaced with
- ->remap_file_range(). See Documentation/filesystems/vfs.txt for more
+ ->remap_file_range(). See Documentation/filesystems/vfs.rst for more
information.
--
[recommended]
diff --git a/Documentation/filesystems/ubifs-authentication.md b/Documentation/filesystems/ubifs-authentication.md
index 028b3e2e25f9..23e698167141 100644
--- a/Documentation/filesystems/ubifs-authentication.md
+++ b/Documentation/filesystems/ubifs-authentication.md
@@ -417,9 +417,9 @@ will then have to be provided beforehand in the normal way.
[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
-[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt
+[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.rst
-[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt
+[DM-VERITY] https://www.kernel.org/doc/Documentation/device-mapper/verity.rst
[FSCRYPT-POLICY2] https://www.spinics.net/lists/linux-ext4/msg58710.html
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
new file mode 100644
index 000000000000..0f85ab21c2ca
--- /dev/null
+++ b/Documentation/filesystems/vfs.rst
@@ -0,0 +1,1428 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Overview of the Linux Virtual File System
+=========================================
+
+Original author: Richard Gooch <rgooch@atnf.csiro.au>
+
+- Copyright (C) 1999 Richard Gooch
+- Copyright (C) 2005 Pekka Enberg
+
+
+Introduction
+============
+
+The Virtual File System (also known as the Virtual Filesystem Switch) is
+the software layer in the kernel that provides the filesystem interface
+to userspace programs. It also provides an abstraction within the
+kernel which allows different filesystem implementations to coexist.
+
+VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on
+are called from a process context. Filesystem locking is described in
+the document Documentation/filesystems/Locking.
+
+
+Directory Entry Cache (dcache)
+------------------------------
+
+The VFS implements the open(2), stat(2), chmod(2), and similar system
+calls. The pathname argument that is passed to them is used by the VFS
+to search through the directory entry cache (also known as the dentry
+cache or dcache). This provides a very fast look-up mechanism to
+translate a pathname (filename) into a specific dentry. Dentries live
+in RAM and are never saved to disc: they exist only for performance.
+
+The dentry cache is meant to be a view into your entire filespace. As
+most computers cannot fit all dentries in the RAM at the same time, some
+bits of the cache are missing. In order to resolve your pathname into a
+dentry, the VFS may have to resort to creating dentries along the way,
+and then loading the inode. This is done by looking up the inode.
+
+
+The Inode Object
+----------------
+
+An individual dentry usually has a pointer to an inode. Inodes are
+filesystem objects such as regular files, directories, FIFOs and other
+beasts. They live either on the disc (for block device filesystems) or
+in the memory (for pseudo filesystems). Inodes that live on the disc
+are copied into the memory when required and changes to the inode are
+written back to disc. A single inode can be pointed to by multiple
+dentries (hard links, for example, do this).
+
+To look up an inode requires that the VFS calls the lookup() method of
+the parent directory inode. This method is installed by the specific
+filesystem implementation that the inode lives in. Once the VFS has the
+required dentry (and hence the inode), we can do all those boring things
+like open(2) the file, or stat(2) it to peek at the inode data. The
+stat(2) operation is fairly simple: once the VFS has the dentry, it
+peeks at the inode data and passes some of it back to userspace.
+
+
+The File Object
+---------------
+
+Opening a file requires another operation: allocation of a file
+structure (this is the kernel-side implementation of file descriptors).
+The freshly allocated file structure is initialized with a pointer to
+the dentry and a set of file operation member functions. These are
+taken from the inode data. The open() file method is then called so the
+specific filesystem implementation can do its work. You can see that
+this is another switch performed by the VFS. The file structure is
+placed into the file descriptor table for the process.
+
+Reading, writing and closing files (and other assorted VFS operations)
+is done by using the userspace file descriptor to grab the appropriate
+file structure, and then calling the required file structure method to
+do whatever is required. For as long as the file is open, it keeps the
+dentry in use, which in turn means that the VFS inode is still in use.
+
+
+Registering and Mounting a Filesystem
+=====================================
+
+To register and unregister a filesystem, use the following API
+functions:
+
+.. code-block:: c
+
+ #include <linux/fs.h>
+
+ extern int register_filesystem(struct file_system_type *);
+ extern int unregister_filesystem(struct file_system_type *);
+
+The passed struct file_system_type describes your filesystem. When a
+request is made to mount a filesystem onto a directory in your
+namespace, the VFS will call the appropriate mount() method for the
+specific filesystem. New vfsmount referring to the tree returned by
+->mount() will be attached to the mountpoint, so that when pathname
+resolution reaches the mountpoint it will jump into the root of that
+vfsmount.
+
+You can see all filesystems that are registered to the kernel in the
+file /proc/filesystems.
+
+
+struct file_system_type
+-----------------------
+
+This describes the filesystem. As of kernel 2.6.39, the following
+members are defined:
+
+.. code-block:: c
+
+ struct file_system_operations {
+ const char *name;
+ int fs_flags;
+ struct dentry *(*mount) (struct file_system_type *, int,
+ const char *, void *);
+ void (*kill_sb) (struct super_block *);
+ struct module *owner;
+ struct file_system_type * next;
+ struct list_head fs_supers;
+ struct lock_class_key s_lock_key;
+ struct lock_class_key s_umount_key;
+ };
+
+``name``
+ the name of the filesystem type, such as "ext2", "iso9660",
+ "msdos" and so on
+
+``fs_flags``
+ various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
+
+``mount``
+ the method to call when a new instance of this filesystem should
+ be mounted
+
+``kill_sb``
+ the method to call when an instance of this filesystem should be
+ shut down
+
+
+``owner``
+ for internal VFS use: you should initialize this to THIS_MODULE
+ in most cases.
+
+``next``
+ for internal VFS use: you should initialize this to NULL
+
+ s_lock_key, s_umount_key: lockdep-specific
+
+The mount() method has the following arguments:
+
+``struct file_system_type *fs_type``
+ describes the filesystem, partly initialized by the specific
+ filesystem code
+
+``int flags``
+ mount flags
+
+``const char *dev_name``
+ the device name we are mounting.
+
+``void *data``
+ arbitrary mount options, usually comes as an ASCII string (see
+ "Mount Options" section)
+
+The mount() method must return the root dentry of the tree requested by
+caller. An active reference to its superblock must be grabbed and the
+superblock must be locked. On failure it should return ERR_PTR(error).
+
+The arguments match those of mount(2) and their interpretation depends
+on filesystem type. E.g. for block filesystems, dev_name is interpreted
+as block device name, that device is opened and if it contains a
+suitable filesystem image the method creates and initializes struct
+super_block accordingly, returning its root dentry to caller.
+
+->mount() may choose to return a subtree of existing filesystem - it
+doesn't have to create a new one. The main result from the caller's
+point of view is a reference to dentry at the root of (sub)tree to be
+attached; creation of new superblock is a common side effect.
+
+The most interesting member of the superblock structure that the mount()
+method fills in is the "s_op" field. This is a pointer to a "struct
+super_operations" which describes the next level of the filesystem
+implementation.
+
+Usually, a filesystem uses one of the generic mount() implementations
+and provides a fill_super() callback instead. The generic variants are:
+
+``mount_bdev``
+ mount a filesystem residing on a block device
+
+``mount_nodev``
+ mount a filesystem that is not backed by a device
+
+``mount_single``
+ mount a filesystem which shares the instance between all mounts
+
+A fill_super() callback implementation has the following arguments:
+
+``struct super_block *sb``
+ the superblock structure. The callback must initialize this
+ properly.
+
+``void *data``
+ arbitrary mount options, usually comes as an ASCII string (see
+ "Mount Options" section)
+
+``int silent``
+ whether or not to be silent on error
+
+
+The Superblock Object
+=====================
+
+A superblock object represents a mounted filesystem.
+
+
+struct super_operations
+-----------------------
+
+This describes how the VFS can manipulate the superblock of your
+filesystem. As of kernel 2.6.22, the following members are defined:
+
+.. code-block:: c
+
+ struct super_operations {
+ struct inode *(*alloc_inode)(struct super_block *sb);
+ void (*destroy_inode)(struct inode *);
+
+ void (*dirty_inode) (struct inode *, int flags);
+ int (*write_inode) (struct inode *, int);
+ void (*drop_inode) (struct inode *);
+ void (*delete_inode) (struct inode *);
+ void (*put_super) (struct super_block *);
+ int (*sync_fs)(struct super_block *sb, int wait);
+ int (*freeze_fs) (struct super_block *);
+ int (*unfreeze_fs) (struct super_block *);
+ int (*statfs) (struct dentry *, struct kstatfs *);
+ int (*remount_fs) (struct super_block *, int *, char *);
+ void (*clear_inode) (struct inode *);
+ void (*umount_begin) (struct super_block *);
+
+ int (*show_options)(struct seq_file *, struct dentry *);
+
+ ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
+ ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+ int (*nr_cached_objects)(struct super_block *);
+ void (*free_cached_objects)(struct super_block *, int);
+ };
+
+All methods are called without any locks being held, unless otherwise
+noted. This means that most methods can block safely. All methods are
+only called from a process context (i.e. not from an interrupt handler
+or bottom half).
+
+``alloc_inode``
+ this method is called by alloc_inode() to allocate memory for
+ struct inode and initialize it. If this function is not
+ defined, a simple 'struct inode' is allocated. Normally
+ alloc_inode will be used to allocate a larger structure which
+ contains a 'struct inode' embedded within it.
+
+``destroy_inode``
+ this method is called by destroy_inode() to release resources
+ allocated for struct inode. It is only required if
+ ->alloc_inode was defined and simply undoes anything done by
+ ->alloc_inode.
+
+``dirty_inode``
+ this method is called by the VFS to mark an inode dirty.
+
+``write_inode``
+ this method is called when the VFS needs to write an inode to
+ disc. The second parameter indicates whether the write should
+ be synchronous or not, not all filesystems check this flag.
+
+``drop_inode``
+ called when the last access to the inode is dropped, with the
+ inode->i_lock spinlock held.
+
+ This method should be either NULL (normal UNIX filesystem
+ semantics) or "generic_delete_inode" (for filesystems that do
+ not want to cache inodes - causing "delete_inode" to always be
+ called regardless of the value of i_nlink)
+
+ The "generic_delete_inode()" behavior is equivalent to the old
+ practice of using "force_delete" in the put_inode() case, but
+ does not have the races that the "force_delete()" approach had.
+
+``delete_inode``
+ called when the VFS wants to delete an inode
+
+``put_super``
+ called when the VFS wishes to free the superblock
+ (i.e. unmount). This is called with the superblock lock held
+
+``sync_fs``
+ called when VFS is writing out all dirty data associated with a
+ superblock. The second parameter indicates whether the method
+ should wait until the write out has been completed. Optional.
+
+``freeze_fs``
+ called when VFS is locking a filesystem and forcing it into a
+ consistent state. This method is currently used by the Logical
+ Volume Manager (LVM).
+
+``unfreeze_fs``
+ called when VFS is unlocking a filesystem and making it writable
+ again.
+
+``statfs``
+ called when the VFS needs to get filesystem statistics.
+
+``remount_fs``
+ called when the filesystem is remounted. This is called with
+ the kernel lock held
+
+``clear_inode``
+ called then the VFS clears the inode. Optional
+
+``umount_begin``
+ called when the VFS is unmounting a filesystem.
+
+``show_options``
+ called by the VFS to show mount options for /proc/<pid>/mounts.
+ (see "Mount Options" section)
+
+``quota_read``
+ called by the VFS to read from filesystem quota file.
+
+``quota_write``
+ called by the VFS to write to filesystem quota file.
+
+``nr_cached_objects``
+ called by the sb cache shrinking function for the filesystem to
+ return the number of freeable cached objects it contains.
+ Optional.
+
+``free_cache_objects``
+ called by the sb cache shrinking function for the filesystem to
+ scan the number of objects indicated to try to free them.
+ Optional, but any filesystem implementing this method needs to
+ also implement ->nr_cached_objects for it to be called
+ correctly.
+
+ We can't do anything with any errors that the filesystem might
+ encountered, hence the void return type. This will never be
+ called if the VM is trying to reclaim under GFP_NOFS conditions,
+ hence this method does not need to handle that situation itself.
+
+ Implementations must include conditional reschedule calls inside
+ any scanning loop that is done. This allows the VFS to
+ determine appropriate scan batch sizes without having to worry
+ about whether implementations will cause holdoff problems due to
+ large scan batch sizes.
+
+Whoever sets up the inode is responsible for filling in the "i_op"
+field. This is a pointer to a "struct inode_operations" which describes
+the methods that can be performed on individual inodes.
+
+
+struct xattr_handlers
+---------------------
+
+On filesystems that support extended attributes (xattrs), the s_xattr
+superblock field points to a NULL-terminated array of xattr handlers.
+Extended attributes are name:value pairs.
+
+``name``
+ Indicates that the handler matches attributes with the specified
+ name (such as "system.posix_acl_access"); the prefix field must
+ be NULL.
+
+``prefix``
+ Indicates that the handler matches all attributes with the
+ specified name prefix (such as "user."); the name field must be
+ NULL.
+
+``list``
+ Determine if attributes matching this xattr handler should be
+ listed for a particular dentry. Used by some listxattr
+ implementations like generic_listxattr.
+
+``get``
+ Called by the VFS to get the value of a particular extended
+ attribute. This method is called by the getxattr(2) system
+ call.
+
+``set``
+ Called by the VFS to set the value of a particular extended
+ attribute. When the new value is NULL, called to remove a
+ particular extended attribute. This method is called by the the
+ setxattr(2) and removexattr(2) system calls.
+
+When none of the xattr handlers of a filesystem match the specified
+attribute name or when a filesystem doesn't support extended attributes,
+the various ``*xattr(2)`` system calls return -EOPNOTSUPP.
+
+
+The Inode Object
+================
+
+An inode object represents an object within the filesystem.
+
+
+struct inode_operations
+-----------------------
+
+This describes how the VFS can manipulate an inode in your filesystem.
+As of kernel 2.6.22, the following members are defined:
+
+.. code-block:: c
+
+ struct inode_operations {
+ int (*create) (struct inode *,struct dentry *, umode_t, bool);
+ struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
+ int (*link) (struct dentry *,struct inode *,struct dentry *);
+ int (*unlink) (struct inode *,struct dentry *);
+ int (*symlink) (struct inode *,struct dentry *,const char *);
+ int (*mkdir) (struct inode *,struct dentry *,umode_t);
+ int (*rmdir) (struct inode *,struct dentry *);
+ int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
+ int (*rename) (struct inode *, struct dentry *,
+ struct inode *, struct dentry *, unsigned int);
+ int (*readlink) (struct dentry *, char __user *,int);
+ const char *(*get_link) (struct dentry *, struct inode *,
+ struct delayed_call *);
+ int (*permission) (struct inode *, int);
+ int (*get_acl)(struct inode *, int);
+ int (*setattr) (struct dentry *, struct iattr *);
+ int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
+ ssize_t (*listxattr) (struct dentry *, char *, size_t);
+ void (*update_time)(struct inode *, struct timespec *, int);
+ int (*atomic_open)(struct inode *, struct dentry *, struct file *,
+ unsigned open_flag, umode_t create_mode);
+ int (*tmpfile) (struct inode *, struct dentry *, umode_t);
+ };
+
+Again, all methods are called without any locks being held, unless
+otherwise noted.
+
+``create``
+ called by the open(2) and creat(2) system calls. Only required
+ if you want to support regular files. The dentry you get should
+ not have an inode (i.e. it should be a negative dentry). Here
+ you will probably call d_instantiate() with the dentry and the
+ newly created inode
+
+``lookup``
+ called when the VFS needs to look up an inode in a parent
+ directory. The name to look for is found in the dentry. This
+ method must call d_add() to insert the found inode into the
+ dentry. The "i_count" field in the inode structure should be
+ incremented. If the named inode does not exist a NULL inode
+ should be inserted into the dentry (this is called a negative
+ dentry). Returning an error code from this routine must only be
+ done on a real error, otherwise creating inodes with system
+ calls like create(2), mknod(2), mkdir(2) and so on will fail.
+ If you wish to overload the dentry methods then you should
+ initialise the "d_dop" field in the dentry; this is a pointer to
+ a struct "dentry_operations". This method is called with the
+ directory inode semaphore held
+
+``link``
+ called by the link(2) system call. Only required if you want to
+ support hard links. You will probably need to call
+ d_instantiate() just as you would in the create() method
+
+``unlink``
+ called by the unlink(2) system call. Only required if you want
+ to support deleting inodes
+
+``symlink``
+ called by the symlink(2) system call. Only required if you want
+ to support symlinks. You will probably need to call
+ d_instantiate() just as you would in the create() method
+
+``mkdir``
+ called by the mkdir(2) system call. Only required if you want
+ to support creating subdirectories. You will probably need to
+ call d_instantiate() just as you would in the create() method
+
+``rmdir``
+ called by the rmdir(2) system call. Only required if you want
+ to support deleting subdirectories
+
+``mknod``
+ called by the mknod(2) system call to create a device (char,
+ block) inode or a named pipe (FIFO) or socket. Only required if
+ you want to support creating these types of inodes. You will
+ probably need to call d_instantiate() just as you would in the
+ create() method
+
+``rename``
+ called by the rename(2) system call to rename the object to have
+ the parent and name given by the second inode and dentry.
+
+ The filesystem must return -EINVAL for any unsupported or
+ unknown flags. Currently the following flags are implemented:
+ (1) RENAME_NOREPLACE: this flag indicates that if the target of
+ the rename exists the rename should fail with -EEXIST instead of
+ replacing the target. The VFS already checks for existence, so
+ for local filesystems the RENAME_NOREPLACE implementation is
+ equivalent to plain rename.
+ (2) RENAME_EXCHANGE: exchange source and target. Both must
+ exist; this is checked by the VFS. Unlike plain rename, source
+ and target may be of different type.
+
+``get_link``
+ called by the VFS to follow a symbolic link to the inode it
+ points to. Only required if you want to support symbolic links.
+ This method returns the symlink body to traverse (and possibly
+ resets the current position with nd_jump_link()). If the body
+ won't go away until the inode is gone, nothing else is needed;
+ if it needs to be otherwise pinned, arrange for its release by
+ having get_link(..., ..., done) do set_delayed_call(done,
+ destructor, argument). In that case destructor(argument) will
+ be called once VFS is done with the body you've returned. May
+ be called in RCU mode; that is indicated by NULL dentry
+ argument. If request can't be handled without leaving RCU mode,
+ have it return ERR_PTR(-ECHILD).
+
+ If the filesystem stores the symlink target in ->i_link, the
+ VFS may use it directly without calling ->get_link(); however,
+ ->get_link() must still be provided. ->i_link must not be
+ freed until after an RCU grace period. Writing to ->i_link
+ post-iget() time requires a 'release' memory barrier.
+
+``readlink``
+ this is now just an override for use by readlink(2) for the
+ cases when ->get_link uses nd_jump_link() or object is not in
+ fact a symlink. Normally filesystems should only implement
+ ->get_link for symlinks and readlink(2) will automatically use
+ that.
+
+``permission``
+ called by the VFS to check for access rights on a POSIX-like
+ filesystem.
+
+ May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in
+ rcu-walk mode, the filesystem must check the permission without
+ blocking or storing to the inode.
+
+ If a situation is encountered that rcu-walk cannot handle,
+ return
+ -ECHILD and it will be called again in ref-walk mode.
+
+``setattr``
+ called by the VFS to set attributes for a file. This method is
+ called by chmod(2) and related system calls.
+
+``getattr``
+ called by the VFS to get attributes of a file. This method is
+ called by stat(2) and related system calls.
+
+``listxattr``
+ called by the VFS to list all extended attributes for a given
+ file. This method is called by the listxattr(2) system call.
+
+``update_time``
+ called by the VFS to update a specific time or the i_version of
+ an inode. If this is not defined the VFS will update the inode
+ itself and call mark_inode_dirty_sync.
+
+``atomic_open``
+ called on the last component of an open. Using this optional
+ method the filesystem can look up, possibly create and open the
+ file in one atomic operation. If it wants to leave actual
+ opening to the caller (e.g. if the file turned out to be a
+ symlink, device, or just something filesystem won't do atomic
+ open for), it may signal this by returning finish_no_open(file,
+ dentry). This method is only called if the last component is
+ negative or needs lookup. Cached positive dentries are still
+ handled by f_op->open(). If the file was created, FMODE_CREATED
+ flag should be set in file->f_mode. In case of O_EXCL the
+ method must only succeed if the file didn't exist and hence
+ FMODE_CREATED shall always be set on success.
+
+``tmpfile``
+ called in the end of O_TMPFILE open(). Optional, equivalent to
+ atomically creating, opening and unlinking a file in given
+ directory.
+
+
+The Address Space Object
+========================
+
+The address space object is used to group and manage pages in the page
+cache. It can be used to keep track of the pages in a file (or anything
+else) and also track the mapping of sections of the file into process
+address spaces.
+
+There are a number of distinct yet related services that an
+address-space can provide. These include communicating memory pressure,
+page lookup by address, and keeping track of pages tagged as Dirty or
+Writeback.
+
+The first can be used independently to the others. The VM can try to
+either write dirty pages in order to clean them, or release clean pages
+in order to reuse them. To do this it can call the ->writepage method
+on dirty pages, and ->releasepage on clean pages with PagePrivate set.
+Clean pages without PagePrivate and with no external references will be
+released without notice being given to the address_space.
+
+To achieve this functionality, pages need to be placed on an LRU with
+lru_cache_add and mark_page_active needs to be called whenever the page
+is used.
+
+Pages are normally kept in a radix tree index by ->index. This tree
+maintains information about the PG_Dirty and PG_Writeback status of each
+page, so that pages with either of these flags can be found quickly.
+
+The Dirty tag is primarily used by mpage_writepages - the default
+->writepages method. It uses the tag to find dirty pages to call
+->writepage on. If mpage_writepages is not used (i.e. the address
+provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost
+unused. write_inode_now and sync_inode do use it (through
+__sync_single_inode) to check if ->writepages has been successful in
+writing out the whole address_space.
+
+The Writeback tag is used by filemap*wait* and sync_page* functions, via
+filemap_fdatawait_range, to wait for all writeback to complete.
+
+An address_space handler may attach extra information to a page,
+typically using the 'private' field in the 'struct page'. If such
+information is attached, the PG_Private flag should be set. This will
+cause various VM routines to make extra calls into the address_space
+handler to deal with that data.
+
+An address space acts as an intermediate between storage and
+application. Data is read into the address space a whole page at a
+time, and provided to the application either by copying of the page, or
+by memory-mapping the page. Data is written into the address space by
+the application, and then written-back to storage typically in whole
+pages, however the address_space has finer control of write sizes.
+
+The read process essentially only requires 'readpage'. The write
+process is more complicated and uses write_begin/write_end or
+set_page_dirty to write data into the address_space, and writepage and
+writepages to writeback data to storage.
+
+Adding and removing pages to/from an address_space is protected by the
+inode's i_mutex.
+
+When data is written to a page, the PG_Dirty flag should be set. It
+typically remains set until writepage asks for it to be written. This
+should clear PG_Dirt