diff options
author | Tobin C. Harding <tobin@kernel.org> | 2019-06-04 10:26:56 +1000 |
---|---|---|
committer | Jonathan Corbet <corbet@lwn.net> | 2019-06-06 09:41:13 -0600 |
commit | ee5dc0491c38ae4e4e583d7532d470754bb173f6 (patch) | |
tree | 6b0d39e34a968dcb90387bf7f13bd67e3ce560aa /Documentation/filesystems/vfs.rst | |
parent | af96c1e304f7051bf2ee64c9957724bdace05c58 (diff) |
docs: filesystems: vfs: Render method descriptions
Currently vfs.rst does not render well into HTML the method descriptions
for VFS data structures. We can improve the HTML output by putting the
description string on a new line following the method name.
Suggested-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Diffstat (limited to 'Documentation/filesystems/vfs.rst')
-rw-r--r-- | Documentation/filesystems/vfs.rst | 1147 |
1 files changed, 642 insertions, 505 deletions
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 2ffbdf5f392c..0f85ab21c2ca 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -125,35 +125,46 @@ members are defined: struct lock_class_key s_umount_key; }; -``name``: the name of the filesystem type, such as "ext2", "iso9660", +``name`` + the name of the filesystem type, such as "ext2", "iso9660", "msdos" and so on -``fs_flags``: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) +``fs_flags`` + various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) -``mount``: the method to call when a new instance of this filesystem should -be mounted +``mount`` + the method to call when a new instance of this filesystem should + be mounted -``kill_sb``: the method to call when an instance of this filesystem - should be shut down +``kill_sb`` + the method to call when an instance of this filesystem should be + shut down -``owner``: for internal VFS use: you should initialize this to THIS_MODULE in - most cases. -``next``: for internal VFS use: you should initialize this to NULL +``owner`` + for internal VFS use: you should initialize this to THIS_MODULE + in most cases. + +``next`` + for internal VFS use: you should initialize this to NULL s_lock_key, s_umount_key: lockdep-specific The mount() method has the following arguments: -``struct file_system_type *fs_type``: describes the filesystem, partly initialized - by the specific filesystem code +``struct file_system_type *fs_type`` + describes the filesystem, partly initialized by the specific + filesystem code -``int flags``: mount flags +``int flags`` + mount flags -``const char *dev_name``: the device name we are mounting. +``const char *dev_name`` + the device name we are mounting. -``void *data``: arbitrary mount options, usually comes as an ASCII - string (see "Mount Options" section) +``void *data`` + arbitrary mount options, usually comes as an ASCII string (see + "Mount Options" section) The mount() method must return the root dentry of the tree requested by caller. An active reference to its superblock must be grabbed and the @@ -178,22 +189,27 @@ implementation. Usually, a filesystem uses one of the generic mount() implementations and provides a fill_super() callback instead. The generic variants are: -``mount_bdev``: mount a filesystem residing on a block device +``mount_bdev`` + mount a filesystem residing on a block device -``mount_nodev``: mount a filesystem that is not backed by a device +``mount_nodev`` + mount a filesystem that is not backed by a device -``mount_single``: mount a filesystem which shares the instance between - all mounts +``mount_single`` + mount a filesystem which shares the instance between all mounts A fill_super() callback implementation has the following arguments: -``struct super_block *sb``: the superblock structure. The callback - must initialize this properly. +``struct super_block *sb`` + the superblock structure. The callback must initialize this + properly. -``void *data``: arbitrary mount options, usually comes as an ASCII - string (see "Mount Options" section) +``void *data`` + arbitrary mount options, usually comes as an ASCII string (see + "Mount Options" section) -``int silent``: whether or not to be silent on error +``int silent`` + whether or not to be silent on error The Superblock Object @@ -240,87 +256,106 @@ noted. This means that most methods can block safely. All methods are only called from a process context (i.e. not from an interrupt handler or bottom half). -``alloc_inode``: this method is called by alloc_inode() to allocate memory - for struct inode and initialize it. If this function is not +``alloc_inode`` + this method is called by alloc_inode() to allocate memory for + struct inode and initialize it. If this function is not defined, a simple 'struct inode' is allocated. Normally alloc_inode will be used to allocate a larger structure which contains a 'struct inode' embedded within it. -``destroy_inode``: this method is called by destroy_inode() to release - resources allocated for struct inode. It is only required if +``destroy_inode`` + this method is called by destroy_inode() to release resources + allocated for struct inode. It is only required if ->alloc_inode was defined and simply undoes anything done by ->alloc_inode. -``dirty_inode``: this method is called by the VFS to mark an inode dirty. +``dirty_inode`` + this method is called by the VFS to mark an inode dirty. -``write_inode``: this method is called when the VFS needs to write an - inode to disc. The second parameter indicates whether the write - should be synchronous or not, not all filesystems check this flag. +``write_inode`` + this method is called when the VFS needs to write an inode to + disc. The second parameter indicates whether the write should + be synchronous or not, not all filesystems check this flag. -``drop_inode``: called when the last access to the inode is dropped, - with the inode->i_lock spinlock held. +``drop_inode`` + called when the last access to the inode is dropped, with the + inode->i_lock spinlock held. This method should be either NULL (normal UNIX filesystem - semantics) or "generic_delete_inode" (for filesystems that do not - want to cache inodes - causing "delete_inode" to always be + semantics) or "generic_delete_inode" (for filesystems that do + not want to cache inodes - causing "delete_inode" to always be called regardless of the value of i_nlink) - The "generic_delete_inode()" behavior is equivalent to the - old practice of using "force_delete" in the put_inode() case, - but does not have the races that the "force_delete()" approach - had. + The "generic_delete_inode()" behavior is equivalent to the old + practice of using "force_delete" in the put_inode() case, but + does not have the races that the "force_delete()" approach had. -``delete_inode``: called when the VFS wants to delete an inode +``delete_inode`` + called when the VFS wants to delete an inode -``put_super``: called when the VFS wishes to free the superblock +``put_super`` + called when the VFS wishes to free the superblock (i.e. unmount). This is called with the superblock lock held -``sync_fs``: called when VFS is writing out all dirty data associated with - a superblock. The second parameter indicates whether the method +``sync_fs`` + called when VFS is writing out all dirty data associated with a + superblock. The second parameter indicates whether the method should wait until the write out has been completed. Optional. -``freeze_fs``: called when VFS is locking a filesystem and - forcing it into a consistent state. This method is currently - used by the Logical Volume Manager (LVM). +``freeze_fs`` + called when VFS is locking a filesystem and forcing it into a + consistent state. This method is currently used by the Logical + Volume Manager (LVM). -``unfreeze_fs``: called when VFS is unlocking a filesystem and making it writable +``unfreeze_fs`` + called when VFS is unlocking a filesystem and making it writable again. -``statfs``: called when the VFS needs to get filesystem statistics. +``statfs`` + called when the VFS needs to get filesystem statistics. -``remount_fs``: called when the filesystem is remounted. This is called - with the kernel lock held +``remount_fs`` + called when the filesystem is remounted. This is called with + the kernel lock held -``clear_inode``: called then the VFS clears the inode. Optional +``clear_inode`` + called then the VFS clears the inode. Optional -``umount_begin``: called when the VFS is unmounting a filesystem. +``umount_begin`` + called when the VFS is unmounting a filesystem. -``show_options``: called by the VFS to show mount options for - /proc/<pid>/mounts. (see "Mount Options" section) +``show_options`` + called by the VFS to show mount options for /proc/<pid>/mounts. + (see "Mount Options" section) -``quota_read``: called by the VFS to read from filesystem quota file. +``quota_read`` + called by the VFS to read from filesystem quota file. -``quota_write``: called by the VFS to write to filesystem quota file. +``quota_write`` + called by the VFS to write to filesystem quota file. -``nr_cached_objects``: called by the sb cache shrinking function for the - filesystem to return the number of freeable cached objects it contains. +``nr_cached_objects`` + called by the sb cache shrinking function for the filesystem to + return the number of freeable cached objects it contains. Optional. -``free_cache_objects``: called by the sb cache shrinking function for the - filesystem to scan the number of objects indicated to try to free them. - Optional, but any filesystem implementing this method needs to also - implement ->nr_cached_objects for it to be called correctly. +``free_cache_objects`` + called by the sb cache shrinking function for the filesystem to + scan the number of objects indicated to try to free them. + Optional, but any filesystem implementing this method needs to + also implement ->nr_cached_objects for it to be called + correctly. We can't do anything with any errors that the filesystem might - encountered, hence the void return type. This will never be called if - the VM is trying to reclaim under GFP_NOFS conditions, hence this - method does not need to handle that situation itself. + encountered, hence the void return type. This will never be + called if the VM is trying to reclaim under GFP_NOFS conditions, + hence this method does not need to handle that situation itself. - Implementations must include conditional reschedule calls inside any - scanning loop that is done. This allows the VFS to determine - appropriate scan batch sizes without having to worry about whether - implementations will cause holdoff problems due to large scan batch - sizes. + Implementations must include conditional reschedule calls inside + any scanning loop that is done. This allows the VFS to + determine appropriate scan batch sizes without having to worry + about whether implementations will cause holdoff problems due to + large scan batch sizes. Whoever sets up the inode is responsible for filling in the "i_op" field. This is a pointer to a "struct inode_operations" which describes @@ -334,23 +369,31 @@ On filesystems that support extended attributes (xattrs), the s_xattr superblock field points to a NULL-terminated array of xattr handlers. Extended attributes are name:value pairs. -``name``: Indicates that the handler matches attributes with the specified name - (such as "system.posix_acl_access"); the prefix field must be NULL. +``name`` + Indicates that the handler matches attributes with the specified + name (such as "system.posix_acl_access"); the prefix field must + be NULL. -``prefix``: Indicates that the handler matches all attributes with the specified - name prefix (such as "user."); the name field must be NULL. +``prefix`` + Indicates that the handler matches all attributes with the + specified name prefix (such as "user."); the name field must be + NULL. -``list``: Determine if attributes matching this xattr handler should be listed - for a particular dentry. Used by some listxattr implementations like - generic_listxattr. +``list`` + Determine if attributes matching this xattr handler should be + listed for a particular dentry. Used by some listxattr + implementations like generic_listxattr. -``get``: Called by the VFS to get the value of a particular extended attribute. - This method is called by the getxattr(2) system call. +``get`` + Called by the VFS to get the value of a particular extended + attribute. This method is called by the getxattr(2) system + call. -``set``: Called by the VFS to set the value of a particular extended attribute. - When the new value is NULL, called to remove a particular extended - attribute. This method is called by the the setxattr(2) and - removexattr(2) system calls. +``set`` + Called by the VFS to set the value of a particular extended + attribute. When the new value is NULL, called to remove a + particular extended attribute. This method is called by the the + setxattr(2) and removexattr(2) system calls. When none of the xattr handlers of a filesystem match the specified attribute name or when a filesystem doesn't support extended attributes, @@ -399,128 +442,147 @@ As of kernel 2.6.22, the following members are defined: Again, all methods are called without any locks being held, unless otherwise noted. -``create``: called by the open(2) and creat(2) system calls. Only - required if you want to support regular files. The dentry you - get should not have an inode (i.e. it should be a negative - dentry). Here you will probably call d_instantiate() with the - dentry and the newly created inode +``create`` + called by the open(2) and creat(2) system calls. Only required + if you want to support regular files. The dentry you get should + not have an inode (i.e. it should be a negative dentry). Here + you will probably call d_instantiate() with the dentry and the + newly created inode -``lookup``: called when the VFS needs to look up an inode in a parent +``lookup`` + called when the VFS needs to look up an inode in a parent directory. The name to look for is found in the dentry. This method must call d_add() to insert the found inode into the dentry. The "i_count" field in the inode structure should be incremented. If the named inode does not exist a NULL inode should be inserted into the dentry (this is called a negative - dentry). Returning an error code from this routine must only - be done on a real error, otherwise creating inodes with system + dentry). Returning an error code from this routine must only be + done on a real error, otherwise creating inodes with system calls like create(2), mknod(2), mkdir(2) and so on will fail. If you wish to overload the dentry methods then you should - initialise the "d_dop" field in the dentry; this is a pointer - to a struct "dentry_operations". - This method is called with the directory inode semaphore held + initialise the "d_dop" field in the dentry; this is a pointer to + a struct "dentry_operations". This method is called with the + directory inode semaphore held -``link``: called by the link(2) system call. Only required if you want - to support hard links. You will probably need to call +``link`` + called by the link(2) system call. Only required if you want to + support hard links. You will probably need to call d_instantiate() just as you would in the create() method -``unlink``: called by the unlink(2) system call. Only required if you - want to support deleting inodes +``unlink`` + called by the unlink(2) system call. Only required if you want + to support deleting inodes -``symlink``: called by the symlink(2) system call. Only required if you - want to support symlinks. You will probably need to call +``symlink`` + called by the symlink(2) system call. Only required if you want + to support symlinks. You will probably need to call d_instantiate() just as you would in the create() method -``mkdir``: called by the mkdir(2) system call. Only required if you want +``mkdir`` + called by the mkdir(2) system call. Only required if you want to support creating subdirectories. You will probably need to call d_instantiate() just as you would in the create() method -``rmdir``: called by the rmdir(2) system call. Only required if you want +``rmdir`` + called by the rmdir(2) system call. Only required if you want to support deleting subdirectories -``mknod``: called by the mknod(2) system call to create a device (char, - block) inode or a named pipe (FIFO) or socket. Only required - if you want to support creating these types of inodes. You - will probably need to call d_instantiate() just as you would - in the create() method +``mknod`` + called by the mknod(2) system call to create a device (char, + block) inode or a named pipe (FIFO) or socket. Only required if + you want to support creating these types of inodes. You will + probably need to call d_instantiate() just as you would in the + create() method -``rename``: called by the rename(2) system call to rename the object to - have the parent and name given by the second inode and dentry. +``rename`` + called by the rename(2) system call to rename the object to have + the parent and name given by the second inode and dentry. The filesystem must return -EINVAL for any unsupported or - unknown flags. Currently the following flags are implemented: - (1) RENAME_NOREPLACE: this flag indicates that if the target - of the rename exists the rename should fail with -EEXIST - instead of replacing the target. The VFS already checks for - existence, so for local filesystems the RENAME_NOREPLACE - implementation is equivalent to plain rename. + unknown flags. Currently the following flags are implemented: + (1) RENAME_NOREPLACE: this flag indicates that if the target of + the rename exists the rename should fail with -EEXIST instead of + replacing the target. The VFS already checks for existence, so + for local filesystems the RENAME_NOREPLACE implementation is + equivalent to plain rename. (2) RENAME_EXCHANGE: exchange source and target. Both must - exist; this is checked by the VFS. Unlike plain rename, - source and target may be of different type. - -``get_link``: called by the VFS to follow a symbolic link to the - inode it points to. Only required if you want to support - symbolic links. This method returns the symlink body - to traverse (and possibly resets the current position with - nd_jump_link()). If the body won't go away until the inode - is gone, nothing else is needed; if it needs to be otherwise - pinned, arrange for its release by having get_link(..., ..., done) - do set_delayed_call(done, destructor, argument). - In that case destructor(argument) will be called once VFS is - done with the body you've returned. - May be called in RCU mode; that is indicated by NULL dentry + exist; this is checked by the VFS. Unlike plain rename, source + and target may be of different type. + +``get_link`` + called by the VFS to follow a symbolic link to the inode it + points to. Only required if you want to support symbolic links. + This method returns the symlink body to traverse (and possibly + resets the current position with nd_jump_link()). If the body + won't go away until the inode is gone, nothing else is needed; + if it needs to be otherwise pinned, arrange for its release by + having get_link(..., ..., done) do set_delayed_call(done, + destructor, argument). In that case destructor(argument) will + be called once VFS is done with the body you've returned. May + be called in RCU mode; that is indicated by NULL dentry argument. If request can't be handled without leaving RCU mode, have it return ERR_PTR(-ECHILD). - If the filesystem stores the symlink target in ->i_link, the VFS may use it directly without calling ->get_link(); however, ->get_link() must still be provided. ->i_link must not be freed until after an RCU grace period. Writing to ->i_link post-iget() time requires a 'release' memory barrier. -``readlink``: this is now just an override for use by readlink(2) for the +``readlink`` + this is now just an override for use by readlink(2) for the cases when ->get_link uses nd_jump_link() or object is not in fact a symlink. Normally filesystems should only implement ->get_link for symlinks and readlink(2) will automatically use that. -``permission``: called by the VFS to check for access rights on a POSIX-like +``permission`` + called by the VFS to check for access rights on a POSIX-like filesystem. - May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in rcu-walk - mode, the filesystem must check the permission without blocking or - storing to the inode. + May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in + rcu-walk mode, the filesystem must check the permission without + blocking or storing to the inode. - If a situation is encountered that rcu-walk cannot handle, return + If a situation is encountered that rcu-walk cannot handle, + return -ECHILD and it will be called again in ref-walk mode. -``setattr``: called by the VFS to set attributes for a file. This method - is called by chmod(2) and related system calls. - -``getattr``: called by the VFS to get attributes of a file. This method - is called by stat(2) and related system calls. - -``listxattr``: called by the VFS to list all extended attributes for a - given file. This method is called by the listxattr(2) system call. - -``update_time``: called by the VFS to update a specific time or the i_version of - an inode. If this is not defined the VFS will update the inode itself - and call mark_inode_dirty_sync. - -``atomic_open``: called on the last component of an open. Using this optional - method the filesystem can look up, possibly create and open the file in - one atomic operation. If it wants to leave actual opening to the - caller (e.g. if the file turned out to be a symlink, device, or just - something filesystem won't do atomic open for), it may signal this by - returning finish_no_open(file, dentry). This method is only called if - the last component is negative or needs lookup. Cached positive dentries - are still handled by f_op->open(). If the file was created, - FMODE_CREATED flag should be set in file->f_mode. In case of O_EXCL - the method must only succeed if the file didn't exist and hence FMODE_CREATED - shall always be set on success. - -``tmpfile``: called in the end of O_TMPFILE open(). Optional, equivalent to - atomically creating, opening and unlinking a file in given directory. +``setattr`` + called by the VFS to set attributes for a file. This method is + called by chmod(2) and related system calls. + +``getattr`` + called by the VFS to get attributes of a file. This method is + called by stat(2) and related system calls. + +``listxattr`` + called by the VFS to list all extended attributes for a given + file. This method is called by the listxattr(2) system call. + +``update_time`` + called by the VFS to update a specific time or the i_version of + an inode. If this is not defined the VFS will update the inode + itself and call mark_inode_dirty_sync. + +``atomic_open`` + called on the last component of an open. Using this optional + method the filesystem can look up, possibly create and open the + file in one atomic operation. If it wants to leave actual + opening to the caller (e.g. if the file turned out to be a + symlink, device, or just something filesystem won't do atomic + open for), it may signal this by returning finish_no_open(file, + dentry). This method is only called if the last component is + negative or needs lookup. Cached positive dentries are still + handled by f_op->open(). If the file was created, FMODE_CREATED + flag should be set in file->f_mode. In case of O_EXCL the + method must only succeed if the file didn't exist and hence + FMODE_CREATED shall always be set on success. + +``tmpfile`` + called in the end of O_TMPFILE open(). Optional, equivalent to + atomically creating, opening and unlinking a file in given + directory. The Address Space Object @@ -673,70 +735,75 @@ cache in your filesystem. The following members are defined: int (*swap_deactivate)(struct file *); }; -``writepage``: called by the VM to write a dirty page to backing store. - This may happen for data integrity reasons (i.e. 'sync'), or - to free up memory (flush). The difference can be seen in - wbc->sync_mode. - The PG_Dirty flag has been cleared and PageLocked is true. - writepage should start writeout, should set PG_Writeback, - and should make sure the page is unlocked, either synchronously - or asynchronously when the write operation completes. - - If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to - try too hard if there are problems, and may choose to write out - other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep - calling ->writepage on that page. - - See the file "Locking" for more details. - -``readpage``: called by the VM to read a page from backing store. - The page will be Locked when readpage is called, and should be - unlocked and marked uptodate once the read completes. - If ->readpage discovers that it needs to unlock the page for - some reason, it can do so, and then return AOP_TRUNCATED_PAGE. - In this case, the page will be relocated, relocked and if - that all succeeds, ->readpage will be called again. - -``writepages``: called by the VM to write out pages associated with the +``writepage`` + called by the VM to write a dirty page to backing store. This + may happen for data integrity reasons (i.e. 'sync'), or to free + up memory (flush). The difference can be seen in + wbc->sync_mode. The PG_Dirty flag has been cleared and + PageLocked is true. writepage should start writeout, should set + PG_Writeback, and should make sure the page is unlocked, either + synchronously or asynchronously when the write operation + completes. + + If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to + try too hard if there are problems, and may choose to write out + other pages from the mapping if that is easier (e.g. due to + internal dependencies). If it chooses not to start writeout, it + should return AOP_WRITEPAGE_ACTIVATE so that the VM will not + keep calling ->writepage on that page. + + See the file "Locking" for more details. + +``readpage`` + called by the VM to read a page from backing store. The page + will be Locked when readpage is called, and should be unlocked + and marked uptodate once the read completes. If ->readpage + discovers that it needs to unlock the page for some reason, it + can do so, and then return AOP_TRUNCATED_PAGE. In this case, + the page will be relocated, relocked and if that all succeeds, + ->readpage will be called again. + +``writepages`` + called by the VM to write out pages associated with the address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then the writeback_control will specify a range of pages that must be - written out. If it is WBC_SYNC_NONE, then a nr_to_write is given - and that many pages should be written if possible. - If no ->writepages is given, then mpage_writepages is used - instead. This will choose pages from the address space that are - tagged as DIRTY and will pass them to ->writepage. - -``set_page_dirty``: called by the VM to set a page dirty. - This is particularly needed if an address space attaches - private data to a page, and that data needs to be updated when - a page is dirtied. This is called, for example, when a memory - mapped page gets modified. + written out. If it is WBC_SYNC_NONE, then a nr_to_write is + given and that many pages should be written if possible. If no + ->writepages is given, then mpage_writepages is used instead. + This will choose pages from the address space that are tagged as + DIRTY and will pass them to ->writepage. + +``set_page_dirty`` + called by the VM to set a page dirty. This is particularly + needed if an address space attaches private data to a page, and + that data needs to be updated when a page is dirtied. This is + called, for example, when a memory mapped page gets modified. If defined, it should set the PageDirty flag, and the PAGECACHE_TAG_DIRTY tag in the radix tree. -``readpages``: called by the VM to read pages associated with the address_space - object. This is essentially just a vector version of - readpage. Instead of just one page, several pages are - requested. +``readpages`` + called by the VM to read pages associated with the address_space + object. This is essentially just a vector version of readpage. + Instead of just one page, several pages are requested. readpages is only used for read-ahead, so read errors are ignored. If anything goes wrong, feel free to give up. -``write_begin``: - Called by the generic buffered write code to ask the filesystem to - prepare to write len bytes at the given offset in the file. The - address_space should check that the write will be able to complete, - by allocating space if necessary and doing any other internal - housekeeping. If the write will update parts of any basic-blocks on - storage, then those blocks should be pre-read (if they haven't been - read already) so that the updated blocks can be written out properly. +``write_begin`` + Called by the generic buffered write code to ask the filesystem + to prepare to write len bytes at the given offset in the file. + The address_space should check that the write will be able to + complete, by allocating space if necessary and doing any other + internal housekeeping. If the write will update parts of any + basic-blocks on storage, then those blocks should be pre-read + (if they haven't been read already) so that the updated blocks + can be written out properly. - The filesystem must return the locked pagecache page for the specified - offset, in ``*pagep``, for the caller to write into. + The filesystem must return the locked pagecache page for the + specified offset, in ``*pagep``, for the caller to write into. - It must be able to cope with short writes (where the length passed to - write_begin is greater than the number of bytes copied into the page). + It must be able to cope with short writes (where the length + passed to write_begin is greater than the number of bytes copied + into the page). flags is a field for AOP_FLAG_xxx flags, described in include/linux/fs.h. @@ -744,114 +811,128 @@ cache in your filesystem. The following members are defined: A void * may be returned in fsdata, which then gets passed into write_end. - Returns 0 on success; < 0 on failure (which is the error code), in - which case write_end is not called. - -``write_end``: After a successful write_begin, and data copy, write_end must - be called. len is the original len passed to write_begin, and copied - is the amount that was able to be copied. - - The filesystem must take care of unlocking the page and releasing it - refcount, and updating i_size. - - Returns < 0 on failure, otherwise the number of bytes (<= 'copied') - that were able to be copied into pagecache. - -``bmap``: called by the VFS to map a logical block offset within object to - physical block number. This method is used by the FIBMAP - ioctl and for working with swap-files. To be able to swap to - a file, the file must have a stable mapping to a block - device. The swap system does not go through the filesystem - but instead uses bmap to find out where the blocks in the file - are and uses those addresses directly. - -``invalidatepage``: If a page has PagePrivate set, then invalidatepage - will be called when part or all of the page is to be removed - from the address space. This generally corresponds to either a - truncation, punch hole or a complete invalidation of the address + Returns 0 on success; < 0 on failure (which is the error code), + in which case write_end is not called. + +``write_end`` + After a successful write_begin, and data copy, write_end must be + called. len is the original len passed to write_begin, and + copied is the amount that was able to be copied. + + The filesystem must take care of unlocking the page and + releasing it refcount, and updating i_size. + + Returns < 0 on failure, otherwise the number of bytes (<= + 'copied') that were able to be copied into pagecache. + +``bmap`` + called by the VFS to map a logical block offset within object to + physical block number. This method is used by the FIBMAP ioctl + and for working with swap-files. To be able to swap to a file, + the file must have a stable mapping to a block device. The swap + system does not go through the filesystem but instead uses bmap + to find out where the blocks in the file are and uses those + addresses directly. + +``invalidatepage`` + If a page has PagePrivate set, then invalidatepage will be + called when part or all of the page is to be removed from the + address space. This generally corresponds to either a + truncation, punch hole or a complete invalidation of the address space (in the latter case 'offset' will always be 0 and 'length' will be PAGE_SIZE). Any private data associated with the page - should be updated to reflect this truncation. If offset is 0 and - length is PAGE_SIZE, then the private data should be released, - because the page must be able to be completely discarded. This may - be done by calling the ->releasepage function, but in this case the - release MUST succeed. - -``releasepage``: releasepage is called on PagePrivate pages to indicate - that the page should be freed if possible. ->releasepage - should remove any private data from the page and clear the - PagePrivate flag. If releasepage() fails for some reason, it must - indicate failure with a 0 return value. - releasepage() is used in two distinct though related cases. The - first is when the VM finds a clean page with no active users and - wants to make it a free page. If ->releasepage succeeds, the - page will be removed from the address_space and become free. + should be updated to reflect this truncation. If offset is 0 + and length is PAGE_SIZE, then the private data should be + released, because the page must be able to be completely + discarded. This may be done by calling the ->releasepage + function, but in this case the release MUST succeed. + +``releasepage`` + releasepage is called on PagePrivate pages to indicate that the + page should be freed if possible. ->releasepage should remove + any private data from the page and clear the PagePrivate flag. + If releasepage() fails for some reason, it must indicate failure + with a 0 return value. releasepage() is used in two distinct + though related cases. The first is when the VM finds a clean + page with no active users and wants to make it a free page. If + ->releasepage succeeds, the page will be removed from the + address_space and become free. The second case is when a request has been made to invalidate - some or all pages in an address_space. This can happen - through the fadvise(POSIX_FADV_DONTNEED) system call or by the - filesystem explicitly requesting it as nfs and 9fs do (when - they believe the cache may be out of date with storage) by - calling invalidate_inode_pages2(). - If the filesystem makes such a call, and needs to be certain - that all pages are invalidated, then its releasepage will - need to ensure this. Possibly it can clear the PageUptodate - bit if it cannot free private data yet. - -``freepage``: freepage is called once the page is no longer visible in - the page cache in order to allow the cleanup of any private - data. Since it may be called by the memory reclaimer, it - should not assume that the original address_space mapping still - exists, and it should not block. - -``direct_IO``: called by the generic read/write routines to perform - direct_IO - that is IO requests which bypass the page cache - and transfer data directly between the storage and the - application's address space. - -``isolate_page``: Called by the VM when isolating a movable non-lru page. - If page is successfully isolated, VM marks the page as PG_isolated - via __SetPageIsolated. - -``migrate_page``: This is used to compact the physical memory usage. - If the VM wants to relocate a page (maybe off a memory card - that is signalling imminent failure) it will pass a new page - and an old page to this function. migrate_page should - transfer any private data across and update any references - that it has to the page. - -``putback_page``: Called by the VM when isolated page's migration fails. - -``launder_page``: Called before freeing a page - it writes back the dirty page. To - prevent redirtying the page, it is kept locked during the whole - operation. - -``is_partially_uptodate``: Called by the VM when reading a file through the - pagecache when the underlying blocksize != pagesize. If the required - block is up to date then the read can complete without needing the IO - to bring the whole page up to date. - -``is_dirty_writeback``: Called by the VM when attempting to reclaim a page. - The VM uses dirty and writeback information to determine if it needs - to stall to allow flushers a chance to complete some IO. Ordinarily - it can use PageDirty and PageWriteback but some filesystems have - more complex state (unstable pages in NFS prevent reclaim) or - do not set those flags due to locking problems. This callback - allows a filesystem to indicate to the VM if a page should be - treated as dirty or writeback for the purposes of stalling. - -``error_remove_page``: normally set to generic_error_remove_page if truncation - is ok for this address space. Used for memory failure handling. + some or all pages in an address_space. This can happen through + the fadvise(POSIX_FADV_DONTNEED) system call or by the + filesystem explicitly requesting it as nfs and 9fs do (when they + believe the cache may be out of date with storage) by calling + invalidate_inode_pages2(). If the filesystem makes such a call, + and needs to be certain that all pages are invalidated, then its + releasepage will need to ensure this. Possibly it can clear the + PageUptodate bit if it cannot free private data yet. + +``freepage`` + freepage is called once the page is no longer visible in the + page cache in order to allow the cleanup of any private data. + Since it may be called by the memory reclaimer, it should not + assume that the original address_space mapping still exists, and + it should not block. + +``direct_IO`` + called by the generic read/write routines to perform direct_IO - + that is IO requests which bypass the page cache and transfer + data directly between the storage and the application's address + space. + +``isolate_page`` + Called by the VM when isolating a movable non-lru page. If page + is successfully isolated, VM marks the page as PG_isolated via + __SetPageIsolated. + +``migrate_page`` + This is used to compact the physical memory usage. If the VM + wants to relocate a page (maybe off a memory card that is + signalling imminent failure) it will pass a new page and an old + page to this function. migrate_page should transfer any private + data across and update any references that it has to the page. + +``putback_page`` + Called by the VM when isolated page's migration fails. + +``launder_page`` + Called before freeing a page - it writes back the dirty page. + To prevent redirtying the page, it is kept locked during the + whole operation. + +``is_partially_uptodate`` + Called by the VM when reading a file through the pagecache when + the underlying blocksize != pagesize. If the required block is + up to date then the read can complete without needing the IO to |